Senior Engineer (Python)

Manila Recruitment

  • Central Visayas
  • Permanent
  • Full-time
  • 23 days ago
  • Apply easily
  • Design and implement agent evaluation pipelines that benchmark AI capabilities across real-world enterprise use cases
  • Build domain-specific benchmarks for product support, engineering ops, GTM insights, and other verticals relevant to modern SaaS
  • Develop performance benchmarks that measure and optimize latency, safety, cost-efficiency, and user-perceived quality
  • Create search- and retrieval-oriented benchmarks, including multilingual query handling, annotation-aware scoring, and context relevance
  • Partner with AI and infra teams to instrument models and agents with detailed telemetry for outcome-based evaluation (Member of Technical Staff: AI Performance1)
  • Drive human-in-the-loop and programmatic testing methodologies for fuzzy metrics like helpfulness, intent alignment, and resolution effectiveness
  • Contribute to company’s open evaluation tooling and benchmarking frameworks, shaping how the broader ecosystem thinks about SaaS AI performance
Requirements
  • 3 to 7 years of experience in systems, infra, or performance engineering roles with strong ownership of metrics and benchmarking
  • Fluency in Python and comfort working across full-stack and backend services
  • Experience building or using LLMs, vector-based search, or agentic frameworks in production environments
  • Familiarity with LLM model serving infrastructure (e.g., vLLM, Triton, Ray, or custom -Kubernetes-based deployments), including observability, autoscaling, and token streaming
  • Experience working with model tuning workflows, including prompt engineering, fine-tuning (e.g., LoRA, DPO, or evaluation loops for post-training optimization
  • Deep appreciation for measuring what matters — whether itʼs latency under load, degradation in retrieval precision, or regression in AI output quality
  • Familiarity with evaluation techniques in NLP, information retrieval, or human-centered AI (e.g. RAGAS, Recall@K, BLEU, etc.)
  • Strong product and user intuition — you care about what the benchmark represents, not just what it measures
Advantageous Skills:
  • Experience contributing to academic or open-source benchmarking projects

Manila Recruitment