
Senior Engineer (Python)
- Central Visayas
- Permanent
- Full-time
- Design and implement agent evaluation pipelines that benchmark AI capabilities across real-world enterprise use cases
- Build domain-specific benchmarks for product support, engineering ops, GTM insights, and other verticals relevant to modern SaaS
- Develop performance benchmarks that measure and optimize latency, safety, cost-efficiency, and user-perceived quality
- Create search- and retrieval-oriented benchmarks, including multilingual query handling, annotation-aware scoring, and context relevance
- Partner with AI and infra teams to instrument models and agents with detailed telemetry for outcome-based evaluation (Member of Technical Staff: AI Performance1)
- Drive human-in-the-loop and programmatic testing methodologies for fuzzy metrics like helpfulness, intent alignment, and resolution effectiveness
- Contribute to company's open evaluation tooling and benchmarking frameworks, shaping how the broader ecosystem thinks about SaaS AI performance
- 3 to 7 years of experience in systems, infra, or performance engineering roles with strong ownership of metrics and benchmarking
- Fluency in Python and comfort working across full-stack and backend services
- Experience building or using LLMs, vector-based search, or agentic frameworks in production environments
- Familiarity with LLM model serving infrastructure (e.g., vLLM, Triton, Ray, or custom -Kubernetes-based deployments), including observability, autoscaling, and token streaming
- Experience working with model tuning workflows, including prompt engineering, fine-tuning (e.g., LoRA, DPO, or evaluation loops for post-training optimization
- Deep appreciation for measuring what matters - whether itʼs latency under load, degradation in retrieval precision, or regression in AI output quality
- Familiarity with evaluation techniques in NLP, information retrieval, or human-centered AI (e.g. RAGAS, Recall@K, BLEU, etc.)
- Strong product and user intuition - you care about what the benchmark represents, not just what it measures
- Experience contributing to academic or open-source benchmarking projects