Site Reliability Engineer
HostPapa View all jobs
- Philippines
- Permanent
- Full-time
- Define and implement SLIs, SLOs, and error budgets for critical CloudBlue services to ensure reliability and performance
- Influence system architecture with a strong focus on reliability, scalability, and operability, designing systems for fault tolerance, graceful degradation, and self-healing
- Reduce operational toil by identifying opportunities for automation and process improvement
- Design and operate CloudBlue’s observability stack across metrics, logs, and traces using tools such as Datadog, Grafana, and Elastic Stack
- Develop actionable alerting strategies and dashboards that provide clear insight into platform and business health
- Design and maintain high-availability architectures, implementing redundancy, failover, and disaster recovery strategies across regions and availability zones
- Conduct capacity planning, load testing, and performance optimization to ensure platform stability and scalability
- Act as a senior responder during production incidents, leading incident coordination, communication, and service restoration
- Own blameless postmortems and drive improvements that reduce incident frequency, MTTR, and customer impact
- Improve reliability of Kubernetes-based platforms through health checks, autoscaling strategies, rollout safety, and resilience testing
- Partner with engineering and DevOps teams to improve deployment safety, rollback strategies, and platform reliability
- Maintain runbooks and operational documentation, and promote SRE best practices across engineering teams
- Support other tasks or projects as assigned to meet team and business needs
- 3+ years of experience as an SRE, DevOps Engineer, or Production Engineer, with strong ownership of production systems
- Proven experience operating highly available, enterprise-grade, multi-tenant SaaS platforms
- Hands-on experience with observability and monitoring tools such as Datadog, Grafana, and Elasticsearch/Kibana
- Solid understanding of Linux, networking, and distributed systems fundamentals
- Experience working with containerized environments such as Docker and Kubernetes
- Strong scripting and automation skills using Python and/or Bash
- Experience participating in on-call rotations and incident response in production environments
- Strong written and spoken English
- Experience defining SLIs/SLOs and managing error budgets at scale will be considered a plus
- Exposure to hyperscale or service-provider-grade platforms is an advantage
- Cloud experience, preferably with Azure; experience with AWS and/or GCP will also be valued
- Experience working with hybrid or on-premises integrations is beneficial
- Familiarity with chaos engineering and resilience testing will be considered an asset
- Work from anywhere - this is a remote opportunity
- A competitive salary that values you and your unique skill sets
- Career advancement & professional development opportunities to help you reach your full potential
- Flexible work arrangements to support work/life balance