
Lead, Site Reliability Engineer
- Pasay City, Metro Manila
- Permanent
- Full-time
- Incident Management. Is responsible for a team of resources prepared to react quickly to production incidents with the goal to restore systems/applications back to normal service operation as quickly as possible and minimize the impact on guest/crew experience or business operations, thus ensuring the best possible service levels and availability are maintained. Review ticket analysis and approve closure of tickets/incidents. Understands architecture of Royal website and escalates incidents as needed to the appropriate team for further triage. Synthesizes and communicates incident details to the production team, stakeholders, including executive level stakeholders. Review postmortem / RCA document and follow up
- Application Performance Management (APM). Ensures the proactive monitoring and management of performance and availability of the software applications within the products s/he is responsible for. Strives to detect and diagnose complex application performance problems to maintain an expected level of service. Builds case for prioritizing bug and enhancement tickets. Creates reports on new deployment build performance for product teams to ensure quality.
- Configuration Management. Leads the team(s) in implementing and maintaining the technology standards and practices across product definition and product configuration. Adjust health thresholds and other monitoring settings based on historical performance. Creates and maintains performance dashboards used by support and product teams. Maintains alerting, communication, and documentation tool chain to ensure it is up to date and efficient.
- Change Control Governance. Ensuring all production changes required by the product teams are carried out in a planned and authorized manner, within established change control policies and procedures and that all changes are thoroughly tested and validated from the monitoring perspective.
- Production Operations Readiness. Ensure all product implementations go through an operational readiness review. Establish and maintain clear communication channels (e.g., Slack, Teams) with the scrum and marketing teams. Ensure all team members are informed about relevant updates and changes that may affect the website.
- 10+ years in Site Reliability Engineering (SRE), DevOps, or a related IT operations role.
- Bachelor's degree in Computer Science, Information Technology, Computer Engineering, or other relevant advanced degree preferred.
- At least 3 years of experience managing teams and collaborating with external service providers.
- Technical Expertise:
- Proficiency in cloud platforms such as AWS, AWS Elastic Beanstalk.
- Understanding of API design principles: REST, SOAP, Graph
- Advanced knowledge of monitoring and logging tools (AppDynamics, Datadog, Splunk, New Relic, etc.).
- Strong proficiency in Adobe AEM is crucial for guiding technical initiatives and mentoring teams
- Problem-Solving Skills:
- Strong analytical and troubleshooting skills to diagnose and resolve complex production issues swiftly.
- Ability to develop and implement effective incident response plans.
- Communication and Collaboration:
- Excellent written and verbal communication skills for effective interaction with cross-functional teams and documentation.
- Ability to collaborate with Development, QA, IT, and external managed service providers to ensure seamless operations.
- The Lead SRE Engineer may be required to participate in an on-call rotation to handle urgent incidents and ensure 24x7 system reliability.
- On-call duties may include evenings, weekends, and holidays as needed.