[Remote] Director of Site Reliability Engineering
Note: The job is a remote job and is open to candidates in USA. Talently is a cutting-edge organization in the Technology, Information and Media industry, and they are seeking a Director of Site Reliability Engineering. In this role, you will lead and build world-class Site Reliability Engineering practices, driving strategic reliability initiatives and mentoring engineering teams in a remote-first environment.
Responsibilities
- Define and execute a comprehensive company-wide Site Reliability Engineering strategy, embedding reliability as a core discipline across engineering teams
- Build, lead, and develop a high-performing SRE organization, including hiring, mentoring, and fostering a reliability-focused culture
- Establish SLIs, SLOs, KPIs, and error budgets to measure and drive platform reliability and performance improvement
- Guide architecture decisions and technical roadmaps for highly available, resilient, and scalable distributed systems
- Drive adoption of observability, monitoring, logging, and incident response solutions across cloud-based microservices environments, primarily on Google Cloud Platform
- Establish and oversee robust incident response frameworks, operational governance, and post-incident analysis processes
- Promote and implement best practices for infrastructure automation, cloud-native operations, and cost optimization
- Lead continuous improvement and innovation initiatives, including exploring AI-driven operations and new SRE methodologies
Skills
- 12+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or DevOps in high-scale environments
- 5+ years of proven technical leadership, building and scaling SRE teams and practices
- Strong expertise with distributed systems, cloud-native infrastructures, microservices, and hands-on Google Cloud Platform experience (GKE, Compute Engine, Cloud Functions)
- Deep proficiency with infrastructure as code, automation frameworks, and CI/CD deployment pipelines
- Track record designing large-scale observability and monitoring solutions using tools like Prometheus, Grafana, Datadog, or New Relic
- Excellent communication, organizational development, and mentorship abilities
- Strong programming ability in Python, Go, Java, or similar languages
- Cloud or reliability certifications (e.g., Google Cloud Professional, SRE certifications)
- Experience implementing AIOps, anomaly detection, predictive analytics, or automated remediation/self-healing infrastructure
- Familiarity with AI/ML tools for operational intelligence and intelligent alerting
- Strong database performance tuning and distributed data systems knowledge
- Comfortable operating in fast-paced, high-growth technology environments
- Bachelor's degree in Computer Science, Engineering, or related field
Company Overview