[Remote] Senior Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. CertifyOS is building the data infrastructure that powers modern healthcare. They are seeking a Senior Site Reliability Engineer who will design for reliability, manage the operational lifecycle, and influence platform architecture and deployment workflows across their systems.
Responsibilities
- Designs for reliability, ships the automation, and stands behind it in production
- Own the operational lifecycle end-to-end and influence platform architecture, reliability standards, and deployment workflows
- Own the full lifecycle of what they support — from infrastructure design and deployment automation through observability, incident response, and postmortems
- Improve autoscaling behavior, resource utilization, and workload efficiency across cloud-native distributed systems
- Own incident response processes, root cause analysis, escalation workflows, and runbooks
- Build and maintain Infrastructure as Code, CI/CD pipelines, and operational tooling that reduce manual work and improve engineering productivity
- Instrument data freshness and infrastructure health, not just service uptime
Skills
- 5+ years in SRE, DevOps, Platform Engineering, or Infrastructure Engineering — operating production systems at scale where your infrastructure is someone else's dependency and failures have real downstream consequences
- Track record of improving reliability end-to-end: you've debugged hard production problems, made them not happen again, and built the alerting to prove it
- Strong Linux systems administration, incident response, and root cause analysis skills
- Comfort influencing operational standards and mentoring teams on reliability practices
- Deep hands-on experience with GCP — GKE, Cloud Run, and containerized workloads at scale
- Experience building and maintaining Infrastructure as Code with Terraform and/or Pulumi
- Fluency across deployment patterns and the judgment to know when each fits: rolling deployments, blue/green, canary — and the rollback story for each
- Experience with autoscaling, resource optimization, and infrastructure efficiency for distributed systems
- Experience managing infrastructure security, secrets, and access controls in regulated or security-conscious environments
- Strong understanding of Golden Signals monitoring — latency, traffic, errors, saturation — and how to make them actionable rather than noisy
- Experience designing SLIs, SLOs, error budgets, alerting strategies, dashboards, and escalation workflows
- Hands-on experience with observability platforms: Google Cloud Monitoring, Datadog, Grafana, Prometheus, or similar
- Strong sense of data platform health: lineage, freshness, and correctness matter as much to you as throughput
- Experience building and maintaining CI/CD pipelines using GitHub Actions or similar
- Scripting or programming fluency in Python, Bash, Go, or similar — you reduce toil through code, not process
- Experience working with Git workflows and modern software delivery practices
- Strong written and verbal communication — you can explain an operational risk to an engineer and a product manager in the same conversation
- Experience operating systems handling sensitive data or PII in regulated or compliance-adjacent environments
- Experience operating large-scale distributed systems or microservices architectures
- Familiarity with healthcare, credentialing, or health-tech environments
- Experience leveraging AI-assisted observability or incident response tooling
- Familiarity with NodeJS, TypeScript, Java, or React application stacks
Benefits
- We provide 100% coverage of health, dental, and vision insurance premiums for employees.
- Our US-based team benefits from unlimited PTO, with at least two weeks off each year to recharge.
- In India, employees are supported with health insurance, statutory leave benefits, and additional wellness (menstrual) leave for women.
Company Overview
Company H1B Sponsorship