Senior Site Reliability Engineer (Fleet Management)
Requirements
- Have 6+ years of experience in software development and operating distributed systems,
- Are proficient in Go, Python, or a similar language, with a strong commitment to code quality and testing practices (writing unit, integration, and E2E tests),
- Have deep experience using and extending containerization technologies, preferably Kubernetes,
- Have a solid understanding of Linux operating system internals and networking concepts (e.g., filesystems, TCP/IP, DNS, TLS),
- Possess a customer focused mindset, treating internal developers as your primary users,
- Have strong operational ownership, including a track record of debugging complex production issues and driving them to resolution,
- Prefer automation over manual processes ("allergic to ops work"),
- We are a small team of software engineers with a strong bias toward building software solutions to eliminate toil,
- (Desirable) Designing and implementing secure, multi-tenant runtime environments from first principles,
- (Desirable) Proficiency with Kubernetes ecosystem tools such as Helm, Kustomize, Gatekeeper, Kyverno, and CRDs/Operators, CRI, CSI,
- (Desirable) Expertise in cloud infrastructure platforms, including AWS, GCP, or Azure,
- (Desirable) Proficiency in provisioning infrastructure using tools like Terraform, Crossplane, and AWS Controllers for Kubernetes (ACK),
- (Desirable) Advanced Linux systems internals and networking concepts specifically relevant to containers, such as namespaces and cgroups
What the job involves
- Platform Engineering is the department within SRE that is responsible for a range of critical infrastructure and operational functions that support the broader engineering organization,
- Among these are our multi-cloud-provider Kubernetes infrastructure, networking, load balancing (including our public-facing edge and internal service mesh), and observability and alerting systems,
- The Fleet Management team provides the core runtime environment that empowers our developers to build and ship products to delight our customers,
- We manage the end-to-end lifecycle of our Kubernetes fleet, alongside the critical components that ensure cluster reliability and security (e.g., CoreDNS, cert-manager, and Gatekeeper),
- As our infrastructure scales to support new use cases and products, we are spearheading a migration from Terraform-based Infrastructure as Code (IaC) to an Operator-driven lifecycle management model,
- Contribute to developing and maintaining a scalable and secure runtime environment on top of Kubernetes that supports product needs across MongoDB,
- Provide internal support for our Kubernetes ecosystem, partnering with engineering teams to help them solve domain-specific problems,
- Participate in a 24/7 on-call rotation to resolve critical issues,
- Prioritize blameless post-mortems and dedicate engineering time to systemic fixes, ensuring you aren’t paged for the same issue twice
Apply tot his job Apply To this Job