Senior Site Reliability Engineer, Core Cloud Engineering

Work from home Full-time role Hiring

Join Vultr Our Engineering team at Vultr is seeking a Senior Site Reliability Engineer, Core Cloud Engineering to report to the Director of Core Cloud Engineering. This role demands deep expertise in large-scale distributed systems, infrastructure automation, and production operations of hypervisor platforms and the control plane. The ideal candidate will combine hands-on systems engineering with a focus on reliability, scalability, and observability, ensuring Vultr's cloud services remain performant and resilient for our 1.5 million users.

Key Responsibilities

Production Control Plane Operations: Operate and scale Vultr's control plane, ensuring availability, correctness, and performance across global datacenters.
Hypervisor & Infrastructure Reliability: Design, implement, and maintain automation to manage hypervisor fleets (KVM, QEMU, libvirt) and supporting infrastructure at scale.
Networking & Systems Automation: Develop tooling and automation for Open vSwitch (OVS), BGP routing, and other networking components to ensure resilient and self-healing network operations.
Performance & Reliability Tuning: Continuously analyze and improve system performance across compute, storage, and network layers, with an emphasis on reducing toil and eliminating single points of failure.
Observability & Incident Response: Implement advanced monitoring, logging, and tracing solutions (Grafana, Sentry, SumoLogic) while leading incident response to minimize impact and drive postmortem culture.
CI/CD & Configuration Management: Maintain and evolve infrastructure pipelines (GitLab CI/CD, Puppet) to enable safe, fast, and reliable changes to both control plane and hypervisor infrastructure.
Collaboration: Work closely with Software Engineers, Network Engineers, and Product teams to align platform reliability with business and user needs.
Documentation & Standards: Produce clear technical documentation for runbooks, operational procedures, and automation frameworks to improve team efficiency and reliability standards.
Mentorship & Leadership: Coach and mentor team members in best practices for site reliability, incident handling, automation, and low-level Linux systems debugging.

Qualifications

Proficiency in PHP with strong scripting and automation skills.
Experience running large-scale distributed systems and control plane infrastructure in production.
Strong background in hypervisor technologies (libvirt, QEMU, KVM) and Linux systems administration.
Expertise in networking protocols and tools, particularly BGP and Open vSwitch (OVS), with automation experience.
Deep knowledge of observability and monitoring frameworks (Grafana, Sentry, SumoLogic) and incident management.
Advanced troubleshooting skills across compute, networking, and storage subsystems.
Experience building and maintaining CI/CD pipelines (GitLab) and configuration management (Puppet).
Familiarity with MySQL or similar databases, with an understanding of operational considerations for reliability and scale.
Strong problem-solving abilities and the drive to tackle complex, low-level reliability challenges.
Effective cross-team communication and collaboration skills.
A commitment to continuous improvement and fostering a culture of operational excellence.

Compensation $120,000 - $130,000 Final compensation will vary depending on years of experience, background/skill set, location, and applicable laws. Apply tot his job Apply To this Job

Apply

Senior Site Reliability Engineer, Core Cloud Engineering

Key Responsibilities

Qualifications

You might like

Site Reliability Engineer - Remote - US

Senior Site Reliability Engineer (B2B Contract)

Site Reliability Engineer, Team Lead

Wireless & Network Engineer

Wireless Network Engineer – 6 Months Contract

Software Engineer - Kubernetes, CI/CD, and DevOps

Lead IT Network Engineer

Senior Network Engineer (Remote)

Senior Network Systems Engineer (Remote)

Embedded Linux Field Engineer for Devices/IoT

Immediate Hiring: Data Entry Specialist - Fire, Life Safety & Security - arenaflex, Irving, TX

Experienced Customer Experience Representative – Remote Opportunity with arenaflex: Thrive in a Dynamic Environment with Competitive Pay

Vice President of Marketing

Senior Systems Engineer (Production Support)

Customer Service Representative – Luxury Retail (Remote) – Premium Fashion Brand Support Specialist

Experienced Transaction Specialist II (Data Entry) – 2nd Shift in arenaflex's South Burlington, VT Location

Experienced Data Entry Specialist – Remote Work Opportunity with arenaflex

Experienced Customer Service Representative – Entry-Level Remote Position with arenaflex

Cloud Software Sales Representative Opportunity - Cloud, AI & Digital Modernization

Online Adjunct Instructor of Human Services (Pool)