Site Reliability Engineering Manager - Remote

Work from home Full-time role Hiring

Why Arcoro? Want to work with a solid company that’s transforming HR for the construction industry? Our team of dedicated professionals helps construction, contracting and field services companies hire, manage and grow their workforce with a market-leading SaaS solution. As a member of the A-Team, you’ll enjoy a top-notch employee experience where you can embrace your problem-solving skills and innovation, work with a team of great colleagues and see the impact of your contribution each day. Our culture is collaborative, and we believe strongly in training, growth and internal advancement. We offer competitive compensation including comprehensive benefits and a generous time-off policy. We offer both on-site and remote opportunities. At Arcoro, you will help create software products that are cutting edge, easy to use, and that make an appreciated and notable difference in our customers’ daily lives. About the Job: The Site Reliability Engineering Manager is responsible for leading the SRE team to ensure the availability, performance, scalability, and operational excellence of Arcoro’s production systems. This role combines people leadership with deep technical oversight, ensuring services meet defined reliability targets and that the team is effective, engaged, and aligned with product and business goals. The SRE Manager partners closely with Engineering and Product to drive reliability engineering practices, incident response, observability, and continuous improvement across the production environment. This is a hands-on role. In addition to leading and developing the team, the SRE Manager is expected to contribute as an individual contributor by writing code and automation, building tooling, participating in on-call, and working directly in production systems alongside the team.

What You'll Do

Lead and manage a team of Site Reliability Engineers responsible for the reliability, performance, and operational health of production systems
Serve as a hands-on technical contributor by writing code and automation, building reliability tooling, participating in on-call, and working directly in production systems alongside the team
Support career growth and development of team members through coaching, mentoring, and performance management
Define, measure, and drive Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets in partnership with engineering and product teams
Own incident response, including on-call rotations, escalation processes, severity management, and blameless postmortems
Drive continuous improvement in monitoring, observability, alerting, and on-call practices to reduce toil and mean-time-to-recovery
Lead the adoption of AI and automation across SRE practices, including AI-assisted incident response, intelligent alerting, automated remediation, and the use of AI tooling to reduce toil and accelerate operational workflows
Partner with Engineering to refine our products to better support agentic AI development, including improving APIs, telemetry, environments, and platform capabilities that enable AI agents to safely build on and operate against our systems
Drive cloud cost optimization and FinOps practices in partnership with Engineering, including vendor management, cost allocation, rightsizing, and engineering best practices that reduce cloud spend
Partner with Engineering on operational readiness reviews, production change management, and release safety
Champion reliability best practices and ensure they are embedded across the engineering organization
Track and report on key reliability metrics, incident trends, and team health to leadership
Stay current with emerging SRE practices, tooling, and industry standards

What We're Looking For:

Proven experience leading SRE, operations, or reliability-focused engineering teams in a production software environment
Willingness and ability to operate as a hands-on individual contributor in addition to managing the team, including writing code, building automation, and participating in on-call
Strong understanding of SRE principles, including SLOs/SLIs, error budgets, and blameless postmortems
Hands-on background in incident response, on-call management, and production troubleshooting
Experience with modern observability practices, including metrics, logging, tracing, and alerting
Demonstrated experience applying AI and automation to reliability work, including using AI-assisted tooling, building automated remediation, and leading the adoption of AI-driven practices on a team
Solid grasp of distributed systems, cloud infrastructure, and the operational characteristics of web-scale applications
Strong leadership, coaching, and team development skills
Excellent communication skills, including the ability to lead through high-pressure incidents and communicate clearly with technical and non-technical stakeholders
Strong analytical and problem-solving abilities
Ability to work across teams and influence at multiple levels of the organization

Preferred Qualifications

Bachelor’s degree in Computer Science, a related field, or equivalent professional experience
10+ years of experience in software engineering, systems engineering, DevOps, or site reliability engineering
3+ years of experience in a technical leadership, team lead, Lead, or Principal role
Previous experience as an SRE Manager, Lead SRE, Principal DevOps/SRE, Operations Manager, or similar leadership role
Strong experience with Microsoft Azure; additional experience with AWS or Google Cloud Platform a plus
Experience with Microsoft technologies (.NET, C#, SQL Server) in a production environment
Experience with container orchestration (Kubernetes, AKS, or EKS) and tools such as Helm or Argo
Experience with observability platforms (e.g., Datadog, ELK, Grafana, OpenTelemetry, Azure Monitor)
Experience with infrastructure-as-code (e.g., Bicep, Terraform, CloudFormation) and modern CI/CD pipelines (e.g., Azure DevOps, GitHub Actions)
Experience with cloud cost optimization and FinOps practices
Familiarity with incident management and ITSM tooling (e.g., PagerDuty, Opsgenie, ServiceNow)
Hands-on experience with AI-assisted engineering tools (e.g., coding copilots, LLM-powered runbooks or agents) and automation platforms used in production operations
Microsoft Azure certifications (e.g., AZ-305 Solutions Architect Expert, AZ-400 DevOps Engineer Expert) a plus

Salary Range: $200,000-$220,000 DOE

What We Offer

Apply tot his job Apply To this Job

Apply

Site Reliability Engineering Manager - Remote

What You'll Do

Preferred Qualifications

What We Offer

You might like

DevOps Engineer (Remote)

Site Reliability Engineer

Senior Software Engineer, Kubernetes Platform and Fabric Integration

Kubernetes Platforms System Engineer

Kubernetes Engineer

Site Reliability Engineer (SRE)

Staff Software Engineer - Kubernetes Operations

Senior Site Reliability Engineer (Hardware Automation)

Site Reliability Engineer (SRE)

Senior Site Reliability Engineer, Node Platform

Remote Software Developer /Junior java developer/Data scientist

Tech Process Team Lead - DOME - Activation

Experienced Work-From-Home Data Entry Specialist – Remote Opportunity at arenaflex

Experienced Full Stack Data Entry Specialist – Remote Work Opportunity with arenaflex

Field Rep - ND

Experienced Customer Support Executive – Remote Team Lead

Level 3 IT Technician

Finanzbuchhalter für Neumandate & digitale Onboardings (m/w/d)

Experienced Full Stack Data Entry Specialist – Remote Data Entry Operations for arenaflex

Healthcare Data Verifier German Speaker - Remote ( 3 months Temp- Spain)