See all roles

[Remote] Sr. Site Reliability Engineer (AI Platforms)

Work from home Full-time role Hiring

Note: The job is a remote job and is open to candidates in USA. Optomi, in partnership with a premier client in the financial services industry, is seeking a Site Reliability Engineer to establish and scale reliability practices for AI-powered applications and services in production. This role will drive production readiness, observability, incident management, and automation while partnering closely with engineering teams to ensure highly available, resilient systems.

Responsibilities

  • Define and enforce production readiness standards for AI services and agent-based applications prior to deployment
  • Establish and manage SLIs, SLOs, and error budgets, including burn-rate monitoring and alerting
  • Ensure services have appropriate runbooks, rollback procedures, monitoring, and on-call ownership
  • Track reliability metrics and enforce operational standards across engineering teams
  • Instrument AI services and agent pipelines using structured JSON logging, custom metrics, and distributed tracing
  • Build dashboards and alerting for service health, latency, error rates, dependency performance, and agent execution metrics
  • Identify and address observability gaps unique to AI systems, including context limitations, model timeouts, tool invocation failures, and partial task execution
  • Develop monitoring strategies that surface reliability risks before production impact occurs
  • Build and maintain automation that supports production readiness reviews, incident analysis, SLO monitoring, and reliability validation
  • Develop tooling and workflows that automate operational checks and reliability enforcement
  • Maintain reliability standards, operational documentation, runbooks, and service ownership mappings
  • Continuously evolve reliability controls as new failure patterns emerge across AI-powered systems
  • Lead incident response and post-incident review efforts for production services
  • Perform root cause analysis and drive remediation efforts through completion
  • Identify recurring failure patterns and implement systemic reliability improvements
  • Support on-call operations and validate escalation processes for critical services
  • Review application architectures, infrastructure designs, and code changes through a reliability lens
  • Evaluate resiliency patterns such as retries, circuit breakers, health checks, graceful degradation, and rollback strategies
  • Partner with engineering teams to address reliability risks before production deployment

Skills

  • 4+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or Production Operations
  • Hands-on experience managing production services and reliability programs
  • Strong understanding of SLI/SLO frameworks, error budgets, and operational excellence practices
  • Experience building monitoring, alerting, and observability solutions using platforms such as Datadog, Dynatrace, New Relic, Grafana, or similar
  • Strong scripting or programming experience with Python, TypeScript, or comparable languages
  • Experience with distributed systems observability, including structured logging, metrics, and tracing
  • Experience supporting AI/ML, automation, or data-driven platforms in production
  • Strong background leading incident response and post-incident review processes
  • Experience integrating operational workflows with ticketing and documentation platforms
  • Experience working within regulated or highly available production environments

Company Overview

  • OPTOMI is an IT staffing firm that serves its consultants, clients, and employees through its consultant-focused approach. It was founded in 2012, and is headquartered in Roswell, Georgia, USA, with a workforce of 501-1000 employees. Its website is http://www.optomi.com/.
  • Company H1B Sponsorship

  • Optomi has a track record of offering H1B sponsorships, with 7 in 2025, 6 in 2024, 2 in 2023, 5 in 2022, 8 in 2021, 7 in 2020. Please note that this does not guarantee sponsorship for this specific role.
  • Apply To This Job

    You might like

    [Remote] Senior Azure Data Consultant (Microsoft Fabric Modernization)

    Work from home Full-time role

    [Remote] SAP Program Manager

    Work from home Full-time role

    [Remote] Data Analytics Contractor

    Work from home Full-time role

    [Remote] Senior Technical Project Manager

    Work from home Full-time role

    [Remote] Data Analytics Contractor

    Work from home Full-time role

    [Remote] Territory Manager, Product Assembly

    Work from home Full-time role

    [Remote] MEP Program Manager

    Work from home Full-time role

    [Remote] Temporary Operations Support Specialist

    Work from home Full-time role

    [Remote] Data & AI Engineer - 90408785 - Remote Job Details | Amtrak

    Work from home Full-time role

    [Remote] Data & AI Senior Engineer - 90405345 - Remote Job Details | Amtrak

    Work from home Full-time role

    ENERGY SERVICES DATA ENTRY CLERK (Typist) – Remote | WFH

    Work from home Full-time role

    Customer Success Manager – Driving Client Growth & Retention in AI‑Powered Fulfillment Solutions at arenaflex

    Work from home Full-time role

    Experienced Customer Service Representative – Remote Opportunity with arenaflex

    Work from home Full-time role

    Experienced Remote Data Entry Specialist – Part-Time Opportunity for Detail-Oriented Individuals at arenaflex

    Work from home Full-time role

    Apply Now: UPS Remote Data Entry Specialist – W...

    Work from home Full-time role

    Health Data Services Strategy Manager (Hybrid)

    Work from home Full-time role

    Teletherapy Occupational Therapy | New York

    Work from home Full-time role

    Experienced Customer Success Program Manager – U.S Federal Government Sales Operations

    Work from home Full-time role

    [Remote] Work From Home Client Acquisition (NO experience? Will Train!)

    Work from home Full-time role

    Experienced Data Entry Professional for Remote Opportunities - Entry Level Position in E-commerce Industry with blithequark

    Work from home Full-time role