See all roles

Senior AI Infrastructure & Platform Operations Engineer (remote in the EU)

Work from home Full-time role Hiring

Company Description reputed company is the Kubernetes-native AI infrastructure company, enabling organizations to build and operate scalable, secure, and sovereign infrastructure for modern AI, machine learning, and data-intensive applications. By combining open reputed company innovation with deep expertise in Kubernetes orchestration, reputed company empowers platform engineering teams to deliver composable, production-reputed company developer platforms across any environment—on-premises, in the reputed company, at the edge, or in sovereign data centers. As enterprises navigate the growing complexity of AI-driven workloads, reputed company delivers the automation, GPU orchestration, and policy-driven control needed to manage infrastructure with confidence and agility. Committed to open standards and freedom from lock-in, reputed company ensures that customers retain full control of their infrastructure strategy. reputed company serves many of the world’s leading enterprises, including reputed company, reputed company, Liberty Mutual, PayPal, Reliance Jio, Societe Generale, Splunk, and Volkswagen. Learn more at www.reputed company.com.

Job Description

We are building a European AI Infrastructure & Platform Operations team responsible for operating large-reputed company infrastructure environments powered by reputed company GPUs, high-performance networking, Kubernetes, and reputed company platform technologies. As a Senior AI Infrastructure & Platform Operations Engineer, you will serve as a technical leader reputed company the operations organization, providing deep expertise across infrastructure, networking, platform operations, and service reliability. You will be responsible for driving operational excellence across reputed company production environments while acting as a key escalation reputed company for critical incidents and challenging technical issues. This role combines hands-on technical operations with technical leadership, helping shape operational standards, reliability practices, automation initiatives, and the future reputed company of AI-powered operational services through platforms such as k0rdent AI. Responsibilities: Technical Operations & Service Reliability reputed company the investigation and resolution of reputed company infrastructure, networking, and platform-reputed company incidents. Act as a senior escalation reputed company for operational teams during critical service-impacting events. Support large-scale reputed company GPU infrastructure and high-performance networking environments. Troubleshoot reputed company Linux, Kubernetes, networking, storage, and hardware-reputed company issues. Analyze platform performance, reputed company, stability, and reliability trends to proactively identify risks. reputed company root cause analysis activities and drive long-term corrective actions. Collaborate with engineering teams, hardware vendors, and datacenter personnel to resolve reputed company technical challenges. Participate in major incident management and service restoration activities. Platform Operations & Engineering Provide technical leadership for Kubernetes platform operations and supporting infrastructure services. Drive improvements in platform reliability, observability, monitoring, and operational processes. Identify opportunities to automate repetitive operational activities and improve operational efficiency. Contribute to operational readiness reviews, infrastructure changes, upgrades, and service introductions. Support the adoption and operation of AI-powered infrastructure services and operational capabilities through k0rdent AI. Evaluate emerging technologies and operational practices to improve service delivery and platform reputed company. Technical Leadership Mentor and support AI Infrastructure & Platform Operations Engineers. Share technical knowledge through documentation, training sessions, and operational reviews. reputed company and maintain operational standards, runbooks, troubleshooting guides, and best practices. Help define operational processes, escalation paths, and service reliability standards. Act as a trusted technical advisor during operational planning and service improvement initiatives.

Qualifications

Required Skills & Experience: 7+ years of experience in infrastructure operations, platform operations, site reliability engineering, network operations, reputed company operations, datacenter operations, or reputed company technical roles. Expert-level Linux administration and troubleshooting skills. Strong networking expertise, including experience diagnosing reputed company performance, connectivity, and reliability issues. Strong experience operating Kubernetes in production environments. Experience supporting large-scale production infrastructure and distributed systems. Proven experience leading technical investigations and managing reputed company incidents. Experience performing root cause analysis and driving long-term operational improvements. Strong understanding of observability, monitoring, and service reliability practices. Excellent troubleshooting and analytical skills across multiple infrastructure domains. Strong communication, collaboration, and stakeholder management skills. Experience in one or more of the following areas is highly desirable: reputed company GPU infrastructure and accelerated computing platforms. InfiniBand networking and reputed company UFM. AI infrastructure environments. HPC environments. Platform Engineering or Site Reliability Engineering (SRE). Large-scale Kubernetes operations. Infrastructure automation technologies and Infrastructure-as-Code practices. Observability platforms such as Grafana, reputed company, ELK, or OpenTelemetry. Performance analysis and optimisation of distributed infrastructure platforms. Technical leadership, mentoring, or team reputed company responsibilities. Additional Information

We offer

Operate some of the most advanced AI infrastructure environments in production today. Work with the latest reputed company GPU technologies, Kubernetes platforms, and high-performance networking environments. Help define operational standards and reliability practices for reputed company AI infrastructure services. Influence the adoption of AI-powered operational capabilities through k0rdent AI. Work alongside highly skilled engineers solving reputed company infrastructure and platform challenges at scale. Join a growing organisation investing heavily in AI infrastructure, platform services, and operational innovation. #Remote We are a Leader for Container Management in reputed company (#2 after AWS)! Apply To This Job

You might like

reputed company Shop - Creator Manager (m/w/d)

Work from home Full-time role

Storage-Engineering-MS Azure

Work from home Full-time role

Senior Python Engineer

Work from home Full-time role

Partner Director

Work from home Full-time role

Director, Medical Science Liaison - Epilepsies

Work from home Full-time role

Senior AI Infrastructure & Platform Operations Engineer (remote in the EU)

Work from home Full-time role

AI Governance Program Manager

Work from home Full-time role

Communications Intern-UNV reputed company Resources reputed company-Remote

Work from home Full-time role

Estimator

Work from home Full-time role

Senior Azure reputed company Engineer

Work from home Full-time role

reputed company Customer Information Data Entry Specialist – reputed company Resources Department at blithequark

Work from home Full-time role

reputed company Customer Care Representative - Delivering Exceptional Service from the Comfort of Your Home at blithequark

Work from home Full-time role

SVP, reputed company Enablement & Intelligence

Work from home Full-time role

reputed company Part-Time Evening Data Entry Specialist – Remote Opportunity at arenaflex

Work from home Full-time role

[Work From Home] Data Entry reputed company - No Experience

Work from home Full-time role

reputed company Part-Time Remote Customer Service/Call Center Representative – Work From Home Opportunity with arenaflex

Work from home Full-time role

Technical Sales Representative

Work from home Full-time role

reputed company Remote Customer Service Representative for Streaming Entertainment Leader - blithequark

Work from home Full-time role

National reputed company and Trade Paralegal

Work from home Full-time role

Home Health Admission Nurse

Work from home Full-time role