See all roles

SRE Lead Platform Engineer Dynatrace & Azure - Fully remote

Work from home Full-time role Hiring

Job Title: SRE Lead Platform Engineer- Remote Duration: 6 Months to Hire Location: Fully remote, EST The key skills for this Lead SRE Platform Engineer role are observability and monitoring (MELT data) using tools like Dynatrace, Datadog, and SCOM, strong Azure cloud and hybrid infrastructure knowledge, and DevOps automation with CI/CD, GitHub, and Terraform. The role also requires programming for automation (Python, C#, SQL) and strong experience with incident management, root cause analysis, and reliability engineering practices. At a lead level, the focus is on defining monitoring standards, improving system reliability, and guiding cross-team efforts to reduce outages and improve platform performance. Dynatrace Datadog Microsoft SCOM A typical day for this engineer would be a mix of monitoring system health, investigating reliability issues, improving observability, and leading automation and infrastructure improvements. Role Summary As a Lead SRE Platform Engineer, you will drive reliability engineering strategy and execution across critical IT Business Solutions platforms at Wegmans. This role focuses on improving uptime, performance, and operational efficiency through software enhancements, observability, automation, and data-driven root cause analysis (RCA). You will serve as the technical lead for SRE practices establishing monitoring standards, improving MELT (Metrics, Events, Logs, Traces) strategy, influencing tooling decisions, and partnering across infrastructure, development, operations, and vendor teams. This is a high-impact opportunity to build and mature reliability engineering capabilities from the ground up. What You ll Do Reliability & Observability Leadership Define and mature SRE best practices across cloud and on-prem environments. Design and implement comprehensive monitoring strategies using tools such as: o Dynatrace o Datadog o Microsoft SCOM Develop dashboards, alerts, synthetic testing, and proactive monitoring capabilities. Establish and evolve a MELT data strategy to improve service reliability. Provide data-driven RCA investigations and implement preventative solutions. Platform & Application Reliability Support and enhance reliability across: Cloud & Infrastructure o Microsoft Azure (software, storage, Azure local) o Hyper-V and legacy VMware environments o NetApp and Pure storage platforms o Azure log analytics o Infrastructure as Code using Terraform o Migration from Azure DevOps to GitHub (strong GitHub experience required) Order Management Systems o Azure-based, internally developed .NET/C# applications o Internal message queuing systems o Logging, analytics, and synthetic testing post-patching o API-based integrations Workforce & Payroll Platforms o Workday (Payroll) o ADP Vantage (Timekeeping) Warehouse & Distribution Systems o Blue Yonder Warehouse Management System (WMS) o Vocollect handheld voice picking devices o Network analytics for identifying dead zones and connectivity issues o Barcode scanners and device connectivity troubleshooting DevSecOps & Automation Lead CI/CD reliability improvements (Azure DevOps GitHub transition critical). Enhance pipeline automation with embedded security controls. Advance Infrastructure-as-Code standards (Terraform). Improve configuration management and change governance. Drive automation to reduce manual intervention and operational risk. ITSM & Incident Management Work within BMC ecosystem including: o BMC Helix o BMC Remedy o BMC Server Automation Optimize automated incident generation (SCOM BMC workflows). Improve triage, escalation, and impact modeling across services. Monitor vendor performance and escalate appropriately. Participate in off-hour escalation support when required. Strategic Impact Develop predictive reliability models using statistical techniques. Identify systemic risk across production systems. Guide tooling decisions (e.g., Dynatrace vs. Datadog or other observability platforms). Ensure regulatory and operational compliance standards are met. Facilitate cross-functional collaboration and document SRE procedures and planning artifacts. Required Qualifications 5 7+ years of Software Engineering and Infrastructure/Database Engineering experience. Deep expertise in: o DevSecOps practices o Observability platforms o API integrations o Performance management tools o ITIL principles o ITSM data analytics o MELT data collection and analysis Experience in Azure cloud environments. Strong analytical and problem-solving skills. Demonstrated ability to influence technical direction. Excellent communication and cross-team collaboration skills. Continuous improvement mindset focused on reliability engineering.

Preferred Qualifications

Strong programming experience in: o .NET / C# o Python o SQL Experience with MSSQL (primary) and Oracle (limited). Experience with GitHub (critical for upcoming transition). Agile/Scrum experience. Knowledge of Reliability-Centered Engineering and maintenance strategies. Experience with synthetic testing and proactive validation post-deployment. Bachelor s degree in a related technical field. Thank you, Shiva Mittal Apply tot his job Apply To this Job

You might like

Senior Azure Engineer

Work from home Full-time role

Operation Support Analyst/ Azure Infrastructure Engineer/Azure Admin - Richfield, MN (Hybrid)

Work from home Full-time role

Senior Azure Network Engineer

Work from home Full-time role

Data Engineer, Databricks, Python, Azure

Work from home Full-time role

Lead Engineer, Applications - Azure/Container/.Net Core/ADF - Remote

Work from home Full-time role

Azure Integration & Logic Apps

Work from home Full-time role

GCP Data Quality Test Engineer – Retail Domain

Work from home Full-time role

Data Engineer (Azure, Fabric, Databricks)

Work from home Full-time role

GCP Security SecDevOps Engineer-NYC, NY or Alpharetta, GA

Work from home Full-time role

Google Cloud DevOps Engineer

Work from home Full-time role

Technical Support L3 - Weekends

Work from home Full-time role

Experienced Online Remote Data Entry Specialist – Thrive in a Dynamic, Globally Respected arenaflex Team

Work from home Full-time role

On-Call Emergency Response Manager

Work from home Full-time role

Experienced Customer Support Representative – Remote Work Opportunity at arenaflex

Work from home Full-time role

Experienced Full Stack Customer Sales Representative – Insurance Solutions and Financial Planning

Work from home Full-time role

Experienced Customer Care Representative – Remote Work Opportunity with arenaflex

Work from home Full-time role

SENIOR SPECIALIST, WORKFORCE MANAGEMENT (SCHEDULER)

Work from home Full-time role

Remote Data Entry Specialist – Content Database Management & Quality Assurance (Work From Home)

Work from home Full-time role

Product Manager – Music Expression

Work from home Full-time role

Manager, Rx Product Development

Work from home Full-time role