See all roles

AI Test Engineer - Senior Manager

Work from home Full-time role Hiring

About Vialto Labs (VLabs) Vialto Labs (VLabs) is responsible for redesigning how work is delivered in the tax and immigration service lines, as well as driving operational efficiency across Vialto’s functional areas using AI. The team builds and deploys novel AI-enabled solutions that directly improve productivity and increase delivery quality for our clients. VLabs is accountable for rapidly turning innovative experiments into production-ready deliverables at scale and embedding them into day-to-day operations. This team focuses on the highest-impact workflows, creating standardized, repeatable capabilities that can be deployed globally. Operating with a mandate for speed and measurable outcomes, VLabs works alongside service line, product, and platform leaders.

About the Role

The Senior Manager, AI Test Engineering is a hands-on role within VLabs Quality Engineering, responsible for validating the performance, reliability, and integrity of AI-enabled solutions in production environments. This role operates at the intersection of AI engineering and quality assurance, ensuring that outputs from LLMs, OCR pipelines, document classification models, and agentic workflows perform as expected at scale and meet defined business performance thresholds. Working closely with the Programme Test Manager and partnering with engineering, product, and delivery teams, this role translates AI testing strategy into executable frameworks, evaluation pipelines, and reusable assets embedded into the delivery lifecycle. Success requires independent execution, strong technical depth, and the ability to proactively identify risks, patterns, and performance gaps while enabling rapid, production-grade deployment of AI capabilities.

Key Responsibilities

AI Evaluation & Test Design Translate AI testing strategy into executable test scenarios across LLM outputs, document classification, extraction accuracy, agent workflows, and edge cases Design adversarial and boundary test inputs to expose hallucination, misclassification, and failure modes Validate AI outputs for structure, consistency, accuracy, and production readiness against defined performance thresholds Evaluation Engineering & Automation Build reusable Python-based evaluation frameworks, including output validation, hallucination detection, and scoring mechanisms Develop parameterized test scripts reusable across features, models, and releases Implement AI-as-Judge frameworks, including prompt design, scoring logic, and calibration of evaluation reliability Embed evaluation frameworks into CI/CD pipelines to support continuous testing and deployment Drift Detection & Quality Monitoring Design and operate drift detection frameworks using fixed baseline datasets and scheduled re-evaluation Establish thresholds to distinguish acceptable variation from performance degradation Enable release gating by identifying regressions prior to production deployment Ground Truth & Data Quality Build and maintain ground truth datasets in partnership with subject matter experts Define standards for classification, extraction accuracy, and acceptable output characteristics Continuously update datasets to reflect evolving business requirements and use cases Workflow & Integration Testing Test end-to-end agentic workflows, validating data integrity, error propagation, and fallback behavior Perform API-level testing of AI pipeline endpoints using Python and Postman/Newman Validate data persistence and integrity across system layers using SQL Partner with engineering teams to ensure testability, observability, and system reliability Standardization & Scaling Define and scale standardized AI evaluation patterns and reusable quality frameworks across VLabs Contribute to enterprise AI quality standards and reference architectures Governance & Responsible AI Ensure adherence to Responsible AI, data privacy, and governance requirements Support auditability, traceability, and transparency of AI outputs and evaluation processes Stakeholder Enablement Translate evaluation results into actionable insights for engineering, product, and business stakeholders Support decision-making on model readiness, release risk, and performance trade-offs Proactively identify risks, patterns, and systemic issues and escalate appropriately Qualifications & Experience Professional Experience 7+ years in software testing, including 2–3 years focused on AI/ML-enabled systems in production environments Proven experience designing and executing AI evaluation frameworks and quality strategies Strong track record building ground truth datasets, drift detection systems, and scalable evaluation pipelines Experience testing multi-step agentic workflows and AI-driven automation systems Experience operating in fast-paced, iterative delivery environments Background in regulated or compliance-driven environments preferred Technical Expertise Advanced Python programming for evaluation frameworks, batch processing, and data analysis Experience with LLM evaluation tools such as deepeval, RAGAS, promptfoo, or similar Strong capabilities in: AI output validation, hallucination detection, and grounding checks Drift detection frameworks and statistical evaluation methods OCR, VLM, and document AI testing (classification, extraction, edge cases) API testing using Python (requests/httpx) and Postman/Newman SQL for data validation and pipeline integrity checks Familiarity with LangChain, LlamaIndex, or similar frameworks Experience with cloud AI platforms such as Azure AI Foundry or AWS Bedrock preferred Operating Capabilities Ability to operate independently in fast-moving, ambiguous environments Strong analytical mindset with attention to detail and quality rigor Ability to balance speed and rigor in AI evaluation and delivery cycles Proactive communicator who identifies risks and drives resolution Ability to translate technical findings into business-relevant insights Education Bachelor’s degree required; Advanced degree in Computer Science, Data Science, or related field preferred We are an equal opportunity employer that does not discriminate on the basis of any legally protected status. Please note, AI is used as part of the application process. Apply To This Job

You might like

Payment Operations Analyst, Customer Success

Work from home Full-time role

Spécialiste Sécurité de Production

Work from home Full-time role

Architecte infonuagique

Work from home Full-time role

Expert(e) en fiabilité des services (SRE)

Work from home Full-time role

Administrateur(trice) de sotckage senior

Work from home Full-time role

Senior Program Associate - Software engineer

Work from home Full-time role

Head of Sales - Belgium

Work from home Full-time role

Gestionnaire de produit GIA

Work from home Full-time role

Analyste SOC Purple Team, sécurité en production

Work from home Full-time role

Care Coordinator IV- T or C, NM Area

Work from home Full-time role

Experienced Full Stack Security Architect – IT Security, Data Protection, and Compliance

Work from home Full-time role

Experienced Customer Service Representative – Dallas, TX Branch at arenaflex

Work from home Full-time role

Rechtspfleger / Justizexperte mit Gerichtserfahrung – eJustice & Digitalisierung der Justiz

Work from home Full-time role

ERP Analyst (Operations)

Work from home Full-time role

Experienced Remote Data Entry Specialist – Support arenaflex Operations with Precision and Efficiency

Work from home Full-time role

Virtual Property Operations Coordinator (US Real Estate)

Work from home Full-time role

Claims Specialist, Ocean Marine

Work from home Full-time role

Sales Development Representative - Indirect channels

Work from home Full-time role

Technical Business Analyst - ServiceNow

Work from home Full-time role

Experienced Customer Support Executives – Digital Communication Experts Wanted for arenaflex

Work from home Full-time role