GPU Kernel Developer – AI/ML
Role Summary We are seeking expert-level GPU Software Engineers to support a high-visibility platform initiative within the Maya program, focused on building software tooling on top of a custom compiler and SDK. The role involves developing, optimizing, and porting GPU kernels and AI workloads to a specialized hardware platform. This is a critical and time-sensitive engagement with immediate onboarding expectations and long-term roadmap alignment (~18 months).
Key Responsibilities
- Develop GPU kernels for specialized hardware platforms using PyTorch/Triton frameworks
- Build software solutions leveraging custom compiler and SDK capabilities
- Design and implement kernel-level optimizations to control hardware execution behavior
- Port open-source AI/ML models to custom SDK environments
- Port and adapt high-performance computing benchmarks and stress workloads such as:
- Linpack (High Performance Linpack)
- BERT/benchmark-style workloads (referred as “Babu bench”)
- • Develop stress testing and validation workloads aligned to hardware behaviour and platform validation
- • Support testing and stress testing of current and next-generation hardware platforms
- • Collaborate closely with platform architects and compiler teams to enhance system capabilities
Core Technical Skills (Must-Have) Programming & Frameworks
- Python
- C/C++ (systems-level programming)
- PyTorch
- Triton (Triton language / kernel development)
GPU & Systems Expertise
- GPU kernel development (mandatory and critical)
- Strong understanding of GPU architecture and compute optimization
- Experience with compiler-based optimizations / runtime execution layers
- Experience with custom SDKs or hardware abstraction layers
Performance & Workloads
- Experience in:
- GEMM kernel development (matrix multiplication kernels)
- Porting ML models to new hardware platforms
- Performance tuning and stress testing at system level
Nice-to-Have
Skills
- Experience working with custom silicon / hardware platforms
- Exposure to high-performance computing (HPC) workloads
- Familiarity with:
- Linpack benchmarks
- AI workload benchmarking tools
- • Experience in compiler optimization ecosystems
Engagement Model & Structure
- Number of roles: 3 developers (initial hiring may start with 2)
- Location flexibility:
- Onsite / Offshore / Hybrid mix allowed
- • Timeline:
- Immediate start required
- • Duration:
- ~18 months program duration with phased platform evolution
Key Differentiators (Critical Expectation)
- This is NOT a DevOps / support / debugging role
- Requires deep hands-on engineering expertise in:
- Kernel programming
- GPU workloads
- ML framework internals
- • Candidates must demonstrate build-level competence, not just theoretical knowledge
Apply tot his job Apply To this Job