Job Title:
MLOps Engineer (PyTorch, Systems & Training Pipeline)
About the Role
We’re seeking a systems-minded MLOps Engineer to own and evolve the infrastructure behind our PyTorch-based training workloads. You’ll help build robust pipelines, solve real networking and systems issues, and ensure our training codebases are clean, maintainable, and scalable.
This role is ideal for someone who thrives at the intersection of deep learning, systems programming, and infrastructure engineering — someone who cares not just about getting models trained, but doing it right, with code that lasts.
Key Responsibilities
- Build and maintain training and inference pipelines using PyTorch
- Write and maintain robust tooling in Python and C++ to support the training lifecycle
- Own the training codebase: enforce clarity, modularity, reproducibility, and performance
- Design workflows that support checkpointing, resuming, versioning, and tracking experiments
- Optimize compute workloads for bare-metal environments (I/O, CPU/GPU utilization, memory)
- Troubleshoot low-level networking issues, distributed training errors, and hardware bottlenecks
- Set up and manage ML environments (containers, package management, drivers, runtime configs)
- Monitor and debug training jobs across multiple nodes and GPUs
- Build systems that persist — built for scale, maintainability, and long-term usage
You Should Have
- Expertise in PyTorch (e.g., DDP, mixed precision, TorchScript)
- Strong programming skills in C++ and Python
- Solid background in computer science fundamentals (data structures, concurrency, OS)
- Experience debugging and tuning bare-metal servers (Linux, kernel params, BIOS tuning)
- Strong understanding of networking, interconnects, and distributed training setups (NCCL, MPI)
- Proven ability to build reliable, reproducible pipelines for training and evaluation
- Familiarity with job schedulers (SLURM, custom batch runners) and monitoring tools
Nice to Have
- Experience with custom deployments (no cloud, local clusters, edge devices)
- Contributions to PyTorch or open-source ML tooling
- Familiarity with infrastructure-as-code (e.g., Ansible, Terraform, Nix)
- Experience setting up logging, observability, and alerting for training runs
- Passion for clean code and detail-oriented engineering