MLOps Engineer (PyTorch)

Full-time

Remote

Software Engineer

Job Title:

MLOps Engineer (PyTorch, Systems & Training Pipeline)

About the Role

We’re seeking a systems-minded MLOps Engineer to own and evolve the infrastructure behind our PyTorch-based training workloads. You’ll help build robust pipelines, solve real networking and systems issues, and ensure our training codebases are clean, maintainable, and scalable.

This role is ideal for someone who thrives at the intersection of deep learning, systems programming, and infrastructure engineering — someone who cares not just about getting models trained, but doing it right, with code that lasts.

Key Responsibilities

Build and maintain training and inference pipelines using PyTorch
Write and maintain robust tooling in Python and C++ to support the training lifecycle
Own the training codebase: enforce clarity, modularity, reproducibility, and performance
Design workflows that support checkpointing, resuming, versioning, and tracking experiments
Optimize compute workloads for bare-metal environments (I/O, CPU/GPU utilization, memory)
Troubleshoot low-level networking issues, distributed training errors, and hardware bottlenecks
Set up and manage ML environments (containers, package management, drivers, runtime configs)
Monitor and debug training jobs across multiple nodes and GPUs
Build systems that persist — built for scale, maintainability, and long-term usage

You Should Have

Expertise in PyTorch (e.g., DDP, mixed precision, TorchScript)
Strong programming skills in C++ and Python
Solid background in computer science fundamentals (data structures, concurrency, OS)
Experience debugging and tuning bare-metal servers (Linux, kernel params, BIOS tuning)
Strong understanding of networking, interconnects, and distributed training setups (NCCL, MPI)
Proven ability to build reliable, reproducible pipelines for training and evaluation
Familiarity with job schedulers (SLURM, custom batch runners) and monitoring tools

Nice to Have

Experience with custom deployments (no cloud, local clusters, edge devices)
Contributions to PyTorch or open-source ML tooling
Familiarity with infrastructure-as-code (e.g., Ansible, Terraform, Nix)
Experience setting up logging, observability, and alerting for training runs
Passion for clean code and detail-oriented engineering

Apply now

Share this job

Twitter Facebook Linkedin Email

MLOps Engineer (PyTorch)

More jobs

Principle Machine Learning Engineer -- Teamwork Graph

Atlassian

Blockchain Developer (Rust, Solana)

Neti