HARDCORE ENGINEER - PRE-TRAINING INFRASTRUCTURE

xAI
Full-time
Palo Alto, CA
$180,000 - $440,000
Posted on a month ago

Job Description

xAI is seeking a highly motivated and hands-on engineer to design, build, and implement large-scale distributed training systems for AI models. The role involves profiling, debugging, optimizing GPU utilization, and contributing to the codebase while maintaining a strong work ethic and communication skills.

Responsibilities

  • Design, build, and implement large-scale distributed training systems
  • Profiling, debugging, and optimizing multi-host GPU utilization
  • Hardware/Software/Algorithm co-design
  • Maintain and innovate on the codebase
  • Build tools to boost team productivity

Requirements

  • Experience in configuring and troubleshooting operating systems for performance
  • Experience building scalable training frameworks for AI models in HPC clusters
  • Knowledge of scalable orchestration frameworks and tools
  • Familiarity with machine learning compilers and runtime (XLA, MLIR, Triton)
  • Understanding of distributed training strategies (FSDP, Megatron, pipeline parallelism)
  • Experience with NCCL or custom communication libraries

Benefits

  • No benefits