AI/HPC NETWORK DEVELOPMENT ENGINEER - NETWORKING

xAI
Full-time
Memphis, TN; Palo Alto, California; Dublin, Ireland
Posted on a month ago

Job Description

xAI is seeking a highly motivated network engineer with deep experience in RoCEv2 to develop and optimize hyper-scale networks for AI training and inference workloads. The role involves optimizing network performance, designing new network infrastructure, and contributing to deployment and operations frameworks.

Responsibilities

  • Develop at hyper scale while optimizing performance and availability
  • Optimize network performance for training and inference models
  • Build metric dashboards and tweak configurations
  • Design next-generation backend and front-end networks
  • Participate in on-call rotation and scaling/maintenance efforts
  • Travel to data centers for buildouts and team collaboration

Requirements

  • 10+ years designing/operating large-scale networks
  • 5+ years in ethernet AI/HPC space
  • Deep understanding of ethernet congestion control
  • Deep understanding of AI training/inference workloads
  • NCCL debugging and potential contribution experience
  • Expertise in creating performance/operations metrics
  • Python scripting for automation and data analysis

Benefits

  • No benefits