AI/HPC NETWORK DEVELOPMENT ENGINEER - NETWORKING

xAI
Full-time
Palo Alto, CA
Posted on a month ago

Job Description

xAI is seeking a Network Development Engineer with deep experience in RoCEv2 to develop and optimize hyper-scale networks for AI training and inference workloads. The role involves optimizing network performance, designing new network infrastructure, and contributing to deployment and operations frameworks.

Responsibilities

  • Develop at hyper scale while optimizing performance and availability
  • Optimize network performance for training and inference models
  • Design next-generation backend and front-end networks
  • Build metric dashboards and tweak configurations
  • Travel to Memphis for data center buildouts
  • Participate in on-call rotation
  • Automate repetitive tasks with Python

Requirements

  • 10+ years designing and operating large-scale networks
  • 5+ years in the ethernet AI/HPC space
  • Deep understanding of congestion control on ethernet
  • Deep understanding of AI training and inference workloads
  • Experience with NCCL
  • Expertise in creating performance and operations metrics
  • Experience with Python

Benefits

  • No benefits