Site Reliability Engineer - Storage

xAI
Full-time
Memphis, TN
Posted on a month ago

Job Description

xAI is seeking a Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of their petabyte-to-exabyte scale storage infrastructure supporting large AI training clusters. The role involves deploying, troubleshooting, and optimizing storage for 24/7 AI workloads, collaborating with various engineering teams, and participating in on-call rotations.

Responsibilities

  • Deploy, maintain, and scale exabyte-scale storage clusters
  • Troubleshoot production storage issues across hardware-software stacks
  • Collaborate with storage teams to validate server specs and debug field problems
  • Evaluate and onboard new storage vendors and technologies
  • Support storage SDEs by translating requirements into reliable systems
  • Lead hardware refreshes for legacy storage fleets
  • Participate in on-call rotations and drive post-mortems
  • Create and maintain documentation and monitoring for storage health

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related field
  • 3+ years in site reliability engineering, systems engineering, or storage operations at multi-PB+ scale
  • Hands-on experience with storage systems (VAST, DDN, Dell, Lustre, GPFS, Weka)
  • Proficiency in scripting (Python/Bash)
  • Strong troubleshooting skills across storage hardware and software
  • Experience with incident response and on-call rotations
  • Basic hardware knowledge for storage bring-up
  • Excellent communication and documentation skills

Benefits

  • No benefits