Site Reliability Engineer - Automation

xAI
Full-time
Memphis, TN
Posted on a month ago

Job Description

xAI is seeking a Site Reliability Engineer specializing in automation to focus on automating firmware upgrades, scripting solutions for hardware from key vendors, and proactively identifying issues for automated fixes. The role involves leveraging Python, Bash, Linux, and Kubernetes to enhance datacenter efficiency and support scalable AI infrastructure.

Responsibilities

  • Develop and maintain scripts for firmware upgrades
  • Work with hardware vendors for seamless integration
  • Identify and automate fixes for operational problems
  • Collaborate with Datacenter Operations Technicians
  • Integrate automation into CI/CD pipelines
  • Monitor and refine automated processes
  • Document automation scripts and procedures
  • Participate in on-call rotations and incident response

Requirements

  • Bachelor's degree in Computer Science or related field
  • 5+ years of experience in site reliability engineering or automation
  • Proficiency in Python, Bash, Linux, and Kubernetes
  • Experience with firmware packages and upgrades
  • Familiarity with NVIDIA, Dell, Supermicro, and HP hardware
  • Strong problem-solving skills
  • Experience in high-performance computing or AI infrastructure
  • Excellent collaboration skills

Benefits

  • No benefits