Site Reliability Engineer - Monitoring

xAI
Full-time
Memphis, TN
Posted on a month ago

Job Description

xAI is seeking a Site Reliability Engineer specializing in monitoring to develop and manage monitoring solutions, primarily using Grafana, to provide visibility into datacenter health and scale business operations. The role involves collaboration with datacenter teams to minimize downtime and deliver actionable insights.

Responsibilities

  • Design, build, and maintain Grafana dashboards
  • Develop automation scripts using Java, Golang, Python, C/C++/C#, Bash, or Linux shell scripting
  • Collaborate with Datacenter Operations Technicians
  • Evaluate and optimize dashboards for scalability
  • Manage dashboard lifecycle
  • Participate in on-call rotations and incident analysis
  • Document monitoring strategies and best practices

Requirements

  • Bachelor's degree in Computer Science or related field
  • 5+ years of experience in site reliability engineering or monitoring
  • Proficiency in at least two of Java, Golang, Python, C/C++/C#
  • Strong skills in Linux and Bash scripting
  • Experience with JSON data parsing
  • Expert-level knowledge of Grafana
  • Proven track record of developing scalable dashboards
  • Experience managing monitoring tools and dashboards
  • Strong problem-solving skills

Benefits

  • No benefits