PRINCIPAL SITE RELIABILITY ENGINEER, DATA PROTECTION PRODUCTS

ConnectWise
Full-time
ARM Remote
Posted on 19 days ago

Job Description

As a Site Reliability Engineer, you will work as an integral member of product teams, helping to build, deploy, and monitor cloud services reliably. You will contribute to complex software development projects to maintain essential, revenue-critical services, focusing on the reliability, availability, and performance of Elasticsearch infrastructure.

Responsibilities

  • Build systems and infrastructure to monitor complex, large-scale distributed systems
  • Identify stability/performance issues and collaborate with developers
  • Represent the SRE organization in design reviews
  • Monitor system throughput, capacity, and reliability
  • Debug complex systems without downtime
  • Engage in service capacity planning and demand forecasting
  • Drive standardization efforts
  • Monitor and troubleshoot Elasticsearch performance issues

Requirements

  • Bachelor’s degree in Computer Science or equivalent work experience
  • Knowledge of virtualization, storage, networking, server, and security
  • Understanding of systems and application design
  • Experience with monitoring and logging solutions such as Prometheus, Grafana, and ELK stack
  • Proficiency in scripting languages such as Python
  • Experience with infrastructure-as-code tools such as Terraform or CloudFormation
  • Strong understanding of Linux system administration and networking concepts
  • Excellent troubleshooting and problem-solving skills
  • Strong communication and interpersonal skills
  • Knowledge of Unix, TCP/IP, HTTP, and web application security
  • Experience analyzing logs and troubleshooting distributed systems

Benefits

  • No benefits