Job Summary
As the SRE Manager, you will lead and manage our SRE team, working closely with cross-functional teams to establish and enhance our reliability engineering practices. You will be responsible for driving the continuous improvement of our systems' reliability, scalability, and efficiency, while also ensuring prompt incident response and effective problem resolution. In addition, you will play a key role in setting and achieving service level objectives (SLOs) and driving the adoption of best practices for monitoring, alerting, and automation. The Manager of SRE is a hands-on technical role and requires a thorough understanding of all components of a modern web application stack, including front-end, backend, database, networking, and systems-level knowledge.
- Minimum Qualification: Degree
- Experience Level: Mid level
- Experience Length: 2 years
Job Description/Requirements
Responsibilities:
- Collaborate with development, operations, and product teams to optimize the reliability, scalability, and performance of our systems
- Define and monitor service level objectives (SLOs) to ensure the availability and performance of our services
- Implement effective incident management and problem resolution processes, ensuring minimal impact to customers
- Develop and maintain monitoring and alerting systems to proactively identify and mitigate potential issues
- Drive automation efforts to streamline deployments, infrastructure provisioning, and operational tasks
- Perform post-incident reviews to identify root causes, implement preventive measures, and share lessons learned
- Stay up to date with industry trends and emerging technologies, and assess their potential impact on our SRE practices
Requirements:
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field
- 2-6 years of experience in Site Reliability Engineering or a related role, with demonstrated experience in leading and managing teams
- Strong knowledge of SRE and DevOps principles, practices, and methodologies
- Proficiency in scripting and automation using tools such as Python, NodeJS, or other langugages
- Experience with cloud platforms (AWS, Azure, GCP) and infrastructure-as-code (IaC) tools like Terraform
- Expertise in monitoring and observability tools (e.g., Prometheus, Datadog, New Relic, ELK stack)
- Expertise with containerization technologies (Docker, Kubernetes
- Familiarity with incident response and post-incident analysis processes
- Strong analytical and problem-solving skills
- Excellent communication and leadership ability
Important Safety Tips
- Do not make any payment without confirming with the Jobberman Customer Support Team.
- If you think this advert is not genuine, please report it via the Report Job link below.