The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to join its Platform Engineering Team. This role will directly support our application platform serving the world's favorite encyclopædia to millions of people around the globe. Wikipedia and its sister projects are powered strictly by Free and Open Source software, with MediaWiki at its core, surrounded by an ecosystem of microservices in PHP, NodeJS, Python, and Go.
- Minimum Qualification:Degree
- Experience Level:Mid level
- Experience Length:3 years
Your responsibilities include:
- Sharing our values and working in accordance with them
- Ensuring smooth and reliable operation of the Platform Engineering team's ecosystem of microservices, systems and infrastructure (including dealing with our main MediaWiki installations). We work regularly with Cassandra (in particular), Redis, Envoy, MariaDB/MySQL and other platforms.
- Investigating performance of our Debian Linux fleet, deep diving into profiling, UNIX fundamentals and doing shell scripting.
- Performing platform transformations and migrations towards modernized infrastructure (bare metal deployments to Kubernetes clusters, active/active multi-data center support, etc.)
- Bringing your creativity to improve our current infrastructure and introduce new automation where needed
- Participating in early design and review of projects, guiding appropriate technology approaches
- Mentoring and supporting other engineers on the Platform team as regards infrastructure and beyond
- Supporting the team in deploying new features and fixes
- Troubleshooting, debugging and following up on emerging issues in our application stack and services with other teams at the Foundation
- Interfacing between the Platform Engineering Team and the WMF's SRE team
- Implementing and utilizing configuration management, orchestration and deployment tools (Puppet, Kubernetes)
- Assisting in the architectural design of new services and making them operate at scale
- Monitoring of systems, services and service clusters, optimization of performance and resource utilisation
- Incident response, diagnosis and follow-up on system outages or alerts across Wikimedia's production infrastructure
Skills and Experience:
- Demonstrable experience in an SRE/Operations/DevOps role as part of a team
- Ability to communicate well across different teams, roles and contexts
- Experience in supporting complex web applications running highly available and high traffic infrastructure based on Linux
- Development experience in at least one language - PHP, Nodejs, Python or Go are examples of what we mostly work with.
- Comfortable with configuration management and orchestration tools (Puppet, Ansible, Chef, SaltStack, etc.), and modern observability infrastructure (monitoring, metrics and logging)
Qualities that are important to us:
- Track record of open source contributions is highly appreciated
- Familiarity with core distributed systems concepts
- Familiarity with modern distributed container cluster management systems (Kubernetes, Docker Swarm, Mesos, …)
Important Safety Tips
- Do not make any payment without confirming with the Jobberman Customer Support Team.
- If you think this advert is not genuine, please report it via the Report Job link below.