Or your alerts

Distinguished Engineer, Fault Tolerance and Disaster Recovery


Engineering & Technology

IT & Telecoms Confidential
1 month ago

Job Summary

Our Distinguished Engineer I works with our Principal and Sr. Engineers to innovate and build new systems, improve, and enhance existing systems as well as identify new opportunities to apply your knowledge to solve critical problems. You will lead the strategy and execution of a technical roadmap that will increase the velocity of delivering products and unlock new engineering capabilities. The ideal candidate has a deep understanding of technology, risk management, site reliability engineering principles and strategic planning to design and implement resilient systems that safeguard our business from potential threats.

  • Minimum Qualification: Degree
  • Experience Level: Senior level
  • Experience Length: 10 years

Job Description/Requirements

Position Responsibilities:

As a Distinguished Engineer I, you will:

  • Develop and drive the overall strategy for Business Continuity and Disaster Recovery (BCDR), aligning it with the organization's business goals and objectives
  • Provide thought leadership in BCDR, staying ahead of industry trends and emerging technologies to enhance our resilience posture
  • Conduct comprehensive risk assessments to identify potential threats and vulnerabilities
  • Design and implement robust risk mitigation strategies and plans to ensure continuous business operations
  • Lead the design and architecture of resilient and scalable systems, considering both on-premises and cloud-based solutions.
  • Collaborate with cross-functional teams to integrate BCDR considerations into the development and deployment processes
  • Develop and maintain comprehensive incident response plans to address various disaster scenarios
  • Conduct regular simulations and drills to ensure the readiness of the organization in the event

of a disaster

  • Hands-on software engineering and SDLC best practices (Technical Review Documents, Architecture, Software Development, Software Reviews, Testing, Production Readiness Reviews, among others)
  • Evaluate, select, and implement cutting-edge technologies and tools to enhance our BCDR capabilities including but not limited to processes, compliance, and visibility
  • Stay current with industry best practices and emerging technologies to continuously improve our BCDR capabilities
  • Work closely with executive leadership, IT teams, and other stakeholders to communicate the importance of BCDR and foster a culture of resilience.
  • Act as a trusted advisor, providing guidance on BCDR matters to technical and non-technical stakeholders.
  • Be a role model and mentor, helping to coach and strengthen the technical expertise and know-how of our engineering and product community. Influence and educate executives
  • Analyze cost and forecast, incorporating them into business plans
  • Determine and support resource requirements, evaluate operational processes, measure outcomes to ensure desired results, and demonstrate adaptability and sponsoring continuous learning


  • Fluency and specialization in software development and best practices using modern programming languages such as Go, Java, and Python
  • Understanding of SQL and NoSQL databases, including stateful services management and storage
  • Understanding of networking, caches, key/value stores, load balancing, global load balancing, queues, DNS and CDN.
  • Deep knowledge of SRE practices, methodologies, and principles, along with a solid understanding of on prem and public cloud-based network, compute, and storage technologies
  • In-depth knowledge of hybrid cloud architecture, IaaS and PaaS technologies, container orchestration platforms (e.g., Kubernetes), cloud efficiency and observability etc.
  • Strong background in incident management
  • Ability to create incident response playbooks, runbooks, incident triaging strategies, and post-incident analysis to drive continuous improvement in system reliability and availability
  • Experience with open-source management and monitoring tools
  • Experience with infrastructure automation, tooling, and configuration management frameworks (e.g., Puppet, Chef, Ansible, Terraform, etc.)
  • Familiarity with cloud security best practices and compliance standards
  • Excellent leadership skills with a passion for mentoring and fostering professional growth
  • Strong problem-solving and analytical abilities, with a keen eye for detail and a passion for driving operational excellence
  • Visionary thinker with the ability to anticipate future challenges and opportunities
  • Exceptional leadership and communication skills
  • Strong analytical and problem-solving capabilities
  • Proven track record of successfully leading and building software in large and complex organizations


  • 10+ years of professional experience in software engineering
  • 8+ years of experience with architecture and design
  • 6+ years of experience in open-source frameworks
  • 4+ years of experience with AWS, GCP, Azure, or another cloud service

Important Safety Tips

  • Do not make any payment without confirming with the Jobberman Customer Support Team.
  • If you think this advert is not genuine, please report it via the Report Job link below.
Report Job

Share Job Post

Lorem ipsum dolor (Location) Lorem ipsum ₵ Confidential

Job Function : Lorem ipsum

1 year ago

Lorem ipsum dolor (Location) Lorem ipsum ₵ Confidential

Job Function : Lorem ipsum

1 year ago

Lorem ipsum dolor (Location) Lorem ipsum ₵ Confidential

Job Function : Lorem ipsum

1 year ago

Stay Updated

Join our newsletter and get the latest job listings and career insights delivered straight to your inbox.

We care about the protection of your data. Read our privacy policy.

This action will pause all job alerts. Are you sure?

Cancel Proceed
Report Job
Please fill out the form below and let us know more.
Share Job Via Sms

Preview CV