
Principal Site Reliability Engineering Specialist (SRE)
City : Waterloo
Category : Software & Systems
Industry : IT
Employer : BlackBerry
Worker Sub-Type:
RegularJob Description:
About BlackBerry
Today, BlackBerry is a transformed company. What we once did for smartphones we’re now doing for financial institutions, automotive OEMs, aerospace, defense, healthcare, and media companies: envisioning, enabling, and securing new forms of communication that are connecting the world in extraordinary new ways. We have the most sophisticated end-to-end solutions, and our ideas lead the way in the hottest markets like cybersecurity and autonomous vehicles. With such growth and opportunity, you couldn’t consider joining us at a more exciting time!
Are you the person we're looking for?
Are you fascinated by cloud-native technologies? Do you feel a nagging disquiet when you see something that isn't automated well, or worse, not at all? Are you intrigued by the challenges presented by running globally distributed, multi-cloud infrastructure that enterprises and governments around the world depend on?
As a Site Reliability Engineering Specialist (SRE) on the BlackBerry Service Engineering & Operations team, you'll be responsible for keeping BlackBerry services running smoothly and securely, with the availability that our customers expect. You'll do this by blending operational discipline with systems engineering principles, emphasizing robust automation and rigorous observability.
The kind of SRE we want is comfortable with Kubernetes and containers, particularly in public clouds such as AWS and Azure. You're no stranger to Git workflows and GitOps practices. Need that merge request rebased? No problem. GitLab pipeline failing? You've got it covered. You've used Terraform enough to know how to navigate its idiosyncrasies. You're able to wrangle PromQL to build the perfect Grafana dashboards to monitor your services and craft high-quality, actionable alerts. You might even enjoy spending your free time going for long walks on the beach while thinking about all the ways complex systems can fail -- who are we to judge!
Toil is your enemy. You consider it a good day when you can fire up Vim and whip up a Bash or Python script to automate an annoying task you do regularly. VS Code? That works too, we're equal opportunity at BlackBerry. (Unless you use Emacs, of course.) Polyglot? Even better. We've also got tools written in Go, Ruby, and even C++.
A broader knowledge of programming languages and software development practices is a strong asset that helps you build and manage world-class services and makes you a better partner to BlackBerry's various development teams, which as an SRE is a fundamental aspect of your job. Software architects, developers, and product owners will look to you for your infrastructure and operational insights that will shape the solutions our customers use every day.
If this sounds like you, come join our team of SREs and help us solve interesting and challenging problems!
Responsibilities
- Architect, Design, and Engineer observability platforms supporting customer facing BlackBerry services
- Support deeply integrated and sophisticated CI/CD pipelines
- Ensure the services you support have essential metrics, high quality dashboards and alerts, with well-documented runbooks
- Maintain existing services by measuring overall system health and ensuring platforms and related software are current and patched
- Be a member of an on-call rotation (includes additional compensation) in a global 24x7 environment, responding to escalations, performing root cause analyses, and striving to ensure the same incident never occurs twice
- Help maintain our catalog of reusable, cross-service automation and build custom automation as needed for your services
- Find inventive ways of reducing costs and improving the performance of existing systems
- Plan for infrastructure and services to meet targeted SLOs and capacity
- Document as much as possible, and automate everything else
Skills and Qualifications
- Post-secondary degree in Computer Science or related technical discipline, or equivalent practical experience
- Ten or more years of experience working with cloud technologies, systems administration, or related fields in a production environment
- Experience as a cloud architect for applications using public clouds such as AWS, Azure, or GCP
- Deep knowledge of observability principles and experience using solutions such as Prometheus, Cortex, OpenSearch, Grafana, Zabbix, or related SaaS such as OpsGenie
- Extremely comfortable using Linux and navigating around the shell
- Experience with private clouds such as OpenStack or OpenNebula would be an asset
- Strong familiarity with Git and associated Git workflows
- Experience automating infrastructure deployments using tools like Terraform, Ansible, Chef, Puppet, Salt, etc.
- Experience using container orchestration platforms such as Kubernetes or Docker Swarm
- Solid understanding of the full infrastructure stack: networks and network protocols, block and object storage, virtualization and operating systems, traffic steering (especially load balancers and DNS), and databases
- Experience with CI/CD pipelines using solutions such as GitLab CI/CD, GitHub Actions, or Jenkins
- Competent and preferably fluent in at least one programming language (Bash counts, but something like Python, JavaScript, Ruby, or Go is preferred)
Projects you could be part of
- Plan, design and migrate globally deployed services from private to public clouds
- Improve and build upon our existing observability platform by researching and prototyping new monitoring, logging or tracing technologies
- Enhance the log ingestion pipeline to include new cloud data sources
- Work with a task force team to help improve the performance and availability of business-critical service deployments
- Become part of a DevOps group to build custom solutions, establish deployment patterns and help standardize operational practices
#LI-SK2
Scheduled Weekly Hours:
40