Site Reliability Engineer
London, England, United Kingdom
Software and Services
Shazam Site Reliability Engineers are not just responsible for making sure all services and systems that Shazam relies on are operating at their highest level; they’re also responsible for helping development teams embrace these principles as they develop software. Shazam SREs embed themselves with development teams and act as extensions of those teams to propagate best practices. As an SRE, you’ll collaborate with development teams to help them understand the bigger picture of distributed systems, beyond individual components. We are strong believers in ownership with software engineers being responsible for the code they write. The SRE team helps build the competencies across teams to ensure we build scalable and supportable systems.
This role sits in our London office reporting to our Head of SRE. The successful candidate will be assisting multiple development teams based in London, San Diego and other locations. They'll work to build and maintain key backend systems, as well as participate in all stages of development cycles - from feature design all the way to production release. You will be expected to write and review code and deeply understand how our applications work. This role offers the potential for leadership opportunities, so if you have an interest in leading a team or taking on managerial responsibilities in the future, we’d love to hear from you.
Description
Hundreds of millions of users. Billions of Shazams. Countless moments of discovery. Shazam brings a unique brand of magic to millions every day. Bring us your vision, and it’ll be you creating the wow moments that excite people across the world! We’re looking for a strong engineer to join our team to lead advancements to the next level of reliability, scalability, and performance for the core services that Shazam provides to its users. You’ll work alongside development teams to continue to evangelize best practices and improve the systems that power Shazam.
Minimum Qualifications
- Design, develop, and operate highly available and scalable distributed systems.
- Collaborate with development teams to implement best practices for CI/CD, infrastructure as code, automated testing and security, etc. to be able to meet scaling demands.
- Troubleshoot and debug issues across the entire stack, including application code, networking, and infrastructure.
- Build, maintain, and optimize monitoring and alerting solutions to ensure high availability and performance of services. Familiarity with different methodologies (e.g. SLOs, etc.)
- Automate repetitive tasks and processes, focusing on reliability and efficiency improvements.
- Participate in on-call rotations and incident management processes to ensure rapid resolution of critical issues.
- Contribute to team and organizational strategy, participating in architectural reviews and decision-making processes.
- Experience: 3+ years of experience in designing, building, and operating reliable distributed systems.
- Cloud Expertise: Hands-on experience with a cloud platforms such as Google Cloud Platform (GCP) or Amazon Web Services (AWS).
- Strong understanding of core Linux/UNIX operating system fundamentals and TCP/IP and network stack.
- Experience operating Kubernetes clusters in production, with an understanding of how containers interact with network and system resources.
- Monitoring & Logging: Knowledge of monitoring and logging tools (Prometheus, Grafana, ELK stack, or similar) as well as how to instrument applications.
- Programming Skills: Proficiency in at least one programming language (e.g., Golang, Python, Java, C/C++) and a scripting language (i.e. Bash) with a strong understanding of software development and debugging. Ability to read, understand and contribute to source code.
- Bachelor's degree in Computer Science, Electrical or Computer Engineering, or equivalent experien
Key Qualifications
Preferred Qualifications
- Security: Knowledge of security best practices for cloud-based infrastructure.
- DevOps Tools: Experience with deploying software to production, implementing and managing CI/CD pipelines, Infrastructure as code, and software release tooling. Familiarity with Helm, Helmfile is a plus.
- Database experience: Familiarity with databases (e.g., PostgreSQL, Cassandra, Redis) is a plus.
- Team Leadership: Prior experience leading a team of engineers is a plus.
- Additional Requirements
- A dedicated lifelong learner who is always looking for new things to learn and try.
- A professional engineer who loves crafting, analysing and troubleshooting large software systems.
- An excellent communicator who builds collaborative relationships with technical and non-technical stakeholders.
- Have excellent analytical and problem-solving skills, tenacious in sticking with a problem until it's resolved once and for all.
- A great teammate, but you can work on your own initiative as well.
- Always actively looking for ways to improve our services, and take personal ownership for the quality of the services we offer.
- Demonstrate personal accountability, owning the decisions and mistakes that you make.