Website Zapier
Site Reliability Engineer
About You
You’re an experienced technologist. You spent 4 years working on multiple projects in SaaS companies in the world of systems administration, systems engineering or software development with at least 2 years of experience in Site Reliability Engineering or DevOps.
You know the cloud. You’ve participated in the design or maintenance of highly available, cloud-based infrastructure in AWS or another cloud offering. You understand how to leverage infrastructure as code tools and have learned best practices for reliability and observability. We use tools like Terraform, Kubernetes, Redis, GitLab, and Datadog, among others.
You can code. You have experience with languages like Python or Go to create automated tools. You believe in hands-off deployments and infrastructure as code. Well-honed expertise with the fundamentals of software development goes a long way here.
You can solve complex systems challenges. You enjoy complex challenges, understand how to improve performance, and help uncover opportunities for improvement. You’ve worked on problems where “just throw more hardware at it” isn’t enough for the system to scale.
You’re a great communicator. Not only do you know how to share your knowledge with the team and document things well so they can be consumed asynchronously (we do this a lot as a remote company), but you know how to communicate effectively with software and support teams.
You value our values. At Zapier, our values are at the heart of how we collaborate and how we think about our customers. In our remote setting, they help develop trust and ensure we work and collaborate together to democratize automation. You see how these values can empower meaningful work, you thrive in a collaborative setting, you are eager to continue growing and excited to be part of the team.
Things You’ll Do
As part of this team, you’ll work on
- Designing and deploying our AWS infrastructure using infrastructure as code (Terraform, Helm, etc) across multiple accounts.
- Contributing to our kubernetes clusters (EKS) and serverless functions (Lambda). Production Engineering provides compute resources as a service, and you’ll help shape what features we offer.
- Evaluating new tools and recommending technologies to the entire organization. If there’s a tool that will help us serve our customers, we’ll go get it.
- Partnering with teams to solve novel infrastructure and design problems. Service teams are responsible for keeping services running. It’s your job to help them make decisions that scale.
- Building services to integrate systems, process high-traffic workloads, and perform critical migrations. We don’t believe in drawing a hard line between developers and SREs–if you see a part of the code you can improve, default to action and make the change.
Using site reliability principles, you’ll help fix problems at their root cause rather than just the symptoms. You’ll improve application reliability using a software engineering approach to operations. You’ll develop internal tools and systems to help engineering teams ship better software, faster. You’ll get to impact every engineering team in the organization and use a broad set of technologies. Maintaining excellent relationships and communicating effectively with teams will be crucial to your success.
Building new features and services is a big part of this role. We continually develop and implement new ways to support our teams, understand our customers’ needs, and become experts in site reliability.
When bad things happen, you’ll have the support of your team to solve contributing causes, learn from failures, and build robust and resilient systems for our customers. We look for the solution that automates the problem away, not one that requires manual effort.
If you’re interested in making a big impact and taking our infrastructure to the next level at a fast-growing and profitable startup, then read on.
To apply for this job please visit zapier.com.