Website Braze

Ensure services utilize infrastructure platforms in a scalable, reliable manner to meet our strict enterprise-grade SLAs with customers.
Design, implement and continuously improve infrastructure architecture for in-house and third-party services.
Proactive capacity management, preferably in an automated way.
Debug reliability and scalability issues across all infrastructure layers, including the kubernetes and virtualization layers.
Make monitoring and alerting alerts on symptoms and not on product outages.
Manage incidents:
- Be on a PagerDuty rotation to respond to availability incidents and provide support for other engineers.
- Use your on-call shift to prevent incidents from ever happening.
- Retrospect everything that happens to turn lessons into system improvements/changes, automation, etc.

5+ years of experience as a Site Reliability, DevOps, and/or Software Engineer.
You think about systems – interfaces, boundaries, edge cases, failure modes, behaviors, and specific implementations.
Have the urge to collaborate, document, and deliver quickly.
- Collaborating across the global remote teams, often working asynchronously.
- Document everything so you don’t need to learn the same thing (or plan the same work) twice.
- Delivering fast to delight our customers–even internal ones.
Have an enthusiastic, go-for-it attitude. When you see something broken, you can’t help but fix it.
Have the desire to solve availability challenges in a high scale, high volume environment.
Have an excellent ability to manage multiple tasks and expectations at once.
Know your way around Linux and Unix Shell.
Have good programming skills – Ruby and/or Go preferred.
Have experience with Docker, Kubernetes, Terraform, or similar IaC technologies.
Have experience with MongoDB, Redis, Kafka, Postgres, or similar data technologies.

To apply for this job please visit boards.greenhouse.io.

Senior Site Reliability Engineer, Infrastructure at Braze

Share with someone