
Website CME
Description
Job Title: Site Reliability Engineer
We are seeking a highly skilled Senior Site Reliability Engineer (SRE) to join our Platform Engineering team. The ideal candidate will have a strong understanding of DevOps and Service Level Management (SLM) metrics. As well as experience working in event-driven infrastructure projects using tools like Terraform, New Relic, Kubernetes, AWS, and Kafka.
As a representative of Platform Engineering, you will play a critical role working with other engineering teams to ensure our platform infrastructure tooling fulfils their needs and has a positive impact on Developer Experience. As well as helping them determine the right settings and thresholds for triggering alerts or automations on their applications.
Key Responsibilities:
- Scalability and High Availability: Design, implement, and maintain scalable and highly available systems using load balancing, auto-scaling patterns, canary releases, and blue-green deployments.
- Monitoring, Logging, and Observability: Develop and maintain monitoring and logging dashboards using tools like New Relic, Prometheus, Grafana, and Datadog.
- Ensure observability through metrics, tracing, log aggregation, and alerting.
- Alerting and Automation: Help teams determine the right settings and thresholds for triggering alerts or automations on their applications.
- Understand that each application has different performance requirements, such as varying acceptable response times or resource constraints.
- System Performance and Reliability: Monitor, optimize, and ensure system reliability and performance using tools like New Relic.
- Apply DORA metrics to measure and improve development and operational performance.
- Ensure compliance with SLM metrics like SLAs, SLOs, and SLIs by tracking uptime, response times, and resolution times.
- Resiliency: Implement and advocate for “Chaos” engineering practices to ensure system resiliency.
- Collaboration: Work with cross-functional teams to enhance platform engineering practices and gathering the right information for metrics analysis.
Requirements
- Proven experience working with Infrastructure-as-Code tooling, like Terraform, for infrastructure management.
- Strong understanding of scalability and high availability patterns, including load balancing, auto-scaling, canary releases, and blue-green deployments.
- Strong understanding of DevOps metrics (like DORA) and their application in measuring and improving development and operational performance.
- Strong understanding of Service Level Management (SLM) metrics (like SLAs, SLOs, and SLIs). And their importance in defining, monitoring, and ensuring compliance from the services bound to them.
- Experience with monitoring, logging, and observability tools like New Relic, Prometheus, Grafana, and Datadog.
- Experience working with Kafka and improving performance of event-driven, realtime data processing and streaming projects and architectures.
- Familiarity with tooling used for SLM, DevOps and DORA metrics like Apache Dev Lake, Grafana and New Relic.
- Experience working with AWS, Azure or GCP for cloud infrastructure management.
- Experience working with CI/CD pipeline tools such as GitHub Actions, Jenkins, GitLab CI, or similar.
- Analytical Skills. Ability to analyze and interpret metrics to drive improvements.
- Strong communication skills to effectively collaborate with team members and stakeholders.
- Nice-to-haves Familiarity with Observability-as-Code tooling and practices.
- Familiarity with “Chaos” engineering practices for system resiliency
Before you Apply: Here is an interview Q and A for you: Click here
NOTE: Here is why some companies may not hire you.
Up Your Skill: Take Paid Courses HERE for free
Are your skills still relevant in 2025-2030? Check it out HereÂ
To apply for this job please visit gotocme.zohorecruit.com.
Site Reliability Engineer at CME, Remote
Share with someone