Always Never Home

We help candidates land their dream Jobs, Internships, Grants, Scholarships and Graduate programs

Site Reliability Engineer at CME, Remote

  • Full Time
  • Mid-level
  • Remote
  • Remote

Website CME

Description

Job Title: Site Reliability Engineer

We are seeking a highly skilled Senior Site Reliability Engineer (SRE) to join our Platform Engineering team. The ideal candidate will have a strong understanding of DevOps and Service Level Management (SLM) metrics. As well as experience working in event-driven infrastructure projects using tools like Terraform, New Relic, Kubernetes, AWS, and Kafka.
As a representative of Platform Engineering, you will play a critical role working with other engineering teams to ensure our platform infrastructure tooling fulfils their needs and has a positive impact on Developer Experience. As well as helping them determine the right settings and thresholds for triggering alerts or automations on their applications.

Key Responsibilities:

  • Scalability and High Availability: Design, implement, and maintain scalable and highly available systems using load balancing, auto-scaling patterns, canary releases, and blue-green deployments.
  • Monitoring, Logging, and Observability: Develop and maintain monitoring and logging dashboards using tools like New Relic, Prometheus, Grafana, and Datadog.
  • Ensure observability through metrics, tracing, log aggregation, and alerting.
  • Alerting and Automation: Help teams determine the right settings and thresholds for triggering alerts or automations on their applications.
  • Understand that each application has different performance requirements, such as varying acceptable response times or resource constraints.
  • System Performance and Reliability: Monitor, optimize, and ensure system reliability and performance using tools like New Relic.
  • Apply DORA metrics to measure and improve development and operational performance.
  • Ensure compliance with SLM metrics like SLAs, SLOs, and SLIs by tracking uptime, response times, and resolution times.
  • Resiliency: Implement and advocate for “Chaos” engineering practices to ensure system resiliency.
  • Collaboration: Work with cross-functional teams to enhance platform engineering practices and gathering the right information for metrics analysis.

Requirements

  • Proven experience working with Infrastructure-as-Code tooling, like Terraform, for infrastructure management.
  • Strong understanding of scalability and high availability patterns, including load balancing, auto-scaling, canary releases, and blue-green deployments.
  • Strong understanding of DevOps metrics (like DORA) and their application in measuring and improving development and operational performance.
  • Strong understanding of Service Level Management (SLM) metrics (like SLAs, SLOs, and SLIs). And their importance in defining, monitoring, and ensuring compliance from the services bound to them.
  • Experience with monitoring, logging, and observability tools like New Relic, Prometheus, Grafana, and Datadog.
  • Experience working with Kafka and improving performance of event-driven, realtime data processing and streaming projects and architectures.
  • Familiarity with tooling used for SLM, DevOps and DORA metrics like Apache Dev Lake, Grafana and New Relic.
  • Experience working with AWS, Azure or GCP for cloud infrastructure management.
  • Experience working with CI/CD pipeline tools such as GitHub Actions, Jenkins, GitLab CI, or similar.
  • Analytical Skills. Ability to analyze and interpret metrics to drive improvements.
  • Strong communication skills to effectively collaborate with team members and stakeholders.
  • Nice-to-haves Familiarity with Observability-as-Code tooling and practices.
  • Familiarity with “Chaos” engineering practices for system resiliency

Before you Apply: Here is an interview Q and A for you: Click here

NOTE: Here is why some companies may not hire you.

Up Your Skill: Take Paid Courses HERE for free

Are your skills still relevant in 2025-2030? Check it out Here 

To apply for this job please visit gotocme.zohorecruit.com.

Site Reliability Engineer at CME, Remote
Share with someone
Scroll to top

Receive Job and Scholarship Alerts

X