Mid-level
Remote (Remote)
Posted 6 minutes ago

Before you apply: Here is an interview Q&A for you: Click here

NOTE: Here is why some companies may not hire you.

Hey!! Update Your CV Like a Pro. HERE are Tips from an Experienced Recruiter

Operations Engineer at fal, Remote

As we bring up owned clusters alongside our cloud capacity, we’re hiring Operations Engineers to keep the fleet alive. This is a hands-on role. You’re first in line when nodes go bad, GPUs throw ECC errors, IB links flap, or a rack stops responding. You’ll provision new nodes, validate them, ship them to production, and troubleshoot whatever entropy throws at them. You’ll be on-call. You’ll be in the weeds.

You’re a fit if you’ve:

Administered Linux Systems in the critical path before
Troubleshoot GPU node issues: NVLink, NCCL, IB, driver and firmware bugs
Has experience in observability systems like Grafana and Prometheus
Scripted your way out of repetitive work (bash, python, go, whatever)

Who you are:

Curious. You don’t accept “it’s flaky” as a root cause
Comfortable with ambiguity. The runbook doesn’t exist yet for half of what you’ll do
On-call doesn’t scare you
You’d rather automate a problem than fix it twice

Responsibilities:

Provision, validate, and triage GPU nodes across B300, H200, and H100 clusters
Troubleshoot hardware and software issues across compute, network, and storage
Monitor fleet health, take remediation action, push fixes upstream when needed
Write the runbooks. Improve the ones that exist. Delete the ones that don’t work

To apply for this job please visit job-boards.greenhouse.io.

Operations Engineer at fal, Remote

Operations Engineer at fal, Remote

You’re a fit if you’ve:

Who you are:

Responsibilities:

Related Posts:

Support Desk Analyst at Nscale, Remote

Operations Engineer at fal, Remote

3rd Line Customer Support Analyst at IRIS Financials, Remote

Full Stack Engineer at Tem Energy, Remote

Customer Experience Operations at Tem Energy, Remote (United Kingdom)