Website fal
Before you apply: Here is an interview Q&A for you: Click here
NOTE: Here is why some companies may not hire you.
Hey!! Update Your CV Like a Pro. HERE are Tips from an Experienced Recruiter
Operations Engineer at fal, Remote
As we bring up owned clusters alongside our cloud capacity, we’re hiring Operations Engineers to keep the fleet alive. This is a hands-on role. You’re first in line when nodes go bad, GPUs throw ECC errors, IB links flap, or a rack stops responding. You’ll provision new nodes, validate them, ship them to production, and troubleshoot whatever entropy throws at them. You’ll be on-call. You’ll be in the weeds.
You’re a fit if you’ve:
- Administered Linux Systems in the critical path before
- Troubleshoot GPU node issues: NVLink, NCCL, IB, driver and firmware bugs
- Has experience in observability systems like Grafana and Prometheus
- Scripted your way out of repetitive work (bash, python, go, whatever)
Who you are:
- Curious. You don’t accept “it’s flaky” as a root cause
- Comfortable with ambiguity. The runbook doesn’t exist yet for half of what you’ll do
- On-call doesn’t scare you
- You’d rather automate a problem than fix it twice
Responsibilities:
- Provision, validate, and triage GPU nodes across B300, H200, and H100 clusters
- Troubleshoot hardware and software issues across compute, network, and storage
- Monitor fleet health, take remediation action, push fixes upstream when needed
- Write the runbooks. Improve the ones that exist. Delete the ones that don’t work
To apply for this job please visit job-boards.greenhouse.io.
