Every job now records its own metrics

FEATURED_IMAGE.PNG ASSET_01

A glowing green ECG-style pulse trace on a black background

01 Article body

[REF_001]

Full
Text.

Technical write-up from the team at machine.dev.

Every machine.dev job now records its own metrics. CPU, memory, disk, network, on all runners, and on GPU runners the GPU's utilisation, memory, temperature, and power draw. They show up as sparkline charts on the job page in the dashboard. It's been running in beta for months. Today it's generally available, on by default, and free.

You don't have to do anything to get it. Push a job, let it run, open the job page when it's done. The charts are there.

Why we built it

The old way to find out whether your job was using the GPU was to put nvidia-smi in your workflow and read the logs afterwards. That works, but it's a snapshot, and you have to remember to add it before the run, not after. By the time you're wondering why a training job took 40 minutes, the run is over and you've got nothing to look at.

So the question kept coming up: was the GPU actually doing anything, or did the job spend half its life on CPU-side preprocessing while a very expensive L40S sat idle? Was it memory-bound? Did the disk choke on checkpoint writes? You could guess. Guessing is not a great way to spend money on compute.

Metrics answer that without you instrumenting anything. The runner samples itself while the job runs and ships the numbers home. When the job finishes, the dashboard draws the charts. You look at the GPU utilisation line, see it flatlined at 30%, and now you know to drop from an L40S to a T4 and stop paying for headroom you weren't using.

What gets collected

On every runner:

CPU utilisation
Memory usage
Disk I/O
Network bytes in and out

On GPU runners, you also get:

GPU utilisation
GPU memory usage
GPU temperature
Power draw

That's the full set on GPU runners. CPU runners get the system four. Everything renders as a sparkline on the job page, so you're reading a shape, not a wall of numbers.

Low fidelity by default

On by default means on for every job, and the default is deliberately low fidelity. The runner samples once every 60 seconds. That's enough to see the shape of a job that runs for more than a few minutes, and it costs almost nothing to collect.

For a long training run, a sample a minute is plenty. You're looking for "did the GPU stay busy," and an hour of one-minute samples draws that line fine. For a job that's over in 90 seconds, a sample a minute gives you one or two data points, which is not a chart, it's a dot. That's where you turn it up.

Turning it up

Fidelity is the sampling interval, and you set it with a label. The config uses the packed label format, where everything goes in one runs-on string separated by forward slashes.

Default, every 60 seconds. Nothing to add:

YAML

runs-on: machine/gpu=l4

Sample every 5 seconds for a finer-grained look at a shorter job:

YAML

runs-on: machine/gpu=l4/metrics_interval=5

You can go all the way down to every second when you really want to see what happened moment to moment:

YAML

runs-on: machine/gpu=l4/metrics_interval=1

The interval range is 1 to 60 seconds. One second is the floor. Higher fidelity means more data points, which means a denser chart and a bit more overhead while the job runs. For most jobs the default is the right call and you'll never touch this. When you're chasing a specific question about a specific run, drop the interval and look closer.

Turning it off

If you don't want metrics on a job, set metrics=false:

YAML

runs-on: machine/gpu=l4/metrics=false

That's the whole switch. No metrics collected, no charts drawn. We don't think you'll want this often, but it's there.

A CPU example

None of this is GPU-only. A big build on a 32-core runner gets the same treatment, minus the GPU lines:

YAML

runs-on: machine/cpu=32/tenancy=spot/metrics_interval=10

Now you can see whether make -j$(nproc) actually saturated all 32 cores or whether it spent most of the build single-threaded and you could have rented a smaller box.

What's next

The current view is per-job sparklines, which is the thing we wanted first: see one run, understand one run. Comparing runs over time, alerting on a utilisation threshold, exporting the raw series, those are all good side projects and none of them are done yet. If there's one you want before the others, tell us, it changes what we build next.

For now: it's on, it's free, and it's collecting on every job you've run since the rollout. Go open a recent job page and have a look. You've probably been leaving GPU utilisation on the table this whole time. Most people are.

The full label reference, including every metrics option, is in the configuration docs.

02 Share

[REF_002]

Spread
the word.

SHARE_ON_X › SHARE_ON_LINKEDIN ›

03 Continue reading

[REF_003]

More
Writing.

[POST_001] JUNE 5, 2026

Every job now records its own metrics

Full
Text.

Why we built it

What gets collected

Low fidelity by default

Turning it up

Turning it off

A CPU example

What's next

Spread
the word.

More
Writing.

One runner, one job

Disk config, tunable and visible. Cost breakdown on every job.

We open sourced nat-zero: scale-to-zero NAT instances for AWS

Duct tape is cancelled. Pipelines only.

Why We Built machine.dev

High powered
CI infra.

Every job now records its own metrics

FullText.

Why we built it

What gets collected

Low fidelity by default

Turning it up

Turning it off

A CPU example

What's next

Spreadthe word.

MoreWriting.

One runner, one job

Disk config, tunable and visible. Cost breakdown on every job.

We open sourced nat-zero: scale-to-zero NAT instances for AWS

Duct tape is cancelled. Pipelines only.

Why We Built machine.dev

High poweredCI infra.

Full
Text.

Spread
the word.

More
Writing.

High powered
CI infra.