[INDEX] // ALL_DOCS ›
[TOC] // ON_THIS_PAGE ›
At a glance
Workload: Train ResNet on CIFAR-10 across a 2×2 hyperparameter grid (learning rates × batch sizes) using GitHub Actions matrix strategy.
Runner: gpu=t4, cpu=4, ram=16 × 4 parallel jobs ($0.004/min each).
Estimated cost: ~$0.55 per full sweep (~30 min wall-clock thanks to parallelism).
This page shows how to use a GitHub Actions matrix to fan out hyperparameter combinations across parallel machine.dev GPU runners, then aggregate the results in a single comparison job.
When to fan out
Reasons you might run a parallel sweep:
- Find the right model configuration faster by testing combinations side by side
- Cut total wall-clock time on a hyperparameter search
- Compare model performance across configurations in one report
- Pick the best-performing run automatically rather than by eye
How it works
The workflow uses GitHub Actions’ matrix strategy to run training jobs concurrently. Each job trains a ResNet model on CIFAR-10 with a different combination of hyperparameters. You trigger it on demand.
The pipeline:
- Defines a matrix of hyperparameter combinations
- Launches a GPU job per combination, all running concurrently
- Saves per-run metrics as artifacts
- Aggregates results in a final comparison job
- Outputs a comparison CSV
Workflow
name: ResNet Hyperparameter Tuning
on:
workflow_dispatch:
jobs:
hyperparameter_tuning:
name: Hyperparameter Tuning
# id makes each matrix leg pin its own runner — no job-stealing between sweeps
runs-on: machine/id=${{ github.run_id }}-lr${{ matrix.learning_rate }}-bs${{ matrix.batch_size }}/gpu=t4/cpu=4/ram=16/architecture=x64
timeout-minutes: 30
strategy:
fail-fast: false
matrix:
learning_rate: [0.001, 0.0005]
batch_size: [32, 64]
steps:
- uses: actions/checkout@v4
- name: Set up Python 3.10
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install uv
uses: astral-sh/setup-uv@v5
- name: Install dependencies
run: |
uv venv .venv --python=3.10
source .venv/bin/activate
uv pip install -r requirements.txt
deactivate
- name: Train and Evaluate ResNet
env:
LEARNING_RATE: ${{ matrix.learning_rate }}
BATCH_SIZE: ${{ matrix.batch_size }}
run: |
source .venv/bin/activate
python train.py
deactivate
- name: Upload metrics artifact
uses: actions/upload-artifact@v4
with:
name: metrics-${{ matrix.learning_rate }}-${{ matrix.batch_size }}
path: metrics_*.json
compare_tuning:
needs: hyperparameter_tuning
name: Compare Tuning Performance
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python 3.10
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install uv
uses: astral-sh/setup-uv@v5
- name: Install dependencies
run: |
uv venv .venv --python=3.10
source .venv/bin/activate
uv pip install -r requirements.txt
deactivate
- name: Download all metrics
uses: actions/download-artifact@v4
with:
path: metrics
- name: Compare Metrics
run: |
source .venv/bin/activate
python compare_metrics.py
deactivate
- name: Upload comparison results
uses: actions/upload-artifact@v4
with:
name: comparison-results
path: model_comparison.csv
What this gives you
A few useful properties of the matrix-on-machine.dev pattern:
-
Matrix strategy. The workflow declares a matrix of hyperparameters and Actions auto-creates one job per combination. Two learning rates × two batch sizes = four concurrent training jobs.
-
Real parallelism. Each job lands on its own machine.dev GPU runner. Wall-clock time is the slowest combination, not the sum of all combinations.
-
Per-run metrics. Each job writes a metrics file as a uniquely-named artifact.
-
Automatic comparison. A final job downloads everything and produces a single comparison report.
Runner config
The default runner used here:
- T4 GPU: 16 GB VRAM, fits ResNet on CIFAR-10 comfortably
- Configurable CPU, RAM, and architecture per job
Because each combination has its own runner, the search finishes in roughly one combination’s worth of time, even with many combinations.
Tips
- Pick hyperparameters that move the needle. Don’t sweep things that won’t change the result.
- Start broad, then narrow around promising values
- Adjust CPU/RAM per matrix entry if some combinations need more
- Set the workflow timeout long enough to cover the slowest combination
- Use
fail-fast: falseso one bad combination doesn’t kill the whole sweep
Getting started
- Use MachineDotDev/parallel-hyperparameter-tuning as a template
- Open the Actions tab in your repository
- Pick the “ResNet Hyperparameter Tuning” workflow
- Click “Run workflow”
- Wait for all jobs to finish
- Download the
comparison-resultsartifact to see which combination won
Adapting it to your model
- Update the matrix to your hyperparameters
- Replace
train.pywith your training code - Capture the metrics that actually matter for your task
- Update
compare_metrics.pyto highlight what you care about
How to adapt this
- More hyperparameters: add
optimizer,weight_decay,dropout, etc. to the matrix. Every combination spawns its own runner. - Larger sweep: a 5×5×5 = 125 combination matrix is fine. Every job runs on its own runner concurrently.
- Larger model: bump to
gpu=l4orgpu=a10gif 16 GB VRAM is too tight - Use spot: add
tenancy=spotto cut costs by 70-90% (the example uses on-demand by default)
Next steps
- Working repo: fork or use as a template
- Cost Optimization: spot pricing for sweep economics
- LLM Supervised Fine-Tuning: same matrix technique applied to LLM fine-tunes