SKIP_TO_MAIN_CONTENT
machine.dev
SIGN_UP
04.0 // Documentation v1.8.4 Last updated 2026-06-05

Parallel Hyperparameter Tuning

Run parallel hyperparameter tuning on machine.dev GPU runners. Use GitHub Actions matrix strategy to test multiple configurations simultaneously.

At a glance

Workload: Train ResNet on CIFAR-10 across a 2×2 hyperparameter grid (learning rates × batch sizes) using GitHub Actions matrix strategy. Runner: gpu=t4, cpu=4, ram=16 × 4 parallel jobs ($0.004/min each). Estimated cost: ~$0.55 per full sweep (~30 min wall-clock thanks to parallelism).

This page shows how to use a GitHub Actions matrix to fan out hyperparameter combinations across parallel machine.dev GPU runners, then aggregate the results in a single comparison job.

When to fan out

Reasons you might run a parallel sweep:

  • Find the right model configuration faster by testing combinations side by side
  • Cut total wall-clock time on a hyperparameter search
  • Compare model performance across configurations in one report
  • Pick the best-performing run automatically rather than by eye

How it works

The workflow uses GitHub Actions’ matrix strategy to run training jobs concurrently. Each job trains a ResNet model on CIFAR-10 with a different combination of hyperparameters. You trigger it on demand.

The pipeline:

  1. Defines a matrix of hyperparameter combinations
  2. Launches a GPU job per combination, all running concurrently
  3. Saves per-run metrics as artifacts
  4. Aggregates results in a final comparison job
  5. Outputs a comparison CSV

Workflow

name: ResNet Hyperparameter Tuning

on:
  workflow_dispatch:

jobs:
  hyperparameter_tuning:
    name: Hyperparameter Tuning
    # id makes each matrix leg pin its own runner — no job-stealing between sweeps
    runs-on: machine/id=${{ github.run_id }}-lr${{ matrix.learning_rate }}-bs${{ matrix.batch_size }}/gpu=t4/cpu=4/ram=16/architecture=x64
    timeout-minutes: 30
    strategy:
      fail-fast: false
      matrix:
        learning_rate: [0.001, 0.0005]
        batch_size: [32, 64]
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python 3.10
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'

      - name: Install uv
        uses: astral-sh/setup-uv@v5

      - name: Install dependencies
        run: |
          uv venv .venv --python=3.10
          source .venv/bin/activate
          uv pip install -r requirements.txt
          deactivate

      - name: Train and Evaluate ResNet
        env:
          LEARNING_RATE: ${{ matrix.learning_rate }}
          BATCH_SIZE: ${{ matrix.batch_size }}
        run: |
          source .venv/bin/activate
          python train.py
          deactivate

      - name: Upload metrics artifact
        uses: actions/upload-artifact@v4
        with:
          name: metrics-${{ matrix.learning_rate }}-${{ matrix.batch_size }}
          path: metrics_*.json

  compare_tuning:
    needs: hyperparameter_tuning
    name: Compare Tuning Performance
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python 3.10
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'

      - name: Install uv
        uses: astral-sh/setup-uv@v5

      - name: Install dependencies
        run: |
          uv venv .venv --python=3.10
          source .venv/bin/activate
          uv pip install -r requirements.txt
          deactivate

      - name: Download all metrics
        uses: actions/download-artifact@v4
        with:
          path: metrics

      - name: Compare Metrics
        run: |
          source .venv/bin/activate
          python compare_metrics.py
          deactivate

      - name: Upload comparison results
        uses: actions/upload-artifact@v4
        with:
          name: comparison-results
          path: model_comparison.csv

What this gives you

A few useful properties of the matrix-on-machine.dev pattern:

  1. Matrix strategy. The workflow declares a matrix of hyperparameters and Actions auto-creates one job per combination. Two learning rates × two batch sizes = four concurrent training jobs.

  2. Real parallelism. Each job lands on its own machine.dev GPU runner. Wall-clock time is the slowest combination, not the sum of all combinations.

  3. Per-run metrics. Each job writes a metrics file as a uniquely-named artifact.

  4. Automatic comparison. A final job downloads everything and produces a single comparison report.

Runner config

The default runner used here:

  • T4 GPU: 16 GB VRAM, fits ResNet on CIFAR-10 comfortably
  • Configurable CPU, RAM, and architecture per job

Because each combination has its own runner, the search finishes in roughly one combination’s worth of time, even with many combinations.

Tips

  • Pick hyperparameters that move the needle. Don’t sweep things that won’t change the result.
  • Start broad, then narrow around promising values
  • Adjust CPU/RAM per matrix entry if some combinations need more
  • Set the workflow timeout long enough to cover the slowest combination
  • Use fail-fast: false so one bad combination doesn’t kill the whole sweep

Getting started

  1. Use MachineDotDev/parallel-hyperparameter-tuning as a template
  2. Open the Actions tab in your repository
  3. Pick the “ResNet Hyperparameter Tuning” workflow
  4. Click “Run workflow”
  5. Wait for all jobs to finish
  6. Download the comparison-results artifact to see which combination won

Adapting it to your model

  1. Update the matrix to your hyperparameters
  2. Replace train.py with your training code
  3. Capture the metrics that actually matter for your task
  4. Update compare_metrics.py to highlight what you care about

How to adapt this

  • More hyperparameters: add optimizer, weight_decay, dropout, etc. to the matrix. Every combination spawns its own runner.
  • Larger sweep: a 5×5×5 = 125 combination matrix is fine. Every job runs on its own runner concurrently.
  • Larger model: bump to gpu=l4 or gpu=a10g if 16 GB VRAM is too tight
  • Use spot: add tenancy=spot to cut costs by 70-90% (the example uses on-demand by default)

Next steps