SKIP_TO_MAIN_CONTENT
machine.dev
SIGN_UP
04.0 // Documentation v1.8.4 Last updated 2026-06-05

GRPO Fine-Tuning

Improve language-model reasoning with Group Relative Policy Optimization on machine.dev GPU runners.

At a glance

Workload: Fine-tune Qwen 2.5 3B on the GSM8K math dataset using Group Relative Policy Optimization (GRPO) via Unsloth. Runner: gpu=t4, cpu=4, ram=16, tenancy=spot ($0.004/min). Estimated cost: ~$0.15 per training run (~30 min).

This page shows how to use GRPO, the reasoning-focused RL algorithm from the DeepSeekMath paper, to strengthen a model’s mathematical reasoning. Includes checkpointing and spot-instance retries.

Why GRPO

A few reasons you might reach for it:

  • Push mathematical reasoning forward on smaller models
  • Get structured outputs with defined reasoning steps
  • Improve performance on multi-step problem solving
  • Build models that show their working

How it works

The fine-tuning pipeline uses Unsloth to accelerate training and applies GRPO. It runs as a GitHub Actions workflow you can trigger on demand with input parameters.

The job:

  1. Loads a base model (e.g. Qwen 2.5 3B)
  2. Prepares the GSM8K dataset of grade-school math problems
  3. Applies LoRA for memory-efficient training
  4. Trains with GRPO to improve reasoning and structured outputs
  5. Saves checkpoints during training (in the retry-enabled workflow)
  6. Pushes the fine-tuned model to Hugging Face Hub

Workflow

The basic version:

name: Training

on:
  workflow_dispatch:
    inputs:
      max_seq_length:
        type: string
        required: false
        description: 'The maximum sequence length'
        default: '1024'
      lora_rank:
        type: string
        required: false
        description: 'The lora rank'
        default: '64'
      max_steps:
        type: string
        required: false
        description: 'The maximum number of steps'
        default: '250'
      gpu_memory_utilization:
        type: string
        required: false
        description: 'The GPU memory utilization'
        default: '0.60'
      learning_rate:
        type: string
        required: false
        description: 'The learning rate'
        default: '5e-6'
      per_device_train_batch_size:
        type: string
        required: false
        description: 'The per device training batch size'
        default: '1'
      hf_repo:
        type: string
        required: true
        description: 'The Hugging Face repository to upload the model to'

jobs:
  train:
    name: Qwen 2.5 3B - GRPO LoRA Training (unsloth)
    runs-on: machine/gpu=t4/cpu=4/ram=16/architecture=x64
    timeout-minutes: 180
    env:
      MAX_SEQ_LENGTH: ${{ inputs.max_seq_length }}
      LORA_RANK: ${{ inputs.lora_rank }}
      GPU_MEMORY_UTILIZATION: ${{ inputs.gpu_memory_utilization }}
      MAX_STEPS: ${{ inputs.max_steps }}
      LEARNING_RATE: ${{ inputs.learning_rate }}
      PER_DEVICE_TRAIN_BATCH_SIZE: ${{ inputs.per_device_train_batch_size }}
      HF_TOKEN: ${{ secrets.HF_TOKEN }}
      HF_HUB_ENABLE_HF_TRANSFER: 1
      HF_REPO: ${{ inputs.hf_repo }}
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python 3.10
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt

      - name: Run Training
        run: |
          python3 "qwen2_5_(3b)_grpo.py"

Retry on spot interruption

For longer runs, the repository ships a workflow with automatic checkpointing and retry:

name: Training with Retry

on:
  workflow_dispatch:
    inputs:
      attempt:
        type: string
        description: 'The attempt number'
        default: '1'
      max_attempts:
        type: number
        description: 'The maximum number of attempts'
        default: 5
      # Same parameters as in the basic workflow
      # ...

This avoids losing training progress when AWS reclaims a spot instance. The pattern:

  1. Save checkpoints to Hugging Face Hub during training
  2. Detect spot instance interruptions with a custom GitHub Action
  3. Restart the workflow with an incremented attempt number
  4. Resume training from the latest checkpoint

The retry walkthrough:

  1. The workflow starts a training job with a given attempt number (default: 1)
  2. Checkpoints get pushed to Hugging Face Hub on a schedule
  3. If the job completes, the workflow ends
  4. If the job fails on a spot interruption:
    • The check-runner-interruption action confirms a spot preemption was the cause
    • The workflow calculates the next attempt number
    • Within max_attempts, it triggers a new run with the incremented attempt
    • Original parameters carry over to the new attempt
  5. The new attempt downloads the latest checkpoint and resumes from there

Even if a spot instance is reclaimed, your training picks up where it left off on the next instance.

machine.dev runner config

The default runner used here:

  • T4 GPU: 16 GB VRAM, suits Unsloth-optimised training of small/medium LLMs
  • Spot tenancy: cheap but interruptible (paired with the retry pattern above)
  • Configurable CPU, RAM, and architecture

You can pin regions:

runs-on: machine/gpu=t4/cpu=4/ram=16/architecture=x64/tenancy=spot/regions=us-east-1,us-east-2

What GRPO actually is

Group Relative Policy Optimization is a reinforcement-learning algorithm aimed at reasoning. It first appeared in the DeepSeekMath paper and was later used to train DeepSeek-R1. Compared to PPO it has three differences worth knowing:

  1. No value function. GRPO drops the separate value-function model, cutting memory use.

  2. Group-based advantage. Instead of a value function, GRPO samples multiple outputs per prompt and uses the group’s mean reward as the baseline. That maps more cleanly to how reward models score multiple outputs for one input.

  3. Direct KL divergence. The KL term lives in the loss function rather than the reward, so the model is held closer to its starting behaviour.

On math reasoning datasets like GSM8K, GRPO nudges models to show their working in structured steps, which lifts both accuracy and explainability.

A typical training loop:

  • Sample multiple outputs per prompt
  • Score each generation with reward functions (rule-based or outcome-based)
  • Compute advantages relative to the group mean
  • Update the policy while staying close to the original model

Getting started

  1. Use MachineDotDev/grpo-fine-tune as a template
  2. Create a Hugging Face access token with write permissions
  3. Add it as a repository secret named HF_TOKEN
  4. Open the Actions tab in your repository
  5. Pick the “Training with Retry” workflow
  6. Click “Run workflow” and set parameters:
    • Sequence length, LoRA rank, training steps
    • GPU memory utilisation and learning rate
    • Hugging Face target repository
  7. Run it and wait
  8. The fine-tuned model lands on Hugging Face Hub

Tips

  • The default parameters are tuned for GSM8K and are a fine starting point
  • Lower the batch size if you hit OOM
  • Use the retry-enabled workflow for any run longer than ~20 minutes
  • Watch the workflow logs to track loss and reward
  • Evaluate on math word problems to see whether reasoning actually improved

How to adapt this

  • Larger model: swap gpu=t4 for gpu=l4 or gpu=l40s. See CPU vs GPU.
  • Different base model: change the model name in the training script
  • Different reasoning task: swap GSM8K for MATH, Big-Bench, or your own dataset

Next steps