GRPO Fine-Tuning — machine.dev docs

[INDEX] // ALL_DOCS ›

[TOC] // ON_THIS_PAGE ›

At a glance

Workload: Fine-tune Qwen 2.5 3B on the GSM8K math dataset using Group Relative Policy Optimization (GRPO) via Unsloth. Runner: gpu=t4, cpu=4, ram=16, tenancy=spot ($0.004/min). Estimated cost: ~$0.15 per training run (~30 min).

This page shows how to use GRPO, the reasoning-focused RL algorithm from the DeepSeekMath paper, to strengthen a model’s mathematical reasoning. Includes checkpointing and spot-instance retries.

Why GRPO

A few reasons you might reach for it:

Push mathematical reasoning forward on smaller models
Get structured outputs with defined reasoning steps
Improve performance on multi-step problem solving
Build models that show their working

How it works

The fine-tuning pipeline uses Unsloth to accelerate training and applies GRPO. It runs as a GitHub Actions workflow you can trigger on demand with input parameters.

The job:

Loads a base model (e.g. Qwen 2.5 3B)
Prepares the GSM8K dataset of grade-school math problems
Applies LoRA for memory-efficient training
Trains with GRPO to improve reasoning and structured outputs
Saves checkpoints during training (in the retry-enabled workflow)
Pushes the fine-tuned model to Hugging Face Hub

Workflow

The basic version:

name: Training

on:
  workflow_dispatch:
    inputs:
      max_seq_length:
        type: string
        required: false
        description: 'The maximum sequence length'
        default: '1024'
      lora_rank:
        type: string
        required: false
        description: 'The lora rank'
        default: '64'
      max_steps:
        type: string
        required: false
        description: 'The maximum number of steps'
        default: '250'
      gpu_memory_utilization:
        type: string
        required: false
        description: 'The GPU memory utilization'
        default: '0.60'
      learning_rate:
        type: string
        required: false
        description: 'The learning rate'
        default: '5e-6'
      per_device_train_batch_size:
        type: string
        required: false
        description: 'The per device training batch size'
        default: '1'
      hf_repo:
        type: string
        required: true
        description: 'The Hugging Face repository to upload the model to'

jobs:
  train:
    name: Qwen 2.5 3B - GRPO LoRA Training (unsloth)
    runs-on: machine/gpu=t4/cpu=4/ram=16/architecture=x64
    timeout-minutes: 180
    env:
      MAX_SEQ_LENGTH: ${{ inputs.max_seq_length }}
      LORA_RANK: ${{ inputs.lora_rank }}
      GPU_MEMORY_UTILIZATION: ${{ inputs.gpu_memory_utilization }}
      MAX_STEPS: ${{ inputs.max_steps }}
      LEARNING_RATE: ${{ inputs.learning_rate }}
      PER_DEVICE_TRAIN_BATCH_SIZE: ${{ inputs.per_device_train_batch_size }}
      HF_TOKEN: ${{ secrets.HF_TOKEN }}
      HF_HUB_ENABLE_HF_TRANSFER: 1
      HF_REPO: ${{ inputs.hf_repo }}
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python 3.10
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt

      - name: Run Training
        run: |
          python3 "qwen2_5_(3b)_grpo.py"

Retry on spot interruption

For longer runs, the repository ships a workflow with automatic checkpointing and retry:

name: Training with Retry

on:
  workflow_dispatch:
    inputs:
      attempt:
        type: string
        description: 'The attempt number'
        default: '1'
      max_attempts:
        type: number
        description: 'The maximum number of attempts'
        default: 5
      # Same parameters as in the basic workflow
      # ...

This avoids losing training progress when AWS reclaims a spot instance. The pattern:

Save checkpoints to Hugging Face Hub during training
Detect spot instance interruptions with a custom GitHub Action
Restart the workflow with an incremented attempt number
Resume training from the latest checkpoint

The retry walkthrough:

The workflow starts a training job with a given attempt number (default: 1)
Checkpoints get pushed to Hugging Face Hub on a schedule
If the job completes, the workflow ends
If the job fails on a spot interruption:
- The check-runner-interruption action confirms a spot preemption was the cause
- The workflow calculates the next attempt number
- Within max_attempts, it triggers a new run with the incremented attempt
- Original parameters carry over to the new attempt
The new attempt downloads the latest checkpoint and resumes from there

Even if a spot instance is reclaimed, your training picks up where it left off on the next instance.

machine.dev runner config

The default runner used here:

T4 GPU: 16 GB VRAM, suits Unsloth-optimised training of small/medium LLMs
Spot tenancy: cheap but interruptible (paired with the retry pattern above)
Configurable CPU, RAM, and architecture

You can pin regions:

runs-on: machine/gpu=t4/cpu=4/ram=16/architecture=x64/tenancy=spot/regions=us-east-1,us-east-2

What GRPO actually is

Group Relative Policy Optimization is a reinforcement-learning algorithm aimed at reasoning. It first appeared in the DeepSeekMath paper and was later used to train DeepSeek-R1. Compared to PPO it has three differences worth knowing:

No value function. GRPO drops the separate value-function model, cutting memory use.
Group-based advantage. Instead of a value function, GRPO samples multiple outputs per prompt and uses the group’s mean reward as the baseline. That maps more cleanly to how reward models score multiple outputs for one input.
Direct KL divergence. The KL term lives in the loss function rather than the reward, so the model is held closer to its starting behaviour.

On math reasoning datasets like GSM8K, GRPO nudges models to show their working in structured steps, which lifts both accuracy and explainability.

A typical training loop:

Sample multiple outputs per prompt
Score each generation with reward functions (rule-based or outcome-based)
Compute advantages relative to the group mean
Update the policy while staying close to the original model

Getting started

Use MachineDotDev/grpo-fine-tune as a template
Create a Hugging Face access token with write permissions
Add it as a repository secret named HF_TOKEN
Open the Actions tab in your repository
Pick the “Training with Retry” workflow
Click “Run workflow” and set parameters:
- Sequence length, LoRA rank, training steps
- GPU memory utilisation and learning rate
- Hugging Face target repository
Run it and wait
The fine-tuned model lands on Hugging Face Hub

Tips

The default parameters are tuned for GSM8K and are a fine starting point
Lower the batch size if you hit OOM
Use the retry-enabled workflow for any run longer than ~20 minutes
Watch the workflow logs to track loss and reward
Evaluate on math word problems to see whether reasoning actually improved

How to adapt this

Larger model: swap gpu=t4 for gpu=l4 or gpu=l40s. See CPU vs GPU.
Different base model: change the model name in the training script
Different reasoning task: swap GSM8K for MATH, Big-Bench, or your own dataset

Next steps

Working repo: fork or use as a template
Cost Optimization: checkpointing and retry pattern
LLM Supervised Fine-Tuning: the same checkpointing approach with SFT instead of RL