[INDEX] // ALL_DOCS ›
[TOC] // ON_THIS_PAGE ›
At a glance
Workload: Fine-tune Qwen 2.5 3B on the GSM8K math dataset using Group Relative Policy Optimization (GRPO) via Unsloth.
Runner: gpu=t4, cpu=4, ram=16, tenancy=spot ($0.004/min).
Estimated cost: ~$0.15 per training run (~30 min).
This page shows how to use GRPO, the reasoning-focused RL algorithm from the DeepSeekMath paper, to strengthen a model’s mathematical reasoning. Includes checkpointing and spot-instance retries.
Why GRPO
A few reasons you might reach for it:
- Push mathematical reasoning forward on smaller models
- Get structured outputs with defined reasoning steps
- Improve performance on multi-step problem solving
- Build models that show their working
How it works
The fine-tuning pipeline uses Unsloth to accelerate training and applies GRPO. It runs as a GitHub Actions workflow you can trigger on demand with input parameters.
The job:
- Loads a base model (e.g. Qwen 2.5 3B)
- Prepares the GSM8K dataset of grade-school math problems
- Applies LoRA for memory-efficient training
- Trains with GRPO to improve reasoning and structured outputs
- Saves checkpoints during training (in the retry-enabled workflow)
- Pushes the fine-tuned model to Hugging Face Hub
Workflow
The basic version:
name: Training
on:
workflow_dispatch:
inputs:
max_seq_length:
type: string
required: false
description: 'The maximum sequence length'
default: '1024'
lora_rank:
type: string
required: false
description: 'The lora rank'
default: '64'
max_steps:
type: string
required: false
description: 'The maximum number of steps'
default: '250'
gpu_memory_utilization:
type: string
required: false
description: 'The GPU memory utilization'
default: '0.60'
learning_rate:
type: string
required: false
description: 'The learning rate'
default: '5e-6'
per_device_train_batch_size:
type: string
required: false
description: 'The per device training batch size'
default: '1'
hf_repo:
type: string
required: true
description: 'The Hugging Face repository to upload the model to'
jobs:
train:
name: Qwen 2.5 3B - GRPO LoRA Training (unsloth)
runs-on: machine/gpu=t4/cpu=4/ram=16/architecture=x64
timeout-minutes: 180
env:
MAX_SEQ_LENGTH: ${{ inputs.max_seq_length }}
LORA_RANK: ${{ inputs.lora_rank }}
GPU_MEMORY_UTILIZATION: ${{ inputs.gpu_memory_utilization }}
MAX_STEPS: ${{ inputs.max_steps }}
LEARNING_RATE: ${{ inputs.learning_rate }}
PER_DEVICE_TRAIN_BATCH_SIZE: ${{ inputs.per_device_train_batch_size }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_HUB_ENABLE_HF_TRANSFER: 1
HF_REPO: ${{ inputs.hf_repo }}
steps:
- uses: actions/checkout@v4
- name: Set up Python 3.10
uses: actions/setup-python@v5
with:
python-version: '3.10'
- name: Install dependencies
run: |
pip install -r requirements.txt
- name: Run Training
run: |
python3 "qwen2_5_(3b)_grpo.py"
Retry on spot interruption
For longer runs, the repository ships a workflow with automatic checkpointing and retry:
name: Training with Retry
on:
workflow_dispatch:
inputs:
attempt:
type: string
description: 'The attempt number'
default: '1'
max_attempts:
type: number
description: 'The maximum number of attempts'
default: 5
# Same parameters as in the basic workflow
# ...
This avoids losing training progress when AWS reclaims a spot instance. The pattern:
- Save checkpoints to Hugging Face Hub during training
- Detect spot instance interruptions with a custom GitHub Action
- Restart the workflow with an incremented attempt number
- Resume training from the latest checkpoint
The retry walkthrough:
- The workflow starts a training job with a given attempt number (default: 1)
- Checkpoints get pushed to Hugging Face Hub on a schedule
- If the job completes, the workflow ends
- If the job fails on a spot interruption:
- The
check-runner-interruptionaction confirms a spot preemption was the cause - The workflow calculates the next attempt number
- Within
max_attempts, it triggers a new run with the incremented attempt - Original parameters carry over to the new attempt
- The
- The new attempt downloads the latest checkpoint and resumes from there
Even if a spot instance is reclaimed, your training picks up where it left off on the next instance.
machine.dev runner config
The default runner used here:
- T4 GPU: 16 GB VRAM, suits Unsloth-optimised training of small/medium LLMs
- Spot tenancy: cheap but interruptible (paired with the retry pattern above)
- Configurable CPU, RAM, and architecture
You can pin regions:
runs-on: machine/gpu=t4/cpu=4/ram=16/architecture=x64/tenancy=spot/regions=us-east-1,us-east-2
What GRPO actually is
Group Relative Policy Optimization is a reinforcement-learning algorithm aimed at reasoning. It first appeared in the DeepSeekMath paper and was later used to train DeepSeek-R1. Compared to PPO it has three differences worth knowing:
-
No value function. GRPO drops the separate value-function model, cutting memory use.
-
Group-based advantage. Instead of a value function, GRPO samples multiple outputs per prompt and uses the group’s mean reward as the baseline. That maps more cleanly to how reward models score multiple outputs for one input.
-
Direct KL divergence. The KL term lives in the loss function rather than the reward, so the model is held closer to its starting behaviour.
On math reasoning datasets like GSM8K, GRPO nudges models to show their working in structured steps, which lifts both accuracy and explainability.
A typical training loop:
- Sample multiple outputs per prompt
- Score each generation with reward functions (rule-based or outcome-based)
- Compute advantages relative to the group mean
- Update the policy while staying close to the original model
Getting started
- Use MachineDotDev/grpo-fine-tune as a template
- Create a Hugging Face access token with write permissions
- Add it as a repository secret named
HF_TOKEN - Open the Actions tab in your repository
- Pick the “Training with Retry” workflow
- Click “Run workflow” and set parameters:
- Sequence length, LoRA rank, training steps
- GPU memory utilisation and learning rate
- Hugging Face target repository
- Run it and wait
- The fine-tuned model lands on Hugging Face Hub
Tips
- The default parameters are tuned for GSM8K and are a fine starting point
- Lower the batch size if you hit OOM
- Use the retry-enabled workflow for any run longer than ~20 minutes
- Watch the workflow logs to track loss and reward
- Evaluate on math word problems to see whether reasoning actually improved
How to adapt this
- Larger model: swap
gpu=t4forgpu=l4orgpu=l40s. See CPU vs GPU. - Different base model: change the model name in the training script
- Different reasoning task: swap GSM8K for MATH, Big-Bench, or your own dataset
Next steps
- Working repo: fork or use as a template
- Cost Optimization: checkpointing and retry pattern
- LLM Supervised Fine-Tuning: the same checkpointing approach with SFT instead of RL