[INDEX] // ALL_DOCS ›
[TOC] // ON_THIS_PAGE ›
At a glance
Workload: Benchmark two LLMs head-to-head across HellaSwag, ARC, MathQA, TruthfulQA, DROP, GSM8K, and MMLU using lm-evaluation-harness.
Runner: gpu=l40s, cpu=4, ram=32, tenancy=spot ($0.016/min).
Estimated cost: ~$1 per benchmark run (~1 hour, 100 examples per task).
This page shows how to run reproducible head-to-head LLM benchmarks on machine.dev GPU runners, with comparison charts you can drop into a PR review.
Why benchmark
Reasons you might want to run head-to-head LLM evals:
- Compare model performance on tasks you actually care about
- Pick the right model for your use case
- Check whether a fine-tune is genuinely better than the baseline it started from
- See where each model is strong or weak across reasoning task types
How it works
The Language Model Arena uses the lm-evaluation-harness framework to benchmark models on a fixed task set. It runs as a GitHub Actions workflow you trigger on demand with input parameters.
The job:
- Loads two models (Hugging Face IDs or local paths)
- Runs them through the same evaluation tasks
- Generates comparison charts showing relative performance
- Stores results as GitHub workflow artifacts
Workflow
name: LM Eval Benchmarking
on:
workflow_dispatch:
inputs:
model_1:
type: string
required: false
description: 'The first model to benchmark'
default: 'Qwen/Qwen2.5-3B-Instruct'
model_1_revision:
type: string
required: false
description: 'The first model revision to benchmark'
default: 'main'
model_2:
type: string
required: false
description: 'The second model to benchmark'
default: 'unsloth/Llama-3.1-8B-Instruct'
model_2_revision:
type: string
required: false
description: 'The second model revision to benchmark'
default: 'main'
tasks:
type: string
required: false
description: 'The tasks to benchmark'
default: 'hellaswag,arc_easy,mathqa,truthfulqa,drop,arc_challenge,gsm8k,mmlu_abstract_algebra,mmlu_college_mathematics'
examples_limit:
type: string
required: false
description: 'The number of examples to use for benchmarking'
default: '100'
jobs:
benchmark:
name: LLM Eval Benchmarking
runs-on: machine/gpu=l40s/cpu=4/ram=32/architecture=x64/tenancy=spot
steps:
# Workflow steps for running the benchmark
# ...
- name: Generate Benchmark Comparison Chart
run: |
ls -l ./benchmarks/
python ./llm_benchmark_plotting.py
- name: Upload Benchmark Artifacts
uses: actions/upload-artifact@v4
with:
name: benchmark-results
path: benchmarks/
retention-days: 90
Tasks evaluated
The default task list covers a range of reasoning skills:
- hellaswag: common-sense reasoning about events
- arc_easy and arc_challenge: multiple-choice science questions
- mathqa: mathematical reasoning and problem solving
- truthfulqa: truthfulness in model responses
- drop: reading comprehension with numerical reasoning
- gsm8k: grade-school math word problems
- mmlu_abstract_algebra and mmlu_college_mathematics: advanced mathematics
Together they cover common-sense, scientific, mathematical, and comprehension reasoning.
Runner config
The default runner used here:
- L40S GPU: 48 GB VRAM, fits 8B models comfortably
- Spot tenancy: cheap and the workload is short enough that interruption is rare
- Configurable CPU, RAM, and architecture
You can pin regions:
runs-on: machine/gpu=l40s/cpu=4/ram=32/architecture=x64/tenancy=spot/regions=us-east-1,us-east-2
machine.dev picks the cheapest spot price within the listed regions.
What you get back
After the run finishes, the workflow produces:
- JSON files with per-task metrics
- Comparison charts as PNG
- Everything stored as GitHub artifacts for 90 days
The plotting code, simplified:
# Extract metrics from JSON files
for model, dir_path in model_results.items():
result_files = glob.glob(os.path.join(dir_path, "results_*.json"))
if result_files:
latest_file = max(result_files, key=os.path.getctime)
with open(latest_file) as f:
data = json.load(f)
for task, task_metrics in data['results'].items():
tasks.add(task)
metrics[model][task] = task_metrics
# Generate comparison charts for each task
for task in sorted(tasks):
plt.figure(figsize=(12, 7))
plt.title(f'{task} Comparison: Model 1 vs Model 2')
# ... Chart generation code ...
output_path = current_dir / f'benchmarks/{task}_comparison.png'
plt.savefig(output_path)
Getting started
- Fork MachineDotDev/language-model-arena
- Open the Actions tab in your repository
- Pick the “LM Eval Benchmarking” workflow
- Click “Run workflow” and set parameters:
- The two models to compare
- Which tasks to benchmark
- Number of examples per task
- Run it and wait
- Download the benchmark artifacts to view the charts
Tips
- More examples means more reliable numbers but a longer run. 100 is a useful default.
- Pick tasks that match your downstream use case
- Compare models at similar parameter counts for fair head-to-heads
- When comparing fine-tune vs baseline, keep every other parameter constant
How to adapt this
- Smaller cheaper run: swap
gpu=l40sforgpu=l4($0.006/min) if both models fit in 24 GB VRAM - Different tasks: edit the
tasksinput to pick any lm-eval-harness task - More than 2 models: turn this into a matrix across model names
- Run on PRs: trigger on
pull_requestto compare a candidate fine-tune against its baseline before merging
Next steps
- Working repo: fork or use as a template
- CPU vs GPU: picking GPU size for evaluation workloads
- LLM Supervised Fine-Tuning: train a model, then benchmark it here