Language Model Arena — machine.dev docs

[INDEX] // ALL_DOCS ›

[TOC] // ON_THIS_PAGE ›

At a glance

Workload: Benchmark two LLMs head-to-head across HellaSwag, ARC, MathQA, TruthfulQA, DROP, GSM8K, and MMLU using lm-evaluation-harness. Runner: gpu=l40s, cpu=4, ram=32, tenancy=spot ($0.016/min). Estimated cost: ~$1 per benchmark run (~1 hour, 100 examples per task).

This page shows how to run reproducible head-to-head LLM benchmarks on machine.dev GPU runners, with comparison charts you can drop into a PR review.

Why benchmark

Reasons you might want to run head-to-head LLM evals:

Compare model performance on tasks you actually care about
Pick the right model for your use case
Check whether a fine-tune is genuinely better than the baseline it started from
See where each model is strong or weak across reasoning task types

How it works

The Language Model Arena uses the lm-evaluation-harness framework to benchmark models on a fixed task set. It runs as a GitHub Actions workflow you trigger on demand with input parameters.

The job:

Loads two models (Hugging Face IDs or local paths)
Runs them through the same evaluation tasks
Generates comparison charts showing relative performance
Stores results as GitHub workflow artifacts

Workflow

name: LM Eval Benchmarking

on:
  workflow_dispatch:
    inputs:
      model_1:
        type: string
        required: false
        description: 'The first model to benchmark'
        default: 'Qwen/Qwen2.5-3B-Instruct'
      model_1_revision:
        type: string
        required: false
        description: 'The first model revision to benchmark'
        default: 'main'
      model_2:
        type: string
        required: false
        description: 'The second model to benchmark'
        default: 'unsloth/Llama-3.1-8B-Instruct'
      model_2_revision:
        type: string
        required: false
        description: 'The second model revision to benchmark'
        default: 'main'
      tasks:
        type: string
        required: false
        description: 'The tasks to benchmark'
        default: 'hellaswag,arc_easy,mathqa,truthfulqa,drop,arc_challenge,gsm8k,mmlu_abstract_algebra,mmlu_college_mathematics'
      examples_limit:
        type: string
        required: false
        description: 'The number of examples to use for benchmarking'
        default: '100'

jobs:
  benchmark:
    name: LLM Eval Benchmarking
    runs-on: machine/gpu=l40s/cpu=4/ram=32/architecture=x64/tenancy=spot
    
    steps:
      # Workflow steps for running the benchmark
      # ...
      
      - name: Generate Benchmark Comparison Chart
        run: |
          ls -l ./benchmarks/
          python ./llm_benchmark_plotting.py
      
      - name: Upload Benchmark Artifacts
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results
          path: benchmarks/
          retention-days: 90

Tasks evaluated

The default task list covers a range of reasoning skills:

hellaswag: common-sense reasoning about events
arc_easy and arc_challenge: multiple-choice science questions
mathqa: mathematical reasoning and problem solving
truthfulqa: truthfulness in model responses
drop: reading comprehension with numerical reasoning
gsm8k: grade-school math word problems
mmlu_abstract_algebra and mmlu_college_mathematics: advanced mathematics

Together they cover common-sense, scientific, mathematical, and comprehension reasoning.

Runner config

The default runner used here:

L40S GPU: 48 GB VRAM, fits 8B models comfortably
Spot tenancy: cheap and the workload is short enough that interruption is rare
Configurable CPU, RAM, and architecture

You can pin regions:

runs-on: machine/gpu=l40s/cpu=4/ram=32/architecture=x64/tenancy=spot/regions=us-east-1,us-east-2

machine.dev picks the cheapest spot price within the listed regions.

What you get back

After the run finishes, the workflow produces:

JSON files with per-task metrics
Comparison charts as PNG
Everything stored as GitHub artifacts for 90 days

The plotting code, simplified:

# Extract metrics from JSON files
for model, dir_path in model_results.items():
    result_files = glob.glob(os.path.join(dir_path, "results_*.json"))
    if result_files:
        latest_file = max(result_files, key=os.path.getctime)
        with open(latest_file) as f:
            data = json.load(f)
            for task, task_metrics in data['results'].items():
                tasks.add(task)
                metrics[model][task] = task_metrics

# Generate comparison charts for each task
for task in sorted(tasks):
    plt.figure(figsize=(12, 7))
    plt.title(f'{task} Comparison: Model 1 vs Model 2')
    # ... Chart generation code ...
    output_path = current_dir / f'benchmarks/{task}_comparison.png'
    plt.savefig(output_path)

Getting started

Fork MachineDotDev/language-model-arena
Open the Actions tab in your repository
Pick the “LM Eval Benchmarking” workflow
Click “Run workflow” and set parameters:
- The two models to compare
- Which tasks to benchmark
- Number of examples per task
Run it and wait
Download the benchmark artifacts to view the charts

Tips

More examples means more reliable numbers but a longer run. 100 is a useful default.
Pick tasks that match your downstream use case
Compare models at similar parameter counts for fair head-to-heads
When comparing fine-tune vs baseline, keep every other parameter constant

How to adapt this

Smaller cheaper run: swap gpu=l40s for gpu=l4 ($0.006/min) if both models fit in 24 GB VRAM
Different tasks: edit the tasks input to pick any lm-eval-harness task
More than 2 models: turn this into a matrix across model names
Run on PRs: trigger on pull_request to compare a candidate fine-tune against its baseline before merging

Next steps

Working repo: fork or use as a template
CPU vs GPU: picking GPU size for evaluation workloads
LLM Supervised Fine-Tuning: train a model, then benchmark it here