SKIP_TO_MAIN_CONTENT
machine.dev
SIGN_UP
04.0 // Documentation v1.8.4 Last updated 2026-06-05

Language Model Arena

Benchmark and compare LLMs using the lm-evaluation-harness on machine.dev GPU runners. Run standardized evaluations on models from Hugging Face.

At a glance

Workload: Benchmark two LLMs head-to-head across HellaSwag, ARC, MathQA, TruthfulQA, DROP, GSM8K, and MMLU using lm-evaluation-harness. Runner: gpu=l40s, cpu=4, ram=32, tenancy=spot ($0.016/min). Estimated cost: ~$1 per benchmark run (~1 hour, 100 examples per task).

This page shows how to run reproducible head-to-head LLM benchmarks on machine.dev GPU runners, with comparison charts you can drop into a PR review.

Why benchmark

Reasons you might want to run head-to-head LLM evals:

  • Compare model performance on tasks you actually care about
  • Pick the right model for your use case
  • Check whether a fine-tune is genuinely better than the baseline it started from
  • See where each model is strong or weak across reasoning task types

How it works

The Language Model Arena uses the lm-evaluation-harness framework to benchmark models on a fixed task set. It runs as a GitHub Actions workflow you trigger on demand with input parameters.

The job:

  1. Loads two models (Hugging Face IDs or local paths)
  2. Runs them through the same evaluation tasks
  3. Generates comparison charts showing relative performance
  4. Stores results as GitHub workflow artifacts

Workflow

name: LM Eval Benchmarking

on:
  workflow_dispatch:
    inputs:
      model_1:
        type: string
        required: false
        description: 'The first model to benchmark'
        default: 'Qwen/Qwen2.5-3B-Instruct'
      model_1_revision:
        type: string
        required: false
        description: 'The first model revision to benchmark'
        default: 'main'
      model_2:
        type: string
        required: false
        description: 'The second model to benchmark'
        default: 'unsloth/Llama-3.1-8B-Instruct'
      model_2_revision:
        type: string
        required: false
        description: 'The second model revision to benchmark'
        default: 'main'
      tasks:
        type: string
        required: false
        description: 'The tasks to benchmark'
        default: 'hellaswag,arc_easy,mathqa,truthfulqa,drop,arc_challenge,gsm8k,mmlu_abstract_algebra,mmlu_college_mathematics'
      examples_limit:
        type: string
        required: false
        description: 'The number of examples to use for benchmarking'
        default: '100'

jobs:
  benchmark:
    name: LLM Eval Benchmarking
    runs-on: machine/gpu=l40s/cpu=4/ram=32/architecture=x64/tenancy=spot
    
    steps:
      # Workflow steps for running the benchmark
      # ...
      
      - name: Generate Benchmark Comparison Chart
        run: |
          ls -l ./benchmarks/
          python ./llm_benchmark_plotting.py
      
      - name: Upload Benchmark Artifacts
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results
          path: benchmarks/
          retention-days: 90

Tasks evaluated

The default task list covers a range of reasoning skills:

  • hellaswag: common-sense reasoning about events
  • arc_easy and arc_challenge: multiple-choice science questions
  • mathqa: mathematical reasoning and problem solving
  • truthfulqa: truthfulness in model responses
  • drop: reading comprehension with numerical reasoning
  • gsm8k: grade-school math word problems
  • mmlu_abstract_algebra and mmlu_college_mathematics: advanced mathematics

Together they cover common-sense, scientific, mathematical, and comprehension reasoning.

Runner config

The default runner used here:

  • L40S GPU: 48 GB VRAM, fits 8B models comfortably
  • Spot tenancy: cheap and the workload is short enough that interruption is rare
  • Configurable CPU, RAM, and architecture

You can pin regions:

runs-on: machine/gpu=l40s/cpu=4/ram=32/architecture=x64/tenancy=spot/regions=us-east-1,us-east-2

machine.dev picks the cheapest spot price within the listed regions.

What you get back

After the run finishes, the workflow produces:

  1. JSON files with per-task metrics
  2. Comparison charts as PNG
  3. Everything stored as GitHub artifacts for 90 days

The plotting code, simplified:

# Extract metrics from JSON files
for model, dir_path in model_results.items():
    result_files = glob.glob(os.path.join(dir_path, "results_*.json"))
    if result_files:
        latest_file = max(result_files, key=os.path.getctime)
        with open(latest_file) as f:
            data = json.load(f)
            for task, task_metrics in data['results'].items():
                tasks.add(task)
                metrics[model][task] = task_metrics

# Generate comparison charts for each task
for task in sorted(tasks):
    plt.figure(figsize=(12, 7))
    plt.title(f'{task} Comparison: Model 1 vs Model 2')
    # ... Chart generation code ...
    output_path = current_dir / f'benchmarks/{task}_comparison.png'
    plt.savefig(output_path)

Getting started

  1. Fork MachineDotDev/language-model-arena
  2. Open the Actions tab in your repository
  3. Pick the “LM Eval Benchmarking” workflow
  4. Click “Run workflow” and set parameters:
    • The two models to compare
    • Which tasks to benchmark
    • Number of examples per task
  5. Run it and wait
  6. Download the benchmark artifacts to view the charts

Tips

  • More examples means more reliable numbers but a longer run. 100 is a useful default.
  • Pick tasks that match your downstream use case
  • Compare models at similar parameter counts for fair head-to-heads
  • When comparing fine-tune vs baseline, keep every other parameter constant

How to adapt this

  • Smaller cheaper run: swap gpu=l40s for gpu=l4 ($0.006/min) if both models fit in 24 GB VRAM
  • Different tasks: edit the tasks input to pick any lm-eval-harness task
  • More than 2 models: turn this into a matrix across model names
  • Run on PRs: trigger on pull_request to compare a candidate fine-tune against its baseline before merging

Next steps