SKIP_TO_MAIN_CONTENT
M machine.dev
SIGN_UP
04.0 // Documentation v1.3.1 Last updated 2026-04-26

Language Model Arena

Benchmark and compare LLMs using the lm-evaluation-harness on machine.dev GPU runners. Run standardized evaluations on models from Hugging Face.

At a glance

Workload: Benchmark two LLMs head-to-head across HellaSwag, ARC, MathQA, TruthfulQA, DROP, GSM8K, and MMLU using lm-evaluation-harness Runner: gpu=l40s, cpu=4, ram=32, tenancy=spot ($0.016/min) Estimated cost: ~$1 per benchmark run (~1 hour, 100 examples per task)

This page shows how to run reproducible head-to-head LLM benchmarks on machine.dev GPU runners and produce comparison charts you can drop into a PR review.

Use Case Overview

Why might you want to evaluate language models?

  • Compare the performance of different models on specific tasks
  • Identify which model is best suited for your particular use case
  • Measure how well your fine-tuned models perform against baselines
  • Understand the strengths and weaknesses of various models across different reasoning tasks

How It Works

The Language Model Arena uses the lm-evaluation-harness framework to benchmark models on a set of standardized tasks. The workflow is defined in a GitHub Actions workflow file and triggered on-demand with configurable parameters.

The benchmarking process:

  1. Loads two specified models (from Hugging Face or local paths)
  2. Runs them through the same set of evaluation tasks
  3. Generates comparison charts to visualize their relative performance
  4. Stores the results as GitHub workflow artifacts

Workflow Implementation

The Language Model Arena is implemented as a GitHub Actions workflow that can be triggered manually. Here’s the workflow definition:

name: LM Eval Benchmarking

on:
  workflow_dispatch:
    inputs:
      model_1:
        type: string
        required: false
        description: 'The first model to benchmark'
        default: 'Qwen/Qwen2.5-3B-Instruct'
      model_1_revision:
        type: string
        required: false
        description: 'The first model revision to benchmark'
        default: 'main'
      model_2:
        type: string
        required: false
        description: 'The second model to benchmark'
        default: 'unsloth/Llama-3.1-8B-Instruct'
      model_2_revision:
        type: string
        required: false
        description: 'The second model revision to benchmark'
        default: 'main'
      tasks:
        type: string
        required: false
        description: 'The tasks to benchmark'
        default: 'hellaswag,arc_easy,mathqa,truthfulqa,drop,arc_challenge,gsm8k,mmlu_abstract_algebra,mmlu_college_mathematics'
      examples_limit:
        type: string
        required: false
        description: 'The number of examples to use for benchmarking'
        default: '100'

jobs:
  benchmark:
    name: LLM Eval Benchmarking
    runs-on:
      - machine
      - gpu=l40s
      - cpu=4
      - ram=32
      - architecture=x64
      - tenancy=spot
    
    steps:
      # Workflow steps for running the benchmark
      # ...
      
      - name: Generate Benchmark Comparison Chart
        run: |
          ls -l ./benchmarks/
          python ./llm_benchmark_plotting.py
      
      - name: Upload Benchmark Artifacts
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results
          path: benchmarks/
          retention-days: 90

Evaluation Tasks

The Language Model Arena evaluates models on a variety of reasoning tasks, including:

  • hellaswag: Common sense reasoning about events
  • arc_easy & arc_challenge: Multiple-choice science questions
  • mathqa: Mathematical reasoning and problem-solving
  • truthfulqa: Measuring truthfulness in model responses
  • drop: Reading comprehension with numerical reasoning
  • gsm8k: Grade school math word problems
  • mmlu_abstract_algebra & mmlu_college_mathematics: Advanced mathematics knowledge

These tasks are designed to test different aspects of a model’s reasoning capabilities, from simple common sense to complex mathematical problem-solving.

Using machine.dev GPU Runners

This benchmark leverages machine.dev GPU runners to provide the necessary computing power for efficient evaluation. The workflow is configured to use:

  • L40S GPU: A powerful GPU with 48GB of VRAM for handling larger models
  • Spot instance: To optimize for cost while maintaining performance
  • Configurable resources: CPU, RAM, and architecture specifications

You can also specify regions for the benchmark to run in:

runs-on:
  - machine
  - gpu=l40s
  - cpu=4
  - ram=32
  - architecture=x64
  - tenancy=spot
  - regions=us-east-1,us-east-2

This ensures your benchmarks run efficiently while optimizing for cost. machine.dev will search for the lowest spot price within the specified regions.

Benchmark Results

After running the workflow, the benchmark produces:

  1. JSON files containing detailed performance metrics for each task
  2. Comparison charts visualizing the performance differences between models
  3. All results are stored as GitHub artifacts for 90 days

Here’s an example of how the benchmark plotting works:

# Extract metrics from JSON files
for model, dir_path in model_results.items():
    result_files = glob.glob(os.path.join(dir_path, "results_*.json"))
    if result_files:
        latest_file = max(result_files, key=os.path.getctime)
        with open(latest_file) as f:
            data = json.load(f)
            for task, task_metrics in data['results'].items():
                tasks.add(task)
                metrics[model][task] = task_metrics

# Generate comparison charts for each task
for task in sorted(tasks):
    plt.figure(figsize=(12, 7))
    plt.title(f'{task} Comparison: Model 1 vs Model 2')
    # ... Chart generation code ...
    output_path = current_dir / f'benchmarks/{task}_comparison.png'
    plt.savefig(output_path)

Getting Started

To run the Language Model Arena benchmark:

  1. Fork the MachineDotDev/language-model-arena repository
  2. Navigate to the Actions tab in your repository
  3. Select the “LM Eval Benchmarking” workflow
  4. Click “Run workflow” and configure your parameters:
    • Select the models you want to compare
    • Choose which tasks to benchmark
    • Set the number of examples to evaluate
  5. Run the workflow and wait for results
  6. Download the benchmark artifacts to view the comparison charts

Best Practices

  • Balance depth vs. breadth: More examples provide more accurate results but take longer to evaluate
  • Choose appropriate tasks: Select tasks that are relevant to your use case
  • Compare similar models: Compare models with similar parameter counts for more meaningful comparisons
  • Use consistent evaluation settings: When comparing different versions of a model, keep all other parameters constant

How to adapt this

  • Smaller, cheaper benchmarks: swap gpu=l40s for gpu=l4 ($0.006/min) if both models fit in 24 GB VRAM
  • Different tasks: edit the tasks input to pick from any lm-eval-harness task
  • More than 2 models: turn the workflow into a matrix across model names
  • Run on PRs: trigger this on pull_request to compare a candidate fine-tune against its baseline before merging

Next steps