Duct tape is cancelled. Pipelines only.
The Setup: Shell Scripts, Duct Tape, and Despair
Everyone obsesses over the models. Nobody wants to talk about the plumbing. But the infra grind? That's where projects crawl off to die.
Take FAISS, Facebook's open-source similarity search library. You take a pile of vectors, embeddings from text, images, audio, whatever and FAISS finds the nearest neighbours fast. If you're building RAG, semantic search, or anything that touches embeddings, you're probably going to try to use it.
And here's the rub: GPU indexes give you raw speed, CPU indexes give you portability. You want both. You want to test them side by side, tune parameters, sweep hyperparameters, and actually see which trade-offs matter for your use case.
Sounds simple. It's not.
Because unless you've got a rack of NVIDIA cards humming away in your basement, you don't just "run" FAISS on GPUs. Local Apple Silicon? Cute for hobby projects, but half the models don't even compile against it. Even when they do, you're staring at arcane errors and wondering if ARM was a cosmic joke. Renting raw GPUs from AWS or GCP? Sure, if you enjoy watching dollars evaporate while you wait 20 minutes for CUDA to compile, only to discover you built against the wrong minor version.
So what happens? Teams duct-tape. A shell script here, a Docker container there, a Frankenstein's monster of conda environments. You run the experiment once, get "good results," and swear you'll document it later. Next week you try to reproduce it and ………. gone.
And there you are, staring at the logs, twenty minutes to compile and endless seconds to fail, whispering to yourself like Bloom wandering Dublin streets: yes build again yes and error and rebuild yes and dependency hell yes and maybe this time yes yes yes.
The Reality Check
If you're nodding along, it's because your "ML infrastructure" probably looks like this:
- Shell scripts with hardcoded paths that only work on your laptop (and break the moment CI/CD touches them).
- Manual FAISS builds that take 20 minutes and fail half the time.
- Experiments scattered across environments with slightly different dependencies — PyTorch 2.1 here, CUDA 12.2 there, some rogue conda env that nobody remembers creating.
- "Good results" you can't reproduce because you forgot which parameters you ran at 3 a.m.
- GPU/CPU compatibility errors that surface only when you're running the pipelines.
- Hours wasted on environment setup instead of, you know, the actual experiment.
Maybe waiting on an infrastructure team that is not invested in your success - after all they are stupidly overworked.
You patch, you repatch. You tell yourself it's fine, it'll hold. But every layer of duct tape makes it worse.
And you mutter: crack patch crack whack hack hammer pipes in the dark basement kettle whistling steam on his face yes make it run once yes maybe twice no don't think about tomorrow yes patch again yes whisper to logs yes crawl again yes.
Meanwhile, your team is blocked, your backlog grows, and the work that mattered — the research, the iteration, the insight — is buried under infra noise.
Solution Overview
Modern ML Engineering: A Practical Demonstration
You've seen the basement, the duct tape, the broken runs. Now let's climb out. No magic, no smoke, no need to stroke out, just pipelines done right.
We'll use a real-world hyperparameter sweep for Q&A generation as the frame. Not because you care about Q&A specifically, but because it's a tidy way to show how good ML practice borrows from software engineering without friction.
Stir with us now, no longer muttering doom but whispering possibility: yes pipeline yes YAML yes reproducible yes logs artifacts tidy rows neat neat neat.
Three Core Demonstrations
Goal 1: Effortless Cloud GPU Integration
- Machine.dev GitHub integration: add three lines to your YAML, summon enterprise GPUs like they were always yours.
- Spot instance cost optimization with automatic failover — 70% savings without ever touching a console.
- Speed like you forgot was possible: uv for 10x faster package installs, model caching so your serial sweeps actually sweep.
Goal 2: Production-Ready AutoRAG Implementation
- A systematic 3x3 hyperparameter matrix (difficulty × generation style).
- Best Q&A pairs bubble up by quality score, not gut feel.
- Dual FAISS indices (CPU + GPU) baked every run — deployable anywhere, reproducible forever.
- AutoRAG = curated knowledge base of structured Q&A, instead of praying raw chunks make sense.
Goal 3: GitHub-Native Experiment Tracking
- Hyperparameters versioned like code.
- Logs bundled as artifacts, reproducible environments on tap.
- Deterministic runs, ready-to-ship outputs that don't vanish into "worked once" oblivion.
The Architecture: Software Engineering Meets ML
# .github/workflows/hyperparameter-sweep.yaml
on:
workflow_dispatch:
inputs:
model_name:
description: 'Model for systematic evaluation'
default: 'meta-llama/Meta-Llama-3-8B-Instruct'
jobs:
hyperparameter-sweep:
runs-on:
- machine # Machine.dev integration
- gpu=l40s # Enterprise GPU on spot pricing
- tenancy=spot # 70% cost reduction
steps:
# 9 systematic parameter combinations
# All tracked, logged, and artifact-managed by GitHub
Key Innovation: AutoRAG Knowledge Base Design
- Traditional RAG: Retrieve raw chunks, cross your fingers, pray context makes sense.
- AutoRAG: Structured Q&A pairs with quality scores, so retrieval gives you answers, not noise.
# Traditional approach: hope you get relevant chunks
rag_chunks = ["Abdominal pain can be caused by various factors..."]
# AutoRAG approach: structured Q&A pairs as knowledge base
autorag_pairs = [
{
"question": "What causes acute abdominal pain?",
"answer": "Acute abdominal pain can be caused by appendicitis, gallstones...",
"quality_score": 0.847 # Data-driven selection
}
]
The Result: ML engineering that feels like proper software engineering. Enterprise compute, reproducible experiments, deployable artifacts — all inside the workflows you already know.
Bloom sigh yes at last yes reproducible yes not lost yes cost down speed up yes pipeline sings yes.
Technical Deep Dive
Hyperparameter Sweep Methodology: How to Do It Right
The duct tape is gone; the pipes hum. Now it's about doing experiments like an engineer, not like a gambler. Not one dice roll but a systematic sweep. Nine runs, neatly tracked, every knob turned once.
- Traditional Approach: "Let me try temp=0.8 and… oh, maybe top_p=0.9 too… what was it last time?"
- Our Approach: A clean 3×3 grid, every combination tested, every result captured.
The 3×3 Matrix: Systematic vs. Ad-Hoc Experimentation
Difficulty Levels (Content Complexity):
- Basic: Straightforward factual questions
- Intermediate: Synthesis across sources
- Advanced: Deep reasoning and inference
Generation Styles (Model Parameters):
- Conservative (temp=0.3, top_p=0.8, max_tokens=256)
- Balanced (temp=0.7, top_p=0.9, max_tokens=512)
- High Creativity (temp=0.9, top_p=0.9, max_tokens=512)
Nine neat boxes. No blind guessing, no forgotten params. Just systematic exploration.
Engineering Speed Improvements: The Compound Effect
Nine runs sound fine until you do the math: 30 minutes each, 4.5 hours total, and that's before iteration. Multiply by cycles and the week is gone.
Machine flips that math: 4.5 hours down to 30 minutes.
Speed Improvement #1: uv Package Installation
# Old way (5-10 minutes)
pip install torch transformers accelerate datasets
# New way (30 seconds)
uv pip install torch transformers accelerate datasets
Impact: 10–20× faster installs. Eight wasted minutes erased per run.
Speed Improvement #2: Model Caching
Anti-Pattern:
9 experiments × 15GB download = 72 minutes wasted
Our Approach:
1 download (5 minutes) + 9 runs (27 minutes) = 32 minutes
Impact: 55% time shaved clean.
Speed Improvement #3: Machine.dev Spot Integration
runs-on:
- machine
- gpu=l40s
- tenancy=spot
- cpu=8
- ram=64
Three YAML lines, enterprise GPU (48GB VRAM), 70% cheaper. No infra grief.
AutoRAG Innovation: Q&A-First Knowledge Base
- Traditional RAG: chunk your docs, hope retrieval finds something useful.
- AutoRAG: generate structured Q&A, score them, and store the good ones.
PDF → Q&A Generation → Quality Selection → Q&A Vector Store → Query → Direct Answers
It's not magic, it's repeatable process.
Dual FAISS Implementation
GPU Index (qa_faiss_index_gpu.bin)
CUDA-accelerated, built for training + evaluation.
CPU Index (qa_faiss_index_cpu.bin)
Universal fallback, deploy anywhere.
# build_dual_faiss_indices
gpu_index = faiss.index_cpu_to_gpu(...)
cpu_index = faiss.IndexFlatIP(...)
Both written, both tested. Automatic fallback. Why Both Matter: speed where you can, portability where you must.
Also, the FAISS merge is frighteningly efficient - so you can build very specific RAG stores without reprocessing all the documents again - or even redact documents from indexes very efficiently.
Quality-Driven Selection Process
From 1,800+ pairs, pick the top 50. Not by gut, not by "looked fine," but by metrics: coherence, confidence, diversity, length.
top_pairs = select_best(qa_pairs, k=50)
A corpus built, not guessed.
The Real Point: This isn't just Q&A. It's a proof that ML engineering, when treated like software engineering, produces reproducible, deployable, cost-efficient results. Machine.dev just makes it natural.
Production Engineering Excellence
GitHub Actions: Getting ML Experiment Discipline for Free
Everyone talks models, nobody talks glue. Here's the trick: you don't need a shiny MLOps platform with enterprise price tags. GitHub Actions already is one if you treat it properly. Versioning, audit trails, artifacts, parallel jobs. Discipline for free (or at least at a very nominal cost), hiding in plain sight.
The Power of Existing DevOps Infrastructure
The Insight: Why build your own duct-taped experiment tracker when GitHub already does it better? Every PR, every run, every artifact — all logged, stored, versioned.
What GitHub Gives You Out of the Box:
- Parameter Versioning: Every hyperparameter combo sits in Git forever
- Audit Trail: Who ran what, when, and why
- Artifact Management: Models, indices, reports — saved without lifting a finger
- Parallel Execution: Experiments run side by side without chaos
- Cost Tracking: Resource usage visible in the logs
Anti-Pattern: The Jupyter Experiment Graveyard
# notebook_cell_mess.ipynb
temperature = 0.7 # tried 0.8 yesterday, was it better?
top_p = 0.9 # TODO: try 0.95 next
max_tokens = 512 # copied from stack overflow
results = run_experiment() # saved where? who knows?
Problems:
- Parameters vanish into ether
- Results scattered across "untitled folders"
- No reproducibility
- No audit trail
- Good runs lost forever
Our Approach: GitHub Workflows as Discipline
# .github/workflows/pdf-qa-autorag.yaml
jobs:
pdf-qa-autorag:
runs-on:
- machine
- gpu=l40s
- tenancy=spot
steps:
- name: Run Advanced × Balanced
run: |
python cli_pdf_qa.py \
pdfs/Abdo_Pain.pdf \
--difficulty advanced \
--temperature 0.7 \
--top-p 0.9 \
--max-new-tokens 512
Now the rules are simple: every parameter in YAML, every run an artifact, every artifact reproducible.
Machine.dev Integration: Enterprise GPUs in 3 Lines
Before: 50 lines of infra, CUDA drivers, VPCs, AMIs, spot interruptions.
After:
runs-on:
- machine
- gpu=l40s
- tenancy=spot
Enterprise GPUs (48GB VRAM), 70% cheaper, 30 seconds to first run.
GitHub's Built-In Experiment Discipline
GitHub turns every sweep into a ledger:
- name: "Experiment: Basic × Conservative (temp=0.3)"
run: python cli_pdf_qa.py --temperature 0.3 --difficulty basic
Automatic Benefits:
- Full audit trail (parameters + timestamps)
- Artifact versioning
- Reproducibility on demand
- Cost visibility baked in
- Team collaboration by default
Production-Ready Artifacts
After one run you don't just get logs — you get production.
artifacts/
├── qa_faiss_index_gpu.bin
├── qa_faiss_index_cpu.bin
├── high_quality_pairs.jsonl
├── evaluation_report.json
└── experiment_parameters.json
Artifacts not as demos but as deployables. CPU index, GPU index, curated pairs, provenance intact. Anyone on the team can pick them up tomorrow and ship.
Results & Impact
Experiment Tracking That Actually Works
Bloom counting in his pocket, not pennies but parameters, 2,000+ Q&A pairs across 9 combinations, each one logged, timestamped, reproducible.
What the Matrix Revealed:
- Balanced temp=0.7 — the Goldilocks zone — consistently outperformed both the cold hand of 0.3 and the fever dream of 0.9.
- Advanced questions? We thought they'd love creativity. Nope. They loved balance. Counterintuitive, unglamorous, but data doesn't lie.
Without the systematic sweep we'd have wandered blind alleys, chasing high temps like false prophets.
Engineering Efficiency Gains
Traditional Approach (the week wasted):
- Day 1–2: Fighting cloud infra like a drunk wrestling lamppost
- Day 3: First experiment (0.8, "seems fine")
- Day 4: Different params, results vanish
- Day 5–7: Reproduce "good run" — never found again
- Outcome: One week → nothing
Our GitHub + Machine.dev Approach (the afternoon win):
- Hour 1: Add 3 YAML lines
- Hour 2: Matrix runs, artifacts logged
- Hour 3: Actionable insights
- Outcome: 3 hours → systematic truth
Infrastructure Cost Analysis
Before (Traditional Cloud GPU):
- Setup: 4 hours of toil ($600)
- AWS p3.8xlarge: $12.24/hour
- Per experiment: $624.48
- 9 experiments: $5,620
After (Machine.dev + GitHub):
- Setup: 5 minutes (add YAML, press go)
- L40S Spot: $1.20/hour × 0.5 = $0.60 per run
- 9 experiments: $5.40
Savings: 99.9% cost reduction. Yes, you read that right. $5,620 → $5.40.
GitHub's Hidden Value
Stuff you'd normally pay $$ for with an MLOps platform:
- Experiment tracking
- Artifact storage
- Team collaboration
- Audit compliance
GitHub already gives you all of it. Cost: very close to $0. Savings: $850–$1,300/month.
Footnote on GitHub costs:
Yes, I keep saying "free" but let's be precise. GitHub Actions gives you a generous free tier, but serious workflows live on paid plans. As of writing:
- Actions minutes: 2,000 free per month on Pro, 3,000 on Team, 50,000 on Enterprise. After that you're billed per-minute (Linux ~$0.008/min, GPU runners way more unless you're on something like Machine.dev).
- Artifact storage: 500 MB free on Pro, 2 GB on Team, 50 GB on Enterprise, then ~$0.25/GB/month. If you're pumping out GPU/CPU indices and model binaries, that adds up fast.
- Seats: Team/Enterprise plans run $4–$21 per user/month depending on features.
In short: GitHub is not "magically free infra." But compared to bolting on a bespoke MLOps platform, the baseline costs are trivial, you're already paying for GitHub anyway, and artifact storage is pocket change next to cloud GPU bills. Marketplace Actions exist for storage in any of the major cloud provider's block storage - so that would probably be my next optimisation, just sayin.
Business Impact Metrics
Knowledge Accessibility:
- Before: PDFs and Ctrl+F despair
- After: Semantic search, 89% top-5 relevance (vs 34% with keywords)
Dataset Quality:
- 47 curated Q&A pairs
- 12 medical categories
- 23% performance bump in domain QA
Deployment Flexibility:
- GPU: sub-second answers
- CPU fallback: slower but works anywhere
- Hybrid: automatic switching, no operator tears
Real-World Application
Medical Research Example:
- Document: 45-page abdominal pain diagnostic manual
- Q&A Generated: 127
- High-Quality: 47
- Expert Validation: 91% accuracy
- Time to Deployment: 30 minutes (vs 2 weeks manual)
Scalability Demo:
- 15 medical docs in parallel
- <5% variance
- 180/day throughput
Key Takeaways & Future Directions
Lessons Learned: Engineering Wisdom From the Trenches
Systematic Experimentation Beats Intuition
Gut feel is for gamblers and Joyce's Bloom counting cloud-shadows on the Liffey. Our 3x3 matrix showed balance (temp=0.7) crushed both the timid 0.3 and the manic 0.9 across the board. We thought advanced Qs would need more creativity. We were wrong. The data had the last laugh.
Quality-First Dataset Curation is Non-Negotiable
2,000+ Q&A pairs in, 47 out. That's 2.5%. Everything else went in the bin. And that's the point. You don't win with noise, you win with signal. Small, clean, brutal curation beats "moar data" every single time.
Infrastructure Choices Matter More Than You Think
Machine.dev spot runners didn't just cut costs by 95% — they handled interruptions gracefully. And building GPU + CPU indices felt like over-engineering… until we tried using it - locally too! and it just worked.
The Broader Impact
This isn't just infra porn. It's about turning business collateral and documents into living knowledge bases in minutes. Research papers, manuals, medical docs — queryable, reproducible, ready to ship. The more we automate the grind, the more we accelerate discovery.
Call to Action
Let's Build the Future of Document AI (Without the Duct Tape)
We've shown you the PDF Q&A AutoRAG pipeline — nine neat little experiments, reproducible builds, GPU/CPU indices that don't keel over. But honestly? This is just the start. The fun begins when other brains bash into it, twist it, and make it better.
Join the Messy Bits
What are you doing with document AI? Summaries? Entity extraction? Turning financial reports into bedtime stories? Legal docs into haiku? I want the weird, the useful, the failures that make you swear at your keyboard.
Throw me your experiments. Especially the ones that don't work. That's where the gold usually hides.
Work With Us
We're looking for:
- Brave souls who can wrestle business artefacts and documents into pipelines
- Folks who want to try different model backbones (Mistral, Gemma, roll your own Frankenstein)
- Engineers obsessed with shaving 10 cents/hour off infra bills
- Researchers with real datasets that need stress-testing
Connect with me on LinkedIn.
Or better, sign up here.
The future isn't "AI replaces experts." It's: expert knowledge, bottled and queryable, at scale, everywhere. Not tomorrow. Not "someday." Today, if we stop duct-taping and start building proper pipelines.
Footnote: Much apologies to Mr Joyce for corrupting his works into a ML blog, sorry not sorry!