An LLM eval harness your team will actually trust
· 2 min read

The fastest way to lose trust in an AI feature is to change a prompt, ship it, and discover in production that it now does something dumb on a case it used to handle. Without evaluations you have no regression net — every change is a coin flip, and confidence erodes with each surprise.
You don't need a research platform. You need the LLM equivalent of a test suite: a golden set, a scorer, and a gate in CI.
1. A golden set that encodes what "good" means
Collect 20–50 real inputs that matter — including the ugly edge cases that have bitten you — and pair each with what a good response must (and must not) contain. This dataset is your spec. It's worth more than any prompt-engineering trick because it survives model swaps.
2. A scorer that's mostly assertions, sparingly a judge
Score deterministically wherever you can — it's fast, free, and never flaky:
// Cheap, deterministic checks first.
expect(answer).toContain(expected.citation);
expect(answer).not.toMatch(/i('?m| am) (just )?an ai/i); // no disclaimers
expect(wordCount(answer)).toBeLessThan(180);
Reserve an LLM-as-judge for the genuinely subjective ("is this tone professional?"), and pin the judge model + rubric so its scores are comparable over time. Judges drift; treat their output as a noisy signal, not gospel.
3. A threshold gate in CI
Run the harness on every change to a prompt, model, or retrieval step, and fail the build when the pass rate drops below a line you set:
eval: golden-set (42 cases)
pass rate: 95% (40/42) threshold: 90% ✓
Now a prompt tweak is a pull request with evidence attached, not a vibe.
The honest caveats
- Evals rot. Revisit the golden set as the product changes, or you'll optimize for yesterday's problem.
- Judge bias is real. LLM judges favor verbosity and their own style — calibrate against human labels occasionally.
- Coverage ≠ correctness. A green suite means "no known regressions," not "perfect."
None of this is glamorous, and that's the point. The teams shipping AI they can stand behind aren't the ones with the cleverest prompts — they're the ones who made "did it get worse?" a question with an automatic answer.
Related reading

Shipping agentic AI that actually saves time
Agent demos are easy; agents that survive contact with real operations are not. Notes from putting multimodal RAG and agent workflows into production.
· 1 min read

Kubernetes migrations without downtime: a project manager's runbook
Most 'big bang' platform migrations don't fail on the technology — they fail on coordination. Here's the runbook I use to move a live system to Kubernetes one slice at a time, with a rollback at every step.
· 1 min read

Compliance as Code: turning ISO 27001 controls into CI checks
Audit season shouldn't be archaeology. Here's how I turn a handful of ISO 27001 controls into automated checks that run on every pull request — so evidence is a by-product of shipping, not a fire drill.
· 1 min read