AI Evaluation: Benchmarks, Custom Evals, LLM-as-Judge, and Production Testing
You cannot improve what you cannot measure. In AI engineering, evaluation is the difference between a demo that impresses and a product that works. Yet most teams ship LLM features with no systematic evaluation, relying on vibes and spot-checking. This guide covers how to evaluate LLMs properly — from standard benchmarks to custom production evals.
Why Evaluation Matters#
Without evaluation, you are guessing. Every change to your AI system — new model, updated prompt, different retrieval strategy — can improve one dimension while silently degrading another. Evaluation gives you:
- Confidence to ship: Quantified evidence that a change is an improvement
- Regression detection: Catch quality drops before users do
- Model selection: Data-driven decisions between providers and versions
- Prompt optimization: Measure the impact of prompt changes objectively
Standard Benchmarks#
Benchmarks provide a shared language for comparing models. They are useful for initial model selection but insufficient for evaluating your specific use case.
MMLU (Massive Multitask Language Understanding)#
Tests knowledge across 57 subjects — from abstract algebra to world religions. Multiple choice format with four options per question. MMLU scores tell you about general knowledge breadth but nothing about instruction following, reasoning depth, or generation quality.
- What it measures: Factual knowledge and basic reasoning across domains
- Limitation: Saturating quickly. Top models score above 90%, making differentiation difficult.
- Use it for: Quick sanity check that a model has broad knowledge
HumanEval#
A benchmark of 164 Python programming problems. The model generates a function body, which is tested against unit tests. Pass@k measures the probability that at least one of k generated solutions passes all tests.
- What it measures: Code generation ability in Python
- Limitation: Problems are relatively simple. Does not test debugging, refactoring, or working with large codebases.
- Use it for: Comparing code generation capabilities between models
MT-Bench#
A two-turn conversation benchmark with 80 questions across eight categories: writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities. Uses GPT-4 as a judge to score responses on a 1-10 scale.
- What it measures: Multi-turn conversation quality and instruction following
- Limitation: Relies on GPT-4 as judge, which has its own biases
- Use it for: Evaluating conversational and instruction-following ability
Other Notable Benchmarks#
- GPQA: Graduate-level science questions that challenge even domain experts
- MATH: Competition-level math problems requiring multi-step reasoning
- BigBench: A collaborative benchmark with over 200 diverse tasks
- Arena Elo: Crowdsourced human preference rankings from Chatbot Arena
The Benchmark Trap#
Do not choose a model based solely on benchmark scores. Benchmarks test narrow capabilities under controlled conditions. Your application has specific requirements — tone, format, latency, cost — that no benchmark captures. Always supplement benchmarks with custom evals.
Building Custom Evals#
Custom evals test exactly what matters for your application. They are the most valuable evaluation you can build.
Anatomy of a Custom Eval#
Every eval has three components:
- Test cases: Input-output pairs representing real usage
- Scoring function: How to measure quality of the output
- Threshold: What score constitutes acceptable performance
Types of Scoring Functions#
Exact match: The output must match a specific string. Useful for classification, entity extraction, and structured output.
def score_exact(output, expected):
return 1.0 if output.strip() == expected.strip() else 0.0
Contains/regex: The output must contain specific strings or match a pattern. Good for checking that key information is present.
def score_contains(output, required_terms):
return sum(1 for t in required_terms if t in output) / len(required_terms)
Semantic similarity: Use embeddings to measure meaning overlap. Handles paraphrasing better than string matching.
LLM-as-judge: Use another model to evaluate the output (covered in detail below).
Task-specific metrics: BLEU for translation, ROUGE for summarization, pass@k for code generation.
Building Your Test Suite#
Start with 20-50 test cases covering:
- Happy path: Typical inputs the model handles well
- Edge cases: Unusual inputs, ambiguous requests, boundary conditions
- Adversarial inputs: Attempts to jailbreak, confuse, or extract harmful content
- Regression cases: Previous failures that have been fixed
Grow the test suite over time. Every production bug should become a test case.
LLM-as-Judge#
Using a strong LLM to evaluate a weaker one (or the same model) is surprisingly effective. It scales better than human evaluation and correlates well with human judgment when done correctly.
Basic Pattern#
judge_prompt = """
Rate the following response on a scale of 1-5 for accuracy and helpfulness.
Question: {question}
Response: {response}
Score (1-5):
Reasoning:
"""
Best Practices#
- Use rubrics: Define explicit criteria for each score level. Vague instructions produce inconsistent scores.
- Pairwise comparison: Instead of absolute scores, ask the judge to compare two responses. "Which response better answers the question?" is easier to judge than "Rate this response from 1-10."
- Multiple judges: Use 2-3 different judge prompts or models and average scores. This reduces individual bias.
- Calibration: Include examples with known scores in the judge prompt to anchor the scale.
- Position bias: When comparing two responses, randomize which appears first. LLMs tend to favor the first or last option.
Limitations#
LLM judges can be fooled by verbose, confident-sounding responses that are actually wrong. They may also share biases with the model being evaluated. Always validate your judge against human ratings on a subset of cases.
A/B Testing AI Features#
Evals tell you if a change is better in isolation. A/B testing tells you if it is better for your users.
What to A/B Test#
- Model upgrades (GPT-4o vs Claude Sonnet)
- Prompt changes
- RAG retrieval strategies
- Temperature and sampling parameters
- UI changes around AI features (how you present AI output)
Metrics to Track#
Quality metrics: User ratings, thumbs up/down, edit distance (how much users modify AI output), task completion rate.
Engagement metrics: Feature usage, session length, return rate, copy/paste frequency.
Business metrics: Conversion rate, support ticket volume, time-to-resolution.
Safety metrics: Hallucination rate (measured by spot-checking), harmful content flags, user reports.
A/B Testing Pitfalls#
- Sample size: AI features often have high variance. You need larger sample sizes than typical product A/B tests.
- Novelty effects: Users may engage more with a new AI feature simply because it is new. Run tests long enough for the novelty to wear off.
- Segment effects: AI quality may vary dramatically across user segments. Analyze results by segment, not just overall.
Tools#
Braintrust#
A purpose-built platform for LLM evaluation. Define evals as code, run them in CI, and track results over time. Supports custom scorers, LLM-as-judge, and dataset management. The logging SDK captures production traces for monitoring. Strong on experiment tracking and comparison.
Langsmith (by LangChain)#
End-to-end observability and evaluation for LLM applications. Captures full execution traces including chain steps, retrieval results, and tool calls. Built-in eval framework with human annotation queues. Best for teams already using LangChain, but works with any LLM framework.
Promptfoo#
An open-source CLI tool for prompt testing and evaluation. Define test cases in YAML, run them against multiple providers simultaneously, and see results in a comparison table. Excellent for rapid prompt iteration. Supports custom assertions, model grading, and red-teaming. Lightweight and easy to integrate into CI/CD.
Other Notable Tools#
- OpenAI Evals: Open-source framework for evaluating OpenAI models
- RAGAS: Specialized evaluation framework for RAG pipelines
- DeepEval: Unit testing framework for LLMs with pytest integration
- Humanloop: Prompt management with built-in evaluation and monitoring
Building an Eval Pipeline#
A production eval pipeline runs automatically on every change:
- Pre-merge: Run fast evals (exact match, contains, basic LLM judge) in CI on every PR
- Post-merge: Run full eval suite including expensive LLM judge evaluations
- Pre-deploy: Gate deployments on eval scores meeting thresholds
- Production: Monitor live traffic with sampling-based evaluation
- Feedback loop: Route user feedback and flagged outputs into the test suite
Summary#
Evaluation is not optional for production AI. Use standard benchmarks for model selection, build custom evals for your specific use case, leverage LLM-as-judge for scalable quality assessment, and A/B test to measure real user impact. Start with 20 test cases and grow from there. The tools — Braintrust, Langsmith, Promptfoo — make it practical to build eval pipelines that run in CI and catch regressions before users do.
Build smarter AI systems with us at codelit.io.
Article #335 on Codelit — Keep building, keep shipping.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Try these templates
Build this architecture
Generate an interactive architecture for AI Evaluation in seconds.
Try it in Codelit →
Comments