AI evaluationLLM benchmarksMMLUHumanEvalMT-BenchLLM-as-judgeA/B testingBraintrustLangsmithPromptfoo

AI Evaluation: Benchmarks, Custom Evals, LLM-as-Judge, and Production Testing

March 29, 2026 7 min readBy Codelit Team Discussion

You cannot improve what you cannot measure. In AI engineering, evaluation is the difference between a demo that impresses and a product that works. Yet most teams ship LLM features with no systematic evaluation, relying on vibes and spot-checking. This guide covers how to evaluate LLMs properly — from standard benchmarks to custom production evals.

Why Evaluation Matters#

Without evaluation, you are guessing. Every change to your AI system — new model, updated prompt, different retrieval strategy — can improve one dimension while silently degrading another. Evaluation gives you:

Confidence to ship: Quantified evidence that a change is an improvement
Regression detection: Catch quality drops before users do
Model selection: Data-driven decisions between providers and versions
Prompt optimization: Measure the impact of prompt changes objectively

Standard Benchmarks#

Benchmarks provide a shared language for comparing models. They are useful for initial model selection but insufficient for evaluating your specific use case.

MMLU (Massive Multitask Language Understanding)#

Tests knowledge across 57 subjects — from abstract algebra to world religions. Multiple choice format with four options per question. MMLU scores tell you about general knowledge breadth but nothing about instruction following, reasoning depth, or generation quality.

What it measures: Factual knowledge and basic reasoning across domains
Limitation: Saturating quickly. Top models score above 90%, making differentiation difficult.
Use it for: Quick sanity check that a model has broad knowledge

HumanEval#

A benchmark of 164 Python programming problems. The model generates a function body, which is tested against unit tests. Pass@k measures the probability that at least one of k generated solutions passes all tests.

What it measures: Code generation ability in Python
Limitation: Problems are relatively simple. Does not test debugging, refactoring, or working with large codebases.
Use it for: Comparing code generation capabilities between models

MT-Bench#

A two-turn conversation benchmark with 80 questions across eight categories: writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities. Uses GPT-4 as a judge to score responses on a 1-10 scale.

What it measures: Multi-turn conversation quality and instruction following
Limitation: Relies on GPT-4 as judge, which has its own biases
Use it for: Evaluating conversational and instruction-following ability

Other Notable Benchmarks#

GPQA: Graduate-level science questions that challenge even domain experts
MATH: Competition-level math problems requiring multi-step reasoning
BigBench: A collaborative benchmark with over 200 diverse tasks
Arena Elo: Crowdsourced human preference rankings from Chatbot Arena

The Benchmark Trap#

Do not choose a model based solely on benchmark scores. Benchmarks test narrow capabilities under controlled conditions. Your application has specific requirements — tone, format, latency, cost — that no benchmark captures. Always supplement benchmarks with custom evals.

Building Custom Evals#

Custom evals test exactly what matters for your application. They are the most valuable evaluation you can build.

Anatomy of a Custom Eval#

Every eval has three components:

Test cases: Input-output pairs representing real usage
Scoring function: How to measure quality of the output
Threshold: What score constitutes acceptable performance

Types of Scoring Functions#

Exact match: The output must match a specific string. Useful for classification, entity extraction, and structured output.

def score_exact(output, expected):
    return 1.0 if output.strip() == expected.strip() else 0.0

Contains/regex: The output must contain specific strings or match a pattern. Good for checking that key information is present.

def score_contains(output, required_terms):
    return sum(1 for t in required_terms if t in output) / len(required_terms)

Semantic similarity: Use embeddings to measure meaning overlap. Handles paraphrasing better than string matching.

LLM-as-judge: Use another model to evaluate the output (covered in detail below).

Task-specific metrics: BLEU for translation, ROUGE for summarization, pass@k for code generation.

Building Your Test Suite#

Start with 20-50 test cases covering:

Happy path: Typical inputs the model handles well
Edge cases: Unusual inputs, ambiguous requests, boundary conditions
Adversarial inputs: Attempts to jailbreak, confuse, or extract harmful content
Regression cases: Previous failures that have been fixed

Grow the test suite over time. Every production bug should become a test case.

LLM-as-Judge#

Using a strong LLM to evaluate a weaker one (or the same model) is surprisingly effective. It scales better than human evaluation and correlates well with human judgment when done correctly.

Basic Pattern#

judge_prompt = """
Rate the following response on a scale of 1-5 for accuracy and helpfulness.

Question: {question}
Response: {response}

Score (1-5):
Reasoning:
"""

Best Practices#

Use rubrics: Define explicit criteria for each score level. Vague instructions produce inconsistent scores.
Pairwise comparison: Instead of absolute scores, ask the judge to compare two responses. "Which response better answers the question?" is easier to judge than "Rate this response from 1-10."
Multiple judges: Use 2-3 different judge prompts or models and average scores. This reduces individual bias.
Calibration: Include examples with known scores in the judge prompt to anchor the scale.
Position bias: When comparing two responses, randomize which appears first. LLMs tend to favor the first or last option.

Limitations#

LLM judges can be fooled by verbose, confident-sounding responses that are actually wrong. They may also share biases with the model being evaluated. Always validate your judge against human ratings on a subset of cases.

A/B Testing AI Features#

Evals tell you if a change is better in isolation. A/B testing tells you if it is better for your users.

What to A/B Test#

Model upgrades (GPT-4o vs Claude Sonnet)
Prompt changes
RAG retrieval strategies
Temperature and sampling parameters
UI changes around AI features (how you present AI output)

Metrics to Track#

Quality metrics: User ratings, thumbs up/down, edit distance (how much users modify AI output), task completion rate.

Engagement metrics: Feature usage, session length, return rate, copy/paste frequency.

Business metrics: Conversion rate, support ticket volume, time-to-resolution.

Safety metrics: Hallucination rate (measured by spot-checking), harmful content flags, user reports.

A/B Testing Pitfalls#

Sample size: AI features often have high variance. You need larger sample sizes than typical product A/B tests.
Novelty effects: Users may engage more with a new AI feature simply because it is new. Run tests long enough for the novelty to wear off.
Segment effects: AI quality may vary dramatically across user segments. Analyze results by segment, not just overall.

Tools#

Braintrust#

A purpose-built platform for LLM evaluation. Define evals as code, run them in CI, and track results over time. Supports custom scorers, LLM-as-judge, and dataset management. The logging SDK captures production traces for monitoring. Strong on experiment tracking and comparison.

Langsmith (by LangChain)#

End-to-end observability and evaluation for LLM applications. Captures full execution traces including chain steps, retrieval results, and tool calls. Built-in eval framework with human annotation queues. Best for teams already using LangChain, but works with any LLM framework.

Promptfoo#

An open-source CLI tool for prompt testing and evaluation. Define test cases in YAML, run them against multiple providers simultaneously, and see results in a comparison table. Excellent for rapid prompt iteration. Supports custom assertions, model grading, and red-teaming. Lightweight and easy to integrate into CI/CD.

Other Notable Tools#

OpenAI Evals: Open-source framework for evaluating OpenAI models
RAGAS: Specialized evaluation framework for RAG pipelines
DeepEval: Unit testing framework for LLMs with pytest integration
Humanloop: Prompt management with built-in evaluation and monitoring

Building an Eval Pipeline#

A production eval pipeline runs automatically on every change:

Pre-merge: Run fast evals (exact match, contains, basic LLM judge) in CI on every PR
Post-merge: Run full eval suite including expensive LLM judge evaluations
Pre-deploy: Gate deployments on eval scores meeting thresholds
Production: Monitor live traffic with sampling-based evaluation
Feedback loop: Route user feedback and flagged outputs into the test suite

Summary#

Evaluation is not optional for production AI. Use standard benchmarks for model selection, build custom evals for your specific use case, leverage LLM-as-judge for scalable quality assessment, and A/B test to measure real user impact. Start with 20 test cases and grow from there. The tools — Braintrust, Langsmith, Promptfoo — make it practical to build eval pipelines that run in CI and catch regressions before users do.

Build smarter AI systems with us at codelit.io.

Article #335 on Codelit — Keep building, keep shipping.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

Try these templates

Customer Support Platform

Zendesk-like helpdesk with tickets, live chat, knowledge base, and AI-powered auto-responses.

8 components

Build this architecture

Generate an interactive architecture for AI Evaluation in seconds.

Try it in Codelit →