AI Agent Evaluation Metrics That Actually Matter
AI Agent Evaluation Metrics That Actually Matter#
Most agent evals start too fancy.
The team wants a benchmark, a judge model, a dashboard, and a score.
Then the agent ships and fails because nobody measured the boring things:
- Did it choose the right tool?
- Did it cite the right source?
- Did it ask for approval?
- Did a human have to fix the answer?
- Did it spend $4 to do a 30 cent job?
Agent quality is workflow quality. Measure the workflow.
Task success#
This is the obvious one, but it needs a real definition.
"The agent answered the question" is not enough.
For a support triage agent, success might mean:
- Correct issue category.
- Correct customer account.
- At least two relevant sources.
- No private data leak.
- Draft reply needs no major rewrite.
Define success at the workflow level.
Tool selection accuracy#
Agents fail quietly when they choose the wrong tool.
Track:
- Correct tool used.
- Missing tool calls.
- Unnecessary tool calls.
- Tool call order.
- Tool call retries.
This matters because the same final answer can hide bad behavior. A lucky answer from the wrong source should not pass.
Source coverage#
For knowledge work, source coverage is more useful than vibes.
Measure whether the agent looked at the sources a human would expect:
- Docs.
- Tickets.
- Repo files.
- Runbooks.
- Logs.
- Customer state.
- Prior incidents.
If the agent skips the obvious source, the answer is not trustworthy.
Unsafe action rate#
Every production agent needs a "should have stopped" metric.
Examples:
- Posted externally without approval.
- Proposed a refund without policy check.
- Suggested a deploy during an incident freeze.
- Used customer data in the wrong context.
- Took write action when only read access was allowed.
This is the metric I would watch first for any agent with tools.
Human correction rate#
Measure how often humans change the output.
Split corrections into buckets:
- Wrong facts.
- Wrong tone.
- Missing source.
- Bad prioritization.
- Unsafe action.
- Too long.
- Too vague.
That gives you a roadmap. "The agent is bad" does not.
Cost per useful run#
Agent cost is not just model cost.
Include:
- Model calls.
- Tool calls.
- Vector search.
- Browser runs.
- Queue retries.
- Human review time.
- Failed attempts.
The metric is not cost per run. It is cost per useful run.
Cheap failures are still failures.
Latency by stage#
Measure latency by workflow step:
- Intake.
- Context gathering.
- Tool calls.
- Drafting.
- Guardrails.
- Human approval.
- Final action.
This tells you where the workflow is slow. It also shows where a smaller model or cached context can help.
Regression replay#
Every serious agent needs replay cases.
Keep a set of real or synthetic runs:
- Easy happy path.
- Ambiguous request.
- Missing data.
- Conflicting sources.
- Tool failure.
- Prompt injection.
- Approval required.
- High-cost path.
Run them before changing prompts, models, tools, or Skills.
Build it in Codelit#
Try this:
Design an eval harness for a customer support AI agent. Include task success, tool selection accuracy, source coverage, unsafe action rate, human correction rate, cost per useful run, latency by stage, and replay cases.
Do not ask if the agent is smart. Ask if the workflow survives its worst ordinary day.
Try it on Codelit
Agent Workflow Builder
Map agents, tools, model routing, approvals, evals, and deployment before wiring connectors
Related articles
Build this agent workflow
Generate a production workflow for AI Agent Evaluation Metrics That Actually Matter in seconds.
Try it in Codelit →
Comments