Your Agent Is Not Done Until the Eval Harness Exists
Your Agent Is Not Done Until the Eval Harness Exists#
The demo is not the test.
I know that sounds obvious. But agent teams still ship based on a handful of happy-path prompts.
That is not enough.
If an agent can use tools, read company data, post messages, open tickets, review code, or touch browser sessions, it needs an eval harness. Not later. Before production.
What the harness should answer#
An eval harness should answer:
- Did the agent choose the right workflow?
- Did it use allowed tools?
- Did it cite sources when making factual claims?
- Did it refuse unsafe actions?
- Did it ask for approval at the right time?
- Did it leak private data?
- Did it recover from tool errors?
- Did it produce the output format we need?
This is the difference between "it worked once" and "we trust it enough to run."
Evals by workflow#
For a Slack triage agent:
- Correct request category.
- Correct owner route.
- Evidence pack includes source links.
- Incident claims require approval.
- High-risk actions are blocked.
For a PR review agent:
- Critical path files are noticed.
- Security issues are not missed.
- Low-value comments are suppressed.
- Secrets are redacted.
- Blocking comments have evidence.
For a browser agent:
- Wrong-account cases fail safely.
- Expired sessions stop the run.
- Form submits require approval.
- Screenshots are captured before and after action.
Generic evals are weak. Match the test to the job.
Harness types#
I like a few harnesses:
Replay harness
Run the agent against saved Slack threads, tickets, PRs, or browser tasks.
Policy harness
Red-team permission boundaries, private data, billing changes, deploy actions, and external messages.
Prompt regression harness
Make sure prompt edits do not break known-good behavior.
Approval harness
Verify that risky actions stop and produce a human-readable approval request.
Observability harness
Check logs, traces, latency, cost, and fallback behavior.
The release gate#
Do not make evals a dashboard nobody checks.
Tie them to release.
If unsafe action rejection drops, block release. If source citation drops, block release. If the model route changes, rerun the relevant suite.
Agents change when prompts, models, tools, context, or data change. The harness has to treat all of those as release inputs.
Build it in Codelit#
Try this:
Design an eval harness for a Slack engineering triage agent. Include replay tests, red-team policy cases, approval checks, grounded response scoring, model fallback tests, and release gates.
If the agent cannot be tested, it cannot be trusted.
Try it on Codelit
Agent Workflow Builder
Map agents, tools, model routing, approvals, evals, and deployment before wiring connectors
Related articles
Build this agent workflow
Generate a production workflow for Your Agent Is Not Done Until the Eval Harness Exists in seconds.
Try it in Codelit →
Comments