AI agentsagentic workflowevalstestingproduction

Your Agent Is Not Done Until the Eval Harness Exists

May 21, 2026 3 min readBy Mo Discussion

Your Agent Is Not Done Until the Eval Harness Exists#

The demo is not the test.

I know that sounds obvious. But agent teams still ship based on a handful of happy-path prompts.

That is not enough.

If an agent can use tools, read company data, post messages, open tickets, review code, or touch browser sessions, it needs an eval harness. Not later. Before production.

What the harness should answer#

An eval harness should answer:

Did the agent choose the right workflow?
Did it use allowed tools?
Did it cite sources when making factual claims?
Did it refuse unsafe actions?
Did it ask for approval at the right time?
Did it leak private data?
Did it recover from tool errors?
Did it produce the output format we need?

This is the difference between "it worked once" and "we trust it enough to run."

Evals by workflow#

For a Slack triage agent:

Correct request category.
Correct owner route.
Evidence pack includes source links.
Incident claims require approval.
High-risk actions are blocked.

For a PR review agent:

Critical path files are noticed.
Security issues are not missed.
Low-value comments are suppressed.
Secrets are redacted.
Blocking comments have evidence.

For a browser agent:

Wrong-account cases fail safely.
Expired sessions stop the run.
Form submits require approval.
Screenshots are captured before and after action.

Generic evals are weak. Match the test to the job.

Harness types#

I like a few harnesses:

Replay harness

Run the agent against saved Slack threads, tickets, PRs, or browser tasks.

Policy harness

Red-team permission boundaries, private data, billing changes, deploy actions, and external messages.

Prompt regression harness

Make sure prompt edits do not break known-good behavior.

Approval harness

Verify that risky actions stop and produce a human-readable approval request.

Observability harness

Check logs, traces, latency, cost, and fallback behavior.

The release gate#

Do not make evals a dashboard nobody checks.

Tie them to release.

If unsafe action rejection drops, block release. If source citation drops, block release. If the model route changes, rerun the relevant suite.

Agents change when prompts, models, tools, context, or data change. The harness has to treat all of those as release inputs.

Build it in Codelit#

Try this:

Design an eval harness for a Slack engineering triage agent. Include replay tests, red-team policy cases, approval checks, grounded response scoring, model fallback tests, and release gates.

Design the eval harness

If the agent cannot be tested, it cannot be trusted.

Try it on Codelit

Agent Workflow Builder

Map agents, tools, model routing, approvals, evals, and deployment before wiring connectors

Build this agent workflow →

Comments

AI agents

Agent Skills Are the New Runbooks

3 min read

AI agents

Agent Workflows for AI Infrastructure Teams

2 min read

AI agents

From Agent Workflow to Production Architecture

3 min read

Build this agent workflow

Generate a production workflow for Your Agent Is Not Done Until the Eval Harness Exists in seconds.

Try it in Codelit →

AI agentsagentic workflowevalstestingproduction

Your Agent Is Not Done Until the Eval Harness Exists

May 21, 2026 3 min readBy Mo Discussion

Your Agent Is Not Done Until the Eval Harness Exists#

The demo is not the test.

I know that sounds obvious. But agent teams still ship based on a handful of happy-path prompts.

That is not enough.

If an agent can use tools, read company data, post messages, open tickets, review code, or touch browser sessions, it needs an eval harness. Not later. Before production.

What the harness should answer#

An eval harness should answer:

Did the agent choose the right workflow?
Did it use allowed tools?
Did it cite sources when making factual claims?
Did it refuse unsafe actions?
Did it ask for approval at the right time?
Did it leak private data?
Did it recover from tool errors?
Did it produce the output format we need?

This is the difference between "it worked once" and "we trust it enough to run."

Evals by workflow#

For a Slack triage agent:

Correct request category.
Correct owner route.
Evidence pack includes source links.
Incident claims require approval.
High-risk actions are blocked.

For a PR review agent:

Critical path files are noticed.
Security issues are not missed.
Low-value comments are suppressed.
Secrets are redacted.
Blocking comments have evidence.

For a browser agent:

Wrong-account cases fail safely.
Expired sessions stop the run.
Form submits require approval.
Screenshots are captured before and after action.

Generic evals are weak. Match the test to the job.

Harness types#

I like a few harnesses:

Replay harness

Run the agent against saved Slack threads, tickets, PRs, or browser tasks.

Policy harness

Red-team permission boundaries, private data, billing changes, deploy actions, and external messages.

Prompt regression harness

Make sure prompt edits do not break known-good behavior.

Approval harness

Verify that risky actions stop and produce a human-readable approval request.

Observability harness

Check logs, traces, latency, cost, and fallback behavior.

The release gate#

Do not make evals a dashboard nobody checks.

Tie them to release.

If unsafe action rejection drops, block release. If source citation drops, block release. If the model route changes, rerun the relevant suite.

Agents change when prompts, models, tools, context, or data change. The harness has to treat all of those as release inputs.

Build it in Codelit#

Try this:

Design an eval harness for a Slack engineering triage agent. Include replay tests, red-team policy cases, approval checks, grounded response scoring, model fallback tests, and release gates.

Design the eval harness

If the agent cannot be tested, it cannot be trusted.

Try it on Codelit

Agent Workflow Builder

Map agents, tools, model routing, approvals, evals, and deployment before wiring connectors

Build this agent workflow →

Comments

AI agents

Build this agent workflow

Generate a production workflow for Your Agent Is Not Done Until the Eval Harness Exists in seconds.

Try it in Codelit →

Your Agent Is Not Done Until the Eval Harness Exists

Your Agent Is Not Done Until the Eval Harness Exists#

What the harness should answer#

Evals by workflow#

Harness types#

The release gate#

Build it in Codelit#

Comments

Related articles

Agent Skills Are the New Runbooks

Agent Workflows for AI Infrastructure Teams

From Agent Workflow to Production Architecture

Build this agent workflow

Your Agent Is Not Done Until the Eval Harness Exists

Your Agent Is Not Done Until the Eval Harness Exists#

What the harness should answer#

Evals by workflow#

Harness types#

The release gate#

Build it in Codelit#

Comments

Related articles

Agent Skills Are the New Runbooks

Agent Workflows for AI Infrastructure Teams

From Agent Workflow to Production Architecture

Build this agent workflow