Last reviewed: 2026-06-12

Agent Run Evidence Ledgers for Human Review

When a coding agent finishes a task it leaves behind a trail of artifacts: diffs, test results, CI logs, commit messages, instruction-file states, and model call records. Without a structured ledger, a human reviewer has no reliable way to understand what the agent did, why it made certain choices, or whether the output is safe to merge. This guide explains what belongs in an agent run evidence ledger, how to structure it so a reviewer can act on it, and how to hand it off through a pull request or CI workflow.

Direct answer

An agent run evidence ledger is a structured record of the artifacts produced during a single coding agent session. It is designed for human review, not agent consumption. A minimal ledger contains five elements:

The task brief the agent was given, including acceptance criteria and any instruction files active at run start (such as AGENTS.md or a project-level Claude Code memory file).
The diff of every file changed, in a form that a reviewer can read line by line.
Test and CI output showing pass/fail state, exit codes, and any error messages produced during the run.
A handoff summary written by the agent stating what it did, what it did not do, and any uncertainty or blocked state it encountered.
A model call log recorded at a field level sufficient to triage unexpected output: endpoint family, request timestamp, and response status. Do not log full prompt or response text.

Assembling these five elements into a single pull request or linked directory before requesting review is what makes the ledger actionable. Verify the exact PR fields and CI check attachment behavior available in your GitHub plan against the sources linked below, as interface details change between GitHub product tiers.

For broader release checks, see When to Stop, Retry, or Escalate: A Practical Guide to Coding Agent Task Control .

Who this is for

Engineering teams that delegate implementation tasks to coding agents and need a repeatable review process before merging agent-generated changes.
Platform engineers building agent orchestration infrastructure who need to define which artifacts each agent run must produce.
Reviewers who receive agent-generated PRs and want a consistent checklist to work from rather than reading raw commit history.
Agent operators setting up new agents and deciding which artifacts to capture before the first production run.

Key takeaways

A ledger is reviewer-facing. Organize it for the human, not the agent.
Five artifact types cover the majority of review needs: task brief, diff, CI output, handoff summary, and a model call log.
Instruction files are part of the evidence. Capturing their state at run start lets a reviewer reconstruct the rules the agent operated under. Verify current instruction file naming and scope for your specific agent tool against the linked documentation.
A pull request is the standard container for an agent run ledger. GitHub’s PR interface surfaces the diff, links CI results, and provides a comment thread for reviewer questions.
CI output attached to the PR is evidence, not decoration. A reviewer should be able to see which checks ran, which passed, and what the failing output was without leaving the PR interface.
Model call logs belong in the ledger but must not contain credentials, full prompts, or full response bodies. Log at the field level only.
Routing all agent model calls through a single model gateway keeps the endpoint family and response status fields consistent across ledgers, making log records comparable run to run.

Smoke-test workflow

Use this workflow to verify a ledger is reviewer-ready before requesting review.

Setup assumptions:

The agent has completed its run and committed changes to a feature branch.
A pull request has been opened against the base branch.
CI checks have run at least once.

Happy-path check:

Open the PR. Confirm the description contains a handoff summary written by the agent. The summary should state what changed, what the acceptance criteria were, and whether the agent considers the run complete or blocked.
Open the Files Changed tab. Confirm all changed files are visible and the diff is readable. An empty diff means the ledger is incomplete.
Open the Checks tab. Confirm at least one CI workflow ran and its pass/fail status is visible. If no checks ran, review the GitHub Actions workflow definition against the documentation linked in the Sources section.
Confirm the instruction files active at run start are either committed to the branch or referenced by path in the handoff summary. For Claude Code agents this typically means the project-level memory directory or a CLAUDE.md file; for Codex agents this typically means an AGENTS.md file at the repo root or a subdirectory. Verify current file naming and scope rules against your agent’s documentation before assuming a specific path.
Confirm a model call log entry exists. It must not contain credentials or full prompt/response text.

Error-path check:

If CI checks are missing, confirm the workflow file exists in .github/workflows/ and the trigger matches the branch name pattern. A missing workflow is a blocker; do not approve the PR until at least one automated check has run.
If the diff is unexpectedly large (many unrelated files changed), check the handoff summary. If the agent does not explain the scope, request a revised summary before review proceeds.

Minimum assertions for the reviewer:

Task brief is present and matches the diff scope.
At least one CI check passed.
Handoff summary explains any incomplete or blocked items.
Model call log does not contain credentials or full response bodies.

What the smoke test must not assert:

Do not assert that the model used was a specific identifier. Model availability changes; record the endpoint family only.
Do not assert a specific token count, cost, or latency from the model call log. These fields change with pricing and routing updates; verify current pricing and quota rules against your gateway provider’s documentation before drawing conclusions from them.
Do not assert that CI output matches a specific log line. Test output formats change with framework versions.

Sample log record template

Record the following fields after each agent run. All values shown are placeholders.

run_id: agent-run-YYYYMMDDTHHMMSSZ branch: feature/[task-slug] agent: [agent-name] task_brief_ref: [path or PR link to task brief] instruction_files:

path: [e.g. AGENTS.md or .claude/memory/] captured_at: [ISO timestamp] ci_workflow: [workflow filename] ci_status: [pass | fail | skipped] model_call_log:
endpoint_family: [e.g. chat-completions] timestamp: [ISO timestamp] response_status: [e.g. 200] diff_files_changed: [integer count] handoff_summary_present: [true | false] reviewer_requested_at: [ISO timestamp]

Do not log credentials, API keys, full prompts, full response bodies, or personal data in any field of this record.

Failure modes

Evidence gap: the agent cannot inspect the failing log, source page, pull request, or local command output. The safe action is to stop and record the missing evidence instead of guessing.
Scope drift: the agent edits files that are not connected to the observed failure. Keep the repair tied to the failing signal and leave unrelated cleanup for a separate task.
Environment mismatch: the local check uses different versions, credentials, feature flags, or runtime settings than the hosted path. Record the mismatch before treating the result as proof.
Unreviewed fallback: the agent changes models, endpoints, permissions, or retry behavior to make a run pass without preserving the review boundary. Treat access and provider failures as operational blockers, not topic failures.
Weak handoff: the final note says the issue is fixed but omits the command, result, changed files, and remaining uncertainty. That makes the next operator repeat the investigation.

Sources checked

GitHub pull requests documentation - accessed 2026-06-12; purpose: verify pull request review and collaboration boundaries.
GitHub Actions documentation - accessed 2026-06-12; purpose: verify workflow runs, jobs, steps, checks, and logs.
OpenAI Codex AGENTS.md guidance - accessed 2026-06-12; purpose: verify repository instruction-file context for coding agents.
Claude Code memory documentation - accessed 2026-06-12; purpose: verify project memory and instruction-file context for agent workflows.
CometAPI chat completions reference - accessed 2026-06-12; purpose: verify endpoint family naming conventions available for model call log records.

Contract details to verify

Area	What to verify	Source URL	Accessed	Safe candidate wording
PR diff interface	Which diff views are available in the Files Changed tab and whether they persist after force pushes	https://docs.github.com/en/pull-requests	2026-06-12	“Verify that the diff view shows all changed files before requesting review”
CI check attachment	Which fields appear in the Checks tab and how check run status is reported per plan tier	https://docs.github.com/en/actions	2026-06-12	“Confirm at least one CI check ran and its status is visible in the Checks tab”
AGENTS.md scope rules	Whether AGENTS.md applies repo-wide, per-directory, or only at the root; current nesting behavior	https://github.com/openai/codex/blob/main/docs/agents_md.md	2026-06-12	“Verify current instruction file scope rules against your agent’s documentation”
Claude Code memory paths	Current file names and directory layout for project-level memory and instruction files	https://code.claude.com/docs/en/memory	2026-06-12	“Verify the current memory file path for your Claude Code version before referencing it in the ledger”

Reader next step

Compare the workflow against Start with CometAPI .

Use When to Stop, Retry, or Escalate: A Practical Guide to Coding Agent Task Control as the next comparison point. Keep AI Coding Agent Setup, Security, and Model Routing nearby for setup and permission checks.

FAQ

What is the minimum viable evidence ledger for a one-person team? A PR with: a description written by the agent, at least one CI check result, and a readable diff. Add the instruction file state and a model call log entry when the run involved repeated model calls or complex branching decisions.

Should the agent write the handoff summary, or should a human write it? The agent should write the initial summary as part of its run artifacts. A human reviewer may add notes, but the agent-written summary is the primary evidence of what the agent understood its task to be.

How long should a model call log entry be? Log the minimum fields needed to triage a failure: endpoint family, timestamp, response status, and run ID. Do not log full prompts, full response bodies, token counts, or cost figures. These fields change with pricing and quota updates and belong in a separate operational dashboard, not in a per-run review ledger.

Does every agent run need a PR? A PR is the standard container recommended here because it surfaces the diff, CI output, and a comment thread in one interface. If your workflow uses a different review interface, map the five ledger elements to the fields available in that interface.

What if the agent did not complete the task? A blocked or partial run still requires a ledger. The handoff summary should explain what was completed, what was not, and why. A reviewer who receives a partial-run ledger can decide whether to retry, finish manually, or close the branch.

How do I keep model call log fields consistent across different agents? Routing all agent model calls through a single model gateway means the endpoint family and response status fields are consistent regardless of which underlying model was used. Start with CometAPI to route agent model calls through a single interface and keep your evidence ledger fields stable across runs. Verify current endpoint families and response fields in the CometAPI documentation before finalizing your log record template.

Related: How to Hand Off Coding Agent Pull Requests for Review | How to Produce Reviewable Diffs From Coding Agent Sessions | Write Coding Agent Task Briefs That Produce Reviewable Changes