Last reviewed: 2026-06-18
Direct answer
Terminal command evidence is the review record that shows what a coding agent actually ran, what passed, what failed, and what still needs a human decision. A useful record is short, repeatable, and tied to the pull request or handoff note instead of living only in chat.
Use this workflow when the agent changes code, documentation, tests, or build configuration:
- Setup assumptions: the repository has clear project instructions, the agent is working in a clean branch or isolated workspace, secrets are not printed, and reviewers can access the same test commands.
- Happy-path request plan: ask the agent to list the intended verification commands before running them, then run the smallest command set that proves the changed behavior.
- Error-path check: require the agent to preserve failing command output, explain whether the failure is expected, and stop before broad retries hide the original signal.
- Minimum assertions: record command, working directory, exit result, short output summary, changed files under review, and whether the result supports the requested change.
- Pass/fail logging fields: use
command_id,purpose,cwd,command,exit_result,evidence_summary,follow_up, andreviewer_note. - What not to assert: do not claim broad product reliability, full security coverage, complete CI parity, model quality, or production readiness from one local command.
For adjacent review handoff structure, see How to Hand Off Coding Agent Pull Requests for Review .
Who this is for
This guide is for engineering teams that use coding agents to edit repositories and need a review trail that survives beyond the agent session. It is especially useful when a reviewer must decide whether a change is ready for a pull request, a CI retry, or a narrower follow-up task.
It is not a substitute for CI, human code review, or repository-specific release checks. It is a way to keep terminal evidence from becoming vague after the run ends.
Key takeaways
- Treat terminal evidence as a review artifact, not as a transcript dump.
- Capture the planned command, the actual command, the working directory, and the result.
- Keep failed commands visible until a reviewer understands whether the failure is caused by the change, the environment, or a pre-existing issue.
- Connect terminal evidence to repository instructions, pull request review, and CI workflow evidence.
- Use placeholders for secrets and prompts; never paste real credentials or full private outputs into examples.
Smoke-test workflow
Before a reviewer trusts an agent-authored change, ask for one compact smoke test record.
Setup:
- Confirm the agent read the repository instruction file that applies to the changed path.
- Confirm the command will run from the repository root or record the exact subdirectory.
- Confirm no command will print tokens, private prompts, customer data, or full generated responses.
Happy-path request:
Run the smallest repository command that verifies the changed behavior. Before running it, state why this command is relevant. After it finishes, summarize only the result and the review implication.
Error-path request:
If the command fails, keep the first failure visible. Do not retry with a broader command until the failure is classified as changed-code failure, environment failure, dependency failure, or unrelated pre-existing failure.
Sanitized log record:
command_id: "cmd-001"
purpose: "Verify the changed validation path"
cwd: "<REPOSITORY_ROOT>"
command: "<SAFE_COMMAND_PLACEHOLDER>"
exit_result: "pass | fail | not_run"
evidence_summary: "<SHORT_RESULT_SUMMARY>"
changed_files_checked: ["<PATH_PLACEHOLDER>"]
follow_up: "<NONE_OR_NEXT_CHECK>"
reviewer_note: "<HUMAN_DECISION_PLACEHOLDER>"
Failure modes
- Evidence gap: the agent cannot inspect the failing log, source page, pull request, or local command output. The safe action is to stop and record the missing evidence instead of guessing.
- Scope drift: the agent edits files that are not connected to the observed failure. Keep the repair tied to the failing signal and leave unrelated cleanup for a separate task.
- Environment mismatch: the local check uses different versions, credentials, feature flags, or runtime settings than the hosted path. Record the mismatch before treating the result as proof.
- Unreviewed fallback: the agent changes models, endpoints, permissions, or retry behavior to make a run pass without preserving the review boundary. Treat access and provider failures as operational blockers, not topic failures.
- Weak handoff: the final note says the issue is fixed but omits the command, result, changed files, and remaining uncertainty. That makes the next operator repeat the investigation.
Sources checked
- OpenAI Codex AGENTS.md guidance - accessed 2026-06-18; purpose: verify repository instruction-file context for coding agents.
- GitHub Actions documentation - accessed 2026-06-18; purpose: verify workflow runs, jobs, steps, checks, and logs.
- GitHub pull requests documentation - accessed 2026-06-18; purpose: verify pull request review and collaboration boundaries.
- Claude Code memory documentation - accessed 2026-06-18; purpose: verify project memory and instruction-file context for agent workflows.
Contract details to verify
| Area | What to verify | Source URL | Accessed | Safe candidate wording |
|---|---|---|---|---|
| Repository instructions | Whether the agent used the instruction file that applies to the edited path | https://github.com/openai/codex/blob/main/docs/agents_md.md | 2026-06-18 | “Confirm the agent read the applicable repository instructions before accepting the command evidence.” |
| Project memory | Whether project-level notes are relevant to the current agent run | https://code.claude.com/docs/en/memory | 2026-06-18 | “Treat project memory as context to verify, not as proof that the command result is correct.” |
| CI workflows | Whether local command evidence should be compared with CI workflow results | https://docs.github.com/en/actions | 2026-06-18 | “Use CI results as a separate signal when a local command does not cover the full workflow.” |
| Pull request review | Whether the evidence belongs in a reviewable pull request handoff | https://docs.github.com/en/pull-requests | 2026-06-18 | “Attach the command result summary to the pull request review context so reviewers can inspect the change.” |
FAQ
How much terminal output should the agent include?
Include enough output to show the command result and the relevant failure or success signal. Do not paste full logs when a short excerpt and command metadata are enough for review.
Should a failed command block the change?
A failed command should block blind acceptance. It may still be acceptable if the failure is classified, unrelated to the change, and recorded with a clear follow-up.
Is local terminal evidence enough when CI exists?
No. Local evidence explains what the agent checked before handoff. CI workflow results remain a separate review signal.
What should be redacted?
Redact credentials, private prompts, customer data, full generated responses, and any output that would expose secrets. Use <API_KEY_PLACEHOLDER> for credential examples.
Where does CometAPI fit in this workflow?
If your coding agents route model calls through a gateway, keep review and command evidence separate from provider-specific claims. For gateway setup work, start with CometAPI and verify exact API behavior in the relevant product documentation before increasing usage.
Reader next step
Run the next implementation or review pass against Agent Memory Review Before Long-Running Tasks , then keep Agent Run Evidence Ledgers for Human Review nearby for the surrounding editorial and source boundary.