Last reviewed: 2026-06-18

Direct answer

Terminal command evidence is the review record that shows what a coding agent actually ran, what passed, what failed, and what still needs a human decision. A useful record is short, repeatable, and tied to the pull request or handoff note instead of living only in chat.

Use this workflow when the agent changes code, documentation, tests, or build configuration:

  1. Setup assumptions: the repository has clear project instructions, the agent is working in a clean branch or isolated workspace, secrets are not printed, and reviewers can access the same test commands.
  2. Happy-path request plan: ask the agent to list the intended verification commands before running them, then run the smallest command set that proves the changed behavior.
  3. Error-path check: require the agent to preserve failing command output, explain whether the failure is expected, and stop before broad retries hide the original signal.
  4. Minimum assertions: record command, working directory, exit result, short output summary, changed files under review, and whether the result supports the requested change.
  5. Pass/fail logging fields: use command_id, purpose, cwd, command, exit_result, evidence_summary, follow_up, and reviewer_note.
  6. What not to assert: do not claim broad product reliability, full security coverage, complete CI parity, model quality, or production readiness from one local command.

For adjacent review handoff structure, see How to Hand Off Coding Agent Pull Requests for Review .

Who this is for

This guide is for engineering teams that use coding agents to edit repositories and need a review trail that survives beyond the agent session. It is especially useful when a reviewer must decide whether a change is ready for a pull request, a CI retry, or a narrower follow-up task.

It is not a substitute for CI, human code review, or repository-specific release checks. It is a way to keep terminal evidence from becoming vague after the run ends.

Key takeaways

  • Treat terminal evidence as a review artifact, not as a transcript dump.
  • Capture the planned command, the actual command, the working directory, and the result.
  • Keep failed commands visible until a reviewer understands whether the failure is caused by the change, the environment, or a pre-existing issue.
  • Connect terminal evidence to repository instructions, pull request review, and CI workflow evidence.
  • Use placeholders for secrets and prompts; never paste real credentials or full private outputs into examples.

Smoke-test workflow

Before a reviewer trusts an agent-authored change, ask for one compact smoke test record.

Setup:

  • Confirm the agent read the repository instruction file that applies to the changed path.
  • Confirm the command will run from the repository root or record the exact subdirectory.
  • Confirm no command will print tokens, private prompts, customer data, or full generated responses.

Happy-path request:

Run the smallest repository command that verifies the changed behavior. Before running it, state why this command is relevant. After it finishes, summarize only the result and the review implication.

Error-path request:

If the command fails, keep the first failure visible. Do not retry with a broader command until the failure is classified as changed-code failure, environment failure, dependency failure, or unrelated pre-existing failure.

Sanitized log record:

command_id: "cmd-001"
purpose: "Verify the changed validation path"
cwd: "<REPOSITORY_ROOT>"
command: "<SAFE_COMMAND_PLACEHOLDER>"
exit_result: "pass | fail | not_run"
evidence_summary: "<SHORT_RESULT_SUMMARY>"
changed_files_checked: ["<PATH_PLACEHOLDER>"]
follow_up: "<NONE_OR_NEXT_CHECK>"
reviewer_note: "<HUMAN_DECISION_PLACEHOLDER>"

Failure modes

  • Evidence gap: the agent cannot inspect the failing log, source page, pull request, or local command output. The safe action is to stop and record the missing evidence instead of guessing.
  • Scope drift: the agent edits files that are not connected to the observed failure. Keep the repair tied to the failing signal and leave unrelated cleanup for a separate task.
  • Environment mismatch: the local check uses different versions, credentials, feature flags, or runtime settings than the hosted path. Record the mismatch before treating the result as proof.
  • Unreviewed fallback: the agent changes models, endpoints, permissions, or retry behavior to make a run pass without preserving the review boundary. Treat access and provider failures as operational blockers, not topic failures.
  • Weak handoff: the final note says the issue is fixed but omits the command, result, changed files, and remaining uncertainty. That makes the next operator repeat the investigation.

Sources checked

Contract details to verify

AreaWhat to verifySource URLAccessedSafe candidate wording
Repository instructionsWhether the agent used the instruction file that applies to the edited pathhttps://github.com/openai/codex/blob/main/docs/agents_md.md2026-06-18“Confirm the agent read the applicable repository instructions before accepting the command evidence.”
Project memoryWhether project-level notes are relevant to the current agent runhttps://code.claude.com/docs/en/memory2026-06-18“Treat project memory as context to verify, not as proof that the command result is correct.”
CI workflowsWhether local command evidence should be compared with CI workflow resultshttps://docs.github.com/en/actions2026-06-18“Use CI results as a separate signal when a local command does not cover the full workflow.”
Pull request reviewWhether the evidence belongs in a reviewable pull request handoffhttps://docs.github.com/en/pull-requests2026-06-18“Attach the command result summary to the pull request review context so reviewers can inspect the change.”

FAQ

How much terminal output should the agent include?

Include enough output to show the command result and the relevant failure or success signal. Do not paste full logs when a short excerpt and command metadata are enough for review.

Should a failed command block the change?

A failed command should block blind acceptance. It may still be acceptable if the failure is classified, unrelated to the change, and recorded with a clear follow-up.

Is local terminal evidence enough when CI exists?

No. Local evidence explains what the agent checked before handoff. CI workflow results remain a separate review signal.

What should be redacted?

Redact credentials, private prompts, customer data, full generated responses, and any output that would expose secrets. Use <API_KEY_PLACEHOLDER> for credential examples.

Where does CometAPI fit in this workflow?

If your coding agents route model calls through a gateway, keep review and command evidence separate from provider-specific claims. For gateway setup work, start with CometAPI and verify exact API behavior in the relevant product documentation before increasing usage.

Reader next step

Run the next implementation or review pass against Agent Memory Review Before Long-Running Tasks , then keep Agent Run Evidence Ledgers for Human Review nearby for the surrounding editorial and source boundary.