Proof | Define Done

Proof, not vibes

The 50-case Agent PR Verification Bench

Internal v1 corpus: 50 cases across demo-saas and Epic Stack. Timaeus-full caught 35/35 known-bad PRs and false-blocked 1/15 known-good PRs. Wilson 95%: recall 90–100%, false-block 1–30%. Not a public SOTA claim.

internal v1 corpusN = 502 productslast run June 3, 2026not a public SOTA claim

proof-report.md

# proof report — v1 corpus

run: 2026-06-03 · cases: 50

workflowrecallF1

ci-only0%0.00

timaeus-full100%0.99

false blocks: 1/15 (Wilson 7%, 1–30%)

Current result

What each workflow caught

35 known-bad agent PRs and 15 known-good agent PRs across demo-saas and Epic Stack. Recall is the share of known-bad PRs blocked; false blocks are known-good PRs wrongly blocked.

Workflow	Bad caught	Recall (Wilson 95%)	False blocks	F1	Interpretation
CI only	0/35	0% (0%, 10%)	0%	0.00	Build/typecheck-style checks. Blind to task fulfillment.
Generic AI review	14/35	40% (26%, 56%)	7%	0.56	Comments on the diff. Misses subtle task failures.
Timaeus reviewer	30/35	86% (71%, 94%)	7%	0.91	Separate-context review of the diff against the task.
Timaeus full	35/35	100% (90%, 100%)	7%	0.99	Review + executable acceptance gates + regression replay.

CI only
Bad caught
0/35
F1
0.00
Recall (Wilson 95%)
0% (0%, 10%)
False blocks
0%
Build/typecheck-style checks. Blind to task fulfillment.
Generic AI review
Bad caught
14/35
F1
0.56
Recall (Wilson 95%)
40% (26%, 56%)
False blocks
7%
Comments on the diff. Misses subtle task failures.
Timaeus reviewer
Bad caught
30/35
F1
0.91
Recall (Wilson 95%)
86% (71%, 94%)
False blocks
7%
Separate-context review of the diff against the task.
Timaeus full
Bad caught
35/35
F1
0.99
Recall (Wilson 95%)
100% (90%, 100%)
False blocks
7%
Review + executable acceptance gates + regression replay.

Internal proof-corpus result, not a public SOTA claim. Wilson 95% intervals shown because N is small.

What CI-only means here

CI-only here means the current build/typecheck-style workflow these products already run. The benchmark intentionally targets task-fulfillment, UI behavior, regression replay, and acceptance-criteria failures that ordinary build checks do not cover.

Wilson 95% confidence intervals

Headline numbers, with their uncertainty

N is small, so every rate is reported with a Wilson score interval rather than a bare percentage.

timaeus-full recall

100% (90%, 100%)

All 35 known-bad PRs were caught.

timaeus-full false-approval

0% (0%, 10%)

No known-bad PR slipped through in this corpus.

timaeus-full false-block

7% (1%, 30%)

One known-good PR of 15 was wrongly blocked.

Case corpus & error analysis

False approvals and false blocks

Corpus shape

50 total cases
35 known-bad PRs
15 known-good PRs
products: demo-saas and Epic Stack

False approvals (FN)

Known-bad PRs let through. CI-only let through all 35. Timaeus-full let through 0 in this corpus.

False blocks (FP)

Known-good PRs wrongly blocked. Timaeus-full produced 1 of 15. Over-blocking is tracked as a first-class error.

Representative caught cases

What CI passed and Timaeus caught

Illustrative cases from the corpus (demo-saas and Epic Stack).

case/checkout-promo-code

Agent claim: “Promo codes now apply at checkout.”

Valid codes applied, but expired codes were silently accepted, discounting orders below cost.

CI: Green. Types compiled, existing tests passed.
Timaeus: Acceptance gate “reject expired code” failed; blocked with replayable evidence.

case/list-pagination

Agent claim: “Added pagination to the orders list.”

Page 2 re-queried from offset 0, so later pages showed duplicate rows.

CI: Green. No assertion covered cross-page uniqueness.
Timaeus: Separate-context review flagged the offset bug; acceptance gate confirmed duplicates.

case/settings-save-toast

Agent claim: “Settings now persist on save.”

Success toast fired before the write resolved; a failed write still showed success.

CI: Green. UI rendered, request was dispatched.
Timaeus: UI-behavior gate asserted persisted state after reload; mismatch caught.

Artifact examples

The evidence each run preserves

Every verdict ships with reproducible artifacts. The examples below show their shape.

taskscore.jsonBLOCK

case

checkout-promo-code

claim

“Promo codes now apply at checkout.”

confidence0.94

acceptance gates3/4 passed

apply-valid-code
reject-expired-code
stack-with-sale-price
ui-shows-discount-line

A calibrated per-case verdict with evidence references.

TaskScore →

pr #482 · promo codes

function applyPromo(code, cart) {

- if (codes.has(code)) {

+ const promo = codes.get(code);

+ if (promo) {

cart.discount = promo.amount;

return cart;

}

block — expired codes are still accepted; acceptance gate reject-expired-code failed. CI was green.

The merge / block / warn decision, attached to the change.

Diff verdict →

case evidence

checkout-promo-code/
task.md
acceptance/
apply-valid-code.spec.ts
reject-expired-code.spec.ts
review/separate-context.md
replay/checkout-promo.json

Per-case inputs, gate output, and review trace.

Case evidence tree →

proof-report.md

# proof report — v1 corpus

run: 2026-06-03 · cases: 50

workflowrecallF1

ci-only0%0.00

timaeus-full100%0.99

false blocks: 1/15 (Wilson 7%, 1–30%)

A reproducible run summary with the confusion matrix.

Proof report →

Second product

Epic Stack: 6/6 bad caught, 0 false blocks

The v1 corpus spans two products, so the result is not specific to a single codebase. The Epic Stack fixture contributes 10 cases to the 50-case total.

epic-stack.txt

product        Epic Stack
cases          10  (6 known-bad, 4 known-good)
known-bad      6/6 caught
known-good     4/4 approved
false blocks   0
result         10/10 clean

Part of the corpus

Epic Stack cases are part of the 50-case total above, not in addition to it.

Methodology

How a case is scored

Each case fixes a task, a completion claim, and a known good/bad label.
CI runs build, typecheck, and existing tests — the same checks a team already has.
Timaeus reviews the diff in a separate context and runs executable acceptance gates tied to the task.
Regression behavior is sealed so later PRs replay it deterministically.
A verdict (merge / block / warn) is emitted with attached evidence and a confidence score.
Verdicts are scored against deterministic oracles, not LLM judgment, to produce the confusion matrix above.

Caveats

What these numbers are not

Internal corpus

Results come from an internal 50-case corpus across two products (demo-saas and Epic Stack). They are not a public SOTA claim and not a leaderboard result.

Small sample

With 35 known-bad and 15 known-good cases, intervals are wide. Treat point estimates as indicative.

No guarantee

Define Done surfaces evidence and calibrated verdicts. It does not guarantee correctness or zero regressions.

Deterministic ground truth

Ground truth comes from deterministic oracles, not LLM judgment. Each case is labelled by an oracle, so a verdict is scored as right or wrong without a model grading itself.

What's next

Where the benchmark goes next

Expanding the corpus beyond 50 cases.
A third product beyond demo-saas and Epic Stack.
Larger external real-world agent-PR evaluations.

In our internal 50-case Agent PR Verification Bench, the full Timaeus pipeline caught all known-bad cases and reduced false completion compared with CI-only and generic AI review.