Skip to content
Define DoneDesign partners
Proof, not vibes

The 50-case Agent PR Verification Bench

Internal v1 corpus: 50 cases across demo-saas and Epic Stack. Timaeus-full caught 35/35 known-bad PRs and false-blocked 1/15 known-good PRs. Wilson 95%: recall 90–100%, false-block 1–30%. Not a public SOTA claim.

internal v1 corpusN = 502 productslast run June 3, 2026not a public SOTA claim
proof-report.md
# proof report — v1 corpus
run: 2026-06-03 · cases: 50
workflowrecallF1
ci-only0%0.00
timaeus-full100%0.99
false blocks: 1/15 (Wilson 7%, 1–30%)
Current result

What each workflow caught

35 known-bad agent PRs and 15 known-good agent PRs across demo-saas and Epic Stack. Recall is the share of known-bad PRs blocked; false blocks are known-good PRs wrongly blocked.

  • CI only
    Bad caught
    0/35
    F1
    0.00
    Recall (Wilson 95%)
    0% (0%, 10%)
    False blocks
    0%

    Build/typecheck-style checks. Blind to task fulfillment.

  • Generic AI review
    Bad caught
    14/35
    F1
    0.56
    Recall (Wilson 95%)
    40% (26%, 56%)
    False blocks
    7%

    Comments on the diff. Misses subtle task failures.

  • Timaeus reviewer
    Bad caught
    30/35
    F1
    0.91
    Recall (Wilson 95%)
    86% (71%, 94%)
    False blocks
    7%

    Separate-context review of the diff against the task.

  • Timaeus full
    Bad caught
    35/35
    F1
    0.99
    Recall (Wilson 95%)
    100% (90%, 100%)
    False blocks
    7%

    Review + executable acceptance gates + regression replay.

Internal proof-corpus result, not a public SOTA claim. Wilson 95% intervals shown because N is small.

What CI-only means here
CI-only here means the current build/typecheck-style workflow these products already run. The benchmark intentionally targets task-fulfillment, UI behavior, regression replay, and acceptance-criteria failures that ordinary build checks do not cover.
Wilson 95% confidence intervals

Headline numbers, with their uncertainty

N is small, so every rate is reported with a Wilson score interval rather than a bare percentage.

timaeus-full recall
100% (90%, 100%)

All 35 known-bad PRs were caught.

timaeus-full false-approval
0% (0%, 10%)

No known-bad PR slipped through in this corpus.

timaeus-full false-block
7% (1%, 30%)

One known-good PR of 15 was wrongly blocked.

Case corpus & error analysis

False approvals and false blocks

Corpus shape
  • 50 total cases
  • 35 known-bad PRs
  • 15 known-good PRs
  • products: demo-saas and Epic Stack
False approvals (FN)

Known-bad PRs let through. CI-only let through all 35. Timaeus-full let through 0 in this corpus.

False blocks (FP)

Known-good PRs wrongly blocked. Timaeus-full produced 1 of 15. Over-blocking is tracked as a first-class error.

Representative caught cases

What CI passed and Timaeus caught

Illustrative cases from the corpus (demo-saas and Epic Stack).

case/checkout-promo-code
Agent claim: “Promo codes now apply at checkout.

Valid codes applied, but expired codes were silently accepted, discounting orders below cost.

CI
Green. Types compiled, existing tests passed.
Timaeus
Acceptance gate “reject expired code” failed; blocked with replayable evidence.
case/list-pagination
Agent claim: “Added pagination to the orders list.

Page 2 re-queried from offset 0, so later pages showed duplicate rows.

CI
Green. No assertion covered cross-page uniqueness.
Timaeus
Separate-context review flagged the offset bug; acceptance gate confirmed duplicates.
case/settings-save-toast
Agent claim: “Settings now persist on save.

Success toast fired before the write resolved; a failed write still showed success.

CI
Green. UI rendered, request was dispatched.
Timaeus
UI-behavior gate asserted persisted state after reload; mismatch caught.
Artifact examples

The evidence each run preserves

Every verdict ships with reproducible artifacts. The examples below show their shape.

taskscore.jsonBLOCK
case
checkout-promo-code
claim
“Promo codes now apply at checkout.”
confidence0.94
acceptance gates3/4 passed
  • apply-valid-code
  • reject-expired-code
  • stack-with-sale-price
  • ui-shows-discount-line

A calibrated per-case verdict with evidence references.

TaskScore
pr #482 · promo codes
function applyPromo(code, cart) {
- if (codes.has(code)) {
+ const promo = codes.get(code);
+ if (promo) {
cart.discount = promo.amount;
return cart;
}
}
block — expired codes are still accepted; acceptance gate reject-expired-code failed. CI was green.

The merge / block / warn decision, attached to the change.

Diff verdict
case evidence
  • checkout-promo-code/
  • task.md
  • acceptance/
  • apply-valid-code.spec.ts
  • reject-expired-code.spec.ts
  • review/separate-context.md
  • replay/checkout-promo.json

Per-case inputs, gate output, and review trace.

Case evidence tree
proof-report.md
# proof report — v1 corpus
run: 2026-06-03 · cases: 50
workflowrecallF1
ci-only0%0.00
timaeus-full100%0.99
false blocks: 1/15 (Wilson 7%, 1–30%)

A reproducible run summary with the confusion matrix.

Proof report
Second product

Epic Stack: 6/6 bad caught, 0 false blocks

The v1 corpus spans two products, so the result is not specific to a single codebase. The Epic Stack fixture contributes 10 cases to the 50-case total.

epic-stack.txt
product        Epic Stack
cases          10  (6 known-bad, 4 known-good)
known-bad      6/6 caught
known-good     4/4 approved
false blocks   0
result         10/10 clean
Part of the corpus
Epic Stack cases are part of the 50-case total above, not in addition to it.
Methodology

How a case is scored

  1. Each case fixes a task, a completion claim, and a known good/bad label.
  2. CI runs build, typecheck, and existing tests — the same checks a team already has.
  3. Timaeus reviews the diff in a separate context and runs executable acceptance gates tied to the task.
  4. Regression behavior is sealed so later PRs replay it deterministically.
  5. A verdict (merge / block / warn) is emitted with attached evidence and a confidence score.
  6. Verdicts are scored against deterministic oracles, not LLM judgment, to produce the confusion matrix above.
Caveats

What these numbers are not

Internal corpus
Results come from an internal 50-case corpus across two products (demo-saas and Epic Stack). They are not a public SOTA claim and not a leaderboard result.
Small sample
With 35 known-bad and 15 known-good cases, intervals are wide. Treat point estimates as indicative.
No guarantee
Define Done surfaces evidence and calibrated verdicts. It does not guarantee correctness or zero regressions.
Deterministic ground truth
Ground truth comes from deterministic oracles, not LLM judgment. Each case is labelled by an oracle, so a verdict is scored as right or wrong without a model grading itself.
What's next

Where the benchmark goes next

  • Expanding the corpus beyond 50 cases.
  • A third product beyond demo-saas and Epic Stack.
  • Larger external real-world agent-PR evaluations.

In our internal 50-case Agent PR Verification Bench, the full Timaeus pipeline caught all known-bad cases and reduced false completion compared with CI-only and generic AI review.