The 50-case Agent PR Verification Bench
Internal v1 corpus: 50 cases across demo-saas and Epic Stack. Timaeus-full caught 35/35 known-bad PRs and false-blocked 1/15 known-good PRs. Wilson 95%: recall 90–100%, false-block 1–30%. Not a public SOTA claim.
What each workflow caught
35 known-bad agent PRs and 15 known-good agent PRs across demo-saas and Epic Stack. Recall is the share of known-bad PRs blocked; false blocks are known-good PRs wrongly blocked.
| Workflow | Bad caught | Recall (Wilson 95%) | False blocks | F1 | Interpretation |
|---|---|---|---|---|---|
| CI only | 0/35 | 0% (0%, 10%) | 0% | 0.00 | Build/typecheck-style checks. Blind to task fulfillment. |
| Generic AI review | 14/35 | 40% (26%, 56%) | 7% | 0.56 | Comments on the diff. Misses subtle task failures. |
| Timaeus reviewer | 30/35 | 86% (71%, 94%) | 7% | 0.91 | Separate-context review of the diff against the task. |
| Timaeus full | 35/35 | 100% (90%, 100%) | 7% | 0.99 | Review + executable acceptance gates + regression replay. |
- CI only
- Bad caught
- 0/35
- F1
- 0.00
- Recall (Wilson 95%)
- 0% (0%, 10%)
- False blocks
- 0%
Build/typecheck-style checks. Blind to task fulfillment.
- Generic AI review
- Bad caught
- 14/35
- F1
- 0.56
- Recall (Wilson 95%)
- 40% (26%, 56%)
- False blocks
- 7%
Comments on the diff. Misses subtle task failures.
- Timaeus reviewer
- Bad caught
- 30/35
- F1
- 0.91
- Recall (Wilson 95%)
- 86% (71%, 94%)
- False blocks
- 7%
Separate-context review of the diff against the task.
- Timaeus full
- Bad caught
- 35/35
- F1
- 0.99
- Recall (Wilson 95%)
- 100% (90%, 100%)
- False blocks
- 7%
Review + executable acceptance gates + regression replay.
Internal proof-corpus result, not a public SOTA claim. Wilson 95% intervals shown because N is small.
Headline numbers, with their uncertainty
N is small, so every rate is reported with a Wilson score interval rather than a bare percentage.
All 35 known-bad PRs were caught.
No known-bad PR slipped through in this corpus.
One known-good PR of 15 was wrongly blocked.
False approvals and false blocks
- 50 total cases
- 35 known-bad PRs
- 15 known-good PRs
- products: demo-saas and Epic Stack
Known-bad PRs let through. CI-only let through all 35. Timaeus-full let through 0 in this corpus.
Known-good PRs wrongly blocked. Timaeus-full produced 1 of 15. Over-blocking is tracked as a first-class error.
What CI passed and Timaeus caught
Illustrative cases from the corpus (demo-saas and Epic Stack).
Valid codes applied, but expired codes were silently accepted, discounting orders below cost.
- CI
- Green. Types compiled, existing tests passed.
- Timaeus
- Acceptance gate “reject expired code” failed; blocked with replayable evidence.
Page 2 re-queried from offset 0, so later pages showed duplicate rows.
- CI
- Green. No assertion covered cross-page uniqueness.
- Timaeus
- Separate-context review flagged the offset bug; acceptance gate confirmed duplicates.
Success toast fired before the write resolved; a failed write still showed success.
- CI
- Green. UI rendered, request was dispatched.
- Timaeus
- UI-behavior gate asserted persisted state after reload; mismatch caught.
The evidence each run preserves
Every verdict ships with reproducible artifacts. The examples below show their shape.
- apply-valid-code
- reject-expired-code
- stack-with-sale-price
- ui-shows-discount-line
A calibrated per-case verdict with evidence references.
TaskScore →The merge / block / warn decision, attached to the change.
Diff verdict →- checkout-promo-code/
- task.md
- acceptance/
- apply-valid-code.spec.ts
- reject-expired-code.spec.ts
- review/separate-context.md
- replay/checkout-promo.json
Per-case inputs, gate output, and review trace.
Case evidence tree →A reproducible run summary with the confusion matrix.
Proof report →Epic Stack: 6/6 bad caught, 0 false blocks
The v1 corpus spans two products, so the result is not specific to a single codebase. The Epic Stack fixture contributes 10 cases to the 50-case total.
product Epic Stack
cases 10 (6 known-bad, 4 known-good)
known-bad 6/6 caught
known-good 4/4 approved
false blocks 0
result 10/10 cleanHow a case is scored
- Each case fixes a task, a completion claim, and a known good/bad label.
- CI runs build, typecheck, and existing tests — the same checks a team already has.
- Timaeus reviews the diff in a separate context and runs executable acceptance gates tied to the task.
- Regression behavior is sealed so later PRs replay it deterministically.
- A verdict (merge / block / warn) is emitted with attached evidence and a confidence score.
- Verdicts are scored against deterministic oracles, not LLM judgment, to produce the confusion matrix above.
What these numbers are not
Where the benchmark goes next
- Expanding the corpus beyond 50 cases.
- A third product beyond demo-saas and Epic Stack.
- Larger external real-world agent-PR evaluations.
In our internal 50-case Agent PR Verification Bench, the full Timaeus pipeline caught all known-bad cases and reduced false completion compared with CI-only and generic AI review.