# Proof report — v1 corpus

run: 2026-06-03
corpus: Agent PR Verification Bench (v1)
products: demo-saas, Epic Stack
cases: 50 (35 known-bad, 15 known-good)

> Public-safe example. Internal proof-corpus result, not a public SOTA claim.
> Wilson 95% intervals shown because N is small. Ground truth comes from
> deterministic oracles, not LLM judgment.

## Confusion matrix

| workflow            | TP | FN | FP | TN | precision | recall            | F1   |
| ------------------- | -- | -- | -- | -- | --------- | ----------------- | ---- |
| ci-only             |  0 | 35 |  0 | 15 | 0.00      | 0% (0%, 10%)      | 0.00 |
| generic-ai-review   | 14 | 21 |  1 | 14 | 0.93      | 40% (26%, 56%)    | 0.56 |
| timaeus-reviewer    | 30 |  5 |  1 | 14 | 0.97      | 86% (71%, 94%)    | 0.91 |
| timaeus-full        | 35 |  0 |  1 | 14 | 0.97      | 100% (90%, 100%)  | 0.99 |

## timaeus-full headline rates (Wilson 95%)

- recall:         100% (90%, 100%)
- false-approval:   0% (0%, 10%)
- false-block:      7% (1%, 30%)   — 1/15 known-good

## Second product (Epic Stack)

10/10 clean: all 6 known-bad caught, all 4 known-good approved, 0 false blocks.

## Caveats

- Internal corpus across two products; not a public SOTA claim.
- Small N; treat point estimates as indicative.
- Not a guarantee of correctness or zero regressions.