Public historical benchmarks
PatchProof benchmarks are reproducible, read-only evaluations of pinned historical bug-fix commits. They are not endorsements by the upstream maintainers.
The benchmark records:
- Repository and immutable base/head SHAs.
- Adapter and explicit commands.
- Selection reasons and granularity.
- Expected and actual canonical status.
- Duration, fallbacks, unsupported cases, and false-negative causes.
Results are generated from benchmarks/manifest.json. Unsupported and inconclusive cases remain visible rather than being excluded.
First full run
The first full run evaluated all ten repositories on June 20, 2026:
- Two repositories produced
proventargeted evidence: one axios test and four Zod tests. - Two repositories produced
not_proventargeted evidence. - Six repositories produced
inconclusivetargeted evidence. - One case exposed a pre-execution nested-project transplant bug, which was fixed and rerun.
- All ten aggregate results remained
inconclusivebecause the historical clean suites did not run reliably in the minimal modern environment.
This is not presented as a 10/10 success story. It demonstrates that PatchProof preserves uncertainty and that historical dependency reconstruction is the primary benchmark challenge. See the permanent machine-readable results and the workflow run.
| Case | Mode | Status | Tests |
|---|---|---|---|
| axios-http-adapter-error | inspect | selected | 1 |
| click-fish-completion | inspect | selected | 1 |
| date-fns-chinese-month | inspect | selected | 2 |
| httpx-request-timeout | inspect | selected | 1 |
| pydantic-generator-max-length | inspect | selected | 1 |
| pytest-initial-conftest | inspect | selected | 1 |
| requests-file-wrapper | inspect | selected | 1 |
| vite-null-export-glob | inspect | selected | 1 |
| vitest-inline-diff-config | inspect | selected | 1 |
| zod-default-map-set-clone | inspect | selected | 4 |