Skip to content

Public historical benchmarks

PatchProof benchmarks are reproducible, read-only evaluations of pinned historical bug-fix commits. They are not endorsements by the upstream maintainers.

The benchmark records:

  • Repository and immutable base/head SHAs.
  • Adapter and explicit commands.
  • Selection reasons and granularity.
  • Expected and actual canonical status.
  • Duration, fallbacks, unsupported cases, and false-negative causes.

Results are generated from benchmarks/manifest.json. Unsupported and inconclusive cases remain visible rather than being excluded.

First full run

The first full run evaluated all ten repositories on June 20, 2026:

  • Two repositories produced proven targeted evidence: one axios test and four Zod tests.
  • Two repositories produced not_proven targeted evidence.
  • Six repositories produced inconclusive targeted evidence.
  • One case exposed a pre-execution nested-project transplant bug, which was fixed and rerun.
  • All ten aggregate results remained inconclusive because the historical clean suites did not run reliably in the minimal modern environment.

This is not presented as a 10/10 success story. It demonstrates that PatchProof preserves uncertainty and that historical dependency reconstruction is the primary benchmark challenge. See the permanent machine-readable results and the workflow run.

CaseModeStatusTests
axios-http-adapter-errorinspectselected1
click-fish-completioninspectselected1
date-fns-chinese-monthinspectselected2
httpx-request-timeoutinspectselected1
pydantic-generator-max-lengthinspectselected1
pytest-initial-conftestinspectselected1
requests-file-wrapperinspectselected1
vite-null-export-globinspectselected1
vitest-inline-diff-configinspectselected1
zod-default-map-set-cloneinspectselected4

Deterministic regression-test evidence. No telemetry or required AI.