Curated Lean Verified Code Generation Benchmark
| # | Model | Approach | End-to-End Code Generation | Note | # Proofs Generated |
|---|---|---|---|---|---|
| 1 | 💙 Claude-3.7 | COPRA-enhanced | 1/161 | Problem 53 | 2/161 (spec) + 14/161 (impl) = 16/282 |
| 1 | 💚 DeepSeek-R1 | Few-Shot | 1/161 | Problem 53 | 1/161 (spec) + 9/161 (impl) = 10/282 |
| 1 | 💚 GPT OSS 20b | COPRA-enhanced | 1/161 | Problem 53 | 2/161 (spec) + 8/161 (impl) = 10/282 |
| 1 | 💙 GPT-4o | COPRA-enhanced | 1/161 | Problem 53 | 3/161 (spec) + 6/161 (impl) = 9/282 |
| 1 | 💙 GPT-4o mini | Few-Shot | 1/161 | Problem 53 | 2/161 (spec) + 3 / 161 (impl) = 5/282 |
| 1 | 💙 Claude-3.7 | Few-Shot | 1/161 | Problem 53 | 1/161 (spec) + 3/161 (impl) = 4/282 |
| 6 | 💙 GPT-4o | Few-Shot | 0/161 | - | 1/161 (spec) + 1/161 (impl) = 2/282 |
| 6 | 💙 GPT-5 mini (For Code Generation) + Kimina Prover (For proofs) | Few-Shot | 0/161 | - | 0/161 (spec) + 1/161 (impl) = 1/282 |
| Model | Approach | Spec Certification | Impl Certification | End-to-End | ||
|---|---|---|---|---|---|---|
| Compiled | Proved | Compiled | Proved | |||
| Few-Shot Baseline | ||||||
| 💙 GPT-4o mini | Few-Shot | 82.609% | 1.242% | 83.230% | 1.863% | 0.621% |
| 💙 Claude-3.7 | Few-Shot | 86.957% | 0.621% | 65.217% | 1.863% | 0.621% |
| 💙 GPT-4o | Few-Shot | 84.472% | 0.621% | 68.323% | 0.621% | 0% |
| 💚 DeepSeek-R1 | Few-Shot | 71.42% | 0.621% | 60.870% | 5.559% | 0.621% |
| COPRA Baseline | ||||||
| 💙 Claude-3.7 | COPRA-enhanced | 81.366% | 1.242% | 65.217% | 8.696% | 0.621% |
| 💚 GPT OSS 20b | COPRA-enhanced | 78.261% | 1.242% | 65.839% | 4.969% | 0.621% |
| 💙 GPT-4o | COPRA-enhanced | 76.398% | 1.863% | 68.323% | 3.727% | 0.621% |
To add your results to this leaderboard:
Contact: amitayush@utexas.edu