Curated Lean Verified Code Generation Benchmark
| # | Model | Approach | End-to-End | Problems Solved | Note |
|---|---|---|---|---|---|
| 1 | 💙 GPT-4o mini | Few-Shot | 0.621% | 1/161 | Problem 53 |
| 1 | 💙 Claude-3.7 | Few-Shot | 0.621% | 1/161 | Problem 53 |
| 1 | 💙 DeepSeek-R1 | Few-Shot | 0.621% | 1/161 | Problem 53 |
| 1 | 💙 GPT-4o | COPRA-enhanced | 0.621% | 1/161 | Problem 53 |
| 1 | 💙 Claude-3.7 | COPRA-enhanced | 0.621% | 1/161 | Problem 53 |
| 6 | 💙 GPT-4o | Few-Shot | 0% | 0/161 | - |
| Model | Approach | Spec Certification | Impl Certification | End-to-End | ||
|---|---|---|---|---|---|---|
| Compiled | Proved | Compiled | Proved | |||
| Few-Shot Baseline | ||||||
| 💙 GPT-4o | Few-Shot | 84.472% | 0.621% | 68.323% | 0.621% | 0% |
| 💙 GPT-4o mini | Few-Shot | 82.609% | 1.242% | 83.230% | 1.863% | 0.621% |
| 💙 Claude-3.7 | Few-Shot | 86.957% | 0.621% | 65.217% | 1.863% | 0.621% |
| 💙 DeepSeek-R1 | Few-Shot | 71.42% | 0.621% | 60.870% | 5.559% | 0.621% |
| COPRA Baseline | ||||||
| 💙 GPT-4o | COPRA-enhanced | 76.398% | 1.863% | 68.323% | 3.727% | 0.621% |
| 💙 Claude-3.7 | COPRA-enhanced | 81.366% | 1.242% | 65.217% | 8.696% | 0.621% |
To add your results to this leaderboard:
Contact: amitayush@utexas.edu