Curated Lean Verified Code Generation Benchmark
# | Model | Approach | End-to-End | Problems Solved | Note |
---|---|---|---|---|---|
1 | 💙 GPT-4o mini | Few-Shot | 0.621% | 1/161 | Problem 53 |
1 | 💙 Claude-3.7 | Few-Shot | 0.621% | 1/161 | Problem 53 |
1 | 💙 DeepSeek-R1 | Few-Shot | 0.621% | 1/161 | Problem 53 |
1 | 💙 GPT-4o | COPRA-enhanced | 0.621% | 1/161 | Problem 53 |
1 | 💙 Claude-3.7 | COPRA-enhanced | 0.621% | 1/161 | Problem 53 |
6 | 💙 GPT-4o | Few-Shot | 0% | 0/161 | - |
Model | Approach | Spec Certification | Impl Certification | End-to-End | ||
---|---|---|---|---|---|---|
Compiled | Proved | Compiled | Proved | |||
Few-Shot Baseline | ||||||
💙 GPT-4o | Few-Shot | 84.472% | 0.621% | 68.323% | 0.621% | 0% |
💙 GPT-4o mini | Few-Shot | 82.609% | 1.242% | 83.230% | 1.863% | 0.621% |
💙 Claude-3.7 | Few-Shot | 86.957% | 0.621% | 65.217% | 1.863% | 0.621% |
💙 DeepSeek-R1 | Few-Shot | 71.42% | 0.621% | 60.870% | 5.559% | 0.621% |
COPRA Baseline | ||||||
💙 GPT-4o | COPRA-enhanced | 76.398% | 1.863% | 68.323% | 3.727% | 0.621% |
💙 Claude-3.7 | COPRA-enhanced | 81.366% | 1.242% | 65.217% | 8.696% | 0.621% |
To add your results to this leaderboard:
Contact: amitayush@utexas.edu