🏆 CLEVER Leaderboard

Curated Lean Verified Code Generation Benchmark

Legend
💚 Fully open-sourced method
💙 Partially open-sourced method
FS: Few-shot prompting with 1-2 examples
COPRA: Symbolic proof search agent [arXiv]
Compiled: Code is syntactically valid and type-checks
Proved: Proofs accepted by Lean's kernel
Pass@k-sec: Success rate within 600-second time budget (k=600)
# Proofs Generated: # Proofs generated while Spec and Impl Certification
🏆 End-to-End Performance Rankings
Complete pipeline success: specification certification + implementation certification (Pass@600-sec)
Sorted based on first "End-to-End Code Generation" column, then "# Proofs Generated"
# Model Approach End-to-End Code Generation Note # Proofs Generated
1 💙 Claude-3.7 COPRA-enhanced 1/161 Problem 53 2/161 (spec) + 14/161 (impl) = 16/282
1 💚 DeepSeek-R1 Few-Shot 1/161 Problem 53 1/161 (spec) + 9/161 (impl) = 10/282
1 💚 GPT OSS 20b COPRA-enhanced 1/161 Problem 53 2/161 (spec) + 8/161 (impl) = 10/282
1 💙 GPT-4o COPRA-enhanced 1/161 Problem 53 3/161 (spec) + 6/161 (impl) = 9/282
1 💙 GPT-4o mini Few-Shot 1/161 Problem 53 2/161 (spec) + 3 / 161 (impl) = 5/282
1 💙 Claude-3.7 Few-Shot 1/161 Problem 53 1/161 (spec) + 3/161 (impl) = 4/282
6 💙 GPT-4o Few-Shot 0/161 - 1/161 (spec) + 1/161 (impl) = 2/282
6 💙 GPT-5 mini (For Code Generation) + Kimina Prover (For proofs) Few-Shot 0/161 - 0/161 (spec) + 1/161 (impl) = 1/282
📊 Detailed Performance Breakdown
Pass@600-sec rates for specification certification and implementation certification
(Compiled: syntactically valid and type-checks; Proved: proofs accepted by Lean's kernel)
Sorted based on average compilation rates across both stages
Model Approach Spec Certification Impl Certification End-to-End
Compiled Proved Compiled Proved
Few-Shot Baseline
💙 GPT-4o mini Few-Shot 82.609% 1.242% 83.230% 1.863% 0.621%
💙 Claude-3.7 Few-Shot 86.957% 0.621% 65.217% 1.863% 0.621%
💙 GPT-4o Few-Shot 84.472% 0.621% 68.323% 0.621% 0%
💚 DeepSeek-R1 Few-Shot 71.42% 0.621% 60.870% 5.559% 0.621%
COPRA Baseline
💙 Claude-3.7 COPRA-enhanced 81.366% 1.242% 65.217% 8.696% 0.621%
💚 GPT OSS 20b COPRA-enhanced 78.261% 1.242% 65.839% 4.969% 0.621%
💙 GPT-4o COPRA-enhanced 76.398% 1.863% 68.323% 3.727% 0.621%
📝 Submission Guidelines

To add your results to this leaderboard:

  1. Evaluate your approach using the CLEVER Python API
  2. Document your methodology with a preprint or publication
  3. Submit your results by contacting us with evaluation metrics and methodology details

Contact: amitayush@utexas.edu