🏆 CLEVER Leaderboard

Curated Lean Verified Code Generation Benchmark

Legend
💚 Fully open-sourced method
💙 Partially open-sourced method
FS: Few-shot prompting with 1-2 examples
COPRA: Symbolic proof search agent
Compiled: Code is syntactically valid and type-checks
Proved: Proofs accepted by Lean's kernel
Pass@k-sec: Success rate within 600-second time budget (k=600)
🏆 End-to-End Performance Rankings
Complete pipeline success: specification certification + implementation certification (Pass@600-sec)
# Model Approach End-to-End Problems Solved Note
1 💙 GPT-4o mini Few-Shot 0.621% 1/161 Problem 53
1 💙 Claude-3.7 Few-Shot 0.621% 1/161 Problem 53
1 💙 DeepSeek-R1 Few-Shot 0.621% 1/161 Problem 53
1 💙 GPT-4o COPRA-enhanced 0.621% 1/161 Problem 53
1 💙 Claude-3.7 COPRA-enhanced 0.621% 1/161 Problem 53
6 💙 GPT-4o Few-Shot 0% 0/161 -
📊 Detailed Performance Breakdown
Pass@600-sec rates for specification certification and implementation certification
Model Approach Spec Certification Impl Certification End-to-End
Compiled Proved Compiled Proved
Few-Shot Baseline
💙 GPT-4o Few-Shot 84.472% 0.621% 68.323% 0.621% 0%
💙 GPT-4o mini Few-Shot 82.609% 1.242% 83.230% 1.863% 0.621%
💙 Claude-3.7 Few-Shot 86.957% 0.621% 65.217% 1.863% 0.621%
💙 DeepSeek-R1 Few-Shot 71.42% 0.621% 60.870% 5.559% 0.621%
COPRA Baseline
💙 GPT-4o COPRA-enhanced 76.398% 1.863% 68.323% 3.727% 0.621%
💙 Claude-3.7 COPRA-enhanced 81.366% 1.242% 65.217% 8.696% 0.621%
📝 Submission Guidelines

To add your results to this leaderboard:

  1. Evaluate your approach using the CLEVER Python API
  2. Document your methodology with a preprint or publication
  3. Submit your results by contacting us with evaluation metrics and methodology details

Contact: amitayush@utexas.edu