🏆 CLEVER Leaderboard

Curated Lean Verified Code Generation Benchmark

GitHub Repository 📄 Paper 📦 PyPI Package

Legend

💚 Fully open-sourced method

💙 Partially open-sourced method

FS: Few-shot prompting with 1-2 examples

COPRA: Symbolic proof search agent [arXiv]

Compiled: Code is syntactically valid and type-checks

Proved: Proofs accepted by Lean's kernel

Pass@k-sec: Success rate within 600-second time budget (k=600)

# Proofs Generated: # Proofs generated while Spec and Impl Certification

Complete pipeline success: specification certification + implementation certification (Pass@600-sec)
Sorted based on first "End-to-End Code Generation" column, then "# Proofs Generated"

#	Model	Approach	End-to-End Code Generation	Note	# Proofs Generated
1	💙 Claude-3.7	COPRA-enhanced	1/161	Problem 53	2/161 (spec) + 14/161 (impl) = 16/282
1	💚 DeepSeek-R1	Few-Shot	1/161	Problem 53	1/161 (spec) + 9/161 (impl) = 10/282
1	💚 GPT OSS 20b	COPRA-enhanced	1/161	Problem 53	2/161 (spec) + 8/161 (impl) = 10/282
1	💙 GPT-4o	COPRA-enhanced	1/161	Problem 53	3/161 (spec) + 6/161 (impl) = 9/282
1	💙 GPT-4o mini	Few-Shot	1/161	Problem 53	2/161 (spec) + 3 / 161 (impl) = 5/282
1	💙 Claude-3.7	Few-Shot	1/161	Problem 53	1/161 (spec) + 3/161 (impl) = 4/282
6	💙 GPT-4o	Few-Shot	0/161	-	1/161 (spec) + 1/161 (impl) = 2/282
6	💙 GPT-5 mini (For Code Generation) + Kimina Prover (For proofs)	Few-Shot	0/161	-	0/161 (spec) + 1/161 (impl) = 1/282

Pass@600-sec rates for specification certification and implementation certification
(Compiled: syntactically valid and type-checks; Proved: proofs accepted by Lean's kernel)
Sorted based on average compilation rates across both stages

Model	Approach	Spec Certification		Impl Certification		End-to-End
Model	Approach	Compiled	Proved	Compiled	Proved	End-to-End
Few-Shot Baseline
💙 GPT-4o mini	Few-Shot	82.609%	1.242%	83.230%	1.863%	0.621%
💙 Claude-3.7	Few-Shot	86.957%	0.621%	65.217%	1.863%	0.621%
💙 GPT-4o	Few-Shot	84.472%	0.621%	68.323%	0.621%	0%
💚 DeepSeek-R1	Few-Shot	71.42%	0.621%	60.870%	5.559%	0.621%
COPRA Baseline
💙 Claude-3.7	COPRA-enhanced	81.366%	1.242%	65.217%	8.696%	0.621%
💚 GPT OSS 20b	COPRA-enhanced	78.261%	1.242%	65.839%	4.969%	0.621%
💙 GPT-4o	COPRA-enhanced	76.398%	1.863%	68.323%	3.727%	0.621%

📝 Submission Guidelines

To add your results to this leaderboard:

Evaluate your approach using the CLEVER Python API
Document your methodology with a preprint or publication
Submit your results by contacting us with evaluation metrics and methodology details

Contact: amitayush@utexas.edu