PutnamBench: A Multilingual Mathematics Benchmark for Formal Theorem-Proving

Abstract

We present PutnamBench, a new multilingual benchmark for evaluating the ability of neural theorem-provers to solve competition mathematics problems.

PutnamBench consists of 1709 hand-constructed formalizations of problems sourced from the William Lowell Putnam Mathematical Competition, the premier undergraduate-level mathematics competition in North America. There are 657 problems formalized in Lean 4, 640 formalized in Isabelle, and 412 formalized in Coq.

Proving the theorems requires significant problem-solving ability and proficiency in a broad range of topics taught in undergraduate mathematics courses. We use PutnamBench to evaluate several established neural and symbolic theorem-provers. These approaches can only solve a handful of the PutnamBench problems, establishing the benchmark as a difficult open challenge for research on neural theorem-proving.

Formal Theorem-Proving

Formal theorem-proving is the task of proving mathematical theorems specified inside an interactive theorem prover such as Lean, Isabelle, or Coq. These systems allow one to specify theorems and proofs inside a programming language. The correctness of a proof is automatically checked by the compiler.

Informal benchmarks such as MATH and GSM8K measure accuracy by exact match with the ground truth numerical answer. However, the reasoning path might be incorrect and cannot be checked automatically as the problem is specified in natural language. By contrast, benchmarking formal theorem-proving through PutnamBench allows one to measure a model's capacity to produce precise logical arguments in a mathematical context. Furthermore, since no formal proofs have been written for any of our problems, direct contamination is not possible. When training models for informal reasoning, one must take many more precautions to prevent test-set leakage.

The William-Lowell Putnam Mathematical Competition

The William Lowell Putnam Mathematical Competition, organized by the Mathematical Association of America (MAA), is the premier collegiate mathematics competition in North America. Thousands of undergraduate students from universities across the United States and Canada take the exam each year. The competition comprises two 3-hour-long sessions of six problems each, presented in approximately ascending order of difficulty within each session.

The Putnam Competition is known for its challenging problems that require a deep understanding of undergraduate-level mathematics and creative problem-solving ability. The competition includes problems involving concepts from a wide variety of topics in the undergraduate curriculum, including analysis, abstract algebra, and linear algebra.

The competition is scored out of 120 points, with the median score typically being zero. The top five individual scorers are named Putnam Fellows, and in recent years regularly include former International Mathematical Olympiad (IMO) medalists. Hence, we find that PutnamBench is a well-suited benchmark for measuring progress towards the Artificial Intelligence Mathematical Olympiad and the IMO Grand Challenge.

PutnamBench: Multilinguality & Factored Solutions

Multilinguality. Most recent work on neural theorem-proving for competition mathematics has been focused in Lean and Isabelle. PutnamBench supports researchers using these tools by incorporating formalizations of all 640 problems in these languages. Previously without a benchmark for competition math, we also include formalizations in Coq, of which there are 417. Formalization obstacles specific to Coq, as mentioned in our paper, prevent us from formalizing all 640 problems. We actively support community feedback and contributions as we continue to iterate on PutnamBench-Coq.

We hope that the multilinguality of PutnamBench will allow the theorem-proving community to make progress towards general mathematical reasoning machines.

Factored Solutions. Roughly 60% of Putnam problems, in their natural language form, require exhibiting a (closed-form) solution along with a proof of its correctness. Such problems do not assert propositions, and hence are not immediately formalizable as they are not directly the statement of a theorem. Prior benchmarks have included the explicit solution in the theorem statement, but this can drastically reduce the problem's overall difficulty, as finding the answer can be the most challenging part. For such problems, we factor out the solution from the theorem statement, allowing two tasks: proving the theorem with the solution provided, and additionally determining the solution.

Evaluations & Leaderboard

We perform evaluations with some recent models, as described in the paper. We find that PutnamBench is hard, with an aggregation of all methodologies tried unable to successfully prove even 1% of theorems. In particular, few-shot GPT-4o solves only a single problem in each language when allotted 10 attempts. PutnamBench highlights several deficiencies of large language models (LLMs): weak ability to (1) perform precise logical reasoning, (2) operate in low-resource languages.

We are actively maintaining a leaderboard to keep track of progress from upcoming approaches for neural-theorem proving. We will be incorporating results which are accompanied by a research publication (or arxiv preprint) detailing the general methodology.

Sample Problem & Proof

Putnam 1988 B1. A composite (positive integer) is a produce \(ab\) with \(a\) and \(b\) not necessarily distinct integers in \(\{2,3,4,\dots\}\). Show that every composite is expressible as \(xy + xz + yz + 1\), with \(x,y,z\) positive integers.


            theorem putnam_1988_b1 :
            ∀ a ≥ 2, ∀ b ≥ 2, ∃ x y z : ℤ, 
              x > 0 ∧ y > 0 ∧ z > 0 ∧ a * b = x * y + x * z + y * z + 1
            := by
              intro a ha b hb
              use a - 1, b - 1, 1
              constructor
              linarith
              constructor
              linarith
              constructor
              linarith
              ring

Here we include a formalization of Putnam 1988 B1 in Lean. The proof is generated by GPT-4o, and crucially requires the choice of \(x,y,z = a-1,b-1,1\). Once this choice is specified, the proof is completed with automated tactics for linear arithmetic and ring manipulations.

Putnam 1988 B1 is among the easiest problems to ever appear on the Putnam Competition. We believe progress on PutnamBench will come, in part, as a result of advancements in (1) proof planning & intermediate lemma synthesis and (2) ability to reason about and utilize existing formalized mathematical theory.

Licensing & Maintenance

PutnamBench is available under an Apache 2.0 License for Isabelle & Lean, and an MIT License for Coq. With a big thanks to the MAA, we also include the informal problem statements as sourced from the official archives.

PutnamBench will be actively maintained for the foreseeable future. We plan to incorporate problem formalizations from Putnam 2024 in December.

We are actively encouraging community feedback and discussion. We are interested in hearing how you might plan to use PutnamBench! Your feedback may help us iterate on the structure of PutnamBench, such as incorporating a validation/training split. Please reach out to us if you have (1) suggestions to improve the benchmark, (2) comments on maintaining benchmark longevity and usefulness, (3) noticed issues in certain formalizations which require modification

BibTeX

@misc{tsoukalas2024putnambenchevaluatingneuraltheoremprovers,
      title={PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition}, 
      author={George Tsoukalas and Jasper Lee and John Jennings and Jimmy Xin and Michelle Ding and Michael Jennings and Amitayush Thakur and Swarat Chaudhuri},
      year={2024},
      eprint={2407.11214},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2407.11214}, 
}

PutnamBench: A Multilingual Competition-Mathematics Benchmark for Formal Theorem-Proving