LaSR is a novel method for symbolic regression (SR), the task of searching for compact programmatic hypotheses that best explain a dataset. The problem is commonly solved using genetic algorithms; we show that we can enhance such methods by inducing a library of abstract textual concepts.
LaSR (pronounced 'Laser') uses zero-shot queries to a large language model (LLM) to discover and evolve concepts occurring in known high-performing hypotheses. We discover new hypotheses using a mix of standard evolutionary steps and LLM-guided steps (obtained through zero-shot LLM queries) conditioned on discovered concepts. Once discovered, hypotheses are used in a new round of concept abstraction and evolution.
We validate LaSR on the Feynman equations, a popular SR benchmark, as well as a set of synthetic tasks. On these benchmarks, LaSR substantially outperforms a variety of state-of-the-art SR approaches based on deep learning and evolutionary algorithms.
The next sections might not render correctly on mobile devices. Please view this page on a desktop or enable "desktop mode!"
Symbolic regression is the task of finding a mathematical expression that best fits a given dataset. The goal is to find a compact, interpretable expression that can be used to make predictions and infer new relationships. Symbolic regression is a key tool in scientific discovery, as it can help uncover hidden relationships in the data and suggest new hypotheses to test.
One of the earliest known example of symbolic regression is Johannes Kepler's discovery of the empirical laws of planetary motion. Kepler fit various geometric shapes to the ground truth celestial data and found that elliptical orbits best explained the data. Kepler's work in transforming the empirical data into a set of mathematical relationships paved the way for broader interpretation. For instance, Newton validates his laws of motion and his law of universal gravitation by showing that the empirical relationships discoverd by Kepler are a consequence of his laws. In this way, the interpretation of empirical law in one phenomena often generalizes to and explains other phenomena that didn't have a-priori data collection mechanisms.
PySR is a framework for symbolic regression that implements search using multi-population genetic programming. Intuitively, for each population, PySR runs a tournament to select the programs with the highest fitness score. These programs are randomly mutated / crossed with each other to produce new programs. These new programs replace the oldest candidate. This process is repeated for a fixed number of iteration and, in implementation, can be parallelized across multiple cores. This enables scalable exploration of a large search space, and is crucial for PySR's success.
However, PySR does not allow for the incorporation of prior knowledge or domain-specific concepts. This can make it difficult to discover complex relationships, especially in scientific domains where the data is noisy and/or high-dimensional but supplemented with rich background knowledge.
This fundamental limitation surfaces as an exploration bottleneck.
PySR's search strategy is essentially a form of "local search." That is, each population starts with a randomly sampled set of programs and iteratively optimizes them by sampling programs that are syntactically close to the best programs in the population. This process continues until a local optimum is reached. Intuitively, this forms an "island" in the search space of programs.
As PySR runs multiple populations in parallel, we end up with a disjoint set of "islands" in the search space. Now, PySR has many heuristics to deal with such "islands" -- for instance, by migrating programs between populations. However, there is no obvious way to overcome this fundamental exploration bottleneck.
Consecutively, our main goal is to understand if we can overcome this exploration bottleneck and increase exploration in relevant parts of the search space.
Our hypothesis is that the islands emerge because PySR (or any symbolic algorithm) can only sample programs that are syntactically close to the best program in each population. But syntactic closeness does not necessarily imply semantic closeness. Can we instead sample programs that are semantically close to the best programs in each population?
We're going to achieve this by abstracting programs into relevant concepts. A concept is a natural language representation of a program. In symbolic regression, our programs are equations. Hence, we desire two properties from a concept: (1) Symbolic Abstraction and (2) Symbolic guidance.
Symbolic Abstraction: A concept captures an abstract property that is common to a set of programs. For instance, many empirical trends such as Zipf's law (Linguistics), Moore's law (Computer Science), and Arrhenius' equation (Chemistry) can be represented by the equation sketch: $$y = a x^k + \epsilon$$ This trend is pervasive across many domains and is commonly referred to as a "power law." This is just one example of how a concept can "summarize" an important property of a set of programs.
Symbolic Guidance: A concept can also guide the search for new programs. This arises naturally from the abstraction property: if a concept is a good abstraction of a set of equations, we should be able to use it to sample new equations that are similar to the ones in the set.
This is particularly useful in scientific discovery, where scientists often have a preconception of how certain variables will interact with each other, but may not know the exact form of the relationship (The example with gravitational waves is a little misleading as the theoretical prediction prompted the experimental discovery).
LaSR consists of three phases: Hypothesis Evolution, Concept Abstraction, and Concept Evolution. In Hypothesis Evolution, we search for programs that best fit the data and are guided by concepts discovered in the previous iteration. Concept Abstraction deals with updating the concept library with new concepts derived from the best performing programs. Finally, Concept Evolution involves updating the concept library with new concepts that are implications of the discovered concepts.
This is a self-amplifying loop: better programs lead to better concepts, which can, in turn, guide the search for even better programs.
The hypothesis evolution searches for programs that best fit the data and are guided by concepts discovered in the previous iteration. To do this, we will extend PySR's search process to incorporate concept guidance. The next few slides will detail how we do this.
We build upon PySR's multi-population genetic programming framework to integrate concept guidance. Let's begin by understanding PySR's main loop first. Assuming \(k\) programs in a single population, PySR starts with a randomly initialized set of programs \( \left( \{Pi_1^1, \dots \Pi_1^k \} \right) \), from which the best program is selected \( \left( \pi_1^\star \right) \) by evaluating its fitness on a supervised dataset. This program undergoes either mutation (a part of the program is resampled), crossover (a part of the program is swapped with a part of the second best program), simplification (the program is simplified), or optimization (the constants in the program are optimized). The generated program replaces the oldest program in the population, and the process repeats for a fixed number of iterations. Each iteration takes less than a second, and many such populations can be run in parallel.
Integrating concept guidance into PySR involves augmenting all symbolic operations that generate new programs with an LLM zero shot query that mimics the symbolic operation it is replacing
Instead of fully replacing the symbolic operations, we propose a hybrid approach where we replace the symbolic operation with the LLM zero shot query with probability \(p\) (usually 1%). This allows us to incorporate the concept guidance into the search process without sacrificing the 'local search' nature of PySR.
Each LLM zero shot query is conditioned on concepts discovered in the previous iteration. These concepts are stored in a concept library and ensure that the LLM zero shot query is relevant to the current search space.
Next, the best performing programs are abstracted into concepts. We rely on another LLM zero shot query to "summarize" the programs into a natural language representation. This concept is then stored in the concept library.
Finally, the concept library is updated with new concepts that are implications of the discovered concepts. This is done by another LLM zero shot query that is conditioned on the discovered concepts.
Let's take a step back and understand the intuition behind LaSR. We noted that PySR's ends up with "islands" in the search space because it can only sample programs that are syntactically close to the best programs in each population. As LaSR builds upon PySR, we will also end up with these "islands" in the search space at the end of Phase 1.
In Phase 2, LaSR abstracts the best performing programs into concepts. These concepts serve as "bridges" between the "islands" in the search space.
We can now sample new programs conditioned on the discovered concepts. This allows us to explore new parts of the search space that were previously inaccessible. Since we retain the "local search" capabilities of PySR, intuitively, this allows us to explore more "islands" in the search space.
Furthermore, in Phase 3, we can sample more concepts that are derived from the discovered concepts. these concepts, in turn, increase the exploration of the search space in areas that were previously inaccessible.
LaSR doesn't require human provided concepts. Hence, we can test LaSR against other symbolic regression algorithms on the Feynman equations; a collection of equations that describe empirical relationships from the Feynman Lectures Series . We found that LaSR outperforms other symbolic regression algorithms on this benchmark.
LaSR's performance is ultimately bottlenecked by the quality of the language guidance. To evaluate this, we conducted a set of cascading experiments where we varied the quality of the language guidance by (1) increasing the probability of replacing symbolic operations with LLM zero shot queries and (2) changing the backend LLM model. We found that even a small model like Llama 3 8B can provide sufficient guidance for LaSR to outperform other symbolic regression algorithms.
More details on this set of experiments (and a set of experiments on an unseen synthetic dataset) can be found in our paper.
We also conducted an enhancement study where we provided LaSR with hints for the equations in the Feynman equations dataset. We found that LaSR with hints accelerated the search process and found the correct equation faster than LaSR without hints, even discovering some new equations.
We also conducted a qualitative study on the equations discovered by LaSR. Here, we should Equation #10 from the Feynman equations dataset: Coulomb's law. LaSR and PySR both discover a high performing program for this equation. Coulomb's law is interesting because it embodies multiple concepts: (1) The force and the distance of the charges are inversely proportional and follow a "power law" trend, (2) In the scalar form, the force is proportional to the product of the charges and since multiplication is commutative, the order of the charges does not matter.
PySR's equation: PySR's equation is unwieldy and simplifies to the correct form after about 10 manual steps of simplification. Also, as this equation requires more constants, it is more prone to optimization errors.
LaSR's equation: LaSR's equation is much simpler and reduces to ground truth after four manual steps of simplification. Surprisingly, we notice that smaller models tend to produce simpler equations.
LaSR produces two artifacts: an equation of best fit and a concept library. Here, we showcase snippets of the concept library generated for Coulomb's law. As a consequence of LLM training, the concepts are rather verbose and small relevant concepts are often buried in the middle of the generations.
Limitations: As a consequence of using LLM zero shot queries, LaSR cannot guarantee the factuality or the correctness of the concepts in the concept library. Furthermore, concepts deemed to be "important" may be a consequence of the LLM model's training data and may mislead scientists. Addressing these concerns is an exciting direction for future work in LLM guided program induction.
This project would not be possible without the excellent work of the community. These are some relevant papers to better understand the premise of our work:
If you found this post interesting, please read our paper for mathematical details and experimental results. You can cite our paper as follows:
@misc{grayeli2024symbolicregressionlearnedconcept,
title={Symbolic Regression with a Learned Concept Library},
author={Arya Grayeli and Atharva Sehgal and Omar Costilla-Reyes and Miles Cranmer and Swarat Chaudhuri},
year={2024},
eprint={2409.09359},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2409.09359},
}