LaSR: Symbolic Regression with a Learned Concept Library

Abstract

LaSR is a novel method for symbolic regression (SR), the task of searching for compact programmatic hypotheses that best explain a dataset. The problem is commonly solved using genetic algorithms; we show that we can enhance such methods by inducing a library of abstract textual concepts.

LaSR (pronounced 'Laser') uses zero-shot queries to a large language model (LLM) to discover and evolve concepts occurring in known high-performing hypotheses. We discover new hypotheses using a mix of standard evolutionary steps and LLM-guided steps (obtained through zero-shot LLM queries) conditioned on discovered concepts. Once discovered, hypotheses are used in a new round of concept abstraction and evolution.

We validate LaSR on the Feynman equations, a popular SR benchmark, as well as a set of synthetic tasks. On these benchmarks, LaSR substantially outperforms a variety of state-of-the-art SR approaches based on deep learning and evolutionary algorithms.

PySR's Exploration bottleneck

Sketch of PySR's search space

PySR's search strategy is essentially a form of "local search." That is, each population starts with a randomly sampled set of programs and iteratively optimizes them by sampling programs that are syntactically close to the best programs in the population. This process continues until a local optimum is reached. Intuitively, this forms an "island" in the search space of programs.

As PySR runs multiple populations in parallel, we end up with a disjoint set of "islands" in the search space. Now, PySR has many heuristics to deal with such "islands" -- for instance, by migrating programs between populations. However, there is no obvious way to overcome this fundamental exploration bottleneck.

Our Hypothesis

Consecutively, our main goal is to understand if we can overcome this exploration bottleneck and increase exploration in relevant parts of the search space.

Our hypothesis is that the islands emerge because PySR (or any symbolic algorithm) can only sample programs that are syntactically close to the best program in each population. But syntactic closeness does not necessarily imply semantic closeness. Can we instead sample programs that are semantically close to the best programs in each population?

LaSR: Symbolic Regression with a Learned Concept Library

Concept Library Desiderata: (1) Symbolic Abstraction

We're going to achieve this by abstracting programs into relevant concepts. A concept is a natural language representation of a program. In symbolic regression, our programs are equations. Hence, we desire two properties from a concept: (1) Symbolic Abstraction and (2) Symbolic guidance.

Symbolic Abstraction: A concept captures an abstract property that is common to a set of programs. For instance, many empirical trends such as Zipf's law (Linguistics), Moore's law (Computer Science), and Arrhenius' equation (Chemistry) can be represented by the equation sketch: $$y = a x^k + \epsilon$$ This trend is pervasive across many domains and is commonly referred to as a "power law." This is just one example of how a concept can "summarize" an important property of a set of programs.

Concept Library Desiderata: (2) Symbolic Guidance

Symbolic Guidance: A concept can also guide the search for new programs. This arises naturally from the abstraction property: if a concept is a good abstraction of a set of equations, we should be able to use it to sample new equations that are similar to the ones in the set.

This is particularly useful in scientific discovery, where scientists often have a preconception of how certain variables will interact with each other, but may not know the exact form of the relationship (The example with gravitational waves is a little misleading as the theoretical prediction prompted the experimental discovery).

LaSR's Three phases

LaSR consists of three phases: Hypothesis Evolution, Concept Abstraction, and Concept Evolution. In Hypothesis Evolution, we search for programs that best fit the data and are guided by concepts discovered in the previous iteration. Concept Abstraction deals with updating the concept library with new concepts derived from the best performing programs. Finally, Concept Evolution involves updating the concept library with new concepts that are implications of the discovered concepts.

This is a self-amplifying loop: better programs lead to better concepts, which can, in turn, guide the search for even better programs.

LaSR: Phase 1 Hypothesis Evolution

The hypothesis evolution searches for programs that best fit the data and are guided by concepts discovered in the previous iteration. To do this, we will extend PySR's search process to incorporate concept guidance. The next few slides will detail how we do this.

Phase 1: Base Algorithm

We build upon PySR's multi-population genetic programming framework to integrate concept guidance. Let's begin by understanding PySR's main loop first. Assuming $k$ programs in a single population, PySR starts with a randomly initialized set of programs $ \left( \{Pi_1^1, \dots \Pi_1^k \} \right) $, from which the best program is selected $ \left( \pi_1^\star \right) $ by evaluating its fitness on a supervised dataset. This program undergoes either mutation (a part of the program is resampled), crossover (a part of the program is swapped with a part of the second best program), simplification (the program is simplified), or optimization (the constants in the program are optimized). The generated program replaces the oldest program in the population, and the process repeats for a fixed number of iterations. Each iteration takes less than a second, and many such populations can be run in parallel.

Phase 1: LLM Operations

Integrating concept guidance into PySR involves augmenting all symbolic operations that generate new programs with an LLM zero shot query that mimics the symbolic operation it is replacing

Instead of fully replacing the symbolic operations, we propose a hybrid approach where we replace the symbolic operation with the LLM zero shot query with probability $p$ (usually 1%). This allows us to incorporate the concept guidance into the search process without sacrificing the 'local search' nature of PySR.

Phase 1: Concept Library Integration

Each LLM zero shot query is conditioned on concepts discovered in the previous iteration. These concepts are stored in a concept library and ensure that the LLM zero shot query is relevant to the current search space.

LaSR: Phase 2 Concept Abstraction

Next, the best performing programs are abstracted into concepts. We rely on another LLM zero shot query to "summarize" the programs into a natural language representation. This concept is then stored in the concept library.

LaSR: Phase 3 Concept Evolution

Finally, the concept library is updated with new concepts that are implications of the discovered concepts. This is done by another LLM zero shot query that is conditioned on the discovered concepts.

LaSR: Intuition (1)

Let's take a step back and understand the intuition behind LaSR. We noted that PySR's ends up with "islands" in the search space because it can only sample programs that are syntactically close to the best programs in each population. As LaSR builds upon PySR, we will also end up with these "islands" in the search space at the end of Phase 1.

LaSR: Intuition (2)

In Phase 2, LaSR abstracts the best performing programs into concepts. These concepts serve as "bridges" between the "islands" in the search space.

LaSR: Intuition (3)

We can now sample new programs conditioned on the discovered concepts. This allows us to explore new parts of the search space that were previously inaccessible. Since we retain the "local search" capabilities of PySR, intuitively, this allows us to explore more "islands" in the search space.

LaSR: Intuition (4)

Furthermore, in Phase 3, we can sample more concepts that are derived from the discovered concepts. these concepts, in turn, increase the exploration of the search space in areas that were previously inaccessible.

LaSR: Results

LaSR: Feynman Equations

LaSR doesn't require human provided concepts. Hence, we can test LaSR against other symbolic regression algorithms on the Feynman equations; a collection of equations that describe empirical relationships from the Feynman Lectures Series . We found that LaSR outperforms other symbolic regression algorithms on this benchmark.

LaSR's performance is ultimately bottlenecked by the quality of the language guidance. To evaluate this, we conducted a set of cascading experiments where we varied the quality of the language guidance by (1) increasing the probability of replacing symbolic operations with LLM zero shot queries and (2) changing the backend LLM model. We found that even a small model like Llama 3 8B can provide sufficient guidance for LaSR to outperform other symbolic regression algorithms.

More details on this set of experiments (and a set of experiments on an unseen synthetic dataset) can be found in our paper.

LaSR: Feynman Equations with Hints

We also conducted an enhancement study where we provided LaSR with hints for the equations in the Feynman equations dataset. We found that LaSR with hints accelerated the search process and found the correct equation faster than LaSR without hints, even discovering some new equations.

LaSR: Feynman Equations Qualitative Study

We also conducted a qualitative study on the equations discovered by LaSR. Here, we should Equation #10 from the Feynman equations dataset: Coulomb's law. LaSR and PySR both discover a high performing program for this equation. Coulomb's law is interesting because it embodies multiple concepts: (1) The force and the distance of the charges are inversely proportional and follow a "power law" trend, (2) In the scalar form, the force is proportional to the product of the charges and since multiplication is commutative, the order of the charges does not matter.

PySR's equation: PySR's equation is unwieldy and simplifies to the correct form after about 10 manual steps of simplification. Also, as this equation requires more constants, it is more prone to optimization errors.

LaSR: Feynman Equations Qualitative Study

LaSR's equation: LaSR's equation is much simpler and reduces to ground truth after four manual steps of simplification. Surprisingly, we notice that smaller models tend to produce simpler equations.

LaSR: Feynman Equations Qualitative Study

LaSR produces two artifacts: an equation of best fit and a concept library. Here, we showcase snippets of the concept library generated for Coulomb's law. As a consequence of LLM training, the concepts are rather verbose and small relevant concepts are often buried in the middle of the generations.

Limitations: As a consequence of using LLM zero shot queries, LaSR cannot guarantee the factuality or the correctness of the concepts in the concept library. Furthermore, concepts deemed to be "important" may be a consequence of the LLM model's training data and may mislead scientists. Addressing these concerns is an exciting direction for future work in LLM guided program induction.

LLM Scaling Laws: Present methodology

To investigate LaSR's utility in finding novel and practical empirical trends, we investigate whether LaSR can discover novel LLM scaling laws on the BigBench dataset.

Traditionally, to identify an LLM scaling law, practitioners must first manually posit a "skeleton equation" with a fixed set of known variables and unknown free parameters, and then optimize the unknown parameters based on a dataset of model hyperparameters and resulting dataset fitness.

LLM Scaling Laws: Methodology with LaSR

Instead of starting with a predefined equation, we use LaSR to discover the skeleton equation that best fits various subsets of the BigBench dataset.

Removing the need to manually posit a skeleton equation simplifies the methodology for finding scaling laws in many ways. (1) It removes human preconceptions about the expected relationships between hyperparameters. (2) It increases the number of variables and the type of variables human practitioners can jointly reason about. (3) It enables positing equations of much higher complexity and variable interdependence than otherwise possible.

LaSR's LLM Scaling Law

LaSR discovers the following scaling law on the subset of BigBench: $$\texttt{score} = \frac{A}{\left( \frac{\texttt{train_steps}}{B}\right)^\texttt{#shots}} + E$$ More details are in the paper but, in essence, this scaling law suggests that increasing the number of shots exponentially increases the model's performance for instances with less training data, while having diminishing gains as the number of training steps of the model increase.

As the output artifacts of LaSR are interpretable, we can even augment existing scaling laws with the discovered empirical! More details/limitations are in the paper!

BibTeX

If you found this post interesting, please read our paper for mathematical details and experimental results. You can cite our paper as follows:

@misc{grayeli2024symbolicregressionlearnedconcept, title={Symbolic Regression with a Learned Concept Library}, author={Arya Grayeli and Atharva Sehgal and Omar Costilla-Reyes and Miles Cranmer and Swarat Chaudhuri}, year={2024}, eprint={2409.09359}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2409.09359}, }

LaSR: Symbolic Regression with a Learned Concept Library

Abstract

⚠️Warning⚠️

Symbolic Regression

PySR: evolutionary algorithm for Symbolic Regression

PySR's Exploration bottleneck

Sketch of PySR's search space

Our Hypothesis