We study the problem of building a visual concept library for visual recognition. Building effective visual concept libraries is challenging, as manual definition is labor-intensive, while relying solely on LLMs for concept generation can result in concepts that lack discriminative power or fail to account for the complex interactions between them.
Our approach, Escher, takes a library learning perspective to iteratively discover and improve visual concepts. Escher uses a vision-language model (VLM) as a critic to iteratively refine the concept library, including accounting for interactions between concepts and how they affect downstream classifiers. By leveraging the in-context learning abilities of LLMs and the history of performance using various concepts, Escher dynamically improves its concept generation strategy based on the VLM critic's feedback.
Notably, Escher does not require any human annotations, and is thus an automated plug-and-play framework. We empirically demonstrate the ability of Escher to learn a concept library for zero-shot, few-shot, and fine-tuning visual classification tasks. This work represents, to our knowledge, the first application of concept library learning to real-world visual tasks.
In many scientific fields, perceptual reasoning doesn't come naturally. Sometimes the features of an image aren't obvious to the human eye. Other times the questions we pose demand more than instinct, requiring deliberate analysis. As a result, scientists must learn how to identify subtle traits by building domain knowledge and seeking constant feedback from peers. Let's look at a concrete example from ecology.
In this iNaturalist exchange [Source], an experienced ecologist uploads a geo-tagged photo of a lizard, initially misidentifying it as a Florida Scrub Lizard (Sceloporus woodi). Another user, a trained herpetologist, corrects the identification, suggesting that the lizard is actually a Northern Curly-tailed Lizard (Leiocephalus carinatus). They catch a distinguishing morphological detail: Sceloporus (the family of the Florida Scrub Lizard) do not have a strongly keeled tail. The experienced ecologist agrees with the correction and explains that they were focusing on the lizard's color patterning, rather than the scales.
We are interested in the question: how can machine perception systems learn to identify such subtle visual concepts?
Visual programming attempts to decompose complex perceptual reasoning problems into a logical combination of simpler perceptual tasks that can be solved using off-the-shelf vision foundation models. Such exciting works leverage the code generation capabilities of large language models (LLMs) along with a pre-specified API of visual foundation models to generate code that can be executed to answer perceptual reasoning questions.
However, such approaches inherently suffer from the decoupling of the program synthesizer and the underlying vision foundation models. The program synthesizer is trained to generate deterministic code, and the stochastic nature of the vision foundation model is hidden away by the API. This decoupling leads to a disconnect between how the program synthesizer assumes the vision foundation model will behave and how it actually behaves on real-world images.
Our main goal is to understand whether we can overcome this decoupling by using the vision foundation model as a critic to guide the program synthesizer.
Escher focuses on the task of fine-grained image classification. In this task, we're given an image and a set of classes (in natural language). Our goal is to classify the image into one of the classes.
Prior work in this area begins by asking an LLM to decompose the complex categorization task into a series of simpler perceptual tasks.
For example, here, the LLM would be asked two questions (1) What does a donut look like? (2) What does a bagel look like?
Then, a CLIP-based VLM is used to estimate how well the image matches the visual concepts. The final categorization is decided by choosing the class whose average visual concept score is highest. Such approaches have been shown to work well in practice compared to naively using the class-image similarity.
There are other benefits too: (1) The visual concepts are interpretable, (2) The visual concepts can be reused across different tasks, (3) The visual concepts can be used to decompose otherwise out-of-distribution categorization tasks into in-distribution classification tasks.
However, on this input, the LLM generated visual concepts, while faithful to the categorization task, yield very different results when evaluated by the VLM.
We hypothesize that such approach suffers from the same weakness that zero-shot visual programming approaches suffer from: the LLMs decomposition is decoupled from the VLM's performance. The VLM and the LLM had very different training methodologies and hence, have different inductive biases. These biases can lead to the LLM generating visual concepts that, while faithful to the categorization task, prove to be detrimental to the VLM's performance.
Escher attempts to overcome this limitation by instantiating a concept library that is iteratively refined using the VLM's feedback.
This concept library is a data structure that stores the visual concepts generated by the LLM for each class.
Following previous work, we generate concepts using an LLM conditioned on the class names with a slight modification: the LLM is also given access to the concept library. Initially, the concept library is empty so this is equivalent to how prior work uses the LLM.
To evaluate the generated concepts, Escher simply passes the concepts to the underlying VLM. However, instead of directly using the VLM outputs to predict the class, we use the VLM outputs to better understand what classes get confused with each other, and how the concept library can be improved to reduce this confusion.
To generate feedback, Escher attempts to find classes that are confused with each other. If we have access to the ground truth labels, this can be achieved by constructing a confusion matrix of the VLM outputs.
However, Escher does not have access to the ground truth labels. Instead, Escher analyzes the raw similarity scores for each image-class pair to make this decision. Generally, this necessitates a well-calibrated backbone VLM model, and there are many heuristics to achieve this.
Here, we showcase one such heuristic: top-k pseudo-confusion. This heuristic selects the top-k classes with the highest image-text similarity scores for each image, and identifies the classes that are consistently ranked within the top-k most-similar classes across all images. The intuition is that if the VLM consistently ranks two classes within the top-k most-similar classes for a given image, it is likely that the VLM is confused between these two classes.
For example, in this image, the pseudo-confusion matrix can be used to identify two classes of birds in a fine-grained bird classification dataset that are confused with each other: a Slaty backed Gull and a California Gull.
Once Escher identifies the classes that are confused with each other, it attempts to resolve this confusion by generating new concepts for the confused classes. If classes are confused more than once, the resolution query reflects the history of past disambiguations, as well as the VLM's feedback score after each disambiguation.
This loop is repeated for a fixed number of iterations, and usually yields a concept library that is much more interpretable and useful than the initial concept library.
In our running example, this enables the discovery of visual concepts that help distinguish between the donut and the bagel.
Escher's key insight is in leveraging the concept library to couple the LLM's generation and the VLM's responses. A VLM with a specialized concept library produces more fine-grained class disambiguation feedback, prompting the LLM to uncover even finer concepts, which leads to an even more specialized library for the next iteration. Like other library learning algorithms, ideally -- while the LLM disambiguates concepts well and the VLM remains sensitive to them -- this self-reinforcing loop continues until no further relevant concepts can be identified.
Generally, algorithms that use intermediate concepts for decomposing perceptual tasks are known as concept-bottleneck models (CBMs). Our running baseline so far -- using an LLM to generate a set of textual concepts -- can also be expressed as a CBM.
Escher is a meta-algorithm that is agnostic to the underlying choice of CBM. Furthermore, it doesn't require any human provided concepts. We evaluated how Escher behaves in two scenarios (more studies in the paper!):
We start with an image of a Male Ring-necked pheasant. This is a category from the North American birds dataset, which contains over 400 species of birds.
With no iterations, the underlying VLM mispredicts this image as a Female Ring-necked pheasant. Since this is a context-bottlenecked model, we can inspect the underlying concepts to better understand the model's prediction.
When examining the concept activations for the Male Ring-necked pheasant, we notice a substantially lower overall score—likely because the LLM generates concepts for each class in isolation. As a result, highly similar concepts may end up with slightly different initializations, which can alter prediction outcomes. Moreover, these LLM-generated concepts may lack the necessary discriminative power to distinguish between closely related classes because they are unaware of which other classes are present in the dataset.
After five iterations with Escher, the VLM correctly predicts the true class of the image. This is because we are able to identify that the underlying VLM is getting confused between a Male and a Female ring-necked pheasant, and generate concepts that resolve this confusion. Specifically, if we inspect the new most activated concepts, we find that the LLM identifies that a Male ring-necked pheasant has a metallic green head and neck, which proves to be a very discriminative concept for the VLM.
Limitations: Escher currently has three main constraints:
This project would not be possible without the excellent work of the community. These are some relevant papers to better understand the premise of our work:
If you found this post interesting, please read our paper for mathematical details and experimental results. You can cite our paper as follows:
@misc{sehgal2025selfevolvingvisualconceptlibrary,
title={Self-Evolving Visual Concept Library using Vision-Language Critics},
author={Atharva Sehgal and Patrick Yuan and Ziniu Hu and Yisong Yue and Jennifer J. Sun and Swarat Chaudhuri},
year={2025},
eprint={2504.00185},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.00185},
}