Escher: Self-Evolving Visual Concept Library using Vision-Language Critics

1UT Austin, 2Cornell 3Caltech
Prior work in concept-bottleneck visual recognition aims to leverage discriminative visual concepts to enable more accurate object classification. Escher is an approach for iteratively evolving a visual concept library using feedback from a VLM critic to discover descriptive visual concepts.

Abstract

We study the problem of building a visual concept library for visual recognition. Building effective visual concept libraries is challenging, as manual definition is labor-intensive, while relying solely on LLMs for concept generation can result in concepts that lack discriminative power or fail to account for the complex interactions between them.

Our approach, Escher, takes a library learning perspective to iteratively discover and improve visual concepts. Escher uses a vision-language model (VLM) as a critic to iteratively refine the concept library, including accounting for interactions between concepts and how they affect downstream classifiers. By leveraging the in-context learning abilities of LLMs and the history of performance using various concepts, Escher dynamically improves its concept generation strategy based on the VLM critic's feedback.

Notably, Escher does not require any human annotations, and is thus an automated plug-and-play framework. We empirically demonstrate the ability of Escher to learn a concept library for zero-shot, few-shot, and fine-tuning visual classification tasks. This work represents, to our knowledge, the first application of concept library learning to real-world visual tasks.

Discovering visual concepts

In many scientific fields, perceptual reasoning doesn't come naturally. Sometimes the features of an image aren't obvious to the human eye. Other times the questions we pose demand more than instinct, requiring deliberate analysis. As a result, scientists must learn how to identify subtle traits by building domain knowledge and seeking constant feedback from peers. Let's look at a concrete example from ecology.

In this iNaturalist exchange [Source], an experienced ecologist uploads a geo-tagged photo of a lizard, initially misidentifying it as a Florida Scrub Lizard (Sceloporus woodi). Another user, a trained herpetologist, corrects the identification, suggesting that the lizard is actually a Northern Curly-tailed Lizard (Leiocephalus carinatus). They catch a distinguishing morphological detail: Sceloporus (the family of the Florida Scrub Lizard) do not have a strongly keeled tail. The experienced ecologist agrees with the correction and explains that they were focusing on the lizard's color patterning, rather than the scales.

We are interested in the question: how can machine perception systems learn to identify such subtle visual concepts?

Visual Programming

Visual programming attempts to decompose complex perceptual reasoning problems into a logical combination of simpler perceptual tasks that can be solved using off-the-shelf vision foundation models. Such exciting works leverage the code generation capabilities of large language models (LLMs) along with a pre-specified API of visual foundation models to generate code that can be executed to answer perceptual reasoning questions.

Visual Programming's decoupling bottleneck

Constructing visual programs for scientific images

However, such approaches inherently suffer from the decoupling of the program synthesizer and the underlying vision foundation models. The program synthesizer is trained to generate deterministic code, and the stochastic nature of the vision foundation model is hidden away by the API. This decoupling leads to a disconnect between how the program synthesizer assumes the vision foundation model will behave and how it actually behaves on real-world images.

Our Hypothesis

Our main goal is to understand whether we can overcome this decoupling by using the vision foundation model as a critic to guide the program synthesizer.

Escher: Self-Evolving Visual Concept Library using Vision-Language Critics

Task: Fine-grained Image Classification

Escher focuses on the task of fine-grained image classification. In this task, we're given an image and a set of classes (in natural language). Our goal is to classify the image into one of the classes.

Prior work: (1) Visual concept-bottleneck models

Prior work in this area begins by asking an LLM to decompose the complex categorization task into a series of simpler perceptual tasks.

For example, here, the LLM would be asked two questions (1) What does a donut look like? (2) What does a bagel look like?

Prior work: (2) Visual concept-bottleneck models

Then, a CLIP-based VLM is used to estimate how well the image matches the visual concepts. The final categorization is decided by choosing the class whose average visual concept score is highest. Such approaches have been shown to work well in practice compared to naively using the class-image similarity.

There are other benefits too: (1) The visual concepts are interpretable, (2) The visual concepts can be reused across different tasks, (3) The visual concepts can be used to decompose otherwise out-of-distribution categorization tasks into in-distribution classification tasks.

Prior work: (3) Visual concept-bottleneck models - Limitations

However, on this input, the LLM generated visual concepts, while faithful to the categorization task, yield very different results when evaluated by the VLM.

Prior work: (4) Visual concept-bottleneck models - Limitations

We hypothesize that such approach suffers from the same weakness that zero-shot visual programming approaches suffer from: the LLMs decomposition is decoupled from the VLM's performance. The VLM and the LLM had very different training methodologies and hence, have different inductive biases. These biases can lead to the LLM generating visual concepts that, while faithful to the categorization task, prove to be detrimental to the VLM's performance.

Concept Library: Introduction

Escher attempts to overcome this limitation by instantiating a concept library that is iteratively refined using the VLM's feedback.

This concept library is a data structure that stores the visual concepts generated by the LLM for each class.

Concept Library: Instantiation

Following previous work, we generate concepts using an LLM conditioned on the class names with a slight modification: the LLM is also given access to the concept library. Initially, the concept library is empty so this is equivalent to how prior work uses the LLM.

Concept Library: VLM Forward Pass

To evaluate the generated concepts, Escher simply passes the concepts to the underlying VLM. However, instead of directly using the VLM outputs to predict the class, we use the VLM outputs to better understand what classes get confused with each other, and how the concept library can be improved to reduce this confusion.

Feedback Generation: (1) Disambiguation Heuristic

To generate feedback, Escher attempts to find classes that are confused with each other. If we have access to the ground truth labels, this can be achieved by constructing a confusion matrix of the VLM outputs.

However, Escher does not have access to the ground truth labels. Instead, Escher analyzes the raw similarity scores for each image-class pair to make this decision. Generally, this necessitates a well-calibrated backbone VLM model, and there are many heuristics to achieve this.

Feedback Generation: (2) Disambiguation Heuristic

Here, we showcase one such heuristic: top-k pseudo-confusion. This heuristic selects the top-k classes with the highest image-text similarity scores for each image, and identifies the classes that are consistently ranked within the top-k most-similar classes across all images. The intuition is that if the VLM consistently ranks two classes within the top-k most-similar classes for a given image, it is likely that the VLM is confused between these two classes.

Feedback Generation: (3) Example

For example, in this image, the pseudo-confusion matrix can be used to identify two classes of birds in a fine-grained bird classification dataset that are confused with each other: a Slaty backed Gull and a California Gull.

Feedback Generation: (4) Disambiguation Resolution

Once Escher identifies the classes that are confused with each other, it attempts to resolve this confusion by generating new concepts for the confused classes. If classes are confused more than once, the resolution query reflects the history of past disambiguations, as well as the VLM's feedback score after each disambiguation.

Concept Library: Feedback loop

This loop is repeated for a fixed number of iterations, and usually yields a concept library that is much more interpretable and useful than the initial concept library.

In our running example, this enables the discovery of visual concepts that help distinguish between the donut and the bagel.

Feedback Generation: (5) Underlying Insight

Escher's key insight is in leveraging the concept library to couple the LLM's generation and the VLM's responses. A VLM with a specialized concept library produces more fine-grained class disambiguation feedback, prompting the LLM to uncover even finer concepts, which leads to an even more specialized library for the next iteration. Like other library learning algorithms, ideally -- while the LLM disambiguates concepts well and the VLM remains sensitive to them -- this self-reinforcing loop continues until no further relevant concepts can be identified.

Escher Results

Fine-grained evolution w/ CBMs

Generally, algorithms that use intermediate concepts for decomposing perceptual tasks are known as concept-bottleneck models (CBMs). Our running baseline so far -- using an LLM to generate a set of textual concepts -- can also be expressed as a CBM.

Escher is a meta-algorithm that is agnostic to the underlying choice of CBM. Furthermore, it doesn't require any human provided concepts. We evaluated how Escher behaves in two scenarios (more studies in the paper!):

  1. Does evolving CBMs with Escher improve the performance of the CBM?: We compared the performance of a CBM in its first iteration with the performance of the CBM after iterating with Escher. We found that Escher improves the performance of the CBM in all cases. The delta improvement depended on many factors that are discussed in greater details in further experiments.
  2. Does Escher outperform the LLM's zero-shot generated concepts?: Inherently, Escher has an unfair advantage over the LLM's zero-shot generated concepts as it samples more concepts in each iteration. In this experiment, we queried the LLM for a set of concepts that was the same size as the number of concepts in Escher's final concept library. We found that Escher outperformed the LLM's zero-shot generated concepts in all cases.

Qualitative Results (1)

We start with an image of a Male Ring-necked pheasant. This is a category from the North American birds dataset, which contains over 400 species of birds.

Qualitative Results (2)

With no iterations, the underlying VLM mispredicts this image as a Female Ring-necked pheasant. Since this is a context-bottlenecked model, we can inspect the underlying concepts to better understand the model's prediction.

Qualitative Results (3)

When examining the concept activations for the Male Ring-necked pheasant, we notice a substantially lower overall score—likely because the LLM generates concepts for each class in isolation. As a result, highly similar concepts may end up with slightly different initializations, which can alter prediction outcomes. Moreover, these LLM-generated concepts may lack the necessary discriminative power to distinguish between closely related classes because they are unaware of which other classes are present in the dataset.

Qualitative Results (4)

After five iterations with Escher, the VLM correctly predicts the true class of the image. This is because we are able to identify that the underlying VLM is getting confused between a Male and a Female ring-necked pheasant, and generate concepts that resolve this confusion. Specifically, if we inspect the new most activated concepts, we find that the LLM identifies that a Male ring-necked pheasant has a metallic green head and neck, which proves to be a very discriminative concept for the VLM.

Limitations: Escher currently has three main constraints:

  1. Factual Accuracy: Because we rely on zero-shot queries from large language models, Escher cannot guarantee the accuracy or correctness of the concepts in its library. Moreover, concepts flagged as important may reflect biases from the VLM's underlying training process and risk misleading researchers.
  2. Generality: The pseudo-confusion matrix approach is limited in fidelity and not well-suited for general-purpose use. Generating feedback for broader visual reasoning tasks remains a challenge.
  3. VLM Robustness: CLIP-based visual language models (VLMs) are imperfect, as identified in recent studies. Since Escher is largely model-agnostic, it would be worthwhile to explore how more advanced VLMs might improve its performance.

BibTeX

If you found this post interesting, please read our paper for mathematical details and experimental results. You can cite our paper as follows:

@misc{sehgal2025selfevolvingvisualconceptlibrary,
	title={Self-Evolving Visual Concept Library using Vision-Language Critics}, 
	author={Atharva Sehgal and Patrick Yuan and Ziniu Hu and Yisong Yue and Jennifer J. Sun and Swarat Chaudhuri},
	year={2025},
	eprint={2504.00185},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2504.00185}, 
}