Learning to Select Demonstrations for Visual In-Context Learning

Abstract

Multimodal Large Language Models (MLLMs) adapt to visual tasks via in-context learning (ICL), which relies heavily on demonstration quality. The dominant demonstration selection strategy is unsupervised k-Nearest Neighbor (kNN) search. While simple, this similarity-first approach is sub-optimal for complex factual regression tasks; it selects redundant examples that fail to capture the task's full output range. We reframe selection as a sequential decision-making problem and introduce Learning to Select Demonstrations (LSD), training a Reinforcement Learning agent to construct optimal demonstration sets. Using a Dueling DQN with a query-centric Transformer Decoder, our agent learns a policy that maximizes MLLM downstream performance. Evaluating across five visual regression benchmarks, we uncover a crucial dichotomy: while kNN remains optimal for subjective preference tasks, LSD significantly outperforms baselines on objective, factual regression tasks. By balancing visual relevance with diversity, LSD better defines regression boundaries, illuminating when learned selection is strictly necessary for visual ICL.

System Overview

An overview of our LSD (Learning to Select Demonstrations) framework. — **LSD Framework.** The process is a training loop where the MLLM acts as the Environment. The Agent (a Dueling DQN) uses a query-centric decoder to retrieve candidates via FAISS and selects the next best demonstration to maximize the MLLM's prediction accuracy.

Our framework reframes demonstration selection as a finite-horizon Markov Decision Process (MDP). The agent uses a novel Query-Centric State Encoder (a Transformer Decoder) to fuse the query embedding with previously selected demonstrations, preventing "policy collapse". To handle the massive \(\mathcal{O}(N)\) action space of the entire dataset, we use a Dueling Q-Network that predicts an advantage query vector, enabling efficient FAISS-based approximate nearest-neighbor search.

Main Quantitative Results

We report the Mean Absolute Error (MAE \(\downarrow\)) for all methods across five benchmark datasets, evaluated with Gemma 3 4B-it. Our proposed method, LSD, consistently outperforms all baselines on objective tasks, and the performance gap widens as \(K\) increases.

Domain	Dataset	kNN (Baseline)		LSD (Ours)
Domain	Dataset	K=4	K=8	K=4	K=8
Objective (LSD Wins)	UTKFace	7.27	7.61	6.27	7.05
	KonIQ-10k	0.44	0.55	0.40	0.51
	KADID-10k	0.87	0.91	0.79	0.82
Subjective (LSD Fails)	AVA	0.98	0.83	1.06	0.98
Subjective (LSD Fails)	SCUT-FBP5500	0.39	0.40	0.62	0.67

Objective Tasks (Require Diversity)

On factual regression tasks (e.g., Age, Quality), LSD significantly outperforms standard kNN. The agent learns that providing a diverse set of "boundary" examples helps the MLLM model the entire regression space accurately.

Subjective Tasks (Rely on Similarity)

Conversely, on subjective preference tasks (e.g., Aesthetics, Beauty), the kNN baseline is superior. For human perception tasks, visual similarity acts as a necessary anchor, and the diversity introduced by LSD acts as confusing noise.

Demonstration Set Analysis

To understand our agent's policy, we analyzed the selected demo sets on UTKFace. While kNN uses a fixed, myopic strategy based strictly on feature similarity, LSD optimizes for MLLM performance resulting in an emergent label-awareness.

MAE of Demo Labels vs. Query — (a) Label MAE vs. Query

Demo-Query Feature Similarity — (c) Demo-Query Feature Sim.

Pairwise Feature Similarity — (d) Pairwise Feature Sim.

Emergent Label-Awareness: Plot (a) shows that by optimizing for the final reward, LSD implicitly learns to select demos that are closer in label-space to the query, despite its state containing no ground-truth label information. Plot (d) shows LSD actively seeks diversity (low-similarity) compared to kNN.

Qualitative Behavioral Insights

Qualitative Comparison of Selected Demonstrations. — **kNN vs. LSD.** **(Top - UTKFace):** For an 8-year-old query, kNN selects only similar children (high redundancy). LSD selects a diverse age spectrum. **(Bottom - KADID-10k):** kNN redundantly selects distorted versions of the *same source image*. LSD selects diverse scenes and distortion types to define the quality boundaries.

Cross-MLLM Generalization

We demonstrate that the policy learned by our LSD agent is highly generalizable. A single policy trained using reward signals from Gemma 3 4B-it successfully transfers to unseen models, maintaining its performance advantage on Qwen 2.5 7B and performing comparably to strong baselines on Phi-3.5-vision.

Efficiency & Ablation

Large-Scale Action Selection

Discrete RL fails at \(N=50,000\) due to the "exploration cliff". LSD maps states to continuous embeddings, exploiting the semantic inductive bias of the SigLIP encoder, allowing it to scale logarithmically using FAISS.

Feature	Discrete RL (e.g., PPO)	Proposed (LSD)
Output Space	\(N\) logits (One per item)	Vector \(\in \mathbb{R}^D\)
Semantics	Orthogonal Actions	Semantic Neighbors
Complexity	Linear \(\mathcal{O}(N)\)	Logarithmic \(\mathcal{O}(\log N + k)\)

Ablation: State Encoder Architecture

We compare our Query-Centric model against a standard decoder-only model (Concat Input). The Concat baseline exhibited a critical behavioral failure—policy collapse—learning to select the same non-query-specific demonstrations for all queries.

Strategy	K=4	K=8	K=16	Policy Behavior
Query-Centric	6.27	7.05	6.64	Query-Specific
Concat Input	7.01	6.42	7.74	Policy Collapse

Citation

@inproceedings{lee2026learning,
  title={Learning to Select Demonstrations for Visual In-Context Learning},
  author={Lee, Eugene and Lin, Yu-Chi and Diao, Jiajie},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}