Abstract
Multimodal Large Language Models (MLLMs) adapt to visual tasks via in-context learning (ICL), which relies heavily on demonstration quality. The dominant demonstration selection strategy is unsupervised k-Nearest Neighbor (kNN) search. While simple, this similarity-first approach is sub-optimal for complex factual regression tasks; it selects redundant examples that fail to capture the task's full output range. We reframe selection as a sequential decision-making problem and introduce Learning to Select Demonstrations (LSD), training a Reinforcement Learning agent to construct optimal demonstration sets. Using a Dueling DQN with a query-centric Transformer Decoder, our agent learns a policy that maximizes MLLM downstream performance. Evaluating across five visual regression benchmarks, we uncover a crucial dichotomy: while kNN remains optimal for subjective preference tasks, LSD significantly outperforms baselines on objective, factual regression tasks. By balancing visual relevance with diversity, LSD better defines regression boundaries, illuminating when learned selection is strictly necessary for visual ICL.
System Overview
Our framework reframes demonstration selection as a finite-horizon Markov Decision Process (MDP). The agent uses a novel Query-Centric State Encoder (a Transformer Decoder) to fuse the query embedding with previously selected demonstrations, preventing "policy collapse". To handle the massive \(\mathcal{O}(N)\) action space of the entire dataset, we use a Dueling Q-Network that predicts an advantage query vector, enabling efficient FAISS-based approximate nearest-neighbor search.
Main Quantitative Results
We report the Mean Absolute Error (MAE \(\downarrow\)) for all methods across five benchmark datasets, evaluated with Gemma 3 4B-it. Our proposed method, LSD, consistently outperforms all baselines on objective tasks, and the performance gap widens as \(K\) increases.
| Domain | Dataset | kNN (Baseline) | LSD (Ours) | ||
|---|---|---|---|---|---|
| K=4 | K=8 | K=4 | K=8 | ||
| Objective (LSD Wins) |
UTKFace | 7.27 | 7.61 | 6.27 | 7.05 |
| KonIQ-10k | 0.44 | 0.55 | 0.40 | 0.51 | |
| KADID-10k | 0.87 | 0.91 | 0.79 | 0.82 | |
| Subjective (LSD Fails) |
AVA | 0.98 | 0.83 | 1.06 | 0.98 |
| SCUT-FBP5500 | 0.39 | 0.40 | 0.62 | 0.67 | |
Objective Tasks (Require Diversity)
On factual regression tasks (e.g., Age, Quality), LSD significantly outperforms standard kNN. The agent learns that providing a diverse set of "boundary" examples helps the MLLM model the entire regression space accurately.
Subjective Tasks (Rely on Similarity)
Conversely, on subjective preference tasks (e.g., Aesthetics, Beauty), the kNN baseline is superior. For human perception tasks, visual similarity acts as a necessary anchor, and the diversity introduced by LSD acts as confusing noise.
Demonstration Set Analysis
To understand our agent's policy, we analyzed the selected demo sets on UTKFace. While kNN uses a fixed, myopic strategy based strictly on feature similarity, LSD optimizes for MLLM performance resulting in an emergent label-awareness.
Emergent Label-Awareness: Plot (a) shows that by optimizing for the final reward, LSD implicitly learns to select demos that are closer in label-space to the query, despite its state containing no ground-truth label information. Plot (d) shows LSD actively seeks diversity (low-similarity) compared to kNN.
Qualitative Behavioral Insights
Cross-MLLM Generalization
We demonstrate that the policy learned by our LSD agent is highly generalizable. A single policy trained using reward signals from Gemma 3 4B-it successfully transfers to unseen models, maintaining its performance advantage on Qwen 2.5 7B and performing comparably to strong baselines on Phi-3.5-vision.
Efficiency & Ablation
Large-Scale Action Selection
Discrete RL fails at \(N=50,000\) due to the "exploration cliff". LSD maps states to continuous embeddings, exploiting the semantic inductive bias of the SigLIP encoder, allowing it to scale logarithmically using FAISS.
| Feature | Discrete RL (e.g., PPO) | Proposed (LSD) |
|---|---|---|
| Output Space | \(N\) logits (One per item) | Vector \(\in \mathbb{R}^D\) |
| Semantics | Orthogonal Actions | Semantic Neighbors |
| Complexity | Linear \(\mathcal{O}(N)\) | Logarithmic \(\mathcal{O}(\log N + k)\) |
Ablation: State Encoder Architecture
We compare our Query-Centric model against a standard decoder-only model (Concat Input). The Concat baseline exhibited a critical behavioral failure—policy collapse—learning to select the same non-query-specific demonstrations for all queries.
| Strategy | K=4 | K=8 | K=16 | Policy Behavior |
|---|---|---|---|---|
| Query-Centric | 6.27 | 7.05 | 6.64 | Query-Specific |
| Concat Input | 7.01 | 6.42 | 7.74 | Policy Collapse |
Citation
@inproceedings{lee2026learning,
title={Learning to Select Demonstrations for Visual In-Context Learning},
author={Lee, Eugene and Lin, Yu-Chi and Diao, Jiajie},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}