Training-free Conditional Image Embedding Framework from Large Vision Language Models

Published in Winter Conference on Applications of Computer Vision, 2026

Recommended citation: Kawarada, M., Yamada, K., Tejero-de-Pablos, A., & Inoue, N. (2026, March). Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models. In Proc. Winter Conference on Applications of Computer Vision (pp. 7636-7646).

Conditional image embeddings are feature representations that focus on specific aspects of an image indicated by a given textual condition (eg, color, genre), which has been a challenging problem. Although recent vision foundation models, such as CLIP, offer rich representations of images, they are not designed to focus on a specified condition. In this paper, we propose DIOR, a method that leverages a large vision-language model (LVLM) to generate conditional image embeddings. DIOR is a training-free approach that prompts the LVLM to describe an image with a single word related to a given condition. The hidden state vector of the LVLM’s last token is then extracted as the conditional image embedding. DIOR provides a versatile solution that can be applied to any image and condition without additional training or task-specific priors. Comprehensive experimental results on conditional image similarity tasks demonstrate that DIOR outperforms existing training-free baselines, including CLIP. Furthermore, DIOR achieves superior performance compared to methods that require additional training across multiple settings.

Download here

Bibtex:

@inproceedings{kawarada2026training,
  title={Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models},
  author={Kawarada, Masayuki and Yamada, Kosuke and Tejero-de-Pablos, Antonio and Inoue, Naoto},
  booktitle={Proc. Winter Conference on Applications of Computer Vision},
  pages={7636--7646},
  year={2026}
}