CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

Human perception of visual similarity is inherently adaptive and subjective, depending on the users’ interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. This design separates the textual conditioning process and visual feature extraction, allowing highly efficient and multi-conditioned retrieval with fixed visual embeddings. We also construct a synthetic evaluation dataset CLAY-EVAL, for comprehensive assessment under diverse conditioned retrieval settings. Experiments on standard datasets and our proposed dataset show that CLAY achieves state-of-the art retrieval accuracy and notable computational efficiency compared to previous works.

(a) Given a condition, we construct the manifold-aware textual subspace with the condition text features in advance, and generate the condition-aware projection matrix Pc. (b) At inference, we compute the conditional similarity between the query and database images by projecting the visual features onto the textual subspace with Pc.

We construct a synthetic dataset with diverse condition annotations, consisting of (a) object entity and (b) human entity. Each pie chart shows visualizes the distributions of key attributes, showing diversity.

For each query image and condition text pair, we compare the top-5 retrieved results from (a) CLIP-B, (b) InstructBLIP, (c) GeneCIS, and (d) our method. We also report Average Precision (AP) in each result. Green boxes indicate correctly retrieved images, while incorrect retrievals are shown in red boxes.

We report t-SNE of CLIP-B and ours (CLIP-B) on CLAY-Human under condition (a) action, (b) background, and (c) age. The features with the same label are shown in the same color for easy interpretation. Compared to the fixed representation space in CLIP-B, our method forms more discriminative spaces compliant with given conditions.

Citation

@inproceedings{lim2026clay,
  title     = {CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space},
  author    = {Lim, Sohwi and Hyoseok, Lee and Park, Jungjoon and Oh, Tae-Hyun},  
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
  }

CLAY: Conditional Visual Similarity Modulation
in Vision-Language Embedding Space

Given a query image, our method adaptively computes conditioned similarity by modulating the similarity space to align with various conditions, e.g., species, location, action, category, and color.

Abstract

CLAY Pipeline

CLAY-EVAL dataset

Qualitative Results

t-SNE Visualization

Citation