Tune-an-Ellipse: CLIP Has Potential to Find what you Want

Jinheng Xie, Songhe Deng, Bing Li, Haozhe Liu, Yawen Huang*, Yefeng Zheng, Jurgen Schmidhuber, Bernard Ghanem, Linlin Shen, Mike Zheng Shou

*Corresponding author for this work

Research output: Contribution to conferencePaperpeer-review

1 Scopus citations

Abstract

Visual prompting of large vision language models such as CLIP exhibits intriguing zero-shot capabilities. A manually drawn red circle, commonly used for highlighting, can guide CLIP's attention to the surrounding region, to identify specific objects within an image. Without precise object proposals, however, it is insufficient for localization. Our novel, simple yet effective approach, i.e., Differentiable Visual Prompting, enables CLIP to zero-shot localize: given an image and a text prompt describing an object, we first pick a rendered ellipse from uniformly distributed anchor ellipses on the image grid via visual prompting, then use three loss functions to tune the ellipse coefficients to encap-sulate the target region gradually. This yields promising ex-perimental results for referring expression comprehension without precisely specified object proposals. In addition, we systematically present the limitations of visual prompting inherent in CLIP and discuss potential solutions.

Original languageEnglish (US)
Pages13723-13732
Number of pages10
DOIs
StatePublished - 2024
Event2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, United States
Duration: Jun 16 2024Jun 22 2024

Conference

Conference2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Country/TerritoryUnited States
CitySeattle
Period06/16/2406/22/24

Bibliographical note

Publisher Copyright:
© 2024 IEEE.

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'Tune-an-Ellipse: CLIP Has Potential to Find what you Want'. Together they form a unique fingerprint.

Cite this