Method | Size | Overall | Appearance | Behavior | Context | Species |
---|---|---|---|---|---|---|
CLIP ViT-H/14-378 (DFN) Top 100 → GPT-4o | - | 47.1 | 36.6 | 49.7 | 51.9 | 59.4 |
CLIP ViT-H/14-378 (DFN) Top 100 → VILA1.5-40B | - | 42.1 | 32.5 | 44.7 | 46.7 | 52.4 |
CLIP ViT-H/14-378 (DFN) Top 100 → GPT-4-Turbo (20240409) | - | 38.8 | 29.7 | 40.0 | 42.2 | 54.7 |
CLIP ViT-H/14-378 (DFN) Top 100 → PaliGemma-3B-mix-448 | - | 37.7 | 27.2 | 41.2 | 41.7 | 48.6 |
CLIP ViT-H/14-378 (DFN) Top 100 → LLaVA-v1.6-34B | - | 37.4 | 28.0 | 39.0 | 41.8 | 50.8 |
CLIP ViT-H/14-378 (DFN) | 987M | 35.6 | 25.7 | 38.7 | 36.5 | 52.7 |
SigLIP SO400m-14-384 | 878M | 34.9 | 30.5 | 35.7 | 36.0 | 42.6 |
SigLIP ViT-L/16-384 | 652M | 31.6 | 24.1 | 33.0 | 33.8 | 44.5 |
CLIP ViT-L/14 (DFN) | 428M | 24.6 | 18.4 | 24.0 | 26.3 | 40.9 |
CLIP ViT-B/16 (DFN) | 150M | 16.2 | 12.0 | 16.8 | 15.7 | 28.3 |
CLIP ViT-L/14 (OpenAI) | 428M | 15.8 | 14.9 | 15.3 | 14.3 | 23.6 |
CLIP RN50x16 (OpenAI) | 291M | 14.3 | 10.4 | 15.8 | 13.3 | 23.3 |
CLIP ViT-B/16 (OpenAI) | 150M | 11.4 | 9.8 | 10.6 | 11.2 | 19.0 |
CLIP ViT-B/32 (OpenAI) | 110M | 8.2 | 5.8 | 7.6 | 8.9 | 16.1 |
CLIP RN50 (OpenAI) | 102M | 7.6 | 5.7 | 7.3 | 7.9 | 13.8 |
WildCLIP-t1 | 150M | 7.5 | 5.2 | 8.0 | 7.0 | 13.2 |
WildCLIP-t1t7-lwf | 150M | 7.3 | 6.5 | 6.8 | 6.4 | 13.1 |
BioCLIP | 150M | 3.6 | 2.3 | 0.5 | 2.2 | 21.1 |
Random | - | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Method | Size | Overall | Appearance | Behavior | Context | Species |
---|---|---|---|---|---|---|
GPT-4o | - | 62.4 | 59.7 | 61.9 | 70.6 | 42.4 |
VILA1.5-40b | 40B | 54.3 | 50.4 | 55.1 | 61.9 | 36.0 |
SigLIP SO400m-14-384 | 878M | 51.5 | 51.8 | 51.7 | 53.4 | 38.8 |
GPT-4-Turbo (20240409) | - | 48.9 | 43.7 | 49.6 | 56.6 | 39.7 |
PaliGemma-3b-mix-448 | 3B | 48.9 | 44.1 | 51.6 | 53.8 | 35.3 |
LLaVA-v1.6-34b | 34B | 48.3 | 43.7 | 48.7 | 56.4 | 34.7 |
SigLIP ViT-L/16-384 | 652M | 47.5 | 42.8 | 50.2 | 52.1 | 34.7 |
VILA1.5-13B | 13B | 46.3 | 40.2 | 46.5 | 56.8 | 32.7 |
CLIP ViT-H/14-378 (DFN) | 987M | 44.6 | 38.8 | 50.1 | 47.4 | 28.6 |
InstructBLIP-FLAN-T5-XXL | 12B | 44.3 | 38.7 | 45.9 | 50.7 | 37.2 |
LLaVA-v1.6-mistral-7b | 7B | 43.1 | 39.0 | 42.7 | 51.5 | 31.7 |
LLaVA-1.5-13b | 13B | 43.0 | 37.7 | 45.1 | 48.9 | 32.7 |
BLIP-2-FLAN-T5-XXL | 12B | 40.5 | 32.8 | 43.4 | 47.9 | 32.4 |
CLIP ViT-L/14 (DFN) | 428M | 39.1 | 34.9 | 40.7 | 43.3 | 33.4 |
CLIP ViT-L/14 (OpenAI) | 428M | 37.8 | 35.1 | 37.9 | 41.4 | 37.6 |
CLIP RN50x16 (OpenAI) | 291M | 36.2 | 32.7 | 36.1 | 40.5 | 39.8 |
CLIP ViT-B/16 (DFN) | 150M | 33.7 | 29.4 | 35.4 | 37.2 | 31.5 |
CLIP ViT-B/16 (OpenAI) | 150M | 33.5 | 30.8 | 32.9 | 37.2 | 37.1 |
WildCLIP-t1 | 150M | 31.6 | 28.2 | 31.0 | 36.5 | 34.3 |
WildCLIP-t1t7-lwf | 150M | 31.5 | 29.0 | 30.5 | 35.2 | 37.4 |
CLIP ViT-B/32 (OpenAI) | 151M | 31.3 | 26.9 | 30.4 | 37.3 | 37.0 |
CLIP RN50 (OpenAI) | 102M | 31.2 | 28.8 | 30.3 | 35.0 | 35.2 |
BioCLIP | 150M | 28.9 | 27.4 | 27.2 | 30.8 | 41.1 |
Random | - | 23.0 | - | - | - | - |
@article{vendrow2024inquire,
title={INQUIRE: A Natural World Text-to-Image Retrieval Benchmark},
author={Vendrow, Edward and Pantazis, Omiros and Shepard, Alexander and Brostow, Gabriel and Jones, Kate E and Mac Aodha, Oisin and Beery, Sara and Van Horn, Grant},
journal={NeurIPS},
year={2024},
}