INUIRE
A Benchmark for Natural World Image Retrieval
1Massachusetts Institute of Technology 2University College London 3iNaturalist 4University of Edinburgh 5University of Massachusetts, Amherst   *, † indicates equal contribution
Expert-level multi-modal models require expert-level benchmarks.

We introduce 🔍 INQUIRE, an image retrieval benchmark of 200 challenging ecological queries that are comprehensively labeled over a new 5 million image subset of iNaturalist (iNat24). We hope that 🔍 INQUIRE will encourage the community to build next-generation image retrieval methods toward the goal of helping accelerate and automate scientific discovery.
A hermit crab using plastic trash as its shell
Distal rynchokinesis
California condor tagged with a green “26”
Everted osmeterium
An ornamented bowerbird nest
A nest brood parasitized by a cowbird
A sick cassava plant
Tamandua back-brooding its young
A hermit crab using plastic trash as its shell
INUIRE Leaderboards
We evaluate current multimodal models on INQUIRE, both on the Fullrank and Rerank task. Evaluations are conducted in a zero-shot fashion, with no additional prompt-turning or in-context demonstrations. Results are reported in AP@50, the average precision among the top 50 retrieved images.
INQUIRE-Fullrank Leaderboard
This leaderboard shows full dataset retrieval performance, starting from all 5 million images in iNat24.
Two Stage Retrieval
One Stage Retrieval CLIP/Embedding Model
MethodSizeOverallAppearanceBehaviorContextSpecies
CLIP ViT-H/14-378 (DFN) Top 100 → GPT-4o -47.1 36.6 49.7 51.9 59.4
CLIP ViT-H/14-378 (DFN) Top 100 → VILA1.5-40B -42.1 32.5 44.7 46.7 52.4
CLIP ViT-H/14-378 (DFN) Top 100 → GPT-4-Turbo (20240409) -38.8 29.7 40.0 42.2 54.7
CLIP ViT-H/14-378 (DFN) Top 100 → PaliGemma-3B-mix-448 -37.7 27.2 41.2 41.7 48.6
CLIP ViT-H/14-378 (DFN) Top 100 → LLaVA-v1.6-34B -37.4 28.0 39.0 41.8 50.8
CLIP ViT-H/14-378 (DFN) 987M35.6 25.7 38.7 36.5 52.7
SigLIP SO400m-14-384 878M34.9 30.5 35.7 36.0 42.6
SigLIP ViT-L/16-384 652M31.6 24.1 33.0 33.8 44.5
CLIP ViT-L/14 (DFN) 428M24.6 18.4 24.0 26.3 40.9
CLIP ViT-B/16 (DFN) 150M16.2 12.0 16.8 15.7 28.3
CLIP ViT-L/14 (OpenAI) 428M15.8 14.9 15.3 14.3 23.6
CLIP RN50x16 (OpenAI) 291M14.3 10.4 15.8 13.3 23.3
CLIP ViT-B/16 (OpenAI) 150M11.4 9.8 10.6 11.2 19.0
CLIP ViT-B/32 (OpenAI) 110M8.2 5.8 7.6 8.9 16.1
CLIP RN50 (OpenAI) 102M7.6 5.7 7.3 7.9 13.8
WildCLIP-t1 150M7.5 5.2 8.0 7.0 13.2
WildCLIP-t1t7-lwf 150M7.3 6.5 6.8 6.4 13.1
BioCLIP 150M3.6 2.3 0.5 2.2 21.1
Random -0.0 0.0 0.0 0.0 0.0
INQUIRE-Rerank Leaderboard
This leaderboard shows reranking performance, starting from a fixed set of 100 images per query.
Proprietary LMM
Open Source LMM
Open Source CLIP/Embedding Model
MethodSizeOverallAppearanceBehaviorContextSpecies
GPT-4o -62.4 59.7 61.9 70.6 42.4
VILA1.5-40b 40B54.3 50.4 55.1 61.9 36.0
SigLIP SO400m-14-384 878M51.5 51.8 51.7 53.4 38.8
GPT-4-Turbo (20240409) -48.9 43.7 49.6 56.6 39.7
PaliGemma-3b-mix-448 3B48.9 44.1 51.6 53.8 35.3
LLaVA-v1.6-34b 34B48.3 43.7 48.7 56.4 34.7
SigLIP ViT-L/16-384 652M47.5 42.8 50.2 52.1 34.7
VILA1.5-13B 13B46.3 40.2 46.5 56.8 32.7
CLIP ViT-H/14-378 (DFN) 987M44.6 38.8 50.1 47.4 28.6
InstructBLIP-FLAN-T5-XXL 12B44.3 38.7 45.9 50.7 37.2
LLaVA-v1.6-mistral-7b 7B43.1 39.0 42.7 51.5 31.7
LLaVA-1.5-13b 13B43.0 37.7 45.1 48.9 32.7
BLIP-2-FLAN-T5-XXL 12B40.5 32.8 43.4 47.9 32.4
CLIP ViT-L/14 (DFN) 428M39.1 34.9 40.7 43.3 33.4
CLIP ViT-L/14 (OpenAI) 428M37.8 35.1 37.9 41.4 37.6
CLIP RN50x16 (OpenAI) 291M36.2 32.7 36.1 40.5 39.8
CLIP ViT-B/16 (DFN) 150M33.7 29.4 35.4 37.2 31.5
CLIP ViT-B/16 (OpenAI) 150M33.5 30.8 32.9 37.2 37.1
WildCLIP-t1 150M31.6 28.2 31.0 36.5 34.3
WildCLIP-t1t7-lwf 150M31.5 29.0 30.5 35.2 37.4
CLIP ViT-B/32 (OpenAI) 151M31.3 26.9 30.4 37.3 37.0
CLIP RN50 (OpenAI) 102M31.2 28.8 30.3 35.0 35.2
BioCLIP 150M28.9 27.4 27.2 30.8 41.1
Random -23.0 - - - -