: Visual Understanding of VLMs in Really Dense
Scenes
Abstract
Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question–answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models.
- 2,720 question-answer pairs
- 6 tasks and 3 levels of difficulty
- fresh image data
- all images are public domain
Leaderboard
You can benchmark your own model by submitting your predictions to our evaluation server. If you want your submission to appear on the public leaderboard, please follow the instructions to open a GitHub issue.Model | Special Inference | Activity | Attributes | Counting | OCR | Reasoning | Scene | Easy | Medium | Hard | Total |
---|---|---|---|---|---|---|---|---|---|---|---|
PaliGemma 2 3B | No | 42.0 | 53.0 | 20.4 | 8.5 | 24.9 | 32.7 | 51.9 | 28.3 | 5.0 | 29.0 |
LLaVA 1.5 7B | No | 35.3 | 43.6 | 13.2 | 3.4 | 39.5 | 43.2 | 69.7 | 24.6 | 1.9 | 30.8 |
Gemma 3n E2B | No | 32.0 | 26.2 | 15.0 | 19.5 | 35.6 | 53.2 | 74.6 | 25.7 | 7.9 | 33.9 |
LLaVA-NeXT 7B | No | 44.7 | 41.6 | 19.1 | 8.5 | 40.5 | 54.0 | 81.8 | 31.5 | 2.2 | 37.5 |
LFM2 VL 450M | No | 35.3 | 47.0 | 22.9 | 20.3 | 27.8 | 59.5 | 83.1 | 32.4 | 8.6 | 39.7 |
DeepSeek VL2 Tiny | No | 54.7 | 47.7 | 22.5 | 35.6 | 37.1 | 54.2 | 82.5 | 38.0 | 2.6 | 41.2 |
SmolVLM | No | 42.7 | 41.6 | 17.2 | 28.0 | 32.2 | 67.3 | 83.5 | 38.8 | 3.1 | 42.0 |
Gemma 3n E4B | No | 40.0 | 23.5 | 19.3 | 23.7 | 41.0 | 73.9 | 87.8 | 38.4 | 8.9 | 44.2 |
InternVL3 1B | No | 48.0 | 57.0 | 27.2 | 25.4 | 35.1 | 77.5 | 94.9 | 48.9 | 5.0 | 50.6 |
LFM2 VL 1.6B | No | 49.3 | 55.7 | 25.2 | 28.0 | 44.4 | 79.5 | 97.4 | 50.4 | 4.8 | 51.9 |
InternLM-XComposer2-4KHD | No | 50.7 | 53.7 | 25.4 | 31.4 | 42.4 | 83.6 | 94.4 | 53.8 | 6.7 | 53.4 |
Qwen2.5-VL 3B | No | 60.7 | 61.7 | 25.9 | 49.2 | 43.9 | 77.5 | 94.0 | 56.0 | 4.8 | 54.1 |
InternLM-XComposer2.5 | No | 48.0 | 51.7 | 22.7 | 35.6 | 45.9 | 87.3 | 95.9 | 53.7 | 9.1 | 54.3 |
InternVL3 2B | No | 50.0 | 58.4 | 30.4 | 39.0 | 49.8 | 80.3 | 98.9 | 55.6 | 5.7 | 55.3 |
DeepSeek VL2 | No | 65.3 | 63.8 | 25.9 | 46.6 | 58.5 | 81.8 | 99.4 | 60.6 | 4.1 | 57.7 |
LLaVA-OneVision 7B | No | 60.7 | 57.7 | 28.4 | 29.7 | 54.1 | 88.2 | 95.5 | 63.6 | 4.3 | 58.3 |
Qwen2.5-VL 7B | No | 63.3 | 69.1 | 34.9 | 55.9 | 49.8 | 85.3 | 97.9 | 66.2 | 9.6 | 61.5 |
LLaVA 1.5 13B | No | 41.3 | 39.6 | 13.8 | 3.4 | 42.9 | 71.6 | 94.0 | 34.0 | 2.6 | 42.0 |
LLaVA-NeXT 13B | No | 44.0 | 43.6 | 17.0 | 6.8 | 41.5 | 75.8 | 97.4 | 38.1 | 2.9 | 45.1 |
VILA HD 4K | No | 54.0 | 48.3 | 22.5 | 11.0 | 49.3 | 74.5 | 91.2 | 47.1 | 4.1 | 48.5 |
Gemma 3 12B | No | 48.7 | 42.3 | 16.5 | 31.4 | 47.8 | 82.7 | 98.3 | 45.6 | 6.2 | 50.0 |
PaliGemma 2 10B | No | 48.7 | 52.3 | 23.6 | 5.1 | 42.4 | 81.8 | 91.9 | 49.5 | 5.7 | 50.3 |
VILA HD 1.5K | No | 54.0 | 57.7 | 25.9 | 21.2 | 52.2 | 79.4 | 94.2 | 54.3 | 4.1 | 53.1 |
InternVL3 8B | No | 66.0 | 67.8 | 32.2 | 42.4 | 59.0 | 93.4 | 99.6 | 70.8 | 7.9 | 63.9 |
PaliGemma 2 28B | No | 40.0 | 49.0 | 17.4 | 5.9 | 40.0 | 66.1 | 81.2 | 37.7 | 6.0 | 41.5 |
Gemma 3 27B | No | 51.3 | 46.3 | 18.1 | 40.7 | 50.7 | 86.3 | 98.5 | 50.6 | 8.9 | 53.2 |
Llama 4 Scout | No | 58.7 | 65.8 | 31.1 | 37.3 | 62.0 | 78.8 | 95.7 | 57.9 | 13.6 | 57.5 |
InternVL3 14B | No | 66.7 | 69.1 | 30.6 | 41.5 | 57.1 | 91.1 | 98.5 | 69.7 | 5.3 | 62.5 |
LLaVA-OneVision 72B | No | 66.0 | 69.8 | 30.9 | 39.0 | 57.1 | 91.8 | 97.6 | 71.0 | 4.1 | 62.7 |
Qwen2.5-VL 32B | No | 60.0 | 70.5 | 30.8 | 61.0 | 61.5 | 90.3 | 98.5 | 68.7 | 12.4 | 63.6 |
Qwen2.5-VL 72B | No | 67.3 | 74.5 | 35.1 | 72.9 | 53.2 | 90.5 | 97.6 | 72.6 | 13.4 | 65.7 |
InternVL3 78B | No | 78.0 | 80.5 | 34.7 | 31.4 | 65.4 | 93.7 | 97.6 | 76.9 | 8.1 | 66.8 |
InternVL3 38B | No | 76.7 | 78.5 | 35.4 | 45.8 | 69.8 | 92.2 | 98.3 | 78.6 | 7.2 | 67.6 |
Horizon Alpha | No | 57.3 | 74.5 | 35.6 | 48.3 | 63.9 | 93.2 | 99.4 | 72.9 | 10.8 | 65.7 |
Gemini 2.0 Flash | No | 76.0 | 71.1 | 41.7 | 57.6 | 56.6 | 92.1 | 99.1 | 74.0 | 19.1 | 68.1 |
o4 mini | No | 70.0 | 76.5 | 38.3 | 62.7 | 67.8 | 93.7 | 98.1 | 77.4 | 17.2 | 69.1 |
o3 | No | 74.0 | 69.8 | 36.7 | 61.0 | 75.1 | 94.7 | 99.4 | 76.4 | 19.6 | 69.5 |
Example Questions

Depending on the shadow of the people, what is the most likely position of the sun?
Options: A. behind the right building, B. behind the left building, C. its night time, D. behind the middle tower
Task: Reasoning

What is the ninth word of the caption below the image?
(freeform)
Task: OCR

How many live animals can be seen?
(freeform)
Task: Counting
BibTeX
@misc{gavrikov2025visualoverload,
title={VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes},
author={Paul Gavrikov and Wei Lin and M. Jehanzeb Mirza and Soumya Jahagirdar and Muhammad Huzaifa and Sivan Doveh and Serena Yeung-Levy and James Glass and Hilde Kuehne},
year={2025},
eprint={2509.25339},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.25339},
}