: Visual Understanding of VLMs in Really Dense Scenes

1 Independent Researcher 2 JKU Linz 3 MIT CSAIL 4 TĂĽbingen AI Center 5 Standford 6 MIT-IBM Watson AI Lab
Preprint, 2025

Abstract

Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question–answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models.

  • 2,720 question-answer pairs
  • 6 tasks and 3 levels of difficulty
  • fresh image data
  • all images are public domain

Leaderboard

You can benchmark your own model by submitting your predictions to our evaluation server. If you want your submission to appear on the public leaderboard, please follow the instructions to open a GitHub issue.
Model Special Inference Activity Attributes Counting OCR Reasoning Scene Easy Medium Hard Total
PaliGemma 2 3B No 42.0 53.0 20.4 8.5 24.9 32.7 51.9 28.3 5.0 29.0
LLaVA 1.5 7B No 35.3 43.6 13.2 3.4 39.5 43.2 69.7 24.6 1.9 30.8
Gemma 3n E2B No 32.0 26.2 15.0 19.5 35.6 53.2 74.6 25.7 7.9 33.9
LLaVA-NeXT 7B No 44.7 41.6 19.1 8.5 40.5 54.0 81.8 31.5 2.2 37.5
LFM2 VL 450M No 35.3 47.0 22.9 20.3 27.8 59.5 83.1 32.4 8.6 39.7
DeepSeek VL2 Tiny No 54.7 47.7 22.5 35.6 37.1 54.2 82.5 38.0 2.6 41.2
SmolVLM No 42.7 41.6 17.2 28.0 32.2 67.3 83.5 38.8 3.1 42.0
Gemma 3n E4B No 40.0 23.5 19.3 23.7 41.0 73.9 87.8 38.4 8.9 44.2
InternVL3 1B No 48.0 57.0 27.2 25.4 35.1 77.5 94.9 48.9 5.0 50.6
LFM2 VL 1.6B No 49.3 55.7 25.2 28.0 44.4 79.5 97.4 50.4 4.8 51.9
InternLM-XComposer2-4KHD No 50.7 53.7 25.4 31.4 42.4 83.6 94.4 53.8 6.7 53.4
Qwen2.5-VL 3B No 60.7 61.7 25.9 49.2 43.9 77.5 94.0 56.0 4.8 54.1
InternLM-XComposer2.5 No 48.0 51.7 22.7 35.6 45.9 87.3 95.9 53.7 9.1 54.3
InternVL3 2B No 50.0 58.4 30.4 39.0 49.8 80.3 98.9 55.6 5.7 55.3
DeepSeek VL2 No 65.3 63.8 25.9 46.6 58.5 81.8 99.4 60.6 4.1 57.7
LLaVA-OneVision 7B No 60.7 57.7 28.4 29.7 54.1 88.2 95.5 63.6 4.3 58.3
Qwen2.5-VL 7B No 63.3 69.1 34.9 55.9 49.8 85.3 97.9 66.2 9.6 61.5
LLaVA 1.5 13B No 41.3 39.6 13.8 3.4 42.9 71.6 94.0 34.0 2.6 42.0
LLaVA-NeXT 13B No 44.0 43.6 17.0 6.8 41.5 75.8 97.4 38.1 2.9 45.1
VILA HD 4K No 54.0 48.3 22.5 11.0 49.3 74.5 91.2 47.1 4.1 48.5
Gemma 3 12B No 48.7 42.3 16.5 31.4 47.8 82.7 98.3 45.6 6.2 50.0
PaliGemma 2 10B No 48.7 52.3 23.6 5.1 42.4 81.8 91.9 49.5 5.7 50.3
VILA HD 1.5K No 54.0 57.7 25.9 21.2 52.2 79.4 94.2 54.3 4.1 53.1
InternVL3 8B No 66.0 67.8 32.2 42.4 59.0 93.4 99.6 70.8 7.9 63.9
PaliGemma 2 28B No 40.0 49.0 17.4 5.9 40.0 66.1 81.2 37.7 6.0 41.5
Gemma 3 27B No 51.3 46.3 18.1 40.7 50.7 86.3 98.5 50.6 8.9 53.2
Llama 4 Scout No 58.7 65.8 31.1 37.3 62.0 78.8 95.7 57.9 13.6 57.5
InternVL3 14B No 66.7 69.1 30.6 41.5 57.1 91.1 98.5 69.7 5.3 62.5
LLaVA-OneVision 72B No 66.0 69.8 30.9 39.0 57.1 91.8 97.6 71.0 4.1 62.7
Qwen2.5-VL 32B No 60.0 70.5 30.8 61.0 61.5 90.3 98.5 68.7 12.4 63.6
Qwen2.5-VL 72B No 67.3 74.5 35.1 72.9 53.2 90.5 97.6 72.6 13.4 65.7
InternVL3 78B No 78.0 80.5 34.7 31.4 65.4 93.7 97.6 76.9 8.1 66.8
InternVL3 38B No 76.7 78.5 35.4 45.8 69.8 92.2 98.3 78.6 7.2 67.6
Horizon Alpha No 57.3 74.5 35.6 48.3 63.9 93.2 99.4 72.9 10.8 65.7
Gemini 2.0 Flash No 76.0 71.1 41.7 57.6 56.6 92.1 99.1 74.0 19.1 68.1
o4 mini No 70.0 76.5 38.3 62.7 67.8 93.7 98.1 77.4 17.2 69.1
o3 No 74.0 69.8 36.7 61.0 75.1 94.7 99.4 76.4 19.6 69.5

Example Questions

BibTeX

@misc{gavrikov2025visualoverload,
      title={VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes}, 
      author={Paul Gavrikov and Wei Lin and M. Jehanzeb Mirza and Soumya Jahagirdar and Muhammad Huzaifa and Sivan Doveh and Serena Yeung-Levy and James Glass and Hilde Kuehne},
      year={2025},
      eprint={2509.25339},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.25339}, 
}