Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
Abstract
Vision-DeepResearch benchmark addresses limitations in evaluating visual-textual search capabilities of multimodal models by introducing realistic evaluation conditions and improving visual retrieval through multi-round cropped-search workflow.
Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep-research systems. The code will be released in https://github.com/Osilly/Vision-DeepResearch.
Community
We introduce the Vision-DeepResearch Benchmark (VDR-Bench) to address two key limitations of existing multimodal deep-research benchmarks: (1) they are not visual-search-centric, allowing many instances to be solved without genuine visual retrieval; and (2) they rely on overly idealized retrieval settings that fail to reflect noisy, real-world search engines. To this end, VDR-Bench comprises 2,000 instances curated with full human involvement and rigorous solvability verification, complementing existing benchmarks by enforcing realistic visual search and evidence-grounded reasoning.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models (2026)
- KG-ViP: Bridging Knowledge Grounding and Visual Perception in Multi-modal LLMs for Visual Question Answering (2026)
- Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models (2026)
- VisChainBench: A Benchmark for Multi-Turn, Multi-Image Visual Reasoning Beyond Language Priors (2025)
- Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions (2026)
- Efficient Multimodal Planning Agent for Visual Question-Answering (2026)
- UR-Bench: A Benchmark for Multi-Hop Reasoning over Ultra-High-Resolution Images (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper