Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning
Abstract
A unified framework named Perceive-to-Reason (P2R) is introduced that separates visual perception from reasoning in vision-language models through a two-stage process, improving fine-grained visual reasoning performance on high-resolution images.
Fine-grained visual reasoning remains challenging for vision-language models, especially when small but critical visual cues are buried in high-resolution images. Existing approaches rely on repeated cropping or test-time visual search to introduce local evidence, but they typically do not explicitly distinguish perception from reasoning. In this paper, we propose Perceive-to-Reason (P2R), a unified framework that formulates fine-grained visual reasoning as a two-stage process: the model first localizes question-relevant evidence as a Perceiver, and then answers the question as a Reasoner based on the annotated image and cropped regions. To better align training with this decoupled formulation, we further introduce Perception-Reasoning Alternating GRPO (PRA-GRPO), a role-aware reinforcement learning strategy that alternates between perception-focused and reasoning-focused updates using only final-answer supervision. Built on top of Qwen3-VL-Instruct-2B/4B/8B, P2R consistently improves performance across model scales. In particular, P2R-4B achieves 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, substantially outperforming its corresponding backbone. Further experiments show that the benefits of P2R extend beyond high-resolution benchmarks to broader multimodal reasoning tasks. These results suggest that explicitly decoupling perception from reasoning provides an effective framework for fine-grained visual reasoning.
Community
This paper proposes P2R, a two-stage framework that decouples fine-grained visual reasoning into perception and reasoning, trained with PRA-GRPO, a role-aware alternating RL strategy requiring only final-answer supervision without bounding box annotations.
š» Code: https://github.com/ZJU-REAL/Perceive-to-Reason
š¤ Model: https://huggingface.co/hongxingli/P2R-4B
š¤ Dataset: https://huggingface.co/datasets/hongxingli/P2R-10k
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models (2026)
- iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning (2026)
- V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning (2026)
- From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models (2026)
- ReGuLaR: Relation-Grounded Latent Reasoning for Large Vision-Language Models (2026)
- Semantic-Enriched Latent Visual Reasoning (2026)
- VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2607.01191 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash