arxiv:2607.01191

Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

Published on Jul 1

· Submitted by

Hongxing Li on Jul 2

Zhejiang University

Upvote

Authors:

Abstract

A unified framework named Perceive-to-Reason (P2R) is introduced that separates visual perception from reasoning in vision-language models through a two-stage process, improving fine-grained visual reasoning performance on high-resolution images.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Fine-grained visual reasoning remains challenging for vision-language models, especially when small but critical visual cues are buried in high-resolution images. Existing approaches rely on repeated cropping or test-time visual search to introduce local evidence, but they typically do not explicitly distinguish perception from reasoning. In this paper, we propose Perceive-to-Reason (P2R), a unified framework that formulates fine-grained visual reasoning as a two-stage process: the model first localizes question-relevant evidence as a Perceiver, and then answers the question as a Reasoner based on the annotated image and cropped regions. To better align training with this decoupled formulation, we further introduce Perception-Reasoning Alternating GRPO (PRA-GRPO), a role-aware reinforcement learning strategy that alternates between perception-focused and reasoning-focused updates using only final-answer supervision. Built on top of Qwen3-VL-Instruct-2B/4B/8B, P2R consistently improves performance across model scales. In particular, P2R-4B achieves 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, substantially outperforming its corresponding backbone. Further experiments show that the benefits of P2R extend beyond high-resolution benchmarks to broader multimodal reasoning tasks. These results suggest that explicitly decoupling perception from reasoning provides an effective framework for fine-grained visual reasoning.

View arXiv page View PDF GitHub 14 Add to collection

Community

hongxingli

Paper submitter 1 day ago

This paper proposes P2R, a two-stage framework that decouples fine-grained visual reasoning into perception and reasoning, trained with PRA-GRPO, a role-aware alternating RL strategy requiring only final-answer supervision without bounding box annotations.

💻 Code: https://github.com/ZJU-REAL/Perceive-to-Reason
🤗 Model: https://huggingface.co/hongxingli/P2R-4B
🤗 Dataset: https://huggingface.co/datasets/hongxingli/P2R-10k