Improve Vision Language Model Chain-of-thought Reasoning
AuthorsRuohong Zhang†, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun‡, Zhe Gan, Yinfei Yang, Ruoming Pang, Yiming Yang‡
AuthorsRuohong Zhang†, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun‡, Zhe Gan, Yinfei Yang, Ruoming Pang, Yiming Yang‡
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes often relying on datasets dominated by short annotations with minimal rationales. In this work, we show that training VLM on short answers leads to poor generalization on reasoning tasks that require more detailed explanations. To address this limitation, we propose a two-stage post-training strategy that extends the usage of short answer data for enhanced CoT reasoning. First, we augment short answers with CoT reasoning generated by GPT-4o, enhancing the VLM’s CoT capabilities through fine-tuning. Second, we leverage short answers as outcome rewards for reinforcement learning. Specifically, short answers are used as correctness indicators to construct positive (correct) and negative (incorrect) pairs from model-generated reasoning chains. These pairs are then used to calibrate the model’s reasoning via Direct Preference Optimization. Our experiments show significant improvements in CoT reasoning on benchmark datasets, along with enhanced generalization to direct answer prediction. This work provides a critical data resource for VLM CoT training and demonstrates the effectiveness of outcome rewards for multimodal models post-training.
May 28, 2025research area Knowledge Bases and Search, research area Speech and Natural Language Processing
May 1, 2024research area Computer Vision, research area Speech and Natural Language ProcessingHow Far Are We from AGI?