Logo MM-RLHF

The Next Step Forward in Multimodal LLM Alignment

Yi-Fan Zhang2, Tao Yu2, Haochen Tian2, Chaoyou Fu3,
Peiyan Li2, Jianshu Zeng, Wulin Xie2, Yang Shi, Huanyu Zhang2, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen1, Fan Yang1, Zhang Zhang2, Tingting Gao1, Di Zhang1, Liang Wang2, Rong Jin, Tieniu Tan2,3
CASIA MAIS-NLPR
1KuaiShou, 2CASIA, 3NJU,

Introduction

   Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhance MLLM capability remains largely unexplored.
   To this end, we introduce MM-RLHF, a dataset containing 120k fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce a Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across 10 distinct dimensions and 27 benchmarks, with results demonstrating significant and consistent improvements in model performance. Specifically, fine-tuning with MM-RLHF and our alignment algorithm leads to a 19.5% increase in conversational abilities and a 60% improvement in safety.

MM-RLHF Dataset

Construction Pipeline

teaser_tasks

(1) Data Collection and Cleaning: Starting with 10 million instruction samples, we cluster data based on image similarity, and uniformly sample across diverse categories. This results in a diverse dataset covering image-based Q&A and video Q&A formats. (2) Response Generation: We leverage state-of-the-art models, including GPT-4o and Qwen2-VL-72B, to generate responses. (3) Human Annotation: We conduct comprehensive manual annotation across nine categories, including scoring, ranking, and textual explanations, ensuring fine-grained evaluation.

teaser_tasks

Re-Sampling results from the clustering process. Due to the large total number of samples, the clustered and deduplicated results contain a rich diversity of categories. Selected samples include topics such as mathematics, daily life, natural scenes, medicine, electronic technology, and OCR scenarios, showcasing a variety of problem-image pairs. The 2D features are obtained via UMAP dimensionality reduction.

Dataset Examples

MM-RLHF-Reward

Critique-Based Reward Model Training

reward_model

Illustration of the multi-task reward model training process. The process begins with a user query and corresponding model responses, which are ranked and annotated by humans. Human annotations are expanded using GPT-4o to provide enhanced rationales. The reward model is trained with two objectives: (1) Learning to Provide Critique, where the model learns to provide detailed critiques and evaluations for model responses, and (2) Learning Scoring, where the model learns to assign scores based on the model response and critique. The integration of these tasks ensures a robust evaluation framework for improving model outputs.

Dynamic Reward Scaling

mm_dpo

Overview of the MM-DPO framework The dynamic reward scaling mechanism adjusts the update strength based on the reward margin, improving optimization stability and robustness.

Experiment Results

Reward Evaluation on MM-RLHF-RewardBench or VLRewardBench

Main Performance on MM-RLHF-SafeBench or General Image/Video understanding and Hallucination Benchmarks

Citation


      @article{zhang2025mmrlhfstepforwardmultimodal,,
        title={MM-RLHF: The Next Step Forward in Multimodal LLM Alignment},
        author={Yi-Fan Zhang and Tao Yu and Haochen Tian and Chaoyou Fu and Peiyan Li and Jianshu Zeng and Wulin Xie and Yang Shi and Huanyu Zhang and Junkang Wu and Xue Wang and Yibo Hu and Bin Wen and Fan Yang and Zhang Zhang and Tingting Gao and Di Zhang and Liang Wang and Rong Jin and Tieniu Tan},
        journal={arXiv preprint arXiv:2502.10391},
        year={2025}
      }