Logo MM-RLHF

The Next Step Forward in Multimodal LLM Alignment

Yi-Fan Zhang2, Tao Yu2, Haochen Tian2, Chaoyou Fu3,
Peiyan Li2, Jianshu Zeng, Wulin Xie2, Yang Shi, Huanyu Zhang2, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen1, Fan Yang1, Zhang Zhang2, Tingting Gao1, Di Zhang1, Liang Wang2, Rong Jin, Tieniu Tan2,3
CASIA MAIS-NLPR
1KuaiShou, 2CASIA, 3NJU,

Introduction

   Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhance MLLM capability remains largely unexplored. To this end, We are proud to open-source MM-RLHF, a comprehensive project for aligning Multimodal Large Language Models (MlLMs) with humanpreferences. This release includes:
  • 🔹 High-Quality MLLM Alignment Dataset: 120K samples curated by over 50 experts over two months, featuring ratings and manual annotations across eight dimensions.
  • 🔹 Strong Critique-Based MLLM Reward Model: Trained on human annotations, achieving state-of-the-art (SOTA) performance on public benchmarks.
  • 🔹 Novel Alignment Algorithm – MM-DPO: Effectively integrates reward signals to enhance the data efficiency of DPO training.
  • 🔹 Two New Benchmarks: Designed for reward modeling and multimodal safety, addressing critical gaps in existing benchmarks.
  • 🔹 Broad Performance Gains: Our dataset and algorithms drive consistent improvements across 10 dimensions and 27 benchmarks for open-source MLLMs.

MM-RLHF Dataset

Construction Pipeline

teaser_tasks

(1) Data Collection and Cleaning: Starting with 10 million instruction samples, we cluster data based on image similarity, and uniformly sample across diverse categories. This results in a diverse dataset covering image-based Q&A and video Q&A formats. (2) Response Generation: We leverage state-of-the-art models, including GPT-4o and Qwen2-VL-72B, to generate responses. (3) Human Annotation: We conduct comprehensive manual annotation across nine categories, including scoring, ranking, and textual explanations, ensuring fine-grained evaluation.

teaser_tasks

Re-Sampling results from the clustering process. Due to the large total number of samples, the clustered and deduplicated results contain a rich diversity of categories. Selected samples include topics such as mathematics, daily life, natural scenes, medicine, electronic technology, and OCR scenarios, showcasing a variety of problem-image pairs. The 2D features are obtained via UMAP dimensionality reduction.

Dataset Examples

MM-RLHF-Reward

Critique-Based Reward Model Training

reward_model

Illustration of the multi-task reward model training process. The process begins with a user query and corresponding model responses, which are ranked and annotated by humans. Human annotations are expanded using GPT-4o to provide enhanced rationales. The reward model is trained with two objectives: (1) Learning to Provide Critique, where the model learns to provide detailed critiques and evaluations for model responses, and (2) Learning Scoring, where the model learns to assign scores based on the model response and critique. The integration of these tasks ensures a robust evaluation framework for improving model outputs.

Dynamic Reward Scaling

mm_dpo

Overview of the MM-DPO framework The dynamic reward scaling mechanism adjusts the update strength based on the reward margin, improving optimization stability and robustness.

Experiment Results

Reward Evaluation on MM-RLHF-RewardBench or VLRewardBench

Main Performance on MM-RLHF-SafeBench or General Image/Video understanding and Hallucination Benchmarks

Citation


      @article{zhang2025mmrlhfstepforwardmultimodal,,
        title={MM-RLHF: The Next Step Forward in Multimodal LLM Alignment},
        author={Yi-Fan Zhang and Tao Yu and Haochen Tian and Chaoyou Fu and Peiyan Li and Jianshu Zeng and Wulin Xie and Yang Shi and Huanyu Zhang and Junkang Wu and Xue Wang and Yibo Hu and Bin Wen and Fan Yang and Zhang Zhang and Tingting Gao and Di Zhang and Liang Wang and Rong Jin and Tieniu Tan},
        journal={arXiv preprint arXiv:2502.10391},
        year={2025}
      }