Introduction

Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhance MLLM capability remains largely unexplored. To this end, We are proud to open-source MM-RLHF, a comprehensive project for aligning Multimodal Large Language Models (MlLMs) with humanpreferences. This release includes:

🔹 High-Quality MLLM Alignment Dataset: 120K samples curated by over 50 experts over two months, featuring ratings and manual annotations across eight dimensions.
🔹 Strong Critique-Based MLLM Reward Model: Trained on human annotations, achieving state-of-the-art (SOTA) performance on public benchmarks.
🔹 Novel Alignment Algorithm – MM-DPO: Effectively integrates reward signals to enhance the data efficiency of DPO training.
🔹 Two New Benchmarks: Designed for reward modeling and multimodal safety, addressing critical gaps in existing benchmarks.
🔹 Broad Performance Gains: Our dataset and algorithms drive consistent improvements across 10 dimensions and 27 benchmarks for open-source MLLMs.

Construction Pipeline

(1) Data Collection and Cleaning: Starting with 10 million instruction samples, we cluster data based on image similarity, and uniformly sample across diverse categories. This results in a diverse dataset covering image-based Q&A and video Q&A formats. (2) Response Generation: We leverage state-of-the-art models, including GPT-4o and Qwen2-VL-72B, to generate responses. (3) Human Annotation: We conduct comprehensive manual annotation across nine categories, including scoring, ranking, and textual explanations, ensuring fine-grained evaluation.

Re-Sampling results from the clustering process. Due to the large total number of samples, the clustered and deduplicated results contain a rich diversity of categories. Selected samples include topics such as mathematics, daily life, natural scenes, medicine, electronic technology, and OCR scenarios, showcasing a variety of problem-image pairs. The 2D features are obtained via UMAP dimensionality reduction.

Dataset Examples

Critique-Based Reward Model Training

Illustration of the multi-task reward model training process. The process begins with a user query and corresponding model responses, which are ranked and annotated by humans. Human annotations are expanded using GPT-4o to provide enhanced rationales. The reward model is trained with two objectives: (1) Learning to Provide Critique, where the model learns to provide detailed critiques and evaluations for model responses, and (2) Learning Scoring, where the model learns to assign scores based on the model response and critique. The integration of these tasks ensures a robust evaluation framework for improving model outputs.

Dynamic Reward Scaling

Overview of the MM-DPO framework The dynamic reward scaling mechanism adjusts the update strength based on the reward margin, improving optimization stability and robustness.

Experiment Results

Reward Evaluation on MM-RLHF-RewardBench or VLRewardBench

Performance comparison across metrics and methods on MM-RLHF-RewardBench. MM-RLHF-Reward (w/o. Task 1) represents training the LLaVA-OV-7B model to score pair-wise samples while excluding Task 1. MM-RLHF-Reward (w/o. enhanced annotations) involves learning human-provided annotations, followed by scoring. MM-RLHF-Reward (inference w. GT annotation) uses ground truth annotations during inference.

Performance comparison of our reward model (MM-RLHF-Reward) with existing open-source and private MLLMs. MM-RLHF-Reward-7B outperforms existing 72B open-source MLLMs and several competitive closed-source models.

Main Performance on MM-RLHF-SafeBench or General Image/Video understanding and Hallucination Benchmarks

Performance variations after alignment across 8 different evaluation dimensions, comparing multiple models under our alignment strategy. All models show comprehensive performance improvements under the proposed alignment, demonstrating significant gains across various tasks.

Performance variations after alignment across MM-rlhf-SafeBench, comparing multiple models under our alignment strategy.


      @article{zhang2025mmrlhfstepforwardmultimodal,,
        title={MM-RLHF: The Next Step Forward in Multimodal LLM Alignment},
        author={Yi-Fan Zhang and Tao Yu and Haochen Tian and Chaoyou Fu and Peiyan Li and Jianshu Zeng and Wulin Xie and Yang Shi and Huanyu Zhang and Junkang Wu and Xue Wang and Yibo Hu and Bin Wen and Fan Yang and Zhang Zhang and Tingting Gao and Di Zhang and Liang Wang and Rong Jin and Tieniu Tan},
        journal={arXiv preprint arXiv:2502.10391},
        year={2025}
      }

MM-RLHF

The Next Step Forward in Multimodal LLM Alignment