MMoE is an initial trial to create better multimodal foundation models by focusing on how different types of interactions occur between image and text data. In this work, we categorize interactions between these two modalities into three types:
- Redundancy: When both image and text contain similar information relevant to the task.
- Uniqueness: When each modality (image or text) holds different information that’s valuable for the task.
- Synergy: When the image and text combine in a way that produces new, useful information for the task.
We design and train expert models to handle each interaction type separately, and then a final model (a “fuser”) combines their outputs to make the final prediction.
The overall pipeline of MMoE includes three steps:
- Categorizing training data based on multimodal interactions
- Training expert models for each multimodal interaction type
- Inference with mixtures of expert models
In the following section, we provide detailed instruction on running experiments for each part.
To categorize multimodal datasets into interaction types (Redundancy, Uniqueness, Synergy), we use unimodal predictions from the CogVLM2-LLaMA3-chat19B for images and Qwen2-72B-Instruct for text. Below is an example using the MUSTARD dataset.
Step-by-Step Data Categorization
cd ./data_gen_vision_label_CogVLM2
pip install -r requirements.txt
python mustard_vision_label.py // collect unimodal vision label
cd ./data_gen_text_label_Qwen2
pip install -r requirements.txt
python mustard_text_label.py // collect unimodal text label
cd ./data_split
pip install -r requirements.txt
python mustard_split.py
Accessing Preprocessed Data
We provide organized and processed data for the MMSD2.0
, MUSTARD
, and URFUNNY
datasets. Each dataset contains the following components:
- data_raw: Original dataset files (note: large images are stored separately; see instructions below).
- data_gen_output: Unimodal labels and captions.
- data_split_output: Data split according to unimodal labels.
- expert_inference_output: Expert model outputs.
You can download each dataset at the following links:
Note on Large Images: Each dataset’s /data_raw
folder contains an /images
directory, which includes files that are too large to store directly. If you wish to run experiments, please download these images and place them under /data_raw/images
as indicated below:
Due to size limitations, the image data is hosted externally. To use these datasets:
- Download images from the provided Google Drive links
- Place them in the respective image folders (
/mmsd2.0_data
,/mustard_data
,/urfunny_data
).
Train expert models and the fusion model based on the multimodal interaction type.
Example of training fusion and expert models:
To train fusion model for MUSTARD:
cd ./expert_fusion
sh mustard_fusion_train.sh
sh mustard_fusion_test.sh // generate rus logits for each datapoint
To train BLIP-2 models for the MUSTARD dataset:
conda create -n py310 python=3.10
conda activate py310
cd ./expert_BLIP2
pip install -r requirements.txt
sh blip2_mustard_train.sh
sh blip2_mustard_test.sh // generate prediction for each model for each datapoint
To train Qwen2 models for MUSTARD:
conda create -n py310 python=3.10
conda activate py310
cd ./expert_Qwen2
pip install -r requirements.txt
sh train_qwen_mustard.sh
sh test_qwen_mustard.sh // generate prediction for each model for each datapoint
cd ./expert_fusion
sh mustard_fusion_train.sh
sh mustard_fusion_test.sh // generate rus logits for each datapoint
To train Qwen2 models for MUSTARD:
conda create -n py310 python=3.10
conda activate py310
cd ./expert_Qwen2
pip install -r requirements.txt
sh train_qwen_mustard.sh
sh test_qwen_mustard.sh // generate prediction for each model for each datapoint
cd ./expert_fusion
sh mustard_fusion_train.sh
sh mustard_fusion_test.sh // generate rus logits for each datapoint
To train ALBEF models for MUSTARD:
conda create -n py38 python=3.8
conda activate py38
cd ./expert_ALBEF
# download base model weight from https://github.com/salesforce/ALBEF
wget https://storage.googleapis.com/sfr-pcl-data-research/ALBEF/ALBEF.pth
pip install -r requirements.txt
sh scripts/all_mustard.sh
Once models are trained, fusion-based inference combines expert predictions for final results.
cd ./expert_fusion
python fusion.py
If you find MMoE useful in your research, please consider citing our paper:
@inproceedings{yu-etal-2024-mmoe,
title = "{MMoE}: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts",
author = "Yu, Haofei and Qi, Zhengyang and Jang, Lawrence Keunho and Salakhutdinov, Russ and Morency, Louis-Philippe and Liang, Paul Pu",
editor = "Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.558",
pages = "10006--10030",
}