M2P2 is a multimodal sequence learning framework that predicts persuasion in debate videos. Given a speaking clip (including audio, video and text modalities), M2P2 learns both shared and heterogeneous embeddings to predict persuasion.
The QPS debate video dataset is released for future persuasion research.
Dataset statistics | |
---|---|
Duration (minutes) | 582 |
Number of debates | 62 |
Number of speakers | 48 |
Number of segments | 2,297 |
Controversial topics (e.g. foreign policy, immigration, national debt, privacy issues) engender much debate amongst academics, businesses, and politicians. Identifying persuasive speakers in an adversarial environment is a critical task. In debate videos, multiple modalities (audio, video and text) are persuasive cues. Different modalities (1) are often semantically aligned, but (2) may provide diverse information for prediction.
To leverage the alignment of different modalities while maintaining the diversity of the cues they provide, M2P2 devises a novel adaptive fusion learning framework which fuses embeddings obtained from two modules – an alignment module that extracts shared information between modalities and a heterogeneity module that learns the weights of different modalities with guidance from three separately trained unimodal reference models.
The example above shows the realtime prediction of debate persuasiveness (number of votes) using M2P2. The debate is from a Chinese debate TV show, Qipashuo.
The following BibTeX citation can be used:
@misc{bai2020persuasion, Author = {Chongyang Bai and Haipeng Chen and Srijan Kumar and Jure Leskovec and V. S. Subrahmanian}, Title = {M2P2: Multimodal Persuasion Prediction using Adaptive Fusion}, Year = {2020}, Eprint = {arXiv:2006.11405}, }