MULTIMODAL INFORMATION FUSION FOR VIDEO CONCEPT DETECTION (WA-S2)

Author(s) :

Yi Wu	(ECE Dept, University of California Santa Barbara, USA)
Ching-Yung Lin	(IBM T.J. Watson Research Center, Hawthorne, USA)
Edward Y. Chang	(ECE Dept, University of California Santa Barbara, USA)
John R. Smith	(IBM T.J. Watson Research Center, Hawthorne, USA)

Abstract :

Video media carries multimodal information including visual, audio, textual data. Considerable research has been focused on utilizing multimodal features for better understanding of video content. However, many problems remain such as how to combine multimodal features and what are the effects of different combinations. In this paper, we propose two methods to find the optimal combination of multimodal information in order to improve the performance of video concept detection. The first method is Gradientdescent- optimization Linear Fusion. The second method is Two-Level SVMs (Support Vector Machines) Nonlinear Fusion. Both methods train separate classifiers for single modalities as the first step. Once individual models have been designed, Gradient-descent-optimization Linear Fusion learns an optimal weighted linear combination of single modalities by using a gradient decent technique. Two-Level SVMs Nonlinear Fusion learns an optimal non-linear combination of single modalities by using another SVM. Our experiments show that both methods improve performance significantly on TREC-Video 2003 benchmarks.

Menu