11th International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services

Research Article

Multi-modal Fusion for Flasher Detection in a Mobile Video Chat Application

Download585 downloads
  • @INPROCEEDINGS{10.4108/icst.mobiquitous.2014.257973,
        author={Lei Tian and Rahat Rafiq and Shaosong Li and David Chu and Richard Han and Qin Lv and Shivakant Mishra},
        title={Multi-modal Fusion for Flasher Detection in a Mobile Video Chat Application},
        proceedings={11th International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services},
        publisher={ICST},
        proceedings_a={MOBIQUITOUS},
        year={2014},
        month={11},
        keywords={multi-modal fusion flasher detection mobile video chat},
        doi={10.4108/icst.mobiquitous.2014.257973}
    }
    
  • Lei Tian
    Rahat Rafiq
    Shaosong Li
    David Chu
    Richard Han
    Qin Lv
    Shivakant Mishra
    Year: 2014
    Multi-modal Fusion for Flasher Detection in a Mobile Video Chat Application
    MOBIQUITOUS
    ICST
    DOI: 10.4108/icst.mobiquitous.2014.257973
Lei Tian1,*, Rahat Rafiq1, Shaosong Li1, David Chu2, Richard Han1, Qin Lv1, Shivakant Mishra1
  • 1: University of Colorado Boulder
  • 2: Microsoft Research
*Contact email: lei.tian@colorado.edu

Abstract

This paper investigates the development of accurate and efficient classifiers to identify misbehaving users (i.e., “flashers”) in a mobile video chat application. Our analysis is based on video session data collected from a mobile client that we built that connects to a popular random video chat service. We show that prior image-based classifiers designed for identifying normal and misbehaving users in online video chat systems perform poorly on mobile video chat data. We present an enhanced image-based classifier that improves classification performance on mobile data. More importantly, we demonstrate that incorporating multi-modal mobile sensor data from accelerometer and the camera state (front/back) along with audio can significantly improve the overall image-based classification accuracy. Our work also shows that leveraging multiple image-based predictions within a session (i.e., temporal modality) has the potential to further improve the classification performance. Finally, we show that the cost of classification in terms of running time can be significantly reduced by employing a multilevel cascaded classifier in which high-complexity features and further image-based predictions are not generated unless needed.