
Research Article
Multimodal-Driven Emotion-Controlled Facial Animation Generation Model
@ARTICLE{10.4108/eetsis.7624, author={Zhenyu Qiu and Yuting Luo and Yiren Zhou and Teng Gao}, title={Multimodal-Driven Emotion-Controlled Facial Animation Generation Model}, journal={EAI Endorsed Transactions on Scalable Information Systems}, volume={12}, number={4}, publisher={EAI}, journal_a={SIS}, year={2025}, month={7}, keywords={Deep Learning, Computer Vision, Generative Adversarial Networks, Facial Animation Generation Technology, Multimodal}, doi={10.4108/eetsis.7624} }
- Zhenyu Qiu
Yuting Luo
Yiren Zhou
Teng Gao
Year: 2025
Multimodal-Driven Emotion-Controlled Facial Animation Generation Model
SIS
EAI
DOI: 10.4108/eetsis.7624
Abstract
INTRODUCTION: In recent years, the generation of facial animation technology has emerged as a prominent area of focus within computer vision, achieving varying degrees of progress in lip-synchronization quality and emotion control. OBJECTIVES: However, existing research often compromises lip movements during facial expression generation, thereby diminishing lip synchronisation accuracy. This study proposes a multimodal, emotion-controlled facial animation generation model to address this challenge. METHODS: The proposed model comprises two custom deep-learning networks arranged sequentially. By inputting an expressionless target portrait image, the model generates high-quality, lip-synchronized, and emotion-controlled facial videos driven by three modalities: audio, text, and emotional portrait images. RESULTS: In this framework, text features serve a critical supplementary function in predicting lip movements from audio input, thereby enhancing lip-synchronization quality. CONCLUSION: Experimental findings indicate that the proposed model achieves a reduction in lip feature coordinate distance (L-LD) of 5.93% and 33.52% compared to established facial animation generation methods, such as MakeItTalk and the Emotion-Aware Motion Model (EAMM), and a decrease in facial feature coordinate distance (F-LD) of 7.00% and 8.79%. These results substantiate the efficacy of the proposed model in generating high-quality, lip-synchronized, and emotion-controlled facial animations.
Copyright © 2025 Zhenyu Qiu et al., licensed to EAI. This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.