Multimodal-Driven Emotion-Controlled Facial Animation Generation Model

Zhenyu Qiu; Yuting Luo; Yiren Zhou; Teng Gao

sis 25(4):

Research Article

Multimodal-Driven Emotion-Controlled Facial Animation Generation Model

Download17 downloads

Cite: BibTeX Plain Text

@ARTICLE{10.4108/eetsis.7624,
    author={Zhenyu Qiu and Yuting Luo and Yiren Zhou and Teng Gao},
    title={Multimodal-Driven Emotion-Controlled Facial Animation Generation Model},
    journal={EAI Endorsed Transactions on Scalable Information Systems},
    volume={12},
    number={4},
    publisher={EAI},
    journal_a={SIS},
    year={2025},
    month={7},
    keywords={Deep Learning, Computer Vision, Generative Adversarial Networks, Facial Animation Generation Technology, Multimodal},
    doi={10.4108/eetsis.7624}
}

Zhenyu Qiu
Yuting Luo
Yiren Zhou
Teng Gao
Year: 2025
Multimodal-Driven Emotion-Controlled Facial Animation Generation Model
SIS
EAI
DOI: 10.4108/eetsis.7624

Zhenyu Qiu¹, Yuting Luo¹^,*, Yiren Zhou¹, Teng Gao¹

1: Nanchang Institute of Technology

*Contact email: yuting_luo24@outlook.com

Abstract

INTRODUCTION: In recent years, the generation of facial animation technology has emerged as a prominent area of focus within computer vision, achieving varying degrees of progress in lip-synchronization quality and emotion control. OBJECTIVES: However, existing research often compromises lip movements during facial expression generation, thereby diminishing lip synchronisation accuracy. This study proposes a multimodal, emotion-controlled facial animation generation model to address this challenge. METHODS: The proposed model comprises two custom deep-learning networks arranged sequentially. By inputting an expressionless target portrait image, the model generates high-quality, lip-synchronized, and emotion-controlled facial videos driven by three modalities: audio, text, and emotional portrait images. RESULTS: In this framework, text features serve a critical supplementary function in predicting lip movements from audio input, thereby enhancing lip-synchronization quality. CONCLUSION: Experimental findings indicate that the proposed model achieves a reduction in lip feature coordinate distance (L-LD) of 5.93% and 33.52% compared to established facial animation generation methods, such as MakeItTalk and the Emotion-Aware Motion Model (EAMM), and a decrease in facial feature coordinate distance (F-LD) of 7.00% and 8.79%. These results substantiate the efficacy of the proposed model in generating high-quality, lip-synchronized, and emotion-controlled facial animations.

Keywords: Deep Learning, Computer Vision, Generative Adversarial Networks, Facial Animation Generation Technology, Multimodal

Received: 2025-10-21
Accepted: 2025-07-07
Published: 2025-07-17
Publisher: EAI

: http://dx.doi.org/10.4108/eetsis.7624

Copyright © 2025 Zhenyu Qiu et al., licensed to EAI. This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.

Multimodal-Driven Emotion-Controlled Facial Animation Generation Model

Abstract

About EAI

Community

Publish with EAI