About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
sis 25(4):

Research Article

Multimodal-Driven Emotion-Controlled Facial Animation Generation Model

Download17 downloads
Cite
BibTeX Plain Text
  • @ARTICLE{10.4108/eetsis.7624,
        author={Zhenyu Qiu and Yuting Luo and Yiren Zhou and Teng Gao},
        title={Multimodal-Driven Emotion-Controlled Facial Animation Generation Model},
        journal={EAI Endorsed Transactions on Scalable Information Systems},
        volume={12},
        number={4},
        publisher={EAI},
        journal_a={SIS},
        year={2025},
        month={7},
        keywords={Deep Learning, Computer Vision, Generative Adversarial Networks, Facial Animation Generation Technology, Multimodal},
        doi={10.4108/eetsis.7624}
    }
    
  • Zhenyu Qiu
    Yuting Luo
    Yiren Zhou
    Teng Gao
    Year: 2025
    Multimodal-Driven Emotion-Controlled Facial Animation Generation Model
    SIS
    EAI
    DOI: 10.4108/eetsis.7624
Zhenyu Qiu1, Yuting Luo1,*, Yiren Zhou1, Teng Gao1
  • 1: Nanchang Institute of Technology
*Contact email: yuting_luo24@outlook.com

Abstract

INTRODUCTION: In recent years, the generation of facial animation technology has emerged as a prominent area of focus within computer vision, achieving varying degrees of progress in lip-synchronization quality and emotion control. OBJECTIVES: However, existing research often compromises lip movements during facial expression generation, thereby diminishing lip synchronisation accuracy. This study proposes a multimodal, emotion-controlled facial animation generation model to address this challenge. METHODS: The proposed model comprises two custom deep-learning networks arranged sequentially. By inputting an expressionless target portrait image, the model generates high-quality, lip-synchronized, and emotion-controlled facial videos driven by three modalities: audio, text, and emotional portrait images. RESULTS: In this framework, text features serve a critical supplementary function in predicting lip movements from audio input, thereby enhancing lip-synchronization quality. CONCLUSION: Experimental findings indicate that the proposed model achieves a reduction in lip feature coordinate distance (L-LD) of 5.93% and 33.52% compared to established facial animation generation methods, such as MakeItTalk and the Emotion-Aware Motion Model (EAMM), and a decrease in facial feature coordinate distance (F-LD) of 7.00% and 8.79%. These results substantiate the efficacy of the proposed model in generating high-quality, lip-synchronized, and emotion-controlled facial animations.

Keywords
Deep Learning, Computer Vision, Generative Adversarial Networks, Facial Animation Generation Technology, Multimodal
Received
2025-10-21
Accepted
2025-07-07
Published
2025-07-17
Publisher
EAI
http://dx.doi.org/10.4108/eetsis.7624

Copyright © 2025 Zhenyu Qiu et al., licensed to EAI. This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.

EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL