About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
airo 25(1):

Research Article

Evaluating Open-Source Vision Language Models for Facial Emotion Recognition Against Traditional Deep Learning Models

Download18 downloads
Cite
BibTeX Plain Text
  • @ARTICLE{10.4108/airo.8870,
        author={Vamsi Krishna Mulukutla and Sai Supriya Pavarala and Srinivasa Raju Rudraraju and Sridevi Bonthu},
        title={Evaluating Open-Source Vision Language Models for Facial Emotion Recognition Against Traditional Deep Learning Models},
        journal={EAI Endorsed Transactions on AI and Robotics},
        volume={4},
        number={1},
        publisher={EAI},
        journal_a={AIRO},
        year={2025},
        month={8},
        keywords={Facial Emotion Detection, VLMs, Facial Expression Classification, Phi-3.5, CLIP},
        doi={10.4108/airo.8870}
    }
    
  • Vamsi Krishna Mulukutla
    Sai Supriya Pavarala
    Srinivasa Raju Rudraraju
    Sridevi Bonthu
    Year: 2025
    Evaluating Open-Source Vision Language Models for Facial Emotion Recognition Against Traditional Deep Learning Models
    AIRO
    EAI
    DOI: 10.4108/airo.8870
Vamsi Krishna Mulukutla1, Sai Supriya Pavarala1, Srinivasa Raju Rudraraju1, Sridevi Bonthu2,*
  • 1: Vishnu Institute of Technology, Bhimavaram
  • 2: Vishnu Institute of Technology
*Contact email: sridevi.b@vishnu.edu.in

Abstract

Facial Emotion Recognition (FER) is crucial for applications such as human-computer interaction and mental health diagnostics. This study presents the first empirical comparison of open-source Vision-Language Models (VLMs), including Phi-3.5 Vision and CLIP, against traditional deep learning models—VGG19, ResNet-50, and EfficientNet-B0—on the challenging FER-2013 dataset, which contains 35,887 low-resolution, grayscale images across seven emotion classes. To address the mismatch between VLM training assumptions and the noisy nature of FER data, we introduce a novel pipeline that integrates GFPGAN-based image restoration with FER evaluation. Results show that traditional models, particularly EfficientNet-B0 (86.44%) and ResNet-50 (85.72%), significantly outperform VLMs like CLIP (64.07%) and Phi-3.5 Vision (51.66%), highlighting the limitations of VLMs in low-quality visual tasks. In addition to performance evaluation using precision, recall, F1-score, and accuracy, we provide a detailed computational cost analysis covering preprocessing, training, inference, and evaluation phases, offering practical insights for deployment. This work underscores the need for adapting VLMs to noisy environments and provides a reproducible benchmark for future research in emotion recognition.

Keywords
Facial Emotion Detection, VLMs, Facial Expression Classification, Phi-3.5, CLIP
Received
2025-03-09
Accepted
2025-08-08
Published
2025-08-11
Publisher
EAI
http://dx.doi.org/10.4108/airo.8870

Copyright © 2025 Vamsi Krishna Mulukutla et al., licensed to EAI. This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.

EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL