Automatic Speech Grading using a Multimodal Deep Learning Framework using Bert and Whisper

M. Hemantheswar Reddy; K. Rishitha; P. Bharath Raj; D N Kiran Pandiri; U. Thulasi Srinivas

Proceedings of the 4th International Conference on Information Technology, Civil Innovation, Science, and Management, ICITSM 2025, 28-29 April 2025, Tiruchengode, Tamil Nadu, India, Part I

Research Article

Automatic Speech Grading using a Multimodal Deep Learning Framework using Bert and Whisper

Download330 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/eai.28-4-2025.2357788,
    author={M.  Hemantheswar Reddy and K.  Rishitha and P.  Bharath Raj and D N  Kiran Pandiri and U.  Thulasi Srinivas},
    title={Automatic Speech Grading using a Multimodal Deep Learning Framework using Bert and Whisper},
    proceedings={Proceedings of the 4th International Conference on Information Technology, Civil Innovation, Science, and Management, ICITSM 2025, 28-29 April 2025, Tiruchengode, Tamil Nadu, India, Part I},
    publisher={EAI},
    proceedings_a={ICITSM PART I},
    year={2025},
    month={10},
    keywords={speech grading automatic speech recognition whisper nlp pronunciation scoring fluency measurement},
    doi={10.4108/eai.28-4-2025.2357788}
}

M. Hemantheswar Reddy
K. Rishitha
P. Bharath Raj
D N Kiran Pandiri
U. Thulasi Srinivas
Year: 2025
Automatic Speech Grading using a Multimodal Deep Learning Framework using Bert and Whisper
ICITSM PART I
EAI
DOI: 10.4108/eai.28-4-2025.2357788

M. Hemantheswar Reddy¹^,*, K. Rishitha¹, P. Bharath Raj¹, D N Kiran Pandiri¹, U. Thulasi Srinivas¹

1: VFSTR Deemed to be University

*Contact email: hemanth14082004@gmail.com

Abstract

This paper proposes a Natural Language Processing (NLP-based) program of speech grading for not only the audio but also the video portion that quantitatively evaluates speech in terms of grammar, vocabulary, pronunciation, fluency and accuracy. These conventional speech evaluation methods are prone to be subjective, inefficient, low feedback, and thus limit their application in overall assessment. The proposed system is a system that combines Automatic Speech Recognition (ASR) models such as Whisper that transcribe speech to text and then Natural Language Processing (NLP) technologies that analyze and score them in a standardized way. By providing plentiful and actionable feedback, the system has the potential to improve the reliability and consistency in assessment of speech. This technique has broad uses in education, recruitment, and communication training, provides a scalable and objective approach towards speech measurement.

Keywords: speech grading, automatic speech recognition, whisper, nlp, pronunciation scoring, fluency measurement

Published: 2025-10-13
Publisher: EAI

: http://dx.doi.org/10.4108/eai.28-4-2025.2357788

Automatic Speech Grading using a Multimodal Deep Learning Framework using Bert and Whisper

Abstract

About EAI

Community

Publish with EAI