
Research Article
Automatic Speech Grading using a Multimodal Deep Learning Framework using Bert and Whisper
@INPROCEEDINGS{10.4108/eai.28-4-2025.2357788, author={M. Hemantheswar Reddy and K. Rishitha and P. Bharath Raj and D N Kiran Pandiri and U. Thulasi Srinivas}, title={Automatic Speech Grading using a Multimodal Deep Learning Framework using Bert and Whisper}, proceedings={Proceedings of the 4th International Conference on Information Technology, Civil Innovation, Science, and Management, ICITSM 2025, 28-29 April 2025, Tiruchengode, Tamil Nadu, India, Part I}, publisher={EAI}, proceedings_a={ICITSM PART I}, year={2025}, month={10}, keywords={speech grading automatic speech recognition whisper nlp pronunciation scoring fluency measurement}, doi={10.4108/eai.28-4-2025.2357788} }
- M. Hemantheswar Reddy
K. Rishitha
P. Bharath Raj
D N Kiran Pandiri
U. Thulasi Srinivas
Year: 2025
Automatic Speech Grading using a Multimodal Deep Learning Framework using Bert and Whisper
ICITSM PART I
EAI
DOI: 10.4108/eai.28-4-2025.2357788
Abstract
This paper proposes a Natural Language Processing (NLP-based) program of speech grading for not only the audio but also the video portion that quantitatively evaluates speech in terms of grammar, vocabulary, pronunciation, fluency and accuracy. These conventional speech evaluation methods are prone to be subjective, inefficient, low feedback, and thus limit their application in overall assessment. The proposed system is a system that combines Automatic Speech Recognition (ASR) models such as Whisper that transcribe speech to text and then Natural Language Processing (NLP) technologies that analyze and score them in a standardized way. By providing plentiful and actionable feedback, the system has the potential to improve the reliability and consistency in assessment of speech. This technique has broad uses in education, recruitment, and communication training, provides a scalable and objective approach towards speech measurement.