
Research Article
A Comprehensive Approach to Indian Sign Language Recognition: Leveraging LSTM and MediaPipe Holistic for Dynamic and Static Hand Gesture Recognition
@ARTICLE{10.4108/airo.8693, author={Prachi Rawat and Papendra Kumar and Vivek Kumar Tamta and Anuj Kumar}, title={A Comprehensive Approach to Indian Sign Language Recognition: Leveraging LSTM and MediaPipe Holistic for Dynamic and Static Hand Gesture Recognition}, journal={EAI Endorsed Transactions on AI and Robotics}, volume={4}, number={1}, publisher={EAI}, journal_a={AIRO}, year={2025}, month={5}, keywords={Sign Language Recognition, LSTM, Indian Sign Language, MediaPipe Holistic, Computer Vision, Deep Learning, Gesture Recognition}, doi={10.4108/airo.8693} }
- Prachi Rawat
Papendra Kumar
Vivek Kumar Tamta
Anuj Kumar
Year: 2025
A Comprehensive Approach to Indian Sign Language Recognition: Leveraging LSTM and MediaPipe Holistic for Dynamic and Static Hand Gesture Recognition
AIRO
EAI
DOI: 10.4108/airo.8693
Abstract
Recognizing Indian Sign Language (ISL) gestures effectively is crucial for improving communication accessibility for deaf community. This study introduces an innovative approach that integrates a Sequential Long Short-Term Memory (LSTM) model with MediaPipe Holistic for accurate and real-time gesture recognition. This work outlines a straightforward approach to recognizing Indian Sign Language (ISL) gestures effectively. The process is divided into three steps: Extracting features from data, Cleaning, Labelling and identifying gestures using MediaPipe Holistic. The system tracks landmarks on the face, hands, and body across video frames, capturing essential details such as temporal and spatial features for interpreting gestures. First, the data is cleaned and labeled by removing unclear fuzzy images and null entries. Then after, the processed data is passed into a Sequential LSTM model, which has two LSTM layers and a dense output layer. In the proposed approach, model’s performance is improved by integrating techniques such as early stopping and categorical cross-entropy. The model is trained and tested using a customized ISL dataset that included 11 distinct gestures, and it achieved a high accuracy rate of 96.97%. The framework emphasizes the model's robustness across diverse lighting conditions and real-world scenarios, ensuring its applicability in sectors such as healthcare, education, and public service. By enhancing communication for ISL users, it effectively addresses existing gaps and improves accessibility in these domains.
Copyright © 2025 P. Rawat et al., licensed to EAI. This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.