
Research Article
Conformer network-guided speech recognition for smart home Internet of Things system
@ARTICLE{10.4108/eetiot.9818, author={Shengjun Huang and Dong-hyun Kim}, title={Conformer network-guided speech recognition for smart home Internet of Things system}, journal={EAI Endorsed Transactions on Internet of Things}, volume={11}, number={1}, publisher={EAI}, journal_a={IOT}, year={2025}, month={11}, keywords={Internet of Things, conformer network, asymmetric convolution, speech recognition, gated feed-forward neural network}, doi={10.4108/eetiot.9818} }- Shengjun Huang
Dong-hyun Kim
Year: 2025
Conformer network-guided speech recognition for smart home Internet of Things system
IOT
EAI
DOI: 10.4108/eetiot.9818
Abstract
In response to the fact that the World is gradually entering an aging society, and the problem that traditional Internet of Things (IoT) systems are operationally complex and lack humanization, a conformer network-guided speech recognition for smart home Internet of Things system is proposed in this paper. Firstly, by introducing a voice recognition module with an embedded processor, not only traditional voice recognition has been achieved, but also cloud transmission of voice has been realized, breaking through the bottleneck of low computing and storage capabilities of the main control chip. Then, by using Internet of Things technology, the complex algorithms are transferred to the cloud for execution. There is a significant improvement in voice recognition. By leveraging the distributed storage feature of the cloud, a user-specific voice database can be established categorically. This enables the provision of a vast amount of data basis when users are learning. In response to the shortcomings of the existing Conformer speech recognition model, such as insufficient extraction ability of time-frequency features, redundant model structure and large number of parameters, this paper proposes a speech recognition model based on asymmetric convolution and gated feed-forward neural network. Different-sized asymmetric convolutions are used to perform multi-scale fusion and down-sampling on the time-frequency features of the speech sequence. This not only enhances the model's ability to extract time-frequency features but also effectively reduces the information loss during down-sampling. At the same time, the gated feed-forward module is introduced to replace the double half-step feed-forward network in Conformer, reducing the number of network parameters while simplifying the model structure. Finally, based on a large amount of data, the system gradually builds a personalized speech recognition library for the user through learning. Through experiments, the effectiveness of the proposed intelligent fusion-based Internet of Things system in terms of speech recognition accuracy, computing power, and the intelligence level of voice interaction has been verified.
Copyright © 2025 Shengjun Huang et al., licensed to EAI. This is an open access article distributed under the terms of the CC BYNC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.


