Speech emotion recognition method in educational scene based on machine learning

In order to effectively improve the accuracy and anti noise performance of speech emotion recognition in educational scenes, a new method based on machine learning is studied. Based on the fundamental frequency and resonance degree, the speech emotional characteristics of educational scenes are collected respectively. Using the kernel canonical correlation analysis in machine learning algorithm, the emotional feature samples are nonlinearly mapped to the high-level feature space, the correlation between different emotional features is analyzed, the nonlinear correlation between the two groups of variables is obtained, the two speech emotional features are integrated, and the feature samples are constructed. SVM is used to establish speech emotion recognition classifier, and genetic algorithm is used to determine the optimal parameters. The experimental results show that the emotion recognition rate of this method is more than 90%, and the emotion recognition rate of anger, fear, happiness and sadness is more than 95%; After adding a variety of noise, the speech emotion recognition results are completely consistent with the actual speech emotion, which shows that this method has high anti noise performance.

Yanning Zhang and Gautam Srivastava 2 and performed K-means clustering on the features of all frames of each audio signal to complete speech emotion recognition. Lee et al. established a speech emotion recognition model based on CNN's transfer learning and attention mechanism [8], and carried out speech emotion recognition with CNN model as the core and attention mechanism.
Machine learning (ML) is an interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines [9]. Machine learning algorithms can be divided into three categories: supervised learning, unsupervised learning and reinforcement learning [10]. When a specific data set (training set) has specific attributes (labels), but other data have no labels or need prediction labels, supervised learning can be used, mainly including decision tree algorithm, support vector machine algorithm, classification algorithm, logistic regression algorithm, etc. [11]. Unsupervised learning can be used for a given unlabeled data set (data is not pre allocated) to find out the potential relationship between data. Reinforcement learning is between the two. Each prediction has a certain form of feedback, but there is no accurate label or error information, mainly including clustering algorithm, principal component analysis algorithm, independent component analysis algorithm, association analysis algorithm and singular value decomposition algorithm [12]. The machine learning algorithm is applied to the speech emotion recognition of educational scene, and the speech emotion recognition method of educational scene based on machine learning is studied to obtain accurate speech emotion recognition results of educational scene.

Speech emotion feature extraction in educational scenes
Feature extraction is an important step in the modelling of speech emotion recognition in educational scenes. The educational speech signal has short-term stability, so it can process the educational scene speech signal and extract the required feature parameters. Windowing and framing the speech signal of education scene can effectively use the short-term stability of the speech signal of education scene for feature extraction and analysis. Windowing is to multiply the original educational scene speech signal with a specific window function to obtain the windowed speech signal.
There are many speech features related to emotion, mainly including pitch frequency, formant and so on. These are important speech features, which have wide and important applications in the fields of speech enhancement, speech coding, speech synthesis, speech recognition, speaker recognition, emotion recognition, speech hiding, sound source location and so on, especially for Chinese.

Characteristics of fundamental frequency
Pitch period is one of the important parameters describing excitation source in speech signal processing of educational scene. When people make voiced sound, the air flow makes the vocal cords produce relaxation oscillation vibration through the glottis, producing a quasi-periodic pulse air flow. This air flow excites the vocal tract to produce voiced sound, also known as voiced speech, which carries most of the energy in speech. The frequency of this vocal cord vibration is called the fundamental frequency [13], and the corresponding period is called the pitch period. At present, pitch detection algorithms mainly include autocorrelation function method, average amplitude difference function method, cepstrum method, and some improved algorithms based on the above algorithms.
The fundamental frequency is closely related to the size and tightness of the vocal cord. Under different emotional states, the vocal cords change accordingly. For example, when angry, the vocal cords stretch and tighten, so the fundamental frequency also changes accordingly [14]. Eight features related to the fundamental frequency are selected for speech emotion recognition in educational scenes, which are the maximum value of the fundamental frequency 0 _ max . Their calculation formulas are as follows:

Amplitude characteristics
Formant refers to some areas where energy is relatively concentrated in the spectrum of sound. Formant is not only the determinant of sound quality, but also reflects the physical characteristics of sound channel (resonant cavity) [15]. The original meaning of formant refers to the resonant frequency of sound cavity. Similar to pitch extraction, formant estimation is also plagued by many problems, including false peaks, formant merging, high pitch speech, etc., and its main methods include cepstrum method and LPC method.
Amplitude describes the intensity of speech emotional information in educational scenes, mainly in the rhythm of speech. When in the state of anger or surprise, the volume increases, while in the state of sadness, the volume is low, so the amplitude is also an indispensable voice emotional feature of educational scenes. The amplitude is measured by the short-time energy of each frame speech signal ( ) sn , and its calculation formula is as follows: Where: ( ) According to the emotional fundamental frequency and amplitude characteristics of the collected educational scene speech, the kernel canonical correlation analysis (KCCA) algorithm in machine learning algorithm is used for feature fusion. KCCA nonlinearly maps the samples to the highdimensional feature space [16], and then performs correlation analysis to obtain the nonlinear correlation between the two groups of variables. Let Let the kernel functions be X K and Y K respectively, and the kernel matrix is described as follows: Kernel matrix centralization: zero mean of training samples is made.
The goal of KCCA is to find the projection directions   and   so that (12) is the largest when the following criterion function is used: The vector   is located in the space formed by the According to the kernel regeneration theory, there is an N-dimensional , which can be brought into equation (13) to obtain: In order to prevent meaningless canonical correlation vectors, it is necessary to introduce a regular term to constrain equation (14): The Lagrange multiplier method is used to solve the above constrained extreme value problem [17], and the corresponding Lagrange equation is: where 1  and 2  are Lagrange multipliers.
The partial derivatives of  with respect to  and ( ) , L  is solved respectively to make them zero, that is: Thus, KCCA is equivalent to solving the eigenvector problem corresponding to the generalized eigenequation, that is: Solve  and  , and extract the nonlinear correlation features between x and y : Where u and v are the transformed characteristic components.
It is linearly transformed to obtain: The projected combined features are used for the modelling and classification of speech emotion recognition in subsequent educational scenes.

Recognition process
The implementation process of speech emotion recognition method in educational scene is as follows: (1) The voice signal of education scene is collected and preprocessed; (2) The fundamental frequency feature and amplitude feature are extracted respectively. Due to the different value range of each feature, in order to eliminate the impact of different range on emotion modelling, normalization must be carried out. The normalization formula is: Because the fundamental frequency characteristics and amplitude characteristics describe the changing relationship between speech and emotion in educational scenes from different angles, they not only focus on each other, but also complement each other, but also have correlation, that is, information redundancy. In addition, the feature dimension is not proportional to emotion recognition, so KCCA is used to fuse the features, find the most important information in the features, and transform the original feature vector into a low dimensional vector; (4) The emotion samples are processed by low dimension vector to reduce the data scale. The support vector machine is used to learn the training samples, establish the emotion recognition classifier, and identify the test samples to verify the effectiveness of the classifier; to sum up, the speech emotion recognition process of education scene based on machine learning (kernel canonical correlation analysis and support vector machine) is shown in Figure 1. Speech emotion recognition method in educational scene based on machine learning 5 Speech emotion recognition classifier based on SVM algorithm Support vector machine (SVM) is a machine learning algorithm based on statistical learning theory [18]. It was first proposed by Vapnik et al. based on the principle of linear classifier. SVM can be used to solve linear and nonlinear sample classification. Its core idea is to map the linearly indivisible sample points in low-dimensional space to high-dimensional feature space through kernel function, and then construct the optimal classification hyperplane in the feature space. At this time, the data can also be segmented by hyperplane in high-dimensional space, so as to become linearly separable, and the distance between each sample and the hyperplane shall be kept to the maximum.
The basic principle of SVM classification algorithm is as follows: The nonlinear sample sets ( ) ( ) 11 , , , ,

Genetic algorithm to determine the optimal parameters
Genetic algorithm is an efficient global optimization search algorithm that combines the survival of the fittest in the process of biological evolution with the random information exchange system of chromosomes in the population [19]. Genetic algorithm parameter optimization is to encode the parameters to be optimized to form chromosomes and randomly generate the initial population. In genetic evolution, the selection strategy based on fitness function is used to simulate the survival law "survival of the fittest" to select individuals, and crossover and mutation are used to produce the next generation population. The population is continuously EAI Endorsed Transactions on Scalable Information Systems September 2022 -October 2022 | Volume 9 | Issue 5 | e9 optimized until the expected termination conditions are met. The last generation of chromosomes is regarded as the global optimal solution, and the optimal parameters are obtained by decoding.
In this paper, genetic algorithm is used to optimize the parameters of different training sets to find the optimal parameters belonging to the training set, and then SVM model is trained and identified. Figure 2  In this paper, genetic algorithm is used to optimize SVM parameters. The specific steps are as follows: Step 1: initialize the parameters, and make binary coding on the parameters C and  of SVM classification model. Each variable is represented by 20 binary bits, and then randomly generate the initial population.
Step 2: decode C and  , substitute them into the SVM algorithm function, and take the trained classification recognition rate as the fitness value. The higher the fitness id, the greater the probability of inheriting to the next generation is, and the lower the fitness is, the less the probability of inheriting to the next generation is.
Step 3: selection operation, simulate the "survival of the fittest" in each generation of evolution through individual fitness [20], select excellent individuals from the group as the parent generation, and then generate a new group.
Step 4: crossover operation, select the individuals after the selection operation, and generate new individuals according to the crossover probability.
Step 5: mutation operation: in the population individual string, change the gene of a locus according to the mutation probability to generate a new individual.
Step 6: decode and calculate the fitness value, compare the classification recognition rate between the offspring and the parent, and update the optimal individual.
Step 7: judge whether the number of iterations or fitness value reaches the set termination value. If not, repeat steps 3 to 6; if the requirements are met, proceed to step 8.
Step 8: when the end condition is reached, the optimal solutions C and  are output.
The implicit parallelism and powerful global search ability of genetic algorithm can search the global best in a very short time; the optimization process is completed automatically without manual intervention, which avoids the errors caused by manual operation and improves the efficiency of optimization.

Data corpus
The phonetic databases used in this experiment are Berlin affective phonetic database and Chinese affective corpus of Chinese Academy of Sciences. The Berlin affective corpus was recorded by the Technical University of Berlin. There are 10 non professional actors, 5 men and 5 women. They have anger, boredom, disgust, fear, happiness, neutrality, sadness and 10 recorded scripts. A total of 800 emotional sentences were recorded, and then 20 volunteers listened and recognized them. Among the 800 emotional sentences, some sentences are short and difficult to recognize; There are also some sentences with serious colloquialism. Therefore, the samples of emotional sentences were screened. Finally, 535 sentences were retained. The Chinese emotion corpus was recorded and provided by the human-computer speech interaction research group of the State Key Laboratory of pattern recognition, Institute of automation, Chinese Academy of Sciences. There are two male and two female professional speakers. They use six emotional states: anger, fear, happiness, sadness, surprise and neutrality. They have 50 recording scripts and finally get 1200 sound emotions. Both data sets are stored in 16000 sample rate, 16 bit quantization and wav format.

Kernel function selection in feature fusion
The selection of kernel function of KCCA is very important to the recognition results. At present, there are mainly polynomial kernel function, fractional power polynomial kernel function and Gaussian kernel function. In the case of different training samples, the average recognition accuracy of different kernel functions is shown in Figure 3.  It can be seen from Figure 3 that under the conditions of different training samples, Gaussian kernel function has the highest emotion recognition rate. Therefore, the method in this paper selects Gaussian kernel function as the kernel function of KCCA in the feature fusion process for feature fusion.

Parameter optimization experiment of support vector machine
Through the experiment, the SVM kernel function is selected. It is known that the polynomial kernel function (i.e. t = 1) is better for the seven emotion experiments of Berlin speech data set; The experiment of five emotions in Berlin speech set and Chinese emotion data set is better by using linear kernel function (i.e. t = 0). Empirical cross validation technology is used in speech emotion recognition experiment. For each group of experiments, the parameters C and  are optimized by genetic algorithm, and then the SVM model is trained and identified by the optimal parameters. The experimental parameters are set as follows: the crossover rate and variation rate are 0.7 and 0.035 respectively. The parameters are binary coded. The population size is 20 and the number of iterations is 100. Figure 2 is a set of experimental results obtained by using genetic algorithm to optimize SVM parameters for seven emotions on the Berlin data set.   Figure 5. At the same time, in order to further illustrate the performance advantages of the proposed method in emotion recognition, taking the methods in reference [6] and reference [7] as the comparison method, two comparison methods are used for emotion recognition on the test data set, and the emotion recognition results of the proposed method and the two comparison methods are compared.
The results are shown in Figure 5.  By analysing Figure 5, the average recognition rates of the method in this paper, the recognition method in reference [6] and the recognition method in reference [7] are 93.6%, 87.7% and 83.7% respectively. Moreover, the recognition rate of emotion in the method in this paper's EAI Endorsed Transactions on Scalable Information Systems September 2022 -October 2022 | Volume 9 | Issue 5 | e9 generally higher than that in the two comparison methods, and the recognition rate is more than 90%. Among them, the recognition rate of emotion in anger, fear, happiness and sadness is more than 95%, showing that this method has a high speech emotion recognition rate.

Anti-noise performance
In order to test the anti-noise performance of the evaluation method in this paper, the speech emotion results recognized by this method under different Gaussian noise, salt and pepper noise and Gaussian filter operator standard deviation are shown in Table 1. By analysing table 1, with the increase of Gaussian noise variance, salt and pepper noise superposition density and Gaussian fuzzy filter operator standard deviation, the speech emotion recognition results obtained by the method in this paper are completely consistent with the actual speech emotion, which shows that this method has high noise resistance.
The above experimental results show that compared with the recognition methods in reference [6] and reference [7], the emotion recognition rate of this method is more than 90%, and the emotion recognition rate of anger, fear, happiness and sadness is more than 95%, indicating that the recognition rate of this method is high; The anti noise performance is tested under different Gaussian noise, salt pepper noise and Gaussian filter operator standard deviation. The results show that the speech emotion recognition results are completely consistent with the actual speech emotion, indicating that the proposed method has high anti noise performance.

Discussion
This paper studies the speech emotion recognition method of educational scene based on machine learning, and the experimental results show that this method has good emotion recognition results. This is mainly because the kernel canonical correlation analysis (KCCA) algorithm of machine learning algorithm is used for feature fusion. Kernel canonical correlation analysis algorithm introduces the idea of kernel function into correlation analysis algorithm. The idea is to map low-dimensional data to high-dimensional feature space (kernel function space), and carry out correlation analysis in kernel function space conveniently through kernel function. High precision feature fusion results are obtained by kernel canonical correlation analysis algorithm. According to the feature fusion results, the support vector machine algorithm in machine learning algorithm is used for speech emotion recognition. Support vector machine (SVM) was first proposed by Corinna Cortes and Vapnik in 1995. It is based on the VC dimension theory of statistical learning theory and the principle of structural risk minimization.
According to the limited sample information, the complexity of the model (i.e., the learning accuracy of specific training samples) and learning ability (i.e., the ability to identify any sample without error), the best compromise between them is found in order to obtain the best promotion ability. Support vector machine method shows many unique advantages in solving small sample, nonlinear and high-dimensional pattern recognition, and can be extended to other machine learning problems such as function fitting. In machine learning, support vector machine (SVM, also support vector network) is a supervised learning model related to related learning algorithms. It can analyse data, identify patterns, and be used for classification and regression analysis.

Conclusion
This paper studies the speech emotion recognition method of educational scene based on machine learning, uses machine learning algorithm to analyze the correlation between different emotional features, constructs feature samples, uses SVM to establish speech emotion recognition classifier, and uses genetic algorithm to determine the optimal parameters. Kernel canonical EAI Endorsed Transactions on Scalable Information Systems September 2022 -October 2022 | Volume 9 | Issue 5 | e9 correlation analysis algorithm and support vector machine algorithm are applied to the field of speech emotion recognition in educational scenes. The experimental results show that the emotion recognition rate of the speech emotion recognition method based on machine learning is more than 90%, which has a high recognition rate; The speech emotion recognition results obtained in various noise environments are completely consistent with the actual speech emotion, which shows that this method has high anti noise performance. It can provide a better scientific basis for speech emotion recognition in educational scenes.