Music Emotion Recognition Based on Long Short-Term Memory and Forward Neural Network

In this paper, we propose a new music emotion recognition method based on long short-term memory and forward neural network. First, Mel Frequency Cepstral Coefficient (MFCC) and Residual Phase (RP) are weighted to extract music emotion features, which improves the recognition efficiency of music emotion features. Meanwhile, in order to improve the classification accuracy of music emotion and shorten the training time of the new model, Long short-term Memory network (LSTM) and forward neural network (FNN) are combined. Using LSTM as the feature mapping node of FNN, a new deep learning network (LSTM-FNN) is proposed for music emotion recognition and classification training. Finally, we conduct the experiments on the emotion data set. The results show that the proposed algorithm achieves higher recognition accuracy than other state-of-the-art complex networks.


Introduction
Music has always been an indispensable part of human activities. It can not only represent the author to express his/her inner emotional activities, but also make the listener accept the power of music, so as to achieve some positive spiritual guidance [1,2]. In this era of pursuing intelligence, many films, television works and multimedia videos emerge in an endless stream. Music emotion recognition can also perform real-time soundtrack according to the emotion conveyed by voice and video content [3].
At present, the research on musical emotion recognition is mainly divided into two aspects. One is how to better extract the emotion features of music. One is how to improve the classifier effect of emotion recognition. Chen et al. [4] adopted Deep Pitch Class Profile (DPCP) feature based on deep learning in the stage of audio feature extraction to ensure the robustness and generalization ability of audio feature extraction and improve the feature performance of nonlinear deep semantics of music. Weninger et al. [5] input the underlying features of music into the recurrent neural network for training, so as to complete music emotion recognition. Markov et al. [6] used Gaussian Process (GP) and Support Vector Machines (SVM) to research different features, including MFCC, Linear Prediction Coefficient (LPC), timbre features and their various combined features. Then they were used for music style classification and VA (Valency arousal) emotion estimation. It can be seen from their experiments that the classification result of GP method is indeed better than that of SVM method. However, the algorithm complexity of GP method is higher than SVM Aizhen Liu 2 method. So it is very difficult to apply in a large scale mission. Chen et al. [7] spliced the features related to rhythm, intensity, timbre, and pitch into 38-dimensional music features, and used the Deep Gaussian Process (DGP) method for music emotion recognition. They built a GP regressor for each emotion category and used regression to classify music emotions. Although this method achieved a good effect on emotion classification, music samples could not be expanded after the model training is completed. Li et al. [8] proposed a method based on Deep Bidirectional Long Short Term Memory (DBLSTM) to dynamically predict music emotion, which trained multiple DBLSTM based on time series of different scales. Then Extreme Learning Machine (ELM) was used to fuse the results of multi-scale DBLSTM to get the final result. Wei Xiang et al. [9] used Convolutional Neural Network (CNN) and its variants in deep learning to automatically extract abstract features of emotion samples, eliminating the process of artificial feature selection and dimension reduction. Sarkar et al. [10] proposed a convolutional neural network built around VGGNet and a novel post-processing technology to improve the performance of music emotion recognition in accordance with the method based on deep learning.

EAI Endorsed Transactions on Scalable Information Systems
Orjesek et al. [11] proposed a deep learning model, which used the feature spectrogram of music signals as the input of music features, and used the combination of convolutional neural network and recurrent neural network to extract features and classify emotions from the spectrogram. Issa et al. [12] introduced a new architecture to extract MFCC, chromatogram, Meyerscale spectrum, Tonnetz representation and spectral contrast features from sound files and then input them into a one-dimensional convolutional neural network. An incremental approach was then used to modify the initial model to improve classification accuracy. Unlike some previous approaches, all models can work directly with raw sound data without having to be converted to a visual representation. Nalini et al. [13] extracted music emotional features by combining MFCC and RP, and applied the Auto-associative Neural Network. The results showed that the recognition results of fusion feature were consistently better than that of single music emotion feature. However, training models in traditional deep learning were time-consuming and inefficient, especially in dynamically increasing the number of samples. Most music algorithms for emotion recognition work in two ways. The first is feature extraction. It tries to extract the emotion feature information contained in the music signal as the model input. The second is classifier design. In order to maximize the accuracy of music emotion recognition and classification, a better learning model is designed.
Although these algorithms have achieved good recognition effect, there are still areas for improvement: 1) there are many kinds of extracted music emotion features, but the flexibility of the algorithm is not enough to adapt to various features.
2) The deep learning network is simple to build, but its internal structure is very complex and the number of hyperparameters is huge, which makes it difficult to modify. It is very difficult to analyze its internal structure theoretically.
3) Emotion is subjective, so it is not easy to grasp how to better extract its music features and which aspects to start with for innovation.
Forward neural network (FNN) provides an alternative to deep learning network, which has simple structure and fast data processing [14]. Tang et al. [15] used the random convolutional neural network to extract features of audio, and then used the FNN to predict labels. Deep learning and FNN were sequentially splicing, effectively improving the classification accuracy and training efficiency of the model. In order to take into account the advantages of deep learning and FNN at the same time, Chen et al. [16] proposed a cascade FNN based on convolutional feature mapping nodes, and the experiment proved that the network greatly exceeded the traditional deep learning network in feature extraction and training efficiency. Serhat et al. [17] proposed an approach for music emotion recognition based on convolutional long short term memory deep neural network architecture. It utilized features obtained by feeding convolutional neural network layers with log-mel filterbank energies and MFCCs in addition to standard acoustic features. Madeline et al. [18] used machine learning techniques to classify which genre of music was being listen to using physiological responses. Both Long Short Term Memory Networks and Convolutional Neural Networks could be used for making predictions from sequence data. It trained and compared two networks which attempted to classify the genre of music a participant was listening to from their electrodermal activity. Benito-Gorron et al. [19] proposed a hybrid convolutional-LSTM model which achieved the better overall results.
In this paper, LSTM and FNN are combined. LSTM is used as the feature mapping node of FNN to build a new Long short-term memory-FNN (LSTM-FNN) to improve the accuracy of music emotion classification. LSTM-FNN uses an incremental learning algorithm to process the training of new nodes without reprocessing all data, which greatly reduces the running time of the model. Firstly, in the stage of music feature extraction, content-based acoustic feature MFCC is used to increase emotion sensitivity. Residual phase and bit are derived from music signals to extract specific music emotion information, and the weighted combination of the two is used as model input. Secondly, LSTM model training is performed on the input data to extract the contextual relationship of music, and the feature node set is generated as the input of Music Emotion Recognition Based on Long Short-Term Memory and Forward Neural Network 3 FNN enhanced node. The enhanced layer output is generated through mapping, and the combination of feature node and enhanced node set is used to obtain the final output by global violation. Finally, the trained model is used to predict the types of musical emotions. Experimental results show that the proposed algorithm can extract audio information more effectively by adding music features, and the constructed LSTM-FNN can effectively improve the accuracy and efficiency of music emotion recognition.
This paper is organized as follows. In sections 2, we detailed introduce the proposed the music emotion recognition model. Section 3 gives the experiments to verify the effectiveness of the proposed method. Finally, a conclusion is conducted in section 4.

Feature extraction
At present, content-based acoustic features are mainly divided into timbre, rhythm, pitch, harmony and temporal characteristics. Timbre features include cepstrum features such as MFCC. The features of rhythm content mainly include cadence number, rhythm histogram and so on. The content features of pitch are mainly frequency information.
Harmonic characteristics include chromaticity diagram [20,21]. The time feature includes the center of time mass. Where, MFCC makes use of the principle of hearing and the declination characteristic of cepstrum, which is as one of the most successful spectral features in speech and music related recognition tasks. In order to extract the feature, firstly, the audio signal is preprocessed, and the frame is segmented and windowed. The original signal with a sample rate of 44.1khz is segmented into frames with 2048 samples by blackman-Harris window. After the audio signal is windowed, the ends of each frame will fade to 0. As a result, both ends of the signal are weakened. In order to overcome this problem, the adjacent frames will overlap in the frame splitting process. Generally, half of the frame length is taken or the frame length is fixed at 10ms. In this paper, adjacent frames overlap by 50%, which can not only reduce spectrum leakage, but also reduce unnecessary workload. Then, the discrete STFT is applied to each frame to obtain the spectral energy, which is weighted by the frequency response of 1 k Mayer filters and further filtered to generate Mayer spectra. Its center frequency and bandwidth roughly match that of an auditory critical band filter. Finally, the whole Merle spectrum sequence is divided into L blocks with the size of 2 k frames, represented as Therefore, each block has a size of RP is defined as the cosine of the phase function of the analytic signal derived from the Linear Predictive (LP) residual of the musical signal. At time t, the music sample s(t) can be estimated as a linear combination of p samples in the past, so the predicted music sample can be expressed as: Coefficients (LPCs). The prediction error e(t) is defined as the difference between the actual value s(t) and the predicted value. The formula is as follows: LPCs, the LP residual r(t) of the music signal, are obtained by minimizing the prediction error e(t). Analytic signal ) (t r a can be obtained from r(t): is the Hilbert transform of r(t).
A lot of emotion information about music exists in LP residual. Calculating the residual phase can help to extract emotion specific information in music signal. Residual phase is the cosine of the phase of the analytic signal, and the calculation formula is as follows: Marius et al. [22] had proved that RP contained audiospecific information that was complementary to MFCC features. RP is defined as the cosine of the phase function of the analytic signal derived from the LP residual of the musical signal. The recognition rate in the deep learning model indicates that there is specific emotion information in the music signal, and RP can extract these specific information.  The classification of music emotion in this experiment can be regarded as a multi-classification problem, which is expressed by the following formula.
Where h represents the input of the neural network. z represents the output of the neural network. w and b are the parameter weights of the input and output layers.
The propagation modes of neural network include forward propagation and back propagation. Forward propagation means that the model propagates from the bottom up, performing calculations based on a given input. The loss value is calculated according to the calculation result of forward propagation. Backpropagation errors use gradient descent algorithm to calculate and train the parameters of each neuron. The forward propagation formula of the neural network can be represented by the following set of recursive formulas, i.e., Where, the input of the hidden layer k at time t is marked as In this paper, Softmax is used as the activation function for the training of FNN [23]. Softmax has a good performance in the multi-classification problem. Because the output of each neuron of Sofimax is positive and the sum is 1. The output of Softmax layer can be regarded as a probability distribution. Suppose the output of Softmax is ) ( Its main steps include: 1) calculate the activation value of each node in the network; 2) The gradient is propagated through the back propagation algorithm to obtain the gradient value of each parameter; 3) The gradient descent algorithm is used to update the model parameters; 4) Iterate the above process until convergence.
The block diagram of LSTM-FNN is shown in figure 2. This paper uses the Emotion music data set to test and evaluate the performance of complex models of deep learning and forward learning networks in Emotion classification. The dataset consists of 2906 songs in four emotion categories: 639 angry songs, 753 happy songs, 750 relaxed songs and 764 sad songs [24]. For the convenience and tidiness of the experiment, only the first 30s of each song are used, and the zero filling operation is carried out if the song is less than 30s. The data set is randomly divided into three parts in the ratio of 8:1:1, which are training, verification and testing sets respectively to maximize the fairness of the experiment.  LP residuals are derived with 16-order LP. By using the first-order digital filter and 20ms frame size, the overlap between adjacent frames is 50%. LP residual is extracted from the emotion music signal by pre-emphasizing the input music data, and the highest Hilbert envelope of each frame is extracted to generate RP features. The feature sequence diagram is extracted by combining the two features weighted. Feature extraction is carried out for each type of music signal, and the obtained sequence feature diagram is shown in figure 3. Figure 3 shows the timing features of three frames extracted from the audio signals of four emotion types. The input parameters of the network are in the form of Batch size, height, width, channels]. According to the computer memory size and the complexity of the classification model, batch-size is 128, that is, 128 sequence diagrams are input at one time.

Parameter setting
In LSTM-FNN network, 3-layer LSTM is used for node mapping, and the output dimensions are 400, 200 and 100 respectively. The model structure with the best effect is selected through experiment comparison, and then the output of LSTM is mapped to the enhancement layer. There are four convolutional blocks in CCFBLS network, and each CNN block contains a convolutional layer, a pooling layer, and a dropout layer. The number of filters in the convolutional layer is 64, the shape is fixed at 3×3, the step is 1, the pooling mode is selected as maximum pooling, and the dropout parameter is 0.5, in which the four convolutional outputs are all connected to the output node of CCFBLS. It is uncertain which LSTM model structure combined with FNN can achieve higher accuracy of music emotion classification. In this paper, the LSTM structure experiment is carried out to select the number of LSTM nodes in the mapping layer. The experiment compares the LSTM and FNN combined models of 1-3 layers respectively [27], and tries to find out the influence of the number of LSTM nodes on the classification accuracy of the overall model. The classification results are shown in figure 4. It can be seen that the classification accuracy of the two-layer LSTM model is higher than that of the other two models, and increasing the number of layers does not make the result more excellent but increases the training time [28][29][30][31]. Therefore, the output of the two-layer LSTM model is selected as the input of the mapping layer and combined with FNN for music emotion classification training. The accuracy and effectiveness of the proposed classification model are evaluated by comparing the proposed model with the four classification models in [12]. In order to make a fair comparison, each method was cross-validated by 10 times to obtain the classification accuracy, and the classification results are shown in Table 2. The experimental environments are same for the compared methods. Experimental results show that the proposed model is far superior to the model based on deep learning in music emotion classification, and also better than the RCNNBL model. Because the emotional analysis of music is very subjective, it is very difficult to use physical parameters to describe music emotion from the characteristics of audio signals in the recognition of musical emotion [32][33][34]. Moreover, the current research results are not satisfactory for the classification of music emotion, and only the possible direction can be identified in the slight advantages. As can be seen from Table 2, LSTM has a slight advantage in music emotion classification, while MCCLSTM uses multi-channel CNN and LSTM to perform music emotion recognition and classification task. Although the recognition accuracy is a little more stable than LSTM, the complex model will increase the model training time, so this paper chooses to use LSTM and FNN. The width learning system using the cascaded convolutional neural network has shown obvious advantages in music emotion classification. Compared with other complex models, the accuracy of music emotion classification has been greatly improved, thus proving the superiority of FNN network. The network structure proposed in this paper makes full use of the ability of advanced network to deal with complex data quickly. Its advantages lie in its simple structure and short training time, thus improving the recognition efficiency. LSTM has excellent performance in extracting time sequence features from time series data. It can extract the time sequence relationship of music, so as to retain music emotion features to the greatest extent. Combining the advantages of the above, LSTM-FNN network model is obtained for music emotion classification task. Figure 5 shows the comparison results of classification accuracy with different models. The recognition accuracy of LSTM-FNN model is 13% higher than MCCLSTM, 10.2% higher than RCNNBL, and 9.5% higher than CCFBLS, which proves that LSTM-FNN model can achieve music emotion classification more accurately.  Performance of the proposed method is compared with other methods for the standard feature set [17] in Table 3. Results in table 3 display that, for the standard feature set, LSTM+FNN produces greatly improvements in music emotion recognition in terms of F-measure compared to that of other six methods, respectively. But it slightly lower than LSTM-DNN [17]. In table 4, we can see that the running time is faster than LSTM-DNN due to the employ of three GPUs. It reflects the effectiveness of the proposed method.  features of music to the greatest extent. Combining the advantages of LSTM and FNN, LSTM-FNN network model is obtained to perform the task of music emotion classification. Experiment results show that the LSTM-BLS network model has higher recognition accuracy than the single deep learning model, and realizes lower time complexity than the complex model based on LSTM, and effectively realizes the emotion classification of music. As the emotions expressed in different "paragraphs" of complex music may not be consistent with the overall emotions, which brings difficulties in recognition, resulting in low accuracy in this type of music by the proposed method in this paper. If the same audio data is divided into multiple segments, and multiple segments are fed into trained networks to "vote", the judgment of musical emotion may be more accurate and scientific. At the same time, experiments using more complex neural network architectures, such as the excellent recurrent neural network for time-related classification of music emotion classification, it may be able to obtain better results than the presented method in this paper.