Multi-attention mechanism based on gate recurrent unit for English text classification

Text classification is one of the core tasks in the field of natural language processing. Aiming at the advantages and disadvantages of current deep learning-based English text classification methods in long text classification, this paper proposes an English text classification model, which introduces multi-attention mechanism based on gate recurrent unit (GRU) to focus on important parts of English text. Firstly, sentences and documents are encoded according to the hierarchical structure of English documents. Second, it uses the attention mechanism separately at each level. On the basis of the global object vector, the maximum pooling is used to extract the specific object vector of sentence, so that the encoded document vector has more obvious category features and can pay more attention to the most distinctive semantic features of each English text. Finally, documents are classified according to the constructed English document representation. Experimental results on public data sets show that this model has better classification performance for long English texts with hierarchical structure.


Introduction
Text classification is one of the core tasks of natural language processing [1][2][3]. It is widely used in news classification, sentiment analysis, subject detection and other fields. Traditional machine learning classification algorithms use feature engineering to select suitable features to represent text, and then input text features into different classification models to obtain classification results, such as, Naive Bayes (NB) [4], K-nearest neighbor (KNN) [5], Support Vector Machine (SVM) [6] and other algorithms. However, with the increasing number of texts, the complex feature engineering in the traditional machine learning classification algorithm has become a bottleneck restricting its development.
Deep learning models [7][8][9] can autonomously learn the characteristics of samples from large-scale data samples, which improves the intelligence of modeling and simplifies the classification process. Therefore, it has become a research hotspot in the field of text classification. Convolutional neural Network (CNN) [10,11] is one of the commonly used models in deep learning. Yoon Kim et al. [12] proposed the TextCNN model in the EMNLP conference in 2014. He used convolutional neural network to model and classify texts and obtained results that were not inferior to the complex classifier model based on machine learning. Therefore, the research on text classification model based on deep learning is hot.
In natural language processing tasks, recurrent neural networks (RNN) [13][14][15]  time and learn features based on complex sequences of words. Thus, RNN can capture language patterns useful for natural language processing tasks, especially on long text segments. Francisco et al. [16] obtained good results by using the recurrent neural network based on long and short-term memory (LSTM) to encode text. Convolutional neural network can extract local features and RNN can extract global features, both of them show good results. However, when the document length is long, processing the document directly as a long sequence not only poses a significant performance challenge to the model, but also ignores the information contained in the document hierarchy. Therefore, some researchers study hierarchical neural network model [17][18][19][20] and use hierarchical network model for text classification. However, the hierarchical network model usually uses the global object vector in the training process and cannot pay attention to the most obvious semantic features of each text.
When using attention mechanisms for sentences, the existing methods usually use global parameters as target vectors for all categories. On the one hand, this is not conducive to the expression of the characteristics of each sentence, and on the other hand, it cannot highlight the words with obvious categorical characteristics in the sentence. Aiming at the advantages and disadvantages of current text classification algorithms in long text classification, this paper proposes an English text classification model based on multi-layer attention mechanism combining with GRU [21,22] (abbreviated as MAMGRU). A hierarchical model based on RNN is used to classify long English texts, and an attention mechanism is introduced to pay attention to important parts of English texts. At the sentence attention level, besides training a global objective vector, the most important information in each dimension is extracted from the sentence word vector matrix by maximum pooling as the sentencespecific objective vector. The two object vectors are used to score the words in the sentence together, so that the category features of the sentence coding can be more obvious, and the most distinctive semantic features of each English text can be better paid attention to.
Finally, experiments on public data sets verify the validity of the MAMGRU model. Especially for the data set with hierarchical features and long text, the MAMGRU model obtains specific important features of each sentence by extracting the maximum features of each dimension in the sentence at the sentence attention layer, which has a good performance in the accuracy of text classification.

Related works
Deep learning-based text classification algorithms usually use low-dimensional and real-value word vectors to represent the words in the text, then build a neural network model to model the text, obtain the text representation containing all the text information, and use the final text representation to classify the text.
TextCNN model is a classical text classification model based on deep learning. After that, researchers put forward many improvement schemes based on TextCNN. Londt et al. [23] proposed a character-level classification model based on CNN, adding a cyclic layer on the basis of CNN to capture the information that sentences rely on for a long time. Jang et al. [24] obtained sequential dependency information by increasing the depth of CNN. Paoletti et al. [25] studied how to deepen the word granularity of CNN to express the text globally, and proposed a simple pyramidal CNN network structure, which not only increased the network depth and accuracy, but also did not increase the amount of calculation too much.
Inspired by TextCNN and considering the advantages of RNN in processing text data, some researchers use RNN and its variant structure to model and classify texts. Shibata et al. [26] used LSTM structure to encode text, and in order to solve the problem of less annotated data in a single task, they designed three different information sharing mechanisms based on RNN for training, and achieved good results in the four benchmark text classification tasks. In order to solve the problem of insufficient storage unit when RNN encoding long text, Liang et al. [27] proposed a LSTM structure with high caching to capture the overall semantic information in long text, so that the network could better preserve emotional information in a circular unit.
In addition to capturing sequence information on long text, another advantage of RNN for text classification is that it can be well combined with the attention mechanism. It can focus on key information in text modeling to improve the effect of English text classification. The attention mechanism is first proposed in the field of computer vision [28,29]. In the field of natural language processing, the attention mechanism is first introduced into the RNN based encoding model of machine translation tasks. The attention mechanism scores the input sequence by object vector, focuses attention on the more important part of the input sequence, and makes the output result more accurate. Therefore, it has been gradually promoted and applied to a variety of NLP tasks including text classification.
When dealing with a long document representation consisting of many sentences, treating the document directly as a long sequence ignores the information contained in the document hierarchy. Therefore, some researchers use hierarchical neural network model to model documents for text classification. Hu et al. [30] EAI Endorsed Transactions on Scalable Information Systems Online First

R E T R A C T E D
Multi-attention mechanism based on gate recurrent unit for English text classification 3 constructed a bottom-up document representation method. CNN was first used to encode sentences, then RNN with gated structure was used to construct document representation, and finally classification results were obtained through Softmax layer. Experiments show that this model achieved the best result for long document classification at that time. Similarly, Ouyang et al. [31] proposed the hierarchical attention model, which incorporated the attention mechanism into the hierarchical GRU model, enabling the model to better capture the important information of documents and further improved the accuracy of classification of long English documents.

Proposed English text classification model
To capture the sequence information on the text in English, at the same time, better use of long hierarchy of document data, this paper puts forward the text classification model based on multi-attention mechanism, using the hierarchical model based on RNN, introducing the mechanism of focusing on the most important part of English text, in the words of attention by extracting sentences of each dimension in the biggest characteristics to obtain the importance of each sentence. The basic idea of the model is to encode sentences and documents respectively according to the hierarchical structure of words forming sentences and sentences forming documents. In order to focus the semantics of a sentence or document representation on its important components, attention mechanisms are used at each level. Finally, the documents are classified according to the document representation built.
Usually the attention mechanism starts by setting up a task-specific goal vector and matching it to the input sequence. Each element in the input sequence is then assigned an attention score by calculating its similarity to the target vector. After the score is normalized, the weighted sum of all elements gives the final representation. In text classification, the previous practice is to learn a global context vector in the network as the object vector, and score words by calculating the similarity between each word and the object vector.
However, when all categories share an object vector, its information on each feature dimension will be relatively average and it cannot highlight the salient features of the sentence. Even if a word with distinct categorical features appears in a sentence, the global object vector cannot assign it an attention score matching its salience.
Therefore, the MAMGRU model uses a mixed attention mechanism at the sentence attention level, that is, in addition to using the global objective vector, each sentence constructs its own specific objective vector. The largest value in each dimension, that is, the most obvious information, is directly extracted from the word vector matrix of the sentence as the specific object vector of the sentence. The MAMGRU model has five layers, including sentence coding layer, sentence attention layer, document coding layer, document attention layer and document classification layer from bottom to top. The structure of the model is shown in Figure 1.

Sentence encoding layer
At the sentence encoding layer, each word in the sentence is input in turn to construct the sentence representation. To obtain the sequence information in the sentence, the sentence is modeled using RNN. However, in the basic RNN, as the sequence spreads over time, the historical information in the sequence will be gradually forgotten, while the error accumulation is increasing. Therefore, in MAMGRU, a special RNN structure-gate recurrent unit (GRU) is used to solve the gradient updating problem in long-term memory and back propagation.
Bi-directional GRU is used to obtain word annotations combining contextual information by summarizing word information from both directions, as shown in equations (1) and (2).

Sentence attention layer
Because each word in the sentence contributes differently to the classification goal. The weight of words that are more important to the classification should be higher during coding. Therefore, at the sentence attention layer, the attention mechanism is used to calculate an attention score for each word in the sentence, and then the vector representation of the sentence is formed through the word and its score.
In particular, at the sentence attention level, MAMGRU model uses a hybrid attention mechanism. In addition to using the global target vector g v , it also constructs its unique target vector s v for each sentence. Since each dimension of the word vector represents an attribute information, similar to max-pooling in CNN [32,33], the largest value in each dimension is directly extracted from the word vector matrix of the sentence, that is, the most obvious information, as the unique target vector of the sentence, and the semantic information with obvious categorical characteristics is more highlighted.
For each  sentence  in  the  document,   ] ,   Then, for all the words in the sentence, the similarity between them and the two target vectors is calculated and normalized. The attention score for the two target vectors is obtained, as shown in equations (7) (11) In this way, the corresponding vector representation can be obtained for each sentence in the document, and the words with obvious classification characteristics in the text will get more weight and occupy a dominant position in the final sentence representation.

Document encoding and attention layer
Each sentence in the document also contributes differently to the result of the document classification. Therefore, in the document attention layer, the attention mechanism is also used to score each sentence. Here we focus only on the global information of the document.

Document classification layer
Document vector d is a higher-order representation of documents and can be directly used as a feature of document classification. The probability of each category is calculated by Softmax, as shown in equation (19).
In the whole model training process, the cross entropy is used as the loss function, as shown in equation (20). (20) Where j d p is the probability that document d belongs to category j .
To sum up, MAMGRU model firstly uses bidirectional GRU and mixed attention mechanism to encode sentences and obtain vector representation of each sentence in the document. The document is then encoded using a twoway GRU and a basic attention mechanism. The resulting document representation contains semantic information for all the sentences in the document, with a greater proportion of important sentences. Similarly, the more distinctive the words in each sentence, the greater the weight. Finally, at the document classification layer, classify documents according to the obtained document representation, and obtain the probability of documents corresponding to each category. Based on such a hierarchy, a more categorical representation of documents can be obtained.

Experiments and analysis
In this part, the experiments on a balanced dataset of internet media reports in the Chinese language are carried out. In this dataset, there are 1000 training samples and 1000 test samples in each category. This is an exactly balanced dataset [34].
In the training process, all the data in the data set are divided into training set and test set by 9:1, and 10% of the training set is taken as the verification set.
The experiment compares the MAMGRU model with the following classification models.
(1) Classification model based on machine learning: Bayesian model, decision tree model. The Bayesian model trains Bernoulli Bayesian classifier [35] and polynomial Naive Bayesian classifier respectively. The decision tree model trains the ID3 decision tree based on information entropy [36] and the CART decision tree based on GINI impurity [37].
Part of training parameter settings of MAMGRU model are shown in Table 1  Naive Bayes model assumes that attributes are independent of each other, but this assumption is often not true in practical application, and the performance of Naive Bayes is poor when the number of attributes is large. When the value of K is 1, KNN algorithm has a better effect on the open English data set. The K value of KNN algorithm has a great relationship with the data set itself. There is no specific empirical formula for determining K value. Therefore, when training the new data set, it is necessary to re-experiment with different values of K to determine the optimal K value. Since information entropy is more sensitive to impurity than Gini coefficient, when information entropy is used as an indicator, the growth of decision tree will be more "fine". When data dimension is high or data noise is high, information entropy is easy to over-fit.
TextCNN has achieved a good classification effect on data sets, which is higher than the accuracy of traditional machine learning classification model. GRU has a slightly better classification effect than TextCNN, and the proposed model achieves the same results as GRU on public data sets. Compared with short texts, the proposed model can pay more attention to the most distinctive semantic features in long texts and achieve better results.

Conclusion
This paper studies deep learning based text classification algorithm and proposes a novel English text classification model based on multi-attention mechanism. First, it improves the multi-attention model. In sentence encoding, it is proposed to set a specific objective vector for each sentence in the text, combine with the global objective vector, synthesize all word items and their attention score according to a certain weight to get the sentence coding. Then, through the document coding layer, document attention layer and document classification layer in the EAI Endorsed Transactions on Scalable Information Systems Online First

R E T R A C T E D
hierarchical attention model, the probability of documents corresponding to each category is obtained, that is, the realization of text classification. The experimental results show that the proposed algorithm is better than the traditional machine learn-based text classification models and the existing deep learn-based text classification models.