Correlation temporal feature extraction network via residual network for English relation extraction

In relation extraction, a major challenge is the absence of annotated samples. Relation extraction aims to extract the relationships between entity pairs from a large amount of unstructured data. To solve the above problems, this paper presents a new method for English relation extraction based on correlation temporal feature extraction network via residual network. Firstly, the attention mechanism and recurrent neural network are used to obtain the temporal features of English word correlation. Secondly, a multi-branch feature sensing convolutional neural network is constructed to obtain global and local temporal correlation features respectively. Residual network can dynamically reduce the influence of noise data and better extract the deep information of English text. Finally, the relation extraction is realized with Softmax classifier. Experimental results show that the proposed method can extract English relation effectively than other methods.


Introduction
With the rapid development of Internet technology, how to extract useful structured information from massive nonstructured data has become a research hotspot in industry and academia at home and abroad. As the core branch of Information Extraction (IE), relation extraction (RE) aims to extract semantic relationships between given entity pairs from unstructured text, and has been widely used in knowledge graph, text summary, automatic question answering and other aspects [1][2][3][4][5]. The sentence in figure 1 contains entity 1: influenza [6] and entity 2: virus, whose goal is to predict the semantic relationship between the two entities ) , ( 1 2 e e Effect Cause− .

Figure 1. Example of relation extraction
In recent years, neural network based on Transformer [7,8], recurrent neural network (RNN) [9,10], convolutional neural network (CNN) [11][12][13][14], and their variants have been widely used in relation extraction tasks [15][16][17]. In relation extraction using neural networks, the word vector representation of each word input is fixed, but the actual Ping Li 2 meaning of the word changes with the context. Some neural networks expect to learn contextual information when encoding words. However, without overall consideration of the sentence [18], it is difficult to achieve the purpose of learning context information. The Transformer structure's self attention mechanism (SAM) [19] can capture the correlation between any pair of words in a sentence. This method encodes words according to the correlation between words so that the word encoding contains contextual correlation information. RNN uses sequence units to store time information and has advantages in processing time sequence information. Bidirectional Long Short-term Memory (BiLSTM) [20], a typical RNN, uses both past and future time information of word sequences to encode words with contextual time information due to its bi-directional timing structure.
In this paper, SAM and BiLSTM are first used to encode words by taking the temporal characteristics of word correlation, so as to express the meaning of words in sentences more accurately. On the other hand, most neural networks only consider single-branch information flow. For example, Zhang et al. [21] used BiLSTM to encode time sequence of words to obtain contextual time sequence information. Nguyen et al. [22] used convolution kernels of different sizes to improve the local perception ability of CNN. However, the above methods are still difficult to obtain enough semantic features to extract relationships.
To solve this problem, this paper proposes a relation extraction method based on global and local feature-aware network (GLFN) via residual network. Based on the extracted temporal correlation features, the global and local feature-aware convolutional neural network (GLFCNN) is used to fully express the important semantic features of sentences. Specifically, aiming at GLFCNN, this paper firstly constructs multi-branch feature-aware convolutional neural networks (multi-branch feature-aware convolutional neural networks, (MFA-CNN), respectively. It senses the global and local correlation temporal features, splices and screens these two features. Finally, it combines with Softmax classifier, predicts the extracted semantic information. The relation extraction methods proposed in this paper based on global and local feature sensing networks mainly include correlated sequence feature extraction network (CSFEN), global and local feature sensing convolutional Neural Network (GLFCNN), and Prediction Output Network (PON).
(1) Correlation temporal feature extraction network (CSFEN): SAM and BiLSTM are used to obtain the correlation temporal feature of words. Specifically, the former is used to learn the correlation features of any word in a sentence, while the latter performs sequence encoding for words containing correlation features to obtain sequence information.
(2) Global and local feature sensing convolutional neural network (GLFCNN): Firstly, the multi-branch feature sensing convolutional neural network (MFA-CNN) is constructed to respectively sense global and local temporal correlation features, and splicing and filtering these two features to comprehensively represent important semantic features of sentences.
(3) Predictive output network (PON): The full connection layer is used to map the output of GLFCNN to the classification space, and the relationship between entity pairs is predicted by combining with Softmax classifier.
In this paper, a relation extraction method based on global and local feature perception networks is proposed. Firstly, SAM and BiLSTM are used to extract the temporal features of word correlation, and GLFCNN is constructed to obtain the semantic features of different levels of words. The main contributions are as follows: (1) SAM and BiLSTM are used to extract the temporal features of word correlation, which is more precise representing the meaning of words in a sentence.
(2) GLFCNN is constructed to respectively sense global and local correlation time sequence features to avoid the mutual influence of global and local perception. The global and local perceptual features are spliced and screened to fully represent the important semantic features of sentences.
(3) The validity of the proposed relation extraction method is verified on the standard SemEVAL-2010 Task 8 and KBP37 datassets, with F1 reaching 86.1% and 64.9%, respectively.

Related works
The research on relation extraction mainly includes the methods based on artificial rules in the early days and the artificial intelligence methods based on neural networks which have developed rapidly in recent years. The former mainly uses NLP tools or manually designs different kernels to select features [23,24]. Such methods mainly have the problems of error propagation caused by NLP tools [25] and limitations of manual experience.
With the rapid development of neural network, scholars at home and abroad try to use it for relation extraction. Zeng et al. [26] used CNN to extract the features of word and sentence level, and combined with the relative position information between each word and entity pair in the sentence to extract the relationship, which was more efficient than the method based on artificial rules. However, due to the limitations of CNN structure [27], this method cannot give full play to its performance in the face of remote entity pairs. RNN is good at processing temporal information and has advantages in acquiring word longdistance dependence [28]. Tymoshenko et al. [29] used RNN to sequence encode sentences and obtain semantic features including contextual sequence information. Guo et al. [30] captured the distance phase in sentences with the help of BiLSTM's powerful memory preservation ability. It

EAI Endorsed Transactions on Scalable Information Systems Online First
Correlation temporal feature extraction network via residual network for English relation extraction 3 could time dependent information between distant words. However, the above methods only consider the information of short distance dependence or time series dependence of words, without considering the correlation between words, so it is difficult to describe the overall linguistic information of words in a sentence. In this paper, SAM and BiLSTM are used to extract the temporal characteristics of word correlation to encode the word, so that it can more accurately represent the overall semantic information of the context of the word in the English sentence.
In terms of feature perception, Nguyen et al. [31] used CNN to extract local features, combined with maximum pooling operation to reduce redundancy and sense important features. However, as convolution kernels of fixed size would limit the perception range of CNN, Liu et al. [32] used convolution kernels of different sizes to improve CNN's perception ability of local features. Zhang et al. [33] used Gaussian attention to improve SAM's perception of local information, so that local information near the central word could get more attention, and further applied to natural language reasoning tasks. However, most of the above networks adopt single-branch structure, which can only perceive a single information flow feature, and lack comprehensive attention to the global and local features of sentences, so it is difficult to obtain enough semantic features to extract relationships.
Aiming at this problem, this paper proposes a network based on global and local characteristics of the awareness of relationship extraction method, based on the correlation of temporal feature extracting, further builds MFA-CNN, sense of global and local correlation sequence characteristics respectively, avoids global and local perception influence each other. The global and local perceptual features are spliced and screened to represent the important semantics of sentences.

Residual network (ResNet)
The learning objective of residual network is residual . The skipping learning structure ignores that the middle layer directly connects the low-level representation with the high-level representation, and greatly alleviates the problem of gradient disappearance that plights deep networks. In the designed model, we use shortcut connections to build the residual convolution structure. Each residual block contains two convolution layers, each convolution layer is followed by a nonlinear layer, and the activation function is set as ReLU activation function [34][35][36]. All convolution windows are d l  in size, and the output remains the same size as the input after the padding. After the first layer of convolution, the output result of the i-th window is as follows: After the second convolution, the output of the i-th window is as follows: The entire ResNet consists of four of the above residual blocks to form a cascade architecture, and the final output is c, the extracted more abstract relational information.

Relation extraction methods based on both global and local feature-sensing networks
This section introduces the methods of relation extraction based on global and local feature sensing networks. As shown in figure 2, CSFEN uses SAM and BiLSTM to extract the temporal features of correlation between words in sentences. GLFCNN performs global and local perception of correlation temporal features and obtains multi-level semantic information. Finally, PON is used to predict the extracted semantic information.

EAI Endorsed Transactions on Scalable Information Systems
Online First However, the actual meaning of a word tends to change from context to context, and the input sentence X is fixed. In order to make the encoding of the representation words contain contextual relevance information and describe the contextual semantics of the sentence, this paper uses SAM to learn the relevance features of each word in the sentence. SAM is mainly composed of two parts: dot attention and multiple-head attention. Dot product attention consists of three matrices: query matrix Q, key matrix K and value matrix V. The weights of the matrix are automatically updated by network training. The specific realization of dot attention is shown in formula (4).
In order to obtain the representation of attention weight in different sub-spaces, multiplex attention maps attention weight to multiple sub-spaces for learning, and then splices the information learned in different spaces. Multiple attention is shown in equations (5) and (6).
For the input sentence X, SAM is used in this paper to obtain the sentence representation C=SA(X) containing context-relevant information. SAM is used to learn the correlation between words in a sentence, but ignores the temporal information between words in a sentence. In order to extract the temporal sequence information in the positive and negative directions of context, BiLSTM is used to further represent the temporal sequence semantics of sentence correlation, that is, the temporal sequence features of the output sentence correlation L=BiLSTM(C).

Ping Li
Correlation temporal feature extraction network via residual network for English relation extraction 5 obtain the distribution information of global features to describe the importance of the temporal features of the correlation, so as to extract the temporal features of global correlation. GLFN-LAB uses convolution operation to obtain n-gram information of words and realize the perception of local correlation temporal features. GLFN-LAB uses MDC to map correlation temporal features to multiple representation spaces to learn the distribution information of features in different spaces. The MDC is implemented as shown in formula (7) and (8).
Where DC is the empty convolution operation. t is the number of MDC channels, which determines the number of representation spaces. This paper will discuss the influence of the value of t on network performance in the experimental part. In this paper, Gaussian Error Linear Units (GELU) is used to perform nonlinear transformation on distribution information after obtaining the distribution information of correlation time series features of different representation spaces to improve the nonlinear expression ability of the network. The calculation method is shown in formula (9).

Predicted output network (PON)
In this paper, the fully connected network is used to map the final perceived global and local correlation temporal features into | | y classification label space to obtain the prediction information | | y R G  , as shown in equation (14).   (15) Where i ŷ is the actual predicted output. In this paper, cross entropy is used as the loss function of GLFN and L2 regularization is used to punish the network parameters to improve the generalization performance of the network. The objective function of the network is shown in equation (16).

Experimental data and evaluation indexes
In this paper, the pre-trained GloVe word vector is used to represent words, and two standard data sets of SemEVAL-2010 Task 8 and KBP 37 are used to verify the validity of the proposed method. Semeval-2010 Task 8 data set includes 8000 training samples and 2717 test samples, including cause-effect, instrument-agency, product-producer, content-container, entity-Origin, Entity-Destination, Component-whole, membercollection, message-topic, and "Other" relationships. Since the first 9 relationships are directional, 19 relationships of this dataset are considered in this paper. KBP37 data set includes 15917 training samples and 3405 test samples, including 18 kinds of directional relations and a special relation "No_Relation". Therefore, a total of 37 relationships of KBP 37 data set are considered. In addition, we use macro average F value to evaluate the effectiveness of the relation extraction method.

Experimental results and analysis
Since the channel number t of GLFN-GAB represents the number of space, selecting an appropriate t can effectively improve the perception ability of GLFN-GAB. In this paper, different number of channel numbers t are selected to evaluate the macro average F1 values of only global sensing branch (GLFN-GAB) and GLFCNN (including global sensing branch GLFN-GAB and local sensing branch GLFN-LAB) respectively. The performance of the above two methods on the SemEVAL-2010 Task 8 and KBP37 datasets is shown in table 1 and  table 2 respectively. As can be seen from Table 1, GLFCNN performs better than GLFN-GAB for all channel number t. As can be seen from Table 2, except for t=8, GLFCNN and GLFN-GAB have the same F1 values, GLFCNN also performs better than GLFN-GAB in other channels, and when t=4, GLFCNN achieves the highest F1 value in both data sets.
After obtaining the best number of channel t, this paper also compares GLFNGAB, GLFN-LAB, GLFCNN, and the relation extraction method of the default GLFCNN (Non GLFCNN). Figures 3 and 4 show the training process on the SemEval-2010 Task8 and KBP37 data sets using the four methods above. Non GLFCNN directly takes the output of the Relevant Timing Speciality Network (CSFEN) as input to the Predictive Output Network (PON).   , table 3 and  table 4 are used to show the highest F1 values of the above four methods on SemEval2010 Task 8 and KBP37 data sets respectively. As can be seen from Table 3 and Table 4, GLFN has the highest F1 value comparing with non-GLFN, GLFN-GAB and GLFN-LAB. On the SemEVAL-2010 Task 8 and KBP37 datasets, GLFN's F1 reaches 86.2% and 65.1%, respectively. In addition, Table 5 shows GLFN and 14 mainstream English relation extraction methods (CNN [38], CNN+PF [39], multi-CNN [40], CR-CNN, Attention-CNN [41], Bi-LSTM, BiLSTM+feature, ATT-BiLSTM, HierLSTM [42], BiLSTM-Attention, FORESTFT-DDCNN, LST-AGCN, GCN and S-Att) for the highest F1 values on the SemEVAL-2010 Task8 dataset. Finally, for KBP37 data set, this paper also compares CNN, RNN, BiLSTM-CNN, Block+CNN, Ranking CNN, Att-RCNN, Bi-SDP-Att and GLFN. Table 6 shows the highest F1 values of the 8 methods on KBP37 data set. Bi-SDP-Att 64.5

65.2
As shown in Table 5, in the SemEVAL-2010 Task 8 data set, the proposed GLFN in this paper has the highest F1 value comparing to the 14 mainstream relation extraction methods. This is because GLFN can respectively sense the global and local correlation temporal sequence features to avoid the mutual influence of global and local perceptions, and splicing and filtering the two features to filter out the features with low contribution, so as to comprehensively represent the important semantics of sentences and improve the accuracy of English relation extraction [43][44][45][46].
LST-AGCN based on grammar graph enrichis sentence information from the perspective of syntax dependencies, and thus has a higher F1 value compared with other mainstream methods. However, this method requires NLP tool to build syntax dependency tree, and its performance is highly dependent on NLP tool. As shown in Table 6, on the KBP 37 data set, GLFN has the highest F1 value compared with CNN, RNN, BiLSTM-CNN, Ranking CNN, AttRCNN, Block+CNN and Bi-SDP-ATT. Compared with Bi-SDP-ATT with the second highest F1 value, the F1 value of GLFN increases by 0.7%. Moreover, Bi-SDP-ATT needs to construct bidirectional short-dependent path and corresponding attention mechanism, and its structure is relatively complex. In the other 6 methods, Att-RCNN has a higher F1 value, and the GLFN proposed in this paper has a higher F1 value than that by 3.3%.

Conclusion
At present, most neural networks only consider singlebranch information flow, it is difficult to obtain enough semantic features for relation extraction. To solve this question, a GLFN-based English relation extraction method is proposed. The method first uses SAM and BiLSTM to obtain the relevance timing characteristics of the words. Second, build ResNet to obtain global and local correlation timing characteristics, respectively, to avoid the interaction between global and local perception. Further, the two features are stitched and filtered to show the essential semantic features of the sentence in a fullfaceted table. Finally, it uses the Softmax to classify the relation extraction. To verify the validity of the proposed method, extensive experiments are conducted on the standard SemEval-2010Task8 and KBP 37 datasets. The experimental results show that the F1 value of the proposed method in this paper is 86.2% and 65.1% respectively, which is better than the mainstream relation extraction methods between convolution-based neural network and circulatory neural network. In the future, we will verify the validity of the proposed method on more language word vector models. At the same time, we will target the unstructured data in different fields, extract the relationship between entity pairs, and build the entity relationship triple, so as to establish a domain structured data base for further use in knowledge mapping, automatic question-and-answer and other tasks. In addition, we will also develop a convenient user interface to provide users with a better experience.