Self-organizing incremental and graph convolution neural network for English implicit discourse relation recognition

Implicit discourse relation recognition is a sub-task of discourse relation recognition, which is challenging because it is difficult to learn the argument representation with rich semantic information and interactive information. To solve this problem, this paper proposes a self-organizing incremental and graph convolution neural network for English implicit discourse relation recognition. The method adopts the preliminary training language model BERT (Bidirectional Encoder Representation from Transformers) coding argument for argument. A classification model based on self-organizing incremental and graph convolutional neural network is constructed to obtain the argument representation which is helpful for English implicit discourse relation recognition. The experimental results show that the proposed method is superior to the benchmark model in terms of contingency relations and expansion relations.


Introduction
Discourse relation recognition aims to study the logical relationship between two text segments (phrases, clauses, sentences or paragraphs) in the same discourse. As a basic research in the field of Natural Language processing (NLP), discourse relationship recognition is of great value in upper-level natural language processing applications [1,2], such as emotion analysis [3], machine reading comprehension [4], abstract extraction [5] and machine translation [6][7][8]. The task framework of discourse relation recognition is shown in figure 1. Given a pair of arguments (Arg1, Arg2), discourse relation classification model is used to identify discourse relations between them.

Figure 1. Task framework
At present, the largest authoritative corpus in the research field of discourse relationship recognition is Penn Discourse Tree-bank [9] (PDTB), which defines discourse relationship as a three-layer semantic relationship type system according to different granularity. The top four semantic relationships are comparison, contingency, expansion, and temporal. At the same time, according to whether there is a linking word (also known as a cue word, such as "because") between two argumentative expressions, PDTB divides discourse Yubo Geng 2 relation into two categories: explicit discourse relation and implicit discourse relation. Explicit discourse relation is the type of discourse relation that can be inferred directly by explicit connectives. As shown in example 1, this explicit contingent relation argument pair contains the explicit conjunction "so, "a clue that Arg2 is the result of Arg1. Therefore, we can directly deduce that the argument pairs in example 1 have a contingent relation. In contrast, implicit discourse relation argument pairs lack explicit connectives, so they are more dependent on morphological, syntactic, semantic and contextual features. For example, the word "hurricane" in example 2 is the reason for the need for "precautionary mechanisms". Therefore, it can be inferred that the textual relationships in this thesis pair are fortuitous. Explicit discourse relation studies have achieved high classification performance. Pitler et al. [10] had achieved 93.09% accuracy by using the mapping of explicit connectives and discourse relations. However, implicit discourse relation recognition performance is relatively low. The F1 values of the existing optimal methods in the four categories of relationships only reach 53% [11]. Therefore, this paper focuses on the task of implicit discourse relation recognition in English.
Previous studies have applied the attention mechanism to the calculation of argument representation [12][13][14][15][16] to evaluate the relevance of semantic information between arguments, so as to capture important semantic features to assist implicit discourse relationship recognition. However, relevant researches only focus on the semantic features of argumentative elements themselves or among them, so such a single feature cannot fully represent the semantic information of argumentative elements. If we focus only on the interaction information of argument, for example, the word pair information "good-wrong "and" ruined "in Example 3, it will easily lead to the argument pair being identified as a comparative relationship. But if an argument captures information about itself, looking at the words "not "and "good" in Arg1, and then looking at the word "ruined "in Arg2 with the interaction between arguments, then based on the words "not" and "ruined". The double negation of "(destroyed)" [17]  A graph convolutional neural network (SIG) based on self-organizing increment and interactive attention mechanism is proposed to construct implicit discourse relation classification model. This model constructs adjacency matrix based on self-organizing increment and inter-attention mechanism. Therefore, this model can utilize the semantic features of the argument itself and capture the interaction information of the argument, so as to encode a better representation of the argument and improve the performance of implicit discourse relation recognition.
In this paper, PDTB 2.0 [2] data set is used for experiments and testing. The results show that the performance of the proposed model SIG is better than the benchmark model in English implicit discourse relation classification, and it is better than the current implicit discourse relation recognition model in many relations.

Related works
The existing researches on implicit discourse relation recognition mainly fall into two directions: constructing complex classification models and mining large amounts of training data. The model construction mainly includes machine learning model based on feature engineering and neural network model based on argument representation. Previous studies have used a variety of linguistic features to construct statistical learning models.
On PDTB data set, Pitler et al. [18] attempted for the first time to use a variety of linguistic features to identify the top four implicit discourse relations, whose experimental performance exceeded that of random classification. Lin et al. [19] designed a discourse relation recognition model based on context features, word pair features, syntactic structure features and dependency structure features. Rutherford and Xue et al. [20] extracted Brownian clustering features to alleviate the sparsity of word pairs. Braud et al. [21] used the existing unsupervised word vector to train the maximum entropy model for implicit discourse relation classification based on shallow lexical features. Lei et al. [17] mined the semantic features of each relation, trained the naive Bayes model by combining the two cohesive means of topic continuity and argument source, and achieved a F1 value of 47.15% in four-way classification, whose performance exceeded most existing neural network models.
Most of the present researches on implicit discourse relation recognition build complex neural network models to improve the classification performance. Ji et al [22] used two recursive neural networks (RNN) to recognize implicit discourse relations based on the vector representation of argument elements and entity fragments. Zhang et al. [23] proposed a shallow convolutional neural network containing only one hidden layer to avoid the over-fitting problem. Chen et al. [12] Based on bidirectional-long short-term memory network (Bi-LSTM), and semantic interaction information between word pairs was captured using gated relevancy network. Qin et al. [24] added gated neural network (GNN) on the basis of convolutional neural network to capture the interaction information (such as word pairs) between argument elements. Yin et al. [16] adopted the neural network model based on multi-task attention mechanism, and used the unlabeled external corpus BLLIP to generate pseudo-implicit discourse relation corpus to identify implicit discourse relation, and took it as an auxiliary task to improve PDTB implicit discourse relation recognition performance. Bai et al [13] constructed a complex argument representation model to extract argument features by integrating word vectors, convolution, recursion, residuals and attention mechanisms of different granularity. Nguyen et al. [11] used the model in reference [13]. In addition, based on knowledge transfer, relational representation and connective representation were mapped to make them in the same vector space, thus assisting implicit discourse relation recognition.
In view of the shortage of implicit discourse relation corpus, different methods have been used to expand the implicit corpus of PDTB. Zhu et al. [25] mined instances consistent with the original corpus in terms of semantics and relations from other data resources through argument vector. Wu et al. [26] found that explicit and implicit mismatch exists in bilingual corpus, that is, there were no connectives in English corpus, but there were explicit connectives in corresponding Chinese corpus. Based on this, Wu et al. [26] extracted pseudo-implicit discourse relation corpus from the CORPUS of FBIS and Hong Kong Law. Xu et al. [27] used explicit discourse relational corpus to construct pseudo-implicit style examples, and selected samples with high information content based on active learning method to expand implicit discourse relational corpus. Ruan et al. [28] used the "WHY" question pair in question answering corpus to generate pseudo-implicit argument pairs based on "declarative conversion of questions" so as to expand implicit causality corpus.

Methodology
The graph convolutional neural network (SIG) framework based on self-organizing increments and interactive attention proposed in this paper is shown in figure 2.

Figure 2. SIG model
Firstly, the argument representation of two arguments is obtained by fine-tuning BERT language model [29]. Secondly, a fully connected word-word graph is obtained by spliced feature matrix and adjacency matrix. As the initial features of graph convolutional network (GCN), word features are convolved and nonlinear transformation operations are performed on the hidden layer of doublelayer GCN to obtain the final word representation. Finally, the word representation is sent to the full connection layer for dimensionality reduction, and the softmax function is used to normalize it, and the final classification result is obtained.

Vector representation layer
Given a deterministic metarepresentation ) , , , ) , , , CLS is a special classification symbol, and its BERT encoded vector representation can be used as the vector representation of the whole input sequence. SEP is a special symbol used to separate two arguments in an input sequence.

Self-organizing Incremental Graph Convolutional Neural Network (WSOINNGCN)
The WSOINNGCN model framework is shown in figure  3, which consists of three parts: The first part is the feature vector set of image data obtained based on transfer learning; In the second part, a self-organizing incremental neural network (WSOINN) is used to extract topology structure of feature data [31,32], and a few nodes are selected for manual annotation according to the number of node victories. In the third part, graph convolution network (GCN) is built. The cross entropy loss function and Adam algorithm are used to optimize the network parameters, and the remaining nodes are automatically labeled. Finally, all image data are classified based on Euclidean distance.  As shown in figure 4, the VGG16 convolutional module trained on ImageNet data set is used to extract the features of each text, and the 512 feature graphs obtained are pooled globally by means of means. Each graph outputs a 512-dimension feature vector, so as to obtain the data feature set after extraction of all text features. SOINN can obtain the spatial topological graph structure of feature data, while GCN can be used to mine the relationship of huge, sparse and super-dimensional association graph data. In order to integrate SOINN and GCN, this paper proposes the introduction of selforganizing incremental neural network (WSOINN) with connection weight number, and the introduction of node victory times to select a small number of nodes for manual annotation. The algorithm steps of WSOINN are as follows: (1) Initialize the node set According to the above WSOINN algorithm process, the connection weight W between the nodes represents the similarity between the two nodes, and the larger the connection weight is, the more similar the two nodes are. The more victories a node has, the more representative and important it is.

Nodes features matrices
Given two encoded argument representations 1 H and 2 H , they are spliced as node characteristic matrix On this basis, the graph convolution operation can be performed on the two argument representations at the same time, so as to obtain the characteristic matrix rich in the argument's own information and interactive information.

Adjacency matrix
Considering that textual relations depend on deep text understanding and information interaction between arguments, this paper constructs the adjacency matrix of graph convolution neural network based on the self attention score matrix and interactive attention score matrix of arguments, so as to obtain a fully connected graph with arguments as nodes. The calculation methods of self attention mechanism and interactive attention mechanism used in this paper are introduced below. In this paper, the self attention mechanism [32] is used for argument representations 1 H and 2 H to measure the importance of each word representation, so as to obtain the self attention score matrix L L R S   of argument. Taking Arg1 as an example, the specific calculation is shown in equations (4) to (6).

Graph convolution operation
The node feature matrix and adjacency matrix A of graph convolutional neural network are obtained based on the above formula. We refer to formula (4) to calculate the graph convolution feature of node feature matrix X. The number of GCN layers is 2, and the specific calculation is shown in formula (11). ) ) ( (

Full connection layer
In this paper, the updated feature representation } , , ,

Training
This paper constructs a binary classifier for each of the four class relationships of PDTB corpus. In the training process, this paper uses the cross entropy loss function as the objective function and uses Adam [33][34][35] optimization algorithm to update all model parameters. For a given argument pair (Arg1, Arg2) and its relationship label i y , the loss function is calculated as shown in equation (14).

Experimental data
In this paper, the experiment of implicit text relation recognition is carried out with SIG model on the corpus of Pennsylvania text tree bank (PDTB). PDTB was proposed by Prasad in 2008, it came from 2304 articles in the Wall Street Journal (WSJ), and a total of 40600 text relationship samples were marked, 16224 samples were implicit text relationship examples. In order to keep consistent with the previous work, this paper takes section 02-20 as the training set, section 00-01 as the development set and section 21-22 as the test set. The data distribution of the top four semantic relationships comparison (COM.), continuity (CON.), expansion (EXP.) and temporary (TEM.) are shown in Table 1. It can be seen from table 1 that in the PDTB data set, the amount of text relationship data of the other three categories except EXP. is small, and the problem of inter class imbalance makes researchers usually train two classifiers separately for each relationship type for evaluation. Therefore, referring to the previous work, this paper trains the binary classification model based on the training sets of different text relations, and obtains a total of four binary classifiers, which are respectively used to judge whether the sample contains the text relation, and evaluates its performance through F1 value. Following the previous work, this paper does not integrate the four secondary classification results of the same sample, and only discusses the yes or no problem of single category text relationship in the secondary classification. In addition, because the PDTB data set has the problem of unbalanced positive and negative samples, this paper randomly down samples the negative samples to construct a training data set with balanced positive and negative samples. At the same time, in order to better compare with previous work, this paper carries out four-way classification experiments on PDTB data set, trains a four classifier based on the training set, and evaluates it with Macro-F1 value and accuracy.

Experimental setting
In order to prove that using GCN to fuse self attention and interactive attention mechanism is helpful to implicit text relationship recognition, the following six comparison systems are set up in this paper. 1) Bert (baseline): after the hidden layer outputs of Arg1 and Arg2 are obtained by fine tuning the Bert model, they are cut respectively to obtain the representation of two arguments. Then the sentence level argument representation is obtained by word by word summation, and the final feature is obtained by splicing the two sentence level representations. And it is input to the full connection layer for classification. 2) Self: after using Bert to obtain the argument representation of Arg1 and Arg2, calculate their self attention scores respectively, and apply the self attention weight to the argument representation; Then, the updated argument representation is summed word by word to obtain sentence level representation; Finally, the sentence level representation is spliced as the input of the whole connection layer. 3) Inter: after the argument representation of Bert output is obtained, the interactive attention mechanism is used to obtain the interactive attention weight distribution matrix and act on the argument representation; Then, the sentence pair level argument representation is obtained by summing and splicing the new argument representation word by word, and the full connection layer is input for implicit text relationship classification. 4) Concatenate: the sentence level argument representation is obtained by splicing the sentence level representations generated by the above self and inter systems, and input into the full connection layer for implicit text relationship classification. 5) Transformer: splice the argument representations of Arg1 and Arg2 obtained through Bert coding as the input of the double-layer transformer [36,37] with eight-head attention mechanism, and then sum the word features encoded by transformer word by word to obtain the sentence level representation of argument pairs, and input them to the full connection layer for implicit text relationship classification. 6) SIG: after the argument representations of Arg1 and Arg2 are obtained by Bert, the self attention weight distribution matrix and interactive attention weight distribution matrix are calculated respectively. Then, the two argument representations are spliced to obtain the characteristic matrix, and then the attention weight distribution matrix is spliced to obtain the adjacency matrix to construct the double-layer GCN. The output of the last GCN layer is summed word by word to obtain the sentence level representation of two arguments, which are input into the full connection layer for implicit text relationship classification.

Parameter setting
In this paper, the output of the hidden layer of the fine tuned Bert is used as the argument representation, where we set the hidden layer vector dimension k d to 768 and the maximum argument length L to 80. Based on the characteristic matrix constructed by argument representation, this paper splices argument self attention and interactive attention weight distribution matrix to obtain adjacency matrix, constructs two-layer (L=2) GCN neural network, and uses tanh function as the activation function of the model. When building the transformer model, we use the encoder of Transformer in the work of Subakan et al. [32] as a layer of Transformer in this paper. In this paper, a two-layer Transformer is used to transform the argument representation after coding, and the hidden layer dimension of the feed-forward neural network is set to 768, and GeLU [38] is used as the activation function. In the training process, the cross entropy is used as the loss function, and the batch gradient descent method based on Adam is used to optimize the model parameters, in which the batch size is 32 and the learning rate is 5e -5 . In this paper, dropout is calculated after the last GCN layer, and the probability of random discarding is 0.1.

Experimental results
Six neural network models with different structures are used to classify the four categories of implicit text relations of PDTB. The specific classification performance is shown in Table 2. Among them, the performance of the proposed model sig in multiple relationships is better than the other five comparison models. The main reason is that sig combines the advantages of two attention mechanisms. While paying attention to the information of two arguments, it can also pay attention to the interactive information between them, and update the argument representation through such information. Therefore, SIG can generate argument representation that is more consistent with the characteristics of implicit text relationship classification task. However, the model transformer uses an 8-head attention mechanism to capture various information of arguments themselves and the interaction between arguments. However, when transformer simulates the information interaction between arguments, it only uses the argument point product matrix as the attention score matrix, while the attention mechanism that SIG can use is more flexible. In this paper, bilinear model is used to simulate the linear interaction between two arguments. In addition, transformer uses 8-head attention mechanism, while SIG only uses single head self attention mechanism; At the same time, the value of transformer's attention score matrix is inconsistent in different layers, while GCNs in different layers in SIG share the same adjacency matrix, and the size of its element value indicates the strength of the connection between different word nodes; After each layer transformer uses the attention mechanism to update the argument characteristics, it also needs to use the feed-forward neural network containing two fully connected layers to transform it, and adopts the residual mechanism. In contrast, the structure of SIG model is simpler and prevents over fitting to a certain extent. Therefore, transformer performs better than SIG in the expansion relationship with a large amount of data, but slightly weaker in other relationships.
In addition, the performance of concatenate model is inferior to self and Inter in almost all discourse relations. We believe that it is mainly caused by the following two reasons: first, the way of splicing is too simple to simulate the complex relationship between the two arguments and the balance between the two attention mechanisms; Secondly, there is a certain over fitting problem in this model. In contrast, the proposed model SIG uses GCN to weigh the two attention mechanisms. Among them, the inherent weight sharing characteristics of GCN model can prevent over fitting to a certain extent, so SIG can almost surpass other models in the classification performance of four types of text relationships.
In order to prove the effectiveness of the model SIG proposed in this paper, we compared it with the existing advanced models (see Table 3). Among them, Bai et al [13] used character level, sub word level and word level representation based on Shahid [35] to construct multigranularity argument representation, and combined convolution operation, residual mechanism, interactive attention mechanism and multitask learning idea to construct complex deep neural network. On the basis of Bai et al [13], Nguyen et al. [11] mapped the relationship vector and conjunction vector to the same vector space based on the idea of knowledge transfer. In addition, Yin et al. [16] trained the multitasking model with the help of external data such as BLLIP. In the same text, there is a certain relationship between top-down text relationships. Xu et al [36] deeply explored this feature and constructed an implicit text relationship classifier by using the method of ensemble learning. Compared with previous work, the model SIG proposed in this paper is relatively simple, and only standard PDTB data set is used for training. However, it can outperform the current optimal method in classification performance on multiple relations. The main reasons are as follows: 1) BERT pre-trained language model [39][40][41][42] already contains a large amount of prior knowledge, which is helpful for implicit discourse relation recognition which requires common knowledge. 2) Previous work usually uses interactive attention mechanism to extract interaction information between elements, but ignores the importance of the information of the elements themselves, while SIG integrates its own information and interaction information. Table 4 shows the lexical distribution of the four categories of PDTB used in this paper. Where, each type of relationship contains a large number of out-ofvocabulary (OOV). Researchers usually represent these unregistered words with the special symbol "UNK" and uniformly initialize them to obtain a consistent word vector, which can break the dilemma of finding unregistered word vectors, but it reduces a certain amount of information and has a certain impact on implicit text relationship recognition. For example, in example 4, the unregistered word "steamed "does not appear in the training set. In the absence of the word "paused" and "reaching its high ", it is difficult to derive the causal relationship. However, BERT can use the word context information to initialize the word vector for the unregistered word, and "steamed Forward "is the reason for "reaching its high". Therefore, it can be deduced that the discourse relation contained in this argument pair is contingent relation. Example 4 [Arg1]: Instead, the rally only paused for about 25 minutes and then steamed forward as institutions resumed buying.
[Arg2]: The market closed minutes after reaching its high for the day of [Discourse relation]: Contingency. Cause. Result In order to prove the effectiveness of model SIG, this paper uses models Self, Inter and SIG to calculate the distribution of attention weight for example 3, and average the value of attention weight word by word to draw gray color blocks, and obtain the grayscale of attention distribution calculated by example 3 for the three models (See figure 5). As can see from figure 5, both model Self and SIG focus on the words "not "and "good" in Arg1. However, only model SIG gives a high weight to the word "ruined "in Arg2. Thus, model SIG can infer from the double negation of the word "not "and "ruined" that the implicit discourse relationship contained between these two arguments is contingent. In this paper, experiments are carried out on the model constructed by GCN with different layers, and its performance is shown in Table 5. Where, when the number of GCN layers is 2 (i.e. GCN2), the binary classifier reaches the maximum value in F1 value, while when the number of GCN layers is 4, the macro-F1 value and the accuracy of four-way classification are 53.87% and 59.49%, respectively. This is mainly because the sample size of the training set of the binary classification model is lower than that of the fourclassification model. Therefore, when the number of GCN layers is large, the binary classifier tends to over-fit.

Conclusions
In this paper, a graph convolutional neural network model based on self-organizing increment and interactive attention mechanism is proposed to recognize implicit discourse relations. Experimental results show that the performance of the proposed model SIG is better than that of the benchmark model BERT, and the performance of the proposed model SIG is better than that of the existing advanced methods on multi-class relationships. The experimental results show that the implicit discourse relation recognition task is still very challenging, and the classification performance of the other three categories except EXP. is low, which is far from meeting the requirements of practical application. In the next step, we will carry out researches from two aspects: (1) mine highquality implicit discourse relation corpus externally for data imbalance.
(2) construct a more complex classification model conforming to the characteristics of implicit discourse relation recognition tasks.