An automatic scoring method for Chinese-English spoken translation based on attention LSTM

In this paper, we propose an automatic scoring method for Chinese-English spoken translation based on attention LSTM. We select semantic keywords, sentence drift and spoken fluency as the main parameters of scoring. In order to improve the accuracy of keyword scoring, this paper uses synonym discrimination method to identify the synonyms in the examinees' answer keywords. At the sentence level, attention LSTM model is used to analyze examinees' translation of sentence general idea. Finally, spoken fluency is scored based on tempo/rate and speech distribution. The final translation quality score is obtained by combining the weighted scores of the three parameters. The experimental results show that the proposed method is in good agreement with the result of manual grading, and achieves the expected design goal compared with other methods.


Introduction
Chinese-English spoken translation quality assessment is one of the hot topics in the field of automatic assessment of Chinese-English translation quality in recent years [1]. In the automatic scoring of oral English, some studies are mostly aimed at the oral English evaluation of pronunciation quality [2], such as reading questions and repeat questions. Zheng et al. [3] scored English reading questions through maximum likelihood linear regression and maximum posterior probability algorithm, and achieved certain results. However, there is still a lack of effective evaluation strategies for question types related to text content (keywords, sentence drift, etc.), such as interpretation questions and retelling questions. Although some scholars have carried out the corresponding research, but really applied to large-scale speaking test scoring results are very limited. Zhang et al. [4], for example, used speed, text coverage, keyword coverage and other indicators to score oral retelling questions, but this method lacked the overall analysis of sentence general idea. Suyoun Yoon et al. [5] used Siamese convolutional neural network to extract key information features of examinees' spoken sentences for scoring, but also did not conduct further research on the general idea of sentences. Therefore, there are still many challenges in establishing an effective automatic scoring model for Chinese-English spoken translation questions.
In recent years, many methods based on deep neural network (DNN) have been widely used in natural language processing [6][7][8][9][10]. Automatic feature extraction using DNN greatly alleviates the feature dependence problem of traditional methods. At the same time, worddistributed embedding is used as the input of DNN, which makes the extracted features contain rich semantic Xiaobin Guo 2 information. These DNN methods combine with distributed word vector have achieved success in many tasks, with better accuracy and efficiency than traditional methods. For part-of-speech tagging tasks, as much as possible in order to avoid according to the characteristics of the specific tasks, reference [11] used many hidden layers to automatically extract features. For the part of speech tagging, entity recognition and chunking analysis tasks, it designs a unified architecture, which greatly eased the feature dependent problem in traditional method, significantly improved the labeling results on each task. For Chinese word segmentation and part-of-speech tagging, a more concise and efficient deep neural network model was designed in reference [12], which achieved a better effect with less use of artificially designed features. However, compared with traditional tagging models, although the neural network model mentioned above reduces the workload of artificial design features, the actual effect is limited by the size of word window, and the context information referenced by pos-tagging is very limited. However, the present research shows that the word categories are closely related to the contextual information around them. Reference [13] proposed that hierarchical long short-term memory (LSTM) was used to obtain a wider range of context information, and part-ofspeech tagging and word segmentation tasks were combined to provide supplementary information to each other, thus improving the accuracy of part-of-speech tagging. Reference [14] proposed to add CRF layer in the output layer of LSTM network, and use CRF layer to realize tag inference at sentence level. The result was better than the traditional CRF model and the model using LSTM network alone. However, this reference only focused on pos-tagging, chunking analysis and entity recognition in English, and did not discuss the experiments on pos tagging and relevant corpus in Chinese.
Recently, attention mechanism has been introduced into the field of natural language processing, and has achieved good application effects in machine translation [15], syntactic analysis and automatic summarization [16]. The attention mechanism is used to assign different probability weights to the hidden layer units of neural network, so that the hidden layer can pay attention to the feature information which is more favorable to the classification task, and reduce the attention to some redundant information. Thus, in the same context sequence, adding the hidden layer of attention mechanism can further optimize the quality of extracted features. This is well proved by the application of attention mechanism in syntactic analysis [17], which enables the syntactic analysis model to learn long-distance syntactic dependency information. For part-of-speech tagging, as a syntactic functional category of words, the accuracy of part-of-speech tagging is obviously affected by the contextual information in the sentence. Especially for some long distance and specific syntactic structure information, it can solve the problem of tagging concurrent words well [18]. Adding attention mechanism to neural network annotation model can obtain these specific context information well and improve the accuracy of annotation model. This paper introduces an automatic grading method for the quality of Chinese-English spoken translation. We select semantic keywords, sentence semantic similarity and oral fluency as evaluation indexes to evaluate translation quality. In sentence-level Chinese-English translation, the translation of key words must convey the meaning, and the general meaning of Chinese-English sentences should also be accurate. As for oral translation, fluency parameters are also very important, and fluency also reflects the overall level of the translator's oral English [19]. In the scoring of sentence-oriented Chinese-English oral translation questions, researchers generally pay attention to the evaluation of the accuracy of Chinese-English oral translation and the general idea of the whole sentence. This is the main reason why we choose the above three evaluation parameters. In many Spoken English proficiency tests in China, Chinese-English spoken translation is the main question type. Therefore, automatic scoring of Chinese-English spoken translation questions has practical significance.
In the oral test, the reference answer standard of the sentence-level Chinese-English oral translation question manual revision clearly states that 60% of the key information should be translated, and 40% of the overall comprehension and expression of the sentence should be accounted for. Therefore, this scoring criterion should also be considered when constructing automatic scoring model [20].
In terms of key word scoring, the scoring model should not only consider the score of the key words given in the answer, but also the score of the synonyms related to keywords [21]. Therefore, we should establish thesaurus related to answer keywords, and then give the score of keywords by the degree of thesaurus related to keywords.
In terms of sentence comprehension, scholars usually score by calculating the similarity between the standard answer sheet and the answers. With the development of deep learning technology [22][23][24], some specific neural network models based on deep learning can mine deeper semantic information in sentences. Because of this discovery, some researchers have applied neural network models to automatic scoring tasks. For example, Qian Hussein et al. [25] built an adaptive deep learning speaking scoring system. Bshary et al. [26] used the LSTM to evaluate oral pronunciation. This paper attempts to apply deep learning to sentence general idea scoring in automatic scoring of Chinese-English spoken translation.

Keyword translation calculation method based on synonym discrimination
As can be seen from the scoring criteria of Chinese-English spoken translation (Table 2), the scoring of key words is very important. In order to accurately evaluate the information of key word translation in the answer sheet, we need to consider the following two situations: first, the number of key words in the reference answer; Second, refer to the use of synonyms for keywords in the answer. Therefore, it is necessary to judge whether the answer of the examinee contains the key information required by the question, that is, to inspect the examinee's grasp of keywords and their synonyms. In order to judge candidates' mastery of keywords and their synonyms, this paper adopts the synonym discrimination method combined with Word2Vec and semantic tree [27] to carry out semantic analysis and grading of candidates' oral keyword information at the lexical level.
From the semantic level, a sentence is usually composed of "key words" and "general words". "Key words" can affect the meaning of the sentence, which is the key information required by the oral answer. However, "general words" do not have a decisive influence on understanding the meaning of the whole sentence [28]. Therefore, according to the requirements of manual scoring, this paper mainly scores the key words in the sentence, and the influence of "general words" on the sentence is also considered when scoring the sentence drift. Because Word2Vec can mine the semantics of keywords and their synonyms and represent them as vectors, it can be used to calculate semantic similarity. In order to score the key information of the answer, we need to build a corpus for the key information of the answer, which should contain key information words and synonyms with high frequency of use. In order to avoid too many synonyms, candidates' knowledge should be considered at the same time [29,30]. Before the experiment, we recorded all the keywords and synonyms with high usage to build a language library. At the same time, manual annotation was carried out to form a text corpus for the use of the two experimental schemes.
The following is a Chinese-English spoken translation question and standard answer. Titled is "学习决定成绩,成绩又促进学习进步". The standard answer is "Learning determines grade, and achievement promotes progress in learning." Among them, "Learning", "grade", "achievement", and "progress" are the key words of the standard answer. Keywords and synonym structures are shown in Figure 1. In Figure 1, the node words in each right singlebranched tree are synonyms of each other, and the similarity decreases gradually to the right.
At the lexical level, this paper uses the method of synonym discrimination to score key information, and the specific steps are shown in Figure 2. First, it identifies the position of the candidate's key word in the semantic tree. If the candidate's key word is not found in the semantic tree, the word is not included in the thesaurus representing the standard answer, and the key information point is lost. If there are nodes in the semantic tree, the keywords that examinee answers are converted into semantic feature vectors through the previously trained Word2Vec model, and then the semantic similarity between two words is calculated by cosine similarity. "  " in Figure 2 represents the calculation of cosine similarity. The last key information score is the mapping score of semantic similarity corresponding to the word.

LSTM model
LSTM is a Recurrent Neural network (RNN) [31]. It solves the problem of gradient disappearance existing in traditional RNN by introducing memory cell and gated mechanism, and performs better in representing the context information of elements in sequence data and extracting long-distance dependencies. Figure 3(a) shows a single LSTM unit, and Figure 3(b) shows its internal structure.
Where, t c represents the state information of the memory unit. The parameter set represents the corresponding offset term,  and  are sigmoid and tanh activation functions respectively.  sign means vectors wisepoint multiplication.
Generally, the information in LSTM network is oneway, and LSTM can only use the information of the past moment, not the information of the future moment. Obviously, for some tasks such as word segmentation and part-of-speech tagging, both forward and backward information of sequences are very important. Therefore, a reverse layer can be added to the LSTM network to constitute a BLSTM. The BLSTM consists of two LSTM layers with opposite directions, and its structure is shown in Figure 4. In Figure 4, the expanded BLSTM network structure is divided into three layers: input layer, hidden layer and output layer. The hidden layer consists of forward LSTM and reverse LSTM, which are used to calculate forward and reverse hidden states respectively, and then projected to the common output layer. Compared with unidirectional LSTM, bidirectional LSTM has better performance in sequence feature acquisition and representation because the hidden layer information flows along two opposite directions and can obtain both forward and backward historical information. Therefore, it has been applied in many natural language sequence annotation tasks.

Attention mechanism
As a syntactic function category of words, the tagging process of part of speech is influenced by the information of sentence syntactic structure, and is more closely related to the words that have important syntactic dependence. However, other words have no obvious marking effect on the current words. Attention mechanism is a good probability weight allocation mechanism. By calculating the probability weight of attention at different moments, some nodes which are very related to the annotation of the target word get more attention and are assigned to larger probability weight. In this way, the quality of feature vector of hidden layer can be improved. The structure of the basic attention model is shown in Figure 5.

Chinese-English Spoken Translation based on attention LSTM
The Chinese-English spoken translation model proposed in this paper adds attention mechanism on the basis of BLSTM, and the specific model structure is shown in Figure 6. It mainly includes three parts: input layer, hidden layer and output layer. Among them, the hidden layer is composed of one-way LSTM layer, two-way LSTM layer and attention layer, which are respectively introduced below. And because of the large dimension, the element value is mostly 0. Therefore, there is a serious data sparsity problem. Distributed word vector uses lowdimensional and dense real vector to vectorize words, which contains rich semantic information and is widely used in many natural language processing tasks. This paper adopts the distributed word vector representation method, using Google word2vec tool. The word vector matrix M is formed through pre-training and indexed in the word vector matrix [32,33], and each word is transformed into its corresponding word vector form t x , which is used as the input of BLSTM network.
2) Hidden layer. The calculation is mainly divided into three steps. Step Where m is the implicit element dimension. LSTM() function represents nonlinear transformation of LSTM network, and its main function is to encode input word vector into corresponding implicit state vector.
Step 2. According to LSTM forward hidden state and reverse hidden state, the BLSTM hidden layer is calculated.
is the corresponding offset term. The hidden layer aggregates the forward and backward sequence information of the current element in the input sequence, which can provide richer contextual features for oral translation.
Step 3. According to the attention mechanism, the probability weight is assigned to the BLSTM hidden layer, and the new attention hidden layer is calculated. Since BLSTM contains both forward and reverse layers, both The dimension of the new attention hidden layer obtained in Equation (14) is the same as that of the original hidden layer. Since the probability distribution of attention is different at each moment, the new hidden layer of attention can pay attention to the part of speech tagging that is different from the initial hidden layer and the input sequence [34][35][36]. Therefore, at the beginning of each moment, the hidden layer plays a different role in Chinese-English spoken. Among them, the probability of attention of the hidden state which has a great influence on the current word labeling is correspondingly greater.
3) Output layer. Softmax function is used to calculate the probability distribution of tags on the annotation set at each time.

The scoring method of oral fluency
In the aspect of pronunciation, this paper mainly analyzes the examinee's oral fluency. Oral fluency is an important indicator for teachers to directly evaluate candidates' oral pronunciation ability. However, oral fluency is mainly reflected in the speed of speakers, so this paper evaluates oral fluency based on tempo/rate and speech distribution. Among them, the characteristic of speed is the average pronunciation time of each word.
First, the candidate's pronounce time is derived from the number of words n and the length of the i -th word in oral pronounce by a double threshold cut lexical method based on short-term energy and zero crossing rate. Then, formula (16) is used to calculate the characteristics of speech speed. If the speed of the candidate is greater than the set threshold, it is judged to be fluent. Then, fluency score is given by fractional mapping function. If the speed of the examinee is less than the set threshold, the examinee's speech distribution is judged to be uneven and does not meet the pronunciation requirements of the oral answer.

Fractional fusion model
The total score of the automatic scoring model for Chinese-English spoken translation is obtained through the following steps: (1) Introduce the examinee's oral answering voice, calculate the average length of the speech segment and the average pause time based on zero-energy integral-cut lexical model, and calculate the examinee's speaking speed, and then calculate the fluency score.
(2) Using the trained Word2Vec vector, the degree of fitting between the keywords in the answer audio and the keywords in the model library is determined. The position of the keyword in the semantic tree was matched and the score of the keyword was calculated.
(3) Using Word2Vec and short and long memory neural network model, the semantic features of all sentences in the corpus are transformed into vectors. Match the similarity between the examinee's pronunciation and the general idea of the sentence in the standard answer, and give the score of the general idea of the sentence. (4) The final evaluation total score is the weighted result of the three scoring indexes, as shown in equation (17).
Where, 1  , 2  and 3  are the weights of keyword score, sentence drift score and fluency score respectively. After analysis by linear regression prediction method, the weight values are set as 0.6, 0.3 and 0.1 in sequence. Finally, the total score is mapped to the A, B, C and D grade ranges.

Experiments and analysis
In order to verify the effectiveness of the proposed method, relevant experiments are carried out. The questions are selected from the interpreting and listening test of a university in June 2020. The first part is Chinese-English spoken translation. There are 5 questions in this section (Table 1), and each question is worth 2 points. We collected 3100 real audio data with accurate manual marking, and each question had 620 audio data within 20 seconds, which were recorded by 620 students in 6 different examination rooms. In order to reduce the influence of the grading teachers' subjectivity on the grading results, each phonetic answer paper was graded independently by two grading teachers, and the average scores of the two teachers were taken as artificial grading. In recent years, the research of machine translation has made great progress, and the translation quality has been constantly improved 在人机交互和高级用户接 口应用领域中,我们希望 未来的机器能像人一样与 我们更加容易和便捷地交 流,如手势驱动控制、手 语翻译等。 In the field of humancomputer interaction and advanced user interface applications, we hope that the future machines can communicate with us more easily and conveniently like human beings, such as gesture driven control, sign language translation and so on.
Note: the bold is keyword. Before the key word score and sentence general idea score, 3100 samples of phonetic answer paper need to be manually translated into text. Meanwhile, in order to extract effective semantic features of examinees, stop words such as "the", "a", "to", "this" and "can" are all removed. In the experiment process, this paper will conduct modeling and experiment on 5 topics respectively. Finally, the experimental results of the overall scoring model were taken as the average of 5 experiments. For the division of data set of each question, the ratio of "training:test=7:3" was used for the experiment.
This paper establishes a scoring model based on three scoring indexes: keywords, sentence drift and oral fluency. The weights of the three indicators are respectively set as 0.6, 0.3 and 0.1. Referring to the suggestions of teachers, this paper sets four grading grades A, B, C and D for a single translation question with a total score of 2 points, combining the translation principle of "faithfulness, expressiveness and elegance" and teachers' grading standards for translation questions. The scoring standard and corresponding score range of each grade are shown in Table 2. The key information is accurate, the language expression is fluent, the vocabulary is used properly, the sentence general idea is correct, the overall grasp is excellent The key information is not accurate enough, the language expression is smooth, the vocabulary is used properly, and the general idea of the sentence is not well understood 0.5≤score<1.0 C The key information is relatively accurate, but the language expression is not smooth, the sentence general deviation, general grasp 0≤score<0.5 D Key information is inaccurate or irrelevant, the language expression is not smooth, the general meaning of the sentence has a large deviation, and the overall grasp is poor The recording parameters of voice data are: mono, 22050Hz sampling rate, 16 bit coding. Keywords are converted into word vector by Word2Vec, and the dimension of word vector adopts 100 dimensions. The learning rate is 0.001. The number of neurons in LSTM layer is set to 100, tanh function is used as the activation function, and Adam is used as the optimization method. The batch of data for each training is 3.
This paper uses consistency (accuracy) and Pearson correlation coefficient to evaluate the scoring ability of the automatic scoring model for Chinese-English spoken translation. The consistency rate and Pearson correlation coefficient (r value) are shown in equation (18) and equation (19).
Where i x represents the model score, and x is the mean value of the model score. In order to verify the effectiveness of the proposed method, the following comparative experimental scheme is adopted. 1) Oral fluency scoring method and keyword scoring method remain unchanged. LSTM model is not used in sentence drift scoring, and word vector generated by Word2Vec is directly averaged to obtain sentence semantic representation and scoring. 2) On the basis of experimental scheme 1, Word2Vec in keyword scoring and sentence general idea scoring is replaced by another Bert model with good pre-training effect at present. 3) On the basis of experimental scheme 2, Bert and LSTM models are used for sentence general idea scoring. The experimental results are shown in Table 3~Table 7.  Table 3 is the comparison table of some experimental results. It can be seen that there is a good similarity between the results of the model scoring in this paper and those of teachers. Table 4 shows that the average consistency rate of the automatic scoring model for Chinese-English spoken translation built in this paper is 0.8569 on five questions. Among them, the highest value can reach 0.8853, indicating that the accuracy of the model scoring is close to the real score of the teacher and has good effectiveness. In the actual grading process, the teacher will score at his discretion, while the model scoring is based on the established evaluation indicators. Therefore, the model scoring with unified scoring rules has higher objectivity and authenticity, and can explain the differences between the model scoring in this paper and the teacher scoring.
(2) According to Table 4, Table 5 and Table 6, the experimental effect of using Bert model alone is 0.022 higher than that of using word2Vec model alone, indicating that Bert does improve the acquisition of effective information to a certain extent through the multilayer bidirectional decoding process. Although Word2Vec+LSTM has a lower agreement rate on questions 2 and 4 than Bert+LSTM, it achieves the best average agreement rate on the 5 questions. This indicates that the word2Vec and LSTM model proposed in this paper have a better combination effect, which improves the expression ability of keyword semantics and sentence semantics to a certain extent, thus improving the accuracy of automatic oral scoring.
(3) Table 7 shows that the correlation between the scoring model in this paper and the average score of teachers reaches 0.8490, indicating a strong correlation between the predicted score of the model and the real score of teachers. Among them, the highest value can reach 0.9095, indicating that the model score of question 5 has a very high correlation with the teacher's score. It is proved that the introduction of synonym discrimination, LSTM neural network model and the evaluation method of speed feature can enhance the scoring ability of the automatic scoring model of Chinese-English spoken translation.

Conclusion
This paper aims to analyze the scoring mechanism of Chinese-English spoken translation questions and establish an objective and effective automatic scoring model for Chinese-English spoken translation by taking Chinese-English spoken sentence translation in higher education examination as the research object. This study shows that the method of synonym discrimination for key words, LSTM model for sentence translation analysis and fluency rating based on speed and pronunciation distribution have good results. The directions for further work are as follows: (1) The amount of corpus needs to be increased. At present, it is difficult to find public test data for oral translation questions, so the research work mainly builds relevant scoring standards and test databases for specific tests, resulting in a small database size, which will affect the extraction and training of semantic features by neural network model, and then affect the scoring accuracy of relevant semantics. (2) The addition of speech recognition module. In this paper, artificial translation is used to replace speech recognition. Only the combination of speech recognition and automatic scoring can establish a complete oral evaluation system. (3) Increase evaluation indicators. At present, this paper does not involve the study of grammar and intonation. Establishing the evaluation index of the scoring model will make the scoring mechanism more comprehensive and the scoring result more objective.