Incremental Knowledge Acquisition for WSD: A Rough Set and IL based Method

Word sense disambiguation (WSD) is one of tricky tasks in natural language processing (NLP) as it needs to take into full account all the complexities of language. Because WSD involves in discovering semantic structures from unstructured text, automatic knowledge acquisition of word sense is profoundly difficult. To acquire knowledge about Chinese multi-sense verbs, we introduce an incremental machine learning method which combines rough set method and instance based learning. First, context of a multi-sense verb is extracted into a table; its sense is annotated by a skilled human and stored in the same table. By this way, decision table is formed, and then rules can be extracted within the framework of attributive value reduction of rough set. Instances not entailed by any rule are treated as outliers. When new instances are added to decision table, only the new added and outliers need to be learned further, thus incremental leaning is fulfilled. Experiments show the scale of decision table can be reduced dramatically by this method without performance decline.


Introduction
Automatic acquisition of knowledge is by far one of mostly used technologies in NLP.But, knowledge about word disambiguation has been a bottleneck till now.The sense of an ambiguous word is determined by a certain context.But the knowledge in the context revealing a word meaning is varied, incomplete and undetermined.How to acquire useful semantic knowledge in the context is a research hotspot in the community of NLP.
Rough set is a mathematic tool that describes incomplete and undetermined knowledge, which can be used to analyze and process information that is inaccurate, inconsistent and incomplete more efficiently [11] [12], further to discover the implied knowledge and reveal potential rules.It depends only on the intrinsic data structure.Since the theory of rough set was proposed, many researchers have devoted to attribute reduction problem [13] [14][15] [16].The rough set approach has already been applied in the management of many issues successfully, including data mining [17] [18], decision-making [3][4], forecasting [5], machine diagnosis [6], recommendation and filtering [7] [8], personal investment portfolio analysis [9] etc. Rough set based methods often work together with other machine learning methods to boost the machine learning performance [3][4] [5][6] [7][8] [9].
Instance-based learning (IL), also called example-based, memory-based, or case-based learning, is able to learn outliers in data well.Since this is highly desirable for natural language processing in general, IL is widely used in NLP [19] [20].IL often integrates with other method to increase the accuracy of classifier, for example, in [21] with a simulated annealing genetic algorithm and in [22] with a Naï ve Bayes classifier.

EAI Endorsed Transactions on Scalable Information Systems
In rule based NLP, rules represent general knowledge but ignore many outliers.The outliers, though occurred occasionally, have great impact on accuracy of many NLP tasks.In addition, the accuracy of rules created by manual work needs verified.Therefore, in NLP, when statistical based methods cannot improve accuracy any more, acquiring rules efficiently and automatically, together with processing outliers become an alternative.
The RS based knowledge discovery of Chinese multisense verbs we proposed in this paper is, using RS as the mathematic tool, to discover implied context information from part of speech (POS) tagged Chinese text, which determines word meaning and is used in WSD.

Brief to RS theory
Suppose there is a knowledge representation system: , where . Virtually, f is a decision table that determines the value of corresponding attributes of an element in U .
Decision tables are a precise yet compact way to model complex rule sets and their corresponding actions.In RS, they are used to represent objects in universe.A decision table has two dimensions, each row representing one object and each column representing one attribute of objects.Attributes are divided into two classes: condition attribute and decision attribute.Objects in universe are divided into different decision classes corresponding to their condition attributes.
Table 1 is a set of condition attributes, D is a decision attribute.As far as classification task is concerned, not all condition attributes are required.Some attributes are redundant and can be removed without deteriorating the classifying performance.Reduction is defined as the minimum condition attributes set that doesn't contain redundant attributes but assures correct classification.
After Table 1  Reduction is the essence of RS method.It is the hotspot in machine learning and data mining and is also the theoretical basis of the proposed method.

Description of word meaning
A word has a clear meaning in a specific context, i.e., it is context that determines the sense of a word.
Suppose that the sense items set of the multi-sense word W is } ,..., , { Because context words are in an open set, we must go through a reduction process when sense of W is determined.Data sparseness is one problem in attribute values reduction.Some elementary categories are covered by a partition of condition attributes, but that coverage may be incomplete, that is, these categories shouldn't be covered.
In theory, knowledge acquired by reduction of RS doesn't lose information in corpora, but corpora are approximate representation of some information of natural language.Therefore, the amount of information in corpora is less than that of knowledge in natural language, which leads to knowledge discovering by attribute values reduction based on corpora only approximating to natural language.

Brief to IL
Instance-based learning is a machine learning method evolving from memory-based reasoning.This reasoning pattern supposes that reasoning of knowledge is more a process of similarity comparison based on experience than a process of condition-action based on conception induction.It is based on the hypothesis that the learning process is memory based.
The knowledge representation in IL is attribute logic, too.Similarity degree is computed with formula (3).
According to the knowledge representation system defined in section 2.1, , where i w is the weight of attribute i which determines its importance, and ) , ( The classification based on formula (1) needs to traverse all instances in memory learning, find the instance N having nearest distance and tag the instance classified with decision attribute of N .

Natural language and instance-based learning
The similarity reasoning mechanism of IL has 2 advantages in NLP.

Knowledge acquisition without information lost
A lot of knowledge in natural language processing is difficult to represent and acquire.Learning process in memory eases this problem in some degree.

Paying more attention to outliers
The difficulty to represent natural language is there are many outliers in it.The reasoning mechanism of IL pays more attention to outliers.Reasoning of IL is more accurate than dualism, which is a reasoning theory --if it is not A then it must be B .

RS and IL based knowledge acquisition of Chinese multi-sense verbs
We use Contemporary Chinese Dictionary (CCD) and HowNet as lexicographic resources and select multi-sense verbs in both dictionaries, or else we may select the one which has only one sense as the verb's sense.
There are 2 major modules in knowledge acquisition: original decision table generation and RS based attribute reduction.

Module of original decision table generation
The generation of original table is shown in Figure 1.For each multi-sense verb do the step ii-iv; 2. Recognizing all sentences containing the verb in corpora; 3. Selecting manually a sense for the multi-sense verb according to the context; 4. Putting the sense and corresponding context words into table.

Figure 1. Pseudocode of decision table generation
There are 16 fields in the decision table --wd_0, SINHN, SINXH, wd_6, wd_5, wd_4, wd_3, wd_2, wd_1, wd1, wd2, wd3, wd3, wd4, wd5, wd6, number.SINHN, SINXH are decision attributes.They are the sense of the multi-sense verb in HowNet and CCD respectively.Wd_6~wd_1, which are 6 context words previous to the multi-sense verb in the sentence, and wd1~wd6, which are 6 context words succeeding the verb, are condition attributes.Number is used to count the same instances in the corpora, that is, the number of sentences which can produce the same record in the decision table.
The major words deciding the verb sense in a sentence are nouns and there are at most 5 slots modifying the headword from different angles before a specific noun [23].Therefore we may get the noun determining the verb sense by extracting 6 context words previous and succeeding to the multi-sense verb respectively.If the number of context words prior to or succeeding the verb is less than 6, we'll assign a liberty value to it (them), denoting by '*'.The record obtained from the sample sentence is shown in Table 3.

Module of RS based attribute reduction
Knowledge is acquired by reduction of original decision table based on RS reduction.

Basic Algorithm
The algorithm includes three main procedures: PreProcess, AcquireProcess and ExceptionRule.

Preprocess
In this procedure, those conditional attributes with none of the same value are removed, because they cannot produce reduction rules.

ExceptionRule
It is used to copy instance(s) not entailed by any rule, i.e., outliers, to the table ExceptionTable.

AcquireProcess
This is fulfilled by 2 procedures.First, ProduceFromLtoS extracts rules composed by most attributes to candidate table "RuleTable" from decision table "worktable".Second, ProduceFromStoL extracts rules from RuleTable according to the length of rules.This procedure gives priority to rules with least length, that is, if a longer rule is entailed by a shorter rule, then the longer rule is removed.Actually, ProduceFromStoL removes redundant information from RuleTable.This assures rules acquired contain least attributes yet classification is correct.
The pseudo code can be seen in Figure 2  .By PreProcess, some conditional attributes may be removed because they cannot play any role in reduction, thus the search scale will be reduced dramatically.

Incremental Algorithm
As seen in the previous paragraph, the complexity of computing reduction increases exponentially with the increase in scale of decision table.The introduction of a heuristic search method will help in discovering a better reduction with least conditional attributes.
When new knowledge is added to decision table, it is necessary to acquire rules by an incremental algorithm.
The idea is naï ve and its correctness is obvious.It is shown in Figure 4  To avoid problem caused by data sparseness efficiently.2. To filter 'noise'.'Noise' may cause outliers independent of any category and reduction based on RS attribute values may include outliers in a rule set.This will lead to such a result that the inference must comply with dualism.The threshold  can avoid including outliers in a rule set.3. To increase time and space efficiency of machine learning.The threshold  may limit effectively the excess inflation of the candidate rule set caused by data sparseness.Therefore, the time and space efficiency is increased.

Knowledge representation
Two kinds of knowledge are generated by reduction of decision table.One is rules covering instances greater than or equal to the given threshold  .The other is instances that can't form a rule because it covers instances less than  .
We use generation rules to describe these rules which have different implementing forms for the specific data.
Because rules have varied condition attributes, it is so wasteful to describe them with a table table) that we build a rule file, which is a text file, to store them.The format of the text file is: word(i).Ri.XH=Sense in XH word(i).Ri.HN=Sense in HN word(i).Ri.wd(a)=value(a) …… word(i).Ri.wd(n)=value(n) word(i).Ri.len=Number of words contained in the rule i The first two items are decision attributes, the last item denotes length of the rule, i.e., the number of words determining the rule, and other items are condition attributes.
Sense in XH, in HowNet is Sense in HN.
We collect those instances that can't form a rule into a table called as "ExceptionTable".ExceptionTable has the same structure as the original table, so it appears as a decision table.Because the instances that may form a rule have been filtered out, the scale of ExceptionTable is much smaller than the original table.Data in ExceptionTable may be used as outliers in NLP.

Rule extraction
We have performed experiment with the method proposed in this paper on corpora of 5 million characters.The experiment is oriented to 4 multi-sense Chinese verbs.Because the scale of the corpora isn't great enough, we empirically assign 2 to  --the number of instances covered in the experiment on the condition that when 1 instance is covered, it is not a law but a contingency, but when more than 1 instance are covered, it is most probably a rule.The experiment results are shown in Table 4.The Chinese word "改造" has three synonyms.They are "改变", "改良" and "制造".Their English means are "alter", "improve" and "produce" respectively.
And the Chinese word "发生" has two synonyms too.They are "发生" and "制造" .Their English means are "happen" and "produce" respectively.

Explanation
In Table 4  While the number of training instances increases, the number of rules and coverage increase accordingly.

Disambiguation
To verify the availability of acquired rules, we use them in WSD task.

Rule matching
Match context words of a multi-sense verb with rules in rule base.If they are matched, tag the verb with the verb sense in the rule.In matching, the priority is given to the longest rule.

Calculating average semantic distances of rules
If no rule can be matched, calculate the distance from the context words to rules with formula (5), which is the distance function between two words.Select the rule with minimum average semantic distance and tag the verb with its decision attributes.
Actually, this is an expansion to basic rules by way of semantic classes and it is a good solution to data sparseness problem.Here, the semantic classes used are hyponymy and synonymy relation defined in HowNet.
The average semantic distance function is shown as formula (6).

Calculating the distance to outliers
If the minimum distance to rules is unacceptable, for example, the minimum distance being 100, calculate the semantic distance to outliers.Select the outlier with minimum distance and tag the multi-sense verb with its decision attribute.When computing the distance to an outlier, distance function between two words is formula (7).

Experimental results
We have validated the strategy above by experiments.The percentage of correct disambiguation in close test is 100%.
When experimented on open corpora of 500 thousand characters that came from journal articles, we got 92%.The detailed results of decision are shown in Table 5.The English means of the Chinese characters are explicated in section 4.1.

Conclusion
We put forward and implemented an acquisition method of word senses knowledge of Chinese verbs based on RS theory and IL.We proposed conception of threshold, i.e., instances covered by a rule, to solve data sparseness problem, filter instances not governed by rules and improve time and space efficiency in machine learning of NLP.There are two problems which must be pointed out and dug deeper into in future.
1.The limit of knowledge representation of attribute logic is one of causes that lead to data sparseness.Due to the flexibility of natural language, when context words in specific position are selected as attributes, the 2. Without enough corpora, rules acquired depend on training corpora to some degree.

Figure 4 .
Incremental learning AlgorithmShort of empirical knowledge about data sparseness, we define a threshold  as number of instances covered by a candidate rule, to avoid distortion of induction and errors of automatic knowledge acquisition caused by data sparseness in some degree.Only when the number of instances covered by a candidate rule is greater than or equal to  , can it be selected as an eligible rule.There are 3 advantages that threshold  is used as a condition of acquiring a rule.X. Huang et al.EAI Endorsed Transactions on Scalable Information Systems 02-07 2015 | Volume 2 | Issue 5 | e3 1.

X
. Huang et al.EAI Endorsed Transactions on Scalable Information Systems 02-07 2015 | Volume 2 | Issue 5 | e3 representation itself may bring about data sparseness problem.So knowledge representation should be studied further.

Table 1 .
Decision table is reduced by A , B B and C , Table2is obtained.

Table 3 .
Description of data in decision table .

Table 4 .
Experiment data in learning phase

,
N1 is the number of training instances; N2 is the number of redundancy instances in training set; Nr is the number of rules acquired by attributes reduction; Ncr is the number of instances covered by rules, and Rcr is the corresponding coverage; Ne is the number of outliers, and EAI Endorsed Transactions on Scalable Information Systems 02-07 2015 | Volume 2 | Issue 5 | e3Rce is the corresponding coverage.Nt=Nr+Ne, and Rt, computed with formula (4), is total coverage of rules.

Table 5 .
Decision table