Identification of New Parameters for Ontology Based Semantic Similarity Measures

A major challenge among various applications of computational linguistics, natural language processing and information retrieval is to measure semantic similarity accurately. In this research paper, various ontology-based approaches i.e. compute semantic similarity between words have been studied and listed their benefits and shortcomings on the various identified parameters. Earlier, correlation with human judgment was the single criteria for the judgment of good similarity measures. In this paper, more parameters for semantic similarity measures have been identified and a relative analysis of similarity measures is performed on the identified parameters. These identified parameters can be further utilized for formulating the new semantic similarity-measures in the latest research area of text mining, web mining and information retrieval. We have identified various parameters like features-set, applicability on various ontologies such as single ontology, cross ontology or fuzzy ontology, Ontology type, dataset applied and relationship mapping for the various measures. Through detailed analysis we have found that feature based and hybrid approaches has higher accuracy as compare to edge and content based methods and works in different type of ontologies. Recent research drawing interest to find new feature set in this area like fuzzy distance, graph generation and text snippets etc. Max Accuracy was achieved in single ontology 0.87 and 0.83 over cross ontologies. WordNet and MeSH are maximally utilized Global Ontologies.


Introduction
Huge success of WWW and knowledge society quantity of textual electronic information has been expanded significantly [1] that has acquired the interest of many researcher groups in this area.Most of the information on the WWW is represented by structural form like HTML, DHTML and XML formats.This information interpreted by the humans not by the machines.Now a day's web generates approximately 25 billon textual data on each day.Such a huge data, big data [2,3,4] cannot be organized by humans, so latest research is drawing interest how this textual information can be understood and interpreted by machines.Understanding and Interpretation of textual information revolves around two basic terms i.e.Text Mining [5] and Natural Language Processing(NLP) [6,7].Textual information is broken into smaller chunks and the smallest chunk is known as terms/words.A major concern in NLP and computational linguistics is to interpret the meaning of each single word.For understanding the meaning of a Shivani Jain, Seeja.K R and Rajni Jindal 2 word/concept first step is to calculate the semantic similarity among the words or terms [8].Semantic similarity measures evaluate a numerical value which quantifies the closeness among terms/concepts.Semantic closeness stated that how analytically near, two ideas (words or terms) are, on the ground they share some common part of their meaning [9,10].Aim of semantic similarity is to measure the accurate closeness among the different words.Author Resnik et al. [11] defined the similarity between the words ("Car" & "Automobile") as 1.For measuring the similarity several researchers proposed their similarity measures.Although massive research have been done in this area, however in this research paper we have conducted a research study on ontology based similarity measures presented in the area of Knowledge Based(KB) [12,13].KB system stores huge information in un-structural formats.Unstructural information can be processed and presented through an object model known as ontology [14,15].Ontologies are created for different aspects and domains; they regularly contains overlapping data and information that can be further used for calculating the similarity among the different words [16].In this paper, a relative study is presented on various parameters in the area of ontology based measures.Similarity is not a new term; semantic similarity was coined in 1995 by the Resnik et al.After that every year new similarity measures were proposed.
First motivation for this study is none of the research papers presented such an extension research on the ontology based measures; they merely illustrated a small amount of measures in this area, however in this paper authors have studied more than 45 research papers in this area.A detailed comparative study is shown.
Second motivation is to identify different parameters; by these approaches can be compared.In the prior studies accuracy was the only criteria for the formulation and judgment of a good similarity measures however, every year; new similarity measures are proposed to deal with the latest research in this area [17,18].To cope up with these new emerging fields of knowledge base authors have identified more parameters for the formulation of a good similarity measure.By the extension study in this field the authors have identified different parameter such as: features set, type of ontology framework, benchmark dataset used, relationship consideration and Ontology type.Detailed explanation is presented in the forthcoming sections.This paper is divided into different sections; First section describes the methodology for conducting the research study, in the same section briefly explains similarity taxonomy.Third section shows the various similarity measures in the area of KB.Fourth section characterized the various parameters identified and their respective meaning.Next section presented the detailed relative study of various approaches on identified parameters.Next section illustrates the analysis and finding of the research paper.Last section describes conclusion and future work in this area.

Research Methodology
For conducting this research study, we have searched various research papers from the online repository like IEEE, Elsevier, Springer, IGI Global, InderScience journal and conference papers from year 1996-2018.Identified keywords for extraction of research papers are; "Word-Similarity Concept", "Semantic Similarity Meaning", "Corpus-based Similarity", "Text-based Similarity", "Knowledge based similarity", "Ontology based similarity", "Edge-based similarity" and "feature based similarity".If above defined keywords found in the title and in abstract of the research paper we manually considered them for our research paper.After studying these papers we have identified more related terms that illustrated more depth information in this area.The related keywords are "WordNet-based similarity measures", "Bio-medical based similarity measures", "Cross-Ontology based Similarity measures", "Fuzzy-ontology based similarity measures" and "Formal concept analysis based similarity measures".In this research paper we work on the word similarity measures in the area of knowledge base thus we have not included the "string based similarity", "corpus based similarity", "Sentence based similarity" ,"Paragraph based similarity" and "Document level similarity" for the research paper.

Semantic Similarity Taxonomy
Similarity can be word similarity, sentence similarity or paragraph similarity.In this paper, word similarity is discussed.Two words can be similar lexically or semantically.Lexical similarity represents words having similar string sequence like in {"Man", "Lan" and "Van"} and also in DNA sequence matching {"ADCGTDCGTC" and "ADCCGTCGCA"}.Lexical similarity has a wide application in the areas of medical sequence matching [19,20] and pattern recognition.Various measures are proposed in this area [20].
Whereas semantic similarity deals with the meaning of the two words, how the linguistic meanings of two words are similar, such as how much similar the words {"mango" and "orange"} as both words belong to class "Fruit".Another term, associated with semantic similarity is "semantic relatedness" [22] not necessarily relay's only on the taxonomic relationship "is-a", however more relationships were explored in semantic relatedness.For example {"tier", "pencil"} was less associated to each other as compared to {"pencil", "paper"} in terms of semantic relatedness and "pencil" and "paper" don't has "is-a" relationship.In this other relationships like "part-of", "antonym" is explored.
For calculating the semantic similarity among the words diverse similarity measures have been proposed by different researcher's groups.In literature two approaches "Corpus-Based Approaches" and "Knowledge-Based Approaches" are presented by the researchers [12,23].In corpus based approaches large amount of textual information is used for calculating the similarity among the words.Latent Semantic Approach (LSA) [24] and Pointwise mutual information(PMI) [25] are the widely used methods for computing the similarity.Ontologies are represented using classes, attributes and relationships among the attributes.Ontologies characterized as "a formal specification of a shared conceptualization" [27].WordNet is a domain free and all-purpose thesaurus for English words.It organizes more than ten thousand English words that are semantically structured, and forms ontology.It formulates approximately 10000 English words in a semantic structure that looks like ontology.A graph prototype is generated in form of network and it is used for forming the meaning of concepts.In most of the knowledge based approaches; researchers used WordNet as the base ontology for the computing similarity among the word /concepts.In the next section we have explained the various measures present in WordNet ontology.

Knowledge based/ Ontology-based approaches
There are mainly three approaches present in literature for KB method [8,28,29] and last one is hybrid that is combination of any two approaches.The approaches are, 1. Edge/ path based Approaches 2. Information Content Approaches 3. Features Based Approaches 4. Hybrid Approach In this section we briefly explain these approaches Terms associated in semantic similarity measures are:-Len (c i , c j ) -shortest path length starting from synsets c i to another synset c j in WordNet.LCS(c i , c j )-Lowest Common Subsumer in WordNet (LCS/lso) of c i , and c j depth(c j )-path length to a synset c i from overall root unit and depth of root =1 Max_deep -taxonomy maximum depth.Max_node -maximum number of words present in the ontology.
Sim(c i , c j ) -semantic similarity among the two concept c i and c j

Edge/ Path based Approach
Edge-Counting approaches were introduced by Author Lin et al [30].It was a very basic technique to measure the similarity, it calculates the minimum path-distance from the end to end connections in their related ontological model by a "is-a" links.As path length increased the similarity between the terms tends to decreased.If two terms are closed their distance must be low and similarity was high.[26] Here the similarity between words (nickel, dime) & (nickel, money) was computed according to edge based approach proposed by Lin as:

Figure 2. Edge Based Approach presented by Lin
Mostly used measures in edge based approaches:

Shortest Path based measures introduced by
Researcher Rada et al. [31].This measure taken len(C 1 , C 2 ) into consideration C 1 and C 2 are the two concepts and similarity was computed as: assumptions that sibling of the concepts in the lower nodes have more similarity than the higher pairs of nodes.In this noun network of WordNet was used, on this; they defined the two relationships among the edges one was direct relationship and another was inverse relationship.Both the relationships lies between the two ranges {min and max} min=1and max=2 4. .The weight for edge for a node A is calculated as: The distance between the two adjuncts Node (A, B) was calculated using the weight and depth of node as: (5) [35]:It works on a hypothesis that sources of information are infinite, and humans tend to evaluate sources of word similarity by means of a limited space between completely similar and non-similar words.

Information Content Approach
It was based on the theory that every concept is a combination of a large extent of information in WordNet.It relay on a fact that a node is a distinctive representation of a concept in a particular domain and hold some about of information.Direct link between the two concepts is represented by an edge.Similarity among these two concepts was computed on the shared information in that particular domain.The concepts are more similar if they share related amount of information in a domain.
Here  presents a class/concept.
Here,  represent the concept and p(w) probability of that instance.
As the number of nodes increases, the information it contains was found decreases.If there exist a single root node in the ontology, then it is having the highest probability of occurrence, value can defined to 1 and information it contain is lowest and measured to 0. Most used measures in Information content approaches:

3.2.1Resnik
Approach [11]: It relied upon shared information between two concepts.Two terms are dissimilar the most, if LCS doesn't not exist.If LCS exists, using information content of the LCS, similarity was calculated as,  [37]: It depends on measuring taxonomical length of connections among the IC of a specific word and it's LCS.This formula defines the dissimilarity concept between the two terms.

Jiang and Conrath
According to author Resnik et al for each incidence of a particular noun among the corpus is measure as an incident of every taxonomical class and can be formulated as, ) Here W (a) represent as nouns set in a corpus and multiple senses of the word  are denoted by "a".N shows overall number of nouns present in a defined corpus.

Feature Based Approach
Feature based methodology describes similarity among the concepts as a factor of their properties.They relied upon the amount of common and un-common features of evaluated concepts.Common features enhanced the similarity and exceptional features tend to weaken it.[38]: It was established on the fact that each concept was depicted through a set of keywords presenting their properties and features.It was represented in the WordNet using definitions and glosses values.Glossary of the word "automobile" describe through the WordNet as "a motor vehicle with four wheels; usually propelled by an internal combustion engine".[39]: It proposed a similarity measure, where computation is done on the basis of the individual sum of likeness between the "synsets", their "features" and "neighborhood concept" of the estimated terms as:

Rodriguez and Egenhofer
,     ℎℎ are the similarity among the "synonym set", their "features" and corresponding "semantic neighborhood" of the computed terms.According to this (w, u, v>=0) value of w, u and v depends on the particular similarity weights of every specific component.The value relies upon the individuality of the taxonomy.Here S symbolizes the coinciding between the dissimilar features, which are computed using equation (12).

Hybrid Approaches
Hybrid approaches can be any approach where these two or more approaches can be combined.Recent research is drawing interest in these approaches.In the recent time author Cai et al. [40] used a hybrid technique of Edge based and the Information content approach.Detail analysis is shown in Table 6.

Identified Parameters
So far, we have discussed similarity measures in knowledge based system.To cope with the new emerging field of knowledge base system new measures are proposed every year [16].In this section we have defined some new parameters for similarity measures.After conducting a thorough literature review, the following parameters have been identified for the evaluation of measures.For analysis, we have considered type of measure.Measure belongs to any of three mentioned category or it is based on the hybridization method.
Next, we have chosen feature set of measures that provides insight view of measures.By detailed study we get thirty different features for different measures.
Next parameter is accuracy; it is computed by the correlation value of the computed measure with the human judgment.If the correlation is high, accuracy is high for that measure.
Next parameter is ontology development framework; mainly three type of work present in the literature: single ontology [20], cross ontology [41] and fuzzy ontology [42].
Another parameter is relationships considered among words as mango "is-a" fruit so mango and fruit having "is-a" relationship between two words.However, "mango" is also related to the word "seed".Semantic similarity is also present in above defined words.A good similarity measure also computes the similarity among these two words.So, considering all the relationships among the words is also an important factor for any measure.Some of the measures computes good similarity with a "is-a" relationship but can't computes the similarity with other type of relationships; Some Edge based methods has high accuracy for "is-a" relationship, but not given same results on other type of relationship.A complete understanding of semantic relationship is an important factor for similarity measures.
Next parameter is dataset; three types of standard datasets are available for the WordNet ontology.
Last parameter is the ontology type; in that we have shown measures are applicable on which type of Ontology.As some of measures are applied Global ontology such as WordNet, MeSH and Gene ontology and some are applied for domain ontology like solar ontology or product ontology.The parameters and their meaning are given in table1.

Sl. No
Parameter Meaning

Type of Measure
It represents the approach that has followed as Edge based, Content based, Feature based or hybrid approach.
2. Feature Set Feature used for computing the similarity.

Accuracy It computes the correlation with human judgment and the value varies from [0-1].
Near to 1 signifies high correlation value.

Types of Ontology Framework
The types of ontology framework on which the measure works as single ontology, cross ontology or fuzzy ontology.

Taxonomical relations
It specifies the relationship from the WordNet & Mesh such as "is-a", "partof", "anatomy" etc. examined in the measures.

Ontology Type
Global Ontology like WordNet/Mesh/Gene ontology or any domain ontology.

Evaluation of different parameters
Here we have discussed the evaluation of identified parameters.
Type of measure: Research shows that first measure was proposed based on edge based and but current research focused on hybrid approaches.A hybrid approach gives more accuracy, flexibility and adaptability to the systems as compared to edge based, content and feature based approach.
Accuracy: The accuracy for the different measures is shown in tables; for this we have considered previous research work presented in this area.The Author Alexendra et al. in 2006 [46] compares result of many previous research work and quoted that Jiang and Conrath [37] performs better than the other approaches at that time.For recent work different researcher [40,47,48] given their measures and compared their results with previous works.We have taken their results for accuracy parameter.
Single Ontology-WordNet, Mesh and SUMO ontology are general purpose ontology.Mostly researchers used WordNet as main ontology for their measures.All the proposed measures work on the single ontology.2 Cross Ontology -Measuring the similarity among the cross ontology has a wide application area of new ontology development.New ontology can be developed using merging of two ontologies.In merging; two words from different ontologies are combined.Like the word "fever" having same synonyms "Pyrogens" in MeSH ontology.For developing a new ontology for medical science we can utilized both the word "fever" and "Pyrogens".3 Fuzzy Ontology-Fuzzy ontology is an emerging field of ontology development.Conventional ontologies are not suitable to manage imprecision or vagueness in information, to deal with probabilistic, uncertainty and vagueness in information, it was introduced Yen et al. [49,50,51].Fuzzy ontology is the extension of crisp ontology and fuzzy membership is assigned between two concepts for a particular relationship.In recent years fuzzy ontology drawing interest of many research groups and detailed literature review in given by Author Devadoss [52].Data-Set: We have found that mostly three standard dataset are used for different measures, mentioned in the table 1. Mostly researcher considered one or two dataset for their measures.Miller and Charles is the most popular dataset in this area.This dataset has 30 word pairs and three columns; the first column specify first word, second column specify other word and the last column gives the similarity among these two words according to the human judgment.It varies from [0-1] in data set, 1 means highest similarity.In the 353-TC dataset, it has 353 word pairs are present and the similarity varies from [0-4] among the words.Latest researcher considered the 353-TC dataset.In this dataset more emphasis on semantic relatedness as compare to similarity.
Relationships-Considered: Most common taxonomical relationship "is-a".However some of the measures also computed the similarity among different words and considered more semantic relationship such as "part-of" and "anatomy".Ontology Type-Mostly presented work based on the WordNet ontology.Some of the measures considered the MeSH and Gene Ontology (medical science ontology).In some of the paper's domain ontologies was also considered.

Discussion
We have given the detailed comparative study in the above section.These finding can be further used for defining the New similarity measures in this area.Different analysis and results are given below: In the Edge based measures, feature sets like length, depth, LCS, direct and indirect links were considered.These approaches are quite simple and execution time is less as they have to consider only length or the depth of the ontology.In these measures accuracy lies between the ranges of 0.51 to 0.67.It works only for the single ontologies.These measures considered only the "is-a" relationship among the ontology.They work well for the Global ontologies like WordNet and MeSH ontologies as these ontologies describes the complete structure of the general purpose ontology.
In the information content approach LCS, formal concept analysis, noun, synsets and graph similarity features sets were used.Here the accuracy lies between the 0.59-0.78.The accuracy depends on the coverage and the structure of the ontology.As the depth of ontology increases the similarity increases.They work well in single ontology and also works for the cross ontology using the LCS, if the LCS exists among the ontology, the similarity can be computed accordance with the LCS.In this type of measures "is-a", "part-of" relationships can be considered if the relationships were present among ontologies.It works for the single, cross ontology framework where complete structure/relationships are defined among the ontologies.
Feature based approaches can be applied for the single, cross and fuzzy ontology framework.There accuracy lies between the 0.67-0.83.In WordNet we have the gloss information that can be used to calculate the similarity however MeSH ontology doesn't provide such information.In this all the relation can be considered as these approaches not consider length and the structure of the ontology.These measures computed high correlation value among the words.However, none of work was presented, where fuzzy integration with feature based approaches was given.So, this can be future work in this area.These approaches furnish the same result on small or big ontology or on any type of ontology.These measures only depended upon the features of the concepts.For the small size and for the domain ontology they compute good accuracy with less complexity.
In Hybrid method, any two of the above method can be combined.It takes the advantages of any of two techniques.Accuracy lies between 0.76-0.87 in Hybrid approaches.In the hybrid work some researcher introduced the fuzzy concepts in their measures.The accuracy is high as compared to edge based and content base and comparable to the feature based approach.The features set in these techniques were used annotated terms, open dictionary, swarm optimization, heuristic approach and text snippets.These types of approaches are good and give high accuracy and can be used for the all relationship but it adds complexity to the system due to its features set.These approaches can be applied for fuzzy ontology.Very few measures were proposed in the area of fuzzy ontology framework.
Accuracy graph of different Similarity Measures is shown in Fig 5 .Maximum accuracy is 0.86 by feature based approach [61] and Hybrid approaches [40,70].Charles's dataset [43] is most popular dataset for WordNet.TC-353 dataset is used in recent research, as it has more emphasis on semantic relatedness as compared to similarity.
Data-set used by various measures shown in Fig. 7.

Conclusion and future work
In this research paper, a detailed study of previous two decade work in field of semantic similarity is presented.In this study the state of the art similarity measures are compared based on the seven parameters like Type of Measure, Feature set, Accuracy, Ontology framework, Benchmark Datasets, Taxonomical relationships and Ontology Type.To cope up with new emerging fields new similarity measures are proposed every year.However, there are certain issues in their approaches.If similarity is high, then complexity is high due to their feature set.Similarity measure compute good result among the single ontology but not give the same result over the cross ontologies.Edge based measure works only in single ontology and only taxonomical relationships are considered.Information content methods are works in the area of cross ontology with the existence of LCS.Feature based approaches are found to give high correlation values and give good result on both single and cross ontologies.But some time it is difficult to determine the features of a concept in certain ontologies.Corpus based method like LSA and PMI can be further used for finding the features in the domain ontologies.These measures are used for the domain ontologies and features can be formulated using text processing methods.
To cope up with new emerging field of knowledge based new methods for similarity measured was proposed every year.Further we can identify more parameters in this area like complexity of the measures, tool/ language support for measures, ease of adaptability and ontological coverage.
Fig 1 shows the taxonomy of similarity measures.Knowledge-Based Methods are an effort to figure out the semantic similarity without human intervention, it uses a vast amount of environment knowledge about the concepts.The major related work on semantic similarity measures utilizing the taxonomy knowledge by the Global ontologies like WordNet[26], MeSH[22] a medical ontologies.Ontologies have been extensively used in the area of knowledge-Base systems.

Figure 1 .
Figure 1.Taxonomy of similarity approaches

) 3 . 1 . 2
Sussna's Measures [32] : This measure was proposed by Sussna et al. in 1993.It is based on the relative depth in the WordNet Ontology.WordNet Ontology has maximum 16 node depth strength.It was based on the

) 3 . 1 . 3 4 ) 3 . 1 . 4
Wu & palmer's Measure [33]: In this measure they work on the position for the concept C 1 and C 2 in the taxonomy.For finding the position the relative location was considered for common concept lso (C 1 , C 2 ) which is based on LCS .They concluded similarity among the concepts was based on path length and depth of the ontology.Path length and depth was defined through the links in the ontology. & (  ,   ) =  * ((  ,  )) (  ,  )+ *  ((  ,  )) (Leacock & Chodorow Measure [34]: Maximum depth of the taxonomy was considered and defined the subsequent measure.They considered that if C 1 and C 2 have the same sense or the meaning, then their similarity was defined to be 1. & (  ,   ) = − (  ,  )  * _

Figure 3 .
Figure 3.An Information Content Approach presented by Resnik [9] 3.2.2Lin Approach [36]: It was proposed that the similarity among the concepts is measured as the ratio between the extents of information desirable to specify their association.  (  ,   ) =  * _(  ,  ) (  )+(  )

Figure 4
Figure4 Tversky Model for two words 'car' and 'bicycle'[34] Two words/concepts are more similar if they contain more common characteristics of words.Exceptional characteristics tend to decrease the similarity.

Figure 6
Figure 6 Measures over the Ontology's framework

Figure 7 (Figure 8
Figure 7 (a), (b), (c), (d) Data Set used in Various Measures.Taxnomoical relationships considered by the different knowledge-based approches are shown in fig 8.

Table 2 .
Edge Based Measures on Various Parameter

Table 3 .
Information content based Measures on various parameters

Table 4 .
Feature based Measures

Table 5 .
Hybrid Measures on various parameters

Figure 5
Accuracy Graph for the different measures.Information Content approaches works for cross ontologies and where complete structure and relationships are described among the ontologies.Feature based approaches work well in single and cross ontology.Only limitation with this method is to find the complete features of the concepts in the defined domain ontologies.Various measures work on the different ontologies framework is Min_accuracyMax_accuracyEAI Endorsed Transactions on Scalable Information SystemsOnline First