Novel Semantic Relatedness Computation for Multi-Domain Unstructured Data

Semantic Relatedness computation has been a fundamental as well as an essential step for domains like Information Retrieval, Natural Language Processing, Semantic Web, etc. Many techniques for Semantic Relatedness calculation in a single domain have been proposed. However, these techniques give inappropriate results for the massive multidomain dataset because they provide a relation between concepts across different domains, which are not related to each other. Their similarities should be minimized. In this paper, a novel method, "modified Balanced Mutual Information(MBMI)," to calculate the semantic relatedness of multidomain data has been proposed. In this proposed method, to get semantic relatedness, concepts are extracted, followed by a fuzzy vector from a given corpus. A comparison of the proposed method with other existing methods has been performed. We used medical and computer science articles as our dataset. The proposed method shows better results for multidomain data. Received on 18 March 2020; accepted on 26 June 2020; published on 30 June 2020


Introduction
Data exists mainly in three forms: structured data, semi-structured data, and unstructured data. Unstructured data comprises emails, blogs, web textual data, tweets, news articles, e-learning articles, online study material, Wikipedia, and so on. This amounts to a higher percentage than any other format of data in all open data worldwide. Because of the unstructured size, it possesses lots of ambiguity, which gives rise to different algorithms to extract information using different parameters depending on fields like news articles, web mining, spam detection, reviews, and ratings, E-learning tools do text mining for knowledge representation or information extraction. It has created tremendous revenue for areas like sentiment analysis, text summarization, movies/product recommendation systems. The source of these structured data all forms of data are Social Media, Wikipedia, News, Customer Reviews on Movies, Online Products, Foods, etc. These Big Data is the succeeding contemporaries of business analytics as well as data warehousing and is supposed to produce top-line profits to industries. The most significant role of this marvel is the accelerated step of shift and transformation; today, as of now, it is not where we will be in merely two years. Over the last few years, data has expanded to become "unstructured" more unlike that of structured data since the corporate sector has 80 percent of data is unstructured. Also, the sources of data have engendered exceeding operational applications. Text, news, blogs, emails, e-books, geospatial, and Internet data are unstructured data. Semi-structured data is frequently an aggregate of mixed types of data that has a remarkable pattern or edifice, which is not defined as structured data.
Semantic Relatedness computation has been a fundamental as well as an essential step for domains like information retrieval, Natural Language Processing, Semantic Web, etc. Many techniques for Semantic Relatedness or Mutual Information calculations have been developed like Normalized Google Distance (NGD), Balanced Mutual Information (BMI), and so on. The similarity measurement techniques should give similarity between related terms, and also it should give minimum similarity between highly dissimilar items. However, these techniques fail when we have multi-domain big data [2] because they provide a relation between concepts across the different domains, which are not related, and their similarity should be minimum. We have used medical science and computer science articles for our experiments. After preprocessing, essential concepts have been extracted for which a vector has been generated. Then fuzzy vector is obtained by using different semantic relatedness techniques.

Motivation
The motivation behind the new formula's derivation is that the existing Balance Mutual Information (BMI) computes similarity when two terms appear together and not appear together minus when one word appears, and the other is absent. However, when considering large data silos, two less correlated words appear in one part of the section, and both are missing in the significant area of data distribution. Thus BMI will give higher value, although both are less related, and this BMI will always give more value even if two terms are very less related.

Contribution
We have derived a new formula for semantic relatedness computation. We have worked on academic articles, mainly research, keeping articles from multiple domains. We have compared with existing techniques and shown our formula gives the optimized result.

Organization
Section 2 discusses semantic relatedness using different techniques. In Section 3, we have covered the proposed framework and methodology. In Section 4, the results obtained have been shown. In section 5, the Conclusion, as well as future work, is given.

Related Work
Computation of semantic relatedness between two concepts determines the extent to which terms are closer to each other semantically. There may exist multiple relations between two concepts, and we can also extract any association between these concepts. For example, cancer is related to chemotherapy. Relatedness is a broad term with similarity as its subset. One example is Is-A relation like cat Is-A mammal. Similarly, measures are an essential and fundamental step for information extraction/retrieval, knowledge extraction, and so on.
Getting the similarities among tokens is the foremost step. The probabilistic distribution of words is used for getting mutual information semantically or lexically. Also, this can be done for sentences and paragraphs. Information retrieval is a vital field based on this concept. Many techniques have been developed to get more accurate data in Information retrieval. Latent semantic analysis is the mathematical models developed and used to improve the accuracy of information retrieval is Latent semantic indexing, also called [3]. LSA extracts a matrix of concept to concept or term-documents from a given corpus then uses Singular Value Decomposition(SVD) [4] if the matrix becomes large. The Singular Value Decomposition removes less important features this reducing the dimension of a large matrix. SVD decomposes the given S matrix of order M x N into three matrices, which in turn reduces the rank of the given matrix, thus reducing the size of a matrix as well as approximating the same information as that stored in the original matrix. Some of the semantic similarity measures are semantic indexing [5],word sense disambiguation [6,7], or coreference resolution [8], information extraction patterns [9],topic coherence [10] , spelling correction [11]. In E-learning, Learning objects with semantic similarities are used to generate a knowledge graph and recommend a personalized learning path [12,13].
An N-gram is applied to a set of words in a sentence in which we can say words of n-tuple with the condition that they follow each other. If we take an example of a sentence like "Hadoop is a big data tool," "Hadoop can process big data," or "Hadoop can process big data in real-time." The pattern of words following each other can be used to store it as an index. [14,15], Damerau-Levenshtein [16,17]. The Smith-Waterman algorithm uses protein sequence or nucleic acid sequence to find out the similar regions of the string by optimizing similarity measures by comparing segments of different lengths [18]. Needleman-Wunsch algorithm [19] also uses a string score with dynamic programming for the alignment of all possible nucleotide/protein sequences in Bioinformatics to get alignments having with the highest score. Probabilistic linkage technology has been used to link sizeable public health databases by getting the scores between two given files of individual data under uncertain environment based on probability error [20]. Given two string [21] proposed a string comparator measurement for partial computing agreement between the given strings for updating exact agreement weights if the given lines do not agree with character by character. CLEEK links entities using multidomain data, which is Chinese longtext corpus [22]. Balanced Mutual Information (BMI) calculates mutual information between two terms or concepts by taking care that how much both terms are 2 EAI Endorsed Transactions on Energy Web 11 2020 -01 2021 | Volume 8 | Issue 31 | e4 Novel Semantic Relatedness Computation for Multi-Domain Unstructured Data present or absent together minus if one term is present and second is absent and so on [23].
When we have a semantic context vector for two concepts, C 1 and C 2 , we can find which members are both joint and distinct. The Jaccard similarity index [24] uses this method.
Cosine similarity finds the angle between two objects by taking their features vector as input. It gives an output from 0 (not similar at all) to 1(highly similar). Another vital algorithm is Normalized Google Distance (NGD) [25], Kulback Leibler, Expected Cross-Entropy (ECH), which calculates the semantic similarity between two words as given below: [26] has used the input as a short text for finding semantic similarity based on lexical matching. WordNet, being a lexical database for English, provides relations and hierarchy among synsets [27]. Other techniques using information content are Jiang and Conrath [28], Resnik [29], and Lin [30]. Wikipedia has been a vast, rapidly evolving tapestry of highly hyperlinked textual content. Wikipedia constitutes articles, categories, and redirects mostly great resource for natural language processing. Based on Wikipedia, the work has been done using Wikipedia link structure, WikiWalks [31], or Wikipedia Link Vector Model [32] and Wikirelate [33]. Semantic interpretation of terms has always been made using its vector, like word2vec or other techniques obtained through the windowing process. [34,35] has used the Fuzzy Context vector to represent a concept. For big data analytics, a distributed technique is used in [36]. Semantic-based document clustering has been done by using Wikipedia and the concept of ontology [37].

Problem Statement
The prominent approaches like BMI, CP, ECH, Jaccard, KL, MI, and NGD have been used to get the semantic similarity in a given corpus for a particular domain. However, nowadays, all data sources generate Big Data, and the characteristics of Big Data have already been discussed in the Introduction part. Big Data has got massive size having data from a different domain. However, the performance of all semantic relatedness computation is not excellent as they give some similarity among terms across different domains.

Modified Balance Mutual Information -A Novel Technique
We have proposed a unique formula Modified Balance Mutual Information (MBMI) in eq five, which gives values when two terms appear together multiplied by a β factor minus (1-β) time when either two terms do not appear together. While existing Balance Mutual Information BMI computes similarity when two terms appear together and not appear together minus when either one terms appear and other is absent. However, when considering large data silos, two less correlated terms appear in one part of the section, and both are absent in significant data distribution areas. Thus BMI will give higher value, although both are less related, and this BMI will always give more value even if two terms are very less related. We have shown experimentally that our method gives optimized results.
α and δ are kept in between 0 and 1. We have kept α=0.55 and δ=0.45 to provide higher weightage if two concepts appear together in a window against if only one concept is present in a window.

Implementation
As we have taken research articles of the medical domain and computer domain as our input, it needs to be preprocessed to be ready for applying algorithms. 3 EAI Endorsed Transactions on Energy Web 11 2020 -01 2021 | Volume 8 | Issue 31 | e4

Document Preprocessing
All words with no quality of information except only grammatical connotations are involved in the removal set of words. After removing these words, the remaining set of words leaves a more productive bag of words for analysis. POS Tagging: The mechanism of addressing words based on their different parts of speech. Stemming: Various grammatical variants of a word such as noun, adjective, and adverb, the root word are called stemming. POS and different word occurrences are not considered in the formation of stemming.
Mutual information between two enitites i.e. two terms can be computed by following formula: where proability of getting a term in a window w of term t i 's and t j 's are Pr(t i ) and Pr(t j ) respectively, estimated using w t i /w, and w t j /w where w t i and w t j are the counts of terms t i and t i in total windows [11].Changing the above equation from programming point of view we have We have defined a Fuzzy context vector for every concept extracted from the corpus.Thus if ith concept , lets say C i , then membership function µ ( c i ) of t i with C i can be defined as The concept C i is having m t i terms obtained by windowing process and corresponding weight is µ ( c i )

Semantic Computation
Finally for getting the semantic relatedness between two concepts, we require a fuzzy context vector. We have carried out an extensive computational process to get these Fuzzy vectors.
here match(i) has been for reducing the computation process since the result will not be affected. The term in second vector with highest matching with the term in first vector will be taken for computation. Now semantic similarity between two terms can be computed as follows: F v j , for k=0 to n. 32: Sum= R ik * µ c 1 (t 1 ) * µ c 2 (t m(i) ) 33: Until All term T i F v i scanned 34: Display the semantic distance of C i with C j as 1/sum 35: Until All Concepts ArrayConcept scanned 36: Until All Concepts ArrayConcept scanned where R(t i ,t m (i)) means mutual information of t i and t m(i) .
These data set after being preprocessed, frequent terms have been assumed to be a learning topic in which we treat them as a concept. After applying a specific threshold value, essential concepts have been extracted. Figure 3 shows no of the concepts extracted vs. threshold value. We have set threshold values to 20 percentiles, i.e., terms with above this value have been treated as a concept for which a vector has to be generated. In this paper, a novel semantic relatedness technique has been proposed and experimented with computing semantic relatedness for cross-domain academic articles from medical and computer science journals as our data set. The result is domain-dependent, as well as the content of articles being taken as input. So the output will heavily dependent on the corpus being used.
The snapshot in figure 2 shows the extracted Concepts above a certain threshold and their corresponding fuzzy context vector.        [43] 15339 3334 8 The figure 4 and 5 shows the fuzzy context vector extracted for the concepts 'Image' and 'MRI'. The extracted Fuzzy Vector of These concepts has been discussed in the data set we have taken for our experiment as we can see that BMI gives a higher value for all elements in the vector, although it is not true whereas MBMI gives optimized value neither too low nor too high. The Figure 6,7, 8, and 9 and shows the semantic relatedness of concepts 'MRI' 'DATA', 'IMAGE' and 'BRAIN' with other concepts using different techniques like BMI, Jack, NGD, CP, ECH, KL, MI and our method MBMI. In figure 6, MRI is not related to deep and network, but BMI gives values 0.747 and 0.878, respectively, whereas MBMI gives values 0.152 and 0.115, respectively. In figure 7, DATA is related highly nearly 1 with other concepts using BMI technique, whereas it gives semantic similarity 0.144, 0.2, and 0.431 with concepts brain convolutional and MRI, respectively. In figure 8, Image is related with value 1 with almost all concepts, whereas MBMI gives similarity values 0.224, 0.492, 0.505, 0.329, and 0.151 with learn, deep, network and data respectively. In Figure 9, Brain is related with learn, network, data, method and IEEE with values 0.781, 0.947, 0.972 , 1 and 0.842 respectively using BMI techniques. Whereas using MBMI we get values 0.015, 0.178, 0.081, 0.189 and 0.17. In Figure 10 and 11, it can be shown that the very dissimilar terms are dissimilar shown by our existing technique as compared to other techniques. Thus we have seen that MBMI has been useful in the scenario when we have vast and multi-domain data i.e., Big Data having an unstructured part of being processed for e-learning. The result is dependent on the distributional probabilities of the terms, content, and domain also.

Conclusion and Future Work
A novel semantic relatedness technique has been proposed and experimented for semantic relatedness computation for cross-domain unstructured data. We have used academic articles related to medical journals. These articles are related to medical science and computer science. Our techniques give better results primarily when two concepts are not related altogether. However, the proposed work can also be done for finding similar and dissimilar authors based on their paper publication. It can be applied for the clustering of news articles, and fake news can be detected. Furthermore, a crawler can be made to download articles from Elsevier or other academic data sources and since processing textual data takes much time so Big Data tools like Hadoop or Spark can be used to process these articles, and relevant results can be obtained.