Unsupervised Machine Learning based Documents Clustering in Urdu

The volume of data on the web is growing rapidly, due to the proliferation of news sources, contents, blogs and journals etc. Like other languages, the Urdu language has also observed tremendous growth on the internet. As the volume of data is expanding, information retrieval (IR) is becoming complicated. Document clustering is an unsupervised ML approach, employed to group a huge number of dispersed documents into a small number of significant and consistent clusters, thus providing a base for indexing, IR and browsing mechanisms. Documents clustering has a long tradition in English as well as English like western languages, but Urdu lags behind in terms sophisticated natural language processing (NLP) tools and resources for documents clustering. Documents clustering becomes a challenging task in Urdu language having a rich morphology, particular structure, syntax peculiarities and cursive nature. In this study, we have developed a framework of document clustering and analysed various similarity measures for Urdu documents. We have also checked the effect of stop words removal in the process of Urdu document clustering.


Introduction
The extent of data on the cyberspace is expanding quickly due to the large-scale and rapid expansion of web technologies [1][2][3][4].These databases are continuously upgraded for growth of documents and possess a high query stack.This unstructured information has asked a few new examinations to investigate this gigantic information, sort related data and to subsequently enhance the association of the content existing on the web [5].Nearly every information one requests are currently accessible on the internet [6].English and European languages have mainly dominated the web since its beginning [7].However, in the past few years, a widespread range of information in the Indian local languages such as Urdu, Hindi, Bengali, Oriya, Tamil, and Telugu have been observed on the internet [6,8].The richness of data along with the vibrant and diverse nature of the Web makes Information retrieval (IR) a challenging task [9,10].Document clustering presents a structure for categorizing a large collection of documents [11,12].Document clustering is exploited to consequently find the intrinsic characteristics and native grouping amongst documents, to sort out them into various clusters [13,14].Documents clustering is an exciting approach, since it groups the documents exclusive of human intercession and exempts organizations from the prerequisite of manual categorization of documents, which might be an arduous and tedious process [15].Various studies regarding document clustering, exploiting English language documents as input have been presented [16].However, each language can generate distinct levels of exactness, depending on each natural language shapes and characteristics, like morphological and syntax peculiarities, use of antonyms and synonyms, and utilization of native expressions etc [17,18].Structure of this paper is organized as: section 2 highlights the importance and challenges of Urdu, section 3 describes Atta Ur Rahman et al.

2
un-supervised learning approach.In section 4 related work is presented, Section 5 describes proposed work, section 6 tells about the adopted unsupervised clustering algorithm while section is about the experimentation work and section 8 provides conclusion.

Urdu Language
Urdu is a lingua franca and national language of Pakistan [25,26].As per Wikipedia statistics there exist 100 million native speakers of Urdu in Pakistan and India and an additional 300 million speakers around the globe [25].The development of computational sources is the elementary step in any Natural Language Processing (NLP) task.Urdu is broadly communicated languages of Asian subcontinent, though due to sources scarceness, not sufficient effort has been accomplished aimed at Urdu language processing [7].The "daily Jang" stayed the leading newspaper which generates the Urdu scripts digitally in the Nastaliq script design.Currently, many Urdu journals and magazines are issued in Pakistan on daily bases.There exists a bulk quantity of tweets in the shape of Geo News, Jang News, Roznama Dunya, Dawn News, BBC, ARY, AJJ, and Abb Takk News etc.Moreover, India also distributed more than 3,000 Urdu publications on regular basis.

Challenges in Urdu Document Clustering
The great number of issues identified in Urdu language causes document clustering a difficult task.This section describes a few major constraints which diminish the execution of proposed structure.
• Resource Scarceness A large number of complexities related with Urdu content makes it a rare dialect to be studied for NLP.A benchmark and an extensive corpus is the essential prerequisite for any NLP associated task.However there is no standard corpus accessible for Urdu language processing.The accessible Urdu NE labeled corpora are: Backer-Riaz (2002) and Emile (2003) corpus.Currently, there is not any dataset accessible for Urdu documents clustering.
• Context sensitive and Cursive Nature In Urdu, the state of a character isn't just influenced by its position but additionally by its adjacent characters.Urdu characters have distinctive shapes at beginning, middle and end of the word.For example in the word (Love, ‫,)محبت‬ the state of (te,‫)ت‬ is changed in the word (gift, ‫تحف‬ ‫ہ‬ ).Thus the character ‫"ل"‬ has a diverse shape in the words (slave,‫م‬ ‫,)غال‬ (Electricity, ‫,)بجلی‬ (long, ‫,)طویل‬ and (but, ‫.)لیکن‬ • Words segmentation problem Segmentation is far problematic in Urdu dialect in light of the fact that here space is utilized for word limit.Space enclosure and exception are caused by utilization of space in Urdu content.For example space inclusion happens, such as, ‫صورت"‬ ‫"حوب‬ (hobsorat, beautiful) is a single word however because of space insertion the framework will assume it as two words like ‫"حوب"‬ and ‫."صورت"‬ Space omission issues happen, for example, ‫لیے"‬ ‫"اس‬ (aslye, therefore) is two words but because of space exclusion, the framework will consider ‫"اسلیے"‪it‬‬ like a single word.
• Compound Named issues A compound named are made out of various words like ( ‫پوتن‬ ‫میر‬ ‫والدی‬Vladimir putin ‫نیازی,‬ ‫احمد‬ ‫,عمران‬ Imran Ahmad Niazi), here both words refer to a single word but the system will consider each one as a three separate words.Such as ‫,"پوتن"‬ ‫,"میر"‬ ‫,"والدی"‬ and ‫,"نیازی"‬ ‫,"احمد"‬ ‫."عمران"‬ • Large number of Synonyms Urdu language possess a large number of synonyms like ‫,جنت(‬ heaven) has synonyms such as ‫بہشت(‬ ، ‫باغ‬ ‫)فردوس،‬ which create a great problem in documents clustering.
• Conjunction issues Some entities are formed by utilizing conjunction word such as ‫چین(‬ ‫اور‬ ‫,پاکستان‬ Pakistan and China) and ‫دانش(‬ ‫و‬ ‫,علم‬ Knowledge and wisdom) etc.
• Acronym ambiguities In English dialect acronym can be easily distinguished because of the upper casing principle, however in Urdu, it is very hard to perceive acronym, for example, ‫سی(‬ ‫بی‬ ‫بی‬ , BBC, ‫پیک‬ ‫سی‬ , CPEC ‫او,‬ ‫این‬ ‫,یو‬ UNO) and so on.The rest of the paper is structured as follows: Section 2 describe an extensive detailed of related work, Section 3 clarifies the proposed architecture employed for Urdu documents clustering, and Section 4 shows experimental analysis, result and evaluation metric while Section 5 concludes the paper.

Unsupervised Machine Learning
Supervised learning algorithm is typically used in subjective classification.This algorithm depends on manually labelled datasets and domain dependent.For that reason supervised algorithm is time consuming, required manual expertise and relatively difficult to understand a words of the human discourse.The few familiar examples of supervised learning algorithm are support vector machine (SVM), K-nearest neighbour (KNN) and Naive Bayesian classifier etc.While unsupervised algorithm working regardless of training data sets and its domain independent.
Unsupervised Machine Learning based Documents Clustering in Urdu 3 Thus, the purpose of unsupervised learning algorithm is to identify the actual classification of data and refine their structure.Some common examples of unsupervised learning algorithm are Association Rule Mining (AM), documents clustering, Likelihood Ratio Test (LRT) and Kmean clustering [19] etc.
Consequently, unsupervised algorithm is domain independent and do not required any labelled data sets over supervisor algorithm due to these two basic points we have used unsupervised algorithm for Urdu documents clustering.Since Urdu concerned to resource scarce languages and it's comparatively difficult to employed supervised technique for Urdu documents clustering.

Document Clustering
These days' data produced abundantly in the shape of news article, social networks analysis such as twitter etc, e-books, and financial analysis, etc.As predicted there exist 80% of the whole data on the Web in an unstructured fashion [19] .Conventional database query methods are not suitable to obtain a useful information from this large collection of data.Documents clustering is an unsupervised categorizing of a set of documents into self-relevant clusters such that each document is more identical to one another in the same cluster than with a document of other clusters [20] as shown in fig 1 [21].These clusters are runtime constructed through the clustering procedure, rather than being labelled as in the instance of document classification, which is usually referred as supervised or pre-labeled categorization of documents [22,23].Document clustering has been utilized for various applications such as IR, indexing, surfing large document corpora, and extracting data on the cyberspace [24].

Related Work
Document clustering has been broadly reviewed in data mining literature [1].Enough research work has been explored in approving a well-organized document clustering techniques [2].Hierarchical K-Means based clustering (HKM) is utilized for 242 Arabic documents and discover that the clustering-based IR has tremendous result over the traditional IR framework [3].An experimental investigation has been carried out about Automated Text Clustering connected to Brazilian Portuguese text, the goal was to locate the best computational technique ready to cluster the documents [4].Multilingual document clustering framework has been presented and tested on FIRE dataset by employing a bisect k-means algorithm [5].Hierarchical clustering is broaden extended into divisive (top-down) and agglomerative (bottom-up) clustering [2,6].The divisive approach begins by taking all data objects in a unit cluster and divides them into different sub-clusters based on some splitting criterion until each data object makes a cluster of its own or some termination condition reaches [7].The agglomerative approach begins by taking each data object as a separate cluster and combined them accordingly based on some proximity metric.The process remains continue until all data points are merged in a unit cluster or some closing condition reaches [8].In partitional clustering, a dataset of n objects is directly decomposed into a set of K disjoint clusters, based on some optimization criterion [9].K-means define by [10] and K-medoids describe by [11] are the two eminent algorithms of this type of clustering.A comparison has been made between kmeans and k-mediods algorithms, by utilizing Arabic dataset of 242 documents.They observed that k-mediods has better performance than k-means algorithm [12].The key idea of K-means is to revise the center of the cluster which is computed as the average point of the data objects in an iterative fashion until some closing criteria are reached [10].K-mediods is an enhancement of K-means to 4 deal with distinct data, which returns the adjacent points as the cluster centroids [11].The commonly used clustering algorithms depend on partition also include PAM define by [13], CLARA describe by [14], and CLARANS introduce by [15].In density-based clustering, the core objective of the clustering algorithm is, the document which is in the section with a prominent density of the document space is counted to fit in the similar cluster [16].A density-based kmeans algorithm is suggested to improve the performance of DBSCAN and K-means algorithms.They utilized a dataset of 250 documents and observed that DBK-means has outperforms the k-means and DBSCAN algorithms [17].Clustering algorithm founded on density and distance is also utilized, which calculates the distance and the density of every data points and combined those data objects which have minimum distance and highest density, using a decision graph [18].The COBWEB expresses by [19] and GMM outline by [20] depend on statistical learning and neural network.In Kernel-based clustering algorithms, the input space is converted into a feature of high dimension.The classic algorithms of this type of clustering are kernel K-means explains by [21], kernel FCM mark by [22], kernel SOM specify by [23], and SVC characterize by [24].A clustering algorithm known as affinity propagation (AP) is offered in 2007, which relies upon "message passing" amongst information objects.In this kind of algorithm, the client can't assign the quantity of groups as an input, such as a k-means algorithm.However, like a k-medoids, it can locate "exemplars", fellows of the input set that are illustrative of clusters [25].Various strategies have been acquired to achieve semantic correlations amongst documents [26].A famous tool such as WordNet has been utilized to improve the semantic association amongst words, such as synonyms etc [27].Additional ontology made research are also incorporated [28,29], which focuses on words semantic relationship.Chinese newsbased clustering approach is proposed by utilizing a Neural network language model [30].K-nearest neighbour, kmeans and support vector machine are employed for Marathi news clustering [31].Agglomerative hierarchical clustering is proposed for Urdu ligature recognition and they also utilized Naïve Bayes, decision tree, K-nearest neighbour and linear discernment analysis for classification [32].A detailed study on Urdu document images has been conducted by utilizing various clustering algorithms such as Self organizing map, K-means and hierarchical clustering [33].Urdu ligatures organization is accomplished using a deep neural network.They exploit a corpus of 2430 ligatures and achieved an accuracy of 73.13 % [34].Table 1 shows the most related work about Urdu document clustering.

Proposed Architecture
The demand for Urdu document clustering turns out to be necessary because this data is growing to be very popular on the cyberspace.The frequently expanding use of documents clustering and the extensive span of its appliances directed us to accomplish an investigational analysis of Urdu documents clustering on the basis of various similarity measures.The proposed methodology cconsists of the following steps;

Dataset
The presence of benchmark dataset is required for every natural language processing task [7].It is essential to provide a significant extent of the pre-labeled dataset to train the model effectively.However Urdu language droughts in owning such a linguistic resources for natural language processing task [59,60].The datasets exploited in this study is collected from BBC Urdu news portal http://www.bbc.com/urdu and saved it in Notepad in UTF-8 design.This dataset comprised of 1000 documents of five distinct classes, such as Arts, International, National, Sports, and scientific news while each document contains different number of sentences and tokens as shown in table 2.

Pre-processing
Data pre-processing is an essential phase incorporated before executing any NLP, IR, and data mining task [7].The setting of every Natural Language contents comprises of two kinds of words such as functional words and contented words.The contented words present lexical meaning while the functional words provide a syntactic role [25].The stop words belong to the category of functional words.Stop words exclusion offer a total cut of 30% in the file size and indexing of documents [61].This is typically realized that stop words did not provide any lexical implication to the contents, however their greater existence triggers an obstruction in the documents clustering.Mostly, every sentence comprises of contented words and stop words.
Here we represent contented words as a key words as shown in Table 3.

Bag-of-Words Model
We have represented each document as a Bag-of-words model in this study.This technique is employed for name entity extraction, opinion targets and documents clustering.The Bag-of-words model make good use of Term Frequency Inverse Document Frequency for document clustering and describes their frequencies without any contextual as well syntactic association of words in documents.The Bag-of words model is well known and frequently used method for object classification, information retrieval (IR), similarity measuring and natural language processing (NLP).This model extract a document as pack of its words ignoring word sequence.[62].

Similarity Measures
Document clustering needs a correct description of the proximity amongst a set of documents, concerning of both, the pairwise likeness or space [63].The similarity measure is utilized to implicitly capture the alikeness amongst  documents or records and allocates it a specific value in the range of 0 to 1 [21].The documents preserve to be similar in two different ways; either lexically or semantically.They are lexically comparable, if contain the uniform dictionary words, although semantically related if they illustrate the identical concept [64,65].Still up-to date there is no agreed similarity metric that is suitable for long range of clustering practices [66].This section describes some frequently used similarity metrics, utilized in the development of documents clustering.

Cosine Similarity
Cosine similarity treat each document as a terms or features vector and the likeness of the documents is calculated as the cosine of the angle amongst them [65][66][67].The terms or words of the document are known as the features or dimension of the documents [30].Cosine similarity is a widely used similarity metric in a different area of IR, such as clustering, classification and pattern recognition etc. Mathematically it can be described as Where doc i and doc j represent random documents.
Let we have two documents such as doc 1 and doc 2 as shown in table 4, then its Cosine similarity would be calculated as shown in table 5.

TF-IDF (Term Frequency Inverse Document Frequency)
Numerous term weighting methods have been suggested to compute the weight of a terms in a specific document and in the entire corpus.However, TF-IDF is the highly utilized term weighting scheme.It determines the weight of a term by its occurrence inside a document and the inverse of its document occurrences within the corpus [68].In practice, the terms which arise repeatedly in a few documents but infrequently in the remaining documents incline to exist more significant for that individual set of documents [30].
Mathematically TF-IDF can be represented as Here Tf td indicates term frequency, Idf td represents inverse document frequency and (Tf − Idf) td express the total aggregate of term t inside document d [65,69].We have found the Tf-Idf of doc 1 and doc 2 as shown in table 6.

Levenshtein distance
The Levenshtein distance is utilized to find the character based likeness between strings or documents [64].The closeness between two strings is computed by finding the number of activity or operations executed such as insertion, deletion, or substitution required to change one strings into another strings [70].The Levenshtein distance between two strings s and t where s demonstrates a source string and t represents a target strings can be computed as Where i represents the first character of s and j represents t The Levenshtein distance between two strings ‫,درحواست(‬ application) and ‫,درست(‬ right) can be calculated as shown in table 7, 8 and 9 respectively.

Jaccard Coefficient
The Jaccard coefficient also stated as the Tanimoto coefficient, calculates the similarity of the two documents such as, the sum of the weight of common terms is divided to the sum of the weight of those terms that are existing in any of the two document but are not the common terms [71][72][73].It can be mathematically described as Here T i and T j represent the terms of documents d i and d j correspondingly.The Jaccard Coefficient of doc 1 and doc 2 would be calculated as shown in table 10.By applying Levenshtein distance we will performed three deletion operations such as (delt 1, delt 2 and delt 3 on highlighted characters) to convert source string into target string.

K-means Algorithm
This algorithm was first offered by Stuart Lloyd in 1957 while the term "K-means" was first utilized by James MacQueen in 1967 [74].K-means is an unsupervised algorithm which groups the given information index into a particular set of clusters [75].A centroid-based methodology is exploited in this type of algorithm which can be determined through the average of objects allocated to the clusters [2].The aim of the K-Means algorithm is to reveal the finest separation of n data points in k number of clusters to such an extent that the aggregate separation amongst the data points and its relating centroids are minimized [76].

Experiment
A dataset of 1000 documents comprising of five distinct classes, as shown in table 2 are crawled from BBC Urdu News Portal.This dataset is then pre-processed for stop words removal.After pre-processing, Bag of words model and similarity measures are employed to analyze the closeness of each document.We have achieved the similarity of each document in the form of numeric value ranging from 0 to 1.These values are then passed to the Kmeans algorithm as an input for clustering.The K-means algorithm cluster the dataset.This clustering result is then compared with the manually classified clusters.

Results and Discussion
The physically made classification is ordinarily utilized as a standard basis for assessing the result of documents clustering.Hence, the groups of documents formed utilizing various similarity measures and K-means algorithm are compared with the manually classified documents clusters.This kind of assessment accepts that the objective of grouping is to imitate human reasoning.A grouping arrangement is reasonable if the groups are persistent with the physically generated clusters.

Evaluation Metric
In the proposed framework, a purity metric is employed to evaluate the result.The purity metric calculates the consistency of a group or cluster such that the extent to which a cluster includes largest documents from a specific group [65].Assume a certain cluster Ci of extent   , then the purity of Ci is mathematically expressed as Where max (  ℎ ) represent the set of documents that belongs to the dominant group.The purity values of each

Conclusion
Now a days, the progressive feelers that are widely adopted around the globe for the development of NLP framework, in almost all languages including Urdu, are machine learning approaches.The core reason behind its wide usage is based on four features: a) the capability of automatic learning b) the degree of accuracy c) the speed of processing and d) generic nature.Document clustering which aims to organize a huge number of documents distributed over different sites is well-investigated task from ML perspectives in Western language when compared to Urdu.The peculiarities of Urdu such as lack of resources, rich morphology, lack of capitalization and many other tasks makes Urdu document clustering more complex when compared with the language having script writing style from left to right.In this study, we attempted to propose the ever first adoption of ML approach e.g. the K-Mean clustering model with various similarities measure for Urdu documents.The Four similarity/distance measures which we experimentally analyzed in this study are: Cosine similarity, Jaccard coefficient, Levenshtein distance and TF-IDF.After conducting glare experiments, we observed that each similarity measures have a remarkable impact on Urdu documents clustering except for the TF-IDF measure.We obtained the purity values for each similarity measure, such as Cosine 0.66, Jaccard Coefficient 0.71, and TF-IDF 0.44, and Levenshtein distance 0.67 respectively.Additionally, we also analyze the impact of stop words removal in the process of Urdu document clustering.We obtained the purity value of 0.78 for Cosine, 0.61 for Levenshtein, 0.59 for Jaccard Coefficient and 0.44 for TF-IDF respectively.The result obtained indicate that Jaccard Coefficient (0.71) before stop words removal and Cosine similarity (0.78) after stop words removal outperforms the remaining similarity measures in Urdu documents clustering.We also found that the outcomes of Levenshtein distance (0.67, 0.61) before and after pre-processing are notable than the outcomes of TF-IDF (0.44) respectively.In future work, we will likely to apply semantics similarity for document matching to demonstrate the relationship among documents more effectively.

Figure 1
Figure 1 Documents clustering Step 1 Data collection and Pre-processing, Step 2 Document representations as a Bag of Words Model Step 3 Similarity measurements and Step 4 Documents clustering using the k-means algorithm as shown in figure 2.

Figure 2
Figure 2 Proposed Architecture

Basic
Steps are listed below: • Input: A set of numbers such as N= {   ,  = 1, … } and k. • Output: A group of numbers into k clusters.• Initiate by randomly selecting k number of center such as   .• Assign each number to the cluster   , which has minimum distance d (  ,   ).• Recalculate each   as the mean of all numbers of   .• Repeat step 4 and step 5 until the centroids   and group members   no longer change.

Figure 3
Figure 3 Result of Similarity Measures Before Pre-Processing Figure 3 demonstrates the average result of five different clusters through several similarity measures techniques before pre-processing.

Figure 4
Figure 4 Result of Similarity Measure after Pre-processing Figure 4 represents the average result of five clusters by utilized four various similarity measures after pre-processing.

Table 3
Example of Urdu Stop words and Key words

Table 7
Representation of String in Levenshtein distance

Table 8
Applying Deletion Operation of Strings

Table 10 Jaccard
Coefficient Issue 19 | e5 similarity measures in each cluster, utilizing a K-means algorithm are shown in table 11.Table 11 Result of Similarity Measures before Pre-Processing EAI Endorsed Transactions on Scalable Information Systems 06 2018 -12 2018 | Volume 5 |

Table 11
described the result of five clusters by different techniques of similarity measures before pre-processing.The average accuracy of Cosine Similarity, TF-IDF, Levenshtein distance and Jaccard Co-efficient is 0.66 0.44, 0.67 and 0.71 respectively.

Table 12
Result of Similarity Measures after Pre-Processing

Table 12
shows the average result of five various clusters after pre-processing via different techniques of similarity measures. .The average accuracy of Cosine Similarity, TF-IDF, Levenshtein distance and Jaccard Co-efficient is 0.78 0.44, 0.61 and 0.59 correspondingly