Big Data and Named Entity Recognition Approaches for Urdu Language

Nowadays data is stored in digital form and Terabyte of data is generated on daily basis. It is diﬃcult task to extract useful information from Big data eﬃciently. From unstructured text Information extraction is a technique which used to extract information. Named Entity Recognition (NER) is an essential component of information extraction in the ﬁeld of Natural Language Processing (NLP). Further, Urdu language has various challenges to NER due to its agglutinative, inﬂectional nature and rich morphology. Therefore, NER systems for Urdu language are not mature yet due to lack of resources and ambiguities. This paper speciﬁcally addresses the diﬀerent approaches to NER and explore the existing work for NER in Urdu language.


Introduction
In this digital era Enormous amount of data is available.Analyst has collection of unstructured, semi-structured and structured datasets called big data.The volume, complexity and rate of growth of data is very vast.To capture, manage, process and analyze this enormous amount of data is difficult task for analysts and different programming tools and applications needed to process this data.Data is available in different forms such as textual, video, image, audio, web page log files, blogs, tweets, location information, sensor data.To make intelligent and faster decision big data is very valuable for any organization.According to International Data Corporation (IDC) Big Data is: "Big Data technologies describe new generation of technologies and architectures designed to economically extract value from very large volumes of a wide variety of data, by enabling highvelocity capture, discovery and/or analysis" [1] For structured data, data source is Business Applications in big data e.g; retail, nance and bio informatics etc.For semi structured data, data source is Web Applications which can be web logs, email and web pages.For unstructured data, data source is Audio, Video, Images, Sensor data, Blogs and Tweets.In January 25thTwitter was available for their users in four different languages named as: Arabic, Farsi, Hebrew and Urdu1.Geo News Urdu has 1.88 million tweets 2 which are published in Urdu, BBC Urdu has 16.7K tweets3 in Urdu and Dawn news has 95.1K tweets4 published in Urdu.Almost 115 websites and blogs which are available in Urdu language 5 but news sites are actively participating in producing large amount of data in Urdu on daily basis.

Approaches to implement NER
NER implementation involves three main approaches and the selection is based on their corresponding efficiency.These techniques are named as ruled based, statistical and hybrid approaches.The hybrid approach which involves a combination of the previous two ruled based and statistical approaches.

Rule Based Approach
Rule based approach is applied on the textual data, whenever the system gets input in textual format, it finds named entity and compares it with the rules mentioned in the dictionary mapping and linguists.Thereafter, an output is generated by pulling each mapped individual named entity classification with each rule set of linguists [2].

Rule Based Approach
Rule based approach is applied on the textual data, whenever the system gets input in textual format, it finds named entity and compares it with the rules mentioned in the dictionary mapping and linguists.Thereafter, an output is generated by pulling each mapped individual named entity classification with each rule set of linguists [2].

Statistical Approach
Statistical approach uses machine learning techniques to implement a linguist in a semi-automatic way instead of constructing it manually.These techniques effectively extract part of speech tagging, assign categories to text and carry out sentence parsing.Furthermore, this approach employs a training module that is trained on previously constructed corpus to perform NER with term frequency calculation and context understanding.

Models in Statistical Approach
Implementation of statistical NER further involves different statistical models which employ specific statistical methodologies to train and work with corpus to achieve a required outcome.
-Hidden Markov Model (HMM) is a graphical model that uses the conditional probability distributions determined on the bases of predefined limited history i.e. the Markov property.HMM further splits into two main conditional contexts i.e.Hidden contexts that refers to latent conditions or states of related context, and the other one is observation contexts which in turn describes the related contextual observations.HMM finds its applications in parts of speech tagging, speech recognition and machine translation [5].
-Maximum Entropy Model (MEM) is the maximum entropy represents the largest entropy state of the current knowledge base of the implemented system.It does not put any assumptions or preconditions regarding data distributions in training data, hence, considered as the best for performing various kinds of experiments.It has applications in parts of speech tagging, speech recognition, NER and machine translation [6].
-Conditional Random Field (CRF) is considered as a Markov random field that was trained in a more of discriminative fashion.Distribution over mostly observed variables is not needed, in this way, more complex variables can also be added in the model.It has applications in shallow parsing, NER and gene finding work [7].

Models in Statistical
The statistical approach makes it possible to train the system for efficient NER and involves three methodologies to learn a system [8].
-Supervised Learning involves a pre-built and preannotated state of the art corpus to train the system according to the aforementioned models.In supervised, the learning system reads the available corpus and memorizes it to further process the text fed to the system.For a better performance, availability of a very large corpus is necessary.
-Semi Supervised Learning refers to training some initial entities termed as seeds into the system.Afterwards, the system searches for trained seeds to identify other entities within the same context.
-Unsupervised Learning is a learning environment in which a large number of entities appearing within the similar context are grouped into one unit called as a cluster.Later on system is made to learn all of these clusters and is able to identify the matching entities as in learned clusters.

Hybrid Approach
Hybrid approach involves a combination of multiple approaches such as rule based and statistical approaches to combine both machine learning techniques and manually constructed linguist rule sets.The main advantage of applying this technique is to overcome the shortcomings of the traditional approaches for NER [9].recognizer such as BBN's HMM based IdentiFinder [13] is developed.Unfortunately, these recognizers are not developed for Urdu language.For five languages including Urdu Kumar et al 2008 [14] proposed NER system using hybrid approach which includes CRF and HMM.For all languages HMM model shows better performance than hybrid CRF model.In another study Rule Based approach was used to develop a system by Kashif Riaz et al. 2010 [2] and results were more accurate than the other systems developed in IJCNLP-2008 workshop.Moreover, for Urdu language an information extraction tool was developed by Mukund et.al in [15].This tool also contains a submodule for NER which were developed with the combination of HMM and CRF models.Rules based approach for NER in Urdu is used by Riaz [2] [18] developed MEM based system for the NER.This NER system focused on five Indian languages.In this study, rules and gazetteers are not built for Urdu language.Test data of 12,805 words was used and Nested, maximal and lexical precision, F-measure and recall is measured.In another study Gali, et. al. 2008 [11] two stage hybrid approach used for NER for South and South East Asian Languages which included Telugu, Hindi, Urdu, Bengali and Oriya without using any language specific resources.The dataset of 35,000 Urdu words are used to train the system.In [19] author developed a statistical NER system for Urdu by using unigram and bigram models with gazetteer lists.System used training data that contained 2313 name entities and test data contained 104 name entities.In [3] author developed a hybrid NER system with n-gram model, rules (prefix and suffix characters are used) and gazetteers for Urdu language.Author used the IJCNLP named entity and CRL named entity corpora.

Performance Evaluation
The NER approaches discussed in Section 3 are evaluated by calculating the precision, recall and f-measure.Moreover, approaches presented in IJNLCP 2008 workshop measured the performance of the system by dividing into three categories such as maximal matches, nested matches and lexical item matches.The maximal matches considers the longest possible match of NEs while in nested matches, the longest match of nested NEs are considered.Rest of the systems are evaluated for individual NEs tag set and the overall performance of the systems are evaluated along with gazetteer lists and handcrafted rules.Table 1 shows the overall maximum fmeasure achieved by the system.The system [10], extracted fifteen different features and included the different contextual information to identify the various classes of NEs.In this approach different variations of numbers with special characters such as number with comma, hyphen, period, slash, percentage are handled.Furthermore, the system obtained the f-measure with maximum, nested and maximum lexical matches is 30.35%,28.55% and 35.52% respectively.Furthermore, in [14] approach, authors tagged the corpus using HMM model.They used two layer approach such as statistical and rule based to extend the tool for other languages rather than Hindi which includes Urdu too.In first layer, the model is trained on annotated data and class of each word is identified.Second layer is used to validate the system with the chunks of test data and define rules the rules for each class of NEs.Further, in this study CRF is used for initial tagging with out analyzing morphology due to limited availability of the Urdu language resources.The system obtained the f-measure with maximum, nested and lexical item matches is 33.17%, 31.78% and 38.25% respectively.
In another relevant study [18], authors initially used ME model and than language specific rules are constructed to identify the nested entities.Moreover, for Urdu language, the POS information, language specific rules, morphological information and gazetteers are not used to tune the system due to unavailability of enough resources.However, the system obtained the f-measure with maximum, nested and maximum lexical matches is 27.79%, 28.59% and 35.47% respectively.Moreover, in [11] authors used CRF with rules to identify the NEs.The rules are constructed for Hindi and Bangali languages.Due to lack of resources and domain knowledge of other languages the rules are not constructed.By combning the rules with CRF, the system obtained the f-measure with maximum, nested and maximum lexical matches is 39.86%, 39.01% and 43.46% respectively.This approach is obtained the highest results for Urdu language between all proposed approaches in IJNLCP 2008 workshop.In study [2], authors proposed the rule based approach and used 6-gram model.The authors used finite state automata with lexical information which analyzes the states of the character.Moreover, they defined three rules based on heuristic, corpus and grammar and weighted in which order they are applied.The evaluation of this system is performed on two Urdu corpus ( IJCNLP-08 workshop concentrated on developing NER systems for Hindi, Bengali, Oriya, Telugu and Urdu languages.In [10] author developed the Hybrid NER system for five languages.In another study conditional random field based NER is proposed in which machine learning approaches with heuristics are used.This NER system is developed for Bengali, Oriya, Telugu, Urdu and Hindi by Karhik Gali et al.2008 [11].Moreover, Name Entity (NE) helps to extract information from text and accurate recognition and classification of NE is essential.Rule-based NE recognizer [12], New York University's MENE [13] MEM based recognizer and statistical NE EAI Endorsed Transactions on Scalable Information Systems 12 2017 -04 2018 | Volume 4 | Issue 16 | e3 Becker-Riaz and IJCNLP EAI Endorsed Transactions on Scalable Information Systems 12 2017 -04 2018 | Volume 4 | Issue 16 | e3 . Different rules constructed from 200 documents of Becker-Riaz Urdu corpus [16] is used to formulate different rules.Out of 2,262 documents 600 documents are choose.In 2012 Singh, et.al. [17] developed rules based Urdu NER system.The system used IJCNLP corpus for thirteen NEs.Test set 1 uses the 12032 tokens and Test set 2 uses 150243 tokens.In 2008 Shah, et.al.