Supervised Urdu Word Segmentation Model Based on POS Information

Urdu is the national language of Pakistan, also the most widely spoken and understandable language of the globe. In order to accomplish successful Urdu NLP a robust and high-performance NLP tools and resources are utmost necessary. Word segmentation takes on an authoritative role for morphologically rich languages such as Urdu for diverse NLP domains such as named entity recognition, sentiment analysis, part of speech tagging, information retrieval etc. The morphological richness property of Urdu adds to the challenges of the word segmentation task, because a single word can be composed of null or a few prefixes, a stem and null or a few suffixes. In this paper we present supervised Urdu word segmentation scheme based on part of speech (POS) information of the corresponding words. For experiments conditional random fields (CRF) with contextual feature is used. The performance of the proposed system is evaluated on 300K words, results shows evidential improvements on baseline approach.


Introduction
Nowadays Natural Language Processing plays a vital role in every field of computer science.Human beings are trying to simulate human knowledge by computer system.For this purpose, NLP researchers struggle by introducing knowledge through which computers understand and use natural language.To achieve desired tasks different types of advanced tools and procedures are applied to make computer systems more cognizable.Various disciplines lie in NLP fundamentals such as electronic and electrical engineering, linguistics, information and computer sciences, mathematics, psychology and artificial intelligence (AI) etc [1].Natural Language Processing applications are widely used which mainly consist of different fields of studies, like word segmentation, speech recognition, text processing and summarization, CLIR (cross language information retrieval), user interfaces, voice recognition and artificial intelligence etc. Information retrieval (IR) recognizes desired valuable information from a huge collection of data while information extraction (IE) is used to process document(s) for identification of such entities or events that are prespecified or a technique that processes a document(s), to identify pre-specified entities or events.Artificial intelligence is a sub-field of computer science in which we study the development of hardware and software that simulates human intelligence.
For every NLP application Word Segmentation has vital role.Word segmentation is capable of separation written or oral text into meaningful word tokens.It identified words boundaries in a spoken language.In recent age, Hindi like languages attracted researcher's attention more than other regional languages.For example regional language Urdu is becoming popular on web day by day [2].Data Mining (DM) and Informational retrieval (IR) demands for exploration of natural language processing with responsibilities of the topic categorization, relationship exploration, sentiment analysis and event extraction etc. NLP significant tasks i.e NER (named entity recognition), POS (part-of-speech) tagging, stop words removal, morphological analysis, parsing have major role in all NLP systems [3].
In case of Urdu language (one of the most of important language of the Asian countries) segmentation is much difficult as some of the other Asian languages, in which space is used for word boundary.Urdu word segmentation issues i.e space insertion and omission are caused by use of space in Urdu text [4] and [5].The Space omission problem in Urdu word segmentation e.g. the Urdu word ‫(آپکا‬Aapka, Your's) which is composed of two words ‫(آپ‬Aap, You) and ‫(کا‬Kaa, of) but for the system it will be treated as a single word.Segmentation of such like Urdu words have been addressed by [6] using Urdu-Devnagri transliteration system.The second problem i.e Space Insertion problem consider the Urdu word ‫ضرورت‬ ‫مند‬ (Zaroratmand, Needful) which is a single word but the system will consider it as two words while segmentation i.e ‫ضرورت‬ (Zarorat, need) and ‫مند‬ (mand), such types of Urdu words have been handled by [7] using two-stage system.[8] and [9] have briefly discussed Hindi-Urdu transliteration issues while doing segmentation.Segmentation for Sindi language using simple, compound and complex words are discussed in three layers [10].Urdu-Arabic Word Segmentation techniques and also their challenges has been discussed by [11].As for our knowledge up to now, there has no research work conducted from the ULP research community in which the researchers have examined the effect of supervised machine learning model especially CRFs and POS information for Urdu word segmentation task.Therefore, this gap and advantages of supervised learning schemes in diverse areas, motivated us to undertake supervised learning model along with POS tag information for Urdu word segmentation problem.
The main contributions of this study are below: (1) It is the first to employ CRF along with POS information as feature for the subject ; (2) It is a systematic evaluation of CRF models on data of three different genre such as International, sport and science news

Literature review
Now a day's different languages use different techniques for word segmentation problem so far.These techniques are used by NLP researchers and have deduced better results from each one.The existing techniques for word segmentation in NLP are Dictionary/rule based, statistical/machine learning and hybrid approaches.

Existing techniques
The segmentation techniques used so far, are classified into below categories: • Dictionary/Rule based techniques

Dictionary/ Rule based techniques
To perform different Urdu Language Processing tasks, Rule based techniques are highly examined in case of Urdu word segmentation so far.In these techniques, set of rules or pattern are used to perform significant NLP tasks.But there are number of limitations of these techniques because for each process there will be a separate rule.These techniques were used for chines word segmentation [12].Word segmentation in Japanese, Chines, Thai and Urdu etc are more difficult from western languages like English, French etc, because of non-formal use of space.The Urdu language is also a resource poor language.Thai word segmentation using rule based approach has been presented by [13].The Urdu stemmer "Assas-Band" remove prefix first from stem, then remove postfix and finally stem is extracted [14].Its accuracy rate is 91.2%.The handwritten information are converted into their corresponding Urdu text using Urdu online handwriting recognition system using these techniques [15].NER for automated text processing in Urdu using rule based algorithm [16].Rule based maximum matching approach for space insertion and omission problems in Urdu word segmentation have been addressed by [17].

Statistical/ Machine learning techniques
In recent era of research, statistical techniques surplus other techniques.Machine learning techniques giver better results than rule-based approaches, however not proper attention has been given to these techniques in case of Urdu word segmentation.ML algorithms are capable of defining a function that take input samples to a range of output values.For these algorithms a corpus is constructed in which word boundaries are explicitly defined.Statistical models concerned with bi and tri-gram models are more frequently employed.In natural language processing, supervised statistical learning most significant technique.Supervised statistical approach induces rules from training data automatically.ML algorithms comprise of intelligent modules and different ML models have been addressed by [18].The space insertion issue in Urdu word segmentation has been handled using statistical based technique [7] while space omission has been addressed by [6] using the same approach.

Hybrid techniques
These approaches combined the features of both rules based and statistical techniques.In Urdu language, Sentence boundary identification is initial step for ULP tasks which has been presented by [19] using hybrid approach along with unigram statistical model and rule based algorithm.The result was calculated better than both rule base and statistical approaches.Hybrid technique for segmentation presented by [20] for line segmentation of Urdu text using top down mechanism and for segmentation of line into ligatures using bottom up design.The desired results are quite reasonable with accuracy achieved 99.2%.

Feature based techniques
These techniques are used for identification word boundaries.Any specific information regarding undefined words can be tested using feature.For Thai word segmentation, features can be automatically extracted from training corpus using ML algorithm Winnow [21].

The Urdu language
Urdu is one of important language in Asian countries.It is the Pakistan's National language.
The handheld devices like mobiles phones etc are with success mistreatment all over however the code they supply for user input is generally in English and in Asian nation it's tough for a standard man to speak in English simply.In order to facilitate Urdu speakers and author and cut back the distinction between the common person and therefore the new technology Urdu information processing systems area unit needed.Our struggle is for reducing this gape using ML techniques for segmenting Urdu text.
Urdu is written in Arabic script which is much cursive.Arabic language has many writing styles, one of which is Nastaleeq (character based, right to left style) is commonly used for Urdu language.But NLP tasks for Arabic language are not applicable to Urdu such like stemming algorithm for Arabic will not work for Urdu.Urdu characters have some unique features i.e joining and non-joining.Joiner characters are joined with prefix and postfix characters and non-joiner characters can join only with prefix character not with postfix characters.Table 1 shows Urdu Numbers and characters.Despite of these, diacritics, punctuation marks, signs & symbols are also used in Urdu text shown below in Table 2.The joiners and non-joiners characters are listed in Table 3

Morphological richness of Urdu text
In Urdu, for each word there exists various variants, that's why it is morphological rich.Urdu is sort of the same as different Indo-European languages, e.g. for morphology, having concatenative inflective morphological system.However, in some cases, variations are often found just in case of causative verbs that conjointly exhibit steminternal changes.Urdu morphology and its solution is presented by [22].

The Urdu verbs
Verb ‫)فعل(‬ plays a vital role in every language because it convey action.ULP researchers should learn the Urdu verbs because its structure is used in everyday conversation.Below Table 4 shows some examples of Urdu verbs.

Urdu linguistic resources
Urdu lexical resources square measure necessary a part of each informatics system for Urdu language.Those study concerning linguistics in Asian nation typically been restricted to those fields concerning connected Linguistics, significantly English and socio-linguistics.Next to no value of effort want been completed within description and theoretic linguistic etymology.In Pakistan there is a just as set limit in this range.
In recent era, two choices during creation of datasets for scripting languages (i.e Arabic etc) are widely used, which are: (a) Unicode character set (b) XML file format.These two choices are used for storing data.Urdu uses Unicode encoding scheme for storage however Urdu text can be stored in different kinds of format like '.txt', 'inp', or 'doc' but XML is more adaptable one because data in this format can be easily converted to other formats and easily readable for user and system as well [18].
Urdu language processing researchers are trying to develop advanced tools and datasets.The available datasets are not in good number to meet all criteria of ULP tasks.There are three ULP datasets are available currently.The detail of these dataset are given in the Table 6.The Backer-Riaz dataset first introduced by [24].The EMILLE dataset for ULP was released by Lancaster University in 2003 [25]

Conditional Random Fields
Conditional Random Fields (CRF) is a ML algorithm, used for various NLP tasks such as Name Entity Recognition, sequential labelling and word segmentation etc.These are undirected graphical models which are used to calculate the conditional probability of values on designated output nodes given values on designed input nodes.Comparing to HMM, CRF results better [28].CRF can be defined using variable X and Y as: Let the graph  = (, ) such that  = (  )vεv) so that Y is indices by the vertices of G. Then (, ) is conditional random field when the random variable Yv, conditioned on X, obey the Markov property with respect to the graph:  (  ,  , ~) ⁄ means that w and v are neighbors in G.For sequence tagging tasks, the LDCRF (Latent-dynamic random fields) or DPLVM (Discriminative Probabilistic Latent Variable Models) are a type of CRFs for sequence tagging tasks.These models are known as latent variable models that are trained discriminatively.According to LDCRF let a given sequence of observations say,  =  1 ,  2 ,  3 , … … … …   one of the tagging task but here the problem arises that how to assign sequence of labels and this problem should be solved by the model let  =  1 ,  2 ,  3 , … … … …   , be a labels sequence.In ordinary linear-chain CRF, latent variables 'h' is inserted between x and y rather than directly modeling (  ⁄ ) . it uses chain rule probability.

P(Y X ⁄ ) = ∑ p(Y h ⁄ , X)P(h X) ⁄
h (1) Suppose x 1:n is a sequence of Urdu words in a sentence with name entities z 1:n .According to linear chain CRF the conditional probability is as: Where the normalization factor Z is calculated as under For ULP tasks, many experiments have done for named entity recognition using minor use of CRF, but nowadays need sophisticated researcher's attention while using CRF as a module for NER in Urdu.
CRFClassifier gives a general usage of (selfassertive request) straight chain CRF arrangement models for any assignment.In case of proposed work, the NER structured as, e.g the given Urdu sentence:

Issues in Urdu word segmentation
Compare to Western languages like English, French etc, Urdu faces a lot of segmentation issues because of no regular use of space in between the Urdu words.Urdu faces Space omission issue, space insertion issue, reduplicated words, compound words, affixations and English Abbreviations.These Urdu word segmentation issues are discussed below in detail:

Space insertion issue
This problem arises during word segmentation when a space is inserted in between two Urdu words.No space is written between two words in hand written Urdu text [8], [17] and [6].While typing in computer system a space must be entered if the last character of the word is joiner, otherwise it will be joined to the next word and form incorrect/miss-understandable form and become difficult to recognize for system as well as for own native speaker.The Space insertion issue arises due to joiner characters of Urdu text [17].Examples of space insertion issues are briefly discussed in Table 7.

Space omission issue
This problem arises in a case if we omit space in such a place where space is must to insert in between two Urdu words.Space omission in Urdu text is also a challenging task for word segmentation.But in case of joiner as a last character, space should be inserted after joiner otherwise it will append to next word and give visually false form considerable work for handling space omission issue has been done by [6].This work consists of transliteration system from Urdu to Devnagri for segmentation and then from Devnagri to Urdu.Examples of space omission issue in Urdu word segmentation is discussed in detail in Table 7.

Compound words
These words are combination of two or more words.These combined lexemes form another lexeme [27].Compound words are categorized into three different formats [8] which are listed in Table 7 with examples.

Reduplicated Urdu words
The words that occur twice one after each are known as reduplicated words.Urdu reduplicated words are briefly discussed by [8].Such kind of words make Urdu word segmentation challengeable.Examples are discussed in Table 7.

The Urdu Affixations
By Affixation we mean that such words that contain prefixes and suffixes .Urdu text also contains affixation that make the segmentation process challengeable.These words are discussed by [3].Prefixes and suffixes are removed in stemming.Examples of such Urdu words are shown in Table 7.

Abbreviations
In Urdu language, words from other languages i.e English, Arabic, Greek, Farsi, Latin etc. English abbreviations in Urdu writing need a space/dash character in between the words [8].Some examples of abbreviations are discussed in Table 7.

Conclusions
In this paper we have presented a supervised machine learning scheme for solving Urdu word segmentation.As Urdu is much cursive language as compare to other Asian languages because of space problems in between the words.Since Urdu word segmentation is a challenging task due to its rich morphological richness property.Therefore, it requires standard tools to dealt with.In this study we tried to handle the subject problem with supervised machine learning model namely CRF along with words corresponding POS information.This is the first time that such a segmentation scheme for handling space insertion, space deletion, compound words and reduplicated words has been presented.Results shows evidential improvements of the proposed scheme over the previous approaches.In future we are planning to test deep learning approaches such deep convolutional neural networks, recurrent neural networks and LSTM networks for the subject task.

.1.2. The Urdu nouns Noun
[23]م(‬ represents the name of place, person, animal or anything.Urdu nouns are classified into different categories and types[23].The Table5consists of different types of Urdu nouns and their examples.

Table 6 .
Summary of existing Datasets for ULP

Table 7 .
Summary of Urdu word segmentation issues