Word Embedding and String-Matching Techniques for Automobile Entity Name Identification from Web Reviews

With the huge popularity of Internet, various types of information on a wide range of domains are floating over different social media platforms. To extract this information for using in diverse natural language processing applications, identifying the names is prerequisite. A study is presented here, to identify automobile names from noisy web reviews by exploring two widely used machine learning algorithms, Conditional Random Field and Support Vector Machine. The accuracy of machine learning classifiers radically rely on size and quality of training data which has been prepared manually by extracting discussion forum corpus; the task is time consuming and laborious; hence to leverage this word embedding is adopted. Though it enhances the system’s performance but is unable to spot noisy names which occur in web reviews. Next, a gazetteer based string matching technique is proposed, it recognizes a new set of noisy automobile entities, resulting considerable improvement in accuracy.


Introduction
The smallest token in a corpus that conveys the most information, is an entity name.Naturally identifying this name becomes the ultimate choice for mining and extracting information from text.Named entity recognition (NER) system finds usability in a number of natural language processing (NLP) tasks like text classification, opinion mining, information retrieval, sentiment analysis, machine translation etc.
The literature has demonstrated that newsgroup, discussion forums and social media data have been used as sources of information bank in a number of NLP systems [1][2][3][4][5][6].We have also considered the Internet as our data source for developing the NER system.There are plenty of discussion and review forums available in the Internet, like automobile discussion forum 2 , diagnostic discussion forum 3 , reviews on movie 4 , mobile 5 , any newly launch software or product 6 , life insurance policy 7 , and discussion forum on any recent news or trends8 etc.

Observation and Motivation
We have observed that to most of the common web users, these types of Social Media (web reviews or discussion forums) have become centre of attraction.Exploring the web reviews and discussion articles has emerged as a regular practice to these people before watching a movie or buying a car, bike, mobile, laptop or any other product.These people are accustomed to share information and their experiences in various aspects of daily life on the Internet through these types of Social Media.As a consequence, such corpus is full of customer to customer (C2C) message, business to business (B2B) conversation, business to customer (B2C) communication and vice-versa.Corporate giants and large business firms have identified that, the customer not only delivers profit by business transaction but also enrich product reviews in these Social Media platforms.By analysing these reviews, demand supply study, market trend estimation can be done if the targeted products and their companies (names of product and company etc.) can be identified.Hence, this requires a named entity identification system for the target domain.
Purchasing a family car is quite common these days.While purchasing the car, knowing its features is essential for the buyer.On the Internet there are also several web reviews and discussion forums available on automobiles; for example, CarWale 9 , CareDekho 10 , ZigWheels, CarTrade 11etc.from where customer can get information about the features and other details of a car.One can also write reviews or post any query about a particular model of a car in those forums.Hence these web reviews are the valuable sources of experiences shared by existing users of automobiles.An automatic car recommender system based on buyer requirement and query can be developed if these discussions corpora available on the Internet can be used skilfully.To use these corpora in a variety of tasks like demand supply study, market trend estimation and other information extraction (IE) tasks, recognizing the named entities is prerequisite as these are the pivotal element of the corpus.And these corpora contain the primary named entities like: company, product and model names of a car.These observations motivate us to design a NER system in automobile domain.

Problem Formulation and Solution Strategy
But these corpora available in these web reviews and discussion forums are posted by common web users; hence these often contain a high amount of noisy text.NLP tools which are developed for grammatical and standard corpus often fall short to produce a fair performance on these corpora due to the noises.The intricate and uncertain naming style (lxi, vxi, zxi, etc.) of these car names and product names are also a major setback of this task.The misspellings and abbreviations are the other ambiguities to identify these automobile names.A specific name of a car can be spelled differently by various users; for example, 'Hyundai' is written as 'Hyndai', 'Hundayi', 'Hundai' etc.Sometimes these names also contain numeric values (I10, I20, V2, eV2, etc); which specify product and model of the car.Moreover, the uneven capitalization and punctuation of discussion forum text raise the difficulties; hence dedicated special system is required to handle the automobile noisy names.
For developing a NER system three types of techniques can be adopted: Linguistic, Machine Learning (ML) and Hybrid.The first one relies on the principle of handcrafted rules and requires domain expert [7]; hence difficult to follow for complex named entities [8][9][10].For this reason, supervised ML based NER system is a preferable option.A ML classifier uses labelled training data where names are annotated manually [11].A combination of these two techniques, called Hybrid approach is also used for named entity identification [12].
This paper explores multiple machine learning classifiers, conditional random field and support vector machine for automobile name identification from online discussion or user review corpus.The two classifiers are trained using a manually annotated data set (~105K Words) taken from "http://www.carwale.com/"user reviews.The authors have considered three named entity (NE) categories; company, product and model.To assess the robustness of the system, the testing experiment is performed on a data set taken from a different source CarDekho user review ("https://www.cardekho.com/").
Here the baseline Conditional Random Field (CRF) system obtains an F-Measure of 92.20 and the baseline Support Vector Machine (SVM) system achieves an F-Measure of 93.24.
The ML algorithm uses the training samples from annotated corpus to build a statistical classifier.Thus, the system majorly relies on the quantity and quality of training corpus.The manual preparation of quantitatively large labelled training corpus with full of enrich feature samples, is laborious and burdensome.On the other hand, a large collection of external corpora (raw), containing certain valuable information is well available that can leverage the labelled training corpus.Though efficient processing of raw data is a prerequisite; so that, only the informative parts have sufficient impact on the outcomes [13].Here comes the importance of word embedding.Word embedding is a technique for generating word vectors using skip-gram and continuous bag-of-words (CBOW) model [14].Word2vec12 has been used for creating word vectors to enhance the performance of baseline system.The word embedding modification of CRF based system achieves an F-Measure of 94.09.And SVM system with word embedding based enhancement reaches an F-Measure of 94.93.
During the analysis, it has been observed that a few misspelled or abbreviated (noisy) names remain undetected by the word embedding based enhancements of CRF and SVM models.To recognize these misspelled or abbreviated automobiles NEs, an automobile gazetteer list has been prepared and a novel gazetteer-based string-matching approach has been proposed.Finally, the CRF system by incorporating word embedding and string-matching technique achieved an F-Measure of 95.87 and the SVM system with word embedding and string matching accomplishes an F-Measure of 96.40.
The salient features of this article which contributes to the literature in several ways are as follows: • The proposed NER system explores multiple machine learning classifiers and is an initiative (first NER system in automobile domain) for identifying entity names from automobile web reviews.• The article proposes a word embedding based semisupervised learning framework.• It introduces a novel gazetteer-based string-matching technique to achieve higher accuracy by identifying the misspelled or abbreviated (noisy) entity names from web discussion corpus.
Rest part of the paper is subdivided into following five sections.Related previous works are presented in Section 2. The baseline NER system is discussed in Section 3. Word Embedding enhancement of the system is presented in Section 4. Section 5 represents the novel gazetteer-based string-matching approach for noisy name detection.And the conclusion is drawn in Section 6.

Related Work
It has been observed in the literature that most of the NER tasks which are available, majorly concentrated on identifying general domain names, like person, organization and location etc.Some specific systems are also available which work in identifying NEs like chemical, historical and biomedical etc.
To boost up the performance of a baseline system (model), various types of integrations or add-on modules have been incorporated [38].Various NER systems which incorporated deep domain information likes Part-of-speech (POS), word pattern, out of domain POS, morphological pattern, semantic trigger, etc. for identifying names were [39][40].Lin et al. proposed a Maximum Entropy based hybrid NER system which was combined with a rule-based technique [22].Ekbal and Bandyopadhyay developed a NER system for Bengali corpus using three different classifiers Maximum Entropy, Conditional Random Field and Support Vector Machine [35].Their system also used semi-supervised learning and weighted voting approaches as post-processing techniques.Poibeau proposed a hybrid NER system and incorporated multiple criteria-based techniques for boosting the robustness of the named entity recognizer [41].
Use of machine learning-based approach with word embedding features for the Disease named entity reorganization and normalization subtask of the BioCreative-V Chemical Disease relation (CDR) challenge task was found in [42].The result of Russian named entity classification and equivalent NE retrieval using word-phrase representation was described by Ivanitskiy et al. [43].They described that a word or an expression's context vector came out as an effective attribute to be used for guessing the type of a NE class.Seok et al. used CRF as a learning algorithm and applied word embedding feature for NE extraction purpose [44].A few other NER tasks which used word embedding for identifying names were [45][46][47].There are multiple embedding techniques like 'Word2Vev', and 'GloVe 13 ' etc.
A good number of research works have been found for extracting names from informal web text.We have found a NER system that used a web search engine and a NE list for collecting web documents containing NE instances [48].These web documents were further filtered out by sentence separation and text refinement procedures for finally classifying the NEs in appropriate classes.A novel n-gram based lexical system was proposed by Downey et al. for identifying complex NEs from the online corpus [49].Their system incorporated Point wise Mutual Information (PMI) and Symmetric Conditional Probability (SCP) scores in the lexical method.Ritter et al. presented a T-NER for extracting NEs from the Twitter text [50].They applied topic modelling, Labelled LDA to enhance the performance of their system.Another NER system was introduced by Karaa for extracting and categorizing NEs from web corpus [51].To extract the context of the NEs Karaa used tfidf (term frequency and inverse document frequency).Majumder et al. used CRF as ML classifier along with active learning based semi-supervised approach for drug and disease names identification from discussion forum corpus [3].But their NER system unable to recognize noisy named entities.In another attempted Majumder and Saha applied CRF as classification algorithm along with global contextbased framework to track misspelled and abbreviated disease and drug names from healthcare web review corpus [4].Aguilar et al. proposed a multi-task approach for named entity recognition from social media data using Part-of-Speech tags and gazetteer information [52].They used CRF as classifier for their NER task.Singh et al. presented a named entity recognition system for Hindi-English codemixed social media text (twitter) using word, character and lexical features [53].Sabty et al. proposed a NER system for identifying NEs from Arabic-English Code-Mixed Data  [54].Performance of this system was enhanced by using a pooling technique and word embedding.
In the literature, any NER system is hardly found which works in the automobile domain for identifying car company, product and model names.Most of the NER systems discussed above also found difficulty in handling noisy NEs.Our proposed system is an initiative for identifying and classifying automobile names from noisy online user review or social media corpus.

NER system using CRF and SVM
Two ML classifiers conditional random field and support vector machine have been explored for developing the baseline NER system.

Conditional Random Field (CRF)
CRF is a conditional probabilistic model for annotating or labelling and classifying sequential data like natural language corpus [55].This is an undirected graphical framework which illustrates a single log-linear allocation on the annotated sequence while given a particular observation sequence [56].
For this NER system development, the toolkit CRF++14 has been used.It is a simple and customizable open-source executable model of CRF for segmenting or classifying sequential data such as natural language text.CRF++ is developed for general purpose and useful to a wide range of NLP applications, like Part-of-speech tagging, NER and opinion mining, etc.

Support Vector Machine (SVM)
SVM is a widely used supervised ML classifier introduced by Vapnik [57].The classification problem generally engages in training and testing of data set that involves some data samples.
As the SVM is a binary classifier; the pair-wise or onevs-rest approach is used for multi-class classification.To develop the proposed NER system SVM is used as one of the machine learning classifiers which carries out classification by building an N-dimensional hyperplane which best possibly splits data into two categories, positive class and negative class.This NER task involves two main phases: training and classification, which have been performed by YamCha 15 , an open source and customizable implementation of SVM, widely used in a variety of NLP tasks, like sentiment analysis, NER and other sequential labelling problems.
The SVM toolkit, YamCha supports kernel function.For the proposed NER system development, 2 nd degree of the polynomial kernel has been deployed for SVM.

The Set of Features Used to Train the Baseline Model
The feature value plays a key role in developing a machine learning based NER system.Various types of potential features have been explored and best suited feature values have been chosen for developing the baseline system.The system is trained with the labelled dataset by using several combinations of following candidate features.

Word Features
Word feature is most important to develop the NER system.The current word along with proceeding and following words have been used with a window size of three, five and seven, where the target word is at the middle.

Affix Feature
The Morphological information, like prefix and suffix, are regarded as important cues for terminology classification.We have used prefix and suffix of length three.

Digit Feature
In the corpora, it is found that often a current token/word is containing a numeric value (digit).Hence the feature like Is_Numerical or not is included.

Capitalization Feature
The NEs are often capitalized in a standard corpus.Binary feature like current token starts with a capital letter or not is used here.

Parts-of-speech (POS) Feature
Part-of-speech (POS) information is a key attribute for NER task.POS information of target word along with the surrounding words has been extracted using NLTK (Natural Language Processing Toolkit) POS tagger for this purpose.

Dictionary Feature
Named entity normally does not belong to dictionary, so we use a binary feature that the current word is a dictionary word or not.

Symbolic Feature
The general rule is that a sentence ends with only one terminal punctuation symbol like the period (.), the question mark (?), and the exclamation point (!).There are some complex/compound words (Name Entity like "I-10") consist many symbols.Naturally, this feature has a great impact on accuracy.

Word Shape Feature
In the system, this attribute stands for the brief word class of the current token.The consecutive upper-case letters, lower case letters, digits, symbols are mapped to 'X', 'x', '0', and '#' respectively.For example, the brief word shape of "I-10" is "X#00".The feature has a wider effect in identifying the NEs from the noisy text of web reviews.

Data Set Used in the Baseline NER System
To train the baseline NER system the data set has been taken from "Carwale" (http://www.carwale.com/)user reviews.
The training data set contain ~105K words having ~4436 annotated NEs.Three name classes: Company (C), Product (P) and Model (M) have been considered here.The data set is annotated as follows: "
To test the robustness of the system, a data set containing ~10K words has been taken as test data from a different source, "CarDekho" (https://www.cardekho.com/)user reviews and tested it with the baseline system.

Performance of Machine Learning Based NER System
Table 1 and Table 3 have presented the accuracies obtained in different stages of baseline classifier preparation.The accuracy is calculated as F-Measure or F-Score (F) which is harmonic mean of recall and precision.Recall is the ratio of the number of relevant named entities retrieved to the total number of relevant named entities in the corpus whereas Precision is the ratio of the number of relevant named entities retrieved to the total number of irrelevant and relevant named entities retrieved.Both recall and precision are usually expressed as percentage.The calculation of Precision, Recall and F-Measure are shown below along with the Confusion Matrix in Table 2. . ( (3) 'β' is relative weight between recall and precision, whose value is considered as 1.The baseline CRF system achieves an F-Measure of 92.20 with 92.91% as precision and 91.50% as recall in word windows 3 and the baseline SVM system reaches an F-Measure of 93.24 with 95.71% as precision and 90.89% as recall in word windows 3. The result of the baseline system implies that some NEs are not detected by the system.This may be due to the quantity (size of training data: 105K words) and the quality (noisy text) of training corpus.To leverage this and to get better accuracy from the system, Word Embedding based semi-supervised framework has been adopted.

Word Embedding
Word Embedding is the mapping of words with the vectors of actual numbers, helping to obtain enhanced performance in different NLP tasks like NER; sentiment analysis etc. by grouping similar words [29,45,46,[58][59].It can capture a variety of dimensions of meaning and phrase information related to the prospective attributes of words within a vector.The vectors of the words depend on vocabulary size.The words, which have similar vectors, must be semantically similar.The semantically analogous like synonyms, antonyms, on a scale (like hot, warm, cool), etc.It is used in semantic parsing for digging out the meaning from text to enable natural language understanding.For a language model to predict the meaning of a text, it is necessary to be conscious of the contextual similarity of words.These vectors are numerical representations of contextual similarities between words and can be operated mathematically/logically just like other vectors.Generally, this word similarity is calculated by cosine distance of embedded vectors.
Assume data is embedded in three-dimensional space, such as small bubble represents word entities; blue circle symbolizes 'company name' (Hyundai, Maruti, Ford, etc.), red circle indicates 'product name' (i10, Figo, Alto etc.), and yellow circle denotes 'model name' (LXI, Magna, etc.) as shown in Figure 1.Here embedding is formed with neighbouring words which are closer in term of cosine distance between them (like-Hyundai and Maruti: both are company name) as well as have similar distance from their co-related words of the other group (i10 and Alto respectively).For example, the distance between Hyundai and i10 is similar to the distance between Maruti and Alto, thus Hyundai -i10 and Maruti -Alto are co-related word pair.As Hyundai and Maruti are closer in term of cosine distance and have the same cosine distance with their co-related words (i10 and Alto respectively), so they are group with each other.In this approach, the word embedding has been established by using Word2Vec.After using word embedding technique considerable improvements have been seen in the successive versions of CRF and SVM based NER systems.The results of these two NER systems are shown in Table 4 and Table 5.After word embedding enhancement, CRF based system reaches an F-Measure of 94.09 with 94.62% as precision and 93.57% as recall and SVM based system accomplish an F-Measure of 94.93 with 96.57% as precision and 93.34% as recall.
A comparative study has also been done between the baseline NER Systems with their word embedding based enhancements, shown in Figure 2.After analysing these results, it is found that the SVM and word embedding based NER system outperformed the CRF and word embedding based system.Although after using word embedding technique the performance of both versions of the system have increased, still a few NEs remain undetected as they are noisy (misspelled or abbreviated).To trace these misspelled or abbreviated automobile NEs, an automobile gazetteer list has been prepared and a novel gazetteer-based string-matching approach has been proposed, illustrated in the next section.

String-Matching Based Noisy Name Identification
Recognizing named entity from online user review or social media corpus faces the major setback from its noisy nature.The texts available in these social network platforms or discussion forums are posted by common web users; consequently, these often include a large amount of textual noises.A specific car NE can be spelled differently by different users; for example, 'Hyundai' is written as 'Hyndai', 'Hundayi', 'Hundai' etc.Moreover, the use of punctuation in web review corpus does not follow a standard grammatical rule.As the NEs of this corpus are often misspelled and abbreviated, traditional NLP techniques are unable to detect them; hence dedicated special system is required to handle the noisy names.A novel gazetteer-based string-matching algorithm (Algorithm 1) has been proposed to classify these noisy NEs.
First, a gazetteer list (G_LIST) of automobile named entities has been prepared from the training data.Then the classifier annotation on test data (output "OP" on test data) has been extracted.Next a not name word list (NN_List) has been prepared from "OP".Each of these two lists contains a single word and followed by an annotation tag per line.Now compare every word of NN_LIST with each word of G_LIST and if these two matches more than or equal to 80% (string matching percentage) then change the annotation tag of the word of NN_LIST according to G_LIST.And finally, calculate the F-Measure on modified output data.Gazetteer based string matching technique has been successfully incorporated to identify the noisy automobile names from web discussion corpus, but in very rare cases it misjudges not name words as named entities.For example, "Swift" is a named entity (product name: BPNE) and "Shift" is a not named word but it misclassifies "Shift" as a named entity (BPNE).It has also been found out that a few NEs remain undetected as the threshold matching percentage is considered here more than or equal to 80%.For example, quite a few occasions product name (BPNE) Hyundai 'i10' has been misspelled as Hyundai '100'.'i10' matches 66.67% with '100'.Hence the system is unable to identify this as NE.

Conclusion
In the automobile discussion forums, user can write reviews or post some queries about any particular model of a car.Hence, these web reviews are the valuable sources of experiences shared by existing users of automobiles.To use these corpora in variety of tasks like demand supply study, market trend estimation, opinion mining and other information extraction tasks, recognizing the names is prerequisite.This article presents the study of developing NER system by exploring two machine learning algorithms, Conditional Random Field and Support Vector Machine to identify names from online automobile user review or discussion forum corpus.To learn machine learning classifiers with a set of identified candidate features, training corpus has been prepared under human supervision.To obtain better accuracy from the system, word embedding has been integrated.Though incorporation of word embedding improves system's performances but it has also been observed that a few NEs are not recognized by the system due to the noise in the corpus.To identify these noisy automobile NEs a gazetteer list has been prepared and a novel gazetteer-based string-matching approach has been proposed, it considerably improves the accuracy of the system up to F-Measure of 96.40, with recall 96.24% and precision 96.57%.Though the system is tested on automobile data set, the approach is generic and the similar procedure can also be prolific for other noisy domains.

Figure 1 .
Figure 1.Example of word embeddingWord2Vec is a linguistic representation of a neural network that learns the embedding of every word in the corpus.Word2Vec offers skip-gram architecture along with continuous bag-of-words (CBOW) model, proposed by Mikolov et al. at Google[11].Skip-gram forecasts the neighbouring words or context where a single word is given.The size of every word vector is 50 and the window dimension for context information has been set as 4 words preceding and 4 words following according to the current word.Another ~135K extra words have been collected from automobile discussion corpora and added to the original data set of ~105K words.These ~240K words have been used to create word vectors.

Figure 2 .
Figure 2. Comparison of CRF and SVM based NER System after Word Embedding

Algorithm 1 .
String Matching Based Noisy Name Identification Begin 1. Prepare a gazetteer list (G_LIST) of Automobile Named Entity from training data; 2. On test data "T" find classifier annotation as output data "OP"; 3. Prepare a list (NN_LIST) of words that are not classified as NE from "OP"; For (P1=first line of NN_LIST; P1<=last line of NN_LIST; Increment the pointer by P1=P1+1) { For (P2=first line of G_LIST; P2<=last line of G_LIST; Increment the pointer by P2=P2+1) { If (current word of P1 match more than or equal to 80% with current word of P2) { Change the tag of the current word of P1 according to the current word of P2; Break } Else { Do not change the tag of the current word of P1; } } } 4. Calculate F-Measure on Modified Output Data; End EAI Endorsed Transactions Scalable Information Systems 08 2021 -10 2021 | Volume 8 | Issue 33 | e1

Figure 3 .
Figure 3. Comparing baseline NER result with word embedding and gazetteer-based string-matching NE identification

Table 1 .
Result of Baseline CRF Based NER System on Dataset

Table 2 .
Confusion Matrix Word Embedding and String-Matching Techniques for Automobile Entity Name Identification from Web Review EAI Endorsed Transactions Scalable Information Systems 08 2021 -10 2021 | Volume 8 | Issue 33 | e1

Table 3 .
Result of Baseline SVM Based NER System on Dataset

Table 4 .
Result of CRF Based NER System with Word Embedding