Knowledge Extraction Framework for Building a Large-scale Knowledge Base

As the Web has already permeated to life styles of human beings, people tend to consume more data in online spaces, and to exchange their behaviours among others. Simultaneously, various intelligent services are available for us such as virtual assistants, semantic search and intelligent recommendation. Most of these services have their own knowledge bases, however, constructing a knowledge base has a lot of different technical issues. In this paper, we propose a knowledge extraction framework, which comprises of several extraction components for processing various data formats such as metadata and web tables on web documents. Thus, this framework can be used for extracting a set of knowledge entities from large-scale web documents. Most of existing methods and tools tend to concentrate on obtaining knowledge from a specific format. Compared to them, this framework enables to handle various formats, and simultaneously extracted entities are interlinked to a knowledge base by automatic semantic matching. We will describe detailed features of each extractor and will provide some evaluation of them.


Introduction
Recently, mobile devices are now equipped with intelligent virtual assistant, which is an application program that can understand natural language and complete electronic tasks for users [1].Most of major IT companies such as Apple, Google, Microsoft and Facebook have invested intelligent services that is to make assistants more contextually aware and more versatile [1] [2].In this background, knowledge is crucial for virtual assistant, because they search and discover a set of knowledge bases to handle their enquiry.However, constructing a knowledge base from various data sources is not trivial.For example, the following sentence provides a brief description about Lionel Messi.

"Lionel Messi(born 24 June 1987), is an Argentine professional footballer who plays as a forward for Spanish club FC Barcelona and captains the Argentina national team."
Although human can understand unstructured texts with its context, meaningful information on these texts should be transformed into structural formats for allowing machines to understand it.For example, Lionel Messi is Person, who has date of birth (i.e.24 Jun, 1987) and his nationality is Argentina, which is a type of country.Simultaneously, we can deduce a relation that his occupation is a football player.A knowledge base comprises a set of objects as entities and their various relations.Although enormous data on the Web has been expanded, and is crucial sources for constructing a knowledge base, extracting knowledge from large-scale data sources have still many technical issues such as named entity recognition, relation extraction, entity resolution, and co-reference resolution [3].Furthermore, recent knowledge bases tend to a graph structure of inter-related objects [4], such as DBpedia † , Wikidata ‡ , YAGO § and Google's Knowledge Graph ** .All extracted data is transformed into interlinked data formats based on a specific ontology model.On the other hand, objects among different knowledge bases can be intertwined, in this sense, entity interlinking and ontology matching are essential techniques for building a graphbased knowledge base.In this paper, we introduce a knowledge extraction framework for building a knowledge base from largescale data sources.Although existing knowledge bases contain millions of facts about the world, they have their own emphasis points.We focus on a general framework to aggregate and extract large-scale entities and relations from heterogeneous data sources.The contribution of this paper is as follows: • We propose a comprehensive framework for knowledge extraction with its architecture in details.One of the projects that pursue similar goals to DBpedia is YAGO [8].YAGO has developed to version 3 as YAGO3, an extension of the YAGO knowledge base that combines the information from the Wikipedia in multiple languages [9].It fuses multilingual knowledge with English WordNet to build a coherent knowledge base.The main differences between YAGO and DBpedia include: one is that the DBpedia ontology is manually maintained, while the YAGO ontology is backed by WordNet and Wikipedia leaf categories.The other is that the integration of attributes and objects in infoboxes is done via mappings in DBpedia, while the YAGO implements it by expert-designed declarative rules.The Open IE project has been developing a Web-scale information extraction system that reads arbitrary text from any domain on the Web, extracts meaningful information and stores in a unified knowledge base for efficient querying [10].This project generates tools including TextRunner, Reverb [10], Ollie [11] and OpenIE4.0.OpenIE is schema-less, and can extract data for arbitrary relations from plain text, which is useful when lacking of seed data and training corpus comparing to the distant supervising method used in our Text2K extractor.But extracting noisy data is the main obsession of it.Knowledge Vault [12] is developed by Google to extract facts, in the form of disambiguated triples, from the entire web.The main difference from other works is that it fuses together facts extracted from text with prior knowledge derived from Freebase [13].Our work is most similar with this approach, except adding confidence value to every fact.But we also integrate Wikipedia data, which contains multilingual and abundant structured information.It can enrich our knowledge base with more data.Current knowledge bases can be clustered into 4 main groups [10][12]: 1) approaches such as WikiData, DBpedia, and YAGO, which is built on Wikipedia infoboxes and other structured data sources; 2) approaches such as Reverb, OLLIE, and PRISMATIC, which use open information (schema-less) extraction techniques applied to the entire web; 3) approaches such as PROSPERA, Knowledge Vault, which extract information from the entire web, but use a fixed ontology and schema; and 4) approaches such as Probase [14], which construct taxonomies (is-a hierarchies), as opposed to general knowledge bases with multiple types of predicates.Our extraction framework covers both 1) and 3), and merges all the extracted data to provide an enriched unified knowledge base.Web table extraction is a hotspot in the last decade; yet, limited studies have been published out.Munoz, E. et el.open a tool, DRETa [15], to extract RDF triples from generic Wikipedia tables by using DBpedia as a reference knowledge base.However, DRETa is only served for Wikipedia tables and strongly relies on DBpedia.Limaye, G. et al. [16] propose a machine learning techniques to annotate table's cells with entity, type and relation information.It focuses on table annotation regardless of detecting tables from web pages.

Knowledge Extraction Framework for Large-scale Web Data
This framework is to extract a set of meaningful entities and its values from a large-scale web data.Figure 1 illustrates a high-level architecture of the knowledge extraction framework, which comprises of three types of extractors including metadata, web tables and plain texts in HTML pages.Input formats are a collection of URL or large-scale data dumps such as Wikipedia or crawled web data, and each extractor depended on data formats handles extraction procedures with different processing logics.For example, a web table is obtained through detecting the HTML anchors indicating table chunks, such as <table>, <class="wikitable">, or is extracted by training a table detection model using machine learning techniques (see Section 3.2 in detail).After this extraction by individual extractor, they are transformed structured data by using knowledge model that is already defined for representing a knowledge base, and is stored or updated into a triple store.Each extraction techniques and features are described in the following section.

Metadata
The metadata extraction is to collect knowledge (e.g.products, people, organizations, places, events and resumes) from markup standards such as RDFa [17], micro-data [18] and micro-formats [19] of web documents.This extractor contains two core modules; one is markup filtering, and the other is a specific extraction rule for data formats.A markup filtering is to detect a set of markups from HTML sources.Because a number of web pages do not utilize markup standards, a set of rules for detecting corresponding tags can be used for analyzing a specific part of web pages.Furthermore, it can reduce time-consuming tasks by aggregating a fixed quantity of web pages.When a web page does not match a filtering rule, the rest of this source is not collected and simultaneously is not expanded any other data sources from the page.Currently, extraction rules support several markup formats as follows: • RDF/XML, Turtle, Notation 3 • RDFa with RDFa1.1 prefix mechanism • Microformats: Adr, Geo, hCalendar, hCard, hListing, hResume, hReview, License, XFN and Species • HTML5 Microdata: (such as Schema.org)Its implementation is based on the Anything To Triples (Any23) † † , which is a library for extracting structured data in RDF format from a variety of Web documents.After extracting data, it is transformed and interlinked to a knowledge base using some existing vocabularies, such as schema.org‡ ‡ , purl.org § § and ogp.me *** .† † https://any23.apache.org/‡ ‡ http://schema.org§ § http://purl.org*** http://ogp.me

Web tables
At present, millions of data are represented by web table formats on the Web.In particular, data on this format can be useful sources for enriching a knowledge base, because it has some structures using specific tags.We propose an approach to extract entities and relations from web tables based on a reference knowledge base.The approach consists of two stages: table detection for identifying tables from web pages, and triple transformation for generating triples from web tables.Table detection is to identify tables of visual formatting from web pages, and then to screen out those with relational information.Besides, it transforms relational tables from HTLM into a unified format where each row represents entity relations.Hence, we propose an improved approach based on the work of Yalin and Hu [20] to detect tables, including the follow parts: To extract entities and its relations from plain texts, various approaches are applied to the extraction framework.As illustrated in Figure 2, there are four main components.

Pattern training and Pre-processing
As the first phase, pattern training is based on knowledge seed and training corpus.A semi-supervised learning algorithm is employed for this task, which makes use of a weakly labeled training set.It has the following steps: 1) It may have some labeled training data (relations seed data), 2) It "has" access to a pool of unlabeled data (training corpus), 3) It has an operator that allows it to sample from this unlabeled data and label them.This operator is expected to be noisy in its labels, 4) The algorithm then collectively utilizes the original labeled training data if it had and this new noisily labeled data to give the final output (trained patterns).Table 2 shows a brief example for pattern training: 1) a seed is a movie:writer (movie, writer) pair, relation is movie:writer.The first item is a movie name, and the second one is a writer of the movie, 2) from the training corpus, co-occurrence corpus is extracted, 3) it is transformed into uniform formats for obtaining a set of original patterns, and 4) generate final pattern by doing stemming and replace operation to original patterns.Pre-processing phase mainly contains several tasks.Sentence boundary detection is to get each sentence from text, which is the smallest unit for extraction.Stemming is helpful to correctly extract between different tenses and forms efficiently.Named Entity Recognition (NER) is the key step, which recognizes named entities, as a subject or an object, in a sentence [22].Parsing is used to find the syntax structure of a sentence [23][24].We do parsing for subject identification of sentence.Co-reference resolution helps to find corresponding entity when a subject is not a named entity [25].Using head rules defined in Bikel [21] operating on parsing result to get the subject of a sentence.Sentiment analysis supports extra information about entity emotional trend.In particular, we construct domain-dependent corpora such as movies, music and celebrities, and use general-purpose semantic knowledge bases for extracting and detecting sentimental information.

Optimization
Throughout the entire processes for text extraction, NER, parsing and co-reference resolution modules cost almost 90~95% processing time (see Figure 4).Thus, we focus on our optimization for these tasks.We employ some optimization strategies compared with traditional relation extraction systems.Knowledge extraction framework for building a large-scale knowledge base As the original generated patterns may include varieties of entities such as date and country, pattern abstraction is necessary for gaining final patterns through the postprocessing task.For example, a set of named entities is replaced by wildcard character (e.g."1898" to "*"), and also some patterns are replaced by pattern stemming (e.g."was" to "be").We use Shift/Reduce mode instead of PCFG model.As shown in Table 4, Shift/Reduce reaches a better F1 measure while with much less time needed on parsing.
Co-reference resolution helps to find corresponding entity when a subject is not a named entity.Considering than 88% co-reference cases occur in same paragraph [26], for pronoun, even more than 95%, our co-reference resolution unit focus on paragraph rather than whole article.According to the above tables, it is observed that most of time is spent on the table triplification (about 99%), since lots of retrieval for candidate entities and potential relations are included.Figure 4 compares the performance of table extraction speed among our approach, DRETa and annotating approach, which is proposed by Limaye, G. et al. [26].Our approach uses 1.75s, which shows a better performance than DRETa using 7.28s, while slower than the annotating approach (0.7s), which is lack of table detection procedure.

Unstructured data extraction
In our experiment, we apply text analysis on plain text from HTML pages and Wikipedia pages, focusing on special domains, such as football (player and team), movie, etc. Table 7 shows the statistic of patterns for each domain, 6,269 patterns in total for 70 relations defined.For example, the Person domain has 26 relations such as name, date of birth and parents.In particular, the parents relation can be generated several patterns like ("$ARG1 be the child of $ARG2", $ARG1 and $ARG2 are entities and they are child and parent).In our experiment, 106 sentences with different lengths are executed, and the length distribution of the test cases is shown in Table 8.There are totally 11 cases with length between 150 words and 300 words, 11 cases between 50 and 150, 51 cases between 20 and 50 words, 33 cases length larger than 1 word and less than 20 words.

Conclusions
In this paper, we presented the knowledge extraction framework, which is to extract knowledge entities and its relations from large-scale data sources.We described a conceptual model, its implementation and evaluation of this framework.This approach has a novel method for extracting various data formats in a single process, and ultimately reduces a processing time of overall text extraction.
For processing various unstructured data sources, specific extraction engines are developed for various formats, such as metadata and web tables from HTML web pages and unstructured and plain text.Metadata extraction is developed to detect the semantic tags in the page source files, and generate triples referring to the ontologies used by these HTML pages.Web tables are also a valuable part for extracting knowledge through cells and rows and columns annotating referring to a knowledge base.The text extraction framework is applied the distant supervision method to train patterns based on seed data and training corpus and then apply these patterns to text for extracting entities and its relations.Finally, we transform these extracted data to knowledge referring to our defined ontology depending on the mappings from named entity types to ontology class, and from patterns to ontology properties.Since we have multiple extractors extracting data from different sources, we should provide a fusion mechanism to merge all extracted knowledge to our knowledge base.And to improve our framework, we take into account several topics, including extraction performance and precision, more language support and incremental data updates and data fusion.

Figure 1
Figure 1 Architecture of the knowledge extraction framework

Figure 2 A
Figure 2 A high-level workflow of unstructured data extraction Extraction realizes the goal of obtaining relation information between subject entity and object entity in the same sentence.Extraction is executed for each sentence, and to figure out the relation between the subject entity S and other recognized entities E= {E 1 , 2 ... E n } in the sentence through patterns matching.Assuming all the relations are R= {r 1 , r 2 … r m }, and the patterns are P= {p 1 , p 2 …p m }, here every p i is a group of patterns representing the same relation r i .What we do here is to find out r j for each S and E i pairs, and the detail steps are as below: frag(S, E i ) gets the text fragment between the occurrences of S and E i from the original sentence.pattern_match(text, pattern) checks if the text conforms to the pattern.Post-processing mainly focuses on normalization and transformation.Normalization is to turn data with the same type into unique format such as date type information "May 11 2001"; "1586.01.02" need changed into "5/11/2001" and "1/2/1586".Transformation is to transform data into triple format refer to the designed ontology.

Figure 3 A
Figure 3 A processing time of major tasks for text extraction

| Volume 3 |
Issue | e2 Knowledge extraction framework for building a large-scale knowledge base Haklae Kim, Liang He and Ying Di [6]read and edit by humans and machines alike[5][6].Things described in the Wikidata knowledge base are called items and can have labels, descriptions and aliases in all languages, which does not aim at offering a single truth about things, but providing statements given in a particular context.

Table 1 Selected features of web tables
Table wrapper is to embed a table of visual formatting in a HTML document.Generally, most of web tables are embedded in <table></table>, yet, part of them are in <ul></ul>, <li></li>, etc.Moreover, different embedment possesses its own table structure definition.Therefore, to detect web tables as much as possible, machine learning technique, particularly, Decision Tree is employed.(ii) Feature selection is a crucial step for our machine learning based method.Table 1 describes the features we selected.(iii) Relational Table Classifier: For table detection task, we propose to use Decision Tree on our table training set of feature vectors with true/false table labels to obtain a relational table classifier.

. Experiments 4.1 Table extraction
To evaluate our proposed web table extraction method, we test it on 1200 Wikipedia HTML pages containing 2097 tables on local Linux PC (i7-3770 CPU and 6G memory).The following tables describe consumed extraction time based on web page and web table:

Table 5
shows each webpage's table extraction situation.Extraction task contains table detection and triplification.Since each page may have more than one table, we analyse each table extraction performance and the result is listed in

Table 8
The relation extraction is the same as entity one.There are four categories based on based on ranges of a sentence length.For example, 1-20 region has total 33 cases as shown in Table8.We extract 102 entity triples and 348 relation triples.Finally, we obtain 94.7% and 97.3% for the precision score of entity extraction and relation extraction.