Constructing a Knowledge Base for Entertainment by Interlinking Multiple Data Sources

This paper describes a knowledge base for entertainment domains, including movies, music, and celebrities. We present an ontology model for representing graph-based knowledge, and describe knowledge processing techniques for constructing an entertainment knowledge base, which is comprised of collection and extraction of large-scale data, and knowledge transformation from them. In particular, we address entity resolution and consolidation techniques for integrating particular entities among heterogeneous data sources. We also demonstrate a mobile application using this knowledge base, which allows users to discover relevant entertainment content based on question & answer system.


Introduction
As various entertainment and multimedia content services on social and mobile environments have become popular, individual service should make distinctive features compared to others.One of fundamental challenges is how they are able to deliver a right content to their users when is needed [1] [2].There have been a huge number of research efforts for this subject.Recently, personalisation and recommendation based on big data analytics of users' behaviours would be one of candidates for providing better features [3].Simultaneously, we also need to consider characteristics of entertainment content and average usages of users.
Most of users, in general, tend to discover a set of relevant content across multiple domains such as movies and music.For example, when users watch a movie, they would want to listen to an original soundtrack or to know about actors or actress of that movie.Let us imagine the following query: a film that is related to a musician, who sings "Just love it."To obtain a right answer, a system should investigate some relationships among required information such as person, music and movie.Then, the system provides: the answer: Eminem is the musician, and he appears on a list of films such as How to Make Money Selling Drugs (2013), Something from Nothing: The Art of Rap (2012), Have Gun -Will Travel (2008), 50 Cent: Bulletproof (2005), 8 Mile (2002) and The Wash (2001).
Interlinking content entities among heterogeneous data sources is essential for cross-domain discovery and search.Although there are a huge number of applications and services for content recommendation [4], only few applications allows to discover a set of domain content across different data sources Thus, to enable hybrid content discovery, a knowledge base should have various domain content and they also interconnect among them by a consistent manner.In this study, we describe a knowledge base for discovering cross-domain content by a single query or request.Most of objects in a knowledge base have some associations among others, and they are interconnected to others at a semantic level [4].We also introduce knowledge processing techniques for constructing large-scale entertainment knowledge, and demonstrate a prototype system that uses this knowledge base.In particular, this system provides advanced features such as question answering for user interface, contextual recommendation based on semantic search.

EA EAI Endorsed Transactions on Industrial Networks and Intelligent Systems
Based on the knowledge base, a question answer engine is able to understand users' complex questions from multiple domains.In summary, we make the following contributions:  We introduce an integrated knowledge base, which is created by large-scale curated and extracted data from heterogeneous sources. We describe a process for constructing a knowledge base using linked data technologies from data collection to knowledge transformation including knowledge interlinking across multiple data sources. We demonstrate a mobile application, which provides an effective recommendation using the knowledge base.
In Section 2, we describe related work in more detail before moving on to overview of knowledge processing techniques (Section 3) and a mobile application system as a use case of the entertainment knowledge base (Section 4).

Related work
In past few years, knowledge bases with a graph structure have rapidly increasing of both commercial and public services.Freebase is a free, knowledge graph with millions of global and general information about real world entities, including well-known people, places and things [5][6].Some of commercial knowledge bases, such as Google Knowledge Graph and Microsoft Bing, was powered in part by Freebase as a base knowledge, because Freebase data was available for commercial and non-commercial use under a Creative Commons Attribution License.Freebase contained data harvested from sources such as Wikipedia, NNDB, Fashion Model Directory and MusicBrainz, as well as data contributed by its users.
Some knowledge bases that are used for a commercial service include Knowledge Vault [19] developed by Google and Probase [20] constructed by Microsoft.Knowledge Vault is a web-scale probabilistic knowledge base that combines extractions from web content with prior knowledge derived from existing knowledge repositories, which employs supervised machine learning methods for fusing these distinct information sources and features a probabilistic inference system that computes calibrated probabilities of fact correctness.Probase is designed to model the world as a model, and extracts knowledge from web fitting to their model.
DBpedia is one of popular structured knowledge bases [8] [9].It is constructed by extracting structured content from the information created as part of the Wikipedia data [9].DBpedia allows users to semantically query relationships and properties associated with Wikipedia resources, including links to other related datasets.
In the areas of entertainment, there are well-known data sources: The Internet Movie Database (IMDB) is an online database of information related to films, television programs, and video games, including cast, production crew, fictional characters, biographies, plot summaries, trivia and reviews [10] [11].As of September 2015, IMDB had approximately 3.4 million titles (includes episodes) and 6.7 million personalities in its database, as well as 60 million registered Although some of existing knowledge bases are already connected at a semantic level using linked data technologies, domain-specific data sources are also constructed in a graph knowledge format to interlink existing knowledge.For achieving this, ontological data model is essential, and knowledge-processing techniques play an important role to allow data to be collected and transformed into a graph knowledge format.Knowledge extraction from text employs the distant supervision method [13] to extract relations between recognized entities, which handle a task in two stages.One is the pattern training based on seed data and training corpus; the other is the extraction work applying the patterns on sentences.Natural language processing technologies, such as named entity recognition [14], dependency parsing [15][16] [17], and co-reference resolution [18], are mandatory for this work.Finally, the extracted facts are transformed into triple format referencing to the designed ontology.Compared with this work, our solution will focus on fusing the multilingual knowledge, and provide unified entry for knowledge accessing.

A knowledge base for entertainment
In this section, we describe a process for constructing a knowledge base, which is comprised of ontology modelling, data collection and knowledge transformation of large-scale data.

Knowledge model of entertainment domain
A proposed ontology model is to provide comprehensive expressivity for various domains such as person or celebrities, TV programmes, time, movies, locations and music.Each domain is described in an individual class and those classes are semantic relations with others.For example, the Music class is linked to the class Person for expressing a set of relations such as singer, composer or producer.Figure 1  Note that this knowledge model reuses a set of existing vocabularies to leverage connectivity of distributed data sources.For example, we do not create new classes for person, programme, movie and music.The following describes vocabularies that we reuse in our ontology model: * http://www.imdb.com/stats Friend of a friend ontology [24]: machine-readable ontology describing personal profiles, their activities and their relationships. Music ontology [25]: concepts and properties for describing music domains (i.e.artists, albums, tracks, performance, etc.). Programme ontology [26]: a simple vocabulary for television programmes that covers brands, series, seasons, episodes and broadcast events and services.
 Dublin core metadata [27]: a set of metadata elements for cataloguing library items and other electronic resources.
Table 1 summarises core classes and their properties.As explained, some of properties are reused from existing vocabularies.The vocabularies that are defined by our own purpose are also described, and in this case, a namespace is defined as "sage".

Knowledge construction framework
To construct an entertainment knowledge base, we develop a framework, which is to handle structured and unstructured data from large-scale and heterogeneous data sources, as shown in Figure 2. The data collection component has several modules for aggregating a set of data sources such as a crawler for websites and a harvester for data dumps.A set of data is collected and stored in raw data storages without any further processing.
Then, structured data in generated by using the knowledge extraction module, which is to extract knowledge -a set of entities, comprehensive values of each entity and their relations with others -from multiple heterogeneous sources, including HTML pages, web tables and plain texts.We conduct some different ways of knowledge extraction as follows.
 HTML should consider a processing of unstructured and plain text and semantic mark-up, including RDFa, Microdata and Microformats, which combines several ontologies. Web table is also a valuable part for extraction.We trained a classification model based on decision tree to detect the tables.And considering that not all tables are filled with useful information, we develop a filter to remove the useless tables, which do not contain relational knowledge. Wikipedia data is also extracted to enrich the knowledge base.This task recognises info boxes in the pages, and parses the boxes to extract facts in triple format.Furthermore, we also consider multilingual support and real-time updating strategy for this task.A large volume of data is extracted from various data sources using the knowledge extraction framework, and then it should be transformed into a graph format using the knowledge model.For example, all of extracted values are transformed into a triple (i.e.subject, predicate, and object) in RDF.In particular, values extracted from tables are transformed by using data cube ontology for statistical web tables in order to obtain numerical knowledge.
After refining and extracting proper values for both entities and properties, we construct our knowledge base using the proposed knowledge model.It contains approximately 35 million entities and 850 million facts, ranging from six main subjects such as music, movie and TV, POI, dining, sports and locations as shown in Table 2.

Entity resolution and knowledge consolidation
Although the constructed knowledge is used for a fundamental source, it is also necessary to interlink existing data sources both structural and semantic data.In fact, various data sources can be used for commercial end-user services.

5
Entity resolution is a task of disambiguating manifestations of real world entities [28].There would be different ways of addressing same people, movies or particular objects.For example, a title of movies is often used with different names (e.g."Men in black" vs. "MIB").To integrate instances with their properties (values), this task is essential to define a unique identifier for resolving a set of entities among multiple data sources.
Figure 3 illustrates a process of entity resolution among multiple data sources.Methods that perform this task are categorised as blocking, iterative, or learning.Block-based approaches group together, in the same block, entity descriptions that are close to each other, denoting possible matches.To accomplish that, we define mapping rules (i.e.blocking keys) that is based on which the descriptions are placed into blocks.To identify same movies, we use a set of mapping properties, including dc:title, sage:release_date and sage:director, while we use po:series and po:position properties for disambiguating TV programmes.
For computing similarity scores between two entities, we use the Jaro-Winkler distance, which the higher a distance score for two entities is, the more similar the string are [29].The jaro simailarity metric for two given strings and is where m is the number of matching characters and t is half the number of transpositions.Thus, two strings are  Constructing a Knowledge Base for Entertainment by Interlinking Multiple Data Sources considered to be a match if and only if they are the same and they are t most at a distance from each other.
The Jaro-Winkler measure is an extension of the Jaro distance.Given two strings and , their jaro-Winkler distance is: ℓ 1 where is the Jaro distance for strings and , ℓ is the length of common prefix at the start of the string up to a maximum of 4 characters, and is a constant scaling factor for how much the score is adjusted upwards for having common prefixes.should not exceed otherwise the distance can become larger than 1.The standard value for this constant in Winkler's work is 0.1.
We compute similarity measures of entities between the knowledge base that we construct and IMDB.In particular, we collect entities from IMDB: 422,916 entities for movie, 415,975 entities for TV series, 2,571,661 entities for TV episodes and 861,889 entities for person.For evaluating our algorithm, we use a random sampling, which extracts 1,000 linkage results.As a result, the extracted same entities obtain 98.5% accuracy.If results contain ambiguities after this step, we expand candidates such as release date, director, brand name, name of series or date of birth and death.Note that we calculate values of dc:title with a high weight score (i.e.0.9) in order to minimize an amount of candidates for mapping rules.After filtering out candidate sets, we apply a prioritybased sequential similarity computation iteratively until our predefined threshold meets the condition.A mapping weight is defined by experts and experiments according to different condition.Finally, the extracted same entities are represented in RDF using owl:sameAs property.

Using entertainment knowledge
In this section, we introduce a mobile application using the entertainment knowledge base, which allows users to discover relevant content across heterogeneous domains using the entertainment knowledge base.
In general, content recommendation would be highly depended on metadata of individual item [1][3] and user profiles such as preferences, demographical information or history of usage [4].However, a means for an effective recommendation is often limited, when a set of available metadata of a single data source is not sufficient [2].As introduced in the previous section, we constructed our own knowledge base for entertainment domains, and further this knowledge is interlinked other data sources.It is useful for searching relevant item for a certain keyword; for example, a search would be started from a primary data sources, and a search can be expanded when more relevant information is necessary throughout the interlinked data.Furthermore, the primary data does not provide associations of other content; we can investigate to search relevant items using interlinked data sources.Using these interlinked data sources, we can deliver related and recommended content to users at the right time.In fact, this feature is nothing new but the keys to differentiation are coverage of multiple domains.
In particular, this application provides a natural language interface to allow users to query in their everyday or natural language.Some of question-and-answer patterns are supported.For example, user can discover a following query: "How long is the movie Avengers?"Then the system returns the answer like "3 hours 15 minutes".Furthermore, this application provides a set of related content across different domains (e.g.movie, TV program and music).Figure 4 illustrates system architecture of our application.It is comprised of a client for end-user application and serverside components for question answer and knowledge bases.The following is detail about main components:  Figure 5 illustrates a mobile application, which contains examples of a movie and a musician, respectively.As explained, a dialog interface provides to discover some items, and some of question candidates are displayed by an autocomplete feature, when a pattern is matched.If a question is a type of answers, we provide the answer with details about the object: for example, user wants to know a running time of the movie "Avengers" in Figure 5.We give the answer (i.e.3h 15m), and also provide description and rating about the item.In addition, a list of related items are provided simultaneously: the first panel provides a list of relevant movies and TV programs, which have same genres, directors, actors, and characters.We can find the other series of the selected item, and other sci-fi movies and TV programs.The second panel lists a set of associated songs with the Avengers.The last panel shows the results of personalized recommendation.A user can configure individual preferences on the application, including demographical profiles (birth date and place, and address), favorite genres and musicians, favorite actors, genres and directors.This information is used to analyze users' personalization and recommendation, when an individual query is created.

Conclusion
This paper described a knowledge base for entertainment domains, which is created by integrating heterogeneous data sources.This knowledge contained more than 18.2 million entities and 328 million relations as facts.To construct this base, knowledge processing techniques were largely applied to extract, integrate, transform and interlink a number of entities and relations from heterogeneous data sources.Furthermore, a knowledge model for entertainment domains is developed to represent ontological semantics of individual content and its relations among others.This model has highly flexible structure to interlink and expand other ontology vocabularies such as book, sports or concert.
Figure 5 A question answer based mobile application.This application allows users to find out some items by natural language in English.The left shows an example to have a question and answer in some subjects of film (i.e. a running time of Avengers), while the right panel illustrates an example to find out a person, who is related to a particular music.Note that this application provides relevant items such as film, music and event using the entertainment knowledge base.

Figure 1
Figure 1 Core classes of the entertainment knowledge model.Each class is connected to others with specific properties.
illustrates core classes of entertainment ontology, and shows some relationships among other classes.The Person class plays an important role to link other classes, and the Time and Location classes can be used for representing contextual information by connecting other classes.

Figure 2
Figure 2 Knowledge construction framework


User interface: this application allows users to discover and display a set of entertainment content on mobile environments. Question analysis: a question is transformed into a dependency structure by a robust parser, and questionpattern rules are applied to extract a type (what and how type) and content of the question.If no rule is applied, no type is extracted. Query construction: It discovers a query template based on the results of the question analysis, and creates a SPARQL query. Answer Generation: It generates an answer sentence using answering templates and knowledge entities from the entertainment knowledge base. Result Generation: It merges a set of answers for the question, a set of recommendation.A result can be displayed by computing a simple ranking algorithm.

Figure 3 .
Figure 3.An entity resolution process

EAI
Endorsed Transactions on Industrial Networks and Intelligent Systems 12 2016 -01 2017 | Volume 4 | Issue 10 | e1 Constructing a Knowledge Base for Entertainment by Interlinking Multiple Data Sources

Table 1
Core classes and its main properties

Table 2
Statistics of extracted data EAI Endorsed Transactions on Industrial Networks and Intelligent Systems 12 2016 -01 2017 | Volume 4 | Issue 10 | e1