Extracting Academic Subjects Semantic Relations Using Collocations

The paper presents approach to analyze semantic content of academic subjects and its internal relations using statistically-based techniques for collocation extraction from large electronic educational text corpus. It o ﬀ ers a survey and analysis of some related corpus-based approaches to extract conceptual relations used for educational purpose and presents a technique for semantic search of collocations. The results of extended keyword search from British Academic Spoken English corpus using Sketch Engine searching software are presented. They are analysed with respect to types of generated keyword’s collocations and semantic relations which they assign.


Introduction
Recent developments in the use of Internet technologies enlarge the scope of approaches to adopt various types of search, so to improve the effectivness and correctly related to the query seach results.The Internet users are tend to search using various types of keywords, so to get as much as closer to searched keywords outcome.
Hovever, for the effective search results a complex techniques are developed combining information retrieval and semantic approaches [1].The techniques use related data structure which are capable to deal with complex semantic representations, so to filter and extract the required knowledge.Moreover, that technologies are with multilingual application, since they use various statistical approaches and knowledge extraction.
Further, we are going to present results of research aimed at analyzing semantic content of academic subjects based on the use of statistically-based keyword ★ The research described presents results obtained during COST-STSMIC1302-36988 "Natural Language Processing Keyword Search for Related Languages" of COST Action IC1302 "Semantic keywordbased search on structured data sources (KEYSTONE)" * Email: vstoykova@yahoo.comsearch techniques in academic educational electronic text corpora.

Related Approaches
Keywords generation is widely known technique to define semantic textual relations.It, also, is used to extract semantic relations between words, text coherence, etc.The tradition of application of electronic text corpora for education uses different academic texts [2] (including from Wikipedia [3]) and approaches extended and improved with statisticallybased techniques to extract sophisticated semantic relations.
Various techniques were successfully applied for wide range of languages including Chinese [4,5].The results of existing applications significantly improved their universality with respect to domain application and multilingual scope [6].The related techniques used are based on adoption of various metrics for extraction different types of word semantic relations by applications using estimation of word similarity measure.
Further, we are going to present and analyse applications of such techniques by comparing and discussing several results of extended keyword search with collocations in educational electronic text corpus

The Sketch Engine (SE)
The SE software [7] allows approaches to extract semantic properties of words and most of them are with multilingual application.Extracting keywords is widely used technique to extract terms of particular studied domain.Also, semantic relations can be extracted by generation of related word contexts through word concordances which define context in quantitative terms and a further work is needed to be done to extract semantic relations by searching for co-occurrences and collocations of related keyword.
Co-occurrences and collocations are words which are most probably to be found with a related keyword.They assign the semantic relations between the keyword and its particular collocated word which might be of similarity or of a distance.
The statistical approaches used by SE to search for cooccurrence and collocated words are based on defining probability of their co-occurrences and collocations.We Collocations have been regarded as statistically similar words [8] which can be extracted by using techniques for estimation the strength of association between co-occurring words.Recent developments improved that techniques with respect to application areas including language learning [9].
Further, we shall present and analyse results for extracting collocations using SE software and compare related results with respect to semantic types of received collocations and related texts sources.

The British Academic Spoken English (BASE) Corpus
The British Academic Spoken English (BASE) corpus is a collection of transcripts of lectures and seminars recorded at University of Warwick and University of Reading in the UK during the period 1998-2005.It was created to analyse English for Academic Purposes [10].
The texts included consist of 1 186 290 words and are distributed across four broad domain areas: (i) Arts and Humanities, (ii) Life and Medical Sciences, (iii) Physical Sciences and (iv) Social Studies and Sciences.The corpus is annotated according to Text Encoding Initiative Guidelines and recently was uploaded into the SE allowing the use of its incorporated options for storing, sampling, searching and filtering texts according to different criteria.

Keyword Search Results
For our research, we shall use SE standard options to generate keyword's concordances, distribution, collocations, grammatical and semantic relations.We shall present general methodology by demonstrating related results for keyword politics.
Concordances present all occurrences of given keyword with its related quantitative contexts.Fig. 2 presents all occurrences of keyword politics within BASE corpus with its related contexts.The generated results show that keyword politics has 119 occurrences but do not give information about its frequency distribution which is also important structural criterion.The SE has options to evaluate different types of keyword distribution.Thus, Fig. 3 shows frequency distribution of keyword politics over whole BASE corpus.The received results lead to conclusion that keyword is not coherent within whole corpus and is frequent only in certain texts.More detailed information about frequency distribution of keyword politics is obtained by generation of keyword distribution over its concordances.
The related results are presented at Fig. 4 and show that distribution of keyword is more coherent over certain concordance position and can be detected with respect to different thematic part of BASE corpus.Thus, the keyword occurs mostly in texts from domain of Social Studies and Sciences and Arts and Humanities texts of whole corpus.
Another SE option for detecting keyword frequency distribution is the generation of keyword's distribution over subject areas.Fig. 5 shows the distribution of keyword politics over subject area.
The generated results show that keyword politics is occurred in texts from History, Politics, Business, English Literature, etc. subject areas.The results presented at Fig. 3, Fig. 4 and Fig. 5 show that keyword politics can be occurred within structured texts of related specific domains, areas or subjects.However, concordance and frequency distribution do not give semantic information about keyword's meaningful combinations.For that, a semantic filtering is needed by extending keyword search with collocations.
The SE offers several statistical approaches to generate collocations of a related keyword.However, for our analysis we shall use only that presented in Section 3. Thus, we apply MI − score which was already used in [11] for parallel bilingual collocations generation.
Fig. 6 shows generated collocation candidates for keyword politics from BASE corpus.The results show that most frequent words which are most probably to be occurred together with keyword politics are: electoral, international, gender, etc.
The SE allows more elaborated keyword search over structured data to extract both grammatical and semantic relations.The related techniques are based on the idea that word association measures extract not only collocations but also other types of associations between a lexical unit and a grammatical word or between two semantically related words (hypo-or hyperonyms).Thus, we use SE's word sketch option to generate most frequent grammatical relations of keyword politics.Fig. 7 shows generated results which include following keyword's relations together with their frequent collocations: as modif ier, as pp − obj − of , as and/or, as obj − of , etc. Thus, the SE keyword search can extract not only statistically similar words for building thesauri but also can define their semantic relations.
Generally, the search is performed over structured data and gives results with respect to related structures.For example, concordance search gives all occurrences of keyword with related contexts within the whole corpus.Distributional frequency search gives distribution of keyword within the whole corpus, within the domain sub-corpora or within the subject areas.The collocation candidates search gives as a result list of words which  are most probably to be found with a related keyword.The results include both attributive collocations like electoral politics and specialized collocations like international politics.
The semantic relations search takes into account keyword's grammar features and gives as a result all possible semantic relations of keyword and its related collocations.Consequently, different types of collocations search generates keyword's semantic profiles which describes both semantic and grammar features.

Conclusion
The approach presents search and retrieval over educational electronic text corpus, which use SE statisticallybased techniques for extending the keyword search to evaluate frequency distribution, collocation candidates, grammatical and semantic relations.
The analyzed keyword search results show that using different types of extended search, it is possible to capture keyword's constraints (lexical, grammatical, syntactic or semantic) which govern word combinations selection in related semantic context.In that way, it is possible to extract general keyword's semantic relations.The precision of statistical filtering is used to extract, range and isolate semantically-related keyword combinations and to receive more semantically relevant to keyword results.

Figure 1 .
Figure 1.The formulas of Sketch Engine's statistical scoring.
use techniques of T − score, MI − score and MI 3 − score for corpora processing and searching.For all, the following terms are used: N -corpus size, f A -number of occurrences of keyword in the whole corpus (the size of concordance), f B -number of occurrences of collocated keyword in the whole corpus, f AB -number of occurrences of collocate in the concordance (number of co-occurrences).The related formulas for defining T − score, MI − score and MI 3 − score are presented at Fig. 1.The T − score, MI − score and MI 3 − score are applicable for processing multilingual parallel corpora as well.

Figure 2 .
Figure 2. The concordance of keyword politics from BASE corpus.

2Figure 3 .
Figure 3.The frequency distribution of keyword politics over BASE corpus.

Figure 4 .
Figure 4.The frequency distribution of keyword politics over concordance position.

Figure 5 .
Figure 5.The frequency distribution of keyword politics over subject area.

Figure 6 .
Figure 6.The collocation candidates of keyword politics from BASE corpus.

Figure 7 .
Figure 7.The grammatical relations of keyword politics from BASE corpus.