From web to SMS: A text summarization of Wikipedia pages with character limitation

Wikipedia is one of the main sources of information on the Web. But the access to this content may be difficult especially when using a basic telephone without browsing capability and only a GSM network. The only means of text-based communication remains through SMS. Due to the limitation of the number of characters, a Wikipedia page cannot always be sent through SMS. This work raises the issue of text summarization with character limitation. To solve this issue, two extractive approaches have been combined: LSA and TextRank algorithms. Generated summaries have been evaluated using ROUGE metrics. Since ROUGE metrics do not consider character limitation, a new threshold named Threshold of Acceptability for Character-Oriented Summaries (TACOS) has been proposed to appreciate ROUGE metrics. The evaluation showed the relevance of the approach for pages of at most 2000 characters. The system has been tested using the SMS simulator of RapidSMS without a GSM gateway to simulate the deployment in a real environment. To the best of our knowledge, this is the first work tackling text summarization issue with character limitation.


Introduction
Wikipedia is an encyclopedia hosted by Wikimedia and it is considered as one of the main sources of information on the Web. Its freely available web-content can be used offline in remote regions that experience teacher shortage like rural areas in sub-Saharan Africa. Due to the lack of reliable power infrastructure, these areas are experiencing a high penetration rate of mobile devices. However, the bulk of those devices is composed of basic phones with no browsing capabilities [1]. The only way to send and receive text is through SMS, because of the good GSM network coverage [2]. But because of its length, a complete Wikipedia webpage cannot be always sent through SMS. The web page should therefore be summarized.
Summarizing involves condensing the most important information from a document (or multiple documents) to produce an abbreviated version [3]. An automatic summary is thus a text resulting from the reduction by a computer (or any computing system) of one (or several) text (s) that contains the same idea as the original ones.
Several systems have attempted to automatically summarize Wikipedia pages. Almost all of these employ an extractive approach, and try to extract relevant sentences from the content of the page [4]. Most of those systems produce summaries that are generally one-quarter the length of the original text. Although some of them propose word limitations, they are not dealing with the limit in terms of characters. If the generated summary contains more than 455 characters (limitation depending on the mobile carrier) it is going to be transformed into an MMS which requires adapted devices and includes a cost. Summarizing a webpage into an SMS is therefore a challenge since there is no relationship between the number of sentences or words and the number of characters.
The current work raises the issue of character-limitation text summarization. To solve this problem, a combination of

EAI Endorsed Transactions on Creative Technologies
Research Article

EAI Endorsed Transactions on
Creative Technologies 04 2020 -06 2020 | Volume 7 | Issue 24 | e5 2 two existing extractive summarization approaches has been used: LSA and TextRank algorithms. To the best of our knowledge, this is the first work tackling text summarization issue with character limitation.
The rest of the paper is organized as follows. Section 2 presents related works on text summarization using Wikipedia. The proposed approach for text summarization is presented in Section 3; followed by the evaluation of generated summaries in Section 4. Section 5 presents test results using the SMS simulator of RapidSMS. This paper ends with conclusions and possible directions for future developments.

Automatic text summarization approaches
Text summarization approaches can be classified as presented in Figure 1, adapted from [5]. There are two major approaches to creating a summary from a document: by extraction or by abstraction. Summarizing a text by extraction is extracting portions of the original text to form the summary [6]. Generally, the sentence is used as a basic unit and the challenge in this area remains the development of effective and easy techniques for reporting important passages in a text. In contrast, summarizing a text by abstraction involves reducing the length of this text by paraphrasing it while retaining the original idea [7]. These approaches use ontological information, extraction, and fusion of information as well as compression. Usually, any summarization approach that does not use extraction is considered as an abstractive approach.
When the summary is generated using only one source document, we talk about a single-document summary otherwise, it is a multi-document summary. In addition, the system may use knowledge-rich techniques to produce the summary. In this case, the system makes use of lexical resources such as VerbOcean [8] or external resources such as WordNet [9]. The summarization process may be subject to some constraints like containing the information requested by the query in case of query-focused summarization. In update summarization, the aim is to generate an updated version of the summary by identifying new pieces of information in the more recent articles. This extension supposes that the user has already read the previous versions of the summary. Finally, in guided summarization, the summarization approach is guided by a set of aspects that should be covered in a summary.
Recent surveys about automatic text summarization can be found in [10,11].

Wikipedia text summarization
Several works attempted to summarize Wikipedia pages. Hatipoglu and Omurca [12] have developed a mobile application for the automatic summarization of Wikipedia articles in Turkish. The system uses the extractive approach based on the structural characteristics of the Turkish language and on the semantic characteristics of the sentences. First, they score sentences from structural features such as the position of the sentence in the text, the number of words in the sentence, and the number of words from the title in the sentence. Then, a semantic analysis using the LSA (Latent Semantic Analysis) algorithm is performed before the extraction of the final summary.
Ajmera [13] built an extractive based Wikipedia summarizer in Python using the Django web framework. He extracted Wikipedia pages' content using urllib2 † , an extensible library for opening URLs. Sentence position and word similarity were used as features. The Term Frequency-Inverse Document Frequency (TF-IDF) is used to score words and the Google Search Result Count to score sections.
Hingu et al. [14] have implemented two methods for summarizing Wikipedia pages. Their methods, which extend traditional approaches, provide new features based on the citations present in the document to score sentences. In their methods, the frequency of words is adjusted according to the root form of the word. The words are stemmed with the objective to assign equal weight to words with the same root word. The length of the summary in terms of the percentage of the original text can be provided by the user.
Although those previous works are relevant, it remains that the generated summaries were not designed for limited devices such as basic phones with a limited number of characters.
One of the first works summarizing Wikipedia pages for basic phones dates back to Ramanathan et al. [15]. They designed a proxy-based approach for extractive summarization. Their approach, inspired by [16], uses the Wikipedia corpus to find the document topic. They indexed the whole Wikipedia corpus using the Lucene engine ‡ . But the generated summary is limited to 100 words which may exceed the maximum number of characters an SMS can contain.

General idea
Our approach is based on sentence scoring methods and it is composed of 5 steps: text retrieval from Wikipedia; text pre-processing; sentence scoring; sentence selection; and summary generation. The approach is described in the flowchart in Figure 2.

Text retrieval
A Wikipedia page is generally divided into four parts: the top of the page, the stringcourse of the left, the body, and the page's footer. The body is the part that contains relevant information. It is subdivided into several parts including a title, an introductory summary, a table of contents, and the content itself. The first step is to retrieve † https://docs.python.org/2/library/urllib2.html the text from the requested Wikipedia page. For this, we made use of the Wikipedia tool. Given a title (name, word, group of words) the latter retrieves all the text of the corresponding Wikipedia page if it exists. Then, all sections are removed except "content". Afterward, the elimination of markers such as sections, subsections, and others are done. At the end, only relevant text about the page's topic remains. If the searched page does not exist, an error message is sent indicating that the requested article does not exist.

Text preprocessing
Retrieved text is pre-processed before summarizing through tokenization into sentences and then into words. Sentence tokenization is the process of splitting a paragraph into a list of sentences while word tokenization is the process of splitting a sentence into a list of words. Words are lemmatized, which means finding their basic form. Then unnecessary and special characters are removed, as well as stop words.

Sentence scoring
In general, sentence scoring encompasses three approaches: Word scoring that consists of assigning scores to the most important words; Sentence scoring that verifies sentence features such as its position in the document, or ‡ https://lucene.apache.org/ EAI Endorsed Transactions on Creative Technologies 04 2020 -06 2020 | Volume 7 | Issue 24 | e5 its similarity to the title; and Graph scoring that analyzes the relationship between sentences. Two approaches are used in this work to score sentences: LSA and TextRank.

LSA (Latent Semantic Analysis) algorithm
Inspired by the latent semantic indexing introduced by Dumais et al. [17] and improved later in [18], the LSA algorithm for text summarization was first developed by Gong et al in [19]. In this paper, we consider the more recent variant provided in [20]. LSA algorithm uses the cooccurrence of words to derive an implicit representation of the semantic of the text. The construction of the representation begins with the filling of a matrix of size × with words (one per line) and sentences (one per column). The input of the matrix corresponds to the weight of the word in the sentence . The matrix is generally sparse since sentences usually contain different words. If a sentence does not contain a word , the corresponding weight in matrix is set to zero, otherwise, the weight is set to × . Afterward, Singular Value Decomposition techniques are applied to A, to obtain the product of three matrices: = Σ . is a × matrix, and each column of can be interpreted as a subject meaning a specific combination of words of the entry, with the weight of each word in the subject given by a real number. The matrix ∑ is a diagonal matrix of size × . The single entry in row of the matrix corresponds to the weight of the "subject", which is the ℎ column of . The weights are sorted in reverse order, so that weight is greater than or equal to weight if < .
Subjects with low weight can be ignored by removing the last columns from , the last rows and columns from ∑, and the last rows from . This procedure is called dimensionality reduction. Matrix is a new representation of sentences with one sentence per line expressed in terms of subjects given in . The matrix = ∑ combines the subject, the weight, and the representation of the sentence to indicate to what extent the sentence transmits the subject, with indicating the weight for the subject in the sentence . Then a sentence for each of the most important topics is selected. A reduction in dimensionality is performed, retaining as many subjects as the predefined number of sentences. The sentence with the highest weight for each selected subject is selected to form the summary. The LSA algorithm provided in [20] is given in Appendix 1.

TextRank algorithm
TextRank was introduced by Mihalcea and Tarau [21]. The internally uses the popular PageRank algorithm, which is used by Google for ranking web sites and pages and measures their importance. PageRank [22] is one of the most popular ranking algorithms and was designed as a method for web link analysis. PageRank considers the influence of incoming and outgoing links into one single model, in order to produce only one set of scores. The approach considers a directed graph represented as G = (V, E), with V representing the set of pages (vertices) and E representing the set of links (edges). In this paper, we consider the customized version of TextRank proposed in [23]. The TextRank algorithm uses sentences as the vertices of the algorithm based on the extractive summarization. Since they may exist multiple links between these vertices, in [23] the author modified the original PageRank algorithm to include a weight coefficient (say wij) between the edge connecting two vertices Vi and Vj such that wij indicates the strength of the connection between the vertices. The function for computing TextRank of vertices is given in (1).
The pseudocode of the algorithm in [23] is provided in Appendix 2.

Sentences selection
The six top sentences from both LSA and TextRank are temporarily selected. Their scores are compared and the sentences with the highest ranks are added to the summary if the number of characters is not exceeded.

Summary generation
The complete proposed algorithm to generated summaries is given in Algorithm 1.

Benchmark
For the evaluation, we used four datasets. The first contains articles with less than 500 characters. The second contains articles with a length between 500 and 1000 characters, the third with articles between 1000 and 2000 characters, and the last with articles containing over 2000 characters. The Wikipedia summary section was used as the reference summary.

Metrics
The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric is a set of metrics used for evaluating automatic summarization [24]. The metrics compare an automatically produced summary (candidate summary) against a reference or a set of reference summaries (that are human produced). The measure is done by counting the number of matching words between a candidate summary and the reference summary. To test our system, we used two different variants of ROUGE: ROUGE-N and ROUGE-SU.
ROUGE-N (with N between 1 to 9): The summaries' texts are divided into a sequence of character of length N (N-gram). Let N-gramTotal be the number of co-occurring in both the candidate summary and the set of reference summaries and N-gramRef the number of co-occurring in the set of reference summaries, the ROUGE-N score is computed following (2).
ROUGE-SU (ROUGE Skip-bigram plus Unigram): it is an extension of ROUGE-S Skip-bigram) which measures the overlap of skip bigrams between a candidate summary and a set of reference summary. Skip-bigram is any pair of words in their sentence order, allowing for arbitrary gaps [25]. The ROUGE-SU in addition to skip bigrams counts also skip unigrams to do not assign a 0 score to a sentence just because it does not share a skip bigram when it instead has common unigrams.
As ROUGE metrics are word-oriented, to appreciate the quality of the summary, we defined a new metric named Threshold of Acceptability for Character-Oriented Summaries (TACOS) that will allow us to set the threshold of acceptability for character-limitation summaries. Let Ss be the system summary and Sr the reference summary, TACOS is defined by (3)

If length(Sr) < length(Ss)
(3) Otherwise A system summary Ss is relatively good following a metric X if ( ) ≥ ( ). TACOS is seen as the proportion of Sr in Ss. If Ss is greater than Sr, the proportion should be 1. Meaning that Sr should be part of Ss, to consider the summary as a relatively good one. On the contrary, if Ss is smaller than Sr, and if a metric provides a result at least equal to the ratio Ss/Sr, then the summary is considered as a relatively good one according to this metric.

Results of the evaluation
The results of the evaluation are presented in Table 1 to 4. From Table 1, related to the dataset of articles with a length of less than 500 characters, we can observe that all metrics give a result greater than the threshold TACOS. This means that the system provides relatively good results, irrespective of the metric used for the evaluation. From Table 2, related to the second dataset, we can observe that in the four first columns all evaluations give a result of one. This means that the entire reference summary is included in the system summary. In addition, in column nine (Mofu), even though the reference summary length is larger than the system summary length the evaluations give a result of one. This means that the system summary is a subset of the reference summary and all the bigrams in the reference summary are in the set of bigrams of the system summary. Here, we have seven relatively good results with all the metrics. According to Table 3, related to the third dataset, we can see that we have also seven relatively good results for all the metrics. From Table 4, related to the last dataset, we can note that there are only three relatively good results for ROUGE-1 and two for both ROUGE-2 and ROUGE-SU4. According to Figure 3, all curves are confounded because the purpose of the system is to generate summaries whose lengths are as close as possible to 455 characters. Since the pages are less than 500 characters, the system returns the entire article which is why the result of the evaluation is 1. Figure 4 shows the results of the evaluation of the second dataset. We find that the smaller the size of the article, the less the result of the evaluation is good. This is due to the disproportion between the size of the summary generated and that of the reference summary.

EAI Endorsed Transactions on
Creative Technologies 04 2020 -06 2020 | Volume 7 | Issue 24 | e5 From web to SMS: A text summarization of Wikipedia pages with character limitation From Figures 5 and 6, it appears that the summary generated is not only affected by the size of the reference summary but also by that of the article.
In light of these observations, we can state that the longer the reference summary, the worse the results for all ROUGE metrics. In other terms, the ROUGE metrics are not suitable for short generated summaries with long reference summaries. We must now think about how to evaluate character-based summaries of long texts.

RapidSMS Framework
RapidSMS is an open-source framework for application development using Short Message Service (SMS). It is the continuously increasing penetration rate of GSM technologies on the planet that seems to have motivated its development. Through its web interface, users can log in and access the system to view data as they arrive. They can also send SMS. RapidSMS is written in Python and integrates with Django, a Web development platform also written in Python. Still, according to the UNICEF website, RapidSMS is designed to work on small hardware configurations and requires at least a GSM modem and a SIM card. The complete architecture of RAPIDSMS is provided in Figure   5.

Results of testing
To get a summary of a Wikipedia page, the user sends an SMS to a short number with the following syntax: "wikisum Key_word". Figure 8 presents some requests with their corresponding summaries.
Request: "wikisum dog" Request: "wikisum water" Corresponding summary: "The latest dietary reference intake report by the United States National. Research Council in general recommended, based on the median total water intake from US survey data (Including food sources): 3.7 liters for men and 2.7 liters of water total for women, noting that water contained in food provided approximately 19% of total water intake in the survey." 359 characters including spaces.

Conclusions and perspectives
The main objective of this work was to summarize Wikipedia pages into a maximum of three SMS (455 characters). To do this, we proposed an approach combining two approaches namely: LSA and TextRank. Generated summaries have been evaluated using ROUGE metrics. Since those metrics have been developed for summarization approaches using words as units, their results cannot directly be interpreted for character-based summarization approaches. A new metric called Threshold of Acceptability for Character Oriented Summaries (TACOS) has therefore been introduced to have a relative appreciation of the quality of summaries. The complete system has been tested using RapidSMS, allowing a user to send a request and to receive the corresponding summary of the Wikipedia page on a mobile phone.
Although the fact that TACOS provides a threshold for the appreciation of the quality of the summary, it will be of interest to better elaborate this metric. An approach could be to find a relationship between characters in both system summary and reference summary. In addition, to better guide the summarization process, the system can also make use of a profiling process based on the user's previous requests.