Extracting Actionable Knowledge from Domestic Violence Discourses on Social Media

Domestic Violence (DV) is considered as big social issue and there exists a strong relationship between DV and health impacts of the public. Existing research studies have focused on social media to track and analyse real world events like emerging trends, natural disasters, user sentiment analysis, political opinions, and health care. However there is less attention given on social welfare issues like DV and its impact on public health. Recently, the victims of DV turned to social media platforms to express their feelings in the form of posts and seek the social and emotional support, for sympathetic encouragement, to show compassion and empathy among public. But, it is difficult to mine the actionable knowledge from large conversational datasets from social media due to the characteristics of high dimensions, short, noisy, huge volume, high velocity, and so on. Hence, this paper will propose a novel framework to model and discover the various themes related to DV from the public domain. The proposed framework would possibly provide unprecedentedly valuable information to the public health researchers, national family health organizations, government and public with data enrichment and consolidation to improve the social welfare of the community. Thus provides actionable knowledge by monitoring and analysing continuous and rich user generated content.


Introduction
Domestic Violence (DV) against women [1] is not a new phenomenon, and it causes serious impacts to women's physical, mental health and well-being. The existing statistical surveys and reports [2] reveal that violence against women is a global public health issue which affects approximately one in three women universally. The findings of existing statistical content analysis and survey results highlight the following apparent research gaps [3], which are addressed in this paper.
• First, the existing surveys always rely on a sample of the population and requires manual annotation, much human efforts which is labour-intensive, time consuming and expensive.
• Next, it requires more safety measures to handle this kind of research as women are reluctant to * Corresponding author. Email: sudha.subramani1@live.vu.edu.au disclose their sufferings, and there is a need to collect data in a private space and in the absence of male partners. The training of interviewers is also to be done properly and interviewing only one woman in the household to inhibit the sharing of survey content between others.
• Finally, the existing studies on this issues have the constraints due to the less availability data and only provides the selected number of health outcomes and making it difficult to determine the co-morbidity level, emotions, opinions and feelings. Women suffering from DV are likely to experience not only mental disorders, but also suffer physically, emotionally and psychologically.
• Hence there is a need to develop the novel methodology other than statistical surveys to analyse the various themes related to DV

Research Aim
The massive amount of data generated over social media is witnessing a foremost development in the last few centuries. Social media encourages the users for their freedom of thoughts and provides a platform to share their expressions, opinions and random details of their lives with no restrictions in the form of posts and status updates. That play a predominant role to analyse public opinion by aggregating the thoughts of millions of users. Though each tweet or post contains less informational value, the consolidation of millions of messages can generate actionable knowledge and provide valuable insights about the public opinion in general.
Several studies in different domains: news events, natural disasters, user sentiment analysis, political opinions explored Twitter, a fascinating source of public opinions, over half an billion user generated messages posted every day in order to extract useful information. But, less attention was directed toward studying social welfare topics and health care in social media compared to other topics like marketing products, sentimental analysis of customers and politics. In recent years, this has become the active research area that has drawn huge attention among the research community for information retrieval and to discover the abstract topics that underlies on the large microblogging stream. But, it is challenging for the information retrieval from the large streams of microblogs of its following characteristics: • Immense scale of volume, fast data arriving rate and their unique characteristics of short length.
• Large number of spelling and grammatical errors.
• Use of informal and mixed language.
• Microblog texts are usually relevant to wide-ranging events in real world events. Some of them might carry interesting and relevant information, whereas others might not.

Contributions
This section starts with discussing the academic contribution about the methodologies used in this paper. Then, we explain certain practical benefits that are related to the public health sector.

Contribution to Knowledge (Academic Contribution)
The contribution of this paper is to take the plethora of information from the information explosion era of social media on a daily basis and gain in-depth insights of knowledge discovery. This work harvests DV related data, identify hidden patterns, generate models, and finally use that data to improve the quality of care. Collecting the data from Twitters data warehouse, and run large calculations over the Twitter dataset easily and efficiently, for turning that data into something actionable. While the existing topic models have focused only on term based approaches to detect topics, they have the difficulties in interpretation and scalability for understanding, organizing and exploring the quality topics from microblogging data stream. Hence this paper uses the pattern mining technique to identify the patterns, then the MapReduce architecture is used on the mined patterns and finally cluster the terms which are more coherent to the topic. The endeavor of this paper is to focus on social media and provide appropriate data analytic methods, in order to improve the classification accuracy and high interpretability of the cluster of topics.

Statement of Significance (Practical Contribution)
Recently victims of DV are turning to social media to share their experiences for social support which has two realms like informational and emotional support. Not only victims, but also friends, family members, care givers, family welfare organization, psychiatrist, and police are increasingly sharing their opinions, feelings, and regrets about the incidents to create social awareness.
To the best of our knowledge, this would be the first work to use social media to analyse the public health informatics, mine out latent topics and internal semantic structures on the topic of DV. These comprehensive findings would mark an important milestone, not only in the field of research in DV against women, but also in the field of global public health in general. This support the people with data enrichment, consolidation about a specific topic and the entire community can be benefited with the valuable information extracted from the large data.
The public health sector plays a significant role by considering the serious health risks faced by women and their families by this evidence based surveys. This promotes and strengthens the health care systems to women who have experienced violence. This report not only provides the first such summary of acts of DV against women, but also would make the significant advancement for womens health and rights. Let this report serve as a unified call to action for those working for a world without violence against women.

Literature Survey
With the increasing popularity of social media, several research studies focused on social media to analyse and predict real world activities like user sentiment analysis [4], opinion mining on political campaigns [5] [6], natural disasters [7], epidemic surveillance [8], event detection [9], e-healthcare services [ [70], as the increasing sophistication of social media data is posing substantial threats to users privacy. In contrast to statistical surveys, Twitter becomes a reliable data source to share the opinions and thoughts in online immediately with a status update. Hence, this becomes an efficient platform for researchers to detect real world events in informational retrieval and decision making. Sentiment analysis has been previously studied on different aspects such as blogs and forums and has now been analysed in social media [11]. Liu et al. [12] have done Sentiment analysis by extracting the comments on the attributes and features of a particular aspect and categorizes the comments as positive, negative or neutral. The aspect can be a product, event, person, or topic.
Opinion mining on political events shows that tweets are effective to describe the opinion about an event. OConnor et al. [6] and Tumasjan et al. [7] have studied that sentimental analysis of tweets correlated with the voters political preferences and closely aligned with the election results. Not only in the field of politics, but also in economics, have public tweets played a major role. Bollen et al. [13] [14] analysed that tweet sentiments can be used to predict trends of stock and it is directly correlated with them. Bruns et al. [15], Burns et al. [16], Gaffney et al. [17] observed that Twitter is a powerful tool to gather public opinion and create social change.
Sakaki et al. [7] investigated tweets during natural disasters and shown that it is able to detect earthquakes and send warning alerts to society. They considered each twitter user as a mobile sensor in Japan and the probability of an earthquake is computed using time and geolocation information of the user. Posting time and volume were modelled as exponential distribution to estimate locations of earthquake using kalman and particle filters. Their research further evidenced that earthquake can be sensed earlier than official broadcast.
Culotta et al. [18] [19] analysed Twitter to detect influenza epidemic outbreaks that improves speed and cost reduction from traditional methods. Data of the user like gender, age and location can be used to provide more descriptive information about demographic insights compared to search queries. They detected influenza using multiple regression models and Quincey et al. [20] detected swine flu from Twitter using predefined keywords and terms co-occurrence method. These methods are analysed by searching the tweets with the keywords and detected anomalous change with the rapid flow in message traffic related to given keywords. The aids of such a method is to collect more focused information from the Twitter stream.
Conventional methods [32][34] of event detection has been addressed in Topic Detection and Tracking [33] that deals with traditional media sources like news articles, academic papers and so on to find the real world events. Event detection is difficult to perform in Twitter because of its noise and short size. Bernstein et al. [35] proposed interactive topic browsing system that groups twitter in to topics and used tag cloud for visualization. Sankaranarayanan et al. [36] developed TwitterStand facilitates users to browse news based on geographic preference. This extracts the feature sets, performs online clustering based on term frequencies to find topics. TwitInfo [37] is a similar approach that performs sentimental analysis to assists users and also calculates peaks from frequency of tweets.
Another approach in twitter trend analysis is bursty topic detection, which are trendier in the time series. Lee et al. [38] developed a sliding window model and it considers the time factor of the term frequency. It calculates the term weights based on the arrival rates in a specified time frame. Burstiness is used to determine how popular a term at the given time interval in the particular event detection.
Topic model is one of the popular research areas to identify semantic relationship from text content. The probabilistic topic model by Blei et al. [39] infers the problem of topic detection as probabilistic distribution. It represents the document as a distribution over topics and a topic as distribution over words. On variation of LDA, online-LDA [41], Dynamic topic models [40], labelled LDA [42] used probabilistic topic models as a baseline model to detect events in twitter streams. However, due to the unique characteristics of twitter streams like short length, noisy, informal texts, spelling mistakes and fast arrival rate, those methods do not suit well. LDA models are used to extract the topics [71]. Various classifiers are trained to predict the model performance [72,73,74,75].
Most of the topics use term based models, which are difficult to predict and challenging to users for better interpretation. This suffers from polysemy and synonyms. Polysemy means, one word defines multiple meanings. For instance, the term Apple defines both fruit and technology. Synonym is about the similar meaning of two different words. For instance, both the bank river S. Sudha and financial organization can be called as bank. Hence patterns are the best practice for topic modeling, as the meaning of the word can be expressed well in patterns with the co-occurrence of neighbour terms.
Pattern mining is the classical approach that is used in traditional databases. Apriori [43][44] is one of the most popular and important association rule mining algorithms. Association rule mining discovers all the association rules between the item sets that exceed the values of minimum support and confidence.
Patterns and Interest are the major factors in data mining applications. Association rule mining and classification extract the patterns and the interest factor defines whether the pattern is useful or not. Data mining could also be viewed as process of converting data into information, the information in to value [57], which is used by users. Mining the actionable patterns, this could be beneficial to the people, organization, and community. Hence, in the social media of big twitter data stream, actionable knowledge can be extracted, by mining the interesting patterns related to the particular issue [58].
Recent research studies used some techniques like Twitter Monitor [59], EDCoW [60], HUPC [61], and SFPM [62]. The first two techniques are based on bursty keyword analysis. The keywords that have higher absolute frequency than usual are used to find more bursty terms. EDCoW method applied wavelet analysis to model the terms based on the frequencies. HUPC and SFPM applied pattern mining process to detect hot topics from Twitter data streams.

Proposed Methodology
There are a lot of studies in pattern mining in traditional data sets, MapReduce architecture, Topic Modeling respectively. This work attempts to integrate those three popular techniques in Big Twitter data stream to extract the valuable information hidden inside the large corpus and the benefits are multi-fold. Hence the architecture of proposed work is in Figure 1

Evaluation Metrics
The first stage is to collect DV related tweets from the social Media using Twitter API. Then, the preprocessing techniques are applied to remove noisy and inane content and extract the relevant data. From the pre-processed input data, semantic association between terms are captured by mining frequent patterns.
In the next phase, the semantic units of the patterns are extracted and reduced using MapReduce architecture. The frequent patterns generated from the twitter stream are taken as input to the mapper, which are shuffled and broad-casted to the reducer in the MapReduce architecture. This will output the terms that are more cohesively related to the topic. The purpose of this architecture is to extract the terms that are more relevant to the topic and the redundant terms are removed. In the earlier stage, we are mining the patterns, where the global terms are cohesively related to each other. Hence, this step augments the semantic worth to the conventional bag of words representation.
The next stage models the process to generate the cluster of topics, which would be the summary of the entire Big Twitter Data stream. This step ensures that similar set of text words are clustered together and logically represents a theme for effective summarization. Then the cluster of topics is visualized using tag clouds. Finally, perform the evaluation metrics to predict the quality of topics.
In the Data Collection process, the tweets are extracted from Twitter API based on keywords. In the second stage, the frequent patterns are generated that share many common terms related to abuse and mental disorders. <family violence, verbal abuse, depression>, <family violence, physical abuse, hurt, anxiety>, <domestic violence, sexual abuse, suicide>, and so on. The Patterns exhibit similar semantic information leads to redundancy. Hence the Map reduce architecture is applied in the third category. It assigns over each pattern and Map iterates over each word in the pattern and finally produces the consolidated count of every word. In the next stage, the topic detection algorithm clusters the words based on the semantic structure like <physical abuse, verbal abuse, sexual abuse> are clustered in to the topic "abuses" and <depression, anxiety, suicide> are clustered into the topic "mental disorders" and so on. Thus this research produces actionable knowledge by converting the raw twitter data in to valuable information. This knowledge would be useful for the family health organizations, health care providers in further improving the health services, by monitoring the public social data.

Data Collection by Twitter Streaming API
The constant stream of tweets arriving from the service is accessed through the Twitter Streaming API [50]. This module scrambles to collect as much data as possible, in order to find hidden patterns that can be acted upon for further knowledge discovery. Apache Flume is a distributed and reliable data ingestion system that has a simple and flexible architecture for efficient collection of large amounts of Twitter data. The endpoints are configured in a workflow as sources and sinks. Every piece of data is defined as events in Flume, which are nothing but tweets. Figure 2 represents the flume architecture for high volume data gathering. The source generates events and where the data enters in to a flume. The acquired data flow through a channel and it is a drain between source and sink. Once the data reaches the sink, it writes the data to a Hadoop Distributed File System (HDFS). This module has to be designed a custom source to access the Twitter Streaming API, for the extraction of the relevant tweets based on search keywords and hashtags. It is worthless and time consumption to extract the complete Twitter firehose. Initially we have to create four identical keys in our Twitter account: Consumer Key (API Key), Consumer Secret (API Secret), Access Token, and Access Token Secret and configure the flume source, channel and sink to gather the Twitter streaming data related to DV.

Pre-Processing of Twitter Stream
The content of collected twitter stream varies from useful and meaningful information to incomprehensible text. The former has people opinion and relevant posts regarding the particular topic, whereas the latter may contain advertisements and non-worth reading.
Hence, in this step, high quality information and features are extracted by incorporating some preprocessing techniques. Pre-processing of Twitter stream removes noise that produces negatives effects and degrades performance. In microblogs, this is the most important step, as it contains the higher level of noise in the tweets. Stop word lists contain common English words like articles, prepositions, pronouns, etc., Examples are the, a, an, the, in, at, etc. Hasan saif et al.
[51] investigated that removing stop words improves the classification accuracy in Twitter analysis by reducing the data sparsity and shrinking the feature space.
Stemming is used to identify the root of a word, to remove the suffixes related to a term, and to save memory space. For example, the words relations, related, relates all can be stemmed to relate. Porter stemming [52] [53] is applied to standardize terms appearance and to reduce data sparseness. Non text symbols and punctuation marks are removed. Noisy tweets are filtered by eliminating links, non-ascii words, mentions, numbers and hashtags.

Pattern Mining Model for Microblogs
Mining patterns from Microblogs can resolve many problems mentioned in the previous section in the bag of words model representation. The unigrams are ambiguous and are interpreted differently with the words of different combinations. Hence, the algorithms have to be proposed to extract patterns in this module. This augments the relationship between the multiple terms relationship and co-occurrences of terms exhibit semantic associations.
Pattern mining discovers the recurring relationships and interesting correlations between terms. The quality of the detected topics depends on the quality of mined patterns. When extracted patterns need to satisfy the three qualities like frequency, collocation, completeness for higher accuracy and human interpretability. Frequency is the most important factor, how important is the frequency of pattern in the twitter stream, while detecting the topic. A collocation is referred as the co-occurrence of terms in the pattern in such a frequency that what is expected due to chance. Completeness is defined as the patterns provide the meaningful representation and has semantic meaning. The advantage of this step is as follows:

S. Sudha
• Successful mining of patterns from tweets leads to discover high quality and informative topics.
• Pattern representation reduces noisy and redundant information and captures the words that are more meaningful to topics.

MapReduce Architecture
MapReduce is a recent computing paradigm for distributed processing that process vast amount of data in parallel, reliable and fault tolerant manner. It has become much popular for data intensive programming model. A MapReduce job parallels the map tasks by decomposing the input data into separate chunks. The outputs of the maps are sorted out by giving as the input to the reduce class. Both the input and output of the MapReduce job are stored in a distributed file system. In this step, it will consider frequent patterns consist of various terms as input and produces the appearances of each word wi that are hidden in the patterns. MapReduce framework is the most suitable method for pattern growth model, because of its functionality to split the patterns in to smaller portions. The two routines of MapReduce are Map and Reduce and they have different functionalities. A Map process is assigned to each pattern and Map iterates over each word wi in its assigned pattern and emits the pair < w i , 1 >. Output from all equivalent classes of Map phases are grouped together and passed in to a single Reducer Phase. A serial algorithm computes the local task and the reducer finally counts the appearance of each word by gathering the output from all processes. The purpose of this module is to generate the bag of frequent terms from the frequent patterns. The word has the highest count represents, its occurrence in most of the patterns and high contribution towards the topic. The words with the highest support count also signify its highest coherence in representing the semantic meaning in the patterns.
• The global frequent terms generated from the frequent patterns by MapReduce model, which are more relevant to topic.
• Parallel execution of millions of patterns and hence increases the time efficiency.

Topic Prediction and Visualization
In this module, the cluster of topics is generated for the global frequent terms, which are produced in the MapReduce architecture. Hence the various terms belongs to the each cluster topic depends on its semantic meaning and similarity. Hence, the corpus is the collection of topics and each topic is the distribution over terms. In this module, the domain knowledge in the form of Wikipedia or articles written about mental disorders, physical illness, sentiments and emotions and abuses related to DV are scrapped from the Web. The web crawlers extract the example of words related to various above mentioned themes in general. With this prior knowledge, a must-link technique models the extracted terms from MapReduce into various topics. A must-link defines that there exists semantic relationship between two terms t1 and t2; thus two terms belong to the same topic. They always appear together as topical terms in the same topic across many disciplines. Hence, the good topical words have the characteristics of better readability, worth description about the topic and high discrimination feature against neighbour topics. The terms describing the every topic are ranked according to the support of a term. It can be defined as the terms importance is directly proportional to the number of patterns contain it. Hence, the term frequency is calculated according to the number of times, it appears in the patterns. Based on the term support, the terms are ranked in the descending order. This importance ranking of all terms in all topics describes well the each of the detected topic in a better way.
The tag cloud [54] is an effective way to provide the summarization of terms describing the topics in a visually appealing way. The top frequent terms describing the topic are visualized using a tag cloud and the tags are usually terms. The tag cloud is useful to quickly perceiving the most prominent terms in the tweets and for differentiating the popular terms based on its font size and colour.

Evaluation Metrics
The topic detection performance can be evaluated from the result of classifying the texts to its relevant topic clusters. It can be measured by the following metrics: Precision, Recall, and F measure [55]. Precision and recall are the basic measures used to evaluate the performance of the proposed algorithm. True Positive (TP) denotes the number of terms correctly classified as relevant to the topic; False Positive (FP) denotes the number of irrelevant terms classified as relevant; True Negative (TN) denotes the number of terms correctly classified as irrelevant; False Negative (FN) denotes the number of terms misclassified as irrelevant. Thus, recall and precision are calculated as Another metric F-Measure is also often used to measure the performance in information retrieval. It combines precision and recall with an equal weight in the following equation.

Figure 3. Hadoop Framework
In addition, the quality of the topics is detected by quantitative methods proposed by chang et al. [56] in the paper "How Humans Interpret Topic Models to evaluate the quality of the discovered topics". This can be predicted by two metrics: Term Intrusion and Domain Expert Evaluation to measure the semantic meaning in the topics inferred. Term intrusion assess whether the discovered topics have semantic coherence by the human perception. This task involves the bunch of questions to discover the intruder object from the options available. Each question contains 5 terms. 4 of them are randomly selected from top 10 terms of one topic and another term is an irrelevant one, which is chosen from another topic. The annotators have to analyse the terms and find out the incorrect term. The average of results that are answered correctly would be used to evaluate how well the terms are uniquely correlated and associated to each topic.
As this topic related to DV issue, we need the knowledge of domain expert. Once the topics like various abuses that women suffering from DV, their mental disorders, physical health problems, emotions and so on are discovered, we need the domain experts to analyse the various health problems that are extracted by our methodology is desirable. This can be evaluated by topic coherence property. The terms in the topics are really coherent to topic thematic structure.

Functional Architecture
Twitter has massive data storage and high processing requirements and thus it is difficult to handle with traditional databases. So, the paper implements the proposed methodological framework in Apache Hadoop platform [63] (Figure 3) for the optimized data storage and workflow solutions and it is an open source distributed software platform.

Gathering Twitter Data with Apache Flume
As mentioned in previous section 4.1, the Twitter Data is extracted using the Apache Flume with Twitter Streaming API. Then the data can be stored in the Hadoop Distributed File System (HDFS).

Storing data in HDFS
The data can be stored in the Hadoop Distributed File System (HDFS), the prominent tool to store Big Twitter Data with high performance access. HDFS has the capability to support large scale applications with huge file sizes, where individual file even has the capacity of terabyte. HDFS mainly has two componets : Namenode and datanodes. This follows Master Slave architecture. There are multiple Datanodes, where the actual data resides and a single Master Namenode, which monitors the information about data storage, size in the form of metadata. Resilience and fault tolerance are the key features of HDFS, where the data is divided into 64 MB chunks by default and replicated thrice across the data nodes.

Querying Complex Data with Hive
The Twitter Streaming API outputs tweets in a JSON format which are arbitrarily complex. Once the data loaded into HDFS, create the the external table to enable querying the data. Hence, Hive provides a query interface which can be used to query data that resides in HDFS in a simple way.The query language is similar to SQL, and query the data by reading from HDFS. The interfaces Serializer and Deserializer (SerDe) tell the Hive to process the data and to flexibly define and redefine, by reading the JSON data and translates into objects for Hive.

Partition Management with Oozie
The Flume will continue to create new files, as the Twitter API perpetually stream the tweets. When the new tweet arrives in, there is a need to add partitions to the Hive table periodically in an automatic way. Thus, Apache Oozie is a flexible workflow coordination system that instructs the job workflows to run based on a set of criteria (to refresh on a hourly basis).

Conclusion
This paper proposes a novel framework to extract actionable knowledge from social media discourse and this knowledge would be valuable to the the social welfare of the community. This work will provide some significant results as it has the "some unique advantage of real time access to naturalistic information by identifying the health related trends from the real world events and online communities" rather than traditional survey.