A Novel Method to Detect Public Health in Online Social Network Using Graph-based Algorithm

INTRODUCTION: Twitter has played an important role in the social life of people. The health-related tweets are extracted and find the spread of epidemic disease on network. It can provide as a starting place of individual data to learn the physical condition of users. OBJECTIVES: Key objective is to develop graph-based algorithm to detect public health in online social network. METHODS: The proposed method collect the tweets relating to general health in twitter using the min-cut algorithm. The algorithm finds the minimum cut off an undirected edge-weighted graph. The runtime of the algorithm seems to be faster than other graph algorithms. Min-cut is reliable and good in network optimization and prevents redundancy. RESULTS: To evaluate the performance, we utilize the health dataset on the detection of epidemic disease. The proposed method using a graph-based algorithm is the best in terms of accuracy, precision, and recall. With respect to the confusion matrix, Min-cut provides the highest true positive when compared to Text rank and K-Means algorithm. CONCLUSION: Proposed health detection method using graph-based algorithm is better than Text Rank and K-Means in all aspects.


Introduction
Social website improvements and the fast blast in sum and multifaceted nature of data created with the helpful asset of Internet contributions have turned out to be troublesome now not handiest mechanically, however moreover regarding programming program application areas [1].Performance and accessibility of data handling are significant components that should be assessed considering the truth that traditional information preparing instruments won't offer alright valuable assets.
It has been viewed as an incredible gadget by and by utilized by each little and enormous gatherings and organizations, similar to Google and Facebook, anyway furthermore open and private medicinal services establishments [2].
Given its advanced rise and the creating intricacy of the related mechanical issues, the development of comprehensive system answers was an advocate for each particular software [3].In this work of art, we underwrite a period worshiped practical design, putting away and concentrating huge data that might be executed in a solitary of-a-sort situations [4].We demonstrate that enormous wellness social data can produce pivotal records, important each for not unordinary customers and experts.Primer consequences of measurement assessment on Twitter well being information [5].Online life is the place one gets the chance to learn, offer, and express their decisions on explicit issues.We may build some additional data about the issue from that opinion [5].These stages interface people over the globe.We live in general people where the scholarly data on the Internet is making at a fast pace and various affiliations are endeavoring to use this tempest of data to focus people's points of view towards their things [6].An inconceivable wellspring of unstructured substance information is consolidated into social affiliations, where it is unfeasible to physically separate such degrees of data.There are limitless nice affiliations areas that enable customers to contribute, change and grade the substance, similarly as to express their emphatically held assessments about unequivocal center interests [7].

EAI Endorsed Transactions on Pervasive Health and Technology
A few models circuit goals, social gatherings, thing examines targets, in addition, agreeable frameworks, like Twitter (http://twitter.com/).These models are showing up as the future model in mining making data streams.The tweets are colossal for appraisal since data get in contact at a high repeat and figurings that approach them must do in that cutoff under mentioning goals of the purpose of control and time [8].To amass classifiers for tweets setting, we need to accumulate masterminding data with the objective that we can apply credible learning counts.
With the increased usage of social media in recent years; an individual's knowledge sources have got expanded Twitter among all other social media is the most popular micro-blogging service with the highest amount of followers [9].People share their opinions irrespective of the place or the time.Generally, the length of a tweet is limited to 240 characters (experimental) which previously were 140.Although only a little amount of information can be gained from one particular tweet, a vast amount of information can be obtained from an accumulated content.Twitter has been a valuable source of information for a long time in various issues [10].
Twitter has helped in the investigation of different issues like political race result, recognizing seismic tremors before the legislature could.Tweets can likewise be convenient in some significant issues, for example, general wellbeing.For instance clients for the most part tweet in such a manner "I have extreme migraine", "I have been determined to have pneumonia".Tweets like this may give us a way to find any comparative cases [11].Twitter information is utilized as a source, for instance, in assessment evaluation where the undertaking is to depict messages into two classes relying upon whether they pass on positive or negative inclinations, since wandering tweets physically as positive or negative is a hazardous and lavish task [12].Nevertheless, a huge perfect condition of Twitter information is that different tweets have producers given supposition markers: changing supposition ensures in the utilization of different sorts of emoticons [13].
Smileys or emojis are a visual sign that is connected with stimulated states [19].Exactly, when the producer of a tweet utilizes an emoji he is commenting on his have content with a blasting state.Such commented on tweets can be utilized to set up a supposition classifier [14].Spouting figurings utilize probabilistic information structures with estimations that may offer splendid approximated responses.Adventitiously, dynamic online checks are bound by the memory and data to move motivation behind the confinement of a solitary machine.Accomplishing results snappier and scaling to continuously significant information streams as a last resort requires the utilization of parallel and orbited figuring [15].
Information stream pushing assessment are depended on to deal with the information beginning at now made, at a dependably broadening rate, from different sources: sensor systems, estimations in structure checking and traffic the bosses log records or click streams in web seeing, making structures call detail records, email, blogging, Twitter posts, etc [16].In reality, all information made can be considered as spilling information or as a depiction of gushing information, since it is confirmed during a period break.This information is not overseen and passed on in immense sums and is referenced as goliath data [20].By definition, colossal information in human affiliations proposes electronic tremendous and complex prospering illuminating aggregations, difficult to control standard checks [17].Gigantic information in government oversaw reserve funds is overpowering a quick after effect of its volume, paying little respect to in like manner consider the changing collection of information types and the speed at which it must be overseen.Information related to quiet restorative affiliations and thriving is cosmetics "tremendous information" in the administration handicap industry [18].
In this paper, we have proposed a framework we have looked at Text rank and Min cut calculation, tweets are separated utilizing the suitable watchwords, the accumulated information is spoken to in type of a lattice which is additionally handled by the Min-cut calculation (min cut calculation need a contribution to type of a square grid).Utilizing Min-cut calculation the information is grouped and key expressions are shaped, the equivalent is accomplished for the content position.The information is marked both physically and dependent on the key expressions.The SVM is prepared dependent on the marked information which contains both positive just as negative key expressions and the outcome is gotten dependent on which the exactness is determined.sowe have applied a min-cut algorithm and analyzed its performance with other algorithms and proved min-cut outperforms the text rank algorithm.

Related works
To get health data, we have analyzed many common and prevalent diseases.Some of the diseases are TB, HIV, Rabies, Malaria, Cholera, Influenza, Measles, Hepatitis A, Whooping cough, Mumps, Chickenpox, Rubella, Gonorrhea, Syphilis, Zika, Ebola, Yellow fever, Typhoid, Dengue, Chikungunya, Diabetes, Obesity, Heart disease, Cancer, Mental illness, Auto immune diseases, Stroke, Alzheimer's disease, etc.These diseases related words are chosen as keywords to retrieve the tweets [22].
The health-related tweets arrive at very high frequency and have a large volume of unstructured data.These tweets are most important for analysis and predict health-related issues.An online clustering algorithm is restricted by memory and bandwidth of stand-alone machines [ 23 24].Various clustering algorithm has been surveyed to replace text rank keyphrase extraction algorithm and they are mentioned below,

Connectivity models
In availability models, closeness is found through the information that focuses on untruth closer in information space than the information point that lies far away.These models pursue two approaches.They either characterize all information into discrete bunches and afterward total them as the separation diminishes or group information as a solitary group and after that parcel as the separation between them increments.Additionally, the decision of separation capacity is natural.In spite of the fact that these models are anything but difficult to clarify while taking care of enormous datasets, it needs adaptability.Models that utilization this are various leveled bunching calculations and their variations [25].

Centroid models
Centroid models are grouping strategy that runs iteratively, and likeness is determined through the closeness of information that focuses on the centroid of the bunches.One of the most mainstream calculations in this model is K-Means grouping.Toward the starting itself, we should make reference to the number of groups required, which makes it significant for these models to have earlier learning of the data set.So as to locate the neighborhood optima these models run iteratively [26].

Distribution models
The transport model is a verifiable approach.Over the data, it performs iteratively.Continuously every data record is examined.The closeness between each record and at present existing gatherings is resolved.From the start, no bundles exist.The record is added to the material gathering if the decided comparability lands at the farthest point.In like manner the qualities of this gathering change.Another gathering is made, if the resemblance decided doesn't land at the edge, or if there is no pack, where the record alone is contained.It is possible to decide as far as possible similarly as the most outrageous number of groups [27].

Density models
In thickness models bunching depends on thickness, determined either from privately associated thickness focuses or by unequivocally developed thickness work.Here bunches are isolated by low thickness areas objects.Information in low-thickness areas is considered as commotion or exceptions.Its job in discovering nonlinear shapes structure is a crucial one.Its primary component is discovering bunches with subjective shapes [28].[29].For persuasive recommendation, a travel plan is created based upon active target user, user id and location [30] [33].Many types of wearable sensor devices contain unobtrusive sensors, smart textiles and printable electronic tattoos for health detection [31].The cloud service has an enormous amount of health information will provide information to healthcare service which will reduce cost [32][34].It also helps to improve the early detection and prevention of chronic diseases [35][36].

Graph models
The rest of the piece of the paper is sorted out in an accompanying way: Section 2 introduces the writing survey on the occasion location calculations.Segment 3 portrays the procedure engaged with this paper.Segment

Data preprocessing
Tweets are collected through twitter API.Twitter API can be accessed by creating an API account.By creating account we get API key, API secret key, access token, access token secret key.By mentioning these authorization keys, the tweets are retrieved for the keywords separately, and then the tweets are merged as one document.The retrieved tweets are preprocessed and are classified as positive and negative manually based on our implementation.To find key phrases from min cut, the frequency matrix must be fed as input to min-cut algorithm.For that, first the frequencies among words have to be noted and the matrix has to be constructed manually.Fig. 1 represent the word cloud of the preprocessed text.

Retrieving Health data
The classified positive and negative tweets are separately fed as input to text rank and the positive and negative key phrases are extracted.In the same way, the positive and negative tweets frequency matrix is fed to min-cut and the corresponding is obtained, from which the key phrases have to selected manually Twitter Data has both negative and positive tweets.The tweets are preprocessed.Keyword extraction algorithms such as Text Rank and Min-cut algorithms are applied to the processed tweets as shown in Fig. 2. Keyphrases are extracted and converted into a dataset of the different algorithms.SVM classifier algorithm is applied for both the datasets.Then the performance of both the algorithm is analyzed.
After selecting the key phrases, the dataset has to be constructed manually for text rank and min-cut separately.The dataset consists of both labeled tweets and key phrases.The SVM classifier is trained by both text rank and min-cut dataset separately.SVM predicts the test data for both trainings separately, by which the performance can be measured.After implementing SVM we get health-related tweets.Table 1.Confusion matrix value for three algorithms Table 1. says that Min-cut Algorithm gives the best value for the confusion matrix.

Algorithm for Health Detection System
In Table 1, The further analysis of min-cut and text rank algorithm performance, confusion matrix have been constructed which consists of True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) data.Using this confusion matrix, Accuracy, Sensitivity, Specificity, F1, Recall, Precision have been calculated and their formulas are mentioned below,

Analyses
We have analyzed the performance of min-cut and text rank algorithm by considering the following parameters, such as, number of tweets, accuracy, sensitivity, specificity, F1 score, Precision, and Recall, each of which are described below, Accuracy: It is the measurement of being precise or that conforms to the appropriate value or a standard as given in equ (1).
Sensitivity= TP/ (TP+FN) Specificity: Specificity calculates from equ(3) the fraction of negatives that are correctly identified as negatives.

Specificity = TN/ (TN+FP) (3)
Precision: Precision otherwise called positive predictive value is used to find the fraction of relevant instances among the retrieved instances using the following equ(4).Table 4. Represent the comparison between min-cut and text rank.It says that min-cut is better than the text rank algorithm.The correct number of tweets that match the total number of tweets in the min-cut algorithm is more than the algorithm.The correct number of tweets that match the total number of tweets in the mincut algorithm is more than the TextRank algorithm.So mincut algorithm outperforms the TextRank algorithm.Other analysis conducted between Mincut and PageRank algorithm shows that Mincut performs better than the existing method.

Conclusion and Future Work
Our motive of the project to obtain health related data from twitter data is achieved along with the algorithm comparison for its efficient outcome.Min cut algorithm has many variations.This paper consider undirected graph using min cut algorithm.This algorithm is easier to implement that is it can be solved using n 2 processors.It is used to divided the graph into smaller sub graph which still maintain the structural characteristic.Our future plan includes not only specifying health related data but also health related data along with its locations and severity.
Clustering is performed in graph models by grouping graph into several clusters/groups based on some relevance.Cluster is made by cutting or discarding the useless edges.The clusters are called community clusters.The graph clustering applications are different from other clusterings.It has its own methods such as MST clustering, Markov clustering, Chameleon and Star Clustering.After analyzing various clustering algorithms, we have proposed a min-cut clustering algorithm for keyphrase extraction

A
Novel Method to Detect Public Health in Online Social Network Using Graph-based Algorithm 3 EAI Endorsed Transactions on Pervasive Health and Technology Online First 4 talks about test set-up and test results.Figure 5 closes the paper with dialog.

Table 4 .
Comparison between Min-cut and Text Rank Algorithm