Going Multi-viral: Synthedemic Modelling of Internet-based Spreading Phenomena

Epidemics of a biological and technological nature pervade modern life. For centuries, scientific research focused on biological epidemics, with simple compartmental epidemiological models emerging as the dominant explanatory paradigm. Yet there has been limited translation of this effort to explain internet-based spreading phenomena. Indeed, single-epidemic models are inadequate to explain the multimodal nature of complex phenomena. In this paper we propose a novel paradigm for modelling internet-based spreading phenomena based on the composition of multiple compartmental epidemiological models. Our approach is inspired by Fourier analysis, but rather than trigonometric wave forms, our components are compartmental epidemiological models. We show results on simulated multiple epidemic data, swine flu data and BitTorrent downloads of a popular music artist. Our technique can characterise these multimodal data sets util-ising a parsimonous number of subepidemic models.


INTRODUCTION
Human existence has always been driven by interactions between humans and between humans and their environment.Spreading processes of various kinds arise as an inevitable consequence of these interactions.Where the spreading is rapid and widespread, the resulting outbreak is termed an epidemic.Epidemics occur in and impact on almost every domain from biology (e.g.infectious diseases) to technology (e.g. computer viruses and social networks).For this reason, the study of epidemics and spreading processes has been a vital scientific endeavour throughout history.
Since physiological well-being is one of the most basic human needs [15], it is natural that the study of spreading processes focused for many centuries on disease propagation and biological epidemics in populations.The last two centuries witnessed the emergence of the evidence-based scientific study of disease we know today as epidemiology.Over the same time period, increased industrialization, mass transit and technological developments have increased not only the potential for the activation of a broad class of spreading processes but also the rates of transmission, increasing the likelihood that they manifest themselves as epidemics.
It has long been recognised that it is not only diseases that are subject to spreading processes.Amongst others, Dawkins has suggested the theory of memes i.e. ideas that spread like "mind-viruses" [8].A key assumption of our present research is that there are many similarities between the way diseases spread and the way internet-based spreading mechanismssuch as tweeting and sharing of online content -operate.That is, an outbreak of interest starts with a few susceptible individuals who are exposed to an originating event and some become "infected".These individuals then interact with others, passing on the "infection".Eventually the infected individuals recover/lose interest and the outbreak dies out.We consequently adopt epidemiological models to describe the dynamics of internet-based phenomena.
A model of a single epidemic is inadequate to characterise the multimodality that emerges from many complex internetbased spreading phenomena.We speculate that this is because mono-epidemic-based modelling efforts cannot account for the potential influence of multiple underlying spreading mechanisms, each of which may initiate at a different time.Consider for example YouTube video views.Views may be due to sharing of the content on social media platforms such as Facebook and Twitter, links on other websites, the content being featured and/or recommended in a news article or by YouTube itself, notifications to channel subscribers etc. Ideally we require a model that is able to adapt to the sudden activation of any of these mechanisms, rapidly updating itself to enable near-term predictions of reasonable quality, without detailed knowledge of the underlying spreading mechanisms involved.We propose a novel modelling and prediction framework based on the analysis and synthesis of multiple epidemic models as shown in Figure 1.Given a composed signal which is presumed to represent the aggregated observable manifestation of multiple underlying epidemics, the framework breaks down the incoming signal into its fundamental components and selects the disease spreading models that best explain each component.These models are resynthesized in order to predict the future evolution of the signal.
Our approach is inspired by Fourier analysis, but instead of trigonometric wave forms our components are compartmental epidemiological models.There are several challenges inherent in our approach, not least in determining the number of epidemics to be fitted, and in selecting appropriate epidemiological models and parameters for each component.
The remainder of this paper is organised as follows.Section 2 presents relevant background of the theory and analysis of disease propagation, and an overview of contemporary epidemiology-based social network analysis.Section 3 lays out our synthedemic decomposition algorithm which is implemented in a prototype version of our framework.Section 4 presents our case study results on simulated multiple epidemic data, swine flu data and BitTorrent download activity from artists that went viral in recent years.Lastly, Section 5 concludes and considers avenues for future work.

Disease Propagation Theory and Predictive Analytics
In ancient times, and throughout the Dark and Middle ages, the predominant explanations for disease propagation included the supernatural, superstition and miasma theory, which held that diseases were caused by "bad air".A notable exception was Hippocrates who correctly identified the role of human behaviours and environmental factors [1].
Progress towards a more scientific and data-based approach began to be made from 1600 onwards with the collection of the first public health statistics, by John Graunt (1620-1674) [12] and others.One of the most famous studies now regarded as the foundation of this discipline was by John Snow of the 1854 London Cholera epidemic [20] in which he identified a particular water pump on Broad Street as the likely source of the outbreak.
Predictive mathematical models for epidemics were relatively slow to develop, despite their tremendous utility in understanding, managing and forecasting the progression of epidemics.One of the earliest was in 1766 by Daniel Bernoulli who carried out a study of the effects of smallpox vaccination.However, arguably the most significant breakthrough in this context was that of compartmental disease models based on Ordinary Differential Equations (ODEs), as proposed by Kermack and McKendrick in 1927 [14].
A variety of compartmental disease models are currently used in practice [22].The most well-known of these, the Susceptible-Infected-Recovered (SIR) model features a closed population of individuals divided into three evolving subpopulations: S(t) tracks the number of individuals who are susceptible to become infected by the disease at time t, I(t) tracks the number of individuals who are infected by the disease with rate β and R(t) tracks the number of individuals who have recovered from the disease at rate γ.

Epidemiology-based Social Network Analysis
Goffman and Newill were the first to bring a social context to epidemiology, with their mathematical model for the spreading of rumours [13].The rise of the internet, particularly search engines and Online Social Networks (OSNs), led to two classes of studies: those designed to augment conventional epidemiology (e.g.[6,9]), and those applying epidemiological or diffusion process principles to model the dissemination of information (e.g.[2,3,5,11,16]).The former includes detection of real physical disease outbreaks by assuming a relationship between online searches and the real number of infected individuals [9].The latter includes the work of Tweedle and Smith, who applied an SIR-inspired model to pop star Justin Bieber's popularity based on Google Trends data [21].Very recently, Coviello et al. published a controversial study which measured the contagion of emotional expression amongst Facebook users [7].
A recent study explored the potential for epidemiology to explain certain outbreaks of internet-based information spreading [18].The authors were able to progressively fit and to parameterise simple epidemiological models from single data traces of BitTorrent downloads and YouTube views.Subsequently they investigated confidence intervals on the outbreak parameter values as the outbreak unfolded over time [19].Another work explored the dynamics of interacting epidemics in multiple overlapping populations [17].

METHODOLOGY
The synthedemic 1 methodology is designed to fit composed epidemic models to outbreak datasets that are regularly augmented with new observations (so as to facilitate potentially real-time operation).We start with a small truncated data set and at each step we add one new data point to the truncated dataset until we reach the end of the time frame to be considered.Initially we start by fitting no epidemics, and dynamically incorporate more epidemics when it becomes necessary to improve the fit.
It is clearly important to chose a set of compartmental model types which are appropriate for the context within which the synthedemic framework is deployed.It transpires that followers of online phenomena have noticed that there appear to be two types of content diffusion: growth, characterised by organic spreading of content in communities (initially by influencers), and spike, which represents a sudden "explosion" of sharing activity sparked by some (mass-media) event that is then followed by a gradual decay [10].Here we propose to model the former by an SIR process, and the latter by an IR (Infected-Recovered) process consisting of an initial impulse followed by exponential decay.That is, • An SIR epidemic starting at time t0 is characterised by the initial number of infected individuals I0, the initial number of susceptible individuals S0, the initial number of recovered individuals, the infection rate β and the recovery rate γ.The SIR model dynamics are: for t > t0 with [S(t0), I(t0), R(t0)] = [S0, I0, R0] and with I(t) = R(t) = S(t) = 0 for t < t0.
• An IR (spike) epidemic starting at time t0 is characterised by the initial number of infected individuals I0 and the decay rate γ.The IR model dynamics are: for t > t0 with I(t0) = I0 and with I(t) = 0 for t < t0.

Synthedemic Methodology Overview
Let M be the class of subepidemic models that we are considering and let M (k) be the set of vectors with k subepidemics.Generally the set M can contain any type of epidemic model but here we restrict ourselves to the 2 types of epidemics introduced above.In view of the parameter sets of these processes, elements of M take the form, sir(t0, I0, S0, β, γ) or ir(t0, I0, γ) .
Note that we do not include the initial number of recovered individuals in the parameter set of the SIR, as the number of recovered individuals does not influence the evolution of the number of infected individuals.For further use, we also introduce the type base(t0, I0) .= ir(t0, I0, 0), which corresponds to a constant infection level I0 starting at t0. 1

A portmanteau term from synthesised epidemic
For any m ∈ M, let fm(t) be the number of infected individuals at time t of model m.With a slight abuse of notation and assuming that epidemics are additive, we associate with every vector E of elements of M, the multiple epidemic, Let yi be the ith data point which is collected at time ti, and let t and y be the vectors with elements ti and yi, respectively.Moreover, let ti be the vector with elements t1 to ti. yi is defined likewise.We aim to find a sequence of vectors of subepidemics {E(i) : E(i) ⊂ ∪kM (k) } such that E(i) maximizes the coefficient of determination for the data up till time ti, whereby the number of subepidemics is upperbounded.The bound is chosen such that a target coefficient of determination r 2  target can be attained.The coefficient of determination for a vector of epidemics E and data points y collected at epochs t, is defined as, k=1 yk denote Euclidean distance and sample mean, respectively.We also introduce the notation (y) for the number of elements in a vector y and the vector fE(t) with elements fE(ti) for ease of notation.
The general optimisation problem can be formulated as, The bound k − i on the number of subepidemics allows for achieving r 2  target with a parsimonious model.Without such bound the optimisation problem would be trivial.In that case, the optimal fit is to have a spike with infinite (or very large) decay rate at every data point.As the formulated optimisation problem is numerically involved, we formulate a heuristic optimisation approach in the next section.

Practical Implementation Issues
In order to improve the speed and stability of our online fitting procedures, we constrain the search space for finding E(i) as follows: • We add or subtract at most one epidemic at each t.
• In updating the vector of epidemics at time ti, the start times and types of all currently-fitted subepidemics are assumed to be fixed.Other parameters of subepidemics are free and can be updated.
• If an epidemic is added, we use a heuristic to determine its type based on the residual process prior to adding this epidemic.
• SIR-type processes are assumed to start with a single infected individual.Henceforth this parameter is suppressed in the notation.
In view of the former assumptions, let Nδ(E) denote the δneighbourhood of subepidemic E. For a SIR process, this neighbourhood is defined as, whereas for the IR and baseline, the neighbourhood is, and, respectively.With a slight abuse of notation, the neighbourhood of vector of epidemics is defined as Notice that we only allow changes of the start time for the epidemic which was added last and keep the start time of the preceding epidemics fixed.
Our practical experience to date is that a value of δ = 20 yields good results; this corresponds to a large enough window to provide start time flexibility while maintaining computational feasibility.
With the notation introduced above, our heuristic online fitting algorithm is shown in Algorithm 1.Here ite(cond,a, b) is an if-then-else function that returns a when cond is true and b otherwise.Informally, the algorithm can be described as follows.First, as there is insufficient information if only the first few data points are known, we set For each additional data point, we do the following.
1. We first check if the target coefficient of determination can be attained by parametrising the current set of epidemics.The optimal set is and the corresponding coefficient of determination is , we try to reduce the number of epidemics.Therefore, we try to attain the target coefficient of determination without the last epidemic (provided that there is more than one epidemic).
and the corresponding coefficient of determination is If r2 (i) ≥ r 2 target , we can reduce the number of epidemics and set E(i) = Ẽ(i).If not then we set E(i) = Ê(i) and move on to the next data point.

If r2 (i) < r 2
target , we consider adding an epidemic.To determine the type of the epidemic (sir or ir), we first calculate the residual vector zi = yi − fÊ (i) (ti) .
Let μ(i) be the sample mean of zi and let σ(i) be the sample standard deviation of zi.As the new epidemic should be located at the end of the residual, let zi−κ+1:i be the last κ data points in zi.
• The new epidemic type is sir if the minimum value in zi−κ:i exceeds μ(i) + 2σ(i).In our experiments, we found that κ = 2 yields good results.
• The new epidemic type is ir, if the most recent residual exceeds μ(i) + 6σ(i).
• if neither ir nor sir are detected, we set E(i) = Ê(i), and issue a warning that r 2 target can not be attained at ti due to no epidemic being detected.
If an epidemic is detected, we extend Ê(i) with the detected epidemic E (d) started at time ti, and let Ȇ(i) be the optimal vector of epidemics in the neighbourhood of this extended vector, The corresponding coefficient of determination is target then we add the new epidemic and set

E(i) = Ȇ(i). If not then ȓ2
i is below the target value; however, we may still be able to improve on the current fit.So we check if ȓ2 i > r2 , if this is the case we set E(i) = Ȇ(i) and issue a warning that r 2 target could not be attained at time ti, even though an epidemic has been added.Finally, if ȓ2 i ≤ r2 , then the new epidemic did not improve the fit; in this case we set E(i) = Ê(i) and issue a warning that r 2 target could not be attained at time ti.

RESULTS
We demonstrate our technique's applicability on a synthetic data trace (derived by composing two time-shifted SIR model traces to produce a double epidemic model) and real data including swine flu data and BitTorrent downloads of music artist Robin Thicke.The BitTorrent download data were retrieved by the MusicMetric API (an online artist analytics toolbox that contains detailed information on fan trends for particular artists).

Synthetic Double Epidemic Model (2 SIR models)
This dataset was created by the superposition of two timeshifted stochastic simulation trajectories of SIR epidemics with known parameters: The combined epidemic was then used as input to a prototype implementation.As observed in Figure 2, on day 56 the fit of the epidemic component is proceeding well (r 2 = 0.999)    print ti, E 30: end for 31: end function and the short term prediction quality is good as the model matches the forthcoming decay of the epidemic.The estimated parameters are close to the known parameters.The estimated parameters become even closer to the actual parameters as the data points move towards the introduction of the second epidemic on day 91.Our framework realises the need for a second epidemic and begins the fitting procedure again.On day 103 the quality of the fit to past data is good (as r 2 = 0.999) and the model predicts the downward trend.By the end of the epidemic, the model fitted the data and estimated the parameters of both epidemics well.

Swine flu 2009 reported cases in the UK
In 2009, there was a global outbreak of a new strain of influenza A virus subtype H1N1 (colloquially called swine flu) which was termed a pandemic by the World Health Organization.We use weekly reported swine flu cases in 2009 in England as provided by the Health Protection Agency [4], and successfully fit a double epidemic.On week 9 in Figure 3 the model has detected and fitted the swine flu data with past r 2 = 0.916.On week 22, the model detects with a good precision the subsequent downwards evolution of the second outbreak.Investigating the biological interpretation of our model would be an interesting exercise, but one which is beyond the scope of the present paper.

Robin Thicke BitTorrent Downloads
This dataset begins with the release of Robin Thicke's album Blurred Lines, the title track of which was the best-selling song of 2013 in the UK and the second best-selling song of 2013 in the US.Each data point represents the number of daily downloads of Robin Thicke's songs, and it is these downloads that we presume are the manifestation of a number of underlying epidemic spreading processes.
Figure 4 presents a monoepidemic fit in the style of the work described in [18] which clearly demonstrates the inability of a single-epidemic model to reflect adequately the complexity of this kind of data set.Indeed the r 2 value is just 0.485.As illustrated in Figure 5, the synthedemic model fit with r 2 target = 0.9 fares much better.On day 94 the model not only fits the historical data extremely well but also predicts the peak of the initial growth phase accurately, albeit that the model tends to overestimate near-term future download counts.On day 135 a sudden peak in the data is observed corresponding to Robin Thicke's performance of Blurred Lines on the TV show Jimmy Kimmel Live.The model not only detects this as a second epidemic, but also predicts the downward trend with good accuracy.On day 206, we observe a new short and sharp outbreak which is detected and fitted well.This corresponds to the infamous live performance of Blurred Lines by Robin Thicke and Miley Cyrus at the 2013 MTV Video Music Awards.Last but not least, our model has successfully detected the outbreak on day 254 corresponding to Robin Thicke's live performance on the X Factor results show.
We note there is clear potential to improve historical fit and prediction quality through the application of appropriate residual refinement techniques e.g.use of an autoregressive (AR) process to characterise the variability of residuals.

CONCLUSION
This paper has proposed a novel framework for quantitative modelling and prediction based on the analysis and synthesis of multiple epidemic models.Using a surprisingly low number of synthesised epidemics, a prototype implementation of this framework is able to adequately characterise the evolution of an artificially-generated data set and real-world data sets based on swine flu data and daily BitTorrent downloads.Model fitting can be performed in an online manner as an outbreak of interest unfolds, and the short-term model predictions are generally pleasing although they have not been subjected to rigorous quantitative analysis in the present work and there is scope to improve predictions using residual refinement techniques.
There are several possible directions for future work.Firstly, although we have focused on internet-based phenomena, we anticipate that our methodology may be readily applied in many other domains that might arguably be driven by underlying epidemic-like phenomena, such as computer viruses and retail sales.To facilitate this, we believe our prototype implementation could be extended in order to support epidemic model selection from a broader range of candidate models, as well as to support negative epidemic terms.
We also plan to incorporate event prediction with prior knowledge of upcoming events (where applicable) in order to improve predictive ability.We also plan to investigate how appropriate confidence intervals can be computed on synthedemic model predictions.Last but not least, we would like to explore the range of potential dependencies between epidemics and their host populations and what implications these may have for the synthedemic paradigm.

Figure 1 :
Figure 1: The proposed synthedemic modelling and prediction framework.