Modeling user and topic interactions in social networks using Hawkes processes

We present in this paper a framework to model information diffusion in social networks based on linear multivariate Hawkes processes. Our model exploits the effective broadcasting times of information by users, which guarantees a more realistic view of the information diffusion process. The proposed model takes into consideration not only interactions between users but also interactions between topics, which provides a deeper analysis of influences in social networks. We provide an estimation algorithm based on nonnegative matrix factorization techniques, which together with a dimensionality reduction argument is able to discover, in addition, the latent community structure of the social network. We also provide several numerical results of our method.


INTRODUCTION
There has been a steady increase of interest in point processes for modeling information diffusion in networks (see [16,25,29,31,32]).Information diffusion is a phenomenon in social networks where users broadcast information to others in the network; for example on Twitter, users can "tweet".By tweeting, users broadcast information to the network; however only those capable of receiving these tweets can retrieve the information, i.e., you must follow the user in question to be able to see his tweets.These sequence of broadcasts by users is called an information cascade.
Following this principle, in this paper we model the information diffusion in a social network by a linear multivariate Hawkes process (see [10,17]).A Hawkes process is a point process that increases its intensity when an event occurs, hence allowing one to decouple two very different phenomena: the information diffusion due to the willingness of users in propagating their information and the viral network effect of receiving the information from neighbours and retransmitting it; one can think of a group of people participating in a webchat: even though everyone has an intrinsic willingness to discuss, post and interact, if at a given time someone posts a comment over a subject, this comment increases the chance that others will also comment over the same subject, and so on (see for example [4]).
A different reason for the use of point processes in information diffusion models is that they take into consideration the broadcast times of users, whereas standard information cascade models consider time to be discrete, i.e., time only evolves when events occur.
Our motivation comes from information cascade models (see [7,8,19,20,26,30]), where one studies the propagation of information in a social network as a cascade of broadcasting by its users.For example: In [29], Yang and Zha study the dissemination of memes in social networks with linear Hawkes processes and couple the point process with a language model in order to estimate the memes.They provide a variational Bayes algorithm for the coupled estimation of the language model, the influence of users and their intrinsic diffusion rates; however, they do not take into consideration the influence that memes may have on one another; moreover, they propose the estimation of the entire social network, not taking into consideration the eventual lack of communication between users.In [19], Myers and Leskovec study the variation of the probability in retransmiting information due to previous exposure to different types of information; they found that, for Twitter, the retransmission probabilities change drastically; however, the approach of Myers and Leskovec does not take into consideration the time between broadcasts of information and the topology of the network.And in [20], Myers et al. study the influence of externalities from other nodes on information cascades in networks; they use a point process approach, from which the times of There has been a steady increase of interest in point processes for modeling information diffusion in networks (see [16,25,29,31,32]).Information diffusion is a phenomenon in social networks where users broadcast information to others in the network; for example on Twitter, users can "tweet".By tweeting, users broadcast information to the network; however only those capable of receiving these tweets can retrieve the information, i.e., you must follow the user in infection are essential for the estimation of parameters, but the topological properties of the network are of secondary concern in their work.
In this paper, we first model information dissemination by linear Hawkes processes using the full information of the broadcast times and we derive not only the influence of users on each other, but also the influence of different types of topics on each other, as in [19,29].
We follow the ideas in [23,24] to derive a nonnegative matrix factorization (see [6,13,14]) cyclic estimation for the system parameters -the influence of agents, the influence of topics and the intrinsic diffusion rates.

LINEAR HAWKES PROCESSES
A multivariate linear Hawkes process (see [10,17] for more details) is a self-exciting orderly point process Xt, t ∈ [0, τ], in R R with an intensity for the r th coordinate of the form (1) where φ r,r , r, r ∈ {1, 2, • • • , R} are positive kernel functions responsible for the temporal decaying of all the interactions which happened in the past and μ r ≥ 0 is the intrinsic (baseline) intensity of the point process for the r th coordinate.In our framework, for example, each coordinate X r t of the Hawkes process could represent the countnumber of broadcasts by user r until time t.
The use of self-exciting processes here enlightens the necessity of a theory that can model the interaction between people having a conversation or exchanging messages: imagine two people messaging each other through SMS.Normally each one would have its own rhythm of messaging (this is modelled by the intrinsic rate μ in Eqn.(1)), but due to the self-excitation among these people, they will text faster than they would normally do, in order to answer each other; one can model this effect with the temporal kernel φ r,r in Eqn.(1).
By orderly, we mean that almost surely Xt does not have more than one jump occurring simultaneously (see [5] for a more formal definition).By the standard theory of point processes (see [5]) we have that an orderly point process is completely characterized by its intensity, which in this case is also a stochastic process.
Throughout this paper we consider kernels of the form where K r,r ≥ 0 are entries of the R × R interaction matrix K, which represents the interaction of the coordinate r and the coordinate r .In other words, the kernels φ r,r are all of the same time-decaying function and the interactions between different coordinates differ only in intensity, not in type.

INFORMATION DIFFUSION BY LINEAR HAWKES PROCESSES
After a brief introduction to linear Hawkes processes we begin the cornerstone of our paper, information diffusion.As mentioned before, information diffusion over social networks is to be interpreted as the broadcasting of messages by users.These messages are assumed to have a specific topic, it can be politics, sports, religion, films, etc...
The broadcasting of messages can be done in various ways, depending on the application: measuring tweets or retweets, checking the history of a conversation in a chat room, etc... but they all have one thing in common: messages are broadcasted by a set V of users in a social network.
In our context, the network will be a generic social network, defined as a communication graph G = (V, E), where V is the set of users with cardinality V = N and E is the edge set, i.e., the set with all the possible communication links between users.We assume this graph to be directed and unweighted, and coded by an inward adjacency matrix A such that Ai,j = 1 if user j is able to broadcast messages to user i, or Ai,j = 0 otherwise.If one thinks about Twitter, Ai,j = 1 means that user i follows user j and receives the news published by user j in his or her timeline.
We also assume a hypothesis that messages are of a specific content or topic.Meaning that one can discern the message's major topic and label it within K different topics.These topics could be economics, politics, religion, sports, music, cinema, etc...
In light of these explanations, we model the number of messages broadcasted by users as a linear Hawkes process Xt, where X i,k t is the cumulated number of messages of topic k broadcasted by user i in the time interval [0, t].In other words, our Hawkes process is a R N ×K point process.
We define our temporal kernel functions as φ (i,j),(c,k) (t), which measures the temporal influence of a broadcast of a message about topic c by user j on the broadcast of a message about topic k by user i.

Intensity
Having explicitely defined the basic assumptions and using the machinery of linear Hawkes processes, we factorize our kernel functions in two parts: the influence of users on other users (given by the N × N matrix J) and the influence of topics on other topics (given by the K × K matrix B).We thus have the full form of our kernel functions Given the full form of our kernel functions, here is the final form of our intensities which in matrix form can be seen as As said before, not all users can communicate among themselves.Hence one must take into consideration the inward adjacency matrix A given by the underlying structure on the social network.This is done by the relation Ai,j = 0 ⇒ Ji,j = 0, which gives us In vector form, we have where v(λ) is the vectorization of a matrix λ and ⊗ is the Kronecker product.

Parameter estimation by cyclic descent
After defining our model in Eqn. ( 4), we proceed to the estimation of the user-user influence parameter J, the topictopic influence parameter B and the intrinsic rates μ, following [23]: Let (t j,c n )n be the jump times of the point process X j,c t .Define δmin = min (i,k,n ) =(j,c,n) |t i,k n − t j,c n | as the minimum elapsed time between jumps of Xt in [0, τ] and fix δ < δmin.
We divide [0, τ] into T = τ δ time bins such that we do not have more than one jump of Xt in each bin, in order to preserve the orderliness property of Xt.
Define also the NK × T matrices λ, μ and φ such that

Maximum likelihood estimation and nonnegative matrix factorization
With the Hawkes intensity discretized, we proceed to the maximum likelihood estimation of the Hawkes parameters.We begin by showing that maximizing the Riemann-sum approximation of the log-likelihood of X is equivalent to minimizing the Kullback-Leibler (KL) divergence between the jumps of X and the intensity λ.
where dKL(x|y) = x log( x y ) − x + y is the Kullback-Leibler divergence of x and y.
Proof.We have that the log-likelihood of X is given by (see for example [5,21]) Approximating the integrals in L by their Riemann sums we get thus maximizing the approximation of L is equivalent to minimizing With Y fixed, this is equivalent to minimizing Following Eqn. ( 6), we have that λ is a linear combination of several matrices with positive entries, hence the minimization of Eqn. ( 7) can be solved by nonnegative matrix factorization (NMF) algorithms (see [6,14]).
Unfortunately, NMF algorithms are not convex on the ensemble of matrices.Nevertheless, they are convex (due to the convexity of the Kullback-Leibler divergence in this case) on each matrix, given that all others are fixed.It can be shown (see [6,11,14]) that estimating each matrix given the rest fixed in a cyclic way produces nonincreasing values for Eqn.(7), thus converging to a local maximum of the approximate log-likelihood.
Due to the overwhelming number of user-user interaction parameters Ji,j in real-life social networks (where we have N ∼ 10 8 ), we factorize This method is similar to clustering our social network communication graph into different communities (see [12]).
Since a composition of a linear function and a convex function remains convex, we have that is still convex as a function of one matrix if the rest remains fixed, and the NMF updates still converge to a local minimum of DKL.
Lemmas 2, 3, 4 and 5 give us the exact form of the NMF multiplicative estimation updates for the Hawkes parameters J = F G, B and μ.

Estimation of F
We now proceed to the estimation of the first user-user influence matrix factor F by NMF techniques.It is also extremely desirable to uphold the constraint of Ai,j = 0 ⇒ Ji,j = (F G)i,j = 0, i.e., we must estimate F and G such that we keep the communication graph unaltered.This is a very difficult problem, since the NMF updates destroy this relashionship, and the only other way to do so is to estimate each coordinate separately.Since Ai,j ∈ {0, 1}, we can circumvent this problem using a convex relaxation of this constraint of the form 1 η × g( 1 − A, F G ), with g : R+ → R+ a convex function and η ≥ 0 a penalization parameter.
Choosing for example g a linear function we have the following penalization ηF We have the following multiplicative 2 estimates for F : where ηF (1 − A)G T , with ηF ≥ 0, is a convex penalization term responsible for Ai,j = 0 ⇒ (F G)i,j = 0, i.e., we do not estimate interactions outside the underlying network structure.
Proof.First of all, we have that ( Let Fi be the rows of F and ρt be the columns of ρ, with ρ k t the columns of the submatrices ρ k .Then 1 From now on we denote by 1 any vector of matrix with entries equal to 1.The dimension of 1 will be clear in the context. 2 For two matrices A and B of same dimensions, we denote A B their entrywise division and A B their entrywise product. Hence One can see that, since D F KL is a sum of convex functions, it is still a convex function and we can use the multiplicative update rule for F given by [6,14].We have thus the following multiplicative update rule Since the penalization term ηF (1 − A)G T has all its entries nonnegative, it is added to the denominator of the NMF updates, as in [6].Following [6], we can rewrite the multiplicative updates with the linear penalization as Eqn.(8).

Estimation of G
We now proceed to the estimation of the second user-user influence matrix factor G using the same ideas applied to the estimation of F .Again we must use a convex relaxation of the constraint Ai,j = 0 ⇒ Ji,j = (F G)i,j = 0.The derivative of this relaxation with respect to G takes the form ηGF T (1 − A).
Unfortunately, since F anf G act as a product, there is a potential identifiability issue of the form F G = F MM −1 G = F G where M is any scaled permutation and the pair F = F M, G = M −1 G is also a valid factorization of J (see [13,18,23]).We deal with this issue normalizing the rows of G to sum to 1 (see [13,23]).This normalization step involves the resolution of a nonlinear system for each row of G to find the associated Lagrange multipliers.
Our constraint thus becomes G1 = 1, for which the Karush-Kuhn-Tucker (KKT) conditions are written in matrix form as η G = d i=1 ηG,iei1, with ηG,i ∈ R the Lagrange multipliers solution of the nonlinear equation G1 = 1 after the update.
We have the following multiplicative updates for G: where η G is a d×N matrix composed by Lagrange multipliers solution of the nonlinear equation G1 = 1 and ηGF T (1 − A), with ηG ≥ 0, is responsible for Ai,j = 0 ⇒ (F G)i,j = 0.
Proof.Firstly, we have that Using the same arguments as with F , we have the update rule for G given by Eqn.(9).

Estimation of B
For the estimation of the topic-topic influence matrix B, one may also notice that we still need to normalize the rows of B to sum up to 1, for the same reasons as in G, since B appears multiplying J = F G in Eqn.(5).
We have the following multiplicative updates for B: where η B is a matrix composed by the Lagrange multipliers solution of the nonlinear equation B1 = 1.
Proof.Firstly, we have By the same principle as in the estimation of F and G, the updates for B are given by Eqn.(10).

Estimation of μ
Applying the same techniques, we can estimate the users intrinsic rates matrix μ.
Lemma 5. We have the multiplicative updates for μ: Proof.By the same token, it is easy to see that giving us the multiplicative updates in Eqn.(11).

Complexity
NMF factorization techniques are multiplicative updates, using only entrywise operations and matrix products, which is fast and can be performed in a distributed fashion very easily.Hence, at each step of the cyclic descent procedure, we have the following complexity for the updates, written in terms of the number of users N , the number of topics K, the factorization dimension d and the number of time discretization steps T : • The complexity for the numerator and denominator updates of B is O(K 2 NT ).
• The complexity for μ is O(dKN T ).
For the complexity of G and B, we also have to take into consideration the calculation of the Lagrange multipliers η G and η B .These multipliers are calculated using convex optimization techniques3 .However, the complexity of these calculations is not greater than the complexity of the multiplicative updates for G or B.

Total complexity of the updates
The complexity of each cyclic updating step (updating F with the rest fixed, updating G with the rest fixed, updating B with the rest fixed and updating μ with the rest fixed) is thus if N T , which is normally the case (we usually have considerably more messages than users).Thus, we achieve a linear complexity on the dataset -which is basically dictated by N and T since K N , K T and d N .

Complexity for J without the factorization F G
Following the same calculations as for the complexity of F using Eqn.(8), we get that the complexity for By the same token, every time we factorize J = F G to compute the other multiplicative updates for B and μ, we have to calculate λ, which has a complexity of O(dKN T ).If we cannot factorize J, the complexity becomes O(KN 2 T ), which is much larger than O(dKN T ) since d N .
This proves that the dimensionality reduction J = F G is crucial to obtain a linear complexity in the data.

Additional remarks
Our estimation method -based on the maximum log-likelihood of the point process Xt and on nonnegative matrix factorization techniques -requires the NMF parameter d.This parameter is ad-hoc and must be learned beforehand.However, Tan and Févotte derive an automatic way of finding the optimal d during the NMF updates in [27].They do so by considering the NMF procedure for the β-divergence (for which the KL divergence is a particular case) as a Bayesian estimation of an underlying probabilistic model.
One known setback in the NMF framework is the convergence to local minima of the cost function, which means that the initial condition is crucial for a good estimation.
There are results that illustrate how to achieve a better estimation by constructing an improved initial condition (see [1,3,28]), but they do not work here: our cost function is with respect to DKL(Y |μ+(B T ⊗F G)φ) and the frameworks in [1,3] do not apply if we consider finding good initial conditions for J = F G, B and μ at the same time.Moreover, we do not know the true value of J = F G, our only proxy is the adjacency matrix A, which is binary (Ai,j ∈ {0, 1}) and make it very hard to use the methods in [3,28].We use random initial conditions for B and μ, and we factorize A into A = FAGA, with FA ∈ M N ×d (R + ) and GA ∈ M d×N (R + ), and use FA as the initial condition for F and GA as the initial condition for G.
If estimating parametric kernels φ of exponential or powerlaw type (see section 2), the convolution φ * dX must be calculated at each NMF update, which increases considerably the running time of the algorithm, since calculating φ * dX is costlier than the NFM updates.However, for the exponential kernel we could calculate φ * dX only up to a fixed lag L, as in [29], which speeds up the algorithm.
There are also attempts to derive nonparametric estimation of kernels for Hawkes processes, as in Bacry and Muzy [2].
The problem with nonparametric kernel estimation is the high dimension of our Hawkes processes, i.e., N 1 and T 1; since these methods are quadrature-based methods for the convolutions, they are much slower than parametric alternatives.

NUMERICAL SIMULATION
This section is dedicated to the estimation of our model parameters F, G, B and μ using synthetic Hawkes processes simulated following the thinning algorithm 4 developed by Ogata in [22].We used for figures 1, 2, 3 and 4 the parameters N = 100, K = 10 and an exponential temporal kernel of the form φ(t) = e −10t .I {t>0} .
cliques of size 50 with uniform random weights.We simulated our Hawkes process until time τ = 250, and used d = 51 for our factorization J = F G, with a linear penalization (as in lemma 2) with constants ηF = ηG = 10 3 .We did not use cross-validation techniques to find optimal penalization parameters ηF and ηG, since the algorithm is robust enough with respect to them. Figure 1 is the heatmap of J = F G, where the left heatmap is the estimated J = F G and the right heatmap is the true value for J.One can clearly see that our algorithm retrieves quite well the structure behind the true J, i.e., two distinct cliques.
Figure 2 is the heatmap of the squared difference of the true J and its estimation J, i.e., for each true entry Ji,j and estimated entry Ji,j we have ploted the differences (Ji,j − Ji,j) 2 and (Ji,j − Ji,j) 2 /J 2 i,j (when Ji,j is nonzero).
Figure 3 refers to the squared difference of B and its estimation and figure 4 refers to the squared difference of the true μ and its estimation, as in figure 2.
Figures 5 is, again, related to a 2-clique network with cliques of size 10, random edge weights following an uniform distribution, K = 1 (only one content), and a simulated Hawkes process until τ = 20.We compare our estimation choosing d = 10 with the estimation algorithm in [29] (with the obvisouly simplification of K = 1 and no language model); one can see that our algorithm (on the left) outperforms the algorithm in [29] not only on the estimation of μ, but also on the estimation of J, retrieving the community structure when the algorithm in [29] did not.Moreover, the algorithm in [29] needs an ad-hoc parameter ρ to control the sparsity of the network, which is not needed in our case.

CONCLUSIONS
We presented in this paper a general framework to model information diffusion in social networks based on the theory of self-exciting point processes -linear multivariate Hawkes processes -and use nonnegative matrix factorization techniques to derive our estimation algorithm.
The model studied here exploits the real broadcasting times of users -a feature that comes with no mathematical overhead since we do so in the framework of point processes (see   [5]) -which guarantees a more realistic view of the information diffusion cascades.Also, the model takes into consideration not only interactions between users (as in [29]) but also interactions between topics (as in [19]), which provides a deeper analysis of influences in social networks.
Another crucial advantage of this framework is that all the parameters are versatile and allow for a variety of extensions and adaptations for real-life situations: if one has predefined labelled data, if one wants to discover the topics broadcasted in the messages, if one wants to change the shape of the temporal kernel in order to exploit a longer Nonnegative matrix factorization techniques are interesting here for mainly two reasons: the multiplicative updates derived from the optimization problem are easy to implement, even in a distributed fashion -they are basically matrix products and entrywise operations -and the complexity of the algorithm is linear in the data, allowing one to perform estimations in real-life social networks, especially if some of the parameters are already known beforehand.
One can also notice that by performing a dimensionality reduction J = F G during our nonnegative matrix factorization estimation, we not only estimated the influence that users have on one another but we also acquired information on the communities of the underlying social network, since we were able to factorize the hidden influence graph J. Here, we used heavily the self-exciting model to retrieve the hidden influence graph, which is different from other graphs generated by different methods; for example, one could weight the communication graph A with the number of messages from one user to its neighbours, but by doing so one looses the temporal character.Moreover, the graphs found by performing this kind of technique are under the assumption that messages influence directly other users, which may not be the case.In our Hawkes framework, the influence is a byproduct of the interaction of users and information, and therefore their influence is probabilistic -it may or may not occur at each broadcast.

Figures 1 ,
Figures 1, 2, 3 and 4 are from a network composed of two 4 The thinning algorithm simulates a standard Poisson process Pt with intensity M > i,k λ i,k t for all t ∈ [0, τ] and selects from each jump of Pt the Hawkes jumps of X i,k t

Figure 2 :
Figure 2: Heatmap of L 2 differences (absolute and relative) between entries of true J and estimated J.

Figure 3 :
Figure 3: Heatmap of L 2 differences (absolute and relative) between entries of true B and estimated B.

Figure 4 :
Figure 4: Heatmap of L 2 differences (absolute and relative) between entries of true μ and estimated μ.