A Comprehensive Survey of Link Prediction Techniques for Social Network

A growing trend of using social networking sites is attracting researchers to study and analyze di ﬀ erent aspects of social network. Besides many problems, link prediction is a fascinating problem in the ﬁeld of social network analysis (SNA). Link prediction, in social network analysis, is a task of identifying the missing links and predicting the new links. Several researchers have proposed solutions for the link prediction problem during the past two decades. However, there is a need to provide comprehensive overview of the signiﬁcant contributions for a thorough analysis. The objective of this review is to summaries and discuss the existing link prediction algorithms in a common context for an unbiased analysis. The extensive review is presented by constructing the systematical category for proposed algorithms, selected problems, evaluation measures along with selected network datasets. Finally, applications of link prediction are discussed.


Introduction
Social network (SN) platforms enable social actors to perform various activities, i.e., information sharing, exchanging views, and learn from other social actors by following them [68]. Interaction of people on public places for e.g., students in universities, customers in restaurants and cafe's can be considered as examples of offline SN. Likewise, SN can be online that are supported by social networking websites (i.e., Twitter [69], Facebook [49] etc.). In graph theory, SN is represented as a social graph, where people are the nodes and their relationships/ interactions are edges (i.e., ties or links).
With the unprecedented growth of the WWW, the tendency of humans to interact, communicate and form relationships with each other has grown manifold. In just a few years, online SNs have become an essential part of our lives and provide us with opportunities to stay connected. This focus towards SNs, have created a lot of opportunities for researchers from different * Corresponding author. Email: writetosamadalvi@gmail.com disciplines to study and analyze the various aspects of human behaviors as well as characteristics of the SN. Nevertheless, Analysis of SN is an exceptional task that is facing many difficulties. A lot of problems correspond to SN analysis are being studied, including community detection [28], structural analysis of SN [17], network visualization [71], and finding influential users [105]. Besides, link prediction in SN is one of the most interesting problems, i.e., to predict the formation of a new or unknown link during a given time interval textitT-T+1 using the concept of network mining. Consider a SN in Figure 1, at time T, there are three persons with two edges. The solid link between these persons shows that "Ana" is a common friend of both "James" and "Jack". While, "James" and "Jack" are not connected as there is no link. On the other hand, it becomes interesting to consider at time T+1 that a new link will appear between "James" and "Jack" or not. The task of predicting friendship between "James" and "Jack" is called link prediction. Additionally, link prediction has variety of applications, i.e., friend recommendation [96], citation recommendation [83], identification of collaborators in co-authorship network  [43], identification of criminals in criminal network [59], item recommendation [99] etc. Figure 2 represents the number of published research papers with search keyword "Link Prediction" on DBLP (Computer Science Bibliography). The significance of link prediction in various domains can be seen during last two decades as shown in Figures 2 and  3. Surprisingly, researchers from different disciplines have done a tremendous job in the last ten years by publishing hundreds of papers on the topic of link prediction. Even, Most of the publications happened in the last year 2019, where 183 research paper are published. This growing number of publications shows that link prediction is a challenging and interesting task for researchers. Besides, 681 out of 1216 research papers are published in conferences and workshops, which means new as well as expert researchers are conducting their research in this area as shown in Figure 3.
In the past, several useful survey have been conducted on link prediction in social network [55] [5] [64]. Liben and Kleinberg [55] give a useful insight and information for link prediction by using classical measures of prediction and topological features of network and can be considered among the pioneer significant work on the topic of link prediction. Lu et al., extended the survey by considering popular algorithms of link prediction for complex networks [5]. However, they have considered their contributions from social sciences, physicial point of view. Although, collection of algorithms considered by Lu et al., is valuable, it still requires deep analysis for the assessment of link prediction techniques. Hassan et al., categorized link prediction methods [64] by considering three types of models: probabilistic, binary classification and linear algebraic. Although, it is best for experts, but not suitable for new researchers who want to learn about link prediction problem.
In oder to include recently proposed link prediction techniques and fulfil the above mentioned shortcomings of the previous surveys, this paper provides a systematic and comprehensive survey on the topic of  link prediction in SN. The systematic means, our focuss will be on that type of studies which used various link prediction methods to conduct critical research studies. In addition, we have proposed a taxonomy to categorize the link prediction methods. To the best of our knowledge, this is the first in the last 5 years study that provides a complete picture of the topic of link prediction.
The organization of this article is as follows: In section 2, definition of link prediction problem is explained in detail. Different types of networks that are used for link prediction are explained in section 3. In section 7, link prediction applications are presented. State-of-the-art link prediction methods are discussed in section 4. Detailed discussion on evaluation measures is presented in section 5. In section 6, different kinds of features are explained. Finally, the conclusion is presented in section 8.

Problem Definition
• Definition (Homogenous Network): For a given network G = (V , E) where E represents the set of identical links between nodes and V is the set of same type of nodes, then G is called a Homogeneous Network. 2 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems Online First • Definition (Decrement Network): For a given network G = (V , E) at time t, if at time t+1, new nodes and edges appeared in G, then new decrement network at time t+1 • Definition (Mixed Network): For a given network G = (V , E) at time t, if at time t+1, some nodes and edges appeared (V a , E a ) and disap- Rely on the different kinds of link prediction approaches and networks, we can formulate the link prediction problem in various ways. Link prediction problem falls into two categories, missing and future links prediction. The formal definition of missing link prediction problem is defined as: consider undirected network graph G(V , E) where E is a set of ties/links and V is a set of nodes/vertices. Moreover, U denotes the set of all possible ties/links where |V | is number of nodes in V . Then, U n = (U − E) is the set of those ties/links, which are not exists. In other words, there are some missing ties/links in U n . In this case, task of link prediction is to find out those links.
Furthermore, future link prediction problem can be classified into two categorize: periodic and nonperiodic link prediction. Periodic link prediction emphasizes on dynamic networks, on the other hand, non-periodic consider the current state of network for prediction.
• Periodic: given a graph where each e = (u, v) ∈ E t link took place at time t (as shown in Figure 6). Here, the goal of link prediction is to predict the link state at time step G t+1 . In the other words, the objective is to prediction next snapshot of graph.
• Non-Periodic: In this case, we have current state of the graph G with only one snapshot G t instead of series of snapshots (as shown in Figure 7). Consider a graph G = (V , E t ), where E t is the set of links E ⊆ (V × V ) and V denotes its nodes. Consider subgraphs of G, future G t+1 and G t that Here, objective link prediction is to predict next state of graph i.e., G t+1 .

Types of Networks
Two types of networks (as shown in Figure 4) are considered for link prediction in the literature: (1)   (2) uncertain. The property of certain network is that there is not concept of deletion of nodes and edges. Once, node or edge is added, it will remain there forever and will not be deleted. Coauthorship network is an example of certain network, where authors are represented by nodes and edges between them represents the collaboration of authors. Besides, weight associated with edges represents the number of collaborations among authors. On the other hand, in uncertain networks, probability is attached with each link that a link exists for specified time slot. Further categorize of certain and uncertain networks are as follows.
Static Network:. Static networks (as shown in Figure 9) are those type of networks in which node does not coins its position nor crashes. The whole structure of network along with nodes and edges will remain same. A single snapshot of SN on specific time slot is an example of static network. Static network further classified into Homogeneous and Heterogeneous networks.
Dynamic Network:. Dynamic network (as shown in Figure 8) reshapes its whole structure with the passage of time. Facebook is an example of dynamic network. Due to the dynamics property, network can be change as follows: (1)

Similarity-Based
Similarity-based approaches believe that nodes try to make links with other similar nodes . These approaches works on the hypothesis that nodes are similar if they have a common connected node or they have a shortest distance in the network. A similarity function S(u, v) is used by these approaches which allocates similarity score to each non-connected pair of nodes u and v. Finally, pair of nodes sorted in descending order according similarity score. A high score represents high probability that nodes will be linked near in a future, while low score shows that nodes will not be linked.
Node-Based. Similarity computation between pair of node is an interesting solution in for the task of link prediction. It builds on the simple idea: as much as the pair is similar, the more chances a link between them. This reflects the fact that people try to make relationship with those people who are similar in religions, language, educations, locations and interest. This relationship can be measure by computing similarity, where score (known as similarity between u and v) is assigned to each pair of nodes (u,v). A high similarity score means u and v will b linked, while low similarity score means u and v will not be linked. In a practical SN, a node (i.e., people) has profile in online SN containing set of attributes such as gender, age, location, language, interest, bio, country and city. These attributes values can be used to compute similarity between pair of nodes. Most of time, these attributes are in textual form, where textual-based similarities [83] are used. Discussing similarity based approaches is against the purpose of this study, reader can read some comprehensive survey [32]. Samad et al. , in the area of citation network, evaluated both textual and topological similarity measures in order to predict the link between research papers [83]. Where, they have used profiles of research papers containing textual attributes including title and abstract. Their observation is that predicting link between node through topological similarity is better than textual similarity. They also observe that increasing text in attributes lowers the similarity between nodes. Bhattacharyya et al. define tree model with multiple categorize to study and analyze the keywords of profiles, then compute distance of keywords to find similarity between users [11]. They 5 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems Online First have found that similarity between users are almost equal except for direct friends. In addition, as much as keywords and friends increased, similarity between users decreased. Akcora et al. observe that most of the user profiles are not publicly available in current social networks or missing. Keeping this limitation in mind, they invent a method that before computing users similarity, estimate the portion of missing values of profile [2]. Anderson et al. use the commonality of user's interest to measure similarity [6]. User's interests are actions that user takes, such as asking question, editing article, reading blog and bookmark items. All these actions take by user can be represented as vector, and user's similarity is the cosine between action vectors. Samad et al. , in the context of face-to-face contact networks, evaluate six different social attributes in order to predict the link [84]. They have found that, language and country are such attributes that plays an important role in contact prediction. They have observe that people tend to contact those people who are similar in language and country.
In conclusion, actions and attributes are mostly used in node-based similarity approaches. These actions and attributes reflect the personal behaviors and interests. In case of having social attributes and behaviors, nodebased approaches are useful.

Topological-Based.
There are a lots of metrics are exist to compute similarity between two nodes even without node or edge attributes. These metrics used topological information and known as topological-based measures. These metrics are further categorized into local and global metrics.
Local. In a SN, to estimate the similarity of each node with other nodes, local similarity-based methods relies on structural information like neighborhood. These methods are faster, effective and highly parallelizable as compare to nonlocal methods. Moreover, local methods enable us to adequately deal with link prediction issue in changing and dynamic networks like online SN. The primary defect of these methods is that local information (such as neighborhood) restricts nodes to find contacts within neighbors of neighbors. In realworld networks, it is shown that many connections between nodes are formed at greater distance (i.e., more than two ) [55]. Nevertheless, local methods have shown competitive prediction results as compare to complex methods. In addition, it is noticeable that, although these approaches are restricted to two-hop, their time complexity is O(xk 2 f (m)), where O(xk 2 ) is spatial complexity and f (m) is similarity computing.

Common Neighbors:
This method is widely used in link prediction. It works same as its name, the more common neighbors, the more chances to linked in future [72]. It bases on the hypothesis that, if two nodes share maximum common neighbors, it increases the chances that their will be link between them than nodes without common neighbors. Most of the researchers agreed on this hypothesis [55]. Similarity cam be computed as follows in Equation 1: Where, Γ u and Γ v represents neighbor nodes of u and v.
2. Jaccard Coefficient: Jaccard Coefficient is known as Jaccard Index, and is basically the normalizes the similarity score of common neighbors by considering intersection over union [36]. For similarity of two nodes u and v, it take in account the common neighbors and total neighbors of both nodes. Besides, Liben et al. showed that Jaccard produced worst results as compare to common neighbors. It can be defined as in Equation 2: 3. SAM: This method is recently published by Samad et al. [85]. This works on the simple idea that both nodes have their own similarity, i.e., it is possible that one node is 100% similar to another node, but at the same time other node is not similar as first node. SAM similarity can be defined as Equation 3 4. Adamic Adar: Initially, this method was proposed to find similarity among two pages [1]. Later, Liben et al. [55] used the customize version for link prediction problem as shown in Equation 4. In fact, this measure torchere the common neighbors along with hight degree. It can be defined as in Equation 4.

Resource Allocation:
This measure is inspired by the process of resource allocation in operating systems. Resource allocation is same as adami adar, but it gives more punishment to common neighbors along with high degree [106]. This is why, both resource allocation and adamic adar have close results. Its foremost feature is that it consider neighbors of neighbors along with direct neighbors. It is defined as in Equation 5 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems Online First 6. Preferential Attachment: This method is proposed by Barbasi et al. [8]. Its main feature is new node will be connected with node having high degree instead of node with low degree. Method can be defined as in Equation.
7. Sørensen Index: This method was proposed by Thorvald Sørensen to find similarity between data samples of ecological community [90]. The foremost objective of this method is to motivate the lower degree nodes in order to find their links. Similarity can be computed as in Equation 7.
8. Salton Cosine: This method is also known as cosine similarity [82]. This method is similar as Sørensen Index and Jaccard Index. Through some studies, it is found that value produces by Salton Cosine is twice the Jaccard Index [34]. Value can be computed as in Equation 8 9. Hub Promoted: Hub Promoted measure proposed by Ravasz et al. during the study of metabolic network [79]. It defines overlap between nodes u and v on the base of topology. Similarity computation defined as in Equation 9.
10. Hub Depressed: This measure is same as Hub Promoted, but the similarity value can be computed by nodes with higher degree [107]. Similarity can be defined as in Equation 10.
11. Leicht-Holme-Nerman: This measure assigns high similarity score to pair of nodes with more common neighbors [50]. This method take in account the number of actual paths and number of expected paths of length two between two nodes. The authors claimed that it is more sensitive than others in terms of structural equivalence. Similarity can be computed as in Equation 11.
12. Parameter-Dependent: This measure improves the accuracy of link prediction for both unpopular and popular [108]. Here, λ have many goodness that, in case λ = 0, this measure debased to Common Neighbors. Besides, if λ = 1 and λ = 0.5, it debased to Salton Cosine and Leicht-Holme-Nerman, respectively. Formula is shown in Equation 12.
13. Individual Attraction: This method is same as resource allocation, but it take in account the connections of shared neighbors [27]. It works on the hypothesis that pair of nodes are likely to be connected if they have highly connected neighbors. Similarity can be estimated as in Equation 13 IA Where, z,Γ (u) Γ (v) represents the links between nodes of set Γ (u) Γ (v) and node z.
14. Local Naive Bayes: This measure works on the hypothesis that every shared neighbor has unique role or influence [52]. This influence or role of node can be computed using statistical theory. Similarity can be estimated as in Equation 14 Where, o, R z and f (z) representing constant for network, role of node and influence measuring function, respectively.

CAR-Based:
This measure is build on the assumption that, there are more chances that two nodes will be linked, if their neighbors are strongly connected in local community [15]. This CAR-Based method is estimated as in Equation 15.
16. Functional Similarity Weight: This method is a variant of Sørension index. It considers the probability of interaction of both nodes u and v independently in directed network [21]. Nevertheless, this probability score can also be used in undirected networks. This method can be estimated as in Equation 16.
Depend on the shared neighbors degree. If clustering coefficient is high, nodes will be connected. [21] FSW No Link likelihood is estimated by the interaction of both nodes.
Global. In order to estimate the similarity between pair of nodes, global similarity-based methods relies on whole structure of network. These methods are not restricted to two node distance as local methods, however, their complexity make them impractical for large networks. In addition, their parallelization is more complex as whole topology of network may not be known by computational agent. Regardless, they shows very diverse time complexities, O(k2) is their spatial complexity as they store similarity score of each pair. Global similarity-based methods are further categorized into path-based and random walk.
Path-Based. Besides neighbor's and nodes information, path is another feature that can be used to estimate similarity between nodes, and this feature is used in path-based methods.
1. Local Path: Local path [61] measures uses information about paths with length 2 an 3. Unlike local measures that relies on the nearest neighbors, it takes into account additional information of neighbors of length 2 and 3. Since, neighbors at length 2 are more important than at length 3, so α is used as adjustment factor in measure. This measure can be defined as in Equation 17.
Where, A2 represents adjacency matrix of nodes with length 2 and A3 denote Adjacency matrix with length 3. Therefore, LP is the adjacency matrix of nodes with length 2 and 3.

2.
Katz: Katz method [42] is based on ensemble of all paths between two nodes. The paths are damped exponentially by length that can give more importance to shorter paths. The expression is defined as in Equation 18. .
4. Shortest Path: This is Simplest and easy to use global measure. It determined the similarity of u and v by takes into account the shortest path between u and v [3]. The expression looks like as in Equation 21.
Where, P u → v is a path between u and v, Whereas, |P | denotes the length of path p.

5.
FriendLink: This method finds similarity by traversing all the paths [73]. It works on the hypothesis that social network users can use all the paths between them. Therefore, similarity between pair of nodes u and v can be estimated as in Equation 22.
Where, n is the size of network, l is the path length between u and v, path i u,v denotes all paths between u and vwith i length. In addition, higher l will cause for the poor performance.
Compared with neighbor-based methods and nodebased methods, which are restricted to local community information, path-based methods takes into account additional topological information. Where, they consider not only local, but also global information such as paths between pair of nodes. However, path-based methods are more expensive than local methods in terms of time complexity.In addition, longer paths are rarely used and not more useful. Sarkar et a. [86], in their study, shows that if shorter paths are not enough, longer paths will be useful. Therefore, pathbased methods can produce better results in case of removing too longer paths.
Random Walk. Similarity between nodes in SN can also be calculated by random walk. Random walk takes into account the amount of transition from current node to its neighbors. There are few similarity measures that uses random walk to find similarity between nodes.

Random Walk:
In 1905, Random walk were coined by Karl Pearson [75] and have been adopted by many researchers from various disciplines such as physics, economics and biology. Consider a graph along with starting node, suppose we randomly pick one of its neighbor and proceed to it, then repeat step for every reached node. This series of randomly picked nodes is called random walk [60]. Let p u is probability vector of starting node u to reaching any anode in the network, thus, the probability of starting node to reaching any node is iteratively estimated as in Equation 23.
Where, Mat is the matrix of transition probability computed by adjacency matrix Am, with Mat i,j = Am i,j / k Am i,k . In addition, p u (0) assigned 0 to all its elements, except p u u (0), where value is 1.

Randome
Where, p u (0) assigned 0 to all its elements.
3. Hittime Time: Hittime time [29], takes into account the average steps required to reach at node v from node u. Usually, it is asymmetric measure which means Based on probability matrix Mat, Hitting Time can be defined as in Equation 25. .
5. Cosine Similarity Time: Cosine similarity time method is used to find the similarity of two vectors [29]. It is based on Q † , where Q † is pseudo-inverse of Mat = D A − Am. It is estimated as follows in Equation 27. 9 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems Online First 6. SimRank: SimRank is a unique method that takes into account the point where two random walkers meet [38]. It works on the hypothesis that if two random walkers are meet at node, then they are similar to each other. It is estimated as follows in Equation 28.
8. PropFlow: This measure is same as Rooted PageRank, however, it is localized more than that [57]. It restricts the steps of random walker to l steps. In other words, If random walker starts its walk from u and going to v, then it takes no more than l steps. It pick links on the base of weight. It can be defined as follows in Equation 30.
9. SepctralLink:SpectraLink method is proposed by Symeonidis et al. [92], which is used to capture the proximity of node by enhancing the method of spectral clustering. It takes into account the Laplacian matrix, and produced noise free matrix, which is more compact and smaller. Therefore, it predicts more accurate links. They also extend there work to predict negative and positive links in social networks [93].
Quasi. Quasi approaches have recently appear to force the balance between global and local similarity methods. Quasi methods are almost as effective to compute similarity as local methods, besides, also consider additional structural information, as global methods consider. Some Quasi methods consider the whole structural information, but their time complexity is still below than global methods. spatial complexity of quasi methods is O(uk 2+s ), where s relies on the parameters that set the length of path or number of iterations.
1. Local Random Walk: Local random walk [60] measure uses the random walk from source to destination, but restrict the iterations to a small number k. Similarity is estimated as follows in Equation 31.
Where, p u v (t) represents the probability vector, estimated on iteration t.

Supervised Random Walk: Supervised random walk
uses the topological information of node and link features [60]. The foremost objective of this method is to releasing the random walker continuously at the starting node. It can be defined as follows in Equation 32. Hybrid.

Evidential Measurement:
Yin et al. proposed evidential measurement method [102]. This is an hybrid technique which requires both node and local similarity. It is estimated as follows in Equation 33.
2. Methods in Weighted Networks: Link prediction also has been used in weighted networks. Some of measures are: weighted adamic/adar, weighted common neighbors, weighted resource allocation [63]. Measures are given in Equations 34, 35 and 36. weighted version of adamic/adar.
weighted version of common neighbors. (35) weighted version of resource allocation.
Where, Γ (u) Γ (v) represents the common neighbors of nodes u and v, w(x, v) represents the weight of link between x and v i.e., S(u) = x∈Γ (u) w(u, v) α . 10 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems Online First

Learning-Based
Classification. Let u, v ∈ V are nodes from the graph G(E, V ) and l (u,v) is the label of pair of nodes (u, v).
In the link prediction problem, using classification, we denote every pair of node (non-connected) as instance with class label. If the nodes are connected, label says it positive, otherwise says it negative. In general, label of pair (u,v) defined as in Equation 37 Classification is powerful concept to deal with link prediction problem, even it can use any kind of similarity measures as features as shown in Figure  10. However, this kind of approach have to deal with serious problem which is known as class imbalance [46]. A lot of classifier-based methods have been developed, and any kind of classifier can be a part of such approaches. Few of the researchers have compared many classifiers for the link the link prediction i.e., support vector machine, decision trees, multilayered perception, k-nearest neighbors, naive Bayes and many ensembles of these classifiers [4]. While, other researchers have found random forest as a good one [23]. For building an effective and efficient classifier, it is crucial to extract and define desirable feature set from SN. From the past studies, topological-based and node based features are proved as important for classification models i.e., VCP measures is a distinctive feature which represents topological information [56]. Li [89]. The foremost objective of this model is to determine the efficient and predictive attributes and features. It computes the weighted similarity measures using node and topological features for the alignment of adjacency matrix. People usually perceive that weight as feature plays a important role in link prediction. However, the previously statement is not verified yet, even in few studies, performance is damaged. Few of the research works claimed that weights would be helpful for the improvements of prediction results in supervised link prediction [25]. On the other   [41]. Where, the main objective was to predict the unrevealed chunks of the network using similarity of nodes. In addition, since it can fill the missing chunks of the network, it enable us to predict different kinds of links simultaneously. As a variant, another fast algorithms of link propagation is proposed to answer the linear equations in the method [80]. Brouard et al. [14] tried to predict the links through kernel regression, a semisupervised learning approach.

Matrix Factorization.
Matrix factorization are such kind of approaches that extract and uses additional or latent features for link prediction and have been used by various recommender systems [45]. Menon et a. have proposed a learning method to learn latent features for link prediction [70]. This learning method considers a vector l i of latent features for every node in the network, scaling factor SFu, v for ever pair of node, node feature's weights W n and link feature's weights w l . Moreover, feature's vector b u,v corresponds to link and a i corresponds to node. This model computes prediction score for nodes u and v as follows in Equation 38.
Meta-Heuristics. Involvement of plenty of factors makes links formation a complicated process. Most of the link formation methods are heuristics as they try to give high accuracy than other predictors by making hypothesis in the network. Bliss et al. have proposed a method for link prediction on the hypothesis that, different heuristics of link formation can cooperate and coexist [12]. It optimize various link predictors (i.e., global and local similarity measures) by adopting evolution policy. Ever solution x is represented by vector w (x) of real numbers as heuristics number. A similarity function for each candidate predictor is as follows in Equation ??.
Kernel-Based. Kunegis et al. developed a kernel-based method for link prediction that integrates different graph kernels as well as methods of dimensionality reduction [48]. The learn ability of this method makes it unique, since it is capable to learn F function which exerts adjacency or Laplacian matrix. Let there are training and testing sets of X and Y adjacency matrices for link prediction. Now, consider a function F(spectral transformation) which maps adjacency matrix X to adjacency matrix Y with least error using optimization problem as follows in Equation 40 Where, ||F(X) − Y ||F corresponds to Frobenius norm. The constrain in this norm ensure that spectral transformation function F is the property of another function S( known as function of spectral transformation). Consider a symmetric matrix X = MΛM T for function F, then we have F(X) = MF(Λ)M T , where function F(X) applies on every eigenvalue. Moreover, optimization problem, as shown in Equation 40, can be resolved by calculating the eigenvalue X = MΛM T as shown in Equation 41.
Since, the entries other than diagonal are independent from spectral function F, therefore, optimization function can be converted from matrix to real numbers as follows in Equation 42.
Spectral function F can used many kernels as described below.
1. Exponential Kernel: Consider a unweighed graph G Along with adjacency matrix Am. Now, Am n corresponds to count of paths with length n. On the base of hypothesis that nodes that are connecting through more paths are more similar than nodes that are connecting through few paths. So, function F can be estimated as follows in Equation 43.
Thus, Exponential kernel defined as below in Equation 44 2. Von-Neuman Kernel: Von-Neuman is same as exponential kernel as it also count the number of paths. it can be expressed as follows in Equation 45 Moreover, diffusion kernel can also be estimated as follows.

Probabilistic Model
In the literature, a number of network formation models have been studied and discussed in terms of probabilistic and statistical approaches [31]. These approaches stepped into the problem of link prediction on the base of probability and statistical analysis. These probabilistic methods usually suppose that the network which is going to be studied has a known structure.
In addition, set of model parameters are estimated in order to build a model. Furthermore, for each missing link, formation probability is computed on the base of these parameters. Finally, formation probability values are sorted the important links as we did in similarity based approaches.
Hierarchical Structure Model. According to the literature, most of the real networks are organized as hierarchically, such as protein-protein interaction network, metabolic networks, social networks like actor network and internet domains [78]. Where lower degree nodes are expected to have higher clustering coefficient than higher degree nodes. In 2008, Clauset et al. have proposed a method that delineates hierarchical network by a dendrogram with [v]-1 internal nodes and [v] leaves [22] as shown in Figure 11. Where, each leaf corresponds to network node and each internal node corresponds to relationship between its descendant nodes. Moreover, probability P ro n is attached with every internal node, which is uniform to the probability of link .
Where, r k and l k representing the number of leaves from right and left subtrees along with root k. Consider Figure 10, where a dendrogram of hierarchical network is shown. As per to dendrogram, there is 0.5 connecting probability of nodes 2 and 3. On the other hand, nodes 1 and 6 have 0.2 connecting probability. A Markov Chain Monte Carlo [30] approaches is employed to sample a set of dendrograms with a probability corresponding to their likelihood. The goal is to regroup subtrees of current dendrogram in another order.

Stochastic Block Model.
In reality, not all networks meet the requirements of hierarchical schema. A common approach is to assume that nodes in the network are distributed in blocks and communities, where nodes belongs the same group or community have same status [33]. The chances of link formation between two nodes depends on the community or block they belongs. A stochastic model usually consist of two parts P and P ro M, such as M od = (P , P ro M). Where, P is the partition method, and P ro M is the probability matrix of two nodes belongs to two different communities. Let P ro M αβ be the probability among two blocks α and β and G is the network. Likelihood can be computed as follows in Equation 47.
Where, α,β represents links between nodes in block α and β, while γ α,β represents those links that exists 13 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems Online First between both blocks. The foremost feature of this model is that it allow us to identify spurious as well as missing links from noisy data in the network. In addition, provide better prediction results than hierarchical structure model. However, its computation complexity is high and have not much ability to present possible overlapping. To overcome shortcomings of previous model, Chen and Zhand proposed marginalized deonising model [19]. Its features are to consider problem of link prediction as matrix denoising and learning mapping function. This mapping function is able to convert the matrix of observe links to unobserve links.
Cycle Formation Model. Huang et al. proposed a model on the based on the hypothesis that networks have the inclination towards close cycles in their link formation process [35]. This hypothesis is same as other methods like common neighbors, which take in account the number of cycles that would be shaped if the link existed. Moreover, this approach make an effort to detain longer cycles by increasing clustering coefficient to make it generalized. The generalized clustering coefficient can be computed as in Equation 48 Where, k representing the length cycles being analyzed. Furthermore, cycle formation model can be defined as CF(k) with k > 0, distinguishes every formation mechanism (i.e., g(1), g (2), ..., g(k)) by single coefficient (i.e., c 1 , c 2 , ....., c k ). The anticipated clustering coefficient cam computed as in Equation 49 (c 1 , c 2 , c 3 , ...., c k ) = j |G j |P r (G j )P r (e 1,k+1 ∈ E|G j ) (49) Where, Gj representing possible connected graphs along with i nodes. On the base of this given coefficient probability for link existence is estimated as in Equation 50 Local Co-Occurrence Model. Probabilistic models, discussed earlier, are restrictive to large networks, since their computational complexity is very high. Wang et al. proposed probabilistic model based on three types of local topological features: topology features, cooccurrence probability features and semantic features [95]. To obtain link probability among two nodes, tend to create local probabilistic model using MRF (Markov Random Field). For the prediction of link, three steps are performed: (1) A central neighborhood set S x,y is identified, (2) then t nodes that lie on the frequent path find out the co-occurrence probability features. For the classification, logistic regression isused over three types of features discussed above. This local co-occurrence method is described in Figure 11.

Preprocessing
Preprocessing approaches are also called metaapproaches or high-level approaches, since they tend to work by combining with other methods. The foremost objective of these approaches is minimize the noise that exists in the networks as "false" or "weak" links. In addition, enhance the performance of approaches described earlier.
Low Rank Approximation. This method works on the network structure to simplify it by solving a well know problem namely low rank approximation. It uses adjacency matrix Am to make the network noiseless [48]. The optimization problem tends to reduce the cost function that estimates the fit among original and approximation matrix of minimized rank. This can be solved efficiently through SVD of original matrix as follows in Equation 51.
Where, M T and M denotes unitary matrices, while represents the diagonal matrix with positive elements. Most of the methods to estimate SVD are available.
Most widely used approach focusses on the fact that eigenvalues of (AmAm T ) as a square roots represents the singular values. In fact, considering decomposition expression, It ca be defined as follows in Equation 52.
Unseen Bigrams. Consecutive or adjacent two elements from the string are called bigram (also known as digram). The concept of bigram have been taken in various applications i.e., speech recognition, linguistics, or cryptography. Unseen bigrams are such kind of bigrams which are valid and not observed in a string collection. Let "a flower", "a room", "the flower", "the room" and "a bike" are observed bigrams, then we can noticed that "the bike" is a kind of unseen bigram. The strategy given by bigrams can be adjusted for link prediction to minimize the noise by replacing similar nodes [54]. In this way, similarity can be expressed as follows in Equation 54 Where, X ∞ u is corresponds to the set of ∞ nodes similar to u.
Filtering. Also known as clustering [54], to avoid ambiguity, we called it as filtering. This is another kind of noise reduction method, which removes the weakest ties between nodes in order to improve the link prediction results. weakest ties are those kind of observed links which have no shared neighbors or small number of neighbors. It has another feature that it can also used for observed links to find their worth or strength. Therefore, filtering approach is used to assign a similarity score to every connected pair in order to remove γ weakest links and clean the network.

Performance Evaluation Measures
Evaluation measures used in the area of link prediction are embraced from other research areas i.e., classification, information retrieval [39]. These evaluation measures can be classified into two categorize: (1) threshold curves and (2) fixed threshold [58][101] [24]. Fixed threshold measures have some imperfections that few of the estimates of sensible threshold accessible in score space. To overcome these flaws, threshold curve measures are an alternative.

FP R =
FalseP ositive T rueN egative + FalseP ositive Where, T P R estimates the portion of correctly predicted positive links. While, FP R estimates the misinterpreted negative links. Although these measure have made a big contribution to link prediction, in spite of this some researchers proves that both AUC and ROC can be illusive [101]. Furthermore, they have stated that, for the reason of acute class imbalance, PR curves and PRAUC are better than ROC and AUC for the performance evaluation.
2. PR: PR is abbreviation of precision-recall curve. It represents precision along with recall at different thresholds [24]. It only considers the positive links for instead of negative links. Since, in periodic link prediction, it is required to predict removed links for that PR curve is not suitable [39].

AUC: AUC is abbreviation of area under the ROC.
Here, high AUC represents the superior results of classification, while, low AUC corresponds to poor results.

Fixed Threshold
1. Accuracy (Classification): Accuracy, which is pure from classification, is the most widely and commonly used measure. It is estimated as follows in Equation 55.
Where, N and P are the overall negative and positive links. Imbalanced data can makes accuracy deceptive. Usually, SNs are too large and existing link just add up to < 10 percent of all possible links, which means it is not meaningful measure [56].
2. Accuracy (Graph): Graph accuracy [84] is same as classification accuracy. However, graph accuracy takes into account the original graph and predicted graph. It is estimated as follows in Equation 56. P recision = T rueP ositive T rueP ositive + FalseP ositive (58) 5. F1-Measure: It is also known as harmonic mean of recall and precision. It is defined as follows in Equation 59.
Where, R represents the recall and P denotes precision.
6. NDCG: It is another type of evaluation measure, which measures the accuracy. It uses the top k prediction scores. it is computed as follows in Equation 60.
Where, DCG k is follows.
And I DCG is as follows.

MR(Mean-Rank):
This measures is specifically used for missing link prediction. In order to evaluate following steps should be perform.
• First, the dataset should separated in two set (i.e., training and testing) without any negative link, • For every test link, removed the node and replaced it by another node, • Computes the dissimilarity values of corrupted links,

Figure 13. Taxonomy of Performance Evaluation Measures
• After that sort the nodes in descending order, • In this way, the correct nodes are sorted according to their rank, • Finally, the mean is estimated for predicted ranks.

8.
Hit@n: This measure is same as mean rank. The difference is, it works on the top n nodes. From the many studies, it is shown that researchers uses Hit@10 [13].

Link Prediction Features
In this study, we have categorized the features used for link prediction as shown in Figure 14. Usually, there are two type of features are used by majority of link prediction approaches: (1) node-based features and (2) link-based features. Node-based features includes node's in-degree, out degree, level and distance. While, link-based features includes Link's level, type, label, weight and path. Studying the link prediction features, it is experienced that major part of the research is done utilizing node-based features [104][66] [103][44] [9]. This research claims that features related with nodes plays significant role in link prediction, since individuals in SN have their own attributes (i.e., sex, age, gender, city, country and language) which are helpful in further relationship with new individuals. The previous statement is further justified by Samad et al. [84] by evaluating node features in order to predict the link between nodes. They have observed that language and country are key features that plays an important role in association between nodes in SN. Node-based features are further classified into two categorize: (1) Attributes (i.e., age, gender, school, interest, or location etc.), (2) topological (i.e., degree, neighbors, level, or distance 16 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems Online First Figure 14. Taxonomy of Features used for link prediction etc.). Node attributes are used in the situation, when we have a graph without edges. On the other hand, topological features are used, when we have a partial graph to infer the links. For example, Jahanbakhs et al [37] divided their work of link prediction into two parts: (1) In first phase, they predicted the links using graph without eges, (2) in second phase, they have considered a partial graph for link prediction. Moreover, link-based features are also used by most of the researchers for link prediction [40][91][7] [67]. Link weight considers as foremost features for link prediction in weighted networks. In addition, nodes distance is another important feature and estimated through the random walk and shortest paths methods. Few of the researchers have combined both node and link-based features in order to predict the future links [37]. Furthermore, link-based features are classified into two types: (1) attributes(i.e., weight, type, or label etc), (2) topological (i.e., level, path, or subgraph etc.).

Applications
On a large scale in different areas number of applications of link prediction technique had been found. Interaction of entities in a structured way from any sort of domain is capable to gain assistance from link prediction. Few compulsive or commonly used applications of link predictions are described shortly.
Link prediction methods facilitates in refinement for selection between similar users from a system using a collective approach, preceding an effective recommendation outcomes [88]. Users of such systems anticipate to have an effective and easy way to find user they are familiar to, as there are huge amount of users registered. To gain high degree of accuracy majority of social networks implements link prediction techniques which instinctively recommends similar users.
In the domain of biology, using protein-protein interaction network, link prediction methods are being used to detect potential interactions among the proteins particles [77]. To examine the interactions of protein by test-tube experiment is costly in terms of time and money, so with the help of results from initial experiments, target could be set computationally.
From collaboration prediction another use of link prediction is found in scientific c o-authorship networks. It is easy to access collaboration data, since collections from journal indexing sites are publicly available. For better knowledge to understand that how some research areas make progress, link prediction methods act as tool by predicting which authors or groups could associate in the upcoming time [74].
Record linkage (namely Entity Resolution) consists of searing identical records or references in a dataset. Traditionally, record linkage, focussed only on similarity of attributes among entries. Recently, few authors have considered structured information to improve record linkage by using link prediction methods along with similarity between the references [10].
Another widely used application of link prediction in social network is to explore structure of terrorist network (namely criminal network) in order to find out the way to fight a gainst t he c rimes [ 47]. F or instance, authors in [100], claimed that if we reinsert the small portion of links using link prediction methods, structure of few terrorist networks does not change. These outcomes support that the link prediction techniques can reveal important links in criminal networks, creating a way to investigate definite terrorist actions.
Ultimately, network can be useful to predict the likelihood of expansion across society. Marketing studies can also be improve with the help of network analysis. Some authors also reveal that in order to gain high marketing plans, link prediction can be used for vigorous marketing [81].

Conclusion
In fact, link prediction have gain more attention in recent decade as new algorithms and applications are emerging rapidly. This article presented comprehensive review on link prediction and expressed that various challenges and techniques are exist. In the context of link prediction, categories of techniques, problems, networks, evaluation measures and features are proposed. Different techniques of link prediction are explained i.e., similarity-based, learning-based, probabilistic models and preprocessing. Node-based and link-based features are also illustrated. Finally, link prediction applications are also discussed. 17 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems Online First