Optimal topology of gene-regulatory networks : role of the average shortest path

Gene regulatory networks (GRNs) possess an important structural property; they are sparse and resilient, with a robust topology that affords protection against random “attacks” (e.g., gene deletions). However, such networks exhibit optimal or near-optimal topological features not present in other scale-free networks. This paper utilizes an integer linear program formulation to gauge the exact structural optimality of scale-free networks measured using the average shortest path between transcription factors and the regulated genes of a gene-regulatory network sampled from the Escherichia coli bacterium. While randomly generated versions of these networks show several cases for improvement, few subnetworks sampled from Escherichia coli ’s transcriptional network show optimized solutions that differ substantially from their original topology. We therefore conclude that sampled transcriptional subnetworks from Escherichia coli exhibit an “optimal” topology not present in alternative networks. Because these analyses do not consider the biology of expression dynamics and are based on topology alone, other communication systems, such as wireless networks, may benefit from a more detailed examination of the role in which the average shortest path affects system function, such as with noise or other signaling disruptions.


INTRODUCTION
In a gene-regulatory network (GRN), the protein-coding genes of DNA are abstracted as the nodes of a directed graph, with the interconnecting links associated with either a stimulatory (up-regulating), inhibitory (down-regulating), or mixed (both up-regulating and down-regulating) effect on gene expression, and therefore, cellular protein abundance [1].An exciting capability of GRNs is that they resist deleterious effects from external disruptions-a property termed as 'biological robustness' [2], such as the ability to mitigate dynamical effects of noise in protein-coding gene expression [3].While previous research has accredited some of this capability to network topology, there stands an underexplored question: which topological features contribute to this robustness, and what is its underlying mechanism?
Gene-regulatory networks fall within a class of scale-free networks [4], in which their topological degree distribution follows a power-law equation [5].This feature is associated with loosely connected modules that support fewer genes with larger degree; such genes are distinguishably known as "hubs" or "hub nodes" [5].Conversely, the majority of networked genes possess degrees much lower than the average.This aspect attributes resilience to the deleterious effects of genetic mutations, because highly connected master transcriptional regulators occur rarely in these networks; however, targeted suppression of these proteins, possibly by manipulating gene activity, may decouple their influence from the network and lead to a loss in organism-level function.Additionally, connectivity of some transcriptional network motifs, such as the feed-forward loop, to the embed-ding network may contribute substantially to its topological properties, such as the average shortest path and clusterring coefficient [6,7].Aside from these topological characteristics, feed-forward loops admit dynamical characteristics linked with their topology, such as an ability to generate pulses [8], admit signal delays and irreversible speed-ups [8], or respond differentially to dynamically acute concentration fluctuations [9].
In this paper, we investigate how another topological metricthe average shortest path between any two networked nodesis affected by different scale-free architectures with identical degree distribution and feed-forward loop abundance.In this context, We show that a gene-regulatory network of the Escherichia coli (E.coli) bacterium achieves a near-optimal level of robustness against changes to the average shortest path.This finding was not observed for the randomized networks considered here.

NETWORK DATASETS 2.1 Transcriptional Regulatory Networks
The E. coli bacterium thrives in the mammalian digestive tract, and is a widely-used model prokaryotic organism [10], and its GRN is considered a prototypical scalefree network.We have used the GeneNetWeaver software package [11] to obtain the largest-connected component of E. coli's GRN.We then derived a transcriptional-regulatory network (TRN) from this GRN by pruning the gene-gene interactions, leaving transcription factor (TF)-gene and TF-TF interactions.We term a "regulated gene" as one that does not regulate another, although TFs may regulate both other TFs and regulated genes.In this sense, the transcriptional regulatory network admits a regulation hierarchy with TFs near the root(s) and regulated genes as its leaves.Collectively, the E coli TRN so obtained has 23 connected components, 1565 transcription factors and regulated genes, and 3758 directed edges.
A number of subnetworks were sampled randomly from this transcriptional network, each with a number of nodes n = 5, 10, 15, 20, 25, 30, using the GeneNetWeaver software package.Networks were sampled in a way that ignored any auto-regulatory edges, as these did not contribute to the optimization metric (see below).Additionally, 10 replicates for each network size, n, were sampled for a total of 300 unique transcriptional-regulatory networks.

Randomized networks
Each sampled transcriptional-regulatory network was "randomized" using the following algorithm.We began with a number of disconnected nodes n, and |E|-many edges from each sampled E. coli network.Next, we selected a node pair at random with uniform probability from the set of all possible pair-wise combinations between distinct nodes, and drew a directed edge between them.This node-pair was then removed from the set of available pairs.This procedure was iterated until |E|-many edges were drawn.

Metrics and criteria for optimal topology
We adopt a version of the average shortest path, d , as a measure for network robustness, which has been widely used as a robustness metric for many complex systems [12].In this paper, d is defined as the distance, in number of "hops" (i.e., single movements between adjacent nodes) required to connect a transcription-factor node to a regulated gene, and averaged over all such pairs in a network.We restrict our analyses to the connected paths from transcription factors to genes.With these conditions, d can be expressed by the equation: wherein dij denotes the shortest path between nodes i and j, p the number of connected TF-gene pairs, nt the number of TFs, and ng the number of genes in the network.The type of the regulatory interaction, either up-or down-regulating, was not considered in Eq. ( 1).We note that d is slightly different from other definitions of the average shortest path, wherein p has been taken as the number of connected pairs without regard for their type.However, the p of 1 is distinct from these by its consideration of biology, that it reflects consideration of only a subset of the total number of possible paths by eliminating any contributions to d between regulated genes.
An optimal topology is one wherein a metric of the topology is at an extremum.As explained in the next section, this criteria is the average shortest path, d , for a given network G(V, E), wherein V is the set of all networked nodes (TFs and regulated genes) and E denotes the set of edges connecting them.An output of the integer linear programming algorithm (explained below) is an "optimized" version of G, G0(V, E0), which hosts an identical set of vertices, number of edges, |E| = |E0|, identical degree distribution as G, and the same number (and type) of feed-forward loop transcriptional network motifs, but with potentially different average shortest path, d0 .To determine whether the topology of G is optimal, we compare d to that of its optimized coun-   2)) evaluated using the integer linear programming method, for transcriptional-regulatory subnetworks sampled from E. coli (light grey boxes) of size n = 5, 10, 15, 20, 25, and 30, compared against optimal solutions found from randomized versions of these networks (dark grey).If the input network is optimal, then δ = 0.
terpart, d0 , using the following metric, δ: Therefore, δ = 0 corresponds to the result that G exhibits a perfectly optimal topology.

Optimizing Topology with Integer Linear Programming
Finding a network with optimal topology is a challenging problem.On the one hand, a brute-force search approach is intractable due to an exponentially large search space on the order of O(2 |V | 2  )-many different graphs.On the other hand, meta-heuristic methods, such as Simulated Annealing [13] and Genetic Programming [14,15,16], cannot guarantee an optimal solution.To address this problem, we propose a new approach to finding an optimal graph topologies for small and moderate size networks based on integer linear programming (ILP).
Many graph-theoretical problems are solvable using integer linear programming [17], such as the shortest path, vertex coverage, maximum flow, and minimum cost-flow problems.This is possible, because these problems can be expressed in terms of linear relationships which together form a polytope enclosed by their intersections.From this polytope, it is possible to identify an extremum of a cost function.Moreover, ILP can be useful to identify cases wherein solutions are not feasible with other methods, and their implementation can be facilitated using freely available academic software, such as IBM ILOG CPLEX optimizer [18].
We consider a linear program which identifies a new graph G0 with minimum d0 based on an input G, subject to the following constraints: 1. Connectivity between a TF and a regulated gene "target' must be preserved from G, which ensures invariance of the overall paths, but not necessarily the path-lengths; 2. Degree distributions between G and G0 are identical; 3. The number and type of feed-forward loop transcriptional network motifs remain invariant between G and G0, despite any variance in topology.
In particular, we hold the number of feed-forward loops constant during the optimization process, because we have previously identified that path-length metrics may be significantly affected if feed-forward loop transcriptional motifs are "deleted" from the network topology [6].The detailed equations for the linear constraints have not been included due to space restrictions.An example of an "optimized" 6-node network is given in Figure 1.The optimal solution (Fig. 1(b)) preserves connections between TF-gene pairs, in-and out-degrees for each node, and the number of feed-forward loop transcriptional motifs, but exhibits a d smaller by approximately 11%.

RESULTS AND DISCUSSION
Figure 2 illustrates results of the integer linear programming based optimization of sampled transcriptional-regulatory network topologies.Here we observe that very few of the subnetworks sampled from E. coli, of any size, exhibit an average path length in their optimized topologies that is smaller than already supported by these networks.This result should be contrasted by our attempts to optimize randomized networks, which demonstrate that average shortest path lengths computed for optimized topologies were significantly reduced over their non-optimized input network.
These results indicates that E. coli network topologies are already well-suited to minimize the average shortest path between transcription factors and their regulated genes.There may be plausible reasons why an evolved transcriptionalregulatory network may experience pressure to minimize the number of regulatory interactions between the regulating proteins and the terminal genes.Common modes of genetic evolution, such as gene duplication and divergence, alters the degree and connectivity of networked proteincoding genes; while these are manifestly local topological alterations, the path-length encompasses regulatory interactions that transcend a gene's local neighborhood, suggesting that network dynamics manifest with system-level properties might play a role in whether an organism's progeny survives.Although dynamics may play a role in correlating topological characteristics at network scales, there is evidence that node-degrees in both biological and non-biological networks are correlated by a geodesic distance of approximately three steps [19]; however, it is as-yet unclear what general mechanism underlies such a correlative relationship.

CONCLUSIONS
In this paper we used an integer linear programming based optimization method to determine a topology for a given (directed) transcriptional-regulatory network under the constraints that: (i) any paths present between transcription factors and genes remain in the optimized network; (ii) that degree distributions of the optimized network remain invariant; and (iii) that the number of feed-forward loop transcriptional motifs remain invariant.By using this method with sampled subnetworks from the transcriptional-regulatory network obtained from the E. coli bacterium, we found that E. coli's subnetworks exhibited topologies that already minimized the average shortest path length between any two transcription factor and regulated gene of the network.In contrast, randomized versions of these biological networks were not optimized, in the sense that the integer linear program identified alternative network topologies that further reduced the network's average shortest path length.
Although the integer linear programming formulation reveals differences in topological optimality between the considered TRNs and random networks, it may experience significant difficulties when analyzing larger networks, because the relatively large number of constraints required by this method makes it infeasible.For example, the largest reported example we are aware of involved a 150 node network with over 10 6 constraints [20], which leads to an excessive convergence time.It may therefore be beneficial to develop a heuristic that aims to reduce this computational complexity.

ACKNOWLEDGMENTS
This research was funded by the US Army's Environmental Quality and Installations 6.1 basic research program.Opinions, interpretations, conclusions, and recommendations are those of the author(s) and are not necessarily endorsed by the U.S. Army.It was also partially supported by NSF.

Figure 1 :
Figure 1: (a) An exemplary random network and (b) it's post-optimization output from the integer linear programming algorithm.The in-and out-degrees of each node in both are identical, and the existence of routes from the TFs (nodes 1-5) to their target regulated gene (node 6) have been preserved.In this example, the optimized network (b) exhibits a smaller average shortest path, d , than the exemplary network (a).