On the Computational Complexity of Software ( Re ) Modularization : Elaborations and Opportunities

Software system modularization and remodularization are key challenges in software engineering. All previous research has assumed that these problems are computationally intractable and hence focused on heuristic methods such as hill-climbing, evolutionary algorithms, and simulated annealing that are fast but do not guarantee to produce solutions of optimal or even near-optimal quality. However, this intractability has never been formally established. In this paper, we give the first proofs of the NP -hardness of software modularization and remodularization relative to several models of module-internal connectivity. We also review three popular algorithmic approaches for producing provably optimal or near-optimal solutions efficiently and both discuss the applicability of these approaches in and list results in the literature relevant to practical software modularization and remodularization.


INTRODUCTION
The modularization of software systems, i.e., the assignment of software-units to modules that are minimally coupled and maximally cohesive, is a central challenge in software system design; of equal importance later on in software maintenance is the remodularization of such systems to re-optimize cohesion and coupling as these systems are modified in response to changing requirements.Over the last 30 years, much effort has been put into developing automated techniques to aid in modularizing and remodularizing software systems.
Much of this work has assumed that these problems are computationally intractable and hence cannot be solved both efficiently and optimally.This assumption is based on the similarity of modularization to intractable problems like graph partitioning [17] and the very large number of possible modularizations of a software system [20, page 194].Given this assumed but unproven intractability, research has focused on heuristic methods such as hill-climbing [20], evolutionary algorithms (see [13,Sections 7.1 and 7.2] and references), and simulated annealing [20,27] that are fast but are not guaranteed to produce solutions of optimal or even nearoptimal quality.However, this may be unnecessary -even if modularization and remodularization are intractable, there may yet be fast methods that give provably optimal or nearoptimal solutions in practice.
In this paper, we address the issues raised above by (1) giving the first proofs of the computational intractability of software modularization and remodularization and (2) reviewing existing approaches for efficiently obtaining provably optimal or near-optimal modularizations and remodularizations and the restrictions (if any) under which these approaches operate.We will focus in particular on the promise of fixed-parameter tractable algorithms [7,8] whose runtimes are exponential in general but may run in effectively polynomial time on inputs that occur in practice.This paper is organized as follows.In Section 2, we formalize the problems of software system modularization and remodularization.Section 3 demonstrates the intractability of these problems in general relative to both deterministic and probabilistic algorithms.Section 4 reviews efficient algorithmic approaches for modularization and remodularization whose performance is provably optimal or near-optimal as well as existing results relative to those approaches.In order to focus in the main text on the implications of our results and reviewed approaches for (re)modularization research, all proofs of results are given in Appendix B. Finally, our conclusions are given in Section 5.

FORMALIZING SOFTWARE (RE)MODULARIZATION
To formalize software modularization and remodularization, we need to formalize the following four entities: 1. Software systems: Following [20], a software system is represented as a Unit Dependency Graph (UDG) G = (V, E) in which vertices are software units, e.g., procedures, classes, data types, and edges are dependencies between pairs of units, e.g., inheritance, procedure invocation, data-type use.Though dependency graphs are usually directed to allow for the directionality of dependencies, this directionality affects neither the definitions of software system coupling and cohesion nor the manner in which dependent units are assigned to the same module during (re)modularization.
Hence, we will here ignore this directionality and consider only undirected UDG.
2. software system modularization: A k-modularization is a partition of a UDG G into k disjoint connected components1 ; this will be formalized as a function M : V ⇒ {1, 2, . . ., k} such that M (u) is the number of the module to which unit u is assigned.We will further refine the standard definition of cohesion given above by considering three degrees of connectivity via dependencies among the units in each module of a modularization M :

modularization
(a) Basic Connectivity (BC): There is a dependency-path in the module between every pair of units.
(b) High Connectivity (HC): There is a dependency-path of length at most 2 in the module between every pair of units.This is equivalent to each unit in a module with n units having at least n/2 dependencies to other units in that module.
(c) Complete Connectivity (CC): There is a dependency in the module between every pair of units.

remodularization modifications:
A remodularization changes the module-assignments of one or more units, where each such change is a move refactoring [19,27].To allow control over the number of changes in a remodularization (in order to, for example, ensure that the remodularization respects the original modularization as much as possible [1]), we will define the number of module-assignment changes between two modularizations M and The above yields the following problems: Software Modularization with Connectivity X (SMod-X where X ∈ {BC,HC,CC}) if such an M exists, and special symbol ⊥ otherwise.
Software Remodularization with Connectivity X (SReMod-X where X ∈ {BC,HC,CC}) if such an M ′ exists, and special symbol ⊥ otherwise.
Note that in assessing the quality of a remodularization, we require only that this remodularization improve by at least a specified amount cs relative to the quality of the given modularization.In combination with the control over the number of remodularization changes granted by parameter cm, this allows assessment of the computational difficulty of incremental schemes for remodularization [2,27] The above is already useful, in that it establishes that software modularization is not, as it is sometimes construed in the literature [17], the problem Graph Partitioning [12,Problem ND14].This is so because Graph Partitioning incorporates the additional constraint that all components in a modularization have a specified maximum size.Rather, modularization with basic and complete connectivity corresponds to the following problems: 2 A clique is a graph G in which there is an edge between each pair of vertices in G.
The connected components and cliques in kWC and kCD are equivalent to the modules in SMod-BC and SMod-CC in both number and connectivity-type, and sets E ′ of edges in both kWC and kCD are equivalent to the sets of couplingdependencies counted by F cpl () in SMod-BC and SMod-CC.This gives us the following: Observation 1. Problems k-way Cut and SMod-BC are equivalent.
Observation 2. Problems k-Cluster Deletion and SMod-CC are equivalent.
As we shall see below, these observations will be most useful both in applying known results to our problems and in proving new results.

SOFTWARE (RE)MODULARIZATION IS INTRACTABLE
Following general practice in Computer Science [12], we define tractability as being solvable in the worst case in time polynomially bounded in the input size.We show that a problem is not polynomial-time solvable, i.e., not in the class P of polynomial-time solvable problems, by proving it to be at least as difficult as the hardest problems in problem-class N P (see [12] and Appendix A for details).
Modulo the conjecture P = N P which is widely believed to be true [11], the above shows that software modularization and remodularization cannot be done optimally in polynomial time in general.
The N P -hardness of SMod-CC and SReMod-CC is perhaps unsurprising given the computational intractability of finding cliques of a specified size in a given graph [12, Clique, Problem GT19] (indeed, this is reflected in the fact that the N P -hardness of SMod-CC and SReMod-CC holds when the numbers of requested modules k and kF , respectively, have any fixed value greater than or equal to three).As clique-modules have the highest possible number of moduleinternal dependencies, this implies that finding modularizations and remodularizations with the highest possible values of MQ [20] and Q [27] is intractable; as our problems remain N P -hard relative to basic connectivity, this intractability continues to hold for much lower values of MQ and Q.
Result A has very interesting additional consequences.It is widely believed that P = BP P [25, Section 5.2] where BP P is considered the most inclusive class of problems that can be efficiently solved using probabilistic methods (in particular, methods whose probability of correctness can be efficiently boosted to be arbitrarily close to probability one).Hence, our results also imply that unless P = N P , there are no probabilistic polynomial-time methods which correctly modularize or remodularize software systems with high probability for all inputs.Taken together, the above constitutes the first proof that no currently-used method (including those based on hill-climbing [20], evolutionary algorithms (see [13, Sections 7.2 and 7.2] and references), or simulated annealing [20,27]) can guarantee both efficient and correct operation for all inputs for these problems.

PROVABLY (NEAR-)OPTIMAL ALGORITHMIC OPTIONS FOR SOFTWARE (RE)MODULARIZATION
Though the intractability results in Section 3 are elegant and powerful, the inconvenient fact remains that we would still like to perform software system modularization and remodularization in an efficient and reliable manner.In this section, we will consider three options for accomplishing thisnamely, polynomial-time approximation algorithms, (Section 4.1) restricted-case polynomial-time algorithms (Section 4.2), and fixed-parameter tractable algorithms (Section 4.3) -and discuss their applicability in practical software modularization and remodularization (Section 4.4).

Poly-time Approximation Algorithms
A polynomial-time approximation algorithm is an algorithm that gives a solution whose value relative to a particular parameter (such as solution quality, e.g., parameter s in problem SMod-BC) is provably within an additive or multiplicative factor of the value of that parameter in an optimal solution [6].Very few problems have polynomial-time additivefactor approximation algorithms.Given this, the desirable situation is a polynomial-time (1 + ǫ)-multiplicative factor algorithm, where ǫ is very small, e.g., 1  10 .
At present, there is only one such result for our problems.
Result B: SMod-BC can be approximated in polynomial time to a multiplicative factor of 2 − c 2 k 2 for some constant c (follows from Observation 1 and [26, Theorem 2.1]).
One cannot set c = k to get a polynomial-time optimal solution algorithm as the algorithm runtime increases dramatically as the value of c increases, yielding an exponential runtime as c approaches the value of k.Moreover, though this algorithm may produce solutions close in value to optimal on some inputs, it will produce solutions whose value is effectively twice that of optimal on others, and there is no way to tell which situation holds for any given input; hence, it is not useful in practical modularization.

Restricted-case Poly-time Algorithms
Many problems have algorithms whose runtimes are nonpolynomial in general, e.g., O(n k ) for problem-parameters n and k, but polynomial if one or more parameter have their values fixed to constants, e.g., if k = c for some constant c, O(n k ) ⇒ O(n c ).Such restricted-case polynomial-time algorithms are practical if the parameters k and n are of very small and moderate value, respectively, in inputs encountered in practice.
At present, there are several such results for our problems.Running times can often be improved for small constant values of a parameter (for an overview of such algorithms for SMod-BC when k ≤ 6, see [26,Section 1]).Such algorithms may indeed be useful in the case of modularizing systems that are small (and hence have both few units and few modules) or as part of an iterative strategy that mod-ularizes small portions of a larger system.However, they are unlikely to be practical for one-shot modularizations of systems consisting of a large number of units, even if the number of requested modules is small.

Fixed-parameter Tractable Algorithms
As the problem with restricted-case polynomial-time algorithms noted above is that n may be very large even if k is very small, it would seem reasonable to relax the requirement that an algorithm's runtime be polynomial in all of its parameters.This insight underlies the theory of parameterized computational complexity [8].It turns out that a number of N P -hard problems have been successfully solved by algorithms whose runtimes are polynomial in the overall input size and non-polynomial in parameters whose values are small in the inputs encountered in practice (see [8,22] and references).The following states this insight formally.Definition 1.Let Π be a problem with parameters k1, k2, . ... Then Π is said to be fixed-parameter (fp-) tractable for parameter-set K = {k1, k2, ...} if there exists at least one algorithm that solves Π for any input of size n in time f (k1, k2, ...)n c , where f (•) is an arbitrary function and c is a constant.If no such algorithm exists then Π is said to be fixed-parameter (fp-) intractable parameter-set K.
In other words, a problem Π is fp-tractable for a parameterset K if all superpolynomial-time complexity inherent in solving Π can be confined to the parameters in K.
There are many techniques for designing fp-tractable algorithms [7,8], and fp-intractability is established in a manner analogous to polynomial-time intractability by proving a parameterized problem is at least as difficult as the hardest problems in one of the problem-classes in the W -hierarchy {W [1], W [2], ...} (see [8] and Appendix A for details).At present, there are several such results for our problems.
The situation is better than it may first appear as a problem that is fixed-parameter for a parameter-set K is also fptractable for any parameter-set K ′ that is a superset of K [24, Lemma 2.1.30]and the runtimes of algorithms derived relative to K ′ are often much better than those derived relative to K.Such additional restrictions will often even banish fp-intractability.
Result H: SMod-BC is fp-tractable for {k} when the average of the vertex-degrees in the given UDG is a constant (follows from Observation 1 and [15, Corollary 2]).
Result I: SMod-BC is fp-tractable for {k} when the given UDG is planar (follows from Observation 1 and [15, Proposition 3]).
Given that the running times of these algorithms are (to be blunt, ludicrously) impractical and k and s may only be small in instances that are easily modularizable by humans, the results above are not immediately useful for real-world software modularization.However, such impracticalities are typical of the initial fp-algorithms derived relative to a parameter or parameter-set.Experience has shown that once fp-tractability is proven, surprisingly effective fp-algorithms are often subsequently developed, sometimes by incorporating parameters that were not considered in the original analysis (see [7,8] and references).Hence, the results given here should be seen as promissory notes on algorithms that will be developed in future, possibly within a research program like that sketched in the next section.

Discussion
Though none of the three approaches reviewed above have results that are immediately useful in practical modularization and remodularization, all three are potentially of use.The most promising of these is fixed-parameter tractable algorithms.Such fp-algorithms are ideal for exploiting restrictions characterizing instances of modularization and remodularization that occur in practice, e.g., incremental remodularization in which the requested degree of modularization quality (cs) and structural (cm) change are both small [2,27].Moreover, the relaxed tractability encoded in the fixed-parameter approach may yield improvements when combined with other types of algorithms, e.g., fixedparameter approximation [18], hill-climbing [10], and evolutionary [16] algorithms.
The best way to start a fixed-parameter (re)modularization research program would be to characterize the UDG that underlie actual software systems with an eye to finding both parameters whose values are small in practice as well as the most restricted types of graphs that encode UDG encountered in practice.The specific situations in which modularization and remodularization are done in practice, e.g., incremental (re)modularization, should also be scrutinized to look for additional parameters of small value.Such parameters and graph-types would then guide the derivation of useful fp-(in)tractability results relative to (if necessary, reformulations of) the modularization and remodularization problems defined in Section 2. This derivation process may benefit from both the results listed above and results for closely related problems, e.g., fp-tractability results for Cluster Deletion [4,5] and Highly Connected Deletion [14] and algorithms for Graph Partition [3].

CONCLUSIONS
We have presented a formal characterization of the problems of software system modularization remodularization relative to several types of module-internal connectivity and given the first proofs that all of these problems are computationally intractable in general.This intractability makes unlikely the existence of polynomial-time deterministic or probabilistic methods that produce optimal solutions for these problems.We have also reviewed several algorithmic options for producing optimal or near-optimal solutions -namely, polynomial-time approximation algorithms, restricted-case polynomial-time algorithms, and fixed-parameter tractable algorithms.Though none of these approaches has yet produced results that are of immediate use in practical software modularization and remodularization, the fixedparameter approach shows some promise and we have accordingly sketched the outlines of a fixed-parameter-based research program.It is our hope that even if our recommendations are not adopted, the results and approaches discussed here will be of use in guiding future research on software modularization and remodularization.

ACKNOWLEDGMENTS
I would like to thank the three reviewers for comments which helped greatly in improving the content and presentation of this paper.This work was supported by NSERC Discovery Grant RGPIN 228104-2015.