Scalable Source Code Similarity Detection in Large Code Repositories

Source code similarity are increasingly used in application development to identify clones, isolate bugs, and find copy-rights violations. Similar code fragments can be very problematic due to the fact that errors in the original code must be fixed in every copy. Other maintenance changes, such as extensions or patches, must be applied multiple times. Furthermore, the diversity of coding styles and flexibility of modern languages makes it difficult and cost ineffective to manually inspect large code repositories. Therefore, detection is only feasible by automatic techniques. We present an efficient and scalable approach for similar code fragment identification based on source code control flow graphs fingerprinting. The source code is processed to generate control flow graphs that are then hashed to create a unique fingerprint of the code capturing semantics as well as syntax similarity. The fingerprints can then be efficiently stored and retrieved to perform similarity search between code fragments. Experimental results from our prototype implementation supports the validity of our approach and show its effectiveness and efficiency in comparison with other solutions.

Enterprise Resource Planning (ERP) systems are a fundamental part in most companies IT application portfolio. They provide a set of standardized software applications that handles interdisciplinary business processes across the entire value chain of an enterprise [1,2]. The potential of ERP systems to integrate business functions such as supply chain, financial accounting, or Human Resources has led to their widespread adoption. One such example is the SAP (Systems Applications and Products) ERP software which provides standard packages capturing "best business practices" [3]. However, rapid and continuous changes in business requirements are forcing companies to continuously modify and enhance the standard functionality to meet their needs [4]. Therefore, developers often need to modify a specific pieces of code from the standard ERP code base to satisfy their business scenario requirements. Specifically, SAP allows their customers to develop their own enhancements by using Advance Business Application Programming (ABAP) present detection results in useful and usable manner to development and quality teams.
Typically, source code similarity detection tools work by scanning the source code to identify pieces of code that are similar by using different string matching algorithms. Although reasonably fast for small data, they are quiet inaccurate and slow for large data. Additionally, the code fragments can be textually different but share similarity at semantic or structural level due to modifications made to the code such as variable renaming, statements insertions, deletions, and replacements. Furthermore, precise (i.e., exhaustive) search of code fragments is often infeasible, therefore, the tools tends to over-approximate, so not to miss any possible duplicates. Consequently, this leads to generating a considerable number of false positives that needs to be manually inspected and verified [11,12]. Therefore, there is a need for source code search and classification techniques that can handle large source code repositories efficiently with reasonable accuracy to be useful for development and quality teams.
In this paper, we present our approach for source code similarity detection in context of the SAP ABAP programming language. The presented approach is designed to be robust and scalable for large code base. We present our initial experimental results and experience in implementing a prototype of the approach. The rest of the paper is organized as follow. Section 2 provides brief background including general terms, definitions and some related work. The proposed approach is introduced in Section 3. In Section 4 we discuss experiment settings and show some results. Lastly, we conclude the paper and present future work in Section 5.

Background and Related Work
Clones can be broadly categorized into four types based on the nature of their similarity [10,[13][14][15]. Exact Clones (Type-1): are clone pairs that are identical to each other with no modification to the source code. Renamed clones (Type-2): are clone pairs that are only different in literals and variable types. Restructured or gapped clones (Type-3): are renamed clone pairs with some structural modifications such as additions, deletions, and rearrangement of statements. Semantic clones (Type-4): are clone pairs that have different syntax but perform the same functionality (i.e., semantically equivalent). These typically are most challenging to find and identify, yet, they are the more relevant in the context of ERP systems [16,17].
Several approaches have been proposed in the literature to identify similar source code ranging from textual to semantic similarity identification. Generally, they're classified based on the source representations they work with. In Text based detection, the raw source code, with minimal transformation, is used to perform a pairwise comparison to identify similar source code [18]. Token based detection on the other hand, extracts a sequence of tokens using compiler-style source code transformation [19]. The sequence is then used to match tokens and identify duplicates in the repository and the corresponding original code is returned as clones.
In Tree based detection, the code is transformed to Abstract Syntax Trees (ASTs) that are then used in tree sub matching algorithms to identify similar sub trees [20]. Similarly, clone detection is expressed as graph matching problem for Program Dependence Graphs (PDGs) in [21]. Metrics based detection extracts a number of metrics from the source code fragments and then compare metrics rather than code or trees to identify similar code [22].
Generally, similar code identification techniques work at varying level of granularity. Fine-grained detection leverages tokens, statements and lines as the basis for detection and comparison [23]. On the other hand, coarse-grained detection uses functions, methods, classes, or program files as the basic units of detection [13]. Naturally, the finer the granularity of the tool is, the longer time it takes to find clone candidates. Equally, the larger the granularity of the tool is, the faster time it takes for detection, albeit with fewer detected clones [24]. Detection tools have therefore to make design trade-offs between accuracy and performance on an almost constant basis based on the code base being examined.
Another challenge in finding duplicate code is the performance of querying and retrieving possible matches from a large code base. Fingerprinting and hashing have been used to improve the search efficiency [25]. Hashing maps variable size source code to a fixed size fingerprint that can later be used to query and search for clones in linear time [26]. However, a simple match doesn't work well for inexact matches. Others [16,27] use hashing techniques to group similar source code fragments together, thus enhancing the accuracy and performance of clone detection techniques. However, this is less effective in detecting Type 4 clones as hashing and fingerprints are based on the source code and not its semantic. Machine learning approaches have been proposed [28] to link lexical level features with syntactic level features using semantic encoding techniques [29] to improve Type 4 clone detection. However, in order for them to be effective, human experts need to analyze source code repositories to define features that are most relevant for clone detection.
One way to capture the program semantic is the code Control Flow Graphs (CFGs) . CFGs are one of the intermediate code representations that describes in graph notation, all paths that might be followed through a piece of code during its execution [30]. In CFGs, vertices represent basic blocks and edges (i.e., arcs) represent execution flow. Since CFGs capture syntactic and semantic features of the code, they are better at resisting changes in the code that manipulate source code in very minor ways, while not affecting the functionality of the program. For this reason, control flow graphs have been used in static analysis [31], fuzzing and test coverage tools [32], execution profiling [33,34], binary code analysis [35], malware analysis [36], and anomaly analysis [37].
In this paper, we argue that clone detection is characterized by more than just text patterns in the source code because, there is semantic features as well as syntactic features that must be considered for effective similarity detection. Since CFGs deliver both syntactic and semantic information of the code, we argue that the Control Flow Graphs (CFG) representation provides a sensible choice for source code similarity detection for the following reasons. First, CFG blocks boundaries represent an intrinsic granularity level that is neither too fine nor too coarse for clone identification. Second, CFG provide a reasonable balance between syntax and semantic representation of the code, thus, considering clones with more than text pattern similarities.
We, therefore, present a clone detection technique for improving the precision of detecting similar code clones with reduced time and memory complexities associated with large code base. The key idea is to avoid cost associated with graph and tree pattern matching by utilizing a staged identification approach. Specifically, we employ normalization and abstraction to standardize the source code. Then we derive CFG and enumerate a list of possible execution paths. Finally, context sensitive hashing is used to efficiently

Proposed Approach
In this section, we describe the proposed three-step (staged) clone detection for source code as shown in Figure 1. The first step involves extracting source code meta-data up to the CFG for the code being evaluated. The second step involves processing the CFG to generate the program fingerprints. The third step involves identifying the most related code groups in the repository followed by similarity check among the group members.

Normalization and Abstraction
Code normalization is the process of transforming a piece of code to remove all the irrelevant parts of the code for the comparison. Applying normalization and abstraction to the code increases the clone variations that can be detected. This includes removing comments, white spaces, empty lines which don't affect the program behavior. Literal values, identifier names are fixed with specific tokens. Abstraction structures such as Loops, If s, and Case statements are also normalized to increase resilience against syntactic variations. Specifically, the lexer (i.e., tokinizer) is used to break the stream of code to tokens. The parser then generates the Abstract Syntax Tree (AST) from the tokens using the context provided by the language grammar. The code can then be normalized according to predefined rules such as the example ones provided in Table 1.
One should note that the purpose of the normalization and abstraction step is to make the matching more resistant to semantically irrelevant variation. Which information to include or exclude is dependent on the specific language and on which kind of clones should be found. However, excessive normalization can introduce ambiguities that decreases the accuracy of the match. Therefore, it is important to carefully consider the level of normalization based on the envisioned use cases.

Control Flow Graphs (CFGs)
CFGs describe the order in which code statements are executed as well as conditions that need to be met for a specific path of execution to be taken [38]. They capture the structure of a program by a directional graph (i.e., digraph) in which nodes (i.e., vertexes) define the program basic blocks and edges define the possible control transfers between these blocks. Specifically, a basic block is a continuous sequence of statements that executes in the same order as they appear in the block without control changes (i.e., branches and jumps). Directed edges on the other hand, represent jumps and branches between CFG basic blocks. The CFG construction is carried out based on the abstract syntax tree (AST) representation to which control flow information are introduced [21,22].
The set of all possible execution paths in a CFG can then be expressed as P (CF G) = (p 0 . . . p r . . . p R ).

Control Flow Extraction.
The steps to extract the CFG can be explained with the help of the pseudo code show in algorithm 2.1. CFG extraction starts with the normalized source code obtained from the corresponding AST. A syntax tree represents the design with a tree structure by abstracting the details concerning the syntax of the language. This is then used to get the program normalized statements. These statements are then analyzed to identify labels to the code blocks called leaders. We identify labels for the code blocks (i.e., nodes) as follows: 1) the first statement of a program is a leader, 2) the targets of control statements (i.e., loops and conditions) are leaders, and 3) the statements immediately following control statements are leaders. The sequence of statements between these leaders constitutes the basic blocks b i ∈ B. An edge e i,j ∈ E describes the transfer of control between two blocks of code b i and b j . The CFG is constructed by adding edges between basic blocks, where execution control-flow exist. Figure 2 illustrate how a CFG is extracted from a piece of example source code. It should be noted that the CFG extraction here is a simplified approach and doesn't consider recursive calls, function calls, or try catch statements at this stage. If more precise CFGs are deemed necessary one can use more advanced techniques, however, at the expense of the CFGs extraction time.

Execution Paths
Execution paths model different possible execution order of program statements, which contain many loops, exceptions, and calls; that is to say, features that reflect the semantics of the program. More precisely, the execution paths consider the nodes and the interdependencies (i.e., edges) between different nodes in the CFG of a program. Therefore, they can reproduce some of the semantic effects of a particular CFG.
Given the CFG obtained in section 2.2 one can define all possible execution paths of a program such that every two adjacent nodes in a path are connected by an edge in E, where b 0 is the start block and the end node is b N . For any given program run only one path, among all the possible paths, can be followed. Therefore, specific program behavior can be viewed as collection of these paths. Pseudo code in algorithm 2.2 enumerate all potential paths using a depth-first traversal of the CFG starting at b 0 . Each path is a stack of basic blocks and each block b n is a sequence of statements s k . A  program CFG can then be expressed as the set of possible execution paths P (CF G) for the program CFG.
One should note that the paths enumeration technique presented here is not precise. However, we argue that exact paths are not important in themselves. Rather, what is important is that these paths are used to provide additional execution context to the CFGs. We can therefore, use paths with lower accuracy, yet still acceptable for the purposes of our approach.
Once the all potential paths are enumerated, the fingerprints can be created for each path as described in the next section.

Fingerprint Generation
CFGs fingerprinting involves extracting various features from the CFGs. The features are used to generate a unique and compact representation of the CFGs in such a way that similar paths are assigned similar fingerprints. Typically, the extracted features can be either semantic or syntactic features, which are then hashed to produce the compact representation of the CFGs. However, on the one hand, using syntactic only features will result in a fingerprint that is sensitive to minor code modifications. On the other hand, using semantic only features will result in a fingerprint that is sensitive to CFG structural information, ignoring block statements syntactic features. Therefore, our objective is to generate a fingerprint that capture not only syntactic information of a CFG, but also semantic features of a CFG as well.
We make use of the extracted execution paths in section 2.3 to represent the CFGs in a light-weight fingerprint. The fingerprint, in addition to compactly capturing textual and structural information of the programs, should represent each path in the program, such that similar paths have a higher probability of collisions or will only differ slightly in their digest. Specifically, very similar paths should map to very similar, or even the same, digest, and difference between digests should be some measure of the difference between paths.
To this end, we employ a similarity preserving hashing techniques [39][40][41] to fingerprint the CFGs. Our choice is motivated by the following reasons: 1) fingerprints shorter size lend themselves well to efficient search and clustering algorithms, thus speeding search time; and 2) similarity preserving fingerprints incorporates approximation, thus capturing more clones than a strict text or graph isomorphism-based approaches.
With these observations in mind, we extend the SimHash technique presented in [42] to generate a fingerprint that is both efficient and also considers the program execution semantics as captured by the control flow graph. Concretely, all sequence of blocks along a certain execution path are stacked as one unit. The sequence of statements in these ordered blocks are then hashed as shown in in algorithm 2.3. Therefore, an execution path will have a specific hash value and a program CFG will have multiple hash values.
Consider the CFG example shown in Figure 3. The possible paths for the program are the set of paths in Observant reader may argue that it would be redundant and space consuming to consider information captured from repeated blocks that are common in the multiple paths of the same CFG. However, our fingerprint is compact and computed efficiently thus making repetition impact negligible. Moreover, further refinement and fine-tuning are also possible to exclude irrelevant paths and blocks from the fingerprint.

CFGs Matching
Once the fingerprint for the CFGs is generated, the actual matching is performed. For CFGs to be considered similar they not only have to be isomorphic, but also the basic blocks of CFGs has to match. Specifically, graph matching is performed by checking node similarity, edge similarity, and the relationships between them. One can make use of graph matching techniques such as bipartite matching and maximum common subgraph isomorphism (MCS) [43]. However, One can also use the CFG fingerprints to compare pairs of CFGs. However, pairwise exact matching is neither efficient for large repositories, nor robust enough to match programs with small variations. Therefore, we use inexact matching of the fingerprints. Specifically, in our case, once we get all the fingerprints, the problem of detecting code clones becomes essentially a fingerprints categorization and clustering problem, such that, in a cluster, the pair-wise a similarity metric remains below a pre-defined threshold value, α , while restricting the cluster size to be no less than another pre-defined value. This is accomplished with the help of a similarity function which is described next.
Similarity Function. The idea of our similarity function is to have a measure of structural similarity of two CFGs which not only looks at the CFG structure, but also on the meta-data of the these nodes. For example, a pair of CFGs may be isomorphic (i.e., identical), however, the nodes (i.e., blocks syntax) are different. Another pair of CFGs may not be isomorphic but have similar paths. Therefore, CFG comparison should consider the path basic blocks similarities as well as the actual paths common between two CFGs.
We estimate the similarity between a pair of paths (p i , p j ) as the number of bits which differ between the two fingerprints. More formally, given two fingerprints h(p i ) and h(p j ), expressed as a binary vector of length R, we define the distance p i and p j , D(p i , p j ), to be the number of bits where h(p i ) and h(p j ) differ. The lower the value of the distance, the more similar the paths are. For example, a distance value of 0 means that the paths are identical, while a distance value of R means that the two paths are dissimilar. The S paths = (D path (p i , p j ) : ∀(p i , p j ) ∈ CFG × CFG) denotes the set of pairwise comparison values (i.e., hamming distance) between paths in two distinct CFGs.
Earlier work [44] shows that one can efficiently identify whether fingerprint pairs differ in at most α bits. This value can be seen as a threshold for similarity between two fingerprints. Specifically, the lower the value of α the higher the similarity between the path blocks. Furthermore, different values of α represent different degrees of similarity. For example, 0 < α < 4 represent identical or near identical clones, while 4 < α < 8 represent similar but not near identical clones. In our approach we use α < 8 empirically based on our experimental findings, however, other values can be used for different environment settings.
The term S CFG ∈ [0, 1] computes the pairwise similarity between two CFG fingerprints. A similarity value of 1 means that the programs are similar, while a value of 0 means that the two programs CFG share no similar paths in common. Specifically, we use a variant of the Jacard index to estimate the overall similarity between the pair (CFG, CFG) is then defined as: The numerator is the number of common paths with at most α different bits and the denominator is the total number of paths in the smaller program CFG. Consequently, the overall similarity between the two programs can be seen as the degree of overall similarity between similar paths between two the programs. One should note that the term S CFG (CFG, CFG) computes when one CFG is contained inside of another. In other words, it considers the case where one program consists of repeated copies of another smaller program. If we want to measure the total amount of resemblance, that is proportional similarity, between two programs CFGs one can change the numerator to be the total number of common paths. However, with either measures one can use α to only consider candidates where the similarity or containment score meets a pre-determined threshold.

Evaluation And Discussion
In this section we present the initial results of our proof of concept along with some insights gained during our evaluation. First, we present the details of

Experiment
To evaluate the effectiveness of our approach in detecting similarities in source code, we ran the evaluation on a synthetic data set. We assessed the performance of the proposed approach against two criteria. First, execution time efficiency, and second, detection precision. The detector is implemented using Java. Specifically, the detector reads the source code files and performs normalization and generates the control flow graphs. The control flow graphs are represented using the DOT format where nodes contain the normalized source code of a block and edges connecting the blocks. All the experiments were conducted on a Windows 10 computer with an Intel Core i7 (4 cores at 3.60GHz) processor, 16GB of RAM and 1TB of HDD storage capacity.
Experimental Data. Due to the lack of a standard benchmarks for ABAP clone detection, 664 ABAP programs were collected from online repositories 1 to serve as our code base. Typically, Verifying clones is a subjective decision that depends on the analyst experience and the context of the code. Furthermore, manual verification of all code clones candidates is impractical. Therefore, We selected a 50 random programs for manual examination to evaluate the existence of false positives.  The distribution of the blocks count in programs show that a considerable number of programs with 10 or less blocks and paths with less than 3 blocks. Intuitively, programs and path with lower number of blocks are less diverse and will generate a less precise fingerprint. This can be illustrated with the help of figure 5a. The figure shows that the precision increases as the number of basic blocks increases. Furthermore, examining paths that have only 1 block revealed that most of these blocks contain generic code related to language specific style requirements such as error handling, calls, and GUI related. This code is not necessarily useful for the purposes of clone detection. Therefore, we exclude all paths that contain less than 3 blocks from our fingerprints repository.

Results
Similarly, the precision increases as the number of lines of code in the block increases as shown figure 5a. Additionally, smaller threshold α is slightly more accurate for smaller programs, however, this improvement becomes less pronounced as the block size increases. Another observation is the the value of α is sensitive to the size of the program. For example, 1 https://github.com/trending/abap larger program can tolerate higher values of α with a reasonable false positives, while smaller programs will provide a more accurate results with lower values of α.
The impact of the threshold α on the number of clones candidates identified and false positives is shown in Figure 5b. The detector is able to identify clones with no false positives when α <= 4. The false positives are still reasonable for α <= 7 , however, they become more pronounced at higher thresholds. This can be explained by the fact that at higher α means there are more bit differences between the signatures. We chose α = 5 as reasonable threshold for our experiment. The detector is able to identify most of the relevant clones with a reasonable false positives. larger α will identify more clones candidates, however, most of the newly identified one were false positive. Smaller α only detected near-identical clones and missed many programs. Figure 5c illustrates the performance improvements compared to the standard clone detector in SAP systems. One can notice that our detector executes in linear time and grows with the number of programs under review. Nevertheless, our detector provides statistically significant performance improvements as shown in the figure. Moreover, SAP clone finder grows at faster rate with the number of programs being converted. The number of clones being detected is also significantly higher in our approach. This can be explained by the flexibility in our fingerprint representation and the value of α. Figure 5d shows the distance heat map (i.e., the values for the similarity matrix) for our code. The paths that are identical will have a lower distance, thus, darker colors. For example, There are quit few paths that are identical and therefore, shown in black. The structure shows that our method is able to discriminate between identical (or near identical) paths and similar paths. Furthermore, it validates our choice for α in the experiment.

Discussion
While we have demonstrated the effectiveness of our approach for source code similarity search, there are still several practical issues one have to carefully consider for large scale adoption. Mainly, in performance and accuracy.
First, in very large code repository search and detection time may not be acceptable even with the improvements provided by our approach. However, one of the salient features of our CFGs fingerprint it's ability to represent group similarities. For example, one can use the fingerprints computed for each program CFG to classify the source code repository into several groups according to the similarity of the fingerprints. To find an initial match, we identify categories that are close to the fingerprint. Categories that are the farthest can then be excluded to reduce the the number of comparisons required to find an initial match. This can be done as an initial step to identify possible initial matches.
Another performance improvement is leveraging bit parallelism and bit vector arithmetic for the distance and similarity computations. For example, in [45] the Jaccard index is approximated using bit vector counting optimization. Same concept can be applied to our Similarity measure. Similarly, [44] showed how one can find fingerprints with a certain threshold α (i.e., hamming distance) efficiently.
One should also note that the accuracy and precision of our approach relies on the CFGs extraction and path enumeration techniques. Naive techniques may result in more false positives. However, more robust techniques may be computationally expensive. Therefore, they need to be considered carefully. One can leverage different optimization techniques and use self tuning methods [46,47] to strike a balance between these conflicting requirements. Operationally, our tool may offer several static analysis techniques out of the box that may be selected and fine-tuned for the specific environment requirements.
Another challenge is the fact that inserting jumps that are never taken in the CFGs distorts the CFGs by generating paths that will never be executed. While matching based on CFGs semantics is not possible in the general case, static profiling techniques such as the ones presented here [33,34] can be used to check for branching and path probabilities. This can improve the semantic matching accuracy of our approach.
Furthermore, the size of the CFGs is a critical factor as shown in our experiment. We excluded the smaller CFGs blocks from our signatures. We believe this is a reasonable assumption since smaller code have significantly lower chances to be copied. However, the threshold should be investigated further.
Using fine grained semantic at the block level to reveal more features is another area that is worth exploring. While some of this can be handled by fine-tuning the normalization and abstraction step in our approach, it might be helpful to explore some graph theoretic and machine learning ideas [48,49] to further improve accuracy. This however may increase performance requirements and should be considered as an additional refining step.

Conclusion and Future Work
This paper demonstrated the need to efficiently identify and measure source code syntax and semantic similarities in large code base of ERP systems and presented an approach to address this need. We designed a detection approach that searches for duplicate and near duplicate code in an efficient way. By leveraging similarity hashing to concisely represent the control flow graphs of the code, the fingerprints capture the programs intrinsic characteristics. The detector uses the control flow graphs to enumerates possible execution paths and fingerprint these paths efficiently. The experiment showed the viability of our approach and illustrated how the detector can achieve reasonable accuracies efficiently compared to current tools.
While the results of presented in the evaluation look promising, they present an initial results of our ongoing research in clone detection for large systems and there is considerable work to be done. For example, in addition to continuing our empirical validation for larger and more challenging code base, we plan to continue our work in several important directions. First, the CFG extraction can be enhanced to capture exception handling, and function calls in the representation. Since SAP systems are highly integrated systems, in which data-centric programming is carried out in ABAP, one may also consider evaluating other intermediate representation such as call graphs, data flow graphs and system dependency graphs to consider the data-flow dependencies as well as control flow dependencies.
Secondly, our naive path enumeration can be improved to consider dominance relationship, loops and back edges to provide more precise paths for the CFG. It might also be worthwhile exploring the possibility of using path and execution static profiling techniques and path execution frequencies in the fingerprints to improve expressiveness of our representation.
The similarity hashing used can be also improved by exploring more intricate weights w for the most relevant parts instead of the equal weights used in the current implementation. Other similarity hashing techniques can be explored and studied. Finally, we plan to apply our approach on industrial case studies to evaluate different practical considerations, which would provide more insight into scalability and usability questions for different situations.