Vul-Mirror: A Few-Shot Learning Method for Discovering Vulnerable Code Clone

It is quite common for reusing code in soft development, which may lead to the wide spread of the vulnerability, so automatic detection of vulnerable code clone is becoming more and more important. However, the existing solutions either cannot automatically extract the characteristics of the vulnerable codes or cannot select different algorithms according to different codes, which results in low detection accuracy. In this paper, we consider the identification of vulnerable code clone as a code recognition task and propose a method named Vul-Mirror based on a few-shot learning model for discovering clone vulnerable codes. It can not only automatically extract features of vulnerabilities, but also use the network to measure similarity. The results of experiments on open-source projects of five operating systems show that the accuracy of Vul-Mirror is 95.7%, and its performance is better than the state-of-the-art methods.


Introduction
With the rapid development of the open-source communities, code reuse has become very popular.Code clone is a code frag ment with the same or similar code as the source code [1], it is one of the co mmon ways of code reuse.Code clone improves development efficiency, but it will lead to potential security problems.If vulnerable code is reused, these vulnerabilities will spread to other applications and endanger the security of all related systems.For examp le, the OpenSSL heartbeat vulnerability (cve-2014-0160) [2] affects web sites, web servers, operating systems, and software applicat ions because the affected system either uses the entire OpenSSL library or clones parts of the library for its system use.So we need an automat ic method that can accurately detect code clone vulnerabilities with a minimal level of human intervention in different programs.
Once the vulnerability is exposed, researchers can analyze the vulnerable codes and ext ract the corresponding vulnerability pattern manually.Then the researchers can discover vulnerable codes in different programs based on the extracted vulnerability patterns.With the help of the traditional program analysis methods such as symbol execution [3][4] and taint analysis [5][6][7], the semi-automatic methods can be realized for discovering vulnerable codes in programs, but these tools are based on human learning and expert knowledge is required.
In order to alleviate the work of hu man intervention, researchers proposed many methods [8][9][10][11] to discover the vulnerable codes clone by calculating the similarity of the vulnerable codes and target codes.These methods are based on the assumption that similar code has the same vulnerability, wh ich is tenable in most cases, so the method based on code comparison is feasible to discover vulnerable codes when two codes are the same or highly similar.But we know that most of the vulnerabilit ies are caused by a few codes, and there is a gap between code similarity and vulnerabilities.We need to ext ract the vulnerability characteristics and make a mo re accurate comparison to discover vulnerable code clones.
In this paper, we find that the process of discovering code clone vulnerability is very similar to that of face recognition.So we consider the discovering vulnerable code clone as a code recognition task and propose a few-shot learning model fo r d iscovering vulnerable code clone.The model not only automatically extracts vulnerability features, but also uses the network to measure similarity.To meet the requirements of few-shot learning, according to different types of clones, we build a samp le set of vulnerabilities.By training the few-shot learning model with mu ltitasks, the model learns how to co mpare codes.When we test target codes, the model can output the vulnerability most similar to the target codes.Based on code similarit ies, co mbined with the characteristics of vulnerabilities, vulnerable code clones can be discovered by the model.The contributions of this paper are as follows: We first use few-shot learning to discover vulnerable code clones and propose a novel method named Vu l-M irror, which considers both codes features and similarity of codes.
We analyze the relat ionships between different types of code clones and original codes and effectively construct a code clone dataset for few-shot learning.
We imp lement the prototype system on five popular operating system codes, and the experimental results show that Vu l-M irro r can achieve much higher performance than the state-of-the-art methods.
The remainder of the paper is o rganized as fo llo ws.Section 2 reviews the related work.Sect ion 3 presents the design of the system.Sect ion 4 describes our experimental results.Section 5 discusses the advantages and disadvantages of the method, and we conclude the paper in section 6.

Related work
Traditional static and dynamic analysis methods also can discover vulnerable code clones, but they rely heavily on security experts.We main ly review the method that relies less on security experts , these methods can be divided into two types: pattern-based methods and similarity-based methods.

Pattern-based methods
Li [12] assumes most of the codes are correct and proposed a method named CP-Miner to find code clone errors.CP-Miner parses a program and co mpares the resulting token sequences using the "frequent subsequence mining" algorith m known as CloSpan [13].PR-M iner [14] focuses on the clone of vulnerability patterns, not the codes.With frequent patterns, it can discover paired vulnerabilities which need to appear together, such as "lock" and "unlock", "malloc" and "free".These methods can find code clone vulnerabilities, but in many cases, the vulnerabilities do not meet the frequent pattern.
Yamaguchi [15] provides a method called Chucky to discover miss-check vulnerabilities.Chucky maps code to vector space and ext racts API (Application Programming Interface) usage patterns by principal co mponent analysis.If the candidate functions are similar to vulnerable codes with high order, it should be audited.Yamaguchi [16] exp loits patterns extracted fro m the abstract syntax trees of functions to detect semantic clones.Yamaguchi [17] proposes a method for inferring search patterns for taint-style vulnerabilities in C code.These methods extract vulnerability patterns semi-automatically, and each of the methods can only discover one fixed pattern.
Deep learning can automat ically extract sample features.Li [18] develops a deep learning-based vulnerability detection system called VulDeePecker, wh ich can ext ract more than one patterns automatically.μVu lDeePecker [19] is based on Vu lDeePecker which can not only judge whether the code is vulnerable but also decide the type of vulnerability.However, due to the lack of a large number of high-quality training samp les, the methods based on deep learning have not been widely used.To solve the problem of the sample shortage, few-shot learning [24][25][26][27] is proposed, but it is not applied in the field of vulnerability discovering.

Similarity-based methods
SourcererCC [20] and CCFinder [21] are typical lexiconbased approaches that only consider the similarity in the lexical level of code frag ments.Deckard [22] is a standard syntax-based approach that uses structured information to identify a kind of code clones.White [23] proposes a deep learning method to detect code clones.These techniques are aimed at detecting as many code clones as possible but not for finding security vulnerabilities accurately.
ReDeBug [11] can quickly find so me unpatched code clones of Type-3.However, it can hardly be applied to Type-2 clones.VulPecker [9] takes the advantages of a variety of algorithms to calculate similarity.However, its comparison algorith ms are limited, and it characterizes vulnerability with a predefined set of features that need to be specified manually.VUDDY [8] normalizes tokens by replacing variab les, function names, etc. with fixed names, and the hash values of functions are used to search code clones.Shi H. [10] adopts deep learning to detect vulnerable code clones.The common d isadvantage of the methods mentioned above is that they only use a single pre-defined metric to co mpare codes base on token-level or line-level, and the vulnerability characteristics are not fully considered.Our approach not only uses a deep metric to co mpute the similarity of codes but also combines different vulnerabilities features to find vulnerable code clones.

System design
When we write a new program or check the codes, we want to test whether the program emp loys the historical vulnerable codes.We can co mpute the similarity o f t wo codes, then further confirm whether there is a vulnerable code in the candidate code clones or not.To compute similarity, the first method is to compare exposed vulnerability with all the target code (see figure 1 (a)), the second method is to compare one target code with all historical vulnerabilit ies (see figure 1. (b)).The first method is suitable for comparison with a few vulnerabilities.The second method is suitable for finding mu ltiple vulnerabilities.We select the second method to compare codes.In this way, the process of vulnerable code detection is similar to image recognition.Therefore we can use the method of image recognition to solve the problem of code clone vulnerability detection Referring to the method of image recognition, we design a system named Vul-Mirror to detect code clone vulnerabilities based on few-shot learning.In the training phase, we train a few-shot learning model by code clone vulnerabilities.The model learns how to find wh ich vulnerability is most similar to the clone code fro m mu ltiple vulnerabilities.In the testing phase, the clone codes are replaced with the target codes.The trained model can identify wh ich historical vulnerability is most similar to the target code and output the similarity value.Because some code clone is low similar to original codes, we need further verify its vulnerable nature based on the output of few-shot learning model.So we add the vulnerability verification process in the testing phase.The framework of Vul-Mirror is shown in Figure 2.
To realize Vul-Mirror, we b reak the task down into four individual tasks: building data set, data processing, designing a few-shot learning model, and identification vulnerability.

Building data set
A few-shot learning model needs to be trained by a data set that every class has one original samp le and k similar samples such as omnig lot and min iImagenet.There is no data set for discovering the vulnerability, so we need to build the data set.Code clones can be divided into four types [1].We treat exposed vulnerable codes as the original codes and code clones as target codes .We found that patterns of vulnerable code clones are as follows: Type-1 clone is an exact clone where either co mpletely copies the source codes or adds some co mments at best.In type-1 clone, there is no change of function codes, so it has the same vulnerability as the original codes (see Table 1 original code and Type-1 clone).A buffer overflow vulnerability exists in the original code, type-1 clone has same vulnerability as the original code.
Type-2 clone is a renamed clone, it modifies variables or function names.If the changed name has nothing to do with any vulnerability, the kind of clone has the same vulnerability as the original codes (see

Original code
Type-1 clone 3 int sum = 0; 3 int sum = 0; 4 for (i = 0; i < len; i++); 4 for (i = 0; i < len; i++);  Type-4 clone is a semantic clone, it changes the statements but has the same functionality.In most cases, original codes have low similar to clone code.Therefore, it is difficult to determine the vulnerability directly.
According to the vulnerability pattern of code clone, we build the data set (code clone) by the following steps: Firstly, according to the information of exposed vulnerabilities in the Co mmon Vu lnerabilit ies and Exposures (CVE), we down load the vulnerability files and patch files fro m the open-source community and get the diff files of vulnerabilities and patches.
Secondly, we extract functions fro m vulnerab le files.Because vulnerabilit ies usually in intra-function, we select the function as a unit to compare.When the vulnerable code spans multiple functions, we compare vulnerable code with the patch file and choose the function with the most rows changed.
Thirdly, based on the vulnerable functions, we generate the code clones using the heuristic method.For type-1 clone, code clones have the same vulnerabilities as the original codes, and we copy the original codes to get code clones.For type-2 clone, we copy the original codes and normalize variables, parameters, function-names, etc. as fixed symbols.For type-3 clone, we delete or insert some statements in functions based on the type-2 clone codes.For type-4 clone, create a code clone is difficult, so we discover it by similarity value and verification module.
We compare the modified code line with the patch.If the modified line same as the patch, replace it until there is no same code line with the patch.Finally, one class of vulnerability includes one original code and three code clones, all of the codes are vulnerable.

Data processing
The program code is different fro m the image, so we process the sample to adapt few-shot learning.The processing flow is as follows: (i) Normalizat ion.We remove the tokens that have nothing to do with the function of codes, such as comments, non-ASCII characters, and redundant whitespaces.We replace feature-independent prompts in codes to "str", such as a long string prompt statement in double-quotes.We normalize the nu mbers to "NUM1", "NUM 2", and normalize variab les as "VAR1", "VAR2".(ii) Transforming code to abstract syntax t ree (AST) sequence.AST can retain the most innovative informat ion and remove the redundant information of source code, so we use AST to present functions.We first transform codes to AST, then transform AST into a token sequence by Deep-First Search.(iii) Vectorization.To get a fixed-size vector, we split the AST sequence into many tokens and convert every token into a corresponding vector.We select 2,000 tokens (about 250 lines of C code) as the unit of the function.When the function tokens are less than 2,000, pad it with zero vectors.If the function length is longer than 2,000 tokens, intercept the corresponding number of lines of code.Then we use word2vec to complete the word embedding.After wo rd embedding, each token is converted to a vector o f 1*50 dimensions, and each function is converted into a vector matrix of 2000*50 dimensions.

Design few-shot learning model
There are many few-shot learning models to be used.1), (2).We choose the CNN network for feature extract ion and relationship comparison.The corresponding model is shown in Figure 3.The "code-1" is a query code of sample, having the h ighest similarity with the cve-3 in the vulnerable codes, so we consider that the "code-1" may contain the same vulnerability as cve-3.

Identification vulnerability
This process is added to confirm some vulnerability.If the vulnerability can be judged directly fro m the similarity such as type-1 cone, and some of the type-2 clones, this process can be avoided.
The few-shot learning model can output which vulnerability is most similar to the target code and its relationship value, which can be used to judge the code similarity.If the similarity of the two codes is too low, such as less than 50%, we thin k that the vulnerability is not a vulnerable code clone.Conversely, if the similarity of the two codes is high, such as greater than 95%, they have the same vulnerab ilities.When we found a target code similar to vulnerable code with a score between 50% and 95% , we use patch (or diff file) of vu lnerable code to check the target code.If the target code is more similar to the patch, we think the code has no vulnerabilit ies , vice versa, the target code is considered a vulnerable code.Our method is flexib le, and we can select different thresholds manually according to the different situations.

Experiment
We perform our experiment with a large nu mber of exposed vulnerabilities and conduct experiments on a machine running Ubuntu 16.04, with NVIDIA GeFo rce RTX 2070 GRU and Intel Xeon E5-2650 v4 CPU, 64 GB RAM, and 12 TB HDD.
In order to imp rove the similarity of code domain, we search exposed vulnerabilities of five operat ing systems and download the vulnerabilities and patches fro m the opensource community.Patches are used to verify candidate codes.According to the method discussed in the third section, we ext ract the vulnerability function fro m five operating system vulnerability files, construct the clone code of the vulnerability, and deal with the function code.The processed data set is used for training and testing the model.Table 2 summarizes the number of datasets.#CVE is the nu mber of vulnerab ilities, #Fun is the number of functions extracted fro m vulnerable files, # Patch is the number of functions extracted fro m patch files and use to train other models, # Clone is the nu mber of generated clone codes.The dataset consists of 5,258 classes, one class includes one original vulnerab le code and three clone codes.The dataset is randomly split into two parts, 80% for training and the remaining 20% for testing.We used the same metric as the description in [18], TP is the number of true positive samples that were correctly discovered as vulnerabilit ies, FP is the number of samples with false vulnerabilit ies discovered, FN is the number of samples with true vulnerabilit ies undetected, and TN is the number of samples with true non-vulnerable code detected.We use the widely used metrics Precision (P), Recall (R), False Positive Rate (FPR), False Negative Rate (FNR), and F1 Score (F1) to evaluate vulnerability detection systems.The ideal system neither misses vulnerabilities (FNR=0 and TPR=1) nor triggers false alarms (FPR=0 and P=1), wh ich means F1=1.
To evaluate the efficacy and effectiveness, we compare against the various state of the art methods, VUDDY [8], Vu lPecker [9], and Vu lDeePecker [18].A ll methods use the same vulnerability samp les, and the results are shown in Figure 3.

Figure 3. the results comparing to other methods
The results show Vul-M irror achieves higher performance (F1=0.941).It misses fewer vulnerabilities (FNR=0.037and TPR=0.963) and triggers less false alarms (FNR=0.126and P=0.957).Because Vu l-M irro r not only extracts features of vulnerabilities but also uses the network to measure the similarity of two codes.Vu lDeePecker also uses deep learning to extract features of codes, but the number of samples is small (on ly 5,258 vulnerabilities and 5,258 patches), which leads to the performance of VulDeePecker degradation.VulPecker can choose one algorithm fro m six algorithms to compare target codes with vulnerab le codes according to different types of vulnerabilities, but it cannot extract features automatically.Imprecise features and limited algorithms reduce the effectiveness of Vu lPecker.VUDDY uses hash value to discover code clone vulnerabilit ies.VUDDY can discover type-1 and type-2 clones vulnerabilities and no trigger false alarms, but it hardly works in the case of that most of the samples are type-3 clone vulnerabilit ies.We use the end-to-end network, and it unifies feature extraction and relation calculation as a whole to identify vulnerable code clones, which is much more effective than the existing methods.
Because every sample in our data sets is vulnerability, we only need to find the most similar vulnerab ility to the cloned code, no need to validate candidate code.The model can achieve good performance, but due to the complexity and diversity of the code, it is still unable to recognize a few samples correctly.
In order to test the performance of the model in practice, we randomly select a new samp le set that includes 13 code clones vulnerabilities and 417 nonvulnerable codes.The non-vulnerable code is different fro m a ll patch codes.We set the threshold to the high value (0.95), the middle value (0.65), and the low value (0.5) respectively.The test results of different thresholds are shown in Table 5. Fro m the result we know, when set high threshold (0.95), the model can get high precision (100%) but it misses some true vulnerabilities (FNR=84.6%).Because of the similarity between the target code and related vulnerability below the threshold, we think that samples are not vulnerable.Note that the model can identify which historical vulnerab ility is most similar to the target code.So if we do not use threshold, the model can find these vulnerabilities.
When we set a middle threshold value (0.65), the model can obtain better co mprehensive performance, but it still misses some vulnerab ilit ies .When we set a lower threshold value (0.5), the model can find all vulnerabilities (TPR=100%).But some non-vulnerable samples are identified as vulnerabilit ies .When the similarity between the target code and one of the vulnerabilities is higher than the threshold value, the target code will be judged as vulnerability.The model triggers false alarms.By using the vulnerability verification process to confirm candidate target code, we can reduce false positives and false positives.
Experimental results show that our method can effectively detect code clone vulnerab ilities, and our method can achieve good performance in practice.

Discussion
Few-shot learning is a hot topic.The model can reduce intra-class differences and increase inter-class differences of samples.Few-shot learning uses only a small nu mber of samples, which allev iates the problem of lack o f labelled samp les.We use few-shot learn ing to detect code clone vulnerabilities and achieve good results.The results show that few-shot learning is suitable to solve the code clone issue.But at present, there are still so me shortcomings in our method: First, our method is based on few-shot learning, it only needs small samples, but it still requires every class of sample has some similar samples.We create the code clones (similar samples) set by the heuristic method.However, it does not necessarily mean that the heuristic method is always accurate in practice.How to effect ively build a vulnerability data set is an interesting topic.
Second, different fro m code clones of detection, there is a gap between the similarity and vulnerability.When we test the non-vulnerable code, the performance of the model will decline.How to use a few-shot learning model to distinguish vulnerable codes and non-vulnerable codes correctly is worth studying.
Third, although our method can alleviate the problem of insufficient vulnerable samples in deep learn ing, it can only be used to discover vulnerable code clones at present, and it is difficu lt to detect vulnerabilit ies caused by other reasons.How to use few-shot learning to discover other kinds of vulnerabilities are the next topic.
Fourth, the similarity of d ifferent codes is different, so threshold adjustment is a co mplex p rocess.It is our next work to give corresponding thresholds for different application scenarios to improve the practicability of the method.

Conclusion
In this paper, we propose Vul-M irror to solve the problem of low accuracy in d iscovering vulnerable code clones.Vu l-M irror uses a few-shot learning model to extract code features and compare the relation of codes.It takes advantage of end to end network to imp lement finegrained detection of similar codes.We use five co mmon metrics to evaluate Vu l-M irror and conduct a co mparative experiment on five open-source OS vulnerabilities datasets with three state-of-the-art methods.Experimental results show that Vul-Mirror is significantly better than other methods.We extend the application of few-shot learning, imp rove the efficiency of code clone vulnerability detection, and alleviate the lack o f a large number of labelled data sets.

Figure 1 .
Figure 1.Two methods of code comparison

Figure 2 .
Figure 2. Overall framework of Vul-Mirror According to our goal, we choose a 5-way 1-shot model.In every iteration step, an episode is formed by randomly selecting five classes from the train ing set with a labelled sample, as well as a fraction of the remainder of five classes' samples to serve as the query set.The features of the samples are ext racted by the encoding module.The feature EAI Endorsed Transactions on Security and Safety 05 2020 -06 2020 | Volume 7 | Issue 23 | e4 of five vulnerab le codes and the clone code are co mb ined, and the relationship value between them is calcu lated by the relation module.The relat ion score is a value fro m 0 to 1, 0 means two code is totally different, and 1 means precisely similar.Then we use MSE as the loss function of the network.The relat ion function and objective function are shown in equation (

Table 1
Type-2 clone 1).Otherwise, code clone has differently vulnerable fro m the original codes (see Table1Type-2 clone 2).Type-3 clone is a restructured clone, and statements are in serted o r d elet ed b ased on t he typ e -2 clone.If th e statements are relat ed to so me vu lnerability , the kind of

Table 1 .
Code clone and vulnerability

TABLE 2 .
Datasets used in experiment

Table 3 .
The result of the new sample