Malware Detection Based on Opcode Dynamic Analysis

Malware detection is an important problem in the field of information security. Opcodes are the most direct information reflecting the execution behavior of malware, The malware based on the dynamic analysis of opcodes also faces some challenges: the acquisition of the operating code information in the execution process of the malware; the high false alarm rate in the detection process and the large system overhead caused by the malware detection in the application layer. In order to deal with the above problems, this paper proposes a new scheme for dynamic opcode acquisition, the opcode information obtained from the software runtime is used for offline analysis. The detection accuracy of off-line malware can reach 99.85%, which is superior to the traditional technology. Moreover, this paper proposes an online detection scheme: CPU built-in malware monitoring model (CBMM), which can solve the problem that it is difficult to accurately identify the execution trajectory of malware in the current malware detection process, at the same time, this model can monitor malware in real time. Finally, we implement our model by VerilogHDL, functional simulation was carried out in modelsim simulation software and its implementation cost was analyzed.


Introduction
With the development of the Internet technology in recent years, it is also used by criminals to carry out malicious activities due to the openness of the Internet. So information security is becoming more and more important. An important problem in the field of information security is malware detection. Most antivirus software use a combination of signatures and heuristics method to detect malware. The problem with this approach is that it is susceptible to malware obfuscation mechanisms, making it difficult to accurately identify the trajectory of malware execution.
Because the malware will eventually execute the code with malicious behavior in the execution process, the malware can be detected by analyzing the behavior information of the malware. Since opcodes are the most direct information reflecting the execution behavior of malware, this paper analyzes the information of opcodes in the execution process of malware for malware detection. However, the malware based on the dynamic analysis of the operating code also faces some challenges: the acquisition of the operating code information in the execution process of the malware; the high false alarm rate in the detection process and the large system overhead caused by the malware detection in the application layer. In order to deal with the above problems, this paper firstly analyzes the progress and existing problems of existing malware detection technology based on dynamic opcode analysis, then proposes a new scheme for dynamic opcode acquisition, the opcode information obtained from the software runtime is used for offline analysis. In the off-line analysis, this paper uses a variety of feature selection algorithms to extract features of the operating code information when the software is running, we use the extracted feature subset combined with a variety of machine learning algorithms to conduct cross-comparison experiments. Finally, the detection accuracy of off-line malware can reach 99.85% and the false alarm rate can reach 0.5%. Based on the above research results, this paper proposes an online detection scheme: CPU built-in malware monitoring model (CBMM), which can solve the problem that it is difficult to accurately identify the execution trajectory of malware in the current malware detection process, at the same time, this model can monitor malware in real time. Finally, we implement our model by VerilogHDL, functional simulation was carried out in modelsim simulation software and its implementation cost was analyzed.

Related Work
Malware detection can be divided into dynamic malware detection and static malware detection according to the way of obtaining malware information. Dynamic malware detection is a way to expand malware detection by running malware and obtaining behavior information (including opcode, register, API call, etc.) during its operation.
In order to deal with the problem that the signaturebased malware detection method is susceptible to the confusion mechanism, the current malware detection research begins to focus on how to detect malware by analyzing the behavior information of software runtime. The opcode information at the time of software execution reflects the operation performed by the processor. By analyzing the opcode information at the time of software running, the existence of malware can be detected, it is not affected by the confusion mechanism [2].
Ozsoy et al. [3] used Intel Pin tools to insert instructions into malware, they collected instructions during program operation and analyzed the information of instruction type, frequency of access operation, and they used neural network and logical regression to construct classifier. The sensitivity of classifier to malware reached 100% and the false alarm rate was 9%. [4] [5] also proposed to use opcode for malware detection Testing. In [6], the author proposed to use the instruction sequence with variable length as the feature. After obtaining the feature, the malware detection was carried out using bagging integrated learning algorithm, finally a better detection effect was obtained. There are also scholars who directly use hexadecimal opcode data for malware detection. The advantage of this approach is that the complexity of opcode types can be reduced because one hexadecimal data may correspond to multiple opcode [8].
It can be seen that opcode can be used as a good feature in malware detection, due to the use of opcode information in dynamic analysis, we can solve the problem of malware detection caused by code confusion. But most of the previously proposed methods have a high false alarm rate, high performance overhead and resource requirements. On-line detection of malware and accuracy and false alarm rate need to be further improved.
In terms of dynamic opcode acquisition, there are two main ways: using application layer tools to obtain (e.g. Intel Pin tool [52], valgrind [53]); using system level tools to obtain (mainly sandbox) [55]. During the use of application level tools, there is a problem that kernelspace information cannot be obtained, when using sandbox tools, there is the problem of low monitoring efficiency. We propose a new opcode dynamic acquisition scheme in Section 3.

Proposed Methodology
The malware detection model determines malware as malware or benign software according to the input characteristic data. The malware detection model M can be understood as a function whose domain is a set of all programs P, and the range is {Y, N}: The detection model M scans the program p, determine whether the program contains malicious behavior. The desired result is: if the M returns Y, the program is detected to contain malicious behavior; otherwise, malicious behavior is not detected. In this paper, the detector is the malware detection technology of opcode dynamic analysis, and the domain of definition is the set of programs.
The traditional dynamic analysis method is difficult to obtain the complete sequence of malware opcode. In this paper, a method of obtaining the operation-time opcode information is proposed, and this method uses the complete opcode information. We use different feature extraction algorithms and classification algorithms to extract and classify them, then we analyze the effects of different feature selection algorithms, N-gram length and classification algorithms on malware detection. Experiments show that the detection accuracy of malware is 99.85%, which is superior to traditional technology.
We divides the current malware detection based on opcode dynamic analysis into three parts and two stages (see Fig. 1). Three parts refer to dynamic opcode acquisition part, feature extraction part and decision classification part. Two stages refer to offline data analysis stage and online detection stage. The relationship between the three parts and the two stages is as follows: in the off-line data analysis stage, the required dynamic opcode information is obtained by the dynamic opcode acquisition part, a feature subset is obtained by the feature extraction part, the decision and classification of the third part is expanded by the acquired feature subset. At the end of the off-line data analysis stage, we can get a malware detection classifier which is input into the dynamic opcode information feature. The online detection stage mainly uses the results obtained after offline analysis to detect malicious code online. The next step is to introduce the offline opcode dynamic acquisition scheme in Section 3.1. Section 3.2 introduces the feature algorithm and decision algorithm used in the off-line analysis stage, and section 3.3 presents the online malware detection scheme in this paper.

Opcode Acquisition Scheme
The current dynamic opcode acquisition method still has some problems, such as incomplete information and low monitoring efficiency, so we propose a dynamic opcode acquisition scheme. This scheme is mainly based on the QEMU [58] binary translation mechanism, as shown in Fig. 2, in the process of translating guest opcode into host operation, the data saving module is inserted to save the required information. A sandbox system based on QEMU is designed through the above ideas. This sandbox system can provide instruction level monitoring granularity. A QEMU sandbox system will be implemented in section 4. Fig. 3 shows the dynamic opcode acquisition scheme of this paper based on QEMU sandbox system, which includes three parts: data source, data filtering, and QEMU sandbox system. The sections are described below.

Fig. 2. QEMU binary translation technology
• Data source: software to be analyzed. Malicious software is downloaded from Virus share website in this paper. It is a website dedicated to providing malware analysis samples for researchers. The benign software obtains from the Linux system software, mainly from the system software under the /usr/bin and /bin directory. • Data filtering: To ensure the reliability of malicious samples, filter them before analyzing them. This paper uses the virustotal [59] for sample filtering, by writing an interface program, we use virustotal to analyze malware. Virusshare provide nearly 10 mainstream antivirus engines including McAfee, Symantec, 360 and so on, which can ensure the reliability of malware. • QEMU based sandbox system: this part is the core part of this scheme, the main work is to run the software to be analyzed and obtain its dynamic opcode data. This system will be described in detail in section 4.

Feature Selection:
The feature selection algorithms adopted in this paper are as follows.
• Chi party Test Feature Selection Algorithm.
• Information Gain and Information Gain Rate.
• Symmetric Uncertainty Feature Selection Algorithm.
In section 4, we will use these feature selection algorithms for feature extraction, and analyze the effect of feature selection algorithms on the final detection effect.

Online Malware Detection Scheme
We propose an online malware detection scheme. As shown in the following figure, this scheme mainly includes three parts: data separation module, feature extraction module and decision module. Data separation module works to split the opcode according to the CR3 value. The data source of the data separation module is the decoder of the processor. The feature extraction module and the decision module will be designed by using the detection module obtained after offline analysis. So this part will be explained in detail after section 4 offline data analysis.

System Implementation And Experiment
Analytical experiments will be carried out in this section. The acquisition mode will use the scheme proposed in section 3 to obtain dynamic opcode. After the dynamic opcodes information is obtained, the opcodes will be analyzed by empirical and computational methods. The computationally based analysis method will be applied to the feature selection algorithm and classification algorithm introduced in section 3. In the process of analyzing the experimental results, the analysis method of control variables is mainly used to analyze the factors that affect the detection accuracy of malware. We further design online malware detection scheme: CPU built-in malware monitoring module (CBMM), the model can detect malware online during CPU operation. We realize the CBMM through Verilog HDL, use modelsim simulation software for functional simulation and analyze the actual modern price.

Data Collection Environment
The environment in which the data collection system operates is shown in Table 2.

EAI Endorsed Transactions on
Security and Safety 07 2020 -10 2020 | Volume 7 | Issue 26 | e4 As shown in Fig. 5, the QEMU based sandbox system consists of three modules: microarchitecture information preservation module, software scheduling module and system operation anomaly monitoring module.

Fig. 5. QEMU based sandbox system
The three modules described above are described below: • Opcode information saving module. According to the principle of QEMU binary translation, we modify the source code in the process of QEMU intermediate translation, add the data saving program and design the trigger point where the software needs to save the data at run time, so that the information such as opcode can be saved under the specified scenario. Through analyzing the x86 instruction set, the instructions 0 xf1 and 0 xd6 which do not exist in the x86 instruction set are selected as the trigger points for the preservation of microarchitecture information.
• Use two long integer variables ϕ and φ to represent the number of times 0 xf1 and 0 xd6 two instructions are executed.
Start flag： start recording opcode information. Stop flag： stop recording opc in formation. • Software scheduling module. The microarchitecture information saving trigger point and information saving program designed according to step 1 enables the user to trigger the microarchitecture saving program of the simulator under the condition that the microarchitecture information needs to be saved. At the same time, the scheduling software also needs to keep the initial state of each simulator running platform consistent, so it needs to be restored to the initial state before each system running, and the function flow of the scheduling module is shown in Fig. 6.
EAI Endorsed Transactions on Security and Safety 07 2020 -10 2020 | Volume 7 | Issue 26 | e4 • Abnormal monitoring module. Scheduling software automatically triggers and stops the system's ability to hold processor microarchitecture information under certain conditions. But in some situations, such as running, some malware can cause unrecoverable damage to the system. For improving the robustness of the whole system, the abnormal monitoring module is designed, and the analysis is carried out according to the existence time PID the second dispatching subprocess of the module. If the abnormal time is set, the abnormal monitoring software will deal with the current system scheduling software call subprocess. Through the above opcode collection scheme, we can get the opcode information that needs to be analyzed in the offline analysis stage in this paper.

Datasets
Datasets: Malware gets system software from virusshare [69], benign software gets from 32-bit Ubuntu 16.04. Table 3 shows the number of malware and benign software in the text.

Experimental result:
A total of 168 different opcodes appeared in this paper, and the top18 number of opcodes appeared in malware accounted for 93.6% of all opcodes (Fig. 7). We also counted opcodes that only appear in malware (Fig. 8).The main functions of the most frequent opcodes are listed in Appendices Table 7, Data of N-gram N=2, 3 were also extracted in this experiment (Appendices Fig. 14~Fig.  19). There are 3064 different sequences in the 2-gram and 20,387 in the 3-gram.
We use feature selection algorithm to process opcode data, first construct arff format file to meet weka data format, which contains data feature name and feature data. The frequency characteristic information of different opcodes is mainly considered in this paper. Feature selection can reduce the dimension of feature data and reduce the computational complexity of classifier. Then we will choose different classification algorithms to experiment, and compare the experimental results under the combination of different feature algorithms and different classification algorithms. Feature selection EAI Endorsed Transactions on Security and Safety 07 2020 -10 2020 | Volume 7 | Issue 26 | e4 algorithms as well as classifier algorithms will act in the case of Ngram N=1, 2, 3 (Fig. 7). When the feature selection algorithm is CFS, the random forest classifier has the best performance for a single opcode feature, and its accuracy is 99.85%, followed by Bayes Net, accuracy of 99.79% (Table 4).   Table 5. Bayes Net and random Forest recall rate According to the figure above, the CFS feature selection algorithm is the best one. Remove the information gain rate feature selection algorithm at N=1. In other cases, the accuracy rate is above 99%. Except for the CFS feature selection algorithm, the accuracy of classification algorithm increases with the N of N-gram under other feature selection algorithms.

Design and Implementation of On-line Malware Detection Scheme
This section will design the online malware detection module in this article: CPU built-in malware monitoring module (CBMM). The data separation module is responsible for diverting the received opcode data stream according to the CR3 value, obtaining the opcode data after the processing data preprocessing module is shunted by multiple channel; feature extraction modules associated with the CR3 value to generate the feature vector; the decision module detects the indicated by the current CR3 according to the feature vector generated by the feature extraction module. The advantages of the CPU built-in malware monitoring model are hardware implementation, the response speed is fast, it can be monitored in real time without the influence of malware confusion mechanism. As shown in Figure 9. The diagram shows where the security module is built in Schematic architecture. The input of the CBMM in this architecture is the opcode section in the instruction, as well as the value of the cr3register (Fig. 10). The model determine whether the current running program or software has malicious behavior through the detection of the input data stream. During on-line monitoring, the module generates feature vectors and judges malware when the CR3 changes and the input opcode reaches the preset threshold. Before the decision condition is reached, the whole module will only use the register to record the number of times that the specific opcode is executed, the decision module will only be used in the decision stage, because the data separation module the multiple channel obtained by the block (set to 64 channel in this paper) is running in parallel, the expected value of the data received by each channel is 1/64 of the processor processing data, which can greatly alleviate the problem of high processor processing speed.  Hardware Implementation Cost Analysis. Implementation and simulation environment: Quartus Prime Standard Edition 16Modelsim. The functional modules are simulated at RTL level using modelsim and power consumption prediction using PowerPlay power Analyzertool.
Feature extraction module will produce feature vectors when the current CR3 value changes or the recorded opcode reaches a certain threshold. Decision module uses the decision tree algorithm. This algorithm is a C4.5 algorithm. The information gain rate is used to construct the decision tree, pruning is carried out in the process of constructing the tree.
CBMM module is a complete functional module after integrating data separation module, feature extraction module and decision module. This paper uses modelsim to verify it. The required input data will be constructed in the simulation experiment. By writing the testbench to satisfy all the excitation of the module, we analyze its accuracy according to its simulation output. In the experiment, all opcode data are from offline collected data, CR3 data are added for themselves. In the experiment, the CR3 data range is 0-99. In the online simulation experiment, 50 benign samples and 50 evil samples were used Italian software to test.

Experimental Results Of Algorithm Simulation
As shown below, the input is CR3 and the opcode and the output is the decision result of the decision module, 01 means the decision is malware and 10 means the decision is benign software. During the experiment of CBMM module, the module can detect the malicious software accurately, and accord with the expected effect of offline.  Thermal power consumption analysis is carried out after functional simulation. The following table shows the estimated thermal power consumption and resource usage of each module. LE logic unit in the FPGA development board. According to Table 9, the whole on-line detection scheme can be completed with few logical units.

Conclusion
A CPU built-in malware monitoring model is proposed in this paper to complete the design and simulation of the module, but this module does not take into account the performance factors in hardware implementation. In the future work, the implementation of this module will be optimized and built into the real processor to achieve the effect of its practical application.