Applying Machine Learning Techniques to Understand User Behaviors When Phishing Attacks Occur

Emails have been widely used in our daily life. It is important to understand user behaviors regarding email security situation assessments. However, there are very challenging and limited studies on email user behaviors. To study user security-related behaviors, we design and investigate an email test platform to understand how users behave differently when they read emails, some of which are phishing. Specifically, we conduct two experimental studies, where participants take part in our experiments on site in a lab contained environment and online through Amazon Mechanical Turk that are referred to on-site study and online study, respectively. In the two experimental studies, we design questionnaires for the two studies and use a set of emails including phishing emails from the real world with some necessary modifications for personal information protection. Furthermore, we develop necessary software tools to collect experimental data include participants’ basic background information, time measurement, mouse movement, and their answers to survey questions. Based on the collected data, we investigate what factors, such as intervention, phishing types, and an incentive mechanism, play a key role in user behaviors when phishing attacks occur. The difficulty of such investigation is due to the qualitative analysis of user behaviors and the limited number of data in the on-site study. For these reasons, we develop an approach to quantify user behavior metrics and reduce the number of user attributes by evaluating the significance of each attribute and analyzing the correlation of attributes. Moreover, we propose a machine learning framework, which contains attribute reduction, to find a critical point that classifies the performance of a participant into either ‘good’ or ‘bad’ through 10-fold cross-validation with randomly selected attributes cross-validation models. The proposed machine learning model can be used to predict the performance of a user based on the user profile. Our data analysis shows that intervention and an incentive mechanism play a significant role while phishing type I is more harmful to users compared to the other two types. The findings of this research can be used to help a user identify a phishing attack and prevent the user from being a victim of such an attack. Received on 21 November 2019; accepted on 13 January 2020; published on 29 January 2020


Introduction
Attackers usually send out phishing emails, which is an online identify theft, to deceive victims into providing their personal information and login credentials [2]. susceptibility to phishing. By understanding users' behaviors on phishing attacks, we can determine how to educate users so that they can better be prevented from phishing attacks.
Because of the non-homogeneity of users' network security education levels, users are susceptible to phishing attacks at different degrees [6]. Although security and usability experts claim that computer system should not rely on users' behavior, researchers found that phishing attack are directly correlated with user behavior factors [7]. Thus, an important security prevention method is to educate users to adapt better security behaviors, where user behavior education refers to teaching Internet users about phishing awareness and defense techniques. Educationbased approaches usually offer online information or educational games [8,9].
In this research, we aim at studying user behavior factors, such as intervention, phishing types and a monetary incentive, to understand how a user behaves during phishing email attacks and what mechanism may prevent a user from being a victim of such attacks. Our understanding of user behaviors will help us design a guideline to educate users how to identify phishing emails, thus reducing the chance to be a victim, although user education study is out of the scope of this study. Here, intervention is defined as a mechanism that helps users be aware of the phishing attacks more easily by modifying phishing types to make them appear more obvious [10]. A monetary incentive is introduced to motivate users to pay attention to phishing attacks [11].
Specifically, in our experiments, we recruit participants to conduct email sorting tasks. The emails used in the research consist of both phishing emails and non-phishing or normal emails. There are three kinds of phishing types in the phishing emails: (1) Suspicious sender's email address; (2) Suspicious links or attachments; (3) Malicious email contents. Performance of each participant, such as sorting correctness and time as well as mouse movement, is recorded in each experiment.
The goal of this study is first to understand how user behaviors are correlated to phishing victims through an analysis of the collected experimental data and then to develop a model to predict how likely a user will be a victim based on the user's profile and behaviors.
For this purpose, we explore to answer the following challenging questions in this research: 1. How intervention can affect user behaviors?
2. Which phishing type is more harmful than others?
3. How can a monetary incentive affect a user's behavior and sorting?
4. How accurately can we predict the performance of a user on email sorting based on user profiles and behaviors?
To answer the above questions, we propose two study designs, on-site study design and online study design. We start with an on-site study design that is carried out in a contained lab environment. In the lab, participants are asked to conduct a pre-setup experiment on our testbed, where each participant first read a number of emails and then sort them into either "phishing" or "non-phishing." We introduce a performance score to record the total number of the correctnesses of a participant's sorting.
In this research, our first main challenge is how to quantitatively answer the above questions. To address it, we first quantify user information and behaviors and then analyze the data obtained from participants' performance as well as participants' basic information from the questionnaires shown in Appendices A and B. The two questionnaires are designed for on-site and online, respectively, so they are not identical as their experimental environments are different. We also design a mouse tracking mechanism to trace their mouse movement. Particularly, this first challenge becomes very difficult to be addressed in the on-site study. This is because the number of experimental data is typically small in the on-site study. The small dataset constraint is due to the limitation of budget, resources, and participant diversity, resulting in the the limited number of people to be recruited. Actually, such limitations are very typical in many human subject studies. In this research, only 40 participants are recruited in the on-site study. Thus, our second main challenge to answer the above questions is how to extract useful information from a relatively small number of collected data to build a machine learning framework for predicting the performance score of each participant accurately.
Furthermore, to increase the diversity and scalability of recruitment, we design an online study through Amazon mechanical turk, where participants attends the study online. In the online study, we also collect the profile, performance and mouse movement of each participant similar to the case of the on-site study. Based on the collected data, we develop a comprehensive approach to building a machine learning framework for predicting participants' susceptibility to phishing. In order to better evaluate the performance of participants, we divide their performance into two classes, 'Good' and 'Poor,' based on their performance scores. Thus, it is important to setup a threshold, which we call a critical point, to divide participant performance scores into two classes. We evaluated the critical point in the online study to find the best division method. Our machine learning models are developed by a use of 2 EAI Endorsed Transactions on Security and Safety 04 2019 -08 2019 | Volume 6 | Issue 21 | e3 Applying Machine Learning Techniques to Understand User Behaviors When Phishing Attacks Occur the 10-fold cross-validation where we apply the similar idea of cross-validation to select attributes.
Our main contributions are summarized as follow: 1. We propose two study designs, on-site and online, to understand how a user behaves when phishing attacks occur and determine how we can help a user identify phishing attacks or prevent a user from being a victim of phishing attacks. The onsite study is conducted in a lab environment, while the online study is carried out online only. The on-site study is easily controlled as it is done in a contained environment, but recruiting a large number of participants become difficult. Conversely, the online study is easily scaled up, but the recruitment of online participants makes difficult to ensure the data and profile of participants to be worthiness.

2.
To help users, we introduce intervention, which is a mechanism used in our study to help participants be aware of their weakness areas related to phishing. We specifically address the type of phishing attacks that they are unaware of and help them to recognize that type of phishing attacks. Furthermore, we introduce a monetary incentive to test how the incentive impacts participants' security decision making. To conduct the monetary incentive, we divide the participants into two groups, a control group and an incentive group.
3. Beside participants' basic background information, we develop software tools to collect experimental data including time measurement, mouse movement, and their answers to the survey questions that we carefully design for the above two study designs. The collected experimental data in our two study designs help us answer all the questions raised before.
4. To understand the collected data, we propose and develop a machine learning framework to predict the performance score of each participant based on his/her profile. The proposed machine learning framework consists of four different models; all of them are developed with a 10-fold cross-validation and cross-validation based feature selection. We also perform attribute reduction by analyzing the data obtained from participants' performance as well as the participants' basic information from the survey to select the best attributes for our machine learning framework. 5. In order to better evaluate the performance of participants, we introduce two classes of performance, Good and Poor, based on their performance scores. In this research, we find the best critical point to divide the participants' performance scores into the two classes by using collected experimental data, through the proposed machine learning model.
The rest of the paper is organized as follows. In the section 2, we present related work on phishing emails and why people fall for phishing. In the section 3, we introduce the designs of our two studies. In the section 4, we present the dataset and attributes as well as our machine learning framework. We evaluate the results of our studies in the section 5. Finally, we discuss the implications of our findings int the section 6.

Related Work
As phishing becomes a more and more popular attack vector, email has been the most common way to conduct phishing attacks [12][13][14][15]. In 2011, Vishwanath et al. [16] discovered that most phishing emails are peripherally processed and the decisions made by individuals are usually based on simple clues embedded in an email message. They also found that if the email contains urgent information, the user will typically ignore other clues that could potentially help detect the deception. Furthermore, these findings suggest that the users who have more experience with emails are more likely to be phished.
Based on Vishwanath et al.'s observation [16], Angela Sasse and Kirlappos [9] claimed that the direction of security awareness and training against phishing attacks needed to be changed. They argued that user education needed to focus on challenging and correcting the misconceptions that guide current user behaviors. To better understand user's perspective, decision-making strategies is an effective way of implementing security awareness applications.
Dhamija et al. [17] conducted an experiment for better understanding why phishing worked. They first analyzed the large dataset of phishing attacks and hypotheses about the reasons of phishing attack feasibility. They then assessed those hypotheses by showing 20 web sites to 22 participants and asked them to determine which ones were deceptive. Their results showed that 23% of the participants did not attend to security indicators, leading to incorrect choices 40% of the time.
Vishwanath et al. [18] later conducted an experiment to examine the factors for phishing susceptibility and they found that an individual email habit was an important factor for phishing susceptibility. They found that those people with entrenched email habits tended to be more susceptible to phishing attacks. This is due to their habits that as soon as a notification arrives, they are more likely to open it even though they do not realize that they are opening it. 3 EAI Endorsed Transactions on Security and Safety 04 2019 -08 2019 | Volume 6 | Issue 21 | e3 Interventions can be utilized for better understanding user behavior in phishing susceptibility when existing studies also have consequently focused on training individuals to better detect fraudulent emails [19,20]. Liang et al. [21] demonstrated the effectiveness of warning interfaces with two groups, one control group that had no warnings for phishing attack, and another group that had warnings. They recruit nine participants in total, where eight of them are fell for the attack. After experiments, most of the participants claim that they did not notice the warning and some don't even know what it means. Further, many of the participants admit that they don't know the meaning of phishing.
A lot of studies have been done to show that people are vulnerable to phishing for the following reasons. Many users do not trust security indicators on the websites [22]. Attackers can easily replicate legitimate websites since people usually judge a website by how a website looks and how they feel about it [17]. Although some users are aware of phishing, the information does not contribute to detect or prevent phishing attacks [23,24]. Nowadays, machine learning techniques have been applied to detect the phishing emails [25][26][27][28][29].
User education, we can also think it as an intervention, about security has made a significant impact on preventing phishing attacks [30]. There is an evidence to show that a well-designed user security education can be very effective [31]. Many forms of security education, such as, interactive games, can be utilized to improve user's knowledge to prevent phishing attacks [32][33][34][35].
Supriya et al. [36] has recently studied on user behaviors in phishing attacks with incentive and intervention. They conducted a three-round experiment where participants distinguish phishing emails from normal emails. In our study, we follow closely from their experience but do more analysis. We not only study how user behaviors will affect phishing attack outcomes but also try to predict how users will perform based on their behaviors and background.
The above studies suggest the importance of understanding user behaviors in phishing attacks in order for us to efficiently avoid such attacks. In the onsite study, we specifically focus on phishing indicators to test if users can differentiate various phishing attacks and to study which type of phishing attack has more impact to user. We introduce intervention in both incentive and control groups. The intervention is to tell the user to pay attention to a certain type of phishing attack. Participants were challenged with the intervention of a phishing type where they are weak in the first round by making the phishing type easier for them in the second round. We had the incentive group to test whether or not a monetary incentive impacts the decision-making of participants, i.e., whether or not participants perform better with the presence of a monetary incentive. In the online study, We build a machine learning framework to predict the performance of a user based on their behavior and background. Compare to existing studies in the literature, our approach is more comprehensive in understanding user behaviors when phishing attacks occurs. We proposed two study designs and investigated multiple factors, such as intervention and money incentive. Furthermore, we also proposed a machine learning framework to classify user performance regarding phishing emails.

Study Design
Nowadays, emails have been widely used throughout the world via the Internet. Many people, especially employees in a work environment and students at colleges, read and respond emails daily. Emails become an integral part in daily life for most people. Thus, it is very likely that many people might have experience to wrongly click on a request link seemingly to be legitimate, but actually a phishing link.
In order to thoroughly understand user behaviors when phishing attacks occur and provide better user education, we present two study designs, on-site and online. The on-site study design has experiments carried out in the lab environment while the online study was carried out online. The on-site study is designed to answer the first three questions given in section 1, and the online experiment is designed to answer the last question in that section. It is important to set up our study to be correspond to user behaviors when a user read emails in the real-world. Checking emails in our daily life can be viewed as an email sorting task because when we look at an email, we will first decide whether or not it is a legitimate email. If it looks suspicious, we will not open it. Even we open it, we will look at some keywords and make a decision on whether or not the email is trustworthy and useful. In both study designs, we mimic an email opening, reading, and decision atmosphere for participantswho are asked to act as an administrative assistant to help the department chair, Dr. Jane Smith, to sort her emails while she was on vacation. Therefore, we set up an email testbed to allow users to sort a bunch of emails for Dr. Jane Smith's email accounts. Those emails consist of both legitimate and phishing ones. Participants do not need to respond to any of the emails, only sort them into either a "phishing" or "non-phishing" folder based on the information within the email and email interface. In our study, we use emails obtained from the real world with some necessary modifications for personal information protection. Phishing emails were derived from a semi-random sample of emails in "Phish Bowl" database [37]. Legitimate emails were derived from legitimate emails received by the research team. In this section, we will give a detailed description of these two study designs.

On-Site Study Design
In the on-site study design, its email sorting task consists of three rounds and each round is preloaded with 20 emails, where 15 are phishing emails and 5 are legitimate emails. Among those 15 phishing emails, there are 3 different phishing types and each type includes 5 emails. Thus, each type of emails contains 5 emails in each round. In the second round, the intervention is introduced to the participants based on their performance of the first round. We recruit 40 participants to perform this task. During the experiment, the participants are asked to differentiate the phishing emails from legitimate emails. After the tasks in three rounds, participants are required to take a survey in the lab, where they are asked their backgrounds and their feelings about the task.
Environmental Setup. Our email testbed has three main components: RoundCube email client, Postfix virtual mail server, and BurpSuite proxy listener. The RoundCube email client is a browser-based IMAP client. It is used as an interface for users to preview and make the decision of emails in our study. Postfix mail server provides the ability of hosting multiple virtual domains. The emails preloaded in the RoundCube client are sent through Postfix virtual mail server.
We utilized the HTTP Proxy feature of BurpSuite, which serves as a man-in-the-middle between the browser and the destination web servers. This allows the interception, inspection and modification of the raw traffic passing in both directions. Therefore, both HTTP request and response sent between RoundCube client and Postfix mail server can be captured by BurpSuite. The logs obtained from BurpSuite after each round are saved as an XML format. We then parse the XML file to extract useful information for later analysis.
The email testbed is set up in the environment of Ubuntu 16.04 Long Term Support (LTS). The testbed architecture is shown in Figure 1. It consists of RoundCube Email client, BurpSuite Proxy Listener, and Postfix Virtual Mail Server where there are HTTP requests and responses among them.

Figure 1. Email testbed architecture
In addition, we developed a Python code to track the movement of a mouse including the mouse's locations and staying durations at those locations. In our study, the developed Python code captures the time and location of the mouse during the experiment. If a participant's mouse stays in the same location for a certain long period of time, we will calculate and record the time interval t (in second). We then set up a threshold a to determine the hesitation times h. If h > a, we will increase h by 1. This helps us to estimate how hesitation will affect the performance.
Participant Recruitment. The IRB had been approved before we started to recruit participants (The approval number is: Pro00026240.) In the on-site study, participants are students as we recruited them on campus. They were recruited through flyers posted on the campus or announcements via mailing lists in different departments. Participants are asked to sign the Informed Consent Form before they start the experiment. We have recruited 40 participants at our university to perform this user study. To increase the diversity of participants in this study, we chose most of the participants from different majors and education backgrounds, where both undergraduate and graduate students were recruited.
The average age of the participants is about 23 years old while the participants' ages range from 18 to 38. Among 40 participants, 18 are female and 22 are male. The distribution of the participants is shown in Table 1. We introduce a monetary incentive in our study. It is designed to answer the third question in section 1. We want to study whether the monetary incentive will affect a user's performance or not. In our another research for education purpose, we can decide if we will use this monetary incentive factor to motivate the users to pay more attention to phishing attacks. Each participant has a chance to receive $10 to $25 payment. To see how a monetary incentive can affect the performance, we assign them into two groups: a control group and a monetary incentive group. Each participant in the control group will get $15 payment regardless of his/her performance. The base amount for incentive group is $10, but participants will have a chance to earn $5 extra from each round if they get accuracy above 80%. Phishing Types. One purpose of this research is to study which type of phishing attacks is more malicious to user. There are three types of phishing attacks used in our study: 1. A suspicious sender's email address This type of phishing contains a suspicious sender' email address. Nowadays, people are flooded with emails and tend to pay less attention of the sender's email address. They usually only look at the sender's name, neglect of the email address, or just catch a glimpse of the sender's email address. This information gives the scammer a high chance to replicate the email address. Some of the phishing email addresses are really hard to be distinguished from the authentic email addresses if users do not pay much attention. For example, the letter 'l' and the number '1' are very similar. Therefore, the scammer could utilize this feature to create a fake 'we11sfargo' domain name rather than 'wellsfargo. ' 2. Suspicious links or attachments Suspicious links can be very similar to suspicious sender's email addresses. These links could be manipulated through using similar characters or misspelling issues. For example, a link contains the word 'directdeposit' could be misspelled as 'directdepost.' The suspicious attachments can be disguised as the pdf file, exe file, or other types of files. A suspicious exe file may be easier to spot than a suspicious pdf file. Usually, people will not consider that a pdf file could be malicious until they open it.

Malicious Email Contents
This type of phishing is quite tricky. At first glance, the email content seems normal to most people. However, this kind of phishing attacks contains suspicious contents. For example, the contents may have several grammar issues or the icon of popular social networks are faked. They are very hard to notice if the user is not familiar with those popular social medias or if the user is not a native English speaker.

Experimental Rounds.
In the on-site study, we let participants perform three rounds of email sorting tasks. We collected data of each participant from each round. The average time spend in each round is about 15 minutes. After each round, the participant can take few minutes rest while we save the data captured from experiments.
First Round: This round contains 20 emails in total. Among them, 15 are phishing emails and 5 are legitimate emails. Those 15 phishing emails consist of three types of phishing attacks we introduced above. Each type of phishing attack has 5 emails. The task for participants is to classify 20 emails into two folders, suspicious or keep, based on their knowledge and experience. Participants were not told how many phishing emails and legitimate emails were given.
Second Round: The second round has the same procedure as the first round except that we introduce the intervention in this round. The intervention is to make the phishing type that emails become more obvious to participants, so they will pay more attention to this certain type of phishing attack. This could be an useful factor in user education to prevention phishing attack that we can educate them about different types of phishing attacks. After a participant finished the first round, we calculate the score of first round for the participant. The score is calculated based on the correctness of sorting each email. If a participant moves the email to the correct folder, he/she will get 1 point; otherwise, 0 point is granted. The score then be added up together. The performance score is the total score of sorting all 60 emails. We separated the score for different phishing types and checked for the lowest score among three phishing types. Therefore, the phishing type with the lowest score was used as an intervention in the second round. Before the second round started, we pointed out the type of phishing attacks with the lowest score to the participant and made this type of phishing attack easier for participant to spot in the second round. The reason to introduce intervention is that we want to examine whether a participant will perform better in this round with the knowledge of the certain type of phishing attack. We also want to see whether the intervention action will affect the overall performance of each participant or not.
Third Round: This is the last round in our experiment. In this round, a participant will continue to sort 20 emails. The procedure of the third round is the same as the previous two rounds. But we will not give any intervention in this round. We will compare the performance score between round three and the other two rounds, to see if the intervention from last round still have an effect on the third round.
Survey. The survey was carried out after three rounds of email sorting tasks. We used an online survey platform to record the answers from participants, where they were required to complete it in the lab. This survey contains 30 questions and is mainly about the background of participants, such as, age, gender, and some general questions about their experience and habits of using social medias. There were also some questions related to the email sorting tasks they just took. The examples of the survey questions are shown as follow: • Have you taken any cybersecurity courses?
• I believe I was successful in the email sorting task.
• I briefly looked at the sender/source of the emails. 6 EAI Endorsed Transactions on Security and Safety 04 2019 -08 2019 | Volume 6 | Issue 21 | e3 • I ignored the message content of the emails.
Here, participants are given multiple choices, "Yes/No" for the above first question and "Strongly Disagree/Disagree/Neutral/Agree/Strongly Agree" for the above rest of three questions, where the participants are required to choose one of the answers in the multiple choices.
Besides the data we collected throughout their experiment, this survey can better help us understand participants behaviors and background regarding to phishing attacks. A complete list of survey questions is shown in Appendix A.

Online Study Design
Although by performing the on-site study, we can sufficiently answer the first three questions mentioned in the section 1, it is not sufficient for us to thoroughly understand user behaviors regarding phishing attacks. This is due to the limitation of demographic diversity and the number of participants recruited, etc. Therefore, we propose the online study design developed by the project team at our university. The online study can sufficiently help us to answer the question of what kind of groups are more vulnerable to phishing attacks and how accurately can we predict the performance based on user behavior. Since the online study design is an extension of the on-site study design, we will only introduce the new components of the online study design and compare both designs afterwards.
Environmental Setup. Online study has an environment setup similar to the on-site study, except that we are not using BurpSuite proxy listener to capture the data since the experiment is carried out online. To collect user's input, we use the JavaScript-Based Data Capture and to communicate the captured data to the server, we use the AJAX-Based Data Sender. The PHP Listener is used to receive the data sent from AJAX, and the Logger is used to log the data. On the server side, both Listener and Logger are installed. Both Data Capture and Data Sender are on the client side browser. In order to see how confident participants are while they are sorting an email, we add a rating module in the Roundcube email client so that they can rate their confidence level of each email. The rating is ranging from 1 to 10.

Participant Recruitment.
In the online study, participants are recruited from Amazon Mechanical Turk (MTurk) [38]. We recruited 90 participants in total for this online study. The average age of the participants is about 34 years old while the participants' ages range from 20 to 61. Among 90 participants, 35 are female and 55 are male. There are 8 participants are currently students. There is one participant who is not an English native speaker. Nine participants previously completed a network engineering or cybersecurity course/certificate. The distribution of the participants is shown in Table 2. We still keep the monetary incentive mechanism in the onlinse study because it is a useful feature/attribute for predicting the performance of user behavior when encountered with phishing attacks. The base amount is $4 for non-incentive group. For the incentive group, participants could earn additional payment (up to $8.00) for their performance if it is greater than 75% accuracy.
Phishing Types and Experimental Round. The online study utilizes the same phishing types as we used in the onsite study. However, there is only one experimental round in our online study. Since the on-site study is sufficient for exploring the effect of intervention, to make it simpler, we only use one round of email sorting task and the participants are asked to sort 40 emails as well as to rate their confidence level for each email. They are asked to complete the task within 30 minutes. Among those 40 emails, 20 emails are legitimate and 20 are phishing. Participants are not aware of this distribution when they take the experiment.

Survey.
Because the study is carried out online, we designed two types of surveys, pre-survey and postsurvey. We use the pre-survey to investigate the basic information and background of participants, such as age, gender, education background, cybersecurity background, habits of using social media, etc. When we carry out the on-site study, we include the Informed Consent Form and email sorting instructions whose information is similar to the questions in the presurvey in the online study. The post-survey asked questions related to the email sorting task they took. The pre-survey and post-survey questions contain all the questions from the on-site study survey and we add more questions because the online study is designed differently from the on-site study. For example, we add the confidence rating in the online study, so in the postsurvey from the online study, we have a question: "How confident are you in your assessment of the number of correctly sorted email?" which is not in the onsite survey. The complete survey question is shown in Appendix B.

Similarity and Comparison
Since the online study is designed based on the onsite study, these two study designs are similar to a certain extent but also have differences because they are focusing on different aspects of user behavior study.

Similarity.
Both of the studies contain email sorting tasks and aim to study user behaviors when users encountered with phishing attacks. Both of them divide participants into two groups, monetary incentive group and control group. In these two studies, their phishing types are the same. Both conduct surveys about the email sorting task afterwards.
Comparison. First, the on-site study is carried out in the lab environment. Participants are asked to show up and perform the experiment on the testbed setup in the lab, while the online study is carried out totally online including recruit participants and email sorting task, and so on. Second, participants recruited online are more diverse than the on-site study and the number of participants recruited for the online study is a lot more than the one in the on-site study. Third, for the on-site study we design the intervention mechanism and threeround email sorting task sorting 60 emails in total. Our collected data is sufficient for doing the analysis regarding the intervention question, so, for simplicity, we only design one round email sorting task for sorting 40 emails. Fourth, in the online study we add the rating module to allow user to rate their confident level for each email, which is turned out to be a useful feature in our machine learning framework. Fifty, for the survey questions in the on-site study, we only ask participants to do them after the experiment and we do not have pre-survey. However, in the online study we have both pre-survey and post-survey due to the form of online experiment. There are more survey questions than the on-site study.

Data Analysis Methodologies
The goal of this research is to thoroughly understand the behavior of a user encountering phishing attacks and to identify which factor plays a significant role in phishing attack outcomes. We raised the four questions in the introduction section. In order to answer these questions effectively, we propose two data analysis methodologies. Especially, in order to answer the first three questions, we propose a statistic method to analyze each factor such as intervention, phishing types and incentive. To answer the fourth question, we propose machine learning techniques. In this section, we first discuss our dataset and attributes and then propose a machine learning framework.

Data Set and Attributes
We developed a data collection infrastructure such that it automatically captured and monitored the detailed actions of each participant like clicks, navigation, timestamps, decisions, etc. We further processed and stored this information in a CVS file format for analysis. Since we proposed two study designs, the datasets, onsite dataset and online dataset, are separately stored and analyzed by different data analysis methods. The on-site study design mainly focuses on the factors that will affect user behaviors with phishing attacks, while the online study is designed to predict the user performance with phishing attacks. Although we also predicted the user performance by analyzing the onsite dataset based on machine learning approaches, the result is not as good as we expect because of the small dataset. Thus, we design the online study whose collected data are efficient to predict the user performance as shown later. For this reason, we apply statistic models to analyze the on-site dataset in order to understand the contributing factors of user behaviors when phishing attacks occur. We will also briefly show our machine learning user performance prediction results by using on-site dataset and compare them with the ones based on online dataset.
The on-site data file stores data collected from 40 participants. The file includes the participants' detailed information, time used to sort email (processing time) and performance score, etc. There are 50 attributes in total for the on-site dataset. The online data file stores data collected from 90 participants. It includes the similar information as in on-site data file and additional information including confidence rating, new survey questions and so on, but it doesn't include intervention information. There are 119 attributes in total in the online dataset. A part of them are the processing time and performance score to evaluate the performance of a user. Besides these parts, a lot of attributes are coming from the pre and post surveys that provide basic information such as gender, age, education level, questions about the task and so on (see Appendices A and B). An example of the attributes can be seen in Table 3. In the study column, 'On-site' or 'Online' means that the attribute is only for the on-site study or the online study, respectively. 'Both' means that the attribute is for both studies. The online study has more attributes from the survey questions than on-site one. A complete list of attributes with their descriptions is given in Appendix C.
Performance score is one of the most important indicators in both of our studies. For the on-site dataset, we calculate the score for each participant in each round as well as the score of each phishing type. We can use a statistic method to analyze what factors, such as, intervention, processing time, and incentives, are closely related to the performance score. For the online dataset, we calculate the score of each participant and the score of phishing email and normal email. We can feed the online dataset into machine learning models to predict the performance of the participant for understanding the behavior of a user encountering with phishing attacks.

Proposed Machine Learning Framework
The goal of using a machine learning method is to predict a user's performance when the user encounters phishing attacks, whether or not the user can do well or poorly. Hence, we divided the performance score into two different classes, Good and Poor where a critical point, c, is used as the threshold in the division. If the performance score is greater or equal to c, then we label it as Good, otherwise, it is labeled as Poor. Thus, we have to deal with the classification problem. Some attributes are more significant than others, where some of those other attributes have minimal or no significant effects. Therefore, choosing attributes is critical in our machine learning framework, especially with a small dataset in the case of the on-site study. Moreover, correlation studies are conducted to illustrate the relationship of these independent attributes with the performance scores of participants. Since we have 119 attributes but only 90 datasets, a critical question has raised. That is, while the sample size of our dataset is relatively small, the number of attributes is relatively big. In order to prevent over-fitting, we require a sufficient number of datasets for a certain number of attributes in machine learning models [39,40]. As we know, a non-over-fitting machine learning model usually requires at least P 2 datasets to train the model for P attributes. Clearly, our dataset does not meet the requirement. To resolve this problem, we introduce a stepwise attribute section to 9 Applying Machine Learning Techniques to Understand User Behaviors When Phishing Attacks Occur

EAI Endorsed Transactions on
Security and Safety 04 2019 -08 2019 | Volume 6 | Issue 21 | e3 reduce attributes and proposed our machine learning models in detail. Then, we present how to find the best critical point to classify user performance where a prediction accuracy is ensured.
Stepwise Attribute Selection. To select the best attributes for the following machine learning models, we first perform a Pearson-correlation coefficient analysis to observe the importance of each single attribute. Based on the data we collected, we then fit our data into a linear regression model to evaluate all the attributes. In order to select the most significant attributes to build the model, we use three ways which is stepwise, forward, and backward selections to select the attributes. The model entry significant level was set to 0.5 and the stay significant level was set to 0.2.

Figure 3. Machine learning model with 10 fold cross-validation
Machine Learning Models. After we get the reduced attributes, we now apply machine learning approaches to predicting the overall performance. We build 4 different machine learning models, Decision Tree-J48, Naive Bayes (NB), Support Vector Machine (SVM), and Multilayer Perceptron (MP). We first use Decision Tree-J48 based on the implementation of algorithm ID3 (Iterative Dichotomiser 3) developed by the WEKA project team. The Decision Tree classifier requires relatively little effort for data preparation. The Naive Bayes classifier works well for independent attributes based on the Bayes rule of conditional probability [41,42]. It will consider each of the attributes separately when classifying a new instance. SVM is primarily a classier method that performs classification tasks by constructing hyperplanes in a multidimensional space that separates cases of different class labels. Multilayer Perceptron is a type of neural networks that usually consist of at least three layers of node. The node in each layer uses nonlinear activation function.
We also propose to use the method of 10 fold crossvalidation to precisely predict the performance of each participant, as shown in Figure 3, where MM is short for the machine learning model. The original dataset is randomly partitioned into 10 equal size subdatasets. Of the 10 subdatasets, a single subdataset is retained as the validation data for testing the model and the remaining 9 subsamples are used as training data. The cross-validation process is then repeated 10 times (i.e., 10 folds), with each of the 10 subdatasets used exactly once as the validation data. The 10 results from the folds can then be averaged to produce a single estimate. The advantage of this method is that all observations are used for both training and validation, and each observation is used for validation exactly once. Besides the 10 fold cross-validation, we also utilize the similar idea of cross-validation for the attributes. Suppose we have m attributes and each time we randomly select n attributes to feed into our cross-validation machine learning model. This process will be running in k times, where m = k × n.
The procedure of the proposed machine learning framework is shown in Figure 2. We have m attributes in total after stepwise attribute selection. Then we randomly choose n attributes to do the cross-validation training by applying our machine learning model. The next step is to calculate the performance accuracy. This process can be running k times. These k performance accuracies are averaged to form one final accuracy. In our proposed model, we use 10-fold cross-validation to do the training and testing. Each time we will obtain an accuracy, and this will be done 10 times. The performance accuracy is calculated by averaging all the accuracies.
Finding A Critical Point. To predict the performance of a user encountering phishing attacks, we divide the performance into two classes, Good and Poor, based on the performance score of a user, as discussed before. Let us recall that the critical point c is used to divide the performance score. Finding a critical point is very important step before we use our machine learning models to do the prediction. To find the critical point, we use a greedy method to go through each threshold and check to see if it is the preferred accuracy. In the evaluation section, we will show how to find the critical point in details.

Evaluation
In this section, we analyze and identify what factors may make a significant impact on a phishing attack outcome. Motivated by the questions introduced in the section 1, we are going to first evaluate the intervention factor and find the type of phishing attacks that is 10 EAI Endorsed Transactions on Security and Safety 04 2019 -08 2019 | Volume 6 | Issue 21 | e3 more harmful to people. Then, we want to see if there is the difference of time spent on between phishing emails and normal emails. Also, we will study if a monetary incentive can improve the participants performance regardless of their backgrounds. Last, but not the least, the evaluation of our machine learning models will be presented. The evaluation of intervention, phishing types, and a monetary incentive are using the dataset from the on-site study, while the evaluation of performance prediction is using the dataset from the online study. We will also present the performance prediction results when using on-site dataset and compare it with the results by using online dataset.

Intervention Evaluation
To answer the first question that how intervention can affect the phishing attack, we calculate mean phishing score, mean total score, mean total processing time and mean phishing processing time. The result is shown in Table 4. The intervention is introduced in the second round and based on the performance of the participant from the first round. From Table 4, we can see both phishing scores where the full score is 15 and the total score is 20. As shown in the table, the second round has been slightly improved compare with the first round. The mean time used in the second round is also lesser than the first round. However, in the third round, the performance score has decreased and even worse compared with the first round.

Phishing Type Evaluation
We analyze the performance score and time for different types of phishing attacks. The question is what kind of phishing attacks are more harmful to people can be answered in Table 5. Type 1 phishing attack contains a suspicious sender's email address, type 2 phishing attack has suspicious links or attachments, and type 3 phishing attack contains malicious contents. The mean score (full score is 15) and mean time are calculated by taking average of all 40 participants' score and time of different phishing types. The intervention frequency describes the total times of a certain type phishing intervention introduced in the task. We can see from the table that type 1 phishing has the lowest score and it has been used the most as an intervention. This implies that the type 1 phishing is more harmful compared to the other two types. In addition, it is not hard to see that the score is in inversely proportion to intervention frequency. Thus, intervention is a suitable attribute that can be used in our neural network.

Monetary Incentive
The next question is whether a monetary incentive affects the performance and total processing time.
We calculate the mean total performance score, mean phishing performance score, mean total processing time, and mean phishing processing time of all 40 participants. The result is shown in Table 6. Condition 0 means that there is no monetary incentive. That is, it is the control group, and condition 1 represents that this group will get a monetary incentive. We can see from the table that the group with incentive has a higher performance score than the group who doesn't. Furthermore, the incentive group tends to spend more time than the control group. Therefore, incentive is also a useful attribute regarding a phishing outcome.

Mouse Movement Evaluation
In our study, we also record the mouse movement from each participant and calculate the hesitation times as described in the section 3, where we pick the threshold h = 10. We analyze the relationship between hesitation and the total time used in this study as well as the relationship between hesitation and total score. The relationships between hesitation and total time and the relationship between hesitation and performance score are shown in Figures 4.
We can see from Figure 4 (a) that as hesitation times increases, the total time is also increasing. The orange line is representing the incentive group while the blue line is representing the control group. It is clear from this figure that the incentive group tends to spend more time and has more hesitation times. This is because the participants in the incentive group are more cautious when doing this task. Figure 4 (b) shows the relationship between total score and hesitation times. For the control group the relationship is not so obvious. For the incentive group, the total score is decreasing as the hesitation times increases. It is interesting to see that if the participants get more cautious, they tend to be performing worse.

Time difference between phishing email and control email
The next question we want to know is whether users spend different time in normal or phishing emails. Since there are three rounds in total, we compare the time of each round as well as the total time of all three rounds. As shown in Figure 5, in round one, users spend lesser time in phishing email. In round two, User also spend lesser time in phishing email. However, in round three, user spend more time in phishing email. Thus, in total, there is no significant time difference between normal and phishing email.

Attribute Reduction
The first important step is to select the useful attributes that will be used in our machine learning models.   We perform the Pearson-correlation coefficient analysis to observe the importance of each single attribute. From our observation, we could see that most of the attributes are not significant related to the total score. The detailed information of part of the attributes is shown in Table 7.
From the table, the order is sorted into most significant to less significant. We can see the attributes phishing_accuracy has p < 0.0001. Some attributes are significant, such as sort_agreement_4 and sort_correct_1, are from the survey questions. In particular, sort_agreement_4 is referred to the question in post survey: "I felt irritated and stressed while sorting emails." In this analysis, we select 16 attributes that will be used in our machine learning models.

Critical Point Evaluation
Before we apply machine learning model to predict whether a participant will perform well or not, we need to find a critical point to label the training data as Good or Poor. We test the critical points for each machine 12 EAI Endorsed Transactions on Security and Safety 04 2019 -08 2019 | Volume 6 | Issue 21 | e3

Performance Prediction Evaluation
In our machine learning framework, we use four different machine learning models, J48, Naive Bayes, SVM and multilayer Perceptron, to predict the performance. We have presented a table of critical point in the above section, to further observe the preferred critical point, let's take a look at Figure 6 (a). It shows the accuracies of different machine learning models choosing different critical points. The critical points started from 26 because for the critical points smaller than 26, the division of two classes are unevenly distributed. We can see from this figure, when the critical point is 30, except for J48, all other three models, have the relatively highest accuracy compared with choosing other critical points. Therefore, we choose the critical point c = 30 to label the dataset into two classes, Good and Poor. Next, we use four machine learning models with 10 fold cross-validation to do the classification. Figure 6 (b) shows the accuracy of each fold when using four different machine learning models. The final accuracy result is the average of all 10 folds. For each fold, fold NO.1 to 10, the accuracy ranges because each fold is using different training and testing subdataset as we discussed in the last section. For J48, the accuracies ranging from 55.56% to 100%, only fold 5 and fold 9 reaches 100% accuracy and the worst accuracy is 55.56% from fold 10. For Naive Bayes, the accuracies ranging from 77.78% to 100%, fold 2, 3, 10 has the lowest accuracy and fold 5, 6, 7, 9 has the highest accuracy. SVM has the accuracies ranging from 77.78% to 100%, but it is better performance than Naive Bayes. Multilayer Perceptron has the accuracies ranging from 88.89% to 100%. It is better compared with other four machine learning models. Figure 7 (a) shows the Mean Squared Error (MSE) of each fold for four different machine learning models. The results of MSE show the correspondence with the accuracies of each fold. As accuracy increases, the MSE decreases. Among them, J48 of fold 10 has the highest MSE because accuracy of J48 with fold 10 is the lowest.
Then, we evaluate the performance accuracy as described in our machine learning framework in Figure 2. As we described in section 4.2.2, we also apply the similar idea of cross-validation to attributes.    After analyzing the 10-fold cross validation, the accuracy in the following analysis is the final accuracy by averaging 10 folds accuracies. Figure 8 (a) shows the relationship between accuracy and number of instances, which means the number of participants because we treat each participant as an instance. We can see as the number of instances increases, the accuracy is also increasing. Among them, Multilayer Perceptron has the best accuracy, which is 93.84% in average. When using all 90 instances, the accuracy reaches 96.67%  for Multilayer Perceptron. SVM has the second best accuracy, the average accuracy for SVM is 89.93%. In addition, when using all 90 instances, it has the best accuracy, which is 92.22%. The average accuracies for Naive Bayes and J48 are 86.23% and 83.58%, respectively.     We also use the on-site dataset to do the user performance prediction, we followed the same procedure as we discussed above. However, in the on-site dataset, we only have 40 participants, which means we can only use 40 instances. Figure 9 (a) shows the accuracy comparison of four machine learning models using on-site dataset and online dataset. The online dataset contains 90 instances. We can see the accuracy of the online study is much better than the on-site study. For J48, the on-site study accuracy is 65% and accuracy of the online study is 86.67%. With Naive Bayes, we have the accuracy of 70% in the on-site study and 88.89% in the online study. The accuracies by using SVM for the on-site study and the online study are 70% and 92.22% respectively. Multilayer Perceptron has the highest accuracy in both the on-site study and 16 EAI Endorsed Transactions on Security and Safety 04 2019 -08 2019 | Volume 6 | Issue 21 | e3 the online study. The accuracies are 80% in the on-site study and 96.67% in the online study. Figure 9 (b) shows the accuracy comparison of four machine learning models using on-site dataset and online dataset with 40 instances. We can see the online study has much better accuracy than the on-site study. For J48, the on-site study accuracy is 65% and accuracy of the online study is 72.5%. With Naive Bayes, we have the accuracy of 70% in the on-site study and 82.5% in the online study. The accuracies by using SVM for the on-site study and the online study are 70% and 85% respectively. Multilayer Perceptron has the highest accuracy in both on-site study and online study. The accuracies are 80% in the on-site study and 90% in the online study. With the same number of instances, the online study still has better prediction performance. The reason is that we the attributes in our online study are more significant than the attributes used in the onsite study.

Discussions
In this research, we have collected data from both on-site study and online study. In the on-site study, we applied statistical methods to analyze the data. The on-site study aims at answering the questions regarding intervention, phishing types, and monetary incentive factors. Through statistical methods, we have first analyzed the data collected from the on-site study, where we introduced intervention in the second round. Our analysis demonstrates that the participants with intervention and a monetary incentive perform better than the ones in other cases. Our data analysis also showed the performance of participants in the second round had been improved due to the use of the intervention. However, we noticed that in the third round, some participants' performance was be even worse compared with the one in the first round. We suspect that the worse performance could be thank to the participants' fatigue in the third round. To address this phenomenon, we plan to conduct further experiments in our future research. Because of the limitation of budget and resources, we were only able to recruit 40 participants in the on-site study. To increase the scalability and diversity of participants, we designed the online study and collected data from more participants using Amazon mechanical turk. We applied four machine learning models, J48, Naive Bayes, SVM and Multilayer Perceptron, to predict a participant's performance that was classified as Good or Poor. Our data analysis results showed that Multilayer Perceptron performed the best where its accuracy was 96.67%. However, there was a weakness. That is, in this study, we did an attribute selection or reduction through fitting all attributes in a linear regression that might cause the problem of multicollinearity because some of the attributes were somewhat correlated. This is due to the small dataset in both our studies, resulting in a limitation in the current research. To address this issue, we plan to recruit more participants in the future research. Furthermore, as we see in section 5, both intervention and the monetary incentive could improve the user performance when dealing with phishing emails. Therefore, these factors could be applied in user education. We could design an education game that can be used to predict the user's performance based on user behaviors by applying our machine learning framework. Then, we could design specific schemes by helping them to be aware of phishing attacks so that they could achieve better performance. We could motivate them by giving them a hint (intervention) or an award (monetary incentive). User education is out of the scope of this research. We leave it in the other paper.

Conclusions and Future Work
In this paper, we studied the user behavior related with phishing emails. We did comprehensive and quantitative investigation of how users react in email checking and reading that have become an integral part of our daily life. We have designed two studies, onsite study and online study. We have applied statistical methods to analyze our on-site dataset and explore the answers to the questions on how intervention, phishing types, and a monetary incentive affect user behaviors when phishing attacks are encountered. Our analysis have showed that participants with intervention and a monetary incentive perform better than the ones in other cases. Phishing type 1, suspicious senders' email addresses, tends to be more harmful to users compared to other two phishing types. We have further developed machine learning techniques with the 10fold cross-validation to analyze the data collected in the online study. We have analyzed the best attributes and found the preferred critical point used in our machine learning framework. By choosing 16 attributes and critical point c = 30, we have achieved the user performance prediction accuracies of 86.67%, 88.89%, 92.22%, and 96.67% for J48, Naive Bayes, SVM, and Multilayer Perceptron, respectively.
Based on the findings from our study, we would suggest users pay more attention to the sender's email addresses, links, and contents in the email in order to avoid being a victim from phishing email attacks. In the future, we plan to conduct more experiments and recruit more participants to perform the experiment. In daily-life scenarios, we tend to deal with many other things while checking our emails; thus, we plan to investigate a multitasking experiment platform to understand how multitasking will affect the behavior of a user accordingly besides a couple of future work discussed in section 6.

R1_Phis_Time
Time used for sorting all phishing emails in round 1 On-site

R1_Phis_Score
Score got for sorting all phishing emails in round 1 On-site

R2_Phis_Time
Time used for sorting all phishing emails in round 2 On-site

R2_Phis_Score
Score got for sorting all phishing emails in round 2 On-site

R3_Phis_Time
Time used for sorting all phishing emails in round 3 On-site

R3_Phis_Score
Score got for sorting all phishing emails in round 3 On-site

Num_sorted
The number of all emails that have been sorted in the task Online

Phis_sorted
The number of phishing emails that have been sorted Online

Phis_accuracy
The accuracy of phishing emails that have been sorted Online

Nr_sorted
The number of normal emails that have been sorted Online

Nr_accuracy
The accuracy of normal emails that have been sorted Online

Pay
The amount money paid to participants Online

Avg_rating
The average rating of confidence level Online

Median_rating
The median rating of confidence level Online

All_percent
The ratio of correctly sorted emails to all emails Online

Phis_percent
The ratio of correctly sorted phishing emails to all emails Online

Nr_percent
The ratio of correctly sorted normal emails to all emails Online Note: The attributes of survey questions can be found in Appendices A and B.