How data-sharing nudges influence people’s privacy preferences: A machine learning-based analysis

INTRODUCTION: Many online services use data-sharing nudges to solicit personal data from their customers for personalized services. OBJECTIVES: This study aims to study people’s privacy preferences in sharing di ff erent types of personal data under di ff erent nudging conditions, how digital nudging can change their data sharing willingness, and if people’s data sharing preferences can be predicted using their responses to a questionnaire. METHODS: This paper reports a machine learning-based analysis on people’s privacy preference patterns under four di ff erent data-sharing nudging conditions (without nudging, monetary incentives, non-monetary incentives, and privacy assurance). The analysis is based on data collected from 685 UK residents who participated in a panel survey. Their self-reported willingness levels towards sharing 23 di ff erent types of personal data were analyzed by using both unsupervised (clustering) and supervised (classification) machine learning algorithms. RESULTS: The results led to a better understanding of people’s privacy preference patterns across di ff erent data-sharing nudging conditions, e.g., our participants’ preferences are distributed in a space of 48 possible profiles more sparsely than we expected, and the unexpected observation that all the three data-sharing nudging strategies led to an overall negative e ff ect: they led to a reduced level of self-reported willingness for more participants, comparing with the case of no nudging at all. Our experiments with supervised machine learning models also showed that people’s privacy (data-sharing) preference profiles can be automatically predicted with a good accuracy, even when a small questionnaire with just seven questions is used. CONCLUSION: Our work revealed a more complicated structure of people’s privacy preference profiles, which have some dependencies on the type of data nudging and the type of personal data shared. Such complicated privacy preference profiles can be e ff ectively analyzed using machine learning methods, including automatic prediction based on a small questionnaire. The negative results on the overall e ff ect of di ff erent data-sharing nudges imply that service providers should consider if and how to use such mechanisms to incentivise their consumers to share personal data. We believe that more consumer-centric and transparent methods and tools should be used to help improve trust between consumers and service providers.


Introduction
Behavioral nudging refers to "any aspect of the choice architecture that alters people's behaviors in a predictable way without forbidding any options or significantly changing their economic incentives" [1,2]. When being implemented in a digital environment, behavioral nudging is normally done via the use 1  of some specific user interface (UI) elements aiming to guide people's decision towards a very specific direction [3][4][5][6].

EAI Endorsed Transactions on Security and Safety
When applied to privacy-related applications, nudging can influence people's behaviors to disclose personal information to online services, e.g., to complete a transaction with an e-commerce website or to make a hotel booking via a travel agent [7,8]. In this context, there are two different types of behavioral nudging depending on its purpose: 1) privacy protection nudges that help people to behave more securely in order to achieve better privacy protection (e.g., sharing less personal data to reduce privacy risks), and 2) data-sharing nudges used by online service providers to encourage their customers to share more personal data in exchange for more personalized services and/or benefits. Note that (over-)sharing personal data with service providers is the source of many privacy problems, so the second type of nudges have profound implications on privacy. Therefore, people's privacy behaviors have been extensively studied in the context of personal data-sharing decisions, e.g., research on privacy paradox (people's actual data-sharing behaviors deviate from their selfreported attitude) [9]. Data-sharing nudges are often implemented as a range of monetary or non-monetary incentivizing methods, e.g., cash returns, price discount, loyalty points, but can also be implemented in other ways, e.g., privacy assurance via a trust seal provided by an independent trusted party. To ensure that data-sharing nudging messages are effective at the individual level, some online service providers have also considered tailoring the user interfaces of their online services to present more targeted incentives mapping the user's personal preferences [10][11][12].
Although there is a growing number of studies focusing on both types of privacy-related nudging, it is still less understood how different types of data-sharing nudges can influence people' willingness to disclose personal information. Some researchers have argued that "one-size-fits-all" interventions should be tailored by leveraging individual differences in decision making and personality [13], which calls for personalized nudges -that in turn requires a better understanding of how people can be segmented into different profiles according to their privacy attitudes towards different types of data-sharing nudges.
In this work, we use both unsupervised (DBSCAN, kmeans, and hierarchical agglomerative clustering) and supervised (decision tree, random forest, and naïve Bayes) machine learning algorithms to investigate the affect of three typical types of data-sharing nudges, monetary, non-monetary incentives and privacy assurances, on people's willingness to disclose personal data to service providers. The analysis is based on data collected from 685 UK residents who participated in a panel survey measuring their privacy attitudes towards personal data sharing. The data is analyzed following a three-step procedure, while the user segmentation and profiling is performed by examining how individual privacy preferences change across different data-sharing nudging conditions. The machine learning algorithms were chosen among widely used ones that tend to perform well in tasks with relatively small datasets. Multiple algorithms were considered to allow us to identify the best method for each step of the analysis procedure. The results led to a more complete understanding of people's privacy (data-sharing) preference patterns, e.g., their preferences concentrate more on a small number of profiles out of 48 possible ones, and the unexpected result that all the three datasharing nudging strategies actually led more participants to report a reduced willingness level comparing with the no-nudging condition. Our work also showed that people's privacy (data-sharing) preference profiles could be automatically predicted with good accuracy, even if just seven features are used, which suggests that a small questionnaire with just seven questions can be used to profile users. This can help simplify the development of private data management tools while configuring the initial preference for each individual user.
The rest of the paper is organized as follows. In Section 2, we briefly review some closely related work. Section 3 explains the data used in detail and the proposed three-step procedure for automatic segmentation and profiling of individual users. The results of applying the three-step procedure to the collected data are discussed in Section 4. We cover further discussions and some limitations of our work in Section 5. Finally, Section 6 concludes the paper.

Related work
In this section we discuss some closely related work in two areas: how data-sharing nudges are used by service providers and how they influence people' behaviors, while also how people can be segmented based on their privacy attitudes (i.e., research on privacy typologies). We would like to highlight that, while many researchers have studied privacy attitudes and behaviors under different data-sharing nudging conditions and privacy typologies, we have not seen any work on applying machine learning to study privacy typologies in the context of different data-sharing nudging strategies, which is the focus of this paper.

Data-Sharing Nudges
Service providers have been offering different types of incentives such as monetary rewards, price discounts, loyal points, free products and services to encourage their consumers to share more personal data for more personalized services, with different 2 EAI Endorsed Transactions on Security and Safety 11 2021 -08 2022 | Volume 8 | Issue 30 | e3 effects achieved [14]. It has been found that monetary incentives can exert positive influences in certain contexts. For instance, Hui et al. (2007) found that monetary incentives worked for a Singaporean company to boost personal data disclosure [15]. Mukherjee et al. (2013) reported that monetary incentives positively influenced not only privacy-disclosure preferences but also actual self-disclosure [16]. Similarly, Shibchurn and Yan (2014) showed that offering a monetary reward could increase the willingness levels of online social network (OSN) users to share personal data [17]. However, some other studies showed different results and below we give two examples. While requiring sensitive information, Lee et al. (2015) found that offering customers monetary benefits resulted in an increase in their privacy concerns, therefore it did not seem an effective way to encourage data sharing. Instead, they found that building trust with their customers can be a more effective mechanism for organizations to encourage their customers to share personal data [14]. In another study, Lu et al. (2018) found that monetary incentives were not better than simple email reminders for encouraging self-disclosure of personal data, in the context of an online car-sharing platform [18].
In addition to monetary incentives, engagement in online activities and self-disclosure of personal information are affected by non-monetary incentives and other factors. For instance, it was found that the use of Facebook is strongly associated with the benefits of social capital, such as self presence, accumulating friends, joining virtual groups etc. [19]. There are negative results in the literature as well, for instance, Ward et al. (2005) pointed out that neither price discounts nor personalized services had any effect in incentivizing customers to share personal information [20], and that, in addition, participants were found to be more sensitive when sharing financial information (e.g., online transaction records), while being relatively willing to share personal information.
Among all non-monetary factors studied, many researchers have studied if enhancing trust between customers and service providers via mechanisms such as self-stated privacy statements (e.g., privacy policies) and third-party trust seals can play an important role in influencing people' decisions on data disclosure. Gerlach et al. (2015) reported that the effect of a privacy policy's permissiveness on users' willingness to disclose personal information on OSNs is mediated by users' privacy risk perceptions [21]. In another study, Kobsa and Teltzrow (2004) found that people were more willing to share personal data when purchasing from online shops that attach certain descriptions of privacy practices [22]. Comparing with monetary rewards, Gabisch and Milne (2013) found that "safety cues", such as a statement reassuring users about the existence of a privacy policy and a privacy seal on a website, could be more effective in encouraging self-disclosure online [23]. Similarly, in the context of location-based social network services, Zhao et al. (2012) also reported that privacy policies (in addition to privacy controls) could help in reducing privacy concerns when sharing location-based information [24]. Hui et al. (2007) reported that by using proper privacy statements a local firm in Singapore could collect more personal data than using privacy seals [15]. In [25], it is demonstrated that neither advanced privacy conditions (i.e., Confidential, Anonymized-Envelope and Anonymous-Postcard) nor monetary incentives could result in higher disclosure rates of sensitive information. A similar result was reported in [26], which suggests that none of many types of privacy assurances had a direct or a moderating effect on personal information disclosure by online users in Saudi Arabia.

Privacy Typology
Research on privacy typology has shown that people could be segmented into different segments or profiles based on their privacy attitudes. For example, Harris and Westin's classic work on this topic lead to the so-called Westin's Privacy Segmentation Index, i.e., citizens can be grouped into three segments -Fundamentalists, Unconcerned and Pragmatists, due to different trust levels in existing laws and organizations' processes and collection of their personal data [27]. Examining Internet users' privacy concerns regarding the collection and usage of personal information [28], the original Pragmatist group could be further split into two subgroups, resulting in four user segments: Unconcerned Internet, Circumspect Internet, Wary Internet and Alarmed Internet users. To examine whether Westin's Privacy Segmentation Index could represent users' actual behaviors, Woodruff et al. (2014) conducted an online survey of 884 Amazon Mechanical Turk participants with detailed statements about privacy scenarios and privacy-sensitive consequences, leading to the finding that Westin's segments are not well correlated with behavioral intents or consequences [29]. Such conflicting results called for more research into this topic.
An immediate consequence of the existence of multiple user segments is that a "one-size-fits-all" solution will not be ideal for privacy protection. Therefore, prediction of the user's privacy attitude patterns (privacy profiling) is helpful in many applications. Due to the reported context-dependence attitudeconsequence gap, it was also suggested to combine the contextual and cost-benefit analyses when aiming to predict privacy choices. For instance, in the context of mobile applications, in [30], user groups including Conservatives, Unconcerned, Fence-Sitters and Advanced users could be identified based on their comfort levels 3 EAI Endorsed Transactions on Security and Safety 11 2021 -08 2022 | Volume 8 | Issue 30 | e3 towards requested app permissions with certain purposes. Towards web-based services such as the locationbased service (LBS), Poikela et al. (2014) reported that users could be segmented based on the frequency and the level of accuracy of real-time shared location [31]. By inviting participants to rank privacy behaviors while using a technology service, Morton and Sasse (2014) suggested a five-group segmentation to describe users' information-seeking preferences and inform the construction of default privacy settings [32].
Privacy typology has also been studied by many researchers in the context of Internet of Things (IoT) and OSNs. For instance, for 14 IoT scenarios varying across eight factors, Naeini et al. (2017) found that over 86% of privacy (data-sharing) preferences of users can be modeled accurately [33]. Wisniewski et al. (2014) studied user privacy typology based on self-reported behaviors on privacy settings on user interfaces of Facebook, and reported six groups of user privacy behavioral profiles -Privacy Maximizers, Selective Sharers, Privacy Balancers, Self-Censors, Time Savers and Privacy Minidists [34]. In 2017 [35], they studied further profiling of Facebook users' privacy attitudes according to their awareness on privacy features, ranging from Experts to Novices. In [36], Lankton et al. (2017) examined privacy segmentation of Facebook users using two datasets on self-reported privacy strategies from undergraduate students in information systems and general users, respectively. They argued that their results could better interpret cluster differences through demographics, trust, privacy, and technology-usage perceptions. Considering that users tend to apply default privacy settings, some researchers investigated how well default privacy settings could meet users' expectations. For instance, in [37], Watson et al. (2015) concluded that using personal characterizations from relevant community data to create default privacy settings could better match users' expectations on OSNs. Based on a survey of 337 Internet users in Germany, Schomakers and Lidynia (2019) identified three user clusters with different privacy attitudes: Privacy Guardians, Privacy Cynics and Privacy Pragmatists [38].
In addition to the contextual dependency of user segmentation of privacy attitude and actual behaviors, some researchers also studied how the user segmentation differs across different types of data items. For instance, Knijnenburg et al. (2013) found that such data item dependency did exist in their analysis of three datasets of online information disclosure intentions and behaviors [39]. They also argued that more accurate user profiling algorithms should consider this effect to be able to help tailor the needs of different users.
Finally, we would like to mention that as a preliminary study of the present research, we conducted a clustering-based analysis of 685 UK travelers based on their self-reported willingness to share personal data, which led to two different user segments: Privacy Pessimists and Privacy Rationalists [40]. The present research reports our efforts of applying more advanced machine learning algorithms to the user segmentation problem, covering more complicated aspects such as how users' privacy attitudes change depending on the type of data-sharing nudging condition, and if and how we can automatically classify a given user into a specific user profile.

Data Used and Methodology
In this section, we first explain the data we used in detail, and then describe the proposed three-step procedure for segmenting and profiling individual participants based on the data.

Data Used
Our study aims to segment people according to their self-reported willingness to personal data disclosure to online travel service providers. To gather data needed for the study, we conducted an online survey with a panel of UK residents recruited through a professional survey company in May 2019, as part of the PriVELT project 1 . Participants were requested to state their level of willingness to share 23 types of personal information online. In this study, four datasharing nudging strategies were tested: (1) no incentive; (2) monetary incentives (e.g., cash), (3) non-monetary incentives (e.g., discounts), and (4) privacy assurances (e.g., third-party trust seals). Each data item is scored on a five-point Likert scale: 1 = "strongly disagree", 2 = "disagree", 3 = "neither disagree nor agree", 4 = "agree", and 5 = "strongly agree". The questions we asked for the four different conditions are illustrated in the Appendix (Table C.1). For each of the four conditions, the participants reported how willing they would be to share the 23 different types of personal data shown in Table 1, so in total each participant reported 23 × 4 = 92 levels of data-sharing willingness. Note that we also include acronyms of the 23 data types in Table 1, which are used in Figures 2 and 3 to keep those figures more compact.
After cleaning the data, responses from 685 participants were considered valid. The 23-D willingness responses collected from the 685 participants were divided into four datasets, Dataset NI , Dataset MI , Dataset PA and Dataset NMI , as the input of our data analysis process explained later, where the subscriptions in the datasets' names refer to the four datasharing nudging conditions: Smartphone data such as time stamped movements Specific Expenses (SE) E.g. credit card, Paypal transaction records PA = Privacy Assurances. These acronyms will be used in the remaining of the paper for the sake of brevity. The demographic profiles of the 685 participants can be found in the Appendix (Table C.2).

Methodology
To analyze the above-described data for user segmentation and profiling purposes, we extended our preliminary work reported in [41] by applying a more advanced three-step procedure described below and detailed in the following three sub-subsections. • Step 1: Clustering -Unsupervised clustering algorithms are applied to the four datasets to identify the best performing algorithm. • Step 2: Cluster Analysis -Based on the clustering results, we look at three aspects of cluster analysis: participant distributions to clusters and user profiles, effectiveness of nudging strategies, and behavioral variances across different personal data types. • Step 3: Automatic Profiling -Supervised machine learning algorithms are used to evaluate if cluster labels and user profiles across all four nudging conditions can be automatically predicted.
Step 1: Clustering. As stated earlier, each dataset includes participants' willingness responses towards 23 data types under one specific data-sharing condition, which can be represented as a 23-D vector, Since the dimensionality is relatively high, before conducting the cluster analysis, we firstly apply a Principal Component Analysis (PCA) for the purpose of dimensionality reduction. According to one of the commonly used criteria for selecting significant factors (principal components) -retaining factors with an eigenvalue greater than 1.0 [42], six factors are kept for actual clustering.
For the actual clustering part, a number of candidate clustering algorithms that may perform well should be selected and tested in order to identify the best method for further analysis. A number of clusteringevaluation metrics are needed here to compare the candidate clustering algorithms.
For non-deterministic clustering methods (such as kmeans, which we actually used), the clustering result depends on the random initial condition. Therefore, it is necessary to examine the stability of the produced clusters under different random initial conditions, where the stability refers to the level of consistency of the clustering results, i.e., the same points stay in the same cluster. The procedure used to perform this stability evaluation is explained below. 5 EAI Endorsed Transactions on Security and Safety 11 2021 -08 2022 | Volume 8 | Issue 30 | e3 For each dataset X and each number of clusters (k), the non-deterministic clustering method is run for n times varying the random initial conditions used to initialize the non-deterministic algorithm. To measure how stable the clustering results are across the n rounds, we need a quantitative metric to allow comparison and selection of the best parameters. Considering that the actual cluster labels of different runs for each (x, k) do not align and there are no ground truth labels, we define a pair-based stability metric SM X,k,n as follows: where N X,k is the total number of unique data-point pairs for the database X and a specific number of clusters k, and N s (n) is the number of pairs of data points that fall into the same cluster consistently for all n runs of k-means. This metric has a natural range between 0 and 1, and could be interpreted as the probability of a randomly sampled data-point pair in the dataset x never being split into two separate clusters across all n runs of clustering, when k clusters are predefined. A higher value of SM X,k,n indicates a higher level of stability of the clustering results. With the above-defined stability metric, for each x we can find the best value of k giving the highest value of SM X,k,n . By clustering a user's responses under four different data-sharing nudging conditions, a more complete picture of the user's privacy preferences to datasharing nudging can be inferred by identifying user segments over four datasets, corresponding to their data-sharing willingness in four different conditions: "sharing for no additional benefit", "sharing for additional monetary returns", "sharing for additional non-monetary returns" and "sharing due to third-party privacy assurances". In this way, there are maximally k NI × k MI × k NMI × k PA possible user profiles.
Step 1 gives a set of clusters for each of the four data-sharing nudging conditions. In Step 2 we conduct a detailed analysis of the clustering results, focusing on the following three different but related aspects.
For the first aspect, we examine how all participants distribute to different clusters under each nudging condition and also to the k NI × k MI × k NMI × k PA different user profiles considering their overall behaviors across all four nudging conditions. The latter allows us to examine how some participants may "migrate" from one cluster with a lower level of data-sharing willingness to a higher one under a specific nudging condition, or vice versa, therefore providing some useful evidence about effectiveness of different nudging strategies. To capture the "cluster-migration rate", i.e., the portion of participants in one cluster X i (the i-th cluster under the nudging condition X) who "come from" another cluster Y j (the j-th cluster under the nudging condition Y X), we define a similarity metric as follows: where #(X) denotes the cardinality of a set X. This similarity has a natural range of [0, 1], and a higher value indicates a higher level of membership overlap between these two clusters. Note that this similarity metric is asymmetric, i.e., S X i ,Y j = S Y j ,X i does not hold in general, since the cluster-migration rates for both directions can be different. We focus on the cluster migration rates from clusters under the no-nudging condition NI to other three nudging conditions because this migration direction can give more useful information about how data-sharing nudges can influence participants' behaviors.
The second aspect will extend analysis on the effectiveness of the three nudging conditions by looking at to what extent they are able to successfully nudge participants towards a higher willingness level to share personal data. Different from the first aspect, which focuses more on collective behaviors inferred from the average data-sharing level per cluster, here we examine how data sharing nudging strategies influence individual participants' datasharing willingness (increase, decrease or no change).
The third aspect is about how participants' datasharing willingness levels vary across all 23 types of personal data. We expect participants will have different privacy (data-sharing) preferences on different data types, and it is interesting to know how such type-specific behaviors may affect our overall analysis results.
Step 3: Profile Prediction. In this step, we look at how to build four separate classifiers, each automatically predicting the cluster label of a given 23-D data point ⃗ w (i.e., an individual participant's response) under a specific nudging condition. The four classifiers jointly can predict the participant's overall user profile. Having such classifiers also allows us to have a different set of evaluation criteria for the clustering performance in Step 1 because we would like to choose the clustering method and relevant parameters to optimize the prediction accuracy, too. In our experiments, we decided to choose decision tree (DT), random forest (RF) and naïve Bayes (NB) as the candidate classification algorithms because they can be trained based on a smaller dataset to achieve a reasonably good performance. The first two algorithms can also return indicators of importance of different features for selecting a smaller number of important features for classification, which will allow using a small questionnaire to profile users 6 EAI Endorsed Transactions on Security and Safety 11 2021 -08 2022 | Volume 8 | Issue 30 | e3 and therefore improve usability of such user profiling / personalization systems. Once the most important features for classification are determined, we re-train the classification models using only those features, and measure their predictive accuracy.

Step 1: Clustering
We decided to choose three well-established clustering algorithms of different types as candidates: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [43], k-means [44] and hierarchical agglomerative clustering (HAC) [45]. We used implementations of these algorithms in the widely used library scikitlearn 2 (0.21.3) running with Python 3.7. We set the parameters k max (the maximum number of clusters) and t max (the number of runs for each (x, k)) both to 10. For cluster-evaluation metrics, we decided to use four widely adopted indices: silhouette index (S), Calinski-Harabasz index (CH), Davies-Bouldin (DB) and S-Dbw (SD) index [46][47][48][49][50][51]. In order to determine the value of k max , we ran one of the three clustering algorithm DBSCAN with k max = 10 and it failed to produce any clustering results for k = 9 and k = 10 under two nudging conditions MI and NMI (see Tables C.4 and C.5). We considered that these results indicate that meaningful clusters cannot be identified for k > 10, so we decided to set k max = 10 for our clustering experiments. Tables C.3-C. 6 show the comparison results of the 3 clustering algorithms for each of these indices and each value of k in {2, 3, . . . , 10}. We identified the best clustering algorithm for each of the four datasets by choosing the algorithm with the most "wins" among all the performance metrics. For instance, Table C.5 shows that for Dataset_NMI, k-means is the best when k = 4 given that it achieved more best scores than others, and it is the best across all k values for the same reason.
Here we focus only on a summary of these results, as reported in Table 2. In this table, Columns 1 and 2 show the number of clusters (k) and the clustering algorithm, respectively. In Columns 3-6, each cell in this table shows, for each combination of k and algorithm, the number of clustering quality indices (out of 4) for which the algorithm was ranked the best, for a given dataset (as identified in the heading of Columns 2-6). The rightmost column shows the total number of times the algorithm was ranked the best across the 4 datasets, out of 16 times (4 datasets times 4 index values per dataset), for each value of k.
The results clearly show that the k-means algorithm achieved the best overall performance, which is stable across all values of k. More precisely, considering all k values shown in Table 2, k-means is ranked the best in 107 cases, whilst HAC and DBSCAN are ranked the best in only 20 and 17 cases, respectively. In addition, kmeans outperformed the other two clustering methods for all 9 values of k. Hence, k-means was selected for further analysis.
The next step of the result analysis consists of selecting the best value of k for k-means. This is defined in Eq. (1) for computing the value of SM x,k of each dataset x and each value of k in {2, 3, . . . , 10}. The results are shown in Table 3, where the number of clusters corresponding to the largest possible value of SM x,k (meaning the most stable cluster results -no change at all across all 10 runs) for each dataset is highlighted in light gray. In general the most stable clustering results were obtained with smaller k values, between 2 and 5.
In addition to training classifiers to do automatic profiling of participants in our survey, we are also interested in how good the clustering results are for supporting the automatic profiling in the Step 3. Therefore, to further validate the results in Table 3, we built some classifiers (based on decision tree, random forest and naïve Bayes, as mentioned in the previous section) and compared their predictive performance indicators. One classifier was built for each best clustering clustering result produced by the k-means algorithm (i.e., those gray cells shown in Table 3). The predictive accuracy of these classifiers was evaluated through a five-fold cross validation. Each classifier was trained using a training set, which includes 80% of all data points in one of the four datasets and the corresponding cluster IDs as the class labels. The remaining 20% data points in each dataset was used as the test set to calculate the performance indicators of the trained classifier.
In Table 3, there is a single best number of clusters (k) for the datasets NI and PAk = 2. However, there is a tie between the two best values of k for datasets MI and NMI. Therefore, by using the classifiers to further validate the clustering results, we can also determine the best k value for the MI and NMI datasets between the two values with a tie. Table 4 shows the results of three widely used predictive accuracy measures -the accuracy (i.e., the average percentage of correct predictions across all classes), the weighted-average F1-score and the Area Under the ROC Curve (AUC) after binarizing class labels -for all three classifiers we built. As observed in Table 4, for all three performance metrics and all three classifiers, the better value of k is 3 for the MI dataset and 4 for the NMI dataset. Hence, these k values are selected for further analysis for these two datasets. Besides, all the three classifiers trained with the selected k values obtained a high predictive performance, but random forest and naïve Bayes models performed better 7 EAI Endorsed Transactions on Security and Safety 11 2021 -08 2022 | Volume 8 | Issue 30 | e3

Step 2: Cluster Analysis
In this subsection, we report results of our cluster analysis on all the three aspects described in Section 3.2.

Participants vs. Clusters and User Profiles.
How participants distribute to the different clusters under each nudging condition (produced by the k-means clustering algorithm in Step 1) is shown in Table 5, from which we can see that under each nudging condition different clusters have significantly different values of w (the average data-sharing willingness level across all participants belonging to a cluster). We also conducted a one-way (between-subject) ANOVA to check if the differences are statistically significant, and the results are positive for all with p < 0.001: NI -F(1, 683) = 401; MI -F(2, 682) = 1295; NMI -F(3, 681) = 854; PA -F(1, 683) = 1233. For MI and NMI, since there are more than two groups a Tukey's post-hoc test was run and the results showed significant differences between all group pairs. Based on the statistical results, we name the clusters according to the values of w: L (low) or H (high) for the nudging conditions NI and PA (k = 2); L (low), M (medium) or H (high) for MI (k = 3); L (low), ML (medium low), MH (medium high) or H (high) for NMI (k = 4). Accordingly, we use these names to differentiate the clusters under the same conditions, e.g., NMI ML refers to the ML cluster under the nudging condition NMI. We also conducted a separate one-way ANOVA to compare differences between all 11 clusters across different nudging conditions, which 8 EAI Endorsed Transactions on Security and Safety 11 2021 -08 2022 | Volume 8 | Issue 30 | e3 Table 4. Performance indicators of all classifiers we built for predicting and further validating the user clustering results of k-means. For MI and NMI datasets, we highlight the higher accuracy in bold face for the two possible values of k.

Classifier
Metric If we consider participants' overall behaviors across all four nudging conditions, we can observe some more interesting behavioral patterns as shown in Figure 1. Surprisingly, the distribution is much sparser than we expected: only 9 profiles stand out with a significant portion of participants, and each of the other 39 profiles has just a few (no more than 12) participants so can be considered background noises. Probably not surprisingly, two (the most and the third most) "popular" profiles turned out to be (NI L , MI L , NMI L , PA L ) and (NI H , MI H , NMI H , PA H ), representing un-incentivizable privacy fundamentalists and those privacy-unconcerned, respectively. The remaining 7 profiles can be classified into four sub-groups: 1) 4 profiles (those close to the left bottom corner) represent moderately incentivizable privacy fundamentalists; 2) the profile (NI H , MI H , NMI MH , PA H ) representing those with slight privacy concerns over non-monetary incentives; 3) the profile (NI M , MI M , NMI MH , PA H ) representing privacy pragmatists happy with privacy assurance more than monetary and non-monetary incentives; and 4) the profile (NI L , MI M , NMI MH , PA H ) that represents pragmatic privacy fundamentalists who are incentivizable, more so by non-monetary incentives and privacy assurance than monetary incentives. To summarize, the above behavioral patterns lead to the following observation: 70.5% of participants fall into two ends of the spectrum -over half of all participants (51.2%) are more like privacy fundamentalists and 18.4% have no or only slight privacy concerns; only two profiles (which cover only 9.6% of all participants) can be considered privacy pragmatists; and all the other 39 profiles cover the remaining 19.9% of participants.
Cluster-migration rates from the two NI clusters to cluster in other three nudging conditions are shown in Table 6. As a general trend, all three data-sharing nudging strategies have no (L to L or H to H) or only a moderate (L to ML/M or H to MH/M) influence on most participants, although a small portion of participants changed their data-sharing willingness level drastically (L to H or H to L). This observation is aligned with the results shown in Figure 1.

Data-Sharing Nudges vs. Individual Behaviors.
The results discussed in the previous sub-subsection already give some indication on the lack of effectiveness of all the three data nudging strategies, but the analysis is based on collective behaviors at the cluster level. In this sub-subsection we look at how the three data nudging strategies influenced individual participants' self-reported data-sharing willingness levels, to get more direct evidence on their actual effectiveness. Denoting the difference of the self-reported datasharing willingness level reported by a participant under a nudging condition C and that under nonudging condition by ∆ C (w), Table 7 shows the median and mean values of ∆ C (w) for all the three nudging conditions, from which we can see a surprising result -all the three nudging strategies actually led more participants to report a reduced willingness level comparing with the no-nudging condition. The observed failure of all nudging strategies even holds for participants in NI H .
One may argue that the direction of change in the value of w (i.e., no change, increased or decreased) matters more than how much it changed. Table 8 shows the percentages of participants who fall into these three categories. The results are indeed more revealing: for all nudging strategies, there are more participants with a decreased willingness level than those with an increased level. This implies that, while all the three nudging strategies can work for some participants, they failed for more participants so the overall effect is a failure (according to their purpose of 9 EAI Endorsed Transactions on Security and Safety 11 2021 -08 2022 | Volume 8 | Issue 30 | e3     increasing the overall data-sharing willingness level). This result came as a surprise to us, and has a profound implication on if and how service providers should use data-sharing nudges at all to solicit personal data from their customers.
While all the three nudging strategies failed to work as a whole, data in Tables 7 and 8 also show the order of overall effect of the three nudging strategies: PA > MI > NMI. The fact PA works better than MI implies that service providers should consider how to increase trust of their customers on their services rather than relying on monetary or non-monetary incentives.
Data Type vs. Data-Sharing Willingness. All our previous analysis is based on the data-sharing willingness level averaged across all 23 different types of personal data. Figure 2 shows how the willingness level varies across those data types. Despite the visible variations, for all 23 data types and all four nudging conditions, participants in a L cluster always reported a lower average level of willingness than those in the corresponding H cluster. Under the MI condition, participants in the H cluster reported a lower average level of willingness than those in the M cluster for all data types except for one (N). Under the NMI condition, results are more mixed: differences are less clear among 10 EAI Endorsed Transactions on Security and Safety 11 2021 -08 2022 | Volume 8 | Issue 30 | e3 Table 8. Percentages of participants with an unchanged (=), increased (+) and decreased (-) value of w under different nudging conditions (C).

C Participants in NI L Participants in NI H All participants
ML, MH and H clusters for 10 data types (N, DoB, HA, EA, PhN, CCI, P, E, PaN, PP), showing the boundary between these clusters are not always a clear cut for these data types. As a whole, we felt the results based on averaging across 23 data types should still give largely reliable results since the mixed results have a limited effect. However, we believe that more future research is needed to further explore the data type-specific aspects. Looking at all the results, another observation stands out: across all four nudging conditions, participants in H-clusters and those in non-H-clusters behaved particularly differently for the following types of personal data: BAI, CAB, PaN, DLN, F, VS, FS/I, I/RP, SMPD, SSH. Comparing with other data types, the above types seem generally more sensitive such as biometric features (F, VS, FS/I, I/RP), finance-related (CCI, BAN), and more private data (PaN, DLN, SSH). SMPD seems a notable exception -many people have public profiles on social media so it remains unclear why some participants had more concerns on this type of personal data. More future research will be required to understand more about this particular observation.

Step 3: Profile Prediction
In Section 4.1, we have shown how an individual's profile can be predicted by training four classifiers, one for each nudging condition, and then used the profile predictor to help further validate the clustering results and determine the the "best" k values for kmeans clustering. We have also shown the performance of the classifiers are generally good (see Table 4). In this subsection we look at how the number of input features of the four classifiers can be reduced from 23 × 4 = 92 to a more usable number in real-world applications, i.e., a user needs to answer only fewer questions (ideally just up to a handful) to let the system set up his/her privacy (data-sharing) preference profile.
To help determine the most important features for all the four classifiers we built, we show all 23 features for each of the four trained classifiers in Figure 3, visualizing the significance level ([0,1]) of each feature (measured by running the method feature_importances_ in the scikit-learn library). From the results shown in Figure 3, we can see that, for decision trees (with parameter min_sample_leaf = 2) only a very small number of features have a high significance level, therefore we can easily choose the most important features. While for random forests (with parameter n_estimators = 1,000), it seems more difficult to select significant variables as their weights are more evenly distributed. The significance patterns of different nudging conditions are also significantly different, so we need to select the reduced feature subsets for the four classifiers separately.
When decision trees are used to build classifiers, we are able to select only 7 features out of 92 features across the four classifiers. For the no-nudging condition, the significance of willingness to share Voice data (significance level = 0.68) is much higher than the significance of any other features, and thus only this feature is selected for profile prediction. Similarly, a significantly reduced feature set can be determined for the other three nudging conditions: (Email Address, Face Scan/Image) (significance level = 0.33, 0.25) for monetary incentives, (Name, Education, Fingerprint) (significance level = 0.29, 0.2, 0.23) for nonmonetary incentives, and Activity Sensors (significance level = 0.41) for privacy assurances. Hence, for these experiments with the most significant variables only, when using the decision tree algorithm, only 7 out of the 23 variables were used across the four datasets, in order to predict the class labels in four nudging conditions. In the case of random forests, we simply selected the five most significant features from each nudging condition to reduce the total number of features to 20. The prediction accuracy of the classifiers built with the reduced number of features can be seen in Table 9. While the performance drops in general, the accuracy is very good for the no-nudging condition (over 96% for both decision trees and random forests) and still reasonably high for the other three conditions: ≥ 79% for decision trees and ≥ 88% for random forests. Note that in real-world applications, a relatively accurate prediction is often sufficient and better than what is currently being used (a default profile for all or multiple profiles the user has to manually choose).

Further Discussions
The results reported in the previous section lead to a number of interesting observations about privacy 11 EAI Endorsed Transactions on Security and Safety 11 2021 -08 2022 | Volume 8 | Issue 30 | e3    what we observed in this research can be reproduced in other experiments and different contexts. Second, our results on automatic profile prediction show that it is possible to use four separate classifiers with a small number of features (as few as 7 in total) to capture the complicated user profiling problem, therefore potentially offering service providers a better tool to serve their customers with enhanced usability. In addition to the direct implications of our results for service providers mentioned above, our work can provide useful insights to more parties of the larger data economy ecosystem. For instance, many independent personal data management platforms (PDMPs) 3 have been developed to give users more control of their own data and to empower users to trade data with service providers. Such PDMPs can use the user profiling method reported in this paper to engage users and better serve users, including recommending services to users according to their privacy (datasharing) preferences and special requirements on utility. In addition, new businesses opportunities can be generated around user-facing tools to help individual users using the work reported in this paper, e.g., a data sharing awareness tool can help users better understand how they are sharing data with multiple entities online 3 Some examples: Solid (https://solid.mit.edu/), HAT (https://www.hubofallthings.com/), Databox (https: //www.databoxproject.uk/) and digi.me (https://digi.me/). and in the physical world, and a service comparison tool can assist users to choose the best service based on their personal privacy (data-sharing) preferences, including switching to physical services to avoid sharing more sensitive data online. Furthermore, our work could help policy makers such as national data protection authorities, e.g., to define more granular regulations and guidelines for different businesses sectors and user groups.
Due to the sensitive nature of many personal data and privacy as a basic human right 4 , our work has important links to issues around ethics, transparency, trust and legal obligations of different parties in the data economy ecosystem. For instance, ignoring consumers' privacy wishes and constantly nudging them for more personal data is likely unethical if not illegal. Our results reported in this paper re-confirmed previous findings that some people would become more alerted and less willing to share personal data if they see any data-sharing nudges, which could be explained by the lack of trust between some consumers and service providers. Such a lack of trust is often caused by the lack of transparency about service providers' data collection and processing practices. 4 As defined in the Universal Declaration of Human Rights (UDHR, https://www.un.org/en/ universal-declaration-human-rights/) and human right laws in many jurisdictions.

13
EAI Endorsed Transactions on Security and Safety 11 2021 -08 2022 | Volume 8 | Issue 30 | e3 Service providers should therefore increase the level of transparency of their processes, and provide more diverse solutions to people with different privacy (datasharing) preferences, which can help build a more privacy-friendly ecosystem. Having a more user-centric ecosystem will eventually benefit service providers especially those who respect consumers more and engage them more actively. Our work can offer service providers tools towards such a direction.

Limitations
Since our analysis is based on data collected from an online survey, it can reflect only self-reported privacy attitudes rather than actual behaviors in real world. However, as widely reported in past studies around the privacy paradox theory [9], the self-reported willingness may not be consistent with actual datasharing behaviors, which can be affected by various factors such as how information is presented and the context of the data sharing. We plan to investigate actual data-sharing behaviors in selected real world scenarios and see if how behavioral profiles may change. This will allow us to look at how people's privacy attitudes and behaviors evolve over time, e.g., via a longitudinal study.
As with all empirical studies, some biases may have been introduced in the data collection procedure and the experimental design. For instance, in the survey we conducted, for all questions the four nudging conditions were presented in one fixed order, i.e., "No Nudge" → "Monetary Incentive" → "Privacy Assurances" → "Non-Monetary Incentive". To identify and avoid such biases, we will consider re-validating our work under different experimental setups.
In addition, the panel survey we conducted may have attracted people in different base-line clusters (NI L and NI H ) unevenly so the results on the overall population may be biased. Despite this uncertainty, aggregated results on both clusters in Section 4.2 showed that the affects of the three nudging strategies on the two NI clusters are largely aligned with just a small margin, so we believe that our main conclusions should hold.
While we believe the data we collected are sufficiently representative and the main insights are reliable, there are some factors that may affect the generalizability of the reported results especially at the lower level, e.g., which profiles out of the 48 ones are more dominating. For instance, our pool of participants was limited to a single country (UK) and the number of participants (685) may not be large enough to allow examination of some profiles, especially those that are less common. In addition, our survey was put into a more realistic context of data sharing with online travel service providers, which may not be able to capture different responses in other contexts. We will conduct more future work to further validate our reported results with participants from other countries.
When performing the first step, i.e., the clustering, a prior step was taken to reduce the dimensionality using PCA. Since our datasets contain 23 dimensions (n = 23), using PCA was helpful to reduce the dimensionality to 6 (n = 6) and thus avoid problems with the computation of distances between examples (samples) in datasets, where all examples would be essentially far away from each other due to the high dimensionality (if n = 23 was used). Such a pre-processing step has been commonly used by researchers for clustering high-dimensional data [52][53][54][55][56][57]. PCA has the drawback of being global so it cannot preserve pairwise distance between points in the original space. To address this problem, we can use a dimensional reduction algorithm that can preserve such distances better [58], e.g., t-SNE [59]. We plan to test such distance-preserving dimensional reduction algorithms in our future work and compare the results with what are reported in this paper with PCA.
For clustering we tested three algorithms, and for automatic profile prediction we tested three algorithms -decision trees, random forests and naïve Bayes. While those chosen algorithms gave good results, other algorithms may perform even better. We will explore these possibilities in future work.

Conclusions
This paper presents our work on utilizing both unsupervised and supervised machine learning algorithms to analyze 685 UK residents' self-reported willingness levels to share 23 types of personal data with online travel companies, under four different data-sharing nudging conditions (no nudge, monetary and non-monetary incentives, and privacy assurances). By applying a three-step data analysis process, we revealed more comprehensive user segmentation results when we consider how different types of data-sharing nudges influence people's data-sharing behaviors. We also showed that, using four classifiers and a small number of features (as few as seven), people's data-sharing behavioral profiles can be predicted with good accuracy. The results reported provide new insights on how people's privacy preferences interact with data-sharing nudges and how machine learning methods can be used to analyze and capture such interactions. They can find direct applications in many real-world scenarios, e.g., online booking for flights and hotels, OSN-based communication, and web forum discussions, to help better balance people's wishes for privacy protection and service providers' desires to provide more personalized services.
Acknowledgement. This     Questions we used to capture the level of willingness to disclose personal information with online travel companies. Each question is followed by a list of 23 data types and a 5-point Likert scale.

Nudging Condition Items
No Nudge "How willing are you to share the following information with online travel companies?" Monetary Incentives "Should you receive monetary incentives (i.e., cash), how willing are you to share the following information with online travel companies?" Non-Monetary Incentives "Should you receive non-cash incentives (such as coupons and discounts), how willing are you to share the following information with online travel companies?" Privacy Assurances "If the online company is providing privacy assurances (such as an easy to read privacy policy) about the protection of your personal data, how willing are you to share the following information with online travel companies?"      Table   20 EAI Endorsed Transactions on Security and Safety 11 2021 -08 2022 | Volume 8 | Issue 30 | e3