Privacy Preserving Large-Scale Rating Data Publishing

Large scale rating data usually contains both ratings of sensitive and non-sensitive issues, and the ratings of sensitive issues belong to personal privacy. Even when survey participants do not reveal any of their ratings, their survey records are potentially identifiable by using information from other public sources. In order to protect the privacy in the large-scale rating data, it is important to propose new privacy principles which consider the properties of the rating data. Moreover, given the privacy principle, how to efficiently determine whether the rating data satisfied the required privacy principle is crucial as well. Furthermore, if the privacy principle is not satisfied, an efficient method is needed to securely publish the large-scale rating data. In this paper, all these problem will be addressed.


Introduction
The problem of privacy-preserving data publishing has received a lot of attention in recent years.Privacy preservation on relational data has been studied extensively.A major category of privacy attacks on relational data is to re-identify individuals by joining a published table containing sensitive information with some external tables.Most of existing work can be formulated in the following context: several organizations, such as hospitals, publish detailed data (called microdata) about individuals (e.g.medical records) for research or statistical purposes [22,23,28,32].
Privacy risks of publishing microdata are wellknown.Famous attacks include de-anonymisation of the Massachusetts hospital discharge database by joining it with a public voter database [32] and privacy breaches caused by AOL search data [16].Even if identifiers such as names and social security numbers have been removed, the adversary can use linking [32], homogeneity and background attacks [23] to re-identify X.Sun 1.(a) A published survey rating data set containing ratings of survey participants on both sensitive and non-sensitive issues.(b) Public comments on some non-sensitive issues of some participants of the survey.By matching the ratings on non-sensitive issues with public available preferences, t 1 is linked to Alice, and her sensitive rating is revealed.
most of the existing studies are incapable of handling rating data, since the survey rating data normally does not have a fixed set of personal identifiable attributes as relational data, and it is characterized by high dimensionality and sparseness.The survey rating data shares the similar format with transactional data.The privacy preserving research of transactional data has recently been acknowledged as an important problem in the data mining literature [14,37].

Motivation
On October 2, 2006, Netflix, the world's largest online DVD rental service, announced the $1-million Netflix Prize to improve their movie recommendation service [15].To aid contestants, Netflix publicly released a data set containing 100,480,507 movie ratings, created by 480,189 Netflix subscribers between December 1999 and December 2005.Narayanan and Shmatikov shown in their recent work [24] that an attacker only needs a little information to identify the anonymized movie rating transaction of the individual.They re-identified Netflix movie ratings using the Internet Movie Database (IMDb) as a source of auxiliary information and successfully identified the Netflix records of known users, uncovering their political preferences and other potentially sensitive information.
We consider the privacy risk in publishing anonymous survey rating data.For example, in a life style survey, ratings to some issues are non-sensitive, such as the likeness of book "Harry Potter", movie "Star Wars" and food "Sushi".Ratings to some issues are sensitive, such as the income level and sexuality frequency.Assume that each survey participant is cautious about his/her privacy and does not reveal his/her ratings.However, it is easy to find his/her preferences on non-sensitive issues from publicly available information sources, such as personal weblog or social networks.An attacker can use these preferences to re-identify an individual in the anonymous published survey rating data and consequently find sensitive ratings of a victim.
Based on the public preferences, person's ratings on sensitive issues may be revealed in a supposedly anonymized survey rating data set.An example is given in the Table 1.In a social network, people make comments on various issues, which are not considered sensitive.Some comments can be summarized as in Table 1(b).People rate many issues in a survey.Some issues are non-sensitive while some are sensitive.We assume that people are aware of their privacy and do not reveal their ratings, either non-sensitive or sensitive ones.However, individuals in the anonymoized survey rating data are potentially identifiable based on their public comments from other sources.For example, Alice is at risk of being identified, since the attacker knows Alice's preference on issue 1 is 'excellent', by cross-checking Table 1(a) and (b), s/he will deduce that t 1 in Table 1(a) is linked to Alice, the sensitive rating on issue 4 of Alice will be disclosed.This example motivates us the following research questions: (Satisfaction Problem): Given a large scale rating data set T with the privacy requirements, how to efficiently determine whether T satisfies the given privacy requirements?
Although the satisfaction problem is easy and straightforward to be determined in the relational databases, it is nontrivial in the large scale rating data set.The research of the privacy protection initiated in the relational databases, in which several stateof-art privacy paradigms [22,23,32] are proposed and many greedy or heuristic algorithms [12,19,20,28] are developed to enforce the privacy principles.In the relational database, taking k-anonymity as an example [26,32], it requires each record be identical with at least k − 1 others with respect to a set of quasi-identifier attributes.Given an integer k and a relational data set T , it is easy to determine if T satisfies k-anonymity requirement since the equality has the transitive property, whenever a transaction a is identical with b, and b is in turn indistinguishable with c, then a is the same as c.With this property, each transaction in T only needs to be check once and the time complexity is at most O(n 2 d), where n is the number of transactions in T and d is the size of the quasi-identifier attributes.So the satisfaction problem is trivial in relational data sets.While, the situation is different for the large rating data.First of all, the survey rating data normally does not have a fixed set of personal identifiable attributes as relational data.In addition, the survey rating data is characterized by high dimensionality and sparseness.The lack of a clear set of personal identifiable attributes together with its high dimensionality and sparseness make the determination of satisfaction problem challenging.Second, the defined dissimilarity distance between two transactions (proximate) does not possess the transitive property.When a transaction a is -proximate with b, and b isproximate with c, then usually a is not -proximate with c.Each transaction in T has to be checked for as many as n times in the extreme case, which makes it highly inefficient to determine the satisfaction problem.It calls for smarter technique to efficiently determine the satisfaction problem before anonymizaing the survey rating data.To our best knowledge, this research is the first touch of the satisfaction of privacy requirements in the survey rating data.
How to preserve individual's privacy in the large scale rating data set?
Though several models and algorithms have been proposed to preserve privacy in relational data, most of the existing studies can deal with relational data only [22, 23, 31?].Divide-and-conquer methods are applied to anonymize relational data sets due to the fact that tuples in a relational data set are separable during anonymisation.In other words, anonymizing a group of tuples does not affect other tuples in the data set.However, anonymizing a survey rating data set is much more difficult since changing one record may cause a domino effect on the neighborhoods of other records, as well as affecting the properties of the whole data set.Hence, previous methods can not be applied to deal with survey rating data and it is much more challenging to devise anonymisation methods for large scale rating data than for relational data.

Related work
Privacy preserving data publishing has received considerable attention in recent years.especially in the context of relational data [1, 7, 12, 18-20, 23, 25, 36].All these works assume a given set of attributes QID on which an individual is identified, and anonymize data records on the QID.Their main difference consist in the selected privacy model and in various approaches employed to anonymize the data.The author of [1] presents a study on the relationship between the dimensionality of QID and information loss, and concludes that, as the dimensionality of QID increases, information loss increases quickly.Transactional databases present exactly the worst case scenario for existing anonymisation approaches because of high dimension of QID.To our best knowledge, all existing solutions in the context of k-anonymity [26,27], l-diversity [23] and t-closeness [22] assume a relational table, which typically has a low dimensional QID.
There are few previous work considering the privacy of large rating data.In collaboration with MovieLens recommendation service, [11] correlated public mentions of movies in the MovieLens discussion forum with the users' movie rating histories in the internal MovieLens data set.Recent study reveals a new type of attack on anonymized data for transactional data [24].Movie rating data supposedly to be anonymized is re-identified by linking nonanonymized data from other source.No solution exists for high dimensional large scale rating databases.
Privacy-preservation of transactional data has been acknowledged as an important problem in the data mining literature.There us a family of literature [5,6] addressing the privacy threats caused by publishing data mining results such as frequent item sets and association rules.Existing works on topic [4,34] focus on publishing patterns, The patterns are mined from the original data, and the resulting set of rules is sanitized to present privacy breaches.In contrast, our work addresses the privacy threats caused by publishing data for data mining.As discussed above, we do not assume that the data publisher can perform data mining tasks, and we assume that the data must be made available to the recipient.The two scenarios have different assumptions on the capability of the data publisher and the information requirement of the data recipient.The recent work on topic [14,37] focus on high dimensional transaction data, while our focus is preventing linking individuals to their ratings.This paper is also related to the work on anonymizing social networks [8], and the large scale rating data can be considered as a special case of the complex social network.A social network is a graph in which a node represents a social entity (e.g., a person) and an edge represents a relationship between the social entities.Although the data is very different from transaction data, the model of attacks is similar to ours: An attacker constructs a small subgraph connected to a target individual and then matches the subgraph to the whole social network, attempting to re-identify the target individual's node, and therefore, other unknown connection to the node.[8] demonstrates the severity of privacy threats in nowadays social networks, but does not provide a solution to prevent such attacks.

Privacy models
The auxiliary information of an attacker includes: (i) the knowledge that a victim is in the survey rating data; (ii) preferences of the victims on some non-sensitive issues.The attacker wants to find ratings on sensitive issues of the victim.
In practice, knowledge of Types (i) and (ii) can be gleaned from an external database [24].For example, in the context of Table 1(b), an external database may be the IMDb.By examining the anonymous data set in Table 1(a), the adversary can identify a small number of candidate groups that contain the record of the victim.It will be the unfortunate scenario where there is only one record in the candidate group.For example, since t 1 is unique in Table 1(a), Alice is at risk of being identified.If the candidate group contains not only the victim but other records, an adversary may use this group to infer the sensitive value of the victim individual.For example, although it is difficult to identify whether t 2 or t 3 in Table 1(a) belongs to Bob, since both records have the same sensitive value, Bob's private information is identified.
Intuitively,in order to avoid such attack, a two-step protection model is needed.The first step is to protect individual's identity, which is to make sure that in the released data set, every transaction should be "similar" to at least to (k − 1) other records based on the nonsensitive ratings so that no survey participants are identifiable.For example, t 1 in Table 1(a) is unique, and based on the preference of Alice in Table 1(b), her sensitive issues can be re-identified in the supposed anonymized data set.Jack's sensitive issues, on the other hand, is much safer.Since t 4 and t 5 in Table 1(a) form a similar group based on their non-sensitive rating.
The second step is to prevent the sensitive rating from being inferred in an anonymized data set.The idea is to require that the sensitive ratings in a similar group should be diverse.For example, although t 2 and t 3 in Table 1(a) form a similar group based on their nonsensitive rating, their sensitive ratings are identical.Therefore, an attacker can immediately infer Bob's preference on the sensitive issue without identifying which transaction belongs to Bob.In contrast, Jack's preference on the sensitive issue is much safer than both Alice and Bob.
In our previous work, two privacy models have been proposed.The first one is (k, )-anonymity model, which targets at protecting individual's identity and the second model is (k, , l)-anonymity model, which not only protects individual's identity, but also the personal sensitive information.In the next, section, these two models will be discussed.

(k, )-anonymity
Let be the ratings for a survey participant A and the ratings for a participant B. We define the dissimilarity between two non-sensitive ratings as follows.
Definition 1 ( -proximate).Given a survey rating data set T with a small positive number , two transactions T A , T B ∈ T , where We say T is -proximate, if every two transactions in T are -proximate.
If two transactions are -proximate, the dissimilarity between their non-sensitive ratings is bounded by .In our running example, suppose = 1, ratings 5 and 6 may have no difference in interpretation, so t 4 and t 5 in Table 1(a) are 1-proximate based on their non-sensitive rating.If a group of transactions are in -proximate, then the dissimilarity between each pair of their nonsensitive ratings is bounded by .For example, if T = {t 1 , t 2 , t 3 }, then it is easy to verify that T is 5-proximate.
Definition 2 ((k, )-anonymity).A survey rating data set T is said to be (k, )-anonymous if every transaction isproximate with at least (k − 1) other transactions.The transaction t ∈ T with all the other transactions thatproximate with t form a (k, )-anonymous group.
For instance, there are two (2,5)-anonymous groups in Table 1(a).The first one is formed by {t 1 , t 2 , t 3 } and the second one is formed by {t 4 , t 5 }.The idea behind this privacy principle is to make each transaction contains non-sensitive attributes are similar with other transactions in order to avoid linking to personal identity.(k, )-anonymity well preserves identity privacy.It guarantees that no individual is identifiable with the probability greater than the probability of 1/k.Both parameters k and are intuitive and operable in real-world applications.The parameter captures the protection range of each identity, whereas the parameter k is to lower an adversary's chance of beating that protection.The larger the k and are, the better protection it will provide.Although (k, )-anonymity privacy principle can protect people's identity, it fails to protect individuals' private information.Let us consider one (k, )anonymous group.If the transactions of the group have the same rating on a number of sensitive issues, an attacker can know the preference on the sensitive issues of each individual without knowing which transaction

(k, , l)-anonymity
This example illustrates the limitation of the (k, )anonymity model.To mitigate the limitation, we require more diversity of sensitive ratings in the anonymous groups.In the following, we define the distance between two sensitive ratings, which leads to the metric for measuring the diversity of sensitive ratings in the anonymous groups.
First, we define dissimilarity between two sensitive rating scores as follows.
Note that there is only one difference between dissimilarities of sensitive ratings Dis(s A i , s B j ) and dissimilarities of non-sensitive ratings Dis(o A i , o B j ), that is, in the definition of Dis(o o i , o o j ), null − null = 0, and for the definition of Dis(s A i , s B j ), null − null = r.This is because for sensitive issues, two null ratings mean that an attacker will not get information from two survey participants, and hence are good for the diversity of the group.
Next, we introduce the metric to measure the diversity of sensitive ratings.For a sensitive issue s, let the vector of ratings of the group be [s 1 , s 2 , • • • , s g ], where s i ∈ {1 : r, null}.The means of the ratings is defined as follows: where Q is the number of non-null values, and s i ± null = s i .The standard deviation of the rating is then defined as: For instance in Table 1(a), for the sensitive issue 4, the means of the ratings is (6 + 1 + 1 + 1 + 5)/5 = 2.8 and the standard deviation of the rating is 2.23 according to Equation (3).

Definition 3 ((k, , l)-anonymity).
A survey rating data set is said to be (k, , l)-anonymous if and only if the standard deviation of ratings for each sensitive issue is at least l in each (k, )-anonymous group.
Still consider Table 1(a) as an example.t 4 and t 5 is 1-proximate with the standard deviation of 2. If we set k = 2, l = 2, then this group satisfies (2,1,2)-anonymity requirement.The (k, , l)-anonymity requirement allows sufficient diversity of sensitive issues in T , therefore it could prevent the inference from the (k, )-anonymous groups to a sensitive issue with a high probability.The following theorem gives the upper bound of the parameter l in the (k, , l)anonymity model.The proof of the following theorem can be found in [30].
Theorem 1.Let S be the set of ratings of the sensitive issue of T .Suppose S_min and S_max be the minimum and maximum ratings in S, then the maximum standard deviation of S is (S_max−S_min) 2 .

Validating privacy requirements
In this section, we formulate the satisfaction problem and develop a slicing technique to determine the following Satisfaction Problem.
Problem 1 (Satisfaction Problem).Given a survey rating data set T and privacy requirements k, , l, the satisfaction problem of (k, , l)-anonymity is to decide whether T satisfies the k, , l privacy requirements.
The satisfaction problem is to determine whether the user's given privacy requirement is satisfied by the given data set.It is a very important step before anonymizing the survey rating data.If the data set has already met the requirements, it is not necessary to make any modifications before publishing.As follows, we propose a novel slice technique to solve the satisfaction problem.

Search by slicing
The slicing technique is proposed to efficiently search for the neighbor within distance in high dimension.As we shall see, the complexity of the proposed algorithm grows very slowly with dimension for small .We illustrate the proposed slicing technique using a simple example in 3-D space, as shown in Figure 1.Given t = (t 1 , t 2 , t 3 ) ∈ T , our goal is to slice out a set of transactions T (t ∈ T ) that are -proximate.Our approach is first to find the -proximate of t, which is the set of transactions that lie inside a cube C t of side 2 centered at t. Since is typically small, the number of points inside the cube is also small.The -proximate of C t can then be found by an exhaustive comparison within the -proximate of t.If there are no transactions inside the cube C t , we know that the -proximate of t is empty, so as the -proximate of the set C t .
The transactions within the cube can be found as follows.First we find the transactions that are sandwiched between a pair of parallel planes X 1 , X 2 (See Figure 1) and add them to a candidate set.The planes are perpendicular to the first axis of coordinate frame and are located on either side of the transaction t at a distance of .Next, we trim the candidate set by disregarding transactions that are not also sandwiched between the parallel pair of Y 1 and Y 2 , that are perpendicular to X 1 and X 2 , again located on either side of t at a distance of .This procedure is repeated for Z 1 and Z 2 at the end of which, the candidate set contains only transactions within the cube of size 2 centered at t.
Since the number of transactions in the finalproximate is typically small, the cost of the exhaustive comparison is negligible.The major computational cost in the slicing process occurs therefore in constructing and trimming the candidate set.

Anonymous survey rating data
In this section, we describe our modification strategies through the graphical representation of the (k, )anonymity model.Given a survey rating data set T = {t 1 , t 2 , • • • , t n }, its graphical representation is the graph G = (V , E), where V is a set of nodes, and each node in V corresponds to a record t i (i = 1, 2, • • • , n) in T , and E is the set of edges, where two nodes are connected by an edge if and only if the distance between two records is bounded by with respect to the non-sensitive ratings.
Two nodes t i and t j are called connected if G contains a path from t i to t j (1 ≤ i, j ≤ n).The graph G is called connected if every pair of distinct nodes in the graph can be connected through some paths.A connected component is a maximal connected subgraph of G.Each node belongs to exactly one connected component, as We say G is a clique if every pair of distinct nodes is connected by an edge.The k-clique is a clique with at least k nodes.The maximal k-clique is the a kclique that is not a subset of any other k-clique.We say the connected component The graph is kdecomposable if all its connected components are kdecomposable.

Theorem 2. Given the survey rating data set T with its graphical representation
The proof of Theorem 2 can be found in [29].For instance, the survey rating data shown in Table 2 is (2, 2)-anonymous since its graphical representation (Figure 2(a)) is 2-decomposable.With Theorem 2, to make the rating data satisfy privacy requirement, it only needs to make its graphical representation kdecomposable.
Let t = (t 1 , t 2 , • • • , t n ) be the ratings of some issue from n survey participants with the privacy requirement .We assume that some ratings in t are not bounded by , and our aim is to modify t to make every pair of ratings is bounded by while minimizing the distortion.The idea of the approach is as follows.Order all ratings for the issue t, and find the minimum rating Min and maximum rating Max.Find all intervals of the size between Min and Max.Change the ratings that does not fit in this interval such that the distortion is minimized.In the case of some tuples with the same minimum distortion, randomly pick up one of them as the anonymization.The process is described in Algorithm 1.

Table 4. Example of the anonymization algorithm
Let us take Table 1(a) as an example with k = 2, = 1.There are two groups HG 1 = {t 1 , t 2 , t 3 , t 4 } and HG 2 = {t 5 , t 6 }.HG 2 has already satisfied the privacy requirement, but HG 1 does not.The anonymization of HG 1 is shown in Table 5, in which the vector in bold is the anonymisation we choose.

Intervals Anonymization Distortion
( (5,6,5,5) 4 transaction t i ∈ T contains m issues.The computation cost consists of three parts, which are sorting, finding intervals and computing distortion.The complexity of the sorting is O(mnlogn).During the next phrase of the algorithm, for each attribute, we find the Min and Max and all the possible intervals with size , which incur the amount of O(2(n − 1)) overhead, and the cost for comparisons to search the one with least distortion is O(n).So, the total complexity of all attributes in this phrase is O(mn).The last phrase to compare original and anonymous data sets to estimate the distortion has the cost of O(mn).The computational complexity of this alternative approach is O(mnlogn + mn).

Experimental study
In this section, we experimentally evaluate the efficiency of the proposed slicing algorithm and the proposed anonymization algorithm.Our objectives are two-fold.First, we verify that our slice algorithm is fast and scalable for the satisfaction problem.Second, we show that the slicing technique is not only time efficient, but also space efficient compared with the heuristic pairwise algorithm.

Data sets
Our experimentation deploys two real-world databases.MovieLens 1 and Netflix data sets 2 .MovieLens data set was made available by the GroupLens Research Project at the University of Minnesota.The data set contains 100,000 ratings (5-star scale), 943 users and 1682 movies.Each user has rated at lease 20 movies.Netflix data set was released by Netflix for competition.The movie rating files contain over 100,480,507 ratings from 480,189 randomly-chosen, anonymous Netflix customers over 17 thousand movie titles.The data were collected between October, 1998 and December, 2005 and reflect the distribution of all ratings received during this period.The ratings are on a scale from 1 to 5 (integral) stars.In both data sets, a user is considered as an object while a movie is regarded as an attribute and many entries are empty since a user only rated a small number of movies.Except for rating movies, users' ratings some simple demographic information (e.g., age range) are also included.In our experiments, we treat the users' ratings on movies as non-sensitive issues and ratings on others as sensitive ones.
percentage of data from 10% to 100%.For both data sets, we evaluate the running time for the (k, , l)anonymity model with default setting k = 20, = 1, l = 2.For both testing data sets, the execution time for (k, , l)-anonymity is increasing with the increased data percentage.This is because as the percentage of data increases, the computation cost increases too.The result is expected since the overhead is increased with the more dimensions.Next, we evaluate how the parameters affect the cost of computing.Data set used for this sets of experiments are the whole sets of MovieLens and Netflix data and we evaluate by varying the value of , k and l.With k = 20, l = 2, Figure 3(b) shows the computational cost as a function of , in determining (k, , l)-anonymity requirement of both data sets.Interestingly, in both data sets, as increases, the cost initially becomes lower but then increases monotonically.This phenomenon is due to a pair of contradicting factors that push up and down the running time, respectively.At the initial stage, when is small, more computation efforts are put into finding -proximate of the transaction, but less used in exhaustive search for properproximate neighborhood, and this explains the initial decent of overall cost.On the other hand, as grows, there are fewer possible -proximate neighborhoods, thus reducing the searching time for this part, but the number of transactions in the -proximate neighborhood is increased, which results in huge exhaustive search for proper -proximate neighborhood and this causes the eventual cost increase.Setting = 2, Figure 4(a) displays the results of running time by varying k from 10 to 60 for both data sets.The cost drops as k grows.This is expected, because fewer search efforts for proper -proximate neighborhoods needed for a greater k, allowing our algorithm to terminate earlier.We also run the experiment by varying the parameter l and the results are shown in Figure 4(b).Since the rating of both data sets are between 1 and 5, then according to Theorem 1, 2 is already the largest possible l.When l = 0, there is no diversity requirement among the sensitive issues, and the (k, , l)-anonymity model is reduced to (k, )-anonymity model.As we can see, the running time increases with l, because more computation is needed in order to enforce stronger privacy control.
In addition to show the scalability and efficiency of the slicing algorithm itself, we also experimented the comparison between the slicing algorithm (Slicing) and the heuristic pairwise algorithm (Pairwise), which works by computing all the pairwise distance to construct the dissimilarity matrix and identify the violation of the privacy requirements.We implemented both algorithms and studied the impact of the execution time on the data percentage, the value of , the value of K and the value of L.   From the graph we can see, the slicing algorithm is far more efficient than the heuristic pairwise algorithm especially when the volume of the data becomes larger.This is because, when the dimension of the data increases, the disadvantage of the heuristic pairwise algorithm, which is to compute all the dissimilarity distance, dominates the most of the execution time.On the other hand, the smarter grouping technique used in the slicing process makes less computation cost for the slicing algorithm.The similar trend is shown in Figure 5

Data Utility
Having verifying the efficiency of the slicing technique, we proceed to test its effectiveness.We measure the utility by the distortion metric defined in previous sections.Generally speaking, the more the distortion is, the less useful the anonymized data would be.
We first study the influence of (i.e., the length of a proximate neighborhood) on data utility.Towards this, we set k to 40.Concerning (40, )-anonymity, Figure 7(a) plots the information loss on both data sets as a function of .The anonymization algorithm incurs less distortion as increases.This is expected, since a smaller demands stricter privacy preservation, which reduces data utility.When = 5, there will be no anonymization required, and therefore the information loss reaches 0. Next, we examine the utility of (k, 2)-anonymous solution with different k.The results are averaged across these 50 trials.For comparison, we also include the accuracies of classifier trained on the (not anonymized) original data.From the graph, we can see that the average prediction accuracy is around 75%, very close to the original accuracy, which preserves better utility for data mining purposes.Similar results are obtained by using the Netflix rating data in Figure 8(b).

Conclusion and future work
We have studied the problems of protecting sensitive ratings of individuals in a large scale rating data.Such privacy risk has emerged in a recent study on the de-identification of published movie rating data.We proposed a novel (k, , l)-anonymity privacy principle for protecting privacy in such survey rating data.We theoretically investigated the properties of this model, and studied the satisfaction problem, which is to decide whether a survey rating data set satisfies the privacy requirements given by the user.A greedy anonymization algorithm has been proposed to anonymize large scale rating data.Extensive experiments confirm that our technique produces anonymized data sets that are useful.
This work also initiates the future investigations of approaches on anonymizing the survey rating data.Traditional approaches on anonymizing no matter relational data sets or transactional data set are by generalization or suppression, and the published data set has the same number of data but with some fields being modified to meet the privacy requirements.As shown in the literatures, this kind of anonymization problem is normally NP-hard, and several algorithms are devised along this framework to minimize the certain pre-defined cost metrics.Inspired by the research in this paper, the satisfaction problem can be further used to develop a different method to anonymizing the data set.The idea is straightforward with the result of the satisfaction problem.If the rating data set has already satisfies the privacy requirement, it is not necessary to do any anonymization to publish it.Otherwise, we anonymize the data set by deleting some of the records to make it meet the privacy requirement.The criteria during the deletion can be various (for example, to minimize the number of deleted records) to make it as much as useful in the data mining or other research purposes.We believe that this new anonymization method is flexible in the choice of privacy parameters and efficient in the execution with the practical usage.

Figure 1 .
Figure 1.The slicing technique finds a set of transactions C t inside a cube of size 2 within the -proximate of t.The -proximate of the set C t can then be found by an exhaustive search in the cube.

Figure 5
Figure 5 plots the running time of both slicing and pairwise algorithms on the Movielens data set.Figure 5(a) describe the trend of the algorithms by varying the percentage of the data set.From the graph we can see, the slicing algorithm is far more efficient than the heuristic pairwise algorithm especially when the volume of the data becomes larger.This is

9 ICSTXFigure 6 .Figure 7 .Figure 8 .
Figure 6.Running time comparison of Slicing and Pairwise methods on Netflix data set vs. (c) k varies (d) L varies

Figure 7 (
b) presents the information loss as a function of k.The error grows with k because a larger k demands tighter anonymity control requiring much more data modification.Figures 8(a) and (b) evaluate the classification and prediction accuracy of the greedy anonymization algorithm.Our evaluation methodology is that we first divide data into training and testing sets, and we apply the anonymization algorithm to the training and testing sets to obtain the anonymized training and testing sets, and finally the classification or regression model is trained by the anonymized training set and tested by anonymized testing set.The Weka implementation [35] of simple Naive Bayes classifier was used for the classification and prediction.Using the Movielens data, Figure 8(a) compares the predictive accuracy of classifier trained on Movielens data produced by the greedy anonymization algorithm.In these experiments, we generated 50 independent training and testing sets, each containing 2000 records, and we fixed = 2.
, L. Sun Issue 01-03 | e3 belongs to whom.For example, in Table1(a), t 2 and t 3 are in a (2, 1)-anonymous group, but they have the same rating on the sensitive issue, and thus Bob's private information is breaching.
EAIEuropean Alliance for Innovation 4 ICST Transactions on Scalable Information Systems January-March 2013 | Volume 13 |