TONG Ze,DENG Bowen,ZHENG Lele,ZHANG Tao
(School of Computer Science and Technology,Xidian University,Xi'an 710071,China)
Abstract: The structure of key-value data is a typical data structure generated by mobile devices.The collection and analysis of the data from mobile devices are critical for service providers to improve service quality.Nevertheless,collecting raw data,which may contain various personal information,would lead to serious personal privacy leaks.Local differential privacy (LDP) has been proposed to protect privacy on the device side so that the server cannot obtain the raw data.However,existing mechanisms assume that all keys are equally sensitive,which cannot produce high-precision statistical results.A utility-improved data collection framework with LDP for key-value formed mobile data is proposed to solve this issue.More specifically,we divide the key-value data into sensitive and non-sensitive parts and only provide an LDPequivalent privacy guarantee for sensitive keys and all values.We instantiate our framework by using a utility-improved key value-unary encoding (UKV-UE) mechanism based on unary encoding,with which our framework can work effectively for a large key domain.We then validate our mechanism which provides better utility and is suitable for mobile devices by evaluating it in two real datasets.Finally,some possible future research directions are envisioned.
Keywords: key-value data;local differential privacy;mobile devices;privacy-preserving data collection
With the development of mobile communication technologies,service providers are more willing to collect data from mobile devices to enhance the service experience for users.As a classical data structure,key-value data are widespread in practical mobile applications[1–2].The structure of key-value data is a hybrid data structure,where the key is the identifier of data and the value is the content of data.The following three examples show its potential applications:
1) Mobile devices (such as wearable devices,smartphones,tablets,etc.) generate a large number of data,the majority of which are in a key-value format,i.e.,device_id,device_valueortimestamp,device_value.These data could show the usage habits of users on a specific device or during a particular period,which can help data collectors provide a personalized experience for the user.For example,the service center collectsdevice_id,sleepdurationfrom the user’s smart bracelet to remind the user to rest properly at a suitable time[3].
2) Software vendors,such as Android and iOS,collect users’ data to enhance the users’ experience,i.e.app_name,userratingorapp_name,lengthofvisit,in a key-value format,where the key is the name of an APP,and the value is the length of time or a score to access the APP.These data could show the users’ experience with a particular application,which can help software vendors study future product improvements.For example,software vendors provide users with personalized recommendations by collecting their specific interest[4].
3) Advertisers are interested in knowing whether the video advertisements they place on mobile devices appeal to potential customers[5].Therefore,they are willing to collect advertisement ratings from users in the form of key-value,where the key is the ID of the advertisement,and the value is the number of minutes that users watch the advertisement.
However,the key-value data involves a lot of personal information,thus users may be reluctant to upload data from their mobile devices.To address the privacy-preserving data collection issue,some researchers proposed local differential privacy[6]to obfuscate local information in the data collection phase.Because of its decentralization and strict mathematical proof,it has been adopted by mainstream systems such as MacOS[7]and Windows[8]to collect data.In addition,local differential privacy (LDP) reduces the communication costs of largescale computing and the frequent interaction with the data center,making it well-suited for mobile devices with limited resources and low computing power.
Recently,there has been extensive research on key-value data collections with LDP.YE et al.[9]first proposed PrivKVM to protect key-value data using synchronized key and value perturbation protocols.It adopted one iteration to obtain frequency estimation and several iterations to achieve an approximately unbiased mean estimation.The result of the last iteration is sent to the next iteration as input.However,it requires all users to be online in all the iterations,which is difficult to achieve in practical scenarios.Moreover,PrivKVM may lead to a high estimation error when the key domain is large.SUN et al.[2]proposed a series of LDP mechanisms based on PrivKVM and introduced conditional analysis for key-value data analysis.However,the mean estimation obtained by SUN et al.is biased.Subsequently,GU et al.[10]proposed a private correlated key-value (PCKV) data collection mechanism,which adopts the padding-and-sampling mechanism to solve the large key domain problem of previous work[9].Moreover,a budget composition theorem for the relevant perturbation mechanism is further given to enhance the data utility using privacy budget relaxation.However,according to the definition of LDP,we cannot distinguish whether the output key is genuine.Because the virtual values significantly reduce the aggregation accuracy,the aforementioned mean estimation mechanisms perform poorly in the case of a small privacy budget.Therefore,there is a requirement to enhance the utility of key-value data collection under LDP.
Moreover,the mechanisms aforementioned regard all data as equally sensitive and thus provide excessive protection for some data and leave much room for improving data utility.In real-world scenarios,there is quite a lot of non-sensitive data.For example,when the server collects application names and ratings from cell phones,attackers cannot infer users’ privacy preferences even if they know that the user logs in WeChat,which is a social APP that has a huge user base.Therefore,using WeChat provides non-sensitive data for users.In contrast,using some minority applications provides sensitive data.Based on this idea,MURAKAMI et al.[11]proposed the concept of utility-optimized LDP (ULDP),which only requires LDP protection for sensitive data to reduce the frequency estimation error.Nevertheless,ULDP is only suitable for frequency estimation,thus the accuracy of data collection under the privacy protection for key-value data needs further enhancements.
To address these issues,we propose a new framework for mobile devices called the utility-improved key value (UKV) data collection with LDP.In UKV,mobile devices take different perturbations based on whether the data are sensitive or not to achieve a balance between privacy and utility.We then introduce an initial implementation of the UKV framework and verify its performance in terms of data utility using public datasets.
The remainder of the paper is organized as follows.The overview and benefits of the UKV framework are introduced in Section 2.Some key challenges are presented in Section 3.In Section 4,we describe a case study of an initial implementation of the UKV framework for mobile devices.The performance of our mechanisms is evaluated in Section 5,and some possible future directions are given in Section 6.Finally,we conclude the paper in Section 7.
In this section,we briefly introduce data collection under local differential privacy.Then we describe the UKV framework and its benefits for data protection.
Data collection is an important means of obtaining data from mobile devices.By collecting and analyzing users’ data,data collectors can mine users’ characteristics (such as living habits and health status) and thus formulate more appropriate development strategies.However,users’ data often contains a large amount of personal information.Collecting raw data may lead to serious personal privacy leaks,which not only harms privacy leakers but also brings a series of legal risks and economic losses to data collectors.Therefore,this issue needs to be solved urgently.
Differential privacy[12]provides a feasible solution to the problem of personal privacy leakage due to its characteristic of being plausible and deniable.It provides strictly provable privacy protection without relying on the background knowledge possessed by the attacker.LDP is one of the differential privacy technologies that specifically address the problem of personal privacy leakage during data collection.Unlike central differential privacy,which assumes the existence of a trusted data collector with access to the user’s raw data,LDP does not require any qualification on the credibility of the data collector.In particular,LDP requires each user to locally perturb the raw data with a local perturbation mechanism before sending it to the data collector.Therefore,the data security of the users is guaranteed.Because of this unique advantage,LDP has been widely adopted in practice.A successful case is RAPPOR[13]on Google Chrome,which enables Google to collect users’ browsing information while protecting user privacy.
A basic LDP mechanism is Generalized Randomized Response (GRR)[14].The main idea of GRR is to set the output range to be the same as the input range,with a certain probability of providing a “fake” response while maximizing the likelihood of providing a “true” response.Specifically,each user perturbsxto itself with a large probabilityp,and perturbsxto other data with a small probabilityq.However,the utility of GRR drops rapidly when the data domain is large.UE[15]solves this problem.UE first encodes the input data as a one-hotd-dimensional vector with only the bit corresponding to the data set to 1,wheredis the size of input domain.Then each bit is perturbed independently.Here,each user retains(only) input 1 with large probabilityp,and perturbs each 0 to 1 with probabilityq.Our work is based on the above scheme and achieves secure data collection adapted to mobile devices.
Fig.1 shows the overview of our framework,which contains three parts: mobile devices,the server side,and data analysts.And,we will describe each part of our framework in detail.
·Mobile devices.Mobile devices are individual users who own personal data.They can not only generate and collect data of users but also perturb the raw data with local differential privacy mechanisms to protect information privacy.
·Server side.The server,which has a large number of computing and storage resources,is responsible for collecting the data sent by mobile devices,and aggregating and estimating the data.Finally,the server releases the data and its corresponding estimations.In this paper,we assume that the server is “semi-honest”.Here,“semi-honest” means that the server honestly executes the data collection protocol while potentially leaking the user’s historical data to attackers.
·Data analysts.Data analysts are the actual users of the data.The data analyst submits a query request to the server and gets the noise-added results.These analysts may be ordinary users or malicious attackers.
We briefly describe the data flow in a UKV framework to understand the data processing procedure.The raw data are generated by the mobile device and locally perturbed by UKV.Then,the mobile device sends it to the server side.In addition,the key-value data are divided into two categories:sensitive data and non-sensitive data.In the data perturbation phase,UKV divides the privacy budgetεinto two parts,namelyε1andε2,whereε1is used for perturbation of key andε2is used for perturbation of value.For sensitive key-value data,the key consumes all privacy budgetε1;for non-sensitive key-value data,the key does not consume any privacy budget.Furthermore,UKV consumes the privacy budget ofε2to perturb the value for two types of key-value data.Finally,the server releases the estimated results to the data analyst.
By applying the UKV framework,users perturb the raw data before sending key-value data to the server.In addition,the UKV framework also provides the following benefits.
·Privacy protection and efficient utility of data.In data collection,each user sends the noise-added data to the server.Then the server aggregates and analyzes the data where the frequency and mean estimation are important data analysis components.UKV can maintain high efficiency with a low privacy budget.As the number of non-sensitive keys increases,the data utility improves rapidly.
·No trusted third party is required.LDP transfers the part that adds noise to the raw data to local devices so that the third-party data collectors cannot get the raw data;thus it avoids the risk of privacy leakage by third parties.

▲Figure 1.Overview of the proposed framework
In order to take full advantage of the UKV framework for mobile devices,we still face challenges in the implementation of the proposed framework,which seriously hinders the booming development of related applications.
1) Individuals have different privacy needs for data.The difference between sensitive and non-sensitive data can vary from one user to another (e.g.,some people even want to keep the name of the APP that they use and the scores of the movie they give private).Moreover,we concentrate on a situation in which users can easily choose,no matter it is sensitive or not.Nevertheless,there is also a situation in which the user knows nothing about the sensitive data type.For the latter case,the improvement of our UKV framework for data utility is greatly reduced.Therefore,how to divide sensitive data and nonsensitive data is crucial.
2) Association of sensitive data with non-sensitive data.First,we assume that each user sends a key-value data pair and each user’s data are irrelevant.This makes sense for most personal data (e.g.,application ratings).Yet,for certain types of personal data (e.g.,flu status[16]),users may be extremely influenced by other users.Moreover,when users send more than one pair of data,sensitive and non-sensitive data may also be correlated with each other,which means that nonsensitive data release may lead to sensitive data leakage[17].Therefore,designing a scheme for sending multiple data pairs per user is an important and challenging problem.
3) Lightweight.In practical applications,the communication bandwidth cost between the mobile device and the server is proportionate to the domain size of the key.The time complexity of UKV proposed in this paper isO(d).When the key domain is too large,the communication cost increases dramatically,which is unacceptable in many practical applications.Moreover,because of the limited computing resources of mobile devices,they lack the ability to perform complex computing tasks.Thus,designing a lightweight privacy-preserving algorithm for mobile devices is necessary.
4) Selection of parameters.The Padding-and-Sampling protocol is used in the UKV framework,where the padding lengthlneeds to be set in advance.In theory,it should be set based on the data distribution.A smalllwill reduce the frequency estimation for the key,while a largelwill increase the quantity of virtual key-value data,leading to a larger estimation error.However,the best selection oflis not possible because the purpose of UKV is to learn the data distribution.In addition,the choice of privacy budget is essential for balancing privacy and utility.Therefore,it is crucial to choose parameters to achieve near-optimal efficiency.
In this section,we provide a case study to introduce the initial implementation of the UKV framework for mobile device data collection.
Fig.2 shows the implementation of the UKV framework for mobile devices.It consists of two parts:
1) The mobile device perturbs key-value data with LDP to provide provable privacy guarantees.
2) The server estimates data to generate usable data from the collected key-value data for analysis.
The two parts are combined to improve the accuracy of server data analysis while protecting the privacy of key-value data.We then describe the details of the two parts in the following sections.

▲Figure 2.Implementation of the utility-improved key value framework
·Padding-and-sampling[18]: Each user samples one datum from possessed key-value data instead of sampling one datum from the domain of all key-value data.In order to make all samplings rate the same,each user first adds different random dummy data to possessed key-value data until he haslkeyvalue data.
·Perturbation: The general overview of the perturbation method includes the key input domain,key output domain,and flip probability.The input domain has sensitive keyks∈KSand non-sensitive keykn∈KN,whereKSandKNare the sets of sensitive and non-sensitive keys,respectively.In the output domain,kr∈KSkdenotes the rest sensitive keys exceptk,wherekmay not belong toKS.When the user inputsksto UKV,her output includeskswithpssprobability andkrwithpsrprobability,where the values ofpssandpsrare related to the privacy budgetε1.When the user inputskn,her output includeskrwith thepnrprobability andknwith thepnnprobability,wherepnris equal topsr.

Data estimation: The server collects the data uploaded by users.For the sensitive key,the server computes the counts of 1 and -1 that support keykfrom all the data sent by users,denoted byn0andn1,respectively.Then we could calculate the frequency estimationand the corresponding mean estimationby

wherenis the number of the users.
For the non-sensitive key,the server computes the number ofvthat supports keykfrom all the data sent by users,denoted byn2.Then we could calculate the frequency estimationand the corresponding mean estimationby

wherePis the set of perturbed values sent by the users.
In summary,UKV improves the data utility by slackening privacy protection for non-sensitive data.
In this section,we evaluate the privacy and utility assurance performance of UKV for data collection in public datasets.
Datasets: In this paper,we use the e-commerce (Ec) dataset[21]and the clothing (Cl) dataset[22]to evaluate the performance of UKV on privacy protection and utility assurance.The Ec dataset includes 23 486 key-value pairs and 1 206 category keys for a total of 23 486 users.The Cl dataset includes 192 544 key-value pairs and 5 850 types of keys,with a total of 105 508 users.
To demonstrate the advantages of the UKV framework,we compare it with the most advanced key-value data collection mechanism PCKV.
Evaluation metrics: We evaluate the frequency and mean estimations by comparing the averaged mean square error(MSE) among non-sensitive keys:

whereKNis the domain of non-sensitive keys,kandare the frequency and mean estimations of the key-value data,andfkandmkare the actual frequency and mean values of the keyvalue data.
We use the ten most frequent keys as non-sensitive keys,because the frequency of non-sensitive keys is usually higher in practice.
Figs.3 and 4 show the MSE of non-sensitive keys in two realworld datasets,from which the effect of privacy budget on data utility can be observed.We double the privacy budgets in our experiments.The larger the privacy budget,the lower the mean square error and the higher the data utility.The MSE of the Cl dataset does not change much compared with the results of the Ec dataset because all algorithms benefit from the number of users,which makes up for the effect of the large key domain.As shown in Figs.3(a) and 4(a),our UKV-UE mechanism performs the best as it does not decrease the privacy budget of frequency estimation while discriminating the key sensitivity in UKV.The theory of dividing the sensitivity to decrease the frequency estimation errors is detailed in Ref.[12].

▲Figure.3.MSE of Cl dataset
Similarly,in Figs.3(b) and 4(b),the UKV-UE mechanism performs well for any privacy budget about the mean estimation.In the case of a small privacy budget,only the UKV-UE achieves higher accuracy.
This work,key-value data collection with LDP for mobile devices,still needs further research to advance its development.In this section,we envision some possible future directions.

▲Figure 4.MSE of Ec dataset
1) Statistical analysis of key-value data for mobile devices.To the best of our knowledge,the current work is limited to frequency estimation and mean estimation of key-value data.In contrast,other applications of key-value data are less explored(e.g.,maximum-minimum estimation of key-value data).Therefore,other aggregation statistics of key-value data for mobile devices are a direction worthy of attention.
2) Machine learning on mobile devices.In a distributed machine learning system on mobile devices,the mobile devices collect data and send it to the server.Then,the server divides the subsets of data items according to certain rules and finally distributes the subsets to each device for training.Currently,only a few mobile machine learning frameworks support keyvalue data formats to submit training data,like searching English words in dictionaries (the dictionary data structure is in key-value format,where the key is the alphabet and the value is their sequence number).Therefore,exploring more training frameworks that support key-value data formats and adding LDP protection to them is quite worthwhile.
3) High-dimensional key-value data in mobile devices.Most of the current major differential privacy protection frameworks are for two-dimensional data sets.However,there are a lot of complex high-dimensional data in mobile devices,and it is necessary to protect them using local differential privacy techniques.Moreover,shifting data protection from two dimensions to multiple dimensions will inevitably bring more challenges,like dimensional disasters.In short,designing a differential privacy protection framework for mobile devices that can be extended to multi-dimensional data protection is an important challenge for data analysis work.
4) Mobile real-time data release.With the need for some particular scenarios (such as a health code and a nucleic acid test),people have an increasing demand for real-time query response and data updates.Real-time data release has high requirements for the stability of data transmission.However,the data transmission stability of mobile devices is doubtful,which may cause frequent dropouts for users.In addition,problems such as repeated data release and dynamic data update significantly increase the risk of privacy leakage in the real-time data release.Therefore,the privacy protection for real-time data release of mobile devices deserves much attention.
5) Preventing poisoning attacks by mobile devices.Poisoning attacks against key-value data aim to reduce data availability by sending carefully crafted data from some fake users to the server while changing the frequency and mean value of the target key chosen by the attacker[23].For example,an attacker successfully changed a road segment in Google Maps from"clear" to "congested" using 99 mobile phones.Existing defense methods are effective in some cases but ineffective in others.Therefore,researching methods to defend against poisoning attacks from mobile devices is a worthwhile endeavor.
We researched the utility improvement of key-value data collection for mobile devices and proposed a novel framework,UKV,which has improved the data utility by providing LDP privacy protection for sensitive key-value data only.We also introduced the main challenges that hindered key-value data collections from maximizing their benefits.Then,we introduced an initial implementation of the UKV framework and validated the excellent utility of our mechanism on two real datasets.Finally,we envisioned some possible future directions to attract more research in this area.