Contributions to lifelogging protection in streaming environments

Pàmies Estrems, David

Contributions to lifelogging protection in streaming environments

Pàmies Estrems, David

Dirigida por:

Jordi Castellà Roca Director/a
Joaquín García Alfaro Director/a

Universidad de defensa: Universitat Rovira i Virgili

Fecha de defensa: 10 de septiembre de 2020

Tribunal:

Pino Caballero Gil Presidenta
Josep Domingo Ferrer Secretario/a
Nora Cuppens-Boulahia Vocal

Tipo: Tesis

Teseo: 649078 DIALNET TDX editor

Resumen

Every day, more than five billion people generate sorne kind of data over the Internet. As a tool for accessing that information, we need to use search services, either in the form of Web Search Engines or through Personal Assistants. On each interaction with them, our record of actions via logs, is used to offer a more useful xperience. For companies, logs are also very valuable since they offer a way to monetize the service. Monetization is achieved by selling data to third parties, however query logs could potentially expose sensitive user information: identifiers, sensitive data from users (such as diseases, sexual tendencies, religious beliefs) orbe used for what is called "life-logging": a continuous record of one's daily activities. Current regulations oblige companies to protect this personal information. The thesis is grounded in a detailed and deep state of the art on the most relevant and up to date research conducted in this field . lt briefly explains the concepts of privacy and anonymization, studying in detail the main techniques applied in server side, client side and mixed approaches. From this analysis we were able to detect curren! problems related to privacy and security of releasing query logs. Protection systems for closed data sets have previously been proposed, most of them working with atomic files or structured data. Unfortunately, those systems do not fit when used in the growing real-time unstructured data environment posed by Internet services. Techniques to protect the user's sensitive information in a non-structured real-time streaming environment were designed, guaranteeing a trade-off between data utility and protection. In this regard, three proposals have been made in efficient log protection. Each research opportunity was developed applying Design and Creation as a main research strategy, this implies that the results of this thesis include IT artifacts (prototypes of the proposed solutions). The performance of those artifacts was assessed with both formal validation and an extensive set of experiments, using real data. Privacy guarantee is defined in terms of set theory, which relates sets of users to sets of query logs. Data could be released without modifications other than removing direct identifiers from query text and remapping between !hose two sets. This contrasts with existing approaches that release heavily modified data, either distorted or generalized, to maintain anonymity. Our first proposal is fast and with a high level of privacy. lt describes a new method to anonymize query logs based on probabilistic k-anonymity. lt also propases sorne de-anonymization tools to determine possible privacy problems. This approach tries to preserve the original user interests, but spreads possible semi-identifier information over many users, to prevent linkage attacks. However, since the number of categories it uses is low, the utility of the anonymized data could be improved. This led us to our second proposal. In this case, the proposal allows one to adjust the amount of available categories, as well as the number of elements in each category. This parameterization permits different privacy and utility levels, according to the needs of each application. The second proposal also devises formal and experimental proofs that ensure its feasibility in terms of privacy, utility, and scalability achievement. Regarding privacy, in our evaluations we have considered the worst-case scenario, in which an attacker who is willing to use the anonymized query logs to retrieve the original ones has gained access to the base information of the service: algorithms, anonymized logs, and used parameters. In arder to evaluate query log's utility after applying anonymization, we have measured distances between user profiles. Studying the results we have found that jusi above 40% of utility was lost with a simple classification. However, utility increases rapidly with more detailed classifications, as the ones used in our second proposal, easily achieving less than 1 % of utility loss without affecting privacy. Since we achieve a protection of 1/k in all cases, being k parameterizable, the application of our proposal is sufficient to generate anonymized logs that meet the defined utility criteria and could be released to third parties safely. Algorithmic time complexity of our proposal is linear regarding to the input and could be established as O(n). We have also considered the average Google's load, in queries per second, to study the run-time cost and the memory usage. The results show that proposed anonymizer algorithms could handle the equivalent to the average of Google's load in real-time, using only one thread of execution and a feasible amount of memory, in our test environment. A full architecture, taking into account the use of lnternet-based Personal Assistants, was finally presented. This architecture, jointly addresses anonymization by organizational roles in terms of Data Controller, Data Processor and Data Subject, in arder to comply with the guidelines of the GDPR. Arriving at the final proposal of an architecture that combines lifelogging anonymization and sanitizable signatures, to promptly mitigate privacy threats.