Anonymization techniques for Semantic Networks of Data

Internship Motivations

The information in social networks becomes an important data source, and sometimes it is necessary or beneficial to release such data to the public. Many real-world social networks contain sensitive information and serious privacy concerns on graph data have been raised. The famous result of Narayan and Shmatikov [3] has shown that naive anonymization does not work: it is in practise very easy to re-identify elements of a trivially anonymized (ie replacing identifying informations such as names, social security numbers etc. with random numbers) social networks. Later works [2] pushed further the study of attacks on anonymized social networks.

The goal of social network anonymization is to produce a graph in such a way that some statistical functions produce the same result on the original graph and on the transformed graph, while other functions (namely reidentification) should not produce the same result. There are two main ways to work on the anonymization:

Clustering: one tries to group together edges and nodes so that when the the cluster regroups $k$ nodes then there is no way to distinguish an individual node among them.
k-anonymity: one tries to modify the original graph in such a way that there should be at least k-1 other candidate nodes with similar features (the features are part of the assumption made on the capability of the attacker).

Internship Objectives

The aim of the internship is to study a new approach to social data anonymization. An important focus will be made on semantic networks of data in which nowadays, data are often organized as graphs with an underlying semantic to allow efficient querying and support inference engines. Such is the case in, for example, linked data and semantic web typically relying on RDF. An individual or company may have legitimate motivations to query a database it does not own, but preventing illegitimate access to private information while allowing legitimate database usage is a difficult task. An example as resurfaced recently with the debates about facebook’s privacy leaks and Cambridge Analytica.

In general, the objective is to limit the knowledge a tenant may gather from the database by limiting the queries it may execute. This is particularly difficult when the database possesses an underlying semantic, as attributes may have strong correlations and since inference rules can generally be exploited to infer new knowledge.

A first starting point is to use a graph rewriting approach based on category theory [1]. Using this approach it is very easy to make clones of a graph while being able to ``tune'' the properties of the clones. The idea is to select 1/k nodes, to make k-clones of this selection and to link them together in an appropriate way so that the global graph obtained shares good properties with relation to the original graph.
A second option is to explore topological approaches based on algebraic topology, see [4].

Location

This intership will be held at the LIG laboratory in Grenoble France in the team CAPP, coached by Frederic Prost.

Requisite skills and abilities

C++ programming
Graph theory.
Category Theory
Ability to use/modify large programs.

References

[1] Andrea Corradini, Dominique Duval, Rachid Echahed, Fr\'ed\'eric Prost, Leila Ribeiro: AGREE - Algebraic Graph Rewriting with Controlled Embedding. ICGT 2015: 35-51.

[2] Lars Backstrom, Cynthia Dwork, Jon M. Kleinberg: Wherefore art thou R3579X?: anonymized social networks, hidden patterns, and structural steganography. Commun. ACM 54(12): 133-141 (2011).

[3] Arvind Narayanan, Vitaly Shmatikov: De-anonymizing Social Networks. IEEE Symposium on Security and Privacy 2009: 173-187.

[4] Alberto Speranzon, Shaunak D. Bopardikar: An Algebraic Topological Per- spective to Privacy: Numerical and Categorical Data, American Control Con- ference 2016.