What is the difference between differential privacy and others anonymization techniques?
In this post, I will explain the difference between differential privacy and others anonymization techniques.
What is anonymization?
Lots of datasets contain personal data: names, phone numbers, birth dates, etc. But personal data is subjected to regulation, like GDPR in Europe or CCPA in the US. If you want to avoid these regulations, datasets have to be anonymized. Anonymization means that you transform a dataset containing personal data to a dataset that includes no personal data.
Several techniques exist to remove these personal data, but as always, there is no silver bullet. It is always a tradeoff between privacy and loss of information. The more privacy is guaranteed, the more information gets lost. The most simple anonymization method is to erase the columns containing obvious identifiers. No data, no trouble, right? The problem is that by doing so, you lose all the information. More subtle methods exist. You can add noise, aggregate, pseudonymize or hash attributes. But the principle is always the same: you apply treatments to a dataset and give the anonymized version to the people who need it, like your data science team or your client.
What is different with differential privacy?
There can be a problem with these anonymization techniques when your dataset change over time. Suppose a university has released a dataset in June 2021 containing aggregated data of household incomes. In this dataset, there is a mention that 10 students are from families earning over $100,000 per year. Now, it is in July 2021. There is only one more student in the university. If you release the same dataset and 11 students are in the $100,000+ household incomes category, anyone can determine this new student family financial status.
Differential privacy (DP) handles this kind of situation, thanks to a slightly different approach. Instead of sharing an anonymized dataset, you keep your dataset but control the algorithms applied to it. Instead, you allow others to perform requests to your dataset, like “What is the average age of our customers?” or “What is the median pressure of this cohort of patients?”. For each request, you add some noise to your answer. This time, the noise level is precisely determined, and there are mathematical proofs that re-identification is impossible. However, there is a limit: third parties can only perform a restricted number of requests.
What techniques should I choose?
If you want to share an anonymized dataset with your data scientists or with well-known third parties, you can use classic anonymization techniques. You will be compliant with regulations like GDPR if you apply the right treatments.
If your dataset regularly incorporates new data and is meant to be queriable, you should build a differential privacy solution. It lets you control your data and the algorithms applied to them.
At Cosmian, we are developing a new anonymization feature to help you share your sensitive datasets. Our cryptographic tools already secure your data when your run a model or an algorithm on the cloud. With this new functionality, your dataset will stay protected even further. Stay tuned!