An incredibly important paper on whether data can ever be "anonymized" and how we should handle release of large data-sets

doctorow · February 1, 2018, 4:59pm

Originally published at: https://boingboing.net/2018/02/01/high-dimensional-data.html

…

astazangasta · February 1, 2018, 5:58pm

There is a big difference between a data breach and de-identification; the former presumes that the storage method is secure against unauthorized access, but the data is not. The later assumes the data is freely available, but does not contain identifying marks.

I think both are impossible to solve in any real-world way: there are too many holes in any large system to secure it perfectly against leaks, and many data sets are too complex to protect against a determined sleuth. We live in a world where we struggle to produce cryptographically-secure methods of hiding identity; the wrong pRNG can result in secrets being spilled. A de-identified dataset doesn’t even come close to this level of security; it is full of all kinds of identifying information. A typical dataset of genetic information is full of thousands of correlates with your identity. You cannot release this data in any meaningful sense (i.e., expose actual unique information about an individual, the whole point of releasing the data) without risking identification. To some extent this means that people who want their data to be shared “anonymously” must be made aware of, and accept, the risk of identification.

PsiPhiGrrrl · February 1, 2018, 6:59pm

A thousand times this. Once I worked at a company that conducted annual employee surveys. They promised feedback would be anonymous. I worked in a department with a few other women and POC, but I was the only combination of the two. So, I knew that the source of my comments could easily identified.

Slant · February 1, 2018, 8:11pm

They should also be told point-blank that if data were truly, honest-to-goodly, for real “de-identified”, then it would be of little to no use.

WhyBother · February 1, 2018, 8:29pm

This is the fact that no one wants to admit: if you a giving away a dataset that has potentially-useful correlations that you don’t understand, you can’t be surprised when someone teases out correlations that you didn’t expect. That was sort of the point. We’re basically just lying to ourselves about being able to throw these things out into the ether in the hope that it helps some stranger figure out how to help us. The only way to curate these datasets while respecting user privacy is under contract to select, vetted researchers, with the understanding that nothing is ever really anonymized.

doctorow · February 6, 2018, 4:59pm

This topic was automatically closed after 5 days. New replies are no longer allowed.

Topic		Replies	Views
Google releases a free/open differential privacy library boing	2	678	September 10, 2019
The Australian health authority believed it had "anonymised" a data-set of patient histories, but academics were easily able to unscramble it boing	11	2206	December 26, 2017
Happy Data Privacy Day! A turning point for anonymity, privacy, and the tools that deliver them boing	4	1115	February 4, 2018
Algorithm can identify 99.98% of users in supposedly "anonymized" data boing	8	902	July 29, 2019
A generalized method for re-identifying people in "anonymized" data-sets boing	10	914	July 29, 2019

An incredibly important paper on whether data can ever be "anonymized" and how we should handle release of large data-sets

Related topics