An incredibly important paper on whether data can ever be "anonymized" and how we should handle release of large data-sets


Originally published at:


There is a big difference between a data breach and de-identification; the former presumes that the storage method is secure against unauthorized access, but the data is not. The later assumes the data is freely available, but does not contain identifying marks.

I think both are impossible to solve in any real-world way: there are too many holes in any large system to secure it perfectly against leaks, and many data sets are too complex to protect against a determined sleuth. We live in a world where we struggle to produce cryptographically-secure methods of hiding identity; the wrong pRNG can result in secrets being spilled. A de-identified dataset doesn’t even come close to this level of security; it is full of all kinds of identifying information. A typical dataset of genetic information is full of thousands of correlates with your identity. You cannot release this data in any meaningful sense (i.e., expose actual unique information about an individual, the whole point of releasing the data) without risking identification. To some extent this means that people who want their data to be shared “anonymously” must be made aware of, and accept, the risk of identification.


A thousand times this. Once I worked at a company that conducted annual employee surveys. They promised feedback would be anonymous. I worked in a department with a few other women and POC, but I was the only combination of the two. So, I knew that the source of my comments could easily identified.


They should also be told point-blank that if data were truly, honest-to-goodly, for real “de-identified”, then it would be of little to no use.


This is the fact that no one wants to admit: if you a giving away a dataset that has potentially-useful correlations that you don’t understand, you can’t be surprised when someone teases out correlations that you didn’t expect. That was sort of the point. We’re basically just lying to ourselves about being able to throw these things out into the ether in the hope that it helps some stranger figure out how to help us. The only way to curate these datasets while respecting user privacy is under contract to select, vetted researchers, with the understanding that nothing is ever really anonymized.


This topic was automatically closed after 5 days. New replies are no longer allowed.