A generalized method for re-identifying people in "anonymized" data-sets

doctorow · July 24, 2019, 3:23pm

Originally published at: https://boingboing.net/2019/07/24/99-98pct-reidentification.html

…

FGD135 · July 24, 2019, 3:50pm

Beschizzaed:

xhonk · July 24, 2019, 4:18pm

OK so the minimum threshold for re-identification is 15 parameters…which is where stuff like hypertargeted ads get useful. For other stuff, we don’t NEED as many parameters to get insights so it makes me wonder if we’re better off fragmenting data into pairs, triplets, quads, etc where each parameter combo is covered but limited in scope, e.g., 1 big record gets turned into lots of smaller ones. As an example, for medical research, I want to search the patient’s sex, age, condition, medication, gene 1, gene 2, protein 1, protein 2 for a query for a total of 8 parameters. This is just spitballing and doesn’t do anything to circumvent better re-identification protocols from fewer parameters, however.

WhyBother · July 24, 2019, 4:37pm

No, this method in particular suggests that 15 parameters is sufficient to re-identify almost all participants in a data set. That doesn’t mean that the number of identifiable people drops that much (or at all) with 14, or 13, etc. parameters instead. And there will always be some people in the set who can be identified by as few as two or three parameters. (And those rare people producing nice, clear signals that stand above the noise are the most exploitable in terms of advertising, spear-phishing, etc.)

An anonymized dataset is a collection of information with the information sucked out. If you can use it to solve your problem, you either didn’t need all that data in the first place (for example, if you were just taking a simple average) or the dataset is not actually anonymized.

The set needs a gatekeeper. Someone trusted who can accept queries from a researcher, run them over the data, and either return a statistic or an error indicating some form of underflow. (By underflow I would mean that there is so little data that any events turned up an’t generate useful, repeatable statistics, or that are so few that they almost have to identify a unique individual.)

xhonk · July 24, 2019, 4:45pm

OK, thanks for explaining that in the first part. Gatekeeper approach is very much needed and is what I’ve been recommending to biotech groups I work with.

Slant · July 24, 2019, 8:17pm

Didn’t really need a study to guess at this, but good to have it as confirmation. On a related note, one of my favorite terms bouncing around various privacy policies these days is: pseudonymized!

I’ll use it in a sentence: “When you use our products or services you can rest assured that your personal data is safe and secure through our revolutionary system of obfuscation whereby we keep your information pseudonymized.”

Non-techie translation: you’re fucked if you use this product and think we’re not tracking you step by step.

DonatellaNobody · July 24, 2019, 8:42pm

That works out to 65,434 Americans who can’t be identified.

So the survivalists were right. Who knew.

LutherBlisset · July 24, 2019, 8:47pm

Is that a fancy word for preppers? Which is a fancy word for neo-fascist conspiracy theorists? Which is a fancy word for nutters? Which is a non-PC word which should be avoided, so we go with survivalist?

GorillaCoder · July 25, 2019, 4:23am

Everything old is new again. I graduated in 81. Back about then I discovered at least one DB research paper demonstrating that deanonymization was easy. Folks forgot about it, and didn’t believe you when you told them about it. Now its back ;->

buddybradley · July 26, 2019, 7:10pm

To be fair, it seems they can only do it to Americans, so non-Americans can rest easy!

doctorow · July 29, 2019, 3:23pm

This topic was automatically closed after 5 days. New replies are no longer allowed.

Topic		Replies	Views
Algorithm can identify 99.98% of users in supposedly "anonymized" data boing	8	904	July 29, 2019
An incredibly important paper on whether data can ever be "anonymized" and how we should handle release of large data-sets boing	5	1111	February 6, 2018
Big Data should not be a faith-based initiative boing	11	3219	July 14, 2014
The most interesting thing about the "Thanksgiving Effect" study is what it tells us about the limits of data anonymization boing	10	1075	June 6, 2018
The Australian health authority believed it had "anonymised" a data-set of patient histories, but academics were easily able to unscramble it boing	11	2210	December 26, 2017

A generalized method for re-identifying people in "anonymized" data-sets

Related topics