A generalized method for re-identifying people in "anonymized" data-sets

Originally published at: https://boingboing.net/2019/07/24/99-98pct-reidentification.html

1 Like

Beschizzaed:

2 Likes

OK so the minimum threshold for re-identification is 15 parameters…which is where stuff like hypertargeted ads get useful. For other stuff, we don’t NEED as many parameters to get insights so it makes me wonder if we’re better off fragmenting data into pairs, triplets, quads, etc where each parameter combo is covered but limited in scope, e.g., 1 big record gets turned into lots of smaller ones. As an example, for medical research, I want to search the patient’s sex, age, condition, medication, gene 1, gene 2, protein 1, protein 2 for a query for a total of 8 parameters. This is just spitballing and doesn’t do anything to circumvent better re-identification protocols from fewer parameters, however.

No, this method in particular suggests that 15 parameters is sufficient to re-identify almost all participants in a data set. That doesn’t mean that the number of identifiable people drops that much (or at all) with 14, or 13, etc. parameters instead. And there will always be some people in the set who can be identified by as few as two or three parameters. (And those rare people producing nice, clear signals that stand above the noise are the most exploitable in terms of advertising, spear-phishing, etc.)

An anonymized dataset is a collection of information with the information sucked out. If you can use it to solve your problem, you either didn’t need all that data in the first place (for example, if you were just taking a simple average) or the dataset is not actually anonymized.

The set needs a gatekeeper. Someone trusted who can accept queries from a researcher, run them over the data, and either return a statistic or an error indicating some form of underflow. (By underflow I would mean that there is so little data that any events turned up an’t generate useful, repeatable statistics, or that are so few that they almost have to identify a unique individual.)

2 Likes

OK, thanks for explaining that in the first part. Gatekeeper approach is very much needed and is what I’ve been recommending to biotech groups I work with.

Didn’t really need a study to guess at this, but good to have it as confirmation. On a related note, one of my favorite terms bouncing around various privacy policies these days is: pseudonymized!

I’ll use it in a sentence: “When you use our products or services you can rest assured that your personal data is safe and secure through our revolutionary system of obfuscation whereby we keep your information pseudonymized.”

Non-techie translation: you’re fucked if you use this product and think we’re not tracking you step by step.

1 Like

That works out to 65,434 Americans who can’t be identified.

So the survivalists were right. Who knew.

1 Like

Is that a fancy word for preppers? Which is a fancy word for neo-fascist conspiracy theorists? Which is a fancy word for nutters? Which is a non-PC word which should be avoided, so we go with survivalist?

Everything old is new again. I graduated in 81. Back about then I discovered at least one DB research paper demonstrating that deanonymization was easy. Folks forgot about it, and didn’t believe you when you told them about it. Now its back ;->

1 Like

To be fair, it seems they can only do it to Americans, so non-Americans can rest easy!

1 Like

This topic was automatically closed after 5 days. New replies are no longer allowed.