Algorithm can identify 99.98% of users in supposedly "anonymized" data

beschizza · July 24, 2019, 12:34pm

Originally published at: https://boingboing.net/2019/07/24/algorithm-can-identify-99-98.html

…

Fang · July 24, 2019, 12:51pm

https://www.nature.com/articles/s41467-019-10933-3

Haven’t read it completely yet, but summary doesn’t seem correct. They actually made an algorithm to check if combinations of characteristics are likely to be unique or not. The most important characteristics are stuff like ZIP code and birth date. (If you live in a fairly small town, you are fairly likely to be the only one in that locality born on that exact year/month/day)

gracchus · July 24, 2019, 1:13pm

They had to publish because to do the research is to realize that criminals and governments already did the research.

Criminals and governments and corporations. In some countries, all three at the same time.

AnthonyI · July 24, 2019, 2:33pm

It’s the first thing I think of when I read about “anonymized” data online.

I’ve always been amazed on how accurate these things are.

sqlrob · July 24, 2019, 3:22pm

I’m surprised they bothered with the year. Earlier research on deanonymization didn’t use it. IIRC, it was something like 80% with just birth day, gender, zipcode.

HMSGoose · July 24, 2019, 4:46pm

Birthday is much more specific that a lot of other indicators. I feel like in a list of “15 attributes” throwing in birth day among other is like saying we could pick you out of a crowd based if we had “hair color, foot shape and a picture of your fucking face.” There are very few datasets that should be disseminated with birth date. For research and marketing a 1-to-2 year birth range should be plenty to put you in a statistical bucket.

renato · July 24, 2019, 7:04pm

Although the number of attributes is misleading, it helps lay people understand easily how it works.
I guess they are using something similar to this:
https://wiki.mozilla.org/Fingerprinting

Each of the attributes can separate the users, and some of them carry much more information and can discriminate more.
But, even the more inoffensive ones can be enough when combined.

johnd · July 24, 2019, 11:35pm

Nothing really new here, beyond a significant speedup to the process.
Custodians of anonymised medical data has recognised this problem for decades. Once you’ve given someone an anonymised data set, you rely on formal agreements to keep it anonymised, rather than the fiction that anonymised data can’t be linked back to individuals.
I did a demonstration many years ago that showed how to take “anonymised” datasets from three geographically related medical institutions and and link these together to produce individually identified data. This re-linkage was actually done live in a 90 minute presentation.
It shocked a lot of people.

beschizza · July 29, 2019, 12:34pm

This topic was automatically closed after 5 days. New replies are no longer allowed.

Topic		Replies	Views
A generalized method for re-identifying people in "anonymized" data-sets boing	10	927	July 29, 2019
The most interesting thing about the "Thanksgiving Effect" study is what it tells us about the limits of data anonymization boing	10	1075	June 6, 2018
Researcher releases 10M user/pass combos boing	11	2099	February 15, 2015
This website reminds you of just how old you're getting boing	16	644	June 23, 2023
"Smart" sex toy company sued for tracking users’ habits boing	34	3024	September 18, 2016

Algorithm can identify 99.98% of users in supposedly "anonymized" data

Related topics