Algorithm can identify 99.98% of users in supposedly "anonymized" data

Originally published at:


Haven’t read it completely yet, but summary doesn’t seem correct. They actually made an algorithm to check if combinations of characteristics are likely to be unique or not. The most important characteristics are stuff like ZIP code and birth date. (If you live in a fairly small town, you are fairly likely to be the only one in that locality born on that exact year/month/day)


They had to publish because to do the research is to realize that criminals and governments already did the research.

Criminals and governments and corporations. In some countries, all three at the same time.


It’s the first thing I think of when I read about “anonymized” data online.


I’ve always been amazed on how accurate these things are.


I’m surprised they bothered with the year. Earlier research on deanonymization didn’t use it. IIRC, it was something like 80% with just birth day, gender, zipcode.


Birthday is much more specific that a lot of other indicators. I feel like in a list of “15 attributes” throwing in birth day among other is like saying we could pick you out of a crowd based if we had “hair color, foot shape and a picture of your fucking face.” There are very few datasets that should be disseminated with birth date. For research and marketing a 1-to-2 year birth range should be plenty to put you in a statistical bucket.


Although the number of attributes is misleading, it helps lay people understand easily how it works.
I guess they are using something similar to this:

Each of the attributes can separate the users, and some of them carry much more information and can discriminate more.
But, even the more inoffensive ones can be enough when combined.

Nothing really new here, beyond a significant speedup to the process.
Custodians of anonymised medical data has recognised this problem for decades. Once you’ve given someone an anonymised data set, you rely on formal agreements to keep it anonymised, rather than the fiction that anonymised data can’t be linked back to individuals.
I did a demonstration many years ago that showed how to take “anonymised” datasets from three geographically related medical institutions and and link these together to produce individually identified data. This re-linkage was actually done live in a 90 minute presentation.
It shocked a lot of people.


This topic was automatically closed after 5 days. New replies are no longer allowed.