Whois tells us that something.co.uk is a valid domain name so our search space is basically all registered domains and reasonable combinations of words to the left.
So presumably that’s one more thing we can cross off the list of things that makes us people then? Since the US has established that companies are people, but if they don’t have beliefs, then either they aren’t people or people don’t have beliefs.
(Yes, I’m being facetious. I think.)
Even so, you’ll still end up with a list many orders of magnitude above ‘billions’.
Quick back of the envelope calculation to drive home your point:
There are ~300 million people in the US. So you can be identified with 28.2 bits of information.
There are ~42000 US zip codes, so knowing that counts for ~15.4 of those bits. Birthday (ignoring year) is worth ~8.5 bits, gender is worth one bit. (plus a little if the database includes more than two options) So that’s 24.9 bits of information right there; only 3.3 bits remain. So, under the naïve assumption of uniformity, we would expect there to be only 9 or 10 people fitting those constraints. Almost any other fact about ones self would allow complete identification.
Or instead of billions it could be one. A “rainbow” table may be needed to crack every possible hash, but a “just this very specific shade of orange” table is needed to crack one hash.
Like your abusive ex had an email address for you, buys access to a database with customer information that is “secured” by hashing email addresses, hashes your email address, finds it in the table, and gets enough information to figure out where you are living now.
When you think about security of private information, it has to work if a potential attacker is looking for a specific person that they already have some information about.
Only if you want to get every single email address. If you are OK with getting the vast majority, you can just generate the combinations which look like names, abbreviations of names, and other words. Sure, you will miss UmW6npvRG2@whatever.org, but there just aren’t that many of those.
0.00nice.
Methods that rely on making something too computationally expensive for the attacker, or requiring more disk space for tables than the attacker can afford, are not a good choice for ensuring security at this point in time.
Salt is good for passwords, but you can’t use them for the purpose being discussed here.
Correct. You can kind of split the difference, though, by using tokens.
Devise a tokenizer function that’s a one-way map of email addresses to a random-ish token number. Can be something simple, like a 64-bit pseudo-random integer. Before you hash any email address with a well-known hashing algorithm, generate a token for that email address through the tokenizer, and hash the combination of the email address and the token.
The tokenizer doesn’t have to be as statistically rigorous or computationally intensive as a hash function. The actual hash function will give you your asymmetric, statistically uniform irreversibility. You’re just relying on the tokenizer in the same way you’d rely on a salt: as long as it’s not totally guessable, you’re probably fine.
This should be reliable given a few modest constraints:
-
You keep the tokenizer secret, of course;
-
You also protect the pipeline between the output of the tokenizer and the input to the hash function (so that people can’t reverse-engineer the tokenizer by running a bunch of inputs and seeing what tokens get tacked onto the end); and
-
You only use the tokenizer for one specific password collection.
Given those constraints - the process should be both consistent and resistant to attack. Someone would have to take one email address and try all 2^64 tokens just to figure out what token got tacked onto that one email address. As long as they can’t repeat that process for a large enough sample of email addresses to discern the tokenizer, it should remain secure.
You saw that it was Cory posting, right?
Orders of magnitudes larger isn’t a lot of defense at this scale. Using Cory’s larger value of 4 cents means that $20 bucks buys you an awful lot of hashing. The estimates I’ve seen for domains are over a year old at just under 350 million, let’s round it up and call it a half billion for easy math. At that scale a modest lunch price would let you hash the 1000 most likely addresses at each domain. If you apply a small amount of thought to your search space you would get an even larger share of the domains (you can apply more to gmail.com than zark.com for example.) More importantly you can easily work the other way. I know I’ve got X number of addresses and hashing those will cost essentially nothing.
+1 for what johnd posted. And, anyone here wasting time trying to find ways to help solve this marketing company “problem” should consider saving their energy. Marketing companies could care less about your privacy. It’s a wonder that hashes are used at all.
Sure, as I said, the entire problem changes if you already know the email addresses. And effectively becomes trivial in the case of one single, known address.
However, looking at the title of the article, that’s not the case: To ‘unscramble’ the hashes to retrieve every email address you need the complete rainbow table.
But if you want to discuss that other problem: that a hash of the email address of a user is not useful to protect the identity of the user, you’re right. In fact, just ask yourself why they would add that hash to the dataset!
If it was meant to be undecipherable and impossible to identify anything, why add it in the first place?
The companies that sell data with hashed email addresses all know that the entire dataset is unusable to most of their clients without an identifier for the users… Hell, even if a buyer doesn’t know the email address itself, it can still correlate data from multiple ‘anonymised’ sources and figure out just about every detail of that person. Except maybe their email address.
And the moment that person identifies him/herself at their site (with an emailaddress, of course), they can hash it and glue the rest of the profile in, no sweat.
Yeah, the problem is that the whole point of what they are doing is that two different groups can match their databases together. There’s no way to allow that to happen without allowing third parties to crack the code.
Part of my job involves data privacy and I’m constantly asking people a question very close to this. If you are going to release a dataset and you don’t want anyone to be identified, then you just don’t release fields that could identify them. I get that this reduces the value of the dataset for some purposes, but if you really have to choose which side of that tradeoff you are going to land on. In government people are actively thinking about that tradeoff; I imagine in these companies people are just trying to figure out a way to make it look like they made a nod towards privacy.
Like you say, the whole reason they add the has is so that someone else can match it. And yeah, the data privacy problems go way beyond that. Who even needs to match an email address if you have the location people were standing for every text they ever sent?
Orders of magnitude, as in “40^30 is about 1.15e+48”. That’s 1.15 trillion trillion trillion trillion or 1.15 Octillion (if you’re in Europe; Quindecillion in the US).
Now, consider the entire world’s worth is estimated at about $250 trillion. That’s 2.5e+14. You’d need a way better deal than $0.0069 per 5 billion if you’re ever going to pay that. And a bit more time as well.
As a little exercise I calculated the amount of characters per email address you’d be able to buy a rainbow table for at that price: it’d cover the 16-character table (costing ‘only’ $59 trillion) but would not even be close to the price of the 17-character table (costing $2371 trillion).
More realistically, the table for 12 characters would possibly be affordable at $23 million, even the 13-character one could be considered (at just over $900 million), so even hashing all the possible ‘gmail.com’ addresses will be prohibitively expensive as it’s not unlikely to have more than 12 characters before the “@” symbol… And you’ll have to re-calculate for every domain.
Time-wise, at 5 billion hashes per 10 milliseconds the 12-character table will take just over 1 year to calculate. The 13-character table takes 42 years, the 16-character table takes 2.7 million years and the 17-character one takes over 108 million years…
Exponents… always a surprise.
You’re making a large number of unrealistic assumptions. The biggest one is that working addresses are evenly distributed. They aren’t. Names and words are far more common than random strings. The second bad assumption is that the search needs to be complete to be damaging. If you can anonymize the first billion or two emails you have a named dataset of unbelievable richness and detail. The third is still the option to work from known emails to marketing databases. If you take one person’s hashed email it is a simple database query to get their information.
This topic was automatically closed after 5 days. New replies are no longer allowed.