Big Data should not be a faith-based initiative



Unfortunately, this is one of those situations where ‘sufficiently advanced stupidity is indistinguishable from malice’. It may well be that only some of the pro ‘anonymization’ parties are actually lying because they want to be able to sell data and know full well that the most dangerous stuff is the most valuable, and that some of them are sincere in their ignorance; but that barely matters once the harm has been done (especially since getting the data scrubbed once it hits the market will be…nontrivial).


Paul Ramsey just did a 5 part series about the privacy implications of the BC government’s choice of vendors for storage of sensitive or personal data.

Much of what he discusses revolves around slippery definitions of secure and private, and the fact that governments and vendors create definitions that solve for selection of the vendor.

When ‘sufficiently advanced stupidity’ is accounted for, the Stupid/Evil vortex is instantiated.


I’m a corporate data analyst, and I have this conversation a lot. The bizarre part is that people can accept any amount of conference-presentation bullshit about how revolutionary big data could be for their business, while at the same time refusing to accept that big data could be revolutionary in inconvenient ways that might allow people to be identified and then fucked around with.

Also one specific complaint about the pro-anonymisation paper; it quotes a competition entry that managed 0.8% re-identification for a specific, limited set of attacks. It says that since the target for the competition was 5%, that this result is pants-wettingly amazing. I don’t know about you, but when we’re talking about a dataset of, say, 15 million hospital patients’ records, 0.8% sounds incredibly dangerous, and 5% sounds like a career-ending possible-criminal-charges-brought politician-destroying meltdown the likes of which we can only imagine in our darkest nightmares. So even if we concede their entire argument, they’re still being laughably optimistic.


I have a feeling that the solution to “big data” is on the symptom side rather than looking for a cure. You can cure allergies by killing off your entire immune system… but most people just opt to mask the symptoms.

The problem is that big data has some real value, and not just value in the “more money plz” sense of the word. If you could open up all medical records in the country to all researchers, you can safely assume that we would almost instantly make incredible discoveries. In the course of making those discoveries, you could almost certainly warn people of conditions that your data suggests that they have. We of course won’t do this for the obvious reason that insurance companies would bend everyone over and fuck them.

Personally, I think that the approach is two-part. First, do the obvious and anonymize. It won’t stop a determined attack, but like a shitty bike lock, it helps keep honest people honest and makes it so that you only have to deal with determined attackers. The second bit is to look for where deanonymized information can be used, and ban its use. This is attacking the symptoms. We don’t mind the data being used, we just don’t want it put to ends that are going to hurt us.

Outside of that, I am not sure what else you can do. The promise of sifting through that data for gems is far too great to flat out ban it, and you should be skeptical of people who claim that there is a way magically make this data’s anonymity bullet proof. Either path you take, you are going to do harm. I think the best we can do is try and thread the difference, do the common sense things, and fully accept that our defense against using big data for evil will never be 100%.

One thing which isn’t mentioned is that positive views of de-identification assume that the threat is the identification of some specific victim. But many crimes involve a victim of opportunity. For example, if a burglar can obtain a list of people who are on holiday, he doesn’t care if only 1% can be re-identified as long as he can find some whose houses can be safely burgled.

Big data is the information age’s equivalent of the nuclear weapon.

It could have broad, sweeping powers to marshal many changes, but ultimately, it’s going to create a ugly legacy that few people will want to touch, unless they’re also looking to assert their own power on the global stage.

As soon as it’s unleashed, the good and bad will start flowing.

You can’t make good technical regulations by ignoring technical experts, even if the thing those technical experts are telling you is that your cherished plans are impossible.

Can we get that on a billboard please.


Well who are you going to believe - some computer nerds, or this big fat check I’m handing you?


Totally agree with the author and in fact I blogged about this a few days ago with the same concept regarding how big data can be used maliciously: Influence of data in enterprises

We cannot “hope for the best” when it comes to the way data is handled.

Big Data could end up being a red herring for “Big Knowledge” since data remains bits and bytes until you can transform it into knowledge and cannot see the value in an organisation storing useless data without an identifiable entity relation. Encrypted or not.

Though not overtly stated, medical records are implied here. Just wanted to toss in my $0.02 that PHI (personal health information) has a standard of deidentification that renders it immune to Narayanan and Felten’s reidentification methods: location, for starters. Here is the relevant HHS standard in that case for removing patient identifiers:

(B) All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census:
(1) The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and
(2) The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000

The larger PHI deideintification standard is here.

Still I think it would be naive to think that reidentification were ‘impossible’ given enough data points. But is absolute what we need,or is this a risk analysis cost/benefit type situation? Lives lost because PHI isn’t easily shared clinically or for research purposes is staggering. This is public record (i.e, “To Err is Human”, 1999).

This is a societal problem and issues on privacy hinder the flow of clinical and research based health information. This will get worse when genomics information becomes readily available. Though healthcare is regulated to a far greater degree than any other sector and our data privacy standards are higher, the fear over privacy is palpable to people that otherwise have no beef (other than bitching -> actions not words) with the Facebooks, the Googles, and the Visas of the world.

So it’s a hurdle. And it’s a hurdle that impedes medical progress. Balancing privacy and the need for information is something society needs to address. But in HC the fear is so overblown and the regulations so onerous that I see us coming very close to an ‘opt-in’ scenario re patient data. Right now technically this is the case, but it isn’t the case in practice (patient’s aren’t educated enough to be stewards of their own PHI). This is even more likely in the case of ‘personalized medicine’ which I see as both the advent of personal health records (PHR) but mostly the coming of genomics. In theory patient’s control their own data, but in practice somebody else does (possession being 9/10 of the law). With genomics data it’s the opposite in terms of possession. The patient both owns and possesses the data and submits that data for the benefit of their own diagnostic outcomes at their own discretion.

So I see real ‘opt-in’ (not HIPAA ‘opt-in’) becoming a force in HC and the tipping point occurring (patient’s willingly submitting personal information including genomics data) when the research shows far greater outcomes in the cases of more precise information. While encouraging the free flow of information would get us their quicker, from an insiders perspective, I don’t see that as likely.

This topic was automatically closed after 5 days. New replies are no longer allowed.