Big Data's religious faith denies the reality of failed promises, privacy Chernobyls


[Read the post]


Laughing uncontrollably at this!

“In Soviet times, there was the old anecdote about a nail factory. In the first year of the Five-Year Plan, they were evaluated by how many nails they could produce, so they made hundreds of millions of uselessly tiny nails.”


From the full presentation: “We’re very much in the ‘radium underpants’ stage of the surveillance economy.”

Yeah, that sounds about right. Except that we know that and we are still doing it.


O ye of little faith: had ye but faith the size of a 16-bit integer, ye could move mountains!


Of course, they got smart the next year, and measured the nail production by weight…

Ceglowski also raises a critical point: Big Data has not lived up to its promises, especially in life sciences, where we were promised that deep analysis of data would yield up new science that has spectacularly failed to materialise.

I think this is more a case of “Where’s my jetpack?” syndrome where people don’t realize just how much science has advanced because they base it on some unrealistic goal rather than more important if less glamorous advances. Biology certainly has been tremendously advanced by “big data” in the sense of automated DNA sequencing over the past twenty years or so.


The comparison with nuclear mishaps does not entirely fit.

The data breaches do not create beautiful natural preserves.


…or shambling, misshapen mutants.


Well, since they are talking about big data in biology, they might yet.


If you never test your assumptions, then no amount of data will light your way out of the deep, dark statistical woods.


Do you work in data? You sound like you know a thing or two. I have loathed the term “Big Data” since people really started using it all the time. Most people don’t realize that all data is a sample. Even if you collect exhaustive data, it is still a sample from a putative infinite population. And, because of that, sampling methods, appreciation of distributions, adjustment and all kinds of other basic processes flowing from the Central Limit Theorem still hold true. And then there is the reporting… Whenever I hear Big Data, my eyes crinkle a bit and the plates running down my spine start to redden and I brace for battle.

#Bring it, Big Data.


Oh, they will…


So good. Thank you for sharing.


I also get a small whiff of, “Where’s my jetpack?” here.

If we replace this this sentence:
“Big Data’s advocates believe that all this can be solved with more Big Data.

With this one:
“Big Data’s advocates believe that all this can be solved with more research on methods of data modeling.

Do we feel differently about it?

The way the term Big Data is used these days, those two sentences are functionally interchangeable. Data “science” is still mainly heuristics at this point, we haven’t had the computing power to run experiments in this field for very long at all.

I completely agree that collecting lots of data about human behavior is problematic and I love Maciej’s 90 day expiration plan. That said, I’d be willing to bet that twenty years from now, no one is going to wish that we, as a society, had spent less money researching ways to use machine-learning to understand the life sciences, or that the contributions therefrom will be considered unimpressive.

All of that said, I need to give a talk next month on “Big Data for Social Good” and I’m having a really hard time coming up with material…


The problem is not that big data is not in itself useful, the problem is the adversarial nature in which it’s used. It’s a real shame, with a bit of honestly-intentioned regulation the potential for research in the public interest would be far more interesting than all this ad targeting and Skinner Box fine-tuning.


All data is not a sample nor are all populations infinite. That’s just ridiculous on its head. If my population is the manufacturers of socks in my drawer right now, I assure you that not only is it finite, but I have a complete data set for it.

My only problem with “big data” is that people like Cory seem to think all big data sets are related to people.


Until one fine day… a stray sock shows up.

If you are trying to generalize to socks based on your accessible sample, which consists of the socks made by the manufacturers in your drawer, then the larger theoretical population is what your sock sample was drawn from.

I.e. theoretical population > accessible population > sampling distributions if you are running stats on those socks.


Yes, but my population was the manufacturers of socks that were in my drawer at the time. I assure you that has not changed just because you decided to create a new population of socks that could ever exist period in my drawer.

The population of states of an abstract bit is exactly {0, 1}. It does not change. One does not say this is invalid because they decide to redefine the population to mean the states of every bit that exists, has ever existed and will ever exist. That’s just an exercise in sophism.


I see the direction you are coming at this from. But even still, what if you are generalizing about manufacturers? If you are generalizing about manufacturers based on your sample, even though you call it complete, that act of trying to generalize about them requires this theoretical construct of a larger frame of reference. Yes, we can get super esoteric, and I see what you are saying, but I am talking stat theory and you seem to be talking comp sci or maybe set theory? I dunno.

Also, I would note that a bit is not data. It is a datum. :relaxed: I’m not arguing sophistically about data, though. I’m talking stat theory, so if we are talking about different things we oughta acknowledge it and we can both be correct in our domains.

n.b. thinking a bit more. This back-and-forth summarizes the chicken-egg problem with Big Data. Just because Big Data might be exhaustive or somehow total, does not mean that inferences gained from analyzing it are necessarily generalizable. When you cross over that line from data to statistics, it turns a corner and the assumptions underlying the stat methods do not get suddenly suspended because you think you have all the data. It doesn’t work that way. Examples are Google Flu, ad targeting and use cases gone awry. In each case, there are or were misfires that happen because even exhaustive data doesn’t tell you everything there is to know about something. It’s still just a sample.


Great presentation. Interesting point about the measurement of truckers. The observer effect of quantum physics seems to kick in here; the measurement of a thing impact the thing itself. Isn’t that applicable to much of big data? Who hasn’t changed how they communicate online around certain topics (politics, religion, relationships) knowing that someone, somewhere will be observing it. Who buys some things in cash so nobody knows they bought it? The mere collection of big data has influenced the behaviours it’s tying to analyse. Doesn’t that lower the value in the analysis itself?