Laughing uncontrollably at this!
āIn Soviet times, there was the old anecdote about a nail factory. In the first year of the Five-Year Plan, they were evaluated by how many nails they could produce, so they made hundreds of millions of uselessly tiny nails.ā
From the full presentation: āWeāre very much in the āradium underpantsā stage of the surveillance economy.ā
Yeah, that sounds about right. Except that we know that and we are still doing it.
O ye of little faith: had ye but faith the size of a 16-bit integer, ye could move mountains!
Of course, they got smart the next year, and measured the nail production by weightā¦
Ceglowski also raises a critical point: Big Data has not lived up to its promises, especially in life sciences, where we were promised that deep analysis of data would yield up new science that has spectacularly failed to materialise.
I think this is more a case of āWhereās my jetpack?ā syndrome where people donāt realize just how much science has advanced because they base it on some unrealistic goal rather than more important if less glamorous advances. Biology certainly has been tremendously advanced by ābig dataā in the sense of automated DNA sequencing over the past twenty years or so.
The comparison with nuclear mishaps does not entirely fit.
The data breaches do not create beautiful natural preserves.
ā¦or shambling, misshapen mutants.
Well, since they are talking about big data in biology, they might yet.
If you never test your assumptions, then no amount of data will light your way out of the deep, dark statistical woods.
Do you work in data? You sound like you know a thing or two. I have loathed the term āBig Dataā since people really started using it all the time. Most people donāt realize that all data is a sample. Even if you collect exhaustive data, it is still a sample from a putative infinite population. And, because of that, sampling methods, appreciation of distributions, adjustment and all kinds of other basic processes flowing from the Central Limit Theorem still hold true. And then there is the reportingā¦ Whenever I hear Big Data, my eyes crinkle a bit and the plates running down my spine start to redden and I brace for battle.
#Bring it, Big Data.
Oh, they willā¦
So good. Thank you for sharing.
I also get a small whiff of, āWhereās my jetpack?ā here.
If we replace this this sentence:
āBig Dataās advocates believe that all this can be solved with more Big Data.ā
With this one:
āBig Dataās advocates believe that all this can be solved with more research on methods of data modeling.ā
Do we feel differently about it?
The way the term Big Data is used these days, those two sentences are functionally interchangeable. Data āscienceā is still mainly heuristics at this point, we havenāt had the computing power to run experiments in this field for very long at all.
I completely agree that collecting lots of data about human behavior is problematic and I love Maciejās 90 day expiration plan. That said, Iād be willing to bet that twenty years from now, no one is going to wish that we, as a society, had spent less money researching ways to use machine-learning to understand the life sciences, or that the contributions therefrom will be considered unimpressive.
All of that said, I need to give a talk next month on āBig Data for Social Goodā and Iām having a really hard time coming up with materialā¦
The problem is not that big data is not in itself useful, the problem is the adversarial nature in which itās used. Itās a real shame, with a bit of honestly-intentioned regulation the potential for research in the public interest would be far more interesting than all this ad targeting and Skinner Box fine-tuning.
All data is not a sample nor are all populations infinite. Thatās just ridiculous on its head. If my population is the manufacturers of socks in my drawer right now, I assure you that not only is it finite, but I have a complete data set for it.
My only problem with ābig dataā is that people like Cory seem to think all big data sets are related to people.
Until one fine dayā¦ a stray sock shows up.
If you are trying to generalize to socks based on your accessible sample, which consists of the socks made by the manufacturers in your drawer, then the larger theoretical population is what your sock sample was drawn from.
I.e. theoretical population > accessible population > sampling distributions if you are running stats on those socks.
Yes, but my population was the manufacturers of socks that were in my drawer at the time. I assure you that has not changed just because you decided to create a new population of socks that could ever exist period in my drawer.
The population of states of an abstract bit is exactly {0, 1}. It does not change. One does not say this is invalid because they decide to redefine the population to mean the states of every bit that exists, has ever existed and will ever exist. Thatās just an exercise in sophism.
I see the direction you are coming at this from. But even still, what if you are generalizing about manufacturers? If you are generalizing about manufacturers based on your sample, even though you call it complete, that act of trying to generalize about them requires this theoretical construct of a larger frame of reference. Yes, we can get super esoteric, and I see what you are saying, but I am talking stat theory and you seem to be talking comp sci or maybe set theory? I dunno.
Also, I would note that a bit is not data. It is a datum. Iām not arguing sophistically about data, though. Iām talking stat theory, so if we are talking about different things we oughta acknowledge it and we can both be correct in our domains.
n.b. thinking a bit more. This back-and-forth summarizes the chicken-egg problem with Big Data. Just because Big Data might be exhaustive or somehow total, does not mean that inferences gained from analyzing it are necessarily generalizable. When you cross over that line from data to statistics, it turns a corner and the assumptions underlying the stat methods do not get suddenly suspended because you think you have all the data. It doesnāt work that way. Examples are Google Flu, ad targeting and use cases gone awry. In each case, there are or were misfires that happen because even exhaustive data doesnāt tell you everything there is to know about something. Itās still just a sample.
Great presentation. Interesting point about the measurement of truckers. The observer effect of quantum physics seems to kick in here; the measurement of a thing impact the thing itself. Isnāt that applicable to much of big data? Who hasnāt changed how they communicate online around certain topics (politics, religion, relationships) knowing that someone, somewhere will be observing it. Who buys some things in cash so nobody knows they bought it? The mere collection of big data has influenced the behaviours itās tying to analyse. Doesnāt that lower the value in the analysis itself?