Big Data has big problems


The problem is there is no such thing as purely bias-free data. Big data is hoping that by making the dataset big enough, the biases will neutralize each other, but depending on the way the data is obtained, there will always be SOME sort of a bias.




Watch out for that large datum!

1 Like

Leaving aside that it is perfectly cromulent to treat “data” as a mass noun (especially given the subject), “Big Data” is a singular entity, referring to the idea which is explicated in the post, and so should take singular forms of verbs.


The archetypical big data science of old is astronomy. There are all kinds of biases that creep in to astronomical surveys, but these can be understood and corrected for - this is a whole science in itself. There are also all kinds of theory-free correlations that pop out of the data, but that doesn’t mean they’re bogus or cannot be understood. One of the greatest examples of this is the Herzsprung-Russel diagram which shows how the colour and luminosity of stars relate to one another. When this was found nobody knew what it meant, but pretty soon theories were being put together and we now have a very good understanding of the physical processes behind it.

Applying these same principles to data about more complex biological and societal systems will be harder, but the example of astronomy shows it can be done (and in fact continues to be done with ever bigger astronomical data sets). But we should not expect Big Data to be a magic bullet, suddenly producing amazing new results. It will be slow, will require new approaches to understanding and controlling the inevitable biases, and the correlations, real and imagined, that come out. Funders and the public need to know it will be a long slog.


I really don’t understand all the noise made about the failure of Google Flu Trends (as per the linked article) as if it had anything meaningful to say about Big Data in general. Gee, a stunt pulled by a company not known for epidemiological research turns out to be poorly thought out. Imagine that. Real studies by real researchers spend a lot of effort figuring out how to collect data and apply statistical methods to them.


GIGO, correlation is not causation, etc, etc… and applying filters and algorithms is difficult even for math PhDs. I’ve worked with quants. I’ve found that less accomplished math geeks and engineers are able to produce more relevant results by keeping it simple. I found that very accomplished math PhDs have a tendency to get lost in complex theories and delusions of grandeur. When you see their awful results that’s when you know you wasted a lot time and money. Never give a quant with grand ideas too much money. They’ll empty the company check book before you know it.


The bigger the data, the bigger the lie.

Although, I will say this, apart from the obvious flaws. The bigger the data, the more opportunities there are for subgroup analysis. IF the research questions are informed and well-formulated. IF error, bias and confounding are appreciated fully. IF the statistical methods are appropriate for the situation at hand.

The failure of “a company not known for epidemiological research” actually says a good deal about Big Data as it’s been presented to many businesses. The “magic Christmas-land” view of Big Data boils down to nothing less than instant expertise as a service. You don’t have to know what you’re doing. You don’t have to understand the field. You don’t have to invest money in people who know what’s going on, or time in developing expertise. Instead, you just throw enough data at the problem to tease out a correlation. You don’t need to understand a model. When trends change, there’s no model to break. The body of data just changes underneath you and carries your correlations along with it.

This wasn’t an experiment in epidemiology. It was an experiment in machine prediction, using something that people care enough about to give them a lot of data, and using something that is fairly easy to track after-the-fact to tease out what worked and what didn’t. Bringing in people who understood epidemiology would just undermine the entire point.

Obviously, the real world is more complicated than that. All technology is more complicated than the idea men will tell you. But how far you get before you start to really hit those complications gives you an idea of the potential. Google failure is notable because it’s Google. They are the biggest of the Big Data companies. They have access to more data than anyone else through their services. They use algorithms to make pretty much every decision (including who to hire). They know what they’re doing. And it still didn’t help that much. You can’t blame the failure on lack of technical execution. You can only blame it on the Big Data concept itself.


Back in the 90’s, wavelets were supposed to solve our large data set problems – what happened?

The hardware vendors peddle the same strategy year after year - Don’t miss out on The Next Big Thing!

But to participate, you have buy lots of hardware or servers!

And companies have such a “me too” herd mentality there is an almost nihilistic zeal to discard the technology that has just matured in favor of the next thing that barely works.

Why is Google Flu the poster child for big data? I thought it was simply correlation between searches for flu and the location where flu was occurring.

Seems rather simplistic and unrepresentative of big data given all the sophisticated analyses happening in public and private workplaces. What would’ve prevented Google from doing this same analysis 10 years ago?

If big data is going to be criticized, I think it should be done when analyzing cases utilizing the full capabilities of new data that is being collected and fused, as well as the sophisticated analytical tools.

1 Like

I think the main problem is people thinking that “all the data” means there will be no confounding, bias or error. “How could there be any bias? I have ALL THE DATA.”

What they fail to realize are the theoretical underpinnings of measures of central tendency, sampling distributions and how to construct good research. Simply having an exhaustive dataset does not relieve one of statistical due diligence.

This topic was automatically closed after 5 days. New replies are no longer allowed.