Inherent biases warp Big Data




There is no substitute for common sense. Every rate has a numerator and a denominator. Boston knows its denominator, by neighborhood and by street & traffic patterns, and can discern low-confidence rates from high-confidence rates, based on those known populations from which those samples would be drawn.

Same kind of thing applies to any form of Big Data: what is the distribution of the underlying population? Is the dataset representative of that underlying population? Is the collected data more representative in some places and less representative in other places? Can you characterize the variation? Can you describe potential sources of that variation?

If you don't know your denominator and can't characterize it, then any numerator data that you collect should be looked at with a squinty, skeptical eye until you can understand how your sample fits into the larger picture.

The caveat to Big Data is that all data are a sample, no matter how big. Even if you sampled every single living human, you would still have a sample of billions, from among a theoretical population of infinity across all time. Our mistake with Big Data is in thinking that an exhaustive sample is synonymous with having everything. It's still just a sample.


The theory of Big Data is that the numbers have an objective property that makes their revealed truth especially valuable

This is simply untrue.

Every data analyst since before the invention of the computer has understood that datasets have problems with modeling, accuracy, and interpretation. Only a complete fool thinks that merely adding data quantity improves the quality of a data model.

The pothole example given in the article is a nice one, but of course it's not Big Data -- 20,000 potholes a year, pssh. Moreover, as the article says, the government office that uses that data understands the analysis and interpretation issues of this highly constrained dataset quite well, so in fact the article defeats the quoted premise above.


The promise of Big Data has always been the elimination of data analysts. It's a matter not of working smarter, but of working harder. The only difference is that a computer can work harder, faster, and cheaper than a human. It is a fundamentally brute-force approach the speculates that -- if the data set is large enough -- then no model is needed. The data set IS the model. When reality changes, the data changes, and the model changes along with it.

So the first point is that Big Data is about trying to automate basic observational analysis. The second point is that this basically means the promise of "dumb expertise". A user may not understand why two things correlate, and probably won't care. That they correlate very, very well is enough. Big Data has ultimately promised a world where you can eventually fire most of your economists, marketing team, and analysts. This has never been feasible, but the people who understand that it isn't feasible are also the people whose jobs they're trying to eliminate.

The pothole example is cited because it's easy to spot. It's easy to see that the data points disproportionately at wealthier sections of town, and easy to figure out poorer people are less likely to have a smartphone. It's cited because it is easy to demonstrate and to understand that the entire method has a large blind spot. Similarly with Google Flu. It's a warning about every other algorithm where the there is probably also a blind spot, just not as glaring.

The most useful concept for addressing this is probably one of resolution. Resolution is a simple concept that all technical people understand: if you want to measure something on the order of microns, then a resolution on the order of cm is just not going to cut it. No matter how many cm-scale measurements you make, you're not going to get the micron resolution you want. Less understood is the notion of statistical resolution (aka statistic power), but at least everyone recognizes it's important even if they don't completely understand it. We are now building an awareness that resolution is important for even large, networked data sets. In the past it's been largely overlooked.


No, I'm afraid I disagree, WB. The premise of Big Data has nothing to do with eliminating analysts, at least not as presented by any sane view of the technology.

Admittedly, know-nothing consultants may sell that idea to know-nothing IT directors or marketing VPs, but that's just smoke.

As you may know, Big Data arose from Google solving a particular problem they had with a particular data and computing architecture. As often happens with in-house Google solutions, when they published their result, everyone + dog decided to replicate it and then to apply it to all kinds of inappropriate problems, and then to apply appropriate but non-Big-Data tech to those problems under the same name, and somewhere in there the definition of the thing got totally lost.

That's when the consultants and tech bloggers and the market analysts at Gartner and IDC and so on started their feeding frenzy.

As it used to be known back in the Olde Tyme Days of the late 2000s, Big Data is a large distributed data store -- typically with weak or nonexistent ACID characteristics and a query language much less powerful than SQL -- capable of storing more total data than a conventional relational DB as well as heterogeneous data which may not even have any schemas known in advance. That data storage facility combined with a distributed computing framework for processing the distributed datastore -- ie Map-Reduce or something similar -- is a particular thing with a particular use.

None of this has anything to do with eliminating analysts.

To the extent that Big Data now means "any reasonably large amount of data, or any raw data, or any data whatsoever having to do with marketing or demographics, or really anything at all which will convince you to buy my consulting service", well sure, maybe somewhere in there "eliminating analysts" is at least a marketing goal. But that's not a feature of the underlying technology. Perhaps there has been some confusion of Big Data with the NoSQL movement or with other disintermediation movements in ICT intended to distribute authority to business units and executives with purchasing power and away from IT specialists.


Nicely stated! I'd in fact argue that the "Data Scientist" that's becoming all the rage is really a technically-astute analyst that can work with these heterogeneous and schema-less data stores. We're going to need more analysts, not less.


Oh Christ, we've had Object Oriented Data Models, and Ontologies, and various APIs that mutated into doomed query languages consisting of 500,000 line slush piles, and always the quants who promise to turn straw into gold.. Wake me when this latest plague of locusts has passed.


And don't forget to ask if the goal is an honest inquiry or just to support a decision that has already been made. Usually the cake has already been baked.


I suspect that it is also cited because it's a decent example of the ways that you can use nice, soothing, Objective Math, to effect, or continue, policies that would be impolite to engage in openly.
Some sampling biases are innocent mistakes that crop up because sampling is nontrivial. Some are convenient mechanisms for doing what you can't do openly by other means. Simply having a policy of providing better service to people you care more about would be dreadfully impolite. Having a sampling mechanism that happens to result in the people you care about getting better service? Much better.


This topic was automatically closed after 5 days. New replies are no longer allowed.