Soooo…economic activity! Inconvenience is the driving force of a market-based society.
My, fairly limited, exposure to ‘big data’ suggests that we probably ought to be doing a lot more sampling and a lot less collecting everything.
Consider as the article mentioned the use of old data to detect long term trends. Is there really anything you can do with 100% of the old data that you could not do with 10% or 1% of the old data that is worth paying 10 or 100 times (or more if it tips you over into needing more elaborate storage) the storage costs and some more on the analysis (even if it is only to pull a random sample from the full dataset to analyze it?).
This is most convoluted way to deliver a simple message I have seen in a while, and doesn’t really talk about security (which it implies it will) but about privacy.
I get the point, but still…
It’s a clever point but actually sort of painful. I used to work in research years ago with access to both restricted and unrestricted census data. We could do tons with the unrestricted data which was a small representative sample of the full restricted data, and we preferred to use it. However the things we learned were:
- It takes A LOT of effort to create a representative sample, unrestricted sample data lagged significantly behind the restricted full data sample because it was hard work to accurately create that sample.
- The sample was representative only in terms of certain populations. Sometimes we even had different weights to apply to the sample if we were measuring things like wealth vs things like ethnicity.
- If we had questions that were well outside the mainstream “how did wealth grow in this rural county over 3 decades?” then we usually had an insufficient sample from the unrestricted data and had to fall back to the full restricted data set.
So practically speaking using sampling is valid if you’re trying to share your dataset with others externally (which, if your company sees their data as an asset, will probably not want to do). However it is less expressive than the full data set and expensive in terms of effort to create accurate samples.
I’d also like to point out the obvious truth that data is an asset only if you can actually use it. I’ve worked at one company that really only had a market edge because they had industry data going back a decade and that was extremely valuable and something they could sell as a service (where do you rank in the industry? we can tell you and no one else can). OTOH mass collecting every website click is probably only going to help your UX and even then, only if you actually use it. Otherwise if it’s collecting space on a server, might as well ditch it.
The census is actually a great comparison, because respondents are fully aware of the questions and the information they’re giving.
It’s true that companies are tending to collect more than is necessary, partly because storage is cheap but also because they’ve already encountered situations where they realize the data or sample they have is incomplete, and points to a far more interesting question that is not presently being tracked. Making the necessary adjustments in the sampling or data collection still requires a fresh start, at which point you just move the potential deadline back.
But even if you are collecting more than necessary, the liability aspect is really a question as to what the business or organizational question is that you’re trying to answer. If you’re Amazon, do you really care what people browsed 5 years ago? If you’re eBay, it may be interesting to store datapoints for raw price of items, averaged or smoothed out, but it’s useless to store bidding information past a certain point. Sadly, most organizations only consider this when they get the bill for their storage.
Data is both an asset and a liability.
If it were only a liability people would not keep it around.
I do not at all claim that there is no utility to complete datasets or that some organizations are not actually using the completeness of their datasets. The US Census is a good counterexample, and is used by a huge number of researchers for all sorts of questions.
However the Census does real world data collection, it isn’t just hoovering up events from within a computer program connected to the internet. Real world sampling is vastly harder, in large part because it is fantastically difficult to collect all the data or to collect a sample without bias. For data collected during the execution of a computer program, however, it is not very hard to collect almost all the data (you do miss certain error cases) and it ought to be pretty easy to collect a very random sample of that data. Under those conditions you ought to be losing only statistical power, and loosing it much more slowly than you are reducing your costs.
But I don’t think most of the organizations doing complete data collection are digging into it so deeply.
I went to a talk about data collection in ‘Gears of War (2?)’ the video game. They had a record of every shot fired in the game. They were looking at this data to decide if they needed to alter the game balance. On the order of a billion shots. In fact it was too much data to query in a reasonable amount of time, so they actually took small slices and queried those. But they were still storing every shot record. For the questions they were going to ask, however, it was just ridiculous overkill. They were spending big money for statistical power they were never going to use.
That conclusion requires more supporting premises, such as: “people always make rational decisions.”
Behold, those extra supporting premises are suspect:
code is a liability, not an asset
Which is different from data, but anyway. Deleting code is my favorite fix for broken software.
This is like saying money isn’t necessarily an asset because the bearer might lose it or waste it all on something stupid that doesn’t really help them accomplish your goals.
All sorts of interesting things can be done with data, many of which we haven’t thought of yet, which is an argument for keeping it around. It might not be an overriding argument compared to the associated drawbacks that Cory refers to of course.
My point exactly. I’m not certain whether any given repository of data is an asset or a liability, but I don’t think we can just assume that the people who make the decision to maintain the data are doing so because it is definitely a net asset.
Hm. If data is a liability then surely those who are most invested in compromising our privacy are on the verge of bankruptcy, no?
Then surely you’d agree that the title of this post (“Data is a liability, not an asset”) is potentially incorrect. Which was the point I was actually trying to make.
This topic was automatically closed after 5 days. New replies are no longer allowed.