MI5 warning: we're gathering more than we can analyse, and will miss terrorist attacks


[Read the post]


Nice. “In the national interest” - it falls under that definition, I’m sure.


Too much data? We must gather yet more data so we can do a meta-analysis to find out which data we should be paying the most attention to.
Better move from Cheltenham to somewhere with a bigger river, all those data warehouses are going to need an lot of cooling.


Orwell must be turning in his grave.

it is obvious on its face that if you're looking for a needle in a haystack, you do yourself no favours by making the haystack as large as possible.
I don't think that's a good way of describing what mass surveillance does or how it fails. Obviously no matter your data collection methods (mass or otherwise), you're always starting with far more data than you need - but that's not a bad thing. That's why you depend on many different kinds of filters to eliminate unrelated data from your analysis.

Now, you could argue that restricting your initial data collection is a kind of filter, but it’s a pretty crude one. There’s no reason to think that your restricted initial collection has necessarily captured all available relevant data and excluded only irrelevant stuff. If you wanted to find out, you could also collect the stuff outside your focus, and then just design more sophisticated analysis tools to filter out the noise introduced by more data. I think this is what MI5 is actually saying: we have the collection down pat, but our ability to make sense of it is lagging behind. We need more funding for advancing our data mining systems and analyst budget, so that the nightmare of total government insight into all of society can be finally achieved.

We should not argue against mass surveillance on the basis that it doesn’t work. It doesn’t currently work, but it can, and that’s what makes it so dangerous.


Sounds to me like an argument for an increased personnel budget.


Let’s keep making that haystack bigger! We’re sure to find more needles that way!


But why do they need all available relevant data? If you’re trying to catch suicide gunmen you don’t need every detail of their planning, you just need one relevant clue to cue further investigation.


This is a fairly common thing you learn in science. Most of the data you could collect is not very useful. Take samples, but don’t just sample noise. Use your intuition to tell you what is useful to collect.

Of course these spooks aren’t collecting this data to “stop terror attacks.” They’re collecting it for post hoc analysis. After they catch or identify an attacker using conventional techniques (warrants, shoe-leather) they will pick through their pile of data and they might find something that helps them finding conspirators. They could of course just get that data with a warrant, but they’re lazy. They just want the computer to tell them who the terrorists are, rather than investigate. Their dreams of a perfect system greatly exceed their grasp, probably because the people who come up with these ideas greatly underestimate the amount of work and thought and experience it takes to make sense of the data, not to mention a failure to understand the fundamental logistical/statistical limitations of this approach.

Depending on what you’re doing it sometimes is easy to take a lot of data, more than you may need. So why not take everything? Lots of reasons, mostly logistical and statistical. It’s easy to overburden yourself with too much data that you can’t draw good conclusions from, or which you might be able too, but the effort involved would consume time and doesn’t really advance the goals of the project. Use your intuition to collect the data you think is relevant, then analyze it the best way (which is often, but not always, the most simple way), and then draw your conclusions and move on to the next experiment.

Say you’re doing a time-course experiment on a machine in an automated fashion. The conversation might go like this:

Me: We tell the machine to take a reading every 5 minutes for an hour, which gives us 12 data points.

You: Well what if we tell it to take a reading ever 5 seconds, then we have 720 data points! More data!

Me: Yeah, but what will that accomplish? The reaction proceeds slowly, so we’re not going to see changes on a 5 second scale, so why bother taking that extra data?

You: True, but it doesn’t cost anything to get that data, so why not?

Me: Because it does cost something, it’s more to analyze.

You: But the analysis is just a computer script right? So it doesn’t take any more human time.

Me: True, but I still have to think about it, and thinking about 12 data points is much easier than thinking about 720. The real issue though is that most of those 720 points are not informative. Meaning, the differences between a reading at Time = 5:00 minutes and Time = 5:05 minutes will most likely just be noise in the machine, technical error, not due to the reaction itself.

You: But won’t the statistics “sort all that out”?

Me: They could, but since the proportion of informative data points is so low, I’m going to lose a lot of statistical power. You can get around this, but it’s going to need a potentially more complex model. Imagine: The difference between Time = 0 minutes and Time = 5 minutes should be pretty large, and we will see a change, but the difference between Time = 0:00 and Time = 0:05 will be small, or negative, even though the overall trend is positive. So comparing each point to the one previous to it will likely show no significant change, even though the overall trend is positive. Instead of just comparing each measurement to the previous one in a repeated-measures-based approach, I’m going to have to fit a full regression on those points, which yes, happens on a computer, but it’s still more work. I may also have to fit a non-linear regression to get the same result, which is still more difficult and makes it hard to interpret.

You: Well then why don’t we just measure every 5 seconds and then you can just pull data from every 5 minute interval, so it’s the same analysis.

Me: That’s no different than just doing what we were doing!

You: But then we’ll have it!

Me: Yes, but if we collect data we incur an obligation to analyze it. Imagine defending this to a peer-reviewer. Are you gonna say “We collected all this data, and then ignored 99% of it, the points we kept were chosen fairly arbitrarily based on what was easier to analyze.” Now we look like we’re throwing out potentially relevant data.

You: Well shouldn’t we be capturing as much data as possible?

Me: No. We should be collecting as much relevant data as possible. It’s a waste of time and money to collect useless data. We know this protocol works. It makes useful predictions, it tells us when this reaction is working or not and the rate at which it happens. This data successfully informs future experiments. We’re doing an elegant experiment.

If we were having problems, then maybe we would take more data, but there’s a fine line between useful data-collection and OCD data-collection frenzy.

You: Well I still think we should do it.

Me: Well then you can analyze the experiment, we’ll need the results tomorrow.

You: But I don’t know how to do statistics, or use the analysis software!

Me: My point exactly.


Why do I get the feeling you’ve had that exact conversation at least once (and probably many, many times)?


I would prefer “inductive logic” to “intuition”.
To paraphrase Shannon, simply gathering more data actually reduces the signal to noise ratio. That’s also the challenge for CERN. But the information scientists at CERN have an idea of what they are looking for, and a clever triage system. It is that which makes the problem soluble.
The trick is not to collect more data than you can analyse, until you have that method of triage which enables the data to be reduced automatically until the sign to noise ratio allows the extraction of information. Or to put it another way, prove your idea works before betting the farm on it.


I generally agree, but I don’t think the distinction between “inductive logic” and “intuition” is very meaningful in this case. If you’re doing something for the first time, you have nothing to base the logic on, just your experience and intuition. You might say that your experience with or knowledge of similar experiments is the “inductive logic” you’re using, but that’s just intuition isn’t it?

that method of triage which enables the data to be reduced automatically until the sign to noise ratio allows the extraction of information

That makes a lot of sense for CERN, but writing software to do that for experiments you do everyday would seem like a lot of work, and would require a lot of specialized training. It’s making a mountain out of a mole-hill. Most researchers don’t have the need for that, or the ability to program that. Moreover most peer-reviewers wouldn’t be able to criticize such a system. In fact I’m pretty sure what you’re suggesting is what statistics already does, though not always explicitly (except in the case of an informatics-based approach, like doing model selection/comparison).

CERNs difficulties are not generally applicable to most labs, where you’re can rarely do the amount of replication you’d like to. Some people tend to worry a lot about whether they are using the theoretically-most-mathematically-and-statistically-perfect method while they’re limited for financial/logistical reasons to only having 10 samples in their experiment anyway. They waste time and money on “quarterbacking” their methods in theory while not just doing some experiments. Any chemist will tell you that you can do all the theory you want, but when it comes down to it the reaction will either work or it won’t. The only way you figure it out is by doing it a lot. Most of the PhD projects I’ve seen synthetic chemists do boiled down to "I spent 2 years working on a reaction, turned out the solution was to mix it up in a smaller container, and add the reagent drop-wise while shaking vigorously, then it works great. The final years wre spent doing 20 variations on the reaction and quantifying carefully. No amount of reading or modeling chemistry is likely to tell you that’s the secret, you just have to do it.

I remember an organic chem lab I took. We banged our head against the wall for 3 hours with a simple reaction. “Mix A and B, heat to 70 degrees for 10 minutes, collect crystallized product by filtration.” Tried 5 times, wouldn’t work. So at some point I said “Fuck it” and just boiled the solution and got tons of product.

“What’s the deal!” I exclaimed to my lab instructor (who was, some chemistry PhD student), “It says 70! but that doesn’t work, did I do it right?” “Did you get the right product when you boiled it?”, she asked disinterestedly. “Yeah I got lots, melting curves are just right.”, I replied. “Then you did it right, welcome to chemistry. yawn


Ah. I am sorry; to paraphrase Rutherford, I was writing about science, I did not realise you were writing about stamp collecting.:slight_smile:

[edit - fair enough. Myself, I would say that practical chemistry has a complexity problem that is hard to control. But I still think your examples are induction rather than intuition, as in "well, the last time this kind of thing happened we did X and it worked, try a bit of X."
As for mistakes in textbooks…I do suspect that as with cookbooks, the authors have often not actually tested their recipes and mistakes get propagated.
And no, I don’t really agree with Rutherford.]



Oh I see, you saw the word “science” and then wanted to make a joke.



You appear to be responding to me rather than @nimelennar.
The XKCD seems to me one of the very much less inspired efforts - the remark about stamp collecting was indeed Rutherford not Feynman, and the argument in the strip is pretty stupid - biology has only been able to progress because of the work of physicists, chemists and engineers, and these have become much more effective due in part to the longer life expectancy and so on brought about by modern medicine, in a virtuous circle.

I have edited my original comment that you objected to. It wasn’t my intention to cause offence.


I don’t think that Munroe was trying to slight physicists at all. I think that he was saying that he was just trying to puncture the egos of those physicists who slight other fields (like biology).


Nor did I, I thought the argument was stupid - both sides put up straw men. I’m a bit annoyed with bad arguments at the moment because of the sheer hideous awfulness of the UK referendum “debate” - which seems to consist of racists shouting at bureaucrats.

But my reason for posting was that @clayton_coffman seemed to have got annoyed with the wrong person.


This topic was automatically closed after 5 days. New replies are no longer allowed.