My old philosophy of science professor, Donald Gillies, used to tell a story about when he attended Karl Popper’s undergraduate lectures at the LSE in the 1960s. If I remember the story rightly Popper used to begin by simply commanding his students: “Observe!” The students would look about them, try to avoid his gaze, shift awkwardly in their seats. After some minutes a student would pipe up: “observe what?”
“Exactly!” would be Popper’s triumphant reply.
He (both Popper, and then later Professor Gillies) would then proceed to unfold his version of the argument that all observation begins with some idea about what is significant and what you expect to see – that all observation is to some extent theory laden.
This came to mind as I was thinking about what Edward Snowden’s leaks have and have not revealed about government surveillance. John pointed out in an earlier post some of the things we know now that we didn’t know before about the extent of data gathering undertaken by the government. But it’s far from transparent what sort of reasoning the NSA is using in dealing with this deluge of data. Lots of people have been asking – quite rightly – “is it ethical?” But a more basic question would be: Which theories are guiding their observations? What are the research strategies with which they hope to draw some meaning from this deluge of data, and are they any good?
I recently came across a blog post by Zeynep Tufekci, an assistant professor at the University of North Carolina at Chapel Hill, a fellow at Princeton CITP and a faculty association at Harvard’s Berkman Center for Internet and Society, that raises this question in a way I found useful.
She highlights the importance of a comment by NSA deputy director Chris Inglis in testimony to the Congress that the NSA went out “two or three hops”. This, she explains, is hugely significant for the extent of the data trawling. One hop would mean looking at all your friends. Two hops includes all your friends’ friends. Three hops is your friends’ friends’ friends. I’m using “friends” here in the Facebook sense of the term – as contacts in your network. Assuming that everybody has 300 such contacts, going out “two hops” in someone’s network would take you to 90000 people. Three hops would get you to 27 million.
“At three hops out, you cannot examine individuals—you have to start relying on easily identifiable markers. You have to squint and look at outlines. You have to engage in pattern recognition. If you have swept meta-data on 27 million individuals, what do you look for? Males? Muslims? Those who bought guns? Fertilizers? Pressure cookers? It has to be something. There simply cannot be a semblance of individual examination. By necessity, the “sweep” has to be algorithmic and be trained to look for specific behaviors. (And pattern recognition is often another way of saying stereotypes.)”
“Pattern recognition is great for big things that happen again and again. Things that jump out from the data and that happen repeatedly so we can figure out commonalities. Storms and hurricanes. Supernovas. Patterns of migration. Family formation. Economic development.
But you know what pattern-recognition is worst at? It’s picking out things that are rare, for which there is not enough data to pick out regularities.”
Pattern recognition, she argues, is likely to both “underfit” and “overfit” the data, that is, fail to pick out “the sweet-faced Jahar who attended his high school prom,” and (at least after the attack on the Boston marathon) succeed in identifying every muslim who bought a pressure cooker.
“Even at two hops, let alone three, NSA’s surveillance program can be bad for anti-terrorism by shifting focus and resources from individual investigations, a more fitting method for rare events, to pattern recognition—which this data deluge will almost surely necessitate—and which is not a good method for detecting rare events.”
Basically, she suggests that the data tail may be wagging the analytic dog, encouraging inappropriate research strategies. I also found a piece in Slate that raises some similar concerns about the research strategies encouraged by the capacity to trawl huge amounts of data.
I’m not sure where I stand on all this, but Tufekci’s blog post raises interesting questions about the epistemology of big data and the lack of transparency of reasoning in the NSA’s approach to it.