Yesterday the Cato Institute released a paper that Jim Harper and I co-authored. Harper and I have been working on this paper for about two years.
Here is a link to the abstract. And here is a link to the full paper.
One of the big challenges we faced in getting this paper drafted was dealing with all of the confusion related to data mining. It turns out that what is data mining depends on whom you ask.
The key point of our paper is that the form of data mining which uses historical incident data to determine a pattern … then using this pattern to predict a future event is not helpful in the terrorism context because there isn’t enough historical data to derive a meaningful and statistically reliable pattern. Thus, we settled on the term "predictive data mining" to differentiate what we were characterizing as ineffective from many other effective uses.
This paper also highlights a real governmental need to efficiently locate, access, and aggregate information about specific suspects. To highlight this point we show that starting with two primary suspects, available data points and existing laws, a good number of the 9/11 terrorists could have been identified in a very narrow investigative fashion before September 11th.
Make no confusion about it; though data mining has many value uses from reducing corporate direct marketing costs, to classifying celestial objects and even medical research, it just so happens that it is not so helpful to discover underlying patterns of low- incident terrorism.
Other related posts:
I enjoyed your article and found it interesting. I feel the lack of positive cases of terrorism does, um, undermine the ability to perform supervised learning.
There are certainly conceptual problems in using mining to provide sentinel for further humint operations. Your analogy of the medical decision process is a misleading one, however, on a couple of different levels. For one, machines are violating people's privacy, not people. If a stone sees me naked I don't feel my privacy has been violated, namely because a stone has no consciousness. It isn't aware and cannot act. So a machine has the capacity to enable a person to do so in the future, and that's a problem, certainly. but that's a problem with managing data and privacy. But there's two other aspects essentially misleading about the example of the 3 million false positives. For one, having a biopsy is far more painful than having your email parsed by a machine behind your back. We don't mine for cancer, a far greater threat to our well being. this leads me to the second essential misleading point--that the cost and benefits are not being porperly analyzed in the analogy. Not only are they bootstrapping us to a false impression of cost, they are ignoring the easy use of a cost matrix. technically speaking in the case of actionable intelligence you want extremely high precision rather than recall if your goal is to prevent terrorism without violating people's rights or wasting tons of humint resources.
A couple more things--I think what we're talking her is not really just data mining but also text mining. I think, perhaps for different reasons than Marti Hearst does, that data mining and text mining are not equivalents. A sentinel system is going to be built on a combination of text mining and data mining, not data mining per se, though of course data mining and text mining are hardly disjoint. Also, supervised learning is not the full extent of mining. Unsupervised learning could be utilized to learn more about populations that foster greater rates of terrorist acts.
Mining to generate actionable intelligence is a task that faces numerous barriers. But foremost among the barriers is certainly the lack of frequency of terrorist acts. Great point. Nice work.
Posted by: Patrick Herron | December 13, 2006 at 10:31 PM
Great article but I do agree with Patrick that Data Mining, especially in intelligence and law enfordement, is more than supervised learning only. Unsupervised techniques and text mining are very much part of such an environment. Also using a combination of unsupervised and supervised learning to emulate the thinking process of intelligence officers can help organize information in a very efficient and effective manner.
Colleen McCue recently published her book (Data Mining and Predictive Analysis: Intelligence gathering and crime Analysis, publ. Butterworth-Heinemann) based on het practical experience in this area which covers bopth the drawbacks of infrequent results and the implications of false positives in this area. She illustrates a pragmatic approach based on several cases from the "real; world".
Again: nice article because we do need to be very careful about promising tha data mining is the silver bullet for all intelligence issues. It is a very effective analytic approach that can complement other activities.
Posted by: Jaap Vink | December 19, 2006 at 03:42 AM
Jeff: Great article on data mining and counterterrorism. I used to be a researcher in statistical pattern recognition at IBM's Human Language Technologies Group but now work in financial prediction. I wrote an article on my blog epchan.blogspot.com some weeks ago espousing the view that data mining and AI are not suitable for financial markets prediction either, for very similar reasons. Best,Ernie
Posted by: Ernie Chan | December 19, 2006 at 05:55 AM
First, my comments are my own and don't reflect my company's position. I was glad to see that you tackled the subject and I think your conclusions are good. Many of the post-9/11 actions have been reactions without logical basis.
But back to your article, I don't believe you sold your analysis very well because of a weak definition of predictive data mining and because you are totally excluding the posibility of a predictive component in data mining.
If the connections that you listed were less obvious to the human mind, maybe data mining could reveal them. Using that data, an experienced law enforcement person or analyst might then provide the predictive "ah ha" that leads to a terrorist plot. As computer systems become more "intelligent:, I think it is reasonable to assume that computers will be able to deduce threats from patterns, at least of a general nature. Ex. flying lessons mean an aircraft attack.
In conclusion, I worry about the privacy issues and about the ability of most analysts to construct useful queries that don't waste time and resources, but I think data mining is here to stay. The US needs an umbrella law that describes privacy and establishes some sort of template for information that can be collected and under what circumstances. Our current laws are directed at specific professions such as medical or at reporting regulations. There is no general privacy law, as I believe Australia has.
Posted by: Stephen Taylor | January 09, 2007 at 12:30 PM
Hi Jeff.Thanks for your post about "Effective Counter-Terrorism and the Limited Role of Predictive Data Mining".Your article make great idea to me.Thanks.
Posted by: Michael | October 17, 2007 at 09:30 PM