My Photo

Your email address:


Powered by FeedBlitz

June 2008

Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30          
Blog powered by TypePad

« November 2006 | Main | January 2007 »

December 29, 2006

It Turns Out Both Bad Data and a Teaspoon of Dirt May Be Good For You

I think there used to be a saying that ingesting a teaspoon of dirt would actually keep the immune system strong. (Before my time in case you are wondering.)

Now whether this is true or not, I have come to conclude that with respect to context engines … poor quality data (or "dirt") can in fact be quite helpful. Just to be clear, I am not talking about a date of birth value incorrectly placed in a middle name field or a phone number field containing a non-phone value like the phrase "who put the ear muffs on the cookie?"

When incorrect data actually expresses "natural variability", this kind of data error can be helpful to context assembling systems. What do I mean by "natural variability" you might ask? Well I am referring to plausible variations. For example, when the month and day in the date of birth are transposed. Or, sometimes an address will include the word "Drive" while other times this same address may be referred to without it. If someone’s first name is "Marek" (a fairly uncommon name here in the United States) it may periodically be recorded as "Mark" by a confused data entry operator. The list goes on.

When context accumulating systems keep this natural variability – when trying to recognize like objects in the future, accuracy goes up because the system has been able to learn from the natural variability of the past. For example, recognizing that Marek is sometimes also recorded as Mark is in fact helpful.

The other funny thing about this is: How would one know if someone named Marek has decided to now go by the nickname Mark? Well in most cases you will never know this other than observing over time that he used to go by one name and in recent years he seems to going by another.

This is yet another reason why with respect to context engines there is no such thing as a single version of truth.

Other related posts:

It’s All About the Librarian! New Paradigms in Enterprise Discovery and Awareness

Accumulating Context: Now or Never

December 12, 2006

Effective Counter-Terrorism and the Limited Role of Predictive Data Mining

Yesterday the Cato Institute released a paper that Jim Harper and I co-authored. Harper and I have been working on this paper for about two years.

Here is a link to the abstract. And here is a link to the full paper.

One of the big challenges we faced in getting this paper drafted was dealing with all of the confusion related to data mining. It turns out that what is data mining depends on whom you ask.

The key point of our paper is that the form of data mining which uses historical incident data to determine a pattern … then using this pattern to predict a future event is not helpful in the terrorism context because there isn’t enough historical data to derive a meaningful and statistically reliable pattern. Thus, we settled on the term "predictive data mining" to differentiate what we were characterizing as ineffective from many other effective uses.

This paper also highlights a real governmental need to efficiently locate, access, and aggregate information about specific suspects. To highlight this point we show that starting with two primary suspects, available data points and existing laws, a good number of the 9/11 terrorists could have been identified in a very narrow investigative fashion before September 11th.

Make no confusion about it; though data mining has many value uses from reducing corporate direct marketing costs, to classifying celestial objects and even medical research, it just so happens that it is not so helpful to discover underlying patterns of low- incident terrorism.

Other related posts:

What is Data Mining? Depends Who You Ask ...

Data Mining, Predicate Triage and NSA Domestic Surveillance