My Photo

Your email address:


Powered by FeedBlitz

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
Blog powered by TypePad

« Effective Counter-Terrorism and the Limited Role of Predictive Data Mining | Main | The Registered Traveler Program And Worrying About When Good People Go Bad »

December 29, 2006

It Turns Out Both Bad Data and a Teaspoon of Dirt May Be Good For You

I think there used to be a saying that ingesting a teaspoon of dirt would actually keep the immune system strong. (Before my time in case you are wondering.)

Now whether this is true or not, I have come to conclude that with respect to context engines … poor quality data (or "dirt") can in fact be quite helpful. Just to be clear, I am not talking about a date of birth value incorrectly placed in a middle name field or a phone number field containing a non-phone value like the phrase "who put the ear muffs on the cookie?"

When incorrect data actually expresses "natural variability", this kind of data error can be helpful to context assembling systems. What do I mean by "natural variability" you might ask? Well I am referring to plausible variations. For example, when the month and day in the date of birth are transposed. Or, sometimes an address will include the word "Drive" while other times this same address may be referred to without it. If someone’s first name is "Marek" (a fairly uncommon name here in the United States) it may periodically be recorded as "Mark" by a confused data entry operator. The list goes on.

When context accumulating systems keep this natural variability – when trying to recognize like objects in the future, accuracy goes up because the system has been able to learn from the natural variability of the past. For example, recognizing that Marek is sometimes also recorded as Mark is in fact helpful.

The other funny thing about this is: How would one know if someone named Marek has decided to now go by the nickname Mark? Well in most cases you will never know this other than observing over time that he used to go by one name and in recent years he seems to going by another.

This is yet another reason why with respect to context engines there is no such thing as a single version of truth.

Other related posts:

It’s All About the Librarian! New Paradigms in Enterprise Discovery and Awareness

Accumulating Context: Now or Never

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/692172/7319750

Listed below are links to weblogs that reference It Turns Out Both Bad Data and a Teaspoon of Dirt May Be Good For You:

Comments

Post a comment

If you have a TypeKey or TypePad account, please Sign In