I think there used to be a saying that ingesting a teaspoon of dirt would actually keep the immune system strong. (Before my time in case you are wondering.)
Now whether this is true or not, I have come to conclude that with respect to context engines … poor quality data (or "dirt") can in fact be quite helpful. Just to be clear, I am not talking about a date of birth value incorrectly placed in a middle name field or a phone number field containing a non-phone value like the phrase "who put the ear muffs on the cookie?"
When incorrect data actually expresses "natural variability", this kind of data error can be helpful to context assembling systems. What do I mean by "natural variability" you might ask? Well I am referring to plausible variations. For example, when the month and day in the date of birth are transposed. Or, sometimes an address will include the word "Drive" while other times this same address may be referred to without it. If someone’s first name is "Marek" (a fairly uncommon name here in the United States) it may periodically be recorded as "Mark" by a confused data entry operator. The list goes on.
When context accumulating systems keep this natural variability – when trying to recognize like objects in the future, accuracy goes up because the system has been able to learn from the natural variability of the past. For example, recognizing that Marek is sometimes also recorded as Mark is in fact helpful.
The other funny thing about this is: How would one know if someone named Marek has decided to now go by the nickname Mark? Well in most cases you will never know this other than observing over time that he used to go by one name and in recent years he seems to going by another.
This is yet another reason why with respect to context engines there is no such thing as a single version of truth.
Other related posts:
It’s All About the Librarian! New Paradigms in Enterprise Discovery and Awareness
I have worked in years past with vendors like Identity Systems that have software to fuzzy match names of people and addresses. The counter-intuitive lesson I learned was that "dirty" data should not be converted to some canonical form, but that the dirty data should be kept to support clustering (and future re-clustering).
In my recent studying of Philosophy, I found that this was something deep and basic, and not a specific identity-management tweak. See...
http://existentialprogramming.blogspot.com/search?q=superman
Posted by: Bruce Wallace | February 16, 2009 at 07:48 AM