"People lie. How are you going to account for that?"
This question used to make me crazy. I always wanted to blurt out, "And the sun is going to consume the earth someday – deal with it!"
I never said this, of course.
Anyway, I have a more thoughtful response these days.
Try this on for size. Yep. People are going to falsify information. In fact, you may have experienced this in your life. Let’s say you had a friend – or so you thought. Over time you discovered that this person was in fact dishonest. How did you discover this? The answer is simple: you collected more observations over time.
Observations add up.
I have seen this play out in real data. For example, there was this very big database (billions of table rows describing hundreds of millions of unique people). In this particular database there was this one fellow who was repeatedly lying about his identity. He did a good job, in fact such a good job that despite Semantic Reconciliation processing he appeared to be six different people.
The guy was a liar and no one knew ... that is until future observations (created by his own actions) flushed him out.
[Skip this next paragraph, if you are speed reading or want to stay out of the weeds.]
Here is how this happened. Imagine six apparently discrete identities. Some name similarity, but that never matters at this scale. Then one day this fellow decides to use one of these identities (using previously reported features e.g., same name, phone, SSN, date of birth, etc.), except this time he introduces a new address, one that had never been previously associated with this identity. So this new record is identity resolved to the existing identity – the identity he wanted to present). This caused context accumulation – in this case the new address enhanced what was known about the person he was being today. Sequence Neutrality processing then fires-up to make sure earlier identity resolution events are still valid. During this process another identity was located that shared the new address (the one just learned) and other matching features (e.g., similar names and more). The identity he was trying to be had now become conjoined to one of his other identities – one he was trying to distance himself from. [Technical note: I am specifically using the term conjoined as opposed to merged. Think of conjoined like being rubber-banded together versus merged where two records become one. This is essential for many reasons e.g., retaining the ability to change one’s mind later. More about this in a future post.]
When two identities collapse into one identity – this new conjoined identity now has more context. As something new had just been learned, sequence neutral processing immediately determines if there are any further assertions of the past to fix (e.g., more identities that can be conjoined, or in some cases, disjoined).
Long and short, his six discrete identities collapsed into one … thanks to the arrival of two new records.
Knowing this, one thinks about what data sources are better than others. Some data sources are so good … they work like "glue guns."
From a national security and privacy point of view, it is the above behavior that makes it so important to debate what perceptions (observations) are fair game for context construction, and when.
RELATED POSTS:
More Data is Better, Proceed with Caution
Ubiquitous Sensors? You Have Seen Nothing yet
Accumulate Context: Now or Never
To Know Semantic Reconciliation is to Love Semantic Reconciliation
Jeff:
Considering cojoined instead of merged
Very good idea.
Rubber cement instead of glue gun?
keep up the good work
Posted by: JT Hoagland | July 14, 2007 at 09:07 AM
Pretty slick. Of course, to get the benefits of conjoining versus merging you have to:
1-store lots more info
2-process that increased info over and over and over.
Interesting.
Whether you reply to this comment (assuming you find it worthy of reply) will indicate whether you apply conjoining to your blog.
:-)
Posted by: Alex Simonelis | November 28, 2007 at 07:58 AM