My Photo

Your email address:


Powered by FeedBlitz

June 2008

Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30          
Blog powered by TypePad

« August 2007 | Main | October 2007 »

September 25, 2007

Entity Resolution Systems vs. Match Merge/Merge Purge/List De-duplication Systems

Sometimes entity resolution systems get mixed up with match merge, merge purge and list de-duplication systems (collectively "merge purge systems") … so I thought I would take a moment to point out some of the differences.

Both types of systems help organizations count unique entities (e.g., customers) by making assertions as to when two or more described entities likely represent the same physical entity, aka Semantic Reconciliation. This kind of processing is essential in information systems for many reasons, for example, to save money on postage (e.g., by reducing duplicate mail pieces) or to improve analysis (e.g., to accurately determine "the average number of accounts per customer" which obviously first requires an accurate customer count).

But other than the counting of unique entities, the differences between merge purge and entity resolution systems are night and day.

[Note: These differences are not absolutes. However, the more the system has merge purge behavior, the more it is a merge purge system. And inversely, the more the system behaves with the entity resolution behavior the more the system is an entity resolution system.]

Batch versus real-time: Merge purge systems are traditionally batch oriented. Input files are compared and the result is a de-duplicated output file. The input files are then periodically re-processed in their entirety to account for changes. Note: while some merge purge systems support incremental updates, take a close look at how missing records ["deletes"] in the input files are incrementally handled – this is almost always problematic. Entity resolution systems are generally designed to handle real-time updates. There is no real notion of "input files" in real time systems, rather there are "data sources" producing transactional streams.

Snapshot in time versus perpetually current: Merge purge systems, due to their batch nature, deliver a static de-duplicated data set. The accuracy of the snapshot obviously degrades over time – until the data is reprocessed. Entity resolution systems, due to their real-time nature, deliver a dynamic data store of disambiguated entities that are current to the second.

Data survivorship versus full attribution: Merge purge systems have data survivorship rules that are used to determine which values should be kept and which values should be discarded or archived (e.g., keeping "Robert" while dropping "Bob"). Entity resolution systems retain every record and every attribute, each with its associated attribution/pedigree (i.e., pointers to the system of record and transaction references). The importance of this behavior is clarified in the next three paragraphs.

Data drifting versus self-correcting: Merge purge systems drift in accuracy over time for two reasons: 1) they are not current until the process is re-run and the longer this interval the more inaccurate the snapshot; 2) incremental updates can invalidate earlier assertions. About this second point, what happens is that as new records are loaded, sometimes they present new evidence that improve earlier resolutions (either two previously processed records should have been matched, or two previously processed records should not have been matched). As most merge purge systems have data survivorship rules, they cannot remedy earlier assertions because some of the original data is now missing. The remedy for this is to completely reprocess all of the data. Entity resolution systems with full attribution are not only able to recognize that new records may change the past, but also can self-correct earlier assertions in real-time i.e., without having to perform a ground up re-load. [See: Sequence Neutrality in Information Systems, Data Tethering]

Single version of truth versus every version of truth: Merge purge systems tend to deliver a single version of truth – whereby the best first name, best last name, best home address, and so on is maintained. Entity resolution systems and full attribution deliver every version of truth, where the decision about which name and address is best can be determined when the data is being consumed and in that specific context (e.g., summer time = summer house, winter time = winter house). [See: There is no Such Thing as a Single Version of Truth]

Outlier attribute suppression versus context accumulating: As merge purge systems rely on data survivorship processing they drop outlying attributes, for example, the name Marek might sometimes appear as Mark due to data entry error. Merge purge systems would keep Marek and drop Mark. Entity resolution systems keep all values whether they compete or not, as such, these systems accumulate context. By keeping both Marek and Mark, the semantic reconciliation algorithms can benefit by recognizing that sometimes Marek is recorded as Mark. [See: It Turns Out Both Bad Data and a Teaspoon Of Dirt May Be Good For You]

Binary list processing versus "n" data source ingestion: Most merge purge systems are designed to compare one input file to another (i.e., binary). If there is a third input file to process, the first output file is then compared to the third file (i.e., again a binary function). Entity resolution systems operate on an entirely different principle. In these systems, each evaluated entity (from any data source in any order) is evaluated against the universe of constructed entities that are being persisted in context. [See: Persistent Context] Between this point, full attribution, and self-correction … entity resolution systems are capable of being more accurate.

Limited scalability versus massive scalability: As data volumes grow it becomes more and more unsustainable to reprocess all of one’s data holdings. For this reason, the larger the historical volume of data, the less practical merge purge systems become – as there becomes a point in which there is not enough time to re-crunch all of the data from the ground up. Entity resolution systems which support real time and sequence neutral (self-correcting) processing are not dependent upon re-loading for accuracy and currency.

While I’ve been obsessed with real-time, perpetual analytics and thus entity resolution type systems, I don’t dispute that there are still many missions in this day and age that only need merge purge functionality. For example, merge purge systems are very well suited for any non-real time mission that can live with snapshots which would include things like direct mail marketing and monthly reporting.

Entity resolution systems are best suited for real-time missions where processes require access to the most accurate and most current view of that which is knowable [to the enterprise].

RELATED POSTS:

IEEE Paper: Threat and Fraud Intelligence – Las Vegas Style

September 18, 2007

More Death Cheaper in Future

The difficulty and cost of delivering death and mayhem are dropping so fast, there will come a time in which the ill-will of a few evil men could ruin the day for millions.

Technological advances in physics, engineering and biology coupled with the Internet and the dynamics of Web 2.0 have contributed to unprecedented social progress and overall improvement of the human condition. In many ways … and in most places … it is better now than ever before; hence my recent post "The World is Not a More Dangerous Place." At the same time, these same phenomena are accelerating the lethality potential per unit of human effort.

Example 1: The difficulty required to build and deliver the first few 10-kiloton nuclear devices in the 1940’s involved 130,000 people and cost two billion dollars ($23B in 2007 dollars). Today, graduate students are building viable detonation systems … albeit lacking the enriched uranium or plutonium. But unlike the 1940’s when enriched uranium did not exist – every ounce having to be produced – today this nuclear material exists in stockpiles all over the world.

Example 2: Recent biological advances have made it possible to reanimate the 1918 Spanish Influenza. Did I say "possible?" Sorry, I meant to say "this has already been done!" Between a couple of tissue samples left over in a military hospital and a deceased Alaskan Eskimo preserved in the permafrost, the virus has been successfully reconstructed and its DNA sequenced. Researchers then proceeded to inject this virus into mice with the human immune system. The result – unprecedented death – the most deadly flu virus ever tested. [story here] While nuclear material is hard to acquire, I was told the DNA sequence of the 1918 Spanish Influenza was already in the public domain. Hard to believe, so I asked a friend in the biological community for a copy of this DNA sequence. So it appears that I now have a copy on my laptop, but what would I know!

While advances in technology are a big part of this trend, other factors contribute as well including population density, dependence on mobility, the tightly coupled interdependencies in which the world operates (e.g., from just-in-time supply chains to your just-in-time access to cash and food) and media-driven sensationalism. Factors such as these have a force multiplying and amplification effect even upon traditional means for mayhem. For example, consider the death and mayhem created by Malvo and Muhammad, the two Washington DC-area gunmen. They were able to turn an investment of a few thousand dollars (car, gas, gun, bullets) into an instrument of terror which not only killed a number of people but also created so much panic the regional economy lost an estimated half a billion dollars ($500,000,000).

And so it seems, as time marches forward fewer people are able to create more damage cheaper and faster.

RELATED POSTS:

The World is Not a More Dangerous Place

The Only Way to Actually Win the (Long) War on Terror

Web 2.0 – Al Qaeda’s Most Effective Force Multiplier