Sometimes entity resolution systems get mixed up with match merge, merge purge and list de-duplication systems (collectively "merge purge systems") … so I thought I would take a moment to point out some of the differences.
Both types of systems help organizations count unique entities (e.g., customers) by making assertions as to when two or more described entities likely represent the same physical entity, aka Semantic Reconciliation. This kind of processing is essential in information systems for many reasons, for example, to save money on postage (e.g., by reducing duplicate mail pieces) or to improve analysis (e.g., to accurately determine "the average number of accounts per customer" which obviously first requires an accurate customer count).
But other than the counting of unique entities, the differences between merge purge and entity resolution systems are night and day.
[Note: These differences are not absolutes. However, the more the system has merge purge behavior, the more it is a merge purge system. And inversely, the more the system behaves with the entity resolution behavior the more the system is an entity resolution system.]
Batch versus real-time: Merge purge systems are traditionally batch oriented. Input files are compared and the result is a de-duplicated output file. The input files are then periodically re-processed in their entirety to account for changes. Note: while some merge purge systems support incremental updates, take a close look at how missing records ["deletes"] in the input files are incrementally handled – this is almost always problematic. Entity resolution systems are generally designed to handle real-time updates. There is no real notion of "input files" in real time systems, rather there are "data sources" producing transactional streams.
Snapshot in time versus perpetually current: Merge purge systems, due to their batch nature, deliver a static de-duplicated data set. The accuracy of the snapshot obviously degrades over time – until the data is reprocessed. Entity resolution systems, due to their real-time nature, deliver a dynamic data store of disambiguated entities that are current to the second.
Data survivorship versus full attribution: Merge purge systems have data survivorship rules that are used to determine which values should be kept and which values should be discarded or archived (e.g., keeping "Robert" while dropping "Bob"). Entity resolution systems retain every record and every attribute, each with its associated attribution/pedigree (i.e., pointers to the system of record and transaction references). The importance of this behavior is clarified in the next three paragraphs.
Data drifting versus self-correcting: Merge purge systems drift in accuracy over time for two reasons: 1) they are not current until the process is re-run and the longer this interval the more inaccurate the snapshot; 2) incremental updates can invalidate earlier assertions. About this second point, what happens is that as new records are loaded, sometimes they present new evidence that improve earlier resolutions (either two previously processed records should have been matched, or two previously processed records should not have been matched). As most merge purge systems have data survivorship rules, they cannot remedy earlier assertions because some of the original data is now missing. The remedy for this is to completely reprocess all of the data. Entity resolution systems with full attribution are not only able to recognize that new records may change the past, but also can self-correct earlier assertions in real-time i.e., without having to perform a ground up re-load. [See: Sequence Neutrality in Information Systems, Data Tethering]
Single version of truth versus every version of truth: Merge purge systems tend to deliver a single version of truth – whereby the best first name, best last name, best home address, and so on is maintained. Entity resolution systems and full attribution deliver every version of truth, where the decision about which name and address is best can be determined when the data is being consumed and in that specific context (e.g., summer time = summer house, winter time = winter house). [See: There is no Such Thing as a Single Version of Truth]
Outlier attribute suppression versus context accumulating: As merge purge systems rely on data survivorship processing they drop outlying attributes, for example, the name Marek might sometimes appear as Mark due to data entry error. Merge purge systems would keep Marek and drop Mark. Entity resolution systems keep all values whether they compete or not, as such, these systems accumulate context. By keeping both Marek and Mark, the semantic reconciliation algorithms can benefit by recognizing that sometimes Marek is recorded as Mark. [See: It Turns Out Both Bad Data and a Teaspoon Of Dirt May Be Good For You]
Binary list processing versus "n" data source ingestion: Most merge purge systems are designed to compare one input file to another (i.e., binary). If there is a third input file to process, the first output file is then compared to the third file (i.e., again a binary function). Entity resolution systems operate on an entirely different principle. In these systems, each evaluated entity (from any data source in any order) is evaluated against the universe of constructed entities that are being persisted in context. [See: Persistent Context] Between this point, full attribution, and self-correction … entity resolution systems are capable of being more accurate.
Limited scalability versus massive scalability: As data volumes grow it becomes more and more unsustainable to reprocess all of one’s data holdings. For this reason, the larger the historical volume of data, the less practical merge purge systems become – as there becomes a point in which there is not enough time to re-crunch all of the data from the ground up. Entity resolution systems which support real time and sequence neutral (self-correcting) processing are not dependent upon re-loading for accuracy and currency.
While I’ve been obsessed with real-time, perpetual analytics and thus entity resolution type systems, I don’t dispute that there are still many missions in this day and age that only need merge purge functionality. For example, merge purge systems are very well suited for any non-real time mission that can live with snapshots which would include things like direct mail marketing and monthly reporting.
Entity resolution systems are best suited for real-time missions where processes require access to the most accurate and most current view of that which is knowable [to the enterprise].
RELATED POSTS:
Thanks for addressing this topic. I'm interested in your opinion on a recent US Air Force SBIR proposal entitled "Consolidating Entity Information from Heterogeneous Text Sources for Multi-INT Fusion". It concerns the difficulty in solving two cross-document coreference resolution problems: (1) cross-document name disambiguation, and (2) alias resolution. The authors of this topic seem to think that cross-document resolution involving structured and un-structured data across multi-INT domains is still a major problem.
Is that your view as well, Jeff?
Posted by: Jeff Carr | September 25, 2007 at 07:13 PM
Very informative.
Posted by: Delpierre | October 26, 2007 at 12:43 AM
We found the same thing to be true also.
Posted by: Douglas Schwartz | April 21, 2008 at 11:22 PM
Thank you for this article. It has expanded my narrow thoughts on the uses of Merge and Purge. I had not considered a real-time application of the service.
Posted by: Andrew | August 04, 2008 at 12:37 PM
Thank you for this article. My team is currently debating between what kind of business framework we give to our tool set. i.e Suvivorship vs. Order of precedence.
Our goal is to make available individual source system data as well as a golden record computed from various source system holding the most accurate active information about a customer hence a true 360 view of a customer.
Please! expect future question(s) on this topic once i demo this article to my team.
Posted by: Umang Juthani | March 12, 2009 at 07:35 AM
This article makes a strong case for probabilistic databases, or other kind of uncertainty management, and collective matching (current trend in machine learning and data mining). I certainly agree with this view.
Posted by: Leonardo | January 15, 2010 at 03:30 AM