This month in IEEE Security and Privacy (November/December 2006) there is an article I wrote that describes in relatively plain English the key principles of "Identity Resolution" and "Relationship Resolution."
Here is a link to a PDF version of this story: Threat and Fraud Intelligence – Las Vegas Style
In a nut shell, here are the essential objectives:
- Sequence Neutrality
- Relationship Aware
- Perpetual Analytics
- Context Accumulation
- Extensibility
- Knowledge-based Name Evaluations
- Real-time
- Scalable
This story also makes the case that probabilistic-based identity matching systems skew over time as the underlying data changes. I have 23 years of work in the area of identity disambiguation at scale. This has led me to the conclusion that starting with deterministic matching and tuning probabilistically is far superior, especially in large data sets that cannot be retrained or reloaded in any reasonable interval (e.g., quarterly).
Neat article, I'm glad you wrote it.
What kind of controls do organizations put in place to keep people from lying about (or just manipulating) their personal information? For example, someone trying to beat the system could use a pay-as-you-go cell phone number instead of a home number, or a PO box instead of their home address. It seems like that would be an effective way of blocking the identity and relationship resolution process.
Do organizations end up building unique components or procedures to verify different types of data? For example, one system for SSNs, another for credit card numbers, a third for phone numbers?
Would obfuscated identities reveal themselves in some other way, such as tending to have more generic components to their identities?
Is the problem just not worth worrying about? Or will smart attackers looking for large payoffs try to confuse the identity resolution system?
Posted by: Brian | December 04, 2006 at 10:20 AM
Jeff, the technique you describe applies to International Trade which has a compliance requirement to spot blacklisted people and entities in what is referred to as the Denied Parties list. I worked on this problem many years ago and used a modified version of the Double Metaphone Algorithm to deal with variations in international names. Also, extended the technique to work with international addresses.
You mentioned Soundex in the article which is very poor at phoenitic matching and has been supplanted by Metaphone although none of the Database vendors have advanced their products to replace Soundex yet.
Posted by: Ray Garcia | January 27, 2008 at 02:51 PM
The constraint of using an identity structure that can be constructed as information is captured makes sense in this context. The reason is mostly related to the fact that humans already can conceive of the various attempts as tricking the systems to avoid being caught therefore establishing a set of rules against the probable structure of data and relationships may work for this specific class of problems.
The strategy may be worth trying for other classes of problems as well where analysis and prediction have been difficult. Using a similar strategy as describe in the article might be to contruct a fuzzy ontology and fuzzy action semantics to capture information as it is available. The information can be analyzed for partial representation and fuzzy treatment in matching and formulation of relationship to other aspects of the knowledge being captured.
This approach provides a sensible balance between attempting to fully structure the data versus the difficulty of making sense out of purely unstructured data.
The above likely cannot be done with a traditional SQL database and would require an RDF-s or Owl Repository that is modified to support the fuzzy knowledge.
Posted by: Ray Garcia | January 27, 2008 at 02:59 PM
A related area of research that can help detect the subversion of internal controls see how data lineage is addressed by models that support Data Provenance. Dr. Sudha Ram at the University of Arizona in Tucson has done some excellent work in this area. See http://kartik.eller.arizona.edu/wits2006_poster_gif.gif
for a visual example of what Data Provenance is and how is might be used.
Posted by: Ray Garcia | January 28, 2008 at 08:25 AM
Identity is the simple root for searching to the address, we can easily search to their address.
Posted by: Thomas | April 20, 2012 at 12:54 AM