My Photo

Your email address:


Powered by FeedBlitz

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
Blog powered by TypePad

« Dumb and Dumber: Consequences of the 2006 Silverman Triathlon | Main | Effective Counter-Terrorism and the Limited Role of Predictive Data Mining »

November 29, 2006

IEEE Paper: Threat & Fraud Intelligence – Las Vegas Style

This month in IEEE Security and Privacy (November/December 2006) there is an article I wrote that describes in relatively plain English the key principles of "Identity Resolution" and "Relationship Resolution."

Here is a link to a PDF version of this story: Threat and Fraud Intelligence – Las Vegas Style

In a nut shell, here are the essential objectives:

This story also makes the case that probabilistic-based identity matching systems skew over time as the underlying data changes. I have 23 years of work in the area of identity disambiguation at scale. This has led me to the conclusion that starting with deterministic matching and tuning probabilistically is far superior, especially in large data sets that cannot be retrained or reloaded in any reasonable interval (e.g., quarterly).

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/692172/6987064

Listed below are links to weblogs that reference IEEE Paper: Threat & Fraud Intelligence – Las Vegas Style:

» Threat and Fraud Intelligence – Las Vegas Style from Whoot!
I met Jeff Jonas at FOO Camp this year, and he talked a lot about very large databases of people, and how to resolve personas in an identity system over an extended period of time. There are lots of things [Read More]

Comments

Neat article, I'm glad you wrote it.

What kind of controls do organizations put in place to keep people from lying about (or just manipulating) their personal information? For example, someone trying to beat the system could use a pay-as-you-go cell phone number instead of a home number, or a PO box instead of their home address. It seems like that would be an effective way of blocking the identity and relationship resolution process.

Do organizations end up building unique components or procedures to verify different types of data? For example, one system for SSNs, another for credit card numbers, a third for phone numbers?

Would obfuscated identities reveal themselves in some other way, such as tending to have more generic components to their identities?

Is the problem just not worth worrying about? Or will smart attackers looking for large payoffs try to confuse the identity resolution system?

Jeff, the technique you describe applies to International Trade which has a compliance requirement to spot blacklisted people and entities in what is referred to as the Denied Parties list. I worked on this problem many years ago and used a modified version of the Double Metaphone Algorithm to deal with variations in international names. Also, extended the technique to work with international addresses.

You mentioned Soundex in the article which is very poor at phoenitic matching and has been supplanted by Metaphone although none of the Database vendors have advanced their products to replace Soundex yet.

The constraint of using an identity structure that can be constructed as information is captured makes sense in this context. The reason is mostly related to the fact that humans already can conceive of the various attempts as tricking the systems to avoid being caught therefore establishing a set of rules against the probable structure of data and relationships may work for this specific class of problems.

The strategy may be worth trying for other classes of problems as well where analysis and prediction have been difficult. Using a similar strategy as describe in the article might be to contruct a fuzzy ontology and fuzzy action semantics to capture information as it is available. The information can be analyzed for partial representation and fuzzy treatment in matching and formulation of relationship to other aspects of the knowledge being captured.

This approach provides a sensible balance between attempting to fully structure the data versus the difficulty of making sense out of purely unstructured data.

The above likely cannot be done with a traditional SQL database and would require an RDF-s or Owl Repository that is modified to support the fuzzy knowledge.

A related area of research that can help detect the subversion of internal controls see how data lineage is addressed by models that support Data Provenance. Dr. Sudha Ram at the University of Arizona in Tucson has done some excellent work in this area. See http://kartik.eller.arizona.edu/wits2006_poster_gif.gif
for a visual example of what Data Provenance is and how is might be used.

Post a comment

If you have a TypeKey or TypePad account, please Sign In