I wrote an article with the
above title. This article has since been
published in the proceedings of the International Risk Assessment
and Horizon Scanning Symposium 2010 (IRAHSS) in
[Opening Excerpt]
Man
continues to chase the notion that systems should be capable of digesting
daunting volumes of data and making sufficient sense of this data such that novel,
specific, and accurate insight can be derived without direct human
involvement. While there are many major
breakthroughs in computation and storage, advances in sensemaking systems have
not enjoyed the same significant gains.
This
article suggests that the single most fundamental capability required to make a
sensemaking system is the system’s ability to recognize when multiple
references to the same entity (often from different source systems) are in fact
the same entity. For example, it is essential
to understand the difference between three transactions carried out by three
people versus one person who carried out all three transactions. Without the ability to determine when
entities are the same, it quickly becomes clear that sensemaking is all but
impossible.
Full article here.
I find most organizations have underestimated this principle: If a system cannot count, it cannot predict. While I covered this point in some detail in a previous post, this new article is more complete and has a section entitled Expert Counting Systems: Essential Ingredients For Sensemaking which covers such issues as:
- Expert counting engines should not rely on training data.
- Counted
entities should accumulate features.
- Entities
believed to be the same should be asserted as same.
- Expert
counting benefits from favoring the false negatives.
- New
observations should reverse earlier assertions.
- Full
attribution/pedigree of each observation should be maintained.
- It
should be fast in order to digest the historical data.
- It should be real time so that counting assertions can be made as the transaction is happening, in time to do something about it.
Anyway, long story short, expert counting is non-trivial, especially at scale, and lots more must be done in this area.
Miscellaneous Note: Over the years I’ve sometimes used the term Semantic Reconciliation (recognizing two things are the same despite having been described differently) to describe counting. And, many have heard me or others using the term Entity Resolution or Identity Resolution. Yes, more words that relate to counting … especially with respect to people or organizations: is this about one person or two? Unfortunately, trying to explain these terms to non-technical people has been a bit of work, so now in an attempt to make the concept more consumable … maybe the term “Expert Counting” is an improvement.
RELATED POSTS
This "Counting Problem" is more commonly known as the (Semantic) Equivalency Problem.
Posted by: James Paul White | May 31, 2010 at 04:23 AM
Jeff,
Great article. While entity resolution / semantic equivalency isn't glamorous, it is indeed fundamental. As they say, garbage in, garbage out.
Also enjoyed piece on data finds data a while back.
Cheers,
Dave
Posted by: Kellblog | June 02, 2010 at 10:45 AM
Bill James radically transformed the world baseball analysis. He did so by creating a canonical process for looking at player performance -- based on accurate counting.
"So if we can't tell who the good fielders are accurately from the record books, and we can't tell accurately from watching, how can we tell?
*By counting things*." Bill James, attributed by Michael Lewis in MoneyBall page 69
Posted by: Abe | July 08, 2010 at 05:35 PM