What this means: If you cannot count and associate like
things and remember them, it is damn unlikely your higher-level prediction
system is going to produce accurate answers. Trajectory and velocity are required for many forms of prediction.
Example: Is the customer becoming a better customer and if so, how fast?
How fast is the voter registration roll growing? Is the swine flu headed
this way or that way and how fast? Is on-line sentiment around your
prized brand on the rise or fall? If you think you have five customers moving slowly on an unremarkable vector
when in fact these five customers are all the same person – you might be
missing the fact this customer is moving on a specific vector with meaningful
velocity (e.g., becoming a much better or worse customer). Why do I speak of this? Well, I get a chance to see
some of the most advanced sensemaking systems being created and tested around
the world. And the way I determine (in about five minutes) whether they
have half-a-chance of ever delivering high value is this quick and dirty
assessment: Can they count discrete objects? Take a hypothetical biosurveillance system on the West Coast
of the US, which is supposed to observe the trends of a future influenza
outbreak – say a new swine flu mutation. Such a system might, for
example, use newsfeeds and other available data like blogs to count incidents
and locations over time. How accurate could a system make predictions if
San Francisco, San Fran, SF and the Bay Area were tallied as discreet
regions? If the system cannot tally geographically there might appear to
be five cities each with mild volumes – when in fact, it is one dense region
with moderate volumes. Counting like entities (Semantic
Reconciliation) is fundamental to the measurement of trajectory and
velocity. Nonetheless, seems a lot of organizations want to work on
prediction first and then circle back to cover” basic counting” later.
Well, let me just say … counting is a first order activity. Get this
wrong and your prediction system will certainly miss obvious predictions, it
will lead astray all downstream processes e.g., staff who are taking
these predictions as inputs. Such systems will also fail to scale. And, to the extent an organization is in the ”we want to
detect weak signals” business … counting becomes even that much more important. Smart systems, prediction systems, sensemaking systems,
situational awareness systems, incremental learning systems … whatever one
calls these things … must first be able to form an opinion (aka make
assertions) about context (aka count and associate) … if they are to be
relevant. On a more technical note: 1. While I am referring to determining when two entities are
the same entities (e.g., same location) it is equally important to be able
to assert classification (e.g., same kind of car) and relationships (e.g., employed
by). 2. For some time now, I have been thinking of technologies
that compute when entities are the same, similar and related as “Assertion
Engines.” Note, I do not think about these as “Estimation Engines” which
would compute and persist probabilities between each and every observation
collected. Assertion-based systems take very high-grade probabilities and
call it “so” … leaving some room for maybes … and calling the rest “not so.”
This, I believe, is essential to scalability. 3. Assertion engines that deal with context (determining if
an entity is either the same, similar, or related) seem to misbehave if one
does not favor the false negative (confident beyond a reasonable doubt).
This means one must decide it is true – and remember that it is true.
Example: Donald Conrad at 123 Main Street IS Don Conrad at 123 Main
Street. Unless of course, new evidence becomes available (at a later
point in time) which might bring this into question, e.g., learning
there is a Jr. and Sr. Conrad living at 123 Main Street. This is why smart
systems flip flop – changing their mind about the past. RELATED POSTS: To
Know Semantic Reconciliation is to Love Semantic Reconciliation Sequence
Neutrality in Information Systems Streaming
Analytics vs. Perpetual Analytics (Advantages of Windowless Thinking) Accumulating
Context: Now or Never Federated
Discovery vs. Persistent Context – Enterprise Intelligence Requires the Later Entity
Resolution Systems vs. Match Merge/Merge Purge/List De-duplication Systems More
Data is Better, Proceed With Caution How
to Use a Glue Gun to Catch a Liar It
Turns Out Both Bad Data and a Teaspoon of Dirt May Be Good For You There
Is No Such Thing As A Single Version of Truth Big
Breakthrough in Performance: Tuning Tips for Incremental Learning Systems
The TAC 2009 Knowledge Base Population Track (http://apl.jhu.edu/~paulmac/kbp.html), might be a good place to look to get an idea of the community's approaches for addressing some of the points you've raised. The task consists of two subtasks: Entity-Linking and Slot-Filling. Entity-Linking focuses on resolving entities, given mentions in documents. Unlike previous entity-resolution tasks, the task here is to map mentions to entities in a knowledge-base (or to declare that the entity is unknown). Slot-Filling focuses on extracting facts and relationships about a given entity from a set of target texts.
Although the conference itself is occurring in November, there's at least one work out there that addressed entity-linking.
Posted by: Eric | August 06, 2009 at 01:47 PM