If the goal is to make substantially more sense of data – the only way forward is CONTEXT ACCUMULATION.
This is so important … I have been feverishly looking for a better way to explain in plain English how real-time, streaming, contextualization works. The very best analogy I have come up with to date is that of assembling jigsaw puzzles. The parallels are uncanny.
Puzzling
A pile of puzzle pieces are in a box. You cannot look at the picture on the cover of the box, as that would be cheating and not like the real world.
You grab a piece out of the box and look at your table space. It is the first piece. So it really does not matter where you place it. The second piece is now in hand. Does this relate to the first piece? Probably not. So it is placed stand alone elsewhere on the table. This was an assertion. You decided this second piece is not related to the first piece.
Soon there are many free standing pieces scattered across the table space - none at this point have been associated (snapped together). Now you have the next piece in hand, eager to find its mate. Do you attempt to physically match it with every possible piece in what would be a very brute force and time-consuming manner? No. You notice the piece in hand has at least one discriminating feature, some red and white on one of the puzzle edges. Glancing over the table space you look for pieces with a similar distinguishing feature. You find three such candidates. Your attention is now narrowly focused on just these three pieces.
Comparing the piece in hand to the each candidate, you are assessing confidence. And at the end of this process you have come to a decision point: a new assertion. It is either (A) a match, (B) not a match but possibly a member of a family of pieces that hopefully will converge as new puzzle pieces arrive, or (C) at this time, has no apparent relation whatsoever to any other pieces.
You only connect two pieces when you are sure. You would never get out a hammer and force the piece. In this regard you are favoring the false negative – only connecting the pieces if you have a high degree of certainty.
Let’s say this latest piece finds a match. You are now sitting there with a new puzzle unit (two pieces). Looking this new unit over you ask yourself, “Now that I know this, are there some other puzzle pieces that can now be associated to this unit?” This involves the same process: candidates, confidence, and assertion. There comes a point where an assertion or set of assertions are made and now it is time to move on to the next piece.
Some pieces produce remarkable epiphanies. You grab the next piece, which appears to be just some chunk of grass - obviously no big deal. But wait … you discover this innocuous piece connects the windmill scene to the alligator scene! This innocent little new piece turned out to be the glue.
You can change the approach, operate on a hunch and become curious. Sometimes you decide there is an opportunity to resolve some uncertainty. For example, you have a cluster of pieces all appearing to be related – each having some red and white expressed. With this new interest in mind you don’t grab the next random piece but rather shuffle through the pieces in the box looking specifically for red and white pieces. The goal: with the right few pieces this whole portion of the puzzle may resolve.
Luckily you have only been snapping pieces together when you are sure. Otherwise, your puzzle would be not only a mess but more importantly it would not be evolving towards any degree of clarity. But no matter how careful you have been … every now and then you grab the next piece out of the box and the home you find for it causes you to realize that two earlier piece should not have been connected at all. It catches you off guard for a second, but upon closer inspection you realize you stuck these pieces together inappropriately. Shame, shame. But luckily, this past error (a false positive) was discovered and corrected – thanks to the new observations.
The puzzle you are working on today happens to be a pretty big puzzle with what appears to be thousands of pieces. And without the cover of the box you are unsure of its final size (e.g., 1’x 1’ or 3’x 3’, or bigger). One thing is for sure: you have to leave some space between pieces that do not connect; therefore, you need a table bigger than the final puzzle size. Notably, after a great deal of work, there is a point when the in-progress puzzle reaches a maximum of required workspace. After this tipping point, new pieces have a higher likelihood of finding mates and consolidation than not.
As the working space of the puzzle begins to collapse, not only does context become richer, but the computational effort of figuring out where the next piece belongs becomes more efficient despite the fact there are more pieces on the table than ever. Assertions become faster and more certain. So much so … those last few pieces are as fast and easy as the first few pieces! [I have seen this behavior in one of my systems … an absolutely phenomenal event with extraordinary ramifications!]
But is it really this easy? No.
There may be more than one puzzle in the box, some puzzles having nothing to do with others. There may be duplicate pieces, pieces that disagree with each other, and missing pieces. Some pieces may have been shredded and are now unusable. Other pieces are mislabeled and/or are exceptionally well crafted lies.
Nonetheless, you will never know what you know ... unless you contextualize your observations!
On a Slightly More Technical Level:
1. When I speak of Persistent Context, this is synonymous with the ”work-in progress puzzle” where the net sum of all previous observation and assertions co-exist.
2. Semantic reconciliation (e.g., identity resolution) is but one of several steps of contextualization processing, albeit one of the most important ones that asserts ”same” or ”not same.”
3. Contextualizing observations is entirely dependent on one’s ability to extract and classify features from the observations. Feature extractors fall woefully short today. I’ve hinted at what I think will fix this in previous posts and more about this later.
4. Using new observations to correct earlier assertions is an essential property I have been referring to as Sequence Neutrality. When systems favor the false negative, sequence neutrality most frequently discovers false negatives while the discovery of previous false positives are far and few between.
5. Non training-based, context accumulating systems with sequence neutrality have this behavior: the puzzle can be assembled first pass, without brute force, where the computation cost of the last pieces are as easy as the first pieces, while having no knowledge of what the picture looks like before hand, and regardless of the order in which the pieces are received.
This is what I am working on these days. It is very exciting and I enjoy talking about such things. Feel free to comment or email me questions. I answer every email I get.
MUST READS:
OTHER RELATED POSTS:
I think the hardest part of explaining this is the cliche problem:
Puzzles are, in fact, a perfect analogy to contextualization. But we've been talking about "another piece of the puzzle" for decades, or maybe centuries. When you say "it's like putting together a puzzle", people no longer get a mental picture of puzzle pieces; it's not a metaphor at all. It's just an idiom, like "momentum" or "perspective" or "focus".
Once it becomes idiom, you lose all the cool facets that came with the metaphor. Puzzle pieces are part of a whole! And if you look them from a different angle, you might mistake them for something else! And they only fit together a certain way! And... doesn't matter. The metaphor has lost its power - quite *because* it is so very apt.
This is why you don't want logos or stock art involving globes, rainbows, multi-racial handshakes, and smiling people on headsets. It's not that they don't fit; it's that they don't communicate anything anymore.
Finding a good metaphor is no longer enough. You have to find a fresh one, too. That's much harder - in fact, it's like finding a needle in a haystack.
Posted by: Jay Levitt | November 25, 2008 at 08:40 AM
Is there are real-life problem are you trying to generate solutions for?
I am asking, because, sometimes, opposite of context accumulation (context decumulation?!??!) may also make sense. What I mean is, instead of starting with zero context and try to build it up, you can start with maximum possible context and try to reduce it to only one (the exact location and position of the puzzle piece relative to others).
Induction vs. deduction.
Of course, this is only useful when you know all possible locations and positions your puzzle piece can be initially and from that remove those that it can't be in.
Sounds like a brute force algorithm, but it is not.
Actually, this is exactly the algorithm I had used in the new Address Parser in EAS to make sense of human language addresses (IBM patent pending).
Posted by: Baris Yazici | November 25, 2008 at 02:51 PM
Good metaphor! I love visualizing multiple puzzles in one box with no pictures -- seems like there may be a market there? :-)
IMO, for the puzzle analogy to be similar in difficulty you'd have to have a tessellation puzzle where there'd be no physical limitations to putting pieces together (i.e. no hammer and no weighting towards false negatives). I often see two puzzle pieces that look like they should be adjacent, only later to find that I'd need the hammer -- I think the puzzles have more "context per piece" than typically exists in your systems.
Posted by: Brand Hunt | November 25, 2008 at 11:07 PM
Terrific post, and comments. You've inspired me to re-think some sticky issues by using Context in a different way - recognizing the absence of something that should be present, but isn't.
Posted by: Jeff Carr | November 29, 2008 at 08:59 AM
I came across your blog after googling information management. This seems to be used extensively in the insurance industry:
http://slabbed.wordpress.com/2008/12/02/the-scheme-them-thats-got-and-them-thats-not-the-monopoly-game-chapter-6-qui-tam/#comments
If the detailed explanation in the above site is any indication, the average Joe doesn't have a chance.
How does one get to the heart of these processes?
Posted by: Juanillo | December 02, 2008 at 01:53 PM
Good stuff. Thought provoking.
Your observation about iteratively making assertions, match, no match, or possible match, is similar to the InfoBright technology for BI and the underlying rough set concepts.
It is interesting to note that the premise of their BI architecture, incremental, scalable updates to a warehouse without reloads, is similar if not identical to the theme of your post. It seems the hurdles of existing warehouses have influentially formed the zeitgeist.
Posted by: Randy Brokaw | February 08, 2009 at 09:19 AM
HI JJ
I know it's only an analogy, and not meant to be precise in every detail, and I'm also not trying to argue against your underlying convictions - which I think I agree with.
But another way to look at the end game of putting a puzzle together is to regard pieces that are already joined as being a single (composite) piece. Looked at from that POV, the process of assembling the puzzle has monotonically decreasing difficulty - all the way to the point where you're down to the last two pieces (possibly both composite), at which point a 2yr old child can finish the puzzle.
In other words, the implied(?) argument that things might be expected to be getting harder because you have more pieces on the table, can be flipped on its head - with great effect, because the puzzle *is* getting easier, as seen by the speedup provided by non-false-positive context.
In a sense I'm just picking nits. But I do also find great value in thinking about practical / real-world analogies and how they can give insight into how we go about designing computational systems. FluidDB is in large part about always being able to insert new information into a system in the place(s) where it (is imagined it) will be of most value. It's about how information becomes more valuable when it's in the right context. I love thinking about how we work with information in the real world (e.g., putting a puzzle together, or even just putting a bookmark in a book or a post-it note onto something) and comparing it to the way our familiar computational environments force us to work with information. The gap seems vast, as I think you agree. I also find it fun to think about how we might build systems that make the latter more like the former, and to have also dedicated so much of my life to actually trying to build such things.
Cheers from NYC.
Posted by: Terrycojones | February 13, 2010 at 11:02 AM