When I ask investigators or analysts what technology improvements they would most appreciate, invariably one of their top requests is “to get answers to their questions faster.” This has always struck me as funny. What if the question being asked today is not a smart question until next Thursday? How can we expect analysts to ask every smart question every day? In short, this is kind of like climbing a tree to get to the moon. You can always inch further up, but how is that really going to get you where you need to go?
Systems that produce different answers based on the order of events lack a property I refer to as “Sequence Neutrality”. Sequence neutrality means regardless of the order in which data or queries occur, the end-state, once all data points are known, is the same. Sequence neutrality prevents systems from having to ask every smart question every day.
Here’s an example. Today when a bank searches for “Billy the Kid” the answer will depend on whether such a record existed first. However, with sequence neutrality the moment “Billy the Kid” opens a bank account, regardless of when that occurs, the user making the original query can be notified. Furthermore, months later if “Billy the Kid” is added to the OFAC list (people and organizations that financial institutions are banned from doing business with), the bank is instantly alerted.
As another example, government entities perform background checks on individuals seeking “top secret” clearances. What happens if one of the systems used to favorably qualify a person thereafter receives a record that would suggest the applicant should receive additional scrutiny—a record shows up in a registered (and public) sex offender database shortly after the person is granted a clearance. How will they learn of this new data point? One option would be for the government to ask every question every day, which obviously is impractical. So to address this scenario, the US Government performs background checks every five years. But that means that a glaring problem in the data may not be discovered until the question is asked again—potentially years later. In a system designed for sequence neutrality, the moment a relevant record comes into existence, it is published (pushed) to the relevant system or user.
When sequence neutrality is applied to information systems a very interesting effect is created: the “data finds the data.” What this means is that as each new piece of data is observed by the system, how this data relates to all previously observed data points is considered – without waiting for a user to ask a question. And while this can benefit a single system it is even more powerful when applied across heterogeneous systems. Suddenly, very interesting insight is possible.
How does a company recognize that its accounts payable manager shares the same phone number as its largest vendor (a relationship that can violate company policy if undisclosed)?
When the “data finds the data” such insight and awareness is not only possible it is fundamental and essential to create market differentiating services. Whether an organization is focused on managing customer relationships, credentialing parties, evaluating credit risk or handling investigations– with sequence neutrality built in – unusually unique and powerful possibilities emerge.
Jeff,
I strongly agree with your "aggregate vs sequence results" perspective. The need to have systems neutralize the challenges posed by traditional runtime "race controls" and other nondeterministic factors inherent in distributed systems is key to solving many of the representative problems you cite.
With respect to the "temporal nature of request..." I again strongly support your position on the need to monitor and broadcast "substantial" change in result(s) based on parametric data. I would add that there's considerable value in having some persistent data available to manage users / organizations that "used" information that has changed -- per the above discussion / scenario... Decision may want to consider these insights in addition to leveraging the latest & greatest (temporal) perspective on the available data set.
I recognize this is probably covered in other discussion threads; but I have to bring up the importance of dis-ambiguation. The importance of applying a plethora of text analytics technology to minimize ambiguities is key to these challenges.
Regards, from a long-time SRD (and now IBM Entity Analytics) fan, Fred M-D
Posted by: Fred M-D | February 05, 2006 at 12:41 PM
Dear Jeff James,
I am a Brazilian journalist and I would like to interview you. I can explain the aim of the interview by e-mail. Could you give me your e-mail address?
Best Regards,
Solange
Posted by: Solange Azevedo | March 02, 2006 at 09:22 AM
I actually implemented this concept in an Arabic search engine I wrote for a school project in my data mining class. But it had a lot more to do with the fact that it was more efficient to implement it so that the data was run past the queries rather than the queries run on top of the data. But I used the same argument you present here while convincing my professor that I should get a decent grade for it.
Man that was a tough class.
Posted by: Bob Aman | August 29, 2006 at 10:16 AM
Thanks for giving me a word for it. I've been striving for "sequence neutrality" in my music aggregator, Grabb.it, and now that I can name it it's a lot easier to whiteboard.
Posted by: Chris Anderson | July 11, 2007 at 01:01 AM
Jeff....isn't sequence neutrality as described in your blog the same as recording a "declaring an interest in and want to be informed when it happens" type of setup?
Posted by: Sreenath Chary | January 05, 2009 at 11:42 PM
It also looks similar to the backward chaining vs. forward chaining problem (or trade-off) in reasoning.
Posted by: Jakub Kotowski | July 12, 2010 at 02:59 AM
There are 3 things being conflated and confused here:
- Publish-subscribe; registering or "declaring an interest in and wanting to be informed when it happens" (@Sreenath), instead of periodically polling for changes and pulling the increments; "data was run past the queries rather than the queries run on top of the data" (@Bob). So we have streaming data, standing query and asynch push to client.
- Order independence of operators; depending on what you want, this is the associative, commutative, or distributive property of the operator algebra; idempotency is important too, because it affects whether you need to de-dup; generalize this to Category Theory, and you have catamorphism, anamorphism, hylomorphism (aka Bananas, Lenses and Barbed Wire), for example, map-reduce relies on catamorphisms (as has been known since ca. 1969, through the Bird-Meertens Formalism).
- Non-monotonic logical inference; previous inferences can be undone by new data, something that was assumed true is no longer true; this is more like the idea of backtracking, than backward v. forward inference direction; inferences should be firing in all directions at all times, the only question is how are conflicts resolved; one can imagine asynch agents colliding on a fact, then negotiating with each other about their confidence, explaining (exchanging) their evidence, and coming to an agreed conclusion, or an agreement to differ and continue anyway - you just get a multivalued result, so leave it to someone else to arbitrate and decide later. Monotonic logics are often equivalent to a 'closed world assumption', i.e. all the data/facts are already known, bounded and accessible, whereas we all know that the real world is open, liable to accretion and nonmonotonicity.
So 'sequence neutrality' means being careful about incremental information flow, operator algebras, and logical commitment to partial inferences.
Mike
Posted by: Mike French | June 17, 2011 at 05:14 AM