The terms "streaming" and "perpetual" probably sound like the same thing to most people. However, in the context of intelligent systems, I think there is a big difference.
[Note: when I use the term "observation" below, feel free to think about this as a synonym for "transaction" or "record."]
Streaming analytics involves applying transaction-level logic to real-time observations. The rules applied to these observations take into account previous observations as long as they occurred in the prescribed window – these windows have some arbitrary size (e.g., last five seconds, last 10,000 observations, etc.).
Perpetual Analytics, on the other hand, evaluates every incoming observation against ALL prior observations. There is no window size. Recognizing how the new observation relates to all prior observations enables the publishing of real-time insight (i.e., The Data Finds the Data and the Relevance Finds the User). And another unique property is Sequence Neutrality (i.e., future observations can affect earlier outcomes).
Just to be fair, both streaming and perpetual analytics engines have their place in the world. For example, sometimes transactional volumes are so high … non-persistence and small window sizes are the only route.
However, when the mission is significant and transaction volumes can be managed in real-time … perpetual analytics answers these questions "How does what I just learned relate to what I have known?" "Does this matter?" and "Who needs to know?" And if you can’t answer these questions, then your organization is likely to exhibit some degree of Enterprise Amnesia.
So how many observations per second can our current technology sustain? Recently, we achieved a new record: roughly 600 million observations ingested and contextualized in under five days. And amazingly, my team thinks they can double the performance with some more tuning!
Another reason, by the way, so much throughput is necessary is because historical data cannot just be bulk loaded. Constructing context from historical data involves streaming the data in. I sometimes describe this in terms of "sticking a straw into the historical data and slurping it out one observation at a time." In short, such systems must incrementally learn from the past! [Exception: if you do bulk load, then you must first crawl through the bulk loaded data to contextualize these historical observations as if they had been incrementally ingested.]
RELATED POSTS:
Accumulating Context: Now or Never
Federated Discovery vs. Persistent Context – Enterprise Intelligence Requires the Later
"Constructing context from historical data involves streaming the data in. ... In short, such systems must incrementally learn from the past!"
Can you post more about why context *has* to be built up historically? Is there no other, perhaps better, way to deal with the time variable in establishing context without having to do it serially?
Thanks!
Posted by: Aneel | April 18, 2007 at 01:22 PM
Interesting. I believe this also depends if the transactions are causal or not. If you have a first order markov system/source generating the transactions, then your last transaction (or two) is adequate to tell you about current transaction so if you keep a large window, you are not gaining anything.
If you have a situation where random events/noise effect the transactions, or you have multiple overlaping inputs with different order causalities, well... then it get interesting, beacuase you have to do signal/source separation as well as system identification. (In here a source refers to those things that generate the various types of casual transactions, not just the literal sources. For instance, you can be singing, banging on the drums etc, all in an uncorrelated manner. You are the person responsible, but the noise generated are from three different sources).
Anyway, both problems of signal/source separation and system identification are non-trivial (in the literal sense). The first problem involves guessing or rather guestimating how many sources of transactions/signals there are and their form or lag if amplitidues or priorities of any kind are involved (for descrete binary systems, this pretty much comes down to guessing the number of sources of transations generating the casually related transactions). FOR EACH SOURCE then System Identification, on the other hand, is the process of identifying how the past outputs are correlated to the new one + noise .
Now things get really interesting, if we have multi-layer systems. This is when some sources may be correlated with one or more sources [but you know which ones, and you don't know the time lags either :-) ], and some may or may not be influenced by external effects. Eg Google keeps people's searches around because if ... Jessica Alba gets in the news ... well more people will search for her, and one index can serve very many :-) (this may sound odd, but one can think of these as distorted echoes).
Even more interesting yet is when we have a feedback loop. Here system identification, even for one source, specially if non-linear transformations are involved, is... well hela-of-a-cool and then some, partly because they are darn near impossible to do :-) I said near impossible to do so...
Posted by: Esfandiar Bandari | December 13, 2007 at 11:49 AM
Hi,
This is very interesting. Can you please tell me which companies are providing stream analytics solutions?
Posted by: Rajat Chadha | June 08, 2011 at 04:52 AM