There is a high transaction cost associated with performing Perpetual Analytics with Sequence Neutrality. Essentially, what these systems are doing is this … as every record (observation) is received (perceived) one is trying to answer the question "How does what we are observing relate to what we already know, does this change any earlier assertions, have we learned something that matters and if so, who needs to know?" I think this might be an example of what could be called an Incremental Learning System. Maybe a better term is an Aware Incremental Learning System in that such a system has the ability to publish insight as relevance is noticed – not simply wait for a user to make a query.
Incremental Learning Systems, over time, grow what they know. And what is growing is context. And when this is stored in a database I have been calling this Persistent Context.
Notably, persistent context must be constructed. One cannot take historical non-contextual data and simply bulk load such information into a database, without an assembly process, and expect to get information in context. Context must be carefully assembled.
Yeah, I know. Sounds great and all, but how fast are these systems?
Well they are getting faster that is for sure.
A few months ago, while on the road, I received a very exciting call. Our performance engineering team had just broken an all time throughput record.
In part this was made possible by an internal project we started back in 2002 or 2003 called the "small database footprint project." The notion being that at the end of the day, the pinch-point was going to be the database engine itself. Once you tip the database engine over, well, then you have reached your limit.
Our small database footprint project had the goal of externalizing as much computation off the database engine – pushing this processing into share nothing parallelizable pipelines. So we also did such things as externalized serialization (no more using the database engine to dole out unique record ID’s) and eliminated virtually all stored procedure and triggers – placed more computational weight on these "n" wide pipeline processes instead. By the way … no "table scans" (duh!) and pretty cool strategy to make sure that SELECT result sets were small.
Since large multi-terabyte systems are not going to live solely in memory and remain sustainable, much attention must also be placed on disk layout, for example many smaller capacity disk drives operating at 15,000 RPM so the data can be spread out. Raid 5? Not for read/write tables. For all the fast growing read/write tables we used Raid 10. More expense? Yep. Faster? Yep.
How fast? In short, 800 million records were loaded in four days. To boot, the performance team felt like with some more tuning they might be able to cut that down to just over two days. They were using 32 CPU’s (half of what was possible on that box).
If you are turning on one of these systems or plan on implementing one, we have a great white paper with all the configuration specifics. If so, drop me a note and I’ll get you a copy.