There is a high transaction cost associated with performing Perpetual Analytics with Sequence Neutrality. Essentially, what these systems are doing is this … as every record (observation) is received (perceived) one is trying to answer the question "How does what we are observing relate to what we already know, does this change any earlier assertions, have we learned something that matters and if so, who needs to know?" I think this might be an example of what could be called an Incremental Learning System. Maybe a better term is an Aware Incremental Learning System in that such a system has the ability to publish insight as relevance is noticed – not simply wait for a user to make a query.
Incremental Learning Systems, over time, grow what they know. And what is growing is context. And when this is stored in a database I have been calling this Persistent Context.
Notably, persistent context must be constructed. One cannot take historical non-contextual data and simply bulk load such information into a database, without an assembly process, and expect to get information in context. Context must be carefully assembled.
Yeah, I know. Sounds great and all, but how fast are these systems?
Well they are getting faster that is for sure.
A few months ago, while on the road, I received a very exciting call. Our performance engineering team had just broken an all time throughput record.
In part this was made possible by an internal project we started back in 2002 or 2003 called the "small database footprint project." The notion being that at the end of the day, the pinch-point was going to be the database engine itself. Once you tip the database engine over, well, then you have reached your limit.
Our small database footprint project had the goal of externalizing as much computation off the database engine – pushing this processing into share nothing parallelizable pipelines. So we also did such things as externalized serialization (no more using the database engine to dole out unique record ID’s) and eliminated virtually all stored procedure and triggers – placed more computational weight on these "n" wide pipeline processes instead. By the way … no "table scans" (duh!) and pretty cool strategy to make sure that SELECT result sets were small.
Since large multi-terabyte systems are not going to live solely in memory and remain sustainable, much attention must also be placed on disk layout, for example many smaller capacity disk drives operating at 15,000 RPM so the data can be spread out. Raid 5? Not for read/write tables. For all the fast growing read/write tables we used Raid 10. More expense? Yep. Faster? Yep.
How fast? In short, 800 million records were loaded in four days. To boot, the performance team felt like with some more tuning they might be able to cut that down to just over two days. They were using 32 CPU’s (half of what was possible on that box).
If you are turning on one of these systems or plan on implementing one, we have a great white paper with all the configuration specifics. If so, drop me a note and I’ll get you a copy.
RELATED POSTS:
Streaming Analytics vs. Perpetual Analytics (Advantages of Windowless Thinking)
Scalability and Sustainability in Large Information Sharing Systems
You Won't Have to Ask -- The Data Will Find Data and Relevance Will Find the User
Jeff,
Interesting thought piece. This is something that is desperately needed--now. However, if we take the current evolution of data, one must expect that the future of Persistent Context must be a fused product of both textual entity data and video entity data. For example, non-obvious relationship data with face recognition data with other bio-metrics.
With that said, data is data is data. By digitizing what is a truck, what is a color, what is a crease on a face, Persistent Context should be able to adapt to the point of following an item of interest everywhere. Imagine Persistent Context being used for Amber Alerts (tied into traffic cameras, ATM cameras, etc).
Posted by: Darryl Williams | June 03, 2007 at 07:27 PM
Hi Jeff,
I'm new to your blog, but I'm excited at the progress you're making with the small footprint database.
Over at www.assetbar.com, we're tying to tackle similar problems, but are taking a different technical approach. We've implemented a horizontally-scalable database filesystem that is able to sustain rather large numbers of reads and writes.
We're into distributing the db load across many machines, but whereas something like GFS/Big Table is optimized for large, sequential reads, we're optimizing for lots of small reads / writes. In particular, for individual user behavior as well as how large numbers of users interact with an "asset" (aka file, photo, blog post, etc.).
So by combining a long history of user behavior / activity (one context) with the context of one or more different assets in real time, we're shooting for a web-scale personalization engine / platform. It's a different take on Greg Linden's interesting Findory project.
So far, we have a live proof of concept of our non-SQL architecture that is serving out nearly 10M assets a month. The content is from a indie web comic called Achewood, by Chris Onstad.
This fall, we'll be launching a new application on our platform that should be widely useful.
In the meantime, I'd love to read your whitepaper and I hope that later this year we'll be able to publish stats and whitepapers of our own.
Again, great project!
Posted by: Israel L'Heureux | July 07, 2007 at 06:16 AM
Another approach for those interested in these kinds of systems is Space-Based Architecture, which utilizes an In-Memory Data Grid in conjunction with persistence as needed.
See GigaSpaces' implementation.
http://www.gigaspaces.com
Posted by: Geva Perry | July 10, 2007 at 04:11 AM
Great job. youve come a long way.
Posted by: MA Wedding Videographer | February 15, 2008 at 03:52 PM
In the interests of full disclosure, I should point out that I'm an open-source fanatic. Unless you're using an OSI-approved license, I'm going to crib from your work to build an open-source engine. If your software is open-source, however, I'll be glad to contribute as much as I can.
That all said, I'd love a copy of that white-paper, if you're comfortable giving me one. Thank you for this site, either way; it's helped clarify my understanding of systems integration.
Posted by: Nathaniel Eliot | August 31, 2008 at 04:06 PM