I get asked form time-to-time how data flows. But, what they really mean is: How many places does the data land? After explaining this a few times I decided to blog it for easy future reference.
If you give a company your name and address, how many copies of this data might there be twelve months later? Many might be surprised to discover that there could easily be in excess of 1,000 copies!
So roughly speaking it looks something like this …
When data first arrives it is likely to be stored in an operational system – sometimes called the "system of record." This is the first instance.
Systems of record are frequently mission critical systems and are therefore candidates for robust backed-up policies. While different organizations have different back-up policies, one common strategy involves creating one backup every day; keeping each daily backup for seven days. This is a rolling strategy where every Monday overwrites last Monday’s backup. An end-of-week backup (e.g., every Sunday night) might be kept for five rolling weeks. Month-end backups might be kept for twelve rolling months. And year-end backups are likely to be kept for something like seven years.
So at the end of twelve months it is possible that there are now an additional 24 copies of the data (7+5+12). The good news is that backups are well protected; the bad news is that the greater the number of backups the greater the chances one turns up missing -- which happens. [Example here]
Structure governs function. [More on this here.] This is important because how the data is structured in the original system of record is specific to its mission. This means if an organization wants to use the data internally for other reasons (e.g., secondary operational systems like a fraud detection system, statistical analysis, marketing, etc.) this data is copied into each additional system.
Along this line, many organizations create a reporting copy that can be used for ad hoc analysis without effecting operational systems. Some copy the data into an operational data store (ODS). Another copy of the data is often moved into to the enterprise data warehouse. Copies from data warehouses are often used to populate data marts. How many data marts might there be? Who knows; one, two, three, or maybe more?
So if an organization has only one reporting copy, one ODS, one enterprise data warehouse and three data marts, then this would add up to six more copies. And these copies are likely to have backups made of them as well, especially when significant computational effort was involved in moving the data (e.g., pre-processed, translation, standardization and integration/co-mingling with secondary data sets). If the same backup strategy is used this could result in 6*24 or 144 more copies.
So now we are at 1 + 24 + 144 = 169 copies.
But wait, there is more. Many of these systems likely have some form of audit logging – maybe both at the application and database level. Often additional "one-time data snapshots" are made over the course of a year for such things as, pre- and post- maintenance and conversion (e.g., application or database upgrades), specialty analysis projects, audit snapshots, and so on. Then there are complete copies made for testing purpose (e.g., to ensure the scheduled upgrade is going to work as planned) and training systems (yes, sometimes training systems are created with real data). These may be backed up as well!
Furthermore, high availability mission critical systems can be expected to have one or more fully synchronized copies of the database strategically dispersed across the landscape for both work load distribution and/or disaster recovery purposes.
And then there are many odd little places data can get parked including sensor-side caching (e.g., at the slot machine or cash register itself), in-transit caches (e.g., cell phone towers), message queues, local and central search engines, performance enhancing indices, and so on.
Sorry, but I’ve lost count. So let’s just say over a hundred copies are made … internally. Now, what about the copies of the data which travel beyond the organization that originally collected the data?
Let’s say you are applying for credit. In this case, you have likely authorized a credit report. Getting your credit report involves sending your information to a (or all three) credit bureau(s). This information request now sits in their system of record; their audit logs; their data warehouses and data marts; their backups and so on. But wait there is more!
These secondary recipients of your data may in turn further disseminate this information. This is especially true if the organization is a data aggregator/data broker. This data is combined with other information, assembled, scored and sold. These tertiary recipients then make their own mission-centric copies, data warehouses, backups, etc. And, in some cases, it is repackaged and sold again.
Care to guess how many copies of the data are out there now?
- No copies You better hope not
- >10 copies Almost certainly
- >100 copies Very likely
- >1,000 copies Quite possible is certain settings
- >100,000 copies Sometimes
- >1,000,000 copies Not out of the question
What can cause information to be replicated over 100,000 times can come into play with such information as phone service (phone books), credit applications, and believe it or even those warranty cards you have been filling out!
What does all this mean?
1. Keeping data current in the eco-system is not trivial. [See: Data Tethering]
2. Protecting this many copies of the data is not trivial.
3. The more data you see, the more you realize most data is duplicative.
And this leads to an area I have been thinking about for about five years which I sometimes refer to collectively as "Data Reduction Strategies." More about some progress I have made in this area on some future date.
Oh … and my Perpetual Analytics stuff is going to need one more copy (with its own particular database schema) since Enterprise Intelligence requires Persistent Context. And, of course, it would be wise to back this up too.
Very interesting post Jeff. I suspect that most I.T. organizations don't realize the storage overhead involved as the number of copies grows exponentially.
A few years back it seemed like EII and the related technologies in federated queries, xquery, etc. were going to address having to make multiple copies of the source data, but it seems like the hype didn't match the capabilities at the time and I haven't heard much lately.
Do you think perpetual analytics and persistent context could be used on top of an EII architecture? Would this address the data duplication issues?
Posted by: Sean B. | August 22, 2007 at 08:34 AM
This is some great work on your part. What's interesting to me is that many of the issues you are raising also exist in the world of unstructured content, ECM. Same set of problems but with a slightly different solution set. Thanks, JP.
Posted by: JP Harris | January 08, 2008 at 11:25 AM
Sounds like a system to keep the hard drive and storage device people in business forever. i wonder how many copies this reply will generate.
Posted by: Douglas Schwartz | April 21, 2008 at 08:02 PM
And then there's all these extra copies @ the wayback machine (archive.org):
http://web.archive.org/web/*/http://jeffjonas.typepad.com/
Posted by: Aghilmort Tromlihga | June 11, 2010 at 08:54 AM