My Photo

Your email address:


Powered by FeedBlitz

June 2008

Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30          
Blog powered by TypePad

« February 2006 | Main | April 2006 »

March 30, 2006

When is a Data Warehouse not a Warehouse?

Data warehouses are used by organizations to liberate their information assets –information that is otherwise trapped across a myriad of disparate operational systems.  Organizations use data warehousing architectures to better understand their customers, improve marketing campaigns, enhance customer service, mitigate risk, and so on.

Controversy is the operative word when it comes to selecting data warehousing architecture.  At times, it almost sounds like a religious debate as professionals argue about how to build, manage, and use data warehouses and data marts. 

So, with all due respect, here is my version of truth. Needless to say, having designed a few multi-terabyte, multi-billion row data warehouses in my day, I’m passionate about this.

This is a warehouse (this picture is worth a 1,000 words).  More specifically, a Wal-Mart warehouse located in the middle of nowhere on Interstate 15 between Las Vegas and Salt Lake City.

Warehouses are about strategic distribution.  They are engineered to support three primary functions: (1) a receiving function; (2) a staging function; and, (3) a distribution function.  Ideally, warehouses are strategically located, i.e., physically constructed in areas where expansion is economical, convenient and located in proximity to efficient distribution channels (think highways or railways). Warehouses are designed to support everchanging inventory requirements (e.g., from pet rocks to tandem bicycles). Their inventory is organized towards maximizing efficiency at scale (e.g., pallets and forklifts).  And they are appropriately secured (e.g., protected by a fenced perimeter and a guardhouse which controls the arrival and departure of product). 

Marts (picture this), on the other hand, are located and engineered to serve users.  They are conveniently located and readily accessible (e.g., on site parking).  Content is highly predictable – consumers know which marts have what product.  Inventory is organized in a manner best suited to the products offered and customer expectations. This is why Starbucks, Kroger and Payless shoe stores all have unique and highly specific inventory models (picture this).  Product is often presented in a manner designed specifically to drive consumption, and frequently optimized towards guiding consumers towards product with higher margins.  Marts are secured according to the value of the content – that is why pharmacies are secured differently than 7-11’s.

Structure governs function.  Production facilities, warehouses, and marts have different purposes.  Therefore, each will structure inventory appropriately.

Question: When is a data warehouse not a warehouse? 

Answer: When consumers are found running around the warehouse looking for size 10 shoes.

In my view, many data warehouses are really marketing-oriented data marts because they are engineered solely to serve a specific user mission (not strategic distribution).  This is not to say these systems aren’t valuable.  I just would not call them warehouses.

Let me suggest only building warehouses when some large number of distribution end points (marts) are envisioned in the future.  Otherwise, a more efficient use of resources is to build a few specific marts without building a warehouse at all.  Walmart would not have built that warehouse in the middle of nowhere without anticipating a myriad of storefronts (marts).

My “Two Cents” Technical Note: Star schemas don’t belong in the warehouse, but are well suited for certain types of data marts.  And, if you are looking for a near real-time data warehouse, start by thinking about OLTP-like schemas (e.g., 3NF).  The schemas used by data warehouses and data marts are critical to achieving scalability and sustainability.

March 27, 2006

2006 Solvang Double Century – No Surveillance Cameras at the Laundry Mat

This last Saturday a few buddies and I were out near Santa Barbara, California to put another 200 mile bike ride under our belt.  But unlike other double century rides where the sequence of events is something like suffer, suffer, suffer, finish … this one finished us!

Adam and Vic (two of my buddies from Vegas) and I pulled out at about 6:30am , which should have assured us completion before dark.  That would be important as we had no night gear (lights, reflectors, etc.).

Well long story short, we got lost twice and spent over an hour in no man’s land.  It rained on us for four hours and of course being from Vegas, we had no rain gear.  Vic got a flat but was so cold he was unable to change the tire.  We rode into a small town and had the flat fixed at a local bike store, bought rain coats and went into Frank’s Hot Dogs restaurant for warmth and food (maybe we should have had breakfast after all).  Of course, these activities would be considered unusual practices for such an event.

On the subject of unusual practices, we also decided to get more comfortable so we peddled to a laundry mat and borrowed towels from strangers while we put our soaked clothes in the dryer.  This was a ridiculous scene, Vic had a tiny pink towel.  My towel had so many holes, how you wrapped it around your waist mattered.  We were standing there shivering with our backs up against the dryers for 40 minutes while waiting for our clothes to dry.  There are a lot of surveillance cameras in Vegas, but none here!

180 miles into the race … while peddling in the dark on the shoulder of some freeway, without lights, the race director calls foul and removes us from the course.

And while we whimpered and shivered in the back of the van, they called in our names as DNF (Did Not Finish).  The misery of defeat!  All I wanted now was a Ho Ho (yeah, same family as the cup cake) and Thai food.  I’d been dreaming and talking about my Ho Ho fantasies for hours now.

This was only the second time I failed to finish a road ride.  The first time occurred when attempting to finish the Death Valley Double Century … need I explain?

March 22, 2006

Scalability and Sustainability in Large Information Sharing Systems

For a little over 10 years I have been fascinated by scalable architectures and what makes large scale environments sustainable.

As our government considers national programs to address everything from Master Patient Indexes (MPI’s) for improved health care to Information Sharing Environments (ISE) for counter-terrorism … scalability and sustainability will matter.

When I think about scalability, I think about the highest conceivable transactional volumes and at what point volume will exceed the limits of the architecture.  And when thinking about sustainability, I think about the ability of the system to stay operational through high volumes and changing times without impractical re-engineering or maintenance requirements (e.g., periodic database reloads).

Just after I moved to Las Vegas in the early 1990’s I was tasked by a CIO to build a cross-property charging system that would enable the guests of one hotel to have dinner at another hotel while charging the bill back to their hotel – and visa versa for guests of the other hotel.  While I was contracted to only build a system that supported two hotels, I wondered what architecture would be required to someday support the entirety of Las Vegas.

What kind of architecture would be required to enable the 40 million annual Las Vegas visitors to stay at any hotel and charge any activity (retail, meals, spa, etc.) from any other location back to their room?  How could this be accomplished in a manner that every business unit can select their own systems – and evolve these systems without any overall disruption to the network at large? 

In 1996, I was asked to build a data warehouse for another industry – a warehouse that would consolidate identities and their transactions from the daily feeds of an estimated 4,200 disparate operational systems.  This company wanted to better understand its customers across their diversified hotel and real estate brands.  While I was contracted to only build a system designed for these two business lines, I wondered what architecture would be required to support an unlimited number of business lines and billions of rows of related transactional data. 

While scale was one concern, sustainability was another.  What architecture would support a system whereby corporate acquisitions into entirely new markets (e.g., rental cars) would not invalidate the design and thus require a re-engineering activity?  How could this system be engineered such that each operating unit was not required to be on the same system and same version at all times? 

In both of the above cases (and incidentally my practice today), my mental exercise to contemplate scalability was/is as follows: How would the system look, behave and be managed if the mission exploded horizontally (i.e., new brands) and vertically (i.e., huge transactional growth)?  Would the architecture still hold in such circumstances?  Would every innovative and strategic move by the board of directors trigger a re-engineering of the architecture – thus making it impossible to ever get a large scale integration effort off the ground?

And only in more recent years have I come to fully appreciate the importance of sustainability, especially, the importance of Sequence Neutrality in large scale Perpetual Analytic systems.  This is necessary to avoid the standard data warehousing practice of database reloads designed to address “data drift” (i.e., the accuracy of the database drifting from truth over time).

Other technologies needed to achieve sustainability (not to mention confidence) are tools that enable reconciliation audits between operational systems and these secondary data stores in a manner that reconciliation discrepancies trigger specific re-synchronization events.  And to the best of my knowledge, our IT industry lacks robust and commercially available technology enabling reconciliation between very large data sets suitable for real-time production environments.

Getting scalability and sustainability wrong translates into wasted resources.  And as organizations and nations increasingly pursue very large information sharing missions, these engineering aspects will continue to prove most challenging.  Almost as challenging as the policy issues!

March 16, 2006

What’s In A Name?

As founder and chief scientist of Systems Research & Development (SRD), I spent over 20 years refining technology to determine when people are the same or related despite the natural (and sometime intentional) variability that occurs in identity data.  My company and our Non-Obvious Relationship Awareness (NORA) technology were purchased by IBM in January of 2005 (we are now called IBM Entity Analytics).

When resolving identities, understanding when names are similar is critical.  And it requires very sophisticated algorithms to handle global name issues like transliteration.  For example, while Mohammed is represented one way in Arabic, it can be spelled over 100 ways when translated to English (e.g., Mohamed, Muhammad, Mohammad, etc.) – the shortest of which is Mhd.

And while SRD passionately worked on identity resolution over these many years, a company called Language Analysis Systems (LAS) has itself been passionately working for over 20 years on mastering global name resolution.  Their leadership in this field has made them the de facto standard for global name analysis.

Well, this is a great day for IBM, and especially my Entity Analytics team, as we purchased LAS today!

What do you think happens when the best identity resolution technology on the planet meets the best name resolution technology on the planet?  I will tell you this … I am very excited!

March 14, 2006

There Is No Such Thing As A Single Version of Truth

If you are not interested in a technical peculiarity that occurs in aggregated data sets, just ignore this post.

I am often asked what my thoughts are about selecting the single best attributes (e.g., best name and best address) when multiple attributes are known. I always respond with, “truth is in the eye of the beholder.”

This came as a hard lesson. In the mid-1990’s, I built a data warehouse that was being fed daily by over 4,000 disparate operational systems belonging to handful of widely recognized consumer brands. The goal was to better understand the customer by recognizing when the same person was transacting across different brands all held by the same holding company. The underlying motivation: the more fully the customer is understood the more you can sell to the customer.

There I sat with a number of marketing VP’s, each representing their brand’s interests. And while everyone worked for the same parent company, there was one question no one could agreed upon: When a consumer has transacted with all of the brands, each time using a slightly different name or new address, which name and address should be considered the enterprise-wide GOLD standard? As it turns out, there is no such thing as a single version of truth.

The name and address supplied to a human resources system by an employee is the best name and address for an IRS filing, even if a different name and address has become available from another system. And a hotel statement better be sent to the address supplied by the guest when he or she checks out of the hotel – not some other address deemed “best” because of its perceived currency and reliability from some other data source. For a direct marketing piece, a name and address from a loyalty club program is generally better than the hotel reservation data provided over the phone. Why? Because loyalty club data is more reliable as consumers want to receive their points statement in the mail.

Thus, the definition of best varies based on who is asking the question. So when I am asked how to determine the single best version of truth I recommend being prepared to deliver every version of truth -- for truth is in the eye of the beholder.

Truth on demand … so to speak.

March 12, 2006

Data Mining, Predicate Triage and NSA Domestic Surveillance

Data mining means different things to different people and quite frankly has become an overused term.  And after having seen quite a few data mining definitions, I have concluded the longer the definition the greater the confusion.  So what might be the shortest possible definition?

Data Mining = Prediction.

Marketing organizations use data mining to create promotional offerings (“junk mail” to you and me) targeting selected individuals they “predict” will have a higher propensity to transact.  In this scenario, better prediction means a higher promotional response rate, thus improved sales and savings in promotional costs.  False positives – or incorrect predictions – in this domain have a benign consequence.

And while I am on record about the negative Consequences of False Positives in Government Surveillance Systems, there are some very powerful uses of data mining in government settings.

Introducing: Data Mining for Predicate Triage

When a government is faced with an overwhelming number of predicates (i.e., subjects of investigative interest), data mining can be quite useful for triaging (prioritizing) which subjects should be pursued first.  One example: the hundreds of thousands of people currently in the United States with expired visas. The student studying virology from Saudi Arabia holding an expired visa might be more interesting than the holder of an expired work visa from Japan writing game software.

Applying this line of thinking to the recently reported NSA warrantless surveillance debate, if the surveillance always starts with a predicate (in this setting, phone calls from known Al Qaeda training camps), and then data mining is used for predicate triage … then we are talking about a very useful form of data mining. 

So what constitutes a viable (legal and useful) predicate?  That is the question of the day!

March 10, 2006

Pioneering the Future of Personal Data

Today on National Public Radio I was interviewed by Renée Montagne, the Morning Edition host, on the subject of privacy.

To listen to this interview click here.

Related posts: Advanced Analytics in the Anonymized Data Space, Podcast: The Future of Privacy, Technology Review: Blinding Big Brother, Sort of, FCW: Story About My Anonymization Work

March 07, 2006

More Data is Better, Proceed With Caution

Whether an organization is trying to better understand its customers, detect fraud or protect its infrastructure from terrorism, it is true that the more data points available the better the situational awareness.  Poor and inaccurate situational awareness results in poor decisions.  And poor decisions result in inefficient operations, non-competitive offerings in the market place, and in the case of government, wasted resources and unforeseen attacks on a society. In many cases, poor situational awareness stems from too little, not too much, data.

Situational awareness requires Context and context requires data points.  In the information management world it is recognized that an organization’s information assets are generally useless to the broad strategic interests of the enterprise because the data holdings are so disparate and isolated.  The opportunity cost or consequence related to situational awareness is proportional to the size of the organization – bad decisions in the case of the manager of the barber shop produces one worst case outcome, bad decisions at Enron another, a country another, and mankind yet another, each with increasing consequences.

For example, large organizations lacking enterprise awareness inadvertently hire people who were once arrested for stealing from them.  A large mortgage company once called and left me a message every week for months in hopes of getting me to refinance my loan through them when in fact they had refinanced my loan months ago.  Your request for a wake-up call in the hotel will not stop the maid from knocking on your door – rather you are expected to hang your “Do Not Disturb” sign on your door.  Duh, that’s efficient.

The private sector becomes more competitive when it leverages its existing information assets better.  And while privacy concerns matter to everyone, when governments attempt to leverage information assets especially across organizational boundaries, privacy and civil liberties end up front and center in the debate.  As a member of the Markle Foundation’s Task Force on National Security in the Information Age we spent a lot of time thinking about how a government can be more effective while at the same time ensuring a higher degree of privacy and civil liberties protections.  Our Second Report discuses this at some length.

The government is charged with protecting us, and it will be smart policy and effective technology that is called upon to answer the mail.  And because it is true that in many cases “more hay helps locate more needles,” I encourage my technical colleagues to fully engage with the privacy community to better understand and explore what kinds of solutions get the job done and in a way that is more privacy conscious.  And while there is no perfect answer, we can do a better job – and that includes me – and that is why I am doing my best to maintain a conversation with the privacy community.

March 03, 2006

Comments on the TSA No-Fly and Selectee Watch List Process

Every so often I have someone express concern because they went to the airport and discovered that their name is on the TSA No Fly or Selectee list.  I then have to explain how watch lists work.  In almost every case, the answer is the person on the watch list is not them but someone with the same/similar name. 

The underlying problem is that the information on these watch lists typically have low fidelity (i.e., limited data points like only name and date of birth).  If you want to see an example of a government watch list check out the Office of Foreign Asset Control’s Specially Designated Nationals Watch List.  You will find this frequently contains only a name, date of birth and place of birth.  Financial institutions are required by US law not to transact with these folks.

So back to the airport ... when making an airline reservation, one typically provides a name, address, phone, credit card and sometimes a frequent flyer number.  Well, the problem is the only relevant field for matching is the name.  And names are matched with fuzzy logic which means matches can be found despite minor discrepancies and name variations (e.g., Bob and Robert).  Using only fuzzy name matching in large populations of data produces many false positives.

One remedy often reported is to start using your middle initial when making airline reservations, which in most cases will cause your name to not match, unless of course the name on the watch list happens to have the same middle initial as you.

As watch lists grow, as they have in the post September 11 world, so do false positives.  Additionally, the administrators of the watch list have to address a number of other policy and process challenges, for example, watch list redress.

Paul Rosenzweig and I thought through and addressed both of these key issues in a paper we co-authored entitled, “Correcting False Positives: Redress and the Watch List Conundrum” and published this past year by the Heritage Foundation.

In this paper we present solutions to redress and a consumer-driven method for handling false positives.

On a bright note, since I travel extensively, I can attest to the fact that I have seen someone that, after matching to a name on the No Fly list, was offered and elected to share another personal attribute.  Coincidentally, this is an approach Rosenzweig and I championed in our paper. The premise is the flying public are in the best position to differentiate themselves from the watch listed individual by providing an additional attribute – rather than requiring large scale, automated access to public records data. Using such an approach should result in the individual not being matched on his next trip to the airport.  This process used to be broken, and this new and improved process is certainly better than many other approaches.

March 01, 2006

Advanced Analytics in the Anonymized Data Space

There are emerging technical advances whereby it is becoming not just possible but viable to perform advanced information analytics on data after it has been anonymized (shredded).

This could have profound implications for future information sharing missions.  Because, if an organization intends to share its sensitive data and it is possible to share only anonymized data while achieving a materially similar result, why would they ever share sensitive data any other way?

There are a lot of winners in this alternate future as the risk of unintended disclosure (of one’s sensitive data) is greatly reduced.  Not just good business but better privacy. 

Not a lot of privacy-enhancing technologies also have a compelling business case, but in this case, I think this could be a win win.

Related Links:

Technology Review Story: Blinding Big Brother, Sort of

FCW Story About My Anonymization Work