My Photo

Your email address:

Powered by FeedBlitz

April 2018

Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30          
Blog powered by Typepad

Become a Fan

« Smart Sensemaking Systems, First and Foremost, Must be Expert Counting Systems | Main | Hell with Rules »

July 10, 2010


Feed You can follow this conversation by subscribing to the comment feed for this post.

Will Gordon

This post could not have come at a better time for me. At work, I have been tasked with overcoming a federated search problem.

The solution my team came up with was what you describe as federated fetch.

It is wonderful to now have the language to describe our solution as well as the validation that our solution is the proper one.

Thank you for sharing this information!

Ian Story

While I can't attempt to garner the $25k prize...I would note that for Challenge 1, I think that it's a little off to say:

"And, if by chance passport number is an indexed and searchable field in the airline system, are the addresses and phone numbers indexed as well? And what about the key values in unstructured comment fields? Due to this issue, federated search can produce incomplete results because a source system may contain related records but cannot find them."

In my experience, you find the indexed value (in the example, passport number) and then any associated data is available, whether an indexed field or not, without a significant performance penalty - you use the indexed data to locate the record, and then you can get all the other information on the thinking you mean that you can't search for it (which of course is true, or at least, the cost of that search and resulting table scans is a killer).

So, granted that federated search sucks, how does one build the central directory/index (and keep it up to date)? I guess you just have feeds from whatever systems are causing the updates to happen to the disparate systems also update your index, but in some cases that too could be crushing (like in the airline reservation example)...if you don't do realtime updates of your index though, then it seems you have the chance for bad things to happen (last minute purchases of airline tickets by bad guys that aren't flagged as our nightly update hasn't happened yet, for instance).

Too many problems, not enough CPU cycles!

Matt McKnight


You are 100% right on this. The federation approach is sadly a "politically" popular solution, but is woeful in practice, and while it does something, it doesn't do what most people want it to do. One wrinkle that non-federation brings in is that the big index now must somehow implement the security rules of it's component systems. If there is not a good delegation model for authorization in place it will be difficult to get the trust necessary for the big index, whereas federated search can delegate to the other system. That's been the wedge that is most frequently thrown at me as I advocate for the big index.

I think Ian's point is important, as that is the first challenge in building the big index, getting all of the data in there. Push or pull?

Dave M

Centralized systems have their own drawbacks as well. Actually moving and transforming 100s or 1000s of datasets is a complex task and prone to error. Moving data means it will always be out of date by some amount. There is no instantaneous data teleportation technology available yet. Catastrophic failure is another issue. Your organization might give you $20M to build your system but good luck getting another $20 for a COOP system. And some organizations will simply not give you their data.

My ideal system would be layered hybrid. This builds on the idea of leaving the data in place and building a centralized index. Data can be increasingly compressed into multiple layers of abstraction. The bottom layer is raw data exactly as it appears in the source system. Next layer up abstracts the important parts of the source system. In other words get rid of the admin tables and leave the good stuff. But overall the data is the same. This can be a view on the original data. The next layer up would be models that are specific to particular needs. One might consolidate all the information on people (names, passport numbers, birth dates, etc) while another might consolidate travel information (flights, dates, times, airports, equipment, etc). When you have a large number of data sources not every one will contain every entity of interest. These specialized models let you consolidate the common parts of relevant data sets for a specific purpose. At the top would be the uber model that has a simplified, very basic view of all the data. This would be the global index. At each stage you abstract away data but maintain the pedigree and lineage that lets you work your way back.

The bottom layer stays at the source and is used mainly as look up. The uber model is at the centralized location. The other two abstraction layers can exist in either out at the federated level or at the centralized location.

Obviously this is more complex than the pathetically simple description above and there are a many, many challenges associated with it. Way to many to list here. But this is only blog response not a PhD thesis. Yet. It does provide the benefit of flexibility though. The different layers can be staged in different places depending on the realities of the environment that you are working in.

Stephan Engberg

Politely, this is a solved problem - but it needs to address the application to find the best solution and shurtcutting/greed is part of the problem.

a) If this is about Data finds Data on people, it should not be possible as it means that underlining transactions systems are unsecure reusing identifiers across purpose.

In Europe this would and should be illeagal constituting a breach of purpose even though we all know that it is the most violated laws after trafic regulation.

Instead you use the person herself as the intermediate and only trusted part. You give the citizen access to her data and let her do the query. Security is of course important just as usability is.

I ran a serious of workshops interconnecting Needsdriven Innovation and Security by Design incorporating some of the heaviey cases as illustrations.

One case about checking job applications against police records and another about tax reporting in connextion with sperm banks

But fundamentally - you do not link databases with unrelated transactions as that would imply not only a serious security problem, but also a drift towards command & control economics and inefficiences. Even if you can, you shouldn't and you should eliminate the security problems in the source transactions systems as a parallel to solving the problem using market-supporting and sustainable structures.

b) If you are talking about non-person-related queries this is solved in a number of cases, but of course likely require pre-data management to be fast enough - Google, searches for air travel and price comparison are examples.

c) The transactions define the security requirement. For each case, you will have explicit sources of "truth" and need to define the logics of verification transformed into security requirements and these might dynamic according to conextual variables.

I.e. under normal circumstances you would not do an identity check when using public transportation, but in situations of specific risks, you would raise requirements and in case of full-case breaches of fundamental security structures send out "Dead or Live Posters" even though they technically belong to the last century as we have much better security means and requirements, e.g. incorporating blinded credentials for fact validation.

Dima Rekesh

If I am not mistaken, one way to address this would be to host these federation members in the cloud, where their scalability could be dialed up and down, uh, quickly. Same for the tweaks on each one of them - this is the kind of stuff that should be easy to do.

The comments to this entry are closed.