I am
probably stepping on some folks’ toes. My apologies.
First,
let me explain what I mean by federated search. Federated search:
conducting a search against “n” source systems via a broadcast mechanism
without the benefit or guidance of an index. This is somewhat like
roaming the three buildings of the Library of Congress looking for a book title
… without benefit of a card catalog.
I am
speaking specifically about environments where the systems in the federation
are heterogeneous, are physically dispersed, were not engineered for federation
a priori, and are not managed by a common command and control system.
By way of
example, an airline might have a payroll system containing employees, a
reservation system containing flight reservations and a watch list database
containing people that are not permitted to fly. If this airline
implemented federated search the data in these three systems would remain in
these three systems. Searches (whether invoked by users or machines) are
then broadcast to each source system. Note: Source systems receive
queries for information they may or may not have, and as we shall see, receive
queries for data they may have but have no means to locate in any efficient
manner.
Federated
search works fine if the goal is simply a reference system used to answer
periodic inquiry. Such systems could be described as forensic in nature –
when there is something of interest, one can look for it. Think of such
federated search environments as systems where “the data only speaks when spoken
to.” If this is what an organization needs, and there are a small number
of queries and a finite number of source systems, federated search is a fine
option.
Most
organizations are not living in a world where “after-the-fact forensic
discovery delivered only when asked” is acceptable.
Most
organizations have some obligation to make sense of what they know. For
example, the airline should know if the person added to the watch list is
already an employee or already has a flight reservation. Ideally, the
moment such facts become knowable, someone or some system should be
notified. Think of this as “the data speaks to itself.” I call this data finds data.
This
notion of data finds data implies the “data is the query.” As
each new piece of data enters the organization, the organization has just
learned something. And it is at this exact moment in time that one (a
smart system) must ask: Now that I know this,
how does this relate to what I already know? Does this matter, and if so
… to who?
Whether
the data is the query (generated by systems likely at high volumes) or the user
invokes a query (by comparison likely lower volumes), there is no
difference. In both cases, this is simply a need for “discoverability” – the ability to discover if the enterprise
has any related information.
If
discoverability across a federation of disparate systems is the goal, federated
search does not scale, in any practical way, for any amount of money.
Period. It is so essential that folks understand this before they run off
wasting millions of dollars on fairytale stories backed up by a few math guys
with a new vision who have never done it before.
I will
spare you the gory details of that day in 1996 when I came to witness such a
federated search system. Multi-million dollar, very smart, middleware developed
over a number of years was sitting atop a reported 2,000 data stores and 50B
rows of data. Watching this large federated search system really drove
home a series of epiphanies about the problems of federated search.
Fortunately, the purpose of this particular system was a reference/forensic
system that only had to respond to a relatively low volume of queries,
primarily generated by users. And getting an incomplete answer from
time-to-time would not be the end of the world.
To
explain why federated search bites I will lay out three basic goals, three
notional source systems, and four nasty problems (let’s call them
challenges). Mind you, the greater the number of source systems, and the
greater the transactional volumes, the more impossible it becomes to discover
similar data across dissimilar systems (data finds data).
GOALS
Goal 1:
Because the data must find the data, this means for every record added or
updated in the federation one must determine if this information is related to
any other records in the federation. Such discoverability must be able to
keep up with transactional volumes therefore must be near-real-time. [Note: To keep this really simple let
us say related only means: shares an exact passport
number, address, or phone number.]
Goal 2:
Users should be able to pose queries themselves. Although, as it turns out, this goal
does not matter because the discoverability properties needed to deliver on
Goal 1 can just as easily be applied to this goal.
Goal 3:
The federated search system must be scalable across hundreds or more disparate
source systems. As such,
new source systems must be able to be added to the federation without adverse
consequence to existing source systems in the federation, otherwise, the
greater the number of systems the more unmanageable the environment.
SYSTEMS
Using the
airline example, let’s say the three notional systems look like this:
System 1:
A commercial-off-the-shelf payroll system (20K employees, <16 CPU’s, 200
transactions a day (subject to data finds data), system running at 90%
utilization).
System 2:
An airline reservation system (100M reservations, <265 CPU’s, 2,000
transactions a second, system running at 97% utilization).
System 3:
A watch list database (subjects of interest) running on a commercial-off-the-shelf
SQL database (1M records, <8 CPU’s, 1,000 changes a day, system running at
80% utilization).
CHALLENGES
Challenge
1: How will a new watch listing record containing a passport number (in System
3) efficiently locate related reservations records (in System 2) which share
the same passport number? Here is the problem: An airline reservation
system is typically designed to search on things like reservation number or
fight number and date of departure not passport number. Source systems
are optimized for their purpose –maintaining only the necessary indexes.
And, if by chance passport number is an indexed and searchable field in the
airline system, are the addresses and phone numbers indexed as well? And
what about the key values in unstructured comment fields? Due to this
issue, federated search can produce incomplete results because a source system
may contain related records but cannot find them. Note: It is not
practical to re-engineer every source systems to maintain all conceivable
indexes.
Challenge
2: How will the payroll system (System 1) keep up with the flood of queries
generated by the reservation system (System 2)? Here is the problem: The
payroll system does not have the compute resources to sustain thousands of
queries a second; it was not designed for that. Now maybe you are
thinking why would you do that? Well data finds data is used to construct
context (determine what one knows) in order to determine the right course of
action. In this oversimplified example, maybe the airline likes to know
when current or former employees make reservations so the right offers are
made. Maybe terminated employees are not provided the same kind of offers
as other former employees. Note: It is not practical to re-host the
hardware of every source system such that it will be able to sustain the
cumulative transactional volume of the federation.
Challenge
3: New information can be located during the federated search that warrants a
re-query of the source systems. This is recursive. Imagine if the
query is for a passport number that only exists in the watch listing
database. But what if the watch listing database contains a matching
record which reveals a new phone number? This newly discovered
information, ideally, must be used to re-query the federated systems. For
example, maybe there is a record in the reservation system with the same phone number
and maybe this reservation contains a new address! Here is the
problem: With each new feature discovered one must consider re-querying the
source systems (again). Note: The hardware at each source system would
not only have to support the transactional volume of the federation – but the
recursive queries on top of that.
Challenge
4: Can you be sure all systems, across all the time zones, are all on-line, all
at the same time? What if the fourth system added to the federation is a
small, desktop application running a Microsoft Access database – will this
system be left on-line at night and have high availability, failover system
standing by? The issue is: Heterogeneous systems have non-uniform
availability.
[Theatrical
pause]
Just how sure am I that federated search cannot
handle discoverability at scale? How about this: First person to describe
a scalable federated search system that delivers on the goals and overcomes
these technical challenges … in a practical way e.g., without having to re-host source system hardware …
I’ll write you a personal check for $25,000 (see small print below).
So, if
federated search is not the ideal approach for discoverability at scale, then
what is?
Discovery
at scale is best solved with some form of central directories or indexes.
That is how Google does it (queries hit the Google indexes which return
pointers). That is how the DNS works (queries hit a hierarchical set of
directories which return pointers). And this is how people locate books
at the library (the card catalog is used to reveal pointers to books).
Once a
directory reveals a pointer, you can go fetch it. Federated fetch does scale. Yes, the source
system will have to be on-line, in the same way the floor at the library must
be open. Yes, the user will have to have access privileges. And
yes, there are other challenges like the need to keep the directory current and semantically
reconciled (to overcome
the recursive issues described in Challenge 3). But, at least these are all tractable
problems!
Truthfully,
I would love to be proven wrong here for a variety of reasons e.g., the privacy ramifications of having
large centralized database directories. Although, on the brighter side,
the directory approach to discoverability results in fewer copies of the data
floating around. And another plus may be that data governance
(accountability, oversight, immutable audit
logs, etc.) is going to be
vastly easier to manage with a smaller number of central directories.
[Small
Print: Offer good for two years from the date of this posting. If you
have a solution in mind no need to physically prove it, just explain it on
paper in plain English such that the average propeller-head can read it and go
“oh yeah, that would work.” But, don’t spend too much time on this as
it’s obviously not a fair challenge. I’m just trying to make a point as
it seems a number of organizations, each desperate to quickly solve large scale
discoverability, are being sold on the notion of federated search. An
absolute waste of money.
RELATED POSTS
Federated Discovery
vs. Persistent Context – Enterprise Intelligence Requires the Later
To Know Semantic
Reconciliation is to Love Semantic Reconciliation
It’s All About the
Librarian! New Paradigms in Enterprise Discovery and Awareness
Discoverability:
The First Information Sharing Principle
What Came First,
the Query or the Data?
Jeff,
This post could not have come at a better time for me. At work, I have been tasked with overcoming a federated search problem.
The solution my team came up with was what you describe as federated fetch.
It is wonderful to now have the language to describe our solution as well as the validation that our solution is the proper one.
Thank you for sharing this information!
Posted by: Will Gordon | July 10, 2010 at 08:37 AM
While I can't attempt to garner the $25k prize...I would note that for Challenge 1, I think that it's a little off to say:
"And, if by chance passport number is an indexed and searchable field in the airline system, are the addresses and phone numbers indexed as well? And what about the key values in unstructured comment fields? Due to this issue, federated search can produce incomplete results because a source system may contain related records but cannot find them."
In my experience, you find the indexed value (in the example, passport number) and then any associated data is available, whether an indexed field or not, without a significant performance penalty - you use the indexed data to locate the record, and then you can get all the other information on the record...am thinking you mean that you can't search for it (which of course is true, or at least, the cost of that search and resulting table scans is a killer).
So, granted that federated search sucks, how does one build the central directory/index (and keep it up to date)? I guess you just have feeds from whatever systems are causing the updates to happen to the disparate systems also update your index, but in some cases that too could be crushing (like in the airline reservation example)...if you don't do realtime updates of your index though, then it seems you have the chance for bad things to happen (last minute purchases of airline tickets by bad guys that aren't flagged as our nightly update hasn't happened yet, for instance).
Too many problems, not enough CPU cycles!
Posted by: Ian Story | July 12, 2010 at 09:26 AM
Jeff,
You are 100% right on this. The federation approach is sadly a "politically" popular solution, but is woeful in practice, and while it does something, it doesn't do what most people want it to do. One wrinkle that non-federation brings in is that the big index now must somehow implement the security rules of it's component systems. If there is not a good delegation model for authorization in place it will be difficult to get the trust necessary for the big index, whereas federated search can delegate to the other system. That's been the wedge that is most frequently thrown at me as I advocate for the big index.
I think Ian's point is important, as that is the first challenge in building the big index, getting all of the data in there. Push or pull?
Posted by: Matt McKnight | July 13, 2010 at 10:03 PM
Centralized systems have their own drawbacks as well. Actually moving and transforming 100s or 1000s of datasets is a complex task and prone to error. Moving data means it will always be out of date by some amount. There is no instantaneous data teleportation technology available yet. Catastrophic failure is another issue. Your organization might give you $20M to build your system but good luck getting another $20 for a COOP system. And some organizations will simply not give you their data.
My ideal system would be layered hybrid. This builds on the idea of leaving the data in place and building a centralized index. Data can be increasingly compressed into multiple layers of abstraction. The bottom layer is raw data exactly as it appears in the source system. Next layer up abstracts the important parts of the source system. In other words get rid of the admin tables and leave the good stuff. But overall the data is the same. This can be a view on the original data. The next layer up would be models that are specific to particular needs. One might consolidate all the information on people (names, passport numbers, birth dates, etc) while another might consolidate travel information (flights, dates, times, airports, equipment, etc). When you have a large number of data sources not every one will contain every entity of interest. These specialized models let you consolidate the common parts of relevant data sets for a specific purpose. At the top would be the uber model that has a simplified, very basic view of all the data. This would be the global index. At each stage you abstract away data but maintain the pedigree and lineage that lets you work your way back.
The bottom layer stays at the source and is used mainly as look up. The uber model is at the centralized location. The other two abstraction layers can exist in either out at the federated level or at the centralized location.
Obviously this is more complex than the pathetically simple description above and there are a many, many challenges associated with it. Way to many to list here. But this is only blog response not a PhD thesis. Yet. It does provide the benefit of flexibility though. The different layers can be staged in different places depending on the realities of the environment that you are working in.
Posted by: Dave M | August 01, 2010 at 05:48 PM
Politely, this is a solved problem - but it needs to address the application to find the best solution and shurtcutting/greed is part of the problem.
a) If this is about Data finds Data on people, it should not be possible as it means that underlining transactions systems are unsecure reusing identifiers across purpose.
In Europe this would and should be illeagal constituting a breach of purpose even though we all know that it is the most violated laws after trafic regulation.
Instead you use the person herself as the intermediate and only trusted part. You give the citizen access to her data and let her do the query. Security is of course important just as usability is.
I ran a serious of workshops interconnecting Needsdriven Innovation and Security by Design incorporating some of the heaviey cases as illustrations.
One case about checking job applications against police records and another about tax reporting in connextion with sperm banks
http://digitaliser.dk/resource/896495
But fundamentally - you do not link databases with unrelated transactions as that would imply not only a serious security problem, but also a drift towards command & control economics and inefficiences. Even if you can, you shouldn't and you should eliminate the security problems in the source transactions systems as a parallel to solving the problem using market-supporting and sustainable structures.
b) If you are talking about non-person-related queries this is solved in a number of cases, but of course likely require pre-data management to be fast enough - Google, searches for air travel and price comparison are examples.
c) The transactions define the security requirement. For each case, you will have explicit sources of "truth" and need to define the logics of verification transformed into security requirements and these might dynamic according to conextual variables.
I.e. under normal circumstances you would not do an identity check when using public transportation, but in situations of specific risks, you would raise requirements and in case of full-case breaches of fundamental security structures send out "Dead or Live Posters" even though they technically belong to the last century as we have much better security means and requirements, e.g. incorporating blinded credentials for fact validation.
Posted by: Stephan Engberg | September 26, 2011 at 12:25 AM
If I am not mistaken, one way to address this would be to host these federation members in the cloud, where their scalability could be dialed up and down, uh, quickly. Same for the tweaks on each one of them - this is the kind of stuff that should be easy to do.
Posted by: Dima Rekesh | October 23, 2011 at 10:16 PM