My Photo

Your email address:


Powered by FeedBlitz

June 2008

Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30          
Blog powered by TypePad

« March 2007 | Main | May 2007 »

April 30, 2007

"Need to Know" vs. "Need to Share" – A Very Fine Line Indeed

I kept thinking I should title this post, "Knock. Knock. Who’s there?" But I scrubbed that idea.

This post is related to national security, intelligence, classification and information sharing. If this is not your domain, my comments below will make no sense.

Back in pre-9/11 days, those holding classified data used to apply the "Need to Know" model when considering information sharing. Following 9/11, there has been a call for improved information sharing, which has resulted in the new mantra "Need to Share."

What do you think is the difference between "need to know" and "need to share"? When push comes to shove, most people I speak with in the mission cannot quite articulate the difference. Not good.

In part, "need to share" involves a new mindset. This new mindset was highlighted in the third report issued by the Markle Foundation’s Task Force on National Security in the Information Age. For example, in this report we called for an increased use of tearline reporting and a decreased use of ORCON designations. (see report pages 44-48)

There is another aspect, however, that, while referenced in our report (see report pages 46 and 61), may be lacking the attention it deserves. And this is the subject of data indices. These are so fundamental to implementing a functional information sharing program, that I might hazard to say that without data indices … there is little to no hope information sharing will ever be solved. Let me explain.

If someone is the custodian of a highly relevant data item how will they "know who needs to know?" And conversely, if someone else is in need of this highly relevant data item how will they "know whom to ask?" Basically the problem is: who needs to know what? Example: How will the folks working on counter-proliferation know they have a record that is directly related to another team specializing in anti-money laundering? The chances these two groups (even if working in the same building ... ouch) will actually recognize they have related data points is close to Z E R O. If there were just these two groups, the problem would be trivial and could be worked out. But in the real world, organizations may have hundreds of isolated data sets. On whose door shall I knock?

In this earlier post I introduced the Information Sharing Paradox. This paradox basically states that if everyone cannot share everything with everyone else, and everyone cannot ask everyone else every question every day … then how is someone going to find something?

The answer of course is one must first solve "discovery," i.e., knowing who to ask for what. All large scale discovery problems are solved by central indexes (data registries with pointers). Be advised, discovery is not solved by a federated search where one broadcasts searches across the enterprise. And if you hear that federated search is the solution, be afraid, be very afraid. [I explain this in some detail in this post here.]

In order for "need to share" to fulfill its full potential, data custodians must first publish (limited) metadata to the central index. More precisely, when I say "publish data," in actuality they will need to use data tethering to ensure all adds, changes and deletes are properly reflected in the index. At libraries, index metadata about new documents includes subject, title and author. In your business this limited metadata is more likely to be something like who, what, where, when, etc.

As central indexes will be the means by which information discovery challenges are solved, this becomes a way to begin focusing the privacy and civil liberties debate.

One privacy related tension will be defining exactly what kind of data should be discoverable, i.e., placed in the index? For example, in counter-terrorism information sharing programs, there would be significant controversy over, say, including pharmaceutical prescription information of all US citizens; whereas, including foreigners banned from traveling to the US would probably cause little to no concern. The subject of discoverability (i.e., selecting which data will live in the central index) deserves much debate.

On the good news front, solving discoverability via central indexes brings with it a few useful privacy protections including: a) urges to share more data with more parties is replaced by transferring less information to one place (the central index), b) who is searching for what and what they found can be logged (e.g., using immutable audit logs) in a consistent manner thus facilitating better accountability and oversight, and c) information sharing between parties is now reduced to just the records that they need to know and need to share (sharing less by sharing only information that must be shared), and d) it is now possible to make the index anonymized (see: Anonymized Semantic Indexes), which means the risk of unintended disclosure of even the limited metadata in the index is drastically reduced.

Whether living in the "need to know" world or the "need to share" word, one must first be able to answer the question "who" and "what"; otherwise, this dog won’t hunt.

RELATED POSTS:

Discoverability: The First Information Sharing Principle

Information Sharing: Got Directory?

No Need to "Over Share" – Thoughts on Information Sharing

It’s All About the Librarian! New Paradigms in Enterprise Discovery and Awareness

Intelligent Organizations – Assembling Context and The Proof is in the Chimp!

Federated Discovery vs. Persistent Context – Enterprise Intelligence Requires the Later

April 26, 2007

To Know Semantic Reconciliation is to Love Semantic Reconciliation

Semantic reconciliation is possibly the most fundamental building block required to make intelligent systems intelligent. When I say "semantic reconciliation," I mean: "Recognizing when two objects are the same despite having been described differently." Or put more simply, this is about counting like things.

In disease research one would need to know the difference between six reported cases of Lupus versus one case reported six times. A 911 operator receives emergency calls from six people, each reporting the sound of gunshots. Is this one incident, six separate incidents, or somewhere in between?

I stayed at a W Hotel a few weeks ago. They asked me if I was in the loyalty club program. I did not know, so I had them look. I turns out I am in the loyalty club program three times. They think I am three different customers when in fact I am one. They don’t know me! (Ironically, I checked into a different W Hotel last night and they could not find any loyalty club records for me whatsoever).

If all data collected contained global unique identifiers (e.g., a bar coded serial number), then semantic reconciliation would be trivial. But the world collects different features in different ways from the same object. Some systems record me as Jeff Jonas and others Jeffrey Jonas. Sometimes I share a frequent flyer number and no date of birth, and in other places I share a date of birth and passport number. So how many Jeff Jonases are there? Organizations that cannot count unique objects make suboptimal decisions and in the case of the multiple loyalty club accounts, maybe denying a decent customer decent rewards, e.g., had all the points been recognized as one belonging to one account!

It is important to address semantic reconciliation before other analytical processes (e.g., statistical analysis, market segmentation, link analysis, etc.). This is a "first things first" principle because semantic reconciliation makes secondary analytic and computational problems that much easier and that much more accurate.

And, while my primary focus over the years has been the semantic reconciliation of identities (people and organizations) with attention to massive scale and subtle little nuances like sequence neutrality, similar techniques are possible for many other things (e.g., in Las Vegas the Starbucks on the corner of Sahara and Maryland Parkway happens to be the same as the Starbucks at 2595 S. Maryland Parkway).

If one cannot count discreet objects, one cannot properly construct context. And when organizations make decisions without context – brace yourself for bad decisions – and say hello to more Enterprise Amnesia!

RELATED POSTS:

Accumulating Context: Now or Never

Federated Discovery vs. Persistent Context – Enterprise Intelligence Requires the Later

April 18, 2007

Predicate-based Link Analysis: A Post 9/11 Analysis (1+1= 13)

Had I been blogging in 2001, this would have been posted then. And although I posted something similar to this on May 16th of last year [here] … I am posting this now as some of the reporting to date around my work in this area has been overstated and/or inaccurate.

Following September 11th many newspaper and magazine stories began showing how the hijackers were related to each other and ultimately to Osama Bin Laden. And with these pictures came suggestions that this event could have been prevented had the government had access to much more data (e.g., health care records, banking records, communications, etc.). As well, there appeared to be an emerging consensus that by studying merely the shape of the 9/11 network, one may able to locate similarly shaped networks – thus, detecting and preempting future events.

I disagreed with this thinking. In fact, it was my opinion that, at least in the case of 9/11, neither more large data sets to graph the nation nor hunting for similar network shapes in this graph would have been necessary (or even useful) for detecting and preempting this event.

Ever see someone standing in front of a giant graph? Imagine a picture with millions of nodes connected via millions of lines each with varying thickness and color. Think spaghetti-fest ready to feed 10,000 people. While very impressive to look at … looking at such is not useful in establishing a starting point.

Networks are useful when one has an entrance point. From a specific vantage point one has a string to pull. And in the case of criminal investigations, these starting points are "predicates." By this I mean knowledge about something or someone that meets some threshold on the scale of reasonable and particular (calibrated with respect to the crime i.e., a different threshold for a deadbeat dad versus a nuclear threat), and that justifies some further action.

From this predicate one begins an investigation or inspection – pulling the string and marching down the path toward the ultimate fact: whether someone is planning something, or has done some bad act. When an investigation is started without a sufficient predicate, or starting point, one risks rampant false positives which not only waste resources, they bring investigative attention to the innocent – which results in unnecessary intrusion on our privacy and worse our civil liberties.

While the question of "what is a predicate" is worthy of a longer conversation and debate, in the case of the 9/11 hijackers, there were two perfect starting points. Both Nawaf Alhamzi and Khalid Al-Midhar were already known to the US government to be very bad men. They should have never been let into the US, yet they were living in the US and were hiding in plain sight – using their real names.

When 1+1=13: Starting with these two guys I drew from various public sources (e.g., investigative journalism, grand jury indictment, etc.) to demonstrate how the network would have looked. In short, with basic investigative procedures, I demonstrated that at least 13 of the 19 could have been exposed.

So, back in the day when running SRD, I created a series of PowerPoint charts to illustrate exactly this point.

This was first published on page 28 of the Markle Foundation’s report entitled "Protecting America’s Freedom in the Information Age." Since then, it has found its way into a number of other publications (e.g., Newsweek: Geek War on Terror).

From time-to-time, though, this work has been characterized incorrectly. For example:

It has been said that the data was run through NORA to develop this analysis. Nope. I never had this data. Rather, I just analyzed the open source and told the story – which required no computational power at all.

It has also been said that had NORA been in use by the US government, 9/11 would have been prevented. Ha Ha! The whole point of my 9/11 analysis was that the government did not need mounds of data, did not need new technology, and in fact did not need any new laws to unravel this event!

Just to be clear, I am not saying better technology and better laws would not be helpful. Obviously, our government needs both. I am simply saying that according to my analysis 9/11 very possibly could have been averted without either. I attempted to make this point in my most recent paper entitled "Effective Counterterrorism and the Limited Role of Predictive Data Mining." In this paper my co-author Jim Harper of the Cato Institute and I were able to draw upon new insights revealed in the 9/11 Commission Report to more clearly describe just how effective predicate-based link analysis would have been in the context of 9/11.

One more thing: I am often asked about how false positives would have effected my 9/11 analysis had such an investigation been carried out in the real world. The relationships selected for this demonstration involved solely shared addresses, phone numbers and frequent flyer numbers. When constrained by date ranges, the number of additional parties would likely have been minimal, unless the addresses and phone numbers on the plane reservations were actually those of the travel agency (which was not revealed in open source documents). As such, I posit that the investigation would have produced a small universe of subjects and would have exposed the likes of Mohamed Atta.

RELATED POSTS:

Sometimes a Big Picture is Worth a 1,000 False Positives

The Six Degrees of Kevin Arbitrary

Hunting Bad Guys, Phone Records and a Few Good Dead Men

What is Data Mining? Depends Who You Ask ...

110th Congress Debates Data Mining

April 17, 2007

Streaming Analytics vs. Perpetual Analytics (Advantages of Windowless Thinking)

The terms "streaming" and "perpetual" probably sound like the same thing to most people. However, in the context of intelligent systems, I think there is a big difference.

[Note: when I use the term "observation" below, feel free to think about this as a synonym for "transaction" or "record."]

Streaming analytics involves applying transaction-level logic to real-time observations. The rules applied to these observations take into account previous observations as long as they occurred in the prescribed window – these windows have some arbitrary size (e.g., last five seconds, last 10,000 observations, etc.).

Perpetual Analytics, on the other hand, evaluates every incoming observation against ALL prior observations. There is no window size. Recognizing how the new observation relates to all prior observations enables the publishing of real-time insight (i.e., The Data Finds the Data and the Relevance Finds the User).  And another unique property is Sequence Neutrality (i.e., future observations can affect earlier outcomes).

Just to be fair, both streaming and perpetual analytics engines have their place in the world. For example, sometimes transactional volumes are so high … non-persistence and small window sizes are the only route.

However, when the mission is significant and transaction volumes can be managed in real-time … perpetual analytics answers these questions "How does what I just learned relate to what I have known?" "Does this matter?" and "Who needs to know?" And if you can’t answer these questions, then your organization is likely to exhibit some degree of Enterprise Amnesia.

So how many observations per second can our current technology sustain? Recently, we achieved a new record: roughly 600 million observations ingested and contextualized in under five days. And amazingly, my team thinks they can double the performance with some more tuning!

Another reason, by the way, so much throughput is necessary is because historical data cannot just be bulk loaded. Constructing context from historical data involves streaming the data in. I sometimes describe this in terms of "sticking a straw into the historical data and slurping it out one observation at a time." In short, such systems must incrementally learn from the past! [Exception: if you do bulk load, then you must first crawl through the bulk loaded data to contextualize these historical observations as if they had been incrementally ingested.]

RELATED POSTS:

Accumulating Context: Now or Never

Federated Discovery vs. Persistent Context – Enterprise Intelligence Requires the Later

April 16, 2007

Cannibalism Bites: Speaking at the Risk Assessment and Horizon Scanning Symposium in Singapore

I was invited to Singapore March, 2007 to speak at their International Risk Assessment and Horizon Scanning (RAHS) symposium. While people with varied interests and expertise were invited (including such folks as John Petersen of the Arlington Institute, Admiral John Poindexter and James Surowiecki, author of "The Wisdom of Crowds") one of the recurring themes involved systems which assist humans in considering future scenarios.

Scenario folks scare me.

Let me give you an example: the evening before my speech I was invited to a dinner function. Over dinner Petersen explained to me that one future scenario involves the super-volcano that lives under Yellowstone Park. Apparently, this volcano goes off like clockwork every 600,000 years, well … except, it is already 100,000 years overdue! This sucker is so big, that when it goes off the world may go dark for a couple years. Another person at the table chimes in and says … "We are thinking humans will have to resort to cannibalism." I’m thinking "Excuse me … I’m trying to eat dinner here!"

Then during another presentation it was mentioned that one scenario team predicted that in the year 2012 Earth will see an unscheduled 10 kiloton nuclear weapon discharged. That is five years from now. Oh goodie.

With all this in mind, when I spoke at the conference I opened with: "Scenario folks scare me." Then I proceeded to talk about another aspect of horizon scanning. As Singapore has one of the largest shipping container ports in the world, I wanted to talk about the 23 million containers a year that emerge over their horizon. More specifically, selecting which ones they should be focusing their attention on as there is neither enough resources nor time to inspect all containers!

So, while there was much talk about predicting future scenarios such as health epidemics like SARS (which caught Singapore completely off-guard and caused an enormous toll on their economy) … I wanted to talk about pointing laser beams at containers. With laser beams in mind, I rendered a new PowerPoint chart I titled "Targeting Austin Powers." Then, last minute, I decided not to show it because after a short survey it turns out more than half the audience had never seen this Austin Powers movies.

Oh. And get this … a couple days ago I was reading a science magazine which indicated that physicists think they will finally be able to create their own black hole in 2008. How? The CERN particle accelerator in Switzerland will be operating at full power next year. I’m no math guy, but can this be a good thing? 

April 10, 2007

Ghost in the Machine?

Well, here is a weird series of thoughts that popped into my head the other day.

If you have an avatar in something like Second Life are you then the spirit/soul of the avatar?

If your avatar creates and guides another avatar, does your avatar itself become a spirit?

In the case of Second Life, is its creator Linden Research, Inc. the Supreme Being?

Is it possible we all simply avatars of others?

Who's your daddy?