My Photo

Your email address:


Powered by FeedBlitz

June 2008

Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30          
Blog powered by TypePad

« August 2006 | Main | October 2006 »

September 21, 2006

Data Tethering: Managing the Echo

When one system transfers data to another system, what happens when the original data changes? Or what if the policies governing the original piece of data simply change after it has been transferred?

Say someone who manages a government watch list sends their watch list to a secondary organization. Later someone is cleared (removed) from the list. What assurances can be made that the cleared individual will also be removed from the secondary organization’s watch list? Now imagine how complicated this can be if the recipient of a watch list then re-distributes (cascading) the watch list to tertiary organizations and so on.

Some organizations sell their customer lists to secondary organizations, e.g., to a marketing alliance partner. What if one of the customers requests that their name and address not be sold, and what if they ask for their information to be redacted from secondary sources where transfers have already occurred? Guess what? Bad news. Most organizations don’t even know what customer records were transferred (at least at the customer level) as they likely only know what extract criteria was used on what date and the total record count.

Data tethering means when data changes at its source, the change is reflected through the entire food chain. Every copied piece of data is virtually “tethered” to its master copy. Non-tethered systems contain errors until the next database reload. And the greater the window between database refreshes, the greater the error rate.

Non-tethered systems in national security and law enforcement settings are problematic as there can be real privacy and civil liberties consequences resulting from organizations operating on incorrect data points. And a resource waste to boot.

Data tethering is an important design element when thinking about responsible innovations.

[Miscellaneous note: From a manageability perspective, there may be some reasonable number of cascading data transfers. For example, in some settings like watch lists it may be ideal to mandate no more than two tiers of transfer (e.g., Source A transfers to B and C, then B re-transfers to X, Y and Z). Maybe public records are stipulated for a three-tier maximum. The point being if there are too many tiers, it will not be possible to ensure currency and accuracy across the network.]

Related Posts: Responsible Innovation: Designing for Human Rights

September 19, 2006

Athletics Update: Alcatraz Swim and Malibu Triathlon

Last Sunday my 15 year old son and I (pic) swam from Alcatraz to San Francisco (pic).  About 800 people jumped in the water at what is called the Alcatraz Sharkfest Swim.  Many gave up and went home in shame.  Not us … we actually finished.  And my son was nice enough to stay with me as he clearly could have left me in the dust.  (I should have never let him get involved in competitive swimming when he was younger).

After the Alcatraz race we flew to Los Angeles.  Sunday, the next morning, I did the Malibu Triathlon with my friend Joe.  To my surprise, I did my fastest race ever finishing 13th in my age group of 119 athletes.  Makes me wonder what is possible if I really trained.  Nope not me … work is my number one hobby!

September 11, 2006

9/11 … Five Years Ago Today

I was in New York on September 11th, 2001. 

At 8:30am I left the Marriot Hotel at the World Financial Center heading for a meeting in upper Manhattan. As we drove past the World Trade Center I asked the cab driver if I had time to run up for a quick look from the observation deck and still make my 9am meeting.

The cab driver said "No."

Minutes later … the rest is history.

Oddly, Friday the week before, I was in a New York meeting with the head of security of a large financial services company. Behind his desk, no kidding, was a picture of Osama bin Laden. When asked about the picture he said, "This is who I fear most."

September 08, 2006

What is Data Mining? Depends Who You Ask ...

Everyone has their own definition of data mining. My favorite is this one I heard at the ACM SIGKDD data mining and knowledge discovery conference a few weeks ago, specifically:

Data Mining, noun 1. Torturing the data until it confesses … and if you torture it enough, you can get it to confess to anything.

Here are some far less humorous definitions:

The Government Accountability Office produced the following definition for data mining:

"The application of database technology and techniques—such as statistical analysis and modeling—to uncover hidden patterns and subtle relationships in data and to infer rules that allow for the prediction of future results."

The Congressional Research Service has defined data mining as:

"Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. These tools can include statistical models, mathematical algorithms, and machine learning methods (algorithms that improve their performance automatically through experience, such as neural networks or decision trees). Consequently, data mining consists of more than collecting and managing data, it also includes analysis and prediction."

The Internet’s popular Wikipedia site defines data mining as:

"Data mining (DM), also called Knowledge-Discovery in Databases (KDD) or Knowledge-Discovery and Data Mining, is the process of automatically searching large volumes of data for patterns such as association rules. It is a fairly recent topic in computer science but applies many older computational techniques from statistics, information retrieval, machine learning and pattern recognition."

Mary DeRosa at the Center for Strategic and International Studies (CSIS) published a report on data mining citing a presentation by David Jensen at the CSIS Data Mining Roundtable on July 23, 2003. In "Data Mining in Networks," David Jensen defined data mining as follow:

"'Data mining' ... has a relatively narrow meaning: it is a process that uses algorithms to discover predictive patterns in data sets."

Kim Taipale at the Center for Advanced Studies in Science and Technology Policy has defined data mining this way:

"The combination of mathematics, statistics, economics, political science, cultural anthropology, sociology, psychology, psychiatry, neuroscience, and other social sciences with computer science techniques such as federated search and retrieval, visualization, knowledge extraction, modeling, and simulation — together referred to expansively for policy purposes as "data mining" — enable the development and application of nonlinear, nondeterministic theories and models of complex human phenomena at all scales to social governance and control problems, including law enforcement and national security."

A soon to be published paper by Jim Harper and me have opted for the following definition:

"Data mining is the process of searching data for previously unknown patterns and often using these patterns to predict future outcomes."

The Department of Defenses TAPAC Report (Technology and Privacy Advisory Committee) defined data mining as:

We define 'data mining' to mean "searches of one or more electronic databases of information concerning U.S. person by or on behalf of an agency or employee of the government."

The TAPAC definition is certainly the broadest. Under this definition, when a doctor at the Veterans Administration searches for a specific patient record (e.g., by name and date of birth) – this would constitute a data mining activity.

And new definitions of data mining will surely continue to appear – some more rational than others. For example, here is a definition pending on Capitol Hill, in an amendment submitted by Senator Feingold to H.R. 5441:

DATA-MINING.-The term "data-mining" means a query or search or other analysis of 1 or more electronic databases, whereas-

(A) at least 1 of the databases was obtained from or remains under the control of a non-Federal entity, or the information was acquired initially by another department or agency of the Federal Government for purposes other than intelligence or law enforcement;

(B) a department or agency of the Federal Government or a non-Federal entity acting on behalf of the Federal Government is conducting the query or search or other analysis to find a predictive pattern indicating terrorist or criminal activity; and

(C) the search does not use a specific individual's personal identifiers to acquire information concerning that individual.

Maybe I’m confused. But per this definition it would appear that analysis to find predictive patterns not related to terrorist or criminal activity would not be considered data mining. Also, if the data is owned entirely by the intelligence or law enforcement community no form of analysis could be construed as data mining.

Here are links to the above:

GAO Reference: Data Mining: Federal Efforts Cover a Wide Range of Uses

CRS Reference: Data Mining and Homeland Security: An Overview

Wikipedia Reference: Wikipedia

CSIS Reference: CSIS Report, Data Mining and Data Analysis for Counterterrorism

Taipale Reference: From Data Mining to Computational Social Science for Counterterrorism

TAPC Reference: Report of the Technology and Privacy Advisory Committee, "Safeguarding Privacy in the Fight Against Terrorism"

H.R. 5441 Reference: Homeland Security Appropriations Bill, amendment 4569