Over the last twenty-eight months I have been quietly running a skunk works effort that I’ve code named “G2.” To my delight, on January 28th, 2011 this system became officially viable and will be entering something akin to a “sea trial” phase through 2011.
I believe this system will prove to be my most innovative work to date. I also believe it is the most responsible technology I have created to date.
This new technology, something that might be characterized as a “big data analytic sensemaking” engine, is designed to make sense of new observations as they happen, fast enough to do something about it, while the transaction is still happening. This engine brings to life many of the principles I have been openly sharing on my blog, ranging from Sensemaking Systems Must be Expert Counting Systems, Data Finds Data, Context Accumulation, Sequence Neutrality and Information Colocation to new techniques to harness the Big Data/New Physics phenomenon. That said, as this is version 1.1, there remain many things to do to realize my full vision. It is a very ambitious effort, but more about that some other day.
In terms of responsible innovation, I am even more proud to report that my team and I have baked in, from conception, more privacy and civil liberties enhancing technologies than any other product I am aware of to date.
Friday, January 28th, 2011 – my official launch date – also happened to be the international Data Privacy Day. And on this day, internationally recognized privacy commissioner, Ann Cavoukian hosted a few hundred privacy executives and practitioners from around the world in Toronto Canada at her Privacy by Design: Time to Take Control conference. During my keynote entitled “Confessions of an Architect” I highlighted seven (7) exciting features that have been baked into this new technology (Privacy by Design), specifically:
1. Full Attribution
2. Data Tethering
3. Analytics in the Anonymized Data Space
4. Tamper-Resistant Audit Logs
5. False Negative Favoring Methods
6. Self-Correcting False Positives
7. Information Transfer Accounting
The full presentation is here.
Here is a summary of the above seven PbD features:
1. FULL ATTRIBUTION: Every observation (record) needs to know from where it came and when. There cannot be merge/purge data survivorship processing whereby some observations or fields are discarded. Why is this so important?
A. If received data does not contain its data source and transaction pedigree, then system-to-system reconciliation and audit are virtually impossible, especially in large information sharing environments.
B. If the system merges and purges observations, only later to discover the wrong observations were merged or purged, then without full attribution correcting these earlier mistakes can be difficult if not impossible. The typical alternative being periodic batch re-processing.
C. The Universal Declaration of Human Rights has four articles containing the word “arbitrary” e.g., Article 9 reads “No one shall be subjected to arbitrary arrest, detention or exile.” If you don’t know where the data came from or when, how can any resulting action be anything but arbitrary?
2. DATA TETHERING: Adds, changes and deletes occurring in systems of record must be accounted for, in real-time, in sub-seconds. Why is this so important?
A. Data currency in information sharing environments is important, especially if one is using data to make important, difficult to reverse decisions that affect people’s freedoms or privileges.
B. When derogatory data is removed or corrected in a system of record, it is vital to reflect such corrections immediately. For example, if someone is removed from a watch list, how long should they have to wait before their name is cleared?
3. ANALYTICS ON ANONYMIZED DATA: The ability to perform advanced analytics (including some fuzzy matching) over cryptographically altered data means organizations can anonymize more data before information sharing. Why is this so important?
A. With every copy of data, there is an increased risk of unintended disclosure.
B. Data anonymized before transfer and anonymized at rest reduces the risk of unintended disclosure.
C. If organizations can now share information in an anonymized form and still get a materially similar result, why would organizations want to share information any other way?
[Technical Note: As every anonymized value maintains full attribution, re-identification is by design to support Data Tethering as well reconciliation and audit.]
4. TAMPER-RESISTANT AUDIT LOGS: Each record of who searches for what should be logged in a tamper-resistant manner – even the database administrator should not be able to alter the evidence contained in this audit log. Why is this so important?
A. Every now and then people with access and privilege take a look at records without a legitimate business purpose, e.g., should an employee at a financial services institution take a peek into their roommate’s file.
B. Tamper-resistant logs make it possible to audit user behavior.
C. And, when the word gets out to the work force that such accountability exists, this can cause a chilling effect on misuse.
5. FALSE NEGATIVE FAVORING METHODS: The ability to more strongly favor false negatives is of critical importance in systems that could be used to affect someone’s civil liberties. Why is this so important?
A. In many business scenarios, it is better to miss a few things (false negatives) than inadvertently make claims that are not true (false positives). False positives can feed into decisions that adversely affect people’s lives – e.g., the police find themselves knocking down the wrong door or an innocent passenger is denied the ability to board a plane.
[Technical Note: Sometimes a new observation can lead to multiple conclusions. Systems that are not false negative favoring may select the strongest conclusion and ignore the remaining conclusions. But had the strongest candidate not existed, the second strongest conclusion would be asserted. One false negative favoring method involves remedy such a condition, for example by reversing an earlier conclusion should a future observation bring to light that fact that multiple possible conclusions now exist.]
6. SELF-CORRECTING FALSE POSITIVES: With every new observation presented, prior assertions are re-evaluated to ensure they are still correct, and if no longer correct, these earlier assertions can often be repaired – in real-time, not end of month. Why is this so important?
A. False positives occur when an assertion (claim) is made, but is not true. If relied upon to make a decision, false positives can adversely affect people’s lives e.g., consider someone who cannot board a plane because he or she shares a similar name and date of birth as someone else on a watch list.
B. Without self-correcting false positives, databases start to drift from the truth and become provably wrong (even to the naked eye) – necessitating periodic (batch) reloading to true-up the database.
C. Periodic monthly reloading to correct for false positives means wrong decisions are possible all month until the next reload, even though the system had everything it needed to know beforehand.
[Technical Note: Reversing earlier assertions in real-time at scale, as new observations present themselves, is computationally non-trivial. Imagine making an assertion that two people are the same because they share exactly the same name, address and home phone number – only later to learn through another series of observations that these are really two different people (a junior and a senior). Our “self-correcting false positives” feature self-corrects for these rare cases, in real-time. We consider our ability to perform sequence neutrality at scale one of several breakthrough aspects of our work.]
7. INFORMATION TRANSFER ACCOUNTING: Every secondary transfer of data, whether to human eyeball or tertiary system, can be recorded to allow stakeholders (e.g., data custodians or the consumers themselves) to determine how their data is flowing. Why is this so important?
A. It is often cumbersome to learn who has seen what records, or what records have been shared with tertiary systems.
B. Much like a US credit report that contains an inquiries section exposing the list of recent inquiring parties, now so can your medical or financial file.
C. Users can now be easily provided with such disclosures, increasing transparency and control e.g. enabling a consumer in some cases to request an information recall.
D. When there is a series of leaks, information transfer accounting makes discovery of who accessed all records in the series quite trivial. This can narrow an investigation when looking for criminals within.
What has me most excited is that where some features above would typically be an extra priced option in my new system so many are built in (e.g. this tamper-resistant audit logs). And some of our privacy and civil liberties enhancing features cannot even be turned off!
Yes, there is an official name for my new technology. And no, I’m not telling you, because this is not a sales pitch. Rather, I am simply trying to inspire other technologists to consider Privacy by Design as they innovate.
I’ve had two most great days at IBM. The first great day was in January 2005 when IBM bought my company, SRD. And the second greatest day came six years later on January 28th, 2011.
Smart Sensemaking Systems, First and Foremost, Must be Expert Counting Systems
Accumulating Context: Now or Never
General Purpose Sensemaking Systems and Information Colocation
Source Attribution, Don’t Leave Home Without It
Data Tethering: Managing the Echo
To Anonymize or Not Anonymize, That is the Question
Out-bound Record-level Accountability in Information Sharing Systems
Wow kudos if you've developed a real-time big data analytic sensemaking engine with baked in privacy and civil liberties enhancing technologies .
Posted by: Sardire | February 14, 2011 at 04:54 PM
Nice one.
None of us can outrun our memories and with work like this there's no need to.
Happy futures and all the very best.
Posted by: Weaver | February 14, 2011 at 07:42 PM
Congratulations with your efforts and contributions to PbD.
Posted by: frederik Kortbaek | February 15, 2011 at 01:19 AM
You can give us a hook like that and then leave out any mention of the product name!
Even if you want others to adopt PbD they need to know who they'll be compared to.
Posted by: Joe Harris | February 15, 2011 at 03:37 AM
Very impressive Jeff -- certainly meshes well with my philosophy. When governance fails, innovation must prevail! -- Best, MM
Posted by: Kyield | February 15, 2011 at 06:06 AM
I have to wonder if all the PbD in the technology will actually make it less appealing/marketable? For instance, I'm sure there are big agencies that have some information about me, and that through some system, I am screened to see if I'm allowed to fly...but I don't think they want to tell me I'm being screened, or what information they have on me...nor do I necessarily want to know (although I probably should, I kind of don't want to think about the fact that I may be constantly being evaluated as to whether or not I'm a terrorist).
If such things are built in, it seems like you'd have to allow them to be turned off (thus defeating the purpose) or at least not to notify those in the system...as otherwise, how would do their thing without constantly alerting everyone in the world...even if it was opt-in for notification, so then the badguys opt-in (and you have no way to know for sure that they're badguys, unless you then decide not to notify people that you consider bad guys)...I dunno, but as usual, a thought provoking post.
PS - I have only had one truly great day at IBM, that would be late last year when I was hired...but more to come!
PPS - Watson FTW! :)
Posted by: Ian Story | February 15, 2011 at 02:02 PM
Is this a product? Will it be commercially available, open source, etc...?
Posted by: alan huffman | February 16, 2011 at 06:59 AM
Great work with multiple applications in and out of government. It also serves to validate the Importance of it's individual components for those of us working in related areas. Congratulations, Jeff!
Posted by: Jeffreycarr | February 16, 2011 at 08:11 AM
Any developments regarding data erasure, or data cancellation?
Any further progress needed considering right of oblivion, right to be forgotten? It could impact mainly all features, but specially 2 and 3.
Very interesting post.
Posted by: Álvaro Del Hoyo | March 23, 2011 at 07:56 AM
Given the ability to correlate large, anonymous datasets (specifically referencing the Netflix / IMDB correlation --http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf), do you think "Privacy by Design" systems can counter this attack?
Posted by: Mike Barretta | May 18, 2011 at 10:11 AM