I see a future in which organizations planning to transfer sensitive information from one system of record to some other destination will first ask themselves the question: "Can our data be shared in an anonymized form while achieving materially similar results had the data been transferred in clear text?" And if the answer to this question is "yes," I would then argue, "Why would that organization ever share that sensitive information any other way?"
A new class of technology, "Analytics in the Anonymized Data Space", is making this possible. With this type of technology, information can be anonymized before being transferred between parties, while still permitting sophisticated analysis to be performed on the data even though the data is in a non-human-readable and irreversible form i.e., anonymized.
I think this will become a best practice. When? I don’t know, maybe two years, five years or maybe even twenty years, but someday for sure. It will start with early adopters (already beginning to happen), its use will grow, and finally at some point in time anonymization-based analytics will achieve a critical mass. Thereafter, anonymization will likely be viewed as a best practice. From that moment on, if an organization is not handling its data in such a manner, I would submit they could be considered negligent.
Here is an anonymization scenario:
To stay competitive, banks must understand their customers at least as well as their competition. So, banks send their customer information to data aggregators. The data aggregators then match the bank’s customer data with their private collection of demographics (e.g., marital status) and lifestyle data (e.g., magazines subscriptions). This information is then appended to the original file and then returned to the bank (thus this practice is often called "database marketing appends"). The bank then uses this new information to profile their customers – using this newly found knowledge to improve their customer acquisition and retention programs.
But transferring all customer data to a secondary party causes organizational heartburn. In the example above, the bank’s management recognize sending their customer data to another party comes with some risk: What if an employee at the data aggregator makes an illegal copy of the customer file and secretly sells it? What if a hacker breaks into the data aggregator’s systems and extracts all or portions of the bank’s customer file? What if an employee at the aggregator uses the bank’s customer file to answer very specific questions made by "outsiders" about specific people? What if the aggregator quietly retains portions of the bank’s customer file for use later in unanticipated ways?
As gut wrenching as these risks are, most banks find themselves doing this anyway in an effort to remain competitive.
Emerging innovations which enable advanced analytics to be performed on encrypted or anonymized data will enable the bank to pass non-human readable customer data to the data aggregator. And the data aggregator will then match the bank’s anonymized customer data with their own records – while the bank’s customer records remain anonymized! The demographic and lifestyle data would then be passed back to the bank with a non-personally identifying value (e.g., a customer number).
What is gained? In short, if the data is stolen by a hacker or an agent of the aggregator, they learn nothing useful. A corrupt employee at the data aggregator cannot peruse the customer file for selected information. The aggregator does not learn new information like a new address or phone that the bank knew but the aggregator did not.
What are the risks? Well, there are lots of risks especially in this simplified embodiment (e.g., something called a dictionary attack). But, the basic principle is, if one is going to share information in clear text anyway, then even this simple model reduces to some degree the risk of unintended disclosure.
Luckily, there are a variety of cryptographic and architectural extensions one can use to harden this information sharing model against many different kinds of attacks. [Techie interjection: Commutative encryption, for example, makes it more difficult for any one user to dictionary attack the anonymized values.]
[Another technical note: Anonymization systems that prevent any possible re-identification (e.g., pointers to the original record) come with additional risks, like the inability to fully audit the system and the inability to correctly process deletions. This being the case, I think certain classes of anonymization-based systems must include Source Attribution and Data Tethering. In which case, the original holder of the data can control whether any re-identification is permitted within law and policy.]
OTHER RELATED POSTS:
The other consideration here is to pass analytic insight around instead of data. This has been the basis for the scoring industry for years but sometimes companies could do this for themselves.
That said, I do agree that anonymizing is going to become a best practice sooner rather than later.
Posted by: James Taylor | February 20, 2007 at 10:57 AM
It's particularly interesting when the aggregator provides anonymized analytics to the bank's customer, not the bank:
http://blog.wesabe.com/index.php/2007/02/23/safeguarding-your-data-the-privacy-wall/
Posted by: Dan Sickles | February 23, 2007 at 11:48 PM
IMHO the name you've started out with for this technology, 'Anonymization,' is way too wonky to be understood as valuable by Joe Average User.
Anonymity is a term that is loaded with all kinds of associations, some of them negative. Better to use a simple descriptive term or phrase. Something like "Personal Protection Layer," while vague, at least is clearly positive and presents itself as something unambiguously beneficial. Or maybe "Privacy Protection System."
You may be able to come up with something better than these. Improving on "Anonymizing Technology" should be easy.
Posted by: Natch | April 09, 2007 at 09:17 AM