Having designed a lot of systems over the years – more often than not the customer says they plan on performing periodic purges of historical data. This always seems logical at the time. But, it turns out once you have data it becomes hard to justify its destruction. And if anyone actually destroys data … one is at the same time eliminating any accountability whatsoever (not to mention other adverse consequences).
Data decommissioning is a double-edged sword.
After a number of personal missteps over the years, I have revised my think about data decommissioning. Today I imagine a process where accountability is maximized while the risk of unintended disclosure, misuse, and repurposing are minimized. The goal being to write accountability data into storage de-optimized for information retrieval … therefore rendering retrieval practical only for infrequent, forensic inspection. In simple terms, think paper tape, think hard copy reports, or think microfiche. Alternatively, in more sophisticated settings, I suspect immutable audit logs optimized only for investigative/forensic-specific information retrieval might be useful too. [More detail about this line of thinking available in this paper that Peter Swire and I penned on behalf of the Markle Foundation.] Obviously, at some point in time when there is no longer any reasonable expectation of information accountability, repeatability, etc. wholesale data decommissioned makes sense (burn the microfiche).
How I arrived at this revised thinking in part came about from this series of events.
Many years ago, I deployed a system designed to address a single, very specific threat. Then, several years later I concluded that long after that threat was over, the aggregated data set had probably lived on. I would not have thought twice about the privacy and civil liberties implications of this had I not started to engage in conversations with privacy advocates. Following these conversations, I decided that there are some scenarios in which data decommissioning should be "baked in."
Subsequently, with this in mind, when a pro bono opportunity to assist with a humanitarian disaster relief effort presented itself, I proposed a data destruction caveat for the contract. While the customer didn’t seem to care much one way or another, I was excited to learn the customer agreed to the wholesale destruction of the aggregated data set upon project closure. And delete it all we did.
A small victory for privacy it seemed – that is, until a few years later when I realized that I could no longer prove what was done, right or wrong. In fact, had there been any after-the-fact disputes about incorrect action taken based on the recommendations of the technology, I would have had to say, "We destroyed the evidence!"
In summary, when designing systems which require strong audit, accountability and repeatability processes … very careful consideration must be given to delete processes.
Deeper Technical Points:
1. Much like the challenges that come with processing deletes, record changes can have the same issues. This occurs when a system overwrites changes rather than keeping each incremental record state and its temporal relevance. When overwriting changes – one is deleting previous values; it is this de facto deletion that compromises audit and accountability processes.
2. A further complicating factor is that not all changes are the same. Some changes are corrections, i.e., the earlier value was incorrect, e.g., wrong driver’s license number or a missing apartment number in an address. Another type of change is one where a value supersedes a previous value, e.g., when recording a married name, new email address, or new cell phone number. Further complicating matters, most systems of record do not have a mechanism to capture the difference between corrections and updates – forcing system designers to make some assumptions.
3. When synchronizing data across information sharing environments, propagating deletes through this ecosystem forces each receiving party into this same accountability dilemma.
1. When data actually does get purged it is often prompted by a forcing-function. The three purge scenarios I have seen are: (a) all the ancient history is compromising performance; (b) there is no interest in paying for more storage; and (c) "oops - we shouldn’t have been collecting that!"
2. With all the countless copies of data being made, how can one be sure it is ever all deleted anyway?