This blog entry is dedicated to false positives and false negatives, specifically why it is so essential that systems find and fix them. A false negative occurs when an assertion about something true, is missed e.g., the true perpetrator of a crime is overlooked. A false positive occurs when an assertion is made about something being true when, in fact, it is not true, e.g., when a court convicts someone for a crime he/she did not commit.
Imagine for a moment a DNA database. What if a subject’s second DNA profile does not match the subject’s previously collected DNA profile? Would anyone notice this false negative? Or, what if a DNA profile from a new criminal investigation is submitted to the DNA database and it matches someone who has been in prison for 15 years. Is this proof of a false positive?
How could such errors happen? Contamination? Incompetence? Criminal negligence? Regardless, such tragic things really do happen.
“It did not catch the mistake in Jackson's case, however, because the mistake was not merely a mislabel -- it was the wrong DNA in the wrong vial.”
“… acknowledging that the police lab accidentally had placed Sotolusson's name on another man's DNA sample.”
“… revelations DNA evidence that might have cleared the men never made it to defense attorneys.”
Now the million dollar question: If a DNA database like the Combined DNA Index System (CODIS) were to get new evidence that brought doubt to an earlier assertion; can the system itself detect it? Is anyone notified? Better yet, is the defendant notified?
Systems with the ability to use new observations to reverse earlier assertions are able to self-correct false negatives and self-correct false positives – in essence, changing their minds about the past. Yes … smart systems flip flop.
Traditional systems are generally unable to use new observations to reverse earlier assertions and thus end up with internal inconsistencies – the evidence exists right before your eyes (in the database) while an errant assertion from the past lives on. Some call this “database drift.”
One common remedy to database drift involves periodically re-processing all the data. Which begs the question: How long do you want to wait for a right answer, when the correct answer is in hand, and right before your eyes? The other big issue is that if your system requires periodic reprocessing to correct for this database drift, how long is that going to take? Point being, in very large databases this can take a very long time.
I have been extremely interested in how to prevent this type of database drift – detecting previous false negatives and false positives at the split second each new observation arrives – doing this over billions of rows of historical data while sustaining thousands of transactions a second.
In essence, this kind of algorithm works like this: Now that I know this, had I known this first, are there any assertions I have made that would have been made differently – in real-time at high transaction rates over ALL historical data?
From a privacy and civil liberties perspective, systems incapable of correcting previous false negative and false positives are problematic e.g., an innocent person remains harmed (incarcerated) despite the fact that late-arriving evidence clears his/her name.
Fortunately for some wrongly convicted, the Innocence Project is addressing this problem. This organization uses DNA evidence to exonerate the wrongfully convicted. They have nearly 300 examples (like Michael Morton) of folks wrongly imprisoned – in some cases these innocent folks have been incarcerated for many decades. This amazing group does this by hand, it takes a lot of work, and they have an enormous backlog of cases worth reviewing.
I encourage those working on identity-based assertion systems to add self-correcting false negatives and self-correcting false positives to your systems.
Technical Note #1: A Mini-Demonstration of a Self-Correcting False Positive
I first stumbled across the need for self-correcting false positives about a decade ago when we discovered a particular data set in which three out of every eight million people, a Patrick and a Patricia, lived at the same address with the same last name.
Imagine what might happen in this scenario. You already have this one record in your database:
Record 1, Patrick Smith, Male, 123 Main Street, 555-1212
And today you get Record 2 containing:
Pat Smith, 123 Main Street, 555-1212 and date of birth 03/17/1986
Let’s say for the sake of argument that same name, address and phone number are deemed sufficient evidence to assert that an entity is the same (not always a great idea by the way). The results:
Record 1, Patrick Smith, Male, 123 Main Street, 555-1212
Record 2, Pat Smith, 123 Main Street, 555-1212 and date of birth 03/17/1986
This is fine and dandy and all … until Record 3 appears on the scene:
Patricia Smith, Female, 123 Main Street, 555-1212, and date of birth 03/17/1986
At this moment, one has an opportunity to recognize the earlier error … in this case, the Pat Smith record is now most likely to be Patricia, NOT Patrick. At this split second, a system capable of self-correcting false positives fixes the past replacing it with these new assertions:
Record 1, Patrick Smith, Male, 123 Main Street, 555-1212
Record 2, Pat Smith, 123 Main Street, 555-1212, and date of birth 03/17/1986
Record 3, Patricia Smith, Female, 123 Main Street, 555-1212, and date of birth 03/17/1986
Technical Note #2: A More In-depth Look at CODIS Processes
Over the course of my research, I was fortunate to come across background information about CODIS. Some may find this interesting or helpful.
CODIS has three tiers: LDIS (local lab databases), SDIS (statewide databases), and NDIS (USA database). Selective LDIS records roll-up to SDIS and selective SDIS records roll-up to NDIS.
The criteria to enter the LDIS tier are the least stringent while the criteria to reach the NDIS level are the most stringent. A profile in the NDIS database will have come from a SDIS database. And entries in a SDIS database may have come from a LDIS database below it, possibly varying by each state’s lab structure/organization.
Each tier is broken down further into several indices: Forensic, Convicted Offender, Arrestees (depending on state legislation), Missing Persons, Relatives of Missing Persons, and Unidentified Human Remains (LDIS, SDIS, NDIS). Some LDIS/SDIS databases also have a Volunteer and or Suspect Index (depending on respective local/state legislation).
DNA Profile Submission
Each sample’s profile is generally only eligible for submission to a single index. If a forensic profile matches an individual and is uploaded to the forensic index, that individual’s reference profile can’t simply be added to the convicted offender index later after conviction. Also, the reference sample collected for comparison to the evidence profile cannot be retested to re-obtain the individual’s profile for convicted offender submission either—an entirely new sample is collected upon conviction for submission to the convicted offender index.
The only information contained in each of these databases is:
a) the DNA profile
b) the lab submitting the profile
c) the lab employee responsible for submitting the profile
d) some numerical identifier for the submitting lab/employee to track the profile
CODIS Search Procedures
Only certain index combinations are compared against each other. For example, the Relatives of Missing Persons Index is only compared to the unidentified Human Remains Index while the Convicted Offender Index is compared against all indices except the Relatives of Missing Person’s index.
- If a profile only meets LDIS criteria, it resides in the LDIS database and is only compared within the appropriate indices to other LDIS profiles (profiles belonging to the individual laboratory)
- For example, if a profile meets Miami’s LDIS upload criteria, but not Florida’s SDIS criteria, the profile would simply reside in and be compared against the appropriate indices within Miami’s LDIS.
- If the profile meets SDIS criteria, it is compared within the appropriate indices to other profiles in both the respective LDIS and SDIS databases but not outside LDIS databases within the state (Put differently, a profile that has reached SDIS will not be compared against other states’ LDIS or SDIS profiles.)
- For example, if a profile meets both Miami’s LDIS and Florida’s SDIS upload criteria, but not NDIS’s, the profile would reside in both the LDIS and SDIS and be compared against the appropriate indices at both levels. It would not, however, be compared against any profiles contained in NDIS, nor would it be compared against the LDIS and SDIS of other states, say New Orleans’s LDIS and Louisiana’s SDIS.
- If the profile meets NDIS criteria, it is compared to other profiles within the appropriate indices in the respective LDIS, SDIS, and NDIS databases, but not outside SDIS/LDIS databases
- For example, if a profile meets Miami’s LDIS, Florida’s SDIS, and NDIS upload criteria, it would reside at all three levels and be compared against appropriate indices at each. It would not, however, be compared against profiles residing only in other states’ LDIS or SDIS systems.
CODIS Matches and DNA Identification Errors
The CODIS software performs comparisons between the appropriate indices/tiers of the databases and then reports back any matching profiles with their associated identifiers to the submitting lab(s). Then the lab(s) must follow up on their end to retest their matching sample(s) and report back the confirmation results to each other (the exception being forensic profiles are NOT re-tested to avoid consumption of evidence). If the source (DNA donor) of the matching DNA profiles is known, this info is exchanged between the individual submitting lab(s) and it is at this time a “false-positive” match is discovered.
Basically, this type DNA identification error can only be caught after a match has occurred – meaning the falsely matching profiles have to wind up in an index/tier where they will be compared to each other … an erroneous LDIS profile residing in the SDIS database will not be detected if the falsely matching profile resides in the NDIS database, a different SDIS database, or even an unsearchable index in the same SDIS! Note: this can take an unreasonably long time, if it ever happens at all.
Additionally, since the database contains minimum identifying attributes for each entity or DNA profile, and the attributes mostly contain information that can only be utilized by the submitting lab, the CODIS software doesn’t have any capability to detect “false negative” non-matches. It’s important to note that this inherent limitation was developed intentionally to protect individuals’ privacy … but as a consequence this may lead to erroneous identification and wrongful incarceration.
Here is another interesting weakness in CODIS. When a subject’s DNA becomes crime scene evidence, this DNA profile may possibly appear in the databases up to four times over the entire course of a single investigation: First as a forensic sample (the evidence profile attributed to him), second as a volunteer (the reference specimen he consented to database entry according to local/state law), third as an arrestee (the reference specimen collected from him when arrested for a qualifying offense according to local/state law), and fourth as a convicted offender (the reference specimen collected from him when convicted of a qualifying offense according to local/state/federal law).
Matches should occur at each of these stages of submission if there are no DNA identification errors, and multiple matches would definitely validate each individual testing result. However, if a match does not occur at one or more of these submission stages, then an error has occurred. After the forensic profile was uploaded, each subsequent related submission provided an opportunity to detect an identification error because each subsequent match reported by CODIS would be returned to the lab for confirmation. CODIS’s inability to detect a falsely-negative non-match at any and each of the submission stages means opportunities to detect errant results are missed – this faulty evidence means the wrong people are convicted or freed.
Even if CODIS had the capability to detect/correct a false positive, waiting for that false-positive match to occur is problematic because it takes longer to occur/detect than a false-negative non-match, or even worse, might never occur at all. To obtain the false positive association, the true DNA donor to the forensic profile must somehow get uploaded to an index/tier that will be compared to the forensic profile that was erroneously attributed to the original subject. The match might occur immediately if the true donor’s profile is already there, but might never occur if it’s residing in an ineligible index/tier or the true donor’s profile is forever absent altogether.
Basically, detecting/correcting the false-positive match prevents a subject from being wrongfully convicted for the second crime, but might only minimally exonerate him for the first wrongful conviction (he may have already served his sentence, be deceased, etc.). A false-negative non-match actually precludes a false-positive match unless the falsely matching profile is immediately available for comparison when the erroneous profile is uploaded to the database.