“It is a capital mistake to theorize before one has data”
~ Sir Arthur Conan Doyle
Over the years, folks have often asked me what kind of math am I using to create large scale, real-time, context accumulating systems (e.g., NORA). Some fond of Bayesian speculate I am using Bayesian techniques. Some ask if I am using neural networks or heuristics. A math professor said I was doing advanced work in the field of Set Theory.
My answer is always, “I don’t know any math. I didn’t finish high school. But I can explain how it works, step-by-step, and it is really quite simple.”
That reminds me of a related funny story. After IBM acquired my SRD company in 2005 I began touring IBM’s impressive research facilities around the world. During a visit to one of IBM's research labs I explained my techniques to a room full of researchers. A few months later, to my surprise, they sent me a technical paper to express my work … using math. Fascinating I thought. The idea that my algorithms are now expressed in math terms was really exciting. Could it be? I was so curious. So I asked them to humor me and take me through the paper very slowly via a conference call. It was actually a bit embarrassing. I started out by asking the question what does an equal sign mean when a colon is in front of it? Symbol by symbol I asked for an explanation. Then I asked about this thing shaped like the letter “U” … what does that mean? (Union as it turns out). Anyway, I was able to follow the math and it all made sense until about halfway through the paper when I spotted an obvious error. So I said, “um, the math here is inconsistent with my technique.” I suggested a fix. The phone went quiet for a minute and then about 45 days later they came back with a new and improved paper. Continuing where we left off, I found a similar discrepancy further down the page and then provided some more specifics about my technique. Unfortunately, I never received another draft. Clearly, they could have. But honestly, I suspect they simply lost interest in having to teach me math.
I wish we would have finished that paper, as then folks trained in formal methods would better understand what I am doing and seeing.
One of the things demonstrated by this mathy paper might have been the notion that “data beats math” – at least when it comes to Assertion Algorithms. Based on the available observation space, can an assertion be made? Yes or no. In short, there comes a point where sufficient evidence exists such that an assertion can be made as a “no-brainer” without feeling compelled to split hairs with probability math.
Here is practical example. Imagine being presented with two identity records?
Record #1
Name: Mark Smith
Date of Birth: 05/12/1987
SSN: 555-00-1122
Record #2
Name: Mark Smith
Date of Birth: May 1987
D/L: 0099912334
Are they the same person? It is certainly possible. Using population statistics and some math someone could compute a reasonably accurate probability. I say heck with using math to guess. I’d say where can I find some glue around here? For example, a record like this:
Record #3
Name: Mark K Smith
Date of Birth: May 12, 1987
D/L: 0099912334
SSN: 555-00-1122
So the point is: I’d rather look for corroborating and/or dissenting evidence than look to math for estimated probabilities. And if a really important outcome might come from such an assertion, I would continue to seek observations until it was so obvious you could show the board of directors and they would say “duh.” If you run out of available observations and you are still not sure … then you have a few choices: 1) locate and collect the kinds of observations you need, 2) wait until you luck into a future observation related to the assertion in question (letting the existing ambiguity fester), or 3) pound on it with math. But I say only pound on it with math if it is going to be worth the additional effort/compute (e.g., you are playing high-stakes poker in Vegas).
My gripe, if any, is that way too many people are chipping away at hard problems and making no material gains in decades (e.g., entity extraction and classification) … when what they actually need is more data. Not more of same data, by the way. No, they more likely need orthogonal data – data from a different sensor sharing some of the same domain, entities and features (e.g., name and driver’s license number).
When the quality of mathematical predictions start to flatten out, I recommend increasing your observation space. Hence the above reference to this awesome quote:
“It is a capital mistake to theorize before one has data”
~ Sir Arthur Conan Doyle
RELATED POSTS:
Accumulating Context: Now or Never
Algorithms At Dead-End: Cannot Squeeze Knowledge Out Of A Pixel
How to Use a Glue Gun to Catch a Liar
It Turns Out Both Bad Data and a Teaspoon of Dirt May Be Good For You
Smart Sensemaking Systems, First and Foremost, Must be Expert Counting Systems
I'm ecstatic to have found your blog. Actually, I skipped a few years of high school but ended up with a Ph.D. anyway. This means that in some areas of math I am superb and other areas a bit iffy, yet I did just fine as a statistician. This taught me that there is often more than one way to do things, whether it is with calculus, algebra, geometry or just pounding on it with data. It also gave me a healthy skepticism about some of the answers produced by our equations. Sometimes, as you point out, the math hides more than it reveals.
Posted by: AnnMaria | April 11, 2011 at 12:18 PM
Hi Jeff
Honestly I never understood such debates. It remembers me the discussions in physics where theoreticians fight against experimentalists whether the theory needs the experiment (observation) or vice versa. But it is so obvious that both need each other. I think Sir Arthur Conan Doyle was simply wrong with his statement: “It is a capital mistake to theorize before one has data”, because there may be situations where it is better to think (theorizing) about some "scenarios" based on different "data topologies" before having the actual data, because it could save time, money or even prevent some serious damages. And of course there are many situations where you simply need the data before you can do anything useful. So the statement from Sir Arthur Conan Doyle is only one part of the game.
In fact there are only "SOLUTIONS". Whether these solutions are approached by hard math or by some clever "on the fly" ideas, which might be pretty unconventional but functioning is not relevant at all. Use math where it is needed and use "non-math" approaches where it can help you to come up faster with a SOLUTION. Then solutions are the only thing which matter. Have a nice day!
Posted by: Marcel Blattner | April 12, 2011 at 12:37 AM
I'm glad I came across your blog Jeff :).
I'm a college undergrad. And I totally agree. This is a matter of common sense, don't it? Pounding a problem with math while having incomplete data is just... stupid. It's common sense that if you don't have enough evidence to a "crime", then you go roam your beat, do stake out, ask informants, whatever.
My old professors (years ago) said that one needs to be really good in math to be working with computers. They were wrong. I'm working with computers, and I'm not good in math.
Your post reminded me of the Numbers TV series by the way :).
Posted by: Romar | April 26, 2011 at 01:27 AM
Great post, Jeff. Isn't what you are suggesting the essence of the scientific method. Propose a theory and then prove it. In your case, you propose the two records are the same and then gather data to support that theory.
What's changed is we are suddenly able to gather and understand more data in more different forms than ever before. Proving theories (within an acceptable statistical error margin) has never been easier.
Posted by: Mikecurr55.wordpress.com | September 23, 2011 at 08:27 AM
Interesting post. What you are doing here is to create a hypothesis based on the data that you have. Isn't that biased? Wouldn't it be better to create a hypothesis and then gather data to check if its valid?
Just a thought :)
Posted by: Retnec | September 27, 2011 at 10:51 PM
brilliant, I thought I was alone on this one. Now I have one more data point :)
Posted by: Account Deleted | October 28, 2011 at 11:05 AM
Hi Jeff, I just got your blog. And many more than a few post are great. Data beats math just remainds me that how both sides of our mind work together to get something cool.
Posted by: CyberYisus | July 09, 2012 at 10:11 AM