Your AI Is Probably Wrong and Doesn't Know It
Most AI systems can tell you their answer. Almost none can tell you how confident they should be. That's the difference between a tool and a toy.
One afternoon I decided to manually check my prediction engine’s accuracy. Not read the number it reported — actually calculate it myself. Load the data, compare predictions to outcomes, do the math by hand.
The system said 73%. My calculation said 61%.
I stared at the two numbers side by side in my spreadsheet — one cell green, one cell red — for a long time. Then I went looking for the bug.
The lying dashboard
Here’s what had happened.
My system ran calibration rounds: score some products, compare predictions against reality, update the model weights if things improved. Standard stuff.
The problem was in the comparison. Each round was reading the accuracy number from a saved table — a value computed during the previous round, using the previous weights. So it was comparing new results against old baselines.
The chart was going up. Every round, the number improved. Progress.
Except it wasn’t progress. It was comparing apples to oranges and calling it growth. The new weights weren’t necessarily better — they were just being measured against the wrong version of themselves.
I only caught it because I did the math myself. I told a colleague: “My system’s been lying to me about its own accuracy. For months.” He asked how. “It was grading itself with yesterday’s answer key.”
If I hadn’t had that random impulse to check, the system would still be telling me it was getting smarter while it was actually wandering sideways.
The fix that sounds too simple
Now, every time the system evaluates its own accuracy, it computes everything from scratch. No saved values. No trusting what it calculated last time.
Load the weights, score every case, compare against reality, do the math. Every single time.
Slower? Yes. I don’t care.
I also added a threshold: only accept new weights if they improve accuracy by at least 3%. Not 0.1%. Not 1%. Three percent. I chose 3% because improvements under that kept reversing in the next round — they were noise dressed up as signal.
The two numbers
My prediction engine now reports:
- 65% exact accuracy. It predicts the right tier 65% of the time.
- 93% within one tier. It’s off by at most one step 93% of the time.
The 93% sounds much better. If I were pitching investors, that’s the one I’d put on the slide. It makes me look good. It’s technically true.
But the 65% is the real number. That’s how often the system gets it exactly right. If I only showed the 93%, I’d be doing the same thing I hate about other AI demos — cherry-picking the metric that flatters.
The system also knows what it’s worst at. Products that grow slowly — the “slow burns” — look almost the same in the numbers as products that are about to take off or die. (Part of the problem: the AI-generated training data was inflating scores in exactly the range where slow burns live.) The model can’t tell them apart. And it says so.
That honesty is the most useful thing it does.
A system that says “I don’t know” when it genuinely doesn’t know is infinitely more useful than one that always produces an answer.
The confident advisor
Here’s what scares me about most AI systems.
They always have an answer. Ask a question, get a response. Confident. Specific. No hesitation.
Whether the system is 95% sure or 15% sure, the output looks exactly the same.
That’s the financial advisor who loses your money while sounding like he knows exactly what he’s doing. The confidence doesn’t come from knowledge — it comes from the same place every time. The model generates the most probable next word, regardless of whether it’s right.
If your AI says “I’m 90% confident” and it’s right only 60% of the time, that’s not inaccuracy. That’s a system that keeps lying about how much it knows. And the worst part is, it doesn’t know it’s lying.
What I actually look for now
When I evaluate any AI system — mine or someone else’s — I ask three questions:
- Does it know what it’s bad at? Not overall accuracy. Per-category accuracy. Where specifically does it fail?
- Does its confidence match reality? When it says high confidence, is it actually right more often?
- Can it say “I don’t know”? Or does it always produce an answer, even when it shouldn’t?
A system that’s right 80% of the time and knows which 20% it’s uncertain about is dramatically more useful than one that’s right 85% and has no idea when it’s wrong.
The first one you can trust. The second one you can only hope about. And hoping your AI is right is not a strategy I’m willing to bet on anymore.
Building an AI system and wondering how to make it honest? I’ve spent a lot of time on exactly this problem — mo@fadaly.net.