I Was Paying a Brain Surgeon to Put On Band-Aids

The first month I ran AI agents, I spent more on API calls than on my phone bill. Formatting a reminder — cloud API. Classifying an email — cloud API. Filling in a template that a regex could handle — cloud API.

Every task, no matter how trivial, went to the most expensive model I had access to. It was like calling a brain surgeon every time my kid scraped a knee.

My wife saw the invoice over my shoulder one morning. “Is that for the month?” Yes. “For that?” Yes.

Then I installed a small AI model on my laptop and my monthly cost dropped 70% overnight.

The $10 grocery run

Here’s the math that made me feel stupid.

About 65% of what my agents do is simple. Format this text. Sort these items. Fill in this template. Send this reminder. A pocket calculator could handle most of it.

I was paying cloud prices for all of it. Not a lot per task — fractions of a cent. But fractions of a cent, thousands of times a day, thirty days a month, adds up to real money. My first month’s bill was higher than my phone bill.

The other 35% genuinely needs a powerful model. Complex analysis, nuanced writing, multi-step reasoning. That stuff belongs in the cloud. But only that stuff.

The hospital

I think about it like triage.

A patient walks into the emergency room with a papercut. Do you call the brain surgeon? No. The nurse handles it. Takes thirty seconds. Costs nothing.

A patient comes in with chest pain. The nurse escalates to a doctor. If the doctor suspects something serious, they call the specialist. The surgeon only gets paged for actual surgery.

My AI runs the same way now.

Ollama running qwen3 on my laptop is the nurse. Handles 65% of everything for free. Responds in milliseconds — no internet, no queue, no waiting.

Gemini Flash is the general doctor. Picks up another 15-20% of tasks. Things the nurse can’t handle but that don’t need expensive resources. Free tier, 250 requests a day.

DeepSeek is the specialist. Smart, capable, costs fractions of a cent per call. About 10-15% of work hits this level.

Claude is the brain surgeon. Maybe 5% of all tasks. The ones where getting it wrong actually matters.

The AI triage pyramid: Nurse handles 65% for free, Doctor 15-20% free, Specialist 10-15% paid, Brain Surgeon 5% premium.

Every task starts at the nurse. Only gets escalated if it actually needs more.

The routing is a config file — nothing fancy:

reminder formatting → local (Ollama)
calendar sync       → local (Ollama)
content writing     → paid (DeepSeek)
complex analysis    → frontier (Claude)

How I learned this the painful way

I went through three phases. Each one felt like the right answer at the time.

Phase one: everything expensive. Every task goes to the cloud. Costs pile up. And when the provider goes down on a Sunday afternoon, nothing works. Zero resilience, maximum spend.

Phase two: everything cheap. I discover local models and get excited. Route everything locally. Easy tasks are fine.

Hard tasks come back as incoherent mush. I spend twenty minutes fixing an output that a two-cent API call would have nailed. “Saving money” by wasting time.

Phase three: honesty. Each task goes to the model that can actually handle it. Not the cheapest. Not the most expensive. The right one.

The three phases: Everything Expensive (wrong), Everything Cheap (also wrong), Matched to the Task (right).

The hard part isn’t building the system. The hard part is being honest about what each task actually needs. It’s tempting to push everything cheap. But twenty minutes fixing a bad output costs more than the API call you “saved.”

The accident

I built this to save money. And it did — about 70% less per month.

But the real win was something I didn’t plan for: speed.

A local model responds in milliseconds. No internet round-trip. No server queue.

When 65% of your workload comes back instantly instead of taking two to three seconds each, chains of tasks that used to take minutes finish in seconds. The whole system feels like a different machine.

Cost comparison: Before (100% paid) vs After (65% free local, 25% cheap, 10% paid). 70% cost reduction.

Speed payoff: Cloud takes 2-3 seconds per call. Local responds instantly. At 1000 calls, that's 40 minutes vs seconds.

I optimized for cost and accidentally optimized for speed. Sometimes the cheapest path and the fastest path are the same path.

The bonus nobody mentions

When you depend on one AI provider and they go down — and every provider goes down — you’re stuck. Nothing works until they fix it.

With four levels, if the cloud disappears, the local model keeps essentials running. Not everything. But enough. The system bends instead of breaking.

Fourteen months. Zero complete outages.

Why we overpay

The reason most people send everything to the expensive model isn’t technical. It’s psychological.

We connect price with quality. Expensive feels safe. If I use the best model for everything, nothing can go wrong — right? It’s the same instinct that makes people buy the most expensive wine at dinner when they can’t tell the difference between a $20 bottle and a $80 one.

The wine analogy: $20 bottle vs $80 bottle — both taste fine. Same as cheap model vs expensive on simple tasks.

The nurse handles the papercut just fine. You don’t need the surgeon. You never did.

Running AI and wondering if you’re overpaying? You probably are. Happy to compare notes — mo@fadaly.net.