The Boring Architecture Choice That Saved Everything

2:47 AM on a Thursday. I’m in boxers at my desk, bare feet on cold floor. The screen is the only light in the room. Telegram buzzes on my phone: “Agent failed: content_pipeline.”

My content agent had crashed at step 7 of 14. Steps 1 through 6 had already run — API calls that cost money, outputs that took minutes to generate. All of it correct. All of it done.

But because of how the system worked, there was only one option: start over from step 1.

So I reran the whole pipeline. Paid for the same API calls again. Waited for the same steps to complete again.

Got slightly different outputs this time because the model doesn’t give identical answers twice. Finished at 4:15 AM.

This happened three more times that month.

The fix that felt embarrassing

After the fourth 3 AM restart, I did something that felt too simple to actually work. I added a checklist.

Not a framework. Not an orchestration engine. Not a multi-agent coordinator with dynamic routing and condition branches. A checklist with two rules:

One. You can only be at one step at a time. You’re either at “content generated” or “content verified.” Never in between. Never in some ambiguous maybe-state.

Two. You can only move to the next step in the approved order. You can’t jump from “generated” to “published.” You have to go through “verified” first. No exceptions.

A state machine: New → Generated → Verified → Scored → Published. One step at a time, no skipping.

That’s a state machine. The whole idea. A list of steps and a list of allowed moves. I store state in SQLite — one row per agent run. When something crashes, I know exactly which row to look at.

states: New → Generated → Verified → Scored → Published
storage: SQLite, one row per run
retry: automatic, from last successful state

Why AI needs a babysitter

Here’s what I discovered about AI agents once I started watching them closely: they cheat.

I have an accountability system that scores my daily output. Without the state machine, the AI would sometimes skip the verification step and jump straight to “scored.” Why?

Because verification is expensive. It requires a second model call. The first model found the path of least resistance and took it.

It wasn’t malicious. It wasn’t a bug in the traditional sense. The model just took the shortest path to “done,” and skipping an expensive step is a great way to get there faster.

An arrow tries to skip from Generated directly to Published, bypassing Verified. Big X — the state machine won't allow it.

The state machine doesn’t care about the model’s preferences. Verification is required. The rules say so. Try to skip it and the transition gets rejected. End of discussion.

The 4:15 AM problem, solved

Remember the crash at step 7? With a state machine, I know exactly where the agent stopped. Not “somewhere around step 7, I think.” The database says: “This agent is at state verification_pending. Steps 1-6 completed. Outputs saved.”

Recovery means starting from step 7. Thirty seconds. No wasted money. No rerunning work. No different outputs.

Script crash: restart from step 1, redo all work. State machine crash: steps 1-6 saved, resume at step 7 in 30 seconds.

That’s not a marginal improvement. That’s the difference between “I need to sit here for 90 minutes at 3 AM” and “I tap one button on my phone and go back to sleep.”

Phone at 2:47 AM: notification → tap resume → resumed from step 7 → back to sleep.

For a system running dozens of agents every day, I estimate this saves me about $400 a month in wasted API calls and roughly 6 hours a week I would have spent nursing failed runs back to health. The exact numbers are boring. The sleep is not.

Monthly savings: $400/month in wasted API calls and 6 hours/week in babysitting.

“But shouldn’t you use something fancier?”

A friend asked me this at dinner, fork halfway to his mouth. “You’re telling me you run dozens of agents on a checklist?”

Yes. If your AI system needs to run five things simultaneously, branch dynamically based on intermediate results, and figure out its own execution path — yes, look at Temporal or Airflow or whatever framework fits.

Most agents don’t need that. Most agents need: do step A, then step B, then step C. If something fails, retry that step automatically. If it keeps failing after three attempts, stop and tell me.

That’s a checklist. Not a framework.

I’ve shipped agents using both approaches. The checklist ones have never woken me up at 3 AM. The framework ones have. More than once.

The frameworks fail in interesting, educational ways. The checklists just work.

Why we resist simplicity

This is the part that took me longest to understand about myself.

I wanted the fancy orchestration. I wanted to tell people I was using an advanced multi-agent coordinator with dynamic dispatch.

That sounds impressive at a meetup. “I use a state machine” sounds like something from a computer science textbook in 1972.

We’re drawn to complexity because it feels like proof you’re good at your job. A checklist feels like something an intern could build. But every experienced engineer I know has said some version of the same thing: the hard part isn’t building something complex. It’s resisting the urge to.

Nobody’s giving a conference talk about checklists. Nobody’s tweeting about their transition map.

But when your agent fails at step 11 of 14 at 3 AM and you can see exactly where it stopped, resume from that exact point, and be back in bed in 30 seconds — that’s when boring pays off.

Boring. Debuggable. Reliable.

The hard part isn’t building something complex. It’s resisting the urge to.

I’ll take that combination over exciting, clever, and fragile every single time.

Found a better pattern? I’d genuinely love to hear about it — mo@fadaly.net.