YouTube Splitter — Chunked Transcription

The Problem

A 90-minute YouTube interview is full of value and impossible to mine. You can scrub. You can read the auto-generated captions and get a 70%-accurate wall of text with no structure. What you actually want is the part where they answered question 3, transcribed cleanly, with a deep link back to the timestamp.

YouTube Splitter was the tool for that — give it a URL, get back a clean chunked transcript with timestamps you can actually navigate.

The Architecture

Five components, each doing the thing it’s best at:

Component	Stack
Frontend	Next.js + Clerk + Tailwind
API	Node + Express
Video processing	Modal (serverless Python) + FFmpeg
Transcription	Faster Whisper
Storage	S3 with presigned URLs
Auth	Clerk

Modal handles the variable load gracefully — a one-hour video and a five-minute video share the same code path but only the heavy one spins up GPU. Faster Whisper is the cost-quality sweet spot for English transcription; it’s substantially cheaper than the OpenAI API at comparable accuracy on long-form content.

The Trick

The interesting design call was the chunk schema. Videos got split on configurable time boundaries, not on detected speech or scene change. Two reasons:

Speech-detected chunks are great for podcasts and terrible for interview videos with cross-talk and laughter.
Time-based chunks make the timestamp math trivial — chunk 4 of a 5-minute-chunk video starts at 15:00, every time.

A scene-detected version was on the roadmap. Time-based got the tool out the door, and the operator (me) never missed the upgrade.

What I Learned

This was the project that taught me that serverless GPU is fine when the load curve is bursty and impossible to predict. Modal made the entire video-processing tier essentially free at the volume I was running. The same architecture choice surfaces in Studio — the video pipeline I run today is structurally a descendant of this.