YouTube Splitter — Chunked Transcription
A web app that ingests a YouTube URL, splits the video into configurable time-segment chunks, transcribes each chunk through Whisper, and stores the result with presigned-URL access. Built for the people who treat YouTube as a research corpus.
The Problem
A 90-minute YouTube interview is full of value and impossible to mine. You can scrub. You can read the auto-generated captions and get a 70%-accurate wall of text with no structure. What you actually want is the part where they answered question 3, transcribed cleanly, with a deep link back to the timestamp.
YouTube Splitter was the tool for that — give it a URL, get back a clean chunked transcript with timestamps you can actually navigate.
The Architecture
Five components, each doing the thing it’s best at:
| Component | Stack |
|---|---|
| Frontend | Next.js + Clerk + Tailwind |
| API | Node + Express |
| Video processing | Modal (serverless Python) + FFmpeg |
| Transcription | Faster Whisper |
| Storage | S3 with presigned URLs |
| Auth | Clerk |
Modal handles the variable load gracefully — a one-hour video and a five-minute video share the same code path but only the heavy one spins up GPU. Faster Whisper is the cost-quality sweet spot for English transcription; it’s substantially cheaper than the OpenAI API at comparable accuracy on long-form content.
The Trick
The interesting design call was the chunk schema. Videos got split on configurable time boundaries, not on detected speech or scene change. Two reasons:
- Speech-detected chunks are great for podcasts and terrible for interview videos with cross-talk and laughter.
- Time-based chunks make the timestamp math trivial — chunk 4 of a 5-minute-chunk video starts at 15:00, every time.
A scene-detected version was on the roadmap. Time-based got the tool out the door, and the operator (me) never missed the upgrade.
What I Learned
This was the project that taught me that serverless GPU is fine when the load curve is bursty and impossible to predict. Modal made the entire video-processing tier essentially free at the volume I was running. The same architecture choice surfaces in Studio — the video pipeline I run today is structurally a descendant of this.