- Local video indexing lets you search an unlabeled archive in plain English without uploading a single file to the cloud.
- One developer ran local video indexing on a 2021 MacBook with 50GB of swap, processing years of wildlife footage offline.
- Off-the-shelf AI video editors all assume footage is already labeled — that blind spot is exactly what this build fixes.
- The total monthly cost dropped from a projected $140 SaaS stack to just $22 by solving the indexing problem first.
The Archive Problem Nobody Talks About
Local video indexing sounds like a niche obsession — until you realize almost every serious photographer or videographer is drowning in the same problem. Files pile up faster than anyone can sort them. Folders get names like Mara june 2024 backup final FINAL. And the footage that should be telling stories just sits on spinning drives, untouched.
That’s the situation one developer and founder — who splits his time between the Maasai Mara in Kenya and Silicon Valley-style coding marathons — found himself staring at a few months ago. Three months of social media silence from a safari lodge he works with. Not because there was nothing to post. Because no one could find anything in a multi-SSD archive of raw footage shot across iPhones, a DJI Pocket, a drone, a Nikon Z8, and Ray-Ban Meta glasses. The bottleneck wasn’t creativity. It was the index.
His solution — built over a single weekend and documented on his blog at SimbaStack — is one of the more honest accounts of what it actually takes to make local video indexing useful against a real, messy archive.
Why the SaaS Stack Fell Apart Immediately
The first instinct was the obvious one: throw money at it. The initial plan was a subscription stack combining Eddie AI for iterative editing, Higgsfield for generative B-roll, Submagic for captions, and Buffer for scheduling. Cost: roughly $140 a month. On paper, slick. In practice, it collapsed before a single clip ran through it.
The generative video piece was the first casualty. As he put it bluntly, guests paying $300 a night or more to stay at a real place want to see that real place. Slipping AI-generated savanna footage into a lodge’s Instagram feed isn’t a shortcut — it’s a TripAdvisor disaster waiting to happen. Higgsfield was out instantly.
Then came the realization that DaVinci Resolve Studio — software he already owned — ships with features in version 21 that eat directly into what Eddie AI charges for. IntelliSearch handles semantic clip search. Smart Bins auto-organize footage. Voice to Subtitle produces captions at 90–95% accuracy directly on the timeline. That’s a significant chunk of the $140 stack covered by software already sitting on his machine. Eddie was out too.
What remained was a leaner setup: Claude Code driving Resolve via the open-source DaVinci Resolve MCP server, with ElevenLabs handling voiceover only where it genuinely added value. Monthly cost: $22. But even that cleaner stack exposed a deeper flaw in how every AI video editor on the market is built — and why local video indexing has to come first.
The Real Problem: Every AI Editor Assumes the Work Is Already Done
Here’s the thing the AI video editing industry quietly glosses over. Tools like Eddie AI can search by transcript. They can pull clips by keyword. They’re genuinely impressive — when your footage has metadata, proper filenames, and some kind of organizational structure behind it.
Most real archives don’t look like that. They look like IMG_4721.mov and DJI_0034.mp4 dumped into folders with inconsistent names across three different external drives. No transcript. No tags. No GPS labels. Nothing a semantic search tool can grip onto.
Local video indexing — actually looking at the pixels, understanding what’s in a clip, and writing that understanding to disk — is the step these tools skip entirely. They’re solving problem two. Problem one is the index, and without it, you can’t ask “find the wide shot at sunrise with the giraffe in the frame” and get anything useful back. You’re just searching a void.
That upstream gap is what the weekend build was designed to close. Without a proper local video indexing pass over the raw archive, every downstream AI editing tool is essentially useless.
Building Local Video Indexing on a 2021 MacBook
The architecture is deliberate and worth understanding in detail, because the constraints shaped every decision. Four rules governed the local video indexing build from the start.
First, everything had to stay local. Uploading thousands of multi-gigabyte clips — wildlife footage, personal travel, drone footage from Kenya — to a third-party cloud wasn’t just expensive. It was handing over the entire visual record of a life to a company with its own data policies. Local video indexing was non-negotiable.
Second, the output format had to be simple and durable. Rather than a central database that could corrupt, migrate poorly, or become dependent on a specific tool, each clip gets a .description.md sidecar file written right next to it. Plain text. Grep-able. Survives if the indexer breaks tomorrow. Travels with the file when drives get reorganized.
Third, the vision model pass happens once per clip, and it has to capture everything in that single call. Vision inference is the expensive operation — in time, in compute, in swap pressure on a machine running a 31-billion-parameter model with 50GB of swap. So the schema is built to be exhaustive upfront: rating, technical quality, lighting, time of day, color palette, audio quality, people count, keywords, detected faces, location, full transcript, and a prose description. All in one shot.
Fourth, three vision backends give flexibility depending on the situation. Claude via a Max subscription CLI handles zero-marginal-cost inference for most work. The Anthropic API kicks in when speed matters. And a local backend pointed at LM Studio — running Gemma 4 at 31 billion parameters — handles the bulk pass offline, which is where the 50GB swap file earns its existence.
The per-clip pipeline that feeds those vision calls is impressively thorough. ffprobe pulls technical metadata. exiftool extracts GPS coordinates — latitude, longitude, and altitude — from iPhone clips, DJI footage, and drone video alike. Nominatim handles reverse geocoding, free and without an API key. ffmpeg extracts five evenly-spaced frames at 1920px resolution for the vision pass. WhisperX transcribes audio with word-level alignment across 97 languages including Swahili, English, and Hindi, with speaker diarization via pyannote. And insightface detects faces, storing 512-dimensional ArcFace embeddings in a centralized SQLite database — enabling cross-archive queries for specific people across thousands of clips.
The vision model then reads those frames, the transcript snippet, and the folder path as context, and returns structured YAML frontmatter plus a prose description. That gets written to the sidecar. Local video indexing is now complete for that clip, making it searchable in plain English.
What This Actually Costs to Run
Running a 31B parameter model on a 2021 MacBook with 50GB of swap isn’t fast. It’s not meant to be. This is a background job — the kind you kick off overnight or while you’re working on something else. Speed was never the constraint. Cost and privacy were.
The economics here are striking. The original SaaS stack was $140 a month, ongoing, for tools that still couldn’t solve the core problem. The local video indexing build costs $22 a month (driven primarily by ElevenLabs for voiceover on specific clips), runs on hardware already owned, and produces durable plain-text output that isn’t dependent on any vendor staying in business or keeping their pricing stable.
For professional photographers, documentary filmmakers, travel content creators, or anyone managing large personal archives, that math is hard to ignore. Especially when the alternative is an archive that keeps growing and never gets used.
A Signal for the AI Tooling Industry
There’s a broader pattern here that the AI video editing space should probably reckon with. The current generation of tools — Eddie AI, Descript, Runway’s organization features, even Adobe’s AI integrations in Premiere — are all built with the assumption that users come in with organized, labeled, reasonably structured footage. They optimize for the editing workflow. They assume the discovery problem is already solved.
For professional productions with dedicated media management, that assumption holds. For the vast majority of creators — people shooting constantly on multiple devices, accumulating years of footage across hard drives, never having the time to tag anything — it doesn’t. Local video indexing that runs at the edge, on the device, without cloud dependency, is the missing layer. And right now, you have to build it yourself.
The fact that a single developer could assemble this local video indexing pipeline in a weekend using entirely open-source components — WhisperX, ffmpeg, insightface, Nominatim, LM Studio — and run a 31B parameter vision model locally on a three-year-old laptop says something important about where consumer AI hardware is heading. The compute is already there. The tooling to make use of it for real creative workflows is still catching up. Whoever builds a polished, accessible version of local video indexing for non-developers is going to have a very large audience waiting.
Source: https://blog.simbastack.com/indexed-a-year-of-video-locally/

