Distributed Video Generation — ITDT LLC AI Research

ITDT LLC produces its product videos — App Store previews, promos, and multi-minute explainer episodes — entirely on local hardware, with no paid cloud generators. This is the story of how we and Claude Code built the pipeline that makes that possible: a creative studio split across two computers, coordinated by a small API that was tailored to the hardware on hand, and grown from one-shot previews into a general episode compiler that turns a written plan into a finished film.

Good News! — no need to read all of this.

Anyone who would rather skip the reading and simply have Claude Code read these pages and build the pipeline can do exactly that. Copy the block below into Claude Code (or any capable AI coding agent) and let it read all five pages and build an equivalent pipeline, tailored to the reader’s own hardware.

I'd like to build something like ITDT LLC's "Distributed Video Generation" — a
two-machine, local-AI video studio that splits the work across an Apple Silicon
Mac and an NVIDIA-GPU PC, coordinated by a small job-control API. Please read
these five pages, then help me design and build an equivalent pipeline tailored
to my hardware:

  https://itdtllc.com/video-pipeline/                   the story (why it is built this way)
  https://itdtllc.com/video-pipeline/api.html           the Mac/PC job-control API
  https://itdtllc.com/video-pipeline/verify.html        how to verify and replicate it
  https://itdtllc.com/video-pipeline/verify-api.html    verbatim terminal verification
  https://itdtllc.com/video-pipeline/verify-elira.html  verbatim local-AI verification

My hardware is: <describe your machines — GPUs, RAM, OS, what is always-on>.
Design the functional split for my rig, then implement and verify it.

Prefer to understand it first? Read on — everything the prompt points to is laid out below and across the four pages it links.

What this represents

It is worth stating plainly, because the pace is easy to miss from the inside. A single person, directing AI in plain language, now produces broadcast-style short videos — voice, an on-screen host, music, lip-sync, footage, and captions — on two ordinary local machines, with no film crew, no studio, and no paid cloud services. Work that recently required a team, a budget, and specialized software is now a conversation with AI, carried out on hardware many people already own.

We publish these write-ups so progress like this is visible outside the small circle of people building it. The capability is here today, and it is advancing quickly. The clearer that is to more people, the better everyone can plan for what comes next.

One machine could not do all of it

A finished short video is really several different workloads stacked together: synthesizing a voice, generating music, creating a photoreal host, animating that host into moving video, matching her lips to the narration, compositing everything over screen-captured app footage, and adding captions and an end card. Each of those steps prefers a different kind of hardware. No single machine in the shop was the best home for all of them — so instead of compromising, we asked Claude Code to design a pipeline that used both machines for what each does best.

The collaboration model

The project was built by a person and two instances of Claude Code working together. The person sets the high-level intent — "make a calmer, 27-second App Store preview in a business suit" — and reviews the finished products. The two Claude Code instances, one running on the Mac and one on the PC, work out the engineering contract between themselves and build their own side of it.

They coordinate through two shared channels: a network folder that both machines mount (for handing finished media back and forth), and an append-only journal file where each side leaves messages for the other — "endpoint is live," "clip staged, ready for lip-sync," "here is the exact resolution I locked." Neither Claude reaches into the other machine's code; the journal contract is the only joint surface. It is, in practice, two engineers pair-building a system across a hallway, leaving notes for each other as they go.

Why two machines — and how the split was chosen

So the Mac creates the start frame for each shot and owns everything audio and finishing; the PC animates that frame into video and trains characters. The boundary is not arbitrary — it is drawn along the seam between the two machines' strengths.

A reusable cast

The host is not redrawn from scratch each time. Every character lives in a small library entry: a locked appearance (age, build, hair), a synthetic voice, an identity model the GPU loads to keep her on-model, and one or more outfits — each outfit carrying its own wardrobe, a clean reference still, and its own trained identity model. A shot simply names a character and an outfit — "the host, business suit" — and the studio resolves the rest: it builds the start frame from that outfit's reference and wardrobe text, animates it with that outfit's model, and narrates in the character's voice. New characters and outfits are added to the library once, then reused across every video, so a whole series stays visually consistent.

Directing it in plain language

The studio has a conversational front door. The always-on local model on the Mac is not just an orchestrator running in the background — you can talk to it directly: "render this episode," "make a start frame of the host on a rooftop," "stage that clip to the PC." Every capability in the pipeline is registered with that model as a callable tool with a uniform contract, so it can plan a request and invoke the right steps in order, hand work to the PC, and report back. The same tools are plain command-line programs, so anything the model can do can also be run directly from a terminal or a script. When a new capability is added, it is registered with the local model so it knows the option exists — otherwise the model keeps offering only what it was last taught.

How it came together

The build ran PC-first, then Mac, then a cross-machine integration test. The PC side stood up its job server and its video-render script; once those were live, it journaled the Mac. The Mac side then built its media scripts and the orchestrator that drives the whole run, calling the PC's API across the network. Each capability was added one at a time, verified on its own machine, and only then wired across the link. The result has been verified end-to-end: a single plain-language request now produces a finished, captioned, lip-synced short — voice, music, host, footage, and all — with the person reviewing each clip as it lands.

From a sentence to a full episode

The first videos were short — a 27-second App Store preview. The pipeline has since grown into a general episode compiler: the director turns a plain-language brief into a small, validated episode script — an ordered list of typed shots in a single file — and the studio renders the whole thing, start to finish, with no manual editing. Re-running the compiler on that file rebuilds the entire cut; nothing is hand-edited in a timeline.

An episode is built from a handful of shot types, each routed to the machine that does it best: a talking-head host shot (a generated, lip-synced presenter inside a full moving scene); a narration shot (voice over screen-captured app footage, no host); a host-in-corner inset over footage; an animated title card; and a stylized hologram montage. Layered over the top are lower-thirds and chyrons, a continuous music bed with one-shot stings, dissolve and fade transitions, and gentle camera moves — all declared in the script and assembled automatically. A built-in check validates the structure and screens the narration against the brand's content rules before a frame is rendered.

Keeping a character consistent across minutes of video

A genuine research problem surfaced as the videos got longer. The video model renders only a few seconds at a time, so a longer shot has to be built by chaining clips — each new clip seeded from the last frame of the one before. But a diffusion model drifts: the host's face and wardrobe slowly wander, and across a chain the drift compounds, until by the second or third clip she is no longer quite herself.

The fix lives on the orchestrating machine. The character's identity model stays loaded for every single clip, and a long shot is assembled from short segments so the identity is re-established often rather than left to drift. The seam between clips is handled on the director side — either by passing the previous frame straight through (the smoothest joins) or by re-anchoring it to the host's original first frame with an identity pass before it seeds the next clip (the strongest correction). Which trade-off to take — seam smoothness versus identity tightness, and how short to make the segments — is a knob the director sets per shot. It is a small, concrete example of a problem that only appears once you push local generation past the easy first few seconds — and of the kind of fix that is straightforward when one machine owns orchestration and identity work.

The point: an API tailored to the hardware

The most reusable result is not the videos — it is the realization that the right API depends on the hardware it runs on. The job-control contract here (an asynchronous "submit and poll" model, a first-in-first-out render queue, a shared-folder hand-off, and a common job record both machines understand) was shaped by this mix of an always-on unified-memory Mac and a discrete-GPU PC.

That is the broader capability we want to highlight. The pipeline below is one concrete instance of it; the same approach can be tailored to whatever hardware you have.

That is also why this write-up exists in the detail it does. These pages are a working blueprint, not just a showcase — the architecture, the job-control contract, and the hardware-to-role mapping are all laid out. Point your own Claude Code at them, tell it what machines you have, and it can design and build an analogous pipeline tailored to your setup: re-deriving the implementation rather than copying ours, and reshaping the split to fit your hardware.