The Mac↔PC Video Generation API

How the two machines talk · 2026 · ITDT LLC

This is the working contract between the two halves of ITDT's local video studio: a Mac "Director" that orchestrates and finishes a video, and a PC "Supervisor" that does the heavy GPU rendering. The API is deliberately small — a handful of HTTP endpoints for job control, a shared folder for moving media, and one common job record both machines understand. For the background on why the work is split this way, see the story.

Architecture at a glance

PERSON (plain-language intent) | v +-------------------------+ HTTP / LAN +--------------------------+ | MAC — "Director" | --- submit & poll ---> | PC — "Supervisor" | | | | | | local LLM orchestrator | <--- job status ----- | HTTP job server (8765) | | audio · stills · lips | | diffusion video render | | compositing · finish | | character training | +-----------+-------------+ +------------+-------------+ | | | SHARED NETWORK FOLDER ("the relay") | +------------- media hand-off + journal ---------------+ (start frames Mac->PC, finished clips PC->Mac, append-only message log)

Three channels carry everything:

HTTP over the LAN — job control (start a render, check on it, cancel it).
A shared network folder, "the relay" — the actual media files: the Mac stages a start frame for the PC; the PC stages finished clips back for the Mac.
An append-only journal in that folder — short status messages the two sides leave for each other (and that the two Claude Code instances read when they start work).

Roles

Mac — Director	PC — Supervisor
Always-on local LLM that turns intent into a plan; voice (TTS) and music synthesis; diffusion still start-frames and identity re-anchoring; lip-sync; title cards, overlays, transitions, and montage; compositing and final assembly; upscaling; the orchestrators that drive a whole run and manage long-shot chains.	An HTTP job server fronting GPU scripts; diffusion video generation from a start frame; character-model (LoRA) training; staging finished clips back to the relay.

PC Supervisor — HTTP API

The Supervisor exposes a tiny asynchronous job API. Because a GPU render can take many minutes, nothing blocks: you POST a job, get an id back immediately, and poll for it. Concurrent requests queue first-in-first-out so the single GPU is never double-booked.

Base URL on the reference setup: http://<pc-host>:8765 (the PC's address on the local network).

Method & path	Purpose
`GET /scripts`	List the runnable GPU scripts and their accepted arguments (from a server-side registry).
`POST /run_script`	Enqueue a job. Returns `202` with a `job_id` and queue position. Never blocks on the render.
`GET /job/{id}`	Poll one job: `queued` / `running` / `complete` / `failed`, plus its result on completion.
`GET /queue`	What is running, what is waiting, and recent history.
`POST /cancel_job/{id}`	Drop a queued job, or interrupt a running one and free the GPU.
`POST /cancel_run`	Halt an entire multi-clip run (every job sharing a `run_id`).
`GET /system_stats`, `GET /state`	Health and lifecycle checks.

Submitting a render

POST http://<pc-host>:8765/run_script
{
  "script": "render_base_clip",
  "args": {
    "start_frame": "host_start.png",   // staged to the relay by the Mac
    "lora":        "<character-model>",  // identity model for the host
    "prompt":      "<setting / backdrop only>",
    "width": 832, "height": 480,        // render size (landscape or portrait)
    "framing":     "waist-up medium shot",
    "seconds":     5,                   // one native segment; larger values chain segments
    "motion":      "static",            // optional idle motion (e.g. blink / subtle head)
    "seed":        42,
    "output_name": "scene01"
  },
  "run_id":      "run_2026_0607",       // groups the clips of one video
  "stage_review": true,                  // copy the result to the relay for review
  "run_total":    3                      // how many clips this run will produce
}

-> 202  { "job_id": "a1b2c3d4e5f6", "queue_position": 0 }

Polling it

GET http://<pc-host>:8765/job/a1b2c3d4e5f6

-> { "job_id": "a1b2c3d4e5f6",
     "state":  "complete",
     "result": { "path": "scene01_00001.mp4",
                 "sha256": "…", "wall_clock_s": 1880 },
     "error":  null,
     "returncode": 0 }

The shared job record

Both machines write and read the same job-record shape, so the orchestrator treats a local Mac job and a remote PC job through one mental model:

{
  "job_id": "…", "label": "…", "script": "…", "args": { … },
  "state":  "queued | running | complete | failed",
  "result": { … } | null,
  "error":  "…"   | null,
  "returncode": 0 | null,
  "created_at": "…"
}

The core call: `render_base_clip`

This is the heart of the video API. Given a start frame (a single still of the host, produced on the Mac) plus the host's identity model and a short description of the setting, the PC animates it into a few seconds of moving video on the GPU and returns the clip's path, a checksum, and the wall-clock time. With stage_review set, the clip is also copied into the relay under its run_id so the person can watch it as soon as it lands and stop the run if it is wrong. Long multi-clip runs are resumable: already-finished clips are reused rather than re-rendered.

The model renders only a few seconds per pass, so a longer shot is built by chaining short segments — each seeded from the previous segment's last frame. Run directly on the GPU, a chain drifts: the host's identity wanders and compounds clip-over-clip. So for host shots the director owns the chain instead — it keeps the identity model on every segment, keeps the segments short, and manages each seam itself (passing the previous frame through for the smoothest join, or re-anchoring it to the host's original frame with an identity pass for the strongest correction). See the story for the reasoning; the controls are seconds (segment length), the seam mode, and which identity reference to anchor to.

Characters & outfits

A host is defined once in a small library and then referenced by name. A character entry holds a locked appearance, a voice id, an identity-model trigger token, and a set of outfits; each outfit holds a reference still, wardrobe text, and its own trained identity model:

{
  "host": {
    "voice":      "<voice-id>",
    "trigger":    "<lora-token>",
    "appearance": "age / build / hair — fixed across all outfits",
    "outfits": {
      "default":       { "reference": "ref.png", "wardrobe": "…", "lora": "<model>" },
      "business_suit": { "reference": "ref.png", "wardrobe": "…", "lora": "<model>" }
    }
  }
}

A shot names a character and an outfit; the studio resolves them to the reference + wardrobe (for the start frame), the identity model (for render_base_clip), and the voice (for narration). Base clips are cached per character / outfit / setting, so two shots that share a look reuse one render. New characters and outfits are added with a small capture-and-train flow and then reused across every video.

The episode layer — a video from a written script

Above the per-clip API sits a small compiler that turns one episode script — a JSON file of typed shots — into a finished multi-minute master, headlessly. The orchestrator validates the script, generates or reuses each host's base clips through the PC, then runs the Mac finishing steps; re-running it on the same file rebuilds the cut.

Shot type	What it produces
`host` / `host_pip`	A lip-synced presenter in a full scene, or as a corner inset over footage.
`narration`	Voice over a background clip or screen capture — no host, no lip-sync.
`title`	A title card: animated text elements over a plate.
`montage`	Stylized full-frame scenes (e.g. holograms) materializing in sequence under a voiceover.

Each shot carries a spoken timeline; the script also declares overlays (chyrons / lower-thirds), a music bed with one-shot stings, per-shot transitions (cut / dissolve / fade), and camera moves. A validator checks structure and screens the narration against content rules before anything renders.

{
  "fps": 60, "canvas": { "w": 1920, "h": 1080 },
  "music": { "bed": { "source": "prompt", "value": "warm synth bed" },
             "cues": [ { "id": "chime", "at_s": 21, "source": "prompt", "value": "ui chime" } ] },
  "shots": [
    { "id": "01", "type": "host", "character": "host", "outfit": "default",
      "setting": "<scene>", "camera": "push_in",
      "timeline": [ { "say": "…" } ], "transition_out": "dissolve" },
    { "id": "02", "type": "title", "duration_s": 8,
      "elements": [ { "text": "Title", "style": "title_magenta", "anim": "materialize" } ],
      "overlays": [ { "text": "E01", "style": "chyron", "pos": "lower_right" } ] },
    { "id": "03", "type": "narration", "background": { "source": "file", "value": "capture.mov" },
      "voiceover": [ { "say": "…" } ] }
  ]
}

Mac Director — the scripts it orchestrates

On the Mac side, each capability is a small command-line tool with a uniform contract: arguments in, a single line of JSON out ({"status":"ok", …} or {"status":"error", …}), progress to the side. This uniformity means the studio can be driven two ways: directly from a terminal or script, or by talking to the always-on local model in plain language — every tool below is registered with that model, so it can plan a request and call the right tools in order. (Because the model only offers what it has been registered for, a newly added tool is also registered with it before it will use it.)

Script	What it does
`generate_image`	Diffusion still — the identity-locked host start frame the PC animates, plus an image-to-image identity re-anchor pass used to correct chain seams.
`render_title`	Renders animated title cards, chyrons, and lower-thirds (text the model can't reliably draw) for compositing.
`generate_voiceover`	Text-to-speech narration in the host's locked voice.
`generate_music`	Generates a music bed from a text prompt.
`lipsync_clip`	Matches the host's mouth to the narration audio.
`compose_video`	Composites a background, picture-in-picture host, overlays/captions, voice, and ducked music into one clip.
`assemble_timeline`	Turns a scene list into a finished film: per-scene audio, idle-fill, lip-sync, sequencing, and a single music track.
`assemble_episode`	The episode editor: sequences every shot type (host, narration, title, montage), burns in overlays, mixes the bed and stings, and applies transitions and camera moves.
`validate_episode`	Schema-checks an episode script and screens its narration against content rules before any render.
`prepare_delivery`	Formats a finished master to an exact platform/device size (e.g. App Store or YouTube resolutions) and does the terminal upscale.
`stage_to_relay`	Hands a file to the shared folder for the PC, with a checksum and a journal note.
`render_screenplay`	The short-video orchestrator (one continuous scene): validates a plan, drives the PC per clip, runs the Mac finishing steps. Also home to the drift-managed extended-clip chain.
`render_episode`	The episode orchestrator: validate → generate/reuse host base clips on the PC → `assemble_episode` → master. Resumable, with a dry-run cost estimate.

End-to-end flow

A plain-language request becomes a small, schema-validated plan — a short "screenplay" or a full "episode script" — that is content-checked before any render.
For each new host shot, the Mac generates an identity-locked start frame and stage_to_relays it to the shared folder.
The Mac POSTs render_base_clip to the PC and polls the job_id; the PC animates the frame on the GPU and stages the clip back for review. A shot longer than one segment is built as a director-managed chain — render a segment, manage the seam, seed the next — so the host's identity holds across the whole shot.
The Mac promotes the accepted clips, then runs voice, music, lip-sync, and compositing — and, for an episode, title cards, overlays, a music bed with stings, transitions, and any montage.
The shots are assembled into a master, then formatted and upscaled to the target delivery size.

Already-rendered clips are cached, so re-runs are cheap and the person can stop and resume at any point.

Tailoring the API to your hardware

Every choice above is downstream of the hardware. The asynchronous submit-and-poll model, the first-in-first-out GPU queue, the shared-folder hand-off, and the always-on orchestrator all exist because this studio pairs an always-on, unified-memory Apple Silicon Mac with a discrete-GPU NVIDIA PC. That is the reference configuration — not a requirement.

The same contract re-targets cleanly to a different rig by redrawing only the functional split:

If your hardware is…	…the split shifts to
A single multi-GPU workstation	One machine; the "remote" job server becomes a local queue, no relay folder needed.
Two GPU machines	Render is sharded across both job servers; the queue spans the pool.
A laptop plus a cloud GPU node	The Supervisor moves to the cloud; the relay becomes object storage; the contract is unchanged.
One modest machine	Heavier models swap for lighter ones; long jobs stay asynchronous so the UI never blocks.

The endpoints, the job record, and the orchestration flow stay the same; only where each step runs changes. Claude Code can profile a given configuration and produce the matching split and API — which is the real product of this project: a video-generation API custom-fit to the hardware it runs on.

This page is a complete enough blueprint to act on. Point your own Claude Code at it, tell it what hardware you have, and it can design and build an equivalent pipeline for you — tailoring the functional split to your machines while keeping a contract shaped like this one. You would be re-deriving the implementation for your setup, not copying ours.

Verify & replicate →