structure-as-control beats prompt-roulette

the usual way to get a composition out of an image model is to describe it and re-roll until the dice land. give it the geometry instead — and stop asking the slot machine to be your camera operator.

we built a translator that turns grey blocking — clay/matcap greybox renders straight out of a 3d program, exact camera and geometry, zero materials or light — into finished, painted film plates, via image-to-image generation.

the trick

the trick is what role the greybox plays. it is not a vibe. it's a hard constraint. the generator is told, in plain language: this is the final geometry, camera and layout. the pastel matcap colours are material ids, not final colours. keep every contour, silhouette and camera position exactly. only dress the shapes in the style. the clay carries the composition; the model only gets to decide paint.

the pipeline: export the clay → feed it as the structure reference → stack the prompt in a fixed order (universe → art-direction → treatment → "fully realise it" → structure-lock → the rules → scene read → creative notes → negative) → get back a plate that matches the layout exactly but is fully painted.

before / after

the tool ships a wipe slider so you can drag between the two states, and pick either side:

the same shot, dressed as a painted nocturnal film plate — one wide shot from the lab. drag the divider — grey clay on the left, painted plate on the right. then switch the hour: same bones, different paint.

the grey clay blocking — exact geometry and camera, no materials or light — one wide shot from the lab. drag the divider — grey clay on the left, painted plate on the right. then switch the hour: same bones, different paint.

left (blocking): flat grey clay. correct shapes, correct camera, correct depth — and zero soul. a 3d scene with the lights off.
right (render): the same frame, same silhouettes, same camera, now a moody painted film plate — value steps, warm light pools, atmosphere — without a single contour having moved.
and because every variant is selectable on both sides, you can wipe render-vs-render too: the same geometry as night, early-morning, burning, reclaimed-by-nature. same bones, different hour. that's the payoff shot — it makes "the structure is locked, only the treatment changed" something you can see, not just claim.

the lesson

the usual way to get a specific composition out of an image model is to describe it in words and re-roll until the dice land — you fight the model for camera and layout on every generation, and it drifts every time. give the model the geometry as an actual input and you stop gambling on the part you already know. you decide composition upstream, where you have real tools (a 3d viewport), and you spend the model only on the part it's genuinely good at: surfaces, light, mood.

stop asking the slot machine to also be your camera operator.

let each tool do the thing it's actually good at.

the catch

it has real failure modes, and pretending otherwise would be hype.

"too literal" — it comes back as tinted clay. sparse, object-centric blockings with thin direction make the model just recolour the greybox instead of realising it. the fix is unsubtle: front-load the art direction emphatically ("a painted film frame, not a recolour of grey clay"), and explicitly bar "raw 3d render / recolour" in the negative. the constraint and the creativity have to be cranked together.
the style has to be pinned or it drifts. "painterly" drifted too loose and went off-model; full 3d/pbr went too photoreal. landing a specific, named look — with reference frames — was the difference between consistent output and a different style every roll.
some things it just can't do — so don't ask. there's a signature element in this world the model can't render convincingly. the answer isn't a better prompt, it's: bar it entirely in the negative, keep the ai output clean, and add that element by hand afterward. know which 5% to keep out of the machine's hands.
characters and continuity still need a human. adding figures to a scene is a controlled break in the structure-lock and needs supervision; holding one set consistent across several camera angles is unproven — that's a human-judgment pass, not something the translator closes on its own yet.

net: it removes the gamble from composition. it does not remove taste, art direction, or the final human pass. it makes the machine reliable at the one job you constrained it to — which is the whole point.