Skip to content

Search

γρηγορειτε
••• the musings
••• the mountains
purpose
prophecy
Playlist
The Directory
The Pantheon
Code of Ethics
21 Principles
GBR
contact
glossary
men at work
Disclaimer

AI production pipeline comes into maturity : the GenAI scorecard / watchlist

In the Land of AI, nothing stays still for long. The AI production pipeline is rapidly moving from the R&D lab and into commercial enterprise / creative agencies. New models and startups, with genuinely boundary-pushing technology, appear to pop up almost daily. Its a lot to keep track of. This is my by-no-means comprehensive and totally subjective review of the current SotA (state-of-the-art) models for content generation — primarily in the buckets of 2d image generation, video generation, and the nascent field of 3d model generation.

A common pipeline today for AI video production uses a hybrid of prompted AI tools, and traditional software packages:

[2d “Ai-Art” engine] >> [post/adjust/upscale]
>> [generative video engine] >> [NLE]

Here’s a really simple visualization of how such an AI production pipeline plays out: The top images are straight “text-to-image” renders, prompts fed into the MidJourney AI-art engine. Those are then fed into Runway’s Gen3 video generator, alongside additional text prompting describing the action, and the result is the video on the lower half of the media here:

https://gregoreite.com/wp-content/uploads/2024/08/ai-production-pipeline-video-generation-genAI-Rory-Flynn-runway-ml-midjourney-TTV-ITV.mp4?_=1

Its critically important to understand the driving concept here: NONE OF THOSE PEOPLE ARE REAL, NONE OF THOSE PLACES EXIST, and NONE OF THOSE EVENTS EVER HAPPENED. That entire set of imagery (top), and that entire video (bottom), exists only in the imagination of the AIs inhabiting some of the world’s largest data centers (today’s version of supercomputers).

Let’s take a look at some of the startups who are creating this miraculous content imagination technology and revolutionizing the AI production pipeline, a subset of GenAI:

Contents hide

1 Leading Generative Video Engines & Notes

2 2d graphics / photos / illustration “AI-art” engines

2.1 winner: photorealism

2.2 winner: text rendering

2.3 winner: most hype

2.4 winner: local model

2.5 winner: actual AI production pipeline

3 Pro Features required for GenAI:

4 The Gen AI Production Pipeline, Expanded

Leading Generative Video Engines & Notes

Runway / Gen3

$200M funding, leader in video,
offers a full web-based tool suite
widely recognized as highest quality model on market (for now)

Wonder Dynamics / WonderStudio

superior AI rotoscoping (human forms / mocap)
acquired by AutoDesk (June 2024)

Luma Labs

also has 3d model gen offering (Genie)

Pika Labs

OpenAI / Sora

apparently strong S0tA model.
broke significant new ground on initial press launch (Feb 2024)
not available for public / comm release (yet)
camera moves through 3D worlds appear to have strong coherency and stability
- (tested by bringing generated video output into photogrammetry engine and extracting 3d model)

2d graphics / photos / illustration “AI-art” engines

winner: photorealism

MIDJOURNEY v6

winner: text rendering

OPENAI / DALL-e3

winner: most hype

BLACK FOREST / FLUX

winner: local model

STABLE DIFFUSION / COMFYUI

winner: actual AI production pipeline

ADOBE / PHOTOSHOP / FIREFLY

Pro Features required for GenAI:

There are a number of key features “required” by professional users that extend far beyond the simple prompting — i.e. “here’s a text description / make me an image” — that most general consumers of AI-Art apps utilize.

Those generative AI pro features include:

style reference
- conforming all generated imagery to a set of visual style guidelines
- (i.e. graphic novel, pixar animation, fashion photography)
- often this references the styles of particular individual human artists,
  which has serious implications for copyright and IP infringement law
character reference
- ensuring visual consistency / continuity
- one of the current Holy Grails of AI generated imagery is to maintain the consistency of a particular character, including face, hairstyle, and costume, across multiple still images, or in the case of video, across multiple shots. workarounds and solutions are available, but this remains one of the “hard problems” of the industry, and the variance, especially of human facial features, across scenes, creates a sometimes unconscious “uncanny valley” for end-use content consumers.
- this common scenario is demonstrated clearly in the composite image below (rendered using Leonardo.ai): the need to represent an individual human across a wide range of environments, lighting conditions, and costumes:

prop reference
- same as with people, the ability to deploy a detailed model of a physical prop or product, and have it be reliably and consistently represented across a series of renders or scenes
set reference
- same as with props, the ability to upload a detailed model of a physical prop or product, and have it be reliably and consistently represented across a series of renders or scenes
camera control
- important for still images, where creators like to specify camera type, lens type, f-Stop, depth of field, camera angle, film grain, etc.
- important for video generations, where directors like to specify both the lens, as well as the camera path (movement) or zoom
color grading
- conforming the generated imagery to an established palette / tone for the overall production / scene
inpainting
- making AI variants on only a single area of an image
outpainting
- synthesizing what is “outside the crop” of the original image (which can be AI-generated, or human-drawn, or even photographic). Essentially an AI-imagined “zoom out” from the original, when there is no data or evidence available of what was actually “outside” the original image. for instance:

similarly, AI inpainting is used, via text prompt and/or visual / lasso selection, to alter / modify a single portion of the image, without affecting / re-rendering the surrounding image area:

Of course, all this has been done using traditional tools : PhotoShop, After Effects, and millions of hours of skilled human labor — for years. The difference, as with many things AI, is the blinding speed and cost-efficiency with which these tasks can be accomplished. It literally costs mere pennies (or less!) and takes seconds, to make these alterations, or even to synthesize these photographs and illustrations from scratch. a $20,000 photo shoot is now a $2 render. a $5,000 photo-edit taking days is now a $1 inpaint taking seconds. Seismic.

There remain, however, some quite sticky / nagging challenges for industry players:

legal rights issues of training data
- please someone pay Greg Rutkowski $10 million cash, today
artist compensation mechanisms
the scary 6-fingered hands
text rendering accuracy

The Gen AI Production Pipeline, Expanded

We outlined the basic AI production pipeline formula above:

[2d “Ai-Art” engine] — TTI (text-to-image)
>> [post/adjust/inpaint/upscale]
>> [generative video engine] — ITV (image-to-video) + TTV (text-to-video)
>> [NLE] — final cut

This is how it looks, expanded a bit more into its granular steps:

create a 3d model in Blender / Maya — just the 3d, no texturing or lighting necessary. this is just to create a depthmap that guides the image generator
input that greyscale output into ComfyUI / StableDiffusion / ControlNet
1. and append a text prompt
2. prompt describes lighting & specific scene characteristics (ethnicity, background, props, color grading, etc)
export the (single plate) image output from the render engine
import that image as the starting point for the video generator engine
generate the video asset based on the still image plus additional text prompt
1. text prompt describes scene animation / camera movement / pan / zoom / etc.
do this n times for each scene / video segment you want
use a traditional NLE (nonlinear video editor) to assemble footage, add sound bed / voiceover, manage cuts / transitions / timing.

Trad NLEs:

Adobe Premiere Pro,
Apple Final Cut Pro,
Avid Media Composer
BlackMagic Davinci Resolve

.

prompt: “a shiny chrome humanoid robot stands in an art class, painting at an easel. the model stands center. other robot students are at easels in a circle surrounding the model. a human art teacher stands next to the subject, and places a post-it note reading “A+” atop the upper right corner of the painting.”

engine: MidJourney v6.1

August 6, 2024

AcroYogi

AI Reality 2024, AI-Art, Creativity, GenAI, Production

Meet your new AI Therapist : Layin on tha Couch, talkin to Chat…

An ever-growing number of my friends have told me (I often ask “What do you ask ChatGPT?”) that they use an AI Therapist as their primary go-to for mental health therapy. Responses often go something along the lines of “Chat listens, he hears me, he gives really good insight and advice, and… (and this is…

August 25, 2025
AI Supermodel : And They All Came Tumbling Down 2025

The August 2025 issue of Vogue magazine featured a two-page ad spread by Guess…. with? An AI Supermodel. Two well-established brands in the fashion industry. Two well-financed entities. And this is the punchline: For whatever reason, Guess decided to commission an AI artist to create AI supermodel fakes. And to place a teeny-weeny near invisible…

August 18, 2025
Bigfoot and Yeti VLOG / Sasquatch Buddie AI-Video Meme Floods the 2025 Internets

Almost overnight, a new content form has literally done a viral takeover of my Instagram reels. If you haven’t seen it yet, you will soon. The algorithm wants you to. Its AI Video, and called, variously, the “Bigfoot and Yeti VLOG“, “Sasquatch & Yeti,” “Bigfoot, Yeti & the Hussies” etc. What is it? Why is…

June 17, 2025

© 2022-2025 Gregory Roberts + dSky.AI

Exit mobile version