AI production pipeline comes into maturity : the GenAI scorecard / watchlist

grading the AI Production Pipeline

In the Land of AI, nothing stays still for long. The AI production pipeline is rapidly moving from the R&D lab and into commercial enterprise / creative agencies. New models and startups, with genuinely boundary-pushing technology, appear to pop up almost daily. Its a lot to keep track of. This is my by-no-means comprehensive and totally subjective review of the current SotA (state-of-the-art) models for content generation — primarily in the buckets of 2d image generation, video generation, and the nascent field of 3d model generation.

A common pipeline today for AI video production uses a hybrid of prompted AI tools, and traditional software packages:

[2d “Ai-Art” engine] >> [post/adjust/upscale]
>> [generative video engine] >> [NLE]

Here’s a really simple visualization of how such an AI production pipeline plays out: The top images are straight “text-to-image” renders, prompts fed into the MidJourney AI-art engine. Those are then fed into Runway’s Gen3 video generator, alongside additional text prompting describing the action, and the result is the video on the lower half of the media here:

Its critically important to understand the driving concept here: NONE OF THOSE PEOPLE ARE REAL, NONE OF THOSE PLACES EXIST, and NONE OF THOSE EVENTS EVER HAPPENED. That entire set of imagery (top), and that entire video (bottom), exists only in the imagination of the AIs inhabiting some of the world’s largest data centers (today’s version of supercomputers).

Let’s take a look at some of the startups who are creating this miraculous content imagination technology and revolutionizing the AI production pipeline, a subset of GenAI:

Leading Generative Video Engines & Notes

Runway / Gen3

  • $200M funding, leader in video,
  • offers a full web-based tool suite
  • widely recognized as highest quality model on market (for now)

Wonder Dynamics / WonderStudio

  • superior AI rotoscoping (human forms / mocap)
  • acquired by AutoDesk (June 2024)

Luma Labs

  • also has 3d model gen offering (Genie)

Pika Labs

OpenAI / Sora

  • apparently strong S0tA model.
  • broke significant new ground on initial press launch (Feb 2024)
  • not available for public / comm release (yet)
  • camera moves through 3D worlds appear to have strong coherency and stability
    • (tested by bringing generated video output into photogrammetry engine and extracting 3d model)

2d graphics / photos / illustration “AI-art” engines

winner: photorealism

MIDJOURNEY v6

winner: text rendering

OPENAI / DALL-e3

winner: most hype

BLACK FOREST / FLUX

winner: local model

STABLE DIFFUSION / COMFYUI

winner: actual AI production pipeline

ADOBE / PHOTOSHOP / FIREFLY

Pro Features required for GenAI:

There are a number of key features “required” by professional users that extend far beyond the simple prompting — i.e. “here’s a text description / make me an image” — that most general consumers of AI-Art apps utilize.

Those generative AI pro features include:

  • style reference
    • conforming all generated imagery to a set of visual style guidelines
    • (i.e. graphic novel, pixar animation, fashion photography)
    • often this references the styles of particular individual human artists,
      which has serious implications for copyright and IP infringement law
  • character reference
    • ensuring visual consistency / continuity
    • one of the current Holy Grails of AI generated imagery is to maintain the consistency of a particular character, including face, hairstyle, and costume, across multiple still images, or in the case of video, across multiple shots. workarounds and solutions are available, but this remains one of the “hard problems” of the industry, and the variance, especially of human facial features, across scenes, creates a sometimes unconscious “uncanny valley” for end-use content consumers.
    • this common scenario is demonstrated clearly in the composite image below (rendered using Leonardo.ai): the need to represent an individual human across a wide range of environments, lighting conditions, and costumes:

  • prop reference
    • same as with people, the ability to deploy a detailed model of a physical prop or product, and have it be reliably and consistently represented across a series of renders or scenes
  • set reference
    • same as with props, the ability to upload a detailed model of a physical prop or product, and have it be reliably and consistently represented across a series of renders or scenes
  • camera control
    • important for still images, where creators like to specify camera type, lens type, f-Stop, depth of field, camera angle, film grain, etc.
    • important for video generations, where directors like to specify both the lens, as well as the camera path (movement) or zoom
  • color grading
    • conforming the generated imagery to an established palette / tone for the overall production / scene
  • inpainting
    • making AI variants on only a single area of an image
  • outpainting
    • synthesizing what is “outside the crop” of the original image (which can be AI-generated, or human-drawn, or even photographic). Essentially an AI-imagined “zoom out” from the original, when there is no data or evidence available of what was actually “outside” the original image. for instance:

similarly, AI inpainting is used, via text prompt and/or visual / lasso selection, to alter / modify a single portion of the image, without affecting / re-rendering the surrounding image area:

Of course, all this has been done using traditional tools : PhotoShop, After Effects, and millions of hours of skilled human labor — for years. The difference, as with many things AI, is the blinding speed and cost-efficiency with which these tasks can be accomplished. It literally costs mere pennies (or less!) and takes seconds, to make these alterations, or even to synthesize these photographs and illustrations from scratch. a $20,000 photo shoot is now a $2 render. a $5,000 photo-edit taking days is now a $1 inpaint taking seconds. Seismic.

There remain, however, some quite sticky / nagging challenges for industry players:

  • legal rights issues of training data
  • artist compensation mechanisms
  • the scary 6-fingered hands
  • text rendering accuracy

The Gen AI Production Pipeline, Expanded

We outlined the basic AI production pipeline formula above:

  1. [2d “Ai-Art” engine] — TTI (text-to-image)
  2. >> [post/adjust/inpaint/upscale]
  3. >> [generative video engine] — ITV (image-to-video) + TTV (text-to-video)
  4. >> [NLE] — final cut

This is how it looks, expanded a bit more into its granular steps:

  1. create a 3d model in Blender / Maya — just the 3d, no texturing or lighting necessary. this is just to create a depthmap that guides the image generator
  2. input that greyscale output into ComfyUI / StableDiffusion / ControlNet
    1. and append a text prompt
    2. prompt describes lighting & specific scene characteristics (ethnicity, background, props, color grading, etc)
  3. export the (single plate) image output from the render engine
  4. import that image as the starting point for the video generator engine
  5. generate the video asset based on the still image plus additional text prompt
    1. text prompt describes scene animation / camera movement / pan / zoom / etc.
  6. do this n times for each scene / video segment you want
  7. use a traditional NLE (nonlinear video editor) to assemble footage, add sound bed / voiceover, manage cuts / transitions / timing.

Trad NLEs:

  • Adobe Premiere Pro,
  • Apple Final Cut Pro,
  • Avid Media Composer
  • BlackMagic Davinci Resolve

.


prompt: “a shiny chrome humanoid robot stands in an art class, painting at an easel. the model stands center. other robot students are at easels in a circle surrounding the model. a human art teacher stands next to the subject, and places a post-it note reading “A+” atop the upper right corner of the painting.”

engine: MidJourney v6.1

, , , ,

Exit mobile version