Constructing an AI WebTV

The AI WebTV is an experimental demo to showcase the newest advancements in automatic video and music synthesis.

👉 Watch the stream now by going to the AI WebTV Space.

Should you are using a mobile device, you possibly can view the stream from the Twitch mirror.

Concept

The motivation for the AI WebTV is to demo videos generated with open-source text-to-video models comparable to Zeroscope and MusicGen, in an entertaining and accessible way.

You will discover those open-source models on the Hugging Face hub:

The person video sequences are purposely made to be short, meaning the WebTV must be seen as a tech demo/showreel fairly than an actual show (with an art direction or programming).

Architecture

The AI WebTV works by taking a sequence of video shot prompts and passing them to a text-to-video model to generate a sequence of takes.

Moreover, a base theme and idea (written by a human) are passed through a LLM (on this case, ChatGPT), with the intention to generate quite a lot of individual prompts for every video clip.

Here’s a diagram of the present architecture of the AI WebTV:

Implementing the pipeline

The WebTV is implemented in NodeJS and TypeScript, and uses various services hosted on Hugging Face.

The text-to-video model

The central video model is Zeroscope V2, a model based on ModelScope.

Zeroscope is comprised of two parts that might be chained together:

👉 You’ll need to make use of the identical prompt for each the generation and upscaling.

Calling the video chain

To make a fast prototype, the WebTV runs Zeroscope from two duplicated Hugging Face Spaces running Gradio, that are called using the @gradio/client NPM package. You will discover the unique spaces here:

Other spaces deployed by the community may also be found should you seek for Zeroscope on the Hub.

👉 Public Spaces may turn into overcrowded and paused at any time. Should you intend to deploy your individual system, please duplicate those Spaces and run them under your individual account.

Using a model hosted on a Space

Spaces using Gradio have the flexibility to expose a REST API, which might then be called from Node using the @gradio/client module.

Here is an example:

import { client } from "@gradio/client"

export const generateVideo = async (prompt: string) => {
  const api = await client("*** URL OF THE SPACE ***")

  
  const { data } = await api.predict("/run", [		
    prompt,
    42,	
    24, 
    35 
  ])
  
  const { orig_name } = data[0][0]

  const remoteUrl = `${instance}/file=${orig_name}`

  
}

Post-processing

Once a person take (a video clip) is upscaled, it’s then passed to FILM (Frame Interpolation for Large Motion), a frame interpolation algorithm:

During post-processing, we also add music generated with MusicGen:

Broadcasting the stream

Note: there are multiple tools you need to use to create a video stream. The AI WebTV currently uses FFmpeg to read a playlist product of mp4 videos files and m4a audio files.

Here is an example of making such a playlist:

import { guarantees as fs } from "fs"
import path from "path"

const allFiles = await fs.readdir("** PATH TO VIDEO FOLDER **")
const allVideos = allFiles
  .map(file => path.join(dir, file))
  .filter(filePath => filePath.endsWith('.mp4'))

let playlist = 'ffconcat version 1.0n'
allFilePaths.forEach(filePath => {
  playlist += `file '${filePath}'n`
})
await fs.guarantees.writeFile("playlist.txt", playlist)

It will generate the next playlist content:

ffconcat version 1.0
file 'video1.mp4'
file 'video2.mp4'
...

FFmpeg is then used again to read this playlist and send a FLV stream to a RTMP server. FLV is an old format but still popular on the planet of real-time streaming as a result of its low latency.

ffmpeg -y -nostdin 
  -re 
  -f concat 
  -safe 0 -i channel_random.txt -stream_loop -1 
  -loglevel error 
  -c:v libx264 -preset veryfast -tune zerolatency 
  -shortest 
  -f flv rtmp://

There are various different configuration options for FFmpeg, for more information within the official documentation.

For the RTMP server, yow will discover open-source implementations on GitHub, comparable to the NGINX-RTMP module.

The AI WebTV itself uses node-media-server.

💡 You may also directly stream to considered one of the Twitch RTMP entrypoints. Take a look at the Twitch documentation for more details.

Observations and examples

Listed below are some examples of the generated content.

The very first thing we notice is that applying the second pass of Zeroscope XL significantly improves the standard of the image. The impact of frame interpolation can also be clearly visible.

Characters and scene composition

Prompt: Photorealistic movie of a llama acting as a programmer, wearing glasses and a hoodie, intensely looking at a screen with lines of code, in a comfy, dimly lit room, Canon EOS, ambient lighting, high details, cinematic, trending on artstation

Prompt: 3D rendered animation showing a gaggle of food characters forming a pyramid, with a banana standing triumphantly on top. In a city with cotton candy clouds and chocolate road, Pixar’s style, CGI, ambient lighting, direct sunlight, wealthy color scheme, ultra realistic, cinematic, photorealistic.

Prompt: Intimate close-up of a red fox, gazing into the camera with sharp eyes, ambient lighting making a high contrast silhouette, IMAX camera, high detail, cinematic effect, golden hour, film grain.

Simulation of dynamic scenes

Something truly fascinating about text-to-video models is their ability to emulate real-life phenomena they’ve been trained on.

We have seen it with large language models and their ability to synthesize convincing content that mimics human responses, but this takes things to a complete latest dimension when applied to video.

A video model predicts the subsequent frames of a scene, which could include objects in motion comparable to fluids, people, animals, or vehicles. Today, this emulation is not perfect, but it’ll be interesting to guage future models (trained on larger or specialized datasets, comparable to animal locomotion) for his or her accuracy when reproducing physical phenomena, and likewise their ability to simulate the behavior of agents.

Prompt: Cinematic movie shot of bees energetically buzzing around a flower, sun rays illuminating the scene, captured in 4k IMAX with a soft bokeh background.

Prompt: Dynamic footage of a grizzly bear catching a salmon in a rushing river, ambient lighting highlighting the splashing water, low angle, IMAX camera, 4K movie quality, golden hour, film grain.

Prompt: Aerial footage of a quiet morning on the coast of California, with waves gently crashing against the rocky shore. A startling sunrise illuminates the coast with vibrant colours, captured beautifully with a DJI Phantom 4 Pro. Colours and textures of the landscape come alive under the soft morning light. Film grain, cinematic, imax, movie

💡 It’ll be interesting to see these capabilities explored more in the long run, as an illustration by training video models on larger video datasets covering more phenomena.

Styling and effects

Prompt:
3D rendered video of a friendly broccoli character wearing a hat, walking in a candy-filled city street with gingerbread houses, under a vivid sun and blue skies, Pixar’s style, cinematic, photorealistic, movie, ambient lighting, natural lighting, CGI, wide-angle view, daytime, ultra realistic.

Prompt: Cinematic movie, shot of an astronaut and a llama at dawn, the mountain landscape bathed in soft muted colours, early morning fog, dew glistening on fur, craggy peaks, vintage NASA suit, Canon EOS, high detailed skin, epic composition, top quality, 4K, trending on artstation, beautiful

Prompt: Panda and black cat navigating down the flowing river in a small boat, Studio Ghibli style > Cinematic, beautiful composition > IMAX camera panning following the boat > Prime quality, cinematic, movie, mist effect, film grain, trending on Artstation

Failure cases

Fallacious direction: the model sometimes has trouble with movement and direction. As an example, here the clip appears to be played in reverse. Also the modifier keyword green was not taken under consideration.

Prompt: Movie showing a green pumpkin falling right into a bed of nails, slow-mo explosion with chunks flying throughout, ambient fog adding to the dramatic lighting, filmed with IMAX camera, 8k ultra high definition, top quality, trending on artstation.

Rendering errors on realistic scenes: sometimes we are able to see artifacts comparable to moving vertical lines or waves. It’s unclear what causes this, but it surely could also be as a result of the mixture of keywords used.

Prompt: Film shot of a charming flight above the Grand Canyon, ledges and plateaus etched in orange and red. Deep shadows contrast with the fiery landscape under the midday sun, shot with DJI Phantom 4 Pro. The camera rotates to capture the vastness, textures and colours, in imax quality. Film grain, cinematic, movie.

Text or objects inserted into the image: the model sometimes injects words from the prompt into the scene, comparable to “IMAX”. Mentioning “Canon EOS” or “Drone footage” within the prompt may also make those objects appear within the video.

In the next example, we notice the word “llama” inserts a llama but additionally two occurrences of the word llama in flames.

Prompt: Movie scene of a llama acting as a firefighter, in firefighter uniform, dramatically spraying water at roaring flames, amidst a chaotic urban scene, Canon EOS, ambient lighting, top quality, award winning, highly detailed fur, cinematic, trending on artstation.

Recommendations

Listed below are some early recommendations that might be constructed from the previous observations:

Using video-specific prompt keywords

You might already know that should you don’t prompt a particular aspect of the image with Stable Diffusion, things just like the color of garments or the time of the day might turn into random, or be assigned a generic value comparable to a neutral mid-day light.

The identical is true for video models: you’ll want to be specific about things. Examples include camera and character movement, their orientation, speed and direction. You’ll be able to leave it unspecified for creative purposes (idea generation), but this won’t all the time offer you the outcomes you would like (e.g., entities animated in reverse).

Maintaining consistency between scenes

Should you plan to create sequences of multiple videos, you’ll want to be certain you add as many details as possible in each prompt, otherwise it’s possible you’ll lose necessary details from one sequence to a different, comparable to the colour.

💡 This will even improve the standard of the image for the reason that prompt is used for the upscaling part with Zeroscope XL.

Leverage frame interpolation

Frame interpolation is a strong tool which might repair small rendering errors and switch many defects into features, especially in scenes with a whole lot of animation, or where a cartoon effect is appropriate. The FILM algorithm will smoothen out elements of a frame with previous and following events within the video clip.

This works great to displace the background when the camera is panning or rotating, and will even offer you creative freedom, comparable to control over the variety of frames after the generation, to make slow-motion effects.

Future work

We hope you enjoyed watching the AI WebTV stream and that it’ll encourage you to construct more on this space.

As this was a primary trial, a whole lot of things weren’t the main target of the tech demo: generating longer and more varied sequences, adding audio (sound effects, dialogue), generating and orchestrating complex scenarios, or letting a language model agent have more control over the pipeline.

A few of these ideas may make their way into future updates to the AI WebTV, but we can also’t wait to see what the community of researchers, engineers and builders will give you!

Source link

Constructing an AI WebTV

Concept

Architecture

Implementing the pipeline

The text-to-video model

Calling the video chain

Using a model hosted on a Space

Post-processing

Broadcasting the stream

Observations and examples

Characters and scene composition

Simulation of dynamic scenes

Styling and effects

Failure cases

Recommendations

Using video-specific prompt keywords

Maintaining consistency between scenes

Leverage frame interpolation

Future work

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Technology Behind BLOOM Training

AI in Multiple GPUs: Understanding the Host and Device Paradigm

Evaluating Tool-Using Agents in Real-World Environments

xAI’s next phase unleashed

What’s next for Chinese open-source AI

Constructing an AI WebTV

Concept

Architecture

Implementing the pipeline

The text-to-video model

Calling the video chain

Using a model hosted on a Space

Post-processing

Broadcasting the stream

Observations and examples

Characters and scene composition

Simulation of dynamic scenes

Styling and effects

Failure cases

Recommendations

Using video-specific prompt keywords

Maintaining consistency between scenes

Leverage frame interpolation

Future work

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.