The AI WebTV is an experimental demo to showcase the newest advancements in automatic video and music synthesis.
👉 Watch the stream now by going to the AI WebTV Space.
Should you are using a mobile device, you possibly can view the stream from the Twitch mirror.
Concept
The motivation for the AI WebTV is to demo videos generated with open-source text-to-video models comparable to Zeroscope and MusicGen, in an entertaining and accessible way.
You will discover those open-source models on the Hugging Face hub:
The person video sequences are purposely made to be short, meaning the WebTV must be seen as a tech demo/showreel fairly than an actual show (with an art direction or programming).
Architecture
The AI WebTV works by taking a sequence of video shot prompts and passing them to a text-to-video model to generate a sequence of takes.
Moreover, a base theme and idea (written by a human) are passed through a LLM (on this case, ChatGPT), with the intention to generate quite a lot of individual prompts for every video clip.
Here’s a diagram of the present architecture of the AI WebTV:
Implementing the pipeline
The WebTV is implemented in NodeJS and TypeScript, and uses various services hosted on Hugging Face.
The text-to-video model
The central video model is Zeroscope V2, a model based on ModelScope.
Zeroscope is comprised of two parts that might be chained together:
👉 You’ll need to make use of the identical prompt for each the generation and upscaling.
Calling the video chain
To make a fast prototype, the WebTV runs Zeroscope from two duplicated Hugging Face Spaces running Gradio, that are called using the @gradio/client NPM package. You will discover the unique spaces here:
Other spaces deployed by the community may also be found should you seek for Zeroscope on the Hub.
👉 Public Spaces may turn into overcrowded and paused at any time. Should you intend to deploy your individual system, please duplicate those Spaces and run them under your individual account.
Using a model hosted on a Space
Spaces using Gradio have the flexibility to expose a REST API, which might then be called from Node using the @gradio/client module.
Here is an example:
import { client } from "@gradio/client"
export const generateVideo = async (prompt: string) => {
const api = await client("*** URL OF THE SPACE ***")
const { data } = await api.predict("/run", [
prompt,
42,
24,
35
])
const { orig_name } = data[0][0]
const remoteUrl = `${instance}/file=${orig_name}`
}
Post-processing
Once a person take (a video clip) is upscaled, it’s then passed to FILM (Frame Interpolation for Large Motion), a frame interpolation algorithm:
During post-processing, we also add music generated with MusicGen:
Broadcasting the stream
Note: there are multiple tools you need to use to create a video stream. The AI WebTV currently uses FFmpeg to read a playlist product of mp4 videos files and m4a audio files.
Here is an example of making such a playlist:
import { guarantees as fs } from "fs"
import path from "path"
const allFiles = await fs.readdir("** PATH TO VIDEO FOLDER **")
const allVideos = allFiles
.map(file => path.join(dir, file))
.filter(filePath => filePath.endsWith('.mp4'))
let playlist = 'ffconcat version 1.0n'
allFilePaths.forEach(filePath => {
playlist += `file '${filePath}'n`
})
await fs.guarantees.writeFile("playlist.txt", playlist)
It will generate the next playlist content:
ffconcat version 1.0
file 'video1.mp4'
file 'video2.mp4'
...
FFmpeg is then used again to read this playlist and send a FLV stream to a RTMP server. FLV is an old format but still popular on the planet of real-time streaming as a result of its low latency.
ffmpeg -y -nostdin
-re
-f concat
-safe 0 -i channel_random.txt -stream_loop -1
-loglevel error
-c:v libx264 -preset veryfast -tune zerolatency
-shortest
-f flv rtmp://
There are various different configuration options for FFmpeg, for more information within the official documentation.
For the RTMP server, yow will discover open-source implementations on GitHub, comparable to the NGINX-RTMP module.
The AI WebTV itself uses node-media-server.
💡 You may also directly stream to considered one of the Twitch RTMP entrypoints. Take a look at the Twitch documentation for more details.
Observations and examples
Listed below are some examples of the generated content.
The very first thing we notice is that applying the second pass of Zeroscope XL significantly improves the standard of the image. The impact of frame interpolation can also be clearly visible.
Characters and scene composition
Simulation of dynamic scenes
Something truly fascinating about text-to-video models is their ability to emulate real-life phenomena they’ve been trained on.
We have seen it with large language models and their ability to synthesize convincing content that mimics human responses, but this takes things to a complete latest dimension when applied to video.
A video model predicts the subsequent frames of a scene, which could include objects in motion comparable to fluids, people, animals, or vehicles. Today, this emulation is not perfect, but it’ll be interesting to guage future models (trained on larger or specialized datasets, comparable to animal locomotion) for his or her accuracy when reproducing physical phenomena, and likewise their ability to simulate the behavior of agents.
💡 It’ll be interesting to see these capabilities explored more in the long run, as an illustration by training video models on larger video datasets covering more phenomena.
Styling and effects
3D rendered video of a friendly broccoli character wearing a hat, walking in a candy-filled city street with gingerbread houses, under a vivid sun and blue skies, Pixar’s style, cinematic, photorealistic, movie, ambient lighting, natural lighting, CGI, wide-angle view, daytime, ultra realistic.
Failure cases
Fallacious direction: the model sometimes has trouble with movement and direction. As an example, here the clip appears to be played in reverse. Also the modifier keyword green was not taken under consideration.
Rendering errors on realistic scenes: sometimes we are able to see artifacts comparable to moving vertical lines or waves. It’s unclear what causes this, but it surely could also be as a result of the mixture of keywords used.
Text or objects inserted into the image: the model sometimes injects words from the prompt into the scene, comparable to “IMAX”. Mentioning “Canon EOS” or “Drone footage” within the prompt may also make those objects appear within the video.
In the next example, we notice the word “llama” inserts a llama but additionally two occurrences of the word llama in flames.
Recommendations
Listed below are some early recommendations that might be constructed from the previous observations:
Using video-specific prompt keywords
You might already know that should you don’t prompt a particular aspect of the image with Stable Diffusion, things just like the color of garments or the time of the day might turn into random, or be assigned a generic value comparable to a neutral mid-day light.
The identical is true for video models: you’ll want to be specific about things. Examples include camera and character movement, their orientation, speed and direction. You’ll be able to leave it unspecified for creative purposes (idea generation), but this won’t all the time offer you the outcomes you would like (e.g., entities animated in reverse).
Maintaining consistency between scenes
Should you plan to create sequences of multiple videos, you’ll want to be certain you add as many details as possible in each prompt, otherwise it’s possible you’ll lose necessary details from one sequence to a different, comparable to the colour.
💡 This will even improve the standard of the image for the reason that prompt is used for the upscaling part with Zeroscope XL.
Leverage frame interpolation
Frame interpolation is a strong tool which might repair small rendering errors and switch many defects into features, especially in scenes with a whole lot of animation, or where a cartoon effect is appropriate. The FILM algorithm will smoothen out elements of a frame with previous and following events within the video clip.
This works great to displace the background when the camera is panning or rotating, and will even offer you creative freedom, comparable to control over the variety of frames after the generation, to make slow-motion effects.
Future work
We hope you enjoyed watching the AI WebTV stream and that it’ll encourage you to construct more on this space.
As this was a primary trial, a whole lot of things weren’t the main target of the tech demo: generating longer and more varied sequences, adding audio (sound effects, dialogue), generating and orchestrating complex scenarios, or letting a language model agent have more control over the pipeline.
A few of these ideas may make their way into future updates to the AI WebTV, but we can also’t wait to see what the community of researchers, engineers and builders will give you!

