Video generation models as world simulators

This technical report focuses on (1) our method for turning visual data of every type right into a unified representation that allows large-scale training of generative models, and (2) qualitative evaluation of Sora’s capabilities and limitations. Model and implementation details aren’t included on this report.

Much prior work has studied generative modeling of video data using quite a lot of methods, including recurrent networks,^{[^1]}^{[^2]}^{[^3]} generative adversarial networks,^{[^4]}^{[^5]}^{[^6]}^{[^7]} autoregressive transformers,^{[^8]}^{[^9]} and diffusion models.^{[^10]}^{[^11]}^{[^12]} These works often deal with a narrow category of visual data, on shorter videos, or on videos of a set size. Sora is a generalist model of visual data—it may possibly generate videos and pictures spanning diverse durations, aspect ratios and resolutions, as much as a full minute of high definition video.

Video generation models as world simulators

What are your thoughts on this topic?
Let us know in the comments below.

2 COMMENTS

Share this article

Recent posts

Train a Sentence Embedding Model with 1B Training Pairs

Course Launch Community Event

Large Language Models: A Recent Moore’s Law?

Scaling up BERT-like model Inference on modern CPU

Architecting GPUaaS for Enterprise AI On-Prem

Video generation models as world simulators

What are your thoughts on this topic? Let us know in the comments below.

2 COMMENTS

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.