Real-time interactive video diffusion from Overworld

waypoint launch grid

Try Out The Model

Overworld Stream: https://overworld.stream

What’s Waypoint-1?

Waypoint-1 is Overworld’s real-time-interactive video diffusion model, controllable and prompted via text, mouse, and keyboard. You’ll be able to give the model some frames, run the model, and have it create a world you’ll be able to step into and interact with.

The backbone of the model is a frame-causal rectified flow transformer trained on 10,000 hours of diverse video game footage paired with control inputs and text captions. Waypoint-1 is a latent model, meaning that it’s trained on compressed frames.

The usual amongst existing world models has grow to be taking pre-trained video models and fine-tuning them with temporary and simplified control inputs. In contrast, Waypoint-1 is trained from the get-go with a give attention to interactive experiences. With other models, controls are easy: you’ll be able to move and rotate the camera once every few frames, with severe latency issues. With Waypoint-1 you usually are not limited in any respect so far as controls are concerned. You’ll be able to move the camera freely with the mouse, and input any key on the keyboard, and all this with zero latency. Each frame is generated together with your controls as context. Moreover, the model runs fast enough to supply a seamless experience even on consumer hardware.

How was it trained?

Waypoint-1 was pre-trained via diffusion forcing, a method with which the model learns to denoise future frames given past frames. A causal attention mask is applied such that a token in any given frame can only attend to tokens in its own frame, or past frames, but not future frames. Each frame is noised randomly, and as such the model learns to denoise each frame individually. During inference, you’ll be able to then denoise recent frames one after the other, allowing you to generate a procedural stream of recent frames.

While diffusion forcing presents a powerful baseline, randomly noising all frames is misaligned with a frame-by-frame autoregressive rollout. This inference mismatch ends in error accumulation, and noisy long rollouts. To deal with this problem we post-train with self forcing, a method that trains the model to supply realistic outputs under a regime which matches inference behavior. Self-forcing via DMD has the additional benefit of one-pass CFG, and few-step denoising.

The Inference Library: WorldEngine

WorldEngine is Overworld’s high‑performance inference library for interactive world model streaming. It provides the core tooling for constructing inference applications in pure Python, optimized for low latency, high throughput, extensibility, and developer simplicity. The runtime loop is designed for interactivity: it consumes context frame images, keyboard/mouse inputs, and text, and outputs image frames for real‑time streaming.

On Waypoint‑1‑Small (2.3B) running on a 5090, WorldEngine sustains ~30,000 token‑passes/sec (single denoising pass; 256 tokens per frame) and achieves 30 FPS at 4 steps or 60 FPS at 2 steps

Performance comes from 4 targeted optimizations:

AdaLN feature caching: Avoids repeated AdaLN conditioning projections through caching and reusing as long as prompt conditioning and timesteps stay the identical between fwd passes.
Static Rolling KV Cache + Flex Attention
Matmul fusion: Standard inference optimization using fused QKV projections.
Torch Compile using torch.compile(fullgraph=True, mode="max-autotune", dynamic=False)

from world_engine import WorldEngine, CtrlInput


engine = WorldEngine("Overworld/Waypoint-1-Small", device="cuda")


engine.set_prompt("A game where you herd goats in an attractive valley")


img = pipeline.append_frame(uint8_img)  


for controller_input in [
        CtrlInput(button={48, 42}, mouse=[0.4, 0.3]),
        CtrlInput(mouse=[0.1, 0.2]),
        CtrlInput(button={95, 32, 105}),
]:
    img = engine.gen_frame(ctrl=controller_input)

Construct with World Engine

We’re running a world_engine hackathon on 1/20/2026 – You’ll be able to RSVP here. Teams of 2-4 are welcome and the prize is a 5090 GPU on the spot. We’d like to see what you’ll be able to provide you with to increase the world_engine and it needs to be an excellent event to fulfill like-minded founders, engineers, hackers and investors. We hope you’ll be able to join us at 10am PST on January twentieth for 8 hours of friendly competition!

Stay in Touch

Source link

Real-time interactive video diffusion from Overworld

Try Out The Model

What’s Waypoint-1?

How was it trained?

The Inference Library: WorldEngine

Construct with World Engine

Stay in Touch

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Find out how to Perform Large Code Refactors in Cursor

Why it’s critical to maneuver beyond overly aggregated machine-learning metrics

Deploy MusicGen very quickly with Inference Endpoints

The era of agentic chaos and the way data will save us

Run On-Device LLMs in Apple Devices

Real-time interactive video diffusion from Overworld

Try Out The Model

What’s Waypoint-1?

How was it trained?

The Inference Library: WorldEngine

Construct with World Engine

Stay in Touch

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.