Dreaming in Blocks — MineWorld, the Minecraft World Model

-

games growing up was definitely Minecraft. To this present day, I still remember meeting up with a few friends after school and determining what recent, odd red-stone contraption we might construct next. That’s why, when Oasis, an mechanically generated open AI world model, was released in October 2024, I used to be flabbergasted! Constructing reactive world models seemed finally in reach using current technologies, and shortly enough, we might need fully AI-generated environments.

World models[3], introduced back in 2018 by David HA et al, are machine learning models able to each simulating and interacting with a totally virtual environment. Their major limitation has been computational inefficiency, which made real-time interaction with the model a big challenge.

On this blog post, we are going to introduce the primary open-source Minecraft world model developed by Microsoft, Mineworld[1], which is able to fast real-time interactions and high controllability, while using fewer resources in comparison with its closed-source counterpart, Oasis [2]. Their contribution lies in three major points:

  1. Mineworld: An actual-time, interactive world model with high controllability ,  and it’s open source.
  2. A parallel decoding algorithm that accelerates the generation process, increasing the variety of frames generated per second.
  3. A novel evaluation metric designed to measure a world model’s controllability.

Paper link: https://arxiv.org/abs/2504.08388

Code: https://github.com/microsoft/mineworld

Released: eleventh of April 2025


Mineworld, Simplified

To accurately explain Mineworld and its approach, we are going to divide this section into three subsections:

  • Problem Formulation: where we define the issue and establish some ground rules for each training and inference
  • Model Architecture: An summary of the models used for generating tokens and output images.
  • Parallel Decoding: A glance into how the authors tripled the variety of frames generated per second using a novel diagonal decoding algorithm [8].

Problem Formulation

There are two kinds of input to the world model: video game footage and player actions taken during gameplay. Each of those requires a unique style of tokenization to be appropriately utilized.

Given a clip of Minecraft video 𝑥, containing 𝑛 states/frames, image tokenization might be formulated as follows:

$$x=(x_{1},…,x_{n})$$

$$t= (t_{1},…,t_{c},t_{c+1},…,t_{2c},t_{2c+1},…,t_{N})$$

Each frame 𝑥(i) incorporates c patches, and every patch might be represented by a token t(j). Because of this a single frame 𝑥(i) might be further described because the set of quantized tokens {t(1),t(2),…,t(c)}, where each t(j) ∈ t is a definite patch, capturing its own set of pixels.

Since every frame incorporates c tokens, the overall amount of tokens over one video clip is N =n.c. 

Table 1. Seven different classes for the 11 different possibilities of actions. Grouping taken from [1] 

Along with tokenizing video input, player actions must even be tokenized. These tokens must capture variations similar to changes in camera perspective, keyboard input, and mouse movements. That is achieved using 11 distinct tokens that represent the total range of input features:

  • 7 tokens for seven exclusive motion groups. Related actions are grouped into the identical class (grouping of actions is represented in Table 1). 
  • 2 tokens to encode camera angles following [5]
  • 2 tokens capturing the starting and of the motion sequence: and .

Thus, a flat sequence capturing all game states and actions might be represented as follows:

$$t= (t_{i*c+1},…,t_{(i+1)*c},[aBOS],t_{1}^{a_{i}},…,t_{9}^{a_{i}},[aEOS])$$

We start with a listing of quantized IDs for every patch, ranging from t(1) to t(N) (as shown within the previous equation), followed by a beginning-of-sequence token , the 9 motion tokens, and eventually an end-of-sequence token .

Model Architecture

Two major models were utilized in this work: a Vector Quantized Variational Autoencoder (VQ-VAE)[6] and a Transformer decoder based on the LLaMA architecture[7].

Although traditional Variational Autoencoders (VAEs) were once the go-to architecture for image generation (especially before the wide adoption of diffusion models), that they had some limitations. VAEs struggled in cases with data that was more discrete in nature ( similar to words or tokens) or required high realism and certainty. VQ-VAEs, alternatively, address these shortcomings by moving from a continuous latent space to a discrete one, making them more structured and improving the model’s suitability for downstream tasks.

On this paper, VQ-VAE was used because the visual tokenizer, converting each image frame 𝑥 into its quantized ID representation t. Images of size 224×384 were used as input, with each image divided further into 16 different patches of size 14×24. This leads to a sequence of 336 discrete tokens representing the visual information in a single frame.

Alternatively, a LLaMA transformer decoder was employed to predict each token conditioned on all previous tokens.

$$f_{theta}(t)=prod_{i=1}^{N} pleft( t_{i}|t_{lt i} right) $$

The Transformer function processes not only visual-based tokens but additionally motion tokens. This permits modeling of the connection between the 2 modalities, allowing it for use as each a world model (as intended within the paper) and as a policy model able to predicting actions based on preceding tokens.

Parallel Decoding

Figure 2. Comparison between raster-scan order generation (left) and parallel diagonal decoding (right). Notice that parallel decoding took 2.5 seconds to render, while raster took around 6.8 seconds. Visualization created by blogpost writer, inspired by [1].

The authors had a transparent requirement to contemplate a game “playable” under normal settings: it must generate enough frames per second for the player to comfortably perform a mean amount of actions per minute (APM). Based on their evaluation, a mean player performs 150 APM. To accommodate such needs, the environment would want to run at the very least 2~3 frames per second.

To satisfy this requirement, the authors had to maneuver away from typical raster scan generation (generating from left to right, top to bottom, each token individually) and as an alternative utilize combined diagonal decoding.

Diagonal decoding works by executing several image patches in parallel during a single run. For instance, if patch x(i,j) was processed on step t, each patches x(i+1,j) and x(i,j+1) are processed on step t+1. This method leverages the spatial and temporal connections between consecutive frames, enabling faster generation. This effect may be seen in additional detail in Figure 2.

Nevertheless, switching from sequential to parallel generation introduces some performance degradation. That is on account of a mismatch between the training and inference processes (as parallel generation is vital during inference) and to the sequential nature of LLaMA’s causal attention mask. The authors mitigate this issue by fine-tuning using a modified attention mask that’s more suitable for his or her parallel decoding strategy.


Key Findings & Evaluation

For evaluation, Mineworld utilized the VPT dataset [5], which consists of recorded gaming clips paired with their corresponding actions. VPT consists of 10M video clips, each comprising 16 frames. As previously mentioned, each frame( 224×384 pixels) is split into 336 patches, each patch represented by a separate token t(i). Alongside the 11 motion tokens, this leads to a complete of as much as 347 tokens per frame, summing as much as 55B tokens for all the dataset.

Quantitative results

Mineworld primarily compared its results to Oasis using two categories of metrics: visual quality and controllability.

To accurately measure controllability, the authors introduced a novel approach by training an Inverse Dynamics Model (IDM) [5], tasked with predicting the motion occurring between two consecutive frames. Along with reaching 90.6% accuracy, the model was further tested by supplying 20 game clips with IDM’s predicted actions to five experienced players. After scoring each motion from 1 to five and calculating the Pearson correlation coefficient, they obtained a p-value of 0.56, which indicates a big positive correlation.

With the Inverse Dynamics Model providing reliable results, it could possibly be used to calculate metrics similar to accuracy, F1 rating, or L1 loss by treating the input motion as the bottom truth and the IDM’s predicted motion because the motion produced by the world model. As a consequence of variations within the kinds of actions taken, this evaluation might be further divided into two categories:

  1. Discrete Motion Classification: Precision, Recall, and F1 scores for the 7 motion classes described in Figure 1.
  2. Camera Movement: By dividing rotation across the X and Y axes into 11 discrete bins, an L1 rating might be calculated using the IDM predictions.
Table 2. Comparison results between three different settings of Mineworld and Oasis. Comparing across Frames per second (FPS), precision (P), recall (R), F1 rating (F1), L1 Rating (L1), Frechet video distance (FVD), learned perceptual image patch similarity (LPIPS), Structural Similarity Index Measure (SSIM), and Peak Signal-to-Noise Ratio. Results taken from [1]

Examining the leads to Table 2, we observe that Mineworld, despite having only 300M parameters, outperforms Oasis on all given metrics, whether related to controllability or visual quality. Probably the most interesting metric is frames per second, where Mineworld delivers greater than twice as many frames, enabling a smoother interactive experience that may handle 354 APM, far exceeding the 150 APM hard limit.

While scaling Mineworld to 700M or 1.2B parameters improves image quality, it unfortunately comes at the associated fee of a slowdown, with the FPS dropping to three.01. This reduction in speed can negatively impact user experience, though it still supports a playable 180 APM.

Qualitative Results 

Figure 3. Three different cases of gameplay are provided. Image taken from [1]

Further qualitative evaluation was conducted to guage Mineworld’s capability of generating high-quality details, following motion instructions, and understanding/re-generating contextual information. The initial game state was provided, together with a predefined list of actions for the model to execute.

Figure 3, we will draw three conclusions:

  • Top Panel: Given a picture of a player in the home and directions to maneuver towards the door and open it, the model successfully generated the specified sequence of actions.
  • Middle Panel: In a wood-chopping scenario, the model demonstrated the flexibility to generate fine-grained visual details, appropriately rendering the wood destruction animation.
  • Bottom Panel: A case of high fidelity and context awareness. On moving the camera left and right, we notice the home being out of sight, then back again fully with the identical details.

These three cases show the ability of Mineworld not only in generating high-quality gameplay content but in following the specified actions and re-generating contextual information consistently, a feature that Oasis struggles with.

Figure 4. Further cases for controllability, where, on providing different actions on input, different sequences of gameplay are generated. Image taken from [1]

In a second set of results, the authors focused on evaluating the controllability of the model by providing the very same input scene alongside three different sets of actions. The model successfully generated three distinct output sequences, every one resulting in a totally different final state.


Conclusion

On this blog post, we explored MineWorld, the primary open-source world model for Minecraft. We’ve got discussed their approach to tokenizing each frame/state into several tokens and mixing them with 11 additional tokens representing each discrete actions and camera movement. We’ve got also highlighted their revolutionary use of an Inverse Dynamics Model to compute controllability metrics, alongside their novel parallel decoding algorithm that triples inference speed, reaching a mean of three frames per second.
In the long run, it may very well be beneficial to increase the testing running time beyond a 16-frame window. Such a protracted time can accurately test Mineworld’s ability to regenerate specific objects, a challenge that, in my view, will remain a significant obstacle to adapting such models widely.

Thanks for reading!

Taken with trying a Minecraft world model in your browser? Try Oasis[2] here.


References

[1] J. Guo, Y. Ye, T. He, H. Wu, Y. Jiang, T. Pearce and J. Bian, (2025), arXiv preprint arXiv:2504.08388v1

[2] R. Wachen and D. Leitersdorf, (2024), https://oasis-ai.org/

[3] D. Ha and J. Schmidhuber, (2018), arXiv preprint arXiv:1803.10122

[4] J. Guo, Y. Ye, T. He, H. Wu, Y. Jiang, T. Pearce and J. Bian, (2025), GitHub repository: https://github.com/microsoft/mineworld

[5] B. Baker, I. Akkaya, P. Zhokhov, J. Huizinga, J. Tang, A. Ecoffet, B. Houghton, R. Sampedro and J. Clune, (2022), arXiv preprint arXiv:2206.11795

[6] A. van den Oord, O. Vinyals and K. Kavukcuoglu, (2017), arXiv preprint arXiv:1711.00937

[7] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Joulin, E. Grave and G. Lample, (2023), arXiv preprint arXiv:2302.13971

[8] Y. Ye, J. Guo, H. Wu, T. He, T. Pearce, T. Rashid, K. Hofmann and J. Bian, (2025), arXiv preprint arXiv:2503.14070

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x