Learning to play Minecraft with Video PreTraining

-

The web incorporates an infinite amount of publicly available videos that we are able to learn from. You’ll be able to watch an individual make a beautiful presentation, a digital artist draw a stupendous sunset, and a Minecraft player construct an intricate house. Nevertheless, these videos only provide a record of what happened but not precisely how it was achieved, i.e., you won’t know the precise sequence of mouse movements and keys pressed. If we would really like to construct large-scale foundation models in these domains as we’ve done in language with GPT, this lack of motion labels poses a latest challenge not present within the language domain, where “motion labels” are simply the subsequent words in a sentence.

In an effort to utilize the wealth of unlabeled video data available on the web, we introduce a novel, yet easy, semi-supervised imitation learning method: Video PreTraining (VPT). We start by gathering a small dataset from contractors where we record not only their video, but additionally the actions they took, which in our case are keypresses and mouse movements. With this data we train an inverse dynamics model (IDM), which predicts the motion being taken at each step within the video. Importantly, the IDM can use past and future information to guess the motion at each step. This task is way easier and thus requires far less data than the behavioral cloning task of predicting actions given past video frames only, which requires inferring what the person desires to do and find out how to accomplish it. We will then use the trained IDM to label a much larger dataset of online videos and learn to act via behavioral cloning.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

5 COMMENTS

0 0 votes
Article Rating
guest
5 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

5
0
Would love your thoughts, please comment.x
()
x