We’re excited to share Jack of All Trades (JAT), a project that goals to maneuver within the direction of a generalist agent. The project began as an open reproduction of the Gato (Reed et al., 2022) work, which proposed to coach a Transformer capable of perform each vision-and-language and decision-making tasks. We thus began by constructing an open version of Gato’s dataset. We then trained multi-modal Transformer models on it, introducing several improvements over Gato for handling sequential data and continuous values.
Overall, the project has resulted in:
- The discharge of numerous expert RL agents on a wide range of tasks.
- The discharge of the JAT dataset, the primary dataset for generalist agent training. It comprises a whole lot of hundreds of expert trajectories collected with the expert agents
- The discharge of the JAT model, a transformer-based agent able to playing video games, controlling a robot to perform a wide range of tasks, understanding and executing commands in an easy navigation environment and rather more!

Datasets & expert policies
The expert policies
RL traditionally involves training policies on single environments. Leveraging these expert policies is a real method to construct a flexible agent. We chosen a wide selection of environments, of various nature and difficulty, including Atari, BabyAI, Meta-World, and MuJoCo. For every of those environments, we train an agent until it reached state-of-the-art performance. (For BabyAI, we use the BabyAI bot as a substitute). The resulting agents are called expert agents, and have been released on the 🤗 Hub. You will find an inventory of all agents within the JAT dataset card.
The JAT dataset
We release the JAT dataset, the primary dataset for generalist agent training. The JAT dataset comprises a whole lot of hundreds of expert trajectories collected with the above-mentioned expert agents. To make use of this dataset, simply load it like several other dataset from the 🤗 Hub:
>>> from datasets import load_dataset
>>> dataset = load_dataset("jat-project/jat-dataset", "metaworld-assembly")
>>> first_episode = dataset["train"][0]
>>> first_episode.keys()
dict_keys(['continuous_observations', 'continuous_actions', 'rewards'])
>>> len(first_episode["rewards"])
500
>>> first_episode["continuous_actions"][0]
[6.459120273590088, 2.2422609329223633, -5.914587020874023, -19.799840927124023]
Along with RL data, we include textual datasets to enable a novel interface for the user. That is why you will also find subsets for Wikipedia, Oscar, OK-VQA and Conceptual-Captions.
JAT agent architecture
JAT’s architecture relies on a Transformer, using EleutherAI’s GPT-Neo implementation. JAT’s particularity lies in its embedding mechanism, which has been built to intrinsically handle sequential decision tasks. We interleave remark embeddings with motion embeddings, together with the corresponding rewards.
Each embedding subsequently corresponds either to an remark (related to the reward), or to an motion. But how does JAT encode this information? It will depend on the kind of data. If the info (remark or motion) is a picture (as is the case for Atari), then JAT uses a CNN. If it is a continuous vector, then JAT uses a linear layer. Finally, if it is a discrete value, JAT uses a linear projection layer. The identical principle is used for model output, depending on the kind of data to be predicted. Prediction is causal, shifting observations by 1 time step. In this fashion, the agent must predict the following motion from all previous observations and actions.
As well as, we thought it might be fun to coach our agent to perform NLP and CV tasks. To do that, we also gave the encoder the choice of taking text and image data as input. For text data, we tokenize using GPT-2 tokenization strategy, and for images, we use a ViT-type encoder.
Provided that the modality of the info can change from one environment to a different, how does JAT compute the loss? It computes the loss for every modality individually. For images and continuous values, it uses the MSE loss. For discrete values, it uses the cross-entropy loss. The ultimate loss is the typical of the losses for every element of the sequence.
Wait, does that mean we give equal weight to predicting actions and observations? Actually, no, but we’ll talk more about that below.
Experiments and results
We evaluate JAT on all 157 training tasks. We collect 10 episodes and record the entire reward. For ease of reading, we aggregate the outcomes by domain.
If we were to summarize these ends in one number, it might be 65.8%, the typical performance in comparison with the JAT expert over the 4 domains. This shows that JAT is able to mimicking expert performance on a really wide range of tasks.
Let’s go into a bit of more detail:
- For Atari 57, the agent achieves 14.1% of the expert’s rating, corresponding to 37.6% of human performance. It exceeds human performance on 21 games.
- For BabyAI, the agent achieves 99.0% of the expert’s rating, and fails to exceed 50% of the expert on just 1 task.
- For Meta-World, the agent reached 65.5% of the expert.
- For MuJoCo, the agent achieves 84.8% of the expert.
What’s most impressive is that JAT achieves this performance using a single network for all domains. To take the measure of this performance, let’s watch JAT’s rendering on just a few tasks:
Need to try it out? You’ll be able to! The JAT model is offered on the 🤗 Hub!
For textual tasks, our model shows rudimentary capabilities, we refer the reader to the paper for more details.
The surprising advantages of predicting observations
When training an RL agent, the first goal is to maximise future rewards. But what if we also ask the agent to predict what it’s going to observe in the longer term? Will this extra task help or hinder the educational process?
There are two opposing views on this query. On one hand, learning to predict observations could provide a deeper understanding of the environment, leading to higher and faster learning. Then again, it could distract the agent from its primary goal, leading to mediocre performance in each remark and motion prediction.
To settle this debate, we conducted an experiment using a loss function that mixes remark loss and motion loss, with a weighting parameter to balance the 2 objectives.
The outcomes were noteworthy. When was too high (0.5), the extra objective of predicting observations looked as if it would hinder the educational process. But when was lower, the impact on learning was negligible, and the agent’s performance was just like that obtained when remark prediction was not a part of the target.
Nevertheless, we found a sweet spot around , where learning to predict observations actually improved the agent’s learning efficiency.
Our study suggests that adding remark prediction to the educational process may be helpful, so long as it’s balanced accurately. This finding has essential implications for the design of such agents, highlighting the potential value of auxiliary objectives in improving learning efficiency.
So, the following time you are training an RL agent, consider asking it to predict what it’s going to observe in the longer term. It would just lead to higher performance and faster learning!
Conclusions
On this work, we introduced JAT, a multi-purpose transformer agent able to mastering a wide range of sequential decision-making tasks, and showing rudimentary capabilities in NLP and CV tasks. For all these tasks, JAT uses a single network. Our contributions include the discharge of expert RL agents, the JAT dataset, and the JAT model. We hope that this work will encourage future research in the sector of generalist agents and contribute to the event of more versatile and capable AI systems.
What’s next? A request for research
We consider that the JAT project has opened up a brand new direction for research in the sector of generalist agents, and we have only just scratched the surface. Listed here are some ideas for future work:
-
Improving the info: Although pioneering, the JAT dataset remains to be in its early stages. The expert trajectories come from just one expert agent per environment which can cause some bias. Although we have done our greatest to achieve state-of-the-art performance, some environments are still difficult. We consider that collecting more data and training more expert agents could help rather a lot.
-
Use offline RL: The JAT agent is trained using basic Behavioral Cloning. This means two things: (1) we won’t reap the benefits of sub-optimal trajectories and (2) the JAT agent cannot outperform the expert. We have chosen this approach for simplicity, but we consider that using offline RL could really help improve the agent’s performance, while not being too complex to implement.
-
Unlock the complete potential of a better multi-task sampling strategy: Currently, the JAT agent samples data uniformly from all tasks, but this approach could also be holding it back. By dynamically adjusting the sampling rate to deal with probably the most difficult tasks, we will supercharge the agent’s learning process and unlock significant performance gains.
Links
Citation
@article{gallouedec2024jack,
title = {{Jack of All Trades, Master of Some, a Multi-Purpose Transformer Agent}},
writer = {Gallouédec, Quentin and Beeching, Edward and Romac, Clément and Dellandréa, Emmanuel},
journal = {arXiv preprint arXiv:2402.09844},
yr = {2024},
url = {https://arxiv.org/abs/2402.09844}
}
