Hands-On Imitation Learning: From Behavior Cloning to Multi-Modal Imitation Learning

An outline of probably the most outstanding imitation learning methods with testing on a grid environment

Photo by Possessed Photography on Unsplash

Reinforcement learning is one branch of machine learning concerned with learning by guidance of scalar signals (rewards); in contrast to supervised learning, which needs full labels of the goal variable.

An intuitive example to clarify reinforcement learning could be given by way of a college with two classes having two forms of tests. The top quality solves the test and gets the complete correct answers (supervised learning: SL). The second class solves the test and gets only the grades for every query (reinforcement learning: RL). In the primary case, it seems easier for the scholars to learn the right answers and memorize them. Within the second class, the duty is harder because they’ll learn only by trial and error. Nonetheless, their learning is more robust because they don’t only know what is correct but additionally all of the unsuitable answers to avoid.

Nonetheless, designing accurate RL reward signals (the grades) could be a difficult task, especially for real-world applications. For instance, a human driver knows how one can drive, but cannot set rewards for ‘correct driving’ skill, same thing for cooking or painting. This created the necessity for imitation learning methods (IL). IL is a brand new branch of RL concerned with learning from mere expert trajectories, without knowing the rewards. Essential application areas of IL are in robotics and autonomous driving fields.

In the next, we are going to explore the famous methods of IL within the literature, ordered by their proposal time from old to recent, as shown within the timeline picture below.

The mathematical formulations might be shown together with nomenclature of the symbols. Nonetheless, the theoretical derivation is kept to a minimum here; if further depth is required, the unique references could be looked up as cited within the references section at the tip. The total code for recreating all of the experiments is provided within the accompanying github repo.

So, buckle up! and let’s dive through imitation learning, from behavior cloning (BC) to information maximization generative adversarial imitation learning (InfoGAIL).

The environment utilized in this post is represented as a 15×15 grid. The environment state is illustrated below:

Agent: red color
Initial agent location: blue color
Partitions: green color

The goal of the agent is to achieve the primary row within the shortest possible way and towards a symmetrical location with respect to the vertical axis passing through the center of the grid. The goal location won’t be shown within the state grid.

The motion space A consists of a discrete number from 0 to 4 representing movements in 4 directions and the stopping motion, as illustrated below:

The bottom truth reward R(s,a) is a function of the present state and motion, with a worth equal to the displacement distance towards the goal:

where 𝑝1 is the old position and p2 is the brand new position. The agent will all the time be initialized on the last row, but in a random position every time.

The expert policy used for all methods (except InfoGAIL) goals to achieve the goal within the shortest possible path. This involves three steps:

Moving towards the closest window
Moving directly towards the goal
Stopping on the goal location

This behavior is illustrated by a GIF:

The expert policy generates demonstration trajectories utilized by other IL methods, represented as an ordered sequence of state-action tuples.

where the expert demonstrations set is defined as D={τ0,⋯,τn}

The expert episodic return was 16.33±6 on average for 30 episodes with a length of 32 steps each.

First, we are going to train using the bottom truth reward to set some baselines and tune hyperparameters for later use with IL methods.

The implementation of the Forward RL algorithm utilized in this post is predicated on Clean RL scripts [12], which provides a readable implementation of RL methods.

We’ll test each Proximal Policy Optimization (PPO) [2] and Deep Q-Network (DQN) [1], state-of-the-art on-policy and well-known off-policy RL methods, respectively.

The next is a summary of the training steps for every method, together with their characteristics:

On-Policy (PPO)

This method uses the present policy under training and updates its parameters after collecting rollouts for each episode. PPO has two primary parts: critic and actor. The actor represents the policy, while the critic provides value estimations for every state with its own updated objective.

Off-Policy (DQN)

DQN trains its policy offline by collecting rollouts in a replay buffer using epsilon-greedy exploration. Unlike PPO, DQN doesn’t take the very best motion in response to the present policy for each state but somewhat selects a random motion. This permits for exploration of various solutions. A further goal network could also be used with less steadily updated versions of the policy to make the educational objective more stable.

The next figure shows the episodic return curves for each methods. DQN is in black, while PPO is shown as an orange line.

For this straightforward example:

Each PPO and DQN converge, but with a slight advantage for PPO. Neither method reaches the expert level of 16.6 (PPO comes close with 15.26).
DQN seems slower to converge by way of interaction steps, generally known as sample inefficiency in comparison with PPO.
PPO takes longer training time, possibly as a consequence of actor-critic training, updating two networks with different objectives.

The parameters for training each methods are mostly the identical. For a better take a look at how these curves were generated, check the scripts ppo.py and dqn.py within the accompanying repository.

Behavior Cloning, first proposed in [4], is a direct IL method. It involves supervised learning to map each state to an motion based on expert demonstrations D. The target is defined as:

where π_bc is the trained policy, π_E is the expert policy, and l(π_bc(s),π_E(s)) is the loss function between the expert and trained policy in response to the identical state.

The difference between BC and supervised learning lies in defining the issue as an interactive environment where actions are taken in response to dynamic states (e.g., a robot moving towards a goal). In contrast, supervised learning involves mapping input to output, like classifying images or predicting temperature. This distinction is explained in [8].

On this implementation, the complete set of initial positions for the agent incorporates only 15 possibilities. Consequently, there are only 15 trajectories to learn from, which could be memorized by the BC network effectively. To make the issue harder, we clip the scale of the training dataset D to half (only 240 state-action pairs) and repeat this for all IL methods that follow on this post.

After training the model (as shown in bc.py script), we get a median episodic return of 11.49 with a normal deviation of 5.24.

This is far lower than the forward RL methods before. The next GIF shows the trained BC model in motion.

From the GIF, it’s evident that just about two-thirds of the trajectories have learned to go through the wall. Nonetheless, the model gets stuck with the last third, because it cannot infer the true policy from previous examples, especially because it was given only half of the 15 expert trajectories to learn from.

MaxEnt [3] is one other method to coach a reward model individually (not iteratively), beside Behavior Cloning (BC). Its primary idea lies in maximizing the probability of taking expert trajectories based on the present reward function. This could be expressed as:

Where τ is the trajectory state-action ordered pairs, N is the trajectory length, and Z is a normalizing constant of the sum of all possible trajectories returns under the given policy.

From there, the strategy derives its primary objective based on the utmost entropy theorem [3], which states that probably the most representative policy fulfilling a given condition is the one with highest entropy H. Due to this fact, MaxEnt requires a further reward that can maximize the entropy of the policy. This results in maximizing the next formula:

Which has the derivative:

Where SVD is the state visitation frequency, which could be calculated with a dynamic programming algorithm given the present policy.

In our implementation here of MaxEnt, we skip the training of a brand new reward, where the dynamic programming algorithm can be slow and lengthy. As an alternative, we opt to check the primary idea of maximizing the entropy by re-training a BC model exactly as within the previous process, but with an added term of the negative entropy of the inferred motion distribution to the loss. The entropy must be negative because we wish to maximise it by minimizing the loss.

After adding the negative entropy of the distributions of actions with a weight of 0.5 (selecting the precise value is very important; otherwise, it could result in worse learning), we see a slight improvement over the performance of the previous BC model with a median episodic return of 11.56 now (+0.07). The small value of the development could be explained by the easy nature of the environment, which incorporates a limited variety of states. If the state space gets greater, the entropy could have a much bigger importance.

The unique work on GAIL [5] was inspired by the concept of Generative Adversarial Networks (GANs), which apply the thought of adversarial training to boost the generative abilities of a primary model. Similarly, in GAIL, the concept is applied to match state-action distributions between trained and expert policies.

This could be derived as Kullback-Leibler divergence, as shown within the primary paper [5]. The paper finally derives the primary objective for each models (called generator and discriminator models in GAIL) as:

Where Dt is the discriminator, πθ is the generator model (i.e., the policy under training), πE is the expert policy, and H(πθ) is the entropy of the generator model.

The discriminator acts as a binary classifier, while the generator is the actual policy model being trained.

The primary advantage of GAIL over previous methods (and the rationale it performs higher) lies in its interactive training process. The trained policy learns and explores different states guided by the discriminator’s reward signal.

After training GAIL for 1.6 million steps, the model converged to a better level than BC and MaxEnt models. If continued to be trained, even higher results could be achieved.

Specifically, we obtained a median episodic reward of 12.8, which is noteworthy considering that only 50% of demonstrations were provided with none real reward.

This figure shows the training curve for GAIL (with ground truth episodic rewards on the y-axis). It’s price noting that the rewards coming from log(D(s,a)) might be more chaotic than the bottom truth as a consequence of GAIL’s adversarial training nature.

One remaining problem with GAIL is that the trained reward model, the discriminator, doesn’t actually represent the bottom truth reward. As an alternative, the discriminator is trained as a binary classifier between expert and generator state-action pairs, leading to a median value of 0.5. Because of this the discriminator can only be considered a surrogate reward.

To unravel this problem, the paper in [6] reformulates the discriminator using the next formula:

where fω(s,a) should converge to the actual advantage function. In this instance, this value represents how close the agent is to the invisible goal. The bottom truth reward could be found by adding one other term to incorporate a shaped reward; nonetheless, for this experiment, we are going to restrict ourselves to the advantage function above.

After training the AIRL model with the identical parameters as GAIL, we obtained the next training curve:

It’s noted that given the identical training steps (1.6 Million Steps), AIRL was slower to converge as a consequence of the added complexity of coaching the discriminator. Nonetheless, now now we have a meaningful advantage function, albeit with a performance of only 10.8 episodic reward, which remains to be adequate.

Let’s examine the values of this advantage function and the bottom truth reward in response to expert demonstrations. To make these values more comparable, we also normalized the values of the learned advantage function fω. From this, we got the next plot:

On this figure, there are 15 pulses corresponding to the 15 initial states of the agent. We will see greater errors within the trained model for the last half of the plot, which is as a consequence of the limited use of only half the expert demos in training.

For the primary half, we observe a low state when the agent stands still on the goal with zero reward, while it was evaluated as a high value within the trained model. Within the second half, there’s a general shift towards lower values.

Roughly speaking, the learned function roughly follows the bottom truth reward and has recovered useful details about it using AIRL.

Despite the advancements made by previous methods, a very important problem still persists in Imitation Learning (IL): multi-modal learning. To use IL to practical problems, it’s obligatory to learn from multiple possible expert policies. For example, when driving or playing football, there isn’t any single “true” way of doing things; experts vary of their methods, and the IL model should give you the option to learn these variations consistently.

To deal with this issue, InfoGAIL was developed [7]. Inspired by InfoGAN [11], which conditions the sort of outputs generated by GAN using a further style vector, InfoGAIL builds on the GAIL objective and adds one other criterion: maximizing the mutual information between state-action pairs and a brand new controlling input vector z. This objective could be derived as:

Kullback-Leibler divergence,

where estimating the posterior p(z∣s,a) is approximated with a brand new model, Q, which takes (s,a) as input and outputs z.

The ultimate objective for InfoGAIL could be written as:

In consequence, the policy has a further input, namely z, as shown in the next figure:

In our experiments, we generated recent multi-modal expert demos where each expert could enter from one gap only (of the three gaps on the wall), no matter their goal. The total demo set was used without labels indicating which expert was acting. The z variable is a one-hot encoding vector representing the expert class with three elements (e.g., [1 0 0] for the left door). The policy should:

Learn to maneuver towards the goal
Link randomly generated z values to different modes of experts (thus passing through different doors)
The Q model should give you the option to detect which mode it is predicated on the direction of actions in every state

Note that the discriminator, Q-model, and policy model training graphs are chaotic as a consequence of adversarial training.

Fortunately, we were capable of learn two modes clearly. Nonetheless, the third mode was not recognized by either the policy or the Q-model. The next three GIFs show the learned expert modes from InfoGAIL when given different values of z:

Lastly, the policy was capable of converge to an episodic reward of around 10 with 800K training steps. With more training steps, higher results could be achieved, even when the experts utilized in this instance will not be optimal.

As we review our experiments, it’s clear that every one IL methods have performed well by way of episodic reward criteria. The next table summarizes their performance:

*InfoGAIL results will not be comparable because the expert demos were based on multi-modal experts

The table shows that GAIL performed the very best for this problem, while AIRL was slower as a consequence of its recent reward formulation, leading to a lower return. InfoGAIL also learned well but struggled with recognizing all three modes of experts.

Imitation Learning is a difficult and interesting field. The methods we’ve explored are suitable for grid simulation environments but may circuitously translate to real-world applications. Practical uses of IL are still in its infancy, aside from some BC methods. Linking simulations to reality introduces recent errors as a consequence of differences of their nature.

One other open challenge in IL is Multi-agent Imitation Learning. Research like MAIRL [9] and MAGAIL [10] have experimented with multi-agent environments but a general theory for learning from multiple expert trajectories stays an open query.

The attached repository on GitHub provides a basic approach to implementing these methods, which could be easily prolonged. The code might be updated in the long run. Should you’re interested by contributing, please submit a difficulty or pull request along with your modifications. Alternatively, be happy to go away a comment as we’ll follow up with updates.

Note: Unless otherwise noted, all images are generated by creator

[1] Mnih, V. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

[2] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

[3] Ziebart, B. D., Maas, A. L., Bagnell, J. A., & Dey, A. K. (2008, July). Maximum entropy inverse reinforcement learning. In Aaai (Vol. 8, pp. 1433–1438).

[4] Bain, M., & Sammut, C. (1995, July). A Framework for Behavioural Cloning. In Machine Intelligence 15 (pp. 103–129).

[5] Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. Advances in neural information processing systems, 29.

[6] Fu, J., Luo, K., & Levine, S. (2017). Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248.

[7] Li, Y., Song, J., & Ermon, S. (2017). Infogail: Interpretable imitation learning from visual demonstrations. Advances in neural information processing systems, 30.

[8] Osa, T., Pajarinen, J., Neumann, G., Bagnell, J. A., Abbeel, P., & Peters, J. (2018). An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics, 7(1–2), 1–179.

[9] Yu, L., Song, J., & Ermon, S. (2019, May). Multi-agent adversarial inverse reinforcement learning. In International Conference on Machine Learning (pp. 7194–7201). PMLR.

[10] Song, J., Ren, H., Sadigh, D., & Ermon, S. (2018). Multi-agent generative adversarial imitation learning. Advances in neural information processing systems, 31.

[11] Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, 29.

[12] Huang, S., Dossa, R. F. J., Ye, C., Braga, J., Chakraborty, D., Mehta, K., & AraÃšjo, J. G. (2022). Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms. Journal of Machine Learning Research, 23(274), 1–18.

Hands-On Imitation Learning: From Behavior Cloning to Multi-Modal Imitation Learning

An outline of probably the most outstanding imitation learning methods with testing on a grid environment

On-Policy (PPO)

Off-Policy (DQN)

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

OpenAI’s record-breaking $500B valuation

Prediction vs. Search Models: What Data Scientists Are Missing

Martin Trust Center for MIT Entrepreneurship welcomes Ana Bakshi as recent executive director

AI Engineering and Evals as Latest Layers of Software Work

Apple chases Meta’s AI glasses lead

Hands-On Imitation Learning: From Behavior Cloning to Multi-Modal Imitation Learning

An outline of probably the most outstanding imitation learning methods with testing on a grid environment

On-Policy (PPO)

Off-Policy (DQN)

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.