A Crash Course of Planning for Perception Engineers in Autonomous Driving

-

The basics of planning and decision-making

44 min read

14 hours ago

AlphaGo, ChatGPT and FSD (image credit Elena Popova, Karthik Sridasyam and Jonathan Kemper on Unsplash)

A classical modular autonomous driving system typically consists of perception, prediction, planning, and control. Until around 2023, AI (artificial intelligence) or ML (machine learning) primarily enhanced perception in most mass-production autonomous driving systems, with its influence diminishing in downstream components. In stark contrast to the low integration of AI within the planning stack, end-to-end perception systems (similar to the BEV, or birds-eye-view perception pipeline) have been deployed in mass production vehicles.

Classical modular design of an autonomous driving stack, 2023 and prior (Chart created by author)
Classical modular design of an autonomous driving stack, 2023 and prior (Chart created by creator)

There are multiple reasons for this. A classical stack based on a human-crafted framework is more explainable and could be iterated faster to repair field test issues (inside hours) in comparison with machine learning-driven features (which could take days or even weeks). Nonetheless, it doesn’t make sense to let available human driving data sit idle. Furthermore, increasing computing power is more scalable than expanding the engineering team.

Fortunately, there was a powerful trend in each academia and industry to vary this example. First, downstream modules have gotten increasingly data-driven and might also be integrated via different interfaces, similar to the one proposed in CVPR 2023’s best paper, UniAD. Furthermore, driven by the ever-growing wave of Generative AI, a single unified vision-language-action (VLA) model shows great potential for handling complex robotics tasks (RT-2 in academia, TeslaBot and 1X in industry) and autonomous driving (GAIA-1, DriveVLM in academia, and Wayve AI driver, Tesla FSD in industry). This brings the toolsets of AI and data-driven development from the perception stack to the planning stack.

This blog post goals to introduce the issue settings, existing methodologies, and challenges of the planning stack, in the shape of a crash course for perception engineers. As a perception engineer, I finally had a while over the past couple of weeks to systematically learn the classical planning stack, and I would really like to share what I learned. I will even share my thoughts on how AI will help from the angle of an AI practitioner.

The intended audience for this post is AI practitioners who work in the sphere of autonomous driving, particularly, perception engineers.

The article is a bit long (11100 words), and the table of contents below will most definitely help those that wish to do quick ctrl+F searches with the keywords.

Table of Contents (ToC)

Why learn planning?
What's planning?
The issue formulation
The Glossary of Planning
Behavior Planning
Frenet vs Cartesian systems
Classical tools-the troika of planning
Searching
Sampling
Optimization
Industry practices of planning
Path-speed decoupled planning
Joint spatiotemporal planning
Decision making
What and why?
MDP and POMDP
Value iteration and Policy iteration
AlphaGo and MCTS-when nets meet trees
MPDM (and successors) in autonomous driving
Industry practices of decision making
Trees
No trees
Self-Reflections
Why NN in planning?
What about e2e NN planners?
Can we do without prediction?
Can we do with just nets but no trees?
Can we use LLMs to make decisions?
The trend of evolution

This brings us to an interesting query: why learn planning, especially the classical stack, within the era of AI?

From a problem-solving perspective, understanding your customers’ challenges higher will enable you, as a perception engineer, to serve your downstream customers more effectively, even in case your principal focus stays on perception work.

Machine learning is a tool, not an answer. Essentially the most efficient strategy to solve problems is to mix latest tools with domain knowledge, especially those with solid mathematical formulations. Domain knowledge-inspired learning methods are more likely to be more data-efficient. As planning transitions from rule-based to ML-based systems, even with early prototypes and products of end-to-end systems hitting the road, there’s a necessity for engineers who can deeply understand each the basics of planning and machine learning. Despite these changes, classical and learning methods will likely proceed to coexist for a substantial period, perhaps shifting from an 8:2 to a 2:8 ratio. It is sort of essential for engineers working on this field to grasp each worlds.

From a value-driven development perspective, understanding the restrictions of classical methods is crucial. This insight lets you effectively utilize latest ML tools to design a system that addresses current issues and delivers immediate impact.

Moreover, planning is a critical a part of all autonomous agents, not only in autonomous driving. Understanding what planning is and the way it really works will enable more ML talents to work on this exciting topic and contribute to the event of truly autonomous agents, whether or not they are cars or other types of automation.

The issue formulation

Because the “brain” of autonomous vehicles, the planning system is crucial for the protected and efficient driving of vehicles. The goal of the planner is to generate trajectories which might be protected, comfortable, and efficiently progressing towards the goal. In other words, safety, comfort, and efficiency are the three key objectives for planning.

As input to the planning systems, all perception outputs are required, including static road structures, dynamic road agents, free space generated by occupancy networks, and traffic wait conditions. The planning system must also ensure vehicle comfort by monitoring acceleration and jerk for smooth trajectories, while considering interaction and traffic courtesy.

The planning systems generate trajectories within the format of a sequence of waypoints for the ego vehicle’s low-level controller to trace. Specifically, these waypoints represent the long run positions of the ego vehicle at a series of fixed time stamps. For instance, each point is perhaps 0.4 seconds apart, covering an 8-second planning horizon, leading to a complete of 20 waypoints.

A classical planning stack roughly consists of worldwide route planning, local behavior planning, and native trajectory planning. Global route planning provides a road-level path from the beginning point to the tip point on a worldwide map. Local behavior planning decides on a semantic driving motion type (e.g., automobile following, nudging, side passing, yielding, and overtaking) for the following several seconds. Based on the decided behavior type from the behavior planning module, local trajectory planning generates a short-term trajectory. The worldwide route planning is often provided by a map service once navigation is about and is beyond the scope of this post. We’ll concentrate on behavior planning and trajectory planning any longer.

Behavior planning and trajectory generation can work explicitly in tandem or be combined right into a single process. In explicit methods, behavior planning and trajectory generation are distinct processes operating inside a hierarchical framework, working at different frequencies, with behavior planning at 1–5 Hz and trajectory planning at 10–20 Hz. Despite being highly efficient more often than not, adapting to different scenarios may require significant modifications and fine-tuning. More advanced planning systems mix the 2 right into a single optimization problem. This approach ensures feasibility and optimality with none compromise.

Classification of planning design approaches (source: Fluid Dynamics Planner)
Classification of planning design approaches (source: Fluid Dynamics Planner)

The Glossary of Planning

You would possibly have noticed that the terminology utilized in the above section and the image don’t completely match. There isn’t any standard terminology that everybody uses. Across each academia and industry, it will not be unusual for engineers to make use of different names to consult with the identical concept and the identical name to consult with different concepts. This means that planning in autonomous driving remains to be under energetic development and has not fully converged.

Here, I list the notation utilized in this post and briefly explain other notions present within the literature.

  • Planning: A top-level concept, parallel to manage, that generates trajectory waypoints. Together, planning and control are jointly known as PnC (planning and control).
  • Control: A top-level concept that takes in trajectory waypoints and generates high-frequency steering, throttle, and brake commands for actuators to execute. Control is comparatively well-established in comparison with other areas and is beyond the scope of this post, despite the common notion of PnC.
  • Prediction: A top-level concept that predicts the long run trajectories of traffic agents aside from the ego vehicle. Prediction could be considered a light-weight planner for other agents and can also be called motion prediction.
  • Behavior Planning: A module that produces high-level semantic actions (e.g., lane change, overtake) and typically generates a rough trajectory. Additionally it is referred to as task planning or decision making, particularly within the context of interactions.
  • Motion Planning: A module that takes in semantic actions and produces smooth, feasible trajectory waypoints at some point of the planning horizon for control to execute. Additionally it is known as trajectory planning.
  • Trajectory Planning: One other term for motion planning.
  • Decision Making: Behavior planning with a concentrate on interactions. Without ego-agent interaction, it is solely known as behavior planning. Additionally it is referred to as tactical decision making.
  • Route Planning: Finds the popular route over road networks, also referred to as mission planning.
  • Model-Based Approach: In planning, this refers to manually crafted frameworks utilized in the classical planning stack, versus neural network models. Model-based methods contrast with learning-based methods.
  • Multimodality: Within the context of planning, this typically refers to multiple intentions. This contrasts with multimodality within the context of multimodal sensor inputs to perception or multimodal large language models (similar to VLM or VLA).
  • Reference Line: An area (several hundred meters) and coarse path based on global routing information and the present state of the ego vehicle.
  • Frenet Coordinates: A coordinate system based on a reference line. Frenet simplifies a curvy path in Cartesian coordinates to a straight tunnel model. See below for a more detailed introduction.
  • Trajectory: A 3D spatiotemporal curve, in the shape of (x, y, t) in Cartesian coordinates or (s, l, t) in Frenet coordinates. A trajectory consists of each path and speed.
  • Path: A 2D spatial curve, in the shape of (x, y) in Cartesian coordinates or (s, l) in Frenet coordinates.
  • Semantic Motion: A high-level abstraction of motion (e.g., automobile following, nudge, side pass, yield, overtake) with clear human intention. Also known as intention, policy, maneuver, or primitive motion.
  • Motion: A term with no fixed meaning. It may well consult with the output of control (high-frequency steering, throttle, and brake commands for actuators to execute) or the output of planning (trajectory waypoints). Semantic motion refers back to the output of behavior prediction.

Different literature may use various notations and ideas. Listed below are some examples:

These variations illustrate the range in terminology and the evolving nature of the sphere.

Behavior Planning

As a machine learning engineer, you might notice that the behavior planning module is a heavily manually crafted intermediate module. There isn’t any consensus on the precise form and content of its output. Concretely, the output of behavior planning could be a reference path or object labeling on ego maneuvers (e.g., pass from the left or right-hand side, pass or yield). The term “semantic motion” has no strict definition and no fixed methods.

The decoupling of behavior planning and motion planning increases efficiency in solving the extremely high-dimensional motion space of autonomous vehicles. The actions of an autonomous vehicle must be reasoned at typically 10 Hz or more (time resolution in waypoints), and most of those actions are relatively straightforward, like going straight. After decoupling, the behavior planning layer only must reason about future scenarios at a comparatively coarse resolution, while the motion planning layer operates within the local solution space based on the choice made by behavior planning. One other good thing about behavior planning is converting non-convex optimization to convex optimization, which we’ll discuss further below.

Frenet vs Cartesian systems

The Frenet coordinate system is a widely adopted system that merits its own introduction section. The Frenet frame simplifies trajectory planning by independently managing lateral and longitudinal movements relative to a reference path. The sss coordinate represents longitudinal displacement (distance along the road), while the lll (or ddd) coordinate represents lateral displacement (side position relative to the reference path).

Frenet simplifies a curvy path in Cartesian coordinates to a straight tunnel model. This transformation converts non-linear road boundary constraints on curvy roads into linear ones, significantly simplifying the following optimization problems. Moreover, humans perceive longitudinal and lateral movements otherwise, and the Frenet frame allows for separate and more flexible optimization of those movements.

Schematics on the conversion from Cartesian frame to Frenet frame (source: Cartesian Planner)

The Frenet coordinate system requires a clean, structured road graph with low curvature lanes. In practice, it’s preferred for structured roads with small curvature, similar to highways or city expressways. Nonetheless, the problems with the Frenet coordinate system are amplified with increasing reference line curvature, so it must be used cautiously on structured roads with high curvature, like city intersections with guide lines.

For unstructured roads, similar to ports, mining areas, parking lots, or intersections without guidelines, the more flexible Cartesian coordinate system is really helpful. The Cartesian system is healthier fitted to these environments because it may well handle higher curvature and fewer structured scenarios more effectively.

Planning in autonomous driving involves computing a trajectory from an initial high-dimensional state (including position, time, velocity, acceleration, and jerk) to a goal subspace, ensuring all constraints are satisfied. Searching, sampling, and optimization are the three most generally used tools for planning.

Searching

Classical graph-search methods are popular in planning and are utilized in route/mission planning on structured roads or directly in motion planning to seek out the most effective path in unstructured environments (similar to parking or urban intersections, especially mapless scenarios). There’s a transparent evolution path, from Dijkstra’s algorithm to A* (A-star), and further to hybrid A*.

Dijkstra’s algorithm explores all possible paths to seek out the shortest one, making it a blind (uninformed) search algorithm. It’s a scientific method that guarantees the optimal path, nevertheless it is inefficient to deploy. As shown within the chart below, it explores just about all directions. Essentially, Dijkstra’s algorithm is a breadth-first search (BFS) weighted by movement costs. To enhance efficiency, we will use information in regards to the location of the goal to trim down the search space.

Visualization of Dijkstra’s algorithm and A-star search (Source: PathFinding.js, example inspired by RedBlobGames)

The A* algorithm uses heuristics to prioritize paths that look like leading closer to the goal, making it more efficient. It combines the price thus far (Dijkstra) with the price to go (heuristics, essentially greedy best-first). A* only guarantees the shortest path if the heuristic is admissible and consistent. If the heuristic is poor, A* can perform worse than the Dijkstra baseline and should degenerate right into a greedy best-first search.

In the particular application of autonomous driving, the hybrid A* algorithm further improves A* by considering vehicle kinematics. A* may not satisfy kinematic constraints and can’t be tracked accurately (e.g., the steering angle is often inside 40 degrees). While A* operates in grid space for each state and motion, hybrid A* separates them, maintaining the state within the grid but allowing continuous motion in response to kinematics.

Analytical expansion (shot to goal) is one other key innovation proposed by hybrid A*. A natural enhancement to A* is to attach probably the most recently explored nodes to the goal using a non-colliding straight line. If this is feasible, we’ve got found the answer. In hybrid A*, this straight line is replaced by Dubins and Reeds-Shepp (RS) curves, which comply with vehicle kinematics. This early stopping method strikes a balance between optimality and feasibility by focusing more on feasibility for the further side.

Hybrid A* is used heavily in parking scenarios and mapless urban intersections. Here’s a very nice video showcasing how it really works in a parking scenario.

Hybrid A-star algorithm with analytical expansion (source: the 2010 IJRR Hybrid A-star paper and 2012 Udacity class )

Sampling

One other popular approach to planning is sampling. The well-known Monte Carlo method is a random sampling method. In essence, sampling involves choosing many candidates randomly or in response to a previous, after which choosing the most effective one in response to an outlined cost. For sampling-based methods, the fast evaluation of many options is critical, because it directly impacts the real-time performance of the autonomous driving system.

Large Language Models (LLMs) essentially provide samples, and there must be an evaluator with an outlined cost that aligns with human preferences. This evaluation process ensures that the chosen output meets the specified criteria and quality standards.

Sampling can occur in a parameterized solution space if we already know the analytical solution to a given problem or subproblem. For instance, typically we wish to attenuate the time integral of the square of jerk (the third derivative of position p(t)), indicated by the triple dots over p, where one dot represents one order derivative with respect to time), amongst other criteria.

Minimizing squared jerk for driving comfort (source: Werling et al, ICRA 2010)

It may well be mathematically proven that quintic (fifth order) polynomials provide the jerk-optimal connection between two states in a position-velocity-acceleration space, even when additional cost terms are considered. By sampling on this parameter space of quintic polynomials, we will find the one with the minimum cost to get the approximate solution. The associated fee takes into consideration aspects similar to speed, acceleration, jerk limit, and collision checks. This approach essentially solves the optimization problem through sampling.

Sampling of lateral movement time profiles (source: Werling et al, ICRA 2010)

Sampling-based methods have inspired quite a few ML papers, including CoverNet, Lift-Splat-Shoot, NMP, and MP3. These methods replace mathematically sound quintic polynomials with human driving behavior, utilizing a big database. The evaluation of trajectories could be easily parallelized, which further supports using sampling-based methods. This approach effectively leverages an enormous amount of expert demonstrations to mimic human-like driving behavior, while avoiding random sampling of acceleration and steering profiles.

Sampling from human-driving data for data-driven planning methods (source: NMP, CoverNet and Lift-splat-shoot)

Optimization

Optimization finds the most effective solution to an issue by maximizing or minimizing a particular objective function under given constraints. In neural network training, an analogous principle is followed using gradient descent and backpropagation to regulate the network’s weights. Nonetheless, in optimization tasks outside of neural networks, models are often less complex, and simpler methods than gradient descent are sometimes employed. For instance, while gradient descent could be applied to Quadratic Programming, it is usually not probably the most efficient method.

In autonomous driving, the planning cost to optimize typically considers dynamic objects for obstacle avoidance, static road structures for following lanes, navigation information to make sure the proper route, and ego status to guage smoothness.

Optimization could be categorized into convex and non-convex types. The important thing distinction is that in a convex optimization scenario, there is simply one global optimum, which can also be the local optimum. This characteristic makes it unaffected by the initial solution to the optimization problems. For non-convex optimization, the initial solution matters so much, as illustrated within the chart below.

Convex vs non-convex optimization (source: Stanford course materials)

Since planning involves highly non-convex optimization with many local optima, it heavily relies on the initial solution. Moreover, convex optimization typically runs much faster and is subsequently preferred for onboard real-time applications similar to autonomous driving. A typical approach is to make use of convex optimization at the side of other methods to stipulate a convex solution space first. That is the mathematical foundation behind separating behavior planning and motion planning, where finding initial solution is the role of behavior planning.

Take obstacle avoidance as a concrete example, which generally introduces non-convex problems. If we all know the nudging direction, then it becomes a convex optimization problem, with the obstacle position acting as a lower or upper sure constraint for the optimization problem. If we don’t know the nudging direction, we’d like to make a decision first which direction to nudge, making the issue a convex one for motion planning to resolve. This nudging direction decision falls under behavior planning.

After all, we will do direct optimization of non-convex optimization problems with tools similar to projected gradient descent, alternating minimization, particle swarm optimization (PSO), and genetic algorithms. Nonetheless, that is beyond the scope of this post.

A convex path planning problem vs a non-convex one (chart made by creator)
The answer strategy of the convex vs non-convex path planning problem (chart made by creator)

How can we make such decisions? We will use the aforementioned search or sampling methods to deal with non-convex problems. Sampling-based methods scatter many options across the parameter space, effectively handling non-convex issues similarly to searching.

It’s possible you’ll also query why deciding which direction to nudge from is sufficient to guarantee the issue space is convex. To clarify this, we’d like to debate topology. In path space, similar feasible paths can transform repeatedly into one another without obstacle interference. These similar paths, grouped as “homotopy classes” within the formal language of topology, can all be explored using a single initial solution homotopic to them. All these paths form a driving corridor, illustrated because the red or green shaded area within the image above. For a 3D spatiotemporal case, please consult with the QCraft tech blog.

We will utilize the Generalized Voronoi diagram to enumerate all homotopy classes, which roughly corresponds to different decision paths available to us. Nonetheless, this topic delves into advanced mathematical concepts which might be beyond the scope of this blog post.

The important thing to solving optimization problems efficiently lies within the capabilities of the optimization solver. Typically, a solver requires roughly 10 milliseconds to plan a trajectory. If we will boost this efficiency by tenfold, it may well significantly impact algorithm design. This exact improvement was highlighted during Tesla AI Day 2022. An analogous enhancement has occurred in perception systems, transitioning from 2D perception to Bird’s Eye View (BEV) as available computing power scaled up tenfold. With a more efficient optimizer, more options could be calculated and evaluated, thereby reducing the importance of the decision-making process. Nonetheless, engineering an efficient optimization solver demands substantial engineering resources.

Each time compute scales up by 10x, algorithm will evolve to next generation.
— — The unverified law of algorithm evolution

A key differentiator in various planning systems is whether or not they are spatiotemporally decoupled. Concretely, spatiotemporally decoupled methods plan in spatial dimensions first to generate a path, after which plan the speed profile along this path. This approach can also be referred to as path-speed decoupling.

Path-speed decoupling is also known as lateral-longitudinal (lat-long) decoupling, where lateral (lat) planning corresponds to path planning and longitudinal (long) planning corresponds to hurry planning. This terminology seems to originate from the Frenet coordinate system, which we’ll explore later.

Decoupled solutions are easier to implement and may solve about 95% of issues. In contrast, coupled solutions have a better theoretical performance ceiling but are tougher to implement. They involve more parameters to tune and require a more principled approach to parameter tuning.

The comparison of decoupled and joint planning (source: made by the creator, inspired by Qcraft)
Pros and cons of decoupled vs joint spatiotemporal planning (chart made by creator)

Path-speed decoupled planning

We will take Baidu Apollo EM planner for instance of a system that uses path-speed decoupled planning.

The EM planner significantly reduces computational complexity by transforming a three-dimensional station-lateral-speed problem into two two-dimensional problems: station-lateral and station-speed. On the core of Apollo’s EM planner is an iterative Expectation-Maximization (EM) step, consisting of path optimization and speed optimization. Each step is split into an E-step (projection and formulation in a 2D state space) and an M-step (optimization within the 2D state space). The E-step involves projecting the 3D problem into either a Frenet SL frame or an ST speed tracking frame.

The EM iteration in Apollo EM planner (source: Baidu Apollo EM planner )

The M-step (maximization step) in each path and speed optimization involves solving non-convex optimization problems. For path optimization, this implies deciding whether to nudge an object on the left or right side, while for speed optimization, it involves deciding whether to overtake or yield to a dynamic object crossing the trail. The Apollo EM planner addresses these non-convex optimization challenges using a two-step process: Dynamic Programming (DP) followed by Quadratic Programming (QP).

DP uses a sampling or searching algorithm to generate a rough initial solution, effectively pruning the non-convex space right into a convex space. QP then takes the coarse DP results as input and optimizes them throughout the convex space provided by DP. In essence, DP focuses on feasibility, and QP refines the answer to realize optimality throughout the convex constraints.

In our defined terminology, Path DP corresponds to lateral BP, Path QP to lateral MP, Speed DP to longitudinal BP, and Speed QP to longitudinal MP. Thus, the method involves conducting BP (Basic Planning) followed by MP (Master Planning) in each the trail and speed steps.

A full autonomous driving stack with path-speed decoupled planning (chart made by author)
A full autonomous driving stack with path-speed decoupled planning (chart made by creator)

Joint spatiotemporal planning

Although decoupled planning can resolve 95% of cases in autonomous driving, the remaining 5% involve difficult dynamic interactions where a decoupled solution often ends in suboptimal trajectories. In these complex scenarios, demonstrating intelligence is crucial, making it a highly regarded topic in the sphere.

For instance, in narrow-space passing, the optimal behavior is perhaps to either decelerate to yield or speed up to pass. Such behaviors are usually not achievable throughout the decoupled solution space and require joint optimization. Joint optimization allows for a more integrated approach, considering each path and speed concurrently to handle intricate dynamic interactions effectively.

A full autonomous driving stack with joint spatiotemporal planning (chart made by author)
A full autonomous driving stack with joint spatiotemporal planning (chart made by creator)

Nonetheless, there are significant challenges in joint spatiotemporal planning. Firstly, solving the non-convex problem directly in a higher-dimensional state space is tougher and time-consuming than using a decoupled solution. Secondly, considering interactions in spatiotemporal joint planning is much more complex. We’ll cover this topic in additional detail later after we discuss decision-making.

Here we introduce two solving methods: brute force search and constructing a spatiotemporal corridor for optimization.

Brute force search occurs directly in 3D spatiotemporal space (2D in space and 1D in time), and could be performed in either XYT (Cartesian) or SLT (Frenet) coordinates. We’ll take SLT for instance. SLT space is long and flat, just like an energy bar. It’s elongated within the L dimension and flat within the ST face. For brute force search, we will use hybrid A-star, with the price being a mixture of progress cost and value to go. During optimization, we must conform to go looking constraints that prevent reversing in each the s and t dimensions.

Overtake by lane change in spatiotemporal lattice (source: Spatiotemporal optimization with A*)

One other method is constructing a spatiotemporal corridor, essentially a curve with the footprint of a automobile winding through a 3D spatiotemporal state space (SLT, for instance). The SSC (spatiotemporal semantic corridor, RAL 2019), encodes requirements given by semantic elements right into a semantic corridor, generating a protected trajectory accordingly. The semantic corridor consists of a series of mutually connected collision-free cubes with dynamical constraints posed by the semantic elements within the spatiotemporal domain. Inside each cube, it becomes a convex optimization problem that could be solved using Quadratic Programming (QP).

SSC still requires a BP (Behavior Planning) module to offer a rough driving trajectory. Complex semantic elements of the environment are projected into the spatiotemporal domain in regards to the reference lane. EPSILON (TRO 2021), showcases a system where SSC serves because the motion planner working in tandem with a behavior planner. In the following section, we’ll discuss behavior planning, especially specializing in interaction. On this context, behavior planning is often known as decision making.

An illustration of the spatiotemporal corridor (source: SSC)

What and why?

Decision making in autonomous driving is basically behavior planning, but with a concentrate on interaction with other traffic agents. The idea is that other agents are mostly rational and can reply to our behavior in a predictable manner, which we will describe as “noisily rational.”

People may query the need of decision making when advanced planning tools can be found. Nonetheless, two key points — uncertainty and interaction — introduce a probabilistic nature to the environment, primarily on account of the presence of dynamic objects. Interaction is probably the most difficult a part of autonomous driving, distinguishing it from general robotics. Autonomous vehicles must not only navigate but in addition anticipate and react to the behavior of other agents, making robust decision-making essential for safety and efficiency.

In a deterministic (purely geometric) world without interaction, decision making could be unnecessary, and planning through searching, sampling, and optimization would suffice. Brute force searching within the 3D XYT space could function a general solution.

In most classical autonomous driving stacks, a prediction-then-plan approach is adopted, assuming zero-order interaction between the ego vehicle and other vehicles. This approach treats prediction outputs as deterministic, requiring the ego vehicle to react accordingly. This results in overly conservative behavior, exemplified by the “freezing robot” problem. In such cases, prediction fills all the spatiotemporal space, stopping actions like lane changes in crowded conditions — something humans manage more effectively.

To handle stochastic strategies, Markov Decision Processes (MDP) or Partially Observable Markov Decision Processes (POMDP) frameworks are essential. These approaches shift the main target from geometry to probability, addressing chaotic uncertainty. By assuming that traffic agents behave rationally or at the very least noisily rationally, decision making will help create a protected driving corridor within the otherwise chaotic spatiotemporal space.

Among the many three overarching goals of planning — safety, comfort, and efficiency — decision making primarily enhances efficiency. Conservative actions can maximize safety and luxury, but effective negotiation with other road agents, achievable through decision making, is important for optimal efficiency. Effective decision making also displays intelligence.

MDP and POMDP

We’ll first introduce Markov Decision Processes (MDP) and Partially Observable Markov Decision Processes (POMDP), followed by their systematic solutions, similar to value iteration and policy iteration.

A Markov Process (MP) is a sort of stochastic process that deals with dynamic random phenomena, unlike static probability. In a Markov Process, the long run state depends only on the present state, making it sufficient for prediction. For autonomous driving, the relevant state may only include the last second of information, expanding the state space to permit for a shorter history window.

A Markov Decision Process (MDP) extends a Markov Process to incorporate decision-making by introducing motion. MDPs model decision-making where outcomes are partly random and partly controlled by the choice maker or agent. An MDP could be modeled with five aspects:

  1. State (S): The state of the environment.
  2. Motion (A): The actions the agent can take to affect the environment.
  3. Reward (R): The reward the environment provides to the agent because of this of the motion.
  4. Transition Probability (P): The probability of transitioning from the old state to a brand new state upon the agent’s motion.
  5. Gamma (γ): A reduction factor for future rewards.

This can also be the common framework utilized by reinforcement learning (RL), which is basically an MDP. The goal of MDP or RL is to maximise the cumulative reward received in the long term. This requires the agent to make good decisions given a state from the environment, in response to a policy.

A policy, π, is a mapping from each state, s ∈ S, and motion, a ∈ A(s), to the probability π(a|s) of taking motion a when in state s. MDP or RL studies the issue of how you can derive the optimal policy.

The agent-environment interface in MDP and RL (source: Reinforcement Learning: An Introduction)

A Partially Observable Markov Decision Process (POMDP) adds an additional layer of complexity by recognizing that states can’t be directly observed but quite inferred through observations. In a POMDP, the agent maintains a belief — a probability distribution over possible states — to estimate the state of the environment. Autonomous driving scenarios are higher represented by POMDPs on account of their inherent uncertainties and the partial observability of the environment. An MDP could be considered a special case of a POMDP where the commentary perfectly reveals the state.

MDP vs POMDP (source: POMDPs as stochastic contingent planning)

POMDPs can actively collect information, resulting in actions that gather mandatory data, demonstrating the intelligent behavior of those models. This capability is especially helpful in scenarios like waiting at intersections, where gathering details about other vehicles’ intentions and the state of the traffic light is crucial for making protected and efficient decisions.

Value iteration and policy iteration are systematic methods for solving MDP or POMDP problems. While these methods are usually not commonly utilized in real-world applications on account of their complexity, understanding them provides insight into exact solutions and the way they could be simplified in practice, similar to using MCTS in AlphaGo or MPDM in autonomous driving.

To search out the most effective policy in an MDP, we must assess the potential or expected reward from a state, or more specifically, from an motion taken in that state. This expected reward includes not only the immediate reward but in addition all future rewards, formally referred to as the return or cumulative discounted reward. (For a deeper understanding, consult with “Reinforcement Learning: An Introduction,” often considered the definitive guide on the topic.)

The worth function (V) characterizes the standard of states by summing the expected returns. The action-value function (Q) assesses the standard of actions for a given state. Each functions are defined in response to a given policy. The Bellman Optimality Equation states that an optimal policy will select the motion that maximizes the immediate reward plus the expected future rewards from the resulting latest states. In easy terms, the Bellman Optimality Equation advises considering each the immediate reward and the long run consequences of an motion. For instance, when switching jobs, consider not only the immediate pay raise (R) but in addition the long run value (S’) the brand new position offers.

Bellman’s equation of optimality (chart made by creator)

It is comparatively straightforward to extract the optimal policy from the Bellman Optimality Equation once the optimal value function is offered. But how can we find this optimal value function? That is where value iteration involves the rescue.

Extract best policy from optimal values (chart made by creator)

Value iteration finds the most effective policy by repeatedly updating the worth of every state until it stabilizes. This process is derived by turning the Bellman Optimality Equation into an update rule. Essentially, we use the optimal future picture to guide the iteration toward it. In plain language, “fake it until you make it!”

Update value functions under the guidance of Bellman’s Equation (chart made by creator)

Value iteration is guaranteed to converge for finite state spaces, whatever the initial values assigned to the states (for an in depth proof, please consult with the Bible of RL). If the discount factor gamma is about to 0, meaning we only consider immediate rewards, the worth iteration will converge after only one iteration. A smaller gamma results in faster convergence since the horizon of consideration is shorter, though it could not all the time be the most effective option for solving concrete problems. Balancing the discount factor is a key aspect of engineering practice.

One might ask how this works if all states are initialized to zero. The immediate reward within the Bellman Equation is crucial for bringing in additional information and breaking the initial symmetry. Think in regards to the states that immediately result in the goal state; their value propagates through the state space like a virus. In plain language, it’s about making small wins, incessantly.

Value and policy functions interact until they converge to optimum together (source: Reinforcement Learning: An Introduction)

Nonetheless, value iteration also suffers from inefficiency. It requires taking the optimal motion at each iteration by considering all possible actions, just like Dijkstra’s algorithm. While it demonstrates feasibility as a basic approach, it is often not practical for real-world applications.

The contrast of Bellman Equation and Bellman Optimality Equation (chart made by creator)

Policy iteration improves on this by taking actions in response to the present policy and updating it based on the Bellman Equation (not the Bellman Optimality Equation). Policy iteration decouples policy evaluation from policy improvement, making it a much faster solution. Each step is taken based on a given policy as a substitute of exploring all possible actions to seek out the one which maximizes the target. Although each iteration of policy iteration could be more computationally intensive on account of the policy evaluation step, it generally ends in a faster convergence overall.

In easy terms, when you can only fully evaluate the consequence of 1 motion, it’s higher to make use of your personal judgment and do your best with the present information available.

AlphaGo and MCTS — when nets meet trees

We’ve all heard the unbelievable story of AlphaGo beating the most effective human player in 2016. AlphaGo formulates the gameplay of Go as an MDP and solves it with Monte Carlo Tree Search (MCTS). But why not use value iteration or policy iteration?

Value iteration and policy iteration are systematic, iterative methods that solve MDP problems. Nonetheless, even with improved policy iteration, it still requires performing time-consuming operations to update the worth of each state. A regular 19×19 Go board has roughly 2e170 possible states. This vast variety of states makes it intractable to resolve with traditional value iteration or policy iteration techniques.

AlphaGo and its successors use a Monte Carlo tree search (MCTS) algorithm to seek out their moves, guided by a price network and a policy network, trained on each human and computer play. Let’s take a have a look at vanilla MCTS first.

The 4 steps of MCTS by AlphaGo, combining each value network and policy network (source: AlphaGo, Nature 2016)

Monte Carlo Tree Search (MCTS) is a technique for policy estimation that focuses on decision-making from the present state. One iteration involves a four-step process: selection, expansion, simulation (or evaluation), and backup.

  1. Selection: The algorithm follows probably the most promising path based on previous simulations until it reaches a leaf node, a position not yet fully explored.
  2. Expansion: A number of child nodes are added to represent possible next moves from the leaf node.
  3. Simulation (Evaluation): The algorithm plays out a random game from the brand new node until the tip, referred to as a “rollout.” This assesses the potential final result from the expanded node by simulating random moves until a terminal state is reached.
  4. Backup: The algorithm updates the values of the nodes on the trail taken based on the sport’s result. If the final result is a win, the worth of the nodes increases; whether it is a loss, the worth decreases. This process propagates the results of the rollout back up the tree, refining the policy based on simulated outcomes.

After a given variety of iterations, MCTS provides the share frequency with which immediate actions were chosen from the basis during simulations. During inference, the motion with probably the most visits is chosen. Here is an interactive illustration of MTCS with the sport of tic-tac-toe for simplicity.

MCTS in AlphaGo is enhanced by two neural networks. Value Network evaluates the winning rate from a given state (board configuration). Policy Network evaluates the motion distribution for all possible moves. These neural networks improve MCTS by reducing the effective depth and breadth of the search tree. The policy network helps in sampling actions, focusing the search on promising moves, while the worth network provides a more accurate evaluation of positions, reducing the necessity for extensive rollouts. This mixture allows AlphaGo to perform efficient and effective searches within the vast state space of Go.

The policy network and value network of AlphaGo (source: AlphaGo, Nature 2016)

Within the expansion step, the policy network samples the most definitely positions, effectively pruning the breadth of the search space. Within the evaluation step, the worth network provides an instinctive scoring of the position, while a faster, lightweight policy network performs rollouts until the sport ends to gather rewards. MCTS then uses a weighted sum of the evaluations from each networks to make the ultimate assessment.

Note that a single evaluation of the worth network approaches the accuracy of Monte Carlo rollouts using the RL policy network but with 15,000 times less computation. This mirrors the fast-slow system design, akin to intuition versus reasoning, or System 1 versus System 2 as described by Nobel laureate Daniel Kahneman. Similar designs could be observed in more moderen works, similar to DriveVLM.

To be exact, AlphaGo incorporates two slow-fast systems at different levels. On the macro level, the policy network selects moves while the faster rollout policy network evaluates these moves. On the micro level, the faster rollout policy network could be approximated by a price network that directly predicts the winning rate of board positions.

What can we learn from AlphaGo for autonomous driving? AlphaGo demonstrates the importance of extracting a wonderful policy using a strong world model (simulation). Similarly, autonomous driving requires a highly accurate simulation to effectively leverage algorithms just like those utilized by AlphaGo. This approach underscores the worth of mixing strong policy networks with detailed, precise simulations to boost decision-making and optimize performance in complex, dynamic environments.

In the sport of Go, all states are immediately available to each players, making it an ideal information game where commentary equals state. This permits the sport to be characterised by an MDP process. In contrast, autonomous driving is a POMDP process, because the states can only be estimated through commentary.

POMDPs connect perception and planning in a principled way. The everyday solution for a POMDP is analogous to that for an MDP, with a limited lookahead. Nonetheless, the principal challenges lie within the curse of dimensionality (explosion in state space) and the complex interactions with other agents. To make real-time progress tractable, domain-specific assumptions are typically made to simplify the POMDP problem.

MPDM (and the two follow-ups, and the white paper) is one pioneering study on this direction. MPDM reduces the POMDP to a closed-loop forward simulation of a finite, discrete set of semantic-level policies, quite than evaluating every possible control input for each vehicle. This approach addresses the curse of dimensionality by specializing in a manageable variety of meaningful policies, allowing for effective real-time decision-making in autonomous driving scenarios.

Semantic actions help control the curse of dimensionality (source: EPSILON)

The assumptions of MPDM are twofold. First, much of the decision-making by human drivers involves discrete high-level semantic actions (e.g., slowing, accelerating, lane-changing, stopping). These actions are known as policies on this context. The second implicit assumption concerns other agents: other vehicles will make reasonably protected decisions. Once a vehicle’s policy is set, its motion (trajectory) is set.

The framework of MPDM (chart created by creator)

MPDM first selects one policy for the ego vehicle from many options (hence the “multi-policy” in its name) and selects one policy for every nearby agent based on their respective predictions. It then performs forward simulation (just like a quick rollout in MCTS). The very best interaction scenario after evaluation is then passed on to motion planning, similar to the Spatiotemporal Semantic Corridor (SCC) mentioned within the joint spatiotemporal planning session.

MPDM enables intelligent and human-like behavior, similar to actively cutting into dense traffic flow even when there is no such thing as a sufficient gap present. This will not be possible with a predict-then-plan pipeline, which doesn’t explicitly consider interactions. The prediction module in MPDM is tightly integrated with the behavior planning model through forward simulation.

MPDM assumes a single policy throughout the choice horizon (10 seconds). Essentially, MPDM adopts an MCTS approach with one layer deep and super wide, considering all possible agent predictions. This leaves room for improvement, inspiring many follow-up works similar to EUDM, EPSILON, and MARC. For instance, EUDM considers more flexible ego policies and assigns a policy tree with a depth of 4, with each policy covering a time duration of two seconds over an 8-second decision horizon. To compensate for the additional computation induced by the increased tree depth, EUDM performs more efficient width pruning by guided branching, identifying critical scenarios and key vehicles. This approach explores a more balanced policy tree.

The forward simulation in MPDM and EUDM uses very simplistic driver models (Intelligent driver model or IDM for longitudinal simulation, and Pure Pursuit or PP for lateral simulation). MPDM points out that prime fidelity realism matters lower than the closed-loop nature itself, so long as policy-level decisions are usually not affected by low-level motion execution inaccuracies.

The conceptual diagram of decision making, where prediction, BP and MP integrates tightly (chart created by author)
The conceptual diagram of decision making, where prediction, BP and MP integrates tightly (chart created by creator)

Contingency planning within the context of autonomous driving involves generating multiple potential trajectories to account for various possible future scenarios. A key motivating example is that experienced drivers anticipate multiple future scenarios and all the time plan for a protected backup plan. This anticipatory approach results in a smoother driving experience, even when cars perform sudden cut-ins into the ego lane.

A critical aspect of contingency planning is deferring the choice bifurcation point. This implies delaying the purpose at which different potential trajectories diverge, allowing the ego vehicle more time to assemble information and reply to different outcomes. By doing so, the vehicle could make more informed decisions, leading to smoother and more confident driving behaviors, just like those of an experienced driver.

Risk-aware contingency planning (source: MARC, RAL 2023)

MARC also combines behavior planning and motion planning together. This extends the notion and utility of forward simulation. In other words, MPDM and EUDM still uses policy tree for top level behavior planning and depend on other motion planning pipelines similar to semantic spatiotemporal corridors (SSC), on account of the proven fact that ego motion within the policy tree remains to be characterised by heavily quantized behavior bucket. MARC extends this by keeping the quantized behavior for agents aside from ego but uses more refined motion planning directly within the forward rollout. In a way it’s a hybrid approach, where hybrid carries an analogous intending to that in hybrid A*, a mixture of discrete and continuous.

One possible drawback of MPDM and all its follow-up works is their reliance on easy policies designed for highway-like structured environments, similar to lane keeping and lane changing. This reliance may limit the aptitude of forward simulation to handle complex interactions. To handle this, following the instance of MPDM, the important thing to creating POMDPs simpler is to simplify the motion and state space through the expansion of a high-level policy tree. It is perhaps possible to create a more flexible policy tree, for instance, by enumerating spatiotemporal relative position tags to all relative objects after which performing guided branching.

Decision-making stays a hot topic in current research. Even classical optimization methods haven’t been fully explored yet. Machine learning methods could shine and have a disruptive impact, especially with the appearance of Large Language Models (LLMs), empowered by techniques like Chain of Thought (CoT) or Monte Carlo Tree Search (MCTS).

Trees

Trees are systematic ways to perform decision-making. Tesla AI Day 2021 and 2022 showcased their decision-making capabilities, heavily influenced by AlphaGo and the following MuZero, to deal with highly complex interactions.

At a high level, Tesla’s approach follows behavior planning (decision making) followed by motion planning. It searches for a convex corridor first after which feeds it into continuous optimization, using spatiotemporal joint planning. This approach effectively addresses scenarios similar to narrow passing, a typical bottleneck for path-speed decoupled planning.

Neural network heuristics guided MCTS (source: Tesla AI Day 2021)

Tesla also adopts a hybrid system that mixes data-driven and physics-based checks. Starting with defined goals, Tesla’s system generates seed trajectories and evaluates key scenarios. It then branches out to create more scenario variants, similar to asserting or yielding to a traffic agent. Such an interaction search over the policy tree is showcased within the presentations of the years 2021 and 2022.

One highlight of Tesla’s use of machine learning is the acceleration of tree search via trajectory optimization. For every node, Tesla uses physics-based optimization and a neural planner, achieving a ten ms vs. 100 µs timeframe — leading to a 10x to 100x improvement. The neural network is trained with expert demonstrations and offline optimizers.

Trajectory scoring is performed by combining classical physics-based checks (similar to collision checks and luxury evaluation) with neural network evaluators that predict intervention likelihood and rate human-likeness. This scoring helps prune the search space, focusing computation on probably the most promising outcomes.

While many argue that machine learning must be applied to high-level decision-making, Tesla uses ML fundamentally to speed up optimization and, consequently, tree search.

The Monte Carlo Tree Search (MCTS) method appears to be an ultimate tool for decision-making. Interestingly, those studying Large Language Models (LLMs) are attempting to include MCTS into LLMs, while those working on autonomous driving are trying to switch MCTS with LLMs.

As of roughly two years ago, Tesla’s technology followed this approach. Nonetheless, since March 2024, Tesla’s Full Self-Driving (FSD) has switched to a more end-to-end approach, significantly different from their earlier methods.

We will still consider interactions without implicitly growing trees. Ad-hoc logics could be implemented to perform one-order interaction between prediction and planning. Even one-order interaction can already generate good behavior, as demonstrated by TuSimple. MPDM, in its original form, is basically one-order interaction, but executed in a more principled and extendable way.

Multi-order interaction between prediction and planning (source: TuSImple AI day, in Chinese, translated by creator)

TuSimple has also demonstrated the aptitude to perform contingency planning, just like the approach proposed in MARC (though MARC may accommodate a customized risk preference).

Contingency planning (source: TuSImple AI day, in Chinese, translated by creator)

After learning the fundamental constructing blocks of classical planning systems, including behavior planning, motion planning, and the principled strategy to handle interaction through decision-making, I even have been reflecting on potential bottlenecks within the system and the way machine learning (ML) and neural networks (NN) may help. I’m documenting my thought process here for future reference and for others who could have similar questions. Note that the knowledge on this section may contain personal biases and speculations.

Let’s have a look at the issue from three different perspectives: in the prevailing modular pipeline, as an end-to-end (e2e) NN planner, or as an e2e autonomous driving system.

Going back to the drafting board, let’s review the issue formulation of a planning system in autonomous driving. The goal is to acquire a trajectory that ensures safety, comfort, and efficiency in a highly uncertain and interactive environment, all while adhering to real-time engineering constraints onboard the vehicle. These aspects are summarized as goals, environments, and constraints within the chart below.

The potentials of NN in planning (chart made by creator)

Uncertainty in autonomous driving can consult with uncertainty in perception (commentary) and predicting long-term agent behaviors into the long run. Planning systems must also handle the uncertainty in future trajectory predictions of other agents. As discussed earlier, a principled decision-making system is an efficient strategy to manage this.

Moreover, a typically neglected aspect is that planning must tolerate uncertain, imperfect, and sometimes incomplete perception results, especially in the present age of vision-centric and HD map-less driving. Having a Standard Definition (SD) map onboard as a previous helps alleviate this uncertainty, nevertheless it still poses significant challenges to a heavily handcrafted planner system. This perception uncertainty was considered a solved problem by Level 4 (L4) autonomous driving firms through the heavy use of Lidar and HD maps. Nonetheless, it has resurfaced because the industry moves toward mass production autonomous driving solutions without these two crutches. An NN planner is more robust and may handle largely imperfect and incomplete perception results, which is essential to mass production vision-centric and HD-mapless Advanced Driver Assistance Systems (ADAS).

Interaction must be treated with a principled decision-making system similar to Monte Carlo Tree Search (MCTS) or a simplified version of MPDM. The principal challenge is coping with the curse of dimensionality (combinatorial explosion) by growing a balanced policy tree with smart pruning through domain knowledge of autonomous driving. MPDM and its variants, in each academia and industry (e.g., Tesla), provide good examples of how you can grow this tree in a balanced way.

NNs may enhance the real-time performance of planners by speeding up motion planning optimization. This may shift the compute load from CPU to GPU, achieving orders of magnitude speedup. A tenfold increase in optimization speed can fundamentally impact high-level algorithm design, similar to MCTS.

Trajectories also must be more human-like. Human likeness and takeover predictors could be trained with the vast amount of human driving data available. It’s more scalable to extend the compute pool quite than maintain a growing army of engineering talents.

The NN-based planning stack can leverage human-driving data more effectively (Chart created by creator)

An end-to-end (e2e) neural network (NN) planner still constitutes a modular autonomous driving (AD) design, accepting structured perception results (and potentially latent features) as its input. This approach combines prediction, decision, and planning right into a single network. Corporations similar to DeepRoute (2022) and Huawei (2024) claim to utilize this method. Note that relevant raw sensor inputs, similar to navigation and ego vehicle information, are omitted here.

A full autonomous driving stack with an e2e planner (chart made by author)
A full autonomous driving stack with an e2e planner (chart made by creator)

This e2e planner could be further developed into an end-to-end autonomous driving system that mixes each perception and planning. That is what Wayve’s LINGO-2 (2024) and Tesla’s FSDv12 (2024) claim to realize.

The advantages of this approach are twofold. First, it addresses perception issues. There are various points of driving that we cannot easily model explicitly with commonly used perception interfaces. For instance, it is sort of difficult to handcraft a driving system to nudge around a puddle of water or decelerate for dips or potholes. While passing intermediate perception features might help, it could not fundamentally resolve the difficulty.

Moreover, emergent behavior will likely help resolve corner cases more systematically. The intelligent handling of edge cases, similar to the examples above, may result from the emergent behavior of enormous models.

A full autonomous driving stack with a one-model e2e driver (chart made by author)
A full autonomous driving stack with a one-model e2e driver (chart made by creator)

My speculation is that, in its ultimate form, the end-to-end (e2e) driver could be a big vision and action-native multimodal model enhanced by Monte Carlo Tree Search (MCTS), assuming no computational constraints.

A world model in autonomous driving, as of 2024 consensus, is often a multimodal model covering at the very least vision and motion modes (or a VA model). While language could be useful for accelerating training, adding controllability, and providing explainability, it will not be essential. In its fully developed form, a world model could be a VLA (vision-language-action) model.

There are at the very least two approaches to developing a world model:

  1. Video-Native Model: Train a model to predict future video frames, conditioned on or outputting accompanying actions, as demonstrated by models like GAIA-1.
  2. Multimodality Adaptors: Start with a pretrained Large Language Model (LLM) and add multimodality adaptors, as seen in models like Lingo-2, RT2, or ApolloFM. These multimodal LLMs are usually not native to vision or motion but require significantly less training resources.

A world model can produce a policy itself through the motion output, allowing it to drive the vehicle directly. Alternatively, MCTS can query the world model and use its policy outputs to guide the search. This World Model-MCTS approach, while rather more computationally intensive, could have a better ceiling in handling corner cases on account of its explicit reasoning logic.

Can we do without prediction?

Most current motion prediction modules represent the long run trajectories of agents aside from the ego vehicle as one or multiple discrete trajectories. It stays a matter whether this prediction-planning interface is sufficient or mandatory.

In a classical modular pipeline, prediction remains to be needed. Nonetheless, a predict-then-plan pipeline definitely caps the upper limit of autonomous driving systems, as discussed within the decision-making session. A more critical query is how you can integrate this prediction module more effectively into the general autonomous driving stack. Prediction should aid decision-making, and a queryable prediction module inside an overall decision-making framework, similar to MPDM and its variants, is preferred. There are not any severe issues with concrete trajectory predictions so long as they’re integrated appropriately, similar to through policy tree rollouts.

One other issue with prediction is that open-loop Key Performance Indicators (KPIs), similar to Average Displacement Error (ADE) and Final Displacement Error (FDE), are usually not effective metrics as they fail to reflect the impact on planning. As an alternative, metrics like recall and precision on the intent level must be considered.

In an end-to-end system, an explicit prediction module will not be mandatory, but implicit supervision — together with other domain knowledge from a classical stack — can definitely help or at the very least boost the info efficiency of the educational system. Evaluating the prediction behavior, whether explicit or implicit, will even be helpful in debugging such an e2e system.

Conclusions First. For an assistant, neural networks (nets) can achieve very high, even superhuman performance. For agents, I imagine that using a tree structure remains to be useful (though not necessarily a must).

Initially, trees can boost nets. Trees enhance the performance of a given network, whether it’s NN-based or not. In AlphaGo, even with a policy network trained via supervised learning and reinforcement learning, the general performance was still inferior to the MCTS-based AlphaGo, which integrates the policy network as one component.

Second, nets can distill trees. In AlphaGo, MCTS used each a price network and the reward from a quick rollout policy network to guage a node (state or board position) within the tree. The AlphaGo paper also mentioned that while a price function alone could possibly be used, combining the outcomes of the 2 yielded the most effective results. The worth network essentially distilled the knowledge from the policy rollout by directly learning the state-value pair. That is akin to how humans distill the logical pondering of the slow System 2 into the fast, intuitive responses of System 1. Daniel Kahneman, in his book “Pondering, Fast and Slow,” describes how a chess master can quickly recognize patterns and make rapid decisions after years of practice, whereas a novice would require significant effort to realize similar results. Similarly, the worth network in AlphaGo was trained to offer a quick evaluation of a given board position.

Grandmaster-Level Chess Without Search (source: DeepMind, 2024)

Recent papers explore the upper limits of this fast system with neural networks. The “chess without search” paper demonstrates that with sufficient data (prepared through tree search using a traditional algorithm), it is feasible to realize grandmaster-level proficiency. There’s a transparent “scaling law” related to data size and model size, indicating that as the quantity of information and the complexity of the model increase, so does the proficiency of the system.

So here we’re with an influence duo: trees boost nets, and nets distill trees. This positive feedback loop is basically what AlphaZero uses to bootstrap itself to achieve superhuman performance in multiple games.

The identical principles apply to the event of enormous language models (LLMs). For games, since we’ve got clearly defined rewards as wins or losses, we will use forward rollout to find out the worth of a certain motion or state. For LLMs, the rewards are usually not as clear-cut as in the sport of Go, so we depend on human preferences to rate the models via reinforcement learning with human feedback (RLHF). Nonetheless, with models like ChatGPT already trained, we will use supervised fine-tuning (SFT), which is basically imitation learning, to distill smaller yet still powerful models without RLHF.

Returning to the unique query, nets can achieve extremely high performance with large quantities of high-quality data. This could possibly be ok for an assistant, depending on the tolerance for errors, nevertheless it will not be sufficient for an autonomous agent. For systems targeting driving assistance (ADAS), nets via imitation learning could also be adequate.

Trees can significantly boost the performance of nets with an explicit reasoning loop, making them perhaps more suitable for fully autonomous agents. The extent of the tree or reasoning loop relies on the return on investment of engineering resources. For instance, even one order of interaction can provide substantial advantages, as demonstrated in TuSimple AI Day.

From the summary below of the most popular representatives of AI systems, we will see that LLMs are usually not designed to perform decision-making. In essence, LLMs are trained to finish documents, and even SFT-aligned LLM assistants treat dialogues as a special sort of document (completing a dialogue record).

Representative AI products as of 2024 (chart made by creator)

I don’t fully agree with recent claims that LLMs are slow systems (System 2). They’re unnecessarily slow in inference on account of hardware constraints, but of their vanilla form, LLMs are fast systems as they can’t perform counterfactual checks. Prompting techniques similar to Chain of Thought (CoT) or Tree of Thoughts (ToT) are literally simplified types of MCTS, making LLMs function more like slower systems.

There’s extensive research attempting to integrate full-blown MCTS with LLMs. Specifically, LLM-MCTS (NeurIPS 2023) treats the LLM as a commonsense “world model” and uses LLM-induced policy actions as a heuristic to guide the search. LLM-MCTS outperforms each MCTS alone and policies induced by LLMs by a large margin for complex, novel tasks. The highly speculated Q-star from OpenAI seems to follow the identical approach of boosting LLMs with MCTS, because the name suggests.

Below is a rough evolution of the planning stack in autonomous driving. It’s rough because the listed solutions are usually not necessarily more advanced than those above, and their debut may not follow the precise chronological order. Nonetheless, we will observe general trends. Note that the listed representative solutions from the industry are based on my interpretation of assorted press releases and could possibly be subject to error.

One trend is the movement towards a more end-to-end design with more modules consolidated into one. We see the stack evolve from path-speed decoupled planning to joint spatiotemporal planning, and from a predict-then-plan system to a joint prediction and planning system. One other trend is the increasing incorporation of machine learning-based components, especially within the last three stages. These two trends converge towards an end-to-end NN planner (without perception) and even an end-to-end NN driver (with perception).

A rough history of evolution of planning (Chart made by creator)
  • ML as a Tool: Machine learning is a tool, not a standalone solution. It may well assist with planning even in current modular designs.
  • Full Formulation: Start with a full problem formulation, then make reasonable assumptions to balance performance and resources. This helps create a transparent direction for a future-proof system design and allows for improvements as resources increase. Recall the transition from POMDP’s formulation to engineering solutions like AlphaGo’s MCTS and MPDM.
  • Adapting Algorithms: Theoretically beautiful algorithms (e.g., Dijkstra and Value Iteration) are great for understanding concepts but need adaptation for practical engineering (Value Iteration to MCTS as Dijkstra’s algorithm to Hybrid A-star).
  • Deterministic vs. Stochastic: Planning excels in resolving deterministic (not necessarily static) scenes. Decision-making in stochastic scenes is probably the most difficult task toward full autonomy.
  • Contingency Planning: This will help merge multiple futures into a standard motion. It’s useful to be aggressive to the degree which you could all the time resort to a backup plan.
  • End-to-end Models: Whether an end-to-end model can solve full autonomy stays unclear. It should need classical methods like MCTS. Neural networks can handle assistants, while trees can manage agents.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x