Home Artificial Intelligence BEV Perception in Mass Production Autonomous Driving The Necessity of End-to-end Systems Perception 2.0: End-to-end Perception XNet: BEV Perception Stack from Xpeng Motors The Future Takeaways References

BEV Perception in Mass Production Autonomous Driving The Necessity of End-to-end Systems Perception 2.0: End-to-end Perception XNet: BEV Perception Stack from Xpeng Motors The Future Takeaways References

1
BEV Perception in Mass Production Autonomous Driving
The Necessity of End-to-end Systems
Perception 2.0: End-to-end Perception
XNet: BEV Perception Stack from Xpeng Motors
The Future
Takeaways
References

The Recipe of XNet from Xpeng Motors

This blog post relies on the invited talk within the End-to-end Autonomous Driving Workshop at CVPR 2023 held in Vancouver, titled “The Practice of Mass Production Autonomous Driving in China”.

BEV perception has witnessed great progress over the past few years. It directly perceives the environment across the autonomous driving vehicles. might be seen as an system, and a crucial step toward an end-to-end autonomous driving system. Here we define an end-to-end autonomous driving systems as fully differentiable pipelines that take raw sensor data as input and produce a high-level driving plan or low-level control actions as output.

The autonomous driving community has witnessed a rapid growth in approaches that embrace an end-to-end algorithm framework. We’ll discuss the need of end-to-end approaches from first principle. Then we’ll review the efforts to deploy BEV perception algorithm onto mass production vehicles, taking the event of XNet, the BEV perception architecture from Xpeng for example. Finally, we’ll brainstorm in regards to the way forward for BEV perception toward fully end-to-end autonomous driving.

In solving any engineering problem, it is usually obligatory to make use of a divide-and-conquer approach to search out practical solutions quickly. This strategy involves breaking down the large problem into smaller, relatively well-defined components that might be solved independently. While this approach assists in delivering a whole product quickly, it also increases the chance of being stuck at a neighborhood optimum solution. To achieve the worldwide optimum solution, all components have to be optimized together in an end-to-end fashion.

The performance growth curve for Divide-and-conquer vs End-to-end (chart made by writer)

The 80–20 rule reinforces the concept that 80% of the specified performance might be achieved with only 20% of the entire effort. The advantage of using a divide-and-conquer approach is that it allows developers to work quickly using minimal effort. Nevertheless, the downside is that this method often results in a performance ceiling on the 80% mark. To beat the performance limit and get out of the local optimum, developers must optimize certain components together, which is step one in developing an end-to-end solution. This process have to be repeated several times, breaking performance ceilings time and again until a completely end-to-end solution is achieved. The resulting curve may take the shape of a series of sigmoid curves until the worldwide optimal solution is approximated. One example of an effort toward an end-to-end solution is the event of BEV perception algorithm.

In traditional autonomous driving stacks, 2D images are fed into the perception module to generate 2D results. Sensor fusion is then utilized to reason between 2D results from multiple cameras and elevate these to 3D. The resulting 3D objects are subsequently sent to downstream components, akin to prediction and planning.

BEV perception is basically end-to-end perception (image made by writer)

Nevertheless, the sensor fusion step requires loads of handcrafted rules to fuse the perception results from several camera streams. Each camera only perceives a portion of the thing to be observed, so combining the obtained information necessitates careful adjustment of the fusion logic. We’re essentially doing back-propagation through the engineers’ heads. Furthermore, developing and maintaining these rules creates a set of complications, resulting in quite a few issues in complex urban environments.

To beat this challenge, we are able to apply the Bird’s Eye View (BEV) perception model, which allows us to perceive the environment directly within the BEV space. The BEV perception stack combines two separate components right into a single solution, thereby eliminating the brittle human-crafted logic. BEV perception is basically an solution. This marks a critical step toward an end-to-end autonomous driving system.

The BEV perception architecture from Xpeng is codenamed XNet. It was first publicly introduced in Xpeng 1024 Tech Day in 2022. The visualization below depicts the onboard XNet perception architecture in motion. The red vehicle in the center represents the autonomous driving vehicle because it navigates a roundabout. The encompassing static environment is entirely detected by onboard perception, and no HD Map is used. We are able to observe that XNet accurately detects a wide selection of dynamic and static objects across the vehicle.

The Xpeng AI team began experimenting with the XNet architecture over two years ago (early 2021), and it has since undergone several iterations before arriving at its current form. We exploit Convolutional Neural Network (CNN) backbones to generate image features, while the multicamera features are transposed into the BEV space through a transformer structure. Specifically, a cross-attention module was used. The BEV features from several past frames are then fused with the ego pose — each spatially and temporally — to decode the dynamic and static elements from the fused features.

XNet results and architecture (chart made by writer)

The vision-centric BEV perception architecture improves the cost-effectiveness for mass deployment of autonomous driving solutions, reducing the necessity for dearer hardware components. The accurate 3D detections and velocity unfold a recent dimension of redundancy and reduce the reliance on LiDARs and radars. Moreover, the real-time 3D sensible environment perception lessens the dependency on HD maps. Each capabilities contribute significantly to a more reliable and cost-effective autonomous driving solution.

Deploying such a neural network onto production vehicles presents several challenges. Firstly, hundreds of thousands of multicamera video clips are obligatory to coach XNet. These clips involve around one billion objects requiring annotation. Based on the present annotation efficiency, roughly are needed for annotation. Unfortunately, because of this for the in-house annotation team of around 1000 people at Xpeng, such a task would take around two years to perform, which is just not acceptable. From a model training perspective, it could take to coach such a network using a single machine. Moreover, deploying such a network with none optimization on an NVIDIA Orin platform would take .

All of those issues present challenges that we’ve to handle for successful training and deployment of such a fancy and huge model.

Autolabel

To enhance annotation efficiency, we’ve developed a highly effective autolabel system. This offline sensor fusion stack enhances efficiency by as much as 45,000 times, enabling us to finish annotation tasks that will have required 200 human years in only 17 days.

Autolabel system significantly boost annotation efficiency

Above is the lidar-based autolabel system, and we also developed a system that solely relies on vision sensors. This permits us to annotate clips obtained from the shopper fleets that don not have lidars. This can be a critical a part of the information closed-loop and enhancing the event of a self-evolving perception system.

Large Scale Training

We optimized the training pipeline for XNet from two perspectives. Firstly, we applied mixed precision training and operator optimization techniques to streamline the training process on a single node, which reduced the training time by an element of 10. Next, we partnered with Alicloud and built a GPU cluster with a computation power of 600 PFLOPS, allowing us to scale out the training from a single machine to multiple machines. This reduced the training time even further, although the method was not straightforward as we wanted to rigorously tweak the training procedure to realize near-linear performance scaling. Overall, we reduced the training time for XNet from 276 days to a mere 11 hours. Note that as we add more data into the training process, the training time naturally increases, calling for extra optimization. Subsequently, the scaling out optimization stays a continuous and demanding effort.

Optimization of enormous scale parallel training pipeline for XNet (chart made by writer)

Efficient Deployment on Orin

We noted that with none optimization, running XNet on an Nvidia Orin chip would require 122% of the chip’s computation power. On analyzing the profiling chart displayed originally, we observed that the transformer module consumed a lot of the runtime. That is comprehensible because the transformer module had not received much attention throughout the Orin chip’s initial design phase. In consequence, we wanted to revamp the transformer module and a focus mechanism to support the Orin platform, allowing us to realize a 3x speedup.

Extreme optimization of transformer-based XNet on Orin platform (chart made by writer)

Motivated to optimize further, we progressed to optimize the network through pruning, leading to an extra 2.6x speedup. Lastly, employing workload balancing between GPU and DLA, we achieved an extra 1.7x speedup.

With these various optimization techniques, we reduced XNet’s GPU utilization from 122% to simply 9%. This freed us to explore recent possibilities in architecture on the Orin platform.

Self-evolving Data Engine

With the implementation of XNet architecture, we are able to now initiate data-driven iterations to spice up the model’s performance. To perform this, we first discover corner cases on the automotive after which deploy configurable triggers to the shopper fleet to gather relevant images. Subsequently, we retrieve images from collected data, based on a brief description in natural language or a picture itself. In doing so, we leverage recent advancements in large language models to extend the efficiency of dataset curation and annotation.

Data engine helps improving XNet performance (chart made by writer)

With the XNet architecture and data engine, we’ve created a scalable and self-evolving perception system.

The most recent release of Xpeng Highway NGP 2.0 unifies highway and city pilot solutions, allowing users to drop a pin in a distinct city and have a smooth experience from start to complete. This unification is made possible by XNet, which provides a solid foundation for a unified stack across all scenarios. The ultimately goal is to enable point-to-point user experience with end-to-end autonomous driving.

With a purpose to make the autonomous driving system end to finish differentiable, one other critical missing piece is a machine-learning based planning stack. Learning based planning solutions might be largely divided into imitation learning or reinforcement learning approaches. Recent progress in large language models (LLMs) also spell great potential for the advancement of this vital topic. The next Github repo is a live collection of relevant work within the burgeoning field of end-to-end autonomous driving.

  • Divide and conquer reaches 80% of performance with 20% effort. End-to-end approaches aim to interrupt the 80% performance ceiling, at potentially much greater cost.
  • XNet is an end-to-end perception system and one critical step toward an end-to-end full-stack solution. It requires significant engineering effort (80%) based on the 80–20 rule.
  • The big amount of annotation needed for XNet calls for automatic annotation, as human annotation is just not feasible. The autolabel system can boost efficiency by 45000 times.
  • Large scale training requires optimization of coaching on a single machine, and scaling out from one machine to multiple machines.
  • XNet deployment on Nvidia Orin platform requires refactor of transformer module.

All charts and videos on this blog are made by the writer.

  • For the unique challenges in deploying mass production autonomous driving in China, please seek advice from the next link. This was also a part of the identical invited talk at CVPR 2023.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here