BEV Perception in Mass Production Autonomous Driving The Necessity of End-to-end Systems Perception 2.0: End-to-end Perception XNet: BEV Perception Stack from Xpeng Motors The Future Takeaways References

Artificial Intelligence

BEV Perception in Mass Production Autonomous Driving The Necessity of End-to-end Systems Perception 2.0: End-to-end Perception XNet: BEV Perception Stack from Xpeng Motors The Future Takeaways References

admin

June 22, 2023

BEV Perception in Mass Production Autonomous Driving
The Necessity of End-to-end Systems
Perception 2.0: End-to-end Perception
XNet: BEV Perception Stack from Xpeng Motors
The Future
Takeaways
References

The Recipe of XNet from Xpeng Motors

This blog post relies on the invited talk within the End-to-end Autonomous Driving Workshop at CVPR 2023 held in Vancouver, titled “The Practice of Mass Production Autonomous Driving in China”.

BEV perception has witnessed great progress over the past few years. It directly perceives the environment across the autonomous driving vehicles. might be seen as an system, and a crucial step toward an end-to-end autonomous driving system. Here we define an end-to-end autonomous driving systems as fully differentiable pipelines that take raw sensor data as input and produce a high-level driving plan or low-level control actions as output.

The autonomous driving community has witnessed a rapid growth in approaches that embrace an end-to-end algorithm framework. We’ll discuss the need of end-to-end approaches from first principle. Then we’ll review the efforts to deploy BEV perception algorithm onto mass production vehicles, taking the event of XNet, the BEV perception architecture from Xpeng for example. Finally, we’ll brainstorm in regards to the way forward for BEV perception toward fully end-to-end autonomous driving.

In solving any engineering problem, it is usually obligatory to make use of a divide-and-conquer approach to search out practical solutions quickly. This strategy involves breaking down the large problem into smaller, relatively well-defined components that might be solved independently. While this approach assists in delivering a whole product quickly, it also increases the chance of being stuck at a neighborhood optimum solution. To achieve the worldwide optimum solution, all components have to be optimized together in an end-to-end fashion.

The performance growth curve for Divide-and-conquer vs End-to-end (chart made by writer)

The 80–20 rule reinforces the concept that 80% of the specified performance might be achieved with only 20% of the entire effort. The advantage of using a divide-and-conquer approach is that it allows developers to work quickly using minimal effort. Nevertheless, the downside is that this method often results in a performance ceiling on the 80% mark. To beat the performance limit and get out of the local optimum, developers must optimize certain components together, which is step one in developing an end-to-end solution. This process have to be repeated several times, breaking performance ceilings time and again until a completely end-to-end solution is achieved. The resulting curve may take the shape of a series of sigmoid curves until the worldwide optimal solution is approximated. One example of an effort toward an end-to-end solution is the event of BEV perception algorithm.

In traditional autonomous driving stacks, 2D images are fed into the perception module to generate 2D results. Sensor fusion is then utilized to reason between 2D results from multiple cameras and elevate these to 3D. The resulting 3D objects are subsequently sent to downstream components, akin to prediction and planning.

BEV perception is basically end-to-end perception (image made by writer)

Nevertheless, the sensor fusion step requires loads of handcrafted rules to fuse the perception results from several camera streams. Each camera only perceives a portion of the thing to be observed, so combining the obtained information necessitates careful adjustment of the fusion logic. We’re essentially doing back-propagation through the engineers’ heads. Furthermore, developing and maintaining these rules creates a set of complications, resulting in quite a few issues in complex urban environments.

To beat this challenge, we are able to apply the Bird’s Eye View (BEV) perception model, which allows us to perceive the environment directly within the BEV space. The BEV perception stack combines two separate components right into a single solution, thereby eliminating the brittle human-crafted logic. BEV perception is basically an solution. This marks a critical step toward an end-to-end autonomous driving system.

The BEV perception architecture from Xpeng is codenamed XNet. It was first publicly introduced in Xpeng 1024 Tech Day in 2022. The visualization below depicts the onboard XNet perception architecture in motion. The red vehicle in the center represents the autonomous driving vehicle because it navigates a roundabout. The encompassing static environment is entirely detected by onboard perception, and no HD Map is used. We are able to observe that XNet accurately detects a wide selection of dynamic and static objects across the vehicle.