Home Artificial Intelligence AI model quickens high-resolution computer vision

AI model quickens high-resolution computer vision

0
AI model quickens high-resolution computer vision

An autonomous vehicle must rapidly and accurately recognize objects that it encounters, from an idling delivery truck parked on the corner to a cyclist whizzing toward an approaching intersection.

To do that, the vehicle might use a robust computer vision model to categorize every pixel in a high-resolution image of this scene, so it doesn’t lose sight of objects that is perhaps obscured in a lower-quality image. But this task, often known as semantic segmentation, is complex and requires an enormous amount of computation when the image has high resolution.

Researchers from MIT, the MIT-IBM Watson AI Lab, and elsewhere have developed a more efficient computer vision model that vastly reduces the computational complexity of this task. Their model can perform semantic segmentation accurately in real-time on a tool with limited hardware resources, reminiscent of the on-board computers that enable an autonomous vehicle to make split-second decisions.

Recent state-of-the-art semantic segmentation models directly learn the interaction between each pair of pixels in a picture, so their calculations grow quadratically as image resolution increases. For this reason, while these models are accurate, they’re too slow to process high-resolution images in real time on an edge device like a sensor or cell phone.

The MIT researchers designed a latest constructing block for semantic segmentation models that achieves the identical abilities as these state-of-the-art models, but with only linear computational complexity and hardware-efficient operations.

The result’s a latest model series for high-resolution computer vision that performs as much as nine times faster than prior models when deployed on a mobile device. Importantly, this latest model series exhibited the identical or higher accuracy than these alternatives.

Not only could this method be used to assist autonomous vehicles make decisions in real-time, it could also improve the efficiency of other high-resolution computer vision tasks, reminiscent of medical image segmentation.

“While researchers have been using traditional vision transformers for quite a protracted time, they usually give amazing results, we would like people to also concentrate to the efficiency aspect of those models. Our work shows that it is feasible to drastically reduce the computation so this real-time image segmentation can occur locally on a tool,” says Song Han, an associate professor within the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, and senior writer of the paper describing the brand new model.

He’s joined on the paper by lead writer Han Cai, an EECS graduate student; Junyan Li, an undergraduate at Zhejiang University; Muyan Hu, an undergraduate student at Tsinghua University; and Chuang Gan, a principal research staff member on the MIT-IBM Watson AI Lab. The research will probably be presented on the International Conference on Computer Vision.

A simplified solution

Categorizing every pixel in a high-resolution image which will have hundreds of thousands of pixels is a difficult task for a machine-learning model. A strong latest form of model, often known as a vision transformer, has recently been used effectively.

Transformers were originally developed for natural language processing. In that context, they encode each word in a sentence as a token after which generate an attention map, which captures each token’s relationships with all other tokens. This attention map helps the model understand context when it makes predictions.

Using the identical concept, a vision transformer chops a picture into patches of pixels and encodes each small patch right into a token before generating an attention map. In generating this attention map, the model uses a similarity function that directly learns the interaction between each pair of pixels. In this manner, the model develops what’s often known as a world receptive field, which implies it could access all of the relevant parts of the image.

Since a high-resolution image may contain hundreds of thousands of pixels, chunked into hundreds of patches, the eye map quickly becomes enormous. For this reason, the quantity of computation grows quadratically because the resolution of the image increases.

Of their latest model series, called EfficientViT, the MIT researchers used a less complicated mechanism to construct the eye map — replacing the nonlinear similarity function with a linear similarity function. As such, they’ll rearrange the order of operations to cut back total calculations without changing functionality and losing the worldwide receptive field. With their model, the quantity of computation needed for a prediction grows linearly because the image resolution grows.

“But there isn’t a free lunch. The linear attention only captures global context concerning the image, losing local information, which makes the accuracy worse,” Han says.

To compensate for that accuracy loss, the researchers included two extra components of their model, each of which adds only a small amount of computation.

One in all those elements helps the model capture local feature interactions, mitigating the linear function’s weakness in local information extraction. The second, a module that permits multiscale learning, helps the model recognize each large and small objects.

“Essentially the most critical part here is that we’d like to rigorously balance the performance and the efficiency,” Cai says.

They designed EfficientViT with a hardware-friendly architecture, so it may very well be easier to run on several types of devices, reminiscent of virtual reality headsets or the sting computers on autonomous vehicles. Their model may be applied to other computer vision tasks, like image classification.

Streamlining semantic segmentation

Once they tested their model on datasets used for semantic segmentation, they found that it performed as much as nine times faster on a Nvidia graphics processing unit (GPU) than other popular vision transformer models, with the identical or higher accuracy.

“Now, we are able to get one of the best of each worlds and reduce the computing to make it fast enough that we are able to run it on mobile and cloud devices,” Han says.

Constructing off these results, the researchers need to apply this method to hurry up generative machine-learning models, reminiscent of those used to generate latest images. In addition they need to proceed scaling up EfficientViT for other vision tasks.

“Efficient transformer models, pioneered by Professor Song Han’s team, now form the backbone of cutting-edge techniques in diverse computer vision tasks, including detection and segmentation,” says Lu Tian, senior director of AI algorithms at AMD, Inc., who was not involved with this paper. “Their research not only showcases the efficiency and capability of transformers, but additionally reveals their immense potential for real-world applications, reminiscent of enhancing image quality in video games.”

“Model compression and lightweight model design are crucial research topics toward efficient AI computing, especially within the context of enormous foundation models. Professor Song Han’s group has shown remarkable progress compressing and accelerating modern deep learning models, particularly vision transformers,” adds Jay Jackson, global vp of artificial intelligence and machine learning at Oracle, who was not involved with this research. “Oracle Cloud Infrastructure has been supporting his team to advance this line of impactful research toward efficient and green AI.”

LEAVE A REPLY

Please enter your comment!
Please enter your name here