Technique enables AI on edge devices to continue to learn over time

Artificial Intelligence

Technique enables AI on edge devices to continue to learn over time

admin

November 17, 2023

Technique enables AI on edge devices to continue to learn over time

Personalized deep-learning models can enable artificial intelligence chatbots that adapt to grasp a user’s accent or smart keyboards that repeatedly update to raised predict the following word based on someone’s typing history. This customization requires constant fine-tuning of a machine-learning model with recent data.

Because smartphones and other edge devices lack the memory and computational power vital for this fine-tuning process, user data are typically uploaded to cloud servers where the model is updated. But data transmission uses a terrific deal of energy, and sending sensitive user data to a cloud server poses a security risk.

Researchers from MIT, the MIT-IBM Watson AI Lab, and elsewhere developed a method that allows deep-learning models to efficiently adapt to recent sensor data directly on an edge device.

Their on-device training method, called PockEngine, determines which parts of an enormous machine-learning model should be updated to enhance accuracy, and only stores and computes with those specific pieces. It performs the majority of those computations while the model is being prepared, before runtime, which minimizes computational overhead and boosts the speed of the fine-tuning process.

When put next to other methods, PockEngine significantly sped up on-device training, performing as much as 15 times faster on some hardware platforms. Furthermore, PockEngine didn’t cause models to have any dip in accuracy. The researchers also found that their fine-tuning method enabled a well-liked AI chatbot to reply complex questions more accurately.

“On-device fine-tuning can enable higher privacy, lower costs, customization ability, and in addition lifelong learning, nevertheless it shouldn’t be easy. Every part has to occur with a limited variety of resources. We would like to have the ability to run not only inference but in addition training on an edge device. With PockEngine, now we will,” says Song Han, an associate professor within the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, a distinguished scientist at NVIDIA, and senior writer of an open-access paper describing PockEngine.

Han is joined on the paper by lead writer Ligeng Zhu, an EECS graduate student, in addition to others at MIT, the MIT-IBM Watson AI Lab, and the University of California San Diego. The paper was recently presented on the IEEE/ACM International Symposium on Microarchitecture.

Layer by layer

Deep-learning models are based on neural networks, which comprise many interconnected layers of nodes, or “neurons,” that process data to make a prediction. When the model is run, a process called inference, an information input (similar to a picture) is passed from layer to layer until the prediction (perhaps the image label) is output at the tip. During inference, each layer not must be stored after it processes the input.

But during training and fine-tuning, the model undergoes a process generally known as backpropagation. In backpropagation, the output is in comparison with the proper answer, after which the model is run in reverse. Each layer is updated because the model’s output gets closer to the proper answer.

Because each layer may should be updated, all the model and intermediate results have to be stored, making fine-tuning more memory demanding than inference

Nevertheless, not all layers within the neural network are essential for improving accuracy. And even for layers which might be essential, all the layer may not should be updated. Those layers, and pieces of layers, don’t should be stored. Moreover, one may not must go all the way in which back to the primary layer to enhance accuracy — the method may very well be stopped somewhere in the center.

PockEngine takes advantage of those aspects to hurry up the fine-tuning process and cut down on the quantity of computation and memory required.

The system first fine-tunes each layer, one by one, on a certain task and measures the accuracy improvement after each individual layer. In this fashion, PockEngine identifies the contribution of every layer, in addition to trade-offs between accuracy and fine-tuning cost, and routinely determines the proportion of every layer that should be fine-tuned.

“This method matches the accuracy thoroughly in comparison with full back propagation on different tasks and different neural networks,” Han adds.

A pared-down model

Conventionally, the backpropagation graph is generated during runtime, which involves a terrific deal of computation. As an alternative, PockEngine does this during compile time, while the model is being prepared for deployment.

PockEngine deletes bits of code to remove unnecessary layers or pieces of layers, making a pared-down graph of the model for use during runtime. It then performs other optimizations on this graph to further improve efficiency.

Since all this only must be done once, it saves on computational overhead for runtime.

“It’s like before setting out on a mountaineering trip. At home, you’d do careful planning — which trails are you going to go on, which trails are you going to disregard. So then at execution time, if you find yourself actually mountaineering, you have already got a really careful plan to follow,” Han explains.

After they applied PockEngine to deep-learning models on different edge devices, including Apple M1 Chips and the digital signal processors common in lots of smartphones and Raspberry Pi computers, it performed on-device training as much as 15 times faster, with none drop in accuracy. PockEngine also significantly slashed the quantity of memory required for fine-tuning.

The team also applied the technique to the big language model Llama-V2. With large language models, the fine-tuning process involves providing many examples, and it’s crucial for the model to learn tips on how to interact with users, Han says. The method can be essential for models tasked with solving complex problems or reasoning about solutions.

As an example, Llama-V2 models that were fine-tuned using PockEngine answered the query “What was Michael Jackson’s last album?” accurately, while models that weren’t fine-tuned failed. PockEngine cut the time it took for every iteration of the fine-tuning process from about seven seconds to lower than one second on a NVIDIA Jetson Orin, an edge GPU platform.

In the long run, the researchers need to use PockEngine to fine-tune even larger models designed to process text and pictures together.

“This work addresses growing efficiency challenges posed by the adoption of enormous AI models similar to LLMs across diverse applications in many alternative industries. It not only holds promise for edge applications that incorporate larger models, but in addition for lowering the associated fee of maintaining and updating large AI models within the cloud,” says Ehry MacRostie, a senior manager in Amazon’s Artificial General Intelligence division who was not involved on this study but works with MIT on related AI research through the MIT-Amazon Science Hub.

This work was supported, partly, by the MIT-IBM Watson AI Lab, the MIT AI Hardware Program, the MIT-Amazon Science Hub, the National Science Foundation (NSF), and the Qualcomm Innovation Fellowship.

LEAVE A REPLY Cancel reply