Bringing Generative AI to the masses

With Arm’s recent SME2 announcement, the role of Arm KleidiAI is increasingly clear as Arm’s AI accelerator layer powering the subsequent wave of AI. By embedding into widely-used Edge AI frameworks like XNNPack, MediaPipe, MNN, ONNX Runtime, and even llama.cpp, KleidiAI has delivered substantial performance improvements with no code changes required by developers. That foundation leads on to the upcoming ExecuTorch 0.7 beta, where KleidiAI will probably be enabled by default—bringing automatic acceleration to devices built on the most recent Arm CPU architecture, in addition to an enormous base of existing phones built on earlier generations.

Android and cross-platform developers—whether first- or third-party—gain easy access to KleidiAI AI performance optimizations via ExecuTorch and XNNPack. The result? Faster model startups, lower latency, leaner memory footprints—and no integration hurdles. What previously required custom tuning is now turn-key performance, ready out of the box. This efficiency unlocks recent possibilities—not only for the most recent high-end devices, but for a wider range of hardware.

After we consider running Generative AI (GenAI) on mobile devices, it is straightforward to ascertain the most recent flagship smartphones equipped with powerful CPUs, GPUs, and NPUs. But what if we told you that GenAI experiences—like running large language models (LLMs)—may also be dropped at devices which can be 3, 4, and even 5 years old? And even to the Raspberry Pi 5?

Well, this is not any longer only a vision, but a practical reality. Because of the Arm SDOT CPU feature, which has been available in Arm CPUs since 2015.

The SDOT (Signed Dot Product) instruction, introduced within the Armv8.2 architecture and later CPUs, enables efficient dot product operations on vectors of 8-bit signed integers. The next image illustrates the behavior of 1 such SDOT instruction available on Arm CPUs:

As shown above, the instruction produces 4 32-bit integer outputs, each resulting from the dot product of corresponding groups of 4 int8 elements from the left-hand side (LHS) and right-hand side (RHS) vector registers.

This instruction could be utilized to speed up matrix multiplication routines—the core computational workload behind every LLM—when using Int8 or lower-bit precision formats, corresponding to Int4. These operations typically involve quite a few dot products between individual rows of the left-hand side matrix and corresponding columns of the right-hand side matrix.

The SDOT instruction is already widely supported across a various range of devices, opening the door for GenAI use cases to achieve a significantly larger smartphone audience. As of today, Arm CPUs in roughly 3 billion Arm-based devices include this capability—enabling powerful on-device GenAI experiences for the vast majority of users. In actual fact, 72% of all devices now support this instruction.

Because of ExecuTorch, we at the moment are enabling models like Llama 3.2 to run efficiently on the vast majority of Android devices in addition to edge devices just like the Raspberry Pi 5.

For the quantized Llama 3.2 1B announcement last yr, the ExecuTorch and KleidiAI teams collaborated to deliver optimizations for the Int4 matrix-multiplication on Arm CPUs leveraging the I8MM feature, available from the Armv8.6 architecture onwards. As highlighted in a previous blog post, ExecuTorch with KleidiAI achieves over 20% higher prefill performance on the Galaxy S24+ in comparison with non-KleidiAI kernels.

This translates to greater than 350 tokens per second throughout the prefill phase and over 40 tokens per second throughout the decode phase. This level of performance is sufficient to enable on-device tasks, corresponding to summarizing unread messages, with a smooth user experience using only Arm CPUs. For context, summarizing around 50 unread messages typically involves processing roughly 600 tokens.

This yr, the ExecuTorch and KleidiAI teams have focused on optimizing Int4 matrix multiplication performance by leveraging the SDOT instruction, aiming to broaden adoption.

Take a look at the XNNPack PR on GitHub

While LLM performance on Arm CPUs with only the SDOT extension may not match the most recent flagship smartphones, it still enables impressive capabilities for on-device generative AI. In actual fact, in lots of scenarios, the decode phase is quicker than the common human reading speed—highlighting that even older Arm CPUs can support practical and meaningful GenAI use cases.

For instance, when combined with speech-to-text and text-to-speech models, an area LLM of this type enables the creation of a completely private smart assistant that operates entirely offline, eliminating concerns about data privacy while still offering wealthy voice-based interactions. Such a tool could seamlessly interact along with your connected devices, ensuring users have peace of mind with their data.

One other compelling use case for running Llama 3.2 1B is context-aware text completion in local text editors. As you type, the model provides intelligent, real-time suggestions to streamline writing or coding workflows—all without requiring an online connection.

These are only just a few examples, they usually only scratch the surface of what is feasible with on-device GenAI.

With the combined power of SDOT, KleidiAI, and ExecuTorch, we’re pushing the boundaries of what is feasible. Bringing Generative AI beyond high-end flagship devices and making it accessible on billions of Arm-based devices already in use.

Now it’s your turn—we’re excited to see what you’ll create. To make it easier to start, try Arm’s learning path, designed to guide you thru developing your individual applications with LLMs using ExecuTorch and KleidiAI.

Construct an Android chat app with Llama, KleidiAI, ExecuTorch, and XNNPACK

Source link