Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models

-


banner image

TL;DR:

  • Qwen3-8B is probably the most exciting recent releases—a model with native agentic capabilities, making it a natural fit for the AIPC.

  • With OpenVINO.GenAI, we’ve been capable of speed up generation by ~1.3× using speculative decoding with a light-weight Qwen3-0.6B draft.

  • Through the use of speculative decoding and applying an easy pruning process to the draft, we pushed the speedup even further to ~1.4×

  • We wrapped this up by showing how these improvements may be used to run a quick, local AI Agent with 🤗 smolagents



Qwen3

Qwen3-8B is a component of the most recent Qwen family, trained with explicit agentic behaviors. It supports tool invocation, multi-step reasoning, and long-context handling capabilities, that make it well-suited for complex agent workflows. When integrated with frameworks like Hugging Face 🤗smolagents, QwenAgent, or AutoGen, it enables a big selection of agentic applications built around tool use and reasoning. Unlike single-turn chatbots, agentic applications depend on reasoning models that produce “pondering aloud” traces, intermediate steps that expand token usage, making inference speed critical to responsiveness.
The mixture of optimized inference and built-in agentic intelligence makes Qwen3-8B a compelling foundation for next-gen AI agents.



Accelerating Qwen3-8B on Intel® Core™ Ultra with Speculative Decoding

We began by benchmarking the 4-bit optimized OpenVINO version of Qwen3-8B on an Intel Lunar Lake integrated GPU, establishing this as our baseline for further acceleration

Speculative decoding is a technique to hurry up auto-regressive generation. It really works by utilizing a smaller, faster model as a draft to propose multiple tokens in a single forward pass, that are then validated by the larger goal model in a single forward pass. In our setup, Qwen3-8B served because the goal model while Qwen3-0.6B was used because the draft. This approach delivered a median of 1.3× speedup over the baseline.

from openvino_genai import LLMPipeline, draft_model

target_path = "/path/to/goal/Qwen3-8B-int4-ov"
draft_path = "/path/to/draft/Qwen3-0.6B-int8-ov"
device = "GPU"

model = LLMPipeline(target_path, device, draft_model=draft_model(draft_path, device))

streamer = lambda x: print(x, end="", flush=True)
model.generate("What's speculative decoding and the way does it improve inference speed?", max_new_tokens=100, streamer=streamer)

Before initializing the LLMPipeline, make sure that each the goal and draft models are converted to OpenVINO. You may either download pre-converted models from the provided links or follow these instructions to convert your individual.



Pushing Performance Further

The speculative decoding speedup relies on the typical variety of generated tokens per forward step of the goal, γ gamma , the speculation window size, and the ratio between the goal and draft models’ latency c c . A smaller, faster (though less accurate) draft can often deliver greater acceleration. This inspired us to shrink the draft model while still preserving its quality, i.e. E(#generated_tokens) E(# generated_tokens) .

Speedup=E(#generated_tokens)γc+1 Speedup = frac{E(# generated_tokens)}{gamma c + 1}

Our recent work shows that model depth (variety of layers) is a serious contributor to inference latency.
We drew inspiration from recent work on layer-wise compression[1]. In our approach, we discover blocks of layers that contribute little, measured using angular distance, and take away them. After pruning, we apply fine-tuning to get better accuracy. Using this method, we pruned 6 out of 28 layers from the Qwen3-0.6B draft model.
To get better the standard of the pruned draft model, we further finetuned it using synthetic data generated by Qwen3-8B.
The information was produced by generating responses to 500k prompts from BAAI/Infinity-Instruct dataset.

The resulting pruned draft model delivered ~1.4x speedup in comparison with the baseline, an improvement over the ~1.3× gain achieved with the unique draft. This end result aligns with theoretical expectations – reducing draft latency improves the over-all speedup, enabling faster and more efficient inference.

This demonstrates how pruning + speculative decoding can unlock faster and more efficient inference—making local AI agents much more practical.

Take a look at the notebook and the Qwen3-0.6B depth-pruned draft model to breed our results step-by-step



Integration with 🤗smolagents

To showcase the real-world potential, we deployed our optimized setup with the 🤗smolagents library. With this integration, developers can plug in Qwen3-8B (paired with our pruned draft) to construct agents that decision APIs and external tools, write and execute code, handle long-context reasoning and run efficiently on Intel® Core™ Ultra.
The advantages aren’t limited to Hugging Face, this model pairing may also be used seamlessly with frameworks like AutoGen or QwenAgent, further strengthening the agentic ecosystem.

In our demo, we assigned the accelerated Qwen3-based agent a task: 👉 Summarize the important thing features of the Qwen3 model series and present them in a slide deck.

Here’s the way it worked:
1. The agent used an internet search tool to assemble up-to-date information.
2. It then switched to the Python interpreter to generate slides with the python-pptx library.
This easy workflow highlights only a fraction of the probabilities unlocked when accelerated Qwen3 models meet frameworks like 🤗smolagents, bringing practical, efficient AI agents to life on AI PC. Try it here 🚀



References

[1] Gromov, A., Tirumala, K., Shapourian, H., Glorioso, P., & Roberts, D. A. (2025, January 22). The unreasonable ineffectiveness of the deeper layers. Poster presented at ICLR 2025. https://arxiv.org/abs/2403.17887

Performance and legal notices

  • Performance results are based on internal benchmarking with OpenVINO™ 2025.2 as of September 2025, using a configuration with an Intel® Core™ Ultra 7 268V 2.20 GHz processor with an integrated Intel® Arc™ 140V GPU, paired with 32 GB of DDR5 memory.
  • Performance varies by use, configuration and other aspects. Learn more at www.Intel.com/PerformanceIndex.
  • No product or component may be absolutely secure.
  • Your costs and results may vary.
  • Intel technologies may require enabled hardware, software, or service activation.
  • © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.
  • Other names and types could also be claimed because the property of others.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x