Speed up StarCoder with 🤗 Optimum Intel on Xeon: Q8/Q4 and Speculative Decoding

-


Recently, code generation models have develop into very fashionable, especially with the discharge of state-of-the-art open-source models corresponding to BigCode’s StarCoder and Meta AI’s Code Llama. A growing variety of works focuses on making Large Language Models (LLMs) more optimized and accessible. On this blog, we’re glad to share the most recent results of LLM optimization on Intel Xeon specializing in the favored code generation LLM, StarCoder.

The StarCoder Model is a cutting-edge LLM specifically designed for assisting the user with various coding tasks corresponding to code completion, bug fixing, code summarization, and even generating code snippets from natural language descriptions. The StarCoder model is a member of the StarCoder family which incorporates the StarCoderBase variant as well. These Large Language Models for Code (Code LLMs) are trained on permissively licensed data from GitHub, including over 80 programming languages, Git commits, GitHub issues, and Jupyter notebooks. On this work we show greater than 7x inference acceleration of StarCoder-15B model on Intel 4th generation Xeon by integrating 8bit and 4bit quantization with assisted generation.

Check out our demo on Hugging Face Spaces that’s being run on a 4th Generation Intel Xeon Scalable processor.




Step 1: Baseline and Evaluation

We establish our baseline using StarCoder (15B) coupled with PyTorch and Intel Extension for PyTorch (IPEX). There are several datasets designed to judge the standard of automated code completion. On this work, we use the favored HumanEval dataset to judge the model’s quality and performance. HumanEval consists of 164 programming problems, in the shape of a function signature with a docstring and the model completes the function’s code. The common length of the prompt is 139. We measure the standard using Bigcode Evaluation Harness and report the pass@1 metric. We measure model performance by measuring the Time To First Token (TTFT) and Time Per Output Token (TPOT) on the HumanEval test set and report the common TTFT and TPOT.
The 4th generation Intel Xeon processors feature AI infused acceleration often called Intel® Advanced Matrix Extensions (Intel® AMX). Specifically, it has built-in BFloat16 (BF16) and Int8 GEMM accelerators in every core to speed up deep learning training and inference workloads. AMX accelerated inference is introduced through PyTorch 2.0 and Intel Extension for PyTorch (IPEX) along with other optimizations for various common operators utilized in LLM inference (e.g. layer normalization, SoftMax, scaled dot product).
As the start line we use out-of-the-box optimizations in PyTorch and IPEX to perform inference using a BF16 model. Figure 1 shows the latency of the baseline model and Tables 1 and a couple of show its
latency in addition to its accuracy.

baseline latency
Figure 1. Latency of the baseline model.



LLM Quantization

Text generation in LLMs is performed in an auto-regressive manner thus requiring all the model to be loaded from memory to the CPU for every recent token generation. We discover that the bandwidth between the off-chip memory (DRAM) and the CPU poses the most important bottleneck within the token generation process. Quantization is a well-liked approach for mitigating this issue. It reduces model size and hence decreases model weights loading time.

On this work we concentrate on two sorts of quantization:

  1. Weight Only Quantization (WOQ) – the weights of the model being quantized but not the activations while computation is performed in higher precision (e.g. BF16) which requires dequantization.
  2. Static Quantization (SQ) – each the weights and the activations are quantized. This quantization process includes pre-calculating the quantization parameters through a calibration step which enables the computation to be executed in lower precision (e.g. INT8). Figure 2 shows the INT8 static quantization computation process.



Step 2: 8bit Quantization (INT8)

SmoothQuant is a post training quantization algorithm that’s used to quantize LLMs for INT8 with minimal accuracy loss. Static quantization methods were shown to be underperforming on LLMs because of large magnitude outliers present in specific channels of the activations. Since activations are quantized token-wise, static quantization leads to either truncated outliers or underflowed low-magnitude activations. SmoothQuant algorithm solves this problem by introducing a pre-quantization phase where additional smoothing scaling aspects are applied to each activations and weights which smooths the outliers within the activations and ensures higher utilization of the quantization levels.

INT8 quantization
Figure 2. Computation diagram for INT8 static quantization.

Using IPEX, we apply SmoothQuant to the StarCoder model. We used the test split of the MBPP dataset as our calibration dataset and introduced Q8-StarCoder. Our evaluation shows that Q8-StarCoder holds no accuracy loss over the baseline (if fact, there may be even a slight improvement). When it comes to performance, Q8-StarCoder achieves ~2.19x speedup in TTFT and ~2.20x speedup in TPOT. Figure 3 shows the latency (TPOT) of Q8-StarCoder in comparison with the BF16 baseline model.

INT8 latency
Figure 3. Latency speedup of 8-bit quantized model.



Step 3: 4bit Quantization (INT4)

Although INT8 decreases the model size by 2x in comparison with BF16 (8 bits per weight in comparison with 16 bits), the memory bandwidth remains to be the biggest bottleneck. To further decrease the model’s loading time from the memory, we quantized the model’s weights to 4 bits using WOQ. Note that 4bit WOQ requires dequantization to 16bit before the computation (Figure 4) which suggests that there’s a compute overhead.

INT4 quantization
Figure 4. Computation diagram for model quantized to INT4.

Tensor-wise asymmetric Round To Nearest (RTN) quantization, a basic WOQ technique, poses challenges and infrequently leads to accuracy reduction, nevertheless it was shown within the literature (Zhewei Yao, 2022) that groupwise quantization of the model’s weights helps in retaining accuracy. To avoid accuracy degradation, we perform 4-bit quantization in groups (e.g. 128) of consequent values along the input channel, with scaling aspects calculated per group. We found that groupwise 4bit RTN is sufficient to retain StarCoder’s accuracy on the HumanEval dataset. The 4bit model achieves 3.35x speedup in TPOT in comparison with the BF16 baseline (figure 5), nevertheless it suffers from expected slowdown of 0.84x in TTFT (Table 1) because of the overhead of dequantizing the 4bit to 16bit before computation.

INT4 latency
Figure 5. Latency speedup of 4-bit quantized model.



Different Bottlenecks between Generating the First Token and Subsequent Tokens

The initial step of generating the primary token, which involves parallel processing of all the input prompt, demands significant computational resources when the prompt length is high. Computation, due to this fact, becomes the bottleneck on this stage. Hence, switching from BF16 to INT8 precision for this process improves the performance in comparison with the baseline (and to 4bit WOQ which involves compute overhead in the shape of dequantization). Nonetheless, ranging from the second step, when the system generates the remaining of the tokens one after the other in an autoregressive manner, the model is loaded from the memory repeatedly for every recent generated token. Because of this, the bottleneck becomes memory bandwidth, somewhat than the variety of calculations (FLOPS) performed and due to this fact INT4 outperforms INT8 and BF16.



Step 4: Assisted Generation (AG)

One other method to mitigate the high inference latency and alleviate the memory bandwidth bottleneck issue is Assisted generation (AG) which is a practical implementation of speculative decoding. AG mitigates this issue by higher balancing memory and computational operations. It relies on the premise that a smaller and faster assistant draft model often generates the identical tokens as a bigger goal model.

AG uses a small, fast draft model to greedily generate K candidate tokens. These output tokens are generated much faster, but a few of them may not resemble the output tokens of the unique goal model. Hence, in the subsequent step, the goal model checks the validity of all K candidate tokens in parallel in a single forward pass. This process accelerates the decoding for the reason that latency of parallel decoding of K tokens is smaller than generating K tokens autoregressively.

For accelerating StarCoder, we use bigcode/tiny_starcoder_py because the draft model. This model shares the same architecture with StarCoder but includes only 164M parameters – ~95x smaller than StarCoder, and thus much faster. To realize an excellent greater speedup, along with quantizing the goal model, we apply quantization to the draft model as well. We consider each 8bit SmoothQuant and 4bit WOQ quantization for the draft and goal models. When evaluating each quantization options for the draft and goal models, we found that 8bit SmoothQuant for each models yielded the most effective results: ~7.30x speedup in TPOT (Figure 6).

These quantization selections are backed up by the next observations:

  1. Draft model quantization: when using 8bit quantized StarCoder with 164M parameters as draft model, the model mostly matches within the CPU cache. Because of this, the memory bandwidth bottleneck is alleviated, as token generation occurs without repeatedly reading the goal model from off-chip memory for every token. On this case, there isn’t any memory bottleneck, and we see higher speedup with StarCoder-164M quantized to 8bit compared to StarCoder-164M quantized to 4bit WOQ. We note that 4bit WOQ holds a bonus where memory bandwidth is the bottleneck due to its smaller memory footprint, nevertheless 4bit comes with a compute overhead because of the requirement to perform 4bit to 16bit dequantization before the computation.
  2. Goal model quantization: in assisted generation, the goal model processes a sequence of K tokens that were generated by the draft model. Forwarding K tokens directly (in parallel) through the goal model as an alternative of applying the “standard” sequential autoregressive processing, shifts the balance from memory bandwidth to compute bottleneck. Due to this fact, we observed that using an 8bit quantized goal model yields higher speedups than using a 4bit model due to additional compute overhead that stems from dequantization of each value from 4bit to 16bit.

IN8 AG
Figure 6. Latency speedup of optimized model.

StarCoder Quantization Precision HumanEval (pass@1) TTFT (ms) TTFT Speedup TPOT (ms) TPOT Speedup
Baseline None A16W16 33.54 357.9 1.00x 181.0 1.00x
INT8 SmoothQuant A8W8 33.96 163.4 2.19x 82.4 2.20x
INT4 RTN (g128) A16W4 32.80 425.1 0.84x 54.0 3.35x
INT8 + AG SmoothQuant A8W8 33.96 183.6 1.95x 24.8 7.30x

Table 1: Accuracy and latency measurements of the StarCoder model on Intel 4th Gen Xeon

To load the resulting models and run inference, you possibly can just replace your AutoModelForXxx class with the corresponding IPEXModelForXxx class from optimum-intel.

Before you start, be sure you may have all of the crucial libraries installed :

pip install --upgrade-strategy eager optimum[ipex]
- from transformers import AutoModelForCausalLM
+ from optimum.intel import IPEXModelForCausalLM
  from transformers import AutoTokenizer, pipeline

- model = AutoModelForCausalLM.from_pretrained(model_id)
+ model = IPEXModelForCausalLM.from_pretrained(model_id)
  tokenizer = AutoTokenizer.from_pretrained(model_id)
  pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
  results = pipe("He's a dreadful magician and")



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x