ChatGPT, OpenAI’s groundbreaking language model, has turn into an
influential force within the realm of artificial intelligence, paving the
way for a large number of AI applications across diverse sectors. With its
staggering ability to understand and generate human-like text, ChatGPT
has transformed industries, from customer support to creative writing,
and has even served as a useful research tool.
Various efforts have been made to offer
open-source large language models which display great capabilities
but in smaller sizes, comparable to
OPT,
LLAMA,
Alpaca and
Vicuna.
On this blog, we are going to delve into the world of Vicuna, and explain find out how to
run the Vicuna 13B model on a single AMD GPU with ROCm.
What’s Vicuna?
Vicuna is an open-source chatbot with 13 billion parameters, developed
by a team from UC Berkeley, CMU, Stanford, and UC San Diego. To create
Vicuna, a LLAMA base model was fine-tuned using about 70K user-shared
conversations collected from ShareGPT.com via public APIs. In accordance with
initial assessments where GPT-4 is used as a reference, Vicuna-13B has
achieved over 90%* quality in comparison with OpenAI ChatGPT.
It was released on Github on Apr
11, just just a few weeks ago. It’s price mentioning that the info set,
training code, evaluation metrics, training cost are known for Vicuna. Its total training cost was just
around $300, making it an economical solution for most people.
For more details about Vicuna, please take a look at
https://vicuna.lmsys.org.
Why do we’d like a quantized GPT model?
Running Vicuna-13B model in fp16 requires around 28GB GPU RAM. To
further reduce the memory footprint, optimization techniques are
required. There’s a recent research paper GPTQ published, which
proposed accurate post-training quantization for GPT models with lower
bit precision. As illustrated below, for models with parameters larger
than 10B, the 4-bit or 3-bit GPTQ can achieve comparable accuracy
with fp16.
Furthermore, large parameters of those models even have a severely negative
effect on GPT latency because GPT token generation is more limited by
memory bandwidth (GB/s) than computation (TFLOPs or TOPs) itself. For this
reason, a quantized model doesn’t degrade
token generation latency when the GPU is under a memory sure situation.
Discuss with the GPTQ quantization papers and github repo.
By leveraging this system, several 4-bit quantized Vicuna models are
available from Hugging Face as follows,
Running Vicuna 13B Model on AMD GPU with ROCm
To run the Vicuna 13B model on an AMD GPU, we’d like to leverage the facility
of ROCm (Radeon Open Compute), an open-source software platform that
provides AMD GPU acceleration for deep learning and high-performance
computing applications.
Here’s a step-by-step guide on find out how to arrange and run the Vicuna 13B
model on an AMD GPU with ROCm:
System Requirements
Before diving into the installation process, be sure that your system
meets the next requirements:
-
An AMD GPU that supports ROCm (check the compatibility list on
docs.amd.com page) -
A Linux-based operating system, preferably Ubuntu 18.04 or 20.04
-
Conda or Docker environment
-
Python 3.6 or higher
For more information, please take a look at https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.4.3/page/Prerequisites.html.
This instance has been tested on Instinct
MI210
and Radeon
RX6900XT
GPUs with ROCm5.4.3 and Pytorch2.0.
Quick Start
1 ROCm installation and Docker container setup (Host machine)
1.1 ROCm installation
The next is for ROCm5.4.3 and Ubuntu 22.04. Please modify
in keeping with your goal ROCm and Ubuntu version from:
https://docs.amd.com/bundle/ROCm-Installation-Guide-v5.4.3/page/How_to_Install_ROCm.html
sudo apt update && sudo apt upgrade -y
wget https://repo.radeon.com/amdgpu-install/5.4.3/ubuntu/jammy/amdgpu-install_5.4.50403-1_all.deb
sudo apt-get install ./amdgpu-install_5.4.50403-1_all.deb
sudo amdgpu-install --usecase=hiplibsdk,rocm,dkms
sudo amdgpu-install --list-usecase
sudo reboot
1.2 ROCm installation verification
rocm-smi
sudo rocminfo
1.3 Docker image pull and run a Docker container
The next uses Pytorch2.0 on ROCm5.4.2. Please use the
appropriate docker image in keeping with your goal ROCm and Pytorch
version: https://hub.docker.com/r/rocm/pytorch/tags
docker pull rocm/pytorch:rocm5.4.2_ubuntu20.04_py3.8_pytorch_2.0.0_preview
sudo docker run --device=/dev/kfd --device=/dev/dri --group-add video
--shm-size=8g --cap-add=SYS_PTRACE --security-opt seccomp=unconfined
--ipc=host -it --name vicuna_test -v ${PWD}:/workspace -e USER=${USER}
rocm/pytorch:rocm5.4.2_ubuntu20.04_py3.8_pytorch_2.0.0_preview
2 Model quantization and Model inference (Contained in the docker)
You may either download quantized Vicuna-13b model from Huggingface or
quantize the floating-point model. Please take a look at Appendix – GPTQ
model quantization if you wish to quantize the floating-point model.
2.1 Download the quantized Vicuna-13b model
Use download-model.py script from the next git repo.
git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui
python download-model.py anon8231489123/vicuna-13b-GPTQ-4bit-128g
- Running the Vicuna 13B GPTQ Model on AMD GPU
git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
cd GPTQ-for-LLaMa
python setup_cuda.py install
These commands will compile and link HIPIFIED CUDA-equivalent kernel
binaries to
python as C extensions. The kernels of this implementation are composed
of dequantization + FP32 Matmul. If you wish to use dequantization +
FP16 Matmul for extra speed-up, please take a look at Appendix – GPTQ
Dequantization + FP16 Mamul kernel for AMD GPUs
git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
cd GPTQ-for-LLaMa/
python setup_cuda.py install
# model inference
python llama_inference.py ../../models/vicuna-13b --wbits 4 --load
../../models/vicuna-13b/vicuna-13b_4_actorder.safetensors --groupsize 128 --text “You input text here”
Now that you might have all the pieces arrange, it is time to run the Vicuna 13B
model in your AMD GPU. Use the commands above to run the model. Replace
“Your input text here” with the text you wish to use as input for
the model. If all the pieces is ready up accurately, you need to see the model
generating output text based in your input.
3. Expose the quantized Vicuna model to the Web API server
Change the trail of GPTQ python modules (GPTQ-for-LLaMa) in the next
line:
To launch Web UXUI from the gradio library, you must arrange the
controller, employee (Vicunal model employee), web_server by running them as
background jobs.
nohup python0 -W ignore::UserWarning -m fastchat.serve.controller &
nohup python0 -W ignore::UserWarning -m fastchat.serve.model_worker --model-path /path/to/quantized_vicuna_weights
--model-name vicuna-13b-quantization --wbits 4 --groupsize 128 &
nohup python0 -W ignore::UserWarning -m fastchat.serve.gradio_web_server &
Now the 4-bit quantized Vicuna-13B model will be fitted in RX6900XT GPU
DDR memory, which has 16GB DDR. Only 7.52GB of DDR (46% of 16GB) is
needed to run 13B models whereas the model needs greater than 28GB of DDR
space in fp16 datatype. The latency penalty and accuracy penalty are
also very minimal and the related metrics are provided at the tip of
this text.
Test the quantized Vicuna model within the Web API server
Allow us to give it a try. First, allow us to use fp16 Vicuna model for language
translation.
It does a greater job than me. Next, allow us to ask something about soccer. The reply looks good to me.
Once we switch to the 4-bit model, for a similar query, the reply is
a bit different. There’s a duplicated “Lionel Messi” in it.
Vicuna fp16 and 4bit quantized model comparison
Test environment:
– GPU: Instinct MI210, RX6900XT
– python: 3.10
– pytorch: 2.1.0a0+gitfa08e54
– rocm: 5.4.3
Metrics – Model size (GB)
- Model parameter size. When the models are preloaded to GPU DDR, the
actual DDR size consumption is larger than model itself because of caching
for Input and output token spaces.
Metrics – Accuracy (PPL: Perplexity)
-
Measured on 2048 examples of C4
(https://paperswithcode.com/dataset/c4) dataset -
Vicuna 13b – baseline: fp16 datatype parameter, fp16 Matmul
-
Vicuna 13b – quant (4bit/fp32): 4bits datatype parameter, fp32 Matmul
-
Vicuna 13b – quant (4bit/fp16): 4bits datatype parameter, fp16 Matmul
Metrics – Latency (Token generation latency, ms)
-
Measured during token generation phases.
-
Vicuna 13b – baseline: fp16 datatype parameter, fp16 Matmul
-
Vicuna 13b – quant (4bit/fp32): 4bits datatype parameter, fp32 Matmul
-
Vicuna 13b – quant (4bit/fp16): 4bits datatype parameter, fp16 Matmul
Conclusion
Large language models (LLMs) have made significant advancements in
chatbot systems, as seen in OpenAI’s ChatGPT. Vicuna-13B, an open-source
LLM model has been developed and demonstrated excellent capability and quality.
By following this guide, you need to now have a greater understanding of
find out how to arrange and run the Vicuna 13B model on an AMD GPU with ROCm. This
will enable you to unlock the complete potential of this cutting-edge
language model to your research and private projects.
Thanks for reading!
Appendix – GPTQ model quantization
Constructing Vicuna quantized model from the floating-point LLaMA model
a. Download LLaMA and Vicuna delta models from Huggingface
The developers of Vicuna (lmsys) provide only delta-models that will be
applied to the LLaMA model. Download LLaMA in huggingface format and
Vicuna delta parameters from Huggingface individually. Currently, 7b and
13b delta models of Vicuna can be found.
https://huggingface.co/models?sort=downloads&search=huggyllama
https://huggingface.co/models?sort=downloads&search=lmsys
b. Convert LLaMA to Vicuna by utilizing Vicuna-delta model
git clone https://github.com/lm-sys/FastChat
cd FastChat
Convert the LLaMA parameters by utilizing this command:
(Note: don’t use vicuna-{7b, 13b}-*delta-v0 since it’s vocab_size is
different from that of LLaMA and the model can’t be converted)
python -m fastchat.model.apply_delta --base /path/to/llama-13b --delta lmsys/vicuna-13b-delta-v1.1
--target ./vicuna-13b
Now Vicuna-13b model is prepared.
c. Quantize Vicuna to 2/3/4 bits
To use the GPTQ to LLaMA and Vicuna,
git clone https://github.com/oobabooga/GPTQ-for-LLaMa -b cuda
cd GPTQ-for-LLaMa
(Note, don’t use https://github.com/qwopqwop200/GPTQ-for-LLaMa for
now. Because 2,3,4bit quantization + MatMul kernels implemented on this
repo doesn’t parallelize the dequant+matmul and hence shows lower token
generation performance)
Quantize Vicuna-13b model with this command. QAT is completed based on c4
data-set but you may as well use other data-sets, comparable to wikitext2
(Note. Change group size with different combos so long as the
model accuracy increases significantly. Under some combination of wbit
and groupsize, model accuracy will be increased significantly.)
python llama.py ./Vicuna-13b c4 --wbits 4 --true-sequential --act-order
--save_safetensors Vicuna-13b-4bit-act-order.safetensors
Now the model is prepared and saved as
Vicuna-13b-4bit-act-order.safetensors.
GPTQ Dequantization + FP16 Mamul kernel for AMD GPUs
The more optimized kernel implementation in
https://github.com/oobabooga/GPTQ-for-LLaMa/blob/57a26292ed583528d9941e79915824c5af012279/quant_cuda_kernel.cu#L891
targets at A100 GPU and never compatible with ROCM5.4.3 HIPIFY
toolkits. It must be modified as follows. The identical for
VecQuant2MatMulKernelFaster, VecQuant3MatMulKernelFaster,
VecQuant4MatMulKernelFaster kernels.

For convenience, All of the modified codes can be found in Github Gist.
