Optimize and deploy with Optimum-Intel and OpenVINO GenAI

-



Deploying Transformers models at the sting or client-side requires careful consideration of performance and compatibility. Python, though powerful, will not be all the time ideal for such deployments, especially in environments dominated by C++. This blog will guide you thru optimizing and deploying Hugging Face Transformers models using Optimum-Intel and OpenVINOâ„¢ GenAI, ensuring efficient AI inference with minimal dependencies.



Table of Contents

  1. Why Use OpenVINOâ„¢ for Edge Deployment
  2. Step 1: Setting Up the Environment
  3. Step 2: Exporting Models to OpenVINO IR
  4. Step 3: Model Optimization
  5. Step 4: Deploying with OpenVINO GenAI API
  6. Conclusion



Why Use OpenVINOâ„¢ for Edge Deployment

OpenVINOâ„¢ was originally developed as a C++ AI inference solution, making it ideal for edge and client deployment where minimizing dependencies is crucial. With the introduction of the GenAI API, integrating large language models (LLMs) into C++ or Python applications has grow to be much more straightforward, with features designed to simplify deployment and enhance performance.



Step 1: Setting Up the Environment



Pre-requisites

To begin, ensure your environment is correctly configured with each Python and C++. Install the obligatory Python packages:

pip install --upgrade --upgrade-strategy eager "optimum[openvino]"

Listed below are the particular packages utilized in this blog post:

transformers==4.44
openvino==24.3
openvino-tokenizers==24.3
optimum-intel==1.20
lm-eval==0.4.3

For GenAI C++ libraries installation follow the instruction here.



Step 2: Exporting Models to OpenVINO IR

Hugging Face and Intel’s collaboration has led to the Optimum-Intel project. It’s designed to optimize Transformers models for inference on Intel HW. Optimum-Intel supports OpenVINO as an inference backend and its API has wrappers for various model architectures built on top of OpenVINO inference API. All of those wrappers start from OV prefix, for instance, OVModelForCausalLM. Otherwise, it is analogous to the API of 🤗 Transformers library.

To export Transformers models to OpenVINO Intermediate Representation (IR) one can use two options: This could be done using Python’s .from_pretrained() method or the Optimum command-line interface (CLI). Below are examples using each methods:



Using Python API

from optimum.intel import OVModelForCausalLM

model_id = "meta-llama/Meta-Llama-3.1-8B"
model = OVModelForCausalLM.from_pretrained(model_id, export=True)
model.save_pretrained("./llama-3.1-8b-ov")



Using Command Line Interface (CLI)

optimum-cli export openvino -m meta-llama/Meta-Llama-3.1-8B ./llama-3.1-8b-ov

The ./llama-3.1-8b-ov folder will contain .xml and bin IR model files and required configuration files that come from the source model. 🤗 tokenizer might be also converted to the format of openvino-tokenizers library and corresponding configuration files might be created in the identical folder.



Step 3: Model Optimization

When running LLMs on the resource constrained edge and client devices, model optimization is extremely beneficial step. Weight-only quantization is a mainstream approach that significantly reduces latency and model footprint. Optimum-Intel offers weight-only quantization through the Neural Network Compression Framework (NNCF), which has a wide range of optimization techniques designed specifically for LLMs: from data-free INT8 and INT4 weight quantization to data-aware methods akin to AWQ, GPTQ, quantization scale estimation, mixed-precision quantization.
By default, weights of the models which can be larger than one billion parameters are quantized to INT8 precision which is secure when it comes to accuracy. It signifies that the export steps described above result in the model with 8-bit weights. Nevertheless, 4-bit integer weight-only quantization allows achieving a greater accuracy-performance trade-off.

For meta-llama/Meta-Llama-3.1-8B model we recommend stacking AWQ, quantization scale estimation together with mixed-precision INT4/INT8 quantization of weights using a calibration dataset that reflects a deployment use case. As within the case of export, there are two options on learn how to apply 4-bit weight-only quantization to LLM model:



Using Python API

  • Specify quantization_config parameter within the .from_pretrained() method. On this case OVWeightQuantizationConfig object needs to be created and set to this parameter as follows:
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig

MODEL_ID = "meta-llama/Meta-Llama-3.1-8B"
quantization_config = OVWeightQuantizationConfig(bits=4, awq=True, scale_estimation=True, group_size=64, dataset="c4")
model = OVModelForCausalLM.from_pretrained(MODEL_ID, export=True, quantization_config=quantization_config)
model.save_pretrained("./llama-3.1-8b-ov")



Using Command Line Interface (CLI):

optimum-cli export openvino -m meta-llama/Meta-Llama-3.1-8B --weight-format int4 --awq --scale-estimation --group-size 64 --dataset wikitext2 ./llama-3.1-8b-ov

Note: The model optimization process can take time because it and applies several methods subsequently and uses model inference over the desired dataset.

Model optimization with API is more flexible because it allows using custom datasets that could be passed as an iterable object, for instance, and instance of Dataset object of 🤗 library or simply an inventory of strings.

Weight quantization often introduces some degradation of the accuracy metric. To match optimized and source models we report Word Perplexity metric measured on the Wikitext dataset with lm-evaluation-harness project which support each 🤗 Transformers and Optimum-Intel models out-of-the-box.

Model PPL PyTorch FP32 OpenVINO INT8 OpenVINO INT4
meta-llama/Meta-Llama-3.1-8B 7.3366 7.3463 7.8288



Step 4: Deploying with OpenVINO GenAI API

After conversion and optimization, deploying the model using OpenVINO GenAI is easy. The LLMPipeline class in OpenVINO GenAI provides each Python and C++ APIs, supporting various text generation methods with minimal dependencies.



Python API Example

import argparse
import openvino_genai

device = "CPU"  
pipe = openvino_genai.LLMPipeline(args.model_dir, device)
config = openvino_genai.GenerationConfig()
config.max_new_tokens = 100
print(pipe.generate(args.prompt, config))

To run this instance you would like minimum dependencies to be installed into the Python environment as OpenVINO GenAI is designed to offer a light-weight deployment. You’ll be able to install OpenVINO GenAI package to the identical Python environment or create a separate one to match the applying footprint:

pip install openvino-genai==24.3



C++ API Example

Let’s examine learn how to run the identical pipeline with OpenVINO GenAI C++ API. The GenAI API is designed to be intuitive and provides a seamless migration from 🤗 Transformers API.

Note: Within the below example, another available device in your environment could be specified for “device” variable. For instance, should you are using an Intel CPU with integrated graphics, “GPU” is a great choice to try with. To ascertain the available devices, you should use ov::Core::get_available_devices method (consult with query-device-properties).

#include "openvino/genai/llm_pipeline.hpp"
#include 

int primary(int argc, char* argv[]) {
   std::string model_path = "./llama-3.1-8b-ov";
   std::string device = "CPU"  
   ov::genai::LLMPipeline pipe(model_path, device);
   std::cout << pipe.generate("What's LLM model?", ov::genai::max_new_tokens(256));
}



Customizing Generation Config

LLMPipeline also allows specifying custom generation options via ov::genai::GenerationConfig:

ov::genai::GenerationConfig config;
config.max_new_tokens = 256;
std::string result = pipe.generate(prompt, config);

With the LLMPipieline, users cannot only effortlessly leverage various decoding algorithms akin to Beam Search but in addition construct an interactive chat scenario with a Streamer as within the below example. Furthermore, one can make the most of enhanced internal optimizations with LLMPipeline, akin to reduced prompt processing time with utilization of KV cache of previous chat history with the chat methods : start_chat() and finish_chat() (consult with using-genai-in-chat-scenario).

ov::genai::GenerationConfig config;
config.max_new_tokens = 100;
config.do_sample = true;
config.top_p = 0.9;
config.top_k = 30;

auto streamer = [](std::string subword) {
    std::cout << subword << std::flush;
    return false;
};



pipe.generate(prompt, config, streamer);

And at last let’s examine learn how to use LLMPipeline within the chat scenario:

pipe.start_chat()
for (size_t i = 0; i < questions.size(); i++) {
   std::cout << "query:n";
   std::getline(std::cin, prompt);

   std::cout << pipe.generate(prompt) << std::endl;
}
pipe.finish_chat();



Conclusion

The mixture of Optimum-Intel and OpenVINOâ„¢ GenAI offers a strong, flexible solution for deploying Hugging Face models at the sting. By following these steps, you possibly can achieve optimized, high-performance AI inference in environments where Python will not be ideal, ensuring your applications run easily across Intel hardware.



Additional Resources

  1. Yow will discover more details on this tutorial.
  2. To construct the C++ examples above consult with this document.
  3. OpenVINO Documentation
  4. Jupyter Notebooks
  5. Optimum Documentation

OpenVINO GenAI C++ chat demo



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x