With the Generative AI (GenAI) revolution in full swing, text-generation with open-source transformer models like Llama 2 has change into the talk of the town. AI enthusiasts in addition to developers need to leverage the generative abilities of such models for their very own use cases and applications. This text shows how easy it’s to generate text with the Llama 2 family of models (7b, 13b and 70b) using Optimum Habana and a custom pipeline class – you will have the opportunity to run the models with just a number of lines of code!
This tradition pipeline class has been designed to supply great flexibility and ease of use. Furthermore, it provides a high level of abstraction and performs end-to-end text-generation which involves pre-processing and post-processing. There are multiple ways to make use of the pipeline – you may run the run_pipeline.py script from the Optimum Habana repository, add the pipeline class to your individual python scripts, or initialize LangChain classes with it.
Prerequisites
Because the Llama 2 models are a part of a gated repo, you have to request access in case you have not done it already. First, you could have to go to the Meta website and accept the terms and conditions. After you’re granted access by Meta (it could possibly take a day or two), you could have to request access in Hugging Face, using the identical email address you provided within the Meta form.
After you’re granted access, please login to your Hugging Face account by running the next command (you will want an access token, which you’ll be able to get from your user profile page):
huggingface-cli login
You furthermore may need to put in the most recent version of Optimum Habana and clone the repo to access the pipeline script. Listed below are the commands to accomplish that:
pip install optimum-habana==1.10.4
git clone -b v1.10-release https://github.com/huggingface/optimum-habana.git
In case you’re planning to run distributed inference, install DeepSpeed depending in your SynapseAI version. On this case, I’m using SynapseAI 1.14.0.
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.14.0
Now you’re all set to perform text-generation with the pipeline!
Using the Pipeline
First, go to the next directory in your optimum-habana checkout where the pipeline scripts are situated, and follow the instructions within the README to update your PYTHONPATH.
cd optimum-habana/examples/text-generation
pip install -r requirements.txt
cd text-generation-pipeline
Should you want to generate a sequence of text from a prompt of your alternative, here’s a sample command.
python run_pipeline.py --model_name_or_path meta-llama/Llama-2-7b-hf --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --prompt "Here is my prompt"
You may as well pass multiple prompts as input and alter the temperature and top_p values for generation as follows.
python run_pipeline.py --model_name_or_path meta-llama/Llama-2-13b-hf --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --temperature 0.5 --top_p 0.95 --prompt "Hello world" "How are you?"
For generating text with large models resembling Llama-2-70b, here’s a sample command to launch the pipeline with DeepSpeed.
python ../../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py --model_name_or_path meta-llama/Llama-2-70b-hf --max_new_tokens 100 --bf16 --use_hpu_graphs --use_kv_cache --do_sample --temperature 0.5 --top_p 0.95 --prompt "Hello world" "How are you?" "Here is my prompt" "Once upon a time"
Usage in Python Scripts
You should utilize the pipeline class in your individual scripts as shown in the instance below. Run the next sample script from optimum-habana/examples/text-generation/text-generation-pipeline.
import argparse
import logging
from pipeline import GaudiTextGenerationPipeline
from run_generation import setup_parser
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO,
)
logger = logging.getLogger(__name__)
parser = argparse.ArgumentParser()
args = setup_parser(parser)
args.num_return_sequences = 1
args.model_name_or_path = "meta-llama/Llama-2-7b-hf"
args.max_new_tokens = 100
args.use_hpu_graphs = True
args.use_kv_cache = True
args.do_sample = True
pipe = GaudiTextGenerationPipeline(args, logger)
prompts = ["He is working on", "Once upon a time", "Far far away"]
for prompt in prompts:
print(f"Prompt: {prompt}")
output = pipe(prompt)
print(f"Generated Text: {repr(output)}")
You’ll have to run the above script with
pythonas.py --model_name_or_path a_model_name --model_name_or_pathis a required argument. Nevertheless, the model name might be programmatically modified as shown within the python snippet.
This shows us that the pipeline class operates on a string input and performs data pre-processing in addition to post-processing for us.
LangChain Compatibility
The text-generation pipeline might be fed as input to LangChain classes via the use_with_langchain constructor argument. You may install LangChain as follows.
pip install langchain==0.0.191
Here’s a sample script that shows how the pipeline class might be used with LangChain.
import argparse
import logging
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from pipeline import GaudiTextGenerationPipeline
from run_generation import setup_parser
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO,
)
logger = logging.getLogger(__name__)
parser = argparse.ArgumentParser()
args = setup_parser(parser)
args.num_return_sequences = 1
args.model_name_or_path = "meta-llama/Llama-2-13b-chat-hf"
args.max_input_tokens = 2048
args.max_new_tokens = 1000
args.use_hpu_graphs = True
args.use_kv_cache = True
args.do_sample = True
args.temperature = 0.2
args.top_p = 0.95
pipe = GaudiTextGenerationPipeline(args, logger, use_with_langchain=True)
llm = HuggingFacePipeline(pipeline=pipe)
template = """Use the next pieces of context to reply the query at the tip. Should you do not know the reply,
just say that you simply do not know, don't attempt to make up a solution.
Context: Large Language Models (LLMs) are the most recent models utilized in NLP.
Their superior performance over smaller models has made them incredibly
useful for developers constructing NLP enabled applications. These models
might be accessed via Hugging Face's `transformers` library, via OpenAI
using the `openai` library, and via Cohere using the `cohere` library.
Query: {query}
Answer: """
prompt = PromptTemplate(input_variables=["question"], template=template)
llm_chain = LLMChain(prompt=prompt, llm=llm)
query = "Which libraries and model providers offer LLMs?"
response = llm_chain(prompt.format(query=query))
print(f"Query 1: {query}")
print(f"Response 1: {response['text']}")
query = "What's the provided context about?"
response = llm_chain(prompt.format(query=query))
print(f"nQuestion 2: {query}")
print(f"Response 2: {response['text']}")
The pipeline class has been validated for LangChain version 0.0.191 and will not work with other versions of the package.
Conclusion
We presented a custom text-generation pipeline on Intel® Gaudi® 2 AI accelerator that accepts single or multiple prompts as input. This pipeline offers great flexibility when it comes to model size in addition to parameters affecting text-generation quality. Moreover, additionally it is very easy to make use of and to plug into your scripts, and is compatible with LangChain.
Use of the pretrained model is subject to compliance with third party licenses, including the “Llama 2 Community License Agreement” (LLAMAV2). For guidance on the intended use of the LLAMA2 model, what can be considered misuse and out-of-scope uses, who’re the intended users and extra terms please review and skim the instructions on this link https://ai.meta.com/llama/license/. Users bear sole liability and responsibility to follow and comply with any third party licenses, and Habana Labs disclaims and can bear no liability with respect to users’ use or compliance with third party licenses.
To have the opportunity to run gated models like this Llama-2-70b-hf, you wish the next:
- Have a HuggingFace account
- Comply with the terms of use of the model in its model card on the HF Hub
- set a read token
- Login to your account using the HF CLI: run huggingface-cli login before launching your script
