Home Artificial Intelligence Create your Vision Chat Assistant with LLaVA

Create your Vision Chat Assistant with LLaVA

0
Create your Vision Chat Assistant with LLaVA

Start with multimodal conversational models using the open-source LLaVA model.

Photo by Izabela Kraus on Unsplash

Large Language Models have proved themselves to be a revolutionary technology. Quite a few applications exploiting their capabilities have been already developed and lots of more are expected to return soon. One of the crucial interesting applications of Large Language Models is their deployment as intelligent assistants in a position to help human users in quite a lot of tasks. Chat models trained with instruction tuning and Reinforcement Learning from Human Feedback (RLHF) have shown very promising capabilities of following human instructions and carrying out the assigned tasks. Nonetheless, they’re limited of their applicability to language-only tasks.

Multimodal conversational models aim to unleash the facility of Large Language Models to tackle problems that require combining natural language with other modalities to be solved. Particularly, vision-language models have received increasing attention for the reason that introduction of vision capabilities to GPT-4V. Empowering the natural language capabilities of GPT-4 with image understanding has led to a strong chat assistant that may help users with tasks requiring each vision and language understanding. While the vision capabilities of GPT-4V are impressive, closed-source models limit the potential for research and experimentation with this amazing technology. Fortunately, some open-source models appeared bringing the facility of vision language models to the community in an easily accessible and transparent way. These models also proceed the trend of increased concentrate on computing and memory efficiency, a trend already seen for open-source Large Language Models. That is a vital feature since it facilitates the widespread adoption of those models.

On this tutorial, I’ll walk through the strategy of making a vision chat assistant using the LLaVA (Large Language and Vision Assistant) model introduced within the Visual Instruction Tuning paper. I’ll first give a temporary introduction to the LLaVA model and its improvements before discussing an easy code implementation of a vision chat assistant using the code provided within the official repository. I’ll then present some examples I crafted to showcase the capabilities and limitations of the model.

LLaVA

The LLaVA model was introduced within the paper Visual Instruction Tuning, after which further improved in Improved Baselines with Visual Instruction Tuning (also known as LLaVA-1.5). The thought behind it’s to extract visual embeddings from a picture and treat them in the identical way as embeddings coming from language tokens by feeding them to a Large Language Model. Intuitively, we will think that the image will probably be described with “words” that the language model will use to generate its answer. To decide on the proper “words” the model uses a pre-trained CLIP visual encoder to extract the visual embeddings after which projects them into the word embedding space of the language model. The latter operation is achieved with a vision-language connector, which was originally chosen to be an easy linear layer in the primary paper Visual Instruction Tuning, and later replaced with a more expressive Multilayer Perceptron (MLP) in Improved Baselines with Visual Instruction. The architecture of the model is depicted below.

Architecture of the LLaVA model. The projection W is an easy linear layer in LLaVA or an MLP in LLaVA-1.5. Image from the paper Visual Instruction Tuning.

Certainly one of the benefits of the tactic is that by utilizing a pre-trained vision encoder and a pre-trained language model, only the vision-language connector (which is a light-weight module) have to be learned from scratch. Particularly, the training of LLava consists of two stages:

  • Pre-training for feature alignment: each the pre-trained vision encoder and language model are frozen, and only the weights of the vision-language connector are updated. All training samples consist of text-image pairs packed right into a single-turn conversation. This stage goals to coach the vision-language connector to align the embeddings of the vision encoder with the text embeddings of the language model.
  • Positive-tuning with visual instructions: on this stage, only the weights of the vision encoder are frozen while the vision-language connector and the language model are fine-tuned together. The model is fine-tuned on image-based instruction-following tasks. It’s interesting to note that a few of this data has been created by utilizing language-only GPT4 to create instruction-following samples from the caption of the photographs and the coordinates of the bounding boxes of the entities depicted.

Making a vision chatbot using the code provided within the official repository is fairly easy. The repository also provides standardized chat templates that may be used to parse the inputs in the proper format. Following the proper format utilized in training is crucial for the standard of the reply generated by the model. The precise template depends upon the language model used. The template for LLaVA-1.5 with a pre-trained Vicuna language model will appear to be this:

A chat between a curious user and a man-made intelligence assistant. The 
assistant gives helpful, detailed, and polite answers to the user's questions.

USER: User's prompt

ASSISTANT: Assistant answer

USER: One other prompt

The primary few lines are the overall system prompt utilized by the model. The special tokens , , and are used to point where embeddings representing the image will probably be placed.

The chatbot may be defined in only one easy Python class.

class LLaVAChatBot:
def __init__(self,
model_path: str = 'liuhaotian/llava-v1.5-7b',
device_map: str = 'auto',
load_in_8_bit: bool = True,
**quant_kwargs) -> None:
self.model = None
self.tokenizer = None
self.image_processor = None
self.conv = None
self.conv_img = None
self.img_tensor = None
self.roles = None
self.stop_key = None
self.load_models(model_path,
device_map=device_map,
load_in_8_bit=load_in_8_bit,
**quant_kwargs)

def load_models(self, model_path: str,
device_map: str,
load_in_8_bit: bool,
**quant_kwargs) -> None:
"""Load the model, processor and tokenizer."""
quant_cfg = BitsAndBytesConfig(**quant_kwargs)
self.model = LlavaLlamaForCausalLM.from_pretrained(model_path,
low_cpu_mem_usage=True,
device_map=device_map,
load_in_8bit=load_in_8_bit,
quantization_config=quant_cfg)
self.tokenizer = AutoTokenizer.from_pretrained(model_path,
use_fast=False)
vision_tower = self.model.get_vision_tower()
vision_tower.load_model()
vision_tower.to(device='cuda')
self.image_processor = vision_tower.image_processor
disable_torch_init()

def setup_image(self, img_path: str) -> None:
"""Load and process the image."""
if img_path.startswith('http') or img_path.startswith('https'):
response = requests.get(img_path)
self.conv_img = Image.open(BytesIO(response.content)).convert('RGB')
else:
self.conv_img = Image.open(img_path).convert('RGB')
self.img_tensor = self.image_processor.preprocess(self.conv_img,
return_tensors='pt'
)['pixel_values'].half().cuda()

def generate_answer(self, **kwargs) -> str:
"""Generate a solution from the present conversation."""
raw_prompt = self.conv.get_prompt()
input_ids = tokenizer_image_token(raw_prompt,
self.tokenizer,
IMAGE_TOKEN_INDEX,
return_tensors='pt').unsqueeze(0).cuda()
stopping = KeywordsStoppingCriteria([self.stop_key],
self.tokenizer,
input_ids)
with torch.inference_mode():
output_ids = self.model.generate(input_ids,
images=self.img_tensor,
stopping_criteria=[stopping],
**kwargs)
outputs = self.tokenizer.decode(
output_ids[0, input_ids.shape[1]:]
).strip()
self.conv.messages[-1][-1] = outputs

return outputs.rsplit('', 1)[0]

def get_conv_text(self) -> str:
"""Return full conversation text."""
return self.conv.get_prompt()

def start_new_chat(self,
img_path: str,
prompt: str,
do_sample=True,
temperature=0.2,
max_new_tokens=1024,
use_cache=True,
**kwargs) -> str:
"""Start a recent chat with a recent image."""
conv_mode = "v1"
self.setup_image(img_path)
self.conv = conv_templates[conv_mode].copy()
self.roles = self.conv.roles
first_input = (DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN +
DEFAULT_IM_END_TOKEN + 'n' + prompt) # f"{self.roles[0]}: {prompt}")
self.conv.append_message(self.roles[0], first_input)
self.conv.append_message(self.roles[1], None)
if self.conv.sep_style == SeparatorStyle.TWO:
self.stop_key = self.conv.sep2
else:
self.stop_key = self.conv.sep
answer = self.generate_answer(do_sample=do_sample,
temperature=temperature,
max_new_tokens=max_new_tokens,
use_cache=use_cache,
**kwargs)
return answer

def continue_chat(self,
prompt: str,
do_sample=True,
temperature=0.2,
max_new_tokens=1024,
use_cache=True,
**kwargs) -> str:
"""Proceed the prevailing chat."""
if self.conv is None:
raise RuntimeError("No existing conversation found. Start a recent"
"conversation using the `start_new_chat` method.")
self.conv.append_message(self.roles[0], prompt)
self.conv.append_message(self.roles[1], None)
answer = self.generate_answer(do_sample=do_sample,
temperature=temperature,
max_new_tokens=max_new_tokens,
use_cache=use_cache,
**kwargs)
return answer

In the event you are accustomed to the transformers library, you’ll recognize lots of the same old features, and the operations performed must be straightforward to know. Let’s go quickly over the methods of the LLaVAChatBot class defined above.

  • load_models: this method loads the language models, the tokenizer, and the image processor with the desired parameters for quantization using the BitsAndBytes library. The code shadows the from_pretrained method utilized by Hugging Face transformers models. BitsAndBytes allows quantizing to model to 8bit or 4bit for reduced GPU memory requirements.
  • setup_image: it loads the image from a neighborhood path or a URL and converts it to a tensor using the image processor.
  • generate_answer: this method returns the model’s answer continuing the present conversation concerning the provided image. Again the generate approach to the LLaVa model is analogous to the generate approach to Hugging Face transformers models.
  • get_conv_text: this method returns the raw text of the conversation thus far.
  • start_new_chat: that is considered one of the 2 essential methods of the chatbot, it’s used to start out a recent chat with the model. It creates a recent conversation given the image and the initial prompt to pass to the model. It takes care of organising the conversation using the templates defined within the repository following the format discussed within the previous section.
  • continue_chat: the opposite essential method, it continues an existing conversation about a picture.

I actually have provided the complete code in this Colab notebook so which you could try it yourself.

On this section, I collect some interesting examples obtained while testing the capabilities of the model. All of the examples on this post were created using the model llava-v1.5–7b with 8-bit quantization.

For a start, let’s have a look at the model’s capabilities to explain and understand images. Below I asked the model to explain a picture picturing a white tiger.

Photo by Joshua J. Cotten on Unsplash
A chat between a curious user and a man-made intelligence assistant. The 
assistant gives helpful, detailed, and polite answers to the user's questions.

USER:
Describe the image in details.

ASSISTANT: The image includes a white tiger laying down on a rocky surface,
possibly a stone wall or a rock. The tiger appears to be relaxed and
comfortable in its environment. The tiger's position is such that it covers a
significant slice of the image, from the left side to almost your entire right
side. The rocky surface provides a natural and rugged backdrop for the scene.

USER: What is exclusive about this tiger?

ASSISTANT: This tiger is exclusive since it is a white tiger, which is an
extremely rare and visually striking variation of the species. White tigers are
born with a genetic mutation that causes their fur to be predominantly white,
with the remainder of their body being a combination of orange and black stripes. This
distinctive appearance sets them aside from the more common orange and black
striped tigers.

The model seems to give you the option to explain the image intimately and likewise to reason concerning the subject of the image, appropriately assessing that a white tiger is a rare occurrence.

Next, I tested if the model is in a position to provide the user with actionable instructions to perform various tasks. That is what distinguishes a “captioning” model from a real vision-chat assistant. For instance, can the model provide recipes from the ingredients present in a picture? The result’s reported below.

Photo by ThermoPro on Unsplash
A chat between a curious user and a man-made intelligence assistant. The 
assistant gives helpful, detailed, and polite answers to the user's questions.
USER:
What can I cook with these ingredients?

ASSISTANT: With the ingredients displayed on the table, you may prepare a
delicious roast dinner. The meat, which is a big piece of beef, may be
seasoned and cooked within the oven, while the vegetables, including carrots,
potatoes, and onions, may be roasted alongside the meat. You may as well prepare a
side dish with the potatoes, carrots, and onions, which may be served alongside
the roast dinner. The presence of a bowl of sauce suggests which you could use it
as a dipping sauce or to reinforce the flavors of the roast and vegetables.
Overall, this mixture of ingredients offers a hearty and flavorful meal
that may be enjoyed with family and friends.

USER: Could the recipe you suggested be included in a vegetarian weight loss plan?

ASSISTANT: The recipe I suggested, which involves roasting a big piece of
beef and vegetables, will not be suitable for a vegetarian weight loss plan. The meat is the
essential protein source within the dish, and it will not be suitable for vegetarians who
avoid consuming meat. Nonetheless, you may still create a vegetarian-friendly meal
using the vegetables and sauce availableon the table. For instance, you may
prepare a vegetable roast with the carrots, potatoes, and onions, and serve it
alongside a side of mashed potatoes or a vegetable-based sauce. This manner, you
can still benefit from the flavors and textures of the dish without including meat.

USER: Are you able to suggest a vegetarian recipe with minimal modifications of the
depicted ingredients?

ASSISTANT: Actually! Here's a vegetarian recipe that may be made using the
ingredients available on the table:

Roasted Vegetable Medley with Sauce

Ingredients:

* Carrots
* Potatoes
* Onions
* Olive oil
* Salt
* Pepper
* Herbs (equivalent to thyme, rosemary, or parsley)
* Sauce (optional)

Instructions:

1. Preheat your oven to 400°F (200°C).
2. Wash and peel the carrots, potatoes, and onions. Cut them into bite-sized
pieces.
3. In a big bowl, toss the vegetables with olive oil, salt, and pepper.
4. Spread the vegetables evenly on a baking sheet lined with parchment paper.
5. Roast the vegetables within the preheated oven for 20-25 minutes, or until they
are tender and barely browned.
6. If desired, prepare a sauce by combining ingredients equivalent to tomatoes,
garlic, and herbs. You need to use this sauce as a dipping sauce or to reinforce the
flavors of the roasted vegetables.
7. Serve the roasted vegetable medley alongside a side of mashed potatoes or a
vegetable-based sauce.

This vegetarian recipe permits you to benefit from the flavors and textures of the
roasted vegetables without including meat. The sauce may be easily customized
to fit your taste preferences, and the dish may be served alongside a range
of side dishes, equivalent to mashed potatoes or a vegetable-based sauce.

The model seems to have a great understanding of the ingredients present within the image and additionally it is able to connecting the person words to more abstract concepts (i.e. a meal with meat will not be vegetarian).

Let’s now see how the model may be used to get pieces of recommendation. It is a useful feature to get recent ideas when brainstorming and may provide real value to the user. Within the conversation below, I asked the model for suggestions to vary the look of a café.

Photo by Michał Mancewicz on Unsplash
A chat between a curious user and a man-made intelligence assistant. The 
assistant gives helpful, detailed, and polite answers to the user's questions.

USER:
How can I renovate this café to make it look more vintage?

ASSISTANT: To make the café look more vintage, you may consider the next
changes:

1. Update the lighting: Replace the present lighting with vintage-style lamps
or lanterns, which is able to create a warm and comfy atmosphere.

2. Add vintage decorations: Incorporate antique items, equivalent to vintage clocks,
artwork, or decorative plates, to create a nostalgic ambiance.

3. Use vintage furniture: Replace the present stools with vintage-style chairs
or barstools, which may add character and charm to the space.

4. Incorporate greenery: Add more greenery, equivalent to potted plants or vines, to
create a lush and welcoming atmosphere.

5. Update the bar: Replace the present bar with a vintage-style bar, complete
with a vintage-inspired backsplash and countertop.

6. Use vintage-style glassware: Replace the present glassware with vintage-style
glasses, equivalent to wine glasses, beer mugs, or cocktail shakers.
7. Add vintage-style signage: Replace the present signage with vintage-style
menu boards or chalkboards, which may add a touch of nostalgia to the space.

8. Use vintage-style wallpaper or paint: Apply a vintage-style wallpaper or
paint to the partitions, which may create a novel and charming atmosphere.

By implementing these changes, the café will appear more vintage and welcoming,
making a memorable experience for its patrons.

The recommendation looks on point, although quite generic, identifying the present elements that might be modified to provide a more vintage style.

Let’s proceed with something practical that may aid the users of their day by day activities. Can the model help with writing advertisements or listings?

Photo by Eugene Chystiakov on Unsplash
A chat between a curious user and a man-made intelligence assistant. The 
assistant gives helpful, detailed, and polite answers to the user's questions.

USER:
Write an advertisment for this chair

ASSISTANT: Introducing the trendy and classy black and wood chair, perfect for
any contemporary setting. This chair includes a sleek black seat and a picket
base, making a striking contrast against any background. Its minimalist
design makes it a flexible addition to any room, whether it's used as a dining
chair, a side table, or an announcement piece in a living space. The chair's sturdy
construction ensures it will probably withstand day by day use, while its clean lines and
sleek appearance add a touch of sophistication to any room. Upgrade your living
space with this stylish and functional chair today.

USER: Write now an commercial for a similar chair but targeting cafes

ASSISTANT: Attention café owners and design enthusiasts! Introducing the black
and wood chair, the right addition to your café's seating area. This stylish
and functional chair offers a contemporary touch to your establishment, with its
sleek black seat and picket base making a striking contrast against any
background. Its minimalist design makes it a flexible alternative for any café
setting, whether it's used as a dining chair, a side table, or an announcement
piece in a lounge area. The chair's sturdy construction ensures it will probably
withstand day by day use, while its clean lines and sleek appearance add a touch of
sophistication to any room. Upgrade your café's ambiance with this stylish and
functional chair today.

The model is in a position to pinpoint the features of the chair appropriately and creates an appealing commercial, appropriately customizing the reply to different targets when asked to.

As a final test, allow us to see if prompting techniques which were demonstrated to be useful for Large Language Models may be used with LLaVA as well. Below is an example of attempting zero-shot Chain of Thought prompting.

Photo by Arnold Middelkoop on Unsplash
A chat between a curious user and a man-made intelligence assistant. The 
assistant gives helpful, detailed, and polite answers to the user's questions.

USER:
Would the fish depicted within the image float if thrown in water? Think step by
step.

ASSISTANT: The fish depicted within the image is a big white fish sculpture or
statue, which is lying on the bottom on a brick walkway near a body of water.
If thrown into the water, the fish wouldn't float since it is a solid,
inanimate object. It could sink to the underside of the water, because it will not be
buoyant resulting from its composition and size. The fish's position on the bottom and
its lack of buoyancy make it inconceivable for it to drift in water.

USER: What material might be used to make a fish sculpture that floats on
water?

ASSISTANT: To make a fish sculpture that floats on water, one could use a
lightweight, buoyant material equivalent to expanded polystyrene foam or polyethylene
foam. These materials are commonly used for making buoyant objects like life
jackets, flotation devices, and fishing floats. By shaping the froth into the
desired fish form, it could give you the option to drift on the water's surface,
simulating the behavior of an actual fish.

The model is in a position to explain why the fish wouldn’t float, appropriately identifying that it will not be an actual fish. It also follows the instruction to think step-by-step before giving the ultimate answer. For comparison, here is an example of the conversation without the “Think step-by-step” prompt.

A chat between a curious user and a man-made intelligence assistant. The 
assistant gives helpful, detailed, and polite answers to the user's questions.

USER:
Would the fish depicted within the image float if thrown in water?

ASSISTANT: No, the fish depicted within the image wouldn't float if thrown in
water. It's a large, white, stone fish sculpture, which will not be an actual fish
and doesn't have the buoyancy properties of a natural fish.

While the ultimate answer continues to be correct, the reason given on this case is way less detailed and the reply is given at first of the sentence. Because the generative language model utilized by LLaVA is causal, because of this the provided answer doesn’t depend on the next explanation. It could be interesting to perform more extensive experiments to check if Chain of Thought (CoT) can, normally, improve vision-language models’ performance in complex reasoning tasks just like what has been observed for Large Language Models in Chain-of-Thought Prompting Elicits Reasoning in Large Language Models and Large Language Models are Zero-Shot Reasoners.

While very powerful, LLaVA comes with its limitations. For instance, the model has been trained to make use of just one image per chat so it isn’t in a position to handle more complex conversations that need interactions with multiple images. It’s value noticing, nonetheless, that this limitation will not be intrinsic to the architecture of the model and may be amended by including multi-image chat data during training.

Throughout the tests, I noticed that the model (at the very least the 8bit quantized version of the 1.5 7b model) has trouble with counting elements in a picture and normally with handling numerical tasks. Below is an example.

Photo by John Matychuk on Unsplash
A chat between a curious user and a man-made intelligence assistant. The 
assistant gives helpful, detailed, and polite answers to the user's questions.

USER:
What number of automobile are contained in the car parking zone?

ASSISTANT: There are several cars parked contained in the car parking zone.

USER: List the numbers of the parking spots occupied by cars

ASSISTANT: The parking spots occupied by cars are 66, 67, 68, 69, 70, and 71.

The model, like all Large Language Models, can be vulnerable to hallucinations and may be easily tricked into making mistakes using appropriate prompts. For instance, if supplied with an image of the desert and asked what species is the fish within the image, it hallucinates a desert fish.

Photo by Wolfgang Hasselmann on Unsplash
A chat between a curious user and a man-made intelligence assistant. The 
assistant gives helpful, detailed, and polite answers to the user's questions.

USER:
Tell me the species of the fish on this image

ASSISTANT: The fish on this image is a desert fish, which is a species adapted
to living in desert environments.

LLaVA shows impressive capabilities in vision-language understanding. It marks a transparent step forward for multimodal open-source vision-language models. Certainly one of the most important benefits of LLaVA is that it’s lightweight to coach and fine-tune. As an illustration, the complete training of LLaVA 1.5 13b took just one.2M data and roughly 1 day on a single 8-A100 node. This makes it suitable for fine-tuning on specific domains to get an authority assistant, as was done, for instance, in LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day.

Adding vision capabilities to talk assistants expands the world of applications of such models, bringing their revolutionizing potential to more complex and nuanced tasks. Treating image features as language tokens also brings up the opportunity of using all of the advanced prompting techniques used with text-only language models and further expands them. For instance, one could expand the facility of Retrieval Augmented Generation by retrieving each texts and pictures which are relevant to the conversation. In reality, using the shared image-text embedding space of CLIP it is feasible to retrieve each external documents and external images starting with either an input text or picture!

One other interesting direction to expand the capabilities of the model is presented in LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing. The essential idea is to mix the varied capabilities of vision-language chat models, text-to-image generative models, and other vision models (equivalent to image segmentation models) to get an assistant able to handling multimodal inputs and generating multimodal outputs.

In conclusion, LLaVA marked a vital step for open-source multimodal generative models, which have shown impressive capabilities and are attracting a number of interest. With the more widespread adoption of open-source models, I feel we are going to soon witness a rapid increase in recent applications of those powerful models.

Thanks for reading! If you would like to check out the code yourself you may have a look at this Colab notebook.

LEAVE A REPLY

Please enter your comment!
Please enter your name here