AI Telephone — A Battle of Multimodal Models Selecting the Competitors Prompts The Telephone Line Carrying out the Conversations Results and Conclusions Takeaways

Artificial Intelligence

AI Telephone — A Battle of Multimodal Models Selecting the Competitors Prompts The Telephone Line Carrying out the Conversations Results and Conclusions Takeaways

admin

June 18, 2023

AI Telephone — A Battle of Multimodal Models
Selecting the Competitors
Prompts
The Telephone Line
Carrying out the Conversations
Results and Conclusions
Takeaways

DALL-E2, Stable Diffusion, BLIP, and more!

*Artistic rendering of a game of AI Telephone. Image generated by the creator using DALL-E2.*

Generative AI is on fire right away. The past few months especially have seen an explosion in multimodal machine learning — AI that connects concepts across different “modalities” resembling text, images, and audio. For instance, Midjourney is a multimodal text-to-image model, since it takes in natural language, and outputs images. The magnum opus for this recent renaissance in multimodal synergy was Meta AI’s ImageBind, which might take inputs of 6(!) varieties and represent them in the identical “space”.

With all of this excitement, I desired to put multimodal models to the test and see how good they actually are. Particularly, I desired to answer three questions:

Which text-to-image model is the most effective?
Which image-to-text model is the most effective?
What’s more necessary — image-to-text, or text-to-image?

After all, each model brings its own biases to the table, from training data to model architecture, so there isn’t really ever one BEST model. But we will still put models to the test in a general context!

To reply these questions, I made a decision to play a game of AI Telephone, inspired by the board game Telestrations, which my family and I really like to play together.

Telestrations is very similar to the game of telephone: players go around in a circle, taking in communication from the person on one side, and in turn communicating their interpretation to the person on their other side. Because the game ensues, the unique message is invariably altered, if not lost entirely. Telestrations differs, nonetheless, by adding bimodal communication: players alternate between drawing (or illustrating) an outline, and describing (in text) an outline.

On condition that I used to be more considering comparing models, I adapted the sport to suit this purpose.

Here’s how the sport of AI Telephone works:

Each “game” will pair up an image-to-text (I2T) model with a text-to-image (T2I) model
Given an initial prompt, we use the T2I model to generate a picture.
We then pass this image into the I2T model to generate an outline.
We repeat steps 2 and three a hard and fast variety of times n (in our case n=10).
Finally, we quantify the difference between the unique prompt and the ultimate description.

On this post, I’ll walk you thru this whole process, so which you could play AI Telephone too! At the top, I’ll answer the three motivating questions.

Note: This game of AI Telephone is intimately connected with the notion of cycle consistency. By incorporating a cycle consistency term within the loss function during training, models may be incentivized to, effectively, minimize degradation over a game of telephone. To my knowledge, not one of the models considered on this experiment were trained with cycle consistency as a consideration.

The post is structured as follows:

Selecting the Multimodal Models
Generating the Prompts
Creating Telephone Lines
Carrying out the Conversations
Visualizing and Analyzing the Results

All the code to run this experiment and play AI Telephone may be found here.

To run this code, you will have to put in the FiftyOne open source library for dataset curation, the OpenAI Python Library, and the Replicate Python client.

pip install fiftyone openai replicate

Progression of images in a game of AI Telephone between DALL-E2 and BLIP.

The space of multimodal models is very large: on the time of writing, Hugging Face alone has 4,425 T2I models and 155 I2T models. Playing AI Telephone with all of those models — or perhaps a non-negligible fraction of them — could be completely infeasible. My first task was to pare down this space of potential candidates to a more manageable set of competitors.

Choosing APIs

To start out this project, I knew that I could be working with many models. A number of the prospective models were quite large, and lots of required their very own environments, with a singular set of necessities. On condition that I planned to pair up each T2I model with each I2T model, installing these models locally to play games of AI Telephone presented a possible dependency purgatory — especially because I work on a MacBook Pro M1!

To avoid this problem, I made a decision to follow models that were accessible via APIs. Particularly, I selected to primarily use Replicate, whose easy interface allowed me to work with T2I and I2T models in plug-and-play fashion. Almost every model that I used is open source, so in case you are braver than I, you’ll be able to run these models locally and avoid the costs. That being said, in total this experiment cost < $15 USD.

Text-to-Image Models

When choosing T2I models, I selected from the models in Replicate’s Text to image collection. My selection criteria were that the model needed to be low cost, fast, and comparatively popular (judged by the variety of “runs” of the model on Replicate). Moreover, the model needed to be general purpose, meaning that I wasn’t going to think about outpainting, logo generation, or anime styling models. You might be greater than welcome to try playing AI Telephone with these kind of models in case you’d like!

Given these requirements, I selected Stable Diffusion and Feed forward VQGAN CLIP. Initially, I also worked with DALL-E Mini, but in early tests I used to be disillusioned by the model’s performance, so I swapped the model out for OpenAI’s DALL-E2, which I accessed through OpenAI’s image generations endpoint.

As a side note, restricting my attention to API-accessible models meant that I didn’t consider Midjourney. There isn’t any official API, and I didn’t need to use an unofficial API, nor did I would like to enter prompts into Discord one after the other and download the generated images one by one.

To make this process as plug-and-play as possible, I took an object oriented approach. I defined a base Text2Image class, which exposes a way generate_image(text):

import replicateclass Text2Image(object):
"""Wrapper for a Text2Image model."""
def __init__(self):
self.name = None
self.model_name = None
def generate_image(self, text):
response = replicate.run(self.model_name, input={"prompt": text})
if type(response) == list:
response = response[0]
return response

For Replicate models, all that is required is then setting the model_name attribute, identifying the model on Replicate. For Stable Diffusion, for example, the category definition looks like this:

class StableDiffusion(Text2Image):
"""Wrapper for a StableDiffusion model."""
def __init__(self):
self.name = "stable-diffusion"
self.model_name = "stability-ai/stable-diffusion:27b93a2413e7f36cd83da926f3656280b2931564ff050bf9575f1fdf9bcd7478"

For other models, resembling DALL-E2, the generate_image(text) method may be overloaded:

import openai
class DALLE2(Text2Image):
"""Wrapper for a DALL-E 2 model."""
def __init__(self):
self.name = "dalle-2"def generate_image(self, text):
response = openai.Image.create(
prompt=text,
n=1,
size="512x512"
)
return response['data'][0]['url']

Each of those T2I models returns the URL of the generated image, which we will then pass on to our I2T models.

Image-to-Text Models

I followed an analogous process to find out the I2T competitors, evaluating candidates in Replicate’s Image to text collection. After taking a look at the examples for all the models in the gathering, six models stood out: BLIP, BLIP-2, CLIP prefix captioning, Tremendous-grained Image Captioning with CLIP Reward, mPLUG-Owl, and MiniGPT-4. Other models were enticing, resembling CLIP Interrogator, which tries to reverse engineer a prompt you’ll be able to then use to generate an analogous image. But this felt a bit like cheating so far as AI Telephone was concerned!

Fooling around with the six I2T candidates, I used to be in a position to quickly eliminate two models from contention: BLIP-2 generated responses that were consistently too short to be useful, and the CLIP Caption Reward model generated responses which were often incoherent.

In direct analogy with the T2I models, I defined a base class Image2Text class exposing a generate_text(image_url) method:

class Image2Text(object):
"""Wrapper for an Image2Text model."""
def __init__(self):
self.name = None
self.model_name = None
self.task_description = "Write an in depth description of this image."def generate_text(self, image_url):
response = replicate.run(
self.model_name, 
input={
"image": image_url,
"prompt": self.task_description,
}
)
return response

I then created subclasses for every model. Here’s what the BLIP subclass looks like:

class BLIP(Image2Text):
"""Wrapper for a BLIP model."""
def __init__(self):
super().__init__()
self.name = "blip"
self.model_name = "salesforce/blip:2e1dddc8621f72155f24cf2e0adbde548458d3cab9f00c0139eea840d0ac4746"

All the models are instantiated with the identical task description — to “write an in depth description of this image”.

Progression of images in a game of AI Telephone between DALL-E2 and mPLUG-Owl.

To be as “scientific” as possible, I assumed it best to not generate the initial prompts myself. As an alternative, (and only for fun) I outsourced the duty to ChatGPT. I asked:

I'm playing a game of telephone using text-to-image and image-to-text AI models. 
I would like to judge these models based on their ability to retain complex semantic
information over the course of long conversations. Your job is to offer me 10 text
prompts that I can use to run these games of telephone. You need to give me one 3 
easy, 3 medium, 3 hard, and 1 ultra-hard prompt

I’m playing a game of telephone using text-to-image and image-to-text AI models. I would like to judge these models based on their ability to retain complex semantic information over the course of long conversations. Your job is to offer me 10 text prompts that I can use to run these games of telephone. You need to give me one 3 easy, 3 medium, 3 hard, and 1 ultra-hard (“not possible”) prompt

Listed here are a few of the prompts ChatGPT generated:

Easy:"A red apple sitting on a picket table with sunlight streaming in from a window."
Medium:
"An astronaut floating within the International Space Station, looking at Earth through the window, with an area capsule docked within the background."
Hard:
"A bustling marketplace in an ancient Middle Eastern city. Traders haggling over spices and silks, camels carrying goods, the sun setting behind a mosque with a crescent moon visible."
Inconceivable:
"A panoramic scene of a complicated alien civilization on a distant exoplanet. Interstellar vehicles flying in an indigo sky above towering crystalline structures. Aliens with various physical features are interacting, engaging in activities like exchanging energy orbs, communicating through light patterns, and tending to exotic, bio-luminescent flora. The planet’s twin moons are visible within the horizon over a glistening alien ocean."

A more rigorous scientific approach could be way more intentional with the prompts used, in addition to their categorization.

I then took the text prompts generated by ChatGPT and constructed Prompt objects, which contained the text for the prompt, and the “level” of difficulty assigned by ChatGPT:

class Prompt(object):
def __init__(self, text, level):
self.text = text
self.level = levellevels = ["easy", "medium", "hard", "impossible"]
level_prompts = [easy_texts, medium_texts, hard_texts, impossible_texts]
def get_prompts():
prompts = []
for level, texts in zip(levels, level_prompts):
for text in texts:
prompts.append(Prompt(text, level))
return prompts

Progression of images in a game of AI Telephone between VQGAN-CLIP and MiniGPT-4.

The last component to playing AI Telephone was the “telephone line” itself. I created a TelephoneLine class to encapsulate the connection between a T2I model and an I2T model. Given a single telephone line, a “game” of telephone is played by calling the play(prompt, nturns=10), where the conversation evolves from prompt, and runs for nturns back-and-forth turns.

import os
import hashlib
import fiftyone as fo
from fiftyone import ViewField as Fclass TelephoneLine(object):
"""Class for taking part in telephone with AI."""
def __init__(self, t2i, i2t):
self.t2i = t2i
self.i2t = i2t
self.name = f"{t2i.name}_{i2t.name}"
self.conversations = {}
def get_conversation_name(self, text):
full_name = f"{self.name}{text}"
hashed_name = hashlib.md5(full_name.encode())
return hashed_name.hexdigest()[:6]
def play(self, prompt, nturns = 10):
"""Play a game of telephone."""
print(f"Connecting {self.t2i.name} <-> {self.i2t.name} with prompt: {prompt.text[:20]}...")
texts = [prompt.text]
image_urls = []
for _ in range(nturns):
image_url = self.t2i.generate_image(texts[-1])
text = self.i2t.generate_text(image_url)
texts.append(text)
image_urls.append(image_url)
conversation_name = self.get_conversation_name(prompt.text)
self.conversations[conversation_name] = {
"texts": texts,
"image_urls": image_urls,
"level": prompt.level
}

For every game played, the conversation is logged with a singular name, generated by hashing the T2I model name, I2T model name, and the prompt text (get_conversation_name() method).

I also equipped the category with a save_conversations_to_dataset() method, which saves the photographs and descriptions from all games played on the phone line to a FiftyOne Dataset:

 def save_conversations_to_dataset(self, dataset):
"""Save conversations to a dataset."""
for conversation_name in self.conversations.keys():
conversation = self.conversations[conversation_name]
prompt = conversation["texts"][0]
level = conversation["level"]
image_urls = conversation["image_urls"]
texts = conversation["texts"]for i in range(len(image_urls)):
filename = f"{conversation_name}_{i}.jpg"
filepath = os.path.join(IMAGES_DIR, filename)
download_image(image_urls[i], filepath)
sample = fo.Sample(
filepath = filepath,
conversation_name = conversation_name,
prompt = prompt,
level = level,
t2i_model = self.t2i.name,
i2t_model = self.i2t.name,
step_number = i,
text_before = texts[i],
text_after = texts[i+1]
)
dataset.add_sample(sample)

Progression of images in a game of AI Telephone between Stable Diffusion and CLIP Prefix Captioning.

With all the constructing blocks in place, playing AI Telephone is child’s play!

We will instantiate T2I and I2T models:

## Image2Text models
mplug_owl = MPLUGOwl()
blip = BLIP()
clip_prefix = CLIPPrefix()
mini_gpt4 = MiniGPT4()
image2text_models = [mplug_owl, blip, clip_prefix, mini_gpt4]## Text2Image models
vqgan_clip = VQGANCLIP()
sd = StableDiffusion()
dalle2 = DALLE2()
text2image_models = [sd, dalle2, vqgan_clip]

After which create a telephone line for every pair:

combos = [(t2i, i2t) for t2i in text2image_models for i2t in image2text_models]
lines = [TelephoneLine(*combo) for combo in combos]

We then load in our prompts:

prompts = get_prompts()

And create a FiftyOne Dataset which we’ll use to store the generated images and all relevant information from the conversations:

import fiftyone as fodataset = fo.Dataset(name = 'telephone', persistent=True)
dataset.add_sample_field("conversation_name", fo.StringField)
dataset.add_sample_field("prompt", fo.StringField)
dataset.add_sample_field("level", fo.StringField)
dataset.add_sample_field("t2i_model", fo.StringField)
dataset.add_sample_field("i2t_model", fo.StringField)
dataset.add_sample_field("step_number", fo.IntField)
dataset.add_sample_field("text_before", fo.StringField)
dataset.add_sample_field("text_after", fo.StringField)

We will then run all 120 games of telephone:

from tqdm import tqdmfor line in tqdm(lines):
for prompt in prompts:
line.play(prompt, nturns = 10)
line.save_conversations_to_dataset(dataset)
session = fo.launch_app(dataset)

Within the FiftyOne App, click on the splitting icon within the menu bar to group images by conversation, select conversation_name from the dropdown, then toggle the selector to ordered and choose step_number.

To evaluate the standard of a conversation — purely when it comes to how closely the meaning of the ultimate description approximated the meaning of the initial prompt, I made a decision to generate embeddings for the prompts and descriptions, and compute the cosine distance (in [0, 2]) between the 2.

from scipy.spatial.distance import cosine as cosine_distance

For an embedding model, I wanted a model that might embed each text and pictures, given the multimodal nature of the exercise. I ended up selecting to make use of ImageBind for 3 reasons:

Other popular joint image-text embedding models like CLIP and BLIP are related to a few of the models I utilized in the experiment (BLIP and CLIP prefix captioning), and I desired to avoid any possible biases from using the identical forms of models for evaluation.
Many text embedding models have a small max_token_count — the utmost variety of tokens allowed in a text to be embedded. CLIP, for example, has max_token_count=77. A few of our descriptions are significantly longer than this. Fortunately, ImageBind has a for much longer maximum token count.
I’d been intending to try ImageBind, and this was an ideal opportunity!

I wrapped Replicate’s ImageBind API in a function embed_text(text):

MODEL_NAME = "daanelson/imagebind:0383f62e173dc821ec52663ed22a076d9c970549c209666ac3db181618b7a304"
def embed_text(text):
response = replicate.run(
MODEL_NAME,
input={
"text_input": text,
"modality": "text"
}
)
return np.array(response)

To avoid redundant computations, I hashed the prompts and stored the prompt embeddings in a dictionary. This manner, as a substitute of embedding each prompt for every of the 12 telephone lines, we only have to embed each once:

import hashlib
def hash_prompt(prompt):
return hashlib.md5(prompt.encode()).hexdigest()[:6]### Embed initial prompts
prompt_embeddings = {}
dataset.add_sample_field("prompt_hash", fo.StringField)
## Group samples by initial prompt
## Add hash to all samples in group
prompt_groups = dataset.group_by("prompt")
for pg in prompt_groups.iter_dynamic_groups():
prompt = pg.first().prompt
hash = hash_prompt(prompt)
prompt_embeddings[hash] = embed_text(prompt)
view = pg.set_field("prompt_hash", hash)
view.save("prompt_hash")

We will then group samples by conversation name, iterate through these groups, compute the text embedding for every step, and record the cosine distance (smaller is healthier!) between the text embedding and the initial prompt embedding:

dataset.add_sample_field("text_after_dist", fo.FloatField)prompt_groups = dataset.group_by("conversation_name")
for cg in conversation_groups.iter_dynamic_groups(progress=True):
hash = cg.first().prompt_hash
prompt_embedding = prompt_embeddings[hash]
ordered_samples = cg.sort_by("step_number")
for sample in ordered_samples.iter_samples(autosave=True):
text_embedding = embed_text(sample.text_after)
sample["text_embedding"] = text_embedding        
sample.text_after_dist = cosine_distance(
prompt_embedding,
text_embedding
)

I then computed the typical scores for every T2I-I2T pair across all prompts at a certain level of difficulty and plotted the outcomes. In each of the videos, the I2T and T2I models are printed on the generated images, in addition to the text used to generate that image (red), and the outline generated from that image (green).

Easy

For simple prompts, performance tends to depend most strongly on the text-to-image model. DALL-E2 and Stable Diffusion dramatically outperform VQGAN-CLIP. MiniGPT-4 is a member of each of the top-performing pairs.

Listed here are some examples for the simple prompt introduced above:

AI Telephone for a simple prompt, with pairs of text-to-image and image-to-text models.

Within the games with MiniGPT-4 (and to a rather lesser extent BLIP), the apple stays front and center, whereas for games involving CLIP Prefix, the apple gets phased out over time.

Medium

When the prompts change into a bit harder, the situation starts to vary.

AI Telephone for a medium difficulty prompt, with pairs of text-to-image and image-to-text models.

For nearly all the games, the topic changes somewhere across the fourth or fifth step. Early on, MiniGPT-4 holds a bonus. But by the top of the sport, that advantage seems to have been entirely lost.

Hard

By the point the prompts change into difficult, we begin to see something interesting: for early steps, the image-to-text model is most vital (MiniGPT-4 is best, and CLIP Prefix is for essentially the most part the worst). By later stages, nonetheless, the text-to-image model becomes most vital. And to complicate the situation further, VQGAN-CLIP is best here!

One might worry that “higher” could just mean that consistency is maintained, without accurately representing the unique concept. Nevertheless, after we take a look at examples, we will see that this shouldn’t be the case.

AI Telephone for a tough prompt, with pairs of text-to-image and image-to-text models.

Take the instance highlighted within the video, where the initial prompt is the “hard” prompt introduced above concerning a “bustling marketplace”. While the photographs generated by VQGAN-CLIP are indubitably grainy, the topic can still be made out, and matches the unique prompt fairly closely.

Inconceivable

Unsurprisingly, none of our competitors do terribly well here. One might argue that VQGAN-CLIP is the winner. But for essentially the most part, that is all just noise. Within the video, even for games involving VQGAN-CLIP, the topic is effectively unrecognizable.

AI Telephone for an “not possible” prompt, with pairs of text-to-image and image-to-text models.

This exploration was removed from scientific; I only checked out ten prompts, without true validation of their difficulty level. I only ran the conversations out to 10 back-and-forth steps; and I only evaluated performance on one metric.

It is obvious that which T2I and I2T models fare best depends largely on the complexity of the prompt, and the way long you desire to keep the models talking. Nevertheless, it’s price noting just a few key observations:

VQGAN-CLIP may fare higher for tougher prompts, but this doesn’t mean it’s a higher T2I model. The pictures produced by VQGAN-CLIP are sometimes far less coherent and globally consistent than those produced by Stable Diffusion or DALL-E2.
The evaluation above is all about semantic similarity — it doesn’t take style into consideration. The sort of these images can change a ton over the course of a game of AI Telephone. Anecdotally, I discovered that the style is far more consistent for I2T models like mPLUG-Owl, which give long descriptions, than for models like BLIP, whose descriptions are more subject focused.
By around 5 – 6 iterations, the games had mostly converged to stable equilibria.
Despite the fact that the embedding model, ImageBind, was multimodal, the space between consecutive image embeddings and text embeddings were far greater than the space between consecutive images or consecutive descriptions. Usually, they followed the identical trends, but in less pronounced fashion, which is why I didn’t include these within the plots.

I hope this inspires you to run your personal experiments with generative AI — whether you’re playing AI Telephone, or doing something else entirely!

Should you check out a variation of this and get interesting results, comment on this post!