4 Approaches to construct on top of Generative AI Foundational Models

-

If among the terminology I exploit here is unfamiliar, I encourage you to read my earlier article on LLMs first.

There are teams which can be employing ChatGPT or its competitors (Anthropic, Google’s Flan T5 or PaLM, Meta’s LLaMA, Cohere, AI21Labs, etc.) for real somewhat for cutesy demos. Unfortunately, informative content about how they’re doing so is lost amidst marketing hype and technical jargon. Due to this fact, I see folks who’re getting began with generative AI take approaches that experts in the sector will inform you should not going to pan out. This text is my attempt at organizing this space and showing you what’s working.

Photo by Sen on Unsplash

The bar to clear

The issue with lots of the cutesy demos and hype-filled posts about generative AI is that they hit the training dataset — they don’t really inform you how well it can work when applied to the chaos of real human users and really novel input. Typical software is anticipated to work at 99%+ reliability —for instance, it was only when speech recognition crossed this accuracy bar on phrases that the marketplace for Voice AI took off. Same for automated captioning, translation, etc.

I see two ways through which teams are addressing this issue of their production systems:

  • Human users are more forgiving if the UX is in a situation where they already expect to correct errors (this appears to be what helps GitHub Copilot) or where it’s positioned as being interactive and helpful but not able to use (ChatGPT, Bing Chat, etc.)
  • Fully automated applications of generative AI are mostly within the trusted-tester stage today, and the jury is out on whether these applications are literally capable of clear this bar. That said, the outcomes are promising and trending upwards, and it’s likely only a matter of time before the bar’s met.

Personally, I even have been experimenting with GPT 3.5 Turbo and Google Flan-T5 with specific production use cases in mind, and learning quite a bit about what works and what doesn’t. None of my models have crossed the 99% bar. I also haven’t yet gotten access to GPT-4 or to Google’s PaLM API on the time of writing (March 2023). I’m basing this text on my experiments, on published research, and on publicly announced projects.

With all uses of generative AI, it is useful to firmly take into account that the pretrained models are trained on web content and will be biased in multiple ways. Safeguard against those biases in your application layer.

Approach 1: Use the API Directly

The primary approach is the only because many users encountered GPT through the interactive interface offered by ChatGPT. It seems very intuitive to check out various prompts until you get one which generates the output you would like. For this reason you might have a number of LinkedIn influencers publishing ChatGPT prompts that work for sales emails or whatever.

With regards to automating this workflow, the natural method is to make use of the REST API endpoint of the service and directly invoke it with the ultimate, working prompt:

import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.Edit.create(
model="text-davinci-edit-001",
input="It was so great to fulfill you .... ",
instruction="Summarize the text below in the shape of an email that's 5 sentences or less."
)

Nonetheless, this approach doesn’t lend itself to operationalization. There are several reasons:

  1. Brittleness. The underlying models keep improving. Sudden changes within the deployed models broke many production workloads, and other people learned from that have. ML workloads are brittle enough already; adding additional points of failure in the shape of prompts which can be fine-tuned to specific models shouldn’t be sensible.
  2. Injection. It’s rare that the instruction and input are plain strings as in the instance above. Most frequently, they include variables which can be input from users. These variables should be incorporated into the prompts and inputs. And as any programmer knows, injection by string concatenation is rife with security problems. You set yourself on the mercy of the guardrails placed across the Generative AI API once you do that. As when guarding against SQL injection, it’s higher to make use of an API that handles variable injection for you.
  3. Multiple prompts. It’s rare that you’re going to have the ability to get a prompt to work in one-shot. More common is to send multiple prompts to the model, and get the model to switch its output based on these prompts. These prompts themselves could have some human input (comparable to follow-up inputs) embedded within the workflow. Also common is for the prompts to supply just a few examples of the specified output (called few-shot learning).

A technique to resolve all three of those problems is to make use of langchain.

Approach 2: Use langchain

Langchain is rapidly becoming the library of alternative that means that you can invoke LLMs from different vendors, handle variable injection, and do few-shot training. Here’s an example of using langchain:

from langchain.prompts.few_shot import FewShotPromptTemplate

examples = [
{
"question": "Who lived longer, Muhammad Ali or Alan Turing?",
"answer":
"""
Are follow up questions needed here: Yes.
Follow up: How old was Muhammad Ali when he died?
Intermediate answer: Muhammad Ali was 74 years old when he died.
Follow up: How old was Alan Turing when he died?
Intermediate answer: Alan Turing was 41 years old when he died.
So the final answer is: Muhammad Ali
"""
},
{
"question": "When was the founder of craigslist born?",
"answer":
"""
Are follow up questions needed here: Yes.
Follow up: Who was the founder of craigslist?
Intermediate answer: Craigslist was founded by Craig Newmark.
Follow up: When was Craig Newmark born?
Intermediate answer: Craig Newmark was born on December 6, 1952.
So the final answer is: December 6, 1952
"""
...
]

example_prompt = PromptTemplate(input_variables=["question", "answer"],
template="Query: {query}n{answer}")

prompt = FewShotPromptTemplate(
examples=examples,
example_prompt=example_prompt,
suffix="Query: {input}",
input_variables=["input"]
)

print(prompt.format(input="Who was the daddy of Mary Ball Washington?"))

I strongly recommend using langchain vs. using a vendor’s API directly. Then, be sure that that the whole lot you do works with at the very least two APIs or use a LLM checkpoint that won’t change under you. Either of those approaches will avoid your prompts/code being brittle to changes within the underlying LLM. (Here, I’m using API to mean a managed LLM endpoint).

Langchain today supports APIs from Open AI, Cohere, HuggingFace Hub (and hence Google Flan-T5), etc. and LLMs from AI21, Anthropic, Open AI, HuggingFace Hub, etc.

Approach 3: Finetune the Generative AI Chain

That is the leading-edge approach in that it’s the one I see utilized by most of the delicate production applications of generative AI. As just an example (no endorsement), finetuning is how a startup consisting of Stanford PhDs is approaching standard enterprise use cases like SQL generation and record matching.

To grasp the rationale behind this approach, it helps to know that there are 4 machine learning models that underpin ChatGPT (or its competitors):

  1. A Large Language Model (LLM) is trained to predict the subsequent word of text given the previous words. It does this by learning word associations and patterns on an unlimited corpus of documents. The model is large enough that it learns these patterns in several contexts.
  2. A Reinforcement Learning based on Human Feedback Model (RL-HF) is trained by showing humans examples of generated text, and asking them to approve text that is enjoyable to read. The explanation this is required is that an LLM’s output is probabilistic — it doesn’t predict a single next word; as a substitute, it predicts a set of words each of which has a certain probability of coming next. The RL-HF uses human feedback to learn tips on how to select the continuation that can generate the text that appeals to humans.
  3. Instruction Model is a supervised model that’s trained by showing prompts (“generate a sales email that proposes a demo to the engineering leadership”) and training the model on examples of sales emails.
  4. Context Model is trained to hold on a conversation with the user, allowing them to craft the output through successive prompts.

As well as, there are guardrails (filters on each the input and output). The model declines to reply certain kinds of queries, and retracts certain answers. In practice, these are each machine learning models which can be consistently updated.

Step 2: How RL-HF works. Image from Stiennon et al, 2020

There are open-source generative AI models (Meta’s LLaMA, Google’s Flan-T5) which can help you pick up at any of the above steps (e.g. use steps 1–2 from the released checkpoint, train 3 on your individual data, don’t do 4). Note that LLaMA doesn’t permit business use, and Flan-T5 is a 12 months old (so you’re compromising on quality). To learn where to interrupt off, it is useful to grasp the fee/advantage of each stage:

  • In case your application uses very different jargon and words, it could be helpful to construct a LLM from scratch on your individual data (i.e., start at step 1). The issue is that you might not have enough data and even when you might have enough data, the training goes to be expensive (on the order of three–5 million dollars per training run). This appears to be what Salesforce has done with the generative AI they use for developers.
  • The RL-HF model is trained to appeal to a bunch of testers who might not be subject-matter experts, or representative of your individual users. In case your application requires material expertise, you might be higher off starting with a LLM and branching off from step 2. The dataset you wish for this is far smaller — Stiennon et al 2020 used 125k documents and presented a pair of outputs for every input document in each iteration (see diagram). So, you wish human labelers on standby to rate about 1 million outputs. Assuming that a labeler takes 10 min to rate each pair of documents, the fee is that of 250 human-months of labor per training run. I’d estimate $250k to $2m depending on location and skillset.
  • ChatGPT is trained to answer hundreds of various prompts. Your application, however, probably requires just one or two specific ones. It might probably be convenient to coach a model comparable to Google Flan-T5 in your specific instruction and input. Such a model will be much smaller (and subsequently cheaper to deploy). This advantage in serving costs explains why step 3 is probably the most common point of branching off. It’s possible to fine-tune Google Flan-T5 on your specific task with about 10k documents using HuggingFace and/or Keras. You’d do that in your usual ML framework comparable to Databricks, Sagemaker, or Vertex AI and use the identical services to deploy the trained model. Because Flan-T5 is a Google model, GCP makes training and deployment very easy by providing pre-built containers in Vertex AI. The fee can be perhaps $50 or so.
  • Theoretically, it’s possible to coach a distinct technique to maintain conversational context. Nonetheless, I haven’t seen this in practice. What most individuals do as a substitute is to make use of a conversational agent framework like Dialogflow that already has a LLM built into it, and design a custom chatbot for his or her application. The infra costs are negligible and also you don’t need any AI expertise, just domain knowledge.

It is feasible to interrupt off at any of those stages. Limiting my examples to publicly published work in medicine:

  1. This Nature article builds a custom 8.9-billion parameter LLM from 90 billion words extracted from medical records (i.e., they begin from step 1). For comparison, Flan T5 is 540 billion parameters and the “small/efficient” PaLM is 62 billion parameters. Obviously, cost is a constraint in going much greater in your custom language model.
  2. This MIT CSAIL study forces the model to closely hew to existing text and in addition doing instruction fine-tuning (i.e., they’re ranging from step 2).
  3. Deep Mind’s MedPaLM starts from an instruction-tuned variation of PaLM called Flan-PaLM (i.e. it starts after step 3). They report that 93% of healthcare professionals rated the AI as being on par with human answers.

My advice is to decide on where to interrupt off based on how different your application space is from the generic web text on which the foundational models are trained. Which model do you have to fine-tune? Currently, Google Flan T5 is probably the most sophisticated fine-tuneable model available and open for business use. For non-commercial uses, Meta’s LLaMA is probably the most sophisticated model available.

A word of caution though: once you tap into the chain using open-source models, the guardrail filters won’t exist, so you should have to place in toxicity safeguards. One option is to make use of the detoxify library. Make certain to include toxicity filtering around any API endpoint in production — otherwise, you’ll end up having to take it back down. API gateways generally is a convenient technique to make sure that you’re doing this for all of your ML model endpoints.

Approach 4: Simplify the issue

There are smart approaches to reframe the issue you’re solving in comparable to way you could use a Generative AI model (as in Approach 3) but avoid problems with hallucination, etc.

For instance, suppose you ought to do question-answering. You could possibly start with a strong LLM after which struggle to “tame” the wild beast to have it not hallucinate. A much simpler approach is to reframe the issue. Change the model from one which predicts the output text to a model that has three outputs: the URL of a document, the starting position inside that document, and the length of text. That’s what Google Search is doing here:

Google’s Q&A model predicts a URL, starting position, and length of text. This avoids problems with hallucination.

At worst, the model will show you irrelevant text. What it can not do is to hallucinate since you don’t allow it to truly predict text.

A Keras sample that follows this approach tokenizes the inputs and context (the document that you just are finding the reply inside):

from transformers import AutoTokenizer

model_checkpoint = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
...
examples["question"] = [q.lstrip() for q in examples["question"]]
examples["context"] = [c.lstrip() for c in examples["context"]]
tokenized_examples = tokenizer(
examples["question"],
examples["context"],
...
)
...

after which passes the tokens right into a Keras regression model whose first layer is the Transformer model that takes in these tokens and that outputs the position of the reply throughout the “context” text:

from transformers import TFAutoModelForQuestionAnswering
import tensorflow as tf
from tensorflow import keras

model = TFAutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
optimizer = keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer)
model.fit(train_set, validation_data=validation_set, epochs=1)

During inference, you get the anticipated locations:

inputs = tokenizer([context], [question], return_tensors="np")
outputs = model(inputs)
start_position = tf.argmax(outputs.start_logits, axis=1)
end_position = tf.argmax(outputs.end_logits, axis=1)

You’ll note that the sample doesn’t predict the URL — the context is assumed to be the results of a typical search query (comparable to returned by an identical engine or vector database), and the sample model only does extraction. Nonetheless, you’ll be able to construct the search also into the model by having a separate layer in Keras.

Summary

There are 4 approaches that I see getting used to construct production applications on top of generative AI foundational models:

  1. Use the REST API of an all-in model comparable to GPT-4 for one-shot prompts.
  2. Use langchain to abstract away the LLM, input injection, multi-turn conversations, and few-shot learning.
  3. Finetune in your custom data by tapping into the set of models that comprise an end-to-end generative AI model.
  4. Reframe the issue right into a form that avoids the hazards of generative AI (bias, toxicity, hallucination).

Approach #3 is what I see mostly utilized by sophisticated teams.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x