4 Approaches to construct on top of Generative AI Foundational Models

What works, the professionals and cons, and example code for every approach

If a number of the terminology I take advantage of here is unfamiliar, I encourage you to read my earlier article on LLMs first.

There are teams which are employing ChatGPT or its competitors (Anthropic, Google’s Flan T5 or PaLM, Meta’s LLaMA, Cohere, AI21Labs, etc.) for real moderately for cutesy demos. Unfortunately, informative content about how they’re doing so is lost amidst marketing hype and technical jargon. Due to this fact, I see folks who’re getting began with generative AI take approaches that experts in the sector will let you know should not going to pan out. This text is my attempt at organizing this space and showing you what’s working.

The bar to clear

The issue with most of the cutesy demos and hype-filled posts about generative AI is that they hit the training dataset — they don’t really let you know how well it can work when applied to the chaos of real human users and truly novel input. Typical software is anticipated to work at 99%+ reliability —for instance, it was only when speech recognition crossed this accuracy bar on phrases that the marketplace for Voice AI took off. Same for automated captioning, translation, etc.

I see two ways through which teams are addressing this issue of their production systems:

Human users are more forgiving if the UX is in a situation where they already expect to correct errors (this appears to be what helps GitHub Copilot) or where it’s positioned as being interactive and helpful but not able to use (ChatGPT, Bing Chat, etc.)
Fully automated applications of generative AI are mostly within the trusted-tester stage today, and the jury is out on whether these applications are literally capable of clear this bar. That said, the outcomes are promising and trending upwards, and it’s likely only a matter of time before the bar’s met.

Personally, I even have been experimenting with GPT 3.5 Turbo and Google Flan-T5 with specific production use cases in mind, and learning quite a bit about what works and what doesn’t. None of my models have crossed the 99% bar. I also haven’t yet gotten access to GPT-4 or to Google’s PaLM API on the time of writing (March 2023). I’m basing this text on my experiments, on published research, and on publicly announced projects.

With all uses of generative AI, it is useful to firmly have in mind that the pretrained models are trained on web content and might be biased in multiple ways. Safeguard against those biases in your application layer.

Approach 1: Use the API Directly

The primary approach is the only because many users encountered GPT through the interactive interface offered by ChatGPT. It seems very intuitive to check out various prompts until you get one which generates the output you would like. That is why you’ve gotten a number of LinkedIn influencers publishing ChatGPT prompts that work for sales emails or whatever.

Relating to automating this workflow, the natural method is to make use of the REST API endpoint of the service and directly invoke it with the ultimate, working prompt:

import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
openai.Edit.create(
model="text-davinci-edit-001",
input="It was so great to fulfill you .... ",
instruction="Summarize the text below in the shape of an email that's 5 sentences or less."
)

Nonetheless, this approach doesn’t lend itself to operationalization. There are several reasons:

. The underlying models keep improving. Sudden changes within the deployed models broke many production workloads, and other people learned from that have. ML workloads are brittle enough already; adding additional points of failure in the shape of prompts which are fine-tuned to specific models just isn’t sensible.
. It’s rare that the instruction and input are plain strings as in the instance above. Most frequently, they include variables which are input from users. These variables need to be incorporated into the prompts and inputs. And as any programmer knows, injection by string concatenation is rife with security problems. You place yourself on the mercy of the guardrails placed across the Generative AI API whenever you do that. As when guarding against SQL injection, it’s higher to make use of an API that handles variable injection for you.
. It’s rare that you’ll have the ability to get a prompt to work in one-shot. More common is to send multiple prompts to the model, and get the model to change its output based on these prompts. These prompts themselves can have some human input (reminiscent of follow-up inputs) embedded within the workflow. Also common is for the prompts to offer a couple of examples of the specified output (called few-shot learning).

A option to resolve all three of those problems is to make use of langchain.

Approach 2: Use langchain

Langchain is rapidly becoming the library of selection that means that you can invoke LLMs from different vendors, handle variable injection, and do few-shot training. Here’s an example of using langchain:

from langchain.prompts.few_shot import FewShotPromptTemplateexamples = [
{
"question": "Who lived longer, Muhammad Ali or Alan Turing?",
"answer": 
"""
Are follow up questions needed here: Yes.
Follow up: How old was Muhammad Ali when he died?
Intermediate answer: Muhammad Ali was 74 years old when he died.
Follow up: How old was Alan Turing when he died?
Intermediate answer: Alan Turing was 41 years old when he died.
So the final answer is: Muhammad Ali
"""
},
{
"question": "When was the founder of craigslist born?",
"answer": 
"""
Are follow up questions needed here: Yes.
Follow up: Who was the founder of craigslist?
Intermediate answer: Craigslist was founded by Craig Newmark.
Follow up: When was Craig Newmark born?
Intermediate answer: Craig Newmark was born on December 6, 1952.
So the final answer is: December 6, 1952
"""
...
]
example_prompt = PromptTemplate(input_variables=["question", "answer"], 
template="Query: {query}n{answer}")
prompt = FewShotPromptTemplate(
examples=examples, 
example_prompt=example_prompt, 
suffix="Query: {input}", 
input_variables=["input"]
)
print(prompt.format(input="Who was the daddy of Mary Ball Washington?"))

I strongly recommend using langchain vs. using a vendor’s API directly. Then, be certain that every part you do works with not less than two APIs or use a LLM checkpoint that won’t change under you. Either of those approaches will avoid your prompts/code being brittle to changes within the underlying LLM. (Here, I’m using API to mean a managed LLM endpoint).

Langchain today supports APIs from Open AI, Cohere, HuggingFace Hub (and hence Google Flan-T5), etc. and LLMs from AI21, Anthropic, Open AI, HuggingFace Hub, etc.

Approach 3: Finetune the Generative AI Chain

That is the leading-edge approach in that it’s the one I see utilized by most of the delicate production applications of generative AI. As just an example (no endorsement), finetuning is how a startup consisting of Stanford PhDs is approaching standard enterprise use cases like SQL generation and record matching.

To grasp the rationale behind this approach, it helps to know that there are 4 machine learning models that underpin ChatGPT (or its competitors):

A Large Language Model (LLM) is trained to predict the subsequent word of text given the previous words. It does this by learning word associations and patterns on an unlimited corpus of documents. The model is large enough that it learns these patterns in several contexts.
A Reinforcement Learning based on Human Feedback Model (RL-HF) is trained by showing humans examples of generated text, and asking them to approve text that is agreeable to read. The explanation this is required is that an LLM’s output is probabilistic — it doesn’t predict a single next word; as an alternative, it predicts a set of words each of which has a certain probability of coming next. The RL-HF uses human feedback to learn how you can select the continuation that can generate the text that appeals to humans.
Instruction Model is a supervised model that’s trained by showing prompts (“generate a sales email that proposes a demo to the engineering leadership”) and training the model on examples of sales emails.
Context Model is trained to hold on a conversation with the user, allowing them to craft the output through successive prompts.

As well as, there are guardrails (filters on each the input and output). The model declines to reply certain sorts of queries, and retracts certain answers. In practice, these are each machine learning models which are always updated.

Step 2: How RL-HF works. Image from Stiennon et al, 2020

There are open-source generative AI models (Meta’s LLaMA, Google’s Flan-T5) which let you pick up at any of the above steps (e.g. use steps 1–2 from the released checkpoint, train 3 on your personal data, don’t do 4). Note that LLaMA doesn’t permit business use, and Flan-T5 is a 12 months old (so you might be compromising on quality). To learn where to interrupt off, it is useful to grasp the associated fee/good thing about each stage:

In case your application uses very different jargon and words, it could be helpful to construct a LLM from scratch on your personal data (i.e., start at step 1). The issue is that it’s possible you’ll not have enough data and even when you’ve gotten enough data, the training goes to be expensive (on the order of three–5 million dollars per training run). This appears to be what Salesforce has done with the generative AI they use for developers.
The RL-HF model is trained to appeal to a gaggle of testers who will not be subject-matter experts, or representative of your personal users. In case your application requires material expertise, it’s possible you’ll be higher off starting with a LLM and branching off from step 2. The dataset you wish for this is far smaller — Stiennon et al 2020 used 125k documents and presented a pair of outputs for every input document in each iteration (see diagram). So, you wish human labelers on standby to rate about 1 million outputs. Assuming that a labeler takes 10 min to rate each pair of documents, the associated fee is that of 250 human-months of labor per training run. I’d estimate $250k to $2m depending on location and skillset.
ChatGPT is trained to answer 1000’s of various prompts. Your application, then again, probably requires just one or two specific ones. It will possibly be convenient to coach a model reminiscent of Google Flan-T5 in your specific instruction and input. Such a model might be much smaller (and subsequently cheaper to deploy). This advantage in serving costs explains why step 3 is essentially the most common point of branching off. It’s possible to fine-tune Google Flan-T5 to your specific task with about 10k documents using HuggingFace and/or Keras. You’d do that in your usual ML framework reminiscent of Databricks, Sagemaker, or Vertex AI and use the identical services to deploy the trained model. Because Flan-T5 is a Google model, GCP makes training and deployment very easy by providing pre-built containers in Vertex AI. The associated fee could be perhaps $50 or so.
Theoretically, it’s possible to coach a unique option to maintain conversational context. Nonetheless, I haven’t seen this in practice. What most individuals do as an alternative is to make use of a conversational agent framework like Dialogflow that already has a LLM built into it, and design a custom chatbot for his or her application. The infra costs are negligible and also you don’t need any AI expertise, just domain knowledge.

It is feasible to interrupt off at any of those stages. Limiting my examples to publicly published work in medicine:

This Nature article builds a custom 8.9-billion parameter LLM from 90 billion words extracted from medical records (i.e., they begin from step 1). For comparison, Flan T5 is 540 billion parameters and the “small/efficient” PaLM is 62 billion parameters. Obviously, cost is a constraint in going much greater in your custom language model.
This MIT CSAIL study forces the model to closely hew to existing text and likewise doing instruction fine-tuning (i.e., they’re ranging from step 2).
Deep Mind’s MedPaLM starts from an instruction-tuned variation of PaLM called Flan-PaLM (i.e. it starts after step 3). They report that 93% of healthcare professionals rated the AI as being on par with human answers.

My advice is to decide on where to interrupt off based on how different your application space is from the generic web text on which the foundational models are trained. Which model must you fine-tune? Currently, Google Flan T5 is essentially the most sophisticated fine-tuneable model available and open for business use. For non-commercial uses, Meta’s LLaMA is essentially the most sophisticated model available.

A word of caution though: whenever you tap into the chain using open-source models, the guardrail filters won’t exist, so . One option is to make use of the detoxify library. Be sure to include toxicity filtering around any API endpoint in production — otherwise, you’ll end up having to take it back down. API gateways is usually a convenient option to be certain that you might be doing this for all of your ML model endpoints.

Approach 4: Simplify the issue

There are smart approaches to reframe the issue you might be solving in reminiscent of way which you could use a Generative AI model (as in Approach 3) but avoid problems with hallucination, etc.

For instance, suppose you ought to do question-answering. You could possibly start with a robust LLM after which struggle to “tame” the wild beast to have it not hallucinate. A much simpler approach is to reframe the issue. Change the model from one which predicts the output text to a model that has three outputs: the URL of a document, the starting position inside that document, and the length of text. That’s what Google Search is doing here:

Google’s Q&A model predicts a URL, starting position, and length of text. This avoids problems with hallucination.

At worst, the model will show you irrelevant text. What it can not do is to hallucinate since you don’t allow it to really predict text.

A Keras sample that follows this approach tokenizes the inputs and context (the document that you simply are finding the reply inside):

from transformers import AutoTokenizermodel_checkpoint = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
...
examples["question"] = [q.lstrip() for q in examples["question"]]
examples["context"] = [c.lstrip() for c in examples["context"]]
tokenized_examples = tokenizer(
examples["question"],
examples["context"],
...
)
...

after which passes the tokens right into a Keras regression model whose first layer is the Transformer model that takes in these tokens and that outputs the position of the reply inside the “context” text:

from transformers import TFAutoModelForQuestionAnswering
import tensorflow as tf
from tensorflow import kerasmodel = TFAutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
optimizer = keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer)
model.fit(train_set, validation_data=validation_set, epochs=1)

During inference, you get the expected locations:

inputs = tokenizer([context], [question], return_tensors="np")
outputs = model(inputs)
start_position = tf.argmax(outputs.start_logits, axis=1)
end_position = tf.argmax(outputs.end_logits, axis=1)

You’ll note that the sample doesn’t predict the URL — the context is assumed to be the results of a typical search query (reminiscent of returned by an identical engine or vector database), and the sample model only does extraction. Nonetheless, you’ll be able to construct the search also into the model by having a separate layer in Keras.

Summary

There are 4 approaches that I see getting used to construct production applications on top of generative AI foundational models:

Use the REST API of an all-in model reminiscent of GPT-4 for one-shot prompts.
Use langchain to abstract away the LLM, input injection, multi-turn conversations, and few-shot learning.
Finetune in your custom data by tapping into the set of models that comprise an end-to-end generative AI model.
Reframe the issue right into a form that avoids the risks of generative AI (bias, toxicity, hallucination).

Approach #3 is what I see mostly utilized by sophisticated teams.

4 Approaches to construct on top of Generative AI Foundational Models

What works, the professionals and cons, and example code for every approach

The bar to clear

Approach 1: Use the API Directly

Approach 2: Use langchain

Approach 3: Finetune the Generative AI Chain

Approach 4: Simplify the issue

Summary

What are your thoughts on this topic?
Let us know in the comments below.

1 COMMENT

Share this article

Recent posts

What We Still Don’t Understand About Machine Learning

OpenAI Unveils SearchGPT: A Recent AI-Powered Search Engine

Public Release: Kling AI Video Generator

UK declares hiring of AI staff, but criticism continues

Radical Simplicity in Data Engineering

4 Approaches to construct on top of Generative AI Foundational Models

What works, the professionals and cons, and example code for every approach

The bar to clear

Approach 1: Use the API Directly

Approach 2: Use langchain

Approach 3: Finetune the Generative AI Chain

Approach 4: Simplify the issue

Summary

What are your thoughts on this topic? Let us know in the comments below.

1 COMMENT

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.