Home Artificial Intelligence 4 Approaches to construct on top of Generative AI Foundational Models

4 Approaches to construct on top of Generative AI Foundational Models

4 Approaches to construct on top of Generative AI Foundational Models

If a number of the terminology I exploit here is unfamiliar, I encourage you to read my earlier article on LLMs first.

There are teams which can be employing ChatGPT or its competitors (Anthropic, Google’s Flan T5 or PaLM, Meta’s LLaMA, Cohere, AI21Labs, etc.) for real somewhat for cutesy demos. Unfortunately, informative content about how they’re doing so is lost amidst marketing hype and technical jargon. Subsequently, I see folks who’re getting began with generative AI take approaches that experts in the sphere will let you know usually are not going to pan out. This text is my attempt at organizing this space and showing you what’s working.

Photo by Sen on Unsplash

The bar to clear

The issue with lots of the cutesy demos and hype-filled posts about generative AI is that they hit the training dataset — they don’t really let you know how well it would work when applied to the chaos of real human users and truly novel input. Typical software is predicted to work at 99%+ reliability —for instance, it was only when speech recognition crossed this accuracy bar on phrases that the marketplace for Voice AI took off. Same for automated captioning, translation, etc.

I see two ways during which teams are addressing this issue of their production systems:

  • Human users are more forgiving if the UX is in a situation where they already expect to correct errors (this appears to be what helps GitHub Copilot) or where it’s positioned as being interactive and helpful but not able to use (ChatGPT, Bing Chat, etc.)
  • Fully automated applications of generative AI are mostly within the trusted-tester stage today, and the jury is out on whether these applications are literally in a position to clear this bar. That said, the outcomes are promising and trending upwards, and it’s likely only a matter of time before the bar’s met.

Personally, I actually have been experimenting with GPT 3.5 Turbo and Google Flan-T5 with specific production use cases in mind, and learning quite a bit about what works and what doesn’t. None of my models have crossed the 99% bar. I also haven’t yet gotten access to GPT-4 or to Google’s PaLM API on the time of writing (March 2023). I’m basing this text on my experiments, on published research, and on publicly announced projects.

With all uses of generative AI, it is useful to firmly take into accout that the pretrained models are trained on web content and might be biased in multiple ways. Safeguard against those biases in your application layer.

Approach 1: Use the API Directly

The primary approach is the best because many users encountered GPT through the interactive interface offered by ChatGPT. It seems very intuitive to check out various prompts until you get one which generates the output you would like. For this reason you’ve got numerous LinkedIn influencers publishing ChatGPT prompts that work for sales emails or whatever.

In the case of automating this workflow, the natural method is to make use of the REST API endpoint of the service and directly invoke it with the ultimate, working prompt:

import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
input="It was so great to satisfy you .... ",
instruction="Summarize the text below in the shape of an email that's 5 sentences or less."

Nevertheless, this approach doesn’t lend itself to operationalization. There are several reasons:

  1. . The underlying models keep improving. Sudden changes within the deployed models broke many production workloads, and other people learned from that have. ML workloads are brittle enough already; adding additional points of failure in the shape of prompts which can be fine-tuned to specific models shouldn’t be sensible.
  2. . It’s rare that the instruction and input are plain strings as in the instance above. Most frequently, they include variables which can be input from users. These variables must be incorporated into the prompts and inputs. And as any programmer knows, injection by string concatenation is rife with security problems. You set yourself on the mercy of the guardrails placed across the Generative AI API whenever you do that. As when guarding against SQL injection, it’s higher to make use of an API that handles variable injection for you.
  3. . It’s rare that you’ll give you the option to get a prompt to work in one-shot. More common is to send multiple prompts to the model, and get the model to switch its output based on these prompts. These prompts themselves can have some human input (equivalent to follow-up inputs) embedded within the workflow. Also common is for the prompts to supply just a few examples of the specified output (called few-shot learning).

A solution to resolve all three of those problems is to make use of langchain.

Approach 2: Use langchain

Langchain is rapidly becoming the library of selection that permits you to invoke LLMs from different vendors, handle variable injection, and do few-shot training. Here’s an example of using langchain:

from langchain.prompts.few_shot import FewShotPromptTemplate

examples = [
"question": "Who lived longer, Muhammad Ali or Alan Turing?",
Are follow up questions needed here: Yes.
Follow up: How old was Muhammad Ali when he died?
Intermediate answer: Muhammad Ali was 74 years old when he died.
Follow up: How old was Alan Turing when he died?
Intermediate answer: Alan Turing was 41 years old when he died.
So the final answer is: Muhammad Ali
"question": "When was the founder of craigslist born?",
Are follow up questions needed here: Yes.
Follow up: Who was the founder of craigslist?
Intermediate answer: Craigslist was founded by Craig Newmark.
Follow up: When was Craig Newmark born?
Intermediate answer: Craig Newmark was born on December 6, 1952.
So the final answer is: December 6, 1952

example_prompt = PromptTemplate(input_variables=["question", "answer"],
template="Query: {query}n{answer}")

prompt = FewShotPromptTemplate(
suffix="Query: {input}",

print(prompt.format(input="Who was the daddy of Mary Ball Washington?"))

I strongly recommend using langchain vs. using a vendor’s API directly. Then, be certain that every little thing you do works with not less than two APIs or use a LLM checkpoint that won’t change under you. Either of those approaches will avoid your prompts/code being brittle to changes within the underlying LLM. (Here, I’m using API to mean a managed LLM endpoint).

Langchain today supports APIs from Open AI, Cohere, HuggingFace Hub (and hence Google Flan-T5), etc. and LLMs from AI21, Anthropic, Open AI, HuggingFace Hub, etc.

Approach 3: Finetune the Generative AI Chain

That is the leading-edge approach in that it’s the one I see utilized by most of the delicate production applications of generative AI. As just an example (no endorsement), finetuning is how a startup consisting of Stanford PhDs is approaching standard enterprise use cases like SQL generation and record matching.

To grasp the rationale behind this approach, it helps to know that there are 4 machine learning models that underpin ChatGPT (or its competitors):

  1. A Large Language Model (LLM) is trained to predict the subsequent word of text given the previous words. It does this by learning word associations and patterns on an unlimited corpus of documents. The model is large enough that it learns these patterns in several contexts.
  2. A Reinforcement Learning based on Human Feedback Model (RL-HF) is trained by showing humans examples of generated text, and asking them to approve text that is agreeable to read. The rationale this is required is that an LLM’s output is probabilistic — it doesn’t predict a single next word; as an alternative, it predicts a set of words each of which has a certain probability of coming next. The RL-HF uses human feedback to learn select the continuation that can generate the text that appeals to humans.
  3. Instruction Model is a supervised model that’s trained by showing prompts (“generate a sales email that proposes a demo to the engineering leadership”) and training the model on examples of sales emails.
  4. Context Model is trained to hold on a conversation with the user, allowing them to craft the output through successive prompts.

As well as, there are guardrails (filters on each the input and output). The model declines to reply certain varieties of queries, and retracts certain answers. In practice, these are each machine learning models which can be continually updated.

Step 2: How RL-HF works. Image from Stiennon et al, 2020

There are open-source generative AI models (Meta’s LLaMA, Google’s Flan-T5) which will let you pick up at any of the above steps (e.g. use steps 1–2 from the released checkpoint, train 3 on your individual data, don’t do 4). Note that LLaMA doesn’t permit business use, and Flan-T5 is a 12 months old (so you’re compromising on quality). To learn where to interrupt off, it is useful to know the associated fee/advantage of each stage:

  • In case your application uses very different jargon and words, it might be helpful to construct a LLM from scratch on your individual data (i.e., start at step 1). The issue is that you could not have enough data and even when you’ve got enough data, the training goes to be expensive (on the order of three–5 million dollars per training run). This appears to be what Salesforce has done with the generative AI they use for developers.
  • The RL-HF model is trained to appeal to a gaggle of testers who is probably not subject-matter experts, or representative of your individual users. In case your application requires subject material expertise, you could be higher off starting with a LLM and branching off from step 2. The dataset you wish for this is way smaller — Stiennon et al 2020 used 125k documents and presented a pair of outputs for every input document in each iteration (see diagram). So, you wish human labelers on standby to rate about 1 million outputs. Assuming that a labeler takes 10 min to rate each pair of documents, the associated fee is that of 250 human-months of labor per training run. I’d estimate $250k to $2m depending on location and skillset.
  • ChatGPT is trained to reply to 1000’s of various prompts. Your application, then again, probably requires just one or two specific ones. It will possibly be convenient to coach a model equivalent to Google Flan-T5 in your specific instruction and input. Such a model might be much smaller (and subsequently cheaper to deploy). This advantage in serving costs explains why step 3 is essentially the most common point of branching off. It’s possible to fine-tune Google Flan-T5 to your specific task with about 10k documents using HuggingFace and/or Keras. You’d do that in your usual ML framework equivalent to Databricks, Sagemaker, or Vertex AI and use the identical services to deploy the trained model. Because Flan-T5 is a Google model, GCP makes training and deployment very easy by providing pre-built containers in Vertex AI. The price can be perhaps $50 or so.
  • Theoretically, it’s possible to coach a unique solution to maintain conversational context. Nevertheless, I haven’t seen this in practice. What most individuals do as an alternative is to make use of a conversational agent framework like Dialogflow that already has a LLM built into it, and design a custom chatbot for his or her application. The infra costs are negligible and also you don’t need any AI expertise, just domain knowledge.

It is feasible to interrupt off at any of those stages. Limiting my examples to publicly published work in medicine:

  1. This Nature article builds a custom 8.9-billion parameter LLM from 90 billion words extracted from medical records (i.e., they begin from step 1). For comparison, Flan T5 is 540 billion parameters and the “small/efficient” PaLM is 62 billion parameters. Obviously, cost is a constraint in going much larger in your custom language model.
  2. This MIT CSAIL study forces the model to closely hew to existing text and likewise doing instruction fine-tuning (i.e., they’re ranging from step 2).
  3. Deep Mind’s MedPaLM starts from an instruction-tuned variation of PaLM called Flan-PaLM (i.e. it starts after step 3). They report that 93% of healthcare professionals rated the AI as being on par with human answers.

My advice is to decide on where to interrupt off based on how different your application space is from the generic web text on which the foundational models are trained. Which model do you have to fine-tune? Currently, Google Flan T5 is essentially the most sophisticated fine-tuneable model available and open for business use. For non-commercial uses, Meta’s LLaMA is essentially the most sophisticated model available.

A word of caution though: whenever you tap into the chain using open-source models, the guardrail filters won’t exist, so . One option is to make use of the detoxify library. Be certain that to include toxicity filtering around any API endpoint in production — otherwise, you’ll end up having to take it back down. API gateways is usually a convenient solution to make sure that you’re doing this for all of your ML model endpoints.

Approach 4: Simplify the issue

There are smart approaches to reframe the issue you’re solving in equivalent to way you could use a Generative AI model (as in Approach 3) but avoid problems with hallucination, etc.

For instance, suppose you would like to do question-answering. You could possibly start with a robust LLM after which struggle to “tame” the wild beast to have it not hallucinate. A much simpler approach is to reframe the issue. Change the model from one which predicts the output text to a model that has three outputs: the URL of a document, the starting position inside that document, and the length of text. That’s what Google Search is doing here:

Google’s Q&A model predicts a URL, starting position, and length of text. This avoids problems with hallucination.

At worst, the model will show you irrelevant text. What it would not do is to hallucinate since you don’t allow it to truly predict text.

A Keras sample that follows this approach tokenizes the inputs and context (the document that you just are finding the reply inside):

from transformers import AutoTokenizer

model_checkpoint = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
examples["question"] = [q.lstrip() for q in examples["question"]]
examples["context"] = [c.lstrip() for c in examples["context"]]
tokenized_examples = tokenizer(

after which passes the tokens right into a Keras regression model whose first layer is the Transformer model that takes in these tokens and that outputs the position of the reply throughout the “context” text:

from transformers import TFAutoModelForQuestionAnswering
import tensorflow as tf
from tensorflow import keras

model = TFAutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
optimizer = keras.optimizers.Adam(learning_rate=5e-5)
model.fit(train_set, validation_data=validation_set, epochs=1)

During inference, you get the expected locations:

inputs = tokenizer([context], [question], return_tensors="np")
outputs = model(inputs)
start_position = tf.argmax(outputs.start_logits, axis=1)
end_position = tf.argmax(outputs.end_logits, axis=1)

You’ll note that the sample doesn’t predict the URL — the context is assumed to be the results of a typical search query (equivalent to returned by an identical engine or vector database), and the sample model only does extraction. Nevertheless, you may construct the search also into the model by having a separate layer in Keras.


There are 4 approaches that I see getting used to construct production applications on top of generative AI foundational models:

  1. Use the REST API of an all-in model equivalent to GPT-4 for one-shot prompts.
  2. Use langchain to abstract away the LLM, input injection, multi-turn conversations, and few-shot learning.
  3. Finetune in your custom data by tapping into the set of models that comprise an end-to-end generative AI model.
  4. Reframe the issue right into a form that avoids the hazards of generative AI (bias, toxicity, hallucination).

Approach #3 is what I see mostly utilized by sophisticated teams.



Please enter your comment!
Please enter your name here