Home Artificial Intelligence Occupied with fine-tuning a LLM? Here’s 3 considerations before you start

Occupied with fine-tuning a LLM? Here’s 3 considerations before you start

3
Occupied with fine-tuning a LLM? Here’s 3 considerations before you start

Photo by Brett Jordan on Unsplash

LLMs (Large Language Models) and generative AI are all the fad without delay. A staggering statistic from IBM reveals that just about 2 in 3 C-Suite executives feel pressure from investors to speed up their adoption of generative AI. Naturally, this pressure is trickling right down to Data Science and Machine Learning teams, who’re answerable for navigating the hype and creating winning implementations.

Because the landscape evolves, the ecosystem for LLMs has diverged between open source and industry models, with a quickly filling moat. This emerging scene has prompted many teams to contemplate the next query: How can we make a LLM more specific for our use case?

In this text we explore some key considerations that needs to be top of mind when contemplating the investment of time and engineering cycles to construct a distinct segment LLM. On this journey, it’s crucial to concentrate on a few of the recent research surrounding potential limitations and best practices for constructing fine-tuned language models. After reading this text, you’ll be equipped with a couple of more ideas to steer your organization to the proper decision to coach or not to coach and how you can train.

It’s no secret to anyone that OpenAI is leading the LLM charge with it’s latest iterations of GPT. For that reason many stakeholders may ask for a development team do deploy a model that imitates the outcomes of the more robust model for various reasons (rate limits, data privacy, costs, etc.) This naturally leads developers to wonder: Can we generate outputs from GPT and utilize them to fine-tune a model?

The reply to this query stays uncertain, because it seems to rely upon several aspects. This particular task, often known as imitation learning, involves training a latest language model through fine-tuning using goal observations from a more advanced model comparable to GPT. While this looks like a fantastic option to get good performance out of a downstream model, it does include its share of potential issues.

Figure taken from reference Gudibande et al. [1].

A recent paper titled “The False Promise of Imitating Proprietary LLMs” [1] sheds some light on potential pitfalls it’s possible you’ll encounter with this approach. The authors present some experiments demonstrating that adding more imitation data can potentially result in a degradation in model performance. Taking a look at the figure above, we are able to see that in the middle graph that the accuracy on the benchmark task does decrease because the variety of tokens increase. But why is that the case?

The authors suggest the rationale this happens is that imitation models learn the kind of the model they’re mimicking, moderately than learning and understanding the content of the model. Having a look within the left pane of the figure above, the human reviewers favored the outcomes of the imitation model to those of ChatGPT. After exploring it was clear that the reviewers enjoyed the kind of the imitation model, but didn’t closely examine the content. It was noted that the content produced by the imitation model tended to have weak factuality, leading the authors to summarize “imitation models actually embody a few of the worst features of AI assistants: their answers sound confident but are less factual than ChatGPT.”

It’s vital to notice that there are some scenarios where imitation models can achieve great performance. The authors indicate that the imitation models can achieve good performance on local tasks, or tasks that replicate a really specific behavior of the teacher model. On a task created for the study called NQ-Synthetic, the authors task the language model with generating 10 questions and answers related to a given context. Remarkably, the imitation model achieved a rating near that of GPT. This means that more specific models could achieve favorable outcomes when attempting to mimic behaviors from a teacher model.

An enchanting corollary from the paper is that fine-tuning a model using a teacher model could actually help reduce the toxicity rating of the imitation model. This may very well be extremely useful for firms that want to show an open source LLM quickly without undergoing the laborious task of constructing filters surrounding the outputs. As an alternative of manually attempting to construct filters, firms could as an alternative train on outputs from a fastidiously curated set of information from a teacher model to get a solid start line.

It’s value mentioning the recent release of Orca, a model developed by Microsoft Research, which includes signals from GPT as a part of the training data. The difference here is in the scale of the training data used for the model. Orca is fine-tuned on 5 million examples whereas the imitation model for broad coverage was tuned on roughly 151 thousand observations. Since I presume most of my audience is not going to be spending $16,000 to coach an LLM as an informal experiment, I’m inclined to make statements that more closely check with the imitation modeling paper than Orca. That being said, we could have to attend for more research as to what the minimum variety of examples required for imitation learning to emerge as a viable option for broader tasks.

Takeaway: Depending on the complexity of your task, attempting to mimic the outputs of GPT or any sophisticated model with a weaker model may end in poor model performance.

In-Context Learning, or Few Shot Learning, is the strategy of including task-specific examples within the prompt. This approach is particular for stylish language models, since open source models have yet to realize the specified flexibility to give you the option to handle In-Context Learning. Normally it is feasible to realize great results from this approach, but have you ever ever wondered why that is the case?

The reply to this query is explored in a paper by Dai et al. [3], where they explored the mathematical connections between loading examples within the prompt and fine-tuning using the identical examples. The authors show that the prompt examples produce meta-gradients which can be reflected during forward propagation at inference time. Within the case of fine-tuning, the examples actually produce real gradients which can be used to update the weights. Subsequently, it seems that in-content learning achieves similar results to fine-tuning. For a more in-depth understanding of those finding I might encourage reading the paper, which spares no detail within the mathematical connections.

Although the approach of In-Context Learning is great, there does exist a limitation that will not be evident in fine-tuning. Within the case we’ve got a big corpus of coaching data, a fine-tuned model will make use of all of that data by updating the model with real gradients during training. During In-Context Learning we are able to only provide a limited variety of observations. So here a matter arises: Given a considerable training corpus, how can we make use of essentially the most relevant examples given our input to realize the perfect results?

One approach to tackle this issue is by choosing examples using a heuristic, and fortunately, LangChain provides support for this. LangChain is a Python module that essentially houses pre-built prompts that simplifies working with language models. The tool from LangChain which we are going to concern ourselves with without delay is the ExampleSelector.

def get_similarity(seq_a: str, seq_b: str) -> Union[float, int]:
"""
Make a similarity heuristic,
here we use Jaccard similarity or IOU

seq_a: First sequence to check
seq_b: Second sequence to check

Returns:
Similarity rating (float or int)
"""
# Tokenize
set_a = set(seq_a.split(' '))
set_b = set(seq_b.split(' '))

# Calculate IOU/Jaccard similarity
return len(set_a.intersection(set_b)) / len(set_a.union(set_b))

def example_selector(examples: List[str], input: str, examples2use: int) -> List[str]:
"""
Pseudo code for an example selector

examples: List of coaching corpus
input: Goal sequence to translate
examples2use: Variety of examples to make use of

Returns:
List of chosen examples
"""
scores = [get_similarity(example, input) for example in examples]
sorted_idx = [i for i, _ in sorted(enumerate(scores), key=lambda x: x[1], reverse=True)]
return examples[sorted_idx[:examples2use]]

ExampleSelectors are a kind of prompt manipulator that permits us to dynamically change which examples are used during inference. There are a lot of heuristics that might be used. Above I created some pseudo code of how a selector from LangChain essentially works. I used jaccard similarity between the input sequence and example sequence above. In LangChain there are various more options, so check them out here.

There are two primary advantages to having an approach like this. The primary is that you just allow your LLM to be data efficient, by selectively selecting essentially the most relevant examples for the given input. That is against having a couple of examples statically loaded for all observations. The second profit comes from cost savings, if tuning through a managed service. As of writing, to make use of a fine-tuned base Davinci model costs $0.12 per 1,000 tokens. In contrast, using instruct Davinci costs $0.02, that’s a 400% increase in price! These prices also doesn’t include the associated fee of coaching.

It’s vital to notice that these prices are subject to alter as OpenAI will not be yet using LoRa or Adapters, as revealed in a now-deleted blog post [5]. Nevertheless, the fine-tuned models are still prone to be dearer resulting from the need of maintaining custom weights for individual users. This also doesn’t account for cost of examples in context. Your team will need to judge if ICL or fine-tuning makes more sense to your task from cost and accuracy standpoints.

Takeaway: In-Context Learning with dynamic example loading may achieve the identical results as superb tuning without substantial additional costs that will come from a managed service.

Let’s say you’re attempting to answer complex questions over long documents. This task fundamentally requires the language model to have a very good mastery of language and understanding. This leads us to a matter: What if we assist the language model in breaking down the reasoning process into subtasks, just like how a human would analyze a document and sequentially execute tasks?

Figure taken from Jörke et al. [4].

This is precisely what researchers from Microsoft set out to perform and their answer to this problem is PEARL [4]. PEARL stands for Planning and Executing Actions for Reasoning over Long documents. The final framework is broken down into three steps:

  1. Motion Mining: The language model is first prompted to read the documents and extract possible actions that may very well be used to reply questions which can be domain specific. To extract these actions, the language model is given a couple of example actions. I included an example of what an motion could appear to be below.
  2. Plan Generation: After generating a set of task-specific actions, the LLM is now asked to generate a subsequent list of actions to execute so as given a matter and context. The LLM is provided some examples of plans for other tasks which aids in construction of a high quality plan. More details concerning the technicalities might be present in the paper.
  3. Plan Execution: The model now has the plan. We now provide the inputs to the model and execute the plan.
Example Motion taken from Jörke et al. [4].

There are some intermediary steps which can be used to make sure quality between stages. The authors include a self-correction step which ensures the plan conforms to the required format. There may be also a self-refinement step that determines if the plan might be used later as a few-shot example.

Table taken from Jörke et al. [4].

In evaluation PEARL demonstrated notable improvements over other GPT models, specifically when long documents were included. The important thing takeaway from this process is that in certain cases having multiple steps can significantly assist the model.

One other scenario when having intermediate steps proves helpful is when the variety of documents to be included in your context exceeds what’s supported by the language model. Because it currently stands, the eye mechanism utilized by OpenAI scales at O(n²) and there is no such thing as a solution to beat this yet [5]. This creates considerable interest in reducing the context to essentially the most minimal form possible.

Depending in your task there are methods to handle this. As an illustration, in case your task entirely revolves around entities there may be a possibility to extract the relevant entities and their related properties. You may consider this approach as a lossy compression that lets you feed more context into the LLM. One other advantage of this intermediate step is that you just converted unstructured data to a structured format, which lets you create informed decision making without the LLM. An example of this task is shown below within the figure from Fei et al. [6].

Figure taken from Fei et al. [6]

Takeaway: Breaking a task into smaller subsequent problems will help simplify a bigger problem into more manageable pieces. You may as well use these smaller tasks to resolve bottlenecks related to model limitations.

These are some general ideas regarding what researchers are exploring in the brand new frontiers of LLM performance and efficiency. This will not be an exhaustive list of all things to be considered when fine-tuning a model, but it surely’s a very good place to begin when considering the journey.

For further reading, this post from Hugging Face regarding training LLMs is kind of interesting, and could be a fantastic place to begin when exploring imitation models on a neighborhood problem. Getting a concrete understanding of LangChain can also be supremely helpful. While many of the library may very well be rewritten to your use case, the primary profit is that it’s easier to maintain up with research if other persons are writing the code for you!

Listed below are the takeaways again:

  1. Depending on the complexity of your task, attempting to mimic the outputs of GPT or any sophisticated model with a weaker model may end in poor model performance.
  2. In-Context Learning with dynamic example loading may achieve the identical results as superb tuning without substantial additional costs that will come from a managed service.
  3. Breaking a task into smaller subsequent problems will help simplify a bigger problem into more manageable pieces. You may as well use these smaller tasks to resolve bottlenecks related to model limitations.

3 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here