6 Common LLM Customization Strategies Briefly Explained

-


Why Customize LLMs?

Large Language Models (Llms) are deep learning models pre-trained based on self-supervised learning, requiring an enormous amount of resources on training data, training time and holding numerous parameters. LLM have revolutionized natural language processing especially within the last 2 years, demonstrating remarkable capabilities in understanding and generating human-like text. Nonetheless, these general purpose models’ out-of-the-box performance may not all the time meet specific business needs or domain requirements. LLMs alone cannot answer questions that depend on proprietary company data or closed-book settings, making them relatively generic of their applications. Training a LLM model from scratch is essentially infeasible to small to medium teams resulting from the demand of massive amounts of coaching data and resources. Subsequently, a big selection of LLM customization strategies are developed in recent times to tune the models for various scenarios that require specialized knowledge.

The customization strategies will be broadly split into two types:

  • Using a frozen model: These techniques don’t necessitate updating model parameters and typically completed through in-context learning or prompt engineering. They’re cost-effective since they alter the model’s behavior without incurring extensive training costs, due to this fact widely explored in each the industry and academic with latest research papers published each day.
  • Updating model parameters: This can be a relatively resource-intensive approach that requires tuning a pre-trained LLM using custom datasets designed for the intended purpose. This includes popular techniques like Positive-Tuning and Reinforcement Learning from Human Feedback (RLHF).

These two broad customization paradigms branch out into various specialized techniques including LoRA fine-tuning, Chain of Thought, Retrieval Augmented Generation, ReAct, and Agent frameworks. Each technique offers distinct benefits and trade-offs regarding computational resources, implementation complexity, and performance improvements.

Learn how to Select LLMs?

Step one of customizing LLMs is to pick the suitable foundation models because the baseline. Community based platform e.g. “Huggingface” offers a big selection of open-source pre-trained models contributed by top firms or communities, similar to Llama series from Meta and Gemini from Google. Huggingface moreover provides leaderboards, for instance “Open LLM Leaderboard” to check LLMs based on industry-standard metrics and tasks (e.g. MMLU). Cloud providers (e.g., AWS) and AI firms (e.g., OpenAI and Anthropic) also offer access to proprietary models which can be typically paid services with restricted access. Following aspects are essentials to contemplate when selecting LLMs.

Open source or proprietary model: Open source models allow full customization and self-hosting but require technical expertise while proprietary models offer immediate access and infrequently higher quality responses but with higher costs.

Task and metrics: Models excel at different tasks including question-answering, summarization, code generation etc. Compare benchmark metrics and test on domain-specific tasks to find out the suitable models.

Architecture: Normally, decoder-only models (GPT series) perform higher at text generation while encoder-decoder models (T5) handle translation well. There are more architecture emerging and showing promising results, as an illustration Mixture of Experts (MoE) model “DeepSeek”.

Variety of Parameters and Size: Larger models (70B-175B parameters) offer higher performance but need more computing power. Smaller models (7B-13B) run faster and cheaper but can have reduced capabilities.

After determining a base LLM, let’s explore 6 most typical strategies for LLM customization, ranked so as of resource consumption from the least to probably the most intensive:

  • Prompt Engineering
  • Decoding and Sampling Strategy
  • Retrieval Augmented Generation
  • Agent
  • Positive Tuning
  • Reinforcement Learning from Human Feedback

In case you’d prefer a video walkthrough of those concepts, please take a look at my video on “6 Common LLM Customization Strategies Briefly Explained”.

LLM Customization Techniques

1. Prompt Engineering

Prompt is the input text sent to an LLM to elicit an AI-generated response, and it might probably be composed of instructions, context, input data and output indicator.

Instructions: This provides a task description or instruction for the way the model should perform.

Context: That is external information to guide the model to reply inside a certain scope.

Input data: That is the input for which you would like a response.

Output indicator: This specifies the output type or format.

Prompt Engineering involves crafting these prompt components strategically to shape and control the model’s response. Basic prompt engineering techniques include zero shot, one shot, and few shot prompting. User can implement basic prompt engineering techniques directly while interacting with the LLM, making it an efficient approach to align model’s behavior to on a novel objective. API implementation can also be an option and more details are introduced in my previous article “A Easy Pipeline for Integrating LLM Prompt with Knowledge Graph”.

Resulting from the efficiency and effectiveness of prompt engineering, more complex approaches are explored and developed to advance the logical structure of prompts.

Chain of Thought (CoT) asks LLMs to interrupt down complex reasoning tasks into step-by-step thought processes, improving performance on multi-step problems. Each step explicitly exposes its reasoning final result which serves because the precursor context of its subsequent steps until arriving at the reply.

Tree of thoughts extends from CoT by considering multiple different reasoning branches and self-evaluating selections to choose the subsequent best motion. It’s simpler for tasks that involve initial decisions, strategies for the long run and exploration of multiple solutions.

Automatic reasoning and power use (ART) builds upon the CoT process, it deconstructs complex tasks and allows the model to pick few-shot examples from a task library using predefined external tools like search and code generation.

Synergizing reasoning and acting (ReAct) combines reasoning trajectories with an motion space, where the model search through the motion space and determine the subsequent best motion based on environmental observations.

Techniques like CoT and ReAct are sometimes combined with an Agentic workflow to strengthen its capability. These techniques will likely be introduced in additional detail within the “Agent” section.

Further Reading

2. Decoding and Sampling Strategy

Decoding strategy will be controlled at model inference time through inference parameters (e.g. temperature, top p, top k), determining the randomness and variety of model responses. Greedy search, beam search and sampling are three common decoding strategies for auto-regressive model generation. ****

In the course of the autoregressive generation process, LLM outputs one token at a time based on a probability distribution of candidate tokens conditioned by the pervious token. By default, greedy search is applied to provide the subsequent token with the very best probability.

In contrast, beam search decoding considers multiple hypotheses of next-best tokens and selects the hypothesis with the very best combined probabilities across all tokens within the text sequence. The code snippet below uses transformers library to specify the the variety of beam paths (e.g. num_beams=5 considers 5 distinct hypotheses) through the model generation process.

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
inputs = tokenizer(prompt, return_tensors="pt")

model = AutoModelForCausalLM.from_pretrained(model_name)
outputs = model.generate(**inputs, num_beams=5)

Sampling strategy is the third approach to manage the randomness of model responses by adjusting these inference parameters:

  • Temperature: Lowering the temperature makes the probability distribution sharper by increasing the likelihood of generating high-probability words and decreasing the likelihood of generating low-probability words. When temperature = 0, it becomes corresponding to greedy search (least creative); when temperature = 1, it produces probably the most creative outputs.
  • Top K sampling: This method filters the K most probable next tokens and redistributes the probability amongst those tokens. The model then samples from this filtered set of tokens.
  • Top P sampling: As a substitute of sampling from the K most probable tokens, top-p sampling selects from the smallest possible set of tokens whose cumulative probability exceeds the brink p.

The instance code snippet below samples from the highest 50 most definitely tokens (top_k=50) with a cumulative probability higher than 0.95 (top_p=0.95)

sample_outputs = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3,
) 

Further Reading

3. RAG

Retrieval Augmented Generation (or RAG), initially introduced within the paper “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, has been demonstrated as a promising solution that integrates external knowledge and reduces common LLM “hallucination” issues when handling domain specific or specialized queries. RAG allows dynamically pulling relevant information from knowledge domain and usually doesn’t involve extensive training to update LLM parameters, making it a cheap technique to adapt a general-purpose LLM for a specialized domain.

A RAG system will be decomposed into retrieval and generation stage. The target of retrieval process is to search out contents throughout the knowledge base which can be closely related to the user query, by chunking external knowledge, creating embeddings, indexing and similarity search.

  1. Chunking: Documents are divided into smaller segments, with each segment containing a definite unit of knowledge.
  2. Create embeddings: An embedding model compresses each information chunk right into a vector representation. The user query can also be converted into its vector representation through the identical vectorization process, in order that the user query will be compared in the identical dimensional space.
  3. Indexing: This process stores these text chunks and their vector embeddings as key-value pairs, enabling efficient and scalable search functionality. For giant external knowledge bases that exceed memory capability, vector databases offer efficient long-term storage.
  4. Similarity search: Similarity scores between the query embeddings and text chunk embeddings are calculated, that are used for searching information highly relevant to the user query.

The generation process of the RAG system then combines retrieved information with the user query to form the augmented query which is parsed to the LLM to generate the context wealthy response.

Code Snippet

The code snippet firstly specifies the LLM and embedding model, then perform the steps to chunk the external knowledge base documents into a group of document. Create index from document, define the query_engine based on the index and query the query_engine with the user prompt.

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model="BAAI/bge-small-en-v1.5"

document = Document(text="nn".join([doc.text for doc in documents]))
index = VectorStoreIndex.from_documents([document])                                    
query_engine = index.as_query_engine()
response = query_engine.query(
    "Tell me about LLM customization strategies."
)

The instance above shows an easy RAG system. Advanced RAG improve based on this by introducing pre-retrieval and post-retrieval strategies to cut back pitfalls similar to limited synergy between the retrieval and generation process. For instance rerank technique reorders the retrieved information using a model able to understanding bidirectional context, and integration with knowledge graph for advanced query routing. More use cases will be found on the llamaindex website.

Further Reading

4. Agent

LLM Agent was a trending topic in 2024 and can likely remain a important focus within the GenAI field in 2025. In comparison with RAG, Agent excels at creating query routes and planning LLM-based workflows, with the next advantages:

  • Maintaining memory and state of previous model generated responses.
  • Leveraging various tools based on specific criteria. This tool-using capability sets agents other than basic RAG systems by giving the LLM independent control over tool selection.
  • Breaking down a posh task into smaller steps and planning for a sequence of actions.
  • Collaborating with other agents to form a orchestrated system.

Several in-context learning techniques (e.g. CoT, ReAct ) will be implemented through the Agentic framework and we are going to discuss ReAct in additional details. ReAct, stands for “Synergizing Reasoning and Acting in Language Models”, consists of three key elements – actions, thoughts and observations. This framework was introduced by Google Research at Princeton University, built upon Chain of Thought by integrating the reasoning steps with an motion space that allows tool uses and performance calling. Moreover, ReAct framework emphasizes on determining the subsequent best motion based on the environmental observations.

This instance from the unique paper demonstrated ReAct’s inner working process, where the LLM generates the primary thought and acts by calling the function to “Search [Apple Remote]”, then observes the feedback from its first output. The second thought is then based on the previous commentary, hence resulting in a distinct motion “Search [Front Row]”. This process iterates until reaching the goal. The research shows that ReAct overcomes prevalent problems with hallucination and error propagation as more often observed in chain-of-thought reasoning by interacting with an easy Wikipedia API. Moreover, through the implementation of decision traces, ReAct framework moreover increases the model’s interpretability, trustworthiness and diagnosability.

Code Snippet

This demonstrates an ReAct-based agent implementation using llamaindex. Firstly, it defines two functions (multiply and add). Secondly, these two functions are encapsulated as FunctionTool, forming the Agent’s motion space and executed based on its reasoning.

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool

# create basic function tools
def multiply(a: float, b: float) -> float:
    return a * b
multiply_tool = FunctionTool.from_defaults(fn=multiply)

def add(a: float, b: float) -> float:
    return a + b
add_tool = FunctionTool.from_defaults(fn=add)

agent = ReActAgent.from_tools([multiply_tool, add_tool], llm=llm, verbose=True)

Some great benefits of an Agentic Workflow are more substantial when combined with self-reflection or self-correction. It’s an increasingly growing domain with a wide range of Agent architecture being explored. As an example, Reflexion framework facilitate iterative learning by providing a summary of verbal feedback from environmental and storing the feedback in model’s memory; CRITIC framework empowers frozen LLMs to self-verify through interacting with external tools similar to code interpreter and API calls.

Further Reading

5. Positive-Tuning

Positive-tuning is the technique of feeding area of interest and specialized datasets to change the LLM in order that it’s more aligned with a certain objective. It differs from prompt engineering and RAG because it enables updates to the LLM weights and parameters. Full fine-tuning refers to updating all weights of the pretrained LLM through backpropogation, which requires large memory to store all weights and parameters and will suffer from significant reduction in ability on other tasks (i.e. catastrophic forgetting). Subsequently, PEFT (or parameter efficient advantageous tuning) is more widely used to mitigate these caveats while saving the time and price of model training. There are three categories of PEFT methods:

  • Selective: Select a subset of initial LLM parameters to advantageous tune which will be more computationally intensive in comparison with other PEFT methods.
  • Reparameterization: Adjust model weights through training the weights of low rank representations. For instance, Lower Rank Adaptation (LoRA) is amongst this category that accelerates fine-tuning by representing the load updates with two smaller matrices.
  • Additive: Add additional trainable layers to model, including techniques like adapters and soft prompts

The fine-tuning process is comparable to deep learning training process., requiring the next inputs:

  • training and evaluation datasets
  • training arguments define the hyperparameters e.g. learning rate, optimizer
  • pretrained LLM model
  • compute metrics and objective functions that algorithm must be optimized for

Code Snippet

Below is an example of implementing fine-tuning using the transformer Trainer.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
		output_dir=output_dir,
		learning_rate=1e-5,
		eval_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Positive-tuning has a big selection of use cases. As an example, instruction fine-tuning optimizes LLMs for conversations and following instructions by training them on prompt-completion pairs. One other example is domain adaptation, an unsupervised fine-tuning method that helps LLMs specialise in specific knowledge domains.

Further Reading

6. RLHF

Reinforcement learning from human feedback, or RLHF, is a reinforcement learning technique that advantageous tunes LLMs based on human preferences. RLHF operates by training a reward model based on human feedback and uses this model as a reward function to optimize a reinforcement learning policy through PPO (Proximal Policy Optimization). The method requires two sets of coaching data: a preference dataset for training reward model, and a prompt dataset utilized in the reinforcement learning loop.

Let’s break it down into steps:

  1. Gather preference dataset annotated by human labelers who rate different completions generated by the model based on human preference. An example format of the preference dataset is {input_text, candidate1, candidate2, human_preference}, indicating which candidate response is preferred.
  2. Train a reward model using the preference dataset, the reward model is actually a regression model that outputs a scalar indicating the standard of the model generated response. The target of the reward model is to maximise the rating between the winning candidate and losing candidate.
  3. Use the reward model in a reinforcement learning loop to fine-tune the LLM. The target is that the policy is updated in order that LLM can generate responses that maximize the reward produced by the reward model. This process utilizes the prompt dataset which is a group of prompts within the format of {prompt, response, rewards}.

Code Snippet

Open source library Trlx is widely applied in implementing RLHF and so they provided a template code that shows the essential RLHF setup:

  1. Initialize the bottom model and tokenizer from a pretrained checkpoint
  2. Configure PPO hyperparameters PPOConfig like learning rate, epochs, and batch sizes
  3. Create the PPO trainer PPOTrainer by combining the model, tokenizer, and training data
  4. The training loop uses step() method to iteratively update the model to optimized the rewards calculated from the query and model response
# trl: Transformer Reinforcement Learning library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

# initiate the pretrained model and tokenizer
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)

# define the hyperparameters of PPO algorithm
config = PPOConfig(
    model_name=model_name,    
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

# initiate the PPO trainer on the subject of the model
ppo_trainer = PPOTrainer(
	config=config, 
	model=ppo_model, 
  tokenizer=tokenizer, 
  dataset=dataset["train"],
  data_collator=collator
)                      
                        
# ppo_trainer is iteratively updated through the rewards
ppo_trainer.step(query_tensors, response_tensors, rewards)

RLHF is widely applied for aligning model responses with human preference. Common use cases involve reducing response toxicity and model hallucination. Nonetheless, it does have the downside of requiring a considerable amount of human annotated data in addition to computation costs related to policy optimization. Subsequently, alternatives like Reinforcement Learning from AI feedback and Direct Preference Optimization (DPO) are introduced to mitigate these limitations.

Further Reading

Take-Home Message

This text briefly explains six essential LLM customization strategies including prompt engineering, decoding strategy, RAG, Agent, fine-tuning, and RLHF. Hope you discover it helpful when it comes to understanding the professionals/cons of every strategy in addition to how one can implement them based on the sensible examples.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x