How To Significantly Enhance LLMs by Leveraging Context Engineering

is the science of providing LLMs with the proper context to maximise performance. Once you work with LLMs, you sometimes create a system prompt, asking the LLM to perform a certain task. Nonetheless, when working with LLMs from a programmer’s perspective, there are more elements to contemplate. You might have to find out what other data you possibly can feed your LLM to enhance its ability to perform the duty you asked it to do.

In this text, I’ll discuss the science of context engineering and the way you possibly can apply context engineering techniques to enhance your LLM’s performance.

In this text, I discuss context engineering: The science of providing the proper context to your LLMs. Appropriately utilizing context engineering can significantly increase the performance of your LLM. Image by ChatGPT.

You can too read my articles on Reliability for LLM Applications and Document QA using Multimodal LLMs

Definition

Before I start, it’s necessary to define the term context engineering. Context engineering is basically the science of deciding what to feed into your LLM. This may, for instance, be:

The system prompt, which tells the LLM easy methods to act
Document data fetch using RAG vector search
Few-shot examples
Tools

The closest previous description of this has been the term . Nonetheless, prompt engineering is a less descriptive term, considering it implies only changing the system prompt you might be feeding to the LLM. To get maximum performance out of your LLM, you will have to contemplate all of the context you might be feeding into it, not only the system prompt.

Motivation

My initial motivation for this text got here from reading this Tweet by Andrej Karpathy.

+1 for “context engineering” over “prompt engineering”.

People associate prompts with short task descriptions you’d give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the fragile art and science of filling the context window… https://t.co/Ne65F6vFcf

— Andrej Karpathy (@karpathy) June 25, 2025

I actually agreed with the purpose Andrej made on this tweet. Prompt engineering is unquestionably a crucial science when working with LLMs. Nonetheless, prompt engineering doesn’t cover all the things we input into LLMs. Along with the system prompt you write, you furthermore mght have to contemplate elements reminiscent of:

Which data do you have to insert into your prompt
How do you fetch that data
How one can only provide relevant information to the LLM
Etc.

I’ll discuss all of those points throughout this text.

API vs Console usage

One necessary difference to make clear is whether or not you might be using the LLMs from an API (calling it with code), or via the console (for instance, via the ChatGPT website or application). Context engineering is unquestionably necessary when working with LLMs through the console; nevertheless, my focus in this text will likely be on API usage. The explanation for that is that when using an API, you will have more options for dynamically changing the context you might be feeding the LLM. For instance, you possibly can do RAG, where you first perform a vector search, and only feed the LLM a very powerful bits of knowledge, reasonably than the whole database.

These dynamic changes will not be available in the identical way when interacting with LLMs through the console; thus, I’ll deal with using LLMs through an API.

Context engineering techniques

Zero-shot prompting

Zero-shot prompting is the baseline for context engineering. Doing a task means the LLM is performing a task it hasn’t seen before. You might be essentially only providing a task description as context for the LLM. For instance, providing an LLM with a protracted text and asking it to categorise the text into class A or B, in accordance with some definition of the classes. The context (prompt) you might be feeding the LLM could look something like this:

You might be an authority text classifier, and tasked with classifying texts into
class A or class B. 
- Class A: The text incorporates a positive sentiment
- Class B: The following incorporates a negative sentiment

Classify the text: {text}

Depending on the duty, this might work thoroughly. LLMs are generalists and are capable of perform simplest text-based tasks. Classifying a text into one in every of two classes will normally be a sure bet, and zero-shot prompting will thus normally work quite well.

Few-shot prompting

This infographic highlights easy methods to perform few-shot prompting:

The follow-up from zero-shot prompting is . With few-shot prompting, you provide the LLM with a prompt much like the one above, but you furthermore mght provide it with examples of the duty it should perform. This added context will help the LLM improve at performing the duty. Following up on the prompt above, a few-shot prompt could seem like:

You might be an authority text classifier, and tasked with classifying texts into
class A or class B. 
- Class A: The text incorporates a positive sentiment
- Class B: The following incorporates a negative sentiment


{text 1} -> Class A


{text 2} -> class B


Classify the text: {text}

You’ll be able to see I’ve provided the model some examples wrapped in tags. I actually have discussed the subject of making robust LLM prompts in my article on LLM reliability below:

Few-shot prompting works well because you might be providing the model with examples of the duty you might be asking it to perform. This normally increases performance.

You’ll be able to imagine this works well on humans as well. When you ask a human a task they’ve never done before, just by describing the duty, they may perform decently (after all, depending on the problem of the duty). Nonetheless, if you happen to also provide the human with examples, their performance will normally increase.

Overall, I find it useful to take into consideration LLM prompts as if I’m asking a human to perform a task. Imagine as an alternative of prompting an LLM, you just provide the text to a human, and also you ask yourself the query:

Given this prompt, and no other context, will the human have the ability to perform the duty?

If the reply is not any, it’s best to work on clarifying and improving your prompt.

I also need to mention dynamic few-shot prompting, considering it’s a method I’ve had quite a lot of success with. Traditionally, with few-shot prompting, you will have a set list of examples you feed into every prompt. Nonetheless, you possibly can often achieve higher performance using dynamic few-shot prompting.

Dynamic few-shot prompting means choosing the few-shot examples dynamically when creating the prompt for a task. For instance, if you happen to are asked to categorise a text into classes A and B, and you have already got an inventory of 200 texts and their corresponding labels. You’ll be able to then perform a similarity search between the brand new text you might be classifying and the instance texts you have already got. Continuing, you possibly can measure the vector similarity between the texts and only select probably the most similar texts (out of the 200 texts) to feed into your prompt as context. This manner, you’re providing the model with more relevant examples of easy methods to perform the duty.

RAG

Retrieval augmented generation is a widely known technique for increasing the knowledge of LLMs. Assume you have already got a database consisting of hundreds of documents. You now receive an issue from a user, and must answer it, given the knowledge inside your database.

Unfortunately, you possibly can’t feed the whole database into the LLM. Despite the fact that we’ve LLMs reminiscent of Llama 4 Scout with a 10-million context length window, databases are frequently much larger. You subsequently have to search out probably the most relevant information within the database to feed into your LLM. RAG does this similarly to dynamic few-shot prompting:

Perform a vector search
Find probably the most similar documents to the user query (most similar documents are assumed to be most relevant)
Ask the LLM to reply the query, given probably the most similar documents

By performing RAG, you might be doing context engineering by only providing the LLM with probably the most relevant data for performing its task. To enhance the performance of the LLM, you possibly can work on the context engineering by improving your RAG search. This may, for instance, be done by improving the search to search out only probably the most relevant documents.

You’ll be able to read more about RAG in my article about developing a RAG system to your personal data:

Tools (MCP)

You can too provide the LLM with tools to call, which is a crucial a part of context engineering, especially now that we see the rise of AI agents. Tool calling today is usually done using Model Context Protocol (MCP), an idea began by Anthropic.

AI agents are LLMs able to calling tools and thus performing actions. An example of this may very well be a weather agent. When you ask an LLM without access to tools concerning the weather in Latest York, it should not have the ability to supply an accurate response. The explanation for this is of course that information concerning the weather must be fetched in real time. To do that, you possibly can, for instance, give the LLM a tool reminiscent of:

@tool
def get_weather(city):
    # code to retrieve the present weather for a city
    return weather

When you give the LLM access to this tool and ask it concerning the weather, it will probably then seek for the weather for a city and offer you an accurate response.

Providing tools for LLMs is incredibly necessary, because it significantly enhances the talents of the LLM. Other examples of tools are:

Search the web
A calculator
Search via Twitter API

Topics to contemplate

On this section, I make a number of notes on what it’s best to consider when creating the context to feed into your LLM

Utilization of context length

The context length of an LLM is a crucial consideration. As of July 2025, you possibly can feed most frontier model LLMs with over 100,000 input tokens. This provides you with quite a lot of options for easy methods to utilize this context. You might have to contemplate the tradeoff between:

Including quite a lot of information in a prompt, thus risking a number of the information getting lost within the context
Missing some necessary information within the prompt, thus risking the LLM not having the required context to perform a particular task

Normally, the one method to work out the balance, is to check your LLMs performance. For instance with a classificaition task, you possibly can check the accuracy, given different prompts.

If I discover the context to be too long for the LLM to work effectively, I sometimes split a task into several prompts. For instance, having one prompt summarize a text, and a second prompt classifying the text summary. This may help the LLM utilize its context effectively and thus increase performance.

Moreover, providing an excessive amount of context to the model can have a major downside, as I describe in the subsequent section:

Context rot

Last week, I read an interesting article about context rot. The article was concerning the proven fact that increasing the context length lowers LLM performance, though the duty difficulty doesn’t increase. This means that:

Providing an LLM irrelevant information, will decrease its ability to perform tasks succesfully, even when task difficulty doesn’t increase

The purpose here is basically that it’s best to only provide relevant information to your LLM. Providing other information decreases LLM performance (i.e., )

Conclusion

In this text, I actually have discussed the subject of context engineering, which is the technique of providing an LLM with the correct context to perform its task effectively. There are quite a lot of techniques you possibly can utilize to replenish the context, reminiscent of few-shot prompting, RAG, and tools. These are all powerful techniques you should use to significantly improve an LLM’s ability to perform a task effectively. Moreover, you furthermore mght have to contemplate the proven fact that providing an LLM with an excessive amount of context also has downsides. Increasing the variety of input tokens reduces performance, as you would examine within the article about context rot.

👉 Follow me on socials:

🧑‍💻 Get in contact
🔗 LinkedIn
🐦 X / Twitter
✍️ Medium
🧵 Threads

How To Significantly Enhance LLMs by Leveraging Context Engineering

Table of Contents

Definition

Motivation

API vs Console usage

Context engineering techniques

Zero-shot prompting

Few-shot prompting

RAG

Tools (MCP)

Topics to contemplate

Utilization of context length

Context rot

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Do You Really Need GraphRAG? A Practitioner’s Guide Beyond the Hype

AI ‘godmother’ calls for spatial intelligence

Understanding the nuances of human-like intelligence

Make Python As much as 150× Faster with C

Reimagining cybersecurity within the era of AI and quantum

How To Significantly Enhance LLMs by Leveraging Context Engineering

Table of Contents

Definition

Motivation

API vs Console usage

Context engineering techniques

Zero-shot prompting

Few-shot prompting

RAG

Tools (MCP)

Topics to contemplate

Utilization of context length

Context rot

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.