Keep MCPs Useful in Agentic Pipelines

-

Intro

applications powered by Large Language Models (LLMs) require integration with external services, for instance integration with Google Calendar to establish meetings or integration with PostgreSQL to get access to some data. 

Function calling

Initially these sorts of integrations were implemented through function calling: we were constructing some special functions that might be called by an LLM through some specific tokens (LLM was generating some special tokens to call the function, following patterns we defined), parsing and execution. To make it work we were implementing authorization and API calling methods for every of the tools. Importantly, we had to administer all of the instructions for these tools to be called and construct internal logic of those functions including default or user-specific parameters. However the hype around “AI” required fast, sometimes brute-force solutions to maintain the pace, that’s where MCPs were introduced by the Anthropic company. 

MCPs

MCP stands for Model Context Protocol and today it’s a typical way of providing tools to the vast majority of the agentic pipelines. MCPs mainly manage each integration functions and LLM instructions to make use of tools. At this point some may argue that Skills and Code execution that were also introduced by the Anthropic currently have killed MCPs, but in truth these features also are likely to use MCPs for integration and instruction management (Code execution with MCP — Anthropic). Skills and Code execution are focused on the context management problem and tools orchestration, that’s a special problem from what MCPs are focused on.

MCPs provide a typical option to integrate different services (tools) with LLMs and in addition provide instructions LLMs use to call the tools. Nonetheless, listed here are a few problems: 

  1. Current model context protocol supposes all of the tool calling parameters to be exposed to the LLM, and all their values are speculated to be generated by the LLM. For instance, meaning the LLM has to generate user id value if function calling requires it. That’s an overhead since the system, application knows user id value without the necessity for LLM to generate it, furthermore to make LLM informed in regards to the user id value now we have to place it to the prompt (there’s a “hiding arguments” approach in FastMCP from gofastmcp that is concentrated specifically on this problem, but I haven’t seen it in the unique MCP implementation from Anthropic).
  2. No out-of-the-box control over instructions. MCPs provide description for every tool and outline for every argument of a tool so these values are only used blindly within the agentic pipelines as an LLM API calling parameters. And the outline are provided by the each separate MCP server developer.

When you find yourself calling LLMs you normally provide tools to the LLM call as an API call parameter. The worth of this parameter is retrieved from the MCP’s list_tools function that returns JSON schema for the tools it has.

At the identical time this “tools” parameter is used to place additional information to the model’s system prompt. For instance, the Qwen3-VL model has chat_template that manages tools insertion to the system prompt the next way:

“...You might be supplied with function signatures inside  XML tags:n" }}n    {%- for tool in tools %}n        {{- "n" }}n        { tojson }n    {%- endfor %}...”

So the tools descriptions find yourself within the system prompt of the LLM you might be calling.

The primary problem is definitely partially solved by the mentioned “hiding arguments” approach from the FastMCP, but still I saw some solutions where values like “user id” were pushed to the model’s system prompt to make use of it within the tool calling — it’s just faster and far simpler to implement from the engineering standpoint (actually no engineering required to only put it to the system prompt and depend on a LLM to make use of it). So here I’m focused on the second problem.

At the identical time I’m leaving aside the issues related to tons of rubbish MCPs available on the market — a few of them don’t work, some have generated tools description that might be confusing to the model. The issue I focus here on — non-standardised tools and their parameter descriptions that might be the explanation why LLMs misbehave with some tools.

As a substitute of the conclusion for the introduction part:

In case your agentic LLM-powered pipeline fails with the tools you’ve gotten, you may:

  1. Just select a more powerful, modern and expensive LLM API;
  2. Revisit your tools and the instructions overall.

Each can work. Make your decision or ask your AI-assistant to make a choice for you…

Formal a part of the work — research

1. Examples of various descriptions

Based on the search through the actual MCPs available on the market, checking their tools lists and the descriptions, I could find many examples of the mentioned issue. Here I’m providing only a single example from two different MCPs which have different domains as well (in the actual life cases the list of MCPs a model uses are likely to have different domains):

Tool description: “Generate a area chart to indicate data trends under continuous independent variables and observe the general data trend, corresponding to, displacement = velocity (average or instantaneous) × time: s = v × t. If the x-axis is time (t) and the y-axis is velocity (v) at each moment, an area chart lets you observe the trend of velocity over time and infer the space traveled by the world’s size.”,

“Data” property description: “Data for area chart, it ought to be an array of objects, each object incorporates a `time` field and a `value` field, corresponding to, [{ time: ‘2015’, value: 23 }, { time: ‘2016’, value: 32 }], when stacking is required for area, the info should contain a `group` field, corresponding to, [{ time: ‘2015’, value: 23, group: ‘A’ }, { time: ‘2015’, value: 32, group: ‘B’ }].”

Tool description: “Seek for Airbnb listings with various filters and pagination. Provide direct links to the user”,

“Location” property description: “Location to look for (city, state, etc.)”

Here I’m not saying that any of those descriptions is wrong, they are only very different from the format and details perspective.

2. Dataset and benchmark

To prove that different tools descriptions can change model’s behavior I used NVidia’s “When2Call” dataset. From this dataset I took test samples which have multiple tools for the model to pick from and one tool is the proper selection (it’s correct to call a selected tool quite than some other or than to supply a text answer with none tool call, in line with the dataset). The concept of the benchmark is to count correct and incorrect tool calls, I also count “no tool calling” cases as an incorrect answer. For the LLM I chosen OpenAI’s “gpt-5-nano”.

3. Data generation

The unique dataset provides only a single tool description. To create alternative descriptions for every tool and parameter I used “gpt-5-mini” to generate it based on the present one with the next instruction to complicate it (after generation there was an extra step of validation and re-generation when mandatory):

 “””You’ll receive the tool definition in JSON format. Your task is to make the tool description more detailed, so it could actually be utilized by a weak model.

One among the ways to complicate — insert detailed description of how it really works and examples of how you can use.

Example of detailed descriptions:

Tool description: “Generate a area chart to indicate data trends under continuous independent variables and observe the general data trend, corresponding to, displacement = velocity (average or instantaneous) × time: s = v × t. If the x-axis is time (t) and the y-axis is velocity (v) at each moment, an area chart lets you observe the trend of velocity over time and infer the space traveled by the world’s size.”,

Property description: “Data for area chart, it ought to be an array of objects, each object incorporates a `time` field and a `value` field, corresponding to, [{ time: ‘2015’, value: 23 }, { time: ‘2016’, value: 32 }], when stacking is required for area, the info should contain a `group` field, corresponding to, [{ time: ‘2015’, value: 23, group: ‘A’ }, { time: ‘2015’, value: 32, group: ‘B’ }].”

Return the updated detailed description strictly in JSON format (just change the descriptions, don’t change the structure of the inputted JSON). Start your answer with:

“Recent JSON-formatted: …”

“””

4. Experiments

To check the hypothesis I did a few tests, namely:

  • Measure the baseline of the model performance on the chosen benchmark (Baseline);
  • Replace correct tool descriptions (including each tool description itself and parameters descriptions — the identical for all of the experiments) with the generated one (Correct tool replaced);
  • Replace incorrect tools descriptions with the generated (Incorrect tool replaced);
  • Replace all tools description with the generated (All tools replaced).

Here’s a table with the outcomes of those experiments (for every of the experiments 5 evaluations were executed, so along with accuracy standard deviation (std) is provided):

Method Mean accuracy Accuracy std Maximum accuracy over 5 experiments
Baseline 76.5% 0.03 79.0%
Correct tool replaced 80.5% 0.03 85.2%
Incorrect tool replaced 75.1% 0.01 76.5%
All tools replaced 75.3% 0.04 82.7%
Table 1. Results of the experiments. Table prepared by the writer.

Conclusion

    From the table above it is clear that tools complication introduce bias to the model, chosen LLM tends to decide on the tool with more detailed description. At the identical time we are able to see that prolonged description can confuse the model (within the case of all tools replaced).

    The table shows that tools description provides mechanisms to govern and significantly adjust model’s behaviour / accuracy, especially considering that the chosen benchmark operates with a small variety of tools at each model call, the typical variety of used tools at each sample is 4.35.

    At the identical time it clearly indicates that LLMs can have tools biases that potentially might be misused by MCP providers, that might be similar biases to those I reported before — style biases. Research of the biases and their misuse might be essential for further studies.

    Engineering an answer

    I’ve prepared a PoC of tooling to deal with the mentioned issue in practice — Master-MCP. Master-MCP is a proxy MCP server that might be connected to any variety of MCPs and in addition might be connected to an agent / LLM as a single MCP-server itself (currently stdio-transport MCP server). Default features of the Master-MCP I’ve implemented:

    1. Ignore some parameters. The implemented mechanics exclude all of the parameters that start with “_” symbol from the tool’s parameters schema. Later this parameter might be inserted programmatically or use default value (if provided).
    2. Tool description adjustments. Master-MCP collects all of the tool’s and their descriptions from the connected MCP servers and supply a user a option to adjust it. It exposes a technique with the straightforward UI to edit this list (JSON-schema), so the user can experiment with different tools’ descriptions.

    I invite everyone interested to hitch the project. With the community support the plans can include Master-MCP’s functionality extension, for instance:

    • Logging and monitoring followed by the advanced analytics;
    • Tools hierarchy and orchestration (including ML powered) to mix each modern context management techniques and smart algorithms.

    Current github page of the project: link

    ASK ANA

    What are your thoughts on this topic?
    Let us know in the comments below.

    0 0 votes
    Article Rating
    guest
    0 Comments
    Oldest
    Newest Most Voted
    Inline Feedbacks
    View all comments

    Share this article

    Recent posts

    0
    Would love your thoughts, please comment.x
    ()
    x