How To not Write an MCP Server

-

I the possibility to create an MCP server for an observability application in an effort to provide the AI agent with dynamic code evaluation capabilities. Due to its potential to rework applications, MCP is a technology I’m much more ecstatic about than I originally was about genAI typically. I wrote more about that and a few intro to MCPs typically in a previous post.

While an initial POCs demonstrated that there was an immense potential for this to be a force multiplier to our product’s value, it took several iterations and a number of other stumbles to deliver on that promise. On this post, I’ll attempt to capture a few of the lessons learned, as I believe that this will profit other MCP server developers.

My Stack

  • I used to be using Cursor and vscode intermittently because the foremost MCP client
  • To develop the MCP server itself, I used the .NET MCP SDK, as I made a decision to host the server on one other service written in .NET

Lesson 1: Don’t dump your entire data on the agent

In my application, one tool returns aggregated information on errors and exceptions. The API may be very detailed because it serves a fancy UI view, and spews out large amounts of deeply linked data:

  • Error frames
  • Affected endpoints
  • Stack traces 
  • Priority and trends 
  • Histograms

My first hunch was to easily expose the API as is as an MCP tool. In any case, the agent should have the ability to make more sense of it than any UI view, and catch on to interesting details or connections between events. There have been several scenarios I had in mind as to how I might expect this data to be useful. The agent could robotically offer fixes for recent exceptions recorded in production or within the testing environment, let me find out about errors that stand out, or help me address some systematic problems which can be the underlying root reason behind the problems. 

The essential premise was due to this fact to permit the agent to work its ‘magic’, with more data potentially meaning more hooks for the agent to latch on in its investigation efforts. I quickly coded a wrapper around our API on the MCP endpoint and decided to begin with a basic prompt to see whether all the pieces is working:

Image by creator

We will see the agent was smart enough to know that it needed to call one other tool to grab the environment ID for that ‘test’ environment I discussed. With that at hand, after discovering that there was actually no recent exception within the last 24 hours, it then took the freedom to scan a more prolonged time period, and that is when things got slightly weird:

Image by creator

What a wierd response. The agent queries for exceptions from the last seven days, gets back some tangible results this time, and yet proceeds to ramble on as if ignoring the info altogether. It continues to try to use the tool in alternative ways and different parameter mixtures, obviously fumbling, until I notice it flat out calls out the indisputable fact that the info is totally invisible to it. While errors are being sent back within the response, the agent actually claims there are no errors. What is happening?

Image by creator

After some investigation, the issue was revealed to be the indisputable fact that we’ve simply reached a cap within the agent’s capability to process large amounts of information within the response.

I used an existing API that was extremely verbose, which I initially even considered to be a bonus. The tip result, nonetheless, was that I in some way managed to overwhelm the model. Overall, there have been around 360k characters and 16k words within the response JSON. This includes call stacks, error frames, and references. This should have been supported just by taking a look at the context window limit for the model I used to be using (Claude 3.7 Sonnet should support as much as 200k tokens), but nevertheless the massive data dump left the agent thoroughly stumped.

One strategy could be to alter the model to at least one that supports a fair greater context window. I converted to the Gemini 2.5 pro model simply to test that theory out, because it boasts an outrageous limit of a million tokens. Sure enough, the identical query now yielded a rather more intelligent response:

Image by creator

That is great! The agent was capable of parse the errors and find the systematic reason behind lots of them with some basic reasoning. Nonetheless, we will’t depend on the user using a particular model, and to complicate things, this was output from a comparatively low bandwidth testing environment. What if the dataset were even larger? 
To unravel this issue, I made some fundamental changes to how the API was structured:

  • Nested data hierarchy: Keep the initial response focused on high-level details and aggregations. Create a separate API to retrieve the decision stacks of specific frames as needed. 
  • Enhance queryability: All the queries made to date by the agent used a really small page size for the info (10), if we wish the agent to have the ability to to access more relevant subsets of the info to suit with the constraints of its context, we want to supply more APIs to question errors based on different dimensions, for instance: affected methods, error type, priority and impact etc. 

With the brand new changes, the tool now consistently analyzes vital recent exceptions and comes up with fix suggestions. Nonetheless, I glanced over one other minor detail I needed to sort before I could really use it reliably.

Lesson 2: What’s the time?

Image generated by the creator with Midjourney

The keen-eyed reader could have noticed that within the previous example, to retrieve the errors in a particular time range, the agent uses the ISO 8601 time duration format as an alternative of the particular dates and times. So as an alternative of including standard ‘From’ and ‘To’ parameters with datetime values, the AI sent a duration value, for instance, seven days or P7D, to point it wants to ascertain for errors up to now week.

The explanation for that is somewhat strange — the agent may not know the present date and time! You possibly can confirm that yourself by asking the agent that easy query. The below would have made sense were it not for the indisputable fact that I typed that prompt in at around noon on May 4th…

Image by creator

Using time duration values turned out to be a fantastic solution that the agent handled quite well. Don’t forget to document the expected value and example syntax within the tool parameter description, though!

Lesson 3: When the agent makes a mistake, show it how one can do higher

In the primary example, I used to be actually greatly surprised by how the agent was capable of decipher the dependencies between the several tool calls With a purpose to provide the appropriate environment identifier. In studying the MCP contract, it found out that it needed to call on a dependent one other tool to get the list of environment IDs first.

Nonetheless, responding to other requests, the agent would sometimes take the environment names mentioned within the prompt verbatim. For instance, I noticed that in response to this query: compare slow traces for this method between the test and prod environments, are there any significant differences? Depending on the context, the agent would sometimes use the environment names mentioned within the request and would send the strings “test” and “prod” because the environment ID. 

In my original implementation, my MCP server would silently fail on this scenario, returning an empty response. The agent, upon receiving no data or a generic error, would simply quit and check out to unravel the request using one other strategy. To offset that behavior, I quickly modified my implementation in order that if an incorrect value was provided, the JSON response would describe exactly what went unsuitable, and even provide a sound list of possible values to avoid wasting the agent one other tool call.

Image by creator

This was enough for the agent, learning from its mistake, it repeated the decision with the proper value and in some way also avoided making that very same error in the long run.

Lesson 4: Deal with user intent and never functionality

While it’s tempting to easily describe what the API is doing, sometimes the generic terms don’t quite allow the agent to understand the style of requirements for which this functionality might apply best. 

Let’s take an easy example: My MCP server has a tool that, for every method, endpoint, or code location, can indicate the way it’s getting used at runtime. Specifically, it uses the tracing data to point which application flows reach the precise function or method.

The unique documentation simply described this functionality:

[McpServerTool,
Description(
@"For this method, see which runtime flows in the application
(including other microservices and code not in this project)
use this function or method.
This data is based on analyzing distributed tracing.")]
public static async Task GetUsagesForMethod(IMcpService client,
[Description("The environment id to check for usages")]
string environmentId,
[Description("The name of the class. Provide only the class name without the namespace prefix.")]
string codeClass,
[Description("The name of the method to check, must specify a specific method to check")]
string codeMethod)

The above represents a functionally accurate description of what this tool does, nevertheless it doesn’t necessarily make it clear what sorts of activities it could be relevant for. After seeing that the agent wasn’t picking this tool up for various prompts I assumed it will be fairly useful for, I made a decision to rewrite the tool description, this time emphasizing the use cases:

[McpServerTool,
Description(
@"Find out what is the how a specific code location is being used and by
which other services/code.
Useful in order to detect possible breaking changes, to check whether
the generated code will fit the current usages,
to generate tests based on the runtime usage of this method,
or to check for related issues on the endpoints triggering this code
after any change to ensure it didnt impact it"

Updating the text helped the agent realize why the information was useful. For example, before making this change, the agent would not even trigger the tool in response to a prompt similar to the one below. Now, it has become completely seamless, without the user having to directly mention that this tool should be used:

Image by author

Lesson 5: Document your JSON responses

The JSON standard, at least officially, does not support comments. That means that if the JSON is all the agent has to go on, it might be missing some clues about the context of the data you’re returning. For example, in my aggregated error response, I returned the following score object:

"Score": {"Score":21,
"ScoreParams":{ "Occurrences":1,
"Trend":0,
"Recent":20,
"Unhandled":0,
"Unexpected":0}}

Without proper documentation, any non-clairvoyant agent would be hard pressed to make sense of what these numbers mean. Thankfully, it is easy to add a comment element at the beginning of the JSON file with additional information about the data provided:

"_comment": "Each error contains a link to the error trace,
which can be retrieved using the GetTrace tool,
information about the affected endpoints the code and the
relevant stacktrace.
Each error in the list represents numerous instances
of the same error and is given a score after its been
prioritized.
The score reflects the criticality of the error.
The number is between 0 and 100 and is comprised of several
parameters, each can contribute to the error criticality,
all are normalized in relation to the system
and the other methods.
The score parameters value represents its contributation to the
overall score, they include:

1. 'Occurrences', representing the number of instances of this error
compared to others.
2. 'Trend' whether this error is escalating in its
frequency.
3. 'Unhandled' represents whether this error is caught
internally or poropagates all the way
out of the endpoint scope
4. 'Unexpected' are errors that are in high probability
bugs, for example NullPointerExcetion or
KeyNotFound",
"EnvironmentErrors":[]

This permits the agent to elucidate to the user what the rating means in the event that they ask, but in addition feed this explanation into its own reasoning and proposals.

Selecting the appropriate architecture: SSE vs STDIO,

There are two architectures you should use in developing an MCP server. The more common and widely supported implementation is making your server available as a command triggered by the MCP client. This may very well be any CLI-triggered command; npx, docker, and python are some common examples. On this configuration, all communication is completed via the method STDIO, and the method itself is running on the client machine. The client is answerable for instantiating and maintaining the lifecycle of the MCP server.

Image by creator

This client-side architecture has one major drawback from my perspective: Because the MCP server implementation is run by the client on the local machine, it is far harder to roll out updates or recent capabilities. Even when that problem is in some way solved, the tight coupling between the MCP server and the backend APIs it will depend on in our applications would further complicate this model when it comes to versioning and forward/backward compatibility.

For these reasons, I selected the second style of MCP Server — an SSE Server hosted as an element of our application services. This removes any friction from running CLI commands on the client machine, in addition to allows me to update and version the MCP server code together with the appliance code that it consumes. On this scenario, the client is supplied with a URL of the SSE endpoint with which it interacts. While not all clients currently support this selection, there is an excellent commandMCP called supergateway that will be used as a proxy to the SSE server implementation. Which means users can still add the more widely supported STDIO variant and still eat the functionality hosted in your SSE backend.

Image by creator

MCPs are still recent

There are a lot of more lessons and nuances to using this deceptively easy technology. I even have found that there’s a big gap between implementing a workable MCP to at least one that may actually integrate with user needs and usage scenarios, even beyond those you’ve anticipated. Hopefully, because the technology matures, we’ll see more posts on Best Practices. 

Need to Connect? You possibly can reach me on Twitter at @doppleware or via LinkedIn.
Follow my mcp for dynamic code evaluation using observability at https://github.com/digma-ai/digma-mcp-server

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x