Recent! (May 23, ’25) In the event you prefer Python, try the companion post
Tiny Agents in Python.
Over the past few weeks, I have been diving into MCP (Model Context Protocol) to know what the hype around it was all about.
My TL;DR is that it’s fairly easy, but still quite powerful: MCP is a typical API to reveal sets of Tools that could be hooked to LLMs.
It’s fairly easy to increase an Inference Client – at HF, we now have two official client SDKs: @huggingface/inference in JS, and huggingface_hub in Python – to also act as a MCP client and hook the available tools from MCP servers into the LLM inference.
But while doing that, got here my second realization:
Once you could have an MCP Client, an Agent is literally just some time loop on top of it.
On this short article, I’ll walk you thru how I implemented it in Typescript (JS), how you may adopt MCP too and the way it should make Agentic AI way simpler going forward.
The way to run the entire demo
If you could have NodeJS (with pnpm or npm), just run this in a terminal:
npx @huggingface/mcp-client
or if using pnpm:
pnpx @huggingface/mcp-client
This installs my package into a short lived folder then executes its command.
You will see your easy Agent connect with two distinct MCP servers (running locally), loading their tools, then prompting you for a conversation.
By default our example Agent connects to the next two MCP servers:
- the “canonical” file system server, which gets access to your Desktop,
- and the Playwright MCP server, which knows easy methods to use a sandboxed Chromium browser for you.
Note: this can be a bit counter-intuitive but currently, all MCP servers are literally local processes (though distant servers are coming soon).
Our input for this primary video was:
write a haiku concerning the Hugging Face community and write it to a file named “hf.txt” on my Desktop
Now allow us to do this prompt that involves some Web browsing:
do a Web Seek for HF inference providers on Brave Search and open the primary 3 results
Default model and provider
By way of model/provider pair, our example Agent uses by default:
That is all configurable through env variables! See:
const agent = recent Agent({
provider: process.env.PROVIDER ?? "nebius",
model: process.env.MODEL_ID ?? "Qwen/Qwen2.5-72B-Instruct",
apiKey: process.env.HF_TOKEN,
servers: SERVERS,
});
Where does the code live
The Tiny Agent code lives within the mcp-client sub-package of the huggingface.js mono-repo, which is the GitHub mono-repo during which all our JS libraries reside.
https://github.com/huggingface/huggingface.js/tree/fundamental/packages/mcp-client
The codebase uses modern JS features (notably, async generators) which make things way easier to implement, especially asynchronous events just like the LLM responses.
You would possibly have to ask a LLM about those JS features if you happen to’re not yet conversant in them.
The muse for this: tool calling native support in LLMs.
What’s going to make this whole blogpost very easy is that the recent crop of LLMs (each closed and open) have been trained for function calling, aka. tool use.
A tool is defined by its name, an outline, and a JSONSchema representation of its parameters.
In some sense, it’s an opaque representation of any function’s interface, as seen from the skin (meaning, the LLM doesn’t care how the function is definitely implemented).
const weatherTool = {
type: "function",
function: {
name: "get_weather",
description: "Get current temperature for a given location.",
parameters: {
type: "object",
properties: {
location: {
type: "string",
description: "City and country e.g. Bogotá, Colombia",
},
},
},
},
};
The canonical documentation I’ll link to here is OpenAI’s function calling doc. (Yes… OpenAI just about defines the LLM standards for the entire community 😅).
Inference engines allow you to pass an inventory of tools when calling the LLM, and the LLM is free to call zero, a number of of those tools.
As a developer, you run the tools and feed their result back into the LLM to proceed the generation.
Note that within the backend (on the inference engine level), the tools are simply passed to the model in a specially-formatted
chat_template, like every other message, after which parsed out of the response (using model-specific special tokens) to reveal them as tool calls. See an example in our chat-template playground.
Implementing an MCP client on top of InferenceClient
Now that we all know what a tool is in recent LLMs, allow us to implement the actual MCP client.
The official doc at https://modelcontextprotocol.io/quickstart/client is fairly well-written. You simply have to interchange any mention of the Anthropic client SDK by another OpenAI-compatible client SDK. (There may be also a llms.txt you may feed into your LLM of selection to aid you code along).
As a reminder, we use HF’s InferenceClient for our inference client.
The whole
McpClient.tscode file is here if you must follow along using the actual code 🤓
Our McpClient class has:
- an Inference Client (works with any Inference Provider, and
huggingface/inferencesupports each distant and native endpoints) - a set of MCP client sessions, one for every connected MCP server (yes, we wish to support multiple servers)
- and an inventory of obtainable tools that’s going to be filled from the connected servers and just barely re-formatted.
export class McpClient {
protected client: InferenceClient;
protected provider: string;
protected model: string;
private clients: Map<ToolName, Client> = recent Map();
public readonly availableTools: ChatCompletionInputTool[] = [];
constructor({ provider, model, apiKey }: { provider: InferenceProvider; model: string; apiKey: string }) {
this.client = recent InferenceClient(apiKey);
this.provider = provider;
this.model = model;
}
}
To connect with a MCP server, the official @modelcontextprotocol/sdk/client TypeScript SDK provides a Client class with a listTools() method:
async addMcpServer(server: StdioServerParameters): Promise<void> {
const transport = recent StdioClientTransport({
...server,
env: { ...server.env, PATH: process.env.PATH ?? "" },
});
const mcp = recent Client({ name: "@huggingface/mcp-client", version: packageVersion });
await mcp.connect(transport);
const toolsResult = await mcp.listTools();
debug(
"Connected to server with tools:",
toolsResult.tools.map(({ name }) => name)
);
for (const tool of toolsResult.tools) {
this.clients.set(tool.name, mcp);
}
this.availableTools.push(
...toolsResult.tools.map((tool) => {
return {
type: "function",
function: {
name: tool.name,
description: tool.description,
parameters: tool.inputSchema,
},
} satisfies ChatCompletionInputTool;
})
);
}
StdioServerParameters is an interface from the MCP SDK that can allow you to easily spawn an area process: as we mentioned earlier, currently, all MCP servers are literally local processes.
For every MCP server we connect with, we barely re-format its list of tools and add them to this.availableTools.
The way to use the tools
Easy, you only pass this.availableTools to your LLM chat-completion, along with your usual array of messages:
const stream = this.client.chatCompletionStream({
provider: this.provider,
model: this.model,
messages,
tools: this.availableTools,
tool_choice: "auto",
});
tool_choice: "auto" is the parameter you pass for the LLM to generate zero, one, or multiple tool calls.
When parsing or streaming the output, the LLM will generate some tool calls (i.e. a function name, and a few JSON-encoded arguments), which you (as a developer) have to compute. The MCP client SDK once more makes that very easy; it has a client.callTool() method:
const toolName = toolCall.function.name;
const toolArgs = JSON.parse(toolCall.function.arguments);
const toolMessage: ChatCompletionInputMessageTool = {
role: "tool",
tool_call_id: toolCall.id,
content: "",
name: toolName,
};
const client = this.clients.get(toolName);
if (client) {
const result = await client.callTool({ name: toolName, arguments: toolArgs });
toolMessage.content = result.content[0].text;
} else {
toolMessage.content = `Error: No session found for tool: ${toolName}`;
}
Finally you’ll add the resulting tool message to your messages array and back into the LLM.
Our 50-lines-of-code Agent 🤯
Now that we now have an MCP client able to connecting to arbitrary MCP servers to get lists of tools and able to injecting them and parsing them from the LLM inference, well… what’s an Agent?
Once you could have an inference client with a set of tools, then an Agent is just some time loop on top of it.
In additional detail, an Agent is just a mixture of:
- a system prompt
- an LLM Inference client
- an MCP client to hook a set of Tools into it from a bunch of MCP servers
- some basic control flow (see below for the while loop)
The whole
Agent.tscode file is here.
Our Agent class simply extends McpClient:
export class Agent extends McpClient {
private readonly servers: StdioServerParameters[];
protected messages: ChatCompletionInputMessage[];
constructor({
provider,
model,
apiKey,
servers,
prompt,
}: {
provider: InferenceProvider;
model: string;
apiKey: string;
servers: StdioServerParameters[];
prompt?: string;
}) {
super({ provider, model, apiKey });
this.servers = servers;
this.messages = [
{
role: "system",
content: prompt ?? DEFAULT_SYSTEM_PROMPT,
},
];
}
}
By default, we use a quite simple system prompt inspired by the one shared within the GPT-4.1 prompting guide.
Though this comes from OpenAI 😈, this sentence specifically applies to increasingly models, each closed and open:
We encourage developers to exclusively use the tools field to pass tools, relatively than manually injecting tool descriptions into your prompt and writing a separate parser for tool calls, as some have reported doing prior to now.
Which is to say, we needn’t provide painstakingly formatted lists of tool use examples within the prompt. The tools: this.availableTools param is enough.
Loading the tools on the Agent is literally just connecting to the MCP servers we wish (in parallel since it’s really easy to do in JS):
async loadTools(): Promise<void> {
await Promise.all(this.servers.map((s) => this.addMcpServer(s)));
}
We add two extra tools (outside of MCP) that could be utilized by the LLM for our Agent’s control flow:
const taskCompletionTool: ChatCompletionInputTool = {
type: "function",
function: {
name: "task_complete",
description: "Call this tool when the duty given by the user is complete",
parameters: {
type: "object",
properties: {},
},
},
};
const askQuestionTool: ChatCompletionInputTool = {
type: "function",
function: {
name: "ask_question",
description: "Ask an issue to the user to get more info required to resolve or make clear their problem.",
parameters: {
type: "object",
properties: {},
},
},
};
const exitLoopTools = [taskCompletionTool, askQuestionTool];
When calling any of those tools, the Agent will break its loop and provides control back to the user for brand spanking new input.
The whole while loop
Behold our complete while loop.🎉
The gist of our Agent’s fundamental while loop is that we simply iterate with the LLM alternating between tool calling and feeding it the tool results, and we achieve this until the LLM starts to reply with two non-tool messages in a row.
That is the entire while loop:
let numOfTurns = 0;
let nextTurnShouldCallTools = true;
while (true) {
try {
yield* this.processSingleTurnWithTools(this.messages, {
exitLoopTools,
exitIfFirstChunkNoTool: numOfTurns > 0 && nextTurnShouldCallTools,
abortSignal: opts.abortSignal,
});
} catch (err) {
if (err instanceof Error && err.message === "AbortError") {
return;
}
throw err;
}
numOfTurns++;
const currentLast = this.messages.at(-1)!;
if (
currentLast.role === "tool" &&
currentLast.name &&
exitLoopTools.map((t) => t.function.name).includes(currentLast.name)
) {
return;
}
if (currentLast.role !== "tool" && numOfTurns > MAX_NUM_TURNS) {
return;
}
if (currentLast.role !== "tool" && nextTurnShouldCallTools) {
return;
}
if (currentLast.role === "tool") {
nextTurnShouldCallTools = false;
} else {
nextTurnShouldCallTools = true;
}
}
Next steps
There are various cool potential next steps once you could have a running MCP Client and an easy solution to construct Agents 🔥
- Experiment with other models
- Experiment with all of the Inference Providers:
- Cerebras, Cohere, Fal, Fireworks, Hyperbolic, Nebius, Novita, Replicate, SambaNova, Together, etc.
- each of them has different optimizations for function calling (also depending on the model) so performance may vary!
- Hook local LLMs using llama.cpp or LM Studio
Pull requests and contributions are welcome!
Again, every little thing here is open source! 💎❤️
