finally work.
They call tools, reason through workflows, and really complete tasks.
Then the first real API bill arrives.
For a lot of teams, that’s the moment the query appears:
“Should we just run this ourselves?”
The excellent news is that self-hosting an LLM isn’t any longer a research project or a large ML infrastructure effort. With the precise model, the precise GPU, and a couple of battle-tested tools, you possibly can run a production-grade LLM on a single machine you control.
You’re probably here because considered one of these happened:
Your OpenAI or Anthropic bill exploded
You can’t send sensitive data outside your VPC
Your agent workflows burn thousands and thousands of tokens/day
You wish custom behavior out of your AI and the prompts aren’t cutting it.
If that is you, perfect. If not, you’re still perfect 🤗
In this text, I’ll walk you thru a practical playbook for deploying an LLM on your personal infrastructure, including how models were evaluated and chosen, which instance types were evaluated and chosen, and the reasoning behind those decisions.
I’ll also give you a zero-switch cost deployment pattern for your personal LLM that works for OpenAI or Anthropic.
By the top of this guide you’ll know:
- Which benchmarks actually matter for LLMs that need to unravel and reason through agentic problems, and never reiterate the most recent string theorem.
- What it means to quantize and the way it affects performance
- Which instance types/GPUs might be used for single machine hosting1
- Which models to make use of2
- Learn how to use a self-hosted LLM without having to rewrite an existing API based codebase
- Learn how to make self-hosting cost-effective3?
1 Instance types were evaluated across the “big three”: AWS, Azure and GCP
2 all models are current as of March 2026
3 All pricing data is current as of March 2026
Note: this guide is targeted on deploying agent-oriented LLMs — not general-purpose, trillion-parameter, all-encompassing frontier models, that are largely overkill agent use cases.
✋Wait…why would I host my very own LLM again?
+++ Privacy
That is most definitely why you’re here. Sensitive data — patient health records, proprietary source code, user data, financial records, RFPs, or internal strategy documents that may never leave your firewall.
Self-hosting removes the dependency on third-party APIs and alleviates the chance of a breach or failure to retain/log data based on strict privacy policies.
++ Cost Predictability
API pricing scales linearly with usage. For agent workloads, which generally are higher on the token spectrum, operating your personal GPU infrastructure introduces economies-of-scale. This is very vital in case you plan on performing agent reasoning across a medium to large company (20-30 agents+) or providing agents to customers at any form of scale.
+ Performance
Remove roundtrip API calling, get reasonable token-per-second values and increase capability as obligatory with spot-instance elastic scaling.
+ Customization
Methods like LoRA and QLoRA (not covered intimately here) might be used to fine-tune an LLM’s behavior or adapt its alignment, abliterating, enhancing, tailoring tool usage, adjusting response style, or fine-tuning on domain-specific data.
That is crucially useful to construct custom agents or offer AI services that require specific behavior or style tuned to a use-case quite than generic instruction alignment via prompting.
An aside on finetuning
Methods akin to LoRA/QLoRA, model ablation (“abliteration”), realignment techniques, and response stylization are technically complex and out of doors the scope of this guide. Nonetheless, self-hosting is usually step one toward exploring deeper customization of LLMs.
Why a single machine?
It’s not a tough requirement, it’s more for simplicity. Deploying on a single machine with a single GPU is comparatively easy. A single machine with multiple GPUs is doable with the precise configuration decisions.
Nonetheless, debugging distributed inference across many machines might be nightmarish.
That is your first self-hosted LLM. To simplify the method, we’re going to focus on a single machine and a single GPU. As your inference needs grow, or in case you need more performance, scale up on a single machine. Then as you mature, you possibly can start tackling multi-machine or Kubernetes style deployments.
👉Which Benchmarks Actually Matter?
The LLM Benchmark landscape is noisy. There are dozens of leaderboards, and most of them are irrelevant for our use case. We’d like to prune down these benchmarks to seek out LLMs which excel at agent-style tasks
Specifically, we’re in search of LLMs which may:
- Follow complex, multi-step instructions
- Use tools reliably: call functions with well-formed arguments, interpret results, and judge what to do next
- Reason with constraints: reason with potentially incomplete information without hallucinating a confident but mistaken answer
- Write and understand code: We don’t need to unravel expert level SWE problems, but interacting with APIs and with the ability to generate code on the fly helps expand the motion space and translates into higher tool usage
Listed below are the benchmarks to essentially listen to:
| Benchmark | Description | Why? |
| Berkeley Function Calling Leaderboard (BFCL v3) | Accuracy of function/tool calling across easy, parallel, nested, and multi-step invocations | Directly tests the aptitude your agents rely upon most: structured tool use. |
| IFEval (Instruction Following Eval) | Strict adherence to formatting, constraint, and structural instructions | Agents need strict adherence to instructions |
| τ-bench (Tau-bench) | E2E agent task completion in simulated environments | Measures real agentic competence, can this LLM actually accomplish a goal over multiple turns? |
| SWE-bench Verified | Ability to resolve real GitHub issues from popular open-source repos | In case your agents write or modify code, that is the gold standard. The “Verified” subset filters out ambiguous or poorly-specified issues |
| WebArena / VisualWebArena | Task completion in realistic web environments | Super useful in case your agent needs to make use of a WebUI |
Note: unfortunately, getting reliable benchmark scores on all of those, especially quantized models, is difficult. You’re going to must use your best judgement, assuming that the complete precision model adheres to the performance degradation table outlined below.
🤖Quantizing
That is by no means, shape, or form meant to be the exhaustive guide to quantizing. My goal is to offer you adequate information to mean you can navigate HuggingFace without coming out cross-eyed.
The fundamentals
A model’s parameters are stored as numbers. At full precision (FP32), each weight is a 32-bit floating point number — 4 bytes. modern models are distributed at FP16 or BF16 (half precision, 2 bytes per weight). You will notice this because the baseline for every model
Quantization reduces the variety of bits used to represent each weight, shrinking the memory requirement and increasing inference speed, at the fee of some accuracy.
Not all quantization methods are equal. There are some clever methods that retain performance with highly reduced bit precision.
BF16 vs. GPTQ vs. AWQ vs. GGUF
You’ll see these acronyms rather a lot when model shopping. Here’s what they mean:
- BF16: plain and easy. 2 bytes per parameter. A 70B parameter model will cost you 140GB of VRAM. That is the minimal level of quantizing.
- GPTQ: stands for “Generative Pretrained Transformer Quantization”, quantized layer by layer using an greedy “error aware” approximation of the Hessian for every weight. Largely superseded by AWQ and methods applicable to GGUF models (see below)
- AWQ: stands for “Activation Aware Weight Quantization”, quantizes weights using the magnitude of the activation (via channels) as a substitute of the error.
- GGUF: isn’t a quantization method in any respect, it’s an LLM container popularized by
llama.cpp, inside which one can find a few of the following quantization methods:- K-quants: Named by bits-per-weight and method, e,g Q4_K_M/Q4_K_S.
- I-quants: Newer version, pushes precision at lower bitrates (4 bit and lower)
Here’s a guide as to what quantization does to performance:
| Precision | Bits per weight | VRAM for 70B | Performance |
|---|---|---|---|
| FP16 / BF16 | 16 | ~140 GB | Baseline (100%) |
| Q8 (INT8) | 8 | ~70 GB | ~99–99.5% of FP16 |
| Q5_K_M | 5.5 (mixed) | ~49 GB | ~97–98% |
| Q4_K_M | 4.5 (mixed) | ~42 GB | ~95–97% |
| Q3_K_M | 3.5 (mixed) | ~33 GB | ~90–94% |
| Q2_K | 2.5 (mixed) | ~23 GB | ~80–88% — noticeable degradation |
Where quantization really hurts
Not all tasks degrade equally. The things most affected by aggressive quantization (Q3 and below):
- Precise numerical computation: in case your agent must do exact arithmetic in-weights (versus via tool calls), lower precision hurts
- Rare/specialized knowledge recall: the “long tail” of a model’s knowledge is stored in less-activated weights, that are the primary to lose fidelity
- Very long chain-of-thought sequences: small errors compound over prolonged reasoning chains
- Structured output reliability: at Q3 and below, JSON schema compliance and tool-call formatting begin to degrade. This can be a killer for agent pipelines
💡Protip: Follow Q4_K_M and above for agents. Any lower, and long context reasoning and output reliability issues put agent tasks in danger.
🛠️Hardware
GPUs (Accelerators)
Although more GPU types can be found, the landscape across AWS, GCP and Azure might be mostly distilled into the next options, especially for single machine, single GPU deployments:
| GPU | Architecture | VRAM |
| H100 | Hopper | 80GB |
| A100 | Ampere | 40GB/80GB |
| L40S | Ada Lovelace | 48GB |
| L4 | Ada Lovelace | 24GB |
| A10/A10G | Ampere | 24GB |
| T4 | Turing | 16GB |
The most effective tradeoffs for performance and price exist within the L4, L40S and A100 range, with the A100 providing the perfect performance (when it comes to model capability and multi-user agentic workloads). In case your agent tasks are easy, and require less throughput, it’s secure to downgrade to L4/A10. Don’t upgrade to the H100 unless you wish it.
The 48GB of VRAM provided by the L40S give us loads of options for models. We won’t get the throughput of the A100, but we’ll save on hourly cost.
For the sake of simplicity, I’m going to border the remaining of this discussion around this GPU. In the event you determine that your needs are different (less/more), the selections I outline below will assist you navigate model selection, instance selection and price optimization.
Note about GPU selection: although you could have your heart set on an A100, and the funds to purchase it, cloud capability may restrict you to a different instance/GPU type unless you’re willing to buy “Capability Blocks” [AWS] or “Reservations” [GCP].
Quick decision checkpoint
In the event you’re deploying your first self-hosted LLM:
| Situation | Advice |
|---|---|
| experimenting | L4 / A10 |
| production agents | L40S |
| high concurrency | A100 |
Really helpful Instance Types
I’ve compiled a non-exhaustive list of instance types across the large three which may also help narrow down virtual machine types.
Note: all pricing information was sourced in March 2026.
AWS
AWS lacks many single-GPU instance options, and is more geared towards large multi-GPU workloads. That being said, if you must purchase reserved capability blocks, they provide a p5.4xlarge with a single H100. In addition they have a big block of L40S instance types that are prime for spot instances for predictable/scheduled agentic workloads.
Click to disclose instance types
| Instance | GPU | VRAM | vCPU | RAM | On-demand $/hr |
|---|---|---|---|---|---|
g4dn.xlarge |
1x T4 | 16 GB | 4 | 16 GB | ~$0.526 |
g5.xlarge |
1x A10G | 24 GB | 4 | 16 GB | ~$1.006 |
g5.2xlarge |
1x A10G | 24 GB | 8 | 32 GB | ~$1.212 |
g6.xlarge |
1x L4 | 24 GB | 4 | 16 GB | ~$0.805 |
g6e.xlarge |
1x L40S | 48GB | 4 | 32GB | ~$1.861 |
p5.4xlarge |
1x H100 | 80GB | 16 | 256GB | ~$6.88 |
Google Cloud Platform
Unlike AWS, GCP offers single-GPU A100 instances. This makes a2-ultragpu-1g probably the most cost-effective option for running 70B models on a single machine. You pay just for what you employ.
Click to disclose instance types
| Instance | GPU | VRAM | On-demand $/hr |
|---|---|---|---|
g2-standard-4 |
1x L4 | 24 GB | ~$0.72 |
a2-highgpu-1g |
1x A100 (40GB) | 40 GB | ~$3.67 |
a2-ultragpu-1g |
1x A100 (80GB) | 80 GB | ~$5.07 |
a3-highgpu-1g |
1x H100 (80GB) | 80 GB | ~$7.2 |
Azure
Azure has probably the most limited set of single GPU instances, so that you’re just about set into the Standard_NC24ads_A100_v4, which provides you an A100 for ~$3.60 per hour unless you must go along with a smaller model
Click to disclose instance types
| Instance | GPU | VRAM | On-demand $/hr | Notes |
|---|---|---|---|---|
Standard_NC4as_T4_v3 |
1x T4 | 16 GB | ~$0.526 | Dev/test |
Standard_NV36ads_A10_v5 |
1x A10 | 24 GB | ~$1.80 | Note: A10 (not A10G), barely different specs |
Standard_NC24ads_A100_v4 |
1x A100 (80GB) | 80 GB | ~$3.67 | Strong single-GPU option |
‼️Necessary: Don’t downplay the KV Cache
The important thing–value (KV) cache is a significant component when sizing VRAM requirements for LLMs.
Remember: LLMs are large transformer based models. A transformer layer computes attention using queries (Q), keys (K), and values (V). During generation, each latest token must attend to all previous tokens. Without caching, the model would want to recompute the keys and values for all the sequence every step.
By caching [storing] the eye keys and values in VRAM, long contexts turn into feasible, because the model doesn’t must recompute keys and values. Taking generation from O(T^2) to O(t).
Agents must cope with longer contexts. Because of this even when the model we select suits inside VRAM, we want to also ensure there’s sufficient capability for the KV cache.
Example: a quantized 32B model might occupy around 20-25 GB of VRAM, however the KV cache for several concurrent requests at an 8 k or 16 k context can add one other 10-20 GB. That is why GPUs with 48 GB or more memory are typically really useful for production inference of mid-size models with longer contexts.
💡Protip: Together with serving models with a Paged KV Cache (discussed below), allocate a further 30-40% of the model’s VRAM requirements for the KV cache.
💾Models
So now we all know:
- the VRAM limits
- the quantization goal
- the benchmarks that matter
That narrows the model field from lots of to only a handful.
From the previous section, we chosen the L40S because the GPU, giving us instances at an affordable price point (especially spot instances, from AWS). This puts us at a cap of 48GB VRAM. Remembering the importance of the KV cache will limit us to models which fit into ~28GB VRAM (saving 20GB for multiple agents caching with long context windows).
With Q4_K_M quantizing, this puts us in range of some very capable models.
I’ve included links to the models directly on Huggingface. You’ll notice that Unsloth is the provider of the quants. Unsloth does very detailed evaluation of their quants and heavy testing. Consequently, they’ve turn into a community favorite. But, be happy to make use of any quant provider you favor.
🥇Top Rank: Qwen3.5-27B
Developed by Alibaba as a part of the Qwen3.5 model family.
This 27B model is a dense hybrid transformer architecture optimized for long-context reasoning and agent workflows.
Qwen 3.5 uses a Gated DeltaNet + Gated Attention Hybrid to keep up long context while preserving reasoning ability and minimizing the fee (in VRAM).
The 27B version gives us similar mechanics because the frontier model, and preserves reasoning, giving it outstanding performance on tool calling, SWE and agent benchmarks.
Strange fact: the 27B version performs barely higher than the 32B version.
Link to the Q4_K_M quant
https://huggingface.co/unsloth/Qwen3.5-27B-GGUF?show_file_info=Qwen3.5-27B-Q4_K_M.gguf
🥈Solid Contender: GLM 4.7 Flash
GLM‑4.7‑Flash, from Z.ai, is a 30 billion‑parameter language model that prompts only a small subset of its parameters per token (~3 B lively).
Its architecture supports very long context windows (as much as ~128 k–200 k tokens), enabling prolonged reasoning over large inputs akin to long documents, codebases, or multi‑turn agent workflows.
It comes with turn based “pondering modes”, which support more efficient agent level reasoning, toggle off for quick tool executions, toggle on for prolonged reasoning on code or interpreting results.
Link to the Q4_K_M quant
https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF?show_file_info=GLM-4.7-Flash-Q4_K_M.gguf
👌Value checking: GPT-OSS-20B
OpenAI’s open sourced models, 120B param and 20B param versions are still competitive despite being released over a 12 months ago. They consistently perform higher than Mistral and the 20B version (quantized) is well suited to our VRAM limit.
It supports configurable reasoning levels (low/medium/high) so you possibly can trade off speed versus depth of reasoning. GPT‑OSS‑20B also exposes its full chain‑of‑thought reasoning, which makes debugging and introspection easier.
It’s a solid selection for agent AI tasks. You won’t get the identical performance as OpenAI’s frontier models, but benchmark performance together with a low memory requirement still warrant a test.
Link to the Q4_K_M quant
https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF?show_file_info=GLM-4.7-Flash-Q4_K_M.gguf
Remember: even in case you’re running your personal model, you possibly can still use frontier models
This is a brilliant agentic pattern. If you will have a dynamic graph of agent actions, you possibly can turn on the expensive API for Claude 4.6 Opus or the GPT 5.4 on your complex subgraphs or tasks that require frontier model level visual reasoning.
Compress the summary of your entire agent graph using your LLM to attenuate input tokens and make sure to set the utmost output length when calling the frontier API to attenuate costs.
🚀Deployment
I’m going to introduce 2 patterns, the primary is for evaluating your model in a non production mode, the second is for production use.
Pattern 1: Evaluate with Ollama
Ollama is the docker run of LLM inference. It wraps llama.cpp in a clean CLI and REST API, handles model downloads, and just works. It’s perfect for local dev and evaluation: you possibly can have an OpenAI compatible API running along with your model in under 10 minutes.
Setup
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama pull qwen3.5:27b
ollama run qwen3.5:27b
As mentioned, Ollama exposes an OpenAI-compatible API right out of the box, Hit it at http://localhost:11434/v1
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required but unused
)
response = client.chat.completions.create(
model="qwen3.5:27b",
messages=[
{"role": "system", "content": "You are a paranoid android."},
{"role": "user", "content": "Determine when the singularity will eventually consume us"}
]
)
You may at all times just construct llama.cpp from source directly [with the GPU flags on], which can also be good for evals. Ollama just simplifies it.
Pattern #2: Production with vLLM
vLLM is good since it automagically handles KV caching via PagedAttention. Naively attempting to handle KV caching will result in memory underutilization via fragmentation. While more practical on RAM than VRAM, it still helps.
While tempting, don’t use Ollama for production. Use vLLM because it’s a lot better suited to concurrency and monitoring.
Setup
# Install vLLM (CUDA required)
pip install vllm
# Serve a model with the OpenAI-compatible API server
vllm serve Qwen/Qwen3.5-27B-GGUF
--dtype auto
--quantization k_m
--max-model-len 32768
--gpu-memory-utilization 0.90
--port 8000
--api-key your-secret-key
Key configuration flags:
| Flag | What it does | Guidance |
|---|---|---|
--max-model-len |
Maximum sequence length (input + output tokens) | Set this to the max you really want, not the model’s theoretical max. 32K is a superb default. Setting it to 128K will reserve enormous KV cache. |
--gpu-memory-utilization |
Fraction of GPU memory vLLM can use | 0.90 is aggressive but high quality for dedicated inference machines. Lower to 0.85 in case you see OOM errors. |
--quantization |
Tells vLLM which quantizing format to make use of | Must match the model format you downloaded. |
--tensor-parallel-size N |
Shard model across N GPUs | For single-GPU, omit or set to 1. For multi-GPU on a single machine, set to the variety of GPUs. |
Monitoring:
vLLM exposes a /metrics endpoint compatible with Prometheus
# prometheus.yml scrape config
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['localhost:8000']
metrics_path: '/metrics'
Key metrics to observe:
vllm:num_requests_running: current concurrent requestsvllm:num_requests_waiting: requests queued (if consistently > 0, you wish more capability)vllm:gpu_cache_usage_perc: KV cache utilization (high values = approaching memory limits)vllm:avg_generation_throughput_toks_per_s: your actual throughput
🤩Zero switch costs?
Yep.
You utilize OpenAI’s API:
The API that vLLM uses is fully compatible.
You need to launch vLLM with tool calling explicitly enabled. You furthermore mght have to specify a parser so vLLM knows tips on how to extract the tool calls from the model’s output (e.g., llama3_json, hermes, mistral).
For Qwen3.5, add the next flags when running vLLM
--enable-auto-tool-choice
--tool-call-parser qwen3_xml
--reasoning-parser qwen3
You utilize Anthropic’s API:
We’d like so as to add another, somewhat hacky, step. Add a LiteLLM proxy as a “phantom-claude” to handle Anthropic-formatted requests.
LiteLLM will act as a translation layer. It intercepts the Anthropic-formatted requests (e.g., messages API, tool_use blocks) and converts them into the OpenAI format that vLLM expects, then maps the response back so your Anthropic client never knows the difference.
Note: Add this proxy on the machine/container which actually runs your agents and never the LLM host.
Configuration is simple:
model_list:
- model_name: claude-local # The name your Anthropic client will use
litellm_params:
model: openai/qwen3.5-27b # Tells LiteLLM to make use of the OpenAI-compatible adapter
api_base: http://yourvllm-server:8000/v1 # that is where you are serving vLLM
api_key: sk-1234
Run LiteLLM
pip install 'litellm[proxy]'
litellm --config config.yaml --port 4000
Changes to your source code (example call with Anthropic’s API)
import anthropic
client = anthropic.Anthropic(
base_url="http://localhost:4000", # Point to LiteLLM Proxy
api_key="sk-1234" # Must match your LiteLLM master key
)
response = client.messages.create(
model="claude-local", # proxied model
max_tokens=1024,
messages=[{"role": "user", "content": "What's the weather in NYC?"}],
tools=[{
"name": "get_weather",
"description": "Get current weather",
"input_schema": {
"type": "object",
"properties": {"location": {"type": "string"}}
}
}]
)
# LiteLLM translates vLLM's response back into an Anthropic ToolUseBlock
print(response.content[0].name) # Output: 'get_weather'
What if I don’t need to use Qwen?
Going rogue, fair enough.
Just be certain that arguments for --tool-call-parser and --reasoning-parser and --quantization match the model you’re using.
Since you’re using LiteLLM as a gateway for an Anthropic client, remember that Anthropic’s SDK expects a really specific structure for “pondering” vs “tool use.” When all else fails, pipe every thing to stdout and inspect where the error is.
🤑How much is that this going to cost?
A typical production agent system can devour:
200M–500M tokens/month
At API pricing, that usually lands between:
$2,000 – $8,000 monthly
As mentioned, cost scalability is essential. I’m going to supply two realistic scenarios with monthly token estimates taken from real world production scenarios.
Scenario 1: Mid-size team, multi-agent production workload
Setup: Qwen 3.5 72B (Q4_K_M) on a GCP a2-ultragpu-1g (1x A100 80GB)
| Cost component | Monthly cost |
|---|---|
| Instance (on-demand, 24/7) | $5.07/hr × 730 hrs = $3,701 |
| Instance (1-year committed use) | ~$3.25/hr × 730 hrs = $2,373 |
| Instance (3-year committed use) | ~$2.28/hr × 730 hrs = $1,664 |
| Storage (1 TB SSD) | ~$80 |
| Total (1-year committed) | ~$2,453/mo |
Comparable API cost: 20 agents running production workloads, averaging 500K tokens/day:
- 500K × 30 = 15M tokens/month per agent × 20 agents = 300M tokens/month
- At ~$9/M tokens: ~$2,700/mo
Nearly equivalent on cost, but with self-hosting you furthermore may get: no rate limits, no data leaving your VPC, sub-20ms first-token latency (vs. 200–500ms API round-trip), and the power to fine-tune.
Scenario 2: Research team, experimentation and evaluation
Setup: Multiple models on a spot-instance A100, running 10 hours/day on weekdays
| Cost component | Monthly cost |
|---|---|
| Instance (spot, ~10hr/day × 22 days) | ~$2.00/hr × 220 hrs = $440 |
| Storage (2 TB SSD for multiple models) | ~$160 |
| Total | ~$600/mo |
This offers you unlimited experimentation: swap models, test quantization levels, and run evals for the value of a moderately heavy API bill.
At all times be optimizing
- Use spot instances and make your agents “reschedulable” or “interruptible”: Langchain provides built ins for this. That way, in case you’re ever evicted, your agent can resume from a checkpoint every time the instance restarts. Implement a health-check via AWS Lambda or other to restart the instance when it stops.
- In case your agents don’t have to run overnight, schedule stops and starts with cron or every other scheduler.
- Consider committed-use/reserved instances. In the event you’re a startup planning on offering AI based services into the longer term, this alone can offer you considerable cost savings.
- Monitor your vLLM usage metrics. Check for signals of being overprovisioned (queued requests, utilization). In the event you are only using 30% of your capability, downgrade.
✅Wrapping things up
Self-hosting an LLM isn’t any longer a large engineering effort, it’s a practical, well-understood deployment pattern. The open-weight model ecosystem has matured to the purpose where models like Qwen 3.5 and GLM-4,7 rival frontier APIs on tasks that matter probably the most for agents: tool calling, instruction following, code generation, and multi-turn reasoning.
Remember:
- Pick your model based on agentic benchmarks (BFCL, τ-bench, SWE-bench, IFEval), not general leaderboard rankings.
- Quantize to Q4_K_M for the perfect balance of quality and VRAM efficiency. Don’t go below Q3 for production agents.
- Use vLLM for production inference
- GCP’s single-GPU A100 instances are currently the perfect value for 70B-class models. For 32B-class models, L40, L40S, L4 and A10s are capable alternates.
- The price crossover from API to self-hosted happens at roughly 40–100M tokens/month depending on the model and instance type. Beyond that, self-hosting is each cheaper and more capable.
- Start easy. Single machine, single GPU, one model, vLLM, systemd. Get it running, validate your agent pipeline E2E, optimize.
Enjoy!
