Self-Hosting Your First LLM

-

finally work.

They call tools, reason through workflows, and really complete tasks.

Then the first real API bill arrives.

For a lot of teams, that’s the moment the query appears:

“Should we just run this ourselves?”

The excellent news is that self-hosting an LLM isn’t any longer a research project or a large ML infrastructure effort. With the precise model, the precise GPU, and a couple of battle-tested tools, you possibly can run a production-grade LLM on a single machine you control.

You’re probably here because considered one of these happened:

Your OpenAI or Anthropic bill exploded

You can’t send sensitive data outside your VPC

Your agent workflows burn thousands and thousands of tokens/day

You wish custom behavior out of your AI and the prompts aren’t cutting it.

If that is you, perfect. If not, you’re still perfect 🤗

In this text, I’ll walk you thru a practical playbook for deploying an LLM on your personal infrastructure, including how models were evaluated and chosen, which instance types were evaluated and chosen, and the reasoning behind those decisions.

I’ll also give you a zero-switch cost deployment pattern for your personal LLM that works for OpenAI or Anthropic.

By the top of this guide you’ll know:

  1. Which benchmarks actually matter for LLMs that need to unravel and reason through agentic problems, and never reiterate the most recent string theorem.
  2. What it means to quantize and the way it affects performance
  3. Which instance types/GPUs might be used for single machine hosting1
  4. Which models to make use of2
  5. Learn how to use a self-hosted LLM without having to rewrite an existing API based codebase
  6. Learn how to make self-hosting cost-effective3?

1 Instance types were evaluated across the “big three”: AWS, Azure and GCP

2 all models are current as of March 2026

3 All pricing data is current as of March 2026

Note: this guide is targeted on deploying agent-oriented LLMs — not general-purpose, trillion-parameter, all-encompassing frontier models, that are largely overkill agent use cases.


✋Wait…why would I host my very own LLM again?

+++ Privacy

That is most definitely why you’re here. Sensitive data — patient health records, proprietary source code, user data, financial records, RFPs, or internal strategy documents that may never leave your firewall.

Self-hosting removes the dependency on third-party APIs and alleviates the chance of a breach or failure to retain/log data based on strict privacy policies.

++ Cost Predictability

API pricing scales linearly with usage. For agent workloads, which generally are higher on the token spectrum, operating your personal GPU infrastructure introduces economies-of-scale. This is very vital in case you plan on performing agent reasoning across a medium to large company (20-30 agents+) or providing agents to customers at any form of scale.

+ Performance

Remove roundtrip API calling, get reasonable token-per-second values and increase capability as obligatory with spot-instance elastic scaling.

+ Customization

Methods like LoRA and QLoRA (not covered intimately here) might be used to fine-tune an LLM’s behavior or adapt its alignment, abliterating, enhancing, tailoring tool usage, adjusting response style, or fine-tuning on domain-specific data.

That is crucially useful to construct custom agents or offer AI services that require specific behavior or style tuned to a use-case quite than generic instruction alignment via prompting.

An aside on finetuning

Methods akin to LoRA/QLoRA, model ablation (“abliteration”), realignment techniques, and response stylization are technically complex and out of doors the scope of this guide. Nonetheless, self-hosting is usually step one toward exploring deeper customization of LLMs.

Why a single machine?

It’s not a tough requirement, it’s more for simplicity. Deploying on a single machine with a single GPU is comparatively easy. A single machine with multiple GPUs is doable with the precise configuration decisions.

Nonetheless, debugging distributed inference across many machines might be nightmarish.

That is your first self-hosted LLM. To simplify the method, we’re going to focus on a single machine and a single GPU. As your inference needs grow, or in case you need more performance, scale up on a single machine. Then as you mature, you possibly can start tackling multi-machine or Kubernetes style deployments.


👉Which Benchmarks Actually Matter?

The LLM Benchmark landscape is noisy. There are dozens of leaderboards, and most of them are irrelevant for our use case. We’d like to prune down these benchmarks to seek out LLMs which excel at agent-style tasks

Specifically, we’re in search of LLMs which may:

  1. Follow complex, multi-step instructions
  2. Use tools reliably: call functions with well-formed arguments, interpret results, and judge what to do next
  3. Reason with constraints: reason with potentially incomplete information without hallucinating a confident but mistaken answer
  4. Write and understand code: We don’t need to unravel expert level SWE problems, but interacting with APIs and with the ability to generate code on the fly helps expand the motion space and translates into higher tool usage

Listed below are the benchmarks to essentially listen to:

Benchmark Description Why?
Berkeley Function Calling Leaderboard (BFCL v3) Accuracy of function/tool calling across easy, parallel, nested, and multi-step invocations Directly tests the aptitude your agents rely upon most: structured tool use.
IFEval (Instruction Following Eval) Strict adherence to formatting, constraint, and structural instructions Agents need strict adherence to instructions
τ-bench (Tau-bench) E2E agent task completion in simulated environments Measures real agentic competence, can this LLM actually accomplish a goal over multiple turns?
SWE-bench Verified Ability to resolve real GitHub issues from popular open-source repos In case your agents write or modify code, that is the gold standard. The “Verified” subset filters out ambiguous or poorly-specified issues
WebArena / VisualWebArena Task completion in realistic web environments Super useful in case your agent needs to make use of a WebUI

Note: unfortunately, getting reliable benchmark scores on all of those, especially quantized models, is difficult. You’re going to must use your best judgement, assuming that the complete precision model adheres to the performance degradation table outlined below.

🤖Quantizing

That is by no means, shape, or form meant to be the exhaustive guide to quantizing. My goal is to offer you adequate information to mean you can navigate HuggingFace without coming out cross-eyed.

The fundamentals

A model’s parameters are stored as numbers. At full precision (FP32), each weight is a 32-bit floating point number — 4 bytes. modern models are distributed at FP16 or BF16 (half precision, 2 bytes per weight). You will notice this because the baseline for every model

Quantization reduces the variety of bits used to represent each weight, shrinking the memory requirement and increasing inference speed, at the fee of some accuracy.

Not all quantization methods are equal. There are some clever methods that retain performance with highly reduced bit precision.

BF16 vs. GPTQ vs. AWQ vs. GGUF

You’ll see these acronyms rather a lot when model shopping. Here’s what they mean:

  • BF16: plain and easy. 2 bytes per parameter. A 70B parameter model will cost you 140GB of VRAM. That is the minimal level of quantizing.
  • GPTQ: stands for “Generative Pretrained Transformer Quantization”, quantized layer by layer using an greedy “error aware” approximation of the Hessian for every weight. Largely superseded by AWQ and methods applicable to GGUF models (see below)
  • AWQ: stands for “Activation Aware Weight Quantization”, quantizes weights using the magnitude of the activation (via channels) as a substitute of the error.
  • GGUF: isn’t a quantization method in any respect, it’s an LLM container popularized by llama.cpp, inside which one can find a few of the following quantization methods:
    • K-quants: Named by bits-per-weight and method, e,g Q4_K_M/Q4_K_S.
    • I-quants: Newer version, pushes precision at lower bitrates (4 bit and lower)

Here’s a guide as to what quantization does to performance:

Precision Bits per weight VRAM for 70B Performance
FP16 / BF16 16 ~140 GB Baseline (100%)
Q8 (INT8) 8 ~70 GB ~99–99.5% of FP16
Q5_K_M 5.5 (mixed) ~49 GB ~97–98%
Q4_K_M 4.5 (mixed) ~42 GB ~95–97%
Q3_K_M 3.5 (mixed) ~33 GB ~90–94%
Q2_K 2.5 (mixed) ~23 GB ~80–88% — noticeable degradation

Where quantization really hurts

Not all tasks degrade equally. The things most affected by aggressive quantization (Q3 and below):

  • Precise numerical computation: in case your agent must do exact arithmetic in-weights (versus via tool calls), lower precision hurts
  • Rare/specialized knowledge recall: the “long tail” of a model’s knowledge is stored in less-activated weights, that are the primary to lose fidelity
  • Very long chain-of-thought sequences: small errors compound over prolonged reasoning chains
  • Structured output reliability: at Q3 and below, JSON schema compliance and tool-call formatting begin to degrade. This can be a killer for agent pipelines

💡Protip: Follow Q4_K_M and above for agents. Any lower, and long context reasoning and output reliability issues put agent tasks in danger.

🛠️Hardware

Finally, Santa has delivered a capability block free A100 Instance with 80GB VRAM. Imagined by ChatGPT

GPUs (Accelerators)

Although more GPU types can be found, the landscape across AWS, GCP and Azure might be mostly distilled into the next options, especially for single machine, single GPU deployments:

GPU Architecture VRAM
H100 Hopper 80GB
A100 Ampere 40GB/80GB
L40S Ada Lovelace 48GB
L4 Ada Lovelace 24GB
A10/A10G Ampere 24GB
T4 Turing 16GB

The most effective tradeoffs for performance and price exist within the L4, L40S and A100 range, with the A100 providing the perfect performance (when it comes to model capability and multi-user agentic workloads). In case your agent tasks are easy, and require less throughput, it’s secure to downgrade to L4/A10. Don’t upgrade to the H100 unless you wish it.

The 48GB of VRAM provided by the L40S give us loads of options for models. We won’t get the throughput of the A100, but we’ll save on hourly cost.

For the sake of simplicity, I’m going to border the remaining of this discussion around this GPU. In the event you determine that your needs are different (less/more), the selections I outline below will assist you navigate model selection, instance selection and price optimization.

Note about GPU selection: although you could have your heart set on an A100, and the funds to purchase it, cloud capability may restrict you to a different instance/GPU type unless you’re willing to buy “Capability Blocks” [AWS] or “Reservations” [GCP].

Quick decision checkpoint

In the event you’re deploying your first self-hosted LLM:

Situation Advice
experimenting L4 / A10
production agents L40S
high concurrency A100

Really helpful Instance Types

I’ve compiled a non-exhaustive list of instance types across the large three which may also help narrow down virtual machine types.

Note: all pricing information was sourced in March 2026.

AWS

AWS lacks many single-GPU instance options, and is more geared towards large multi-GPU workloads. That being said, if you must purchase reserved capability blocks, they provide a p5.4xlarge with a single H100. In addition they have a big block of L40S instance types that are prime for spot instances for predictable/scheduled agentic workloads.

Click to disclose instance types
Instance GPU VRAM vCPU RAM On-demand $/hr
g4dn.xlarge 1x T4 16 GB 4 16 GB ~$0.526
g5.xlarge 1x A10G 24 GB 4 16 GB ~$1.006
g5.2xlarge 1x A10G 24 GB 8 32 GB ~$1.212
g6.xlarge 1x L4 24 GB 4 16 GB ~$0.805
g6e.xlarge 1x L40S 48GB 4 32GB ~$1.861
p5.4xlarge 1x H100 80GB 16 256GB ~$6.88

Google Cloud Platform

Unlike AWS, GCP offers single-GPU A100 instances. This makes a2-ultragpu-1g probably the most cost-effective option for running 70B models on a single machine. You pay just for what you employ.

Click to disclose instance types
Instance GPU VRAM On-demand $/hr
g2-standard-4 1x L4 24 GB ~$0.72
a2-highgpu-1g 1x A100 (40GB) 40 GB ~$3.67
a2-ultragpu-1g 1x A100 (80GB) 80 GB ~$5.07
a3-highgpu-1g 1x H100 (80GB) 80 GB ~$7.2

Azure

Azure has probably the most limited set of single GPU instances, so that you’re just about set into the Standard_NC24ads_A100_v4, which provides you an A100 for ~$3.60 per hour unless you must go along with a smaller model

Click to disclose instance types
Instance GPU VRAM On-demand $/hr Notes
Standard_NC4as_T4_v3 1x T4 16 GB ~$0.526 Dev/test
Standard_NV36ads_A10_v5 1x A10 24 GB ~$1.80 Note: A10 (not A10G), barely different specs
Standard_NC24ads_A100_v4 1x A100 (80GB) 80 GB ~$3.67 Strong single-GPU option

‼️Necessary: Don’t downplay the KV Cache

The important thing–value (KV) cache is a significant component when sizing VRAM requirements for LLMs.

Remember: LLMs are large transformer based models. A transformer layer computes attention using queries (Q), keys (K), and values (V). During generation, each latest token must attend to all previous tokens. Without caching, the model would want to recompute the keys and values for all the sequence every step.

By caching [storing] the eye keys and values in VRAM, long contexts turn into feasible, because the model doesn’t must recompute keys and values. Taking generation from O(T^2) to O(t).

Agents must cope with longer contexts. Because of this even when the model we select suits inside VRAM, we want to also ensure there’s sufficient capability for the KV cache.

Example: a quantized 32B model might occupy around 20-25 GB of VRAM, however the KV cache for several concurrent requests at an 8 k or 16 k context can add one other 10-20 GB. That is why GPUs with 48 GB or more memory are typically really useful for production inference of mid-size models with longer contexts.

💡Protip: Together with serving models with a Paged KV Cache (discussed below), allocate a further 30-40% of the model’s VRAM requirements for the KV cache.

💾Models

So now we all know:

  • the VRAM limits
  • the quantization goal
  • the benchmarks that matter

That narrows the model field from lots of to only a handful.

From the previous section, we chosen the L40S because the GPU, giving us instances at an affordable price point (especially spot instances, from AWS). This puts us at a cap of 48GB VRAM. Remembering the importance of the KV cache will limit us to models which fit into ~28GB VRAM (saving 20GB for multiple agents caching with long context windows).

With Q4_K_M quantizing, this puts us in range of some very capable models.

I’ve included links to the models directly on Huggingface. You’ll notice that Unsloth is the provider of the quants. Unsloth does very detailed evaluation of their quants and heavy testing. Consequently, they’ve turn into a community favorite. But, be happy to make use of any quant provider you favor.

🥇Top Rank: Qwen3.5-27B

Developed by Alibaba as a part of the Qwen3.5 model family.

This 27B model is a dense hybrid transformer architecture optimized for long-context reasoning and agent workflows.

Qwen 3.5 uses a Gated DeltaNet + Gated Attention Hybrid to keep up long context while preserving reasoning ability and minimizing the fee (in VRAM).

The 27B version gives us similar mechanics because the frontier model, and preserves reasoning, giving it outstanding performance on tool calling, SWE and agent benchmarks.

Strange fact: the 27B version performs barely higher than the 32B version.

Link to the Q4_K_M quant

https://huggingface.co/unsloth/Qwen3.5-27B-GGUF?show_file_info=Qwen3.5-27B-Q4_K_M.gguf

🥈Solid Contender: GLM 4.7 Flash

GLM‑4.7‑Flash, from Z.ai, is a 30 billion‑parameter language model that prompts only a small subset of its parameters per token (~3 B lively).

Its architecture supports very long context windows (as much as ~128 k–200 k tokens), enabling prolonged reasoning over large inputs akin to long documents, codebases, or multi‑turn agent workflows.

It comes with turn based “pondering modes”, which support more efficient agent level reasoning, toggle off for quick tool executions, toggle on for prolonged reasoning on code or interpreting results.

Link to the Q4_K_M quant

https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF?show_file_info=GLM-4.7-Flash-Q4_K_M.gguf

👌Value checking: GPT-OSS-20B

OpenAI’s open sourced models, 120B param and 20B param versions are still competitive despite being released over a 12 months ago. They consistently perform higher than Mistral and the 20B version (quantized) is well suited to our VRAM limit.

It supports configurable reasoning levels (low/medium/high) so you possibly can trade off speed versus depth of reasoning. GPT‑OSS‑20B also exposes its full chain‑of‑thought reasoning, which makes debugging and introspection easier.

It’s a solid selection for agent AI tasks. You won’t get the identical performance as OpenAI’s frontier models, but benchmark performance together with a low memory requirement still warrant a test.

Link to the Q4_K_M quant

https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF?show_file_info=GLM-4.7-Flash-Q4_K_M.gguf

Remember: even in case you’re running your personal model, you possibly can still use frontier models

This is a brilliant agentic pattern. If you will have a dynamic graph of agent actions, you possibly can turn on the expensive API for Claude 4.6 Opus or the GPT 5.4 on your complex subgraphs or tasks that require frontier model level visual reasoning.

Compress the summary of your entire agent graph using your LLM to attenuate input tokens and make sure to set the utmost output length when calling the frontier API to attenuate costs.

🚀Deployment

I’m going to introduce 2 patterns, the primary is for evaluating your model in a non production mode, the second is for production use.

Pattern 1: Evaluate with Ollama

Ollama is the docker run of LLM inference. It wraps llama.cpp in a clean CLI and REST API, handles model downloads, and just works. It’s perfect for local dev and evaluation: you possibly can have an OpenAI compatible API running along with your model in under 10 minutes.

Setup

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama pull qwen3.5:27b
ollama run qwen3.5:27b

As mentioned, Ollama exposes an OpenAI-compatible API right out of the box, Hit it at http://localhost:11434/v1

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # required but unused
)

response = client.chat.completions.create(
    model="qwen3.5:27b",
    messages=[
        {"role": "system", "content": "You are a paranoid android."},
        {"role": "user", "content": "Determine when the singularity will eventually consume us"}
    ]
)

You may at all times just construct llama.cpp from source directly [with the GPU flags on], which can also be good for evals. Ollama just simplifies it.

Pattern #2: Production with vLLM

vLLM is good since it automagically handles KV caching via PagedAttention. Naively attempting to handle KV caching will result in memory underutilization via fragmentation. While more practical on RAM than VRAM, it still helps.

While tempting, don’t use Ollama for production. Use vLLM because it’s a lot better suited to concurrency and monitoring.

Setup

# Install vLLM (CUDA required)
pip install vllm

# Serve a model with the OpenAI-compatible API server
vllm serve Qwen/Qwen3.5-27B-GGUF 
  --dtype auto 
  --quantization k_m 
  --max-model-len 32768 
  --gpu-memory-utilization 0.90 
  --port 8000 
  --api-key your-secret-key

Key configuration flags:

Flag What it does Guidance
--max-model-len Maximum sequence length (input + output tokens) Set this to the max you really want, not the model’s theoretical max. 32K is a superb default. Setting it to 128K will reserve enormous KV cache.
--gpu-memory-utilization Fraction of GPU memory vLLM can use 0.90 is aggressive but high quality for dedicated inference machines. Lower to 0.85 in case you see OOM errors.
--quantization Tells vLLM which quantizing format to make use of Must match the model format you downloaded.
--tensor-parallel-size N Shard model across N GPUs For single-GPU, omit or set to 1. For multi-GPU on a single machine, set to the variety of GPUs.

Monitoring:
vLLM exposes a /metrics endpoint compatible with Prometheus

# prometheus.yml scrape config
scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: '/metrics'

Key metrics to observe:

  • vllm:num_requests_running: current concurrent requests
  • vllm:num_requests_waiting: requests queued (if consistently > 0, you wish more capability)
  • vllm:gpu_cache_usage_perc: KV cache utilization (high values = approaching memory limits)
  • vllm:avg_generation_throughput_toks_per_s: your actual throughput

🤩Zero switch costs?

Yep.

You utilize OpenAI’s API:

The API that vLLM uses is fully compatible.

You need to launch vLLM with tool calling explicitly enabled. You furthermore mght have to specify a parser so vLLM knows tips on how to extract the tool calls from the model’s output (e.g., llama3_json, hermes, mistral).

For Qwen3.5, add the next flags when running vLLM

--enable-auto-tool-choice 
--tool-call-parser qwen3_xml
--reasoning-parser qwen3

You utilize Anthropic’s API:

We’d like so as to add another, somewhat hacky, step. Add a LiteLLM proxy as a “phantom-claude” to handle Anthropic-formatted requests.

LiteLLM will act as a translation layer. It intercepts the Anthropic-formatted requests (e.g., messages API, tool_use blocks) and converts them into the OpenAI format that vLLM expects, then maps the response back so your Anthropic client never knows the difference.

Note: Add this proxy on the machine/container which actually runs your agents and never the LLM host.

Configuration is simple:

model_list:
  - model_name: claude-local  # The name your Anthropic client will use
    litellm_params:
      model: openai/qwen3.5-27b    # Tells LiteLLM to make use of the OpenAI-compatible adapter
      api_base: http://yourvllm-server:8000/v1 # that is where you are serving vLLM
      api_key: sk-1234

Run LiteLLM

pip install 'litellm[proxy]'
litellm --config config.yaml --port 4000

Changes to your source code (example call with Anthropic’s API)

import anthropic

client = anthropic.Anthropic(
    base_url="http://localhost:4000", # Point to LiteLLM Proxy
    api_key="sk-1234"                 # Must match your LiteLLM master key
)

response = client.messages.create(
    model="claude-local", # proxied model
    max_tokens=1024,
    messages=[{"role": "user", "content": "What's the weather in NYC?"}],
    tools=[{
        "name": "get_weather",
        "description": "Get current weather",
        "input_schema": {
            "type": "object",
            "properties": {"location": {"type": "string"}}
        }
    }]
)

# LiteLLM translates vLLM's response back into an Anthropic ToolUseBlock
print(response.content[0].name) # Output: 'get_weather'

What if I don’t need to use Qwen?

Going rogue, fair enough.

Just be certain that arguments for --tool-call-parser and --reasoning-parser and --quantization match the model you’re using.

Since you’re using LiteLLM as a gateway for an Anthropic client, remember that Anthropic’s SDK expects a really specific structure for “pondering” vs “tool use.” When all else fails, pipe every thing to stdout and inspect where the error is.

🤑How much is that this going to cost?

A typical production agent system can devour:

200M–500M tokens/month

At API pricing, that usually lands between:

$2,000 – $8,000 monthly

As mentioned, cost scalability is essential. I’m going to supply two realistic scenarios with monthly token estimates taken from real world production scenarios.

Scenario 1: Mid-size team, multi-agent production workload

Setup: Qwen 3.5 72B (Q4_K_M) on a GCP a2-ultragpu-1g (1x A100 80GB)

Cost component Monthly cost
Instance (on-demand, 24/7) $5.07/hr × 730 hrs = $3,701
Instance (1-year committed use) ~$3.25/hr × 730 hrs = $2,373
Instance (3-year committed use) ~$2.28/hr × 730 hrs = $1,664
Storage (1 TB SSD) ~$80
Total (1-year committed) ~$2,453/mo

Comparable API cost: 20 agents running production workloads, averaging 500K tokens/day:

  • 500K × 30 = 15M tokens/month per agent × 20 agents = 300M tokens/month
  • At ~$9/M tokens: ~$2,700/mo

Nearly equivalent on cost, but with self-hosting you furthermore may get: no rate limits, no data leaving your VPC, sub-20ms first-token latency (vs. 200–500ms API round-trip), and the power to fine-tune.

Scenario 2: Research team, experimentation and evaluation

Setup: Multiple models on a spot-instance A100, running 10 hours/day on weekdays

Cost component Monthly cost
Instance (spot, ~10hr/day × 22 days) ~$2.00/hr × 220 hrs = $440
Storage (2 TB SSD for multiple models) ~$160
Total ~$600/mo

This offers you unlimited experimentation: swap models, test quantization levels, and run evals for the value of a moderately heavy API bill.

At all times be optimizing

  1. Use spot instances and make your agents “reschedulable” or “interruptible”: Langchain provides built ins for this. That way, in case you’re ever evicted, your agent can resume from a checkpoint every time the instance restarts. Implement a health-check via AWS Lambda or other to restart the instance when it stops.
  2. In case your agents don’t have to run overnight, schedule stops and starts with cron or every other scheduler.
  3. Consider committed-use/reserved instances. In the event you’re a startup planning on offering AI based services into the longer term, this alone can offer you considerable cost savings.
  4. Monitor your vLLM usage metrics. Check for signals of being overprovisioned (queued requests, utilization). In the event you are only using 30% of your capability, downgrade.

✅Wrapping things up

Self-hosting an LLM isn’t any longer a large engineering effort, it’s a practical, well-understood deployment pattern. The open-weight model ecosystem has matured to the purpose where models like Qwen 3.5 and GLM-4,7 rival frontier APIs on tasks that matter probably the most for agents: tool calling, instruction following, code generation, and multi-turn reasoning.

Remember:

  1. Pick your model based on agentic benchmarks (BFCL, τ-bench, SWE-bench, IFEval), not general leaderboard rankings.
  2. Quantize to Q4_K_M for the perfect balance of quality and VRAM efficiency. Don’t go below Q3 for production agents.
  3. Use vLLM for production inference
  4. GCP’s single-GPU A100 instances are currently the perfect value for 70B-class models. For 32B-class models, L40, L40S, L4 and A10s are capable alternates.
  5. The price crossover from API to self-hosted happens at roughly 40–100M tokens/month depending on the model and instance type. Beyond that, self-hosting is each cheaper and more capable.
  6. Start easy. Single machine, single GPU, one model, vLLM, systemd. Get it running, validate your agent pipeline E2E,  optimize.

Enjoy!

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x