Latest in llama.cpp: Model Management

-


Xuan-Son Nguyen's avatar

Victor Mustar's avatar


llama.cpp server now ships with router mode, which helps you to dynamically load, unload, and switch between multiple models without restarting.

Reminder: llama.cpp server is a light-weight, OpenAI-compatible HTTP server for running LLMs locally.

This feature was a well-liked request to bring Ollama-style model management to llama.cpp. It uses a multi-process architecture where each model runs in its own process, so if one model crashes, others remain unaffected.



Quick Start

Start the server in router mode by not specifying a model:

llama-server

This auto-discovers models out of your llama.cpp cache (LLAMA_CACHE or ~/.cache/llama.cpp). Should you’ve previously downloaded models via llama-server -hf user/model, they’ll be available mechanically.

You too can point to an area directory of GGUF files:

llama-server --models-dir ./my-models



Features

  1. Auto-discovery: Scans your llama.cpp cache (default) or a custom --models-dir folder for GGUF files
  2. On-demand loading: Models load mechanically when first requested
  3. LRU eviction: While you hit --models-max (default: 4), the least-recently-used model unloads
  4. Request routing: The model field in your request determines which model handles it



Examples



Chat with a particular model

curl http://localhost:8080/v1/chat/completions 
  -H "Content-Type: application/json" 
  -d '{
    "model": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

On the primary request, the server mechanically loads the model into memory (loading time is dependent upon model size). Subsequent requests to the identical model are quick because it’s already loaded.



List available models

curl http://localhost:8080/models

Returns all discovered models with their status (loaded, loading, or unloaded).



Manually load a model

curl -X POST http://localhost:8080/models/load 
  -H "Content-Type: application/json" 
  -d '{"model": "my-model.gguf"}'



Unload a model to free VRAM

curl -X POST http://localhost:8080/models/unload 
  -H "Content-Type: application/json" 
  -d '{"model": "my-model.gguf"}'



Key Options

Flag Description
--models-dir PATH Directory containing your GGUF files
--models-max N Max models loaded concurrently (default: 4)
--no-models-autoload Disable auto-loading; require explicit /models/load calls

All model instances inherit settings from the router:

llama-server --models-dir ./models -c 8192 -ngl 99

All loaded models will use 8192 context and full GPU offload. You too can define per-model settings using presets:

llama-server --models-preset config.ini
[my-model]
model = /path/to/model.gguf
ctx-size = 65536
temp = 0.7



Also available within the Web UI

The built-in web UI also supports model switching. Just select a model from the dropdown and it loads mechanically.



Join the Conversation

We hope this feature makes it easier to A/B test different model versions, run multi-tenant deployments, or just switch models during development without restarting the server.

Have questions or feedback? Drop a comment below or open a problem on GitHub.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x