Large language models (LLMs) are increasingly utilized for complex tasks requiring multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. Nevertheless, efficient systems for programming and executing these applications are lacking. SGLang, a newly introduced system, goals to deal with this by providing efficient execution of complex language model programs. SGLang comprises a frontend language and a runtime. The frontend simplifies programming with primitives for generation and parallelism control, while the runtime accelerates execution through novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. Experiments display that SGLang achieves as much as 6.4× higher throughput in comparison with state-of-the-art inference systems on various large language and multimodal models, tackling tasks reminiscent of agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat.
Recent advancements in LLM capabilities have expanded their utility, enabling them to handle a wider range of general tasks and performance as autonomous agents. In these applications, LLMs engage in multi-round planning, reasoning, and interaction with external environments. That is facilitated through tool usage, multiple input modalities, and various prompting techniques, reminiscent of few-shot learning, self-consistency, skeleton-of-thought, and tree-of-thought. These latest use cases necessitate multiple, often dependent, LLM generation calls, indicating a trend of using multi-call structures to finish complex tasks.
This shift marks a transition from easy chatting to a more sophisticated programmatic usage of LLMs, where programs schedule and control the generation processes of LLMs. These programs are known as “Language Model Programs” (LM Programs). Advanced prompting techniques and agentic workflows fall inside the scope of LM programs. There are two common properties of LM programs: (1) LM programs typically involve multiple LLM calls interspersed with control flow to finish complex tasks and enhance overall quality. (2) LM programs receive structured inputs and produce structured outputs, enabling the composition of LM programs and integration into existing software systems.
In this text, we shall be taking a deeper dive into the SGLang framework, exploring its architecture, analyzing its performance, and comparing it against state-of-the-art frameworks. So let’s start.
Despite the widespread use of LM programs, current systems for expressing and executing them remain inefficient. SGLang identifies two primary challenges related to the efficient use of LM programs:
- Programming Complexity: Developing LM programs is tedious and difficult resulting from the non-deterministic nature of LLMs. This involves extensive string manipulation, experimental tuning of prompts, brittle output parsing, handling multiple input modalities, and implementing parallelism mechanisms. This complexity significantly reduces the readability of even easy programs.
- Execution Inefficiency: Executing LM programs is inefficient resulting from redundant computation and memory usage. State-of-the-art inference engines, optimized to scale back latency and improve throughput, lack direct knowledge of the workload, leading to significant inefficiencies. A notable example is the reuse of the Key-Value (KV) cache, which consists of reusable intermediate tensors essential for generative inference. Current systems lack effective mechanisms to facilitate KV cache reuse across multiple LLM calls that share a standard prefix, resulting in unnecessary computations and wasted memory. Moreover, constrained decoding for structured outputs, reminiscent of JSON mode, is suboptimal as existing systems only decode one token at a time.
To handle these challenges, SGLang introduces a Structured Generation Language for LLMs. The core idea is to systematically exploit the multi-call structure in LM programs for efficient execution. As shown in the next figure, SGLang has two parts: a front-end language and a back-end runtime.
The front-end simplifies the programming of LM programs, and the runtime accelerates their execution. These parts can work together for higher performance or function independently.
SGLang is a domain-specific language embedded in Python, providing primitives for generation (e.g., extend, gen, select) and parallelism control (e.g., fork, join). It’s compatible with Python’s control flow and libraries, allowing users to develop advanced prompting workflows easily with native Python syntax. SGLang includes an interpreter and a compiler. The interpreter manages the prompt state as a stream and submits primitive operations to the stream for asynchronous execution, ensuring proper control over synchronization and intra-program parallelism. Moreover, SGLang programs may be traced and compiled for further optimizations.The runtime of SGLang proposes several novel optimizations to speed up the execution of LM programs:
- RadixAttention: This method enables the automated reuse of the KV cache across multiple generation calls. In existing inference engines, the KV cache of a request is discarded after processing, stopping reuse across multiple calls and slowing execution. SGLang maintains an LRU cache of the KV cache inside a radix tree, managing the KV cache as a standard cache and using the radix tree for efficient matching, insertion, and eviction. This enables the runtime to handle various reuse patterns efficiently.
- Compressed Finite State Machine: This method enables faster constrained decoding for structured outputs. Existing systems follow constraints just for the subsequent token, making them in a position to decode one token at a time. As a substitute, SGLang analyzes the constraints and builds a compressed finite-state machine to represent them, compressing a multi-token path right into a single-step path each time possible, allowing the decoding of multiple tokens without delay for faster speed.
- API Speculative Execution: For API-only models like OpenAI’s GPT-4, SGLang introduces API speculative execution to optimize multi-call programs.
Using SGLang, various LLM applications were implemented, including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, multi-turn chat, and multi-modality processing. The performance was tested on models including Llama-7B/70B, Mistral-8x7B, LLaVA-v1.5-7B (image), and LLaVA-NeXT-34B (video) on NVIDIA A10G and A100 GPUs. Experimental results show that SGLang achieves as much as 6.4× higher throughput across a wide selection of workloads, models, and hardware setups, in comparison with existing programming and inference systems, including Guidance, vLLM, and LMQL.
SGLang: Programming Model and Methodology
The SGLang programming model is introduced through a running example, describing its language primitives and execution modes, and outlining runtime optimization opportunities. This model simplifies tedious operations in multi-call workflows (e.g., string manipulation, API calling, constraint specification, parallelism) by providing flexible and composable primitives. SGLang is a domain-specific language embedded in Python. The next figure shows a program that evaluates an essay about a picture using the branch-solve-merge prompting method.
The function multi_dimensional_judge takes three arguments: `s`, `path`, and `essay`. s manages the prompt state, path is the image file path, and essay is the essay text. Recent strings and SGLang primitives may be appended to the state s for execution using the += operator. First, the function adds the image and essay to the prompt. It then checks if the essay is said to the image using select, storing the end in s[“related”]. If related, the prompt is forked into three copies for parallel evaluation from different dimensions, using gen to store leads to f[“judgment”]. Next, it merges the judgments, generates a summary, and assigns a letter grade. Finally, it returns the leads to JSON format, following a schema defined by a daily expression constraint . SGLang greatly simplifies this program, as an equivalent program using an OpenAI API-like interface would take 2.1× as many lines of code resulting from manual string manipulation and parallelism control.
SGLang provides primitives for controlling prompt state, generation, and parallelism, which may be used with Python syntax and libraries. Listed below are the primitives:
gen: Calls a model to generate and stores the leads to a variable with the name laid out in its first argument. It supports a “ argument to constrain the output to follow a grammar defined by a daily expression (e.g., a JSON schema).
- select: Calls a model to decide on the best probability option from a listing.
- += or extend: Appends a string to the prompt.
- [variable_name]: Fetches the outcomes of a generation.
- fork: Creates parallel forks of the prompt state.
- join: Rejoins the prompt state.
- image and video: Soak up image and video inputs.
The only option to execute an SGLang program is thru an interpreter, where a prompt is treated as an asynchronous stream. Primitives like extend, gen, and choose are submitted to the stream for asynchronous execution. These non-blocking calls allow Python code to proceed running without waiting for the generation to complete, much like launching CUDA kernels asynchronously. Each prompt is managed by a stream executor in a background thread, enabling intra-program parallelism. Fetching generation results will block until they’re ready, ensuring correct synchronization. Alternatively, SGLang programs may be compiled as computational graphs and executed with a graph executor, allowing for more optimizations. This paper uses interpreter mode by default and discusses compiler mode leads to Appendix D. SGLang supports open-weight models with its own SGLang Runtime (SRT), in addition to API models reminiscent of OpenAI and Anthropic models.
Programming systems for LLMs may be classified as high-level (e.g., LangChain, DSPy) and low-level (e.g., LMQL, Guidance, SGLang). High-level systems provide predefined or auto-generated prompts, reminiscent of DSPy’s prompt optimizer. Low-level systems typically don’t alter prompts but allow direct manipulation of prompts and primitives. SGLang is a low-level system much like LMQL and Guidance. The next table compares their features.
SGLang focuses more on runtime efficiency and comes with its own co-designed runtime, allowing for novel optimizations. High-level languages (e.g., DSPy) may be compiled to low-level languages (e.g., SGLang). The combination of SGLang as a backend in DSPy for higher runtime efficiency is demonstrated later.
The above example illustrates RadixAttention operations with an LRU eviction policy across nine time points, showcasing the dynamic evolution of the radix tree in response to numerous requests. These requests include two chat sessions, a batch of few-shot learning inquiries, and self-consistency sampling. Each tree edge carries a label denoting a substring or a sequence of tokens. The nodes are color-coded to reflect different states: green for newly added nodes, blue for cached nodes accessed throughout the time point, and red for nodes which were evicted.
Step 1: The radix tree is initially empty.
Step 2: The server processes an incoming user message “Hello” and responds with the LLM output “Hi”. The system prompt “You might be a helpful assistant”, the user message “Hello!”, and the LLM reply “Hi!” are consolidated into the tree as a single edge linked to a brand new node.
Step 3: A brand new prompt arrives, and the server finds the prefix of the prompt (i.e., the primary turn of the conversation) within the radix tree and reuses its KV cache. The brand new turn is appended to the tree as a brand new node.
Step 4: A brand new chat session begins. The node from Step 3 is split into two nodes to permit the 2 chat sessions to share the system prompt.
Step 5: The second chat session continues. Nevertheless, resulting from memory limits, a node from Step 4 have to be evicted. The brand new turn is appended after the remaining node from Step 4.
Step 6: The server receives a few-shot learning query, processes it, and inserts it into the tree. The basis node is split because the brand new query doesn’t share any prefix with existing nodes.
Step 7: The server receives a batch of additional few-shot learning queries. These queries share the identical set of few-shot examples, so a node from Step 6 is split to enable sharing.
Step 8: The server receives a brand new message from the primary chat session. It evicts all nodes from the second chat session as they’re least recently used.
Step 9: The server receives a request to sample more answers for the questions in a node from Step 8, likely for self-consistency prompting. To create space for these requests, multiple nodes are evicted.
This instance demonstrates how RadixAttention handles the dynamic allocation and eviction of nodes in response to various kinds of requests, ensuring efficient KV cache reuse and memory management.
SGLang: Evaluation and Results
Results on Open-Weight Models
The latency and throughput results are shown in the next figures. SGLang improves throughput by as much as 6.4× and reduces latency by as much as 3.7×. These improvements result from KV cache reuse, the exploitation of parallelism inside a single program, and faster constrained decoding.
On these benchmarks, the cache hit rate ranges from 50% to 99%. Figure 13 (Appendix) lists the achieved and optimal cache hit rates for all of them, showing that SGLang’s cache-aware scheduling approaches 96% of the optimal hit rate on average.
Results on Larger Models with Tensor Parallelism
Larger models, Mixtral-8x7B and Llama-70B, were tested with tensor parallelism on the identical set of benchmarks, and the outcomes are reported in the next figure. The speedup on larger models shows a trend much like that observed on smaller models, indicating that SGLang’s optimization generalizes well to larger models. Guidance and LMQL were omitted resulting from the shortage of efficient implementations of tensor parallelism.
Results on Multi-Modal Models
SGLang has native support for multi-modal models with the image and video primitives. The optimizations on this paper are compatible with multi-modal models. For RadixAttention, the hash of the input images is computed and used as the important thing within the radix tree, allowing the reuse of the KV cache of the image tokens from the identical image. LLaVA-v1.5-7B (image) was run on llava-bench-in-the-wild and LLaVA-NeXT-34B (video) on ActivityNet. Because these models will not be well supported by other baseline systems, the model authors’ original implementation in Hugging Face Transformers was used because the baseline. As shown in the next table, SGLang provides throughput as much as 6× higher on these benchmarks. In llava-bench-in-the-wild, multiple questions on the identical image were handled, and SGLang runtime reused the KV cache on this case.
Production Deployment
SGLang has been deployed in Chatbot Arena to serve open-weight models. On account of low traffic for some models, just one SGLang employee serves each. After one month, a 52.4% RadixAttention cache hit rate for LLaVA-Next-34B and 74.1% for Vicuna-33B was observed. Cache hits got here from common system messages, ceaselessly reused example images, and multi-turn chat histories. This reduced first-token latency by a mean of 1.7× for Vicuna-33B.
Final Thoughts
In this text, we have now talked about SGLang, a newly introduced system, goals to deal with this by providing efficient execution of complex language model programs. SGLang comprises a frontend language and a runtime. The frontend simplifies programming with primitives for generation and parallelism control, while the runtime accelerates execution through novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. Experiments display that SGLang achieves as much as 6.4× higher throughput in comparison with state-of-the-art inference systems on various large language and multimodal models, tackling tasks reminiscent of agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat.