In my latest posts, talked so much about prompt caching in addition to caching basically, and the way it could improve your AI app when it comes to cost and latency. Nonetheless, even for a completely optimized AI app, sometimes the responses are only going to take a while to be generated, and there’s simply nothing we will do about it. After we request large outputs from the model or require reasoning or deep pondering, the model goes to naturally take longer to reply. As reasonable as that is, waiting longer to receive a solution may be frustrating for the user and lower their overall user experience using an AI app. Happily, a straightforward and easy technique to improve this issue is response streaming.
Streaming means getting the model’s response incrementally, little by little, as generated, slightly than waiting for the whole response to be generated after which displaying it to the user. Normally (without streaming), we send a request to the model’s API, we wait for the model to generate the response, and once the response is accomplished, we get it back from the API in a single step. With streaming, nonetheless, the API sends back partial outputs while the response is generated. This can be a slightly familiar concept because most user-facing AI apps like ChatGPT, from the moment they first appeared, used streaming to indicate their responses to their users. But beyond ChatGPT and LLMs, streaming is actually used in every single place on the net and in modern applications, equivalent to as an illustration in live notifications, multiplayer games, or live news feeds. On this post, we’re going to further explore how we will integrate streaming in our own requests to model APIs and achieve an identical effect on custom AI apps.
There are several different mechanisms to implement the concept of streaming in an application. Nonetheless, for AI applications, there are two widely used varieties of streaming. More specifically, those are:
- HTTP Streaming Over Server-Sent Events (SSE): That may be a relatively easy, one-way kind of streaming, allowing only live communication from server to client.
- Streaming with WebSockets: That may be a more advanced and sophisticated kind of streaming, allowing two-way live communication between server and client.
Within the context of AI applications, HTTP streaming over SSE can support easy AI applications where we just must stream the model’s response for latency and UX reasons. Nonetheless, as we move beyond easy request–response patterns into more advanced setups, WebSockets develop into particularly useful as they permit live, bidirectional communication between our application and the model’s API. For instance, in code assistants, multi-agent systems, or tool-calling workflows, the client may have to send intermediate updates, user interactions, or feedback back to the server while the model continues to be generating a response. Nonetheless, for simplest AI apps where we just need the model to offer a response, WebSockets are often overkill, and SSE is sufficient.
In the remainder of this post, we’ll be taking a greater have a look at streaming for easy AI apps using HTTP streaming over SSE.
. . .
What about HTTP Streaming Over SSE?
HTTP Streaming Over Server-Sent Events (SSE) relies on HTTP streaming.
. . .
HTTP streaming signifies that the server can send whatever it’s that it has to send in parts, slightly than unexpectedly. That is achieved by the server not terminating the connection to the client after sending a response, but slightly leaving it open and sending the client whatever additional event occurs immediately.
For instance, as an alternative of getting the response in a single chunk:
Hello world!
we could get it in parts using raw HTTP streaming:
Hello
World
!
If we were to implement HTTP streaming from scratch, we would want to handle every part ourselves, including parsing the streamed text, managing any errors, and reconnections to the server. In our example, using raw HTTP streaming, we might should in some way explain to the client that ‘Hello world!’ is one event conceptually, and every part after it will be a separate event. Fortunately, there are several frameworks and wrappers that simplify HTTP streaming, certainly one of which is HTTP Streaming Over Server-Sent Events (SSE).
. . .
So, Server-Sent Events (SSE) provide a standardized technique to implement HTTP streaming by structuring server outputs into clearly defined events. This structure makes it much easier to parse and process streamed responses on the client side.
Each event typically includes:
- an
id - an
eventtype - a
datapayload
or more properly..
id:
event:
data:
Our example using SSE could look something like this:
id: 1
event: message
data: Hello world!
But what’s an event? Anything can qualify as an event – a single word, a sentence, or hundreds of words. What actually qualifies as an event in our particular implementation is defined by the setup of the API or the server we’re connected to.
On top of this, SSE comes with various other conveniences, like robotically reconnecting to the server if the connection is terminated. One other thing is that incoming stream messages are clearly tagged as text/event-stream, allowing the client to appropriately handle them and avoid errors.
. . .
Roll up your sleeves
Frontier LLM APIs like OpenAI’s API or Claude API natively support HTTP streaming over SSE. In this manner, integrating streaming in your requests becomes relatively easy, as it could be achieved by altering a parameter within the request (e.g., enabling a stream=true parameter).
Once streaming is enabled, the API not waits for the complete response before replying. As an alternative, it sends back small parts of the model’s output as they’re generated. On the client side, we will iterate over these chunks and display them progressively to the user, creating the familiar ChatGPT typing effect.
But, let’s do a minimal example of this using, as usual the OpenAI’s API:
import time
from openai import OpenAI
client = OpenAI(api_key="your_api_key")
stream = client.responses.create(
model="gpt-4.1-mini",
input="Explain response streaming in 3 short paragraphs.",
stream=True,
)
full_text = ""
for event in stream:
# only print text delta as text parts arrive
if event.type == "response.output_text.delta":
print(event.delta, end="", flush=True)
full_text += event.delta
print("nnFinal collected response:")
print(full_text)
In this instance, as an alternative of receiving a single accomplished response, we iterate over a stream of events and print each text fragment because it arrives. At the identical time, we also store the chunks right into a full response full_text to make use of later if we would like to.
. . .
So, should I just slap streaming = True on every request?
The short answer isn’t any. As useful because it is, with great potential for significantly improving user experience, streaming isn’t a one-size-fits-all solution for AI apps, and we must always use our discretion for evaluating where it must be implemented and where not.
More specifically, adding streaming in an AI app may be very effective in setups after we expect long responses, and we value above all of the user experience and responsiveness of the app. Such a case can be consumer-facing chatbots.
On the flip side, for easy apps where we expect the provided responses to be short, adding streaming isn’t more likely to provide significant gains to the user experience and doesn’t make much sense. On top of this, streaming only is smart in cases where the model’s output is free-text and never structured output (e.g. json files).
Most significantly, the main drawback of streaming is that we aren’t in a position to review the complete response before displaying it to the user. Remember, LLMs generate the tokens one-by-one, and the meaning of the response is formed because the response is generated, not prematurely. If we make 100 requests to an LLM with the very same input, we’re going to get 100 different responses. That’s to say, nobody knows before the responses are accomplished what it’s going to say. Because of this, with streaming activated is rather more difficult to review the model’s output before displaying it to the user, and apply any guarantees on the produced content. We will all the time try to guage partial completions, but again, partial completions are tougher to guage, as we’ve got to guess where the model goes with this. Adding that this evaluation must be performed in real time and never only once, but recursively on different partial responses of the model, renders this process even tougher. In practice, in such cases, validation is run on the whole output after the response is complete. Nevertheless, the difficulty with that is that at this point, it might already be too late, as we can have already shown the user inappropriate content that doesn’t pass our validations.
. . .
On my mind
Streaming is a feature that doesn’t have an actual impact on the AI app’s capabilities, or its associated cost and latency. Nonetheless, it could have an ideal impact on the best way the user’s perceive and experience an AI app. Streaming makes AI systems feel faster, more responsive, and more interactive, even when the time for generating the whole response stays the exact same. That said, streaming isn’t a silver bullet. Different applications and contexts may profit roughly from introducing streaming. Like many choices in AI engineering, it’s less about what’s possible and more about what is smart in your specific use case.
. . .
. . .
💌 and 💼
. . .
