How We Reduced LLM Costs by 90% with 5 Lines of Code

feeling when all the pieces to be working just positive, until you look under the hood and realize your system is burning 10× more fuel than it must?

We had a client script firing off requests to validate our prompts, built with async Python code and running easily in a Jupyter notebook. Clean, easy, and fast. We ran it frequently to check our models and collect evaluation data. No red flags. No warnings.

But beneath that polished surface, something was quietly going flawed.

We weren’t seeing failures. We weren’t getting exceptions. We weren’t even noticing slowness. But our system was doing loads more work than it needed to, and we didn’t know it.

On this post, we’ll walk through how we discovered the problem, what caused it, and the way a easy structural change in our async code reduced LLM traffic and price by 90%, with virtually no loss in speed or functionality.

Now, fair warning, reading this post won’t magically slash your LLM costs by 90%. However the takeaway here is broader: small, ignored design decisions, sometimes just just a few lines of code, can result in massive inefficiencies. And being intentional about how your code runs can prevent time, money, and frustration in the long term.

The fix itself might feel area of interest at first. It involves the subtleties of Python’s asynchronous behavior, how tasks are scheduled and dispatched. Should you’re aware of Python and async/await, you’ll get more out of the code examples, but even in case you’re not, there’s still plenty to remove. Because the actual story here isn’t nearly LLMs or Python, it’s about responsible, efficient engineering.

Let’s dig in.

The Setup

To automate validation, we use a predefined dataset and trigger our system through a client script. The validation focuses on a small subset of the dataset, so the client code only stops after receiving a certain variety of responses.

Here’s a simplified version of our client in Python:

import asyncio
from aiohttp import ClientSession
from tqdm.asyncio import tqdm_asyncio

URL = "http://localhost:8000/example"
NUMBER_OF_REQUESTS = 100
STOP_AFTER = 10

async def fetch(session: ClientSession, url: str) -> bool:
    async with session.get(url) as response:
        body = await response.json()
        return body["value"]

async def most important():
    results = []

    async with ClientSession() as session:
        tasks = [fetch(session, URL) for _ in range(NUMBER_OF_REQUESTS)]

        for future in tqdm_asyncio.as_completed(tasks, total=NUMBER_OF_REQUESTS, desc="Fetching"):
            response = await future
            if response is True:
                results.append(response)
                if len(results) >= STOP_AFTER:
                    print(f"n✅ Stopped after receiving {STOP_AFTER} true responses.")
                    break

asyncio.run(most important())

This script reads requests from a dataset, fires them concurrently, and stops once we collect enough true responses for our evaluation. In production, the logic is more complex and based on the variety of responses we’d like. However the structure is similar.

Let’s use a dummy FastAPI server to simulate real behavior:

import asyncio
import fastapi
import uvicorn
import random

app = fastapi.FastAPI()

@app.get("/example")
async def example():
    sleeping_time = random.uniform(1, 2)
    await asyncio.sleep(sleeping_time)
    return {"value": random.selection([True, False])}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Now let’s fan the flames of that dummy server and run the client. You’ll see something like this from the client terminal:

The progress bar stopped after receiving 10 responses

Can You Spot the Problem?

Nice! Fast, clean, and… wait is all the pieces working as expected?

On the surface, it just like the client is doing the fitting thing: sending requests, getting 10 true responses, then stopping.

But is it?

Let’s add just a few print statements to our server to see what it’s actually doing under the hood:

import asyncio
import fastapi
import uvicorn
import random

app = fastapi.FastAPI()

@app.get("/example")
async def example():
    print("Got a request")
    sleeping_time = random.uniform(1, 2)
    print(f"Sleeping for {sleeping_time:.2f} seconds")
    await asyncio.sleep(sleeping_time)
    value = random.selection([True, False])
    print(f"Returning value: {value}")
    return {"value": value}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0", port=8000)

Now re-run all the pieces.

You’ll start seeing logs like this:

Got a request
Sleeping for 1.11 seconds
Got a request
Sleeping for 1.29 seconds
Got a request
Sleeping for 1.98 seconds
...
Returning value: True
Returning value: False
Returning value: False
...

Take a better take a look at the server logs. You’ll notice something unexpected: as an alternative of processing just 14 requests like we see within the progress bar, the server handles all 100. Despite the fact that the client stops after receiving 10 true responses, it still sends every request up front. Consequently, the server must process all of them.

It’s a simple mistake to miss, especially because all the pieces appears to be working accurately from the client’s perspective: responses are available quickly, the progress bar advances, and the script exits early. But behind the scenes, all 100 requests are sent immediately, no matter after we resolve to stop listening. This ends in 10× more traffic than needed, driving up costs, increasing load, and risking rate limits.

So the important thing query becomes: why is that this happening, and the way can we be certain that we only send the requests we really want? The reply turned out to be a small but powerful change.

The foundation of the problem lies in how the tasks are scheduled. In our original code, we create an inventory of 100 tasks :

tasks = [fetch(session, URL) for _ in range(NUMBER_OF_REQUESTS)]

for future in tqdm_asyncio.as_completed(tasks, total=NUMBER_OF_REQUESTS, desc="Fetching"):
    response = await future

Once you pass an inventory of coroutines to as_completed, Python immediately wraps each coroutine in a Task and schedules it on the event loop. This happens before you begin iterating over the loop body. Once a coroutine becomes a Task, the event loop starts running it within the background instantly.

as_completed itself doesn’t control concurrency, it simply waits for tasks to complete and yields them one after the other within the order they complete. Consider it as an iterator over accomplished futures, not a traffic controller. Which means by the point you begin looping, all 100 requests are already in progress. Breaking out after 10 true results stops you from processing the remaining, however it doesn’t stop them from being sent.

To repair this, we introduced a semaphore to limit concurrency. The semaphore adds a light-weight lock inside fetch in order that only a hard and fast variety of requests can start at the identical time. The remaining remain paused, waiting for a slot. Once we hit our stopping condition, the paused tasks never acquire the lock, so that they never send their requests.

Here’s the adjusted version:

import asyncio
from aiohttp import ClientSession
from tqdm.asyncio import tqdm_asyncio

URL = "http://localhost:8000/example"
NUMBER_OF_REQUESTS = 100
STOP_AFTER = 10

async def fetch(session: ClientSession, url: str, semaphore: asyncio.Semaphore) -> str:
    async with semaphore:
        async with session.get(url) as response:
            body = await response.json()
            return body["value"]

async def most important():
    results = []
    semaphore = asyncio.Semaphore(int(STOP_AFTER * 1.5))

    async with ClientSession() as session:
        tasks = [fetch(session, URL, semaphore) for _ in range(NUMBER_OF_REQUESTS)]

        for future in tqdm_asyncio.as_completed(tasks, total=NUMBER_OF_REQUESTS, desc="Fetching"):
            response = await future
            if response:
                results.append(response)
                if len(results) >= STOP_AFTER:
                    print(f"n✅ Stopped after receiving {STOP_AFTER} true responses.")
                    break

asyncio.run(most important())

With this alteration, we still define 100 requests upfront, but only a small group is allowed to run at the identical time, 15 in that example. If we reach our stopping condition early, the event loop stops before launching more requests. This keeps the behavior responsive while reducing unnecessary calls.

Now, the server logs will display only around 20 "Got a request/Returning response" entries. On the client side, the progress bar will appear an identical to the unique.

The progress bar stopped after receiving 10 responses

With this alteration in place, we saw immediate impact: 90% reduction in request volume and LLM cost, with no noticeable degradation in client experience. It also improved throughput across the team, reduced queuing, and eliminated rate-limit issues from our LLM providers.

This small structural adjustment made our validation pipeline dramatically more efficient, without adding much complexity to the code. It’s a superb reminder that in async systems, control flow doesn’t all the time behave the best way you assume unless you’re explicit about how tasks are scheduled and after they should run.

Bonus Insight: Closing the Event Loop

If we had run the unique client code without asyncio.run, we might need noticed the issue earlier.
For instance, if we had used manual event loop management like this:

loop = asyncio.get_event_loop()
loop.run_until_complete(most important())
loop.close()

Python would have printed warnings reminiscent of:

Task was destroyed however it is pending!

These warnings appear when this system exits while there are still unfinished async tasks scheduled within the loop. If we had seen a screen stuffed with those warnings, it likely would’ve triggered a red flag much sooner.

So why didn’t we see that warning when using asyncio.run()?

Because asyncio.run() takes care of cleanup behind the scenes. It doesn’t just run your coroutine and exit, it also cancels any remaining tasks, waits for them to complete, and only then shuts down the event loop. This built-in safety net prevents those “pending task” warnings from showing up, even in case your code quietly launched more tasks than it needed to.

Consequently, it suppresses those “pending task” warnings while you manually close the loop with loop.close() after run_until_complete(), any leftover tasks that haven’t been awaited will still be hanging around. Python detects that you just’re forcefully shutting down the loop while work continues to be scheduled, and warns you about it.

This isn’t to say that each async Python program should avoid asyncio.run() or all the time use loop.run_until_complete() with a manual loop.close(). However it does highlight something essential: try to be aware of what tasks are still running when your program exits. On the very least, it’s a superb idea to watch or log any pending tasks before shutdown.

Final Thoughts

By stepping back and rethinking the control flow, we were capable of make our validation process dramatically more efficient — not by adding more infrastructure, but by utilizing what we already had more rigorously. A couple of lines of code change led to a 90% cost reduction with almost no added complexity. It resolved rate-limit errors, reduced system load, and allowed the team to run evaluations more ceaselessly without causing bottlenecks.

It’s a crucial reminder that “clean” async code doesn’t all the time mean efficient code, being intentional about how we use system resources is crucial. Responsible, efficient engineering is about greater than just writing code that works. It’s about designing systems that respect time, money, and shared resources, especially in collaborative environments. Once you treat compute as a shared asset as an alternative of an infinite pool, everyone advantages: systems scale higher, teams move faster, and costs stay predictable.

So, whether you’re making LLM calls, launching Kubernetes jobs, or processing data in batches, pause and ask yourself: am I only using what I ?

Often, the reply and the advance are only one line of code away.

How We Reduced LLM Costs by 90% with 5 Lines of Code

The Setup

Can You Spot the Problem?

Bonus Insight: Closing the Event Loop

Final Thoughts

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Generative coding

The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models

Speed up StarCoder with 🤗 Optimum Intel on Xeon: Q8/Q4 and Speculative Decoding

a Leaderboard for Real World Use Cases

Patch Time Series Transformer in Hugging Face

How We Reduced LLM Costs by 90% with 5 Lines of Code

The Setup

Can You Spot the Problem?

Bonus Insight: Closing the Event Loop

Final Thoughts

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.