The right way to Turn Your LLM Prototype right into a Production-Ready System

-

applications of LLMs are those that I wish to call the “.” There are many viral LinkedIn posts about them, and so they all sound like this:

“I built [x] that does [y] in [z] minutes using AI.”

Where:

  • [x] is frequently something like an internet app/platform
  • [y] is a somewhat impressive important feature of [x]
  • [z] is frequently an integer number between 5 and 10.
  • “AI” is admittedly, more often than not, a LLM wrapper (Cursor, Codex, or similar)

If you happen to notice fastidiously, the focus of the sentence will not be really the quality of the evaluation however the period of time you save. That is to say that, when coping with a task, people are usually not excited concerning the LLM in tackling the issue, but they’re thrilled that the LLM is spitting out something that sound like an answer to their problem.

Because of this I seek advice from them as wow-effect LLMs. As impressive as they sound and look, these wow-effect LLMs display multiple issues that prevent them from being actually implemented in a production environment. A few of them:

  1. The prompt is frequently not optimized: you don’t have time to check all different versions of the prompts, evaluate them, and supply examples in 5-10 minutes.
  2. They are usually not meant to be sustainable: in that wanting time, you may develop a nice-looking plug-and-play wrapper. By default, you’re throwing all the prices, latency, maintainability, and privacy considerations out of the window.
  3. They sometimes lack context: LLMs are powerful after they are plugged into a giant infrastructure, they’ve decisional power over the tools that they use, and so they have contextual data to reinforce their answers. No likelihood of implementing that in 10 minutes.

Now, don’t get me flawed: LLMs are designed to be intuitive and straightforward to make use of. Which means that evolving LLMs from the wow effect to production-level will not be rocket science. Nevertheless, it requires a selected methodology that should be implemented.

The goal of this blog post is to offer this system.
The points we are going to cover to maneuver from wow-effect LLMs to production-level LLMs are the next:

  • LLM System Requirements. When this beast goes into production, we want to know easy methods to maintain it. This is completed in stage zero, through adequate system requirements evaluation.
  • Prompt Engineering. We’re going to optimize the prompt structure and supply some best-practice prompt strategies.
  • Force structure with schemas and structured output. We’re going to move from free text to structured objects, so the format of your response is fixed and reliable.
  • Use tools so the LLM doesn’t work in isolation. We’re going to let the model hook up with data and call functions. This provides richer answers and reduces hallucinations.
  • Add guardrails and validation across the model. Check inputs and outputs, implement business rules, and define what happens when the model fails or goes out of bounds.
  • Mix every thing into a straightforward, testable pipeline. Orchestrate prompts, tools, structured outputs, and guardrails right into a single flow you can log, monitor, and improve over time.

We’re going to use a quite simple case: we’re going to make LLM judge data scientists’ tests. That is only a concrete case to avoid a very abstract and confusing article. The procedure is general enough to be adapted to other LLM applications, typically with very minor adjustments.

Looks like we’ve got lots of ground to cover. Let’s start!

Image generated by writer using Excalidraw Whiteboard

The entire code and data might be found here.

Tough selections: cost, latency, privacy

Before writing any code, there are just a few essential inquiries to ask:

  • How complex is your task?
    Do you actually need the newest and most costly model, or can you utilize a smaller one or an older family?
  • How often do you run this, and at what latency?
    Is that this an internet app that must respond on demand, or a batch job that runs once and stores results? Do users expect an instantaneous answer, or is “we’ll email you later” acceptable?
  • What’s your budget?
    It’s best to have a rough idea of what’s “okay to spend”. Is it 1k, 10k, 100k? And in comparison with that, would it not make sense to coach and host your individual model, or is that clearly overkill?
  • What are your privacy constraints?
    Is it okay to send this data through an external API? Is the LLM seeing sensitive data? Has this been approved by whoever owns legal and compliance?

Let me throw some examples at you. If we consider OpenAI, that is the table to take a look at for prices:

Image from https://platform.openai.com/docs/pricing

For easy tasks, where you’ve a low budget and want low latency, the smaller models (for instance the 4.x mini family or 5 nano) are often your best bet. They’re optimized for speed and price, and for a lot of basic use cases like classification, tagging, light transformations, or easy assistants, you’ll barely notice the standard difference while paying a fraction of the price.

For more complex tasks, similar to complex code generation, long-context evaluation, or high-stakes evaluations, it might probably be price using a stronger model within the 5.x family, even at the next per-token cost. In those cases, you’re explicitly trading money and latency for higher decision quality.

If you happen to are running large offline workloads, for instance re-scoring or re-evaluating hundreds of things overnight, batch endpoints can significantly reduce costs in comparison with real-time calls. This often changes which model matches your budget, because you may afford a “larger” model when latency will not be a constraint.

From a privacy standpoint, it is usually good practice to only send non-sensitive or “sensitive-cleared” data to your provider, meaning data that has been cleaned to remove anything confidential or personal. If you happen to need much more control, you may consider running local LLMs.

Image made by writer using Excalidraw Whiteboard

The particular use case

For this text, we’re constructing an automated grading system for Data Science exams. Students take a test that requires them to research actual datasets and answer questions based on their findings. The LLM’s job is to grade these submissions by:

  1. Understanding what each query asks
  2. Accessing the right answers and grading criteria
  3. Verifying student calculations against the actual data
  4. Providing detailed feedback on what went flawed

This can be a perfect example of why LLMs need tools and context. You see, you would indeed do a plug-and-play approach. If we were to do a straightforward DS through a single prompt and API call, it might have the wow-effect, nevertheless it wouldn’t work well in production. Without access to the datasets and grading rubrics, the LLM cannot grade accurately. It must retrieve the actual data to confirm whether a student’s answer is correct.

Our exam is stored in test.json and incorporates 10 questions across three sections. Students must analyze three different datasets: e-commerce sales, customer demographics, and A/B test results. Let’s take a look at just a few example questions:

As you may see, the questions are data-related, so the LLM will need a tool to research these questions. We’ll return to this.

Image made by writer using Excalidraw Whiteboard

Constructing the prompt

When I take advantage of ChatGPT for small day by day questions, I’m terribly lazy, and I don’t listen to the prompt quality in any respect, and that’s okay. Imagine that it is advisable know the present situation of the housing market in your city, and you’ve to sit down down at your laptop and write hundreds of lines of Python code. Not very appealing, right?

Nevertheless, to really get one of the best prompt on your production-level LLM application, there are some key components to follow:

  • Clear Role Definition. WHO the LLM is and WHAT expertise it has.
  • System vs User Messages. The system is the LLM-specific instructions. The “user” represents the particular prompt to run, with the present request from the user.
  • Explicit Rules with Chain-of-Thought. That is the list of steps that the LLM has to follow to perform the duty. This step-by-step reasoning triggers the Chain-of-Thought, which improves performance and reduces hallucinations.
  • Few-Shot Examples. This can be a list of examples, in order that we show explicitly how the LLM should perform the duty. Show the LLM correct grading examples.

It is frequently a superb idea to have a prompt.py file, with SYSTEM_PROMPT, USER_PROMPT_TEMPLATE, and FEW_SHOT_EXAMPLES. That is the instance for our use-case:

So the prompts that we are going to reuse are stored as constants, while those that change based on the scholar answer are obtained from get_grading_prompt.

Image made by writer using Excalidraw Whiteboard

Output Formatting

If you happen to notice, the output within the few-shot example already has a kind of “structure”. At the tip of the day, the rating must be “packaged” in a production-adequate format. It will not be acceptable to have the output as a free-text/string.

With a purpose to do this, we’re going to use the magic Pydantic. Pydantic allows us to simply create a schema that may then be passed to the LLM, which is able to construct the output based on the schema.

That is our schemas.py file:

If you happen to deal with GradingResult, you may see that you’ve these sorts of features:

question_number: int = Field(..., ge=1, le=10, description="Query number (1-10)")
points_earned: float = Field(..., ge=0, le=10, description="Points earned out of 10")
points_possible: int = Field(default=10, description="Maximum points for this query")

Now, imagine that we would like so as to add a brand new feature (e.g. completeness_of_the_answer), this is able to be very easy to do: you only add it to the schema. Nevertheless, take note that the prompt should reflect the best way your output will look.

Image made by writer using Excalidraw Whiteboard

Tools Description

The /data folder has:

  1. An inventory of datasets, which might be the subject of our questions (e.g.). This folder has a set of tables, which represent the information that must be analyzed by the scholars when taking the tests.
  2. The grading rubric dataset, which is able to describe how we’re going to evaluate each query.
  3. The ground truth dataset, which is able to describe the bottom truth answer for each query

We’re going to present the LLM free roam on these datasets; we’re letting it explore each file based on the particular query.

For instance, get_ground_truth_answer() allows the LLM to tug the bottom truth for a given query. query_dataset() permits you to do some operations on the LLM, like computing the mean, max, and count.

Even on this case, it’s price noticing that tools, schema, and prompt are completely customizable. In case your LLM has access to 10 tools, and it is advisable add yet another functionality, there is no such thing as a have to do any structural change to the code: the one thing to do is so as to add the functionality when it comes to prompt, schema, and power.

Image made by writer using Excalidraw Whiteboard

Guardrails Description

In Software Engineering, you recognize a superb system from how gracefully it fails. This shows the quantity of labor that has been put into the duty.

On this case, some “graceful falls” are the next:

  1. The input must be sanitized: the query ID should exist, the scholar’s answer text should exist, and never be too long
  2. The output must be sanitized: the query ID should exist, the rating must be between 1 to 10, and the output must be in the right format identified by Pydantic.
  3. The output should “make sense”: you may not give one of the best rating if there are errors, or give 0 if there are not any errors.
  4. A rate limit must be implemented: in production, you don’t need to unintentionally run hundreds of threads without delay for no reason. It’s best to implement a RateLimit check.

This part is barely boring, but very obligatory. Because it is obligatory, it’s included in my Github Folder, because it is boring, I won’t copy-paste it here. You’re welcome! 🙂

Image made by writer using Excalidraw Whiteboard

Whole pipeline

The entire pipeline is implemented through CrewAI, which is built on top of LangChain. The logic is easy:

  • The crew is the important object that’s used to generate the output for a given input with a single command (crew.kickoff()).
  • An agent is defined: this wraps the tools, the prompts, and the particular LLM (e.g, GPT 4 with a given temperature). That is connected to the crew.
  • The task is defined: that is the particular task that we would like the LLM to perform. This can be connected to the crew.

Now, the magic is that the duty is connected to the tools, the prompts, and the Pydantic schema. Which means that all of the dirty work is completed within the backend. The pseudo-code looks like this:

    agent = Agent(
        role="Expert Data Science Grader",
        goal="Grade student data science exam submissions accurately and fairly by verifying answers against actual datasets",
        backstory=SYSTEM_PROMPT,
        tools=tools_list,
        llm=llm,
        verbose=True,
        allow_delegation=False,
        max_iter=15
    )

    task = Task(
        description=description,
        expected_output=expected_output,
        agent=agent,
        output_json=GradingResult  # Implement structured output
    )
    

    crew = Crew(
            agents=[self.grader_agent],
            tasks=[task],
            process=Process.sequential,
            verbose=self.verbose
        )
     
    result = crew.kickoff()

Now, let’s say we now have the next JSON output, with the scholar work:

We are able to use the next important.py file to process this:

And run it through:

python important.py --submission ../data/test.json 
               --limit 1 
               --output ../results/test_llm_output.json

This type of setup is precisely how production-level code works: the output is passed through an API as a structured piece of knowledge, and the code must run on that piece of knowledge.

That is how the terminal will display to you:

Image made by writer

As you may see from the screenshot above, the input is processed through the LLM, but before the output is produced, the CoT is triggered, the tools are called, and the tables are read.

And that is what the output looks like ():

That is a superb example of how LLMs might be exploited of their full power. At the tip of the day, the important advantage of LLMs is their ability to read the context efficiently. The more context we offer (tools, rule-based prompting, few-shot prompting, output formatting), the less the LLM can have to “fill the gaps” (normally hallucinating) and the higher job it is going to eventually do.

Image generated by writer using Excalidraw Whiteboard

Conclusions

Thanks for sticking with me throughout this long, but hopefully not too painful, blog post. 🙂

We cover lots of fun stuff. More specifically, we began from the wow-effect LLMs, those that look great in a LinkedIn post but collapse as soon as you ask them to run on daily basis, inside a budget, and under real constraints.

As a substitute of stopping on the demo, we walked through what it actually takes to show an LLM right into a system:

  • We defined the system requirements first, pondering when it comes to cost, latency, and privacy, as a substitute of just picking the largest model available.
  • We framed a concrete use case: an automatic grader for Data Science exams that has to read questions, take a look at real datasets, and provides structured feedback to students.
  • We designed the prompt as a specification, with a transparent role, explicit rules, and few-shot examples to guide the model toward consistent behavior.
  • We enforced structured output using Pydantic, so the LLM returns typed objects as a substitute of free text that should be parsed and glued each time.
  • We plugged in tools to present the model access to the datasets, grading rubrics, and ground truth answers, so it might probably check the scholar work as a substitute of hallucinating results.
  • We added guardrails and validation across the model, ensuring inputs and outputs are sane, scores make sense, and the system fails gracefully when something goes flawed.
  • Finally, we put every thing together right into a easy pipeline, where prompts, tools, schemas, and guardrails work as one unit you can reuse, test, and monitor.

The important idea is easy. LLMs are usually not magical oracles. They’re powerful components that need context, structure, and constraints. The more you control the prompt, the output format, the tools, and the failure modes, the less the model has to fill the gaps by itself, and the less hallucinations you get.

Before you head out

Thanks again on your time. It means rather a lot ❤️

My name is Piero Paialunga, and I’m this guy here:

Image made by writer

I’m originally from Italy, hold a Ph.D. from the University of Cincinnati, and work as a Data Scientist at The Trade Desk in Latest York City. I write about AI, Machine Learning, and the evolving role of knowledge scientists each here on TDS and on LinkedIn. If you happen to liked the article and need to know more about machine learning and follow my studies, you may:

A. Follow me on Linkedin, where I publish all my stories
B. Follow me on GitHub, where you may see all my code
C. For questions, you may send me an email

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x