Home Artificial Intelligence Some Thoughts on Operationalizing LLM Applications

Some Thoughts on Operationalizing LLM Applications

1
Some Thoughts on Operationalizing LLM Applications

A number of personal lessons learned from developing LLM applications

Source DALL·E 3 prompted with “Operationalizing LLMs, watercolor”

It’s been fun posting articles exploring latest Large Language Model (LLM) techniques and libraries as they emerge, but more often than not has been spent behind the scenes working on the operationalization of LLM solutions. Many organizations are working on this at once, so I believed I’d share just a few quick thoughts about my journey to date.

It’s beguiling easy to throw up a fast demo to showcase a few of the amazing capabilities of LLMs, but anybody who’s tasked with putting them in front of users with the hope of getting a discernable impact soon realizes there’s plenty of work required to tame them. Below are a few of the key areas that almost all organizations might need to contemplate.

Among the key areas that must be considered before launching applications that use Large Language Models (LLMs).

The list isn’t exhaustive (see also Kadour et al 2023), and which of the above applies to your application will in fact vary, but even solving for safety, performance, and price generally is a daunting prospect.

So what can we do about it?

There may be much concern concerning the protected use of LLMs, and quite right too. Trained on human output they suffer from most of the less favorable facets of the human condition, and being so convincing of their responses raises latest issues around safety. Nonetheless, the chance profile isn’t the identical for all cases, some applications are much safer than others. Asking an LLM to supply answers directly from its training data offers more potential for hallucination and bias than a low-level technical use of an LLM to predict metadata. That is an obvious distinction, but worthwhile considering for anybody about to construct LLM solutions— starting with low-risk applications is an obvious first step and reduces the quantity of labor required for launch.

How LLMs are used influences how dangerous it’s to make use of them

We live in incredibly exciting times with so many rapid advances in AI coming out each week, however it sure makes constructing a roadmap difficult! Several times within the last 12 months a latest vendor feature, open-source model, or Python package has been released which has modified the landscape significantly. Determining which techniques, frameworks, and models to make use of such that LLM applications maintain value over time is difficult. No point in constructing something fabulous only to have its capabilities natively supported free of charge or very low price in the subsequent 6 months.

One other key consideration is to ask whether an LLM is definitely the very best tool for the job. With all of the thrill within the last 12 months, it’s easy to get swept away and “LLM the heck” out of every part. As with all latest technology, using it only for the sake of using it is usually an enormous mistake, and as LLM hype adjusts one may find our snazzy app becomes obsolete with real-world usage.

That said, there isn’t any doubt that LLMs can offer some incredible capabilities so if forging ahead, listed here are some ideas which may help …

In website design there may be the concept of mobile-first, to develop web applications that work on less functional phones and tablets first, then work out the right way to make things work nicely on more flexible desktop browsers. Doing things this fashion around can sometimes be easier than the converse. An identical idea may be applied to LLM applications — where possible attempt to develop them in order that they work with cheaper, faster, and lower-cost models from the outset, reminiscent of GPT-3.5-turbo as an alternative of GPT-4. These models are a fraction of the fee and can often force the design process towards more elegant solutions that break the issue down into simpler parts with less reliance on monolithic lengthy prompts to expensive and slow models.

In fact, this isn’t at all times feasible and people advanced LLMs exist for a reason, but many key functions may be supported with less powerful LLMs — easy intent classification, planning, and memory operations. It may additionally be the case that careful design of your workflows can open the potential for different streams where some use less powerful LLMs and others more powerful (I’ll be doing a later blog post on this).

Down the road when those more advanced LLMs change into cheaper and faster, you’ll be able to then swap out the more basic LLMs and your application may magically improve with little or no effort!

It’s an excellent software engineering approach to make use of a generic interface where possible. For LLMs, this will mean using a service or Python module that presents a set interface that may interact with multiple LLM providers. An important example is langchain which offers integration with a big selection of LLMs. By utilizing Langchain to speak with LLMs from the outset and never native LLM APIs, we will swap out different models in the longer term with minimal effort.

One other example of that is to make use of autogen for agents, even when using OpenAI assistants. That way as other native agents change into available, your application may be adjusted more easily than in the event you had built a complete process around OpenAI’s native implementation.

A typical pattern with LLM development is to interrupt down the workflow into a series of conditional steps using frameworks reminiscent of promptflow. Chains are well-defined so we all know, kind of, what’s going to occur in our application. They’re an awesome place to start out and have a high degree of transparency and reproducibility. Nonetheless, they don’t support fringe cases well, that’s where groups of autonomous LLM agents can work well as they can iterate towards an answer and recuperate from errors (most of the time). The problem with these is that — for now not less than — agents generally is a bit slow on account of their iterative nature, expensive on account of LLM token usage, and tend to be a bit wild at times and fail spectacularly. They’re likely the longer term of LLM applications though, so it’s an excellent idea to organize even when not using them in your application at once. By constructing your workflow as a modular chain, you’re in reality doing just that! Individual nodes within the workflow may be swapped out to make use of agents later, providing the very best of each worlds when needed.

It must be noted there are some limitations with this approach, streaming of the LLM response becomes more complicated, but depending in your use case the advantages may outweigh these challenges.

Linking together steps in an LLM workflow with Promtpflow. This has several benefits, one being that steps may be swapped out with more advanced techniques in the longer term.

It is really amazing to observe autogen agents and Open AI assistants generating code and robotically debugging to resolve tasks, to me it looks like the longer term. It also opens up amazing opportunities reminiscent of LLM As Tool Maker (LATM, Cai et al 2023), where your application can generate its own tools. That said, from my personal experience, to date, code generation generally is a bit wild. Yes, it’s possible to optimize prompts and implement a validation framework, but even when that generated code runs perfectly, is it right when solving latest tasks? I actually have come across many cases where it isn’t, and it’s often quite subtle to catch — the size on a graph, summing across the fallacious elements in an array, and retrieving barely the fallacious data from an API. I feel it will change as LLMs and frameworks advance, but at once, I could be very cautious about letting LLMs generate code on the fly in production and as an alternative go for some human-in-the-loop review, not less than for now.

There are in fact many use cases that absolutely require an LLM. But to ease into things, it’d make sense to decide on applications where the LLM adds value to the method relatively than being the method. Imagine an internet app that presents data to a user, already being useful. That application could possibly be enhanced to implement LLM improvements for locating and summarizing that data. By placing barely less emphasis on using LLMs, the applying is less exposed to issues arising from LLM performance. Stating the apparent in fact, however it’s easy to dive into generative AI without first taking baby steps.

Prompting LLMs incurs costs and may end up in a poor user experience as they wait for slow responses. In lots of cases, the prompt is analogous or equivalent to 1 previously made, so it’s useful to find a way to recollect past activity for reuse without having to call the LLM again. Some great packages exist reminiscent of memgpt and GPTCache which use document embedding vector stores to persist ‘memories’. This is similar technology used for the common RAG document retrieval, memories are only chunked documents. The slight difference is that frameworks like memgpt do some clever things to make use of LLM to self-manage memories.

It’s possible you’ll find nevertheless that on account of a selected use case, you would like some type of custom memory management. On this scenario, it’s sometimes useful to find a way to view and manipulate memory records without having to jot down code. A robust tool for that is pgvector which mixes vector store capabilities with Postgres relational database for querying, making it easy to know the metadata stored with memories.

At the top of the day, whether your application uses LLMs or not it remains to be a software application and so will profit from standard engineering techniques. One obvious approach is to adopt test-driven development. This is very necessary with LLMs provided by vendors to regulate for the incontrovertible fact that the performance of those LLMs may vary over time, something you will have to quantify for any production application. Several validation frameworks exist, again promptflow offers some straightforward validation tools and has native support in Microsoft AI Studio. There are other testing frameworks on the market, the purpose being, to make use of one from the beginning for a powerful foundation in validation.

That said, it must be noted that LLMs will not be deterministic, providing barely different results every time depending on the use case. This has an interesting effect on tests in that the expected result isn’t set in stone. For instance, testing that a summarization task is working as required may be difficult since the summary with barely vary every time. In these cases, it’s often useful to make use of one other LLM to guage the applying LLM’s output. Metrics reminiscent of Groundedness, Relevance, Coherence, Fluency, GPT Similarity, ADA Similarity may be applied, see for instance Azure AI studio’s implementation.

Once you have got a set of wonderful tests that confirm your application is working as expected, you’ll be able to incorporate them right into a DevOps pipeline, for instance running them in GitHub actions before your application is deployed.

Nobody size matches all in fact, but for smaller organizations implementing LLM applications, developing every aspect of the answer could also be a challenge. It’d make sense to give attention to the business logic and work closely along with your users while using enterprise tools for areas reminiscent of LLM safety relatively than developing them yourself. For instance, Azure AI studio has some great features that enable various safety checks on LLMs with a click of a button, in addition to easy deployment to API endpoints with integrating monitoring and safety. Other vendors reminiscent of Google have similar offerings.

There may be in fact a value related to features like this, however it could also be well value it as developing them is a major undertaking.

Azure AI Content Safety Studio is an awesome example of a cloud vendor solution to make sure your LLM application is protected, with no associated development effort

LLMs are removed from being perfect, even essentially the most powerful ones, so any application using them will need to have a human within the loop to make sure things are working as expected. For this to be effective all interactions along with your LLM application should be logged and monitoring tools in place. That is in fact no different to any well-managed production application, the difference being latest kinds of monitoring to capture performance and questions of safety.

One other key role humans can play is to correct and improve the LLM application when it makes mistakes. As mentioned above, the flexibility to view the applying’s memory may also help, especially if the human could make adjustments to the memory, working with the LLM to supply end-users with the very best experience. Feeding this modified data back into prompt tunning of LLM fine-tuning generally is a powerful tool in improving the applying.

The above thoughts are under no circumstances exhaustive for operationalizing LLMs and should not apply to each scenario, but I hope they is perhaps useful for some. We’re all on a tremendous journey at once!

Challenges and Applications of Large Language Models, Kaddour et al, 2023

Large Language Models as Tool Makers, Cai et al, 2023.

Unless otherwise noted, all images are by the creator

Please like this text if inclined and I’d be delighted in the event you followed me! Yow will discover more articles here.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here