LLMOps: My Thesis & Market Map Table of Contents Introduction Why does LLMOps matter now? Issues with LLMs in production? LLMOps Market Map My predictions for LLMOps Concluding Remarks

1) Introduction
2) Why does LLMOps matter now?
3) Issues with LLMs in production?
4) LLMOps Market Map
5) My predictions for LLMOps
6) Concluding Remarks

In my previous article, last 12 months, I explored MLOps and highlighted 4 verticals that I feel offer exceptional investment opportunities.

Nonetheless, over the past 12 months, Machine learning has moved at breakneck pace, with many arguing that AI has finally crossed the inflection point that was being promised since a long time. The emergence of Large Language Models (LLMs), notably exemplified by OpenAI’s ChatGPT, has captivated the general public’s imagination. Generative AI has even shook the enterprise market with just about all firms, big or small, are actively exploring avenues to integrate AI capabilities into their services and products. This can be reflected in the big amount of cash being poured in by the VCs (including many firms reminiscent of Amazon’s recent Generative AI fund) within the generative AI ecosystem.

As I actually have argued previously, the proliferation of any recent technology relies heavily on the provision of strong tooling and infrastructure. On the earth of enterprise technology, particularly managed services, the selling point goes beyond mere accuracy or a complicated feature set, as competitors can swiftly catch up in those facets. The true differentiator is offering a superior developer experience: streamlined setup, effortless usage, and reduced overhead. In case of Large Language Models (LLMs) and AI, there may be a pressing need for precisely such a superior developer experience together with higher accuracy.

“LLMOps” encompasses every little thing that a developer must make LLMs work — development, deployment, and maintenance. Principally, it’s a recent set of tools and best practices to administer the lifecycle of LLM-powered applications.

Many firms and individuals are already using LLMs to power their applications: Notion (writing assistant), Github (programming assistant), Microsoft (office assistant), Michelle Huang (chatting with self), Bhaskar Tripathi (reading PDFs), and lots of more! And as expected, taking LLMs into production is just not easy. As Chip Huyen puts it:

“It’s easy to make something cool with LLMs, but very hard to make something production-ready with them.”

LLMs have evolved at breakneck pace since Google published the unique Transformer Paper in August 2017. In reality, you’ll be able to take a look at this amazing infographic to learn more concerning the evolution of LLMs till date.

The case I’m attempting to make is that, LLMs have come of age — even with several chinks of their armor, todays LLMs are adequate for several tasks. Hence, to enable the usage of LLMs by businesses and individuals, we’d like robust tools and platforms.

But before diving into the challenges and opportunities in LLMOps, I need to indicate my inherent assumptions while eager about this space:

I see a future where LLMs transcend just generating texts, images, music, etc. to directly call APIs, execute code, or modify system resources — LLMs will grow to be the brand new interaction layer for software
LLMs will generate and trigger complex dynamic workflows
LLMs would integrate, inference, and work with one another, including differing modalities reminiscent of text, image, code, audio, and video

In consequence, LLM Infrastructure is an area that’s ripe for innovation and consequently investment!

Large language models are very expensive to coach because it requires constant experimentation and re-training on recent datasets (OpenAI GPT model was initially trained on data until 2021) to stop model from getting stale. More importantly, it’s the inference costs which are steep (Google can lose $30B)!
Only a couple of firms are mature enough to repeatedly fine-tune their models and keep their data pipelines healthy, especially in today’s world where many of the data is shared across code, services, product teams, and even organizations. LLMs for all their goodness can hence grow to be an architectural nightmare in production — you can’t just train once and deploy-forever.
This can be a fancy way of claiming — LLMs can lie! That is a significant concern (especially within the ear where fake news, deepfakes, etc. are so common) with LLMs as it will probably spread vast amount of misinformation (even ChatGPT) because of their proliferation. Further, trying to grasp why hallucinations occur is difficult because the way in which LLMs derive their output is a black box. Nonetheless, on a high-level, we all know that data quality, generation method, and the input context affects hallucinations.
Client-side orchestration is a much easier problem to unravel than server-side orchestration. The actual challenge is solving for the huge scale requirements of recent applications. Imagine having to coach and deploy an LLM in a distributed setting with caching, throttling, and authn/authz, etc. and other critical enterprise features to supply the adequate SLAs (required for any large application) when it comes to API responsiveness and throughput — it is just not trivial!
We have now already seen multiple instances of security concerns over LLMs (Eg: Samsung leak). Prompt Injection has also grow to be popular and an efficient tool to bypass the rudimentary security of LLMs. Enterprise adoption won’t gain steam without much stronger security measures across the LLM stack.

As I proceed following the evolution of enormous language models, startups are developing revolutionary products across your complete infrastructure stack (The LLM stack itself has evolved from the normal NLP stack). Nonetheless, not all of the sub-spaces inside the infrastructure stack are equally exciting. Below, I cover my composition and predictions for the LLMOps space!

Client side Orchestration

These are the products that help developers to orchestrate the client-side integrations of a generative AI application. This is able to include tools that aid the developers to hook the deployed models with external software APIs, and other foundation models, and facilitate end-user interactions. Client frameworks also provide mechanisms that enable end-users to chain prompts with external APIs. Such frameworks assist in breaking an overarching task right into a series of smaller subtasks, mapping each subtask to a single step for the model to finish, and using the output from one step as an input to the following. Further, these frameworks can provide marketplaces for the pre-built client-side workflows with integrations.

First up, what’s prompt engineering? It’s a method to tweak your intent i.e., your inquiries to the LLM such that the output matches your expectations as closely as possible. The OpenAI Cookbook provides a bunch of suggestions to enhance your prompts. Now, back to why prompt management is critical. I feel that applications will probably be composed of multiple LLMs, enabling developers to pick probably the most appropriate model for his or her specific task and other considerations reminiscent of domain knowledge, speed, cost, etc. In reality, there may be a growing consensus that applications may have multi-modal architecture glued together by an orchestration layer. Prompts will probably be the central piece in such a layer and thus there could be a necessity for prompt engineering tools which are flexible and accommodate a wide range of use cases, easy to make use of (possibly low-code/no-code?), easy to guage, traceable and debuggable, allow lifecycle management and versioning, and compatible with the plethora of language models.

An efficient option to leverage LLMs is to generate embeddings from context i.e., information after which developing ML applications on top of those embeddings. The concept is to make use of these mathematical representation of texts for common operations reminiscent of searching, clustering, recommendations, anomaly detection, etc. This is completed by running similarity checks on these mathematical vectors. Nonetheless, these embeddings (i.e., reduce complex texts into mathematical vectors) can grow to be very large for the reason that documents/information could have 1000’s of tokens. Hence, we’d like vector databases to efficiently store and retrieve embeddings. The popularity of vector databases has gone up in recent times, driven partly by the AI hype. The information storage and retrieval for LLMs will proceed to evolve because the generative AI space itself matures. Hence, there may be scope for giant scale innovation and growth. We’re already seeing a number of VC activity being poured into vector databases. It could be interesting to see which implementation comes out on top!

Illustration of Vector embeddings (Image by Redis)

Experimentation

Training, superb tuning, and inference of models is each hard and really costly. To place it into context, a back of the envelope calculation illustrates how moving from traditional search to LLM can result in ~65% reducing in operating income for Google. Hence, we’d like novel tools and techniques to scale back the prices for training, superb tuning, and inference (as evidenced by this recent Stanford paper). Other than the price, the latency for inference is critical for enterprise adoption. Deterministic applications serve APIs in microseconds; nobody desires to wait for several seconds simply to get a response or trigger an motion on an application. While training and inference are more straightforward, fine-tuning has issues beyond just performance. High quality-tuning involves updating the parameters of the underlying foundation model by retraining the model on a more data (will also be a more targeted dataset versus general dataset). An appropriately fine-tuned model can increase the prediction accuracy, improve model performance, and reduce training costs. Nonetheless, superb tuning a model is not as easy because it looks like! If not done properly, it could even result in worse outputs. High quality tuning a model not only requires a deep technical expertise, but additionally loads of storage and compute resources. High quality-tuning an excessive amount of can even introduce overfitting, (i.e., the model gets too specialized on a selected dataset and fails to capture the overarching generic patterns) and even cause hallucinations. Hence, we’d like tools to supply sophisticated tools which are easy to make use of, provide fine-tuning strategies, and evaluation methods to check those strategies.

Server side Orchestration

Server side orchestration includes all of the pieces of code i.e., the machinery that executes within the backend to run the model — deployment, training, inference, monitoring, and security.

When eager about leveraging foundation models, enterprises can either use managed models (reminiscent of OpenAI, Anthropic, Cohere, etc.) or deploy their very own models. Nonetheless, deploying a model is non-trivial and expensive. You have to scale the model architecture, upgrade models with newer version, switch between multiple models, etc. Further, deploying and training model requires powerful on-demand GPU infrastructure. Enterprises may have to weigh in the professionals and cons of cloud-based versus on-premise model deployment with respect to cost, latency, privacy, etc. Automated deployment pipelines (CI/CD) would power streamlined training, fine-tuning, and inference capabilities in addition to traditional software functionalities reminiscent of upgrades and rollbacks with minimal user disruptions.

In production systems, it’s critical that we will observe, evaluate, optimize, and debug the code. With LLMs (or AI typically), the difficulty of observability gets exacerbated because of their blackbox nature. Observability involves tracking and understanding performance, including identifying failures, outages, downtime, evaluating system health (or LLM health), and in case of LLMs, deciphering outputs — for instance, explaining why the model got here to a certain decision. Nonetheless, LLMs present some unique challenges. Firstly, it is extremely difficult to find out what “good” performance actually means for the model. Here, one would probably need to investigate user interactions just to evaluate the model performance. Further, closed source black box models are even harder to grasp and explain since we don’t have access to the architecture or the training data. Hence, we’d like recent testing and comparison frameworks reminiscent of “HELM” by Stanford, Evals by OpenAI, etc. to supply standardization. LLMs (and typically all machine learning models) also exhibit issues reminiscent of model drift — deterioration in model performance because of changes in underlying data distribution (aka stale data). Monitoring the model could assist in updating the model with fresh data to mitigate model drift. Thus, tracking model performance and usage is crucial to debug potential issues, superb tune the model, and even change the underlying model architecture.

Here, I’m using privacy as an overarching term for model safety, security, and compliance. With stringent privacy and security laws reminiscent of GDPR, CCPA, HIPAA, etc, and lots of more across the globe, governments are putting privacy at the middle stage for any recent technological innovation. For enterprises to trust and deploy generative AI models, they need tools that provide accurate evaluations of model fairness, bias, and toxicity (generating unsafe or hateful content) as well as to non-public guardrails. Enterprises at the moment are increasingly concerned about extraction of coaching data, corrupted training data, and leaking of proprietary sensitive data (Eg: Samsung case). Other than this, LLMs similar to traditional machine learning models, are liable to adversarial attacks. Hence, we needs products that may protect against prompt injection, data leakage, and toxic language generation; provide data privacy through anonymization; provide access control (reminiscent of RBAC) for LLMs; implement adversarial training and defensive distillation; and rather more. Such products might help in detecting anomalies and optimize the production model by maintaining its integrity.

: Just as we saw with SQL, document, graph, etc. databases — I foresee that two major players will emerge within the vector database space (one closed source and one open source player). While the database market is large (and thus has an extended tail of players), we now have at all times seen only two/three major players taking many of the market share.
LLMs are still evolving at a rapid pace and so will the client orchestration frameworks. Recent and multiple approaches will emerge that can enable developers to integrate APIs, Auth systems, Databases, etc. Prompt management might get rolled as much as grow to be an element of strong orchestration offerings. Ultimately, the products that provide excellent developer experience with robust integration support will excel.
Currently, most LLMOps products tackle one (or a couple of) facets of the LLM stack reminiscent of prompt, deployment, monitoring, etc. I feel that the majority of those startups will expand and converge to supply a broader range of capabilities — prompting, distributed training, deployment, monitoring, versioning, etc. together as an end-to-end solution. This implies most startups that might not be competitors today will find yourself being competitors. This also means, there will probably be significant M&A amongst startups (Eg: Databricks’ acquisition of MosaicML).
There are a bunch of startups working on the monitoring facets of LLMs. Nonetheless, individually they wouldn’t be as priceless to enterprises since LLM will probably be an element of the general tech stack. Hence, enterprises would search for complete observability and monitoring platforms. Hence, I see players like Datadog, Recent Relic, Splunk, Elastic, etc. scoop up one (or more) LLM monitoring startups to bolster their monitoring portfolios.
Startups focussing on “only” deployment of LLMs will struggle (including startups providing Serverless GPUs). It’s because, I don’t see the economics for such a service understanding because of smaller scale and more importantly, frequent GPU upgrade costs. Finally, big cloud providers have spent years providing compute at low-cost rates which will probably be hard to administer. Further, most enterprises have already got a number of cloud accounts and thus have stickiness to the corresponding cloud platform.

Using LLMs in production is difficult! We have now to tackle each difficult and unknown issues. Further, if a specific task is straightforward enough then it’s hard to justify a dearer, less explainable, and albeit slower system than traditional solutions. Hence, because the generative AI space keeps maturing with newer innovations and disrupts the industry, we’ll need powerful LLM infrastructure to support the ecosystem. Thus LLMOps is one of the vital exciting spaces.

In case you are investing on this space, ideating or constructing your personal enterprise, or have thoughts about AI or this text, I’d love to listen to from you!

Rachit Kansal

The opinions expressed on this blog are solely those of the author and never of this platform. The author is just not a member of or related to any of the firms mentioned within the blog. The views on this blog are solely my very own and don’t represent any of my current or prior workplaces.