Bolstering enterprise LLMs with machine learning operations foundations

Artificial Intelligence

Bolstering enterprise LLMs with machine learning operations foundations

admin

September 22, 2023

Bolstering enterprise LLMs with machine learning operations foundations

Once these components are in place, more complex LLM challenges would require nuanced approaches and considerations—from infrastructure to capabilities, risk mitigation, and talent.

Deploying LLMs as a backend

Inferencing with traditional ML models typically involves packaging a model object as a container and deploying it on an inferencing server. Because the demands on the model increase—more requests and more customers require more run-time decisions (higher QPS inside a latency certain)—all it takes to scale the model is so as to add more containers and servers. In most enterprise settings, CPUs work tremendous for traditional model inferencing. But hosting LLMs is a rather more complex process which requires additional considerations.

LLMs are comprised of tokens—the essential units of a word that the model uses to generate human-like language. They often make predictions on a token-by-token basis in an autoregressive manner, based on previously generated tokens until a stop word is reached. The method can develop into cumbersome quickly: tokenizations vary based on the model, task, language, and computational resources. Engineers deploying LLMs needn’t only infrastructure experience, comparable to deploying containers within the cloud, additionally they have to know the newest techniques to maintain the inferencing cost manageable and meet performance SLAs.

Vector databases as knowledge repositories

Deploying LLMs in an enterprise context means vector databases and other knowledge bases have to be established, they usually work together in real time with document repositories and language models to provide reasonable, contextually relevant, and accurate outputs. For instance, a retailer may use an LLM to power a conversation with a customer over a messaging interface. The model needs access to a database with real-time business data to call up accurate, up-to-date details about recent interactions, the product catalog, conversation history, company policies regarding return policy, recent promotions and ads available in the market, customer support guidelines, and FAQs. These knowledge repositories are increasingly developed as vector databases for fast retrieval against queries via vector search and indexing algorithms.

Training and fine-tuning with hardware accelerators

LLMs have a further challenge: fine-tuning for optimal performance against specific enterprise tasks. Large enterprise language models could have billions of parameters. This requires more sophisticated approaches than traditional ML models, including a persistent compute cluster with high-speed network interfaces and hardware accelerators comparable to GPUs (see below) for training and fine-tuning. Once trained, these large models also need multi-GPU nodes for inferencing with memory optimizations and distributed computing enabled.

To fulfill computational demands, organizations might want to make more extensive investments in specialized GPU clusters or other hardware accelerators. These programmable hardware devices could be customized to speed up specific computations comparable to matrix-vector operations. Public cloud infrastructure is a very important enabler for these clusters.

A recent approach to governance and guardrails

Risk mitigation is paramount throughout your entire lifecycle of the model. Observability, logging, and tracing are core components of MLOps processes, which help monitor models for accuracy, performance, data quality, and drift after their release. That is critical for LLMs too, but there are additional infrastructure layers to contemplate.

LLMs can “hallucinate,” where they occasionally output false knowledge. Organizations need proper guardrails—controls that implement a selected format or policy—to make sure LLMs in production return acceptable responses. Traditional ML models depend on quantitative, statistical approaches to use root cause analyses to model inaccuracy and drift in production. With LLMs, that is more subjective: it could involve running a qualitative scoring of the LLM’s outputs, then running it against an API with pre-set guardrails to make sure an appropriate answer.