Home Artificial Intelligence Applying LLMs to Enterprise Data: Concepts, Concerns, and Hot-Takes 4 Concepts Three Concerns Two Hot-Takes Footnotes

Applying LLMs to Enterprise Data: Concepts, Concerns, and Hot-Takes 4 Concepts Three Concerns Two Hot-Takes Footnotes

0
Applying LLMs to Enterprise Data: Concepts, Concerns, and Hot-Takes
4 Concepts
Three Concerns
Two Hot-Takes
Footnotes

Source: DreamStudio (generated by creator)

Ask GPT-4 to prove there are infinite prime numbers — while rhyming — and it delivers. But ask it how your team performed vs plan last quarter, and it can fail miserably. This illustrates a fundamental challenge of huge language models (“LLMs”): they’ve a superb grasp of general, public knowledge (like prime number theory), but are entirely unaware of proprietary, non-public information (how your team did last quarter.)[1] And proprietary information is critical to the overwhelming majority of enterprise use workflows. A model that understands the general public web is cute, but little use in its raw form to most organizations.

Over the past 12 months, I’ve had the privilege of working with numerous organizations applying LLMs to enterprise use cases. This post details key concepts and concerns that anyone embarking on such a journey should know, in addition to a couple of hot-takes on how I feel LLMs will evolve and implications for ML product strategy. It’s intended for product managers, designers, engineers and other readers with limited or no knowledge of how LLMs work “under the hood”, but some interest in learning the concepts without going into technical details.

Prompt Engineering, Context Windows, and Embeddings

The best solution to make an LLM reason about proprietary data is to offer the proprietary data within the model’s prompt. Most LLMs would don’t have any problem answering the next accurately: “We’ve 2 customers, A and B, who spent $100K and $200K, respectively. Who was our largest customer and the way much did they spend?” We’ve just done some basic prompt engineering, by prepending our query (the second sentence) with context (the primary sentence).

But in the true world, we could have hundreds or hundreds of thousands of consumers. How can we determine which information should go into the context — considering that every word included within the context costs money ? That is where embeddings are available. Embeddings are a technique through which text is transformed into numerical vectors, through which similar text generates similar vectors (vectors which are “close together” in N-dimensional space).[2] We would embed website text, documents, possibly even a complete corpus from SharePoint, Google Docs, or Notion. Then, for every user prompt, we embed it and find the vectors from our text corpus which are most much like our prompt vector.

For instance, if we embedded Wikipedia pages on animals, and the user asked an issue about safaris, our search would rank highly the Wikipedia articles about lions, zebra, and giraffes. This enables us to discover the text chunks most much like the prompt — and thus more than likely to reply it.[3] We include these most similar text chunks within the context that’s prepended to the prompt, in order that the prompt hopefully comprises all the data crucial for the LLM to reply the query.

A downside of embeddings is that each call to the LLM requires all of the context to be passed with the prompt. The LLM has no “memory” of even probably the most basic enterprise-specific concepts. And since most cloud-based LLM providers charge per prompt token, this could get expensive fast.[4]

Tremendous-tuning allows an LLM to grasp enterprise-specific concepts without including them in each prompt. We take a foundation model, which already encodes general knowledge across billions of learned parameters, and tweak those parameters to reflect specific enterprise knowledge, while still retaining the underlying general knowledge.[5] After we generate inferences with the brand new fine-tuned model, we get that enterprise knowledge “without cost”.

In contrast to embeddings/prompt engineering, where the underlying model is a third-party black box, fine-tuning is closer to classical machine learning, where ML teams created their very own models from scratch. Tremendous-tuning requires a training dataset with labeled observations; the fine-tuned model is extremely sensitive to the standard and volume of that training data. We also must make configuration decisions (variety of epochs, learning rate, etc), orchestrate long-running training jobs, and track model versions. Some foundation model providers provide APIs that abstract away a few of this complexity, some don’t.

While inferences could also be cheaper with fine-tuned models, that could be outweighed by costly training jobs.[6] And a few foundation model providers (like OpenAI) only support fine-tuning of lagging-edge models (so not ChatGPT or GPT-4).

One among the novel, significant challenges presented by LLMs is measuring the standard of complex outputs. Classical ML teams have tried-and-true methods for measuring the accuracy of straightforward outputs, like numerical predictions or categorizations. But most enterprise use cases for LLMs involve generating responses which are tens to hundreds of words. Concepts sophisticated enough to require greater than ten words can normally be worded in some ways. So even when we now have a human-validated “expert” response, doing an actual string match of a model response to the expert response is just too stringent a test, and would underestimate model response quality.

The Evals framework, open-sourced by OpenAI, is one approach to tackling this problem. This framework requires a labeled test set (where prompts are matched to “expert” responses), nevertheless it allows broad sorts of comparison between model and expert responses. For instance, is the model-generated answer: a subset or superset of the expert answer; factually comparable to the expert answer; kind of concise than the expert answer? The caveat is that Evals perform these checks using an LLM. If there’s a flaw within the “checker” LLM, the eval results may themselves be inaccurate.

In the event you’re using an LLM in production, you must have faith that it can handle misguided or malicious user inputs safely. For many enterprises, the start line is ensuring the model doesn’t spread false information. Which means a system that knows its own limitations and when to say “I don’t know.” There are numerous tactical approaches here. It will possibly be done via prompt engineering with prompt language like “Respond ‘I don’t know’ if the query can’t be answered with the context provided above”). It will possibly be done with fine-tuning, by providing out-of-scope training examples, where the expert response is “I don’t know”.

Enterprises also must guard against malicious user inputs, e.g. prompt hacking. Limiting the format and length of the system’s acceptable inputs and outputs could be a straightforward and effective start. Precautions are a superb idea in case you’re only serving internal users they usually’re essential in case you’re serving external users.

The developers of the preferred LLMs (OpenAI / GPT-4, Google / Bard) have taken pains to align their models with human preferences and deploy sophisticated moderation layers. In the event you ask GPT-4 or Bard to inform you a racist or misogynistic joke, they’ll politely refuse.[7]

That’s excellent news. The bad news is that this moderation, which targets societal biases, doesn’t necessarily prevent institutional biases. Imagine our customer support team has a history of being rude to a specific sort of customer. If historical customer support conversations are naively used to construct a latest AI system (for instance, via fine-tuning) that system is prone to replicate that bias.

In the event you’re using past data to coach an AI model (be it a classical model or a generative model), closely scrutinize which past situations you would like to perpetuate into the longer term and which you don’t. Sometimes it’s easier to set principles and work from those (for instance, via prompt engineering), without using past data directly.

Unless you’ve been living under a rock, you recognize generative AI models are advancing incredibly rapidly. Given an enterprise use case, the most effective LLM for it today will not be the most effective solution in six months and almost definitely is not going to be the most effective solution in six years. Smart ML teams know they’ll need to change models in some unspecified time in the future.

But there are two other major reasons to construct for simple LLM “swapping”. First, many foundation model providers have struggled to support exponentially-growing user volume, resulting in outages and degraded service. Constructing a fallback foundation model into your system is a superb idea. Second, it could possibly be quite useful to check multiple foundation models in your system (“a horse race”) to get a way of which performs best. Per the section above on Evals, it’s often difficult to measure model quality analytically, so sometimes you only need to run two models and qualitatively compare the responses.

Read the terms and conditions of any foundation model you’re considering using. If the model provider has the correct to make use of user inputs for future model training, that’s worrisome. LLMs are so large it’s possible that specific user queries/responses grow to be directly encoded in a future model version, and will then grow to be accessible to any user of that version. Imagine a user at your organization queries “how can I clean up this code that does XYZ? [your proprietary, confidential code here]” If this question is then utilized by the model provider to retrain their LLM, that new edition of the LLM may learn that your proprietary code is a fantastic solution to solve use case XYZ. If a competitor asks easy methods to do XYZ, the LLM could “leak” your source code, or something very similar.

OpenAI now allows users to opt-out of their data getting used to coach models, which is a superb precedent, but not every model provider has followed their example. Some organizations are also exploring running LLMs inside their very own virtual private clouds; this can be a key reason for much of the interest in open-source LLMs.

After I first began adapting LLMs for enterprise use, I used to be far more fascinated about high-quality tuning than prompt engineering. Tremendous tuning felt prefer it adhered to the principles of classical ML systems to which I used to be accustomed: wrangle some data, produce a train/test dataset, kick off a training job, wait some time, evaluate the outcomes against some metric.

But I’ve come to consider that prompt engineering (with embeddings) is a greater approach for many enterprise use cases. First, the iteration cycle for prompt engineering is much faster than for high-quality tuning, because there isn’t a model training, which may take hours or days. Changing a prompt and generating latest responses could be done in minutes. Conversely, fine-tuning is an irreversible process when it comes to model training; in case you used incorrect training data or a greater base model comes out, you must restart your fine-tuning jobs. Second, prompt engineering requires far less knowledge of ML concepts like neural network hyperparameter optimization, training job orchestration or data wrangling. Tremendous-tuning often requires experienced ML engineers, while prompt engineering can often be done by software engineers without ML experience. Third, prompt engineering works higher for the fast-growing strategy of model chaining, through which complex requests are decomposed into smaller, constituent requests, each of which could be assigned to a distinct LLM. Sometimes the most effective “constituent model” is a fine-tuned model.[8] But many of the value-add work for enterprises is (i) determining easy methods to break apart their problem, (ii) write the prompts for every constituent part, and (iii) discover the most effective off-the-shelf model for every part; it’s not in creating their very own fine-tuned models.

The benefits of prompt engineering are prone to widen over time. Today, prompt engineering requires long, expensive prompts (since context have to be included in each prompt). But I’d bet on rapidly declining cost per token, because the model provider space gets more competitive and providers determine easy methods to train LLMs more cheaply. Prompt engineering can be limited today by maximum prompt sizes — but , OpenAI already accepts 32K tokens (~40 pages of average English text) per prompt for GPT-4, and Anthropic’s Claude accepts 100K tokens (~15 pages). And I’d bet on even larger context windows coming out within the near future.

As LLMs have grow to be higher at producing human-interpretable reasoning, its useful to think about how humans use data to reason, and what that suggests for LLMs.[9] Humans don’t actually use much data! More often than not, we do “zero shot learning”, which simply means we answer questions without the query being accompanied by a set of example question-answer pairs. The questioner just provides the query, and we answer based on logic, principles, heuristics, biases, etc.

That is different from the LLMs of just a couple of years ago, which were only good at few-shot learning, where you needed to incorporate a handful of example question-answer pairs in your prompt. And it’s very different from classical ML, where the model needed to be trained on tons of, hundreds, or hundreds of thousands of question-answer pairs.

I strongly consider that an increasing, dominant share of LLM use cases might be “zero-shot”. LLMs will give you the chance to reply most questions with none user-provided examples. They’ll need prompt engineering, in the shape of instructions, policies, assumptions, etc. For instance, this post uses GPT-4 to review code for security vulnerabilities; the approach requires no data on past instances of vulnerable code. Having clear instructions, policies, and assumptions will grow to be increasingly vital — but having large volumes of high-quality, labeled, proprietary data will grow to be less vital.

LEAVE A REPLY

Please enter your comment!
Please enter your name here