Expected Value Evaluation in AI Product Management

under uncertainty is a central concern for product teams. Decisions large and small often must be made under time pressure, despite incomplete — and potentially inaccurate — information concerning the problem and solution space. This may increasingly be resulting from a scarcity of relevant user research, limited knowledge concerning the intricacies of the business context (typically seen in firms that do too little to foster customer centricity and cross-team collaboration), and/or a flawed understanding of what a certain technology can and can’t do (particularly when constructing front-runner products with novel, untested technologies).

The situation is very difficult for AI product teams for at the least three reasons. First, many AI algorithms are inherently probabilistic in nature and thus yield uncertain outcomes (e.g., model predictions could also be right or incorrect with a certain probability). Second, a sufficient quantity of high-quality, relevant data may not at all times be available to properly train AI systems. Third, the recent explosion in hype around AI — and more specifically, generative AI — has led to unrealistic expectations amongst customers, Wall Street analysts and (inevitably) decision makers in upper management; the sensation amongst lots of these stakeholders appears to be that virtually anything can now be solved easily with AI. Pointless to say, it could be difficult for product teams to administer such expectations.

So, what hope is there for AI product teams? While there is no such thing as a silver bullet, this text introduces readers to the notion of and the way it could be used to guide decision making in AI product management. After a temporary overview of key theoretical concepts, we are going to have a look at three real-life case studies that underscore how expected value evaluation might help AI product teams make strategic decisions under uncertainty across the product lifecycle. Given the foundational nature of the material, the target market of this text includes data scientists, AI product managers, engineers, UX researchers and designers, managers, and all others aspiring to develop great AI products.

Expected Value

Before a proper definition of expected value, allow us to consider two easy games to construct our intuition.

A Game of Dice

In the primary game, imagine you might be competing with your pals in a dice-rolling contest. Each of you gets to roll a good, six-sided die times. The rating for every roll is given by the variety of pips (dots) showing on the highest face of the die after the roll; 1, 2, 3, 4, 5, and 6 are thus the one achievable scores for any given roll. The player with the best total rating at the top of rolls wins the sport. Assuming that is a big number (say, 500), what should we expect to see on the conclusion of the sport? Will there be an outright winner or a tie?

It seems that, as gets large, the full scores of every of the players are prone to converge to three.5*. For instance, after 500 rolls, the full scores of you and your pals are prone to be around 3.5*500 = 1750. To see why, notice that, for a good, six-sided die, the probability of any side being on top after a roll is 1/6. On average, the rating of a person roll will due to this fact be (1 + 2 + 3 + 4 + 5 + 6)/6 = 3.5, i.e., the typical of all achievable scores per roll — this also happens to be the of a die roll. Assuming that the outcomes of all rolls are independent of one another, we might expect the typical rating of the rolls to be 3.5. So, after 500 rolls, we must always not be surprised if each player has a complete rating of roughly 1750. Actually, there may be a so-called in mathematics, which states that when you repeat an experiment (like rolling a die) a sufficiently large variety of times, the typical results of all those experiments should converge almost surely to the expected value.

A Game of Roulette

Next, allow us to consider roulette, a preferred game at casinos. Imagine you might be playing a simplified version of roulette against a friend as follows. The roulette wheel has 38 pockets, and the sport ends after rounds. For every round, you have to pick a complete number between 1 and 38, after which your friend will spin the roulette wheel and throw a small ball onto the spinning wheel. Once the wheel stops spinning, if the ball results in the pocket with the number that you simply picked, your friend can pay you $35; if the ball results in any of the opposite pockets, nevertheless, you have to pay your friend $1. How much money do you expect you and your friend to make after rounds?

You may think that, since $35 is rather a lot greater than $1, your friend will find yourself paying you quite a little bit of money by the point the sport is completed — but not so fast. Allow us to apply the identical basic approach we utilized in the dice game to research this seemingly lucrative game of roulette. For any given round, the probability of the ball ending up within the pocket with the number that you simply picked is 1/38. The probability of the ball ending up in another pocket is 37/38. Out of your perspective, the typical final result per round is due to this fact $35*1/38 – $1*37/38 = -$0.0526. So, plainly you’ll actually find yourself owing your friend slightly over a nickel after each round. After rounds, you will probably be out of pocket by around $0.0526*. For those who play 500 rounds, as within the dice game above, you’ll find yourself paying your friend roughly $26. That is an example of a game that’s rigged to favor the “house” (i.e., the casino, or on this case, your friend).

Formal Definition

Let be a random variable that may yield any one in all final result values, ₁, ₂, …, _k, each with probabilities ₁, ₂, …, _k of occurring, respectively. The of is the sum of the final result values weighted by their respective probabilities of occurrence:

The entire expected value of independent occurrences of will probably be .

The video below walks through some more hands-on examples of expected value calculations:

In the next case studies, we are going to see how expected value evaluation can aid decision making under uncertainty. Fictitious company names are used throughout to preserve the anonymity of the companies involved.

Case Study 1: Fraud Detection in E-Commerce

is an internet platform for reselling used cars across Europe. Legitimate automobile dealerships and personal owners of used cars can list their vehicles on the market on . A typical listing will include the asking price of the vendor, facts concerning the automobile (e.g., its basic properties, special features, and details of any damages/wear-and-tear), and photos of the automobile’s interior and exterior. Buyers can flick thru the various listings on the platform, and having found one they like, can click on a button on the listing page to contact the vendor to rearrange a viewing, and ultimately make the acquisition. charges sellers a small monthly fee to indicate listings on the platform. To drive such subscription-based revenue, the method for sellers to join the platform and create listings is kept so simple as possible.

The difficulty is that a few of the listings on the platform may in actual fact be fake. An unintended consequence of reducing the barriers for creating listings is that malicious users can arrange fake seller accounts and create fake listings (often impersonating legitimate automobile dealerships) to lure and potentially defraud unsuspecting buyers. Fake listings can have a negative business impact on in two ways. First, fearing reputational damage, affected sellers may take their listings to other competing platforms, publicly criticize for its apparently lax security standards (which could trigger other sellers to also leave the platform), and even sue for damages. Second, affected buyers (and those who hear concerning the instances of fraud within the press, on social media, and from family and friends) can also abandon the platform, and write negative reviews online — all of which may further persuade sellers (the platform’s key revenue source) to go away.

Against this backdrop, the chief product officer (CPO) at has tasked a product manager and a cross-functional team of customer success representatives, data scientists, and engineers to evaluate the potential for using AI to combat the scourge of fraudulent listings. The CPO isn’t all in favour of mere opinions — she wants a data-driven estimate of the online value of implementing an AI system that might help quickly detect and delete fraudulent listings from the platform before they could cause any damage.

Expected value evaluation could be used to estimate the online value of the AI system by considering the chances of correct and incorrect predictions and their respective advantages and costs. Particularly, we are able to distinguish between 4 cases: (1) appropriately detected fake listings (), (2) legitimate listings incorrectly deemed fake (), (3) appropriately detected legitimate listings (), and (4) fake listings incorrectly deemed legitimate (). The online monetary impact, , of every case could be estimated with the assistance of historical data and stakeholder interviews. Each true positives and false positives will end in some effort for to remove the identified listings, however the false positives will end in additional costs (e.g., revenues lost resulting from removing legitimate listings and the price of efforts to reinstate these). Meanwhile, whereas true negatives should incur no costs, false negatives could be expensive — these represent the very fraud that the CPO goals to combat.

Given an AI model with a certain predictive accuracy, if denotes the probability of every case occurring in practice, then the sum reflects the expected value of every prediction (see Figure 1 below). The entire expected value for predictions would then be .

Figure 1: Expected Value of Fraud Prediction in Cars Online Case Study

Based on the predictive performance profile of a given AI model and estimates of expected value for every of the 4 cases (from true positives to false negatives), the CPO can get a greater sense of the expected value of constructing an AI system for fraud detection and make a go/no-go decision for the project accordingly. In fact, additional fixed and variable costs normally related to constructing, operating, and maintaining AI systems also needs to be factored into the general decision making.

This article considers an analogous case study, through which a recruiting agency decides to implement an AI system for identifying and prioritizing good leads (candidates prone to be hired by clients) over bad ones. Readers are encouraged to undergo that case study and reflect on the similarities and differences with the one discussed here.

Case Study 2: Auto-Completing Purchase Orders

The procurement department of , an American automobile manufacturer, creates a big variety of purchase orders every month. Constructing a single automobile requires several thousand individual parts that should be procured on time and at the fitting quality standard from approved suppliers. A team of buying clerks is chargeable for manually creating the acquisition orders; this involves filling out an internet form consisting of several data fields that outline the precise specifications and quantities of every item to be purchased per order. Pointless to say, this can be a time-consuming and error-prone activity, and as a part of a company-wide cost-cutting initiative, the Chief Procurement Officer of has tasked a cross-functional product team inside her department to substantially automate the creation of purchase orders using AI.

Having conducted user research in close collaboration with the purchasing clerks, the product team has decided to construct an AI feature for auto-filling fields in purchase orders. The AI can auto-fill fields based on a mixture of any initial inputs provided by the purchasing clerk and other relevant information sourced from master data tables, inputs from production lines, and so forth. The purchasing clerk can then review the auto-filled order and has the choice of either accepting the AI-generated proposals (i.e., predictions) for every field or overriding incorrect proposals with manual entries. In cases where the AI is unsure of the right value to fill (as exemplified by a low for the given prediction), the sphere is left blank, and the clerk must manually fill it with an acceptable value. An AI feature for flexibly auto-filling forms in this fashion could be built using an approach called , as described in this article.

To make sure top quality, the product team would love to set a threshold for model confidence scores, such that only predictions with confidence scores above this predefined threshold are shown to the user (i.e., used to auto-fill the acquisition order form). The query is: what threshold value ought to be chosen?

Let ₁ and ₂ be the payoffs of showing correct and incorrect predictions to the user (resulting from being above the boldness threshold), respectively. Let ₃ and ₄ be the payoffs of showing correct and incorrect predictions to the user (resulting from being below the boldness threshold), respectively. Presumably, there ought to be a positive payoff (i.e., a profit) to showing correct predictions (₁) and never showing incorrect ones (₄). Against this, ₂ and ₃ ought to be negative payoffs (i.e., costs). Picking a threshold that is simply too low increases the prospect of showing incorrect predictions that the clerk must manually correct (₂). But picking a threshold that is simply too high increases the prospect of correct predictions not being shown, leaving blank fields on the acquisition order form that the clerk would wish to spend some effort to manually fill in (₃). The product team thus has a trade-off on its hands — can expected value evaluation help resolve it?

Because it happens, the team is capable of estimate reasonable values for the payoff aspects ₁, ₂, ₃, and ₄ by leveraging findings from user research and business domain know-how. Moreover, the info scientists on the product team are capable of estimate the chances of incurring these costs by training an example AI model on a dataset of historical purchase orders at and analyzing the outcomes. Suppose is the boldness rating attached to a prediction. Then given a predefined model confidence threshold , let ( > ) denote the proportion of predictions that believe scores greater than ; these are the predictions that might be used to auto-fill the acquisition order form. The proportion of predictions with confidence rating below the edge value is ( ≤ ) = 1 – ( > ). Moreover, let ( > ) and ( ≤ ) denote the typical accuracies of predictions that believe scores greater than and at most , respectively. The expected value (or expected payoff) per prediction could be derived by summing up the expected values attributable to every of the 4 payoff drivers (denoted ₁, ₂, ₃, and ₄), as shown in Figure 2 below. The duty for the product team is then to check various threshold values and discover one which maximizes the expected payoff .

Figure 2: Expected Payoff per Prediction in ACME Auto Case Study

Case Study 3: Standardizing AI Design Guidance

The CEO of , a world enterprise software vendor, has recently declared her intention to make the corporate “AI-first” and infuse all of its services with high-value AI features. To support this company-wide transformation effort, the Chief Product Officer has tasked the central design team at with making a consistent set of design guidelines to assist teams construct AI products that enhance user experience. A key challenge is managing the trade-off between creating guidance that is simply too weak/high-level (giving individual product teams greater freedom of interpretation while risking inconsistent application of the guidance across product teams) and guidance that is simply too strict (enforcing standardization across product teams without due regard for product-specific exceptions or customization needs).

One well-intentioned piece of guidance that the central design team initially got here up with involves displaying labels next to predictions on the UI (e.g., “most suitable choice,” “good alternative,” or similar), to offer users some indication of the expected quality/relevance of the predictions. It is believed that showing such qualitative labels would help users make informed decisions during their interactions with AI products, without overwhelming them with hard-to-interpret statistics similar to model confidence scores. Particularly, the central design team believes that by stipulating a consistent, global set of model confidence thresholds, a standardized mapping could be created for translating between model confidence scores and qualitative labels for products across . For instance, predictions with confidence scores greater than 0.8 could be labeled as “best,” predictions with confidence scores between 0.6 and 0.8 could be labeled as “good,” and so forth.

As now we have seen within the previous case study, it is feasible to make use of expected value evaluation to derive a model confidence threshold for a selected use case, so it’s tempting to attempt to generalize this threshold across all use cases within the product portfolio. Nevertheless, that is trickier than it first seems, and the probability theory underlying expected value evaluation might help us understand why. Consider two easy games, a coin flip and a die roll. The coin flip entails two possible outcomes, landing heads or tails, each with a 1/2 probability of occurring (assuming a good coin). Meanwhile, as we discussed previously, rolling a good, six-sided die entails six possible outcomes for the top-facing side (1, 2, 3, 4, 5, or 6 pips), each with a 1/6 probability of occurring. A key insight here is that, because the variety of possible outcomes of a random variable (also called the of the final result set) increases, it generally becomes harder and harder to appropriately guess the final result of an arbitrary event. For those who guess that the following coin flip will end in heads, you will probably be right half the time on average. But when you guess that you’re going to roll any particular number (say, 3) on the following die roll, you’ll only be correct one out of six times on average.

Now, what if we were to set a world confidence threshold of, say, 0.4 for each the coin and dice games? If an AI model for the dice game predicts a 3 on the following roll with a confidence rating of 0.45, then we’d happily label this prediction as “good” and even “great”; in spite of everything, the boldness rating is above the predefined global threshold and significantly higher than 1/6 (the success probability of a random guess). Nevertheless, if an AI model for the coin game predicts heads on the following coin flip with the identical confidence rating of 0.45, we may suspect that this can be a false positive and never show the prediction to the user in any respect; although the boldness rating is above the predefined threshold, it remains to be below 0.5 (the success probability of a random guess).

The above evaluation suggests that a single, one-size-fits-all stipulation to display qualitative labels next to predictions ought to be struck from the standardized design guidance for AI use cases. As a substitute, perhaps individual product teams ought to be empowered to make use-case-specific decisions about the right way to display qualitative labels (if in any respect).

The Wrap

Decision making under uncertainty is a key concern for AI product teams, and can likely gain in importance in a future dominated by AI. On this context, expected value evaluation might help guide AI product management. The expected value of an uncertain final result represents the theoretical, long-term, average value of that final result. Using real-life case studies, this text shows how expected value evaluation might help teams make educated, strategic decisions under uncertainty across the product lifecycle.

As with every such mathematical modeling approach, nevertheless, it’s price emphasizing two small print. First, an expected value calculation is just nearly as good as its structural completeness and the accuracy of its inputs. If all relevant value drivers should not included, the calculation will probably be structurally incomplete, and the resulting findings will probably be inaccurate. Using conceptual frameworks similar to the matrices and tree diagrams shown in Figures 1 and a pair of above might help teams confirm the completeness of their calculations. Readers can check with this book to learn the right way to leverage conceptual frameworks. If the info and/or assumptions used to derive the final result values and their probabilities are faulty, then the resulting expected value will probably be inaccurate, and potentially damaging if used to tell strategic decision making (e.g., wrongly sunsetting a promising product). Second, it will likely be a superb idea to pair a quantitative approach like expected value evaluation with qualitative approaches (e.g., customer interviews, observing how users interact with the products) to get a well-rounded picture. Qualitative insights might help us do sanity checks of inputs to the expected value calculation, higher interpret the quantitative results, and ultimately derive holistic recommendations for decision making.

Expected Value Evaluation in AI Product Management

Expected Value

A Game of Dice

A Game of Roulette

Formal Definition

Case Study 1: Fraud Detection in E-Commerce

Case Study 2: Auto-Completing Purchase Orders

Case Study 3: Standardizing AI Design Guidance

The Wrap

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Train a Quadruped Locomotion Policy and Simulate Cloth Manipulation with NVIDIA Isaac Lab and Newton

Learn the Hugging Face Kernel Hub in 5 Minutes

Ontology is the true guardrail: The best way to stop AI agents from misunderstanding your corporation

Streamline Robot Learning with Whole-Body Control and Enhanced Teleoperation in NVIDIA Isaac Lab 2.3

Featherless AI on Hugging Face Inference Providers 🔥

Expected Value Evaluation in AI Product Management

Expected Value

A Game of Dice

A Game of Roulette

Formal Definition

Case Study 1: Fraud Detection in E-Commerce

Case Study 2: Auto-Completing Purchase Orders

Case Study 3: Standardizing AI Design Guidance

The Wrap

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.