Power Evaluation in Marketing: A Hands-On Introduction

-

Show code
library(tibble)
library(ggplot2)
library(dplyr)
library(tidyr)
library(latex2exp)
library(scales)
library(knitr)

Over the past few years working in marketing measurement, I’ve noticed that is one of the vital poorly understood testing and measurement topics. Sometimes it’s misunderstood and sometimes it’s not applied in any respect despite its foundational role in test design. This text and the series that follow are my attempts to alleviate this.

On this segment, I’ll cover:

  • What’s statistical power?
  • How can we compute it?
  • What can influence power?

Power evaluation is a statistical topic and as a consequence, there can be math and statistics (crazy right?) but I’ll attempt to tie those technical details back to real world problems or basic intuition every time possible.

Without further ado, let’s get to it.

Error types in testing: Type I vs. Type II

In testing, there are two sorts of error:

  • Type I:
    • Technical Definition: We erroneously reject the null hypothesis when the null hypothesis is true
    • Layman’s Definition: We are saying there was an effect when there really wasn’t
    • Example: A/B testing a brand new creative and concluding that it performs higher than the old design when in point of fact, each designs perform the identical
  • Type II:
    • Technical Definition: We fail to reject the null hypothesis when the null hypothesis is fake
    • Layman’s Definition: We are saying there was no effect when there really was
    • Example: A/B testing a brand new creative and concluding that it performs the identical because the old design when in point of fact, the brand new design performs higher

What’s statistical power?

Most persons are aware of Type I error. It’s the error that we control by setting a significance level. Power pertains to Type II error. More specifically, power is the probability of appropriately rejecting the null hypothesis when it is fake. It’s the complement of Type II error (i.e., 1 – Type II error). In other words, power is the probability of detecting a real effect if one exists. It ought to be clear why this is very important:

  • Underpowered tests are prone to miss true effects, resulting in missed opportunities for improvement
  • Underpowered tests can result in false confidence in the outcomes, as we may conclude that there is no such thing as a effect when there actually is one
  • … and most easily, underpowered tests waste money and resources

The role of α and β

If each are vital, why are Type II error and power so misunderstood and ignored while Type I is at all times considered? It’s because we are able to easily pick our Type I error rate. In actual fact, that’s exactly what we’re doing once we set the importance level α (typically α = 0.05) for our tests. We’re stating that we’re comfortable with a certain percentage of Type I error. During test setup, we make a press release, “we’re comfortable with an X % false positive rate,” after which set α = X %. After the test, if our p-value falls below α, we reject the null hypothesis (i.e., “the outcomes are significant”), and if the p-value falls above α, we fail to reject the null hypothesis (i.e., “the outcomes are usually not significant”).

Determining Type II error, β (typically β = 0.20), and thus power, just isn’t as easy. It requires us to make assumptions and perform evaluation, called “power evaluation.” To grasp the method, it’s best to first walk through the means of testing after which backtrack to work out how power will be computed and influenced. Let’s use an easy A/B creative test for instance.

Concept Symbol Typical Value(s) Technical Definition Plain-Language Definition
Type I Error α 0.05 (5%) Probability of rejecting the null hypothesis when the null is definitely true Saying there’s an effect when in point of fact there is no such thing as a difference
Type II Error β 0.20 (20%) Probability of failing to reject the null hypothesis when the null is definitely false Saying there is no such thing as a effect when in point of fact there’s one
Power 1 − β 0.80 (80%) Probability of appropriately rejecting the null hypothesis when the choice is true The prospect we detect a real effect if there’s one
Quick Reference: Error Types and Power

Computing power: step-by-step

A pair notes before we start:

  • I made a couple of assumptions and approximations to simplify the instance. In the event you can spot them, great. If not, don’t worry about it. The goal is to know the concepts and process, not the nitty gritty details.
  • I discuss with the choice threshold within the z-score space because the critical value. Critical value typically refers back to the threshold in the unique space (e.g., conversion rates) but I’ll use it interchangeably so I don’t have to introduce a brand new term.
  • There are code snippets throughout tied to the text and ideas. In the event you copy the code yourself, you possibly can mess around with the parameters to see how things change. A number of the code snippets are hidden to maintain the article readable. Click “Show the code” to see the code.
    • Do that: Edit the sample size within the test setup in order that the test statistic is just under the critical value after which run the ability evaluation. Are the outcomes what you expected?

Test setup and the test statistic

As stated above, it’s best to walk through the testing process first after which backtrack to discover how power will be computed. Let’s do exactly that.

# Set parameters for the A/B test
N_a <- 1000  # Sample size for creative A
N_b <- 1000  # Sample size for creative B
alpha <- 0.05  # Significance level
# Function to compute the critical z-value for a one-tailed test
critical_z <- function(alpha, two_sided = FALSE) {
  if (two_sided) qnorm(1 - alpha/2) else qnorm(1 - alpha)
}

As stated above, it’s best to walk through the testing process first after which backtrack to discover how power will be computed. Let’s do exactly that.

Our test setup:

  • Null hypothesis: The conversion rate of A equals the conversion rate of B.
  • Alternative hypothesis: The conversion rate of B is larger than the conversion rate of A.
  • Sample size:
  • Na = 1,000 — Number of people that receive creative A
  • Nb = 1,000 — Number of people that receive creative B
  • Significance level: α = 0.05
  • Critical value: The critical value is the z-score that corresponds to the importance level α. We call this Z1−α. For a one-tailed test with α = 0.05, that is roughly 1.64.
  • Test type: Two-proportion z-test
x_a <- 100  # Variety of conversions for creative A
x_b <- 150  # Variety of conversions for creative B
p_a <- x_a / N_a  # Conversion rate for creative A
p_b <- x_b / N_b  # Conversion rate for creative B

Our results:

  • xa = 100 — Variety of conversions from creative A
  • xb = 150 — Variety of conversions from creative B
  • pa = xa / Na = 0.10 — Conversion rate of creative A
  • pb = xb / Nb = 0.15 — Conversion rate of creative B

Under the null hypothesis, the difference in conversion rates follows an roughly normal distribution with:

  • Mean: μ = 0 (no difference in conversion rates)
  • Standard deviation:
    σ = √[ pa(1 − pa)/Na + pb(1 − pb)/Nb ] ≈ 0.01
z_score <- function(p_a, p_b, N_a, N_b) {
  (p_b - p_a) / sqrt((p_a * (1 - p_a) / N_a) + (p_b * (1 - p_b) / N_b))
}

From these values, we are able to compute the test statistic:

[
z = frac{p_b – p_a}
{sqrt{frac{p_a (1 – p_a)}{N_a} + frac{p_b (1 – p_b)}{N_b}}}
approx 3.39
]

If our test statistic, z, is larger than the critical value, we reject the null hypothesis and conclude that Creative B performs higher than Creative A. If z is lower than or equal to the critical value, we fail to reject the null hypothesis and conclude that there is no such thing as a significant difference between the 2 creatives.

In other words, if our results are unlikely to be observed when the conversion rates of A and B are truly the identical, we reject the null hypothesis and state that Creative B performs higher than Creative A. Otherwise, we fail to reject the null hypothesis and state that there is no such thing as a significant difference between the 2 creatives.

Given our test results, we reject the null hypothesis and conclude that Creative B performs higher than Creative A.

z <- z_score(p_a, p_b, N_a, N_b)
critical_value <- critical_z(alpha)
if (z > critical_value) {
  result <- "Reject null hypothesis: Creative B performs higher than Creative A"
} else {
  result <- "Fail to reject null hypothesis: No significant difference between creatives"
}
result
#> [1] "Reject null hypothesis: Creative B performs higher than Creative A"

The intuition behind power

Now that now we have walked through the testing process, where does power come into play? In the method above, we record conversion rates, pa and pb, after which compute the test statistic, z. Nonetheless, if we repeated the test over and over, we'd get different sample conversion rates and different test statistics, all centering across the conversion rates of the creatives.

Assume the true conversion rate of Creative B is higher than that of Creative A. A few of these tests will still fail to reject the null hypothesis resulting from natural variance. Power is the proportion of those tests that reject the null hypothesis. That is the underlying mechanism behind all power evaluation and hints on the missing ingredient: the true conversion rates—or more generally, the true effect size.

Intuitively, if the true effect size is higher, our measured effect would typically be higher and we might reject the null hypothesis more often, increasing power.

Selecting the true effect size

If we'd like true conversion rates to compute power, how can we get them? If we had them, we wouldn’t have to perform testing. Due to this fact, we'd like to make an assumption. Broadly, there are two approaches:

  • Select the meaningful effect size: On this approach, we assign the true effect size (or true difference in conversion rates) to a level that might be . If Creative B only increased conversion rates by 0.01%, would we actually care and take motion on those results? Probably not. So why would we care about having the ability to detect that small of an effect? Then again, if Creative B increased conversion rates by 50%, we definitely would care. In practice, the effect size likely falls between these two points.
    • That is also known as the . Nonetheless, the minimal detectable effect of the study and the minimal detectable effect that we care about (for instance, we may only care about 5% or greater effects, however the study is designed to detect 1% or greater effects) may differ. For that reason, I prefer to make use of the term when referring to this strategy.
  • Use prior studies: If now we have data from prior studies or models that measure the efficiency of this creative or similar creatives, we are able to use those values to assign the true effect size.

Each of the above approaches are valid.

In the event you only care to see effects and don’t mind should you miss out on detecting smaller effects, go along with the primary option. If you have to see “statistical significance”, go along with the second option and be conservative with the values you employ (more on that in one other article).

Technical Note

Because we don’t have true conversion rates, we're technically assigning a selected expected distribution to the choice hypothesis after which computing power based on that. The in the next passages is technically the . I'll use the term to maintain the language easy and concise.

Computing and visualizing power

Now that now we have the missing ingredients, conversion rates, we are able to compute power. As a substitute of the measured pa and pb, we now have true conversion rates ra and rb.

We measure power as:

[
1 – beta = 1 – P(z < Z_{1-alpha} ;|; N_a, N_b, r_a, r_b)
]

This will likely be confusing at first glance, so let’s break it down.
We're stating that power (1 − β) is computed by subtracting the Type II error rate from one. The Type II error rate is the likelihood that a test ends in a z-score below our significance threshold, given our sample size and true conversion rates ra and rb. How can we compute that last part?

In a two-proportion z-score test, we all know that:

  • Mean: μ = rb − ra
  • Standard deviation: σ = √[ ra(1 − ra)/Na + rb(1 − rb)/Nb ]

Now we'd like to compute:

[
P(X > Z_{1-alpha}), quad X sim N!left(frac{mu}{sigma},,1right)
]

That is the realm under the above distribution that lies to the suitable of Z1−α and is similar to computing:

[
P!left(X > frac{mu}{sigma} – Z_{1-alpha}right), quad X sim N(0,1)
]

If we had a textbook with a z-score table, we could simply look up the p-value related to
(μ / σ − Z1−α), and that might give us the ability.

Let’s show this visually:

Show the code
r_a <- p_a  # true baseline conversion rate; we're reusing the measured value
r_b <- p_b   # true treatment conversion rate; we're reusing the measure value
alpha <- 0.05
two_sided <- FALSE   # set TRUE for two-sided test

mu_diff <- function(r_a, r_b) r_b - r_a
sigma_diff <- function(r_a, r_b, N_a, N_b) {
  sqrt(r_a*(1 - r_a)/N_a + r_b*(1 - r_b)/N_b)
}

power_value <- function(r_a, r_b, N_a, N_b, alpha, two_sided = FALSE) {
  mu <- mu_diff(r_a, r_b)
  sd1 <- sigma_diff(r_a, r_b, N_a, N_b)
  zc <- critical_z(alpha, two_sided)
  thr <- zc * sigma_diff(r_a, r_b, N_a, N_b)  

  if (!two_sided) {
    1 - pnorm(thr, mean = mu, sd = sd1)
  } else {
    pnorm(-thr, mean = mu, sd = sd1) + (1 - pnorm(thr, mean = mu, sd = sd1))
  }
}

# Construct plot data
mu <- mu_diff(r_a, r_b)
sd1 <- sigma_diff(r_a, r_b, N_a, N_b)
zc <- critical_z(alpha, two_sided)
thr <- zc * sigma_diff(r_a, r_b, N_a, N_b)  

# x-range covering each curves and thresholds
x_min <- min(-4*sd1, mu - 4*sd1, -thr) - 0.1*sd1
x_max <- max( 4*sd1, mu + 4*sd1,  thr) + 0.1*sd1
xx <- seq(x_min, x_max, length.out = 2000)

df <- tibble(
  x = xx,
  H0 = dnorm(xx, mean = 0,  sd = sd1),   # distribution utilized by test threshold
  H1 = dnorm(xx, mean = mu, sd = sd1)    # true (alternative) distribution
)

# Regions to shade for power
if (!two_sided) {
  shade <- df %>% filter(x >= thr)
} else {
  shade <- bind_rows(
    df %>% filter(x >=  thr),
    df %>% filter(x <= -thr)
  )
}

# Numeric power for subtitle
pow <- power_value(r_a, r_b, N_a, N_b, alpha, two_sided)

# Plot
ggplot(df, aes(x = x)) +
  # H1 shaded power region
  geom_area(
    data = shade, aes(y = H1), alpha = 0.25
  ) +
  # Curves
  geom_line(aes(y = H0), linewidth = 1) +
  geom_line(aes(y = H1), linewidth = 1, linetype = "dashed") +
  # Critical line(s)
  geom_vline(xintercept = thr,  linetype = "dotted", linewidth = 0.8) +
  { if (two_sided) geom_vline(xintercept = -thr, linetype = "dotted", linewidth = 0.8) } +
  # Mean markers
  geom_vline(xintercept = 0,  alpha = 0.3) +
  geom_vline(xintercept = mu, alpha = 0.3, linetype = "dashed") +
  # Labels
  labs(
    title = "Power as shaded area under H1 beyond  critical threshold",
    subtitle = TeX(sprintf(r"($1 - beta$ = %.1f%%  |  $mu$ = %.4f,  $sigma$ = %.4f,  $z^*$ = %.3f,  threshold = %.4f)",
                       100*pow, mu, sd1, zc, thr)),
    x = TeX(r"(Difference in conversion rates ($D = p_b - p_a$))"),
    y = "Density"
  ) +
  annotate("text", x = mu, y = max(df$H1)*0.95, label = TeX(r"(H1: $N(mu, sigma^2)$)"), hjust = -0.05) +
  annotate("text", x = 0,  y = max(df$H0)*0.95, label = TeX(r"(H0: $N(0, sigma^2)$)"),  hjust = 1.05) +
  theme_minimal(base_size = 13)

Within the plot above, power is the realm under the choice distribution (H1) (where we assume the choice is distributed based on our true conversion rates) that's beyond the critical threshold (i.e., the realm where we reject the null hypothesis). With the parameters we set, the ability is 0.96. Because of this if we repeated this test over and over with the identical parameters, we'd expect to reject the null hypothesis roughly 96% of the time.

Power curves

Now that now we have intuition and math behind power, we are able to explore how power changes based on different parameters. The plots generated from such evaluation are called .

Note

Throughout the plots, you’ll notice that 80% power is highlighted. It is a common goal for power in testing, because it balances the danger of Type II error with the fee of accelerating sample size or adjusting other parameters. You’ll see this value highlighted in lots of software packages as a consequence.

Relationship with effect size

Earlier, I stated that the larger the effect size, the upper the ability. Intuitively, this is smart. We're essentially shifting the suitable bell curve within the plot above further to the suitable, so the realm beyond the critical threshold increases. Let’s test that theory.

Show the code
# Function to compute power for various effect sizes
power_curve <- function(effect_sizes, N_a, N_b, alpha, two_sided = FALSE) {
  sapply(effect_sizes, function(e) {
    r_a <- p_a
    r_b <- p_a + e  # Adjust r_b based on effect size
    power_value(r_a, r_b, N_a, N_b, alpha, two_sided)
  })
}
# Generate effect sizes
effect_sizes <- seq(0, 0.1, length.out = 100)  # Effect sizes from 0 to 10%
# Compute power for every effect size
power_values <- power_curve(effect_sizes, N_a, N_b, alpha)
# Create a knowledge frame for plotting
power_df <- tibble(
  effect_size = effect_sizes,
  power = power_values
)
# Plot the ability curve
ggplot(power_df, aes(x = effect_size, y = power)) +
  geom_line(color = "blue", size = 1) +
  geom_hline(yintercept = 0.80, linetype = "dashed", alpha = 0.6) +  # goal power guide
  labs(
    title = "Power vs. Effect Size",
    x = TeX(r"(Effect Size ($r_b - r_a$))"),
    y = TeX(r'(Power ($1 - beta $))')
  ) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 0.01)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(NA,1)) +
  theme_minimal(base_size = 13)

Theory confirmed: because the effect size increases, power increases. It approaches 100% because the effect size increases and our decision threshold moves down the long-tail of the conventional distribution.

Relationship with sample size

Unfortunately, we cannot control effect size. It's either the meaningful effect size you would like to detect or based on prior studies. It's what it's. What we are able to control is sample size. The larger the sample size, the smaller the usual deviation of the distribution and the larger the realm under the curve beyond the critical threshold (imagine squeezing the edges to compress the bell curves within the plot earlier). In other words, larger sample sizes should result in higher power. Let’s test this theory as well.

Show the code
power_sample_size <- function(N_a, N_b, r_a, r_b, alpha, two_sided = FALSE) {
  power_value(r_a, r_b, N_a, N_b, alpha, two_sided)
}
# Generate sample sizes
sample_sizes <- seq(100, 5000, by = 100)  # Sample sizes from 100 to 5000
# Compute power for every sample size
power_values_sample <- sapply(sample_sizes, function(N) {
  power_sample_size(N, N, r_a, r_b, alpha)
})
# Create a knowledge frame for plotting
power_sample_df <- tibble(
  sample_size = sample_sizes,
  power = power_values_sample
)
# Plot the ability curve for various sample sizes
ggplot(power_sample_df, aes(x = sample_size, y = power)) +
  geom_line(color = "blue", size = 1) +
  geom_hline(yintercept = 0.80, linetype = "dashed", alpha = 0.6) +  # goal power guide
  labs(
    title = "Power vs. Sample Size",
    x = TeX(r"(Sample Size ($N$))"),
    y = TeX(r"(Power (1 - $beta$))")
  ) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(NA,1)) +
  theme_minimal(base_size = 13)

We again see the expected relationship: as sample size increases, power increases.

Note

On this specific setup, we are able to increase power by increasing sample size. More generally, this is a rise in . In other test setups, precision—and thus power—will be increased through other means. For instance, in Geo-testing, we are able to increase precision by choosing predictable markets or through the inclusion of exogenous features (more on this in a future article).

Relationship with significance level

Does the importance level α influence power? Intuitively, if we're more willing to simply accept Type I error, we usually tend to reject the null hypothesis and thus (1 − β) ought to be higher. Let’s test this theory.

Show the code
power_of_alpha <- function(alpha_vec, r_a, r_b, N_a, N_b, two_sided = FALSE) {
  sapply(alpha_vec, function(a)
    power_value(r_a, r_b, N_a, N_b, a, two_sided)
  )
}

alpha_grid <- seq(0.001, 0.20, length.out = 400)
power_grid <- power_of_alpha(alpha_grid, r_a, r_b, N_a, N_b, two_sided)

# Current point
power_now <- power_value(r_a, r_b, N_a, N_b, alpha, two_sided)

df_alpha_power <- tibble(alpha = alpha_grid, power = power_grid)

ggplot(df_alpha_power, aes(x = alpha, y = power)) +
  geom_line(color = "blue", size = 1) +
  geom_hline(yintercept = 0.80, linetype = "dashed", alpha = 0.6) +  # goal power guide
  geom_vline(xintercept = alpha, linetype = "dashed", alpha = 0.6) + # your alpha
  scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(NA,1)) +
  labs(
    title = TeX(r"(Power vs. Significance Level)"),
    subtitle = TeX(sprintf(r"(At $alpha$ = %.1f%%, $1 - beta$ = %.1f%%)",
                       100*alpha, 100*power_now)),
    x = TeX(r"(Significance Level ($alpha$))"),
    y = TeX(r"(Power (1 - $beta$))")
  ) +
  theme_minimal(base_size = 13)

Yet again, the outcomes match our intuition. There is no such thing as a free lunch in statistics. All else equal, if we would like to diminish our Type II error rate (β), we should be willing to simply accept a better Type I rate (α).

Power evaluation

So what's power evaluation? Power evaluation is the means of computing power given the parameters of the test. In power evaluation, we fix parameters we cannot control after which optimize the parameters we are able to control to attain a desired power level. For instance, we are able to fix the true effect size after which compute the sample size needed to attain a desired power level. Power curves are sometimes used to help with this decision-making process. Later within the series, I'll walk through power evaluation intimately with a real-world example.

Sources

[1] R. Larsen and M. Marx, An Introduction to Mathematical Statistics and Its Applications

What’s next within the Series?

I haven’t fully decided but I definitely need to cover the next topics:

  • Power evaluation in Geo Testing
  • Detailed guide on setting the true effect size in various contexts
  • Real world end-to-end examples

Completely satisfied to listen to ideas. Be at liberty to succeed in out. My contact info is below:

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x