Why Are Marketers Turning To Quasi Geo-Lift Experiments? (And Tips on how to Plan Them)

💡 Note: Geo-Lift R code at the tip of the article.

of my profession, I even have used quasi-experimental designs with synthetic control groups to measure the impact of business changes. At Trustpilot, we switched the position of a banner on the homepage and retrospectively saw signals pointing to a decrease in our primary demand metric. By modeling how performance would have looked without the change and comparing it to what happened, we got a transparent read of incremental decline. At Zensai, we decided to treat a few of our web forms and confirmed in the same way whether the change had a positive impact on user-reported ease of completion.

This approach works wherever you possibly can split something into treatment and control groups: individuals, smartphones, operating systems, forms, or landing pages. You treat some. The others show you how to model what would have happened without treatment. You then compare.

Whenever you cannot randomize at the person level, geography makes an excellent unit of study. Cities and regions are straightforward to focus on via paid social and their borders help contain spillovers.

There are three practical designs for isolating incremental impact:

Randomized controlled trials (RCT) like conversion or brand lift tests are appealing when you need to use them as they run inside big ad platforms. They automate random user task and may deliver statistically robust answers throughout the platform’s environment. Nonetheless, unless you leverage conversion API, their scope is constrained to the platform’s own metrics and mechanics. More recently, proofs (1, 2, 3) have emerged that algorithmic features may compromise the randomness of task. This creates “divergent delivery”: ad A gets shown to 1 form of audience (e.g., more men, or individuals with certain interests), while ad B gets shown to a different type. Any difference in CTRs or conversions can’t be attributed purely to the ad creative. It’s confounded with how the algorithm delivered the ads. On Google Ads, Conversion Lift typically requires enablement by a Google account representative. On Meta, eligible advertisers can run self-serve Conversion Lift tests in Ads Manager (subject to spend and conversion prerequisites), and Meta’s 2025 Incremental Attribution feature further lowers friction for lift-style measurement.
Geo-RCTs (Randomized Geo Experiments) use geographies because the unit of study with random task to treatment and control. There isn’t a need for individual-level tracking, but you would like a sufficiently large variety of geos to realize statistical power and assured results. Because task is randomized, you don’t construct an artificial control as you do in the subsequent form of experiment.
The Quasi Geo-Lift Experiment approaches the identical query using geos because the units of study, without having for individual-level tracking. Unlike Geo-RCTs that require randomization and more geos, this quasi-experimental approach offers three key benefits with no required randomization needed: it a) works well with fewer geographies (no need for a lot of geos as with Geo-RCTs), b) enables strategic market selection (direct control over where and when treatment is applied based on business priorities), and c) accommodates retrospective evaluation and staggered rollouts (in case your campaign already launched or must roll out in waves for operational reasons, you possibly can still measure incrementality after the very fact since randomization isn’t required). The synthetic control is constructed to match pre-intervention trends and characteristics of the treatment unit, so any differences after treatment begins will be attributed to the campaign’s incrementality. But: successful execution requires strong alignment between analytics and performance marketing teams to make sure proper implementation and interpretation.

The advantages of the *Quasi Geo-Lift Experiment* are great. So to show you how to understand if you might use a quasi-experiment in your marketing science function, let’s consider the next example below.

Quasi Geo-Lift Experiment Example

These conditions favor a quasi-experimental geo design that measures business outcomes directly, without relying on user-level tracking or dependencies on platforms like Meta or Google, but as a substitute using commonly available historical data comparable to rides, sales, conversions, or leads.

To perform the Geo-Lift test with an artificial control group design on our example, we are going to use the GeoLift package in R, developed by Meta. The dataset I will likely be using is structured as each day ride counts across 13 cities in Poland.

Best practices on your data when using a Quasi Geo-Lift Experiment.

Use each day data as a substitute of weekly when possible.
Use the most detailed location data available (e.g., zip codes, cities).
Have at the very least 4–5 times the test duration in stable, pre-campaign historical data (no major changes or disruptions – more on this later within the operationalization chapter below!)
Have at the very least 25 pre-treatment periods with a minimum of 10, but ideally 20+ geo-units.
Ideally, collect 52 weeks of history to capture seasonal patterns and other aspects.
The test should last at the very least one purchase cycle for the product.
Run the study for at the very least 15 days (each day data) or 4–6 weeks (weekly data).
Panel data (covariates) is useful but not required.
For each time/location, include date, location, and KPIs (no missing values). Extra covariates will be added if additionally they meet this rule.

Planning Your Quasi Geo-Lift Experiment

Fig. 4: Quasi Geo-Lift Experiment planning (source: own production) — (source: own production)

Designing a quasi geo-lift experiment shouldn’t be nearly running a test, it’s about making a system that may credibly link marketing actions to business outcomes.

To do that in a structured way, the launch of any latest channel or campaign ought to be approached in stages. Here is my ABCDE framework that may show you how to with it:

(A) ASSESS

Establish how incrementality will likely be measured and which method will likely be used.

(B) BUDGET

Determine the minimum spend required to detect an effect that’s each statistically credible and commercially meaningful.

(C) CONSTRUCT

Specify which cities will likely be treated, how controls will likely be formed, how long the campaign will run, and what operational guardrails are needed.

(D) DELIVER

Convert statistical results into metrics and report outcomes as a readout.

(E) EVALUATE

Use results to tell broader decisions by updating MMM and MTA. Deal with calibrating, stress-testing, replicating, and localizing for rollout.

(A) ASSESS the marketing triangulation variable in query by drilling down into it.

1. Marketing triangulation as a starting point.

Start by evaluating which a part of the marketing triangulation you have to to interrupt down. In our case, that might be the incrementality piece. For other projects, drill down into MTA and MMM similarly. For example, MTA uncovers heuristic out-of-the-box techniques like last click, first click, first or last touch decaying, (inverted) U-shape, or W-shape, but in addition data-driven approaches like Markov chain. MMM will be custom, third party, Robyn, or Meridian, and involves additional steps like saturation, adstock, and budget reallocation simulations. Don’t forget to establish your measurement metrics.

2. Incrementality is operationalized through a geo-lift test.

The geo-lift test is the sensible expression of the framework since it reads the consequence in the identical units the business manages, comparable to rides (or sales, conversions, leads) per city per day. It creates treatment and control just as a classic randomized study would, nevertheless it does so on the geographic level.

This makes the design executable across platforms and independent of user-level tracking, while it means that you can study your selected business metrics.

3. Recognition of two experimental families of the Geo-Lift test: RCTs and quasi-experiments.

Where in-platform RCTs comparable to conversion lift tests or brand lift tests exist (Meta, Google), they continue to be the usual and will be leverage. When individual randomization is infeasible, the geo-lift test then proceeds as a quasi-experiment.

4. Identification relies on an artificial control method.

For every treated city, a weighted combination of control cities is learned to breed its pre-period trajectory. The divergence between the observed series and its synthetic counterpart in the course of the test window is interpreted because the incremental effect. This estimator preserves scientific rigor while keeping execution feasible and auditable.

5. Calibration and validation are explicit steps, not afterthoughts.

The experimental estimate of incrementality is used to validate that attribution signals point in the fitting direction and to calibrate MMM elasticities and adstock (via the calibration multiplier), so cross-channel budget reallocations are grounded in causal truth.

**Fig. 7:** Tips on how to calibrate your MTA and MMM. Leverage Quasi Geo-Lift Experiment results to calibrate your MTA/MMM using a calibration multiplier (source: own production)

6. Measure the impact in business terms.

Within the planing phase, the core statistic is the Average Treatment Effect on Treated (ATT), expressed in consequence units per day (e.g., rides per city per day). That estimate is translated into Total Incremental Rides over the test window after which into Cost per Incremental Conversion (CPIC) by dividing spend by the whole variety of incremental rides. Minimum Detectable Effect (MDE) is reported to make the design’s sensitivity explicit and to separate actionable results from inconclusive ones. Finally, Net Profit is calculated by combining historical rider profit with the incremental outcomes and CPIC.

The overall incremental leads will be multiplied by a blended historical conversion rate from result in customer to estimate what number of latest customers the campaign is anticipated to generate. That figure is then multiplied by the common profit per customer in dollars. This fashion, even when revenue is realized downstream of the lead stage, the experiment still delivers a transparent estimate of incremental financial impact and a transparent decision rule for whether to scale the channel. All other metrics like ATE, total incremental leads, cost per incremental lead, and MDE will likely be calcalted in the same fashion.

(B) BUDGET on your quasi-experiment.

Budget estimation shouldn’t be guesswork. It’s a design selection that determines whether the experiment yields actionable results or inconclusive noise. The important thing concept is the Minimum Detectable Effect (MDE): the smallest lift the test can reliably detect given the variance in historical data, the variety of treated and control cities, and the length of the test window.

In practice, variance is estimated from historical rides (or sales, conversions, or leads in other industries). The variety of treated cities and test length then define sensitivity. For instance, as you will notice later, treating 3 cities for 21 days while holding 10 as controls provides enough power to detect lifts of about 4–5%. Detecting smaller but statistically significant effects would require more time, more markets, or more spend.

The Geo-Lift package models these power analyses after which prints the budget–effect–power simulation chart for any experiment ID.

The budget is aligned with unit economics. Within the Bolt case, industry priors suggest a price per ride of €6–€12 and a profit per ride of €6. Under these assumptions, the minimum spend to realize an MDE of roughly 5% involves €3,038 for 3 weeks, or €48.23 per treated city per day. This sits throughout the €5,000€ benchmark budget but, more importantly, makes explicit what effect size the test can and can’t detect.

Framing budget this manner has two benefits. First, it ensures the experiment is designed to detect only effects that matter for the business. Second, it gives stakeholders clarity: if the result’s null, it means the true effect was more likely smaller than the edge, not that the test was poorly executed. Either way, the spend shouldn’t be wasted. It buys causal knowledge that sharpens future allocation decisions.

(C) CONSTRUCT your quasi-experiment design.

Designing the experiment is about greater than picking cities at random. It’s about constructing a layout that preserves validity while staying practical for operations. The unit of study is the city-day, and the consequence is the business metric of interest, comparable to rides, sales, or leads. Treatment is applied to chose cities for a hard and fast test window, while the remaining cities function controls.

Geo-Lift will model and group the perfect city candidates on your treatment.

Control groups will not be just left as-is.

They will be refined using a synthetic control method. Each treated city is paired with a weighted combination of control cities that reproduces its pre-test trajectory. When the pre-period fit is accurate, the post-launch divergence between observed and artificial outcomes provides a reputable estimate of incremental effect.

Operational guardrails are critical to guard signal quality.

City boundaries ought to be fenced tightly in settings to scale back spillovers from commuters. Also consider strict exclusion of control cities from treatment cities within the settings and vice-versa. There shouldn’t be a selected targeting and lookalike audiences applied to the campaign. Local promotions that would confound the test are either frozen in treated geographies or mirrored in controls.

Creatives, bids, and pacing are held constant in the course of the window, and outcomes are only read after a brief cooldown period because of the marketing adstock effect. Other business relevant aspects ought to be considered, e.g. in our case, we should always check the availability driver capability upfront to make sure additional demand will be served without distorting prices or wait times.

Constructing the test means selecting the fitting balance.

The Geo-Lift package will do the heavy lifting for you by modeling and constructing all optimal experiments.

Out of your power evaluation and market selection function code, selecting the fitting experiment setup is a balance between:

variety of cities
duration
spend
desired incremental uplift
profit
stat. significance
test vs control alignment
smallest detectable lift
other business context

A configuration of three treated cities over a 21-day period, as within the Bolt example, provides sufficient power to detect lifts of ~4–5%, while keeping the test window short enough to attenuate contamination. Based on previous experiments, we all know this level of power is adequate. As well as, we want to stay inside our budget of €5,000 per thirty days, and the expected investment of €3,038.16 for 3 weeks matches well inside that constraint.

(D) DELIVER the post-experiment readout.

The ultimate step is to deliver results that translate statistical lift into clear business impact for stakeholders. The experiment’s outputs ought to be framed in terms that each analysts and decision-makers can use.

On the core is the Average Treatment Effect on Treated (ATT), expressed in consequence units per day comparable to rides, sales, or leads. From this, the evaluation calculates total incremental outcomes over the test window and derives Cost per Incremental Conversion (CPIC) by dividing spend by those outcomes. The Minimum Detectable Effect (MDE) is reported alongside results to make the test’s sensitivity transparent, separating actionable lifts from inconclusive noise. Finally, the evaluation converts outcomes into Net Profit by combining incremental conversions with unit economics.

For lead-based businesses, the identical logic applies but the web profit: for that, the whole incremental leads will be multiplied by a blended conversion rate from result in customer, then by average profit per customer, to approximate the web financial impact.

Be cautious when interpreting results to stakeholders. Statistical evaluation with p-value provides evidence, not absolute proof, so phrasing matters. The GeoLift uses Synthetic Control/Augmented Synthetic Control with frequentist inference.

A common but misleading interpretation of statistical significance might sound like this:

This interpretation is problematic for several reasons:

It treats statistical significance as proof.
It assumes the effect size is exact (11.1%), ignoring the uncertainty range around that estimate.
It misinterprets confidence intervals.
It leaves no room for alternative explorations which may create value for the business.
It might probably mislead decision-makers, creating overconfidence and potentially resulting in dangerous business selections.

When performing quasi Geo-Lift test, what a statistical test actually tests?

Every statistical test depends upon a statistical model, which is a fancy web of assumptions. This model includes not only the primary hypothesis being tested (e.g., a brand new TweetX campaign has no effect) but in addition a protracted list of other assumptions about how the info were generated. These include assumptions about:

Random sampling or randomization.
The form of probability distribution the info follow.
Independence.
Selection bias.
The absence of major measurement errors.
How the evaluation was conducted.

A statistical test does not just evaluate the test hypothesis (just like the null hypothesis). It evaluates the entire statistical model: the entire set of assumptions. And that’s why we attempt to at all times make sure that that every one other assumptions are fully met and the experiment designs will not be flawed: so if we observe a small p-value, we will reasonably read it as evidence against the null, not as proof or ‘acceptance’ of H1.

The P-value is a measure of compatibility, not truth.

Probably the most common definition of a P-value is flawed. A more accurate and useful definition is:

.

Consider it as a “surprise index.”

A small P-value (e.g., P=0.01) indicates that the info are surprising if your entire model were true. It’s a red flag telling us that a number of of our assumptions may be unsuitable. Nonetheless, it doesn’t tell us which assumption is unsuitable. The difficulty may very well be the null hypothesis, nevertheless it may be a violated study protocol, selection bias, or one other unmet assumption.
A large P-value (e.g., P=0.40) indicates that the info will not be unusual or surprising under the model. It suggests the info are compatible with the model, nevertheless it doesn’t prove the model or the test hypothesis is true. The information may very well be equally compatible with many other models and hypotheses. And that’s why we attempt to at all times make sure that that every one other assumptions are fully met and experiment designs will not be flawed: so if we observe a small p-value, we will reasonably read it as evidence against the null, not as proof or ‘acceptance’ of H1.

The common practice of degrading the P-value into a straightforward binary, “statistically significant” (P≤0.05) or “not significant” (P>0.05), is damaging. It creates a false sense of certainty and ignores the actual quantity of evidence.

Confidence Intervals (CI) and their importance within the Quasi Geo-Lift test.

A confidence interval (CI) is more informative than a straightforward P-value from a null hypothesis test. It might probably be understood because the range of effect sizes which might be relatively compatible with the info, given the statistical model.

A 95% confidence interval has a selected frequentist property: in the event you were to repeat a quasi Geo-Lift study countless times with valid statistical models, 95% of the calculated confidence intervals would, on average, contain the true effect size.

Crucially, this doesn’t mean there may be a 95% probability that your specific interval incorporates the true effect. Once calculated, your interval either incorporates the true value or it doesn’t (0% or 100%). The “95%” tells how often this method would capture the true effect over many repeated studies, not how certain we’re about this single interval. If you should move from frequentist confidence intervals to direct probabilities about lift, Bayesian methods are the best way.

For those who’d wish to dive deeper into p-values, confidence intervals, and hypothesis testing, I like to recommend these 2 well-known papers:

link here.
link here.

(E) EVALUATE your Quasi Geo-Lift Experiment with a broader lense.

Remember to zoom out and see the forest, not only the trees.

The outcomes must feed back into marketing triangulation by validating attribution ROAS, and calibrating marketing mix models with a causal multiplier.

They also needs to guide the subsequent steps: replicate positive ends in latest geographies, assess percentage lift against the minimum detectable threshold, and avoid generalizing from a single market before further testing.

Stress-testing with placebo testing (in-space or in-time) may also strengthen confidence in your findings.

Below are results from one in-time placebo: +1.3% lift from a placebo shouldn’t be stat. strong to reject no difference between treatment and control group = H0 (as expected for a placebo):

Incorporating the channel impressions and spend into cross-channel interactions within the MMM helps capture interactions with other channels in your media mix.

If the campaign fails to deliver the expected lift despite planning suggesting it should, it is necessary to judge aspects not captured within the quantitative results: the messaging or creative execution. Often, the shortfall could also be attributed to how the campaign (doesn’t) resonate with the target market moderately than flaws within the experimental design.

What’s in it for you?

Quasi geo-lift helps you to prove whether a campaign truly moves the needle without user-level tracking or big randomized tests. You choose just a few markets, construct an artificial control from the remaining, and skim the incremental impact directly in business units (rides, sales, leads). The ABCDE plan makes it practical:

Assess marketing triangulation and the way you’ll measure,
Budget to a transparent MDE,
Construct treatment/control and guardrails,
Deliver ATT → CPIC → profit, then
Evaluate by calibrating MMM/MTA, stress-testing with placebos, and by your corporation context with a broader lense.

Net result? Faster, cheaper, defensible answers you possibly can act on.

Thanks for reading. For those who enjoyed this text or learned something latest, be happy to and reach out to me on LinkedIn.

Full code:

library(tidyr)
library(dplyr)
library(GeoLift)

# Assuming long_data is your pre-formatted dataset with columns: date, location, Y
# The information ought to be loaded into your R environment before running this code.

long_data <- read.csv("/Users/tomasjancovic/Downloads/long_data.csv")

# Market selection (power evaluation)
GeoLift_PreTest <- long_data
GeoLift_PreTest$date <- as.Date(GeoLift_PreTest$date)

# using data as much as 2023-09-18 (day before launch)
GeoTestData_PreTest <- GeoDataRead(
  data = GeoLift_PreTest[GeoLift_PreTest$date < '2023-09-18', ],
  date_id = "date",
  location_id = "location",
  Y_id = "Y",
  format = "yyyy-mm-dd",
  summary = TRUE
)

# overview plot
GeoPlot(GeoTestData_PreTest, Y_id = "Y", time_id = "time", location_id = "location")

# power evaluation & market selection
MarketSelections <- GeoLiftMarketSelection(
  data = GeoTestData_PreTest,
  treatment_periods = c(14, 21, 28, 35, 42),
  N = c(1, 2, 3, 4, 5),
  Y_id = "Y",
  location_id = "location",
  time_id = "time",
  effect_size = seq(0, 0.26, 0.02),
  cpic = 6,
  budget = 5000,
  alpha = 0.05,
  fixed_effects = TRUE,
  side_of_test = "one_sided"
)

print(MarketSelections)
plot(MarketSelections, market_ID = 4, print_summary = TRUE)

# ------------- simulation starts, you'd use your observed treatment/control groups data as a substitute

# parameters
treatment_cities <- c("Zabrze", "Szczecin", "Czestochowa")
lift_magnitude <- 0.11
treatment_start_date <- as.Date('2023-09-18')
treatment_duration <- 21
treatment_end_date <- treatment_start_date + (treatment_duration - 1)

# extending the time series
extend_time_series <- function(data, extend_days) {
  extended_data <- data.frame()
  
  for (city in unique(data$location)) {
    city_data <- data %>% filter(location == city) %>% arrange(date)
    
    baseline_value <- mean(tail(city_data$Y, 30))
    
    recent_data <- tail(city_data, 60) %>%
      mutate(dow = as.numeric(format(date, "%u")))
    
    dow_effects <- recent_data %>%
      group_by(dow) %>%
      summarise(dow_multiplier = mean(Y) / mean(recent_data$Y), .groups = 'drop')
    
    last_date <- max(city_data$date)
    extended_dates <- seq(from = last_date + 1, by = "day", length.out = extend_days)
    
    extended_values <- sapply(extended_dates, function(date) {
      dow <- as.numeric(format(date, "%u"))
      multiplier <- dow_effects$dow_multiplier[dow_effects$dow == dow]
      if (length(multiplier) == 0) multiplier <- 1
      
      value <- baseline_value * multiplier + rnorm(1, 0, sd(city_data$Y) * 0.1)
      max(0, round(value))
    })
    
    extended_data <- rbind(extended_data, data.frame(
      date = extended_dates,
      location = city,
      Y = extended_values
    ))
  }
  
  return(extended_data)
}

# extending to treatment_end_date
original_end_date <- max(long_data$date)
days_to_extend <- as.numeric(treatment_end_date - original_end_date)

set.seed(123)
extended_data <- extend_time_series(long_data, days_to_extend)

# Combining original + prolonged
full_data <- rbind(
  long_data %>% select(date, location, Y),
  extended_data
) %>% arrange(date, location)

# applying treatment effect
simulated_data <- full_data %>%
  mutate(
    Y_original = Y,
    Y = if_else(
      location %in% treatment_cities &
        date >= treatment_start_date &
        date <= treatment_end_date,
      Y * (1 + lift_magnitude),
      Y
    )
  )

# Verifying treatment (prints just the table)
verification <- simulated_data %>%
  filter(location %in% treatment_cities,
         date >= treatment_start_date,
         date <= treatment_end_date) %>%
  group_by(location) %>%
  summarize(actual_lift = (mean(Y) / mean(Y_original)) - 1, .groups = 'drop')

print(verification)

# constructing GeoLift input (simulated)
GeoTestData_Full <- GeoDataRead(
  data = simulated_data %>% select(date, location, Y),
  date_id = "date",
  location_id = "location",
  Y_id = "Y",
  format = "yyyy-mm-dd",
  summary = TRUE
)

# Computing time indices
date_sequence <- seq(from = min(full_data$date), to = max(full_data$date), by = "day")
treatment_start_time <- which(date_sequence == treatment_start_date)
treatment_end_time <- which(date_sequence == treatment_end_date)

# Running GeoLift
GeoLift_Results <- GeoLift(
  Y_id = "Y",
  data = GeoTestData_Full,
  locations = treatment_cities,
  treatment_start_time = treatment_start_time,
  treatment_end_time = treatment_end_time,
  model = "None",
  fixed_effects = TRUE
)

# ---------------- simulation ends!

# plots
summary(GeoLift_Results)
plot(GeoLift_Results)
plot(GeoLift_Results, type = "ATT")

# placebos
set.seed(42)

# window length (days) of the actual treatment
window_len <- treatment_end_time - treatment_start_time + 1

# the furthest you possibly can shift back while keeping the total window inside pre-period
max_shift <- treatment_start_time - window_len
n_placebos <- 5

random_shifts <- sample(1:max(1, max_shift), size = min(n_placebos, max_shift), replace = FALSE)

placebo_random_shift <- vector("list", length(random_shifts))
names(placebo_random_shift) <- paste0("Shift_", random_shifts)

for (i in seq_along(random_shifts)) {
  s <- random_shifts[i]
  placebo_random_shift[[i]] <- GeoLift(
    Y_id = "Y",
    data = GeoTestData_Full,
    locations = treatment_cities,
    treatment_start_time = treatment_start_time - s,
    treatment_end_time   = treatment_end_time   - s,
    model = "None",
    fixed_effects = TRUE
  )
}

# --- Print summaries for every random-shift placebo ---
for (i in seq_along(placebo_random_shift)) {
  s <- random_shifts[i]
  cat("n=== Summary for Random Shift", s, "days ===n")
  print(summary(placebo_random_shift[[i]]))
}

# Plot ATT for every random-shift placebo
for (i in seq_along(placebo_random_shift)) {
  s <- random_shifts[i]
  placebo_end_date <- treatment_end_date - s
  cat("n=== ATT Plot for Random Shift", s, "days ===n")
  print(plot(placebo_random_shift[[i]], type = "ATT", treatment_end_date = placebo_end_date))
}

Why Are Marketers Turning To Quasi Geo-Lift Experiments? (And Tips on how to Plan Them)

Quasi Geo-Lift Experiment Example

Best practices on your data when using a Quasi Geo-Lift Experiment.