Regression Discontinuity Design: How It Works and When to Use It

-

You’re an avid data scientist and experimenter. You already know that randomisation is the summit of Mount Evidence Credibility, and you furthermore mght know that when you may’t randomise, you resort to observational data and Causal Inference techniques. At your disposal are various methods for spinning up a control group — difference-in-differences, inverse propensity rating weighting, and others. With an assumption here or there (some shakier than others), you estimate the causal effect and drive decision-making. But if you happen to thought it couldn’t get more exciting than “vanilla” causal inference, read on.

Personally, I’ve often found myself in at the least two scenarios where “just doing causal inference” wasn’t straightforward. The common denominator in these two scenarios? A missing control group — at first glance, that’s.

First, the cold-start scenario: the corporate wants to interrupt into an uncharted opportunity space. Often there isn’t any experimental data to learn from, nor has there been any change (read: “exogenous shock”), from the business or product side, to leverage within the more common causal inference frameworks like difference-in-differences (and other cousins within the pre-post paradigm).

Second, the unfeasible randomisation scenario: the organisation is perfectly intentional about testing an idea, but randomisation shouldn’t be feasible—or not even wanted. Even emulating a natural experiment is perhaps constrained legally, technically, or commercially (especially when it’s about pricing), or when interference bias arises within the marketplace.

These situations open up the space for a “different” variety of causal inference. Although the tactic we’ll deal with here shouldn’t be the just one fitted to the job, I’d love so that you can tag along on this deep dive into Regression Discontinuity Design (RDD).

On this post, I’ll offer you a crisp view of and RDD works. Inevitably, this may involve a little bit of math — a nice sight for some — but I’ll do my best to maintain it accessible with classic examples from the literature.

We’ll also see how RDD can tackle a thorny causal inference challenge in e-commerce and online marketplaces: the impact of listing position on listing performance. On this practical section we’ll cover key modelling considerations that practitioners often face: parametric versus non-parametric RDD, selecting the proper bandwidth parameter, and more. So, grab yourself a cup of of coffee and let’s jump in!

Outline

How and why RDD works 

Regression Discontinuity Design exploits cutoffs — thresholds — to get better the effect of a treatment on an end result. More precisely, it looks for a pointy change within the probability of treatment project on a ‘running’ variable. If treatment project depends solely on the running variable, and the cutoff is bigoted, i.e. exogenous, then we are able to treat the units around it as randomly assigned. The difference in outcomes just above and below the cutoff gives us the causal effect.

For instance, a scholarship awarded only to students scoring above 90, creates a cutoff based on test scores. That the cutoff is 90 is bigoted — it might have been 80 for that matter; the road had simply to be drawn somewhere. Furthermore, scoring 91 vs. 89 makes the entire difference as for the treatment: either you get it or not. But regarding , the 2 groups of scholars that scored 91 and 89 are usually not really different, are they? And people who scored 89.9 versus 90.1 — if you happen to insist?

Making the cutoff could come right down to randomness, when it’s only a bout a number of points. Possibly the scholar drank an excessive amount of coffee right before the test — or too little. Possibly they got bad news the night before, were thrown off by the weather, or anxiety hit on the worst possible moment. It’s this randomness that makes the cutoff so in RDD.

With out a cutoff, you don’t have an RDD — only a scatterplot and a dream. But, the cutoff by itself shouldn’t be equipped with all it takes to discover the causal effect. Why it really works hinges on one core identification assumption: continuity.

The continuity assumption, and parallel worlds

If the cutoff is the cornerstone of the technique, then its importance comes entirely from the continuity assumption. The concept is a straightforward, counterfactual one: had there been no treatment, then there would’ve been no effect.

To ground the thought of continuity, let’s jump straight right into a classic example from public health: does legal alcohol access increase mortality?

Imagine two worlds where everyone and all the pieces is identical. Apart from one thing: a law that sets the minimum legal drinking age at 18 years (we’re in Europe, folks).

On this planet with the law (the factual world), we’d expect alcohol consumption to leap right after age 18. Alcohol-related deaths should jump too, if there’s a link.

Now, take the counterfactual world where there isn’t any such law; there must be no such jump. Alcohol consumption and mortality would likely follow a smooth trend across age groups.

Now, that’s a superb thing for identifying the causal effect; the absence of a jump in deaths within the counterfactual world is the condition to interpret a jump within the factual world because the impact of the law.

Put simply: if there isn’t any treatment, there shouldn’t be a jump in deaths. If there’s, then something aside from our treatment is causing it, and the RDD shouldn’t be valid.

Two parallel worlds. From left to right; one where there isn’t any minimum age to devour alcohol legally, and one where there’s: 18 years.

The continuity assumption will be written within the potential outcomes framework as:

begin{equation}
lim_{x to c^-} mathbb{E}[Y_i(0) mid X_i = x] = lim_{x to c^+} mathbb{E}[Y_i(0) mid X_i = x]
label{eq: continuity_po}
end{equation}

Where (Y_i(0)) is the potential end result, say, risk of death of subject (/mathbb{i}) under no treatment.

Notice that the right-hand side is a quantity of the counterfactual world; not one which will be observed within the factual world, where subjects are treated in the event that they fall above the cutoff.

Unfortunately for us, we only have access to the factual world, so the belief can’t be tested directly. But, luckily, we are able to proxy it. We are going to see placebo groups achieve this later within the post. But first, we start by identifying can break the belief:

  1. Confounders: something aside from the treatment happens on the cutoff that also impacts the end result. For example, adolescents resorting to alcohol to alleviate the crushing pressure of being an adult now — something that has nothing to do with the law on the minimum age to devour alcohol (within the no-law world), but that does confound the effect we’re after, happening at the identical age — the cutoff, that’s.
  2. Manipulating the running variable:
    When units can influence their position with regard to the cutoff, it might be that units who did so are inherently different from those that didn’t. Hence, cutoff manipulation may end up in selection bias: a type of confounding. Especially if treatment project is binding, subjects may try their best to get one version of the treatment over the opposite.

Hopefully, it’s clear what constitutes a RDD: the running variable, the cutoff, and most significantly, reasonable grounds to defend that continuity holds. With that, you’ve gotten yourself a neat and effective causal inference design for questions that may’t be answered by an A/B test, nor by a few of the more common causal inference techniques like diff-in-diff, nor with stratification.

In the following section, we proceed shaping our understanding of how RDD works; how does RDD “control” confounding relationships? What exactly does it estimate? Can we not only control for the running variable too? These are questions that we tackle next.

RDD and instruments

If you happen to are already accustomed to instrumental variables (IV), it’s possible you’ll see the similarities: each RDD and IV leverage an exogenous variable that doesn’t cause the end result directly, but does influence the treatment project, which in turn may influence the end result. In IV it is a third variable Z; in RDD it’s the running variable that serves as an instrument.

Wait. A 3rd variable; possibly. But an exogenous one? That’s less clear.

In our example of alcohol consumption, it shouldn’t be hard to assume that age — the running variable — is a confounder. As age increases, so might tolerance for alcohol, and with it the extent of consumption. That’s a stretch, possibly, but not implausible.

Since treatment (legal minimum age) will depend on age — only units above 18 are treated — treated and untreated units are inherently different. If age also influences the end result, through a mechanism just like the one sketched above, we got ourselves an apex confounder.

Still, the running variable plays a key role. To grasp why, we’d like to have a look at how RDD and instruments leverage the frontdoor criterion to discover causal effects.

Perhaps almost instinctively, one may answer with for the running variable; that’s what stratification taught us. The running variable is confounder, so we include it in our regression, and shut the backdoor. But doing so would cause some trouble.

Remember, treatment project will depend on the running variable so that everybody above the cutoff is treated with certainty, and below it. So, if we control for the running variable, we run into two very related problems:

  1. Violation of the Positivity assumption: this assumption says that treated units must have a non-zero probability to receive the alternative treatment, and vice versa. Intuitively, conditioning on the running variable is like saying: “Let’s estimate the effect of being above the minimum age for alcohol consumption, while holding age fixed at 14.” That doesn’t make sense. At any given value of running variable, treatment is either all the time 1 or all the time 0. So, there’s no variation in treatment conditional on the running variable to support such a matter.
  2. Perfect collinearity on the cutoff: in estimating the treatment effect, the model has no approach to separate the effect of crossing the cutoff from the effect of being at a specific value of X. The result? No estimate, or a forcefully dropped variable from the model design matrix. , , these should sound familiar to most practitioners.

So no — conditioning on the running variable doesn’t make the running variable the exogenous instrument that we’re after. As a substitute, the running variable becomes exogenous by pushing it to the limit—quite literally. There where the running variable approaches the cutoff from either side, the units are the identical with respect to the running variable. Yet, falling just above or below makes the difference as for getting treated or not. This makes the running variable a sound instrument, if treatment project is the one thing that happens on the cutoff. Judea Pearl refers to instruments as meeting the front-door criterion.

X is the running variable, D the treatment project, Y the end result, and U is a set of unobserved influences on the end result. The causal effect of D on Y is unidentified within the above marginal model, for X being a confounder, and U potentially too. Conditioning on X violates the positivity assumption. As a substitute, conditioning X on its limits towards cutoff (c0), controls for the backdoor path: X to Y directly, and thru U.

So, in essence, we’re controlling for the running variable — but only near the cutoff. That’s why RDD identifies the average treatment effect (LATE), a special flavour of the common treatment effect (ATE). The LATE looks like:

$$delta_{SRD}=Ebig[Y^1_i – Y_i^0mid X_i=c_0]$$

The bit refers back to the partial scope of the population we’re estimating the ATE for, which is the subpopulation across the cutoff. Actually, the further away the info point is from the cutoff, the more the running variable acts as a confounder, working the RDD as a substitute of in its favour.

Back to the context of the minimum age for legal alcohol consumption example. Adolescents who’re 17 years and 11 months old are really not so different from those which are 18 years and 1 month old, on average. If anything, a month or two difference in age shouldn’t be going to be what sets them apart. Isn’t that the essence of conditioning on, or holding a variable constant? What sets them apart is that the latter group can devour alcohol legally for being above the cutoff, and never the previous.

This setup enables us to estimate the LATE for the units across the cutoff and with that, the effect of the minimum age policy on alcohol-related deaths.

We’ve seen how the continuity assumption has to carry to make the cutoff an interesting point along the running variable in identifying the causal effect of a treatment on the end result. Namely, by letting the jump within the end result variable be entirely attributable to the treatment. If continuity holds, the treatment is as-good-as-random near the cutoff, allowing us to estimate the local average treatment effect.

In the following section, we’ll walk through the sensible setup of a real-world RDD: we discover the important thing concepts; the running variable and cutoff, treatment, end result, covariates, and at last, we estimate the RDD after discussing some crucial modelling decisions, and end the section with a placebo test.

RDD in Motion: Search Rating and listing performance Example

In e-commerce and online marketplaces, the place to begin of the client experience is trying to find an inventory. Consider the visitor typing “Nikon F3 analogue camera” within the search bar. Upon carrying out this motion, algorithms frantically sort through the inventory in search of the very best matching listings to populate the search results page.

Time and a focus are two scarce resources. So, it’s within the interest of everyone involved — the client, the vendor and the platform — to order probably the most distinguished positions on the page for the matches with the very best anticipated likelihood to turn into successful trades.

Moreover, position effects in consumer behaviour suggest that users infer higher credibility and desirability from items “ranked” at the highest. Take into consideration high-tier products being placed at eye-height or above in supermarkets, and highlighted items on an e-commerce platform, at the highest of the homepage.

So, the query then becomes: how does positioning on the search results page influence an inventory’s possibilities to be sold?

:
If an inventory is ranked higher on the search results page, then it can have the next likelihood of being sold, because higher-ranked listings get more visibility and a focus from users.

Intermezzo: business or theory?

As with all good hypothesis, we’d like a little bit of theory to ground it. Good for us is that we are usually not trying to search out the cure for cancer. Our theory is about well-understood psychological phenomena and behavioural patterns, to place it overly sophisticated. 

Consider , and the resource theory of attention. These are well ideas in behavioural and cognitive psychology that back up our plan here.

Kicking off the conversation with a product manager shall be more fun this fashion. Personally, I also get excited when I even have to brush up on some psychology.

But I’ve found through and thru that a theory is de facto secondary to any initiative in my industry (tech). Apart from a research team and project, arguably. And it’s fair to say it helps us stay on-purpose: what we’re doing is to bring business forward, not mother science. 

Knowing the reply has real business value. Product and industrial teams could use it to design recent paid features that help sellers get their listings on higher positions — a win for each the business and the user. It could also make clear the worth of on-site real estate like banner positions and ad slots, helping drive growth in B2B promoting.

The query is about incrementality: would’ve listing (mathbb{j}) been sold, had it been ranked 1st on the outcomes page, as a substitute of fifteenth. So, we have the desire to make a causal statement. That’s hard for at the least two reasons:

  1. A/B testing comes with a price, and;
  2. there are confounders we’d like to take care of if we resort to observational methods.

Let’s expand on that.

The fee of A/B testing

One experiment design could randomise the fetched listings across the page slots, independent of the listing relevance. Breaking the inherent link between relevance and position, we might learn the effect of position on listing performance. It’s an interesting idea — but a costly one. 

While it’s an affordable design for statistical inference, this setup is sort of terrible for the user and business. The user might need found what they needed—possibly even made a purchase order. But as a substitute, possibly half of the inventory they might have seen was remotely a superb match due to our experiment. This suboptimal user experience likely hurts engagement in each the short and long run — especially for brand spanking new users who’re still to see what value the platform holds for them. 

Can we expect of a approach to mitigate this loss? Still committed to A/B testing, one could expose a smaller set of users to the experiment. While it can scale down the implications, it might also stand in the best way of reaching sufficient statistical power by lowering the sample size. Furthermore, even small audiences will be answerable for substantial revenue for some firms still — those with hundreds of thousands of users. So, cutting the exposed audience shouldn’t be a silver bullet either.

Naturally, the approach to go is to depart the platform and its users undisturbed —  and still discover a approach to answer the query at hand. Causal inference is the proper mindset for this, however the query is: how can we try this exactly?

Confounders

Listings don’t just make it to the highest of the page on a superb day; it’s their quality, relevance, and the sellers’ repute that promote the rating of an inventory. Let’s call these three variables W.

What makes W tricky is that it influences each the rating of the listing and in addition the probability that the listing gets clicked, a proxy for performance.

In other words, W affects each our treatment (position) and end result (click), helping itself with the status of .

A variable, or set thereof, W, is a confounder when it influences each, the treatment (rank, position) and end result of interest (click).

Subsequently, our task is to search out a design that’s fit for purpose; one which effectively controls the confounding effect of W.

Not all causal inference designs are only sitting around waiting to be picked. Sometimes they show up once you least need them, and sometimes you get lucky once you need them most — like today.

It looks like we are able to use the page cutoff to discover the causal impact of position on clicks-through rate.

Abrupt cutoff in search results pagination

Let’s unpack the listing suggestion mechanism to see exactly how. Here’s what happens under the hood when a results page is generated for a search:

  1. Fetch listings matching the query
    A rough set of listings is pulled from the inventory, based on filters like location, radius, and category, etc.
  2. Rating listings on personal relevance
    This step uses user history and listing quality proxies to predict what the user is more than likely to click.
  3. Rank listings by rating
    Higher scores get higher ranks. Business rules mix in ads and industrial content with organic results.
  4. Populate pages
    Listings are slotted by absolute relevance rating. A results page ends on the th listing, so the th listing appears at the highest of the page. That is goes to be crucial to our design.
  5. Impressions and user interaction
    Users see the leads to order of relevance. If an inventory catches their eye, they may click and look at more details: one step closer to the trade.

Practical setup and variables

So, what is precisely our design? Next, we walk through the reasoning and identification of the important thing ingredients of our design.

The running variable

In our setup, the running variable is the relevance rating (s_j) for listing j. This rating is a continuous, complex function of each user and listing properties:

$$s_j = f(u_i, l_j)$$

The listing’s rank (r_j) is just a rank transformation of (s_j), defined as:

$$r_i = sum_{j=1}^{n} mathbf{1}(s_j leq s_i)$$

Practically speaking, because of this for analytic purposes—akin to fitting models, making local comparisons, or identifying cutoff points—knowing an inventory’s rank conveys nearly the identical information as knowing its underlying relevance rating, and vice versa.

Details: Relevance rating vs. rank

The relevance rating (s_j) reflects how well an inventory matches a particular user’s query, given parameters like location, price range, and other filters. But this rating is relative—it only has meaning inside the context of the listings returned for that specific search.

In contrast, rank (or position) is absolute. It directly determines an inventory’s visibility. I believe of rank as a standardising transformation of (s_j). For instance, Listing A in search Z might need the very best rating of 5.66, while Listing B in search K tops out at 0.99. These raw scores aren’t comparable across searches—but each listings are ranked first of their respective result sets. That makes them equivalent when it comes to what really matters here: how visible they’re to users.

The cutoff, and treatment

If an inventory just misses the primary page, it doesn’t fall to the underside of page two — it’s artificially bumped to the highest. That’s a lucky break. Normally, only probably the most relevant listings appear at the highest, but here an inventory of merely moderate relevance results in a major slot —albeit on the second page — purely as a consequence of the arbitrary position of the page break. Formally, the treatment project (D_j) goes like:

$$D_j = begin{cases} 1 & text{if } r_j > 30 0 & text{otherwise} end{cases}$$

The strength of this setup lies in what happens near the cutoff: an inventory ranked 30 could also be nearly similar in relevance to at least one ranked 31. A small scoring fluctuation — or a high-ranking outlier — can push an inventory over the brink, flipping its treatment status. This local randomness is what makes the setup valid for RDD.

The end result: Impression-to-click

Finally, we operationalise the end result of interest because the click-though rate from impressions to clicks. Do not forget that all listings are ‘impressed’ when when the page is populated. The clicking is the binary indicator of the specified user behaviour.

In summary, that is our setup:

  • Final result: impression-to-click conversion
  • Treatment: Landing on the primary vs. second page
  • Running variable: listing rank; page cutoff at 30 

Next we walk through methods to estimate the RDD. 

Estimating RDD

On this section, we’ll estimate the causal parameter, interpret it, and connect them back to our core hypothesis: how position affects listing visibility.

Here’s what we’ll cover:

  • Meet the info: Intro to the dataset
  • Covariates: Why and methods to include them
  • Modelling decisions: parametric RDD vs. not. Selecting the polynomial degree and bandwidth.
  • Placebo-testing
  • Density continuity testing

Meet the info

We’re working with impressions data from one among Adevinta’s (ex-eBay Classifieds Group) marketplaces. It’s real data, which makes the entire exercise feel grounded. That said, values and relationships are censored and scrambled where needed to guard its strategic value.

A very important note to how we interpret the RDD estimates and drive decisions, is how the info was collected: only those searches where the user saw each the primary and second page were included.

This manner, we partial out the page fixed effect if any, but the truth is that many users don’t make it to the second page in any respect. So there’s an enormous volume gap. We discuss the repercussion within the evaluation recap.

The dataset consists of those variables:

  • Clicked: 1 if the listing was clicked, 0 otherwise – binary
  • Position: the rank of the listing – numeric
  • D: treatment indicator, 1 if position > 30, 0 otherwise – binary
  • Category: product category of the listing – nominal
  • Organic: 1 if organic, 0 if from an expert seller – binary
  • Boosted: 1 if was paid to be at the highest, 0 otherwise – binary
click rel_position D category organic boosted
1 -3 0 A 1 0
1 -14 0 A 1 0
0 3 1 C 1 0
0 10 1 D 0 0
1 -1 0 K 1 1
A sample of the dataset we’re working with.

Covariates: methods to include them to extend accuracy?

The running variable, the cutoff, and the continuity assumption, offer you it’s good to discover the causal effect. But including covariates can sharpen the estimator by reducing variance — if done right. And, oh is it easy to do it unsuitable.

The simplest thing to “break” concerning the RDD design, is the continuity assumption. Concurrently, that’s the thing we wish to interrupt (I already rambled long enough about this).

Subsequently, the predominant quest in adding covariates is to it in such way that we reduce variance, while keeping the continuity assumption intact. One approach to formulate that, is to assume continuity without covariates covariates:

begin{equation}
lim_{x to c^-} mathbb{E}[Y_i(0) mid X_i = x] = lim_{x to c^+} mathbb{E}[Y_i(0) mid X_i = x] text{(no covariates)}
end{equation}

begin{equation}
lim_{x to c^-} mathbb{E}[Y_i(0) mid X_i = x, Z_i] = lim_{x to c^+} mathbb{E}[Y_i(0) mid X_i = x, Z_i] text{(covariates)}
end{equation}

Where (Z_i) is a vector of covariates, for subject i. Less mathy, two things should remain unchanged after adding covariates:

  1. The functional type of the running variable, and;
  2. The (absence of the) jump in treatment project on the cutoff

I didn’t discover the above myself; Calonico, Cattaneo, Farrell, and Titiunik (2018) did. They developed a proper framework for incorporating covariates into RDD. I’ll leave the small print to the paper. For now, some modelling guidelines can keep us going:

  1. Model covariates linearly in order that the treatment effect stays the identical with and without covariates, because of a straightforward and smooth partial effect of the covariates;
  2. Keep the model terms additive, in order that the treatment effect stays the LATE, and doesn’t turn into (CATE); and to avoid a jump on the cutoff.
  3. The above implies that there be interactions with the treatment indicator, nor with the running variable. Doing any of those may break continuity and invalidate our RDD design.

Our goal model may appear to be this:

begin{equation}
Y_i = alpha + tau D_i + f(X_i – c) + beta^top Z_i + varepsilon_i
end{equation}

For letting the covariates interact with the treatment indicator, the kind of model we wish to looks like this:

begin{equation}
Y_i = alpha + tau D_i + f(X_i – c) + beta^top (Z_i cdot D_i) + varepsilon_i
end{equation}

Now, let’s distinguish between two ways of practically including covariates:

  1. Direct inclusion: Add them on to the end result model alongside the treatment and running variable.
  2. Residualisation: First regress the end result on the covariates, then use the residuals within the RDD.

We’ll use residualisation in our case. It’s an efficient way reduce noise, produces cleaner visualisations, and protects the strategic value of the info.

The snippet below defines the end result de-noising model and computes the residualised end result, click_res. The concept is easy: once we strip out the variance explained by the covariates, what stays is a less noisy version of our end result variable—at the least in theory. Less noise means more accuracy.

In practice, though, the residualisation barely moved the needle this time. We are able to see that by checking the change in standard deviation:

SD(click_res) / SD(click) - 1 gives us about -3%, which is small practically speaking.

# denoising clicks
mod_outcome_model <- lm(click ~ l1 + organic + boosted, 
                        data = df_listing_level)

df_listing_level$click_res <- residuals(mod_outcome_model)

# the impact on variance is proscribed: ~ -3%
sd(df_listing_level$click_res) / sd(df_listing_level$click) - 1

Despite the fact that the denoising didn’t have much effect, we’re still in a superb spot. The unique end result variable already has low conditional variance, and patterns across the cutoff are visible to the naked eye, as we are able to see below.

On the x-axis: ranks relative to the page end (30 positions on one page), and on the y-axis: the residualised average click through.

We move on to a number of other modelling decisions that generally have a much bigger impact: selecting between parametric and non-parametric RDD, the polynomial degree and the bandwidth parameter (h).

Modelling decisions in RDD

Parametric vs non-parametric RDD

You may wonder why we even have to choose from parametric and non-parametric RDD. The reply lies in how each approach trades off bias and variance in estimating the treatment effect.

Selecting parametric RDD is basically selecting to scale back variance. It assumes a particular functional form for the connection between the end result and the running variable, (mathbb{E}[Y mid X]), and suits that model across the whole dataset. The treatment effect is captured as a discrete jump in an otherwise continuous function. The everyday form looks like this:

$$Y = beta_0 + beta_1 D + beta_2 X + beta_3 D cdot X + varepsilon$$

Non-parametric RDD, then again, is about reducing bias. It avoids strong assumptions concerning the global relationship between Y and X and as a substitute estimates the end result function individually on either side of the cutoff. This flexibility allows the model to more accurately capture what’s happening right around the brink. The non-parametric estimator is:

(tau = lim_{x downarrow c} mathbb{E}[Y mid X = x] – lim_{x uparrow c} mathbb{E}[Y mid X = x])

So, which must you select? Truthfully, it may possibly feel arbitrary. And that’s okay. That is the primary in a series of judgment calls that practitioners often call the fun a part of RDD. It’s where modelling becomes as much an art because it is a science.

I’ll walk through how I approach that selection. But first, let’s take a look at two key tuning parameters (especially for non-parametric RDD) that may guide our final decision: the polynomial degree and the bandwidth, h.

Polynomial degree

The connection between end result and the running variable can take many forms, and capturing its true shape is crucial for estimating the causal effect accurately. If you happen to’re lucky, all the pieces is linear and there isn't any need to consider polynomials — If you happen to’re a realist, then you definitely probably need to find out how they will serve you in the method. 

In choosing the proper polynomial degree, the goal is to scale back bias, without inflating the variance of the estimator. So we wish to permit for flexibility, but we don’t need to do it greater than needed. Take the examples within the image below: with an end result of low enough variance, the linear form naturally invites the eyes to estimate the end result on the cutoff. However the estimate becomes biased with only a rather more complex form, if we implement a linear shape within the model. Insisting on a linear form in such a fancy case is like fitting your feet right into a glove: It sort of works, however it’s very ugly. 

As a substitute, we give the model more degrees of freedom with a higher-degree polynomial, and estimate the expected (tau = lim_{x downarrow c} mathbb{E}[Y mid X = x] – lim_{x uparrow c} mathbb{E}[Y mid X = x]), with lower bias.

, and failing to achieve this may introduce bias.

The bandwidth parameter: h

Working with polynomials in the best way that’s described above doesn't come freed from worries. Two things are required and pose a challenge at the identical time: 

  1.  we'd like to get the modelling for entire range, and;
  2.  the whole range must be relevant for the duty at hand, which is estimating (tau = lim_{x downarrow c} mathbb{E}[Y mid X = x] – lim_{x uparrow c} mathbb{E}[Y mid X = x]) 

Only then we reduce bias as intended; If one among these two shouldn't be the case, we risk adding more of it. 

The thing is that modelling the whole range properly is harder than modelling a smaller range, specially if the shape is complex. So, it’s easier to make mistakes. Furthermore, the whole range is sort of certain to not be relevant to estimate the causal effect — the “local” in LATE gives it away. How can we work around this?

Enter the bandwidth parameter, h. The bandwidth parameters aids the model in leveraging data that's closer to the cutoff, dropping the data idea, and bringing it back to the local scope RDD estimates the effect for. It does so by weighting the info by some function (mathbb{w}(X)) in order that more weight is given to entries near the cutoff, and fewer to the entries further away.

For instance, with h = 10, the model considers the range of total length 20; 10 on both sides of the cutoff.

The effective weight will depend on the function, (mathbb{w}). A bandwidth function that has a hard-boundary behaviour is known as a square, or uniform, kernel. Consider it as a function that offers weights 1 when the info is inside bandwidth, and 0 otherwise. The gaussian and triangular kernels are two other steadily used kernels by practitioners. The important thing difference is that these behave less abruptly in weighting of the entries, in comparison with the square kernel. The image below visualises the behaviour of the three kernels functions.

Three weighting functions visualised. The y-axis represents the load. The square kernel acts as a hard-cutoff as to which entries it allows to be seen by the model. The triangular and gaussian functions behave more easily with respect to this.

All the pieces put together: non- vs. parametric RDD, polynomial degree and bandwidth

To me, selecting the ultimate model boils right down to the query: what's the best model that does the nice job?Indeed — the principle of Occam’s razor never goes out of fashion. In practise, this implies:

  1. Non- vs. Parametric: is the functional form easy on either side of the cutoff? Then a single fit, pooling data from either side will do. Otherwise, nonparametric RDD adds the flexibleness that is required to embrace two different dynamics on either side of the cutoff.
  2. Polynomial degree: when the function is complex, I opt-in for higher degrees to follow the trend higher flexibly.
  3. Bandwidth: if just picked a high polynomial degree, then I'll let h be larger too. Otherwise, lower values for h often go well with lower degrees of polynomials in my experience*, **.

* This brings us to the commonly accepted suggestion within the literature: keep the polynomial degree lower than 3. In most use cases 2 works well enough. Just be certain that you choose mindfully.

** Also, note that h suits especially well within the non-parametric mentality; I see these two decisions as co-dependent.

Back to the listing position scenario. That is the ultimate model to me:

# modelling the residuals of the end result (de-noised)
mod_rdd <- lm(click_res ~ D + ad_position_idx,
              weight = triangular_kernel(x = ad_position_idx, c = 0, h = 10),  # that is h
              data = df_listing_level)

Interpreting RDD results

Let’s take a look at the model output. The image below shows us the model summary. If you happen to’re accustomed to that, all of it will come right down to interpreting the parameters.

The very first thing to have a look at is that treated listings have ~1% point higher probability of being clicked, than untreated listings. To place that in perspective, that’s a +20% change if the clicking rate of the control is 5%, and ~ +1% increase if the control is 80%. In relation to of this causal effect, these two uplifts are day and night. I’ll leave this open-ended with a number of inquiries to take home: when would you and your team label this impact as a possibility to leap on? What other data/answers do we'd like to declare this track worthy of following?

The rest of the parameters don’t really add much to the interpretation of the causal effect. But let’s go over them quickly, nonetheless. The second estimate (x) is that of the slope below cutoff slope; the third one (D x (mathbb(x))) is the extra [negative] points added to the previous slope to reflect the slope the cutoff; Finally, the intercept is the common for the units right below the cutoff. Because our end result variable is residualised, the worth -0.012 is the demeaned end result; it now not is on the dimensions of the unique end result.

Different decisions, different models

I’ve put this image together to indicate a group of other possible models, had we made different decisions in bandwidth, polynomial degree, and parametric-versus-not. Although hardly any of those models would have put the choice maker on a completely unsuitable path on this particular dataset, each model comes with its bias and variance properties. This color our of the estimate.

Placebo testing

In any causal inference method, the identification assumption is all the pieces. One thing is off, and the whole evaluation crumbles. We are able to pretend all the pieces is alright, or we put our methods to the test ourselves (consider me, it’s higher once you break your individual evaluation before it goes on the market)

Placebo testing is one approach to corroborate the outcomes. Placebo testing checks the validity of results through the use of a setup similar to the true one, minus the actual treatment. If we still see an effect, it signals a flawed design — continuity can’t be assumed, and causal effects can’t be identified.

Good for us, we have now a placebo group. The 30-listing page cut only exists on the desktop version of the platform. On mobile, infinite scroll makes it one long page; no pagination, no page jump. So the effect of “going to the following page” shouldn’t appear, and it doesn’t.

I don’t think we'd like to do much inference. The graph below already tells us the whole story: without pages, going from the thirtieth position to the thirty first shouldn't be different from going from another position to the following. More importantly, the function is smooth on the cutoff. This finding adds an awesome deal of credibility to our evaluation by showcasing that continuity holds on this placebo group.

The placebo test is one among the strongest checks in an RDD. It tests the continuity assumption almost , by treating the placebo group as a stand-in for the counterfactual.

After all, this relies on a brand new assumption: that the placebo group is valid; that it's a sufficiently good counterfactual. So the test is powerful provided that that assumption is more credible than assuming continuity without evidence.

Which suggests that we have to be open to the chance that there isn't any proper placebo group. How can we stress-test our design then?

No-manipulation and the density continuity test

Quick recap. There are two related sources of confounding and hence to violating the continuity assumption:

  1. direct confounding from a 3rd variable on the cutoff, and
  2. manipulation of the running variable.

The primary can’t be tested directly (except with a placebo test). The second can.

If units can shift their running variable, they self-select into treatment. The comparison stops being fair: we’re now comparing manipulators to those that couldn’t or didn’t. That self-selection becomes a confounder, if it also affects the end result.

For example, students who didn't make the cut for a scholarship, but go on to effectively smooth-talk their institution into letting them pass with the next rating. That silver tongue also can help them recovering salaries, and act as confounder once we study the effect of scholarships on future income.

In DAG form, running variable manipulation causes selection bias, which in turn makes that the continuity assumption doesn’t longer hold. If we all know that continuity holds, then there isn't any have to test for selection bias by manipulation. But once we cannot (because there isn't any good placebo group), then at the least we are able to attempt to test if there's manipulation.

So, what are the signs that we’re in such scenario? An unexpectedly high variety of units just above the cutoff, and a dip slightly below (or vice versa). We are able to see this as one other continuity query, but this time when it comes to the density of the samples.

While we are able to’t test the continuity of the potential outcomes directly, we are able to test the continuity of the density of the running variable on the cutoff. The test is the usual tool for this, exactly testing:

(H_0: lim_{x to c^-} f(x) = lim_{x to c^+} f(x) quad text{(No manipulation)})

(H_A: lim_{x to c^-} f(x) neq lim_{x to c^+} f(x) quad text{(Manipulation)})

where (f(x)) is the density function of the running variable. If (f(x)) jumps at x = c, it suggests that units have sorted themselves just above or below the cutoff — violating the belief that the running variable was not manipulable at that margin.

The internals of this test is something for a distinct post, because luckily we are able to rely rdrobust::rddensity to run this test, off-the-shelf.

require(rddensity)
density_check_obj <- rddensity(X = df_listing_level$ad_position_idx, 
                               c = 0)
summary(density_check_obj)

# for the plot below
rdplotdensity(density_check_obj, X = df_listing_level$ad_position_idx)
A visible representation of the McCrary test.

The test shows marginal evidence of a discontinuity within the density of the running variable (T = 1.77, p = 0.077). Binomial counts are unbalanced across the cutoff, suggesting fewer observations slightly below the brink.

Often, it is a red flag as it might pose a thread to the continuity assumption. This time nevertheless, we all know that continuity holds (see placebo test).

Furthermore, rating is completed by the algorithm: sellers don't have any means to control the rank of their listings in any respect. That’s something we all know by design.

Hence, a more plausible explanation is that the discontinuity within the density is driven by platform-side impression logging (not rating), or my very own filtering within the SQL query (which is elaborate, and missing values on the filter variables are usually not unusual).

Inference

The outcomes will do that time around. But Calonico, Cattaneo, and Titiunik (2014) highlight a number of issues with OLS RDD estimates like ours. Specifically, about 1) the bias in estimating the expected end result on the cutoff, that now not is de facto the cutoff once we take samples further away from it, and a pair of) the bandwidth-induced uncertainty that's ignored of the model (as h is treated as a hyperparameter, not a model parameter).

Their methods are implemented in rdrobust, an R and Stata package. I like to recommend using that software in analyses which are about driving real-life decisions.

Evaluation recap

We checked out how an inventory’s spot within the search results affects how often it gets clicked. By specializing in the cutoff between the primary and second page, we found a transparent (though modest) causal effect: listings at the highest of page two got more clicks than those stuck at the underside of page one. A placebo test backed this up—on mobile, where there’s infinite scroll and no real “pages,” the effect disappears. That offers us more confidence within the result. Bottom line: where an inventory shows up matters, and prioritising top positions could boost engagement and create recent industrial possibilities.

But before we run with it, a few necessary caveats.

First, our result's local—it only tells us what happens near the page-two cutoff. We don’t know if the identical effect holds at the highest of page one, which probably signals much more value to users. So this is perhaps a lower-bound estimate.

Second, volume matters. The primary page gets so much more eyeballs. So even when a top slot on page two gets more clicks per view, a lower spot on page one might still win overall.

Conclusion

Regression Discontinuity Design shouldn't be your on a regular basis causal inference method — it’s a nuanced approach best saved for when the celebs align, and randomisation isn’t doable. Be certain that you could have a superb grip on the design, and be thorough concerning the core assumptions: try to interrupt them, after which try harder. When you could have what you wish, it’s an incredibly satisfying design. I hope this reading serves you well the following time you get a possibility to use this method. 

It’s great seeing that you simply got this far into this post. If you desire to read more, it’s possible; just not here. So, I compiled a small list of resources for you:

Also try the reference section below for some deep-reads.

Completely happy to attach on LinkedIn, where I discuss more topics just like the one here. Also, be happy to bookmark my personal website that is way cosier than here.


. ; ,

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x