You’ve probably used the traditional distribution one or two times too many. All of us have — It’s a real workhorse. But sometimes, we run into problems. As an example, when predicting or forecasting values, simulating data given a selected data-generating process, or once we attempt to visualise model output and explain them intuitively to non-technical stakeholders. Suddenly, things don’t make much sense: can a user really have made -8 clicks on the banner? And even 4.3 clicks? Each are examples of how count data doesn’t behave.
I’ve found that higher encapsulating the information generating process into my modelling has been key to having sensible model output. Using the Poisson distribution when it was appropriate has not only helped me convey more meaningful insights to stakeholders, nevertheless it has also enabled me to supply more accurate error estimates, higher Inference, and sound decision-making.
On this post, my aim is to allow you to get a deep intuitive feel for the Poisson distribution by walking through example applications, and taking a dive into the foundations — the maths. I hope you learn not only how it really works, but additionally why it really works, and when to use the distribution.
Outline
- Examples and use cases: Let’s walk through some use cases and sharpen the intuition I just mentioned. Along the way in which, the relevance of the Poisson Distribution will change into clear.
- The foundations: Next, let’s break down the equation into its individual components. By studying each part, we’ll uncover why the distribution works the way in which it does.
- The assumptions: Equipped with some formality, it can be easier to know the assumptions that power the distribution, and at the identical time set the boundaries for when it really works, and when not.
- When real life deviates from the model: Finally, let’s explore the special links that the Poisson distribution has with the Negative Binomial distribution. Understanding these relationships can deepen our understanding, and supply alternatives when the Poisson distribution shouldn’t be suited to the job.
Example in a web based marketplace
I selected to deep dive into the Poisson distribution since it ceaselessly appears in my day-to-day work. Online marketplaces depend on binary user decisions from two sides: a seller deciding to list an item and a buyer deciding to make a purchase order. These micro-behaviours drive supply and demand, each within the short and long run. A marketplace is born.
Binary decisions aggregate into counts — the sum of many such decisions as they occur. Attach a timeframe to this counting process, and also you’ll start seeing Poisson distributions all over the place. Let’s explore a concrete example next.
Consider a seller on a platform. In a given month, the vendor may or may not list an item on the market (a binary selection). We might only know if she did because then we’d have a measurable count of the event. Nothing stops her from listing one other item in the identical month. If she does, we count those events. The whole could possibly be zero for an inactive seller or, say, 120 for a highly engaged seller.
Over several months, we might observe a various variety of listed items by this seller — sometimes fewer, sometimes more — hovering around a median monthly listing rate. That is basically a Poisson process. Once we get to the assumptions section, you’ll see what we needed to assume away to make this instance work.
Other examples
Other phenomena that will be modelled with a Poisson distribution include:
- Sports analytics: The variety of goals scored in a match between two teams.
- Queuing: Customers arriving at a help desk or customer support calls.
- Insurance: The variety of claims made inside a given period.
Each of those examples warrants further inspection, but for the rest of this post, we’ll use the marketplace example for instance the inner workings of the distribution.
The mathy bit
I find opening up the probability mass function (PMF) of distributions helpful to understanding why things work as they do. The PMF of the Poisson distribution goes like:
Where λ is the speed parameter, and 𝑘 is the manifested count of the random variable (𝑘 = 0, 1, 2, 3, … events). Very neat and compact.

Contextualising λ and k: the marketplace example
Within the context of our earlier example — a seller listing items on our platform — λ represents the vendor’s average monthly listings. Because the expected monthly value for this seller, λ orchestrates the variety of items she would list in a month. Note that λ is a Greek letter, so read: λ is a parameter that we are able to estimate from data. However, 𝑘 doesn’t hold any information in regards to the seller’s idiosyncratic behaviour. It’s the goal value we set for the variety of events which will occur to find out about its probability.
The twin role of λ because the mean and variance
Once I said that λ orchestrates the variety of monthly listings for the vendor, I meant it quite literally. Namely, λ is each the expected value and variance of the distribution, indifferently, for all values of λ. Because of this the mean-to-variance ratio (index of dispersion) is all the time 1.
To place this into perspective, the traditional distribution requires two parameters — 𝜇 and 𝜎², the typical and variance respectively — to completely describe it. The Poisson distribution achieves the identical with only one.
Having to estimate just one parameter will be useful for parametric inference. Specifically, by reducing the variance of the model and increasing the statistical power. However, it may well be too limiting of an assumption. Alternatives just like the Negative Binomial distribution can alleviate this limitation. We’ll explore that later.
Breaking down the probability mass function
Now that we all know the smallest constructing blocks, let’s zoom out one step: what’s λᵏ, 𝑒^⁻λ, and 𝑘!, and more importantly, what’s each of those components’ function in the entire?
- λᵏ is a weight that expresses how likely it’s for 𝑘 events to occur, provided that the expectation is λ. Note that “likely” here doesn’t mean a probability, yet. It’s merely a signal strength.
- 𝑘! is a combinatorial correction in order that we are able to say that the order of the events is irrelevant. The events are interchangeable.
- 𝑒^⁻λ normalises the integral of the PMF function to sum as much as 1. It’s called the partition function of exponential-family distributions.
In additional detail, λᵏ relates the observed value 𝑘 to the expected value of the random variable, λ. Intuitively, more probability mass lies across the expected value. Hence, if the observed value lies near the expectation, the probability of occurring is larger than the probability of an commentary far faraway from the expectation. Before we are able to cross-check our intuition with the numerical behaviour of λᵏ, we’d like to think about what 𝑘! does.
Interchangeable events
Had we cared in regards to the order of events, then each unique event could possibly be ordered in 𝑘! ways. But because we don’t, and we deem each event interchangeable, we “divide out” 𝑘! from λᵏ to correct for the overcounting.
Since λᵏ is an exponential term, the output will all the time be larger as 𝑘 grows, holding λ constant. That’s the other of our intuition that there’s maximum probability when λ = 𝑘, because the output is larger when 𝑘 = λ + 1. But now that we all know in regards to the interchangeable events assumption — and the overcounting issue — we all know that we’ve got to think about 𝑘! like so: λᵏ 𝑒^⁻λ / 𝑘!, to see the behaviour we expect.
Now let’s check the intuition of the connection between λ and 𝑘 through λᵏ, corrected for 𝑘!. For a similar λ, say λ = 4, we must always see λᵏ 𝑒^⁻λ / 𝑘! to be smaller for values of 𝑘 which can be far faraway from 4, in comparison with values of 𝑘 that lie near 4. Like so: inline code: 4²/2 = 8 is smaller than 4⁴/24 = 10.7. That is consistent with the intuition of the next likelihood of 𝑘 when it’s near the expectation. The image below shows this relationship more generally, where you see that the output is larger as 𝑘 approaches λ.

The assumptions
First, let’s get one thing off the table: the difference between a , and the . The is a stochastic continuous-time model of points happening in given interval: 1D, a line; 2D, an area, or higher dimensions. We, data scientists, most frequently cope with the one-dimensional case, where the “line” is time, and the points are the events of interest — I dare to say.
These are the assumptions of the Poisson process:
- The occurrence of 1 event doesn’t affect the probability of a second event. Consider our seller happening to list one other item tomorrow indifferently of getting done so already today, or the one from five days ago for that matter. The purpose here is that there isn’t a memory between events.
- The common rate at which events occur, is independent of any occurrence. In other words, no event that happened (or will occur) alters λ, which stays constant throughout the observed timeframe. In our seller example, which means that listing an item today doesn’t increase or decrease the vendor’s motivation or likelihood of listing one other item tomorrow.
- Two events cannot occur at the exact same fast. If we were to zoom at an infinite granular level on the timescale, no two listings might have been placed concurrently; all the time sequentially.
From these assumptions — no memory, constant rate, events happening alone — it follows that 1) any interval’s variety of events is Poisson-distributed with parameter λₜ and a couple of) that disjoint intervals are independent — two key properties of a Poisson process.
The distribution simply describes probabilities for various numbers of counts in an interval. Strictly speaking, one can use the distribution pragmatically each time the information is nonnegative, will be unbounded on the correct, has mean λ, and fairly models the information. It will be just convenient if the underlying process is a Poisson one, and truly justifies using the distribution.
The marketplace example: Implications
So, can we justify using the Poisson distribution for our marketplace example? Let’s open up the assumptions of a Poisson process and take the test.
Constant λ
- The vendor has patterned online activity; holidays; promotions; listings are seasonal goods.
- λ shouldn’t be constant, resulting in overdispersion (mean-to-variance ratio is larger than 1, or to temporal patterns.
Independence and memorylessness
- The propensity to list again is higher after a successful listing, or conversely, listing once depletes the stock and intervenes with the propensity of listing again.
- Two events aren’t any longer independent, because the occurrence of 1 informs the occurrence of the opposite.
Simultaneous events
- Batch-listing, a brand new feature, was introduced to assist the sellers.
- Multiple listings would come online at the identical time, clumped together, and they might be counted concurrently.
Balancing rigour and pragmatism
As Data Scientists on the job, we may feel trapped between rigour and pragmatism. The three steps below should provide you with a sound foundation to choose on which side to err, when the Poisson distribution falls short:
- Pinpoint your goal: is it inference, simulation or prediction, and is it about high-stakes output? List the worst thing that may occur, and the fee of it for the business.
- Discover the issue and solution: why does the Poisson distribution not fit, and what are you able to do about it? list 2-3 solutions, including changing nothing.
- Balance gains and costs: Will your workaround improve things, or make it worse? and at what cost: interpretability, recent assumptions introduced and resources used. Does it allow you to in achieving your goal?
That said, listed below are some counters I take advantage of when needed.
When real life deviates out of your model
The whole lot described thus far pertains to the usual, or homogenous, Poisson process. But what if reality begs for something different?
In the subsequent section, we’ll cover two extensions of the Poisson distribution when the constant λ assumption doesn’t hold. These are usually not mutually exclusive, but neither they’re the identical:
- Time-varying λ: a single seller whose listing rate ramps up before holidays and slows down afterward
- Mixed Poisson distribution: multiple sellers listing items, each with their very own λ will be seen as a mix of assorted Poisson processes
Time-varying λ
The primary extension allows λ to have its own value for every time . The PMF then becomes

Where the variety of events 𝐾(𝑇) in an interval 𝑇 follows the Poisson distribution with a rate now not equal to a hard and fast λ, but one equal to:

More intuitively, integrating over the interval 𝑡 to 𝑡 + 𝑖 gives us a single number: the expected value of events over that interval. The integral will vary by each arbitrary interval, and that’s what makes λ change over time. To grasp how that integration works, it was helpful for me to think about it like this: if the interval 𝑡 to 𝑡₁ integrates to three, and 𝑡₁ to 𝑡₂ integrates to five, then the interval 𝑡 to 𝑡₂ integrates to eight = 3 + 5. That’s the 2 expectations summed up, and now the expectation of all the interval.
Practical implication
One will probably want to modeling the expected value of the Poisson distribution as a function of time. As an example, to model an overall change in trend, or seasonality. In generative model notation:

Time could also be a continuous variable, or an arbitrary function of it.
Process-varying λ: Mixed Poisson distribution
But then there’s a gotcha. Remember after I said that λ has a dual role because the mean and variance? That also applies here. the “relaxed” PMF*, the one thing that changes is that λ can vary freely with time. Nevertheless it’s still the one and only λ that orchestrates each the expected value and the dispersion of the PMF*. More precisely, 𝔼[𝑋] = Var(𝑋) still holds.
There are numerous reasons for this constraint not to carry in point of fact. Model misspecification, event interdependence and unaccounted for heterogeneity could possibly be the problems at hand. I’d wish to deal with the latter case, because it justifies the Negative Binomial distribution — certainly one of the topics I promised to open up.
Heterogeneity and overdispersion
Imagine we are usually not coping with one seller, but with 10 of them listing at different intensity levels, λᵢ, where 𝑖 = 1, 2, 3, …, 10 sellers. Then, essentially, we’ve got 10 Poisson processes happening. If we unify the processes and estimate the grand λ, we simplify the mixture away. Meaning, we get an accurate estimate of all sellers on average, however the resulting grand λ is naive and doesn’t know in regards to the original spread of λᵢ. It still assumes that the variance and mean are equal, as per the axioms of the distribution. This may result in overdispersion and, in turn, to underestimated errors. Ultimately, it inflates the false positive rate and drives poor decision-making.
Negative binomial: Extending the Poisson distribution
Among the many few ways one can have a look at the Negative Binomial distribution, a technique is to see it as a compound Poisson process — 10 sellers, sounds familiar yet? Which means multiple independent Poisson processes are summed as much as a single one. Mathematically, first we draw λ from a Gamma distribution: λ ~ Γ(r, θ), then we draw the count 𝑋 | λ ~ Poisson(λ).
In a single image, it’s as if we might sample from plenty Poisson distributions, corresponding to every seller.

The more exposing alias of the Negative binomial distribution is , and now we all know why: the dictating λ comes from a continuous mixture. That’s what we wanted to elucidate the heterogeneity amongst sellers.
Let’s simulate this scenario to realize more intuition.

First, we draw λᵢ from a Gamma distribution: λᵢ ~ Γ(r, θ). Intuitively, the Gamma distribution tells us in regards to the variety within the intensity — listing rate — amongst the sellers.
On a practical note, one can instill their assumptions in regards to the degree of heterogeneity on this step of the model: different are sellers? By various the degrees of heterogeneity, one can observe the impact on the ultimate Poisson-like distribution. Doing any such checks (i.e., posterior predictive check), is common in Bayesian modeling, where the assumptions are set explicitly.

Within the second step, we plug the obtained λ into the Poisson distribution: 𝑋 | λ ~ Poisson(λ), and acquire a Poisson-like distribution that represents the summed subprocesses. Notably, this unified process has a bigger dispersion than expected from a homogeneous Poisson distribution, nevertheless it is according to the Gamma mixture of λ.
Heterogeneous λ and inference
A practical consequence of introducing flexibility into your assumed distribution is that inference becomes more difficult. More parameters (i.e., the Gamma parameters) have to be estimated. Parameters act as flexible explainers of the information, tending to overfit and explain away variance in your variable. The more parameters you may have, the higher the reason could seem, however the model also becomes more vulnerable to noise in the information. Higher variance reduces the facility to discover a difference in means, if one exists, because — well — it gets lost within the variance.
Countering the lack of power
- Confirm whether you indeed need to increase the usual Poisson distribution. If not, simplify to the best, most fit model. A fast check on overdispersion may suffice for this.
- Pin down the estimates of the Gamma mixture distribution parameters using regularising, informative priors (think: Bayes).
During my research process for writing this blog, I learned a fantastic deal in regards to the connective tissue underlying all of this: how the binomial distribution plays a fundamental role within the processes we’ve discussed. And while I’d like to ramble on about this, I’ll reserve it for one more post, perhaps. Within the meantime, be happy to share your understanding within the comments section below 👍.
Conclusion
The Poisson distribution is a straightforward distribution that will be highly suitable for modelling count data. Nevertheless, when the assumptions don’t hold, one can extend the distribution by allowing the speed parameter to differ as a function of time or other aspects, or by assuming subprocesses that collectively make up the count data. This added flexibility can address the restrictions, nevertheless it comes at a price: increased flexibility in your modelling raises the variance and, consequently, undermines the statistical power of your model.
In case your end goal is inference, you could wish to think twice and consider exploring simpler models for the information. Alternatively, switch to the Bayesian paradigm and leverage its built-in solution to regularise estimates: informative priors.
I hope this has given you what you got here for — a greater intuition in regards to the Poisson distribution. I’d love to listen to your thoughts about this within the comments!