Constructing a Monitoring System That Actually Works

and managing products, it’s crucial to make sure they’re performing as expected and that the whole lot is running easily. We typically depend on metrics to gauge the health of our products. And lots of aspects can influence our KPIs, from internal changes equivalent to UI updates, pricing adjustments, or incidents to external aspects like competitor actions or seasonal trends. That’s why it’s necessary to repeatedly monitor your KPIs so you’ll be able to respond quickly when something goes off course. Otherwise, it would take several weeks to understand that your product was completely broken for five% of shoppers or that conversion dropped by 10 percentage points after the last release.

To realize this visibility, we create dashboards with key metrics. But let’s be honest, dashboards that nobody actively monitors offer little value. We either need people continuously watching dozens and even tons of of metrics, or we’d like an automatic alerting and monitoring system. And I strongly prefer the latter. So, in this text, I’ll walk you thru a practical approach to constructing an efficient monitoring system on your KPIs. You’ll find out about different monitoring approaches, learn how to construct your first statistical monitoring system, and what challenges you’ll likely encounter when deploying it in production.

Organising monitoring

Let’s start with the large picture of learn how to architect your monitoring system, then we’ll dive into the technical details. There are a number of key decisions you could make when organising monitoring:

Sensitivity. You have to find the correct balance between missing necessary anomalies (false negatives) and getting bombarded with false alerts 100 times a day (false positives). We’ll speak about what levers you could have to regulate this in a while.
Dimensions. The segments you select to watch also affect your sensitivity. If there’s an issue in a small segment (like a particular browser or country), your system is rather more prone to catch it in case you’re monitoring that segment’s metrics directly. But here’s the catch: the more segments you monitor, the more false positives you’ll cope with, so you could find the sweet spot.
Time granularity. If you could have plenty of knowledge and may’t afford delays, it may be value taking a look at minute-by-minute data. In the event you don’t have enough data, you’ll be able to aggregate it into 5–15 minute buckets and monitor those as an alternative. Either way, it’s all the time a very good idea to have higher-level day by day, weekly, or monthly monitoring alongside your real-time monitoring to control longer-term trends.

Nonetheless, monitoring isn’t just concerning the technical solution. It’s also concerning the processes you could have in place:

You would like someone who’s answerable for monitoring and responding to alerts. We used to handle this with an on-call rotation in my team, where each week, one person could be in control of reviewing all of the alerts.
Beyond automated monitoring, it’s value performing some manual checks too. You possibly can arrange TV displays within the office, or on the very least, have a process where someone (like an on-call person) reviews the metrics once a day or week.
You have to establish feedback loops. Once you’re reviewing alerts and looking out back at incidents you would possibly have missed, take the time to fine-tune your monitoring system’s settings.
The worth of a change log (a record of all changes affecting your KPIs) can’t be overstated. It helps you and your team all the time have context about what happened to your KPIs and when. Plus, it gives you a beneficial dataset for evaluating the true impact in your monitoring system while you make changes (like determining what percentage of past anomalies your latest setup would actually catch).

Now that we’ve covered the high-level picture, let’s move on and dig into the technical details of learn how to actually detect anomalies in time series data.

Frameworks for monitoring

There are various out-of-the-box frameworks you need to use for monitoring. I’d break them down into two foremost groups.

The primary group involves making a forecast with confidence intervals. Listed below are some options:

You should use statsmodels and the classical implementation of ARIMA-like models for time series forecasting.
Another choice that typically works pretty much out of the box is Prophet by Meta. It’s a straightforward additive model that returns uncertainty intervals.
There’s also GluonTS, a deep learning-based forecasting framework from AWS.

The second group focuses on anomaly detection, and listed here are some popular libraries:

PyOD: The preferred Python outlier/anomaly detection toolbox, with 50+ algorithms (including time series and deep learning methods).
ADTK (Anomaly Detection Toolkit): Built for unsupervised/rule-based time series anomaly detection with easy integration into pandas dataframes.
Merlion: Combines forecasting and anomaly detection for time series using each classical and ML approaches.

I’ve only mentioned a number of examples here; there are far more libraries on the market. You possibly can absolutely try them out along with your data and see how they perform. Nonetheless, I would like to share a much simpler approach to monitoring that I often start with. Though it’s so easy that you may implement it with a single SQL query, it really works surprisingly well in lots of cases. One other significant advantage of this simplicity is that you may implement it in just about any tool, whereas deploying more complex ML approaches might be tricky in some systems.

Statistical approach to monitoring

The core idea behind monitoring is easy: use historical data to construct a confidence interval (CI) and detect when current metrics fall outside of expected behaviour. We estimate this confidence interval using the mean and standard deviation of past data. It’s just basic statistics.

[
textbf{Confidence Interval} = (textbf{mean} – textsf{coef}_1 times textbf{std},; textbf{mean} + textsf{coef}_2 times textbf{std})
]

Image by writer

Nonetheless, the effectiveness of this approach is determined by several key parameters, and the alternatives you make here will significantly impact the accuracy of your alerts.

The primary decision is learn how to define the information sample used to calculate your statistics. Typically, we compare the present metric to the identical time period on previous days. This involves two foremost components:

Time window: I often take a window of ±10–half-hour around the present timestamp to account for short-term fluctuations.
Historical days: I prefer using the identical weekday over the past 3–5 weeks. This method accounts for weekly seasonality, which will likely be present in business data. Nonetheless, depending in your seasonality patterns, you would possibly select different approaches (for instance, splitting days into two groups: weekdays and weekends).

One other necessary parameter is the alternative of coefficient used to set the width of the boldness interval. I often use three standard deviations because it covers 99.7% of observations for distributions near normal.

As you’ll be able to see, there are several decisions to make, and there’s no one-size-fits-all answer. Essentially the most reliable technique to determine optimal settings is to experiment with different configurations using your personal data and select the one which delivers the most effective performance on your use case. So that is an excellent moment to place the approach into motion and see the way it performs on real data.

Example: monitoring the variety of taxi rides

To check this out, we’ll use the favored NYC Taxi Data dataset (). I loaded data from May to July 2025 and focused on rides related to high-volume for-hire vehicles. Since we’ve got tons of of trips every minute, we are able to use minute-by-minute data for monitoring.

Constructing the primary version

So, let’s try our approach and construct confidence intervals based on real data. I began with a default set of key parameters:

A time window of ±quarter-hour around the present timestamp,
Data from the present day plus the identical weekday from the previous three weeks,
A confidence band defined as ±3 standard deviations.

Now, let’s create a few functions with the business logic to calculate the boldness interval and check whether our price falls outside of it.

# returns the dataset of historic data
def get_distribution_for_ci(param, ts, n_weeks=3, n_mins=15): 
  tmp_df = df[['pickup_datetime', param]].rename(columns={param: 'value', 'pickup_datetime': 'dt'})
  
  tmp = [] 
  for n in range(n_weeks + 1):
    lower_bound = (pd.to_datetime(ts) - pd.Timedelta(weeks=n, minutes=n_mins)).strftime('%Y-%m-%d %H:%M:%S')
    upper_bound = (pd.to_datetime(ts) - pd.Timedelta(weeks=n, minutes=-n_mins)).strftime('%Y-%m-%d %H:%M:%S')
    tmp.append(tmp_df[(tmp_df.dt >= lower_bound) & (tmp_df.dt <= upper_bound)])

  base_df = pd.concat(tmp)
  base_df = base_df[base_df.dt < ts]
  return base_df

# calculates mean and std needed to calculate confidence intervals
def get_ci_statistics(param, ts, n_weeks=3, n_mins=15):
  base_df = get_distribution_for_ci(param, ts, n_weeks, n_mins)
  std = base_df.value.std()
  mean = base_df.value.mean()
  return mean, std

# iterating through all of the timestamps in historic data
ci_tmp = []
for ts in tqdm.tqdm(df.pickup_datetime):
  ci = get_ci_statistics('values', ts, n_weeks=3, n_mins=15)
  ci_tmp.append(
    {
        'pickup_datetime': ts,
        'mean': ci[0],
        'std': ci[1],
    }
  )

ci_df = df[['pickup_datetime', 'values']].copy()
ci_df = ci_df.merge(pd.DataFrame(ci_tmp), how='left', on='pickup_datetime')

# defining CI
ci_df['ci_lower'] = ci_df['mean'] - 3 * ci_df['std']
ci_df['ci_upper'] = ci_df['mean'] + 3 * ci_df['std']

# defining whether value is outside of CI
ci_df['outside_of_ci'] = (ci_df['values'] < ci_df['ci_lower']) | (ci_df['values'] > ci_df['ci_upper'])

Analysing results

Let’s have a look at the outcomes. First, we’re seeing quite a number of false positive triggers (one-off points outside the CI that appear to be on account of normal variability).

There are two ways we are able to adjust our algorithm to account for this:

The CI doesn’t must be symmetric. We may be less concerned about increases within the variety of trips, so we could use a better coefficient for the upper certain (for instance, use 5 as an alternative of three).
The info is sort of volatile, so there shall be occasional anomalies where a single point falls outside the boldness interval. To scale back such false positive alerts, we are able to use more robust logic and only trigger an alert when multiple points are outside the CI (for instance, no less than 4 out of the last 5 points, or 8 out of 10).

Nonetheless, there’s one other potential problem with our current CIs. As you’ll be able to see, there are quite a number of cases where the CI is excessively wide. This looks off and will reduce the sensitivity of our monitoring.

Let’s have a look at one example to know why this happens. The distribution we’re using to estimate the CI at this point is bimodal, which ends up in a better standard deviation and a wider CI. That’s since the variety of trips on the evening of July 14th is significantly higher than in other weeks.

So we’ve encountered an anomaly prior to now that’s affecting our confidence intervals. There are two ways to handle this issue:

If we’re doing constant monitoring, we all know there was anomalously high demand on July 14th, and we are able to exclude these periods when constructing our CIs. This approach requires some discipline to trace these anomalies, but it surely pays off with more accurate results.
Nonetheless, there’s all the time a quick-and-dirty approach too: we are able to simply drop or cap outliers when constructing the CI.

Improving the accuracy

So after the primary iteration, we identified several potential improvements for our monitoring approach:

Use a better coefficient for the upper certain since we care less about increases. I used 6 standard deviations as an alternative of three.
Cope with outliers to filter out past anomalies. I experimented with removing or capping the highest 10–20% of outliers and located that capping at 20% alongside increasing the period to five weeks worked best in practice.
Raise an alert only when 4 out of the last 5 points are outside the CI to cut back the variety of false positive alerts brought on by normal volatility.

Let’s see how this looks in code. We’ve updated the logic in get_ci_statistics to account for various strategies for handling outliers.

def get_ci_statistics(param, ts, n_weeks=3, n_mins=15, show_vis = False, filter_outliers_strategy = 'none', 
                   filter_outliers_perc = None):
  assert filter_outliers_strategy in ['none', 'clip', 'remove'], "filter_outliers_strategy have to be one among 'none', 'clip', 'remove'"
  base_df = get_distribution_for_ci(param, ts, n_weeks, n_mins, show_vis)
  if filter_outliers_strategy != 'none': 
    p_upper = base_df.value.quantile(1 - filter_outliers_perc)
    p_lower = base_df.value.quantile(filter_outliers_perc)
    if filter_outliers_strategy == 'clip':
      base_df['value'] = base_df['value'].clip(lower=p_lower, upper=p_upper)
    if filter_outliers_strategy == 'remove':
      base_df = base_df[(base_df.value >= p_lower) & (base_df.value <= p_upper)]
  std = base_df.value.std()
  mean = base_df.value.mean()
  return mean, std

We also must update the best way we define the outside_of_ci parameter.

for ts in tqdm.tqdm(ci_df.pickup_datetime):
  tmp_df = ci_df[(ci_df.pickup_datetime <= ts)].tail(5).copy()
  tmp_df = tmp_df[~tmp_df.ci_lower.isna() & ~tmp_df.ci_upper.isna()]
  if tmp_df.shape[0] < 5: 
    proceed
  tmp_df['outside_of_ci'] = (tmp_df['values'] < tmp_df['ci_lower']) | (tmp_df['values'] > tmp_df['ci_upper'])
  if tmp_df.outside_of_ci.map(int).sum() >= 4:
    anomalies.append(ts) 

ci_df['outside_of_ci'] = ci_df.pickup_datetime.isin(anomalies)

We will see that the CI is now significantly narrower (no more anomalously wide CIs), and we’re also getting far fewer alerts since we increased the upper certain coefficient.

Let’s investigate the 2 alerts we found. These two alerts from the last 2 weeks look plausible once we compare the traffic to previous weeks.

So our latest monitoring approach makes total sense. Nonetheless, there’s a drawback: by only in search of cases where 4 out of 5 minutes fall outside the CI, we’re delaying alerts in situations where the whole lot is totally broken. To deal with this problem, you'll be able to actually use two CIs:

Doomsday CI: A broad confidence interval where even a single point falling outside means it’s time to panic.
Incident CI: The one we built earlier, where we'd wait 5–10 minutes before triggering an alert, for the reason that drop within the metric isn’t as critical.

Let’s define 2 CIs for our case.

It’s a balanced approach that provides us the most effective of each worlds: we are able to react quickly when something is totally broken while still keeping false positives under control. With that, we’ve achieved a very good result and we’re able to move on.

Testing our monitoring on anomalies

We’ve confirmed that our approach works well for business-as-usual cases. Nonetheless, it’s also value performing some stress testing by simulating anomalies we would like to catch and checking how the monitoring performs. In practice, it’s value testing against previously known anomalies to see how it could handle real-world examples.

In our case, we don’t have a change log of previous anomalies, so I simulated a 20% drop within the variety of trips, and our approach caught it immediately.

These sorts of step changes might be tricky in real life. Imagine we lost one among our partners, and that lower level becomes the brand new normal for the metric. In that case, it’s value adjusting our monitoring as well. If it’s possible to recalculate the historical metric based on the present state (for instance, by filtering out the lost partner), that might be ideal since it could bring the monitoring back to normal. If that’s not feasible, we are able to either adjust the historical data (say, subtract 20% of traffic as our estimate of the change) or drop all data from before the change and use only the brand new data to construct the CI.

Let’s have a look at one other tricky real-world example: gradual decay. In case your metric is slowly dropping day after day, it likely won’t be caught by our real-time monitoring for the reason that CI shall be shifting together with it. To catch situations like this, it’s value having less granular monitoring (like day by day, weekly, and even monthly).

Operational challenges

We’ve discussed the mathematics behind alerting and monitoring systems. Nonetheless, there are several other nuances you’ll likely encounter once you begin deploying your system in production. So I’d prefer to cover these before wrapping up.

Lagging data. We don’t face this problem in our example since we’re working with historical data, but in real life, you could cope with data lags. It often takes a while for data to achieve your data warehouse. So you could learn learn how to distinguish between cases where data hasn’t arrived yet versus actual incidents affecting the shopper experience. Essentially the most straightforward approach is to have a look at historical data, discover the everyday lag, and filter out the last 5–10 data points.

Different sensitivity for various segments. You’ll likely want to watch not only the foremost KPI (the variety of trips), but in addition break it down by multiple segments (like partners, areas, etc.). Adding more segments is all the time helpful because it helps you see smaller changes in specific segments (for example, that there’s an issue in Manhattan). Nonetheless, as I discussed above, there’s a downside: more segments mean more false positive alerts that you could cope with. To maintain this under control, you need to use different sensitivity levels for various segments (say, 3 standard deviations for the foremost KPI and 5 for segments).

Smarter alerting system. Also, while you’re monitoring many segments, it’s value making your alerting a bit smarter. Say you could have monitoring for the foremost KPI and 99 segments. Now, imagine we've got a world outage and the variety of trips drops all over the place. Inside the subsequent 5 minutes, you’ll (hopefully) get 100 notifications that something is broken. That’s not an excellent experience. To avoid this case, I’d construct logic to filter out redundant notifications. For instance:

If we received the identical notification throughout the last 3 hours, don’t fire one other alert.
If there’s a notification a couple of drop within the foremost KPI plus greater than 3 segments, only alert concerning the foremost KPI change.

Overall, alert fatigue is real, so it’s value minimising the noise.

And that’s it! We’ve covered your complete alerting and monitoring topic, and hopefully, you’re now fully equipped to establish your personal system.

Summary

We’ve covered numerous ground on alerting and monitoring. Let me wrap it up with a step-by-step guide on learn how to start monitoring your KPIs.

Step one is to collect a change log of past anomalies. You should use this each as a set of test cases on your system and to filter out anomalous periods when calculating CIs.
Next, construct a prototype and run it on historical data. I’d start with the highest-level KPI, check out several possible configurations, and see how well it catches previous anomalies and whether it generates numerous false alerts. At this point, it's best to have a viable solution.
Then try it out in production, since that is where you’ll need to cope with data lags and see how the monitoring actually performs in practice. Run it for two–4 weeks and tweak the parameters to ensure it’s working as expected.
After that, share the monitoring along with your colleagues and begin expanding the scope to incorporate other segments. Don’t forget to maintain adding all anomalies to the change log and establish feedback loops to enhance your system repeatedly.

And that’s it! Now you'll be able to rest easy knowing that automation is maintaining a tally of your KPIs (but still check in on them occasionally, just in case).

Constructing a Monitoring System That Actually Works

Organising monitoring

Frameworks for monitoring

Statistical approach to monitoring

Example: monitoring the variety of taxi rides

Constructing the primary version

Analysing results

Improving the accuracy

Testing our monitoring on anomalies

Operational challenges

Summary

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

A Look Back and Forward

High quality-tuning Florence-2 – Microsoft’s Cutting-edge Vision Language Models

The Importance of Data Quality

The Machine Learning “Advent Calendar” Bonus 2: Gradient Descent Variants in Excel

a Powerful Embedding Model Tailored for Patents and IP with Expert Support from Hugging Face

Constructing a Monitoring System That Actually Works

Organising monitoring

Frameworks for monitoring

Statistical approach to monitoring

Example: monitoring the variety of taxi rides

Constructing the primary version

Analysing results

Improving the accuracy

Testing our monitoring on anomalies

Operational challenges

Summary

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.