The Product Health Rating: How I Reduced Critical Incidents by 35% with Unified Monitoring and n8n Automation

-

For SaaS (software as a service) corporations, monitoring and managing their product data is crucial. For individuals who fail to grasp this, by the point they notice an incident. Damage is already done. For struggling corporations, this will be fatal.

To stop this, I built an n8n workflow linked to their database that may analyze the info every day, spot if there may be any incident. On this case, a log and notification system will start to research as soon as possible. I also built a dashboard so the team could see the leads to real time.

Image by Yassin Zehar

Context

A B2B SaaS platform specializing in data visualization and automatic reporting serves roughly 4500 customers, dispersed in three segments:

  • Small business
  • Mid-market
  • Enterprise

The weekly product usage exceeds 30000 energetic accounts with strong dependencies on real-time data (pipelines, APIs, dashboards, background jobs).

The product team works closely with:

  • Growth (acquisition, activation, onboarding)
  • Revenue (pricing, ARPU, churn)
  • SRE/Infrastructure (reliability, availability)
  • Data Engineering (pipelines, data freshness)
  • Support and customer success
Image by Yassin Zehar

Last 12 months, the corporate observed a rising variety of incidents. Between October and December, the entire of incidents increased from 250 to 450, an 80% increase. With this increase, there have been greater than 45 High and demanding incidents that affected hundreds of users. Probably the most affected metrics were:

  • api_error_rate
  • checkout_success_rate
  • net_mrr_delta
  • data_freshness_lag_minutes
  • churn_rate
Image by Yassin Zehar. For illustration

When incidents occur, a company is judged by its customers based on the way it handles and reacts. While the product team is esteemed for a way they managed it and ensured it won’t occur again.

Having an incident once can occur, but having the identical incident twice is a fault.

Business impact

  • More volatility in net recurring revenue
  • A noticeable decline in energetic accounts over several consecutive weeks
  • Multiple enterprise customers reporting and complaining in regards to the outdated dashboard (as much as 45+minutes late)
Image by Yassin Zehar

In total, between 30000 and 60000 users were impacted. Customer confidence in product reliability also suffered. Amongst non-renewals, 45% identified that’s their predominant reason.

Why is that this issue critical?

As a knowledge platform, the corporate cannot afford to have:

  • slow or stale data
  • Api error
  • Pipeline failures
  • Missed or delayed synchronizations
  • Inaccurate dashboard
  • Churns (downgrades, cancellations)
Image by Yassin Zehar

Internally, the incidents were spread across several systems:

  • Notions for product tracking
  • Slack for alerts
  • PostgreSQL for storage
  • Even on Google Sheets for customer support
Image by Yassin Zehar

There was not a single source of truth. The product team has to manually cross-reference and double-check all data, looking for trends and piecing together. It was an investigation and resolving a puzzle, making them lose so many hours weekly.

Solution: Automating an incident system alert with N8N and constructing a knowledge dashboard. So, incidents are detected, tracked, resolved and understood.

Why n8n?

Currently, there are several automation platforms and solutions. But not all are matching the needs and requirements. Choosing the best one following the necessity is crucial.

The particular requirements were to have access to a database without an API needed (n8n supports Api), to have visual workflows and nodes for a non-technical person to grasp, custom-coded nodes, self-hosted options and cost-effective at scale. So, amongst the platforms existing like Zapier, Make or n8n, the alternative was for the last.

Designing the Product Health Rating

Image by Yassin Zehar

First, the important thing metrics have to be determined and calculated.

Impact rating: easy function of severity + delta + scale of users

impact_score = (
            severity_weights[severity] * 10
            + abs(delta_pct) * 0.8
            + np.log1p(affected_users)
        )
        impact_score = round(float(impact_score), 2)

 Priority: derived from severity + impact

if severity == "critical" or impact_score > 60:
            priority = "P1"
        elif severity == "high" or impact_score > 40:
            priority = "P2"
        elif severity == "medium":
            priority = "P3"
        else:
            priority = "P4"

Product health rating

def compute_product_health_score(incidents, metrics):
    """
    Rating = 100 - sum(penalties)
    Production version handles 15+ aspects
    """
    # Key insight: penalties have different max weights
    penalties = {
        'volume': min(40, incident_rate * 13),      # 40% max
        'severity': calculate_severity_sum(incidents), # 25% max  
        'users': min(15, log(users) / log(50000) * 15), # 15% max
        'trends': calculate_business_trends(metrics)    # 20% max
    }
    
    rating = 100 - sum(penalties.values())
    
    if rating >= 80: return rating, "🟢 Stable"
    elif rating >= 60: return rating, "🟡 Under watch"
    else: return rating, "🔴 In danger"

Designing the Automated Detection System with n8n

Image by Yassin Zehar

This method consists of 4 streams:

  • Stream 1: retrieves recent revenue metrics, identifies unusual spikes in churn MRR, and creates incidents when needed.

const rows = items.map(item => item.json);

if (rows.length < 8) {
  return [];
}

rows.sort((a, b) => recent Date(a.date) - recent Date(b.date));

const values = rows.map(r => parseFloat(r.churn_mrr || 0));

const lastIndex = rows.length - 1;
const lastRow = rows[lastIndex];
const lastValue = values[lastIndex];

const window = 7;
const baselineValues = values.slice(lastIndex - window, lastIndex);

const mean = baselineValues.reduce((s, v) => s + v, 0) / baselineValues.length;
const variance = baselineValues
  .map(v => Math.pow(v - mean, 2))
  .reduce((s, v) => s + v, 0) / baselineValues.length;
const std = Math.sqrt(variance);

if (std === 0) {
  return [];
}

const z = (lastValue - mean) / std;
const deltaPct = mean === 0 ? null : ((lastValue - mean) / mean) * 100;

if (z > 2) {
  const anomaly = {
    date: lastRow.date,
    metric_name: 'churn_mrr',
    baseline_value: mean,
    actual_value: lastValue,
    z_score: z,
    delta_pct: deltaPct,
    severity:
      deltaPct !== null && deltaPct > 50 ? 'high'
      : deltaPct !== null && deltaPct > 25 ? 'medium'
      : 'low',
  };

  return [{ json: anomaly }];
}

return [];
  • Stream 2: Monitors feature usage metrics to detect sudden drops in adoption or engagement.
    Incidents are logged with severity, context, and alerts to the product team.
Image by Yassin Zehar
  • Stream 3: For each open incident, collects additional context from the database (e.g., churn by country or plan), uses AI to generate a transparent root cause hypothesis and suggested next steps, sends a summarized report back to Slack and email and updates the incident
Image by Yassin Zehar
  • Stream 4: Every morning, the workflow compiles all incidents from the day prior to this, creates a Notion page for documentation and sends a report back to the leadership team
Image by Yassin Zehar

We deployed similar detection nodes for 8 different metrics, adjusting the z-score direction based on whether increases or decreases were problematic.

The AI agent receives additional context through SQL queries (churn by country, by plan, by segment) to generate more accurate root cause hypotheses. And all of this data is gathered and sent in a every day email.

The workflow generates every day summary reports aggregating all incidents by metric and severity, distributed via email and Slack to stakeholders.

The dashboard

The dashboard is consolidating all signals into one place. An automatic product health rating with a 0-100 base is calculated with:

  • incident volume
  • severity weighting
  • open vs resolved status
  • variety of users impacted
  • business trends (MRR)
  • usage trends (energetic accounts)

A segment breakdown to discover which customer groups are probably the most affected:

Image by Yassin Zehar

A weekly heatmap and time series trend charts to discover recurring patterns:

Image by Yassin Zehar

And an in depth incident view composed by:

  • Business context
  • Dimension & segment
  • Root cause hypothesis
  • Incident type
  • An AI summary to speed up communication and diagnoses coming from the n8n workflow
Image by Yassin Zehar

Diagnosis:

The product health rating noted the actual product 24/100 with the status “in danger” with:

  • 45 High & Critical incidents
  • 36 incidents through the last 7 days
  • 33,385 estimated affected usersNegative trend in churn and DAU
  • Several spikes in api_error_rate and drops in checkout_success_rate
Image by Yassin Zehar

Biggest impact per segments:

  • Enterprise → critical data freshness issues
  • Mid-Market → recurring incidents on feature adoption
  • SMB → fluctuations in onboarding & activation
Image by Yassin Zehar

Impact

The goal of this dashboard isn’t only to investigate incidents and discover patterns but to enable the organization to react faster with an in depth overview.

Image by Yassin Zehar

We noticed a 35% reduction in critical incidents after 2 months. SRE & DATA teams identified the recurring root reason for some major issues, because of the unified data, and were capable of fix it and monitor the upkeep. Incident response time improved dramatically as a result of the AI summaries and all of the metrics, allowing them to know where to research.

An AI-Powered Root Cause Evaluation

Image by Yassin Zehar

Using AI can save a variety of time. Especially when an investigation is required in several databases, and also you don’t know where to begin. Adding an AI agent within the loop can prevent a substantial period of time because of its speed of processing data. To acquire this, a detailed prompt is needed since the agent will replace a human. So, to have probably the most accurate results, even the AI needs to grasp the context and receive some guidance. Otherwise, it could investigate and draw irrelevant conclusions. Don’t forget to .

You're a Product Data & Revenue Analyst.

We detected an incident:
{{ $json.incident }}

Here is churn MRR by country (top offenders first):
{{ $json.churn_by_country }}

Here is churn MRR by plan:
{{ $json.churn_by_plan }}

1. Summarize what happened in easy business language.
2. Discover probably the most impacted segments (country, plan).
3. Propose 3-5 plausible hypotheses (product issues, price changes, bugs, market events).
4. Propose 3 concrete next steps for the Product team.

It is crucial to notice that when the outcomes are obtained, a final check is needed to make sure the evaluation was appropriately done. AI is a tool, but it could possibly also go flawed, so don’t only on it; it’s a helpful tool. For this technique, the AI will suggest the top 3 likely root causes for every incident.

Image by Yassin Zehar

A greater alignment with the leadership team and reporting based on the info. Every part became more data-driven with deeper analyses, not intuition or reports by segmentation. This also led to an improved process.

Conclusion & takeaways

In conclusion, constructing a product health dashboard has several advantages:

  • Detect negative trends (MRR, DAU, engagement) earlier
  • Reduce critical incidents by identifying root-cause patterns
  • Understand real business impact (users affected, churn risk)
  • Prioritize the product roadmap based on risk and impact
  • Align Product, Data, SRE, and Revenue around a single source of truth

That’s exactly what many corporations lack: a unified data approach.

Image by Yassin Zehar

Using the n8n workflow helped in two ways: having the ability to resolve the problems as soon as possible and gather the info in a single place. The automation tool helped reduce the time spent on this task because the business was still running.

Lessons for Product teams

Image by Yassin Zehar
  • Start easy: constructing an automation system and a dashboard must be clearly defined. You should not constructing a product for the shoppers, you’re constructing a product in your collaborators. It is crucial that you simply understand each team’s needs since they’re your core users. With that in mind, have the product that shall be your MVP and answer to all of your needs first. Then you definately can improve it by adding features or metrics.
  • Unified metrics matter more than perfect detection: we’ve got to bear in mind that it is going to be because of them that the time shall be saved, together with understanding. Having good detection is crucial, but when the metrics are inaccurate, the time saved shall be wasted by the teams on the lookout for the metrics scattered across different environments
  • Automation saves 10 hours per week of manual investigation: by automating some manual and recurring tasks, you’ll save hours investigating, as with the incident alert workflow, we all know directly where to research first and the hypothesis of the cause and even some motion to take.
  • Document every thing: a correct and detailed documentation is a must and can allow all of the parties involved to have a transparent understanding and views about what is occurring. Documentation can also be a chunk of knowledge.

Who am I ?

I’m Yassin, a Project Manager who expanded into Data Science to bridge the gap between business decisions and technical systems. Learning Python, SQL, and analytics has enabled me to design product insights and automation workflows that connect what teams need with how data behaves. Let’s connect on Linkedin

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x