Turning 127 Million Data Points Into an Industry Report

this yr, I published an industry report called Remediation at Scale analyzing how application security (AppSec) teams fix vulnerabilities of their code. The dataset: tens of hundreds of repositories, a full yr of scan data, and organizations starting from startups to enterprises. In total, north of 127 million data points spanning individual findings, scan events, and remediation actions across two forms of security scanning (SAST and SCA).

I’m a Senior Technical PMM at Semgrep with a background in computer science, data science, and solutions engineering. I like constructing things. This project let me mix all of that in a single motion: writing the SQL, constructing scripts to administer the evaluation, parsing and cleansing the info, finding the story the info is telling, and shipping the ultimate polished asset.

This post walks through five lessons I picked up along the best way. When you’ve ever needed to take a large dataset, find the narrative inside it, and switch it into something a technical and non-technical audience can act on, a few of this is perhaps useful.

1. Start with the info, not the story

The temptation with any data project is to make a decision your narrative first, then go searching for numbers to back it up. I did the other.

I spent weeks in pure exploration mode. Querying Snowflake, distributions, running aggregations across different dimensions. No hypothesis, no angle. Just trying to know what the info actually showed.

This was uncomfortable. Stakeholders desired to know what the report would say. I didn’t have a solution yet.

However it turned out to be an important phase of all the project. The information told a story I wouldn’t have guessed: the gap between top-performing security teams and everybody else wasn’t about tooling. It was about systematic follow-through on remediation. I never would have landed on that framing if I’d began with a thesis.

You furthermore may must be willing to kill your darlings. There have been several findings I desired to be true that the info didn’t support. On the flip side, a number of the most interesting insights got here from places I wasn’t looking. I used local LLMs via Ollama to categorise 10,000+ text-based triage records into 20 thematic categories. What emerged was a transparent pattern: essentially the most common themes were about test files, framework protections, and trusted services. That told a story about how teams actually use triage tooling that I never would have found by aggregate metrics.

A number of things that helped during exploration:

Run diagnostic queries first. I built a set of 12+ data quality checks before touching the evaluation. One among them caught that a key metric (parse_rate) only had coverage for a fraction of repos. I switched to another field (NUM_BYTES_SCANNED) with 90%+ coverage. Without that diagnostic, all the findings-per-lines-of-code evaluation would have been mis-computed.
Construct checkpoint/resume into your pipeline. I had 108+ SQL queries across multiple report sections. I wrote a shell script that auto-discovered .sql files, tracked which of them had already produced output CSVs, and skipped them on re-runs. When queries failed midway through (and so they did), I could pick up right where I left off as an alternative of re-running every thing.
Document as you go. Every interesting result, every dead end, every assumption. That running log became the backbone of the report’s methodology section and saved me weeks after I needed to retrace my steps.

Shell script for auto-discovering and running queries for the report. Image by Creator.

2. Turn into the domain expert

You may’t tell a story about data you don’t understand. Before I could write a single section, I needed to know the way static evaluation scanners work, how remediation flows operate in practice, and what metrics actually matter to security teams.

Several firms within the space publish annual reports on similar topics. I collected and browse as many as I could find. Not to repeat, but to know the format, the depth, and the expectations. Reading them gave me a way of:

What the industry expects from this sort of resource
What’s already well-covered
Where there’s room to say something recent

This also helped me spot gaps. Most reports deal with detection volume. Only a few dig into what happens after detection. That became our angle.

Skipping this phase would have meant writing a report filled with surface-level observations that didn’t differentiate against the opposite great content produced by others.

3. Confer with your audience early and often

Early versions of the evaluation just showed averages. Average fix rate, average time to remediate, average findings per repo. The numbers were advantageous. The story was boring.

The breakthrough got here after talking to actual practitioners: the safety engineers, AppSec leads, and CISOs who could be reading the ultimate product. Everyone desired to answer one query:

That feedback directly shaped two of the largest decisions within the report.

First, it led to a cohort-based segmentation. I split organizations into two groups: the highest 15% by fix rate (“leaders”) and everybody else (“the sector”). This is analogous to how survey-based reports segment by maturity level, except I used to be using behavioral data quite than self-reported responses. Suddenly the info had contrast:

Leaders fix 2–3x more vulnerabilities
They resolve findings caught during code review 9x faster than findings from full repository scans
They adopt workflow automation features at higher rates and extract more value from them

The segmentation was the difference between “listed here are some numbers” and “here is something you possibly can act on.”

Bar chart showing the different in code vulnerabilty fix rates between “Leader” cohorts and “Field” cohorts. — Splitting cohorts into leaders and field gives the reader a frame of reference for where their program stands. It also helps frame talking points and findings. Image by Creator.

Second, it reshaped the report’s structure. People didn’t just want benchmarks. They desired to know what to do about them. “Great, the leader cohort fixes more code security vulnerabilities. How do I turn into a frontrunner?” That feedback led me so as to add an evidence-based recommendations section organized by implementation speed:

Quick wins for this week
Process changes for this quarter
Strategic investments for the half

The ultimate report reads as very similar to a playbook because it does a benchmark. None of that might have happened without putting early drafts in front of actual readers.

4. Get design involved early

This one I almost learned too late. Data reports live or die on how they give the impression of being. A wall of charts with no visual hierarchy is just as bad as no data in any respect.

I brought in our design team sooner than I normally would and hung out walking them through the domain. What does “reachability evaluation” mean? Why does the cohort split matter? When the designers understood the story, they made decisions (color coding for cohorts, callout boxes for key insights, before/after code examples) that reinforced it without me having to elucidate in text.

Unused proof-of-concept rendering of the report cover graphic. Note the two.4x Remediation Gap. Image used with permission.

5. Give yourself time

This project took months. The information exploration alone was weeks. Then there have been iterations on the evaluation as I discovered recent angles, design cycles, legal reviews, and rounds of feedback from stakeholders across the corporate.

If I had tried to ship this in 1 / 4, the result would have been forgettable.

Where it landed

Looking back, the 2 things I’d change are each about speed. I’d write down every definition and assumption on day one. Things like “what counts as an lively repository” or “how will we calculate fix rate” seem obvious initially. They turn into contested fast. I finally created a proper definitions document covering 40+ metrics, but doing it earlier would have saved several rounds of rework. And I’d usher in a second set of eyes during exploration. Working solo meant nobody to gut-check whether a finding was interesting or simply noise.

The report itself, Remediation at Scale, covers six evidence-backed patterns that separate high-performing security teams from the remaining. When you’ve tackled the same data-heavy reporting project, I’d be curious to listen to what you learned along the best way.

Turning 127 Million Data Points Into an Industry Report

1. Start with the info, not the story

2. Turn into the domain expert

3. Confer with your audience early and often

4. Get design involved early

5. Give yourself time

Where it landed

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

OpenAI’s $1B Disney blindside

The Pentagon’s culture war tactic against Anthropic has backfired

Methods to Lie with Statistics together with your Robot Best Friend

MIT researchers use AI to uncover atomic defects in materials

There are more AI health tools than ever—but how well do they work?

Turning 127 Million Data Points Into an Industry Report

1. Start with the info, not the story

2. Turn into the domain expert

3. Confer with your audience early and often

4. Get design involved early

5. Give yourself time

Where it landed

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.