Turning 127 Million Data Points Into an Industry Report

this year, I published an industry report called Remediation at Scale analyzing how application security (AppSec) teams fix vulnerabilities in their code. The dataset: tens of thousands of repositories, a full year of scan data, and organizations ranging from startups to enterprises. In total, north of 127 million data points spanning individual findings, scan events, and remediation actions across two types of security scanning (SAST and SCA).

I’m a Senior Technical PMM at Semgrep with a background in computer science, data science, and solutions engineering. I like building things. This project let me combine all of that in a single motion: writing the SQL, building scripts to manage the analysis, parsing and cleaning the data, finding the story the data is telling, and shipping the final polished asset.

This post walks through five lessons I picked up along the way. If you’ve ever had to take a massive dataset, find the narrative inside it, and turn it into something a technical and non-technical audience can act on, some of this might be useful.

1. Start with the data, not the story

The temptation with any data project is to decide your narrative first, then go looking for numbers to back it up. I did the opposite.

I spent weeks in pure exploration mode. Querying Snowflake, looking at distributions, running aggregations across different dimensions. No hypothesis, no angle. Just trying to understand what the data actually showed.

This was uncomfortable. Stakeholders wanted to know what the report would say. I didn’t have an answer yet.

But it turned out to be the most important phase of the entire project. The data told a story I wouldn’t have guessed: the gap between top-performing security teams and everyone else wasn’t about tooling. It was about systematic follow-through on remediation. I never would have landed on that framing if I’d started with a thesis.

You also have to be willing to kill your darlings. There were several findings I wanted to be true that the data didn’t support. On the flip side, some of the most interesting insights came from places I wasn’t looking. I used local LLMs via Ollama to classify 10,000+ text-based triage records into 20 thematic categories. What emerged was a clear pattern: the most common themes were about test files, framework protections, and trusted services. That told a story about how teams actually use triage tooling that I never would have found by looking at aggregate metrics.

A few things that helped during exploration:

Run diagnostic queries first. I built a set of 12+ data quality checks before touching the analysis. One of them caught that a key metric (parse_rate) only had coverage for a fraction of repos. I switched to an alternative field (NUM_BYTES_SCANNED) with 90%+ coverage. Without that diagnostic, the entire findings-per-lines-of-code analysis would have been mis-computed.
Build checkpoint/resume into your pipeline. I had 108+ SQL queries across multiple report sections. I wrote a shell script that auto-discovered .sql files, tracked which ones had already produced output CSVs, and skipped them on re-runs. When queries failed midway through (and they did), I could pick up right where I left off instead of re-running everything.
Document as you go. Every interesting result, every dead end, every assumption. That running log became the backbone of the report’s methodology section and saved me weeks when I needed to retrace my steps.

Shell script for auto-discovering and running queries for the report. Image by Author.

2. Become the domain expert

You can’t tell a story about data you don’t understand. Before I could write a single section, I needed to know how static analysis scanners work, how remediation flows operate in practice, and what metrics actually matter to security teams.

Several companies in the space publish annual reports on similar topics. I collected and read as many as I could find. Not to copy, but to understand the format, the depth, and the expectations. Reading them gave me a sense of:

What the industry expects from this kind of resource
What’s already well-covered
Where there’s room to say something new

This also helped me spot gaps. Most reports focus on detection volume. Very few dig into what happens after detection. That became our angle.

Skipping this phase would have meant writing a report full of surface-level observations that didn’t differentiate against the other great content produced by others.

3. Talk to your target audience early and often

Early versions of the analysis just showed averages. Average fix rate, average time to remediate, average findings per repo. The numbers were fine. The story was boring.

The breakthrough came after talking to actual practitioners: the security engineers, AppSec leads, and CISOs who would be reading the final product. Everyone wanted to answer one question: how do I compare to teams that are doing this well?

That feedback directly shaped two of the biggest decisions in the report.

First, it led to a cohort-based segmentation. I split organizations into two groups: the top 15% by fix rate (“leaders”) and everyone else (“the field”). This is similar to how survey-based reports segment by maturity level, except I was using behavioral data rather than self-reported responses. Suddenly the data had contrast:

Leaders fix 2–3x more vulnerabilities
They resolve findings caught during code review 9x faster than findings from full repository scans
They adopt workflow automation features at higher rates and extract more value from them

The segmentation was the difference between “here are some numbers” and “here is something you can act on.”

Splitting cohorts into leaders and field gives the reader a frame of reference for where their program stands. It also helps frame talking points and findings. Image by Author.

Second, it reshaped the report’s structure. People didn’t just want benchmarks. They wanted to know what to do about them. “Great, the leader cohort fixes more code security vulnerabilities. How do I become a leader?” That feedback led me to add an evidence-based recommendations section organized by implementation speed:

Quick wins for this week
Process changes for this quarter
Strategic investments for the half

The final report reads as much like a playbook as it does a benchmark. None of that would have happened without putting early drafts in front of actual readers.

4. Get design involved early

This one I almost learned too late. Data reports live or die on how they look. A wall of charts with no visual hierarchy is just as bad as no data at all.

I brought in our design team earlier than I normally would and spent time walking them through the domain. What does “reachability analysis” mean? Why does the cohort split matter? When the designers understood the story, they made choices (color coding for cohorts, callout boxes for key insights, before/after code examples) that reinforced it without me having to explain in text.

Unused proof-of-concept rendering of the report cover graphic. Note the 2.4x Remediation Gap. Image used with permission.

5. Give yourself time

This project took months. The data exploration alone was weeks. Then there were iterations on the analysis as I found new angles, design cycles, legal reviews, and rounds of feedback from stakeholders across the company.

If I had tried to ship this in a quarter, the result would have been forgettable.

Where it landed

Looking back, the two things I’d change are both about speed. I’d write down every definition and assumption on day one. Things like “what counts as an active repository” or “how do we calculate fix rate” seem obvious at the start. They become contested fast. I eventually created a formal definitions document covering 40+ metrics, but doing it earlier would have saved several rounds of rework. And I’d bring in a second set of eyes during exploration. Working solo meant no one to gut-check whether a finding was interesting or just noise.

The report itself, Remediation at Scale, covers six evidence-backed patterns that separate high-performing security teams from the rest. If you’ve tackled a similar data-heavy reporting project, I’d be curious to hear what you learned along the way.

What's Hot

Autonomous AI systems depend on data governance

Men's group hopes to ease strain on NHS services

Apple Watch Series 12 rumors — everything we know so far

Autonomous AI systems depend on data governance

Experian uncovers financial services’ AI fraud paradox

Arcee AI Releases Trinity Large Thinking: An Apache 2.0 Open Reasoning Model for Long-Horizon Agents and Tool Use

Quantum Simulations with Python | Towards Data Science

5 best practices to secure AI systems

Defeating the ‘Token Tax’: How Google Gemma 4, NVIDIA, and OpenClaw are Revolutionizing Local Agentic AI: From RTX Desktops to DGX Spark

Autonomous AI systems depend on data governance

Men's group hopes to ease strain on NHS services

Apple Watch Series 12 rumors — everything we know so far

Google Will Now Let You Virtually Try on Clothes With Just a Selfie

What’s in a Name? How to Get Your Domain Right

Speed Across the Galaxy Next Year in Star Wars: Galactic Racer

News

Company

Services

What's Hot

Turning 127 Million Data Points Into an Industry Report

1. Start with the data, not the story

2. Become the domain expert

3. Talk to your target audience early and often

4. Get design involved early

5. Give yourself time

Where it landed

Related Posts

News

Company

Services

Subscribe to Updates