this year, I published an industry report called Remediation at Scale analyzing how application security (AppSec) teams fix vulnerabilities in their code. The dataset: tens of thousands of repositories, a full year of scan data, and organizations ranging from startups to enterprises. In total, north of 127 million data points spanning individual findings, scan events, and remediation actions across two types of security scanning (SAST and SCA).
I’m a Senior Technical PMM at Semgrep with a background in computer science, data science, and solutions engineering. I like building things. This project let me combine all of that in a single motion: writing the SQL, building scripts to manage the analysis, parsing and cleaning the data, finding the story the data is telling, and shipping the final polished asset.
This post walks through five lessons I picked up along the way. If you’ve ever had to take a massive dataset, find the narrative inside it, and turn it into something a technical and non-technical audience can act on, some of this might be useful.
1. Start with the data, not the story
The temptation with any data project is to decide your narrative first, then go looking for numbers to back it up. I did the opposite.
I spent weeks in pure exploration mode. Querying Snowflake, looking at distributions, running aggregations across different dimensions. No hypothesis, no angle. Just trying to understand what the data actually showed.
This was uncomfortable. Stakeholders wanted to know what the report would say. I didn’t have an answer yet.
But it turned out to be the most important phase of the entire project. The data told a story I wouldn’t have guessed: the gap between top-performing security teams and everyone else wasn’t about tooling. It was about systematic follow-through on remediation. I never would have landed on that framing if I’d started with a thesis.
You also have to be willing to kill your darlings. There were several findings I wanted to be true that the data didn’t support. On the flip side, some of the most interesting insights came from places I wasn’t looking. I used local LLMs via Ollama to classify 10,000+ text-based triage records into 20 thematic categories. What emerged was a clear pattern: the most common themes were about test files, framework protections, and trusted services. That told a story about how teams actually use triage tooling that I never would have found by looking at aggregate metrics.
A few things that helped during exploration:
- Run diagnostic queries first. I built a set of 12+ data quality checks before touching the analysis. One of them caught that a key metric (parse_rate) only had coverage for a fraction of repos. I switched to an alternative field (NUM_BYTES_SCANNED) with 90%+ coverage. Without that diagnostic, the entire findings-per-lines-of-code analysis would have been mis-computed.
- Build checkpoint/resume into your pipeline. I had 108+ SQL queries across multiple report sections. I wrote a shell script that auto-discovered .sql files, tracked which ones had already produced output CSVs, and skipped them on re-runs. When queries failed midway through (and they did), I could pick up right where I left off instead of re-running everything.
- Document as you go. Every interesting result, every dead end, every assumption. That running log became the backbone of the report’s methodology section and saved me weeks when I needed to retrace my steps.
Shell script for auto-discovering and running queries for the report. Image by Author.
2. Become the domain expert
You can’t tell a story about data you don’t understand. Before I could write a single section, I needed to know how static analysis scanners work, how remediation flows operate in practice, and what metrics actually matter to security teams.
Several companies in the space publish annual reports on similar topics. I collected and read as many as I could find. Not to copy, but to understand the format, the depth, and the expectations. Reading them gave me a sense of:
- What the industry expects from this kind of resource
- What’s already well-covered
- Where there’s room to say something new
This also helped me spot gaps. Most reports focus on detection volume. Very few dig into what happens after detection. That became our angle.
Skipping this phase would have meant writing a report full of surface-level observations that didn’t differentiate against the other great content produced by others.
3. Talk to your target audience early and often
Early versions of the analysis just showed averages. Average fix rate, average time to remediate, average findings per repo. The numbers were fine. The story was boring.
The breakthrough came after talking to actual practitioners: the security engineers, AppSec leads, and CISOs who would be reading the final product. Everyone wanted to answer one question: how do I compare to teams that are doing this well?
That feedback directly shaped two of the biggest decisions in the report.
First, it led to a cohort-based segmentation. I split organizations into two groups: the top 15% by fix rate (“leaders”) and everyone else (“the field”). This is similar to how survey-based reports segment by maturity level, except I was using behavioral data rather than self-reported responses. Suddenly the data had contrast:
- Leaders fix 2–3x more vulnerabilities
- They resolve findings caught during code review 9x faster than findings from full repository scans
- They adopt workflow automation features at higher rates and extract more value from them
The segmentation was the difference between “here are some numbers” and “here is something you can act on.”
Splitting cohorts into leaders and field gives the reader a frame of reference for where their program stands. It also helps frame talking points and findings. Image by Author.
Second, it reshaped the report’s structure. People didn’t just want benchmarks. They wanted to know what to do about them. “Great, the leader cohort fixes more code security vulnerabilities. How do I become a leader?” That feedback led me to add an evidence-based recommendations section organized by implementation speed:
- Quick wins for this week
- Process changes for this quarter
- Strategic investments for the half
The final report reads as much like a playbook as it does a benchmark. None of that would have happened without putting early drafts in front of actual readers.
4. Get design involved early
This one I almost learned too late. Data reports live or die on how they look. A wall of charts with no visual hierarchy is just as bad as no data at all.
I brought in our design team earlier than I normally would and spent time walking them through the domain. What does “reachability analysis” mean? Why does the cohort split matter? When the designers understood the story, they made choices (color coding for cohorts, callout boxes for key insights, before/after code examples) that reinforced it without me having to explain in text.
Unused proof-of-concept rendering of the report cover graphic. Note the 2.4x Remediation Gap. Image used with permission.
5. Give yourself time
This project took months. The data exploration alone was weeks. Then there were iterations on the analysis as I found new angles, design cycles, legal reviews, and rounds of feedback from stakeholders across the company.
If I had tried to ship this in a quarter, the result would have been forgettable.
Where it landed
Looking back, the two things I’d change are both about speed. I’d write down every definition and assumption on day one. Things like “what counts as an active repository” or “how do we calculate fix rate” seem obvious at the start. They become contested fast. I eventually created a formal definitions document covering 40+ metrics, but doing it earlier would have saved several rounds of rework. And I’d bring in a second set of eyes during exploration. Working solo meant no one to gut-check whether a finding was interesting or just noise.
The report itself, Remediation at Scale, covers six evidence-backed patterns that separate high-performing security teams from the rest. If you’ve tackled a similar data-heavy reporting project, I’d be curious to hear what you learned along the way.

