Research May 2026

We analyzed language drift in 4905 SEC 10-K filings over 10 years. Here's what the data shows.

SVB's 2022 annual report contained a sentence that no other bank in our corpus was writing at the time. Our system flagged it in January 2023. The FDIC arrived in March.

That's a striking data point. But it's one case. The question worth asking is: does this pattern hold at scale? We ran the numbers across 4905 public companies and 7069 flag events. Here's what we found.

The signal

Each year a company files a 10-K. We compare that filing's language against two things simultaneously:

The same company's own prior filings (year-over-year change)
How rare each phrase is across the whole corpus, and whether it rose corpus-wide that year (corpus-wide normalization — not by SIC sector)

The score is high when a company is simultaneously writing things unusual for itself and unusual across the corpus. Phrase-frequency based, fully deterministic — the same input always produces the same score (with a secondary sentence-embedding signal).

This is different from keyword search (which can't distinguish corpus-wide language shifts from company-specific ones) and different from LLM summarization (which can't compare one filing against the whole corpus of thousands of filings).

What the backtest shows

This is the distress read — the same score's high end (the long-side factor result is in the signal validation). We ran a forward-return backtest across every flag event in the corpus: 7069 events from 4905 companies. Read the table as a tendency, not a trade signal: most of the raw vs-SPY underperformance is the small-cap size effect, so we don't lead with it — the distress evidence is the lift, with moderate-flagged companies reaching a distress outcome about 1.2× the corpus base rate.

Horizon	Median alpha vs. S&P 500	% events with negative alpha
1 year	-8.6%	58%
2 years	-14.8%	61%
3 years	-22.4%	63%

n=7069 flag events, 4905 companies. Alpha = company return minus SPY return over the same period.

To be clear about what this means: across 7069 flag events, companies that crossed the distress ceiling underperformed SPY by a median of -8.6% at 1 year. 58% of those events had negative alpha — versus roughly 50% you'd expect from random flagging. That's a real directional signal in a noisy market, not a perfect predictor.

Specific cases

Companies we flagged before widely-known distress events:

Company	Event	Lead time
SVB	Bank collapse Mar 2023	14 days (final filing)
Bed Bath & Beyond	Bankruptcy Apr 2023	~24 months
Nikola	Bankruptcy Nov 2023	~44 months
Rite Aid	Bankruptcy Oct 2023	167 days

What it missed — and why

Three notable misses worth documenting:

PCG (Pacific Gas & Electric) — had zero pre-event filing pairs in our corpus at the time of the wildfire liabilities. Nothing to score.
Revlon — scored high, but only after the restructuring began, not before it.
Chesapeake Energy — only one pre-event filing pair, insufficient history for the signal to develop.

These aren't buried in a footnote. The signal requires multi-year filing history to work. Companies with few historical pairs have lower signal reliability, and we flag this on the company page.

Known methodological issues

Two problems we know about and haven't solved:

The binomial false-positive problem. The ceiling is set at the 95th percentile of pair scores from labeled stable companies. But if a company has 10 years of filing history, the probability of at least one pair randomly exceeding the 95th percentile is 1-(0.95^10) ≈ 40%. Companies with long histories have a structurally higher false-positive rate. We're working on adaptive per-company thresholds.

No sector normalization. The score is normalized corpus-wide, not within industry sectors — an energy company with routine impairment language is weighted against the same pool as a software company. We tested per-sector (GICS) normalization and it reduced both the portfolio alpha and the distress recall, so we don't use it, but it means sector base rates of distress vocabulary aren't accounted for. It's also why a company whose distress language is common across the corpus (e.g. Party City) can be missed.

The tool

We built FilingDrift to make this signal accessible. Free tier covers our labeled company set (the cases above and more). Researcher, Professional, and Desk plans add watchlist alerts, API access, and the full 4905-company corpus.

The live demo shows SVB's full score history with annotations. The methodology page has the technical detail and the full validation analysis.

← All posts

Questions about the methodology or specific tickers? Email hello@filingdrift.com