When we tell people FilingDrift doesn't use an LLM, the reaction is usually some version of: "...why not?" It's 2026. Everything uses LLMs. Your toaster probably has a system prompt.
There's a good reason — and it comes down to what kind of question you're actually trying to answer. We want to share the reasoning because we think it applies to a lot of problems where people reflexively reach for GPT or Claude when a much simpler, cheaper, and more reliable tool would do better.
FilingDrift's core question is: "Is the language in this filing semantically unusual — for this company, in this year, relative to its peers?"
Notice what that question requires:
This is not a summarization task. It's not a question-answering task. It's a geometric comparison task: where does this document sit in semantic space, relative to a reference set?
For that specific task, embedding models are the right tool and LLMs are the wrong one. Here's why.
Run an embedding model on the same sentence twice and you get the same vector, every time. Run GPT or Claude on the same sentence twice and the output changes. Temperature, sampling, model updates — all of it introduces variation. For a financial scoring system that needs to be auditable ("why did this company's score change?"), non-determinism is disqualifying.
10-K filings run 70,000–150,000 words. Even today's large-context LLMs can hold a handful of these simultaneously at most. We need to compare a filing against dozens of peers and multiple prior filings. Embeddings handle this naturally: embed each sentence offline, store the vectors, compare them in arbitrary combinations whenever you need. No truncation, no chunking strategy, no "which 30% of the document do we send to the model?"
The critical insight: SVB's language wasn't just negative — it was unusual relative to peer banks that year. To make that comparison, you need every peer bank's sentences embedded in the same vector space. With embeddings, this is a nearest-neighbor search across a pre-built index — fast, offline, cheap. With an LLM, you'd need to somehow send thousands of documents simultaneously and ask "which of these is most different?" That's not how LLMs work.
The score is computed from actual vectors derived from actual sentences in the actual filing. There is no generation step. The score cannot contain information that wasn't in the document, because it's a mathematical operation on the document's content. An LLM summarizing a 10-K might confidently tell you things that are plausible but wrong. Our system can only tell you what the geometry of the document's language looks like.
We process ~5,000 companies with multi-year filing histories — hundreds of thousands of sentences. At LLM API prices, running a cross-sectional analysis on the full corpus would cost hundreds of dollars per run. Embedding models run locally, cost fractions of a cent per document, and complete in minutes. We rebuild the full corpus score cache every 3 hours. That's not economically viable with LLM API calls.
None of this means LLMs are bad — just that they're solving a different class of problem. LLMs are excellent when you need:
The pattern: LLMs are good at single-document tasks where you need flexible language understanding and are okay with probabilistic output. Embeddings are good at cross-document comparison tasks where you need deterministic, geometric reasoning at scale.
We use a sentence transformer from the SBERT family, trained specifically for semantic similarity tasks. It captures meaning well enough to distinguish "we face significant liquidity risk" from "we believe our liquidity position is adequate" — which is exactly the kind of distinction that matters here.
It runs locally in under a second per document on CPU. The entire corpus re-embeds in minutes. It's not the most powerful model in the world — but for this specific task, "most powerful" is not what matters. What matters is stable geometry, fast inference, and consistency across runs.
The broader point: The NLP toolbox has more than one tool. Transformer-based embedding models have been solving similarity and retrieval problems efficiently for years, with properties (determinism, speed, geometric interpretability) that make them well-suited for auditable analytical systems. "Just use an LLM" is often the right call; sometimes it's reaching for a jackhammer when a precise chisel is what the job requires.
One more piece worth explaining: we don't treat every sentence equally. We weight sentences by their inverse document frequency (IDF) across the full corpus — a technique borrowed from classic information retrieval.
In plain terms: a sentence that every company in our corpus uses scores low regardless of its content, because it's boilerplate. A sentence that only one company is using scores high. "We are subject to various risks and uncertainties" appears in essentially every 10-K ever filed — it carries no signal. "Our held-to-maturity portfolio has unrealized losses of $X, which may require liquidation at a loss to fund deposit withdrawals" — if no peer bank is writing that, it's informative.
IDF weighting is what makes the score sensitive to the specific and unusual, rather than amplifying the routine. An LLM reading a 10-K has no natural way to know which sentences are corpus-wide boilerplate. An embedding model combined with an IDF index does.
Questions or pushback on the methodology? We like both. hello@filingdrift.com