Finance8 min read·

Lazy Prices, Lazy Investors - and the 22% Alpha Hidden in 10-Ks That Nobody Reads

Cohen, Malloy and Nguyen's Lazy Prices paper found that small year-on-year changes in 10-K filings predict large negative returns. Here is what the paper actually says, and how Snowflake Cortex AI and Semantic Views collapse the original eight-year engineering pipeline into an afternoon's work.

On 23 February 2010, Baxter International filed its annual report with the SEC. The stock did not budge. Two months later, the New York Times broke a story about an FDA crackdown on infusion pumps. Baxter fell more than 20% in a fortnight and never recovered. The thing is - Baxter had told everyone it was coming. They just buried it in their 10-K.

We were sitting in the upstairs room at Snowflake's London office last month, at one of the London Quant Group evening seminars, listening to a talk about a paper most of us had heard of but few had actually re-read in years: Cohen, Malloy and Nguyen's Lazy Prices.

The original paper is from 2019. The result is, by quant-academic standards, almost rude in how big it is. A long-short portfolio that buys companies whose annual reports look like last year's and shorts companies whose annual reports have been quietly rewritten earns somewhere between 30 and 58 basis points per month in value-weighted abnormal returns - about 7% a year, risk-adjusted. Drill into changes concentrated in the Risk Factors section and the alpha climbs to 188 basis points per month, or over 22% per year, with a t-statistic of 2.76.

Numbers like that do not normally survive twenty years of academic scrutiny on a sample that contains every publicly traded firm in the United States. This one does.

What we had not appreciated until that LQG talk was that the entire pipeline - the bit that took Cohen and his co-authors years to build, between FOIA requests to the SEC, raw 10-K parsing, custom diff tooling, and a small army of research assistants - now fits comfortably inside a single Snowflake account, with the LLM work done by Cortex AI and the whole thing exposed as a Semantic View that an analyst can query in plain English.

It is worth thinking about why.

What the paper actually says

Lazy Prices is a behavioural finance result dressed up as a textual-analysis paper. The core claim is simple: investors have stopped reading 10-Ks carefully because 10-Ks have become roughly six times longer in twenty years, and twelve times more textually volatile year-on-year. Loughran and McDonald estimate the average public company's 10-K is downloaded from EDGAR roughly 28 times in the days after filing. Twenty-eight. For the entire investing public.

So when a company changes its 10-K - when management quietly inserts a paragraph about increased FDA scrutiny, or rewrites the Risk Factors section, or stops reassuring you that no further charges related to a particular product are likely - almost nobody notices. The signal is hiding in plain text, but the cost of reading thousands of filings carefully every quarter is so high that the market under-prices it. This is, in the end, a story about a violation of the efficient market hypothesis driven by attention costs rather than information costs.

Cohen, Malloy and Nguyen test the idea with four off-the-shelf textual similarity measures (cosine similarity on bag-of-words term frequencies, Jaccard, minimum edit distance, and a simple word-level diff), all of which give similar answers. They then sort firms each month into quintiles by year-on-year similarity and look at forward returns.

The empirical fingerprint is unusual. There is no announcement-day return - investors do not react to the filing. The drift accrues gradually over the next 6 to 18 months, and it does not reverse. That is not the signature of overreaction, or of a typical underreaction story like post-earnings announcement drift, where the price jumps and then drifts in the same direction. This is a population of investors who have simply not bothered to compare this year's text to last year's.

Two more findings stuck with us:

  • 86% of the textual changes in 10-Ks are negative in sentiment. The market's failure to read is not symmetric. Companies that have something good to say tend to broadcast it through other channels; companies that have something bad to say will, if they can, bury it in a Risk Factors update.
  • The effect is strongest among firms that do not include explicit comparative phrases ("compared to last year", "relative to prior year EBITDA"). When management actively draws attention to year-on-year changes, prices respond more or less efficiently. When they do not, it takes the market roughly half a year to figure it out.

That second point is the behavioural mechanism. The information is there. The cognitive cost of finding it is what drives the alpha. In quant trading strategy terms, you can think of the Lazy Prices signal as a textual-similarity factor that prices a slow-moving, attention-driven anomaly.

Why this was hard to replicate in 2019, and why it is not now

If you had wanted to build the Lazy Prices signal yourself in 2019 - and many people did try - you were looking at a multi-month engineering project before you even got to the trading rules.

You needed to: scrape every 10-K and 10-Q from EDGAR back to 1995, strip out the HTML, XBRL, embedded PDFs, exhibits and tables; identify the boundaries of each Item (1A Risk Factors, 7 MD&A, etc.) using regex against wildly inconsistent filing formats; pair each filing with its prior-year equivalent; compute four different similarity measures on documents that can run to 180,000 words; join the result back to CRSP returns, Compustat fundamentals and IBES forecasts; and then, finally, run the actual portfolio sorts.

The textual layer alone is the kind of thing that quietly consumes a graduate student for a year.

What we saw at the LQG talk - and this is the part that genuinely surprised us - is how much of that pipeline now collapses into a Snowflake-native workflow. Specifically, two things have changed since the paper was published:

  1. The data is just there. S&P Global Market Intelligence, Cybersyn and others publish parsed SEC filings as native Snowflake Marketplace shares. You do not ingest them; you query them. The raw text of every 10-K, already chunked and tagged by Item, lives one SELECT away.
  2. The NLP is just there too. Snowflake Cortex has put EMBED_TEXT_1024, VECTOR_COSINE_SIMILARITY, AI_SIMILARITY and COMPLETE behind a SQL function call, charged per token. The LLM infrastructure problem - embedding millions of document chunks, storing them, querying them - has been compressed into something a quant can write in an afternoon.

The interesting framing from the talk was not "use Cortex to do the embeddings". That is the obvious bit. The interesting framing was using Semantic Views as the abstraction.

The Semantic View as the alpha layer

A Semantic View in Snowflake is, roughly, a governed metadata layer that sits on top of your tables and turns them into a set of well-defined business concepts: dimensions, measures, relationships, synonyms. It is the thing that lets Cortex Analyst translate "show me companies whose Risk Factors section changed the most last quarter" into actual SQL without hallucinating column names.

The point - and this is the bit worth slowing down on - is that Lazy Prices is fundamentally a question about a comparison between two documents. Not a single forecast, not a single ticker, not a single number. It is "how different is this filing from its prior-year analogue, in this section, weighted this way?"

Once you express that as a Semantic View, every downstream question becomes a one-line query. A taste of what that pipeline looks like in practice:

-- 1. Embed every parsed 10-K Item (Risk Factors, MD&A, etc.) CREATE OR REPLACE TABLE filings_embedded AS SELECT cik, ticker, filing_date, fiscal_year, item_id, -- e.g. '1A', '7' item_text, SNOWFLAKE.CORTEX.EMBED_TEXT_1024( 'snowflake-arctic-embed-l-v2.0', item_text ) AS item_embedding FROM sec_filings.parsed_items WHERE form_type IN ('10-K', '10-Q'); -- 2. Compute year-on-year similarity per Item, per firm CREATE OR REPLACE TABLE filings_similarity AS SELECT curr.cik, curr.ticker, curr.filing_date, curr.item_id, VECTOR_COSINE_SIMILARITY( curr.item_embedding, prev.item_embedding ) AS sim_cosine, curr.fiscal_year FROM filings_embedded curr JOIN filings_embedded prev ON curr.cik = prev.cik AND curr.item_id = prev.item_id AND curr.fiscal_year = prev.fiscal_year + 1;

That is the entire Lazy Prices similarity layer for one of the four measures, in two queries, on the full universe of US public companies. No ETL, no parsing, no infrastructure.

The Semantic View on top of this is where it gets interesting:

CREATE OR REPLACE SEMANTIC VIEW lazy_prices_signal AS TABLES ( filings_similarity, monthly_returns, fundamentals ) DIMENSIONS ( ticker, fiscal_year, filing_date, item_id WITH SYNONYMS = ('section', 'risk factors', 'MD&A') ) METRICS ( similarity_cosine AS AVG(sim_cosine), forward_alpha_3m AS ..., quintile_rank AS NTILE(5) OVER ( PARTITION BY filing_date ORDER BY sim_cosine ) );

Now an analyst - or, more likely now, an LLM-driven agent sitting in front of Cortex Analyst - can ask "What was the value-weighted return of the bottom-quintile Risk Factors changers in 2024?" and get a correct answer, on the entire CRSP-linked universe, without writing any SQL at all.

That is a meaningfully different research workflow from the one Cohen, Malloy and Nguyen had to build by hand.

Where this goes wrong

We do not want to be glib about this. Two things to keep an eye on.

First, the original paper measured similarity on bag-of-words term frequencies. Cortex embeddings measure semantic similarity. These are not the same thing - and in some ways, the embedding-based version is a worse fit for the original behavioural story. If the mechanism is "investors do not notice that the literal words have changed," then a paraphrase of the same content (which a transformer model would correctly score as highly similar) is precisely the kind of change a lazy investor would miss. We would want to run the classic Loughran-McDonald cosine on tokenised text in parallel with VECTOR_COSINE_SIMILARITY and treat any divergence as a research question, not a confirmation.

Second, this is now a crowded trade. The original paper was published in 2019 in the Journal of Finance, has been cited many hundreds of times, and is on every quant hedge fund's reading list. The fact that the cost of replicating it has collapsed by an order of magnitude does not mean the alpha has survived. What it does mean is that the next version of this idea - the cross-language equivalent on European filings, the application to bond covenants, the cross-document version that compares 10-Ks with the corresponding earnings call transcripts and 8-Ks for inconsistencies - is now extremely tractable. S&P's own follow-up work, Questioning the Answers: LLMs enter the Boardroom, is doing exactly that on the transcript side, scoring executives on how on-topic and proactive their answers are. There is a statistical arbitrage flavour to all of this - these are noisy, slow-moving, low-Sharpe signals that need to be combined carefully - but the building blocks are no longer the bottleneck.

The original Lazy Prices result was a paper about investor inattention. The 2026 version is a paper about which fund had the engineering pipeline to act on the signal first.

What this teaches us

The Knight Capital story we wrote about a few weeks ago was a parable about software systems eating risk management. This one is the inverse: data infrastructure eating research.

For most of the history of quant finance, the moat was the data - getting it, cleaning it, parsing it, joining it. The actual statistical idea on top was often surprisingly simple. A good chunk of the original Lazy Prices result is, mathematically, year-on-year cosine similarity sorted into quintiles. The hard part was not the maths. The hard part was the eight years of pipeline plumbing.

That moat is now extremely shallow. The infrastructure that used to take a graduate student a year takes an afternoon. The differentiator is moving back to where it always belonged: the quality of the question you are asking, the rigour of the backtest, the discipline of the risk management around the live signal, and a working understanding of why the alpha exists in the first place. Cohen, Malloy and Nguyen's answer - "because investors are lazy and 10-Ks are long" - is a real economic story, not a statistical artefact, which is why the result has held up.

If you want to break into modern quant work, the technology stack is part of the job now. Understanding how Cortex, Semantic Views, vector databases and the LLM tool-chain fit together with the maths of pricing and the economics of high-frequency trading is no longer optional. The advantage is no longer in having the pipeline. It is in knowing what to point it at.

Want to go deeper on Lazy Prices, Lazy Investors - and the 22% Alpha Hidden in 10-Ks That Nobody Reads?

This article covers the essentials, but there's a lot more to learn. Inside Quantt, you'll find hands-on coding exercises, interactive quizzes, and structured lessons that take you from fundamentals to production-ready skills — across 50+ courses in technology, finance, and mathematics.

Free to get started · No credit card required