Finance8 min read·

The Sharpe Ratio of Pure Noise

We backtested 1,000 strategies that we knew contained no signal at all. More than half the time, the best of them had a Sharpe ratio above 1.0. A simulation study of selection bias, the expected maximum Sharpe ratio, and why a parameter sweep flatters you less than you fear.

This week we backtested 2,000,000 trading strategies. Every one of them was pure noise. We generated the returns ourselves with a random number generator, so we know, with complete certainty, that the true Sharpe ratio of every single strategy is exactly zero.

The best ones still looked brilliant.

Run 1,000 of these noise strategies over ten years of daily data and pick the best, and you get a Sharpe ratio of 1.03 on average. More than half the time (54.5% in our simulation), the winner clears 1.0. One run in twenty, it clears 1.24. Nobody fabricated anything. Nobody even made an error. The only sin committed was looking at more than one backtest and keeping the best.

That number, 1.03 from nothing, is worth sitting with. It is roughly the in-sample Sharpe that gets a strategy taken seriously. It is the kind of number that ends up in a pitch deck.

The experiment

The setup could not be simpler, and that is the point. Each strategy is ten years of daily returns drawn from a normal distribution with zero mean. No signal, no autocorrelation, no regime structure. We compute the annualised Sharpe ratio of each, take the maximum across N strategies, and repeat the whole exercise 2,000 times to get a stable estimate of what "the best backtest on the desk" looks like under the null.

import numpy as np rng = np.random.default_rng(42) T = 252 * 10 # ten years of daily returns best = [] for _ in range(2000): # repeat the experiment 2,000 times rets = rng.standard_normal((1000, T)) * 0.01 # 1,000 noise strategies sr = rets.mean(axis=1) / rets.std(axis=1, ddof=1) * np.sqrt(252) best.append(sr.max()) print(np.mean(best)) # 1.03

Varying N, the number of strategies you allow yourself to test, gives the full picture:

Strategies testedE[max Sharpe]95th percentile
10.010.53
100.480.82
500.710.96
1000.791.03
5000.961.17
1,0001.031.24

The single-strategy row is the only honest one, and even there the 95th percentile is 0.53. With ten years of daily data, a lone backtest of nothing produces a Sharpe above 0.5 one time in twenty. Everything below that row is selection bias compounding on top of sampling error.

Two things stand out. First, the damage is front-loaded: going from one trial to ten buys the noise a Sharpe of 0.48, while going from 100 to 1,000 only adds another 0.24. The expected maximum grows roughly with the square root of the log of N, so each additional order of magnitude of data-mining costs less than the last. Second, these numbers are for ten years of data. Rerun the experiment on three years, the length of many crypto and intraday backtests, and the expected maximum at N=1,000 rises to 1.88. On short histories, noise does not just creep in. It struts.

The formula that predicts all of this

None of this required simulation. Bailey and Lopez de Prado (2014) derived the expected maximum Sharpe ratio across N independent trials under the null, using extreme value theory. In annualised form:

E[maxSR]252T[(1γ)Φ1 ⁣(11N)+γΦ1 ⁣(11Ne)]E[\max SR] \approx \sqrt{\frac{252}{T}} \left[ (1-\gamma)\, \Phi^{-1}\!\left(1-\tfrac{1}{N}\right) + \gamma\, \Phi^{-1}\!\left(1-\tfrac{1}{Ne}\right) \right]

where T is the number of daily observations, gamma is the Euler-Mascheroni constant (0.5772) and Phi-inverse is the standard normal quantile function. For N=1,000 and ten years of daily data the formula gives 1.03. Our brute-force simulation gave 1.03. For N=100 the formula says 0.80 against a simulated 0.79. It is not often that theory and experiment agree to the second decimal place on the first attempt, and it is a small advertisement for how far a little probability theory goes in this business.

The formula is the engine behind their Deflated Sharpe Ratio: instead of asking "is this Sharpe above zero?", you ask "is this Sharpe above what the best of my N trials would have produced from noise?". The benchmark stops being zero and starts being the table above.

The catch, and it is a serious one, is that N is the number of trials you ran, not the number you remember running. Every parameter value you tried and discarded, every universe definition, every cost assumption you toggled, every start date you nudged. Researchers are reliably terrible at counting this. Which raises the more practical question: when you sweep 275 parameter combinations, have you really run 275 trials?

275 backtests that count as 10

Here the story takes a turn that surprised us. We ran a second experiment: a moving average crossover system on a simulated random walk with zero drift. Fast lookbacks from 2 to 28 days, slow from 20 to 210, giving 275 valid combinations per price series. Same protocol as before: take the best Sharpe across the sweep, repeat over 500 independent price paths.

If those 275 backtests were independent trials, the table above says the best should come in around 0.92 on average. The actual result: 0.50.

The resolution is correlation. The average pairwise correlation between the 275 variants in our runs was 0.65. A 10-and-200 crossover and a 12-and-200 crossover are not two ideas; they are one idea wearing two hats. Inverting the Bailey and Lopez de Prado formula, an expected maximum of 0.50 over ten years corresponds to roughly 10 effective independent trials. The sweep nominally tested 275 strategies and effectively tested about ten.

This cuts both ways, and both directions matter.

The comforting direction: a parameter sweep over one strategy family inflates your best Sharpe far less than a naive trial count suggests. If you tested 275 variants of one idea, deflating as if N=275 is too harsh by a wide margin.

The uncomfortable direction: the same logic means you cannot buy safety by gridding more finely. A heatmap of 275 green cells feels like 275 confirmations. It is closer to ten. And the genuinely dangerous trials are the uncorrelated ones, the ones that come from switching strategy families altogether. A researcher who tries momentum, then mean reversion, then a volatility carry strategy, then a seasonality effect, has racked up independent trials at a rate their parameter-sweeping colleague never approached, while feeling far more disciplined. Ten genuinely different ideas inflate your best backtest more than 275 variants of one.

Where this goes wrong

The simulation is a caricature, deliberately, and it is worth being precise about what it leaves out.

Real strategy returns are not i.i.d. Gaussian. Fat tails and autocorrelation widen the sampling distribution of the Sharpe ratio, which makes the tables above optimistic: real noise produces bigger flukes than Gaussian noise. Lo (2002) covers the autocorrelation correction, and the statistics behind these estimators matter more than any individual number quoted here.

The effective-trials estimate of ten is specific to our grid, our correlation structure and our random walk. A sweep over a less self-similar family would land somewhere else. The honest statement is directional: effective N sits well below nominal N when variants are correlated, and the gap is large when the average correlation is high.

And the formula only deflates the trials you can count. The deepest selection in published quant research happens before any backtest runs: the choice of asset class, sample period and effect to study is itself conditioned on knowing, roughly, what has worked. No deflation formula reaches that far back. Monte Carlo methods can quantify the selection you did; they are silent on the selection the entire field did for you.

What we actually do with this

Three habits fall out of the arithmetic.

Count your trials, in writing, as you go. Not because the count will be accurate, but because the act of logging changes behaviour. A research log with 40 entries makes it much harder to convince yourself that the winner was your first idea.

Deflate against the right N. Variants within a family are heavily discounted; new families count nearly full price. If your process tried five genuinely different ideas and swept parameters within each, your effective N is closer to five or ten than to the thousands of backtests your optimisation loop executed.

And treat the table above as a price list. Testing strategies costs Sharpe, paid in expectation, whether or not anything works. With ten years of data, 100 trials cost you 0.79 of in-sample Sharpe before a single basis point of real edge enters the picture. If the best thing you found after 100 attempts shows 0.9, you have not found a strategy. You have found the price of looking.

Noise pays a Sharpe ratio of one to anyone willing to run enough backtests. The market, unfortunately, does not.

Want to go deeper on The Sharpe Ratio of Pure Noise?

This article covers the essentials, but there's a lot more to learn. Inside Quantt, you'll find hands-on coding exercises, interactive quizzes, and structured lessons that take you from fundamentals to production-ready skills — across 50+ courses in technology, finance, and mathematics.

Free to get started · No credit card required