Mathematics13 min read·

Statistics for Quantitative Trading: Estimation, Testing, and Regression

How to estimate volatility, test whether a strategy works, and build factor models — the statistics that actually get used on trading desks.

From Theory to Data

Probability tells you how the world should behave given a model. Statistics goes the other direction: given data, what can you infer about the world?

Every quant job involves statistics in some form. Estimating expected returns and volatility. Testing whether a trading signal is genuine or just noise. Building regression models to explain asset returns. Understanding the difference between "statistically significant" and "actually profitable."


Estimation: Pinning Down the Numbers

You rarely know the true mean or volatility of an asset's returns. You estimate them from historical data.

Point Estimates

The sample mean estimates expected return:

[ \hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} r_i ]

The sample standard deviation estimates volatility:

[ \hat{\sigma} = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (r_i - \hat{\mu})^2} ]

(The ( n-1 ) rather than ( n ) is Bessel's correction — it removes a small bias.)

Confidence Intervals

A point estimate without uncertainty is dangerous. A 95% confidence interval says: if we repeated this estimation many times, 95% of the intervals would contain the true value.

For the mean: ( \hat{\mu} \pm 1.96 \cdot \frac{\hat{\sigma}}{\sqrt{n}} )

The key insight: the uncertainty shrinks with ( \sqrt{n} ), not ( n ). You need four times as much data to halve the uncertainty. This has real implications — estimating expected returns precisely requires decades of data, which is why quants are much better at estimating volatility (which converges faster) than expected returns.


Hypothesis Testing

Hypothesis testing asks: is this effect real, or could it be random noise?

The Framework

  1. Null hypothesis ( H_0 ): the boring explanation (no effect, no alpha, no trend)
  2. Alternative hypothesis ( H_1 ): the interesting claim
  3. Test statistic: a number computed from data
  4. p-value: the probability of seeing data this extreme if ( H_0 ) is true
  5. Decision: reject ( H_0 ) if p-value < significance level (typically 0.05)

In Practice: Testing a Trading Strategy

You have a strategy that returned 8% annually over 5 years. Is that skill or luck?

The t-statistic is approximately:

[ t = \frac{\hat{\mu}}{\hat{\sigma} / \sqrt{n}} ]

If ( |t| > 2 ) (roughly), you reject the null of zero expected return at the 5% level.

But beware: if you tested 100 strategies and picked the best one, you have a multiple testing problem. By chance alone, several will look significant. This is why strategy overfitting is the biggest trap in algorithmic trading.


Linear Regression

Regression models the relationship between variables:

[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i ]

The ordinary least squares (OLS) solution minimises the sum of squared errors. In matrix form:

[ \hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{y} ]

This is linear algebra in action.

The CAPM as a Regression

The Capital Asset Pricing Model says:

[ R_i - R_f = \alpha_i + \beta_i (R_m - R_f) + \epsilon_i ]

Running this regression gives you:

  • Alpha (( \alpha )): excess return not explained by the market — the holy grail
  • Beta (( \beta )): sensitivity to the market — how much the asset moves when the market moves

Factor Models

Extending to multiple factors:

[ R_i = \alpha + \beta_1 F_1 + \beta_2 F_2 + \cdots + \beta_k F_k + \epsilon ]

The Fama-French model uses market, size, and value factors. Modern quant equity strategies use dozens or hundreds of factors.


Key Diagnostics

A regression is only as good as its assumptions. The main things to check:

CheckWhat It MeansIf It Fails
R-squaredHow much variance is explainedModel may be missing factors
Residual normalityErrors should be roughly normalInference may be unreliable
AutocorrelationResiduals should not be correlatedStandard errors are wrong
HeteroscedasticityVariance should be constantUse robust standard errors

In financial data, autocorrelation and heteroscedasticity (changing volatility) are the norm, not the exception. Volatility clustering — big moves follow big moves — is a well-documented stylistic fact of financial returns.


Maximum Likelihood Estimation

Beyond OLS, maximum likelihood estimation (MLE) is the other workhorse. The idea: find the parameter values that make the observed data most probable.

For a normal distribution with unknown mean and variance:

[ \hat{\mu}, \hat{\sigma}^2 = \arg\max \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(r_i - \mu)^2}{2\sigma^2}\right) ]

MLE is used to fit GARCH models for volatility, estimate distribution parameters, and calibrate pricing models. It is the backbone of statistical modelling in finance.


Statistics in Python

Pandas and statsmodels make statistical analysis straightforward:

import statsmodels.api as sm # CAPM regression X = sm.add_constant(market_excess_returns) model = sm.OLS(stock_excess_returns, X).fit() print(f"Alpha: {model.params[0]:.4f}") print(f"Beta: {model.params[1]:.4f}") print(f"R-squared: {model.rsquared:.4f}")

Going Further

Statistics connects probability theory to real-world data analysis. It is the bridge between "here is how the model works" and "here is what the data says."

Quantt covers estimation, testing, and regression with financial datasets and interactive Python exercises — not abstract toy examples, but the actual calculations quant teams perform daily. The full curriculum builds from mathematical foundations through to applied portfolio analysis.

Want to go deeper on Statistics for Quantitative Trading: Estimation, Testing, and Regression?

This article covers the essentials, but there's a lot more to learn. Inside Quantt, you'll find hands-on coding exercises, interactive quizzes, and structured lessons that take you from fundamentals to production-ready skills — across 50+ courses in technology, finance, and mathematics.

Free to get started · No credit card required