From Theory to Data
Probability tells you how the world should behave given a model. Statistics goes the other direction: given data, what can you infer about the world?
Every quant job involves statistics in some form. Estimating expected returns and volatility. Testing whether a trading signal is genuine or just noise. Building regression models to explain asset returns. Understanding the difference between "statistically significant" and "actually profitable."
Estimation: Pinning Down the Numbers
You rarely know the true mean or volatility of an asset's returns. You estimate them from historical data.
Point Estimates
The sample mean estimates expected return:
[ \hat{\mu} = \frac{1}{n} \sum_{i=1}^{n} r_i ]
The sample standard deviation estimates volatility:
[ \hat{\sigma} = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (r_i - \hat{\mu})^2} ]
(The ( n-1 ) rather than ( n ) is Bessel's correction — it removes a small bias.)
Confidence Intervals
A point estimate without uncertainty is dangerous. A 95% confidence interval says: if we repeated this estimation many times, 95% of the intervals would contain the true value.
For the mean: ( \hat{\mu} \pm 1.96 \cdot \frac{\hat{\sigma}}{\sqrt{n}} )
The key insight: the uncertainty shrinks with ( \sqrt{n} ), not ( n ). You need four times as much data to halve the uncertainty. This has real implications — estimating expected returns precisely requires decades of data, which is why quants are much better at estimating volatility (which converges faster) than expected returns.
Hypothesis Testing
Hypothesis testing asks: is this effect real, or could it be random noise?
The Framework
- Null hypothesis ( H_0 ): the boring explanation (no effect, no alpha, no trend)
- Alternative hypothesis ( H_1 ): the interesting claim
- Test statistic: a number computed from data
- p-value: the probability of seeing data this extreme if ( H_0 ) is true
- Decision: reject ( H_0 ) if p-value < significance level (typically 0.05)
In Practice: Testing a Trading Strategy
You have a strategy that returned 8% annually over 5 years. Is that skill or luck?
The t-statistic is approximately:
[ t = \frac{\hat{\mu}}{\hat{\sigma} / \sqrt{n}} ]
If ( |t| > 2 ) (roughly), you reject the null of zero expected return at the 5% level.
But beware: if you tested 100 strategies and picked the best one, you have a multiple testing problem. By chance alone, several will look significant. This is why strategy overfitting is the biggest trap in algorithmic trading.
Linear Regression
Regression models the relationship between variables:
[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i ]
The ordinary least squares (OLS) solution minimises the sum of squared errors. In matrix form:
[ \hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{y} ]
This is linear algebra in action.
The CAPM as a Regression
The Capital Asset Pricing Model says:
[ R_i - R_f = \alpha_i + \beta_i (R_m - R_f) + \epsilon_i ]
Running this regression gives you:
- Alpha (( \alpha )): excess return not explained by the market — the holy grail
- Beta (( \beta )): sensitivity to the market — how much the asset moves when the market moves
Factor Models
Extending to multiple factors:
[ R_i = \alpha + \beta_1 F_1 + \beta_2 F_2 + \cdots + \beta_k F_k + \epsilon ]
The Fama-French model uses market, size, and value factors. Modern quant equity strategies use dozens or hundreds of factors.
Key Diagnostics
A regression is only as good as its assumptions. The main things to check:
| Check | What It Means | If It Fails |
|---|---|---|
| R-squared | How much variance is explained | Model may be missing factors |
| Residual normality | Errors should be roughly normal | Inference may be unreliable |
| Autocorrelation | Residuals should not be correlated | Standard errors are wrong |
| Heteroscedasticity | Variance should be constant | Use robust standard errors |
In financial data, autocorrelation and heteroscedasticity (changing volatility) are the norm, not the exception. Volatility clustering — big moves follow big moves — is a well-documented stylistic fact of financial returns.
Maximum Likelihood Estimation
Beyond OLS, maximum likelihood estimation (MLE) is the other workhorse. The idea: find the parameter values that make the observed data most probable.
For a normal distribution with unknown mean and variance:
[ \hat{\mu}, \hat{\sigma}^2 = \arg\max \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(r_i - \mu)^2}{2\sigma^2}\right) ]
MLE is used to fit GARCH models for volatility, estimate distribution parameters, and calibrate pricing models. It is the backbone of statistical modelling in finance.
Statistics in Python
Pandas and statsmodels make statistical analysis straightforward:
import statsmodels.api as sm # CAPM regression X = sm.add_constant(market_excess_returns) model = sm.OLS(stock_excess_returns, X).fit() print(f"Alpha: {model.params[0]:.4f}") print(f"Beta: {model.params[1]:.4f}") print(f"R-squared: {model.rsquared:.4f}")
Going Further
Statistics connects probability theory to real-world data analysis. It is the bridge between "here is how the model works" and "here is what the data says."
Quantt covers estimation, testing, and regression with financial datasets and interactive Python exercises — not abstract toy examples, but the actual calculations quant teams perform daily. The full curriculum builds from mathematical foundations through to applied portfolio analysis.
Want to go deeper on Statistics for Quantitative Trading: Estimation, Testing, and Regression?
This article covers the essentials, but there's a lot more to learn. Inside Quantt, you'll find hands-on coding exercises, interactive quizzes, and structured lessons that take you from fundamentals to production-ready skills — across 50+ courses in technology, finance, and mathematics.
Free to get started · No credit card required