Quant Research

Why Backtests Lie: A Practitioner’s Guide to Honest Validation

August Quants Research•February 27, 2025•7 min read

Nine out of ten beautiful backtests will not survive contact with live capital. The reason is rarely technical — it is statistical and behavioural. A field guide to building research you can stake real money on.

Most quantitative research begins with a Sharpe ratio that looks too good to be true. It usually is. The gap between a research backtest and live performance — sometimes called the “backtest decay” — has been studied extensively, and the empirical answer is sobering: live Sharpe ratios are routinely half of the backtested number, and sometimes less than a third. The interesting question is not whether decay happens, but what causes it and how to forecast it ex ante.

The four sources of backtest inflation

First, multiple-comparison bias: searching across thousands of feature combinations until something works is a near-certain way to find spurious results. Second, look-ahead bias: leaking information from the future into features that, in production, would not yet be available. Third, survivorship bias in datasets, and unrealistic assumptions about borrow, fees, slippage and capacity. Fourth, regime overfitting: the strategy implicitly memorises the macro regime of the training period.

Cross-validation that respects time

Standard k-fold cross-validation is dangerous on financial time series because it shuffles past and future. Walk-forward validation, purged k-fold and combinatorial purged cross-validation — codified by Marcos López de Prado — are the institutional standard. They produce more conservative — and far more honest — out-of-sample estimates.

Pre-registration and the dignity of failure

Borrowing from clinical research, sophisticated quant teams pre-register hypotheses before running tests. The discipline of writing down what you expect, and what would falsify it, before seeing results, is the single most reliable way to suppress overfitting. It also turns negative results from career risk into intellectual capital — because the strategy was tested cleanly.

FAQ

How many signals can I test before multiple-testing risk dominates?

It depends on signal independence and effect size. A useful heuristic from Bailey and López de Prado is the deflated Sharpe ratio, which adjusts an observed Sharpe for the number of independent strategies tested.

Is a long backtest history necessarily better?

Not always. Long histories include regimes that may be irrelevant to the present (different market structure, regulation, microstructure). Newer data sometimes deserves more weight, not less.

backtestingoverfittingvalidationmultiple testing

About the author

August Quants Research

The August Quants research desk publishes educational essays on systematic investing, market structure, ML in finance and portfolio construction. We write for institutional readers who value rigour over noise.

Why Backtests Lie: A Practitioner’s Guide to Honest Validation

The four sources of backtest inflation

Cross-validation that respects time

Pre-registration and the dignity of failure

FAQ

How many signals can I test before multiple-testing risk dominates?

Is a long backtest history necessarily better?

August Quants Research

Factor Investing in 2025: Value, Momentum, Quality, Low-Volatility

Cross-Sectional vs Time-Series Momentum

Building a Quant Team: Skills, Culture, Process

Want a deeper conversation?