My model for the M6 forecasting competition

The M6 Financial Forecasting Competition had two components: forecasting and investing. Over 12 months (technically 12 4-week periods), participants had to:

  • forecast the probability that each of the 100 assets would have returns in the 1st quantile, 2nd quantile, etc. that month. This was the "forecasting" component.
  • build a portfolio from the 100 assets to hold for the month (the "decisions" component).

The forecasting component was scored by ranked probability score; the decisions component was scored by information ratio.

I came in 1st by a hair's breadth in the forecasting component. The margins at the top were incredibly close, and the competition was only 12 months, so this is far from enough evidence to conclude that my model was the best.

In this post, I'll describe my model and thought process.

Initial thoughts

My intuition was that the forecasting component would come down to correctly modeling volatility, not picking winners and losers. I don't believe markets are perfectly efficient, but I also didn't believe a contest like this would attract or unearth a diamond-in-the-rough stock-picking talent.

My next consideration was what frequency to model. E.g., I could model daily or weekly returns, and just roll the daily or weekly forecast forward for a month. But I quickly decided that I would start with the frequency of the competition, monthly. The problem with modeling at a higher frequency is that any bias in the model will compound over the month. Think of hitting a golf ball: if your angle is off by 1 degree, you'll sink a 1 foot putt but you'll miss a 30-footer. (I did, however, include some features from daily return data - more on this later).

I knew off the bat that I'd use pymc to build a Bayesian model. Bayesian models are built to quantify uncertainty, and probabilistic programming languages like pymc make building complicated Bayesian models easy. Given the challenge was to model the covariance of 100 assets, I landed on probabilistic PCA as the scaffolding for my model. I had used it with some success in other, similar applications; PCA is often used in econometric modeling to reduce the dimensionality of the hundreds of macroeconomic data series available. But I knew I'd need to add some additional bells and whistles for this problem.

Modeling $\mu$

The $\mu_i$ for each asset $i$ is the sum of:

  • the asset's loadings times the factors, $w_i * \lambda_t$. I somewhat arbitrarily chose 7 factors, and modeled their dynamics as AR(1). I used static loadings.
  • $\beta_{recent\_returns_i}$ multiplied by the asset's return in the final week of the prior month. These betas were drawn from one of three heirarchical distributions: one for stocks, one for equity ETFs, and one for fixed-income ETFs (hereafter "asset classes").
  • an asset-specific intercept, $\alpha_i$. In practice, these were all near zero because the rest of the model explained most of the variance.

The AR coefficient for the factors was slightly negative, and this puzzled me for a long while. It turns out I had stumbled across the short-term reversal anomaly - see here or here.

The remainder of the model is for the noise around $\mu$, which is symmetric. This means that the only mechanisms by which the model could independently forecast "winners and losers" are the three above: factor dynamics (effectively short-term reversal); returns in the last week of the prior month; and asset-specific intercepts. As a result, the model's forecasts were nearly symmetric: the probability of an asset being in the bottom quantile was close to the probability of it being in the top quantile.

Modeling noise around $\mu$

The observation noise around $\mu$ is t-distributed. The degrees of freedom $\nu_i$ is drawn from a hierarchical distribution. $\sigma_{it}$ is the sum of:

  • $\theta_{0i}$, a baseline noise for each asset. Drawn from a one of three hierarchical distributions, one per asset class.
  • $\theta_{recentvol_i}$ times the daily volatility of the asset in the last week of the prior month, normalized at the asset level by its past values. Also drawn from one of three hierarchical distributions.
  • $\theta_{earnings_i}$ times a dummy for whether the asset announced earnings in the month. Drawn from a hierarchical distribution.

That's it, that's the model. I added observation weights to weight the recent past more heavily than the distant past, and fit the model on the last 8 years of data.

From posterior samples to a forecast

One benefit of using a tool like pymc to make forecasts is that in many cases, inference (or sampling) is forecasting. Just pass missing data to the observed variable, and pymc performs automatic imputation. The future is, of course, missing, so I passed it in and got 4000 samples from the posterior distribution of the future, conditional on the data we have observed and, of course, on my model being the correct one.

A second benefit revealed itself midway through the competition when one of the 100 assets (DRE) went defunct. The organizers said they would handle it by forgetting about DRE pretending DRE had a return of zero in each remaining month. This was easy for me to account for: all I did was set DRE's return to zero in each of the 4000 samples before taking the quantile probabilities.

Probabilistic judgmental forecast, and a stroke of luck

I baked in the option to input "probabilistic judgmental forecasts" for the forecast period. I could tell the model, for example, that the return on XLP in the forecast period would be 2% with a standard deviation of 2 percentage points. The model would then give me the posterior distribution for all 100 assets conditional on this probabilistic forecast.

The competition had a trial month before the scores mattered, and I submitted an entry for the trial month, without using this probabilistic judgemental forecast function of my model. Then came the first real month of the competition, which started in late February. I naively predicted that Russia's invasion of Ukraine would be over quickly, and put my thumb on the scale in a direction that I thought be consistent with such an outcome. But timezones are hard and I missed the submission deadline by a few hours, so my entry for the trial month carried over to the first official month. My prediction about Russia's invasion was, of course, horribly wrong, and my submission from the trial month actually had a better RPS than the submission I wanted to make.

My takeaway was clear: I cannot predict the future and should not try. For the remainder of the competition, I did not use the probabilistic judgemental forecast function of my model again.

Closing thoughts

The M6 competition had quarterly prizes as well. I did not win any. Indeed, among the top 3 for the year-long forecasting, none of us placed in any of the quarters. None of the quarterly winners placed more than once. It takes many draws to assess the quality of probabilistic forecasting.