My model for the M6 forecasting competition

The M6 Financial Forecasting Competition had two components: forecasting and investing. Over 12 months (technically 12 4-week periods), participants had to:

  • forecast the probability that each of the 100 assets would have returns in the 1st quantile, 2nd quantile, etc. that month. This was the "forecasting" component.
  • build a portfolio from the 100 assets to hold for the month (the "decisions" component).

The forecasting component was scored by ranked probability score; the decisions component was scored by information ratio.

I came in 1st by a hair's breadth in the forecasting component. The margins at the top were incredibly close, and the competition was only 12 months, so this is far from enough evidence to conclude that my model was the best.

In this post, I'll describe my model and thought process.

Initial thoughts

My intuition was that the forecasting component would come down to correctly modeling volatility, not picking winners and losers. I don't believe markets are perfectly efficient, but I also didn't believe a contest like this would attract or unearth a diamond-in-the-rough stock-picking talent.

My next consideration was what frequency to model. E.g., I could model daily or weekly returns, and just roll the daily or weekly forecast forward for a month. But I quickly decided that I would start with the frequency of the competition, monthly. The problem with modeling at a higher frequency is that any bias in the model will compound over the month. Think of hitting a golf ball: if your angle is off by 1 degree, you'll sink a 1 foot putt but you'll miss a 30-footer. (I did, however, include some features from daily return data - more on this later).

I knew off the bat that I'd use pymc to build a Bayesian model. Bayesian models are built to quantify uncertainty, and probabilistic programming languages like pymc make building complicated Bayesian models easy. Given the challenge was to model the covariance of 100 assets, I landed on probabilistic PCA as the scaffolding for my model. I had used it with some success in other, similar applications; PCA is often used in econometric modeling to reduce the dimensionality of the hundreds of macroeconomic data series available. But I knew I'd need to add some additional bells and whistles for this problem.

Modeling $\mu$

The $\mu_i$ for each asset $i$ is the sum of:

  • the asset's loadings times the factors, $w_i * \lambda_t$. I somewhat arbitrarily chose 7 factors, and modeled their dynamics as AR(1). I used static loadings.
  • $\beta_{recent\_returns_i}$ multiplied by the asset's return in the final week of the prior month. These betas were drawn from one of three heirarchical distributions: one for stocks, one for equity ETFs, and one for fixed-income ETFs (hereafter "asset classes").
  • an asset-specific intercept, $\alpha_i$. In practice, these were all near zero because the rest of the model explained most of the variance.

The AR coefficient for the factors was slightly negative, and this puzzled me for a long while. It turns out I had stumbled across the short-term reversal anomaly - see here or here.

The remainder of the model is for the noise around $\mu$, which is symmetric. This means that the only mechanisms by which the model could independently forecast "winners and losers" are the three above: factor dynamics (effectively short-term reversal); returns in the last week of the prior month; and asset-specific intercepts. As a result, the model's forecasts were nearly symmetric: the probability of an asset being in the bottom quantile was close to the probability of it being in the top quantile.

Modeling noise around $\mu$

The observation noise around $\mu$ is t-distributed. The degrees of freedom $\nu_i$ is drawn from a hierarchical distribution. $\sigma_{it}$ is the sum of:

  • $\theta_{0i}$, a baseline noise for each asset. Drawn from a one of three hierarchical distributions, one per asset class.
  • $\theta_{recentvol_i}$ times the daily volatility of the asset in the last week of the prior month, normalized at the asset level by its past values. Also drawn from one of three hierarchical distributions.
  • $\theta_{earnings_i}$ times a dummy for whether the asset announced earnings in the month. Drawn from a hierarchical distribution.

That's it, that's the model. I added observation weights to weight the recent past more heavily than the distant past, and fit the model on the last 8 years of data.

From posterior samples to a forecast

One benefit of using a tool like pymc to make forecasts is that in many cases, inference (or sampling) is forecasting. Just pass missing data to the observed variable, and pymc performs automatic imputation. The future is, of course, missing, so I passed it in and got 4000 samples from the posterior distribution of the future, conditional on the data we have observed and, of course, on my model being the correct one.

A second benefit revealed itself midway through the competition when one of the 100 assets (DRE) went defunct. The organizers said they would handle it by forgetting about DRE pretending DRE had a return of zero in each remaining month. This was easy for me to account for: all I did was set DRE's return to zero in each of the 4000 samples before taking the quantile probabilities.

Probabilistic judgmental forecast, and a stroke of luck

I baked in the option to input "probabilistic judgmental forecasts" for the forecast period. I could tell the model, for example, that the return on XLP in the forecast period would be 2% with a standard deviation of 2 percentage points. The model would then give me the posterior distribution for all 100 assets conditional on this probabilistic forecast.

The competition had a trial month before the scores mattered, and I submitted an entry for the trial month, without using this probabilistic judgemental forecast function of my model. Then came the first real month of the competition, which started in late February. I naively predicted that Russia's invasion of Ukraine would be over quickly, and put my thumb on the scale in a direction that I thought be consistent with such an outcome. But timezones are hard and I missed the submission deadline by a few hours, so my entry for the trial month carried over to the first official month. My prediction about Russia's invasion was, of course, horribly wrong, and my submission from the trial month actually had a better RPS than the submission I wanted to make.

My takeaway was clear: I cannot predict the future and should not try. For the remainder of the competition, I did not use the probabilistic judgemental forecast function of my model again.

Closing thoughts

The M6 competition had quarterly prizes as well. I did not win any. Indeed, among the top 3 for the year-long forecasting, none of us placed in any of the quarters. None of the quarterly winners placed more than once. It takes many draws to assess the quality of probabilistic forecasting.

Accounting for Game Situation When Using xG to Evaluate Team Strength

Motivation

In a previous post, I implemented a model of soccer developed by Gianluca Baio and Marta A. Blangiardo. They model the number of goals each team scores in a game as Poisson distributed with latent scoring intensities a function of each team's attacking strength, defending strength, and a home team advantage. One output of the model is estimates of every team's attacking and defending strengths, estimates that can be used to predict the outcomes of future games.

This model had (at least) two weaknesses:

  1. By modeling goals scored, it didn't give any credit for almost-goals-that-weren't. A team got no credit for generating a solid opportunity that would have been a goal if it were not for bad luck.
  2. By modeling only the end-of-game goal tallies, it did not take into account the score differential, let alone the evolution of this differential over the course of the game. A team with 3 goals going into the final quarter of a game will put less effort into scoring if their opponents have 0 goals than if their opponents have 2, 3, or 4.

I considered these problems for a while and turned to American football, which is a much more structured and discretized game than soccer. A game can be broken down into drives or possessions, and data is available on the outcome of each drive. I built a model of the NFL at the drive level, which addressed both of the weaknesses of the soccer model. It addressed the first weakness by giving credit for drive progression, even if the drive didn't end in a touchdown. It addressed the second weakness by using drive-level covariates, such as score differential and time on the clock.

I then turned back to soccer. In the meantime, problem #1 had essentially been solved by 'expected goals' or xG. It was less obvious how to solve the problem #2 with soccer. Because soccer is a fluid game, the modeling approach doesn't jump out at you the way it might for a discrete possession-based game like football.

Before diving into the model, I'd like to motivate issue #2 above a bit more. If we don't control for game situation when evaluating team strength using an outcome measure like points, goals or expected goals, we risk biasing our estimates of team strength upward for the weakest teams and downward for the strongest. This is true across sports. Imagine a strong team blowing out a weak team in the fourth quarter of an NFL game. The strong team puts in their B-team, allowing the weak team to score a couple touchdowns. This result will make the two teams look closer in strength than they really are. Or in soccer, as described in the 'Criticisms' section of the xG article:

A team may score one or two difficult chances early in a game and sit back for the remaining of the 90 minutes, allowing their opponents to take many shots from different positions, thus increasing the opponents xG. One could then claim that the losing team achieved a higher xG therefore deserves the win. This is why xG should always be taken with additional context of the game before creating a verdict.

For this reason, if we want to use xG to evaluate team strength - and to use our strength estimates to predict the outcomes of future games - we need to take game situation into account.

The Model

I decided to approach the problem simply - by breaking up a soccer game into segments. Most segments are 10 minutes long, but the length of the last segment of each half varies depending on injury time.

In [5]:
SEGMENTS = pd.DataFrame([
    {'segment': 0, 'period': 1, 'start': 0, 'end': 10}, # start and end are minutes
    {'segment': 1, 'period': 1, 'start': 10, 'end': 20},
    {'segment': 2, 'period': 1, 'start': 20, 'end': 30},
    {'segment': 3, 'period': 1, 'start': 30, 'end': 40},
    {'segment': 4, 'period': 1, 'start': 40, 'end': None}, # 40-45 + stoppage
    {'segment': 5, 'period': 2, 'start': 45, 'end': 55},
    {'segment': 6, 'period': 2, 'start': 55, 'end': 65},
    {'segment': 7, 'period': 2, 'start': 65, 'end': 75},
    {'segment': 8, 'period': 2, 'start': 75, 'end': 85},
    {'segment': 9, 'period': 2, 'start': 85, 'end': None} # 45-90 + stoppage
])

I then model the total amount of xG generated by a team in a game segment as a function of its attacking strength, the opposing team's defending strength, home team advantage, the segment, and the game situation as of the start of the segment, and an interaction between segment and situation. The obvious objection is that the game situation may change partway through the segment, rendering the covariates wrong. I find the model meets its intended purpose despite this weakness.

I graphed a histogram of team xG per game-segment and it looked exponentially distributed, so I modeled observed xG per game segment accordingly:

$$y_i|\lambda_i \sim Exponential(\lambda_i)$$

To be explicit, $i$ indexes segments of individual games, such that $i=0$ is the first 10 minutes of the first game of the season, $i=9$ is the closing minutes of the same game, $i=10$ is the first 10 minutes of the second game of the season, etc.

For the home team:

$$\lambda_i = intercept * (1/segment\_length_i) * e^{home_{h(i)} + attack_{h(i)} + defend_{a(i)} + segment_i + sitch_i + segment\_sitch_i}$$

All covariates are drawn from hierachial distributions:

$$home \sim Normal(home\_mean, \sigma_{home})$$$$home\_mean \sim Normal(-.2, 1)$$$$\sigma_{home} \sim HalfNormal(.4)$$$$attack \sim StudentT(0, \sigma_{attack}, \nu_{attack})$$

$$\sigma_{attack} \sim HalfNormal(.8)$$ $$\nu_{attack} \sim Gamma(2, .01)$$

$$defense \sim StudentT(0, \sigma_{defense}, \nu_{defense})$$$$\sigma_{defense} \sim HalfNormal(.8)$$$$\nu_{defense} \sim Gamma(2, .01)$$$$segment \sim Normal(0, \sigma_{segment})$$

$$\sigma_{segment} \sim HalfNormal(1)$$

$$sitch \sim Normal(0, \sigma_{sitch})$$$$\sigma_{sitch} \sim HalfNormal(1)$$$$segment\_sitch \sim Normal(0, \sigma_{segment\_sitch})$$$$\sigma_{segment\_sitch} \sim HalfNormal(.5)$$

I bucket game situation into 5 buckets, representing the score differential from the attacking team's POV:

In [1]:
def score_differential_to_sitch(x):
    if x < -1:
        return 0
    elif x == -1:
        return 1
    elif x == 0:
        return 2
    elif x == 1:
        return 3
    elif x > 1:
        return 4

I fit the model in pymc3 using data from the 2017-2018 English Premier League season.

Results

The chart below uses the $segment$, $sitch$, and $segmentsitch$ parameters to show the tendency for an average team to generate xG based on game segment and situation, after controlling for team strengths. I've inverted the parameter signs so that higher values correspond with more xG generation; these are represented by lighter colors. The top half of the chart is lighter than the bottom half because teams that are down tend to generate more xG than those that are up. Also, the 2nd half of the game is generally lighter than the first, as games tend to "open up" in the 2nd half. The lightest square corresponds to the closing minutes of a game, for the team down by only one goal.

In [5]:
Image(os.path.join(CHART_DIR, 'game_sitch.png'))
Out[5]:

And as I mentioned in the motivation section, the danger of ignoring game situation when using xG to evaluate team strength is that you'll get upwardly-biased estimates of the strength of the weakest teams and downwardly-biased estimates of the strength of the strongest teams. To illustrate, I fit the same model on the same data, but without $segment$, $sitch$ or $segmentsitch$ parameters. The chart below shows the posterior distributions for the defending strength parameters from the two models. Again, I've inverted the paramater signs so that lower -> stronger xG Suppression. As anticipated, the model that ignores game situation tends to overestimate the strength of the weakest teams and underestimate the strength of the strongest.

In [6]:
Image(os.path.join(CHART_DIR, 'xg_suppress.png'))
Out[6]:

The corresponding chart for attacking strength shows the same phenomenon, but to a lesser degree.

In [7]:
Image(os.path.join(CHART_DIR, 'xg_gen.png'))
Out[7]:

Conclusion

Across sports, when using an effort-based outcome measure like goals, shots, or expected goals to measure team strengths, we need to control for the teams' levels of effort. We can't measure effort directly, but we can measure the game situation, which determines effort level. Indeed, we could add covariates to this model that also help determine effort level, such as whether the game was a must-win to avoid relegation. When we don't control for effort, we risk underestimating the strength of the strongest teams because the strongest teams often find themselves in situations where they don't need to give 100%. Similarly, we risk overestimating the strength of the weakest, because they are often in situations where their opponents aren't giving 100%.

A Hierarchical Bayesian Drive-Survival Model of the NFL

Two posts ago, I implemented a Hierarchical Bayesian model of the Premier League. The model, introduced by Gianluca Baio and Marta A. Blangiardo, modeled scoring in soccer as a Poisson process, with the log scoring intensities a linear function of the teams' attacking/defending strengths plus a home field advantage. By fitting the model to the league as a whole, they are able to estimate teams' attacking strengths while 'controlling' for the defending strengths of their opponents, and vice versa.

Since writing that post, I've wanted to reproduce the model for (American) football. But a Poisson process isn't as natural of a fit for football as it is for soccer. It wouldn't to capture the fight-for-field-position process of football. Moreover, because football is more structured, we have more data to work with: we can get much more granular than the game-outcome level.

In this post, I describe a model of the football drive as a piecewise exponential competing risks survival model. I then fit an example implementation, embedding the drive model within a Hierarchical Bayesian model of the NFL.

The Piecewise Exponential Competing Risks Survival Model

I learned about the piecewise exponential model via the course materials of Germán Rodríguez, who is based at the Office of Population Research at Princeton. Rodriquez calls it his "favorite model in the context of competing risks," and I'm borrowing heavily from him in this explanation of the model; see here (pdf) and here.

The model assumes there are intervals, each of which has a constant baseline hazard rate. With intervals $0 = \tau_1 < \tau_2 < ... < \tau_{k} = \infty$, the baseline hazard for death by cause $j$ is a step function: $$\lambda_{j0}(t) = \lambda_{jk}, \text{for } t \in [\tau_k, \tau_{k+1})$$

Bringing in covariates $x$ and coefficients $beta_j$, the hazard rate for cause $j$ at time $t$ in interval $k$ is: $$\lambda_j(t,x) = \lambda_{j0}(t)e^{x'\beta_j} = e^{\alpha_{jk} + x'\beta_j}$$ where $\alpha_{jk} = \text{log }\lambda_{jk}$ is the log baseline risk for cause $j$ in interval $k$.

Thus, failures of type $j$ in interval $k$ to people with covariate values $x_i$ are are distributed Poisson with mean $$\mu_{ijk} = E_{ik}e^{\alpha_{jk} + x'_i\beta_j} $$ where $E_{ik}$ is the total exposure of people with covariates $x_i$ in interval $k$ (to all causes). In other words, the likelihood function asks, 'how probable is it that we'd see $failures_{ijk}$ in a $Poisson(\mu_{ijk})$ distribution?', for every combination of $i$, $j$, $k$. (Note that this means covariates have to be discretized.)

What makes the model handy for competing risk is that for the overall risk, just sum over the causes: $$\lambda(t,x) = \sum_{j=1}^m\lambda_j(t,x) = \sum_{j=1}^me^{\alpha_{jk} + x'\beta_j} $$

And given an observed death, the conditional probability that the death was by cause $j$ is just $j$'s risk divided by the overall risk: $$\pi_{jk} = \frac{e^{\alpha_{jk} + x'\beta_j}}{\sum_{r=1}^me^{\alpha_{rk} + x'\beta_r}}$$

Application to Football: Yards Instead of Years

Survival analysis is usually concerned with an entity's survival in time. Here, we'll look at a drive's survival in yards. Drives that move backwards will be counted as 0-yard drives.

Application to Football: The Intervals-As-Regions-On-the-Field Trick

In a conventional application of the piecewise exponential survival model, the constant-baseline-risk intervals are age intervals - e.g. 0-1 months, 1-3 months, 3-6 months, etc. That is, they are relative to the entity whose survival is being modeled.

But there's no reason these intervals have to be a relative to the entity. In the case of the football drive, they can represent fixed regions of the football field itself.

The idea is to pick areas on the field that have a similar baseline hazard rate. Intuitively, this means picking areas on the field within which an offense or a defense is going to think similarly, especially with regard to risk averseness. So, for example, I defined the following intervals, where yards 51 and up indicate the opponent's territory:

  1. 0-13 yards: offense pinned against their goalline, so has to call quick-developing plays
  2. 13-75 yards: the 'normal' zone
  3. 75-100 yards: 'extended red-zone'; offense thinks 'worse case, we kick a field goal' and focuses on ball control; defense has very small area to worry about

So if a drive starts on the 10 yard line and wants to survive to the endzone, it has to survive the interval-0 hazard rate for 3 yards, the interval-1 rate for 62 yards, and the interval-2 rate for 25 yards. If a drive starts on the 77, it only has to survive the interval-2 hazard rate for 23 yards. And so on.

Intuitively, it's like a drive has to run a gauntlet, with a difficulty that varies by zone. To be clear, the gauntlet is such that each zone isn't pass/fail, but rather death-by-exposure, so that it can occur anywhere within the zone - indeed, it's more likely to occur earlier in the zone than later.

Application to Football: Competing Risks for a Drive

A drive can die by 'running out of steam' - a punt, field goal attempt, unsuccessful 4th down attempt - or by a turnover. While there's no requirement to do so, I model these two causes as competing risks because they have different consequences for the starting field position of the other team. Also, the piecewise baseline hazard rates probably vary: imagine a team focusing more on ball control as it gets closer to the endzone, knowing that as long as they retain the ball, they can kick a field goal.

Application to Football: Covariates

Covariates can include team-specific parameters (attacking strength, home field advantage), team-interval-specific parameters (red zone attacking strength), drive-situation-specific parameters (weather, game situation (is the offense team losing badly, or down by just a score?)), and game-specific parameters (importance). This flexibility is a key strength of the model.

Weaknesses

As an approach to modeling football, it has (at least) these weaknesses:

  • it makes no distinction between run vs. pass
  • it cannot represent drives that end 'before' they began - e.g. a drive starting on the 20 but ending on the 10 because of a sack or penalty. No safeties.
  • it doesn't include special teams, although this could be remedied by adding a special-teams model
  • it doesn't account for clock management

Example Implementation

In the remainder of the post, I show the results of an example implementation of the model. I've embedded the drive model within a Hierarchical Bayesian framework, so that team-specific parameters are drawn from common distributions.

I fit the model to 2014 regular season data from NFLsavant.

Intervals

I used the following constant-baseline-risk intervals. Note that yards 50+ are in the opponent's territory.

  1. 0-13 yards
  2. 13-75 yards
  3. 66-100 yards

Covariates: Drive Death by Punt/Field Goal

I used team-specific attacking/defending 'strength' covariates. My first attempts suffered from overshrinkage (see my earlier post for a discussion of shrinkage in Hierarchical Bayesian models), so I tried the technique used by Baio/Blangiardo in their second model of the Seria A:

One possible way to avoid this problem is to introduce a more complicated structure for the parameters of the model, in order to allow for three different generating mechanism, one for the top teams, one for the mid-table teams, and one for the bottom-table teams. Also, in line with Berger (1984), shrinkage can be limited by modelling the attack and defense parameters using a non central t (nct) distribution on ν = 4 degrees of freedom instead of the normal of § 2.

Using $t$ to indicate team, first we define latent group parameters $grp^{att}(t)$ and $grp^{def}(t)$, with Dirichlet(1,1,1) priors. Team specific attacking and defending strengths are modelled as: $$att_t \sim nct(\mu_{grp(t)}^{att}, \tau_{grp(t)}^{att}, \nu)$$ $$def_t \sim nct(\mu_{grp(t)}^{def}, \tau_{grp(t)}^{def}, \nu)$$

To associate the latent groups with a performance tier, the priors on the $\mu$s constrain them to be positive/negative. In this example specification, a negative coefficient means a lower hazard (longer drive), while a positive coeffience means a higher hazard (shorter drive). So for the 'top' group, the priors are: $$\mu_1^{att} \sim truncNormal(0, 0.001, -3, 0)$$ $$\mu_1^{def} \sim truncNormal(0, 0.001, 0, 3)$$

And for the 'bottom' group, the priors are: $$\mu_3^{att} \sim truncNormal(0, 0.001, 0, 3)$$ $$\mu_3^{def} \sim truncNormal(0, 0.001, -3, 0)$$

I fix $\mu_2^{att}$ and $\mu_2^{def}$ to 0, unlike Baio/Blangiardo, who draw them from a Normal distribution.

Each set ($att$s and $def$s) is subject to a sum-to-zero contraint, e.g. $\sum_{t=1}^Tatt_t = 0$.

These covariates are used in all zones:

  • Team-specific home field advantage, applied when the defending team is home. Home parameters are drawn from $Normal(\mu_{home}, \sigma_{home})$. Unlike the other team-specific parameters, $home_t$ is not subject to a sum-to-zero constraint. (In future work, I'll try giving the attacking team a home advantage as well).
  • $offense\_winning\_greatly$, a binary flag indicating whether the team possessing the ball is winning by more than 16 points.
  • $offense\_losing\_badly$, a binary flag indicating whether the team poessessing the ball is losing by more than 16 points.
  • $two\_minute\_drill$, a binary flag indicating whether the drive began in the last two minutes of a half.

Covariates: Drive Death by Turnover

Because we observe fewer deaths-by-turnover, we should cut down on the covariates to make sure the model is identifiable. With that in mind, the covariates I used in this implementation are:

  • Team specific ball-retention abilities $att_t \sim Normal(0, \sigma_{att})$
  • Home field advantage, applied when the defending team is home. Note that it's not team-specific here.
  • $offense\_winning\_greatly$, a binary flag indicating whether the team possessing the ball is winning by more than 16 points.
  • $offense\_losing\_badly$, a binary flag indicating whether the team poessessing the ball is losing by more than 16 points.
  • $two\_minute\_drill$, a binary flag indicating whether the drive began in the last two minutes of a half.

Assessing Model Fit with Simulation

In order to simulate a game, we have to make some assumptions. The major assumptions are:

  • Special Teams

    • There are no kickoff returns, all kickoffs result in a touchback.
    • If a drive dies by punt/field goal, the team on offense kicks a field goal is they're passed some point on the field (I used 66), and all field goals attempts are good.
    • A punt and return net always net out to some constant number of yards (I used 38).
  • The Clock

    • The clock time elapsed by a drive is a linear function of the drive distance.
    • Touchdowns, field goals, and punts all take a certain constant amount of time off the clock.
  • Ties

    • Ties are decided by coin flip.

The ties assumption is easily fixed, but I haven't had the chance yet. Simulation code is here.

After fitting the example implementation on 2014 regular season data, and using these assumptions, I used the model to simulate the 2014 season 900 times. For each simulated season, I drew a set of parameter values from the posterior distribution.

The model + simulation recreates some macro-level features of football, such as the distribution of drive starting points. I think the discrepency here is largely driven by my assuming away special teams, although again, this could be remedied by a sub model.

In [7]:
Image(CHART_DIR + 'drives_by_starting_region.png')
Out[7]:

Drive outcome by starting point is similarly close, although the simulations have a slight bias toward drive success when a drive begins after the 50. This could be a result of a small sample size of observed drives (see chart above). Alternatively, it could be drive momentum. Of the drives that make it to the red zone, the vast majority started between the 20 and 40 (see chart above), so these drives could be pulling the baseline redzone hazard rate downward. It would be interesting to include a drive-age covariate to see if this is real.

In [8]:
Image(CHART_DIR + 'drive_outcome_by_starting_region.png')
Out[8]:

Same chart, wider regions, same bias:

In [9]:
Image(CHART_DIR + 'drive_outcome_by_starting_region_wider_regions.png')
Out[9]:

We can compare the actual season-level stats with the results from the simulations. We see some shrinkage, even with the three-tier heirarchy generating the team-specific parameters. Another interpretation would be that teams for which there is a large differential between simulated and actual wins actually just got lucky or unlucky: Washington, for example, lost some close games.

In [10]:
Image(CHART_DIR + 'wins_actual_simulated.png')
Out[10]:

Offensive yards:

In [11]:
Image(CHART_DIR + 'yards_actual_simulated.png')
Out[11]:

We can also look at the underlying parameters being fed into the simulation, like the attack strength parameter. Some things to keep in mind when looking at this chart:

  • Negative value = lower drive hazard = better attack
  • Y-axis units are roughly '% better/worse than mean team'
  • Strength of defense, home field advantage, and drive context (large score differential) are being controlled for
In [12]:
Image(CHART_DIR + 'attacking_strength_with_intervals.png')
Out[12]:
In [13]:
Image(CHART_DIR + 'defending_strength_with_intervals.png')
Out[13]:

Obligatory Deflategate Chart: Turnover Propensity

There's been some discussion about whether the Patriot's low giveaway numbers are so improbable as to be evidence of something fishy. My model finds no evidence that NE's 2014 giveaway propensity is an outlier relative to the other teams.

The intervals on this chart are noticable wider than on the attacking 'quality' chart above. This is because turnovers are extremely rare relative to 'natural' drive deaths like punts and field goals, so we have a lot less to go on.

The obvious next step is to include multiple year's worth of data.

In [15]:
Image(CHART_DIR + 'turnover_propensity_with_intervals.png')
Out[15]:

Simulated Win Rate Against Median Team

To assess a team's overall quality, I simulate each team playing a hypothetical 33rd team, the Median Team, 1000 times. Win rates are charted below. Keep in mind that the model hasn't seen the postseason.

In [16]:
Image(CHART_DIR + 'simulated_win_rate_against_median_team.png')
Out[16]:

Strength of Schedule

I estimate strength of schedule by having the median team play each team's schedule through 500 times, and look at the mean number of losses by schedule. Poor Oakland.

These results are comparable with the prediction machine.

In [17]:
Image(CHART_DIR + 'strength_of_schedule.png')
Out[17]:

Obligatory Superbowl Prediction

My implementation of the model is clearly hot on SEA over NE. Keep in mind that it hasn't seen the postseason. That said, when I simulate a neutral-field match up 50000 times, SEA wins 58% of the time.

Appendix: PyMC code for the Ex-Turnover model

In [ ]:
import os
import sys


CHART_DIR = os.path.join(os.getcwd(), 'charts/')
sys.path.append(os.getcwd())

import math
import warnings

warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import pymc as pm

CODE_DIR = os.path.join(os.getcwd(), 'code/')
DATA_DIR = os.path.join(os.getcwd(), 'data/')
DPI = 300
WIDECHARTWIDTH = 10
WIDECHARTHEIGHT = 6
SAVECHARTS = False


from nfl_hierarchical_bayes import data_prep, simulation_pwexp, elapsed_time, REDZONE_PIECE



# three tiers for attack/defense parameters
# no redzone breakout

df = pd.read_csv(DATA_DIR + 'pbp-2014-bugfixed.csv')
df = data_prep.enrich_play_level_df(df)
df = data_prep.merge_in_game_level_dataset(df, data_prep.GAME_LEVEL_DATASET_2014)
df = data_prep.calculate_game_score_at_play_start(df)
df_drive = data_prep.generate_drive_df(df)
df_drive, teams = data_prep.index_with_team_indexes(df_drive)
df_drive = data_prep.remove_unexplained_drives(df_drive)
df_drive = data_prep.enrich_drive_level_df(df_drive)

print 'Dropping %s drives due to their beginning with <30 seconds left in half' % \
      df_drive[(df_drive.thirty_seconds_drill)].shape[0]
df_drive = df_drive[~(df_drive.thirty_seconds_drill)]

print 'Dropping %s drives due to the ending with qb kneel' % df_drive[(df_drive.end_qb_kneel)].shape[0]
df_drive = df_drive[~(df_drive.end_qb_kneel)]

df_pw = data_prep.generate_piecewise_df(df_drive)
df_counts = data_prep.generate_piecewise_counts_df(df_pw)

observed_drive_deaths_ex_turnover = df_counts.deaths_ex_turnover.values
observed_exposures = df_counts.exposure_yards.values
piece_i = df_counts.piece_i.values
red_zone = (df_counts.piece_i == REDZONE_PIECE).astype(int).values
not_red_zone = (df_counts.piece_i != REDZONE_PIECE).astype(int).values
attacking_team = df_counts.i_attacking.values
defending_team = df_counts.i_defending.values
defending_team_is_home = df_counts.defending_team_is_home.values
offense_is_losing_badly = df_counts.offense_losing_badly.astype(int).values
offense_is_winning_greatly = df_counts.offense_winning_greatly.astype(int).values
drive_is_two_minute_drill = df_counts.two_minute_drill.astype(int).values
num_teams = len(df_counts.i_home.unique())
num_obs = len(drive_is_two_minute_drill)
num_pieces = len(df_counts.piece_i.unique())

observed_drive_deaths_turnover = df_counts.deaths_turnover.values

g = df_counts.groupby('piece_i')
baseline_starting_vals = g.deaths_ex_turnover.sum() / g.exposure_yards.sum()



def ex_turnover_piecewise_exponential_model():

    NCT_DOF = 4

    # hyperpriors for team-level distributions
    std_dev_att1 = pm.Uniform('std_dev_att1', lower=0, upper=50)
    std_dev_def1 = pm.Uniform('std_dev_def1', lower=0, upper=50)
    std_dev_att2 = pm.Uniform('std_dev_att2', lower=0, upper=50)
    std_dev_def2 = pm.Uniform('std_dev_def2', lower=0, upper=50)
    std_dev_att3 = pm.Uniform('std_dev_att3', lower=0, upper=50)
    std_dev_def3 = pm.Uniform('std_dev_def3', lower=0, upper=50)

    mu_att1 = pm.TruncatedNormal('mu_att1', 0, .001, -3, 0, value=-.2)
    mu_def1 = pm.TruncatedNormal('mu_def1', 0, .001, 0, 3, value=.2)
    mu_att3 = pm.TruncatedNormal('mu_att3', 0, .001, 0, 3, value=.2)
    mu_def3 = pm.TruncatedNormal('mu_def3', 0, .001, -3, 0, value=-.2)

    pi_att = pm.Dirichlet("grp_att", theta=[1,1,1])
    pi_def = pm.Dirichlet("grp_def", theta=[1,1,1])

    #team-specific parameters
    group_att = pm.Categorical('group_att', pi_att, size=num_teams)
    group_def = pm.Categorical('group_def', pi_def, size=num_teams)

    @pm.deterministic
    def mu_atts(group_att=group_att,
                mu_att1=mu_att1,
                mu_att3=mu_att3):
        mus_by_group = np.array([mu_att1, 0, mu_att3])
        return mus_by_group[group_att]

    @pm.deterministic
    def mu_defs(group_def=group_def,
                mu_def1=mu_def1,
                mu_def3=mu_def3):
        mus_by_group = np.array([mu_def1, 0, mu_def3])
        return mus_by_group[group_def]

    @pm.deterministic
    def tau_atts(group_att=group_att,
                std_dev_att1=std_dev_att1,
                std_dev_att2=std_dev_att2,
                std_dev_att3=std_dev_att3):
        taus_by_group = np.array([std_dev_att1**-2, std_dev_att2**-2, std_dev_att3**-2])
        return taus_by_group[group_att]


    @pm.deterministic
    def tau_defs(group_def=group_def,
                std_dev_def1=std_dev_def1,
                std_dev_def2=std_dev_def2,
                std_dev_def3=std_dev_def3):
        taus_by_group = np.array([std_dev_def1**-2, std_dev_def2**-2, std_dev_def3**-2])
        return taus_by_group[group_def]

    atts_star = np.empty(num_teams, dtype=object)
    defs_star = np.empty(num_teams, dtype=object)

    for i in range(num_teams):
        atts_star[i] = pm.NoncentralT("att_%i" % i, mu=mu_atts[i], lam=tau_atts[i], nu=NCT_DOF)
        defs_star[i] = pm.NoncentralT("def_%i" % i, mu=mu_defs[i], lam=tau_defs[i], nu=NCT_DOF)

    # home
    mu_home = pm.Normal('mu_home', 0, .0001, value=-.01)
    std_dev_home = pm.Uniform('std_dev_home', lower=0, upper=50)

    @pm.deterministic(plot=False)
    def tau_home(std_dev_home=std_dev_home):
        return std_dev_home**-2

    home = pm.Normal('home',
                     mu=mu_home,
                     tau=tau_home, size=num_teams, value=np.zeros(num_teams))

    # priors on coefficients
    baseline_hazards = pm.Normal('baseline_hazards', 0, .0001, size=num_pieces, value=baseline_starting_vals.values)
    two_minute_drill = pm.Normal('two_minute_drill', 0, .0001, value=-.01)
    offense_losing_badly = pm.Normal('offense_losing_badly', 0, .0001, value=-.01)
    offense_winning_greatly = pm.Normal('offense_winning_greatly', 0, .0001, value=.01)


    # trick to code the sum to zero contraint
    @pm.deterministic
    def atts(atts_star=atts_star):
        atts = [float(i) for i in atts_star]
        atts = atts - np.mean(atts)
        return atts

    @pm.deterministic
    def defs(defs_star=defs_star):
        defs = [float(i) for i in defs_star]
        defs = defs - np.mean(defs_star)
        return defs


    @pm.deterministic
    def mu_ijk(attacking_team=attacking_team,
                   defending_team=defending_team,
                   defending_team_is_home=defending_team_is_home,
                   two_minute_drill=two_minute_drill,
                   drive_is_two_minute_drill=drive_is_two_minute_drill,
                   offense_losing_badly=offense_losing_badly,
                   offense_is_losing_badly=offense_is_losing_badly,
                   offense_winning_greatly=offense_winning_greatly,
                   offense_is_winning_greatly=offense_is_winning_greatly,
                   home=home,
                   atts=atts,
                   defs=defs,
                   baseline_hazards=baseline_hazards,
                   observed_exposures=observed_exposures,
                   piece_i=piece_i):
        return  observed_exposures * baseline_hazards[piece_i] * \
                    np.exp(   home[defending_team] * defending_team_is_home + \
                              two_minute_drill * drive_is_two_minute_drill + \
                              offense_losing_badly * offense_is_losing_badly + \
                              offense_winning_greatly * offense_is_winning_greatly + \
                              atts[attacking_team] + \
                              defs[defending_team])


    drive_deaths = pm.Poisson("drive_deaths", mu_ijk,
                              value=observed_drive_deaths_ex_turnover, observed=True)

    @pm.potential
    def limit_sd(std_dev_att1=std_dev_att1,
                 std_dev_att2=std_dev_att2,
                 std_dev_att3=std_dev_att3,
                 std_dev_def1=std_dev_def1,
                 std_dev_def2=std_dev_def2,
                 std_dev_def3=std_dev_def3,
                 std_dev_home=std_dev_home):
        if std_dev_att1 < 0 or std_dev_att2 < 0 or std_dev_att3 < 0:
            return -np.inf
        if std_dev_def1 < 0 or std_dev_def2 < 0 or std_dev_def3 < 0:
            return -np.inf
        if std_dev_home < 0:
            return -np.inf
        return 0

    @pm.potential
    def keep_mu_within_bounds(mu_att1=mu_att1,
                              mu_def1=mu_def1,
                              mu_att3=mu_att3,
                              mu_def3=mu_def3):
        if mu_att1 < -3 or mu_att1 > 0 or mu_def3 < -3 or mu_def3 > 0:
            return -np.inf
        if mu_def1 < 0 or mu_def1 > 3 or mu_att3 < 0 or mu_att3 > 3:
            return -np.inf
        return 0

    return locals()


ex_turnover = pm.MCMC(ex_turnover_piecewise_exponential_model(),
                      db='pickle', dbname=DATA_DIR + 'ex_turnover_three_tiers_norz.pickle')

Using MCMC to Fit the Shifted-Beta-Geometric Customer Lifetime Value Model

Two professors of marketing, Peter Fader and Bruce Hardie, have developed probability models for estimating customer lifetime value (LTV). In their papers and example spreadsheets, they estimate the models using maximum-likelihood estimation (MLE).

In this post, I'm going to show how to use MCMC (via pymc) to estimate one of the models they've developed. Using MCMC makes it easy to quantify the uncertainty of the model parameters, and because LTV is a function of the model parameters, to pass that uncertainty through into the estimates of LTV itself.

This post is primarily about implementing the model, and I'm only going to touch briefly on the strengths of the Fader/Hardie model over simpler, back-of-the-envelope formulas you'll find if you google 'calculate customer lifetime value.' But in the interest of motivating the implementation, the model is worth understanding because:

  1. by modeling the processes underlying aggregate metrics like 'churn rate' or 'repeated buying rate,' and
  2. by allowing for heterogeneity in a customer base,

it provides more insight into customer behavior and in many cases, will provide less biased predictions about future behavior of customers.

The Four LTV Scenarios

The academic literature on LTV breaks down the problem into four scenarios, depending on whether the customer-business relationship is contractual, and on whether customers have discrete opportunities for transactions. The following graphic, taken from the second paper listed below, provides examples for each of the four scenarios.

In [2]:
import os
from IPython.display import Image, HTML
CHART_DIR = os.path.join(os.getcwd(), 'charts/')
Embed = Image(CHART_DIR + 'quadrant.png', retina=True)
Embed
Out[2]:

The key distinction between contractual and noncontractual is that in contractual scenario, you observe when a customer leaves, while in a noncontractual scenario, customer departure is unobservable and therefore must be inferred. In this post, I'm going to implement a model from the contractual-discrete (lower-right) quadrant: the shifted-beta-geometric (sBG) model.

The paper introducing the sBG model is:

  • Peter S. Fader & Bruce G.S. Hardie, "How to Project Customer Retention." Journal of Interactive Marketing, 21 (2007), 76–90. link

I refer to some equations and examples from the paper, so it may be helpful to download the paper to follow along.
If you're interested in learning more about the larger world of probability models in LTV estimation, this paper provides a nice overview:

  • Peter S. Fader & Bruce G.S. Hardie, "Probability Models for Customer-Base Analysis." Journal of Interactive Marketing, 23 (2009), 61–69. link

Contractual-Discrete: The sBG Model

The shifted-beta-geometric (sBG) model makes two assumptions:

  1. Each customer has a constant churn probability $\theta$. You can think of this as the customer flipping a weighted coin (with probability of tails = $\theta$) at the end of every subscription period, and they cancel their membership if the coin lands tails.
  2. Customer churn probabilities are drawn from a beta distribution with parameters $\alpha$ and $\beta$.

The strength and flexibility of the model comes from the second assumption, which allows for heterogeneity in customer churn rates. In contrast, simple LTV formulas usually assume a constant aggregate churn rate. If you've worked with subscription data, this assumption probability matches your intuition: maybe you've got some lifers who seem like they'll renew until the end of time, and some short-lifers who try the product for a period or two and then cancel. While this heterogeneity can be approximated by the constant aggregate churn rate used in simpler LTV formulas, the aggregate churn rate often results in downwardly-biased estimates of LTV. (See the 'Perils of Ignoring Heterogeneity' paper linked to below for more on this bias.)

Estimating the sBG model amounts to finding $\alpha$ and $\beta$, effectively estimating what proportion of your customer base is lifers, what proportion short-lifers, and what proportion fall on the spectrum in between. Using Matplotlib, we can recreate figure 3 from the sBG paper, illustrating how the beta distribution permits a variety of churn-rate distribution shapes - and thus how the sBG model can fit a variety of customer bases.

In [3]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
In [4]:
import numpy as np
import scipy
import pymc as pm
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
In [5]:
sns.set(style="white")
current_palette = sns.color_palette()
fig, ax = plt.subplots(figsize=(8,6))
ax.set_ylim(0, 2)
ax.set_xlim(0, 2)
x = np.linspace(0, 1, 100)[1:99]
subplot_locations = [[.6, .61, .25, .25], [.6, .21, .25, .25], [.2, .61, .25, .25], [.2, .21, .25, .25]]
subplot_params = [(2, 7), (1.3, .7), (.7, 1.3), (.5, .5)]
for loc, params in zip(subplot_locations, subplot_params):
    ax_i = plt.axes(loc)
    ax_i.plot(x, scipy.stats.beta.pdf(x, params[0], params[1]), '-')    
    plt.setp(ax_i.get_yticklabels(), visible=False)
    ax_i.axes.spines['right'].set_color('none')
    ax_i.axes.spines['top'].set_color('none')
    ax_i.set_xlabel(r'$\theta$')
    ax_i.set_xticks([0., .5, 1])

ax.set_title(r'General Shapes of the Beta Distribution as a Function of $\alpha$ and $\beta$')
ax.axes.spines['right'].set_color('none')
ax.axes.spines['top'].set_color('none')
ax.plot([1, 1], ax.get_ylim(), '--', c='black', lw=1)
ax.plot(ax.get_xlim(), [1, 1], '--', c='black', lw=1)
ax.set_xticks([0., 1., 2.])
ax.set_yticks([0., 1., 2.])
ax.set_xticklabels(['0', '1', r'$\alpha$'])
_ = ax.set_yticklabels(['0', '1', r'$\beta$'])

(If anyone can reproduce this using plt.subplots() or plt.subplot2grid(), I'd love to know how.)

In their 2007 paper, Fader and Hardie demonstrate how to fit as sBG model to data using MLE in Excel. They begin by deriving the likelihood function, then demonstrate how to code it in Excel and how to use Excel's solver to maximize the likelihood function with respect to $\alpha$ and $\beta$.

To fit the model using MCMC and pymc, we'll take the likelihood function they derived, code it in Python, and then use MCMC to sample from the posterior distributions of $\alpha$ and $\beta$. This post is more about implementation than derivation, so I'll just explain the intuition of the likelihood function without going into the details of the derivation.

I'll use the example data they use in Appendix B. These are counts of active users in period 0, period 1, etc.

In [28]:
example_data = [1000, 869, 743, 653, 593, 551, 517, 491]

To calculate the likelihood, we'll need the number of customers lost at each period.

In [29]:
def n_lost(data):
    lost = [None]
    for i in range(1, len(data)):
        lost.append(data[i - 1] - data[i])
    return lost

example_data_n_lost = n_lost(example_data)

data = (example_data, example_data_n_lost)
data
Out[29]:
([1000, 869, 743, 653, 593, 551, 517, 491],
 [None, 131, 126, 90, 60, 42, 34, 26])

Note that I've smushed the two series (counts of active, and lost per period) into a single tuple. I've done this so I can pass the data to pymc as a single object.

I'll define uninformative priors on $\alpha$ and $\beta$.

In [30]:
alpha = pm.Uniform('alpha', 0.00001, 1000, value=1) 
beta = pm.Uniform('beta', 0.00001, 1000, value=1)

And now we start getting into the meat and potatoes of the likelihood function, which includes two components:

  • the probability that $n_i$ customers churn in period $i$ (in this case, we observe $i$ up to 7)
  • the probability that $n - \sum\limits_{t=1}^7n_t $ customers are still active at the end of the 7th period.

To help us with the first component, we need the probability that a randomly chosen individual will have a lifetime $T = t$ given $\alpha$ and $\beta$, equation (7) in the paper:

In [31]:
num_periods = len(example_data)

@pm.deterministic
def P_T_is_t(alpha=alpha, beta=beta, num_periods=num_periods):
    p = [None, alpha / (alpha + beta)]
    for t in range(2, num_periods):
        pt = (beta + t - 2) / (alpha + beta + t - 1) * p[t-1]
        p.append(pt)
    return p

"pm.deterministic" tells pymc that the output is completely (derministically) determined by the inputs, as is the case here.

For the second component of the likelihood function, we need the probability of a randomly chosen individual surviving to period $t$, given $\alpha$ and $\beta$:

In [32]:
@pm.deterministic
def survival_function(P_T_is_t=P_T_is_t, num_periods=num_periods):
    s = [None, 1 - P_T_is_t[1]]
    for t in range(2, num_periods):
        s.append(s[t-1] - P_T_is_t[t])
    return s

We're now ready to define the likelihood function itself. We can see the two components described above:

In [33]:
@pm.observed
def retention_rates(P_T_is_t=P_T_is_t, survival_function=survival_function, value=data):
    def logp(value, P_T_is_t, survival_function):
        
        active, lost = value
        
        # Those who've churned along the way...
        died = np.log(P_T_is_t[1:]) * lost[1:]
        
        # and those still active in last period
        still_active = np.log(survival_function[-1]) * active[-1]              
        return sum(died) + still_active

"pm.observed" tells pymc that we've observed this variable's value, so it's fixed.

To make sure that we've coded everything correctly, we can confirm that our log likelihood matches that in the paper (when $\alpha = 1$ and $\beta = 1$):

In [34]:
mcmc = pm.MCMC([alpha, beta, P_T_is_t, survival_function, retention_rates])
retention_rates.logp
Out[34]:
-2115.545506645605

Yep, it matches (Figure B1 in the paper).

At this point, we can estimate the posterior distributions of $\alpha$ and $\beta$ using MCMC:

In [35]:
mcmc.sample(20000, 5000, 20)
 [-----------------100%-----------------] 20000 of 20000 complete in 11.4 sec

...and plot the posterior distribution using pymc's built-in plot method:

In [36]:
sns.set(style="darkgrid")
pm.Matplot.plot(alpha)
Plotting alpha
In [37]:
pm.Matplot.plot(beta)
Plotting beta

It looks like the medians of the posterior distributions are close to the MLE estimates Fader and Hardie found (.668 for $\alpha$ and 3.806 for $\beta$). Let's confirm:

In [38]:
df_trace = pd.DataFrame({'alpha': alpha.trace(), 'beta': beta.trace()})
df_trace.median()
Out[38]:
alpha    0.717973
beta     4.188330
dtype: float64

So why go through the trouble of using MCMC over MLE? Because now we have an easy way to understand the uncertainty of our estimates, without having to go through extra step of generating sampling distributions computing p values.

To demonstrate, I'm going to introduce a concept from another Fader/Hardie paper, the Discounted Expected Residual Lifetime (DERL).

Discounted Expected Residual Lifetime

The motivation for the DERL is that once you've fit an sBG model to a customer base, an obvious follow up question is, "how much money can I expect to take in from this customer base in the future?" The DERL for a customer who pays $$x$ per period and who is at the end of their $n$ period is the number such that $DERL * x$ is the expected present value of future payments from that customer. The DERL is a function of $\alpha$, $\beta$, a discount rate $d$, and the current period $n$ of the subscriber.

The DERL can also give us the expected discounted CLV of a new customer, if we set $n = 1$ and add an undiscounted initial payment. And because we have posterior distributions for $\alpha$ and $\beta$, we can easily leverage our uncertainty in $\alpha$ and $\beta$ to understand our uncertainty in the statistic we really care about, the CLV.

The paper deriving the DERL is:

  • Peter S. Fader & Bruce G.S. Hardie, "Customer-Base Valuation in a Contractual Setting: The Perils of Ignoring Heterogeneity" Marketing Science, 29-1 (Jan-Feb 2010), 85-93. link

Let's start by defining function to compute the DERL:

In [17]:
from scipy.special import hyp2f1

def derl(alpha, beta, d, n):
    """
    Discounted Expected Residual Lifetime, as derived in Fader and Hardie (2010).  See equation (6).

    :param alpha: sBG alpha param
    :param beta: sBG beta param
    :param d: discount rate
    :param n: customer's contract period (customer has made n-1 renewals)
    :return: float
    """
    return (beta + n - 1) / (alpha + beta + n - 1) * hyp2f1(1, beta + n, alpha + beta + n, 1 / (1 + d))

For each draw of the posterior distribution, we'll calculate the DERL for a customer just before their first renewal decision. For the sake of example, imagine we get $10 per period from each customer, and I'll use a discount rate of 10%.

In [18]:
df_trace['derl'] = df_trace.apply(lambda x: 1 + derl(x['alpha'], x['beta'], .1, 1), axis=1)
df_trace['lifetime_value'] = 10 * df_trace.derl

Now we can provide not only an estimate of discounted customer lifetime value, but also a credible interval:

In [19]:
median_clv = df_trace.lifetime_value.median()
cred_interval = df_trace.lifetime_value.quantile(.025), df_trace.lifetime_value.quantile(.975)
ax = df_trace['lifetime_value'].hist()
ax.set_title('Customer Lifetime Value (Discount Rate: .1)')
ax.set_xlabel('Discounted Expected Customer Lifetime')
ax.plot([median_clv, median_clv], ax.get_ylim())
plt.annotate('Median: %.1f' % median_clv, xy=(median_clv + .02, ax.get_ylim()[1]-10))
ax.plot([cred_interval[0], cred_interval[0]], ax.get_ylim(), c=sns.color_palette()[2], lw=1)
_ = ax.plot([cred_interval[1], cred_interval[1]], ax.get_ylim(), c=sns.color_palette()[2], lw=1)

As we get more data, our credible interval should shrink. Let's confirm by imagining we had 10x the data:

In [20]:
example_data = [10 * x for x in example_data]
example_data_n_lost = n_lost(example_data)

data = (example_data, example_data_n_lost)
data
Out[20]:
([10000, 8690, 7430, 6530, 5930, 5510, 5170, 4910],
 [None, 1310, 1260, 900, 600, 420, 340, 260])

We have to redefine retention_rates so it knows about the new data:

In [21]:
@pm.observed
def retention_rates(P_T_is_t=P_T_is_t, survival_function=survival_function, value=data):
    def logp(value, P_T_is_t, survival_function):
        
        active, lost = value
        
        # Those who've churned along the way...
        died = np.log(P_T_is_t[1:]) * lost[1:]
        
        # and those still active in last period
        still_active = np.log(survival_function[-1]) * active[-1]              
        return sum(died) + still_active

Re-run MCMC....

In [22]:
mcmc = pm.MCMC([alpha, beta, P_T_is_t, survival_function, retention_rates])
mcmc.sample(20000, 5000, 20)
df_trace_10x = pd.DataFrame({'alpha': alpha.trace(), 'beta': beta.trace()})
df_trace_10x['derl'] = df_trace_10x.apply(lambda x: 1 + derl(x['alpha'], x['beta'], .1, 1), axis=1)
df_trace_10x['lifetime_value'] = 10 * df_trace_10x.derl
 [-----------------100%-----------------] 20000 of 20000 complete in 15.4 sec

... and plot our two posterior distributions on the same X axis:

In [23]:
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(8, 6))

plots = [(ax1, df_trace, 'Original N'), (ax2, df_trace_10x, '10x N')]
for ax, df, title in plots:
    median_clv = df.lifetime_value.median()
    cred_interval = df.lifetime_value.quantile(.025), df.lifetime_value.quantile(.975)
    df['lifetime_value'].hist(ax=ax, lw=0)
    ax.plot([median_clv, median_clv], ax.get_ylim())
    ax.plot([cred_interval[0], cred_interval[0]], ax.get_ylim(), c=sns.color_palette()[2], lw=1)
    _ = ax.plot([cred_interval[1], cred_interval[1]], ax.get_ylim(), c=sns.color_palette()[2], lw=1)
    ax.text(.99, .9, title, verticalalignment='bottom', horizontalalignment='right', transform=ax.transAxes, fontsize=14)

ax2.set_xlabel('Discounted Expected Customer Lifetime')
_ = f.suptitle('Customer Lifetime Value (Discount Rate: .1), With Original and 10x Sample Size', fontsize=14)

As expected, as we collect more data, the credible interval for discounted customer lifetime value shrinks. Had we gone in the other direction - either cut our sample size, or removed data on the last few periods - our credible interval would have grown wider.

In summary: in this post, I used pymc to fit the shifted-beta-geometric model of LTV in a contractual-discrete business. By using MCMC, I got the same point estimates the authors got using MLE, but I also got some handy intervals providing a measurement of the uncertainty in the estimates.