Which Cognitive Bias is Making NFL Coaches Predictable?

In football, it pays to be unpredictable (although the “wrong way touchdown” might be taking it a bit far.) If the other team picks up on an unintended pattern in your play calling, they can take advantage of it and adjust their strategy to counter yours. Coaches and their staff of coordinators are paid millions of dollars to call plays that maximize their team’s talent and exploit their opponent’s weaknesses.

That’s why it surprised Brian Burke, formerly of AdvancedNFLAnalytics.com (and now hired by ESPN) to see a peculiar trend: football teams seem to rush a remarkably high percent on 2nd and 10 compared to 2nd and 9 or 11.

What’s causing that?

His insight was that 2nd and 10 disproportionately followed an incomplete pass. This generated two hypotheses:

  1. Coaches (like all humans) are bad at generating random sequences, and have a tendency to alternate too much when they’re trying to be genuinely random. Since 2nd and 10 is most likely the result of a 1st down pass, alternating would produce a high percent of 2nd down rushes.
  2. Coaches are suffering from the ‘small sample fallacy’ and ‘recency bias’, overreacting to the result of the previous play. Since 2nd and 10 not only likely follows a pass, but a failed pass, coaches have an impulse to try the alternative without realizing they’re being predictable.

These explanations made sense to me, and I wrote about phenomenon a few years ago. But now that I’ve been learning data science, I can dive deeper into the analysis and add a hypothesis of my own.

The following work is based on the play-by-play data for every NFL game from 2002 through 2012, which Brian kindly posted. I spend some time processing it to create variables like Previous Season Rushing %, Yards per Pass, Yards Allowed per Pass by Defense, and QB Completion percent. The Python notebooks are available on my GitHub, although the data files were too large to host easily.

Irrationality? Or Confounding Variables?

Since this is an observational study rather than a randomized control trial, there are bound to be confounding variables. In our case, we’re comparing coaches’ play calling on 2nd down after getting no yards on their team’s 1st down rush or pass. But those scenarios don’t come from the same distribution of game situations.

A number of variables could be in play, some exaggerating the trend and others minimizing it. For example, teams that passed for no gain on 1st down (resulting in 2nd and 10) have a disproportionate number of inaccurate quarterbacks (the left graph). These teams with inaccurate quarterbacks are more likely to call rushing plays on 2nd down (the right graph). Combine those factors, and we don’t know whether any difference in play calling is caused by the 1st down play type or the quality of quarterback.

qb_completion_confound

The classic technique is to train a regression model to predict the next play call, and judge a variable’s impact by the coefficient the model gives that variable.  Unfortunately, models that give interpretable coefficients tend to treat each variables as either positively or negatively correlated with the target – so time remaining can’t be positively correlated with a coach calling running plays when the team is losing and negatively correlated when the team is winning. Since the relationships in the data are more complicated, we needed a model that can handle it.

I saw my chance to try a technique I learned at the Boston Data Festival last year: Inverse Probability of Treatment Weighting.

In essence, the goal is to create artificial balance between your ‘treatment’ and ‘control’ groups — in our case, 2nd and 10 situations following 1st down passes vs. following 1st down rushes. We want to take plays with under-represented characteristics and ‘inflate’ them by pretending they happened more often, and – ahem – ‘deflate’ the plays with over-represented features.

To get a single metric of how over- or under-represented a play is, we train a model (one that can handle non-linear relationship better) to take each 2nd down play’s confounding variables as input – score, field position, QB quality, etc – and tries to predict whether the 1st down play was a rush or pass. If, based on the confounding variables, the model predicts the play was 90% likely to be after a 1st down pass – and it was – we decide the play probably has over-represented features and we give it less weight in our analysis. However, if the play actually followed a 1st down rush, it must have under-represented features for the model to get it so wrong. Accordingly, we decide to give it more weight.

After assigning each play a new weight to compensate for its confounding features (using Kfolds to avoid training the model on the very plays it’s trying to score), the two groups *should* be balanced. It’s as though we were running a scientific study, noticed that our control group had half as many men as the treatment group, and went out to recruit more men. However, since that isn’t an option, we just decided to count the men twice.

Testing our Balance

Before processing, teams that rushed on 1st down for no gain were disproportionately likely to be teams with the lead. After the re-weighting process, the distributions are far much more similar:

lead_distribution_split

Much better! They’re not all this dramatic, but lead was the strongest confounding factor and the model paid extra attention to adjust for it.

It’s great that the distributions look more similar, but that’s qualitative. To do a quantitative diagnostic, we can take the standard difference in means, recommended as a best practice in a 2015 paper by Peter C. Austin and Elizabeth A. Stuart titled “Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies“.

For each potential confounding variable, we take the difference in means between plays following 1st down passes and 1st down rushes and adjust for their combined variance. A high standard difference of means indicates that our two groups are dissimilar, and in need of balancing. The standardized differences had a max of around 47% and median of 7.5% before applying IPT-weighting, which reduced the differences to 9% and 3.1%, respectively.

standard_difference_means

Actually Answering Our Question

So, now that we’ve done what we can to balance the groups, do coaches still call rushing plays on 2nd and 10 more often after 1st down passes than after rushes? In a word, yes.

playcall_split

In fact, the pattern is even stronger after controlling for game situation. It turns out that the biggest factor was the score (especially when time was running out.) A losing team needs to be passing the ball more often to try to come back, so their 2nd and 10 situations are more likely to follow passes on 1st down. If those teams are *still* calling rushing plays often, it’s even more evidence that something strange is going on.

Ok, so controlling for game situation doesn’t explain away the spike in rushing percent at 2nd and 10. Is it due to coaches’ impulse to alternate their play calling?

Maybe, but that can’t be the whole story. If it were, I would expect to see the trend consistent across different 2nd down scenarios. But when we look at all 2nd-down distances, not just 2nd and 10, we see something else:

all_yards_playcalling

If their teams don’t get very far on 1st down, coaches are inclined to change their play call on 2nd down. But as a team gains more yards on 1st down, coaches are less and less inclined to switch. If the team got six yards, coaches rush about 57% of the time on 2nd down regardless of whether they ran or passed last play. And it actually reverses if you go beyond that – if the team gained more than six yards on 1st down, coaches have a tendency to repeat whatever just succeeded.

It sure looks like coaches are reacting to the previous play in a predictable Win-Stay Lose-Shift pattern.

Following a hunch, I did one more comparison: passes completed for no gain vs. incomplete passes. If incomplete passes feel more like a failure, the recency bias would influence coaches to call more rushing plays after an incompletion than after a pass that was caught for no gain.

Before the re-weighting process, there’s almost no difference in play calling between the two groups – 43.3% vs. 43.6% (p=.88). However, after adjusting for the game situation – especially quarterback accuracy – the trend reemerges: in similar game scenarios, teams rush 44.4% of the time after an incomplete and only 41.5% after passes completed for no gain. It might sound small, but with 20,000 data points it’s a pretty big difference (p < 0.00005)

All signs point to the recency bias being the primary culprit.

Reasons to Doubt:

1) There are a lot of variables I didn’t control for, including fatigue, player substitutions, temperature, and whether the game clock was stopped in between plays. Any or all of these could impact the play calling.

2) Brian Burke’s (and my) initial premise was that if teams are irrationally rushing more often after incomplete passes, defenses should be able to prepare for this and exploit the pattern. Conversely, going against the trend should be more likely to catch the defense off-guard.

I really expected to find plays gaining more yards if they bucked the trends, but it’s not as clear as I would like.  I got excited when I discovered that rushing plays on 2nd and 10 did worse if the previous play was a pass – when defenses should expect it more. However, when I looked at other distances, there just wasn’t a strong connection between predictability and yards gained.

One possibility is that I needed to control for more variables. But another possibility is that while defenses *should* be able to exploit a coach’s predictability, they can’t or don’t. To give Brian the last words:

But regardless of the reasons, coaches are predictable, at least to some degree. Fortunately for offensive coordinators, it seems that most defensive coordinators are not aware of this tendency. If they were, you’d think they would tip off their own offensive counterparts, and we’d see this effect disappear.

Why Decision Theory Tells You to Eat ALL the Cupcakes

cupcakeImagine that you have a big task coming up that requires an unknown amount of willpower – you might have enough willpower to finish, you might not. You’re gearing up to start when suddenly you see a delicious-looking cupcake on the table. Do you indulge in eating it? According to psychology research and decision-theory models, the answer isn’t simple.

If you resist the temptation to eat the cupcake, current research indicates that you’ve depleted your stores of willpower (psychologists call it ego depletion), which causes you to be less likely to have the willpower to finish your big task. So maybe you should save your willpower for the big task ahead and eat it!

…But if you’re convinced already, hold on a second. How easily you give in to temptation gives evidence about your underlying strength of will. After all, someone with weak willpower will find the reasons to indulge more persuasive. If you end up succumbing to the temptation, it’s evidence that you’re a person with weaker willpower, and are thus less likely to finish your big task.

How can eating the cupcake cause you to be more likely to succeed while also giving evidence that you’re more likely to fail?

Conflicting Decision Theory Models

The strangeness lies in the difference between two conflicting models of how to make decisions. Luke Muehlhauser describes them well in his Decision Theory FAQ:

This is not a “merely verbal” dispute (Chalmers 2011). Decision theorists have offered different algorithms for making a choice, and they have different outcomes. Translated into English, the [second] algorithm (evidential decision theory or EDT) says “Take actions such that you would be glad to receive the news that you had taken them.” The [first] algorithm (causal decision theory or CDT) says “Take actions which you expect to have a positive effect on the world.”

The crux of the matter is how to handle the fact that we don’t know how much underlying willpower we started with.

Causal Decision Theory asks, “How can you cause yourself to have the most willpower?”

It focuses on the fact that, in any state, spending willpower resisting the cupcake causes ego depletion. Because of that, it says our underlying amount of willpower is irrelevant to the decision. The recommendation stays the same regardless: eat the cupcake.

Evidential Decision Theory asks, “What will give evidence that you’re likely to have a lot of willpower?”

We don’t know whether we’re starting with strong or weak will, but our actions can reveal that one state or another is more likely. It’s not that we can change the past – Evidential Decision Theory doesn’t look for that causal link – but our choice indicates which possible version of the past we came from.

Yes, seeing someone undergo ego depletion would be evidence that they lost a bit of willpower.  But watching them resist the cupcake would probably be much stronger evidence that they have plenty to spare.  So you would rather “receive news” that you had resisted the cupcake.

A Third Option

Each of these models has strengths and weaknesses, and a number of thought experiments – especially the famous Newcomb’s Paradox – have sparked ongoing discussions and disagreements about what decision theory model is best.

One attempt to improve on standard models is Timeless Decision Theory, a method devised by Eliezer Yudkowsky of the Machine Intelligence Research Institute.  Alex Altair recently wrote up an overview, stating in the paper’s abstract:

When formulated using Bayesian networks, two standard decision algorithms (Evidential Decision Theory and Causal Decision Theory) can be shown to fail systematically when faced with aspects of the prisoner’s dilemma and so-called “Newcomblike” problems. We describe a new form of decision algorithm, called Timeless Decision Theory, which consistently wins on these problems.

It sounds promising, and I can’t wait to read it.

But Back to the Cupcakes

For our particular cupcake dilemma, there’s a way out:

Precommit. You need to promise – right now! – to always eat the cupcake when it’s presented to you. That way you don’t spend any willpower on resisting temptation, but your indulgence doesn’t give any evidence of a weak underlying will.

And that, ladies and gentlemen, is my new favorite excuse for why I ate all the cupcakes.

How has Bayes’ Rule changed the way I think?

People talk about how Bayes’ Rule is so central to rationality, and I agree. But given that I don’t go around plugging numbers into the equation in my daily life, how does Bayes actually affect my thinking?
A short answer, in my new video below:

 

 

(This is basically what the title of this blog was meant to convey — quantifying your uncertainty.)

What Would a Rational Gryffindor Read?

In the Harry Potter world, Ravenclaws are known for being the smart ones. That’s their thing. In fact, that was really all they were known for. In the books, each house could be boiled down to one or two words: Gryffindors are brave, Ravenclaws are smart, Slytherins are evil and/or racist, and Hufflepuffs are pathetic loyal. (Giving rise to this hilarious Second City mockery.)

But while reading Harry Potter and the Methods of Rationality, I realized that there’s actually quite a lot of potential for interesting reading in each house. Ravenclaws would be interested in philosophy of mind, cognitive science, and mathematics; Gryffindors in combat, ethics, and democracy; Slytherins in persuasion, rhetoric, and political machination; and Hufflepuffs in productivity, happiness, and the game theory of cooperation.

And so, after much thought, I found myself knee-deep in my books recreating what a rationalist from each house would have on his or her shelf. I tried to match the mood as well as the content. Here they are in the appropriate proportions for a Facebook cover image so that you can display your pride both in rationality and in your chosen house (click to see each image larger, with a book list on the left):

Rationality Ravenclaw Library

Rationality Gryffindor Library

Rationality Slytherin Library

Rationality Hufflepuff Library

What do you think? I’m always open to book recommendations and suggestions for good fits. Which bookshelf fits you best? What would you add?

Spirituality and “skeptuality”

Is “rational” spirituality a contradiction in terms? In the latest episode of the Rationally Speaking podcast, Massimo and I try to pin down what people mean when they call themselves “spiritual,” what inspires spiritual experiences and attitudes, and whether spirituality can be compatible with a naturalist view of the world.

Are there benefits that skeptics and other secular people could possibly get from incorporating some variants on traditional spiritual practices — like prayer, ritual, song, communal worship, and so on — into their own lives?

We xamine a variety of attempts to do so, and ask: how well have such attempts worked, and do they come with any potential pitfalls for our rationality?

http://www.rationallyspeakingpodcast.org/show/rs55-spirituality.html

How to want to change your mind

New video blog: “How to Want to Change your Mind.”

This one’s full of useful tips to turn off your “defensive” instincts in debates, and instead cultivate the kind of fair-minded approach that’s focused on figuring out the truth, not on “winning” an argument.

A rational view of tradition

In my latest video blog I answer a listener’s question about why rationalists are more likely to abandon social norms like marriage, monogamy, standard gender roles, having children, and so on. And then I weigh in on whether that’s a rational attitude to take:

%d bloggers like this: