## Which Cognitive Bias is Making NFL Coaches Predictable?

In football, it pays to be unpredictable (although the “wrong way touchdown” might be taking it a bit far.) If the other team picks up on an unintended pattern in your play calling, they can take advantage of it and adjust their strategy to counter yours. Coaches and their staff of coordinators are paid millions of dollars to call plays that maximize their team’s talent and exploit their opponent’s weaknesses.

That’s why it surprised Brian Burke, formerly of AdvancedNFLAnalytics.com (and now hired by ESPN) to see a peculiar trend: football teams seem to rush a remarkably high percent on 2nd and 10 compared to 2nd and 9 or 11.

What’s causing that?

His insight was that 2nd and 10 disproportionately followed an incomplete pass. This generated two hypotheses:

1. Coaches (like all humans) are bad at generating random sequences, and have a tendency to alternate too much when they’re trying to be genuinely random. Since 2nd and 10 is most likely the result of a 1st down pass, alternating would produce a high percent of 2nd down rushes.
2. Coaches are suffering from the ‘small sample fallacy’ and ‘recency bias’, overreacting to the result of the previous play. Since 2nd and 10 not only likely follows a pass, but a failed pass, coaches have an impulse to try the alternative without realizing they’re being predictable.

These explanations made sense to me, and I wrote about phenomenon a few years ago. But now that I’ve been learning data science, I can dive deeper into the analysis and add a hypothesis of my own.

The following work is based on the play-by-play data for every NFL game from 2002 through 2012, which Brian kindly posted. I spend some time processing it to create variables like Previous Season Rushing %, Yards per Pass, Yards Allowed per Pass by Defense, and QB Completion percent. The Python notebooks are available on my GitHub, although the data files were too large to host easily.

## Irrationality? Or Confounding Variables?

Since this is an observational study rather than a randomized control trial, there are bound to be confounding variables. In our case, we’re comparing coaches’ play calling on 2nd down after getting no yards on their team’s 1st down rush or pass. But those scenarios don’t come from the same distribution of game situations.

A number of variables could be in play, some exaggerating the trend and others minimizing it. For example, teams that passed for no gain on 1st down (resulting in 2nd and 10) have a disproportionate number of inaccurate quarterbacks (the left graph). These teams with inaccurate quarterbacks are more likely to call rushing plays on 2nd down (the right graph). Combine those factors, and we don’t know whether any difference in play calling is caused by the 1st down play type or the quality of quarterback.

The classic technique is to train a regression model to predict the next play call, and judge a variable’s impact by the coefficient the model gives that variable.  Unfortunately, models that give interpretable coefficients tend to treat each variables as either positively or negatively correlated with the target – so time remaining can’t be positively correlated with a coach calling running plays when the team is losing and negatively correlated when the team is winning. Since the relationships in the data are more complicated, we needed a model that can handle it.

I saw my chance to try a technique I learned at the Boston Data Festival last year: Inverse Probability of Treatment Weighting.

In essence, the goal is to create artificial balance between your ‘treatment’ and ‘control’ groups — in our case, 2nd and 10 situations following 1st down passes vs. following 1st down rushes. We want to take plays with under-represented characteristics and ‘inflate’ them by pretending they happened more often, and – ahem – ‘deflate’ the plays with over-represented features.

To get a single metric of how over- or under-represented a play is, we train a model (one that can handle non-linear relationship better) to take each 2nd down play’s confounding variables as input – score, field position, QB quality, etc – and tries to predict whether the 1st down play was a rush or pass. If, based on the confounding variables, the model predicts the play was 90% likely to be after a 1st down pass – and it was – we decide the play probably has over-represented features and we give it less weight in our analysis. However, if the play actually followed a 1st down rush, it must have under-represented features for the model to get it so wrong. Accordingly, we decide to give it more weight.

After assigning each play a new weight to compensate for its confounding features (using Kfolds to avoid training the model on the very plays it’s trying to score), the two groups *should* be balanced. It’s as though we were running a scientific study, noticed that our control group had half as many men as the treatment group, and went out to recruit more men. However, since that isn’t an option, we just decided to count the men twice.

## Testing our Balance

Before processing, teams that rushed on 1st down for no gain were disproportionately likely to be teams with the lead. After the re-weighting process, the distributions are far much more similar:

Much better! They’re not all this dramatic, but lead was the strongest confounding factor and the model paid extra attention to adjust for it.

It’s great that the distributions look more similar, but that’s qualitative. To do a quantitative diagnostic, we can take the standard difference in means, recommended as a best practice in a 2015 paper by Peter C. Austin and Elizabeth A. Stuart titled “Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies“.

For each potential confounding variable, we take the difference in means between plays following 1st down passes and 1st down rushes and adjust for their combined variance. A high standard difference of means indicates that our two groups are dissimilar, and in need of balancing. The standardized differences had a max of around 47% and median of 7.5% before applying IPT-weighting, which reduced the differences to 9% and 3.1%, respectively.

So, now that we’ve done what we can to balance the groups, do coaches still call rushing plays on 2nd and 10 more often after 1st down passes than after rushes? In a word, yes.

In fact, the pattern is even stronger after controlling for game situation. It turns out that the biggest factor was the score (especially when time was running out.) A losing team needs to be passing the ball more often to try to come back, so their 2nd and 10 situations are more likely to follow passes on 1st down. If those teams are *still* calling rushing plays often, it’s even more evidence that something strange is going on.

Ok, so controlling for game situation doesn’t explain away the spike in rushing percent at 2nd and 10. Is it due to coaches’ impulse to alternate their play calling?

Maybe, but that can’t be the whole story. If it were, I would expect to see the trend consistent across different 2nd down scenarios. But when we look at all 2nd-down distances, not just 2nd and 10, we see something else:

If their teams don’t get very far on 1st down, coaches are inclined to change their play call on 2nd down. But as a team gains more yards on 1st down, coaches are less and less inclined to switch. If the team got six yards, coaches rush about 57% of the time on 2nd down regardless of whether they ran or passed last play. And it actually reverses if you go beyond that – if the team gained more than six yards on 1st down, coaches have a tendency to repeat whatever just succeeded.

It sure looks like coaches are reacting to the previous play in a predictable Win-Stay Lose-Shift pattern.

Following a hunch, I did one more comparison: passes completed for no gain vs. incomplete passes. If incomplete passes feel more like a failure, the recency bias would influence coaches to call more rushing plays after an incompletion than after a pass that was caught for no gain.

Before the re-weighting process, there’s almost no difference in play calling between the two groups – 43.3% vs. 43.6% (p=.88). However, after adjusting for the game situation – especially quarterback accuracy – the trend reemerges: in similar game scenarios, teams rush 44.4% of the time after an incomplete and only 41.5% after passes completed for no gain. It might sound small, but with 20,000 data points it’s a pretty big difference (p < 0.00005)

All signs point to the recency bias being the primary culprit.

## Reasons to Doubt:

1) There are a lot of variables I didn’t control for, including fatigue, player substitutions, temperature, and whether the game clock was stopped in between plays. Any or all of these could impact the play calling.

2) Brian Burke’s (and my) initial premise was that if teams are irrationally rushing more often after incomplete passes, defenses should be able to prepare for this and exploit the pattern. Conversely, going against the trend should be more likely to catch the defense off-guard.

I really expected to find plays gaining more yards if they bucked the trends, but it’s not as clear as I would like.  I got excited when I discovered that rushing plays on 2nd and 10 did worse if the previous play was a pass – when defenses should expect it more. However, when I looked at other distances, there just wasn’t a strong connection between predictability and yards gained.

One possibility is that I needed to control for more variables. But another possibility is that while defenses *should* be able to exploit a coach’s predictability, they can’t or don’t. To give Brian the last words:

But regardless of the reasons, coaches are predictable, at least to some degree. Fortunately for offensive coordinators, it seems that most defensive coordinators are not aware of this tendency. If they were, you’d think they would tip off their own offensive counterparts, and we’d see this effect disappear.

## Why Decision Theory Tells You to Eat ALL the Cupcakes

Imagine that you have a big task coming up that requires an unknown amount of willpower – you might have enough willpower to finish, you might not. You’re gearing up to start when suddenly you see a delicious-looking cupcake on the table. Do you indulge in eating it? According to psychology research and decision-theory models, the answer isn’t simple.

If you resist the temptation to eat the cupcake, current research indicates that you’ve depleted your stores of willpower (psychologists call it ego depletion), which causes you to be less likely to have the willpower to finish your big task. So maybe you should save your willpower for the big task ahead and eat it!

…But if you’re convinced already, hold on a second. How easily you give in to temptation gives evidence about your underlying strength of will. After all, someone with weak willpower will find the reasons to indulge more persuasive. If you end up succumbing to the temptation, it’s evidence that you’re a person with weaker willpower, and are thus less likely to finish your big task.

How can eating the cupcake cause you to be more likely to succeed while also giving evidence that you’re more likely to fail?

### Conflicting Decision Theory Models

The strangeness lies in the difference between two conflicting models of how to make decisions. Luke Muehlhauser describes them well in his Decision Theory FAQ:

This is not a “merely verbal” dispute (Chalmers 2011). Decision theorists have offered different algorithms for making a choice, and they have different outcomes. Translated into English, the [second] algorithm (evidential decision theory or EDT) says “Take actions such that you would be glad to receive the news that you had taken them.” The [first] algorithm (causal decision theory or CDT) says “Take actions which you expect to have a positive effect on the world.”

The crux of the matter is how to handle the fact that we don’t know how much underlying willpower we started with.

Causal Decision Theory asks, “How can you cause yourself to have the most willpower?”

It focuses on the fact that, in any state, spending willpower resisting the cupcake causes ego depletion. Because of that, it says our underlying amount of willpower is irrelevant to the decision. The recommendation stays the same regardless: eat the cupcake.

Evidential Decision Theory asks, “What will give evidence that you’re likely to have a lot of willpower?”

We don’t know whether we’re starting with strong or weak will, but our actions can reveal that one state or another is more likely. It’s not that we can change the past – Evidential Decision Theory doesn’t look for that causal link – but our choice indicates which possible version of the past we came from.

Yes, seeing someone undergo ego depletion would be evidence that they lost a bit of willpower.  But watching them resist the cupcake would probably be much stronger evidence that they have plenty to spare.  So you would rather “receive news” that you had resisted the cupcake.

### A Third Option

Each of these models has strengths and weaknesses, and a number of thought experiments – especially the famous Newcomb’s Paradox – have sparked ongoing discussions and disagreements about what decision theory model is best.

One attempt to improve on standard models is Timeless Decision Theory, a method devised by Eliezer Yudkowsky of the Machine Intelligence Research Institute.  Alex Altair recently wrote up an overview, stating in the paper’s abstract:

When formulated using Bayesian networks, two standard decision algorithms (Evidential Decision Theory and Causal Decision Theory) can be shown to fail systematically when faced with aspects of the prisoner’s dilemma and so-called “Newcomblike” problems. We describe a new form of decision algorithm, called Timeless Decision Theory, which consistently wins on these problems.

It sounds promising, and I can’t wait to read it.

### But Back to the Cupcakes

For our particular cupcake dilemma, there’s a way out:

Precommit. You need to promise – right now! – to always eat the cupcake when it’s presented to you. That way you don’t spend any willpower on resisting temptation, but your indulgence doesn’t give any evidence of a weak underlying will.

And that, ladies and gentlemen, is my new favorite excuse for why I ate all the cupcakes.

## Will moving to California make you happier?

I pass dozens of brilliantly-colored flowers like this on my daily walk to work. (Photo credit: B Mully, Flickr)

I might have to disagree with a Nobel Laureate on this one.

According to Daniel Kahneman, Nobel prize-winning psychologist and author of the excellent Thinking Fast and Slow, the answer is “No.” A recent post on Big Think describes how Kahneman asked people to predict who’s happier, on average, Californians or Midwesterners. Most people (from both regions!) say, “Californians.” That’s because, Kahneman explains, the act of comparison highlights what’s saliently different between the two regions: their climate. And on that dimension, California’s a pretty clear winner.

And indeed, Californians report loving their climate and Midwesterners loathing theirs. Yet despite that, the overall life satisfaction in the two regions turns out to be nearly identical, according to a 1998 survey by Kahneman. Climate just isn’t that important to happiness, it turns out. The fact that it greatly influences people’s predictions of relative happiness in California vs. the Midwest stems from something called the “Focusing illusion,” Kahneman explains — a bias he sums up with the pithy, “Nothing in life is as important as you think it is when you are thinking about it.”

So far, I have no beef with this interpretation. What I *do* object to is the conclusion, which Kahneman implies and Big Think makes explicit, that “moving to california won’t make you happy.”

I moved from New York, NY to Berkeley, CA, earlier this year, and — having read Kahneman — I didn’t expect the climate to make a noticeable difference in my mood. And yet, every day, when I would leave my house, I found my spirits buoyed by the balmy weather and the clear blue sky. I noticed, multiple times daily, how beautiful the vegetation was and how fresh, fragrant, and — well — un-Manhattanlike the air smelled. It made a noticeable difference in my mood nearly every day, and continues to, six months after I moved.

I was a little surprised that my result was so different from Kahneman’s. And then I realized: Most of those Californians in his study have always been Californians. They grew up there; they didn’t move from the Midwest (or Manhattan) to California. So it’s understandable that their climate doesn’t make a big impact on their happiness, because they have no standard of comparison. They’re not constantly thinking to themselves — as I have been — “Man, it’s so *nice* not to have to shiver inside a bulky winter coat!” or “Man, it’s such a relief not to smell garbage bags sitting out on the sidewalk,” or “Wow, it’s quite pleasant not to be sticky with sweat.”

I’m only one data point, of course, and it’s possible that if you studied people who moved from the Midwest to CA, you’d find that their change in happiness was in fact no different than that of people who moved from CA to the Midwest. But at least, I think it’s important to note that that’s not the study Kahneman did. And that, as a general rule in reading (or conducting!) happiness research, it’s important to remember that the happiness you get from a state depends on your previous states.

## A rational view of tradition

In my latest video blog I answer a listener’s question about why rationalists are more likely to abandon social norms like marriage, monogamy, standard gender roles, having children, and so on. And then I weigh in on whether that’s a rational attitude to take:

## RS episode #53: Parapsychology

In Episode 53 of the Rationally Speaking Podcast, Massimo and I take on parapsychology, the study of phenomena such as extrasensory perception, precognition, and remote viewing. We discuss the type of studies parapsychologists conduct, what evidence they’ve found, and how we should interpret that evidence. The field is mostly not  taken seriously by other scientists, which parapsychologists argue is unfair, given that their field shows some consistent and significant results. Do they have a point? Massimo and I discuss the evidence and talk about what the results from parapsychology tell us about the practice of science in general.

http://www.rationallyspeakingpodcast.org/show/rs53-parapsychology.html

## You’re such an essentialist!

My latest video blog is about essentialism, and why it’s damaging to your rationality — and your happiness.

## RS #48: Philosophical Counseling

Can philosophy be a form of therapy? On the latest episode of Rationally Speaking, we interview Lou Marinoff, a philosopher who founded the field of “philosophical counseling,” in which people pay philosophers to help them deal with their own personal problems using philosophy. For example, one of Lou’s clients wanted advice on whether to quit her finance job to pursue a personal goal; another sought help deciding how to balance his son’s desire to go to Disneyland with his own fear of spoiling his children.

As you can hear in the interview, I’m interested but I’ve got major reservations. I certainly think that philosophy can improve how you live your life — I’ve got some great examples of that from personal experience. But I’m skeptical of Lou’s project for two related reasons: first, because I think most problems in people’s lives are best addressed by a combination of psychological science and common sense. They require a sophisticated understanding how our decision-making algorithms go wrong — for example, why we make decisions that we know are bad for us, how we end up with distorted views of our situations and of our own strengths and weaknesses, and so on. Those are empirical questions, and philosophy’s not an empirical field, so relying on philosophy to solve people’s problems is going to miss a large part of the picture.

The other problem is that it wasn’t at all clear to me how philosophical counselors choose which philosophy to cite. For any viewpoint in the literature, you can pretty reliably find an opposing one. In the case of the father afraid of spoiling his kid, Lou cited Aristotle to argue for an “all things in moderation” policy. But, I pointed out, he could just as easily have cited Stoic philosophers arguing that happiness lies in relinquishing desires.  So if you can pick and choose any philosophical advice you want, then aren’t you really just giving your client your own opinion about his problem, and just couching your advice in the words of a prestigious philosopher?

Hear more at Rationally Speaking Episode 48, “Philosophical Counseling.”