## Finding Pi From Random Numbers

If I gave you 10,000 random numbers between 0 and 1, how precisely could you estimate pi? It’s Pi Approximation Day (22/7 in the European format), which seems like the perfect time to share some math!

When we’re modeling something complex and either can’t or don’t want to bother to find a closed-form analytic solution, we can find a way to get close to the answer with a Monte Carlo simulation. The more precisely we want to estimate the answer, the more simulations we would create.

One classic Monte Carlo approach to find pi is to treat our 10,000 numbers as the x and y coordinates of 5,000 points in a square between (0,0) and (1,1). If we draw a unit circle at (0,0), the percent of the points which land inside the circle gives us a rough estimate of the area – which should be pi / 4. Multiply by 4, and we have our estimate.

simulated mean:  3.1608
95% confidence interval: 3.115 3.206
confidence interval size: 0.091

This technique works, but it’s not the most precise. There’s a fair bit of variance – a 95% confidence interval is about .09 units wide, from 3.12 to 3.21. Can we do better?

Another way to find the quarter-circle’s area is to treat it as a function and take its average value. (Area is average height times width, the width is 1, so the area inside the quarter-circle is just its average height.) That would give us pi/4, so we multiply by 4 to get our estimate for pi.

$y^2+x^2=1$

$y^2=1-x^2$

$f(x)=\sqrt{1-x^{2}}$

We have 10,000 random numbers between 0 and 1; all we have to do is calculate f(x) for each and take the mean:

simulated mean:  3.1500
95% confidence interval: 3.133 - 3.167
confidence interval size: 0.0348

This gives us a more precise estimate;  the 95% confidence interval is less than half what it was!

But we can do better.

### Antithetic Variates

What happens if we take our 10,000 random numbers and flip them around? They’re just a set of points uniformly distributed between 0 and 1, so (1-x) is also a set of points uniformly distributed between 0 and 1. If the expected value of f(x) is pi with variance 0.1, then the expected value of f(1-x) should also be pi with variance of 0.1.

So how does this help us? It looks like we just have two different ways to get the same level of precision.

Well, if f(x) is particularly high, then f(1-x) is going to be particularly low. By pairing each random number with its converse , we can offset some of the error and get an estimation more closely centered around the true mean. Taking the average of two distributions, each with the same expected value should still give us the same answer.

(This trick, known as using antithetic variates, doesn’t work with every function, but works here because the function f(x) always decreases as x increases.)

simulated mean:  3.1389
95% confidence interval: 3.132 - 3.145
confidence interval size: 0.0131

Lo and behold, our 95% confidence interval has narrowed down to 0.013, still only using 10,000 random numbers!

To be fair, this only beats 22/7 about 30% of the time with 10,000 random simulations. Can we reliably beat the approximation without resorting to more simulations?

### Control Variates

It turns out we can squeeze a bit more information out of those randomly generated numbers. If we know the exact expected value for a part of the function, we can be more deliberate about offsetting the variance. In this case, let’s use c(x)=x^2 as our “control variate function”, since we know that the average value of x^2 from 0 to 1 is exactly 1/3.

Where our simulated function was

$f(x)=\sqrt{1-x^{2}}$

now we add a term that will have an expected value of 0, but will help reduce variance:

$f'(x)=\sqrt{1-x^{2}}+b(x^2-\frac{1}{3})$

For each of our 10,000 random x’s, if x^2 is above average, we know that f(x) will probably be a bit *below* average, and we nudge it up. If x^2 is below average, we know f(x) is likely a bit high, and nudge it down. The overall expected value doesn’t change, but we’re compressing things even further toward the mean.

The constant ‘b’ in our offset term determines how much we ‘nudge’ our function, and is estimated based on how our control variate covaries with the target function:

$\frac{Covariance(f(x), c(x))}{Variance(c(x))}$

(In this case, b is about 2.9) Here’s what we get:

simulated mean:  3.1412
95% confidence interval: 3.1381 - 3.1443
confidence interval size: 0.0062

See how the offset flattens our new function (in orange) to be tightly centered around 3.14?

This is pretty darn good. Without resorting to more simulations, we reduced our 95% confidence interval to 0.006.  This algorithm gives a closer approximation to pi than 22/7 about 57% of the time.

If we’re not bound by the number of random numbers we generate, we can get as close as we want. With 100,000 points, our control variates technique has a 95% confidence interval of 0.002, and beats 22/7 about 98% of the time.

These days, as computing power gets cheaper, we can generate 100,000 or even 1,000,000 random numbers with no problem. That’s what makes simulations so versatile – we can find ways to simulate even incredibly complicated processes and unbounded functions, deciding how precise we need to be.

Happy Pi Approximation Day!

(You may ask, why is there a “Pi Approximation Day” and not a “Pi Simulation Day”? Well, according to Nick Bostrom, every day is Simulation Day. Probably.)

## Quantifying the Trump-iness of Political Sentences

You could say that Donald Trump has a… distinct way of speaking. He doesn’t talk the way other politicians do (even ignoring his accent), and the contrast between him and Clinton is pretty strong. But can we figure out what differentiates them? And then, can we find the most… Trump-ish sentence?

That was the challenge my friend Spencer posed to me as my first major foray into data science, the new career I’m starting. It was the perfect project: fun, complicated, and requiring me to learn new skills along the way.

To find out the answers, read on! The results shouldn’t be taken too seriously, but they’re amusing and give some insight into what might be important to each candidate and how they talk about the political landscape. Plus, it serves to demonstrate the data science techniques I’m learning for as a portfolio project.

If you want to play with the model yourself, I also put together an interactive javascript page for you: you can test your judgment compared to its predictions, browse the most Trumpish/Clintonish sentences and terms, and enter your own text for the model to evaluate.

To read about how the model works, I wrote a rundown with both technical and non-technical details below the tables and graphs. But without further ado, the results:

# The Trump-iest and Clinton-est Sentences and Phrases from the 2016 Campaign:

Clinton Trump
Top sentence: “That’s why the slogan of my campaign is stronger together because I think if we work together and overcome the divisiveness that sometimes sets americans against one another and instead we make some big goals and I’ve set forth some big goals, getting the economy to work for everyone, not just those at the top, making sure we have the best education system from preschool through college and making it affordable and somp[sic] else.” — Presidential Candidates Debate

Predicted Clinton: 0.99999999999
Predicted Trump: 1.04761466567e-11

Frustratingly, I couldn’t download or embed the C-SPAN video for this clip, so here are two of the other top 5 Clinton-iest sentences:

Presidential Candidate Hillary Clinton Rally in Orangeburg, South Carolina

Presidential Candidate Hillary Clinton Economic Policy Address

Top sentence: “As you know, we have done very well with the evangelicals and with religion generally speaking, if you look at what’s happened with all of the races, whether it’s in south carolina, i went there and it was supposed to be strong evangelical, and i was not supposed to win and i won in a landslide, and so many other places where you had the evangelicals and you had the heavy christian groups and it was just — it’s been an amazing journey to have — i think we won 37 different states.” — Faith and Freedom Coalition Conference

Predicted Clinton: 4.29818403092e-11
Predicted Trump: 0.999999999957

Frustratingly, I couldn’t download or embed the C-SPAN video for this clip either, so here are two of the other top 5 Trump-iest sentences:

Presidential Candidate Donald Trump Rally in Arizona

Presidential Candidate Donald Trump New York Primary Night Speech

## Top Terms

Term Multiplier
my husband 12.95
recession 10.28
attention 9.72
wall street 9.44
grateful 9.23
or us 8.39
citizens united 7.97
mother 7.20
something else 7.17
strategy 7.05
clear 6.81
kids 6.74
gun 6.69
i remember 6.51
corporations 6.51
learning 6.36
democratic 6.28
clean energy 6.24
well we 6.14
insurance 6.14
grandmother 6.12
experiences 6.00
progress 5.94
auto 5.90
climate 5.89
over again 5.85
often 5.80
a raise 5.71
immigration reform 5.62
Term Multiplier
tremendous 14.57
guy 10.25
media 8.60
does it 8.24
hillary 8.15
politicians 8.00
almost 7.83
incredible 7.42
illegal 7.16
general 7.03
frankly 6.97
border 6.89
establishment 6.84
jeb 6.76
allowed 6.72
obama 6.48
poll 6.24
by the way 6.21
bernie 6.20
ivanka 6.09
japan 5.98
politician 5.96
nice 5.93
conservative 5.90
islamic 5.77
hispanics 5.76
deals 5.47
win 5.43
guys 5.34
believe me 5.32

## Cherrypicked pairs of terms:

Clinton Trump
Term Multiplier Term Multiplier
president obama 3.27 obama 6.49
immigrants 3.40 illegal immigrants 4.87
clean energy 6.24 energy 1.97
the wealthy 4.21 wealth 2.11
learning 6.36 earning 1.38
muslims 3.46 the muslims 1.75
senator sanders 3.18 bernie 6.20

# How the Model Works:

### Defining the problem: What makes a sentence “Trump-y?”

I decided that the best way to quantify ‘Trump-iness’ of a sentence was to train a model to predict whether a given sentence was said by Trump or Clinton. The Trumpiest sentence will be the one that the predictive model would analyze and say “Yup, the chance this was Trump rather than Clinton is 99.99%”.

Along the way, with the right model, we can ‘look under the hood’ to see what factors into the decision.

Technical details:

The goal is to build a classifier that can distinguish between the candidate’s sentences optimizing for ROC_AUC, and allows us to extract meaningful/explainable coefficients.

### Gathering and processing the data:

In order to train the model, I needed large bodies of text from each candidate. I ended up scraping transcripts from events on C-SPAN.org. Unfortunately, they’re uncorrected closed caption transcripts and contained plenty of typos and misattributions. On the other hand, they’re free.

I did a bit to clean up some recurring problems like the transcript starting every quote section with “Sec. Clinton:” or including descriptions like [APPLAUSE] or [MUSIC]. (Unfortunately, they don’t reliably mark the end of the music, and C-SPAN sometimes claims that Donald Trump is the one singing ‘You Can’t Always Get What You Want.’)

Technical details:

I ended up learning to use Python’s Beautiful Soup library to identify the list of videos C-SPAN considers campaign events by the candidates, find their transcripts, and grab only the parts they supposedly said. I learned to use some basic regular expressions to do the cleaning.

My scraping tool is up on github, and is actually configured to be able to grab transcripts for other people as well.

### Converting the data into usable features

After separating the large blocks of text into sentences and then words, I had some decisions to make. In an effort to focus on interesting and meaningful content, I removed sentences that were too short or too long – “Thank you” comes up over and over, and the longest sentences tended to be errors in the transcription service. It’s a judgement call, but I wanted to keep half the sentences, which set cutoffs at 9 words and 150 words. 34,108 sentences remained.

A common technique in natural language processing is to remove the “stopwords” – common non-substantive words like articles (a, the), pronouns (you, we), and conjunctions (and, but). However, following James Pennebaker’s research, which found these words are surprisingly useful in predicting personality, I left them in.

Now we have what we need: sequences of words that the model can consider evidence of Trump-iness.

Technical details:

I used NLTK to tokenize the text into sentences, but wrote my own regular expressions to tokenize the words. I considered it important to keep contractions together and include single-character tokens, which the standard NLTK function wouldn’t have done.

I used a CountVectorizer from sklearn to extract ngrams and later selected the most important terms using a SelectFromModel with a Lasso Logistic Regression. It was a balance – more terms would typically improve accuracy, but water down the meaningfulness of each coefficient.

I tested using various additional features, like parts of speech and lemmas (using the fantastic Spacy library) and sentiment analysis (using the Textblob library) but found that they only provided marginal benefit and made the model much slower. Even just using 1-3 ngrams, I got 0.92 ROC_AUC.

### Choosing & Training the Model

One of the most interesting challenges was avoiding overfitting. Without taking countermeasures, the model could look at a typo-riddled sentence like “Wev justv don’tv winv anymorev.” and say “Aha! Every single one of those words are unique to Donald Trump, therefore this is the most Trump-like sentence ever!”

I addressed this problem in two ways: the first is by using regularization, a standard machine learning technique that penalizes a model for using larger coefficients. As a result, the model is discouraged from caring about words like ‘justv’ which might only occur two times, since they would only help identify those couple sentences. On the other hand, a word like ‘frankly’ helps identify many, many sentences and is worth taking a larger penalty to give it more importance in the model.

The other technique was to use batch predictions – dividing the sentences into 20 chunks, and evaluating each chunk by only training on the other 19. This way, if the word ‘winv’ only appears in a single chunk, the model won’t see it in the training sentences and won’t be swayed. Only words that appear throughout the campaign have a significant impact in the model.

Technical details:

The model uses a logistic regression classifier because it produces very explainable coefficients. If that weren’t a factor, I might have tried a neural net or SVM (I wouldn’t expect a random forest to do well with such sparse data.) In order to set the regularization parameters for both the final classifier and for the feature-selection Lasso Logistic Regressor, I used sklearn’s cross-validated gridsearch object, optimizing for ROC_AUC.

During the prediction process, I used a stratified Kfold to divide the data in order to ensure each chunk would have the appropriate mix of Trump and Clinton sentences. It was tempting to treat the sentences more like a time series and only use past data in the predictions, but we want to consider how similar old sentences are to the whole corpus.

### Interpreting and Visualizing the Results:

The model produced two interesting types of data: how likely the model thought each sentence was spoken by Trump or Clinton (how ‘Trumpish’ vs. ‘Clintonish’ it is), and how any particular term impacts those predicted odds. So if a sentence is predicted to be spoken by Trump with estimated 99.99% probability, the model considers it extremely Trumpish.

The term’s multipliers indicate how each word or phrase impacts the predicted odds. The model starts at 1:1 (50%/50%), and let’s say the sentence includes the word “incredible” – a Trump multiplier of 7.42. The odds are now 7.42 : 1, or roughly 88% in favor of Trump. If the model then sees the word “grandmother” – a Clinton multiplier of 6.12 – its estimated odds become 7.42 : 6.12, (or 1.12 : 1), roughly 55% Trump. Each term has a multiplying effect, so a 4x word and 2x word together have as much impact as an 8x word – not 6x.

Technical details:

In order to visualize the results, I spent a bunch of time tweaking the matplotlib package to generate a graph of coefficients, which I used for the pronouns above. I made sure to use a logarithmic scale, since the terms are multiplicative.

In addition, I decided to teach myself enough javascript to learn to use the D3 library – allowing interactive visualizations and the guessing game where players can try to figure out who said a given random sentence from the campaign trail. There are a lot of ways the code could be improved, but I’m pleased with how it turned out given that I didn’t know any D3 prior to this project.

## How to Raise a Rationalist Kid

In honor of Father’s Day, I talk about the things Jesse’s and my parents did that helped make us intellectually curious and interested in rationality.

## Pick a name for a rationality non-profit!

My new job is basically my dream job: I just moved to the Bay area to help launch a non-profit devoted to teaching rationality.

But we need your help settling on a name. We’ve got it narrowed down to three contenders; click here to vote for your favorite. Thanks!

## The Simulation Hypothesis and the Problem of Evil

In this special live episode recorded at the 2012 Northeast Conference on Science and Skepticism, Massimo and I discuss the “simulation argument” — the case that it’s roughly 20% likely that we live in a computer simulation — and the surprising implications that argument has for religion. Our guest is philosopher David Kyle Johnson, who is professor of philosophy at King’s College and author of the blog “Plato on Pop” for Psychology Today, and who hosts his own podcast at philosophyandpopculture.com. Elaborating on an article he recently published in the journal Philo, Johnson lays out the simulation argument and his own insight into how it might solve the age-old Problem of Evil (i.e., “How is it possible that an all-powerful, all-knowing, and good God could allow evil to occur in the world?”). As usual, Massimo and I have plenty of questions and comments!

Rationally Speaking Episode #59

## My kind of protest sign

And how about a: “Two, four, six, eight! And if you could please register your studies ahead of time to combat publication bias, that would be great!”

## Thoughts on science podcasting: A dispatch from ScienceOnline 2012

I’m in Raleigh, NC this weekend for the sixth annual ScienceOnline un-conference, a gathering of 450 scientists, writers, bloggers, podcasters, educators, and others interested in the way the internet is changing the way we conduct, and communicate, science. My contribution was this morning — I moderated  a discussion on science podcasting with Desiree Schell, the eloquent host of Skeptically Speaking. We made a nicely complementary team. Her podcast is live, whereas mine is pre-recorded; hers is solo; whereas mine is a dialogue with a co-host; hers focuses on the practical applications of science to people’s lives and pocketbooks (e.g., the common cold, the claims of the cosmetics industry, etc.), whereas mine is more abstract and philosophical. So our combined perspectives overlaid together created a kind of podcasting guide in 3D.

A few highlights:

What’s your niche? There are a lot of science and skepticism podcasts out there already, and Desiree and I both agreed that you need a well-defined “niche” in mind if you’re going to start your own. Maybe it’s a topic  you think isn’t being covered enough, or it’s not being covered the way you think it should be, or maybe it’s a group you want to give a voice to. But there should be some reason your podcast exists other than the fact that you want to do a podcast.

For example, I consider Rationally Speaking’s niche to be in the philosophical implications of science. So instead of just covering topics like irrationality, or the science of love, we also try to hash out questions like, Why should we try to overcome irrationality? Does it actually make us happier, and what are the ethical implications of trying to make other people more rational? And if we understand the science of love, does that change our experience of love?

And then our other niche is the question of what constitutes good evidence for a claim: To what extent do fields like evolutionary psychology, string theory, and memetics make testable predictions, and if they don’t, can we have any confidence in their claims? Can we ever generalize from case studies? How do we know which experts to trust?  A lot of skeptic podcasts and blogs highlight claims that are unambiguously pseudoscience, but I think Rationally Speaking specializes in the murkier cases.

The outline versus the map: Desiree and I talked a lot about how to make podcast interviews and conversations go smoothly. When I first started doing Rationally Speaking, I would come into our tapings with a mental outline of the topics I wanted to cover, arranged in a nice order that flowed well… and as it turns out, that’s fine for when you’re giving a lecture, solo, but it just doesn’t work when you throw other people into the mix. You don’t know what topics your guest is going to bring up that call for follow-up, and I never know what direction Massimo’s going to take the conversation in. And the problem with  having an outline in your head is that once you diverge from that outline, you have no instructions for how to get back onto it.

So what I’ve settled on instead is more of a loose, web-like structure in my mind, where the topics aren’t in any set order, but for each topic, I’ve thought about how it connects to at least a couple of other topics. That way, wherever the conversation ends up, I have this map in my head of where I can go next.

Why a podcast at all? For that matter, you should really have a reason to do a podcast rather than write a blog. Podcasts have some significant downsides, compared to blogs. On the production end, they’re a hassle to record and edit, compared to writing a post, and they commit you to a specific length and schedule. On the consumption end, they’re inconvenient in that you can’t skim them at your own pace, you can’t skip down to another section, and you don’t get links or pictures to supplement the content.

But sometimes they really are better than a blog post. I think that’s especially true for treating controversial or multifaceted topics, the kind we look for in Rationally Speaking – hearing people debate a topic is far more engaging than reading one person’s point of view. Also, as Story Collider’s Ben Lillie pointed out during the conversation, listening to a science podcast creates an intimate connection to the scientist – when you’ve got headphones on and you’re hearing the scientist’s voice as if she’s right there with you, it takes barely any time to get a taste of her personality. And science could always use a little more humanizing.