October 19, 2016 5 Comments
You could say that Donald Trump has a… distinct way of speaking. He doesn’t talk the way other politicians do (even ignoring his accent), and the contrast between him and Clinton is pretty strong. But can we figure out what differentiates them? And then, can we find the most… Trump-ish sentence?
That was the challenge my friend Spencer posed to me as my first major foray into data science, the new career I’m starting. It was the perfect project: fun, complicated, and requiring me to learn new skills along the way.
To find out the answers, read on! The results shouldn’t be taken too seriously, but they’re amusing and give some insight into what might be important to each candidate and how they talk about the political landscape. Plus, it serves to demonstrate the data science techniques I’m learning for as a portfolio project.
To read about how the model works, I wrote a rundown with both technical and non-technical details below the tables and graphs. But without further ado, the results:
The Trump-iest and Clinton-est Sentences and Phrases from the 2016 Campaign:
|Top sentence: “That’s why the slogan of my campaign is stronger together because I think if we work together and overcome the divisiveness that sometimes sets americans against one another and instead we make some big goals and I’ve set forth some big goals, getting the economy to work for everyone, not just those at the top, making sure we have the best education system from preschool through college and making it affordable and somp[sic] else.” — Presidential Candidates Debate
Predicted Clinton: 0.99999999999
Frustratingly, I couldn’t download or embed the C-SPAN video for this clip, so here are two of the other top 5 Clinton-iest sentences:
|Top sentence: “As you know, we have done very well with the evangelicals and with religion generally speaking, if you look at what’s happened with all of the races, whether it’s in south carolina, i went there and it was supposed to be strong evangelical, and i was not supposed to win and i won in a landslide, and so many other places where you had the evangelicals and you had the heavy christian groups and it was just — it’s been an amazing journey to have — i think we won 37 different states.” — Faith and Freedom Coalition Conference
Predicted Clinton: 4.29818403092e-11
Frustratingly, I couldn’t download or embed the C-SPAN video for this clip either, so here are two of the other top 5 Trump-iest sentences:
Other Fun Results:
Cherrypicked pairs of terms:
How the Model Works:
Defining the problem: What makes a sentence “Trump-y?”
I decided that the best way to quantify ‘Trump-iness’ of a sentence was to train a model to predict whether a given sentence was said by Trump or Clinton. The Trumpiest sentence will be the one that the predictive model would analyze and say “Yup, the chance this was Trump rather than Clinton is 99.99%”.
Along the way, with the right model, we can ‘look under the hood’ to see what factors into the decision.
The goal is to build a classifier that can distinguish between the candidate’s sentences optimizing for ROC_AUC, and allows us to extract meaningful/explainable coefficients.
Gathering and processing the data:
In order to train the model, I needed large bodies of text from each candidate. I ended up scraping transcripts from events on C-SPAN.org. Unfortunately, they’re uncorrected closed caption transcripts and contained plenty of typos and misattributions. On the other hand, they’re free.
I did a bit to clean up some recurring problems like the transcript starting every quote section with “Sec. Clinton:” or including descriptions like [APPLAUSE] or [MUSIC]. (Unfortunately, they don’t reliably mark the end of the music, and C-SPAN sometimes claims that Donald Trump is the one singing ‘You Can’t Always Get What You Want.’)
I ended up learning to use Python’s Beautiful Soup library to identify the list of videos C-SPAN considers campaign events by the candidates, find their transcripts, and grab only the parts they supposedly said. I learned to use some basic regular expressions to do the cleaning.
My scraping tool is up on github, and is actually configured to be able to grab transcripts for other people as well.
Converting the data into usable features
After separating the large blocks of text into sentences and then words, I had some decisions to make. In an effort to focus on interesting and meaningful content, I removed sentences that were too short or too long – “Thank you” comes up over and over, and the longest sentences tended to be errors in the transcription service. It’s a judgement call, but I wanted to keep half the sentences, which set cutoffs at 9 words and 150 words. 34,108 sentences remained.
A common technique in natural language processing is to remove the “stopwords” – common non-substantive words like articles (a, the), pronouns (you, we), and conjunctions (and, but). However, following James Pennebaker’s research, which found these words are surprisingly useful in predicting personality, I left them in.
Now we have what we need: sequences of words that the model can consider evidence of Trump-iness.
I used NLTK to tokenize the text into sentences, but wrote my own regular expressions to tokenize the words. I considered it important to keep contractions together and include single-character tokens, which the standard NLTK function wouldn’t have done.
I used a CountVectorizer from sklearn to extract ngrams and later selected the most important terms using a SelectFromModel with a Lasso Logistic Regression. It was a balance – more terms would typically improve accuracy, but water down the meaningfulness of each coefficient.
I tested using various additional features, like parts of speech and lemmas (using the fantastic Spacy library) and sentiment analysis (using the Textblob library) but found that they only provided marginal benefit and made the model much slower. Even just using 1-3 ngrams, I got 0.92 ROC_AUC.
Choosing & Training the Model
One of the most interesting challenges was avoiding overfitting. Without taking countermeasures, the model could look at a typo-riddled sentence like “Wev justv don’tv winv anymorev.” and say “Aha! Every single one of those words are unique to Donald Trump, therefore this is the most Trump-like sentence ever!”
I addressed this problem in two ways: the first is by using regularization, a standard machine learning technique that penalizes a model for using larger coefficients. As a result, the model is discouraged from caring about words like ‘justv’ which might only occur two times, since they would only help identify those couple sentences. On the other hand, a word like ‘frankly’ helps identify many, many sentences and is worth taking a larger penalty to give it more importance in the model.
The other technique was to use batch predictions – dividing the sentences into 20 chunks, and evaluating each chunk by only training on the other 19. This way, if the word ‘winv’ only appears in a single chunk, the model won’t see it in the training sentences and won’t be swayed. Only words that appear throughout the campaign have a significant impact in the model.
The model uses a logistic regression classifier because it produces very explainable coefficients. If that weren’t a factor, I might have tried a neural net or SVM (I wouldn’t expect a random forest to do well with such sparse data.) In order to set the regularization parameters for both the final classifier and for the feature-selection Lasso Logistic Regressor, I used sklearn’s cross-validated gridsearch object, optimizing for ROC_AUC.
During the prediction process, I used a stratified Kfold to divide the data in order to ensure each chunk would have the appropriate mix of Trump and Clinton sentences. It was tempting to treat the sentences more like a time series and only use past data in the predictions, but we want to consider how similar old sentences are to the whole corpus.
Interpreting and Visualizing the Results:
The model produced two interesting types of data: how likely the model thought each sentence was spoken by Trump or Clinton (how ‘Trumpish’ vs. ‘Clintonish’ it is), and how any particular term impacts those predicted odds. So if a sentence is predicted to be spoken by Trump with estimated 99.99% probability, the model considers it extremely Trumpish.
The term’s multipliers indicate how each word or phrase impacts the predicted odds. The model starts at 1:1 (50%/50%), and let’s say the sentence includes the word “incredible” – a Trump multiplier of 7.42. The odds are now 7.42 : 1, or roughly 88% in favor of Trump. If the model then sees the word “grandmother” – a Clinton multiplier of 6.12 – its estimated odds become 7.42 : 6.12, (or 1.12 : 1), roughly 55% Trump. Each term has a multiplying effect, so a 4x word and 2x word together have as much impact as an 8x word – not 6x.
In order to visualize the results, I spent a bunch of time tweaking the matplotlib package to generate a graph of coefficients, which I used for the pronouns above. I made sure to use a logarithmic scale, since the terms are multiplicative.