Category Archives: R
Growing up in Northeast Ohio, I do not recall ever seeing, let alone actually kicking, a soccer ball. In those times and in that place the term “football” meant something entirely different. It meant an oblong, leather-clad, brown inflated ball. It meant glorious Friday nights at Mollenkopf Stadium. It meant watching the Ohio State Buckeyes stomp on the University of the Sisters of the Poor every Saturday afternoon. And it meant exploring new and exciting ways to express one’s displeasure and disgust at the Cleveland Browns every Sunday. So naturally I wondered how the 2014 FIFA World Cup was playing in this nether world of chauvinistic American sport.
Through the magic of the Twitter API, R code, and a few extra moments of time on my hands, I set forth on the journey to find out.
The R scripts for this little project can be found here. https://github.com/dino-fire/worldcup
The Twitter REST API enables users to set a geographic parameter to limit searches to a specific geographical area. The search terms were limited to #WorldCup, #worldcup2014, or #Brazil. These terms were subsequently eliminated from the analyses, because we’re interested in what people are saying about those terms, not about counts of the terms themselves.
I started with the latitude and longitude of Columbus, Ohio, and specified a 200-mile radius. A word cloud, of course, yields larger, more prominent displays of words with higher frequencies. The basic word cloud of Ohioans’ tweets demonstrate some interest in the Spanish and Croatian futbol teams. Speaker of the House John Boehner garnered a few honorable mentions as well. What that has to do with the World Cup, I do not know.
Next I made a little side excursion that explored the tweets from the Youngstown/Warren area with those of residents of Youngstown’s sister city, Salerno, Italy. The outcomes were predictable but nuanced. The Youngstown and Warren folks tweeted about the generic USA. Could’ve been the soccer team, could’ve been native cuisine, like hot dogs, and could’ve been anything. Not so with the Italians, though; the national football club was front and center.
Ohioans are people of few words, at least as far as tweeting about the World Cup is concerned. The vast majority of Ohioans’ tweets comprised 8 or 10 unique words. The base R program provides a nice histogram.
Before we get into the deeper statistical analysis, I should point out that THE BIG BUZZ at the time was about England getting unceremoniously booted from the tournament in the opening round.
What’s the difference between England and a teabag? The teabag stays in the Cup longer.
A hierarchical cluster analysis of Ohioans’ tweets is intended to depict how words tend to cluster together in Euclidean space. It’s a fancy way of seeing how words correlate. And here are the results.
One group of tweets centered on England’s demise, and another seemed to be about who was showing up in Rio de Janeiro. Yet another group of words dealt with the Italy – Costa Rica match, while a fourth cluster seemed to inquire about who was supporting US soccer.
Disregarding the clustering of words, we can review the correlations themselves. I’m proud to say that Ohioans are expert analysts of English soccer.
Despite a seemingly infinite number of startups claiming to do better social media mining better than anyone else, sentiment analysis is an iffy proposition at best. For those who aren’t blessed with 50 unsolicited emails a day from social media mining companies, sentiment analysis refers to an evaluation of a tweet from a subjective, qualitative standpoint. The analysis tries to classify tweets or other textual content “scraped” from various websites into “good” or “bad,” “happy” or “sad,” or other such bipolar sentiments. But often that’s where the problem arises. For example, the following tweet would be classified as “good:”
Well, England, that was a good effort.
But unfortunately, so would this one:
Well, England, THAT was a good effort.
He or she whom invents a sentiment algorithm that can accurately interpret sarcasm wins the prize. Yeah, THAT will happen. Scrape THAT, you bums.
Nevertheless, I’ll hop upon the sentiment analysis e bandwagon and see how Ohioans feel about the World Cup so far. First of all, we see that there is no transformation of the sentiment-scored data required. The results reflect a very normal distribution, not skewing one way or another too badly.
We see that the sentiment scores are more positive than not, but as of this writing, the USA team is 1 – 0. Those scores are subject to shift later, to be sure.
In this case, the sentiment scoring algorithm freely admits that it is clueless about the context of many of the words it encountered. Still, it seemed predisposed to find and tag joyful comments.
The sentiment scoring algorithm output a nice comparison word cloud, which visually demonstrates the words and their respective classifications based on frequency. Yes, I always associate the term “snapshot” with “disgust.” Interestingly, “Redskins” got lumped into that classification as well.
So are Ohioan’s beliefs about the World Cup different from other, surrounding, and, some would believe, inferior types of people (based on their state of residence)? Well, let’s see.
Sentiment scores in Ohio, Michigan, West Virginia, Pennsylvania, and Indiana lean uniformly positive. But a careful look at the boxplots show that Ohioans and Indianans opinions tend to cluster in the middle: not too positive, and not too negative. That’s not the case among Michiganders, who tend to be extremely more positive or extremely more negative. Those Michigan folks represent very nicely the dangerous reality about averages: You can be standing with your feet in a bucket of ice water and your head in a roasting hot oven. But on average you feel just fine.
A comparison cloud shows just how different the tweets from these separate states really are. Michiganders seem obsessed with the Italy – Costa Rica match. Indianans seem strangely interested in the Forza Italia political movement. Pennsylvanians are engrossed in a game of “where’s Ronaldo?” Ohioans are losing interest, and starting to turn their attention toward Wimbledon. And West Virginians don’t seem to care much about the World Cup at all.
The MLB All-Star game is coming up soon, so I thought I’d toss a few random analyses your way to commemorate the occasion. Here’s one…
So you want to be an All-Star, do ya? Then change your name to Rodriguez or Robinson. Here are the surnames of the top 250 All-Stars, by number of All-Star game selections, going back to the dawn of the All-Star game, in 1933 Chicago. Unfortunately, notable baseball fan Al Capone was probably not in attendance, since he had other commitments at the time in the Big House. But I digress. The bigger and bolder the name, the more someone with that name appeared in an All-Star uniform. This fine graphic represents the intersection of baseball and big data. For example, Robinson refers to the Orioles’ immortal third baseman, Brooks Robinson (18 career All-Star games), Frank Robinson, player-manager for my beloved Tribe despite those gawd-awful red uniforms (14 career selections), Eddie Robinson, who represented the White Sox and Twins in 4 contests, and of course Jackie Robinson, with 6 games as a Brooklyn Dodger. Frank Robinson was an All-Star selection for 3 of the 4 teams he played for in his career–Cincinnati, Baltimore, and LA. He never made it to the All-Star game as an Indian. Of course. All of the Robinsons on this list are in Cooperstown. Rodriguez is attached to Alex, Ivan, Ellie, Francisco, and Henry.
The word cloud was created using the R wordcloud, tm, and rColorBrewer packages. The simple R script and data file can be found at https://github.com/dino-fire/allstar-analysis.
Like all of this and my upcoming All-Star analysis, a huge shout-out goes to the data geniuses at Baseball Reference. More baseball statistics than are fit for human consumption. This blog has been cross-posted to the most excellent R-bloggers site as well.
Goodness good ¬ ness [goo d-nis] the state or quality of being good, excellence of quality. (dictionary.com).
A good predictive model is only as good as its “goodness.” And, fortunately, there is a well-established process for measuring the goodness of a logistic model in a way that a non-statistician—read: the senior manager end-users of your model—can understand.
There is, of course, a precise statistical meaning behind words we propeller heads throw around such as “goodness of fit.” In the case of logistic models, we are looking for a favorable log likelihood result relative to that of the null, or empty, model. This description isn’t likely to mean much to an end-user audience (it barely means anything to many propeller heads).
Saying that your model will predict the correct binary outcome of something 81% of the time, however, makes a lot more intuitive sense to everyone involved.
It starts with a standard hold-out sample process, where the model is trained and modified using a random part—say, half—of the available data (the learning set) until a satisfactory result is apparent. The model is then tested on the second half of the data (the testing set) to see how “predictive” it is. For a logistic model, a “confusion matrix” is a very clean way to see how well your model does.
Using existing historical data, say we’re trying to predict whether someone will renew their association membership when their current contract is up. We run the model on the testing set, using the parameters determined in the initial model-building step we did on the learning set.
logit.estimate <- predict.glm(fit, newdata = testing, type = ‘response’)
Let’s set the playing field by determining what the existing “actual” proportions of the possible outcomes are in the testing data.
# Actual churn values from testing set testprops <- table(testing$status) # create an object of drop/renew (actuals) prop.table(testprops) # express drop/renew in proportional terms
So historically, we see that 59% of people don’t renew when their membership period is up. Houston, we have a problem! Good thing this is a hypothetical example.
The elegance of logistic regression—like other modeling methods—is that it provides a neat little probability statistic for each person in the database. We can pick some arbitrary value for this predicted probability—say anything greater than 50% —to indicate that someone will renew their membership when the time comes.
testing$pred.val <- 0 # Initialize the variable testing$pred.val[logit.estimate > 0.5] <- 1 # Anyone with a pred. prob.> 50% will renew
With those results in hand, we need to know 2 things. First, how well does the model do in pure proportional terms? In other words, it is close to the same drop/renew proportions from the actual data? This is knowable from a simple table.
testpreds <- table(testing$pred.val) # create an object of drop/renew (predicted) prop.table(testpreds) # express drop/renew predictions in proportional terms
Recall that our original proportions from the “actuals” were 59%/41%…so far so good.
Second, and most importantly, how well does the model predict the same people to drop among those who actually dropped, and how does it do predicting the same people to renew among those who actually renewed? That’s where the confusion matrix comes in.
In a perfect (but suspicious) model, cells A and D would be 100%. In other words, everyone who dropped will have been predicted to drop, and everyone who renewed will have been predicted to renew. In our example, the confusion matrix looks like this:
# Confusion matrix confusion.matrix <- table(testing$q402.t2b, testing$pred.val) # create the confusion matrix confusion.matrix # view it Drop Renew Drop 310 55 Renew 62 189
Assign each of the four confusion matrix cells a letter indicator, and run the statistics to see how well the model predicts renewals and drops.
a <- confusion.matrix[2,2] # actual renew, predicted renew b <- confusion.matrix[2,1] # actual renew, predicted drop c <- confusion.matrix[1,2] # actual drop, predicted renew d <- confusion.matrix[1,1] # actual drop, predicted drop n = a + b + c + d # total sample size CCC <- (a + d)/n # cases correctly classified CCC  0.81 CMC <- (b + c)/n # cases misclassified CMC  0.19 CCP <- a/(a + b) # correct classification of positives (actual à predicted renew) CCP  0.75 CCN <- d/(c + d) # correct classification of negatives (actual à predicted drop) CCN  0.85OddsRatio <- (a * d) / (b * c) # the odds that the model will classify a case correctly OddsRatio  17
At 81%, our model does a pretty fair job of correctly determining the proportion of members who will drop and renew. It is capable of predicting the individuals who will renew their membership 75% of the time. More importantly, the model will predict who will not renew 85% of the time…presumably giving us time to entice these specific individuals with a special offer, or send them a communication designed to address the particular reasons that contribute to their likelihood to drop their membership (we learn this in the model itself). If we send this communication or special offer to everyone the model predicts will drop their membership, we will only have wasted (aka “spilled”) this treatment on 15% of them.
Now that’s information our managers can use.