This is an actual pic I took from the deck at our vacation rental in Ashtabula, Ohio this weekend. No Photoshop! Just fortuitous timing and cloud cover.
There’s no real profound content here today, but the pic is too good not to share. Enjoy!
Or, Data Normalization Using Indices for Fun and Profit
One time a big outdoor sports retailer asked me to analyze their online and catalog sales data in one region of the country, and then use that analysis to try to estimate sales in future brick-and-mortar locations in the same region.
So I did the analysis. Based on my results, I confidently figured they’d probably sell more stuff in Atlanta than they would in Siler City, North Carolina. Wow, what a blinding glimpse of the obvious. The right thing to do would be to compare data between the two places on more of an even footing. There are many ways to do that, but one of my favorites is to use indices. Indexing is meaningful, easy to understand, easy to calculate, and fun. So naturally, I charge a fortune for it (just kidding).
To demonstrate, I took two of my favorite things, 1) anything to do with Ohio and 2) cheeseburgers. Then I calculated indices on based on some real retail sales data from last year.
Fact 1: People in Ohio Love Cheeseburgers. Or Big Macs, or Frosties, or Whoppers. Ohioans consumed 68% more of these types of fast foods, on average, than US consumers in general.
Fact 2: People in Warren have a higher Munch-a-Burger Index than do people in Youngstown. Folks in both cities spend more than the US average on this food, but less than others in Ohio (note the Bonus Facts below).
Fact 3: 26% of people nationally spent some amount > $0 at one of these restaurants in the past 6 months. But that’s true of 34% of Ohioans! It’s 30% in Youngstown, and 29% in Warren. Now there’s an interesting little data point. About the same percentage of the population frequent one of those restaurants between Youngstown and Warren, yet those hungry Warrenites spend oh so much more: $102.28 in the past 6 months versus $74.31 among Youngstowners.
Bonus Fact: Among largish Ohio cities, here are the indices of the Top 5 towns at these restaurants relative to others in the state:
5. West Chester 149
4. Hamilton 163
3. Loveland 187
2. Hilliard 198
1. Grove City 204 (!!!)
So people in Grove City spend twice as much as other Ohioans at these restaurants.
Bonus Fact #2: Here are the cheapest—er, lowest spending—cities in the state on this food:
5. Mentor 79
4. Massillon 76
3. Youngstown 73
2. Canton 64
1. Mansfield 49
People in Mansfield spend about half as much as other Ohioans.
Growing up in Northeast Ohio, I do not recall ever seeing, let alone actually kicking, a soccer ball. In those times and in that place the term “football” meant something entirely different. It meant an oblong, leather-clad, brown inflated ball. It meant glorious Friday nights at Mollenkopf Stadium. It meant watching the Ohio State Buckeyes stomp on the University of the Sisters of the Poor every Saturday afternoon. And it meant exploring new and exciting ways to express one’s displeasure and disgust at the Cleveland Browns every Sunday. So naturally I wondered how the 2014 FIFA World Cup was playing in this nether world of chauvinistic American sport.
Through the magic of the Twitter API, R code, and a few extra moments of time on my hands, I set forth on the journey to find out.
The R scripts for this little project can be found here. https://github.com/dino-fire/worldcup
The Twitter REST API enables users to set a geographic parameter to limit searches to a specific geographical area. The search terms were limited to #WorldCup, #worldcup2014, or #Brazil. These terms were subsequently eliminated from the analyses, because we’re interested in what people are saying about those terms, not about counts of the terms themselves.
I started with the latitude and longitude of Columbus, Ohio, and specified a 200-mile radius. A word cloud, of course, yields larger, more prominent displays of words with higher frequencies. The basic word cloud of Ohioans’ tweets demonstrate some interest in the Spanish and Croatian futbol teams. Speaker of the House John Boehner garnered a few honorable mentions as well. What that has to do with the World Cup, I do not know.
Next I made a little side excursion that explored the tweets from the Youngstown/Warren area with those of residents of Youngstown’s sister city, Salerno, Italy. The outcomes were predictable but nuanced. The Youngstown and Warren folks tweeted about the generic USA. Could’ve been the soccer team, could’ve been native cuisine, like hot dogs, and could’ve been anything. Not so with the Italians, though; the national football club was front and center.
Ohioans are people of few words, at least as far as tweeting about the World Cup is concerned. The vast majority of Ohioans’ tweets comprised 8 or 10 unique words. The base R program provides a nice histogram.
Before we get into the deeper statistical analysis, I should point out that THE BIG BUZZ at the time was about England getting unceremoniously booted from the tournament in the opening round.
What’s the difference between England and a teabag? The teabag stays in the Cup longer.
A hierarchical cluster analysis of Ohioans’ tweets is intended to depict how words tend to cluster together in Euclidean space. It’s a fancy way of seeing how words correlate. And here are the results.
One group of tweets centered on England’s demise, and another seemed to be about who was showing up in Rio de Janeiro. Yet another group of words dealt with the Italy – Costa Rica match, while a fourth cluster seemed to inquire about who was supporting US soccer.
Disregarding the clustering of words, we can review the correlations themselves. I’m proud to say that Ohioans are expert analysts of English soccer.
Despite a seemingly infinite number of startups claiming to do better social media mining better than anyone else, sentiment analysis is an iffy proposition at best. For those who aren’t blessed with 50 unsolicited emails a day from social media mining companies, sentiment analysis refers to an evaluation of a tweet from a subjective, qualitative standpoint. The analysis tries to classify tweets or other textual content “scraped” from various websites into “good” or “bad,” “happy” or “sad,” or other such bipolar sentiments. But often that’s where the problem arises. For example, the following tweet would be classified as “good:”
Well, England, that was a good effort.
But unfortunately, so would this one:
Well, England, THAT was a good effort.
He or she whom invents a sentiment algorithm that can accurately interpret sarcasm wins the prize. Yeah, THAT will happen. Scrape THAT, you bums.
Nevertheless, I’ll hop upon the sentiment analysis e bandwagon and see how Ohioans feel about the World Cup so far. First of all, we see that there is no transformation of the sentiment-scored data required. The results reflect a very normal distribution, not skewing one way or another too badly.
We see that the sentiment scores are more positive than not, but as of this writing, the USA team is 1 – 0. Those scores are subject to shift later, to be sure.
In this case, the sentiment scoring algorithm freely admits that it is clueless about the context of many of the words it encountered. Still, it seemed predisposed to find and tag joyful comments.
The sentiment scoring algorithm output a nice comparison word cloud, which visually demonstrates the words and their respective classifications based on frequency. Yes, I always associate the term “snapshot” with “disgust.” Interestingly, “Redskins” got lumped into that classification as well.
So are Ohioan’s beliefs about the World Cup different from other, surrounding, and, some would believe, inferior types of people (based on their state of residence)? Well, let’s see.
Sentiment scores in Ohio, Michigan, West Virginia, Pennsylvania, and Indiana lean uniformly positive. But a careful look at the boxplots show that Ohioans and Indianans opinions tend to cluster in the middle: not too positive, and not too negative. That’s not the case among Michiganders, who tend to be extremely more positive or extremely more negative. Those Michigan folks represent very nicely the dangerous reality about averages: You can be standing with your feet in a bucket of ice water and your head in a roasting hot oven. But on average you feel just fine.
A comparison cloud shows just how different the tweets from these separate states really are. Michiganders seem obsessed with the Italy – Costa Rica match. Indianans seem strangely interested in the Forza Italia political movement. Pennsylvanians are engrossed in a game of “where’s Ronaldo?” Ohioans are losing interest, and starting to turn their attention toward Wimbledon. And West Virginians don’t seem to care much about the World Cup at all.