When one data word equals a thousand words

Being a certified propellerhead comes with certain privileges, and few are more important than the unalienable right to have other propellerhead geeks as role models. One of mine is a Yale professor by the name of Edward Tufte.

Sparklines are tiny little graphs embedded in textual analysis.  Dr. Tufte, the widely recognized guru of graphics for data-centric reporting, invented these little beasts.  He refers to them as “data-intense, word-sized graphics,” also known as “data words.”  They are useful in describing how linear or time-series data changes Imageover time, or how one group Imagestands out from the rest.  Sparklines are easily added to Excel spreadsheets through the “insert” ribbon, although they don’t copy and paste very well into PowerPoint or Word documents.  The high-resolution ones you see in this article were generated by the handy sparkTable R package (Kowarik,  Meindl, and Templ, 2012).

As an interesting and pathetic side note, Microsoft has applied for a patent for their implementation of sparkline functionality in their software, which is particularly galling to the spirit of freely-available open source applications if not downright plagiarism couched in tech-giant legalese.  A Google search on “sparklines” turns up an onslaught of search-engine-optimized content about how to use Excel to make sparklines (a good thing, too, since making sparklines in Excel requires all of the technical expertise of a contemporary third grader).  Go ahead and patent your weak excuse for sparklines, Microsoft.  I guess patents are cheap to come by in the software world.  Where do I sign up?

Taking the Confusion Out of Your Confusion Matrix

Goodness good ¬ ness [goo d-nis] the state or quality of being good, excellence of quality. (dictionary.com).

A good predictive model is only as good as its “goodness.”  And, fortunately, there is a well-established process for measuring the goodness of a logistic model in a way that a non-statistician—read: the senior manager end-users of your model—can understand.

There is, of course, a precise statistical meaning behind words we propeller heads throw around such as “goodness of fit.”  In the case of logistic models, we are looking for a favorable log likelihood result relative to that of the null, or empty, model.  This description isn’t likely to mean much to an end-user audience (it barely means anything to many propeller heads).

Saying that your model will predict the correct binary outcome of something 81% of the time, however, makes a lot more intuitive sense to everyone involved.

It starts with a standard hold-out sample process, where the model is trained and modified using a random part—say, half—of the available data (the learning set) until a satisfactory result is apparent.  The model is then tested on the second half of the data (the testing set) to see how “predictive” it is.  For a logistic model, a “confusion matrix” is a very clean way to see how well your model does.

Using existing historical data, say we’re trying to predict whether someone will renew their association membership when their current contract is up.  We run the model on the testing set, using the parameters determined in the initial model-building step we did on the learning set.

logit.estimate <- predict.glm(fit, newdata = testing, type = ‘response’)

Let’s set the playing field by determining what the existing “actual” proportions of the possible outcomes are in the testing data.

# Actual churn values from testing set
testprops <- table(testing$status)  # create an object of drop/renew (actuals)
prop.table(testprops)  # express drop/renew in proportional terms

Drop   Renew
0.59   0.41

So historically, we see that 59% of people don’t renew when their membership period is up.  Houston, we have a problem!  Good thing this is a hypothetical example.

The elegance of logistic regression—like other modeling methods—is that it provides a neat little probability statistic for each person in the database.  We can pick some arbitrary value for this predicted probability—say anything greater than 50% —to indicate that someone will renew their membership when the time comes.

testing$pred.val <- 0  # Initialize the variable
testing$pred.val[logit.estimate > 0.5] <- 1 # Anyone with a pred. prob.> 50% will renew

With those results in hand, we need to know 2 things.  First, how well does the model do in pure proportional terms?  In other words, it is close to the same drop/renew proportions from the actual data?  This is knowable from a simple table.

testpreds <- table(testing$pred.val) # create an object of drop/renew (predicted)
prop.table(testpreds) # express drop/renew predictions in proportional terms

Drop   Renew
0.60   0.40

Recall that our original proportions from the “actuals” were 59%/41%…so far so good.

Second, and most importantly, how well does the model predict the same people to drop among those who actually dropped, and how does it do predicting the same people to renew among those who actually renewed?  That’s where the confusion matrix comes in.


In a perfect (but suspicious) model, cells A and D would be 100%.  In other words, everyone who dropped will have been predicted to drop, and everyone who renewed will have been predicted to renew.  In our example, the confusion matrix looks like this:

# Confusion matrix

confusion.matrix <- table(testing$q402.t2b, testing$pred.val) # create the confusion matrix 
confusion.matrix # view it

         Drop   Renew 
  Drop   310       55
  Renew   62      189

Assign each of the four confusion matrix cells a letter indicator, and run the statistics to see how well the model predicts renewals and drops.

a <- confusion.matrix[2,2]  # actual renew, predicted renew
b <- confusion.matrix[2,1]  # actual renew, predicted drop
c <- confusion.matrix[1,2]  # actual drop, predicted renew
d <- confusion.matrix[1,1]  # actual drop, predicted drop
n = a + b + c + d  # total sample size

CCC <- (a + d)/n  # cases correctly classified
[1] 0.81

CMC <- (b + c)/n # cases misclassified
[1] 0.19

CCP <- a/(a + b) # correct classification of positives (actual à predicted renew)
[1] 0.75

CCN <- d/(c + d) # correct classification of negatives (actual à predicted drop)
[1] 0.85OddsRatio <- (a * d) / (b * c) # the odds that the model will classify a case correctly
[1] 17

At 81%, our model does a pretty fair job of correctly determining the proportion of members who will drop and renew.  It is capable of predicting the individuals who will renew their membership 75% of the time.  More importantly, the model will predict who will not renew 85% of the time…presumably giving us time to entice these specific individuals with a special offer, or send them a communication designed to address the particular reasons that contribute to their likelihood to drop their membership (we learn this in the model itself).  If we send this communication or special offer to everyone the model predicts will drop their membership, we will only have wasted (aka “spilled”) this treatment on 15% of them.

Now that’s information our managers can use.

1986 Topps Baseball

In the expansive world of collectible baseball cards, 1986 Topps Baseball comes cheap. In the base set, there are no classic rookie cards worth extorting people over. Barry Bonds’ rookie card came in the 1986 Topps Traded & Rookies set, which is not at all part of the base set, as it is a supplement released after the season. I bet you didn’t know that. That Bonds card used to be valuable, prior to the ‘roid rage era.

It’s been about 12 years now, but Tim—my stepson and partner in baseball card overspending crime—and I came across the opportunity to grab a vending box case of those cards for $75.  Vending boxes are literally that…in those days, distributors would go around stuffing baseball card vending machines with these.  That case held 15,000 cards, if I recollect. 15,000 essentially worthless cards, stuck in dozens of individual vending boxes containing about 500 each, totally at random.  Cards with a big black banner, a weird all-caps font.  Bad ‘80’s haircuts.  Minuscule statistics on the back.  780-something of the damn things in a set. What to do with them?  For starters, let’s have a collating party. That’s right, sort those bad boys into complete sets.  Tim was unceremoniously pressed into indentured servitude on this one.  I sent a set to my nephew in Texas, who happened to be born in 1986, figuring he might appreciate it someday…a snapshot of the professional baseball scene from the time of his birth.  I wish someone would have given me a set of 1960 Topps Baseball back in the day, but if wishes were fishes we’d all cast nets, as the saying goes. So I had reduced my extensive 1986 Topps Baseball holdings down to 14,220.  We made another complete set, and undertook a mission: get them signed by each of the 780+ players.  All of them.  Well, at least the ones who were 1) still alive, 2) able to write their name legibly in cursive, and C) willing to do it for the princely sum of free. This little mission went on for many years, in fits and starts.  We were able to accumulate a couple hundred of those autographs.  Some of the highlights of this journey:

  • Pete Rose wanted something like $50 to sign his card.  For that price, I’d rather have had him sign a betting slip from Caesar’s Palace.  I passed.
  • Cecil Fielder—papa to Prince—was the first one to send his autographed card back.  He wins the prize.
  • Cecil Cooper (another Cecil) from the Brewers wins the “You Are Now Forever Cool” award, as he signed the card to “Dino” personally.
  • At one point, a fellow collector who knew about my quest said he was planning to attend a game in which the minor league Winston-Salem Warthogs (look it up) were a contestant.  The man with the all-time coolest name in the history of major league baseball—who was the manager of the Warthogs—signed his card in person.  My connection said that the Warthog players witnessing this signing event could not stop laughing at the player’s hairdo on the card.  That manager was Razor Shines.

Razor  Shines Anybody want some 1986 Topps Baseball Cards?  Let’s make a deal!  Only a few thousand left…

A Closer Look at Exploratory Data Analysis: What and Why

What it is

An Exploratory Data Analysis, or EDA, is an exhaustive look at existing data from current and historical surveys conducted by a company.

In addition, the appropriate variables from your company’s customer database—such as information about rate plans, usage, account management, and others—are typically included in the analysis.

The intent of the EDA is to determine whether a predictive model is a viable analytical tool for a particular business problem, and if so, which type of modeling is most appropriate.

The deliverable is a low-risk, low-cost comprehensive report of findings of the univariate data and recommendations about how the company should use additional modeling.

At the very least, the EDA may reveal aspects of your company’s performance that others may not have seen.

Why do it

An EDA is a thorough examination meant to uncover the underlying structure of a data set and is important for a company because it exposes trends, patterns, and relationships that are not readily apparent.

You can’t draw reliable conclusions from a massive quantity of data by just gleaning over it—instead, you have to look at it carefully and methodically through an analytical lens.

Getting a “feel” for this critical information can help you detect mistakes, debunk assumptions, and understand the relationships between different key variables. Such insights may eventually lead to the selection of an appropriate predictive model.

What else you can do

If additional predictive modeling is deemed appropriate, a number of approaches may then be utilized.

Approach 1: A logistic model (LOGIT)

You may elect to segregate a company’s business customers into 2 separate and distinct classes: those who place a high value on or express high satisfaction with company services, and those who don’t.  This type of analysis is sometimes referred to as a response model.

LOGIT could offer some insight into the factors that drive this customer rating, especially when some of those factors are opinion oriented (from existing surveys, a survey designed expressly for this purpose, or both). This model would also utilize appropriate customer data from the company’s various strategic surveys, demographic variables, etc. One of the outcomes of LOGIT modeling is a probability, or “score,” that can be appended to each person in the larger database from whence the analysis came.

Approach 2: Recursive partitioning

Recursive partitioning is a technique that uses the same database and survey information but in a different way.

This modeling approach does a good job of taking categorical variables into account as well as ordinal and continuous value variables. Categorical variables tend to be characteristics, such as type of rate plan, type of business, or location, for example. Continuous variables are numbers, like number of employees or annual revenue.  Ratings scales from surveys fall into this latter type of data.

Recursive partitioning provides a “tree” output, and customers branch toward one classification or another based on how they respond to questions or how their behaviors and characteristics are measured.

No matter which course of action your company decides to take, the first step is always begins with an EDA. It’s an important component of the marketing research process that allows data to be organized, reviewed, and interpreted for the benefit of your business.