Or, Data Normalization Using Indices for Fun and Profit
One time a big outdoor sports retailer asked me to analyze their online and catalog sales data in one region of the country, and then use that analysis to try to estimate sales in future brick-and-mortar locations in the same region.
So I did the analysis. Based on my results, I confidently figured they’d probably sell more stuff in Atlanta than they would in Siler City, North Carolina. Wow, what a blinding glimpse of the obvious. The right thing to do would be to compare data between the two places on more of an even footing. There are many ways to do that, but one of my favorites is to use indices. Indexing is meaningful, easy to understand, easy to calculate, and fun. So naturally, I charge a fortune for it (just kidding).
To demonstrate, I took two of my favorite things, 1) anything to do with Ohio and 2) cheeseburgers. Then I calculated indices on based on some real retail sales data from last year.
Fact 1: People in Ohio Love Cheeseburgers. Or Big Macs, or Frosties, or Whoppers. Ohioans consumed 68% more of these types of fast foods, on average, than US consumers in general.
Fact 2: People in Warren have a higher Munch-a-Burger Index than do people in Youngstown. Folks in both cities spend more than the US average on this food, but less than others in Ohio (note the Bonus Facts below).
Fact 3: 26% of people nationally spent some amount > $0 at one of these restaurants in the past 6 months. But that’s true of 34% of Ohioans! It’s 30% in Youngstown, and 29% in Warren. Now there’s an interesting little data point. About the same percentage of the population frequent one of those restaurants between Youngstown and Warren, yet those hungry Warrenites spend oh so much more: $102.28 in the past 6 months versus $74.31 among Youngstowners.
Bonus Fact: Among largish Ohio cities, here are the indices of the Top 5 towns at these restaurants relative to others in the state:
5. West Chester 149
4. Hamilton 163
3. Loveland 187
2. Hilliard 198
1. Grove City 204 (!!!)
So people in Grove City spend twice as much as other Ohioans at these restaurants.
Bonus Fact #2: Here are the cheapest—er, lowest spending—cities in the state on this food:
5. Mentor 79
4. Massillon 76
3. Youngstown 73
2. Canton 64
1. Mansfield 49
People in Mansfield spend about half as much as other Ohioans.
Back by popular demand…derived importance.
A great deal of research is designed to measure the relative impact of specific features of products or services on customers’ satisfaction with those products or services.
Sometimes, surveys are designed to measure importance of those features explicitly and in isolation—no further analysis is necessary than an understanding of which features are more important to customers than others.
In other cases, the importance metrics will be used to determine what, if anything, could or should be changed to improve the product. That’s where key drivers analysis comes in, but more about that later.
Measuring importance through traditional Likert scales, while certainly frequently done, is not the method FGI recommends to measure importance. There are 2 fundamental reasons for this.
First, importance scales often do not provide adequate discrimination and differentiation between product features, especially when viewed in aggregate.
Q: How important is price?
A: Oh, that’s very important.
Q: How important is product availability?
A: Oh, that’s very important.
Q: How important are helpful store employees?
A: Oh, that’s very important too.
Second, people use scales differently (and this problem is not limited to importance scales). Respondents tend to calibrate their responses to previous scores. For example, here’s Respondent #1, rating the 3 attributes in our survey.
Q: How important is price?
A: Let’s give it a 9.
Q: Now, how important is product availability?
A: Well, not as important as price, so let’s say 8.
Q: How important are helpful store employees?
A: Less important than price, but more important than availability. 8 for that one too.
But Respondent #2 may follow precisely the same response pattern—9 / 8 / 8—but start their ratings at 6 instead, yielding 6 / 5 / 5. Should we view these three features as more important for Respondent #1 than for Respondent #2? No. Do any of Respondent #2’s answers qualify for top-2 box summaries?
No. One’s person’s 9 rating may be another person’s 6 rating. The very nature of scales—that the values are relative, not absolute—can cause misinterpretation of the results.
There are occasions where stated importance is appropriate and useful. If this is the case, there are far better ways than Likert scales to measure it, but that’s a subject for another day.
Measuring derived importance
Key drivers analysis yields importance in a derived manner, by measuring the relative impact of product features on critical performance metrics like overall satisfaction, likelihood to purchase again, likelihood to recommend, or some combination of those. The structure of a key drivers questionnaire looks like this:
Q. This next question is about your satisfaction with XX in general. Please rate the store on how satisfied you are with them overall. 10 means you are “Completely Satisfied” and 0 means you were “Not At All Satisfied.”
This question is treated as the dependent variable for our analysis.
Q. Now, consider these specific statements. Using the same scale, how satisfied are you with XX on…
- Variety of products and services
- Professional appearance of staff
- Length of wait time
- Ease of finding things in store
- Length of transaction time
- Convenient parking
- Convenient store location
We can then do some analysis to determine to what extent each of these independent—aka predictor—variables influence overall satisfaction. This is done through something called Pearson’s R Correlations.
In correlations, we get a statistic called R^2 (R squared), which is a measure of the strength of the score of one item to another. In the case of Pearson R, 1.0 means a perfect, positive correlation and -1.0 reflects a perfect, negative correlation. An R^2 value of 0.0 means no correlation at all.
In a key drivers analysis, the higher the correlation between each of the specific attributes and overall satisfaction, the more influence that attribute has on satisfaction, thus the more important it is. Notice that we never have to ask the question “how important is…” since the derived importance tells us everything we need to know. But that’s only half of the equation.
As a result of the question structure, we get explicit satisfaction metrics on each of the individual attributes as well. This data tells us how well we perform on each of the attributes. The resulting output looks something like this:
In our example, “helpful staff,” “coupon policy,” and “items in stock” are the most important attributes; they have the highest correlations to overall satisfaction.
Now compare those attributes to “store location.” The correlation is still positive, but not nearly as powerful as the first two examples. Remember, derived importance measures importance of individual attributes in relative, not absolute, terms.
The second part of our analysis shows that our store’s employees are helpful. In fact, it’s the highest performing attribute of all (while importance is viewed on the X, or horizontal, axis, performance is viewed on the Y, or vertical, axis).
This means that our store does well on this important attribute, and is considered a core strength. This is not the case with the other important attribute, like having items in stock, however. Our store gets the lowest performance rating on that very important feature.
From our survey results, management can quickly see that resources should be directed toward reducing wait times (more cashiers), improving their coupon policy if they can, and especially keeping popular items in stock.
We’ve precisely identified the few items that need to be prioritized, as improvement in satisfaction with these things will have a direct and measurable impact on overall satisfaction.
Being a certified propellerhead comes with certain privileges, and few are more important than the unalienable right to have other propellerhead geeks as role models. One of mine is a Yale professor by the name of Edward Tufte.
Sparklines are tiny little graphs embedded in textual analysis. Dr. Tufte, the widely recognized guru of graphics for data-centric reporting, invented these little beasts. He refers to them as “data-intense, word-sized graphics,” also known as “data words.” They are useful in describing how linear or time-series data changes over time, or how one group stands out from the rest. Sparklines are easily added to Excel spreadsheets through the “insert” ribbon, although they don’t copy and paste very well into PowerPoint or Word documents. The high-resolution ones you see in this article were generated by the handy sparkTable R package (Kowarik, Meindl, and Templ, 2012).
As an interesting and pathetic side note, Microsoft has applied for a patent for their implementation of sparkline functionality in their software, which is particularly galling to the spirit of freely-available open source applications if not downright plagiarism couched in tech-giant legalese. A Google search on “sparklines” turns up an onslaught of search-engine-optimized content about how to use Excel to make sparklines (a good thing, too, since making sparklines in Excel requires all of the technical expertise of a contemporary third grader). Go ahead and patent your weak excuse for sparklines, Microsoft. I guess patents are cheap to come by in the software world. Where do I sign up?
What it is
An Exploratory Data Analysis, or EDA, is an exhaustive look at existing data from current and historical surveys conducted by a company.
In addition, the appropriate variables from your company’s customer database—such as information about rate plans, usage, account management, and others—are typically included in the analysis.
The intent of the EDA is to determine whether a predictive model is a viable analytical tool for a particular business problem, and if so, which type of modeling is most appropriate.
The deliverable is a low-risk, low-cost comprehensive report of findings of the univariate data and recommendations about how the company should use additional modeling.
At the very least, the EDA may reveal aspects of your company’s performance that others may not have seen.
Why do it
An EDA is a thorough examination meant to uncover the underlying structure of a data set and is important for a company because it exposes trends, patterns, and relationships that are not readily apparent.
You can’t draw reliable conclusions from a massive quantity of data by just gleaning over it—instead, you have to look at it carefully and methodically through an analytical lens.
Getting a “feel” for this critical information can help you detect mistakes, debunk assumptions, and understand the relationships between different key variables. Such insights may eventually lead to the selection of an appropriate predictive model.
What else you can do
If additional predictive modeling is deemed appropriate, a number of approaches may then be utilized.
Approach 1: A logistic model (LOGIT)
You may elect to segregate a company’s business customers into 2 separate and distinct classes: those who place a high value on or express high satisfaction with company services, and those who don’t. This type of analysis is sometimes referred to as a response model.
LOGIT could offer some insight into the factors that drive this customer rating, especially when some of those factors are opinion oriented (from existing surveys, a survey designed expressly for this purpose, or both). This model would also utilize appropriate customer data from the company’s various strategic surveys, demographic variables, etc. One of the outcomes of LOGIT modeling is a probability, or “score,” that can be appended to each person in the larger database from whence the analysis came.
Approach 2: Recursive partitioning
Recursive partitioning is a technique that uses the same database and survey information but in a different way.
This modeling approach does a good job of taking categorical variables into account as well as ordinal and continuous value variables. Categorical variables tend to be characteristics, such as type of rate plan, type of business, or location, for example. Continuous variables are numbers, like number of employees or annual revenue. Ratings scales from surveys fall into this latter type of data.
Recursive partitioning provides a “tree” output, and customers branch toward one classification or another based on how they respond to questions or how their behaviors and characteristics are measured.
No matter which course of action your company decides to take, the first step is always begins with an EDA. It’s an important component of the marketing research process that allows data to be organized, reviewed, and interpreted for the benefit of your business.