Tuesday, December 27, 2011

don't fall victim to this

I came across this graph recently when catching up on some reading over the holiday. My question to you is simple: can you read it?

The website where this interactive visual resides is called worldshapin, and it implores you to "compare countries through their shape." It visualizes data from the Human Development Report 2011 as a "star plot" along the six dimensions of education, population, health, workplace equality, carbon footprint, and living standards. As shown above, you can look at this data between countries and as it compares to continents and the world (when the world isn't obscured by the countries and continents you've chosen, as it is above).

Before I get to the don't fall victim portion of this blog post, let me first say that I do think this helps make the data in the report more accessible by making it visual. You can get a quick idea of how one part of the world stacks up to another across these dimensions that you wouldn't get with a table of data, for example. This is fine for information discovery. This assumes you are making it available for an audience who will have an appetite to "play" with the data.

This visual is not fine, however, if you have a specific story that you want to tell through data.

To convince you of this, I'm going to take one of my own failed data visualizations from my past and remake it into something that works. First, a bit of history:

I used to make charts like this. I called them "spider graphs." In a prior life, I worked in banking, managing home equity fraud. When it comes to fraud, the ways you can impact it can be classified into 8 categories (where each category is a piece of the fraud management lifecycle): deterrence, prevention, detection, mitigation, analysis, policy, investigation, and prosecution (Wes Wilhelm, The Fraud Management Lifecycle Theory). So if we were to look at our efforts in each of these areas and rate the activities along a scale from 0 (we have nothing in place) to, say, 10 (the unattainable utopia of fraud management - we've solved every problem), we could show how well we're doing on a relative basis in each area, with the goal of maximizing our coverage and balancing activity across the different parts of the lifecycle. The spider graph was perfect for this!

I was able to locate an old annual review on the topic of home equity fraud that I put together that highlighted progress to date and introduced forward-looking plans. I'm going to assume it's ok to share an excerpt here, given that the financial institution I did this work for is now defunct (due to much bigger issues than my poor data viz). Here's what it looked like:

The visual starts off with an explanation, shows an example of how to read the graphs on the right, followed by the real-data-graphs across the bottom (the titles across the very bottom are the 5 different types of home equity fraud that we were tracking).

Lesson 1 (foreshadowing): if you have to have a graph to show how to read your graph, your visual may be too complicated.

When it comes to the visual at the bottom ("FML for Home Equity"), let's try to look past the black background and meaningless colors (while annoying, we have bigger fish to fry here) to the actual data. Same question as I led this post with: can you read it?

Before I answer that question with my current data viz lens on, let's back up the better part of a decade to take a look at what I thought of these visuals when I created them. I thought they looked really cool. Sexy, even. I also thought they clearly showed what I wanted to show: mainly, that we had a lot of work to do - we were failing in a lot of places and needed to make some changes.

But people found them really hard to read. I found myself explaining, repeatedly (to the same people even!) how to read them. At the time, I thought this was an issue with my audience.

When I look at the graphs through today's lens, I recognize that the issue was not with my audience, but rather with me. It was a visual design failure. I stubbornly persisted to show data in a way that wasn't straightforward for my audience to consume (even when it became obvious through their questions that it wasn't clear!). When information isn't straightforward, it's hard to look at. For an audience, this feels uncomfortable. Most people don't want to spend a lot of time with things that make them feel uncomfortable. Even when you try to convince them to. Can you blame them?

Let's talk about some other ways to visualize this same data. The sort of data we have lends itself easily to a matrix structure, with fraud management lifecycle stage across one axis and fraud type across the other. When I see the data organized this way, I think heatmap. But the main drawback to a heatmap in this scenario is that, while it gives us a decent visual comparison of how we're doing across the different buckets (both by fraud management lifecycle stage and by fraud type), we don't get a visual comparison of where we are vs. where we'd like to be, which I think is the most important piece here.

Instead, I'll leverage one of my best friends: the bar chart. Bar charts are great because people already know how to read them. This means there's no learning curve for your audience to face to get to the information you want to provide. Rather than spending their time deciphering how to read the graph, they can spend it understanding the information it shows. There also more likely to spend time on a visual that doesn't make them feel uncomfortable. Here's another way to visualize this data:

Note that the actual numbers aren't so important here - they were somewhat subjective to begin with - so I opted not to show a numerical scale at all. What is important is the relative distance from where we consider ourselves to be currently and where we want to be (as close to "we've solved every problem" as possible). I've drawn attention to this gap by showing the opportunity that remains outlined in blue.

The overarching lesson is this: don't fall victim to choosing sexy over utility when it comes to data viz for telling a story. When your audience tells you something is hard to read, or you find yourself explaining the visual more than discussing the information it shows, listen and adjust!

If interested, my Excel file is here. Leave a comment to let me know what you think!

Friday, December 16, 2011

the cost of christmas

Each year around this time, the US financial institution PNC produces the "Christmas Price Index," in which they calculate the cost of Christmas based on the items in the 12 Days of Christmas carol. I guess it's a sort of merrier (at least in theme) version of the Consumer Price Index and is meant to provide some economic insight into how the price of goods changes from year to year.

This year, they've layered on an interactive layer of glitz: the Christmas Price Index Express. Fast Company describes it as "A game-enhanced site with a handmade feel, the Index Express appears as a magical train that carries visitors through an alpine world to collect each of the 12 gifts. But it's essentially an elaborate interactive infographic, where the data points come to life with animation and sound." (Fast Company article) Whatever it is, it takes forever to load and I wasn't patient enough to spend time on the Index Express (where there are literally bells and whistles), rather, I clicked through the site long enough to find what I really wanted to get my hands on: the underlying data.

PNC certainly didn't make the data easy to extract. After painstakingly copying and pasting data from each of the 13 pages (total cost of Christmas plus one page for each day of Christmas) and reformatting to get a dataset I could do something with, I had myself an Excel spreadsheet with 28 years of 12 days of Christmas cost. Next challenge: visualize it and see what gems of wisdom we can acquire.

Often, there is much to be learned by looking at how not to visualize data. So before we get to how I'd visualize the cost of Christmas, let's look at a few less-than-optimal visualizations of this data and discuss their limitations.

First, the stacked bar chart. I often see data like this (multiple series over time) displayed this way. Unfortunately, this usually isn't a great approach. Stacked bar charts are tricky, because once you get past the first series, there is no longer a consistent baseline to compare the other series. Here's what it looks like with this data:

In the above, we can see how the total price of Christmas has changed over time and also see what the major contributors to the total price are. But if I want to understand how the different components have changed over time, that's tough with this visual. Are all goods changing in the same way, or are some getting more expensive while others have become cheaper? It's really difficult to tell with this graph.

So what if we unstack the bars so that we do have a consistent baseline for each series. Here's what we get:

This clearly doesn't work here - there's way too much going on. But even with fewer series (picture just the first 5, for example), this format is hard to read. It puts a lot of onus on the audience to spend time staring at it and looking for interesting things to pull out. That's too much work, when we can make the interesting things more obvious so our audience doesn't have to search for them.

Let's see what this data looks like in a line chart:

This is getting better, but still may not be optimal. There are a lot of overlapping lines, especially at the bottom where a number of series have similar values. But the biggest drawback is that we don't get a good sense of how the total cost of Christmas has changed over time with this graph, which is kind of the meta point of the data and is probably interesting.

While we're on the topic of non-ideal graphs for this data, I can also picture some sort of horrible visualization with pie charts: one for each year showing the breakdown of Christmas items, perhaps even with the size of the pie scaled by the total cost of Christmas. This would take some time to build, so I'm not going to go through the effort, particularly given that pie charts are my enemy. Rather, I'll simply say: don't do this! Why? Check out this blog post for some background.

We've looked at some less than stellar graphical representations of this data; now let's turn our attention to something that I think might work a little better.

In any visualization exercise, one of the first things to do is determine what question(s) you want to answer. This will drive how you show the data: the goal is to show it in a way that makes it clear what questions you set out to answer and answers them in a straightforward manner. The problem is that this step is often skipped, resulting in graphs like the ones above. When you don't isolate what question(s) you want to answer and try to create a visual that will answer any question, you run the risk of not answering any single question very well.

With this data, I'm going to choose to answer a couple of questions: how has the price of Christmas changed over time? (both in aggregate and for the various items) and what proportion does each day contribute to the total cost of Christmas? The trick I'll employ to do this in a way that isn't overwhelming is to create a visual with multiple graphs (and words!) so we can answer these questions one at a time. Said in another way, I'm going to use my visual to tell a story with this data. Here's what it looks like:

The top left graph shows how the cost of Christmas has changed over time. The top right graph shows the 2011 cost breakdown per item so we can understand the contributors to the total cost. Finally, the mini-graphs at the bottom help us understand the drivers behind the total changes we see in the top left graph. I've put on my analyst hat and added some words to describe what I believe are the main takeaways that my audience shouldn't miss.

The bottom line: Christmas is getting more expensive. If you have a tight budget for your holiday party, for entertainment you may consider replacing your leaping lords and dancing ladies with milk maids and for decor swap your swans for hens to save a considerable amount of money!

In case you're interested, my full Excel spreadsheet with data and graphs can be found here.