Sunday, September 30, 2012

your input please: font

We have a small debate underway in my day job regarding font. Specifically, which should be our default or standard font for analyses, presentations, etc. This led me to the question: when it comes to font choices, where do best practices end and personal preferences begin?

I'm aware of some relevant research conducted by psychologists Song and Schwarz in 2008 at the University of MI at Ann Arbor, where they showed college students recipes for sushi and asked them to estimate 1) how long the recipe would take them to make and 2) how inclined they were to do so. The only thing that varied between the recipes was the font in which it was written. What they found in a nutshell was that the fussier the font, the more difficult the students judged the recipe and the less likely they were to want to attempt making it. For me, the translation for data visualization broadly is that the more complicated it looks, the less likely your audience is to take time with it.

But back to my specific question: if both fonts are straightforward to read (no legibility issues), how do you choose?

To try to answer this question, I initially planned on doing some research; I quickly grew impatient with this. My brief attempt in Google searches taught me that there is no shortage of font fodder on the internet. There are conflicting lists of the "best" fonts (example). Others have done much more research in this area than I care to (example). I was struck that there don't even seem to be consistent opinions on questions I thought would be easy (e.g. serif vs. sans serif... sans, obviously, right? not according to Wikipedia).

So rather than continue down this slightly frustrating path, I thought I'd pose the problem to you to see if any consensus in the form of the wisdom of crowds emerges. Here are the fonts we're considering:


The quick brown fox jumps over the lazy dog
1234567890 (Calibri)

The quick brown fox jumps over the lazy dog

1234567890 (Open Sans)

The quick brown fox jumps over the lazy dog
1234567890 (Arial)


Specifically, when it comes to the open debate at work: my colleague and I are in agreement that Calibri should not be our default font. I think our reasoning when you boil it down is probably simply because we don't like it vs. anything scientific. Where we differ is on the question of Open Sans vs. Arial. I won't bias you by revealing which I prefer (though my sans serif comment and the text on this blog serve as a pretty big hint).

My questions to you are: If you were weighing in on this decision, what factors would you consider? Which font do you prefer? Why? Leave a comment with your thoughts!

Thursday, September 27, 2012

quick tip: left uppermost align title text

I've commented in the past about the important role that text plays in data visualization: in short, it helps to make the information you provide more accessible to your audience. But where should you place your text for it to best play its role? When it comes to chart and axis titles and legend, my recommendation is to left uppermost justify.

I frequently see chart and axis titles center-aligned and the legend placed to the right of the data it describes. Many standard tools default to this. I favor left uppermost justifying over center-title-alignment and righthand-legend-placement due to two reasons:
  1. Center alignment looks messy: center alignment doesn't create a clean line on either the left or the right, so text is left visually hanging.
  2. Your eye hits the left uppermost space first: in Western cultures, most people read left to right, top to bottom*. This means if you left uppermost justify your graph title, legend, and axis titles, your audience's eye hits how to interpret what they're looking at before they get to the data. 
*I'm frequently asked the question how this changes in cultures reading in other directions: the small sample I've posed this question to have said that when it comes to business, the Western style prevails since so much international business is conducted in English. Please leave a comment with your thoughts if you have insight on this!

What I mean when I say "left uppermost align" when it comes to graph titles and legend is:
  • Graph title (+subtitle, if applicable) are positioned above graph and left-aligned.
  • Legend is placed above graph (below title/subtitle) and left-aligned.
  • y-axis title is aligned with topmost y-axis label.
  • x-axis title is aligned with leftmost x-axis label.

Here's a quick look at what a typical graph looks like with default text alignment settings compared to when we follow this tip:


Personally, I steer clear of center alignment almost always in favor of left- or right-alignment. Outside of titles and legends, whether to left- or right-align your text comes down to the layout of the visual: sometimes right-alignment makes sense, for example in a horizontal bar chart you should right-align your y-axis labels so that funny spacing isn't created between the labels and the data. When in doubt, try aligning a couple of different ways and see what looks best: trust your eye or solicit input from a colleague.

Note: the Excel template to create the left uppermost chart above can be downloaded here.

Thursday, September 20, 2012

bar charts must have a zero baseline

This is one rule of data visualization that I see broken too often: when it comes to bar charts, the y-axis must begin at zero.

When our eyes interpret bar charts, we are comparing the relative heights of the bars. When we cut the height off at something greater than zero, it skews this visual comparison, over-emphasizing the difference between the bars in a way that simply isn't honest. Most recently, I saw this in a visual that was forwarded by a friend of a colleague. The offender: Fox News.


There are a number of things that bother me about this visual. Beyond the unnecessary visual clutter of tiny gridlines and strange chart borders, the y-axis isn't labeled (I think it's Top Tax Rate, as noted by the subtitle, but this would be a lot clearer if the axis itself were labeled) and it is placed on the right-hand size of the visual, so it's the last thing I see as my eyes scan across from left to right, making it even less likely that I see the biggest issue with the graphic, the fact that the y-axis starts at 34%. This makes the difference between Now (35%) and Jan 1, 2013 (39.6%) appear to be way bigger than it actually is.

How big of an issue is this? Let's do some math to find out. The way it's graphed, the height of the bars are 1 (35-34) and 5.6 (39.6-34). This represents a visual increase of 460% ((5.6-1)/1). If we graph the bars with a zero baseline so that the heights are accurately represented - 35 and 39.6 we get a visual (and actual) increase of 13% ((39.6-35)/35). Perhaps that is still significant and that is the point that Fox News was attempting to make. That's fine, but I wish they would have done it without this visual misrepresentation of the truth.

A couple related things to consider (and I have my own opinion on each of these that I'll of course make clear):
  • I've heard the argument that if you're graphing something that has a sort of "natural" baseline of something greater than zero, then it might be appropriate to start with that. For example, if we consider the baseline unemployment rate to be 5%, then the argument goes that you could use this 5% as the baseline. I don't like it. For me, it isn't a valid visual comparison, so if that were the case, I'd use a different way to show it (perhaps plot the entirety of the bars but then also highlight 5% horizontal line and label it in a way that makes it clear how to use it for comparison).
  • When it comes to line graphs, the zero baseline rule does not hold. In other words, you can get away with a non-zero baseline in a line graph. With line graphs, we compare the lines to each other more than their height from the x-axis. Still, you need to be careful. I would advise to make it clear to your audience that you're using a non-zero baseline so they interpret the information correctly (one approach: label the y-axis and highlight the minimum value in bold so attention is drawn that it's something other than zero). And you need to be careful about zooming in too much and making a change that is minor look big - this gets you back into the visual misrepresentation place that we want to avoid.
My advice to Fox News (and to those communicating with data in general) would be to first determine the story you want to tell. Then determine what data will best support this story. Don't compel your audience with visual misrepresentations; rather, convince them with accurately displayed data that backs up the point you are trying to make.

Related note: there are a number of posts by others on this and related topics. In case you're interested in reading more, here are a few I'm aware of (not an exhaustive list):

Thursday, September 13, 2012

some finer points of data visualization

Last month, I conducted the first storytelling with data Data Viz Challenge. In addition to eternal notoriety, I promised the winner the invitation to write a guest blog post (in case you're interested, a full rundown of the entries and my comments about each can be viewed here). Winner Jeff Shaffer came through with the following post, which I'm excited to share with you here.
_________________________________

I have enjoyed reading Cole's blog at storytellingwithdata.com, so when she invited me to write a guest post I was thrilled with the opportunity. The challenge became focusing in on the exact topic for my post. Cole has done some terrific redesigns over the years, turning some not-so-good charts into good data visualizations. It would have been easy to find another bad chart and post a redesign because let's face it, there are more bad examples out there than good ones. So for this post I decided to cover some of the finer points of design in data visualization.

Before I do a critique of a chart, I wanted to share my view on creating a good data visualization. I teach data visualization at the University of Cincinnati and as part of the course I cover what I call "The Shaffer 4 C's of Data Visualization". They simply serve as a guideline to follow when creating or critiquing a data visualization.

The Shaffer 4 C's of Data Visualization:

  1. Clear - easily seen; sharply defined. Who's the audience? What's the message? Clarity is more important than aesthetics Ex. good chart title, critical labels, units of measure, avoiding rotated text, good color choice, etc.
  2. Clean - thorough; complete; unadulterated. Ex. not overlabeling axis and data points, too many gridlines or too dark, proper formatting, using the right chart type, poor color choice, etc.
  3. Concise - brief but comprehensive
    Not minimalist but not verbose
  4. Captivating - to attract and hold by beauty or excellenceDoes it capture attention? Is it interesting? Does it tell the story?

It's important to understand that certain elements can affect more than one area. For example, if there is a poor chart type or a 3D graphic used it could violate both the Clear and Clean principle and if the chart is loaded with data labels at every opportunity then it could easily violate both Clean and Concise. On the other hand it's quite possible to create a very Clean chart following all of the appropriate data visualization rules, but the message is lost (not Clear) or it may not be a story worth telling (not Captivating).

Color is another example that could affect multiple things. For example, using red/green would not be Clear to someone who is colorblind or using a categorical color scheme instead of a sequential color scheme for a certain data type might be very confusing. Alerting colors might confuse the message drawing attention to something it shouldn't. However, over use of color, gradient or shadow could also affect Clean. Even if the message is Clear, it might still be a sloppy looking chart with poor color choices. For example, bright pink mixed with red might cause a visceral reaction to the clash.

One final comment on the 4 C's of Data Visualization. I specifically used Concise to contrast what I believe to be a minimalist approach to data visualization by Edward Tufte and some others in the field. It isn't necessary in my view to save ink as if my printer cartridge were running dry. I also believe it's ok at times to have extra emphasis, even if it's redundant and I think the use of color can be used to help with Captivating so that the visualization isn't boring. What would the world be like if every chart were black and white, shades of gray, or blue and orange? Don't get me wrong, I have nothing against any of these and the blue/orange colorblind-friendly palette is one of my favorites, but we can't use it for everything.

On the flip side, there is a fine line between adding color for this purpose and that color becoming distracting, alerting or overpowering the reader. Jeffrey Heer, Associate Professor at the University of Washington and formerly with the Stanford Visualization Group, co-authored a paper with Wesley Willet and Maneesh Agrawala discussing Scented Widgets. "Visual Scent" was used to describe navigation cues embedded in visualizations. It's a great paper and I think the term visual scent will be used more, but I will add to the lexicon my coining my own term, "Visual Order". It's far too easy to create a chart in Excel that looks like Pac Man eating a skittles rainbow (yes, this is a real chart that someone produced with the simple addition of the eye added for effect). I won't critique this chart today.
Below is a chart to examine:
I ran across this chart on the University of Cincinnati Health website and the reason I picked this chart is because it's actually a pretty good chart.
  • It's the right chart type for the data. The bar chart allows for easy comparisons visually between the institutions. Bar charts are always a good choice for categorical comparisons.
  • It's ranked in order providing a quick and easy understanding of the verification rankings.
  • Reasonable abbreviations were used to shorten the names that would otherwise be very long.
  • The message is fairly Clear, UCNI is #1 receiving 13 verifications in 14 specialty areas of neurological care (note the benefit of a good title).
  • The chart has good use of color, emphasizing UCNI compared to the other institutions. Blue and red aren't exactly complimentary, but red is the University of Cincinnati color so that's a natural choice in this case. This color combination also avoids red/green which allows for someone who is colorblind to make the color distinction for the same visual message. You can test your own images at http://www.vischeck.com or download the free Adobe Photoshop plug-in.
  • The chart has good detail in the note section which gives the reader more information on how the designations are done and the fact that UCNI is working on the 14th specialty area.
  • From a design standpoint it is always best to use a dark font on white or light color and a light font on dark color. In this case the creator wisely chose a white font color on the color bars and black font on the white and light grey.
  • The gridlines are muted so that they are not distracting or creating a moire effect.

Compared to many charts out there, including some of the examples Cole has critiqued in her redesigns, this would be considered a pretty good chart. However, this chart can be improved when examining some of the finer design changes that can be made.

  • It's best to avoid rotated text whenever possible (Clear and Clean). In this case the text was only rotated by 45 degrees, so it's not as hard to read as it would have been if it were rotated a complete 90 degrees (which is commonly done on long labels). I try to avoid rotated text as much as possible, even small angle rotation. The text label "Barrow Neuro. Institute" is actually below 4 bars and requires the eye to follow that text to the end to determine the bar it represents. Try to quickly compare Barrow Neuro. Institute to UC San Francisco. The eye has a hard time keeping a place holder for the comparison. The best solution to solve this is to rotate the chart instead of the text. This allows the reader to read the text normally while still using the bars for the visual comparison. It also puts UCNI at the top of the of the list, which is where they are in the ranking.
  • There is no need for the y-axis label (Clean and Concise). The purpose of axis labels is to give an approximate value for the bars. In this case we have every bar labeled with the value. In cases where there are lots of categories (and this could be one of those cases) then it might be better to remove the individual data points and simply use the axis labels. If using that method then I might highlight UCNI with a single data point for emphasis (still keeping with Concise and Clean).
  • The gridlines are interfering with the paragraph of text (Clean). This is partially due to the increment of the gridlines being set as 1, but it's also the white gridline contrasting with the dark text. There are a number of ways to solve this, for example adding a slightly filled background box to the text or deleting the gridlines completely.

Below is an example redesign:


  • I used a free tool called ColorPic to get the exact colors that were used in the original chart. ColorPic is a utility that will extract the exact color hue, saturation, value and RGB color code from any point hovered over with the mouse.
  • In this case I copied the original color scheme exactly and did not make any adjustments for the gradient of the bars. I recommend avoiding gradient, but the use in this case is so minimal that I simply left it alone for now to preserve the original color scheme. However, notice that even with the tiniest of gradient effects there is still a visual impact on the bars. The left sides of the bars (and the bottom part of the original) are darker and seem to have more weight to them.
  • Axis labels for the values were removed since the bars have data labels.
  • The gridlines are now in increments of 2 instead of 1, but still muted.
  • The paragraph of text is now in the bottom right hand corner of the chart. Notice that I used a gradient effect on the gridlines, muting them to nothing on the bottom right of the chart. They serve no purpose in this region since the bars do not extend to this area. This allows us to keep the gridlines in the area where the bars are without interfering with the text.
  • I changed the font color of the institution names to blue to match the bar.
  • I placed a text box on top of the UCNI text label since Excel doesn't have an option to change the color of a single axis label like it does for a single data label. Now UCNI matches the red bar. 
  • I added the UC Health logo to add to the presentation.
  • Finally, I would usually add the author name and data source as a note at the bottom, but since I don't have the information from the original chart I am unable to do that.

Taking some liberties with the original color scheme and avoiding the gradient effect yields an even better version that isn't as dark and heavy. Note in this version I also removed the background fill and when doing that the bars will hang in the air. I agree with Stephen Few on this point who advocates using an axis bar in this case. Although it might be considered "more unnecessary ink" by some, I prefer this over the dangling bars and to visually set them at a baseline.


As I stepped through this same exercise this past week in my data visualization class, one of the students remarked on one additional improvement that I had mentioned in class as a best practice, but had neglected in this chart. They pointed out that having the data labels set at the inside base instead of at the ends of the bars is visually better. It puts the data point immediately next to the text labels and creates a data table that is easy to read vertically. This allows for quick, easy comparisons and doesn't force the eye to jump back and forth from right to left. While I don't think there is anything wrong with the chart above, I do agree with that best practice because it makes it a bit more Clear.


I hope this example showcases some of the finer points of design for data visualization. We often cover the topics of redesign where the charts are so bad that almost anything would be an improvement. In this particular case it is the careful attention to a few details and applying the 4 C's that help make this chart a better presentation of the data.

I would like to thank Cole again for this opportunity to write a guest post on her wonderful blog. Keep up the great work, Cole!

Jeffrey A. Shaffer

Jeffrey A. Shaffer is the Vice President of Information Technology and Analytics at Unifund. Mr. Shaffer joined Unifund in 1996 and has been instrumental in the creation and development of the complex systems, analytics and business intelligence platform at Unifund. Mr. Shaffer holds a BM and MM degree from the University of Cincinnati and an MBA from Xavier University where he was the winner of the 2006 Graduate Student Scholarly Project in Research. Mr. Shaffer has attended the Harvard School's Executive Education Program, is a Certified Manager of Quality and Organizational Excellence through the American Society for Quality, a Certified Project Management Professional through the Project Management Institute and has completed Six Sigma Green Belt and Black Belt training with the Xavier Consulting Group. Mr. Shaffer is also Adjunct Assistant Professor at the University of Cincinnati in the Carl H. Lindner College of Business teaching Data Visualization in the Graduate Course series for Data Analytics. He is also a regular speaker at business intelligence conferences and symposiums on the topic of data visualization, writes for the data visualization blog at MakingDataMeaningful.com for Lucrum, Inc. and was a finalist in the 2011 Tableau Interactive Visualization Competition.


Thursday, September 6, 2012

color me bad(ly)

Recently, a contact shared the following image with me, along with his thoughts. I found both amusing, so thought I'd share with you here, along with some of my own thoughts and a makeover:

From: http://www.consultingmag-digital.com/consultingmag/201207?pg=6&pm=2&fs=1#pg26

Commentary accompanying the visual:
This seems like some seriously simple data to present, but SOOO poorly executed. Looking at it hurts my head and leaves me with nothing but questions:
  • How much time does it take others to figure out the color pattern(s)?
  • Is there really even a pattern?
  • Why are the two legends/color-schemes different? Don't make me work so hard!
  • Why use donuts/pies instead of some simple paired bars/columns, or even just a pair of lines (i.e., a simple histogram)?

No matter what your content, this is the sort of reaction we should work to avoid in our data visualizations. In this case, it seems the color and donut form is meant to make the data more visually interesting, but it hinders our ability to understand the data.

There are a number of lessons we can employ here to make this data easier to comprehend:
  • If there is an intrinsic order in your categories, leverage it. In this case, the 2011 data has categories in order of increasing days away from home (starting at the lower middle left of graph with the light green segment and working clockwise around), but somehow neither this ordering or the colors of the categories carried over to the 2012 graph; rather, this graph appears to be sorted numerically by category. This makes comparing the segments of the pies even more difficult than it would otherwise be. Speaking of which...
  • Don't make people compare segments of two different pies (or donuts, in this case - substitute your fave dessert dataviz). Our eyes have a hard time measuring angles and areas: this difficulty is amplified when we're meant to do it across different pies/donuts, where the pieces are in slightly different places and there is no consistent baseline.
  • Put the things you want to compare close together. Physical distance between the things we're meant to compare makes comparing those things more difficult. In this case, a bar graph would allow us to put 2011 and 2012 right next to each other so we can get an easy visual comparison.
  • Use color strategically. Don't use color to make something colorful; rather, use color sparingly and strategically to draw your audience's attention to where you want it.
  • Tell a story with your data! Don't assume your audience will want to look at the data and make up their own story. If you look at the full article, the point they are trying to make is that consultants are traveling less in 2012 than prior years. I'm not actually sure this data shows that (it could be that the other groups surveyed are traveling less but the consultants are traveling just as much - we don't have that breakdown of the data to see). At any rate, I'd suggest making the point more clearly with the data and actually calling out the takeaway within the data visualization to help your audience know where to look for the evidence of what you're telling them.
Here's an alternative view of the same data, employing the lessons I've outlined above:


Thanks, Andy, for passing this less-than-stellar viz along and for your thoughts!

For those interested, you can download my Excel file with the above visual here.

Tuesday, September 4, 2012

a few words go a long way

Part of my day job is internal consulting to our analytics team. One of our interns is getting ready to present findings from his summer project and asked for help visualizing results. This is a part of my job that I really enjoy - helping make the "so what", the "why this is important or interesting" part of an analysis we've undertaken visually clear.

As with many of my work-related examples, I have to keep the details confidential and generalize the situation a bit. In this case, we conducted a study where there was a baseline group receiving no treatment, and then several possible categories of treatment received by other groups. We were looking to understand the difference in impact these various treatments would have on a given outcome.

Here was the original data viz (slightly generalized from the original form):

My initial feedback looked something like the following:
  • Nice use of preattentive attribute (color) to draw your audience's attention to where you want them to focus.
  • The graph needs a title. The legend should be closer to the data it's describing.
  • If baseline is what the audience is meant to compare to, put that first and make that clear - think of adding a summary stat on the right side of the bars that is "increase vs. baseline" or similar.
  • I'm not sure the grey bars are adding value? If they represent 100% minus Outcome observed, stack them on the green bars to add to 100% and make that clear.

After discussing live, I spent a little time with the visual and ended up here:


In addition to incorporating the feedback outlined above, I also separated the Baseline visually and added a subtitle to the treatment groups to try to make it clear that each treatment is meant to be compared against the baseline (reinforcing this via the summary stat on the right).

Note that we aren't done at this point - the story still needs to be put around this data. In this case, the story could be something like "Treatment A results in highest increase over baseline" and a recommendation for rolling this treatment out more broadly. But note how some relatively minor formatting changes and the addition of a few words makes the information easier to consume.

The Excel file for the latter version is downloadable here.

Saturday, September 1, 2012

words in print!

Courtesy www.alliancemagazing.org

After speaking at the European Foundation Centre's annual conference earlier this year, Alliance magazine (whose audience is primarily those in the European philanthropic sector) reached out with interest for a short article on best practices for telling a story with data.

Said article was recently published in their latest edition. You can view the article here. Enjoy!