One of the very useful features of spreadsheets is the capability to create charts and graphs of the data. After all, it is said that a picture is worth a thousand words, and a well-done graph can communicate a lot of information in a very concise form. I happen to appreciate good graphing. And I abhor misleading graphs, which are very easy to create. For anyone who wants to become an expert in this area, you cannot do better than to study the works of Edward Tufte, particularly his The Visual Display of Quantitative Information. And for pure pleasure, if you have a few minutes, here is a video from the YouTube channel Numberphile that discusses perhaps the greatest infographic ever created. Also I really have to give a shout-out to the Hans Rosling TED talk, The Best Stats You’ve Ever Seen. This is simply mind-blowing if you have any feel at all for what graphs might do. And finally there is the David McCandless TED talk The Beauty of Data Visualization.
Returning to the Tufte book for a moment, note the key word in the title: Quantitative. We discussed the distinction between quantitative and qualitative date in an earlier lesson, but it never hurts to be clear about this. Because the choices you make about the right chart or graph depend crucially on knowing what is appropriate for the data you have. Quantitative data is data that is measured in terms of numbers, and the measurements make sense as numbers. If I ask you how many apples you have, and you say you have three apples, I could record that data, and it would be quantitative. But if I asked you what apartment number you live in, and you said three, I could record that data, but it is no sense quantitative. Three is just a label, in this instance, and the data is actually qualitative, which means it can be used to distinguish your apartment from the one next door that is apartment two. But you would never claim that apartment three is 50% more “apartmenty” than apartment 2. The number in this case has no meaning as a number. And the software won’t prevent you from making a mistake here. You can create a graph using this data, but it may be completely useless if you make the wrong choice.
Qualitative Analysis and Charts
So what is qualitative data and how do you make charts out this? Qualitative data measures a particular quality of each object (hence the name), and these are very common in social science research. People can be divided by sex (male vs. female), by religion, by race, by nationality, by province, etc. And I like to think of these qualitative variables as being buckets into which the data is sorted. If I am sorting by sex, I have the buckets for male and female, and each person gets placed in the appropriate bucket.And when I have finished sorting them into buckets, I can do one meaningful mathematical measurement, and that is to count the number in each bucket. With those counts, I do have numbers that can be placed into a chart. I have several options here:
- Column – This chart lets you display a column whose height is proportional to the number in each bucket. If your data had 20 men and 10 women, the column for men should be twice as high as the column for women. If this is not the case, you may very well be lying with data, whether deliberately or inadvertently. This came up recently when a major television network in the U.S. had to apologize and correct a column graph that made the difference between 6 million and 7 million look like the difference between 1 and 7.
- Bar – This is just a Column graph turned on its side. Instead of columns going up, you have bars going from left-to-right. There really is no other difference. But there are reasons why you choose one over the other. If your chart has both positive and negative numbers, that is much clearer using Columns. And if you have a lot of bars, and they have long names, a horizontal bar graph is probably clearer. None of these involve any actual change in what they display, just in making things easier to read, but that is a good consideration.
- Pie – This is the chart to use when you want to discuss relative percentages within each bucket. The entire pie is 100%, and each bucket gets a “slice” proportional to its percentage within the total. The use case here is for qualitative data where the number of categories is fairly small. I find that if there are more than about a half-dozen “slices” of the pie it becomes progressively harder to read the chart.
And that last point raises a general point about all qualitative charts. You really don’t want a huge number of “buckets” or categories in these charts. Even with a bar graph, that can in theory accommodate more categories, you can have a chart that is hard to make sense of. A good way to resolve this is to broaden your categories. For example, suppose you are doing an analysis of the proportion of Evangelical Protestants in different parts of the United States. You might start with data breaking this down by each state, but there are 50 states in the United States, so you would have 50 “buckets”. None of these charts would work well in this case. But if you group the states into Regions, such as East Coast, Mid-West, South, West Coast, etc. you can get it down to a manageable number and produce a chart that makes sense and is easy to understand.
Quantitative Analysis and Charts
This is where we get to more complicated mathematical analysis, and the nature of the charts and graphs available changes as a result. The most interesting cases are ones where you have several quantitative variables interacting. For example, a typical Economics question might be how the unemployment rate has varied over time. The unemployment rate is one quantitative variable, and time itself is another. But you can come up with examples in many other fields as well. In Chemistry, you might measure the rate of reaction as the concentration varies, for instance. In these types of analysis each variable needs to be graphed on an axis that has numbers arranged in order. Given the limitations of the human brain, that generally means no more than three axes if you are trying to do a graph, though there have been some clever ways to get around this limitation. And the resources I have given at the beginning of this tutorial will give you some wonderful examples.
The first question you need to consider is what kind of relationship you think exists in this data. In scientific analysis there are in general two kinds of variables in any analysis. They can be called Independent vs. Dependent, and sometimes the Independent variable is instead referred to as the Explanatory variable. Whatever you call them, the basic idea is that one variable is “explaining” the other. To take an example from Medicine, you might want to examine the idea that there is a relationship between the Age of Death, and the Body Weight, and collect data from a group of individuals to examine this idea and see what relationship exists. I hope you will agree that it makes sense to think of body weight being something that helps to determine the age of death, but it makes no sense to think that age of death helps to determine the body weight. So if you are graphing this particular data set you would always put body weight on the horizontal axis and age of death on the vertical axis. This is a convention, not a scientific necessity, but conventions are important since that governs how people will read the graph. You should never violate conventions without a very compelling reason since in most cases it will cause people to misinterpret your graph.
- Line – This is the most basic type of quantitative chart. It places one variable on the horizontal axis (conventionally called the “X” axis in mathematics) and the other on the vertical axis (the “Y” axis). Each point of the graph represents a particular data point (or “observation” in science) which is entered by selecting the correct values for each axis. The last step is to draw a line that connects each of these data points. Line graphs carry certain implications, though. First of all, for each value on the X-axis, there should be one, and only one, value on the Y-axis. A variable that changes over time is an excellent case in point. If you have a graph of GDP for each year a line graph would be perfectly appropriate, since you can only have one value for each year. In this type of analysis (called Time Series in Statistics) the convention is to always place the time variable on the X-axis, and the corresponding measurement on the Y-axis. This type of graph generally presumes an orderly progression across the X-axis.
- XY (Scatter) – This is the preferred graph to use when there can be more than one value on the Y-axis for any given X-axis value, or where there is not yet a presumption of orderly progress on the X-axis. As an example, consider a graph that relates body weight to height for a group of individuals. Even if you presume some general relationship here, it is clear that for any given height you could have multiple weights, and for any given weight there can be multiple heights. This type of graph is often used to see whether a relationship might exist between these variables.
- Area – This is a way to combine multiple series of data that could just as well be displayed in a line graph. The idea here is to fill in the area under the graph.
- Bubble – This is a chart designed to show the relationship between three variables. One is on the horizontal axis, another on the vertical axis, and the third is shown by the size of the “bubble” drawn. This can be drawn for three quantitative variables, or it could be hybrid graph where one of the variable is qualitative. This is a clever alternative to drawing a perspective rendering of a 3-D graph. The limitation is that you cannot display zero or negative numbers in the bubble.
- Net – This is an odd name, since almost no one other than LibreOffice uses this terminology. If you go looking for this on the Web, you will find it referred to most commonly as a Radar chart or a Spider chart. This has “spokes” that represent a set of variables, which radiate out from a common center. Along each spoke the distance from the center represents a measurement. Then you connect the dots going around the spokes to form a very irregular shape. By repeating this process for a number selected objects, you can do a comparison. If you go to the Wikipedia article linked to they give an example of several different cars and for each you do a measurement of variables such as price, mileage, headroom, etc. By comparing the shapes you get for each automobile you can do a quick comparison.
Hybrid Charts and Graphs
Sometimes you need to combine both kinds of variables (Qualitative and Quantitative) in a single analysis, and for that it helps to have a hybrid graph that combines both types of data in a good way. Technically, you can view Bar Charts, Column Charts, and Pie Charts as hybrids in that the count of members in each “bucket” is actually a quantitative measure, but that is not how most people think of it.
- Stock – This is a specialized type of column graph that essentially combines three different numerical measures on one diagram. The height of each column generally represents the closing price, but it adds both the high and the low for the day. But this can be used for more than just stock prices. You could also use it to display the average, minimum, and maximum for a group of measurements, or in statistics perhaps the mean of the measurements combined with an estimated error, and so on.
- Column and Line – This combines two types of data in a single chart, presumably because they were related. For example, you have data on how many cars were sold in a dealership, broken down by model. You might display this as a Column or Bar chart if you were just displaying this by itself. But suppose you wanted to also add in a related quantitative variable, such the amount of display space each model was given, or the price of each model, etc. You could do this by putting the cars sold into a Column graph, and adding a line graph on top of it to show the amount of display space each model received.
So the point of this analysis is to understand that choosing a graph should not be random. You should have a reason for your choice, and the graph you choose should be a good fit for the point you are trying to communicate. I regularly see examples in the media of graphs that are not appropriate, that are done incorrectly, or that violate the conventions. At best, these are just stupid mistakes. But at worst they are examples of what to me is just plain lying. As an example, just in the last few days I saw an example of a graph where the vertical axis was reversed, so that the lower numbers were on top and the numbers increased as you went down. This is of course the exact opposite of what the convention tells all of us to expect, and I believe it was done deliberately to mislead people. And I was not totally surprised (this was a contentious political issue) to hear defenders say “Well if you are to stupid to read a graph…” But I am firmly in the school that says that clear and honest communication is the point, and using graphs properly is essential to that communication. To see some examples of creative ways of lying with graphs, see this article from Simply Statistics.
Listen to the audio version of this post on Hacker Public Radio!