In Statistics there are generally speaking two types of analysis, broken down between Descriptive and Inferential statistics. The difference has to do what what claims you are making about the data. If you are simply stating something about the data (e.g. there were more men than women in the sample) that is descriptive. But if you make a claim that something is not likely to occur by chance, for instance, or that something is statistically significant (and both of those statements are essentially the same thing) then you are in the realm of inferential statistics. Calc has functions to do both kinds of analysis, and this tutorial will examine some of the common descriptive statistics in Calc and how they are used.
Calc offers many statistical functions, of course, that you might want to make use of. These let you get some analytics on data that you have. But you need to have some data to start with. And to do that I am going to make use of the Random function to make up some numbers. In Cell A1&B1 I set up a header by clicking Merge and Center, making the font Arial 12, bold, and giving a colored background through the formatting cells option. Then I select cells C1 and D1, merge and center, and do the same formatting. For column A&B I give the column name “Data”, and for the combined C1D1 I enter “Statistics. In Cell A2 I use the RAND function (Found in the Mathematics category), which gives me a random number between zero and one. I then multiply by 100 to get numbers a little bigger, then I click-and-drag through the column to get 30 numbers.
When you start working on this you will encounter an interesting problem, which is that the random numbers keep changing when you change other cells. What is happening here is that the formulas are recalculating every time the sheet recalculates. But we can use a trick to get around this. Highlight all of the numbers, and select Copy. Then, with the same area highlighted, paste onto itself by right-clicking, selecting Paste Only, then selecting Number. This takes the result of the function and turns it into a raw number which replaces the function in each cell. Now you have your data locked in and you are ready to do some statistics.
Measures of Central Tendency
In statistics we distinguish several different measures of central tendency. Essentially, this is an attempt to answer the question “What does the most representative member of this group look like?” There is more than one answer depending on the data. First is the question does the data represent a qualitative or quantitative variable? Yes, that question does keep keep coming up. For quantitative data there are several likely answers. One is the average, also referred to in statistics as the mean, and another is the median. Which one do you use? Basically, it comes down to how symmetric the distribution is. When it is symmetric the two measures will be very close. But when it is skewed, they will diverge a lot. For instance, if you have a group of ten people in a room and ask what the typical person’s wealth is, if all ten people have similar circumstances you could just use the average and get a good answer. But if one of those people was Bill Gates, you might get an extremely high and unrepresentative number. In that case, you should use the median, which divides the sample into two groups and asks where is the boundary between the top 50% and the bottom 50%.
Average
Go to the Function Wizard, select Statistical as your category, and Average as the function. Click Next. The window for putting in arguments opens with a space for each number. You could enter each number one at a time, one per field, but that is not optimal. Instead, click on the field for the first number, then click on Cell A2. You will see that the field now has that cell address. But now hold down the Shift key and click on cell A31, which selects the whole column of numbers. Now the field will read “=AVERAGE(A1:A31)”. That gives you the whole range of numbers, and when you click OK you will get the average of this group of numbers. In this case, we generated data using the RAND function, multiplied by 100, which should mean random number between 0 and 100. So you should not be surprised if your answer comes pretty close to 50 (mine came out to 49, but your number could be slightly different).
Geometric Mean
The Average, also called the Arithmetic Mean, is calculated by adding up the measurements and dividing by the number of measurements. The Geometric Mean is calculated by multiplying the numbers together, then taking the nth root, where n=the number of measurements. The function is the GEOMEAN function, and you use it just like the AVERAGE function
Harmonic Mean
This one is fairly complicated to describe, but it is used in scientific applications. The 3 types of mean (Arithmetic, Geometric, and Harmonic) are called the Pythagorean means. The harmonic mean is used, for example in evaluating computer algorithms. It is called the HARMEAN function, and is used just like the AVERAGE function. The rule of the three measures is that the Arithmetic mean is always the largest, the Geomtric Mean is in the middle, and the Harmonic Mean is the lowest.
Median
As above go the Statistical functions, but this time select Median. As above, select the range for the first field. This may be a little farther from the middle, depending on your numbers (mine was 42).
Mode
The Mode is the most common value, and this is what you use for Qualitative data. For example, if you had a sample of people where you recorded their hair color as black, blonde, or red-headed, and wanted to know what was most representative, it makes no sense to use arithmetical measures like average. Mode simply looks for the most common observation in the sample. But note that it expects the data to be numerical, so you would need to encode the data with something like 1=Black, 2=Blonde, and 3=Redhead. Do that, and you can put in a range of data into the Mode function and get your answer.
Measures of Dispersion
These measures tell you how much variation there is in a group of numbers. Two different groups of numbers could have a similar average or median, and yet be very different when you look at the degree of variation. As an example, the number 4 and 6 have an average of 5, but so do the numbers 0 and 10, and yet it is clear that they are very different groups of numbers. So to address this we need to look at a few related measures of dispersion. But to use these functions we need to first discuss the difference between a population and a sample. A population means that you have the entire group measured, while a sample means you have some fraction of the group measured and want to use that to make a claim about the population. And this matters because the way you measure dispersion is very slightly different depending on which case you have. I won’t go any further into this right now, as it more appropriately a topic for a class on statistics, but you have slightly different functions for that reason.
Variance
If you want a short description of Variance, it is the average of the squared deviations from the mean. You have four possible functions here, but two of them are somewhat more specialized so I will skip over them for this tutorial. The ones I will demonstrate are VAR and VARP.
- VAR – This one assumes that you have a sample, and will produce a slightly larger number on that account because of assumed sampling error. Go to the Function wizard, to Statistical, and select the VAR function. Click Next, then in the first of the number fields click to place your insertion mark there. Then click on cell A2, then hold down the Shift key and click on cell A31. This will put the range A2:A31 into your function. Click OK and get your number.
- VARP – This one assumes you have a population. Procedure is same as above, except you select the VARP function
Standard Deviation
This is strongly related to Variance, since it is the square root of the variance. Again, you have four possible functions, but I will only cover the two more common ones. And again, there is one each for a sample and for a population. The Standard Deviation can be considered a type of measurement of the average deviation of each measurement from the mean.
- STDEV – This is the function to use if you are measuring the Standard Deviation of a sample. The procedure for using it is identical to the Variance.
- STDEVP – This is the function to measure Standard Deviation for a population. Again, the same procedure is followed
Other Descriptive Statistics
A few other measures can be useful as well. They are the Minimum and the Maximum of a group of numbers.
- MIN – Measures the minimum. Usage is the same as all other functions we have discussed.
- MAX – Measures the Maximum. Usage is the same as all other functions we have discussed.
Lessons Learned
- A very useful trick is to use the Paste Only, Number trick to convert the contents of a cell or a range of cells into the numbers that result. Recall that earlier in our series we emphasized the difference between the contents of a cell (frequently a formula or cell address if you are a skilled builder of spreadsheets) and the visible results, which are generally a number. If you ever want to get just the number and lose the underlying formulas, this is how it is done.
- To use some of these functions it helps to have a little background in the theory, such as why samples and populations are treated differently, or why there are three different ways of measuring the mean. Going any further here is beyond the scope of these tutorials, but is left as an exercise for the reader.
- Descriptive statistics simply describe what we see in a group of numbers. There is a branch of statistics that goes further, and it is called inferential statistics. In the next tutorial we will look at a few of the more common inferential statistics functions
- One big takeaway from this lesson is that all of these functions are used in similar ways. The procedures for using a function are very standardized. So the key is not figuring out the mechanics, it is understanding which function to use, and why that is the correct function. The single most common error I see is people using the wrong function because they don’t understand why they should be using one function rather than another.
The spreadsheet containing the examples for this tutorial can be found here.
Listen to the audio version of this post on Hacker Public Radio!