LibreOffice Calc: Inferential Statistics Functions

Inferential statistics is what you do to say that something is likely, or that it is not due to chance, or things of the sort. It goes beyond simply describing what is in the numbers and let’s you say something about what the numbers in a sample might mean for the population that generated the sample. There are several type of Inferential Statistics that I want to address in this tutorial, beginning with the idea of a confidence interval.

Confidence Interval

This is a pretty simple thing, but useful. The idea is that in some population you are interested in, you draw a sample, and use descriptive statistics to measure the sample. A common example is polling. The entire population of a country would be very expensive to measure, but a sample of say 1,000 is affordable, and gives you some insight into the whole population. A confidence interval takes the sample measurement and gives a range of numbers based on that which will most likely contain the true population measurement. To calculate this you need three numbers, two of which we have already talked about in descriptive statistics. The first is the sample measurement, and for our example let’s use the Average. The second is the sample size, which in our sample data was 30 observation. The third parameter is the Alpha, which is a probability that you got this sample, purely by chance, from a population that is significantly different from your claim. It is, in short, the probability of being wrong in the inference you are drawing. Now you could choose any value you want for Alpha, but you should know that there are some “industry standard” values, and the most common one is .05, or 5%. This means that your inference will be wrong one time in 20. This is the most common value used in political polls, for example, which means that one poll in 20 will be simply wrong. Everyone in the industry knows this, and now you do too.

To begin, select the CONFIDENCE function in the Function Wizard. Enter the value of .05 for the Alpha, whatever you got for the sample standard deviation in your numbers in the STDEV space, and 30 for the N or number of observations. It is important that you use the Sample Standard Deviation to get an accurate number, which is why they use the STDEV function as the name on this field. Click OK and you will get a number. In my case, I got 10.849.

Now, to turn this into a confidence interval, I take my Average, which was was 49.078, and find the lower limit by subtracting 10.849, and the upper limit by adding 10.849. In my numbers, I got the result that I am 95% confident that the true population mean from which this sample was drawn is between 38.228 and 59.927.

Statistical Tests

The next major area of inferential statistics I want to address concerns statistical tests. We won’t be able to give a full description of the theory behind all of these tests, which is more properly addressed as part of a course in statistics. But in general statistical testing is about formulating a claim and then doing a calculation to see if it is likely to be true. Again, the likelihood of truth is something you can choose. The industry standard is usually 95%, but it is often expressed in a slightly different way as “significant at the .05 level.” This is just another way of saying that 1 time in 20 your results will be wrong. Here are some common tests:

F-test – For this use the FTEST function. This compares two samples and asks what is the probability that the variance is the same in the populations from which they were drawn. For this I created two columns of sample data in a way similar to how I created one column earlier for the Descriptive Statistics. Since both use the random function to get numbers between 0 and 100, it is not surprising that I got a 93% probability that they come from populations with the same variance. After all, I made up the data the same way.
t-test – This compares the means of two samples and asks if they are in fact the same in the population. This takes several parameters. You need to have the two data distributions (in the background CALC is calculating both the mean and variance for each), the mode (1-tailed or 2-tailed), and the type (1=paired, 2=2 samples, equal variance, or 3=2 samples, unequal variance).
Chi-Squared – This uses the CHITEST function. This is a test of whether to variables are independent of each other. For example, does knowing the political party a person votes for allow you to predict which church they belong to? If the answer is “No”, then they are independent. This means you need to have data where each of the variables is measured for every individual in the sample, and that we know the measurements for each individual. If you have this kind of data, just plug in the two data groups into the function.
Z-test – This is one of the most common tests used in statistics, and one of the first introduced to students of the subject. This test compares a sample to a known population and asks if the sample is similar. For example, we might know that in general a country of 30 million people has an average life-span of 72 years, with a standard deviation of 10 years. We get a sample of 30 people from a particular city and get an average of 65 years. Is it likely that this represents a significant difference, or is it just random chance? This is the kind of thing you can test with a Z-test.

Regression

This is a technique that is used to estimate a relationship between two variables. The most common way is to look for the straight line that best fits the data, though there are also techniques that use curves. Essentially, what a linear regression does is minimize the “errors” in the estimation. In the real world, if there is a relationship between two variables, it is rarely a “pure” relationship because other things are going on. And that means the relationship you come up with in your model does not completely explain what is going on. You pick one variable which you think is the “cause”, which is usually called either the independent variable or the explanatory variable. And the other variable is the “effect”, and it is called the dependent variable. The relationship you come up with takes the independent variable as an input, and comes up with a predicted value for the dependent variable. The difference between the predicted value and what is actually measured is the error which this technique seeks to reduce. Again, you can get a lot more technical about this stuff, but that is best left to a course on statistics.

An example of what we are talking about is the project I was given as a class assignment in my first year of graduate school in Economics. We collected data on rents in the Ann Arbor area, and on the distance of each apartment from the central campus. The economic model we had in mind postulated that the closer the apartment was to the central campus, the higher the rent, and we estimated a relationship using a linear regression technique. Not surprisingly, we found a fairly strong relationship. But equally important is that there are obviously other factors that affect the rent, such as the amenities, being near public transportation, etc.

For this example I am using a simple economic model. I gathered data on the growth rate of GDP of the US by year, from 1933 to 2013, and the deficit as a % of GDP for the same years. I am going to postulate relationship between them. But which variable is independent, and which is dependent here? It could go either way depending on how you look at it, but let’s say that how fast (or not) the economy grows will determine the level of the deficit. This is not a course in Economics, I just need an example to illustrate the technique.

The first thing I usually do is get a scatter plot of the data to see if the data fits my ideas at all. I did it for this, and got this chart:

Scatter plot for regression model

And you can see that most of the data cluster around the middle, but it does look in general like the relationship I expected to find is there. I can add something more by going back to my chart, clicking on the Chart to get the eight “handles”, then right-clicking to select Edit, then right-clicking on the dots in the chart to get this properties window:

Editing the Scatter Plot to get the Trend Line

Note that the third option down says Insert Trend Line. If I select that, I get this window:

Trend Line properties window

And this has some very nice stuff. On the type tab note that there are four possible trend lines you can have. The first is the Linear trend, which is what we were looking for, but you also have three more complex ones, Logarithmic, Exponential, and Power. All of those postulate a relationship that is not linear, that involves multiplication, exponents, or something like that.

Selecting Linear and then clicking OK gives you this:

Scatter plot with linear regression line included

But let’s go back to when we put in the Trend line and selected Linear. There were two check boxes at the bottom in a section called Equation. One said to show the equation, the other show the coefficient of determination (R-squared). If we check those, what do we get?

Adding the regression equation and r-squared to the trend line chart

Chart with regression equation and r-squared added

Now, having these added numbers in the middle of the chart is just plain hard to read, but remember in our discussion of charts we talked about objects contained within objects? Well, go into Edit mode, then you can select these numbers as a unit and drag them to the side, to get this:

Final trend line chart with regression equation and R-squared

So the equation can be interpreted as saying that the deficit as a % of GDP is equal to 15.299 times the GDP growth rate, plus a constant of 1.123. And the R-squared is something called the Coefficient of Determination, which is a statistical measure of how well the model fits the actual data. What is considered a good fit can vary according to the circumstances, but again this is not a course on either statistics or economics.

Of course, you can get these numbers from functions as well. The Slope, Intercept, and R-squared can each be calculated directly this way. You need to specify the range of numbers for the Independent variable (called data_X), and the Dependent variable (called data_Y). The functions are called SLOPE, INTERCEPT, and RSQ. As with the other functions we have looked at, you select the function in the Function wizard, then select the ranges. If you do this you will get exactly the same numbers as you got in the chart.

Statistical Functions Wrap-up

Over the last two tutorials we have looked at some fairly common statistical functions and showed how they can be used. I will one final time say that the objective is not to provide an understanding of statistics. For that you would need to take a course on the subject. All we are trying to do here is demonstrate that many of the statistical measures you might want to use can be calculated within LibreOffice Calc. I remember when I had to purchase a statistical package for a lot of money to get these capabilities, so getting them in an open source program is really great.

The other thing I would emphasize is that the use of these functions is pretty standardized. Step one is to make sure you have the data, and you begin by checking the required arguments of the function. If you have these data on hand, you insert the function from the Function wizard, point to the cells that contain your data, and the function will do the work. It really is that simple.

The spreadsheet with our examples on it can be found here..

Listen to the audio version of this post on Hacker Public Radio!

Save as PDF

LibreOffice Calc: Inferential Statistics Functions

Confidence Interval

Statistical Tests

Regression

Statistical Functions Wrap-up

Advertisements

Categories

Audio Podcasts

Blogroll

My Sites

Office Software

CC License