The next major area of investigation for this series is how we can do data manipulation in Calc. Although Calc is not a database, it can be used for some data analysis and manipulation. When I worked for the finance department of a hospital, it was very common for the financial analysts to get a data dump from a centralized system as a CSV file, load it up in a spreadsheet, and then slice-and-dice the data to get the answers they wanted. It is not anywhere near what you can do with a good relational database and a structured query, but you can do some quick-and-dirty analysis here. To illustrate this I downloaded a data set from the Vanderbilt University Department of Biostatistics. In case you were not aware, there are lots of datasets you can download on the Internet, including in many cases government data. I grabbed the Meningitis dataset, partly to get something in the medical area into this discussion, but as it was larger than I needed I discarded a lot of the data to get something a little more manageable. I hid most of the variables, and kept these:
- Case Number -just a an index that starts at 1 and goes up
- Year -looks like the two-digit year in which the patient was observed, and some are blank
- Month – looks like the month within the year when the patient was observed, and some are blank
- Age – the age of the patient when they were observed, and some are blank.
- Race of patient – black v. white, some are blank
- Sex of patient – male v. female, some are blank
There are 581 records (rows) in this data set, and I kept them.
This is the simplest thing, You can sort the data in your spreadsheet to put it into order by a column. Let’s say you wanted to sort by the year when the patient was observed. You click on the B column which has that variable, then go to Data, Sort, and this window opens up:
If you clicked “Current Selection” Column B would be sorted all by itself, but the other columns would not be sorted. This used to be the default behavior of spreadsheets, but we have learned that people do attempt these kinds of manipulations on data sets in spreadsheets, so now it is smart enough to check. If you select “Extend selection”, it will sort based on what is in column B, but will sort the dataset, from cell A1 to cell F582. If you make this selection you will get another window:
As you can see you can choose several criteria, and it will sort one within another. For example, if you chose Month for the second sort key, the data would first be sorted by year. then, for all data points with the same year, they would be sorted by month.
Filters are an important tool for working with large datasets. While they don’t replace a relational database with structured queries, they do give you some useful features. There are three kinds of filters, all accessed through the Data menu in Calc. We will discuss AutoFilter in this tutorial, then move on to the others in subsequent tutorials.
Go to Data–>Filters–>AutoFilter and click to turn on. You will see a drop-down appear on every column in the open sheet:
These drop-downs will give you an entry for each unique value in the column. For a column like Case Number this feature is useless since each row has a unique case number. But if we wanted to only look at cases from 1978, we could go the Year column, click the drop-down, remove the check-mark from “All” in the bottom of the dropdown, which de-selects everything, then put a checkmark in “78” to select only that. The result is to display only those rows which have “78” in the Year column. And if you look at the row numbers on the left, every time one or more rows was skipped, there is a heavier line separating the row numbers. Also, for each column that has a filter applied there will be small black square in the lower-right corner of the drop-down arrow box. This is helpful if you need to quickly see where the filters are applied, such as when you need to remove one.
Note: You do need to be careful in working with a filtered dataset. If you forget the filtering you have applied you may get the wrong idea about your data. Be sure to always check your filters first when working with a dataset to make sure you have not filtered out some of the cases you want to see.
And you can combine filters as well. Leaving the filter on the Year column, I can next go to the Sex column and select female in a similar manner. This reduces the dataset even further to only those cases which occurred in 1978 and involved a female. And if I wanted to go even further, I could add a Race qualification, and filter for Black. This would give me all cases in 1978 which involved a black female.
You can remove an AutoFilter easily. Just click the drop-down, and put a checkmark in All at the bottom. This will remove that particular filter.
Other filters you might want to use are shown at the top of the drop-down window:
- Top 10 – This returns the rows with the ten highest values in the dataset. If there are no other filters, and you have at least ten rows, you should get ten rows selected based on this column. But if you have other filters, you may get less. For example, when I first filtered on Race to select White, then selected Top 10 for Age, i got back 6 records. This means that of the ten highest ages in my dataset, only 6 were of white patients.
- Empty- This selects only those rows that are empty for the column you are filtering
- Not Empty – This selects only those rows that are not empty
Now, suppose you wanted to select cases in 1978 involving females who were over the age of 40? You could hit the drop-down for the Age column, remove the checkmark in All, then put checkmarks in for every value greater than 40. This would work, and the result is 6 cases. But there is a better way, and that involves using what are called Standard Filters.
The sample spreadsheet with this data can be found here.
Listen to the audio version of this post on Hacker Public Radio!