Data frames
1 Preparations
Step 1
Before beginning this tutorial, we should make sure we have all the tools in place. We will therefore start by installing the tidyverse ecosystem of packages. Among the many packages included under this umbrella are the packages readr
, readxl
and tibble
- each of which will be used in this tutorial.
In addition, the foreign
package supports importing data from other statistical software (such as Sas, Stata, Systat, System and Minitab).
Let start by installing the tidyverse
ecosystem of packages along with foreign
.
Now we will load these packages so that they are available for the rest of the session.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Notice in the above output, when we load the tidyverse
package, some validation steps are performed to indicate which actual packages were loaded. Importantly, notice also that a couple of conflicts are identified. The first of these dplyr::filter() masks stats::filter()
indicates that once the dplyr
package was loaded the previous definition of a function called filter
(from the stats
package) was overwritten (masked) by a definition contained wihin the dplyr
package.
This is not an error. Rather, it is a warning to advise that if you were expecting to call the filter
function and were expecting to get the behaviour defined within the stats
package, then you should preface the call with the stats
namespace. For example, call stats::filter()
rather than just filter()
.
No such issues arose when loading the foreign
package.
Step 2
The second necessary preparation is to prepare the file system for a tidy working environment. Rather than place all R scripts, data and outputs into a single (increasingly cluttered folder), it is always better to organise your project into a set number of folders. For this tutorial, I would recommend setting up the following structure.
../
|-- data
|-- scripts
Now within your chosen editor, I suggest you create an R script within the scripts
folder and set this path as the working directory.
Step 3
The final preparation step is to download some data files to use during this tutorial. These files should be placed in the data
folder. Each of the files are abbreviated versions of the same Mac Nally (1996) data set, yet each is in a different format (some are text files, others are in formats of other software). Each format is listed below, along with a link to manually access the data and an R code snippet that will download the file and place it in the ../data
folder.
macnally.csv: a comma separated format
macnally.txt: a tab separated format
macnally.xlsx: an excel workbook format
2 Constructing data frames
2.1 data.frame
Data frames are generated by amalgamating vectors of the same length together. To illustrate the translation of a data set (collection of variables) into an R data frame (collection of vectors), a portion of a real data set by Mac Nally (1996) in which the bird communities were investigated from 37 sites across five habitats in southeastern Australia will be used. Although the original data set includes the measured maximum density of 102 bird species from the 37 sites, for simplicity’s sake only two bird species (GST: gray shrike thrush, EYR: eastern yellow robin) and the first eight of the sites will be included. The truncated data set, comprises a single factorial (or categorical) variable, two continuous variables, and a set of site (row) names, and is as follows:
Site | HABITAT | GST | EYR |
---|---|---|---|
Reedy Lake | Mixed | 3.4 | 0.0 |
Pearcedale | Gipps.Manna | 3.4 | 9.2 |
Warneet | Gipps.Manna | 8.4 | 3.8 |
Cranbourne | Gipps.Manna | 3.0 | 5.0 |
Lysterfield | Mixed | 5.6 | 5.6 |
Red Hill | Mixed | 8.1 | 4.1 |
Devilbend | Mixed | 8.3 | 7.1 |
Olinda | Mixed | 4.6 | 5.3 |
Firstly, we will generate the three variables (excluding the site labels as they are not variables) separately:
Next, use the list the names of the vectors as arguments in the data.frame()
function to amalgamate the three separate variables into a single data frame (data set) which we will call macnally
(after the author).
habitat gst eyr
1 Mixed 3.4 0.0
2 Gipps.Manna 3.4 9.2
3 Gipps.Manna 8.4 3.8
4 Gipps.Manna 3.0 5.0
5 Mixed 5.6 5.6
6 Mixed 8.1 4.1
7 Mixed 8.3 7.1
8 Mixed 4.6 5.3
Notice that each vector (variable) becomes a column in the data frame and that each row represents a single sampling unit (in this case, each row represents a different site). By default, the rows are named using numbers corresponding to the number of rows in the data frame. However, these can be altered to reflect the names of the sampling units by assigning a list of alternative names to the row.names()
(data frame row names) property of the data frame.
row.names(macnally) <- c('Reedy Lake', 'Pearcedale', 'Warneet', 'Cranbourne',
'Lysterfield', 'Red Hill', 'Devilbend', 'Olinda')
macnally
habitat gst eyr
Reedy Lake Mixed 3.4 0.0
Pearcedale Gipps.Manna 3.4 9.2
Warneet Gipps.Manna 8.4 3.8
Cranbourne Gipps.Manna 3.0 5.0
Lysterfield Mixed 5.6 5.6
Red Hill Mixed 8.1 4.1
Devilbend Mixed 8.3 7.1
Olinda Mixed 4.6 5.3
2.2 expand.grid
When the data set contains multiple fully crossed categorical variables (factors), the expand.grid()
function provides a convenient way to create the factor vectors.
rep B A
1 1 b1 a1
2 2 b1 a1
3 3 b1 a1
4 4 b1 a1
5 1 b2 a1
6 2 b2 a1
7 3 b2 a1
8 4 b2 a1
9 1 b1 a2
10 2 b1 a2
11 3 b1 a2
12 4 b1 a2
13 1 b2 a2
14 2 b2 a2
15 3 b2 a2
16 4 b2 a2
17 1 b1 a3
18 2 b1 a3
19 3 b1 a3
20 4 b1 a3
21 1 b2 a3
22 2 b2 a3
23 3 b2 a3
24 4 b2 a3
2.3 as_tibble
Tibbles are a modern re-imagining of data frames in R that focus on clarity, consistency, and user-friendliness. While both data frames and tibbles both hold data in rows and columns, tibbles introduce several key differences:
- Preserved Data Types: Unlike data frames which coerce strings to factors, tibbles maintain the original data types, facilitating accurate analysis and avoiding surprises.
- Explicit Naming: Column names are always strings, preventing unintentional creation of numeric or logical variables.
- Improved Printing: Tibbles display a concise overview, presenting only the first 10 rows and all fitting columns to screen, making exploration more efficient.
- Streamlined Subsetting: Accessing specific columns is simpler and safer, minimizing potential errors related to partial matching.
The as_tibble
function converts a data frame into a tibble.
# A tibble: 8 × 3
habitat gst eyr
<fct> <dbl> <dbl>
1 Mixed 3.4 0
2 Gipps.Manna 3.4 9.2
3 Gipps.Manna 8.4 3.8
4 Gipps.Manna 3 5
5 Mixed 5.6 5.6
6 Mixed 8.1 4.1
7 Mixed 8.3 7.1
8 Mixed 4.6 5.3
Since the example data set is so small, there is no appreciable difference in how it is presented as either a data frame or a tibble. It is mainly when the data sets get larger that the distinctions become more apparent.
2.4 tribble
The tribble()
function allows us to construct tibbles directly.
macnally.tbl <- tribble(
~habitat, ~gst, ~eyr,
"Mixed", 3.4, 0.0,
"Gipps.Manna", 3.4, 9.2,
"Gipps.Manna", 8.4, 3.8,
"Gipps.Manna", 3.0, 5.0,
"Mixed", 5.6, 5.6,
"Mixed", 8.1, 4.1,
"Mixed", 8.3, 7.1,
"Mixed", 4.6, 5.3,
)
macnally.tbl
# A tibble: 8 × 3
habitat gst eyr
<chr> <dbl> <dbl>
1 Mixed 3.4 0
2 Gipps.Manna 3.4 9.2
3 Gipps.Manna 8.4 3.8
4 Gipps.Manna 3 5
5 Mixed 5.6 5.6
6 Mixed 8.1 4.1
7 Mixed 8.3 7.1
8 Mixed 4.6 5.3
Note that the construction of tibbles like this more closely resembles the eventual structure of the data. Compare this to the way data frames are constructed (by combining individual vectors).
3 Importing data frames
Statistical systems are typically not very well suited to tasks of data entry and management. This is the roll of spreadsheets and databases, of which there are many available. Although the functionality of R continues to expand, it is unlikely that R itself will ever duplicate the extensive spreadsheet and database capabilities of other software. However, there are numerous projects in early stages of development that are being designed to offer an interface to R from within major spreadsheet packages.
R development has roots in the Unix/Linux programming philosophy that dictates that tools should be dedicated to performing specific tasks that they perform very well and rely on other tools to perform other tasks. Consequently, the emphasis of R is, and will continue to be, purely an environment for statistical and graphical procedures. It is expected that other software will be used to generate and maintain data sets.
Unfortunately, data importation into R can be a painful exercise that overshadows the benefits of using R for new users. In part, this is because there are a large number of competing methods that can be used to import data and from a wide variety of sources. Moreover, many of the popular spreadsheets use their own proprietary file formats that are particularly complex to accommodate fully.
This section does not intend to cover all the methods. Rather, it will highlight the simplest and most robust methods of importing data from the most popular sources. Unless file path names are specified, all data reading functions will search for files in the current working directory.
3.1 Importing from text file
The easiest form of importation is from a pure text file. Since most software that accepts file input can read plain text files, text files can be created in all spreadsheet, database and statistical software packages and are also the default outputs of most data collection devices.
In a text file, data are separated (or delimited) by a specific character, which in turn defines what sort of text file it is. The text file should broadly represent the format of the data frame.
- variables should be in columns and sampling units in rows. the first
- row should contain the variable names and if there are row names, these should be in the first column
The following examples illustrate the format of the abbreviated Mac Nally (1996) data set created as both comma delimited (left) and tab delimited (right) files as well as the corresponding read.table() commands used to import the files.
The following examples assume that the above data will be in the current working directory. If the current working directory (which can be checked with the getwd()
function) does not contain these files, then either:
- include the full path name (or path relative to the current working directory) as the filename argument
- change the current working directory of your session prior to continuing (use the
setwd()
function) - copy and paste the files into the current working directory.
Comma delimited text file .csv
Rows: 37 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): LOCATION, HABITAT
dbl (2): GST, EYR
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 37 × 4
LOCATION HABITAT GST EYR
<chr> <chr> <dbl> <dbl>
1 Reedy Lake Mixed 3.4 0
2 Pearcedale Gipps.Manna 3.4 9.2
3 Warneet Gipps.Manna 8.4 3.8
4 Cranbourne Gipps.Manna 3 5
5 Lysterfield Mixed 5.6 5.6
6 Red Hill Mixed 8.1 4.1
7 Devilbend Mixed 8.3 7.1
8 Olinda Mixed 4.6 5.3
9 Fern Tree Gum Montane Forest 3.2 5.2
10 Sherwin Foothills Woodland 4.6 1.2
# ℹ 27 more rows
Tab delimited text file .txt
Rows: 37 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): LOCATION, HABITAT
dbl (2): GST, EYR
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 37 × 4
LOCATION HABITAT GST EYR
<chr> <chr> <dbl> <dbl>
1 Reedy Lake Mixed 3.4 0
2 Pearcedale Gipps.Manna 3.4 9.2
3 Warneet Gipps.Manna 8.4 3.8
4 Cranbourne Gipps.Manna 3 5
5 Lysterfield Mixed 5.6 5.6
6 Red Hill Mixed 8.1 4.1
7 Devilbend Mixed 8.3 7.1
8 Olinda Mixed 4.6 5.3
9 Fern Tree Gum Montane Forest 3.2 5.2
10 Sherwin Foothills Woodland 4.6 1.2
# ℹ 27 more rows
In the above, the trim_ws = TRUE
argument indicates that leading and trailing spaces should be removed from all the data. This is important as often spreadsheets (I’m looking at you Excel), add spaces before or after words (in particular). These are invisible, yet can cause huge headaches when running analyses or graphing..
The read_csv
and read_tzv
functions provide feedback about what they have imported. Specifically, they list the number of rows and columns, what the delimeting character is and the data type assigned to each field (variable/column).
The data are imported as a tibble.
There are numerous ways to specify the filename.
using full paths
NoteIn the above example, the full path used was appropriate for the machine that the code was run on. However, it is unlikely to reflect a valid path on your machine. You may want to adjust it accordingly.
using relative paths
NoteRecall that
../data/
means navigate out of the current directory and into thedata
directory.using ULRs
In the above example, the data are accessed directly from a remote location.
3.2 Importing from the clipboard
The read_tsv()
function can also be used to import data (into a tibble) that has been placed on the clipboard by other software, thereby providing a very quick and convenient way of obtaining data from spreadsheets. Simply replace the filename argument with the clipboard()
function. For example, to import data placed on the clipboard from Microsoft Excel, select the relevant cells, click copy and then in R, use the following syntax;
Although importing data from the clipboard can be convenient for quickly exploring something, it should mostly be discouraged from a reproducibility perspective:
- when such code is included in a script, the script will just import whatever is present on the clipboard at the time - which may or may not be what you expect it to be
- there is no way to record the providence of the data because it is not pointing to a specific file or source.
3.3 Importing from Excel
Microsoft Excel is more than just a spreadsheet, it can contain macros, formulae, multiple worksheets and formatting. There are numerous ways to import xlsx files into R, yet depending on the complexity of the original files, the translations can be incomplete and inconsistent.
One of the easiest and safest ways to import data from Excel is either to save the worksheet as a text file (comma or tab delimited) and import the data as a text file (see above), or to copy the data to the clipboard in Excel and import the clipboard data into R.
Nevertheless, it is also possible to directly import a sheet from an excel workbook. Tidyverse includes a package called readxl
, however as it is not one of the ‘core’ packages, it is not automatically loaded as part of the ecosystem when the tidyverse
package is loaded. Hence to use the readxl
package, it must be explicitly loaded.
4 Viewing data frames
For very small and simple data.frame’s like the macnally example above, the whole data data.frame can be comfortably displayed in the console. However for much larger data.frame’s, displaying all the data can be overwhelming and not very useful. There are a number of convenient functions that provide overviews of data. To appreciate the particulars of each routine as well as the differences between the different routines, we will add some other data types to our macnally data.
habitat gst eyr bool char date
Reedy Lake Mixed 3.4 0.0 TRUE Large 2000-02-29
Pearcedale Gipps.Manna 3.4 9.2 FALSE Small 2000-03-10
Warneet Gipps.Manna 8.4 3.8 TRUE Large 2000-03-20
Cranbourne Gipps.Manna 3.0 5.0 FALSE Small 2000-03-31
Lysterfield Mixed 5.6 5.6 TRUE Large 2000-04-10
Red Hill Mixed 8.1 4.1 FALSE Small 2000-04-21
Devilbend Mixed 8.3 7.1 TRUE Large 2000-05-01
Olinda Mixed 4.6 5.3 FALSE Small 2000-05-12
4.1 summary()
The summary()
function is an overloaded function whose behaviour depends on the object passed to the function. When summary()
is called with a data.frame, a summary is provided in which:
- numeric vectors (variables) are summarized by the standard 5 number statistics and if there are any missing values, the number of missing values is also provided
- categorical (factors) vectors are tallied up - that is, the number of instances of each level are counted.
- boolean states are also tallied
- character vectors are only described by their length
- date (and POSIX) vectors are summarized by 5 number summaries
habitat gst eyr bool
Gipps.Manna:3 Min. :3.00 Min. :0.000 Mode :logical
Mixed :5 1st Qu.:3.40 1st Qu.:4.025 FALSE:4
Median :5.10 Median :5.150 TRUE :4
Mean :5.60 Mean :5.013
3rd Qu.:8.15 3rd Qu.:5.975
Max. :8.40 Max. :9.200
char date
Length:8 Min. :2000-02-29
Class :character 1st Qu.:2000-03-18
Mode :character Median :2000-04-05
Mean :2000-04-05
3rd Qu.:2000-04-23
Max. :2000-05-12
4.2 str()
Similar to summary()
, the str()
function is an overloaded. The str()
function generally produces a compact view of the structure of an object. When str()
is called with a data.frame, this compact view comprises a nested list of abbreviated structures.
'data.frame': 8 obs. of 6 variables:
$ habitat: Factor w/ 2 levels "Gipps.Manna",..: 2 1 1 1 2 2 2 2
$ gst : num 3.4 3.4 8.4 3 5.6 8.1 8.3 4.6
$ eyr : num 0 9.2 3.8 5 5.6 4.1 7.1 5.3
$ bool : logi TRUE FALSE TRUE FALSE TRUE FALSE ...
$ char : chr "Large" "Small" "Large" "Small" ...
$ date : Date, format: "2000-02-29" "2000-03-10" ...
4.3 glimpse()
The glimpse()
function in the tibble
package is similar to str()
except that it attempts to maximize the amount of data displayed according to the dimensions of the output.
Rows: 8
Columns: 6
$ habitat <fct> Mixed, Gipps.Manna, Gipps.Manna, Gipps.Manna, Mixed, Mixed, Mi…
$ gst <dbl> 3.4, 3.4, 8.4, 3.0, 5.6, 8.1, 8.3, 4.6
$ eyr <dbl> 0.0, 9.2, 3.8, 5.0, 5.6, 4.1, 7.1, 5.3
$ bool <lgl> TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE
$ char <chr> "Large", "Small", "Large", "Small", "Large", "Small", "Large",…
$ date <date> 2000-02-29, 2000-03-10, 2000-03-20, 2000-03-31, 2000-04-10, 20…
4.4 Others
There are also numerous graphical methods including view()
and fix()
, however, I have focused on the script friendly routines. As the graphical routines require user input, they are inappropriate to include in scripts.
Within Rstudio, a data frame can be viewed like a spreadsheet. To view the data this way, click on the name of the data frame within the Environment pane. Furthermore, when in R Notebook mode, a simple functioning spreadsheet will be embedded within the notebook.
5 Exporting data
Although plain text files are not the most compact storage formats, they do offer two very important characteristics. Firstly, they can be read by a wide variety of other applications, ensuring that the ability to retrieve the data will continue indefinitely. Secondly, as they are neither compressed nor encoded, a corruption to one section of the file does not necessarily reduce the ability to correctly read other parts of the file. Hence, this is also an important consideration for the storage of datasets.
The write_csv()
function is used to save data frames and tibbles. Although there are a large number of optional arguments available for controlling the exact format of the output file, typically only a few are required. The following example illustrates the exportation of the Mac Nally (1996) data set as a comma delimited text file.
The first and second arguments specify respectively the name of the data frame (or tibble) and filename (and path if necessary) to be exported. Alternatively, it is possible to export to text files with other delimeter characters using the write_delim()
function.
6 Saving and loading R objects
Any object in R (including data frames and tibbles) can also be saved into a native R workspace image file (*.RData
) either individually, or as a collection of objects using the save()
function.
Whilst this native R storage format is not recommended for long-term data storage and archival (as it is a binary format and thus less likely to be universally and indefinitely readable), saving and loading of R objects does provide very useful temporary storage of large R objects between sessions.
In particular, if one or more objects require processing or manipulations that take some time to regenerate, saving and loading of R objects can permit the analyst to skip straight to a specific section of a script and continue development or analysis. Moreover, this is very useful for tweaking and regenerating summary figures - rather than have to go through an entire sequence of data reading, processing and analysis, strategic use of saving/loading of R objects can allow the researcher to commence directly at the point at which modification is required.
6.1 saveRDS/readRDS
For example:
This will save a single object in a compressed format.
The saved object(s) can be loaded during subsequent sessions by providing the name of the saved workspace image file as an argument to the load()
function. For example:
6.2 save/load
When you want to save multiple objects, the save()
function is convenient. This stores multiple objects in a binary (non-compressed) format.
## save just the macnally data frame to the data folder
save(macnally, file = "../data/macnally.RData")
## calculate the mean gst
mean_gst <- mean(macnally$gst)
## display the mean gst
mean_gst
## save the macnally data frame as well as the mean gst object
save(macnally, mean_gst, file = "../data/macnally_stats.RData")
The saved object(s) can be loaded during subsequent sessions by providing the name of the saved workspace image file as an argument to the load()
function. For example:
Note, the load()
reads the object(s) into the current environment with each object being assigned the names they were originally assigned when they were saved.
6.3 dump
Similarly, a straight un-encoded text version of an object (including dataframes and tibbles) can be saved or added to a text file (such as an R script) using the dump()
function.
If the file character string is left empty, the text representation of the object will be written to the console. This output can then be viewed or copied and pasted into a script file, thereby providing a convenient way to bundle together data sets along with graphical and analysis commands that act on the data sets. It can even be used to paste data into an email.
Thereafter, the dataset is automatically included when the script is sourced and cannot accidentally become separated from the script.