I apologize in advance, this tutorial requires quite a bit of explaining and context before it can get into the code…. Good data manipulation is an art form that requires the use of many specific tools (functions) and expert data manipulation comes from the integration of these tools together. Therefore it is necessary to have an overview of the tool set before investigating any single tool.
Rarely are data transcribed and organised into such simple and ready-to-go data sets. More typically, data are spread throughout multiple sources and in a variety of formats (particularly if compiled by multiple workers or instruments). Consequently, prior to any formal statistical analyses , it is necessary to compile very focused, tidy data sets.
Wickham (2014) suggested that there are many ways to organise data, yet tidy data (data that are structured in such a consistent way as to facilitate analyses) must adhere to a fairly strict set of structural rules. Specifically, in tidy data:
To achieve tidy data, common data preparations include:
This workshop will concentrate on these essential data preparation phases.
Practical data manipulation will be demonstrated via a series of very small artificial datasets. These datasets are presented in tables with black font and lines and the R code required to generate those data will be presented in static code boxes either underneath or adjacent the table. A very basic description of the table and the name of the data frame are displayed above the table. The entire collection of datasets used in this workshop can be obtained by issuing the following command:
(if online)
source("https://github.com/pcinereus/SUYRs_docs/raw/main/data/manipulationDatasets.R")
##OR
load(url("https://github.com/pcinereus/SUYRs_docs/raw/main/data/manipulationDatasets.RData"))(if offline and are running this from a local version having first cloned the SUYRs documentation github repository
source(file="data/manipulationDatasets.R")
##OR
load(file = "data/manipulationDatasets.RData")An overview of the example data sets follows.
These data comprise two Responses (Resp1 and
Resp2) that are each observed on four occasions
(Time) from each of three dosage treatments
(Dose) nested within four plots (Plot). Two of
the four plots completely excluded something and the other two were
controls (did not exclude the thing).
| Treatment | Plot | Dose | Time | Resp1 | Resp2 |
|---|---|---|---|---|---|
| Control | P1 | H | 1 | 8.12 | 3.06 |
| Control | P1 | H | 2 | 20.55 | 25.94 |
| Control | P1 | H | 3 | 27.49 | 29.85 |
| Control | P1 | H | 4 | 44.79 | 25.39 |
| Control | P1 | M | 1 | 20.99 | 20.31 |
| Control | P1 | M | 2 | 37.54 | 17.62 |
| Control | P1 | M | 3 | 61.46 | 98.44 |
| Control | P1 | M | 4 | 82.21 | 160.01 |
| Control | P1 | L | 1 | 31.73 | 21.22 |
| Control | P1 | L | 2 | 59.08 | 37.51 |
| Control | P1 | L | 3 | 94.54 | 119.22 |
| Control | P1 | L | 4 | 121.17 | 116.45 |
| Control | P2 | H | 1 | 8.14 | 23.93 |
| Control | P2 | H | 2 | 13.36 | 28.02 |
| Control | P2 | H | 3 | 33.37 | 37.17 |
| Control | P2 | H | 4 | 39.87 | 38.25 |
| Control | P2 | M | 1 | 19.95 | 19.73 |
| Control | P2 | M | 2 | 42.83 | 40.52 |
| Control | P2 | M | 3 | 62.46 | 4.81 |
| Control | P2 | M | 4 | 81.78 | 136.66 |
| Control | P2 | L | 1 | 32.76 | 30.70 |
| Control | P2 | L | 2 | 62.35 | 123.78 |
| Control | P2 | L | 3 | 90.22 | 113.87 |
| Control | P2 | L | 4 | 114.03 | 76.52 |
| Exclusion | P3 | H | 1 | 21.86 | 23.58 |
| Exclusion | P3 | H | 2 | 39.83 | 28.03 |
| Exclusion | P3 | H | 3 | 59.53 | 21.32 |
| Exclusion | P3 | H | 4 | 75.59 | 90.76 |
| Exclusion | P3 | M | 1 | 38.57 | 30.63 |
| Exclusion | P3 | M | 2 | 81.25 | 83.61 |
| Exclusion | P3 | M | 3 | 124.08 | 124.09 |
| Exclusion | P3 | M | 4 | 159.69 | 112.65 |
| Exclusion | P3 | L | 1 | 61.16 | 39.53 |
| Exclusion | P3 | L | 2 | 119.84 | 110.27 |
| Exclusion | P3 | L | 3 | 175.87 | 286.33 |
| Exclusion | P3 | L | 4 | 238.76 | 54.23 |
| Exclusion | P4 | H | 1 | 18.82 | 28.60 |
| Exclusion | P4 | H | 2 | 39.82 | 39.07 |
| Exclusion | P4 | H | 3 | 63.30 | 93.43 |
| Exclusion | P4 | H | 4 | 82.29 | 60.15 |
| Exclusion | P4 | M | 1 | 39.51 | 45.90 |
| Exclusion | P4 | M | 2 | 79.24 | 88.04 |
| Exclusion | P4 | M | 3 | 122.09 | 84.19 |
| Exclusion | P4 | M | 4 | 161.67 | 256.34 |
| Exclusion | P4 | L | 1 | 57.93 | 85.24 |
| Exclusion | P4 | L | 2 | 117.88 | 167.90 |
| Exclusion | P4 | L | 3 | 181.09 | 314.49 |
| Exclusion | P4 | L | 4 | 242.31 | 304.70 |
Note as this data set is 48 rows long, the majority
examples that feature this dataset will only show the first 6 (or 10)
rows to save on space. This truncation will occur via the
head() function.
These data comprise a Response observed on three occasions
(Time.0, Time.1 and Time.2) from
each of two (A1 and A2) treatments
(Between, gridded or not) nested within four plots
(Plot).
| Plot | Between | Time.0 | Time.1 | Time.2 | |
|---|---|---|---|---|---|
| R1 | P1 | A1 | 8 | 14 | 14 |
| R2 | P2 | A1 | 10 | 12 | 11 |
| R3 | P3 | A2 | 7 | 11 | 8 |
| R4 | P4 | A2 | 11 | 9 | 2 |
Similar to dat.1, these data comprise two Responses
(Resp1 and Resp2). However, these data have a
different hierarchical structure. Each response was measured from one of
two Subplots nested within one of six Plots.
Within each Subplot each of two treatment levels
(B1 and B2) were applied and two of the
Plots where exposed to one of the three
Between levels.
| Resp1 | Resp2 | Between | Plot | Subplot | Within |
|---|---|---|---|---|---|
| 8 | 17 | A1 | P1 | S1 | B1 |
| 10 | 18 | A1 | P1 | S1 | B2 |
| 7 | 17 | A1 | P1 | S2 | B1 |
| 11 | 21 | A1 | P1 | S2 | B2 |
| 14 | 19 | A2 | P2 | S3 | B1 |
| 12 | 13 | A2 | P2 | S3 | B2 |
| 11 | 24 | A2 | P2 | S4 | B1 |
| 9 | 18 | A2 | P2 | S4 | B2 |
| 14 | 25 | A3 | P3 | S5 | B1 |
| 11 | 18 | A3 | P3 | S5 | B2 |
| 8 | 27 | A3 | P3 | S6 | B1 |
| 2 | 22 | A3 | P3 | S6 | B2 |
| 8 | 17 | A1 | P4 | S7 | B1 |
| 10 | 22 | A1 | P4 | S7 | B2 |
| 7 | 16 | A1 | P4 | S8 | B1 |
| 12 | 13 | A1 | P4 | S8 | B2 |
| 11 | 23 | A2 | P5 | S9 | B1 |
| 12 | 19 | A2 | P5 | S9 | B2 |
| 12 | 23 | A2 | P5 | S10 | B1 |
| 10 | 21 | A2 | P5 | S10 | B2 |
| 3 | 17 | A3 | P6 | S11 | B1 |
| 11 | 16 | A3 | P6 | S11 | B2 |
| 13 | 26 | A3 | P6 | S12 | B1 |
| 7 | 28 | A3 | P6 | S12 | B2 |
The data.d data comprise of a single response
(Resp1) collected on different days. The three different
components of the dates have been recorded in three different columns
corresponding to the year, month and
day.
| year | month | day | Resp1 |
|---|---|---|---|
| 2009 | 11 | 09 | 25 |
| 2008 | 05 | 08 | 52 |
| 2009 | 05 | 23 | 18 |
| 2009 | 04 | 15 | 21 |
| 2008 | 10 | 25 | 45 |
| 2008 | 09 | 26 | 40 |
| 2008 | 07 | 05 | 38 |
| 2008 | 11 | 02 | 9 |
| 2009 | 08 | 19 | 0 |
| 2008 | 10 | 03 | 20 |
The data.bio dataset represents a design that is like a
cut down version of data in which there are no
Within Subplot levels and thus there are also
fewer rows.
| Resp1 | Resp2 | Between | Plot | Subplot |
|---|---|---|---|---|
| 8 | 18 | A1 | P1 | S1 |
| 10 | 21 | A1 | P1 | S2 |
| 11 | 23 | A1 | P2 | S4 |
| 14 | 22 | A2 | P3 | S5 |
| 12 | 24 | A2 | P3 | S6 |
| 11 | 23 | A2 | P4 | S7 |
| 9 | 20 | A2 | P4 | S8 |
| 14 | 11 | A3 | P5 | S9 |
| 11 | 22 | A3 | P5 | S10 |
| 8 | 24 | A3 | P6 | S11 |
| 2 | 16 | A3 | P6 | S12 |
The data.chem dataset represents the same design as the
data.bio data except that rather than contain two
biological response, it has two chemical measurements
(Chem1 and Chem2) collected from the same
sampling units as the biological responses.
| Chem1 | Chem2 | Between | Plot | Subplot |
|---|---|---|---|---|
| 1.452878 | 0.8858208 | A1 | P1 | S1 |
| 3.266253 | 0.1800177 | A1 | P1 | S2 |
| 1.178652 | 5.0780682 | A1 | P2 | S3 |
| 13.400350 | 1.5762780 | A1 | P2 | S4 |
| 3.779183 | 1.6222430 | A2 | P3 | S5 |
| 1.196657 | 4.2369184 | A2 | P3 | S6 |
| 5.687807 | 2.9859003 | A2 | P4 | S8 |
| 4.834518 | 4.1328919 | A3 | P5 | S9 |
| 2.002931 | 3.6043314 | A3 | P5 | S10 |
| 12.326867 | 1.7763576 | A3 | P6 | S11 |
| 4.014221 | 0.2255188 | A3 | P6 | S12 |
The data.geo dataset represent geographical information
about the Plots in the data.bio dataset.
| Plot | LAT | LONG |
|---|---|---|
| P1 | 17.9605 | 145.4326 |
| P2 | 17.5210 | 146.1983 |
| P3 | 17.0011 | 146.3839 |
| P4 | 18.2350 | 146.7934 |
| P5 | 18.9840 | 146.0345 |
| P6 | 20.1154 | 146.4672 |
The tikus dataset is a data.frame version of the data
with the same name that is provided as a list within the
mvabund package. The data contain a the abundance
of 75 coral species from different locations (rep) over six
years (time) at Tikus island. The abundance was measured as
the length (in cm) of a 10m transect that intersected with the coral
species.
Note, these data actually have 60 rows and 77 columns. In order avoid displaying a huge table in this quick view, I have trimmed the view to just the first 10 rows and six columns.
| time | rep | Psammocora contigua | Psammocora digitata | Pocillopora damicornis | Pocillopora verrucosa | |
|---|---|---|---|---|---|---|
| V1 | 81 | 1 | 0 | 0 | 79 | 32 |
| V2 | 81 | 2 | 0 | 0 | 51 | 21 |
| V3 | 81 | 3 | 0 | 0 | 42 | 35 |
| V4 | 81 | 4 | 0 | 0 | 15 | 0 |
| V5 | 81 | 5 | 0 | 0 | 9 | 0 |
| V6 | 81 | 6 | 0 | 0 | 72 | 0 |
| V7 | 81 | 7 | 0 | 0 | 0 | 41 |
| V8 | 81 | 8 | 0 | 0 | 16 | 25 |
| V9 | 81 | 9 | 0 | 0 | 0 | 38 |
| V10 | 81 | 10 | 0 | 0 | 16 | 0 |
The great folks over at Rstudio have produced an excellent set of cheatsheets on a range of topics. For this tutorial, the Data Transformation Cheat Sheet and Data Tidying Cheat Sheet (mainly just the first page) are useful summaries.
https://github.com/rstudio/cheatsheets/raw/main/data-import.pdf
https://github.com/rstudio/cheatsheets/raw/main/data-transformation.pdf
https://github.com/rstudio/cheatsheets/raw/main/lubridate.pdf
purrr)https://github.com/rstudio/cheatsheets/raw/main/tidyeval.pdf