I apologize in advance, this tutorial requires quite a bit of explaining and context before it can get into the code…. Good data manipulation is an art form that requires the use of many specific tools (functions) and expert data manipulation comes from the integration of these tools together. Therefore it is necessary to have an overview of the tool set before investigating any single tool.
Rarely are data transcribed and organised into such simple and ready-to-go data sets. More typically, data are spread throughout multiple sources and in a variety of formats (particularly if compiled by multiple workers or instruments). Consequently, prior to any formal statistical analyses , it is necessary to compile very focused, tidy data sets.
Wickham (2014) suggested that there are many ways to organise data, yet tidy data (data that are structured in such a consistent way as to facilitate analyses) must adhere to a fairly strict set of structural rules. Specifically, in tidy data:
To achieve tidy data, common data preparations include:
This workshop will concentrate on these essential data preparation phases.
Practical data manipulation will be demonstrated via a series of very small artificial datasets. These datasets are presented in tables with black font and lines and the R code required to generate those data will be presented in static code boxes either underneath or adjacent the table. A very basic description of the table and the name of the data frame are displayed above the table. The entire collection of datasets used in this workshop can be obtained by issuing the following command:
(if online)
source("https://github.com/pcinereus/SUYRs_docs/raw/main/data/manipulationDatasets.R")
##OR
load(url("https://github.com/pcinereus/SUYRs_docs/raw/main/data/manipulationDatasets.RData"))
(if offline and are running this from a local version having first cloned the SUYRs documentation github repository
source(file="data/manipulationDatasets.R")
##OR
load(file = "data/manipulationDatasets.RData")
An overview of the example data sets follows.
These data comprise two Responses (Resp1
and
Resp2
) that are each observed on four occasions
(Time
) from each of three dosage treatments
(Dose
) nested within four plots (Plot
). Two of
the four plots completely excluded something and the other two were
controls (did not exclude the thing).
Treatment | Plot | Dose | Time | Resp1 | Resp2 |
---|---|---|---|---|---|
Control | P1 | H | 1 | 8.12 | 3.06 |
Control | P1 | H | 2 | 20.55 | 25.94 |
Control | P1 | H | 3 | 27.49 | 29.85 |
Control | P1 | H | 4 | 44.79 | 25.39 |
Control | P1 | M | 1 | 20.99 | 20.31 |
Control | P1 | M | 2 | 37.54 | 17.62 |
Control | P1 | M | 3 | 61.46 | 98.44 |
Control | P1 | M | 4 | 82.21 | 160.01 |
Control | P1 | L | 1 | 31.73 | 21.22 |
Control | P1 | L | 2 | 59.08 | 37.51 |
Control | P1 | L | 3 | 94.54 | 119.22 |
Control | P1 | L | 4 | 121.17 | 116.45 |
Control | P2 | H | 1 | 8.14 | 23.93 |
Control | P2 | H | 2 | 13.36 | 28.02 |
Control | P2 | H | 3 | 33.37 | 37.17 |
Control | P2 | H | 4 | 39.87 | 38.25 |
Control | P2 | M | 1 | 19.95 | 19.73 |
Control | P2 | M | 2 | 42.83 | 40.52 |
Control | P2 | M | 3 | 62.46 | 4.81 |
Control | P2 | M | 4 | 81.78 | 136.66 |
Control | P2 | L | 1 | 32.76 | 30.70 |
Control | P2 | L | 2 | 62.35 | 123.78 |
Control | P2 | L | 3 | 90.22 | 113.87 |
Control | P2 | L | 4 | 114.03 | 76.52 |
Exclusion | P3 | H | 1 | 21.86 | 23.58 |
Exclusion | P3 | H | 2 | 39.83 | 28.03 |
Exclusion | P3 | H | 3 | 59.53 | 21.32 |
Exclusion | P3 | H | 4 | 75.59 | 90.76 |
Exclusion | P3 | M | 1 | 38.57 | 30.63 |
Exclusion | P3 | M | 2 | 81.25 | 83.61 |
Exclusion | P3 | M | 3 | 124.08 | 124.09 |
Exclusion | P3 | M | 4 | 159.69 | 112.65 |
Exclusion | P3 | L | 1 | 61.16 | 39.53 |
Exclusion | P3 | L | 2 | 119.84 | 110.27 |
Exclusion | P3 | L | 3 | 175.87 | 286.33 |
Exclusion | P3 | L | 4 | 238.76 | 54.23 |
Exclusion | P4 | H | 1 | 18.82 | 28.60 |
Exclusion | P4 | H | 2 | 39.82 | 39.07 |
Exclusion | P4 | H | 3 | 63.30 | 93.43 |
Exclusion | P4 | H | 4 | 82.29 | 60.15 |
Exclusion | P4 | M | 1 | 39.51 | 45.90 |
Exclusion | P4 | M | 2 | 79.24 | 88.04 |
Exclusion | P4 | M | 3 | 122.09 | 84.19 |
Exclusion | P4 | M | 4 | 161.67 | 256.34 |
Exclusion | P4 | L | 1 | 57.93 | 85.24 |
Exclusion | P4 | L | 2 | 117.88 | 167.90 |
Exclusion | P4 | L | 3 | 181.09 | 314.49 |
Exclusion | P4 | L | 4 | 242.31 | 304.70 |
Note as this data set is 48 rows long, the majority
examples that feature this dataset will only show the first 6 (or 10)
rows to save on space. This truncation will occur via the
head()
function.
These data comprise a Response observed on three occasions
(Time.0
, Time.1
and Time.2
) from
each of two (A1
and A2
) treatments
(Between
, gridded or not) nested within four plots
(Plot
).
Plot | Between | Time.0 | Time.1 | Time.2 | |
---|---|---|---|---|---|
R1 | P1 | A1 | 8 | 14 | 14 |
R2 | P2 | A1 | 10 | 12 | 11 |
R3 | P3 | A2 | 7 | 11 | 8 |
R4 | P4 | A2 | 11 | 9 | 2 |
Similar to dat.1
, these data comprise two Responses
(Resp1
and Resp2
). However, these data have a
different hierarchical structure. Each response was measured from one of
two Subplot
s nested within one of six Plot
s.
Within each Subplot
each of two treatment levels
(B1
and B2
) were applied and two of the
Plot
s where exposed to one of the three
Between
levels.
Resp1 | Resp2 | Between | Plot | Subplot | Within |
---|---|---|---|---|---|
8 | 17 | A1 | P1 | S1 | B1 |
10 | 18 | A1 | P1 | S1 | B2 |
7 | 17 | A1 | P1 | S2 | B1 |
11 | 21 | A1 | P1 | S2 | B2 |
14 | 19 | A2 | P2 | S3 | B1 |
12 | 13 | A2 | P2 | S3 | B2 |
11 | 24 | A2 | P2 | S4 | B1 |
9 | 18 | A2 | P2 | S4 | B2 |
14 | 25 | A3 | P3 | S5 | B1 |
11 | 18 | A3 | P3 | S5 | B2 |
8 | 27 | A3 | P3 | S6 | B1 |
2 | 22 | A3 | P3 | S6 | B2 |
8 | 17 | A1 | P4 | S7 | B1 |
10 | 22 | A1 | P4 | S7 | B2 |
7 | 16 | A1 | P4 | S8 | B1 |
12 | 13 | A1 | P4 | S8 | B2 |
11 | 23 | A2 | P5 | S9 | B1 |
12 | 19 | A2 | P5 | S9 | B2 |
12 | 23 | A2 | P5 | S10 | B1 |
10 | 21 | A2 | P5 | S10 | B2 |
3 | 17 | A3 | P6 | S11 | B1 |
11 | 16 | A3 | P6 | S11 | B2 |
13 | 26 | A3 | P6 | S12 | B1 |
7 | 28 | A3 | P6 | S12 | B2 |
The data.d
data comprise of a single response
(Resp1
) collected on different days. The three different
components of the dates have been recorded in three different columns
corresponding to the year
, month
and
day
.
year | month | day | Resp1 |
---|---|---|---|
2009 | 11 | 09 | 25 |
2008 | 05 | 08 | 52 |
2009 | 05 | 23 | 18 |
2009 | 04 | 15 | 21 |
2008 | 10 | 25 | 45 |
2008 | 09 | 26 | 40 |
2008 | 07 | 05 | 38 |
2008 | 11 | 02 | 9 |
2009 | 08 | 19 | 0 |
2008 | 10 | 03 | 20 |
The data.bio
dataset represents a design that is like a
cut down version of data
in which there are no
Within
Subplot
levels and thus there are also
fewer rows.
Resp1 | Resp2 | Between | Plot | Subplot |
---|---|---|---|---|
8 | 18 | A1 | P1 | S1 |
10 | 21 | A1 | P1 | S2 |
11 | 23 | A1 | P2 | S4 |
14 | 22 | A2 | P3 | S5 |
12 | 24 | A2 | P3 | S6 |
11 | 23 | A2 | P4 | S7 |
9 | 20 | A2 | P4 | S8 |
14 | 11 | A3 | P5 | S9 |
11 | 22 | A3 | P5 | S10 |
8 | 24 | A3 | P6 | S11 |
2 | 16 | A3 | P6 | S12 |
The data.chem
dataset represents the same design as the
data.bio
data except that rather than contain two
biological response, it has two chemical measurements
(Chem1
and Chem2
) collected from the same
sampling units as the biological responses.
Chem1 | Chem2 | Between | Plot | Subplot |
---|---|---|---|---|
1.452878 | 0.8858208 | A1 | P1 | S1 |
3.266253 | 0.1800177 | A1 | P1 | S2 |
1.178652 | 5.0780682 | A1 | P2 | S3 |
13.400350 | 1.5762780 | A1 | P2 | S4 |
3.779183 | 1.6222430 | A2 | P3 | S5 |
1.196657 | 4.2369184 | A2 | P3 | S6 |
5.687807 | 2.9859003 | A2 | P4 | S8 |
4.834518 | 4.1328919 | A3 | P5 | S9 |
2.002931 | 3.6043314 | A3 | P5 | S10 |
12.326867 | 1.7763576 | A3 | P6 | S11 |
4.014221 | 0.2255188 | A3 | P6 | S12 |
The data.geo
dataset represent geographical information
about the Plot
s in the data.bio
dataset.
Plot | LAT | LONG |
---|---|---|
P1 | 17.9605 | 145.4326 |
P2 | 17.5210 | 146.1983 |
P3 | 17.0011 | 146.3839 |
P4 | 18.2350 | 146.7934 |
P5 | 18.9840 | 146.0345 |
P6 | 20.1154 | 146.4672 |
The tikus
dataset is a data.frame version of the data
with the same name that is provided as a list within the
mvabund
package. The data contain a the abundance
of 75 coral species from different locations (rep
) over six
years (time
) at Tikus island. The abundance was measured as
the length (in cm) of a 10m transect that intersected with the coral
species.
Note, these data actually have 60 rows and 77 columns. In order avoid displaying a huge table in this quick view, I have trimmed the view to just the first 10 rows and six columns.
time | rep | Psammocora contigua | Psammocora digitata | Pocillopora damicornis | Pocillopora verrucosa | |
---|---|---|---|---|---|---|
V1 | 81 | 1 | 0 | 0 | 79 | 32 |
V2 | 81 | 2 | 0 | 0 | 51 | 21 |
V3 | 81 | 3 | 0 | 0 | 42 | 35 |
V4 | 81 | 4 | 0 | 0 | 15 | 0 |
V5 | 81 | 5 | 0 | 0 | 9 | 0 |
V6 | 81 | 6 | 0 | 0 | 72 | 0 |
V7 | 81 | 7 | 0 | 0 | 0 | 41 |
V8 | 81 | 8 | 0 | 0 | 16 | 25 |
V9 | 81 | 9 | 0 | 0 | 0 | 38 |
V10 | 81 | 10 | 0 | 0 | 16 | 0 |
The great folks over at Rstudio have produced an excellent set of cheatsheets on a range of topics. For this tutorial, the Data Transformation Cheat Sheet and Data Tidying Cheat Sheet (mainly just the first page) are useful summaries.
https://github.com/rstudio/cheatsheets/raw/main/data-import.pdf
https://github.com/rstudio/cheatsheets/raw/main/data-transformation.pdf
https://github.com/rstudio/cheatsheets/raw/main/lubridate.pdf
purrr
)https://github.com/rstudio/cheatsheets/raw/main/tidyeval.pdf