I apologize in advance, this tutorial requires quite a bit of explaining and context before it can get into the code…. Good data manipulation is an art form that requires the use of many specific tools (functions) and expert data manipulation comes from the integration of these tools together. Therefore it is necessary to have an overview of the tool set before investigating any single tool.
Rarely are data transcribed and organised into such simple and ready-to-go data sets. More typically, data are spread throughout multiple sources and in a variety of formats (particularly if compiled by multiple workers or instruments). Consequently, prior to any formal statistical analyses , it is necessary to compile very focused, tidy data sets.
Wickham (2014) suggested that there are many ways to organise data, yet tidy data (data that are structured in such a consistent way as to facilitate analyses) must adhere to a fairly strict set of structural rules. Specifically, in tidy data:
To achieve tidy data, common data preparations include:
This workshop will concentrate on these essential data preparation phases.
Practical data manipulation will be demonstrated via a series of very small artificial datasets. These datasets are presented in tables with black font and lines and the R code required to generate those data will be presented in static code boxes either underneath or adjacent the table. A very basic description of the table and the name of the data frame are displayed above the table. The entire collection of datasets used in this workshop can be obtained by issuing the following command:
(if online)
(if offline and are running this from a local version having first cloned the SUYRs documentation github repository
An overview of the example data sets follows.
These data comprise two Responses (Resp1
and
Resp2
) that are each observed on four occasions
(Time
) from each of three dosage treatments
(Dose
) nested within four plots (Plot
). Two of
the four plots completely excluded something and the other two were
controls (did not exclude the thing).
Treatment | Plot | Dose | Time | Resp1 | Resp2 |
---|---|---|---|---|---|
Control | P1 | H | 1 | 8.12 | 3.06 |
Control | P1 | H | 2 | 20.55 | 25.94 |
Control | P1 | H | 3 | 27.49 | 29.85 |
Control | P1 | H | 4 | 44.79 | 25.39 |
Control | P1 | M | 1 | 20.99 | 20.31 |
Control | P1 | M | 2 | 37.54 | 17.62 |
Control | P1 | M | 3 | 61.46 | 98.44 |
Control | P1 | M | 4 | 82.21 | 160.01 |
Control | P1 | L | 1 | 31.73 | 21.22 |
Control | P1 | L | 2 | 59.08 | 37.51 |
Control | P1 | L | 3 | 94.54 | 119.22 |
Control | P1 | L | 4 | 121.17 | 116.45 |
Control | P2 | H | 1 | 8.14 | 23.93 |
Control | P2 | H | 2 | 13.36 | 28.02 |
Control | P2 | H | 3 | 33.37 | 37.17 |
Control | P2 | H | 4 | 39.87 | 38.25 |
Control | P2 | M | 1 | 19.95 | 19.73 |
Control | P2 | M | 2 | 42.83 | 40.52 |
Control | P2 | M | 3 | 62.46 | 4.81 |
Control | P2 | M | 4 | 81.78 | 136.66 |
Control | P2 | L | 1 | 32.76 | 30.70 |
Control | P2 | L | 2 | 62.35 | 123.78 |
Control | P2 | L | 3 | 90.22 | 113.87 |
Control | P2 | L | 4 | 114.03 | 76.52 |
Exclusion | P3 | H | 1 | 21.86 | 23.58 |
Exclusion | P3 | H | 2 | 39.83 | 28.03 |
Exclusion | P3 | H | 3 | 59.53 | 21.32 |
Exclusion | P3 | H | 4 | 75.59 | 90.76 |
Exclusion | P3 | M | 1 | 38.57 | 30.63 |
Exclusion | P3 | M | 2 | 81.25 | 83.61 |
Exclusion | P3 | M | 3 | 124.08 | 124.09 |
Exclusion | P3 | M | 4 | 159.69 | 112.65 |
Exclusion | P3 | L | 1 | 61.16 | 39.53 |
Exclusion | P3 | L | 2 | 119.84 | 110.27 |
Exclusion | P3 | L | 3 | 175.87 | 286.33 |
Exclusion | P3 | L | 4 | 238.76 | 54.23 |
Exclusion | P4 | H | 1 | 18.82 | 28.60 |
Exclusion | P4 | H | 2 | 39.82 | 39.07 |
Exclusion | P4 | H | 3 | 63.30 | 93.43 |
Exclusion | P4 | H | 4 | 82.29 | 60.15 |
Exclusion | P4 | M | 1 | 39.51 | 45.90 |
Exclusion | P4 | M | 2 | 79.24 | 88.04 |
Exclusion | P4 | M | 3 | 122.09 | 84.19 |
Exclusion | P4 | M | 4 | 161.67 | 256.34 |
Exclusion | P4 | L | 1 | 57.93 | 85.24 |
Exclusion | P4 | L | 2 | 117.88 | 167.90 |
Exclusion | P4 | L | 3 | 181.09 | 314.49 |
Exclusion | P4 | L | 4 | 242.31 | 304.70 |
Note as this data set is 48 rows long, the majority
examples that feature this dataset will only show the first 6 (or 10)
rows to save on space. This truncation will occur via the
head()
function.
These data comprise a Response observed on three occasions
(Time.0
, Time.1
and Time.2
) from
each of two (A1
and A2
) treatments
(Between
, gridded or not) nested within four plots
(Plot
).
Plot | Between | Time.0 | Time.1 | Time.2 | |
---|---|---|---|---|---|
R1 | P1 | A1 | 8 | 14 | 14 |
R2 | P2 | A1 | 10 | 12 | 11 |
R3 | P3 | A2 | 7 | 11 | 8 |
R4 | P4 | A2 | 11 | 9 | 2 |
Similar to dat.1
, these data comprise two Responses
(Resp1
and Resp2
). However, these data have a
different hierarchical structure. Each response was measured from one of
two Subplot
s nested within one of six Plot
s.
Within each Subplot
each of two treatment levels
(B1
and B2
) were applied and two of the
Plot
s where exposed to one of the three
Between
levels.
Resp1 | Resp2 | Between | Plot | Subplot | Within |
---|---|---|---|---|---|
8 | 17 | A1 | P1 | S1 | B1 |
10 | 18 | A1 | P1 | S1 | B2 |
7 | 17 | A1 | P1 | S2 | B1 |
11 | 21 | A1 | P1 | S2 | B2 |
14 | 19 | A2 | P2 | S3 | B1 |
12 | 13 | A2 | P2 | S3 | B2 |
11 | 24 | A2 | P2 | S4 | B1 |
9 | 18 | A2 | P2 | S4 | B2 |
14 | 25 | A3 | P3 | S5 | B1 |
11 | 18 | A3 | P3 | S5 | B2 |
8 | 27 | A3 | P3 | S6 | B1 |
2 | 22 | A3 | P3 | S6 | B2 |
8 | 17 | A1 | P4 | S7 | B1 |
10 | 22 | A1 | P4 | S7 | B2 |
7 | 16 | A1 | P4 | S8 | B1 |
12 | 13 | A1 | P4 | S8 | B2 |
11 | 23 | A2 | P5 | S9 | B1 |
12 | 19 | A2 | P5 | S9 | B2 |
12 | 23 | A2 | P5 | S10 | B1 |
10 | 21 | A2 | P5 | S10 | B2 |
3 | 17 | A3 | P6 | S11 | B1 |
11 | 16 | A3 | P6 | S11 | B2 |
13 | 26 | A3 | P6 | S12 | B1 |
7 | 28 | A3 | P6 | S12 | B2 |
The data.d
data comprise of a single response
(Resp1
) collected on different days. The three different
components of the dates have been recorded in three different columns
corresponding to the year
, month
and
day
.
year | month | day | Resp1 |
---|---|---|---|
2009 | 11 | 09 | 25 |
2008 | 05 | 08 | 52 |
2009 | 05 | 23 | 18 |
2009 | 04 | 15 | 21 |
2008 | 10 | 25 | 45 |
2008 | 09 | 26 | 40 |
2008 | 07 | 05 | 38 |
2008 | 11 | 02 | 9 |
2009 | 08 | 19 | 0 |
2008 | 10 | 03 | 20 |
The data.bio
dataset represents a design that is like a
cut down version of data
in which there are no
Within
Subplot
levels and thus there are also
fewer rows.
Resp1 | Resp2 | Between | Plot | Subplot |
---|---|---|---|---|
8 | 18 | A1 | P1 | S1 |
10 | 21 | A1 | P1 | S2 |
11 | 23 | A1 | P2 | S4 |
14 | 22 | A2 | P3 | S5 |
12 | 24 | A2 | P3 | S6 |
11 | 23 | A2 | P4 | S7 |
9 | 20 | A2 | P4 | S8 |
14 | 11 | A3 | P5 | S9 |
11 | 22 | A3 | P5 | S10 |
8 | 24 | A3 | P6 | S11 |
2 | 16 | A3 | P6 | S12 |
The data.chem
dataset represents the same design as the
data.bio
data except that rather than contain two
biological response, it has two chemical measurements
(Chem1
and Chem2
) collected from the same
sampling units as the biological responses.
Chem1 | Chem2 | Between | Plot | Subplot |
---|---|---|---|---|
1.452878 | 0.8858208 | A1 | P1 | S1 |
3.266253 | 0.1800177 | A1 | P1 | S2 |
1.178652 | 5.0780682 | A1 | P2 | S3 |
13.400350 | 1.5762780 | A1 | P2 | S4 |
3.779183 | 1.6222430 | A2 | P3 | S5 |
1.196657 | 4.2369184 | A2 | P3 | S6 |
5.687807 | 2.9859003 | A2 | P4 | S8 |
4.834518 | 4.1328919 | A3 | P5 | S9 |
2.002931 | 3.6043314 | A3 | P5 | S10 |
12.326867 | 1.7763576 | A3 | P6 | S11 |
4.014221 | 0.2255188 | A3 | P6 | S12 |
The data.geo
dataset represent geographical information
about the Plot
s in the data.bio
dataset.
Plot | LAT | LONG |
---|---|---|
P1 | 17.9605 | 145.4326 |
P2 | 17.5210 | 146.1983 |
P3 | 17.0011 | 146.3839 |
P4 | 18.2350 | 146.7934 |
P5 | 18.9840 | 146.0345 |
P6 | 20.1154 | 146.4672 |
The tikus
dataset is a data.frame version of the data
with the same name that is provided as a list within the
mvabund
package. The data contain a the abundance
of 75 coral species from different locations (rep
) over six
years (time
) at Tikus island. The abundance was measured as
the length (in cm) of a 10m transect that intersected with the coral
species.
Note, these data actually have 60 rows and 77 columns. In order avoid displaying a huge table in this quick view, I have trimmed the view to just the first 10 rows and six columns.
time | rep | Psammocora contigua | Psammocora digitata | Pocillopora damicornis | Pocillopora verrucosa | |
---|---|---|---|---|---|---|
V1 | 81 | 1 | 0 | 0 | 79 | 32 |
V2 | 81 | 2 | 0 | 0 | 51 | 21 |
V3 | 81 | 3 | 0 | 0 | 42 | 35 |
V4 | 81 | 4 | 0 | 0 | 15 | 0 |
V5 | 81 | 5 | 0 | 0 | 9 | 0 |
V6 | 81 | 6 | 0 | 0 | 72 | 0 |
V7 | 81 | 7 | 0 | 0 | 0 | 41 |
V8 | 81 | 8 | 0 | 0 | 16 | 25 |
V9 | 81 | 9 | 0 | 0 | 0 | 38 |
V10 | 81 | 10 | 0 | 0 | 16 | 0 |
The great folks over at Rstudio have produced an excellent set of cheatsheets on a range of topics. For this tutorial, the Data Transformation Cheat Sheet and Data Tidying Cheat Sheet (mainly just the first page) are useful summaries.
https://github.com/rstudio/cheatsheets/raw/main/data-import.pdf
https://github.com/rstudio/cheatsheets/raw/main/data-transformation.pdf
https://github.com/rstudio/cheatsheets/raw/main/lubridate.pdf
purrr
)https://github.com/rstudio/cheatsheets/raw/main/tidyeval.pdf
tidyverse
- a data manipulation ecosystem within RThere are numerous packages and base R routines devoted to data
manipulation. Notable packages include data.tables
,
plyr
, dplyr
and doBy
. Indeed,
earlier versions of this tutorial featured examples of each of these
packages. However, an entire family of packages from Hadley Wickem’s
group now stands out as a comprehensive, intuitive suite of tools for
data manipulation and visualisation.
Importantly, all of these packages are designed to integrate together
and complement one another with a consistent interface. To simplify
installing an entire data ecosystem, the tidyverse package is available.
Installing this package (via install.packages('tidyverse')
)
will install the following packages (bold entries are automatically
loaded into the global environment when loading the
tidyverse
package):
dplyr
- for dataframe
manipulationtidyr
- for data tidyingreadr
- for importing dataggplot2
- for visualising datapurrr
- for functional
programmingtibble
- for tibblesstringr
- for string manipulationforecats
- for factor (categorical)
manipulationlubridate
- for dates/times manipulationhmms
- for time manipulationbroom
- for tidying model outputsDBI
- for database interactionshaven
- for importing SAS, SPSS and Stata fileshttr
- for web API’sjsonlite
- for working with JSONrvest
- for web scrapingxml2
- for manipulating XML structuresmodelr
- for model manipulationAs this tutorial will focus largely on the tidyverse
ecosystem, we will now load (attach) this library (actually, eight
libraries) to the global environment so that the functionality is
available.
── Attaching core tidyverse packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ purrr 1.0.1 ✔ tidyr 1.3.0
── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Note that when loading the tidyverse
package,
we are given feedback about which other packages have been attached.
Importantly, also notice that the output highlights a number of
conflicts. In the above example, this indicates that
the definition of the filter
function is now taken
from the the dplyr
package rather than the
stats
package. Essentially, both packages define a
function called filter
and this is just indicating that in
the current environment, the filter
function from
the dplyr
package will be the one used.
On a case by case basis, this can be overruled by prefixing the
function name with the package namespace. For example, to
ensure that the stats
package version of
filter
is used, we can engage it as
stats::filter()
.
Hadley and his collaborators argue that there is a grammar of data manipulation and that manipulations comprises a set of verbs. Furthermore, the philosophy adheres to the UNIX ideal that rather than attempt to build large complex tools that perform a large set of operations, it is better (with regards to usability and robustness) to build a tool set in which each tool performs a single, specific task and that the individual tools can be bolted together to achieve the complex operations.
The core of the tidyverse
data manipulation ecosystem
tools (verbs) can be organised into five categories within two types on
the basis of what aspects of the data they operate on:
dplyr::arrange
-
changing the order of the rows (sorting)dplyr::filter
-
subset of rows based on column valuesdplyr::slice
-
subset of rows based on positiondplyr::select
-
subset of columnsdplyr::rename
-
change the names of columnsdplyr::pull
- extract a single column as a vectordplyr::distinct
- unique
combinations of column valuesdplyr::mutate
-
adding columns and modifying column valuestidyr::unite
-
combine multiple columns togethertidyr::separate
-
separating a single column into multiple columnsdplyr::summarise
-
aggregating (collapsing) to a single rowdplyr::count
- count the number of
unique combinations single rowdplyr::group_by
- define groups
of rowstidyr::pivot_longer
-
lengthen data from wide formattidyr::pivot_wider
-
widen data from long format_join
- merge
(join) two data sets together based on common field(s)In base R, variables (columns) are referred to by either their name
(as a prefix to the environment - the data.frame etc), name (as a
string) or their index (position). For example to access a variable
called “Var1” within a dataset called “data”, we could use either
data$Var1
, data[,"Var1"]
or
data[,1]
(if the focal variable was in the first
column).
The tidyverse
ecosystem adopts an alternative concept
called tidy evaluation to make referring to columns
more flexible.
data-masking - refer to variables as if they
were directly attached to the current environment
(e.g. Var1
instead of data$Var1
. This form of
variable referral is used within:
arrange()
filter()
count()
mutate()
summarise()
group_by()
tidy selection - refer to variables by their
position, name or type (e.g. starts_with("var")
). This form
of variable referral is used within:
select()
rename()
pull()
across()
- which brings tidy selection
semantics to data-masking (and thus functions like
mutate()
)The following table highlights the various tidy-selection
datamutate (and thus functions like mutate()
)
The following table highlights the v and can use via
across()
.jarious tidy-selection options. The
exJKamples all illustrate a hypothetical data set (data
)
with hypothetical columns (Var1
, Var2
, etc)
and cannot be evaluated directly. They all mostly illustrate the concept
using the select()
function.
tidy-selection |
Description |
Examples |
---|---|---|
Bare names |
Select columns based on their names. |
|
Ranges of columns |
Select columns based on their a range of names or column numbers. The
selections will be inclusive. Prepending with a negative sign
( |
|
Boolean helper functions - contains() - starts_with() - ends_with() - matches() - num_range() - everything() - where() - all_of() - any_of() |
Select columns based on evaluating functions on the column names |
Typically, data manipulation/preparations comprise multiple steps and stages in which a series of alterations, additions etc are performed sequentially such that the data are progressively molded into a form appropriate for analysis etc. That is, to achieve the desired result, we must bolt multiple tools (verbs) together.
Traditionally, this would have involved a separate expression for each step often resulting in the generation of numerous intermediate data sets. Furthermore, in an attempt to reduce the number of intermediate steps, functions are often nested within other functions such that alterations are made inline within another function.
For example, the following pseudo code examples illustrates the traditional and nested approaches to achieving numerous data manipulation steps:
Collectively, these practices can yield code that is very difficult to read and interpret.
A long honoured unix coding principle is that each application should
focus on performing one action and performing that function well. In
order to perform a sequence of actions therefore involves piping (via
the unix pipe character |
) the output of one application to
the input of another application and so on in a chain. The grammar of
data wrangling also adopts this principle (each tool specializing on one
action and tools should be piped together to achieve a sequence of
actions).
The piping (glue) operator in tidyverse is %>%
. An
object on the left hand side of the %>%
operator is
passed as the first argument of the function on the
right hand side.
In pseudo code, the piping approach to the above manipulation would be:
If the code is set out as above (with each verb on a separate line), it gives an opportunity to provide a short comment to the right side of each line to provide useful documentation.
As a motivating (if not a bit extreme) example, lets say we wanted to calculate the logSumExp function:
\[ log(\sum^{n}_{i=1} e^{x_i}) \]
[1] 9.316408
[1] 9.316408
The piping approach could be:
[1] 9.316408
[1] 9.316408
To reiterate, the following three are equivalent:
[1] 29.653639 1601.918872 5.101634 118.918637 7140.686681 252.361318
[7] 9.175114 4.863565 1756.825350 199.466617
[1] 29.653639 1601.918872 5.101634 118.918637 7140.686681 252.361318
[7] 9.175114 4.863565 1756.825350 199.466617
[1] 29.653639 1601.918872 5.101634 118.918637 7140.686681 252.361318
[7] 9.175114 4.863565 1756.825350 199.466617
as are the following:
[1] 0.5301465 0.8679950 0.2120706 0.6792861 0.9480981 0.7427928 0.3456667
[8] 0.1991438 0.8733941 0.7239190
[1] 0.5301465 0.8679950 0.2120706 0.6792861 0.9480981 0.7427928 0.3456667
[8] 0.1991438 0.8733941 0.7239190
Most of the following examples will demonstrate isolated data
manipulation actions (such as filtering, summarising or joining) as this
focuses on the specific uses of these functions without the distractions
and complications of other actions. For isolated uses, piping has little
(if any) advantages. Nevertheless, in recognition that data
manipulations rarely comprise a single action (rather they are a series
of linked actions), for all remaining examples demonstrated in the
tidyverse
(dplyr
/tidyr
) context,
piping will be used.
data.frame
’s are collections of variables of the
identical length (yet not necessarily the same type) that the
fundamental data structures used by most modelling routines in R.
Nevertheless, there are numerous properties of data.frames that make
them less than ideal. tibbles
have been more recently
engineered to address these shortcomings:
if a data.frame
is very large, the
print
method can result output that is too large to be
useful. By contrast, tibbles
truncate the output to a
maximum of 10 rows and as many columns as will fit in the output. The
tibble
print
method also displays the
class (type) of each column (variable).
show comparison
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
5 Control P1 M 1 20.99 20.31
6 Control P1 M 2 37.54 17.62
7 Control P1 M 3 61.46 98.44
8 Control P1 M 4 82.21 160.01
9 Control P1 L 1 31.73 21.22
10 Control P1 L 2 59.08 37.51
11 Control P1 L 3 94.54 119.22
12 Control P1 L 4 121.17 116.45
13 Control P2 H 1 8.14 23.93
14 Control P2 H 2 13.36 28.02
15 Control P2 H 3 33.37 37.17
16 Control P2 H 4 39.87 38.25
17 Control P2 M 1 19.95 19.73
18 Control P2 M 2 42.83 40.52
19 Control P2 M 3 62.46 4.81
20 Control P2 M 4 81.78 136.66
21 Control P2 L 1 32.76 30.70
22 Control P2 L 2 62.35 123.78
23 Control P2 L 3 90.22 113.87
24 Control P2 L 4 114.03 76.52
25 Exclusion P3 H 1 21.86 23.58
26 Exclusion P3 H 2 39.83 28.03
27 Exclusion P3 H 3 59.53 21.32
28 Exclusion P3 H 4 75.59 90.76
29 Exclusion P3 M 1 38.57 30.63
30 Exclusion P3 M 2 81.25 83.61
31 Exclusion P3 M 3 124.08 124.09
32 Exclusion P3 M 4 159.69 112.65
33 Exclusion P3 L 1 61.16 39.53
34 Exclusion P3 L 2 119.84 110.27
35 Exclusion P3 L 3 175.87 286.33
36 Exclusion P3 L 4 238.76 54.23
37 Exclusion P4 H 1 18.82 28.60
38 Exclusion P4 H 2 39.82 39.07
39 Exclusion P4 H 3 63.30 93.43
40 Exclusion P4 H 4 82.29 60.15
41 Exclusion P4 M 1 39.51 45.90
42 Exclusion P4 M 2 79.24 88.04
43 Exclusion P4 M 3 122.09 84.19
44 Exclusion P4 M 4 161.67 256.34
45 Exclusion P4 L 1 57.93 85.24
46 Exclusion P4 L 2 117.88 167.90
47 Exclusion P4 L 3 181.09 314.49
48 Exclusion P4 L 4 242.31 304.70
# A tibble: 48 × 6
Treatment Plot Dose Time Resp1 Resp2
<fct> <fct> <fct> <int> <dbl> <dbl>
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.6 25.9
3 Control P1 H 3 27.5 29.8
4 Control P1 H 4 44.8 25.4
5 Control P1 M 1 21.0 20.3
6 Control P1 M 2 37.5 17.6
7 Control P1 M 3 61.5 98.4
8 Control P1 M 4 82.2 160.
9 Control P1 L 1 31.7 21.2
10 Control P1 L 2 59.1 37.5
# ℹ 38 more rows
data.frame
s have very strict column naming rules and
when these are not satisfied, the names will be altered in order to
adhere to the rules. tibble
s permit a wider range of
names.
character vectors are often coerced into factors (categorical
variables). This is not the case with tibble
s.
when vectors are added that are of a different length to the
others in the data.frame
, the values of the shorter
vector(s) are recycled until all vectors are the same length. This
behaviour can be dangerous and is not permitted in tibbles (except for
vectors of length of one).
show comparison
Var1 Var2 Var3
1 1 1 1
2 2 2 1
3 3 1 1
4 4 2 1
5 5 1 1
6 6 2 1
Error in `tibble()`:
! Tibble columns must have compatible sizes.
• Size 6: Existing data.
• Size 2: Column `Var2`.
ℹ Only values of size one are recycled.
# A tibble: 6 × 2
Var1 Var2
<int> <dbl>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
when sub-setting via column indices, data.frame
s
return a vector (when only a single index is provided) or a data.frame
(if multiple indices are provided). This inconsistency is problematic in
the context of a processing pipeline. tibble
’s will always
return a tibble
from column indices.
show comparison
[1] P1 P1 P1 P1 P1 P1 P1 P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P3
[26] P3 P3 P3 P3 P3 P3 P3 P3 P3 P3 P3 P4 P4 P4 P4 P4 P4 P4 P4 P4 P4 P4 P4
Levels: P1 P2 P3 P4
data.frame
s permit partial matching via
$
indexing. This can be problematic in the context of a
processing pipeline. tibble
s expressly forbid this.
data.frame
s have very clumsy support for list
columns (a list column is a column whose cells contain lists). List
columns are better supported in tibble
s. The ability to
support list columns is an integral for the functional
programming routines of the purrr
package.
show comparison
In this fictitious example, I would like to store three different
data.frames
(or any three objects for that matter) in the
three cells of a variable (e.g. Var2
). This is not possible
in a data.frame
- it will coerce the Var2
column into multiple columns (one for each object) and thus will also
recycle the length of each to ensure they are all the same length. Once
the data have been coerced into a data.frame
, the structure
is lost (we can no longer access Var2
).
These issues are addressed in the tibble
.
Many data manipulation actions involve the use of specific auxiliary
functions to act on specific parts of the data set. For
example, if we wish to summarise a data set, we might apply a
mean()
function to the numeric vectors (variables)
and levels()
or unique()
function to
character or factor vectors. That is, the function is applied to each
vector as a whole. Such functions are called summary
functions.
On the other hand, if we wish to generate log-transformed versions of
the numeric vectors, then we would apply the log()
function to each value within those numeric vectors.
Furthermore, we might use other auxiliary functions that return vectors
of either boolean values (TRUE or FALSE) or integers that can represent
row or column indices to determine which observations to perform actions
on.
Broadly speaking, summary functions take a vector
and return a single value. Familiar examples of summary functions are
mean()
, var()
, min()
etc. By
contrast, vectorised functions take a vector and return
a vector of the same length as the original input vector.
rank()
, cumsum()
and log()
are
all examples of such vectorised (window) functions.
The dplyr
package introduces a number of
additional summary and vectorised functions that are deemed useful
augmentations of the standard set of functions available in base R. The
following tables describes the most useful functions (from
base
and dplyr
) along with which manipulation
functions they can be applied. Those functions defined in
dplyr
include the dplyr
namespace.
There are also some operations that appear to provide a mixture of
both summary and vectorised functions. The scale()
function (which centres and scales a vector), for example,
returns a vector of scaled values as well as storing the original mean
and standard deviation (summaries) as attributes.
To demonstrate summary and vectorised functions, the following vectors will be used.
[1] 3.389585 7.378957 1.629561 4.778440 8.873564 5.530862 2.216495 1.581772
[9] 7.471264 5.295647
[1] "a" "a" "b" "b" "a" "b" "a" "b" "a" "b"
[1] FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
Function | Description | Examples |
---|---|---|
mean() , median()
|
Arithmetic mean and median. Note, boolean vectors are stored as a set of 0 (FALSE) and 1 (TRUE) and thus mathematical functions operate on them as if they were numbers. |
|
sum() , prod()
|
Sum and product. |
|
var() , sd()
|
Variance and standard deviation. |
|
mad() , IQR()
|
Median Absolute Deviation and Inter-Quartile Range. |
|
min() , max()
|
Minimum and maximum. |
|
quantile()
|
Quantiles |
|
dplyr::first() , dplyr::last() ,
dplyr::nth()
|
First, laste and nth value. |
|
dplyr::n() , dplyr::n_distinct()
|
Number of values, number of distinct values.
|
|
Function | Description | Examples |
---|---|---|
+,-,*,/,^,%/%, %%
|
typical arithmetic operators |
|
log() ,log2() ,log10() ,
exp()
|
logarithms and exponentials |
|
<,<=,>,>=,!=,==
|
logical operators |
|
between()
|
Whether a value is between two numbers |
|
near()
|
A safe way of assessing equality (== ) in floating points
|
|
lag() , lead()
|
shift cases down/up by one |
|
cummax() , cummin() ,
dplyr::cummean()
|
Cumulative max, min and mean |
|
cumsum() , cumprod()
|
Cumulative sum and product |
|
dplyr::cumall() , dplyr::cumany()
|
Cumulative all and any (mainly for use with filtering). |
|
rank() , order()
|
Rank and order of items |
|
dplyr::min_rank() , dplyr::dense_rank() ,
dplyr::percent_rank()
|
Rank in which ties = min, without gaps and expressed as a percentage. |
|
dplyr::row_number()
|
Rank in which ties = first. |
|
dplyr::cume_dist()
|
Cumulative empirical distribution (proportion less than current value). |
|
dplyr::ntile()
|
Partition into (n) bins. |
|
dplyr::if_else()
|
Elementwise (case by case) if and else. |
|
dplyr::case_when()
|
Elementwise multiple if and else. |
|
arrange
)Most statistical analyses are invariant to the data order and thus data reordering is typically only for aesthetics in tables and figures.
Sorting data has the potential to be one of the most dangerous forms of data manipulation - particularly in spreadsheets in which there is no real binding between individual columns or rows. It is far to easy to accidentally sort the data in a single column (or row) without applying the ordering to all the other columns in the data set thereby resulting in broken data.
Whilst the above apocalypse is still possible in R, the data structures and manipulation interfaces mean that you really have to try to break the data in this way. Furthermore, you are encouraged to store reordered data in a different object to the original data, and hence ‘rolling’ back is trivial.
Treatment | Plot | Dose | Time | Resp1 | Resp2 |
---|---|---|---|---|---|
Control | P1 | H | 1 | 8.12 | 3.06 |
Control | P1 | H | 2 | 20.55 | 25.94 |
Control | P1 | H | 3 | 27.49 | 29.85 |
Control | P1 | H | 4 | 44.79 | 25.39 |
Control | P1 | M | 1 | 20.99 | 20.31 |
Control | P1 | M | 2 | 37.54 | 17.62 |
Control | P1 | M | 3 | 61.46 | 98.44 |
Control | P1 | M | 4 | 82.21 | 160.01 |
Control | P1 | L | 1 | 31.73 | 21.22 |
Control | P1 | L | 2 | 59.08 | 37.51 |
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P2 H 1 8.14 23.93
3 Control P2 H 2 13.36 28.02
4 Exclusion P4 H 1 18.82 28.60
5 Control P2 M 1 19.95 19.73
6 Control P1 H 2 20.55 25.94
Treatment Plot Dose Time Resp1 Resp2
1 Exclusion P4 L 4 242.31 304.70
2 Exclusion P3 L 4 238.76 54.23
3 Exclusion P4 L 3 181.09 314.49
4 Exclusion P3 L 3 175.87 286.33
5 Exclusion P4 M 4 161.67 256.34
6 Exclusion P3 M 4 159.69 112.65
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P2 H 1 8.14 23.93
3 Control P2 H 2 13.36 28.02
4 Exclusion P4 H 1 18.82 28.60
5 Control P1 H 2 20.55 25.94
6 Exclusion P3 H 1 21.86 23.58
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P2 H 1 8.14 23.93
3 Control P2 M 1 19.95 19.73
4 Control P1 M 1 20.99 20.31
5 Control P2 H 2 13.36 28.02
6 Exclusion P3 H 1 21.86 23.58
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
5 Control P2 H 1 8.14 23.93
6 Control P2 H 2 13.36 28.02
We regularly want to run an analysis or generate tabular/graphic summaries for a sub-set of the data. For example, we may wish to reduce the number of columns in the data frame to remove excessive and unwanted data fields. Alternatively (or additionally), we may wish to focus on a subset of the rows. These two forms of sub-setting are:
select
ing - reducing the number of columnsfilter
ing - reducing the number of rowsselect
)Selecting works by either including (or excluding) the column names that you indicate or via special selection ‘Helper’ functions that pass a vector of column indices to include in the subset data.
Treatment | Plot | Dose | Time | Resp1 | Resp2 |
---|---|---|---|---|---|
Control | P1 | H | 1 | 8.12 | 3.06 |
Control | P1 | H | 2 | 20.55 | 25.94 |
Control | P1 | H | 3 | 27.49 | 29.85 |
Control | P1 | H | 4 | 44.79 | 25.39 |
Control | P1 | M | 1 | 20.99 | 20.31 |
Control | P1 | M | 2 | 37.54 | 17.62 |
Control | P1 | M | 3 | 61.46 | 98.44 |
Control | P1 | M | 4 | 82.21 | 160.01 |
Control | P1 | L | 1 | 31.73 | 21.22 |
Control | P1 | L | 2 | 59.08 | 37.51 |
The select()
function users tidy-select semantics.
Select specific variables
Treatment Dose Time Resp1
1 Control H 1 8.12
2 Control H 2 20.55
3 Control H 3 27.49
4 Control H 4 44.79
5 Control M 1 20.99
6 Control M 2 37.54
Exclude specific variables
Treatment Dose Time Resp1
1 Control H 1 8.12
2 Control H 2 20.55
3 Control H 3 27.49
4 Control H 4 44.79
5 Control M 1 20.99
6 Control M 2 37.54
Include columns Treatment
through to
Time
Treatment Plot Dose Time
1 Control P1 H 1
2 Control P1 H 2
3 Control P1 H 3
4 Control P1 H 4
5 Control P1 M 1
6 Control P1 M 2
Exclude columns Treatment
through to
Time
Resp1 Resp2
1 8.12 3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62
Exclude columns Treatment
through to
Time
Resp1 Resp2
1 8.12 3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62
Names containing and ‘r’ (case insensitive)
Treatment Resp1 Resp2
1 Control 8.12 3.06
2 Control 20.55 25.94
3 Control 27.49 29.85
4 Control 44.79 25.39
5 Control 20.99 20.31
6 Control 37.54 17.62
Names starting with ‘r’ (case insensitive)
Resp1 Resp2
1 8.12 3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62
Names ending in ‘e’ (case insensitive)
Dose Time
1 H 1
2 H 2
3 H 3
4 H 4
5 M 1
6 M 2
Names that are exactly four characters long
Plot Dose Time
1 P1 H 1
2 P1 H 2
3 P1 H 3
4 P1 H 4
5 P1 M 1
6 P1 M 2
num_range
Resp1 Resp2
1 8.12 3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62
Resp1 Resp2 Treatment Plot Dose Time
1 8.12 3.06 Control P1 H 1
2 20.55 25.94 Control P1 H 2
3 27.49 29.85 Control P1 H 3
4 44.79 25.39 Control P1 H 4
5 20.99 20.31 Control P1 M 1
6 37.54 17.62 Control P1 M 2
all_of
/ any_of
Select from a vector of strings
Resp1 Resp2
1 8.12 3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62
Select from string vector of strings…
Resp1 Resp2
1 8.12 3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62
Treatment Plot Dose Time
1 Control P1 H 1
2 Control P1 H 2
3 Control P1 H 3
4 Control P1 H 4
5 Control P1 M 1
6 Control P1 M 2
Since tibble
s permit column names that have special
characters in them, in order to refer to a column that has a name with
special characters, it is necessary to enclose the name in backticks.
For example, to select the variable, Pocillopora damicornis
(which contains a space in the name - these are not permitted in
data.frame
s, yet are permitted in tibble
s) and
print
the first six rows:
Pocillopora damicornis
V1 79
V2 51
V3 42
V4 15
V5 9
V6 72
rename
)Note, it is possible to have column names renamed during a
select
action.
Both the rename()
and select()
functions user tidy-select semantics.
Weight
1 8.12
2 20.55
3 27.49
4 44.79
5 20.99
6 37.54
If we want to retain the other variables, we would also have to
include them in the select
either explicitly, or via a
helper function.
Treatment Plot Dose Time Weight Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
5 Control P1 M 1 20.99 20.31
6 Control P1 M 2 37.54 17.62
However, note that this might not retain the order of the columns.
Hence, for simple renaming of columns, the rename
function is more convenient.
Treatment Plot Dose Time Weight Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
5 Control P1 M 1 20.99 20.31
6 Control P1 M 2 37.54 17.62
It is also possible to rename columns based on a lookup (list or data.frame). This is handy for cases when data columns have conveniently abbreviated names yet you are preparing data for tabular output - and thus need more descriptive column names.
lookup <- list("Response 1" = "Resp1",
"Response 2" = "Resp2")
dat.1 %>%
rename(!!!lookup) %>%
head()
Treatment Plot Dose Time Response 1 Response 2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
5 Control P1 M 1 20.99 20.31
6 Control P1 M 2 37.54 17.62
In the above example, the big bang operator
!!!
forces-splice a list operator. That is, the elements of
the list are spliced into the statement as if they had been included
directly.
To do the same from a data.frame lookup..
lookup <- tribble(
~Abbr_name, ~Long_name,
"Resp1", "Response 1",
"Resp2", "Response 2")
## Convert to list of pairs
lookup <- lookup %>%
select(Long_name, Abbr_name) %>%
deframe() %>%
list()
dat.1 %>%
rename(!!!lookup) %>%
head()
Treatment Plot Dose Time Response 1 Response 2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
5 Control P1 M 1 20.99 20.31
6 Control P1 M 2 37.54 17.62
pull
)As indicated earlier, single column indices on tibble
s
return a single column tibble
. To extract a single column
as a vector, there is a pull
function.
The pull()
function users tidy-select semantics.
[1] 8.12 20.55 27.49 44.79 20.99 37.54 61.46 82.21 31.73 59.08
[11] 94.54 121.17 8.14 13.36 33.37 39.87 19.95 42.83 62.46 81.78
[21] 32.76 62.35 90.22 114.03 21.86 39.83 59.53 75.59 38.57 81.25
[31] 124.08 159.69 61.16 119.84 175.87 238.76 18.82 39.82 63.30 82.29
[41] 39.51 79.24 122.09 161.67 57.93 117.88 181.09 242.31
filter
)Filtering selects rows for which a condition is evaluated to be TRUE. Hence, any logical expression or vectorized function that returns Boolean values (TRUE or FALSE) can be used for filtering.
The filter()
function users data-masking
semantics.
Treatment | Plot | Dose | Time | Resp1 | Resp2 |
---|---|---|---|---|---|
Control | P1 | H | 1 | 8.12 | 3.06 |
Control | P1 | H | 2 | 20.55 | 25.94 |
Control | P1 | H | 3 | 27.49 | 29.85 |
Control | P1 | H | 4 | 44.79 | 25.39 |
Control | P1 | M | 1 | 20.99 | 20.31 |
Control | P1 | M | 2 | 37.54 | 17.62 |
Control | P1 | M | 3 | 61.46 | 98.44 |
Control | P1 | M | 4 | 82.21 | 160.01 |
Control | P1 | L | 1 | 31.73 | 21.22 |
Control | P1 | L | 2 | 59.08 | 37.51 |
Helper function | Combine multiple selections … |
---|---|
if_any() |
With an an OR |
if_all() |
With an AND) |
Note, by default, the above searchers are NOT case sensitive
Logical/Boolean function | Description |
---|---|
== |
is equal to |
!= |
is not equal to |
< |
is less than |
> |
is greater than |
<= |
is less than or equal to |
>= |
is greater than or equal to |
is.na() |
is equal to NA |
!is.na() |
is not equal to NA |
%in% |
is in the following vector |
! |
not |
& && |
logical AND |
| || |
logical OR |
xor() |
elementwise OR |
isTRUE() |
is true |
isFALSE() |
is false |
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
5 Control P2 H 1 8.14 23.93
6 Control P2 H 2 13.36 28.02
7 Control P2 H 3 33.37 37.17
8 Control P2 H 4 39.87 38.25
9 Exclusion P3 H 1 21.86 23.58
10 Exclusion P3 H 2 39.83 28.03
11 Exclusion P3 H 3 59.53 21.32
12 Exclusion P3 H 4 75.59 90.76
13 Exclusion P4 H 1 18.82 28.60
14 Exclusion P4 H 2 39.82 39.07
15 Exclusion P4 H 3 63.30 93.43
16 Exclusion P4 H 4 82.29 60.15
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
5 Control P1 M 1 20.99 20.31
6 Control P1 M 2 37.54 17.62
7 Control P1 M 3 61.46 98.44
8 Control P1 M 4 82.21 160.01
9 Control P2 H 1 8.14 23.93
10 Control P2 H 2 13.36 28.02
11 Control P2 H 3 33.37 37.17
12 Control P2 H 4 39.87 38.25
13 Control P2 M 1 19.95 19.73
14 Control P2 M 2 42.83 40.52
15 Control P2 M 3 62.46 4.81
16 Control P2 M 4 81.78 136.66
17 Exclusion P3 H 1 21.86 23.58
18 Exclusion P3 H 2 39.83 28.03
19 Exclusion P3 H 3 59.53 21.32
20 Exclusion P3 H 4 75.59 90.76
21 Exclusion P3 M 1 38.57 30.63
22 Exclusion P3 M 2 81.25 83.61
23 Exclusion P3 M 3 124.08 124.09
24 Exclusion P3 M 4 159.69 112.65
25 Exclusion P4 H 1 18.82 28.60
26 Exclusion P4 H 2 39.82 39.07
27 Exclusion P4 H 3 63.30 93.43
28 Exclusion P4 H 4 82.29 60.15
29 Exclusion P4 M 1 39.51 45.90
30 Exclusion P4 M 2 79.24 88.04
31 Exclusion P4 M 3 122.09 84.19
32 Exclusion P4 M 4 161.67 256.34
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 M 1 20.99 20.31
4 Control P2 H 1 8.14 23.93
5 Control P2 H 2 13.36 28.02
6 Control P2 M 1 19.95 19.73
7 Exclusion P3 H 1 21.86 23.58
8 Exclusion P4 H 1 18.82 28.60
between
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 2 20.55 25.94
2 Control P1 M 1 20.99 20.31
3 Control P2 M 1 19.95 19.73
4 Exclusion P3 H 1 21.86 23.58
5 Exclusion P4 H 1 18.82 28.60
Keep all cases after a value of Resp1 has exceeded 200
Treatment Plot Dose Time Resp1 Resp2
1 Exclusion P3 L 4 238.76 54.23
2 Exclusion P4 H 1 18.82 28.60
3 Exclusion P4 H 2 39.82 39.07
4 Exclusion P4 H 3 63.30 93.43
5 Exclusion P4 H 4 82.29 60.15
6 Exclusion P4 M 1 39.51 45.90
7 Exclusion P4 M 2 79.24 88.04
8 Exclusion P4 M 3 122.09 84.19
9 Exclusion P4 M 4 161.67 256.34
10 Exclusion P4 L 1 57.93 85.24
11 Exclusion P4 L 2 117.88 167.90
12 Exclusion P4 L 3 181.09 314.49
13 Exclusion P4 L 4 242.31 304.70
Keep all cases until the first case of Resp1 < 25
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P2 H 1 8.14 23.93
4 Control P2 H 2 13.36 28.02
5 Exclusion P3 H 1 21.86 23.58
6 Exclusion P4 H 1 18.82 28.60
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P2 H 1 8.14 23.93
4 Control P2 H 2 13.36 28.02
5 Exclusion P3 H 1 21.86 23.58
6 Exclusion P4 H 1 18.82 28.60
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
5 Control P1 M 1 20.99 20.31
6 Control P2 H 1 8.14 23.93
7 Control P2 H 2 13.36 28.02
8 Control P2 H 3 33.37 37.17
9 Control P2 H 4 39.87 38.25
10 Control P2 M 1 19.95 19.73
11 Exclusion P3 H 1 21.86 23.58
12 Exclusion P3 H 2 39.83 28.03
13 Exclusion P3 H 3 59.53 21.32
14 Exclusion P3 H 4 75.59 90.76
15 Exclusion P4 H 1 18.82 28.60
16 Exclusion P4 H 2 39.82 39.07
17 Exclusion P4 H 3 63.30 93.43
18 Exclusion P4 H 4 82.29 60.15
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 M 1 20.99 20.31
3 Control P2 H 1 8.14 23.93
4 Control P2 M 1 19.95 19.73
5 Exclusion P3 H 1 21.86 23.58
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 M 1 20.99 20.31
4 Control P1 M 2 37.54 17.62
5 Control P1 L 1 31.73 21.22
6 Control P2 H 1 8.14 23.93
7 Control P2 H 2 13.36 28.02
8 Control P2 M 1 19.95 19.73
9 Control P2 M 3 62.46 4.81
10 Exclusion P3 H 1 21.86 23.58
11 Exclusion P3 H 3 59.53 21.32
12 Exclusion P4 H 1 18.82 28.60
slice
) Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
sample_n
)In each of the examples, I will set the random seed to ensure we can all repeat the example exactly.
Random sample of 10 rows (without replacement)
Treatment Plot Dose Time Resp1 Resp2
1 Exclusion P3 M 3 124.08 124.09
2 Control P2 H 3 33.37 37.17
3 Control P2 H 2 13.36 28.02
4 Control P1 H 3 27.49 29.85
5 Exclusion P4 M 2 79.24 88.04
6 Exclusion P4 M 3 122.09 84.19
7 Exclusion P4 H 1 18.82 28.60
8 Exclusion P4 L 2 117.88 167.90
9 Exclusion P3 H 1 21.86 23.58
10 Exclusion P3 H 2 39.83 28.03
Random sample of 25% of the rows (without replacement)
Treatment Plot Dose Time Resp1 Resp2
1 Exclusion P3 M 3 124.08 124.09
2 Control P2 H 3 33.37 37.17
3 Control P2 H 2 13.36 28.02
4 Control P1 H 3 27.49 29.85
5 Exclusion P4 M 2 79.24 88.04
6 Exclusion P4 M 3 122.09 84.19
7 Exclusion P4 H 1 18.82 28.60
8 Exclusion P4 L 2 117.88 167.90
9 Exclusion P3 H 1 21.86 23.58
10 Exclusion P3 H 2 39.83 28.03
11 Exclusion P3 H 3 59.53 21.32
12 Control P1 M 1 20.99 20.31
In R, categorical variables (factors) are actually stored as vectors of integers (1, 2, 3, …) along with an attribute that registers the names (and order) of the levels. We can see this, if we ask to see the structure of any categorical variable:
Factor w/ 3 levels "H","L","M": 1 1 1 1 3 3 3 3 2 2 ...
In the above, we see that the levels of the Dose
variable are “H”, “L”, “M” (by default, levels of an unordered factor
are alphabetical). The first four cases are all 1
, which is
mapped to “H”. The next four are 3
which maps to “M” and so
on.
Alternatively, we can explore the levels attribute of a
factor via the levels()
function:
[1] "H" "L" "M"
[1] "P1" "P2" "P3" "P4"
[1] "Control" "Exclusion"
Although subsets of rows (filter
, sample_n
etc) may appear to completely remove particular levels of a categorical
variable, these actions do not update factor attributes. Consequently,
many routines that operate on such factors (such as plots and
statistical models) will proceed as if all factor levels are present
(e.g. plots will contain gaps where space has been provisioned for all
levels even though there may be no data associated with all levels).
To illustrate this, we will filter the dat.1
data such
that it only includes cases for which Plot
equals “P1” and
then explore the levels attribute of some of the categorical
variables.
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
5 Control P1 M 1 20.99 20.31
6 Control P1 M 2 37.54 17.62
7 Control P1 M 3 61.46 98.44
8 Control P1 M 4 82.21 160.01
9 Control P1 L 1 31.73 21.22
10 Control P1 L 2 59.08 37.51
11 Control P1 L 3 94.54 119.22
12 Control P1 L 4 121.17 116.45
[1] "H" "L" "M"
[1] "P1" "P2" "P3" "P4"
[1] "Control" "Exclusion"
So although the data only contains Plot
values of “P1”
(and Treatment
values of “Control”), the levels are still
listed as “P1”, “P2”, “P3”, and “P4”.
To ensure that the attributes reflect the subset data, it is
necessary to use the droplevels()
function:
[1] "H" "L" "M"
[1] "P1"
[1] "Control"
mutate
)To add to (mutate) a data set, a vectorised function is applied either once or across one or more existing columns.
The mutate()
function users data-masking
semantics and can use tidy-select semantics via
across()
.
Whether transforming/scaling existing variables or generating new variables derived from existing variables it is advisable that existing variables be retained (unless the data are huge and storage is tight).
Treatment Plot Dose Time Resp1 Resp2 Sum
1 Control P1 H 1 8.12 3.06 11.18
2 Control P1 H 2 20.55 25.94 46.49
3 Control P1 H 3 27.49 29.85 57.34
4 Control P1 H 4 44.79 25.39 70.18
5 Control P1 M 1 20.99 20.31 41.30
6 Control P1 M 2 37.54 17.62 55.16
Multiple additions/modifications can be defined in a single
mutate
call and the changes occur sequentially.
Consequently, subsequent changes can build upon early changes. To
illustrate this, we will create a new variable (cResp1
)
that is a centred version of the Resp1
variable (centring
is when the mean of a variable is subtracted from each value resulting
in a variable whose mean is exactly zero, yet all other properties are
unaffected.).
Treatment Plot Dose Time Resp1 Resp2 meanResp1 cResp1
1 Control P1 H 1 8.12 3.06 75.26604 -67.14604
2 Control P1 H 2 20.55 25.94 75.26604 -54.71604
3 Control P1 H 3 27.49 29.85 75.26604 -47.77604
4 Control P1 H 4 44.79 25.39 75.26604 -30.47604
5 Control P1 M 1 20.99 20.31 75.26604 -54.27604
6 Control P1 M 2 37.54 17.62 75.26604 -37.72604
The following table lists some of the more useful vectorised
functions to use with mutate
. Importantly, vectorised
functions return a vector of the same length as they receive since all
mutate
edits must preserve the number of rows.
Treatment | Plot | Dose | Time | Resp1 | Resp2 |
---|---|---|---|---|---|
Control | P1 | H | 1 | 8.12 | 3.06 |
Control | P1 | H | 2 | 20.55 | 25.94 |
Control | P1 | H | 3 | 27.49 | 29.85 |
Control | P1 | H | 4 | 44.79 | 25.39 |
Control | P1 | M | 1 | 20.99 | 20.31 |
Control | P1 | M | 2 | 37.54 | 17.62 |
Control | P1 | M | 3 | 61.46 | 98.44 |
Control | P1 | M | 4 | 82.21 | 160.01 |
Control | P1 | L | 1 | 31.73 | 21.22 |
Control | P1 | L | 2 | 59.08 | 37.51 |
Function | Description |
---|---|
+,-,*,/,^,%/%, %% |
typical arithmetic operators |
log() ,log2() ,log10() ,
exp() |
logarithms/exponentials |
<,<=,>,>=,!=,== |
logical operators |
between() |
whether a case is between to numbers |
near() |
a safe way of assessing equality in floating points |
Treatment Plot Dose Time Resp1 Resp2 Sum Div
1 Control P1 H 1 8.12 3.06 11.18 2.6535948
2 Control P1 H 2 20.55 25.94 46.49 0.7922128
3 Control P1 H 3 27.49 29.85 57.34 0.9209380
4 Control P1 H 4 44.79 25.39 70.18 1.7640803
5 Control P1 M 1 20.99 20.31 41.30 1.0334810
6 Control P1 M 2 37.54 17.62 55.16 2.1305335
Treatment Plot Dose Time Resp1 Resp2 logResp1 expResp2
1 Control P1 H 1 8.12 3.06 2.094330 2.132756e+01
2 Control P1 H 2 20.55 25.94 3.022861 1.843312e+11
3 Control P1 H 3 27.49 29.85 3.313822 9.197934e+12
4 Control P1 H 4 44.79 25.39 3.801985 1.063499e+11
5 Control P1 M 1 20.99 20.31 3.044046 6.614864e+08
6 Control P1 M 2 37.54 17.62 3.625407 4.490232e+07
Treatment Plot Dose Time Resp1 Resp2 largeResp1
1 Control P1 H 1 8.12 3.06 FALSE
2 Control P1 H 2 20.55 25.94 FALSE
3 Control P1 H 3 27.49 29.85 TRUE
4 Control P1 H 4 44.79 25.39 TRUE
5 Control P1 M 1 20.99 20.31 FALSE
6 Control P1 M 2 37.54 17.62 TRUE
Treatment Plot Dose Time Resp1 Resp2 A B
1 Control P1 H 1 8.12 3.06 FALSE TRUE
2 Control P1 H 2 20.55 25.94 FALSE FALSE
3 Control P1 H 3 27.49 29.85 FALSE FALSE
4 Control P1 H 4 44.79 25.39 FALSE FALSE
5 Control P1 M 1 20.99 20.31 FALSE FALSE
6 Control P1 M 2 37.54 17.62 FALSE FALSE
Treatment Plot Dose Time Resp1 Resp2 mediumResp1
1 Control P1 H 1 8.12 3.06 FALSE
2 Control P1 H 2 20.55 25.94 TRUE
3 Control P1 H 3 27.49 29.85 FALSE
4 Control P1 H 4 44.79 25.39 FALSE
5 Control P1 M 1 20.99 20.31 TRUE
6 Control P1 M 2 37.54 17.62 FALSE
Function | Description |
---|---|
lag() |
shift cases down one |
lead() |
shift cases up one |
Treatment Plot Dose Time Resp1 Resp2 lagResp1
1 Control P1 H 1 8.12 3.06 NA
2 Control P1 H 2 20.55 25.94 8.12
3 Control P1 H 3 27.49 29.85 20.55
4 Control P1 H 4 44.79 25.39 27.49
5 Control P1 M 1 20.99 20.31 44.79
6 Control P1 M 2 37.54 17.62 20.99
Treatment Plot Dose Time Resp1 Resp2 leadResp1
43 Exclusion P4 M 3 122.09 84.19 161.67
44 Exclusion P4 M 4 161.67 256.34 57.93
45 Exclusion P4 L 1 57.93 85.24 117.88
46 Exclusion P4 L 2 117.88 167.90 181.09
47 Exclusion P4 L 3 181.09 314.49 242.31
48 Exclusion P4 L 4 242.31 304.70 NA
Function | Description |
---|---|
cummax() |
cumulative maximum |
cummin() |
cumulative minimum |
cummean() |
cumulative mean |
cumprod() |
cumulative product |
cumsum() |
cumulative sum |
rank() |
rank of current case (ties averaged) |
min_rank() |
rank of current case (minimum rank for ties) |
dense_rank() |
rank of current case (minimum rank for ties, no gaps) |
percent_rank() |
min_rank of current case (scaled to [0,1]) |
cume_dist() |
cumulative empirical distribution (prop. less than current rank) |
row_number() |
rank of current case (first row for ties) |
ntile() |
bin into (n) buckets |
dat.1 %>% mutate(Cummin = cummin(Resp1),
Cummax = cummax(Resp1),
Cummean = cummean(Resp1)) %>%
head()
Treatment Plot Dose Time Resp1 Resp2 Cummin Cummax Cummean
1 Control P1 H 1 8.12 3.06 8.12 8.12 8.1200
2 Control P1 H 2 20.55 25.94 8.12 20.55 14.3350
3 Control P1 H 3 27.49 29.85 8.12 27.49 18.7200
4 Control P1 H 4 44.79 25.39 8.12 44.79 25.2375
5 Control P1 M 1 20.99 20.31 8.12 44.79 24.3880
6 Control P1 M 2 37.54 17.62 8.12 44.79 26.5800
Treatment Plot Dose Time Resp1 Resp2 Cumsum Cumprod
1 Control P1 H 1 8.12 3.06 8.12 8.120000e+00
2 Control P1 H 2 20.55 25.94 28.67 1.668660e+02
3 Control P1 H 3 27.49 29.85 56.16 4.587146e+03
4 Control P1 H 4 44.79 25.39 100.95 2.054583e+05
5 Control P1 M 1 20.99 20.31 121.94 4.312569e+06
6 Control P1 M 2 37.54 17.62 159.48 1.618939e+08
dat.1 %>% mutate(Rank = rank(Resp1),
minRank = min_rank(Resp1),
denseRank = dense_rank(Resp1),
percentRank = percent_rank(Resp1)) %>%
head()
Treatment Plot Dose Time Resp1 Resp2 Rank minRank denseRank percentRank
1 Control P1 H 1 8.12 3.06 1 1 1 0.0000000
2 Control P1 H 2 20.55 25.94 6 6 6 0.1063830
3 Control P1 H 3 27.49 29.85 9 9 9 0.1702128
4 Control P1 H 4 44.79 25.39 20 20 20 0.4042553
5 Control P1 M 1 20.99 20.31 7 7 7 0.1276596
6 Control P1 M 2 37.54 17.62 13 13 13 0.2553191
Treatment Plot Dose Time Resp1 Resp2 cume_dist(Resp1)
1 Control P1 H 1 8.12 3.06 0.02083333
2 Control P1 H 2 20.55 25.94 0.12500000
3 Control P1 H 3 27.49 29.85 0.18750000
4 Control P1 H 4 44.79 25.39 0.41666667
5 Control P1 M 1 20.99 20.31 0.14583333
6 Control P1 M 2 37.54 17.62 0.27083333
Treatment Plot Dose Time Resp1 Resp2 row_number(Resp1)
1 Control P1 H 1 8.12 3.06 1
2 Control P1 H 2 20.55 25.94 6
3 Control P1 H 3 27.49 29.85 9
4 Control P1 H 4 44.79 25.39 20
5 Control P1 M 1 20.99 20.31 7
6 Control P1 M 2 37.54 17.62 13
Function | Description |
---|---|
if_else() |
elementwise (case by case) if and else |
case_when() |
elementwise multiple if_else |
na_if() |
elementwise replace nominated value with NA |
pmax() |
elementwise maximum across multiple columns |
pmin() |
elementwise minimum across multiple columns |
Treatment Plot Dose Time Resp1 Resp2 Size
1 Control P1 H 1 8.12 3.06 Small
2 Control P1 H 2 20.55 25.94 Small
3 Control P1 H 3 27.49 29.85 Big
4 Control P1 H 4 44.79 25.39 Big
5 Control P1 M 1 20.99 20.31 Small
6 Control P1 M 2 37.54 17.62 Big
dat.1 %>% mutate(Size = case_when(Resp1 < 15 ~ "Small",
Resp1 < 25 ~ "Medium",
Resp1 >= 25 ~ "Big")) %>%
head()
Treatment Plot Dose Time Resp1 Resp2 Size
1 Control P1 H 1 8.12 3.06 Small
2 Control P1 H 2 20.55 25.94 Medium
3 Control P1 H 3 27.49 29.85 Big
4 Control P1 H 4 44.79 25.39 Big
5 Control P1 M 1 20.99 20.31 Medium
6 Control P1 M 2 37.54 17.62 Big
Summary functions (those that return a single value) are also permissible - the value will be recycled for the total length of the input vector. A table of useful summary functions can be found in the Summarise section.
Treatment Plot Dose Time Resp1 Resp2 meanResp1
1 Control P1 H 1 8.12 3.06 75.26604
2 Control P1 H 2 20.55 25.94 75.26604
3 Control P1 H 3 27.49 29.85 75.26604
4 Control P1 H 4 44.79 25.39 75.26604
5 Control P1 M 1 20.99 20.31 75.26604
6 Control P1 M 2 37.54 17.62 75.26604
Another important summary function is the n()
function. This function returns the total number of rows. In
the following example, we will use it to create a column that just
provides a running row counter (e.g. a row index variable).
Treatment Plot Dose Time Resp1 Resp2 N
1 Control P1 H 1 8.12 3.06 1
2 Control P1 H 2 20.55 25.94 2
3 Control P1 H 3 27.49 29.85 3
4 Control P1 H 4 44.79 25.39 4
5 Control P1 M 1 20.99 20.31 5
6 Control P1 M 2 37.54 17.62 6
across
)In the section on
select
, a set of select helper functions were described
to facilitate convenient ways to select columns based on properties of
the column names etc. The across()
function allows
us to bring those same selection helper functions to mutate.
The across()
function has the following
form:
where:
.cols
- a tidy selection (e.g. selection helper
function).fns
- a function (or list of functions) to apply to
each selected column.names
- a glue
specification determining
the format of the new variable names. By default the glue will be either
{.col}
(when there is only a single function) or
{.col}.{fn}
(when a list of functions) Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 2.094330 1.118415
2 Control P1 H 2 3.022861 3.255786
3 Control P1 H 3 3.313822 3.396185
4 Control P1 H 4 3.801985 3.234355
5 Control P1 M 1 3.044046 3.011113
6 Control P1 M 2 3.625407 2.869035
Treatment Plot Dose Time Resp1 Resp2 lResp1 lResp2
1 Control P1 H 1 8.12 3.06 2.094330 1.118415
2 Control P1 H 2 20.55 25.94 3.022861 3.255786
3 Control P1 H 3 27.49 29.85 3.313822 3.396185
4 Control P1 H 4 44.79 25.39 3.801985 3.234355
5 Control P1 M 1 20.99 20.31 3.044046 3.011113
6 Control P1 M 2 37.54 17.62 3.625407 2.869035
where
selections with naming Treatment Plot Dose Time Resp1 Resp2 lTime lResp1 lResp2
1 Control P1 H 1 8.12 3.06 0.0000000 2.094330 1.118415
2 Control P1 H 2 20.55 25.94 0.6931472 3.022861 3.255786
3 Control P1 H 3 27.49 29.85 1.0986123 3.313822 3.396185
4 Control P1 H 4 44.79 25.39 1.3862944 3.801985 3.234355
5 Control P1 M 1 20.99 20.31 0.0000000 3.044046 3.011113
6 Control P1 M 2 37.54 17.62 0.6931472 3.625407 2.869035
num_range
selections with naming Treatment Plot Dose Time Resp1 Resp2 lResp1 lResp2
1 Control P1 H 1 8.12 3.06 2.094330 1.118415
2 Control P1 H 2 20.55 25.94 3.022861 3.255786
3 Control P1 H 3 27.49 29.85 3.313822 3.396185
4 Control P1 H 4 44.79 25.39 3.801985 3.234355
5 Control P1 M 1 20.99 20.31 3.044046 3.011113
6 Control P1 M 2 37.54 17.62 3.625407 2.869035
dat.1 %>% mutate(across(c(Resp1, Resp2),
.fns = list(l = log, s = sqrt),
.names = "{.fn}.{.col}")) %>%
head()
Treatment Plot Dose Time Resp1 Resp2 l.Resp1 s.Resp1 l.Resp2 s.Resp2
1 Control P1 H 1 8.12 3.06 2.094330 2.849561 1.118415 1.749286
2 Control P1 H 2 20.55 25.94 3.022861 4.533211 3.255786 5.093133
3 Control P1 H 3 27.49 29.85 3.313822 5.243091 3.396185 5.463515
4 Control P1 H 4 44.79 25.39 3.801985 6.692533 3.234355 5.038849
5 Control P1 M 1 20.99 20.31 3.044046 4.581484 3.011113 4.506662
6 Control P1 M 2 37.54 17.62 3.625407 6.126989 2.869035 4.197618
Centring all numeric variables (note the use of the
purrr
style lambda formula for functions that
require additional arguments. When applying the function, the focal
variable is assigned the name of .x
.
dat.1 %>% mutate(across(c(Resp1, Resp2),
.fns = list(c = ~ scale(.x, scale = FALSE)),
.names = "{.fn}{.col}")) %>%
head()
Treatment Plot Dose Time Resp1 Resp2 cResp1 cResp2
1 Control P1 H 1 8.12 3.06 -67.14604 -78.64958
2 Control P1 H 2 20.55 25.94 -54.71604 -55.76958
3 Control P1 H 3 27.49 29.85 -47.77604 -51.85958
4 Control P1 H 4 44.79 25.39 -30.47604 -56.31958
5 Control P1 M 1 20.99 20.31 -54.27604 -61.39958
6 Control P1 M 2 37.54 17.62 -37.72604 -64.08958
Convert Time
(a numeric) into a factor
# A tibble: 48 × 6
Treatment Plot Dose Time Resp1 Resp2
<fct> <fct> <fct> <fct> <dbl> <dbl>
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.6 25.9
3 Control P1 H 3 27.5 29.8
4 Control P1 H 4 44.8 25.4
5 Control P1 M 1 21.0 20.3
6 Control P1 M 2 37.5 17.6
7 Control P1 M 3 61.5 98.4
8 Control P1 M 4 82.2 160.
9 Control P1 L 1 31.7 21.2
10 Control P1 L 2 59.1 37.5
# ℹ 38 more rows
Change the labels of the “H” and “M” levels of Dose
# A tibble: 48 × 6
Treatment Plot Dose Time Resp1 Resp2
<fct> <fct> <fct> <int> <dbl> <dbl>
1 Control P1 High 1 8.12 3.06
2 Control P1 High 2 20.6 25.9
3 Control P1 High 3 27.5 29.8
4 Control P1 High 4 44.8 25.4
5 Control P1 Medium 1 21.0 20.3
6 Control P1 Medium 2 37.5 17.6
7 Control P1 Medium 3 61.5 98.4
8 Control P1 Medium 4 82.2 160.
9 Control P1 L 1 31.7 21.2
10 Control P1 L 2 59.1 37.5
# ℹ 38 more rows
'data.frame': 48 obs. of 6 variables:
$ Treatment: Factor w/ 2 levels "Control","Exclusion": 1 1 1 1 1 1 1 1 1 1 ...
$ Plot : Factor w/ 4 levels "P1","P2","P3",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Dose : Factor w/ 3 levels "High","L","Medium": 1 1 1 1 3 3 3 3 2 2 ...
$ Time : int 1 2 3 4 1 2 3 4 1 2 ...
$ Resp1 : num 8.12 20.55 27.49 44.79 20.99 ...
$ Resp2 : num 3.06 25.94 29.85 25.39 20.31 ...
- attr(*, "out.attrs")=List of 2
..$ dim : Named int [1:3] 4 3 4
.. ..- attr(*, "names")= chr [1:3] "Time" "Dose" "Plot"
..$ dimnames:List of 3
.. ..$ Time: chr [1:4] "Time=1" "Time=2" "Time=3" "Time=4"
.. ..$ Dose: chr [1:3] "Dose=H" "Dose=M" "Dose=L"
.. ..$ Plot: chr [1:4] "Plot=P1" "Plot=P2" "Plot=P3" "Plot=P4"
Change the level order of the Dose
factor to something
more natural.
[1] H H H H M M M M L L L L H H H H M M M M L L L L H H H H M M M M L L L L H H
[39] H H M M M M L L L L
Levels: H L M
[1] H H H H M M M M L L L L H H H H M M M M L L L L H H H H M M M M L L L L H H
[39] H H M M M M L L L L
Levels: L M H
Change the labels and level order of the Dose
factor to
something more natural.
[1] H H H H M M M M L L L L H H H H M M M M L L L L H H H H M M M M L L L L H H
[39] H H M M M M L L L L
Levels: H L M
dat.1 %>% mutate(
Dose = fct_relevel(Dose, c("L", "M", "H")),
Dose = fct_recode(Dose, High = 'H', Medium = 'M')
) %>%
as_tibble %>%
pull(Dose)
[1] High High High High Medium Medium Medium Medium L L
[11] L L High High High High Medium Medium Medium Medium
[21] L L L L High High High High Medium Medium
[31] Medium Medium L L L L High High High High
[41] Medium Medium Medium Medium L L L L
Levels: L Medium High
[1] H H H H M M M M L L L L H H H H M M M M L L L L H H H H M M M M L L L L H H
[39] H H M M M M L L L L
Levels: H L M
[1] H H H H M M M M L L L L H H H H M M M M L L L L H H H H M M M M L L L L H H
[39] H H M M M M L L L L
Levels: H M L
Change the order of Dose
levels according to the median
Resp1
values
[1] H H H H M M M M L L L L H H H H M M M M L L L L H H H H M M M M L L L L H H
[39] H H M M M M L L L L
Levels: H L M
[1] H H H H M M M M L L L L H H H H M M M M L L L L H H H H M M M M L L L L H H
[39] H H M M M M L L L L
Levels: H M L
Data are often collected and transcribed at finer temporal/spatial scales and with greater fidelity than is required for all analyses. Furthermore, there is often a desire to present information summarised to a courser level. Therefore an important phase of data preparation/presentation is also to summarise the data into the spatial/temporal scales appropriate for the desired graphical and statistical analyses.
Summarising involves applying one or more summarising functions to one or more variables such that the data are reduced to a single row.
The summarise()
function users data-masking
semantics and can use tidy-select semantics via
across()
.
Treatment | Plot | Dose | Time | Resp1 | Resp2 |
---|---|---|---|---|---|
Control | P1 | H | 1 | 8.12 | 3.06 |
Control | P1 | H | 2 | 20.55 | 25.94 |
Control | P1 | H | 3 | 27.49 | 29.85 |
Control | P1 | H | 4 | 44.79 | 25.39 |
Control | P1 | M | 1 | 20.99 | 20.31 |
Control | P1 | M | 2 | 37.54 | 17.62 |
Control | P1 | M | 3 | 61.46 | 98.44 |
Control | P1 | M | 4 | 82.21 | 160.01 |
Control | P1 | L | 1 | 31.73 | 21.22 |
Control | P1 | L | 2 | 59.08 | 37.51 |
Function | Description |
---|---|
n() |
number of rows/values |
n_distinct() |
number of unique rows/values |
mean() , median() |
mean/median |
sum() , prod() |
sum/product |
var() , sd() |
variance/standard deviation |
mad() , IQR() |
median absolute deviation / interquartile range |
min() , max() |
minimum / maximum |
quantile() |
nth quantile |
first() , last() , nth() |
first / last / nth value |
Calculate the number of rows in the data as well as the number of
distinct (unique) values of Dose
.
N NDose
1 48 3
Calculate the number of rows in the data as well as the number of
distinct (unique) values of Dose
.
Mean Median
1 75.26604 61.31
Calculate the number of rows in the data as well as the number of
distinct (unique) values of Dose
.
Sum Prod
1 3612.77 1.261602e+84
Calculate the number of rows in the data as well as the number of
distinct (unique) values of Dose
.
Var MAD
1 3200.305 42.5951
Calculate the number of rows in the data as well as the number of
distinct (unique) values of Dose
.
Min Max
1 8.12 242.31
Calculate the number of rows in the data as well as the number of
distinct (unique) values of Dose
.
Q.0% Q.25% Q.50% Q.75% Q.100%
1 8.1200 36.4975 61.3100 99.4125 242.3100
Calculate the number of rows in the data as well as the number of
distinct (unique) values of Dose
.
First Last Fifth
1 8.12 242.31 20.99
Across
functionsdat.1 %>% summarise(across(c(Resp1, Resp2),
list(Mean = mean, Var = var),
.names = "{.fn}.{.col}"),
N = n())
Mean.Resp1 Var.Resp1 Mean.Resp2 Var.Resp2 N
1 75.26604 3200.305 81.70958 5886.444 48
dat.1 %>% summarise(across(where(is.numeric),
list(Mean = mean, Var = var),
.names = "{.fn}.{.col}"),
across(where(is.factor), n_distinct),
N = n())
Mean.Time Var.Time Mean.Resp1 Var.Resp1 Mean.Resp2 Var.Resp2 Treatment Plot
1 2.5 1.276596 75.26604 3200.305 81.70958 5886.444 2 4
Dose N
1 3 48
Var <- c("Resp1", "Resp2", "Resp3")
dat.1 %>% summarise(across(any_of(Var),
list(Mean = mean, Var = var)))
Resp1_Mean Resp1_Var Resp2_Mean Resp2_Var
1 75.26604 3200.305 81.70958 5886.444
The count()
function provides a convenient way
to count up the number of unique combinations of factors.
Base R has a family of apply functions that apply a function
(such as mean()
) to a continuous variable separately
for:
Apply the mean()
function to each column
(MARGIN = 2
) of a dataset comprising only of “Resp1” and
“Resp2”.
Resp1 Resp2
75.26604 81.70958
The above apply functions involve a sort of split, apply,
combine process. This process became the inspiration for the
plyr
package (also out of the Wickham lab) which built on
and super-charged the process. The dplyr
package represents
the next iteration of the process (for data frames and
tibbles only as most data are in data frame format).
Central to the modern split/apply/combine process is the idea of
groups. Groups are the basis of splitting the data.
Functions applied to grouped data are applied to each group (subset)
separately before the results are combined back into a single data frame
(actually tibble). Hence grouped data are most powerful when combined
with the summarise()
or mutate()
families of
functions.
Groups are defined via the group_by()
function.
For example, to indicate that all subsequent manipulations should occur
separately for each level of the Dose
variable, we
would:
# A tibble: 48 × 6
# Groups: Dose [3]
Treatment Plot Dose Time Resp1 Resp2
<fct> <fct> <fct> <int> <dbl> <dbl>
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.6 25.9
3 Control P1 H 3 27.5 29.8
4 Control P1 H 4 44.8 25.4
5 Control P1 M 1 21.0 20.3
6 Control P1 M 2 37.5 17.6
7 Control P1 M 3 61.5 98.4
8 Control P1 M 4 82.2 160.
9 Control P1 L 1 31.7 21.2
10 Control P1 L 2 59.1 37.5
# ℹ 38 more rows
Note, in the above, we have only defined the grouping structure, not
any subsequent manipulations. Consequently, the tibble remains
unchanged. Do note however, that above the tabular output, the output
specifies a Group
attribute as a visual
reminder.
It is in combination with other data manipulation routines that
group_by()
is most useful. For example, to calculate the
mean of “Resp1” within each Dose
:
# A tibble: 3 × 2
Dose Mean
<fct> <dbl>
1 H 37.3
2 L 113.
3 M 76.0
In the above output, you may have noticed that the data (means) have been rounded. This is a behaviour of the print method for tibbles. When stored, the full precision of the data are retained.
Before providing a list of examples, it is worth using
group_by()
to highlight the difference between
mutate()
and summarise()
.
# A tibble: 6 × 7
# Groups: Dose [2]
Treatment Plot Dose Time Resp1 Resp2 MeanResp1
<fct> <fct> <fct> <int> <dbl> <dbl> <dbl>
1 Control P1 H 1 8.12 3.06 37.3
2 Control P1 H 2 20.6 25.9 37.3
3 Control P1 H 3 27.5 29.8 37.3
4 Control P1 H 4 44.8 25.4 37.3
5 Control P1 M 1 21.0 20.3 76.0
6 Control P1 M 2 37.5 17.6 76.0
mutate()
maintains the number of rows. Notice how the
mean for each group is repeated for each case.summarise()
reduces the number of rows to one (per
group).mutate()
retains all the columnssummarise()
only returns the columns specifically
involved (either as the Groups or manipulated variables).When multiple variables are used to define the groups, the subsequent manipulations are applied to each unique combination of the factors.
# A tibble: 6 × 3
# Groups: Treatment [2]
Treatment Dose Mean
<fct> <fct> <dbl>
1 Control H 24.5
2 Control L 75.7
3 Control M 51.2
4 Exclusion H 50.1
5 Exclusion L 149.
6 Exclusion M 101.
Importantly, once all the manipulations that need to be performed separately per group have been performed, it is good practice to remove the group attribute. Forgetting to do so will continue to cause all subsequent manipulations (and possibly analyses and graphics) to be inadvertently performed separately per group - such downstream issues can be difficult to diagnose.
To remove the group attribute, we use the
ungroup()
function.
# A tibble: 6 × 3
Treatment Dose Mean
<fct> <fct> <dbl>
1 Control H 24.5
2 Control L 75.7
3 Control M 51.2
4 Exclusion H 50.1
5 Exclusion L 149.
6 Exclusion M 101.
dat.1 %>% group_by(Treatment, Dose) %>%
summarise(meanResp1 = mean(Resp1),
varResp1 = var(Resp1),
N = n())
# A tibble: 6 × 5
# Groups: Treatment [2]
Treatment Dose meanResp1 varResp1 N
<fct> <fct> <dbl> <dbl> <int>
1 Control H 24.5 202. 8
2 Control L 75.7 1191. 8
3 Control M 51.2 611. 8
4 Exclusion H 50.1 562. 8
5 Exclusion L 149. 5192. 8
6 Exclusion M 101. 2378. 8
across
)dat.1 %>% group_by(Treatment, Dose) %>%
summarise(across(starts_with("Resp"),
list(Mean = mean, Var = var),
.names = "{.fn}{.col}"),
N = n())
# A tibble: 6 × 7
# Groups: Treatment [2]
Treatment Dose MeanResp1 VarResp1 MeanResp2 VarResp2 N
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <int>
1 Control H 24.5 202. 26.5 118. 8
2 Control L 75.7 1191. 79.9 1948. 8
3 Control M 51.2 611. 62.3 3676. 8
4 Exclusion H 50.1 562. 48.1 886. 8
5 Exclusion L 149. 5192. 170. 13391. 8
6 Exclusion M 101. 2378. 103. 4786. 8
# A tibble: 8 × 6
# Groups: Treatment, Dose [3]
Treatment Plot Dose Time Resp1 Resp2
<fct> <fct> <fct> <int> <dbl> <dbl>
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.6 25.9
3 Control P1 M 1 21.0 20.3
4 Control P2 H 1 8.14 23.9
5 Control P2 H 2 13.4 28.0
6 Control P2 M 1 20.0 19.7
7 Exclusion P3 H 1 21.9 23.6
8 Exclusion P4 H 1 18.8 28.6
pivot_longer
)Whilst wide data formats are often more compact and typically easier to manage for data entry (particularly in the field), the data are not in the appropriate format for most analyses (traditional repeated measures and multivariate analyses are two exceptions). Most analyses require that each replicate is in its own row and thus it is necessary to be rearrange or reshape (melt) the data from this wide format to the long (molten) format.
Whilst there are numerous routines in R for reshaping data, we will only explore those that are formally part of the tidyverse ecosystem.
The pivot_longer()
function (tidyr
package) is very useful for converting wide (repeated
measures-like) into long format. The important parameters to specify
are:
where:
data
- the input dataframe or
tibblecols
- a tidy select specification of the columns to be
lengthened into a single columnnames_to
- a name (string) to give a new column into
which to store the names of the old wide column namesvalues_to
- a name (string) to give the new column
containing the values that were previously in the old columns.values_drop_na
- whether or not to drop rows that only
contain NA
values in the new value column.To illustrate, we will use the dat.w
dataframe.
Plot | Between | Time.0 | Time.1 | Time.2 | |
---|---|---|---|---|---|
R1 | P1 | A1 | 8 | 14 | 14 |
R2 | P2 | A1 | 10 | 12 | 11 |
R3 | P3 | A2 | 7 | 11 | 8 |
R4 | P4 | A2 | 11 | 9 | 2 |
# A tibble: 12 × 4
Plot Between Time Count
<fct> <fct> <chr> <int>
1 P1 A1 Time.0 8
2 P1 A1 Time.1 14
3 P1 A1 Time.2 14
4 P2 A1 Time.0 10
5 P2 A1 Time.1 12
6 P2 A1 Time.2 11
7 P3 A2 Time.0 7
8 P3 A2 Time.1 11
9 P3 A2 Time.2 8
10 P4 A2 Time.0 11
11 P4 A2 Time.1 9
12 P4 A2 Time.2 2
starts_with
)data.w %>% pivot_longer(starts_with("Time"),
names_to = "Time",
names_prefix = "Time.",
values_to = "Count"
)
# A tibble: 12 × 4
Plot Between Time Count
<fct> <fct> <chr> <int>
1 P1 A1 0 8
2 P1 A1 1 14
3 P1 A1 2 14
4 P2 A1 0 10
5 P2 A1 1 12
6 P2 A1 2 11
7 P3 A2 0 7
8 P3 A2 1 11
9 P3 A2 2 8
10 P4 A2 0 11
11 P4 A2 1 9
12 P4 A2 2 2
pivot_wider
)The opposite of making a data set longer is to make a data set wider. Whilst analytical and graphical routines might require data to be in long format, wide tabular summaries are typically more compact and familiar.
Widening is performed via the pivot_wider()
function, the most important parameters of which are:
where:
data
- the input dataframe or
tibbleid_cols
- a tidy select specification of the columns
that uniquely identify the case - these columns will not be
widened.names_from
- a tidy select specification of the
column(s) that contain the names to be used as new column names.values_from
- a tidy select specification of the
column(s) that contain the values to be used as values in the new
columns (e.g, the data to be widened).To illustrate, we will use the dat.w
dataframe.
Resp1 | Resp2 | Between | Plot | Subplot | Within |
---|---|---|---|---|---|
8 | 17 | A1 | P1 | S1 | B1 |
10 | 18 | A1 | P1 | S1 | B2 |
7 | 17 | A1 | P1 | S2 | B1 |
11 | 21 | A1 | P1 | S2 | B2 |
14 | 19 | A2 | P2 | S3 | B1 |
12 | 13 | A2 | P2 | S3 | B2 |
11 | 24 | A2 | P2 | S4 | B1 |
9 | 18 | A2 | P2 | S4 | B2 |
14 | 25 | A3 | P3 | S5 | B1 |
11 | 18 | A3 | P3 | S5 | B2 |
8 | 27 | A3 | P3 | S6 | B1 |
2 | 22 | A3 | P3 | S6 | B2 |
8 | 17 | A1 | P4 | S7 | B1 |
10 | 22 | A1 | P4 | S7 | B2 |
7 | 16 | A1 | P4 | S8 | B1 |
12 | 13 | A1 | P4 | S8 | B2 |
11 | 23 | A2 | P5 | S9 | B1 |
12 | 19 | A2 | P5 | S9 | B2 |
12 | 23 | A2 | P5 | S10 | B1 |
10 | 21 | A2 | P5 | S10 | B2 |
3 | 17 | A3 | P6 | S11 | B1 |
11 | 16 | A3 | P6 | S11 | B2 |
13 | 26 | A3 | P6 | S12 | B1 |
7 | 28 | A3 | P6 | S12 | B2 |
Note, these data are not as long as they could be. Purely “long” data
should have each observation in its own row. The data
dataframe has two observations (one for “Resp1” and one for “Resp2”) per
row.
Widen the “Resp1” variable by the levels of the Between
factor.
# A tibble: 12 × 5
Between Plot Subplot B1 B2
<fct> <fct> <fct> <int> <int>
1 A1 P1 S1 8 10
2 A1 P1 S2 7 11
3 A2 P2 S3 14 12
4 A2 P2 S4 11 9
5 A3 P3 S5 14 11
6 A3 P3 S6 8 2
7 A1 P4 S7 8 10
8 A1 P4 S8 7 12
9 A2 P5 S9 11 12
10 A2 P5 S10 12 10
11 A3 P6 S11 3 11
12 A3 P6 S12 13 7
Widen the “Resp1” and “Resp2” variables by the levels of the
Between
factor.
# A tibble: 12 × 7
Between Plot Subplot Resp1_B1 Resp1_B2 Resp2_B1 Resp2_B2
<fct> <fct> <fct> <int> <int> <int> <int>
1 A1 P1 S1 8 10 17 18
2 A1 P1 S2 7 11 17 21
3 A2 P2 S3 14 12 19 13
4 A2 P2 S4 11 9 24 18
5 A3 P3 S5 14 11 25 18
6 A3 P3 S6 8 2 27 22
7 A1 P4 S7 8 10 17 22
8 A1 P4 S8 7 12 16 13
9 A2 P5 S9 11 12 23 19
10 A2 P5 S10 12 10 23 21
11 A3 P6 S11 3 11 17 16
12 A3 P6 S12 13 7 26 28
Alternatively we could make the data longer before widening
data %>%
pivot_longer(cols = starts_with("Resp")) %>%
pivot_wider(names_from = c(name, Within),
values_from = value)
# A tibble: 12 × 7
Between Plot Subplot Resp1_B1 Resp2_B1 Resp1_B2 Resp2_B2
<fct> <fct> <fct> <int> <int> <int> <int>
1 A1 P1 S1 8 17 10 18
2 A1 P1 S2 7 17 11 21
3 A2 P2 S3 14 19 12 13
4 A2 P2 S4 11 24 9 18
5 A3 P3 S5 14 25 11 18
6 A3 P3 S6 8 27 2 22
7 A1 P4 S7 8 17 10 22
8 A1 P4 S8 7 16 12 13
9 A2 P5 S9 11 23 12 19
10 A2 P5 S10 12 23 10 21
11 A3 P6 S11 3 17 11 16
12 A3 P6 S12 13 26 7 28
unite
)If data are recorded with excessive fidelity, it may be useful to combine multiple fields into a single field. For example, if the date was recorded across three fields (year, month and day, we may like to combine these to form a single date field.
year | month | day | Resp1 |
---|---|---|---|
2009 | 11 | 09 | 25 |
2008 | 05 | 08 | 52 |
2009 | 05 | 23 | 18 |
2009 | 04 | 15 | 21 |
2008 | 10 | 25 | 45 |
2008 | 09 | 26 | 40 |
2008 | 07 | 05 | 38 |
2008 | 11 | 02 | 9 |
2009 | 08 | 19 | 0 |
2008 | 10 | 03 | 20 |
Lets take those data and combine (unite) the separate date components
into a single date column before coercing it into a date
object.
Date Resp1
1 2009-11-09 25
2 2008-05-08 52
3 2009-05-23 18
4 2009-04-15 21
5 2008-10-25 45
6 2008-09-26 40
7 2008-07-05 38
8 2008-11-02 9
9 2009-08-19 0
10 2008-10-03 20
separate
)Separating variables is the opposite of uniting them. A field is separated by either indicating a character(s) to use as a separator, or else providing a fixed width format.
Year | Resp1 |
---|---|
F33 | 23 |
M27 | 55 |
M33 | 33 |
M26 | 36 |
F21 | 12 |
F34 | 49 |
M29 | 0 |
M34 | 18 |
F32 | 42 |
F26 | 0 |
For this example, we will split the Year
column into two
columns, one that specifies the gender and one that specifies the age.
The values are going to be separated after the first character.
Gender Age Resp1
1 F 33 23
2 M 27 55
3 M 33 33
4 M 26 36
5 F 21 12
6 F 34 49
7 M 29 0
8 M 34 18
9 F 32 42
10 F 26 0
_join
) data setsIt is common to have data associated with a particular study organised into a number of separate data tables (databases etc). In fact, large data sets are best managed in databases. However, statistical analyses generally require all data to be encapsulated within a single data structure. Therefore, prior to analysis, it is necessary to bring together multiple sources.
This phase of data preparation can be one of the most difficult to get right and verify that it is correct (particularly for large data sets).
Merging (or joining) involves creating a new data set that comprises information from two data sets. The resulting joined data set contains all fields from both data sets. The data sets are alignd together according to fields they have in common. Matching records in these common fields are used to select a row from each input data set to be combined.
There are numerous alternative ways of defining what should happen in the event that common records do not occur in both sets. For example, we could specify that only fully matching records be included in the final data set. Alternatively, we could specify that all records be included from both sets and thus the resulting data set will contain missing values. The following describe these various options.
To illustrate the above, we will use the data.bio
and
data.chem
datasets. The data.bio
dataset
mimics a typical ecological data set in which there are a number of
ecological responses (Resp1
and Resp2
) along
with the associated hierarchical design structure. This data set is
intentionally missing data for Subplot
S3
. The
data.chem
dataset has the same hierarchical design
structure, yet records the two chemical analytes (Chem1
and
Chem2
). This data set is intentionally missing data for
Subplot
S7
.
Resp1 | Resp2 | Between | Plot | Subplot |
---|---|---|---|---|
8 | 18 | A1 | P1 | S1 |
10 | 21 | A1 | P1 | S2 |
11 | 23 | A1 | P2 | S4 |
14 | 22 | A2 | P3 | S5 |
12 | 24 | A2 | P3 | S6 |
11 | 23 | A2 | P4 | S7 |
9 | 20 | A2 | P4 | S8 |
14 | 11 | A3 | P5 | S9 |
11 | 22 | A3 | P5 | S10 |
8 | 24 | A3 | P6 | S11 |
2 | 16 | A3 | P6 | S12 |
Chem1 | Chem2 | Between | Plot | Subplot |
---|---|---|---|---|
1.452878 | 0.8858208 | A1 | P1 | S1 |
3.266253 | 0.1800177 | A1 | P1 | S2 |
1.178652 | 5.0780682 | A1 | P2 | S3 |
13.400350 | 1.5762780 | A1 | P2 | S4 |
3.779183 | 1.6222430 | A2 | P3 | S5 |
1.196657 | 4.2369184 | A2 | P3 | S6 |
5.687807 | 2.9859003 | A2 | P4 | S8 |
4.834518 | 4.1328919 | A3 | P5 | S9 |
2.002931 | 3.6043314 | A3 | P5 | S10 |
12.326867 | 1.7763576 | A3 | P6 | S11 |
4.014221 | 0.2255188 | A3 | P6 | S12 |
full_join
)A full or outer join between data.bio
and
data.chem
Resp1 Resp2 Between Plot Subplot Chem1 Chem2
1 8 18 A1 P1 S1 1.452878 0.8858208
2 10 21 A1 P1 S2 3.266253 0.1800177
3 11 23 A1 P2 S4 13.400350 1.5762780
4 14 22 A2 P3 S5 3.779183 1.6222430
5 12 24 A2 P3 S6 1.196657 4.2369184
6 11 23 A2 P4 S7 NA NA
7 9 20 A2 P4 S8 5.687807 2.9859003
8 14 11 A3 P5 S9 4.834518 4.1328919
9 11 22 A3 P5 S10 2.002931 3.6043314
10 8 24 A3 P6 S11 12.326867 1.7763576
11 2 16 A3 P6 S12 4.014221 0.2255188
12 NA NA A1 P2 S3 1.178652 5.0780682
Note that although all Subplot
s are present, there are
missing values for Subplot
S3
(last row) and
S7
.
inner_join
)An inner join between data.bio
and
data.chem
Resp1 Resp2 Between Plot Subplot Chem1 Chem2
1 8 18 A1 P1 S1 1.452878 0.8858208
2 10 21 A1 P1 S2 3.266253 0.1800177
3 11 23 A1 P2 S4 13.400350 1.5762780
4 14 22 A2 P3 S5 3.779183 1.6222430
5 12 24 A2 P3 S6 1.196657 4.2369184
6 9 20 A2 P4 S8 5.687807 2.9859003
7 14 11 A3 P5 S9 4.834518 4.1328919
8 11 22 A3 P5 S10 2.002931 3.6043314
9 8 24 A3 P6 S11 12.326867 1.7763576
10 2 16 A3 P6 S12 4.014221 0.2255188
Note that both Subplot
s S3
and
S7
are missing.
left_join
)A left join between data.bio
and
data.chem
Resp1 Resp2 Between Plot Subplot Chem1 Chem2
1 8 18 A1 P1 S1 1.452878 0.8858208
2 10 21 A1 P1 S2 3.266253 0.1800177
3 11 23 A1 P2 S4 13.400350 1.5762780
4 14 22 A2 P3 S5 3.779183 1.6222430
5 12 24 A2 P3 S6 1.196657 4.2369184
6 11 23 A2 P4 S7 NA NA
7 9 20 A2 P4 S8 5.687807 2.9859003
8 14 11 A3 P5 S9 4.834518 4.1328919
9 11 22 A3 P5 S10 2.002931 3.6043314
10 8 24 A3 P6 S11 12.326867 1.7763576
11 2 16 A3 P6 S12 4.014221 0.2255188
Note that both Subplot
S3
are missing and
Subplot
S7
only has NA values for the chem
data.
right_join
)A left join between data.bio
and
data.chem
Resp1 Resp2 Between Plot Subplot Chem1 Chem2
1 8 18 A1 P1 S1 1.452878 0.8858208
2 10 21 A1 P1 S2 3.266253 0.1800177
3 11 23 A1 P2 S4 13.400350 1.5762780
4 14 22 A2 P3 S5 3.779183 1.6222430
5 12 24 A2 P3 S6 1.196657 4.2369184
6 9 20 A2 P4 S8 5.687807 2.9859003
7 14 11 A3 P5 S9 4.834518 4.1328919
8 11 22 A3 P5 S10 2.002931 3.6043314
9 8 24 A3 P6 S11 12.326867 1.7763576
10 2 16 A3 P6 S12 4.014221 0.2255188
11 NA NA A1 P2 S3 1.178652 5.0780682
Note that both Subplot
S7
are missing and
Subplot
S3
only has NA values for the bio
data.
anti_join
)An anti join between data.bio
and
data.chem
Resp1 Resp2 Between Plot Subplot
1 11 23 A2 P4 S7
Note that Subplot
S7
is returned indicating
that it in the right dataset, yet absent from the left.
anti_join
)An semi join between data.bio
and
data.chem
Resp1 Resp2 Between Plot Subplot
1 8 18 A1 P1 S1
2 10 21 A1 P1 S2
3 11 23 A1 P2 S4
4 14 22 A2 P3 S5
5 12 24 A2 P3 S6
6 9 20 A2 P4 S8
7 14 11 A3 P5 S9
8 11 22 A3 P5 S10
9 8 24 A3 P6 S11
10 2 16 A3 P6 S12
Note that Subplot
S3
and S7
are missing and only the bio data are included.
Lookup tables provide a way of inserting a column of data into a
large data set such that the entries in the new column are determined by
a relational match within another data set (the lookup table). For
example, the main data set might contain data collected from a number of
sites (Plots). Elsewhere we may have a data set that just contains the
set of sites and their corresponding latitudes and longitudes
(geographical lookup table). We could incorporate these latitudes and
longitudes into the main data set by merging against the geographical
lookup table. In Excel, this is referred to as vlookup, in a relational
database (and in tidyverse
) it is referred to as a
join.
If we again consider our data.bio
data, but this time
also consider the data.geo
data. This later dataset
contains the latitude and longitude of each of the plots.
Resp1 | Resp2 | Between | Plot | Subplot |
---|---|---|---|---|
8 | 18 | A1 | P1 | S1 |
10 | 21 | A1 | P1 | S2 |
11 | 23 | A1 | P2 | S4 |
14 | 22 | A2 | P3 | S5 |
12 | 24 | A2 | P3 | S6 |
11 | 23 | A2 | P4 | S7 |
9 | 20 | A2 | P4 | S8 |
14 | 11 | A3 | P5 | S9 |
11 | 22 | A3 | P5 | S10 |
8 | 24 | A3 | P6 | S11 |
2 | 16 | A3 | P6 | S12 |
Plot | LAT | LONG |
---|---|---|
P1 | 17.9605 | 145.4326 |
P2 | 17.5210 | 146.1983 |
P3 | 17.0011 | 146.3839 |
P4 | 18.2350 | 146.7934 |
P5 | 18.9840 | 146.0345 |
P6 | 20.1154 | 146.4672 |
left_join
) Resp1 Resp2 Between Plot Subplot LAT LONG
1 8 18 A1 P1 S1 17.9605 145.4326
2 10 21 A1 P1 S2 17.9605 145.4326
3 11 23 A1 P2 S4 17.5210 146.1983
4 14 22 A2 P3 S5 17.0011 146.3839
5 12 24 A2 P3 S6 17.0011 146.3839
6 11 23 A2 P4 S7 18.2350 146.7934
7 9 20 A2 P4 S8 18.2350 146.7934
8 14 11 A3 P5 S9 18.9840 146.0345
9 11 22 A3 P5 S10 18.9840 146.0345
10 8 24 A3 P6 S11 20.1154 146.4672
11 2 16 A3 P6 S12 20.1154 146.4672