Data wrangling (tidyverse and friends)
I apologize in advance, this tutorial requires quite a bit of explaining and context before it can get into the code…. Good data manipulation is an art form that requires the use of many specific tools (functions) and expert data manipulation comes from the integration of these tools together. Therefore it is necessary to have an overview of the tool set before investigating any single tool.
1 Manipulating data sets
Rarely are data transcribed and organised into such simple and ready-to-go data sets. More typically, data are spread throughout multiple sources and in a variety of formats (particularly if compiled by multiple workers or instruments). Consequently, prior to any formal statistical analyses , it is necessary to compile very focused, tidy data sets.
Wickham (2014) suggested that there are many ways to organise data, yet tidy data (data that are structured in such a consistent way as to facilitate analyses) must adhere to a fairly strict set of structural rules. Specifically, in tidy data:
- variables are in columns
- observations are in rows - that is, for univariate data, there should be a separate row for each response observation.
To achieve tidy data, common data preparations include:
- Reshaping and rearranging data
- Merging multiple data sources
- Aggregating data to appropriate spatial/temporal scales
- Transforming data and deriving new variables
- Sorting and reordering data
- Relabelling data
This workshop will concentrate on these essential data preparation phases.
Practical data manipulation will be demonstrated via a series of very small artificial datasets. These datasets are presented in tables with black font and lines and the R code required to generate those data will be presented in static code boxes either underneath or adjacent the table. A very basic description of the table and the name of the data frame are displayed above the table. The entire collection of datasets used in this workshop can be obtained by issuing the following command:
(if online)
(if offline and are running this from a local version having first cloned the ReefCloud Workshop github repository
An overview of the example data sets follows.
An overview of the example data sets follows.
These data comprise two Responses (Resp1
and Resp2
) that are each observed on four occasions (Time
) from each of three dosage treatments (Dose
) nested within four plots (Plot
). Two of the four plots completely excluded something and the other two were controls (did not exclude the thing).
View data
Treatment | Plot | Dose | Time | Resp1 | Resp2 |
---|---|---|---|---|---|
Control | P1 | H | 1 | 8.12 | 3.06 |
Control | P1 | H | 2 | 20.55 | 25.94 |
Control | P1 | H | 3 | 27.49 | 29.85 |
Control | P1 | H | 4 | 44.79 | 25.39 |
Control | P1 | M | 1 | 20.99 | 20.31 |
Control | P1 | M | 2 | 37.54 | 17.62 |
Control | P1 | M | 3 | 61.46 | 98.44 |
Control | P1 | M | 4 | 82.21 | 160.01 |
Control | P1 | L | 1 | 31.73 | 21.22 |
Control | P1 | L | 2 | 59.08 | 37.51 |
Control | P1 | L | 3 | 94.54 | 119.22 |
Control | P1 | L | 4 | 121.17 | 116.45 |
Control | P2 | H | 1 | 8.14 | 23.93 |
Control | P2 | H | 2 | 13.36 | 28.02 |
Control | P2 | H | 3 | 33.37 | 37.17 |
Control | P2 | H | 4 | 39.87 | 38.25 |
Control | P2 | M | 1 | 19.95 | 19.73 |
Control | P2 | M | 2 | 42.83 | 40.52 |
Control | P2 | M | 3 | 62.46 | 4.81 |
Control | P2 | M | 4 | 81.78 | 136.66 |
Control | P2 | L | 1 | 32.76 | 30.70 |
Control | P2 | L | 2 | 62.35 | 123.78 |
Control | P2 | L | 3 | 90.22 | 113.87 |
Control | P2 | L | 4 | 114.03 | 76.52 |
Exclusion | P3 | H | 1 | 21.86 | 23.58 |
Exclusion | P3 | H | 2 | 39.83 | 28.03 |
Exclusion | P3 | H | 3 | 59.53 | 21.32 |
Exclusion | P3 | H | 4 | 75.59 | 90.76 |
Exclusion | P3 | M | 1 | 38.57 | 30.63 |
Exclusion | P3 | M | 2 | 81.25 | 83.61 |
Exclusion | P3 | M | 3 | 124.08 | 124.09 |
Exclusion | P3 | M | 4 | 159.69 | 112.65 |
Exclusion | P3 | L | 1 | 61.16 | 39.53 |
Exclusion | P3 | L | 2 | 119.84 | 110.27 |
Exclusion | P3 | L | 3 | 175.87 | 286.33 |
Exclusion | P3 | L | 4 | 238.76 | 54.23 |
Exclusion | P4 | H | 1 | 18.82 | 28.60 |
Exclusion | P4 | H | 2 | 39.82 | 39.07 |
Exclusion | P4 | H | 3 | 63.30 | 93.43 |
Exclusion | P4 | H | 4 | 82.29 | 60.15 |
Exclusion | P4 | M | 1 | 39.51 | 45.90 |
Exclusion | P4 | M | 2 | 79.24 | 88.04 |
Exclusion | P4 | M | 3 | 122.09 | 84.19 |
Exclusion | P4 | M | 4 | 161.67 | 256.34 |
Exclusion | P4 | L | 1 | 57.93 | 85.24 |
Exclusion | P4 | L | 2 | 117.88 | 167.90 |
Exclusion | P4 | L | 3 | 181.09 | 314.49 |
Exclusion | P4 | L | 4 | 242.31 | 304.70 |
Note as this data set is 48 rows long, the majority examples that feature this dataset will only show the first 6 (or 10) rows to save on space. This truncation will occur via the head()
function.
These data comprise a Response observed on three occasions (Time.0
, Time.1
and Time.2
) from each of two (A1
and A2
) treatments (Between
, gridded or not) nested within four plots (Plot
).
View data
Plot | Between | Time.0 | Time.1 | Time.2 | |
---|---|---|---|---|---|
R1 | P1 | A1 | 8 | 14 | 14 |
R2 | P2 | A1 | 10 | 12 | 11 |
R3 | P3 | A2 | 7 | 11 | 8 |
R4 | P4 | A2 | 11 | 9 | 2 |
Similar to dat.1
, these data comprise two Responses (Resp1
and Resp2
). However, these data have a different hierarchical structure. Each response was measured from one of two Subplot
s nested within one of six Plot
s. Within each Subplot
each of two treatment levels (B1
and B2
) were applied and two of the Plot
s where exposed to one of the three Between
levels.
View data
Resp1 | Resp2 | Between | Plot | Subplot | Within |
---|---|---|---|---|---|
8 | 17 | A1 | P1 | S1 | B1 |
10 | 18 | A1 | P1 | S1 | B2 |
7 | 17 | A1 | P1 | S2 | B1 |
11 | 21 | A1 | P1 | S2 | B2 |
14 | 19 | A2 | P2 | S3 | B1 |
12 | 13 | A2 | P2 | S3 | B2 |
11 | 24 | A2 | P2 | S4 | B1 |
9 | 18 | A2 | P2 | S4 | B2 |
14 | 25 | A3 | P3 | S5 | B1 |
11 | 18 | A3 | P3 | S5 | B2 |
8 | 27 | A3 | P3 | S6 | B1 |
2 | 22 | A3 | P3 | S6 | B2 |
8 | 17 | A1 | P4 | S7 | B1 |
10 | 22 | A1 | P4 | S7 | B2 |
7 | 16 | A1 | P4 | S8 | B1 |
12 | 13 | A1 | P4 | S8 | B2 |
11 | 23 | A2 | P5 | S9 | B1 |
12 | 19 | A2 | P5 | S9 | B2 |
12 | 23 | A2 | P5 | S10 | B1 |
10 | 21 | A2 | P5 | S10 | B2 |
3 | 17 | A3 | P6 | S11 | B1 |
11 | 16 | A3 | P6 | S11 | B2 |
13 | 26 | A3 | P6 | S12 | B1 |
7 | 28 | A3 | P6 | S12 | B2 |
The data.d
data comprise of a single response (Resp1
) collected on different days. The three different components of the dates have been recorded in three different columns corresponding to the year
, month
and day
.
View data
year | month | day | Resp1 |
---|---|---|---|
2009 | 11 | 09 | 25 |
2008 | 05 | 08 | 52 |
2009 | 05 | 23 | 18 |
2009 | 04 | 15 | 21 |
2008 | 10 | 25 | 45 |
2008 | 09 | 26 | 40 |
2008 | 07 | 05 | 38 |
2008 | 11 | 02 | 9 |
2009 | 08 | 19 | 0 |
2008 | 10 | 03 | 20 |
The data.bio
dataset represents a design that is like a cut down version of data
in which there are no Within
Subplot
levels and thus there are also fewer rows.
View data
Resp1 | Resp2 | Between | Plot | Subplot |
---|---|---|---|---|
8 | 18 | A1 | P1 | S1 |
10 | 21 | A1 | P1 | S2 |
11 | 23 | A1 | P2 | S4 |
14 | 22 | A2 | P3 | S5 |
12 | 24 | A2 | P3 | S6 |
11 | 23 | A2 | P4 | S7 |
9 | 20 | A2 | P4 | S8 |
14 | 11 | A3 | P5 | S9 |
11 | 22 | A3 | P5 | S10 |
8 | 24 | A3 | P6 | S11 |
2 | 16 | A3 | P6 | S12 |
The data.chem
dataset represents the same design as the data.bio
data except that rather than contain two biological response, it has two chemical measurements (Chem1
and Chem2
) collected from the same sampling units as the biological responses.
View data
Chem1 | Chem2 | Between | Plot | Subplot |
---|---|---|---|---|
1.452878 | 0.8858208 | A1 | P1 | S1 |
3.266253 | 0.1800177 | A1 | P1 | S2 |
1.178652 | 5.0780682 | A1 | P2 | S3 |
13.400350 | 1.5762780 | A1 | P2 | S4 |
3.779183 | 1.6222430 | A2 | P3 | S5 |
1.196657 | 4.2369184 | A2 | P3 | S6 |
5.687807 | 2.9859003 | A2 | P4 | S8 |
4.834518 | 4.1328919 | A3 | P5 | S9 |
2.002931 | 3.6043314 | A3 | P5 | S10 |
12.326867 | 1.7763576 | A3 | P6 | S11 |
4.014221 | 0.2255188 | A3 | P6 | S12 |
The data.geo
dataset represent geographical information about the Plot
s in the data.bio
dataset.
View data
Plot | LAT | LONG |
---|---|---|
P1 | 17.9605 | 145.4326 |
P2 | 17.5210 | 146.1983 |
P3 | 17.0011 | 146.3839 |
P4 | 18.2350 | 146.7934 |
P5 | 18.9840 | 146.0345 |
P6 | 20.1154 | 146.4672 |
The tikus
dataset is a data.frame version of the data with the same name that is provided as a list within the mvabund
package. The data contain a the abundance of 75 coral species from different locations (rep
) over six years (time
) at Tikus island. The abundance was measured as the length (in cm) of a 10m transect that intersected with the coral species.
View data
Note, these data actually have 60 rows and 77 columns. In order avoid displaying a huge table in this quick view, I have trimmed the view to just the first 10 rows and six columns.
time | rep | Psammocora contigua | Psammocora digitata | Pocillopora damicornis | Pocillopora verrucosa | |
---|---|---|---|---|---|---|
V1 | 81 | 1 | 0 | 0 | 79 | 32 |
V2 | 81 | 2 | 0 | 0 | 51 | 21 |
V3 | 81 | 3 | 0 | 0 | 42 | 35 |
V4 | 81 | 4 | 0 | 0 | 15 | 0 |
V5 | 81 | 5 | 0 | 0 | 9 | 0 |
V6 | 81 | 6 | 0 | 0 | 72 | 0 |
V7 | 81 | 7 | 0 | 0 | 0 | 41 |
V8 | 81 | 8 | 0 | 0 | 16 | 25 |
V9 | 81 | 9 | 0 | 0 | 0 | 38 |
V10 | 81 | 10 | 0 | 0 | 16 | 0 |
The great folks over at Rstudio have produced an excellent set of cheatsheets on a range of topics. For this tutorial, the Data Transformation Cheat Sheet and Data Tidying Cheat Sheet (mainly just the first page) are useful summaries.
https://github.com/rstudio/cheatsheets/raw/main/data-import.pdf
https://github.com/rstudio/cheatsheets/raw/main/data-transformation.pdf
https://github.com/rstudio/cheatsheets/raw/main/tidyr.pdf
https://github.com/rstudio/cheatsheets/raw/main/factors.pdf
https://github.com/rstudio/cheatsheets/raw/main/lubridate.pdf
https://github.com/rstudio/cheatsheets/raw/main/purrr.pdf
https://github.com/rstudio/cheatsheets/raw/main/regex.pdf
https://github.com/rstudio/cheatsheets/raw/main/strings.pdf
https://github.com/rstudio/cheatsheets/raw/main/tidyeval.pdf
2 tidyverse
- a data manipulation ecosystem within R
There are numerous packages and base R routines devoted to data manipulation. Notable packages include data.tables
, plyr
, dplyr
and doBy
. Indeed, earlier versions of this tutorial featured examples of each of these packages. However, an entire family of packages from Hadley Wickem’s group now stands out as a comprehensive, intuitive suite of tools for data manipulation and visualisation.
Importantly, all of these packages are designed to integrate together and complement one another with a consistent interface. To simplify installing an entire data ecosystem, the tidyverse package is available. Installing this package (via install.packages('tidyverse')
) will install the following packages (bold entries are automatically loaded into the global environment when loading the tidyverse
package):
dplyr
- for dataframe manipulationtidyr
- for data tidyingreadr
- for importing dataggplot2
- for visualising datapurrr
- for functional programmingtibble
- for tibblesstringr
- for string manipulationforecats
- for factor (categorical) manipulationlubridate
- for dates/times manipulation
hmms
- for time manipulationbroom
- for tidying model outputsDBI
- for database interactionshaven
- for importing SAS, SPSS and Stata fileshttr
- for web API’sjsonlite
- for working with JSONrvest
- for web scrapingxml2
- for manipulating XML structuresmodelr
- for model manipulation
As this tutorial will focus largely on the tidyverse
ecosystem, we will now load (attach) this library (actually, eight libraries) to the global environment so that the functionality is available.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Note that when loading the tidyverse
package, we are given feedback about which other packages have been attached. Importantly, also notice that the output highlights a number of conflicts. In the above example, this indicates that the definition of the filter
function is now taken from the the dplyr
package rather than the stats
package. Essentially, both packages define a function called filter
and this is just indicating that in the current environment, the filter
function from the dplyr
package will be the one used.
On a case by case basis, this can be overruled by prefixing the function name with the package namespace. For example, to ensure that the stats
package version of filter
is used, we can engage it as stats::filter()
.
2.1 The grammar of data manipulation
Hadley and his collaborators argue that there is a grammar of data manipulation and that manipulations comprises a set of verbs. Furthermore, the philosophy adheres to the UNIX ideal that rather than attempt to build large complex tools that perform a large set of operations, it is better (with regards to usability and robustness) to build a tool set in which each tool performs a single, specific task and that the individual tools can be bolted together to achieve the complex operations.
The core of the tidyverse
data manipulation ecosystem tools (verbs) can be organised into five categories within two types on the basis of what aspects of the data they operate on:
- operate on a single data set
- operate on the rows
dplyr::arrange
- changing the order of the rows (sorting)dplyr::filter
- subset of rows based on column valuesdplyr::slice
- subset of rows based on position
- operate on the columns
dplyr::select
- subset of columnsdplyr::rename
- change the names of columnsdplyr::pull
- extract a single column as a vectordplyr::distinct
- unique combinations of column valuesdplyr::mutate
- adding columns and modifying column valuestidyr::unite
- combine multiple columns togethertidyr::separate
- separating a single column into multiple columns
- operate on groups of columns
dplyr::summarise
- aggregating (collapsing) to a single rowdplyr::count
- count the number of unique combinations single rowdplyr::group_by
- define groups of rows
- reshapes (pivots) the data
tidyr::pivot_longer
- lengthen data from wide formattidyr::pivot_wider
- widen data from long format
- operate on the rows
- operate of two data sets
_join
- merge (join) two data sets together based on common field(s)
In base R, variables (columns) are referred to by either their name (as a prefix to the environment - the data.frame etc), name (as a string) or their index (position). For example to access a variable called “Var1” within a dataset called “data”, we could use either data$Var1
, data[,"Var1"]
or data[,1]
(if the focal variable was in the first column).
The tidyverse
ecosystem adopts an alternative concept called tidy evaluation to make referring to columns more flexible.
- data-masking - refer to variables as if they were directly attached to the current environment (e.g.
Var1
instead ofdata$Var1
. This form of variable referral is used within:arrange()
filter()
count()
mutate()
summarise()
group_by()
- tidy selection - refer to variables by their position, name or type (e.g.
starts_with("var")
). This form of variable referral is used within:select()
rename()
pull()
across()
- which brings tidy selection semantics to data-masking (and thus functions likemutate()
)
data
) with hypothetical columns (Var1
,Var2
, etc) and cannot be evaluated directly. They all mostly illustrate the concept using theselect()
function.
tidy-selection | Description | Examples |
---|---|---|
Bare names |
Select columns based on their names. |
|
Ranges of columns |
Select columns based on their a range of names or column numbers. The selections will be inclusive. Prepending with a negative sign ( |
|
Boolean helper functions - contains() - starts_with() - ends_with() - matches() - num_range() - everything() - where() - all_of() - any_of() |
Select columns based on evaluating functions on the column names |
2.2 Piping
Typically, data manipulation/preparations comprise multiple steps and stages in which a series of alterations, additions etc are performed sequentially such that the data are progressively molded into a form appropriate for analysis etc. That is, to achieve the desired result, we must bolt multiple tools (verbs) together.
Traditionally, this would have involved a separate expression for each step often resulting in the generation of numerous intermediate data sets. Furthermore, in an attempt to reduce the number of intermediate steps, functions are often nested within other functions such that alterations are made inline within another function.
For example, the following pseudo code examples illustrates the traditional and nested approaches to achieving numerous data manipulation steps:
Collectively, these practices can yield code that is very difficult to read and interpret.
A long honoured unix coding principle is that each application should focus on performing one action and performing that function well. In order to perform a sequence of actions therefore involves piping (via the unix pipe character |
) the output of one application to the input of another application and so on in a chain. The grammar of data wrangling also adopts the principle of each tool specializing on one action and tools should be piped together to achieve a sequence of actions.
The piping (glue) operator in R (as of version 4.1) is |>
. An object on the left hand side of the |>
operator is passed as the first argument of the function on the right hand side.
magrittr
(%>%
) pipe
Prior to the native adoption of the pipe character in R, piping was supported via the magrittr
package. Within this package, the pipe operator is %>%
. Although both the native and magrittr
pipe operators are somewhat analogous, they are not homologous.
Syntax:
- Native pipe: Simple
|>
symbol placed between the expression and function call. - Magrittr pipe: Double percentage signs
%>%
offering visual distinction from surrounding code.
- Native pipe: Simple
Placeholder:
- Native pipe: No built-in placeholder for piped-in values. Functions need to handle them explicitly.
- Magrittr pipe: Dot placeholder (
.
), allowing concise access to piped-in values within the function call.
Functionality:
- Native pipe: More basic functionality, primarily focused on function chaining.
- Magrittr pipe: Offers additional features like:
- Exposition pipe (
%~$%
): Displays the values passed through the pipe, facilitating debugging. - Interjection pipe (
~>%
): Modifies the piped-in value before passing it to the next function. - Lazy evaluation: Delays intermediate calculations until necessary, potentially improving performance.
- Exposition pipe (
Scope:
- Native pipe: Recently introduced, integrated directly into R Base package.
- Magrittr pipe: A separate package, offering more extensibility and customization options.
Adoption:
- Native pipe: Gaining popularity due to its simplicity and integration with base R.
- Magrittr pipe: Widely used, having established a large community and ecosystem of compatible packages.
Choosing between them depends on:
- Personal preference: Some prefer the visual clarity of %>%, while others favor the simplicity of
|>
. - Project requirements: Consider the need for features like the dot placeholder or additional pipe types.
- Team collaboration: Choose a consistent style for shared code or adhere to project standards.
- Personal preference: Some prefer the visual clarity of %>%, while others favor the simplicity of
Ultimately, both pipes are powerful tools for data manipulation in R. Understanding their differences helps you choose the most appropriate option for your specific needs and coding style.
In pseudo code, the piping approach to the above manipulation would be:
If the code is set out as above (with each verb on a separate line), it gives an opportunity to provide a short comment to the right side of each line to provide useful documentation.
A more specific example
As a motivating (if not a bit extreme) example, lets say we wanted to calculate the logSumExp function:
\[ log(\sum^{n}_{i=1} e^{x_i}) \]
[1] 9.316408
[1] 9.316408
The piping approach could be:
To reiterate, the following three are equivalent:
[1] 29.653639 1601.918872 5.101634 118.918637 7140.686681 252.361318
[7] 9.175114 4.863565 1756.825350 199.466617
[1] 29.653639 1601.918872 5.101634 118.918637 7140.686681 252.361318
[7] 9.175114 4.863565 1756.825350 199.466617
as are the following:
Most of the following examples will demonstrate isolated data manipulation actions (such as filtering, summarising or joining) as this focuses on the specific uses of these functions without the distractions and complications of other actions. For isolated uses, piping has little (if any) advantages. Nevertheless, in recognition that data manipulations rarely comprise a single action (rather they are a series of linked actions), for all remaining examples demonstrated in the tidyverse
(dplyr
/tidyr
) context, piping will be used.
2.3 Tibbles
data.frame
’s are collections of variables of the identical length (yet not necessarily the same type) that the fundamental data structures used by most modelling routines in R. Nevertheless, there are numerous properties of data.frames that make them less than ideal. tibbles
have been more recently engineered to address these shortcomings:
if a
data.frame
is very large, theprint
method can result output that is too large to be useful. By contrast,tibbles
truncate the output to a maximum of 10 rows and as many columns as will fit in the output. Thetibble
print
method also displays the class (type) of each column (variable).show comparison
Treatment Plot Dose Time Resp1 Resp2 1 Control P1 H 1 8.12 3.06 2 Control P1 H 2 20.55 25.94 3 Control P1 H 3 27.49 29.85 4 Control P1 H 4 44.79 25.39 5 Control P1 M 1 20.99 20.31 6 Control P1 M 2 37.54 17.62 7 Control P1 M 3 61.46 98.44 8 Control P1 M 4 82.21 160.01 9 Control P1 L 1 31.73 21.22 10 Control P1 L 2 59.08 37.51 11 Control P1 L 3 94.54 119.22 12 Control P1 L 4 121.17 116.45 13 Control P2 H 1 8.14 23.93 14 Control P2 H 2 13.36 28.02 15 Control P2 H 3 33.37 37.17 16 Control P2 H 4 39.87 38.25 17 Control P2 M 1 19.95 19.73 18 Control P2 M 2 42.83 40.52 19 Control P2 M 3 62.46 4.81 20 Control P2 M 4 81.78 136.66 21 Control P2 L 1 32.76 30.70 22 Control P2 L 2 62.35 123.78 23 Control P2 L 3 90.22 113.87 24 Control P2 L 4 114.03 76.52 25 Exclusion P3 H 1 21.86 23.58 26 Exclusion P3 H 2 39.83 28.03 27 Exclusion P3 H 3 59.53 21.32 28 Exclusion P3 H 4 75.59 90.76 29 Exclusion P3 M 1 38.57 30.63 30 Exclusion P3 M 2 81.25 83.61 31 Exclusion P3 M 3 124.08 124.09 32 Exclusion P3 M 4 159.69 112.65 33 Exclusion P3 L 1 61.16 39.53 34 Exclusion P3 L 2 119.84 110.27 35 Exclusion P3 L 3 175.87 286.33 36 Exclusion P3 L 4 238.76 54.23 37 Exclusion P4 H 1 18.82 28.60 38 Exclusion P4 H 2 39.82 39.07 39 Exclusion P4 H 3 63.30 93.43 40 Exclusion P4 H 4 82.29 60.15 41 Exclusion P4 M 1 39.51 45.90 42 Exclusion P4 M 2 79.24 88.04 43 Exclusion P4 M 3 122.09 84.19 44 Exclusion P4 M 4 161.67 256.34 45 Exclusion P4 L 1 57.93 85.24 46 Exclusion P4 L 2 117.88 167.90 47 Exclusion P4 L 3 181.09 314.49 48 Exclusion P4 L 4 242.31 304.70
# A tibble: 48 × 6 Treatment Plot Dose Time Resp1 Resp2 <fct> <fct> <fct> <int> <dbl> <dbl> 1 Control P1 H 1 8.12 3.06 2 Control P1 H 2 20.6 25.9 3 Control P1 H 3 27.5 29.8 4 Control P1 H 4 44.8 25.4 5 Control P1 M 1 21.0 20.3 6 Control P1 M 2 37.5 17.6 7 Control P1 M 3 61.5 98.4 8 Control P1 M 4 82.2 160. 9 Control P1 L 1 31.7 21.2 10 Control P1 L 2 59.1 37.5 # ℹ 38 more rows
data.frame
s have very strict column naming rules and when these are not satisfied, the names will be altered in order to adhere to the rules.tibble
s permit a wider range of names.character vectors are often coerced into factors (categorical variables). This is not the case with
tibble
s.when vectors are added that are of a different length to the others in the
data.frame
, the values of the shorter vector(s) are recycled until all vectors are the same length. This behaviour can be dangerous and is not permitted in tibbles (except for vectors of length of one).show comparison
Error in `tibble()`: ! Tibble columns must have compatible sizes. • Size 6: Existing data. • Size 2: Column `Var2`. ℹ Only values of size one are recycled.
# A tibble: 6 × 2 Var1 Var2 <int> <dbl> 1 1 1 2 2 1 3 3 1 4 4 1 5 5 1 6 6 1
when sub-setting via column indices,
data.frame
s return a vector (when only a single index is provided) or a data.frame (if multiple indices are provided). This inconsistency is problematic in the context of a processing pipeline.tibble
’s will always return atibble
from column indices.show comparison
data.frame
s permit partial matching via$
indexing. This can be problematic in the context of a processing pipeline.tibble
s expressly forbid this.data.frame
s have very clumsy support for list columns (a list column is a column whose cells contain lists). List columns are better supported intibble
s. The ability to support list columns is an integral for the functional programming routines of thepurrr
package.show comparison
In this fictitious example, I would like to store three different
data.frames
(or any three objects for that matter) in the three cells of a variable (e.g.Var2
). This is not possible in adata.frame
- it will coerce theVar2
column into multiple columns (one for each object) and thus will also recycle the length of each to ensure they are all the same length. Once the data have been coerced into adata.frame
, the structure is lost (we can no longer accessVar2
).These issues are addressed in the
tibble
.
3 Summary and vectorised functions
Many data manipulation actions involve the use of specific auxiliary functions to act on specific parts of the data set. For example, if we wish to summarise a data set, we might apply a mean()
function to the numeric vectors (variables) and levels()
or unique()
function to character or factor vectors. That is, the function is applied to each vector as a whole. Such functions are called summary functions.
On the other hand, if we wish to generate log-transformed versions of the numeric vectors, then we would apply the log()
function to each value within those numeric vectors. Furthermore, we might use other auxiliary functions that return vectors of either boolean values (TRUE or FALSE) or integers that can represent row or column indices to determine which observations to perform actions on.
Broadly speaking, summary functions take a vector and return a single value. Familiar examples of summary functions are mean()
, var()
, min()
etc. By contrast, vectorised functions take a vector and return a vector of the same length as the original input vector. rank()
, cumsum()
and log()
are all examples of such vectorised (window) functions.
The dplyr
package introduces a number of additional summary and vectorised functions that are deemed useful augmentations of the standard set of functions available in base R. The following tables describes the most useful functions (from base
and dplyr
) along with which manipulation functions they can be applied. Those functions defined in dplyr
include the dplyr
namespace.
There are also some operations that appear to provide a mixture of both summary and vectorised functions. The scale()
function (which centres and scales a vector), for example, returns a vector of scaled values as well as storing the original mean and standard deviation (summaries) as attributes.
To demonstrate summary and vectorised functions, the following vectors will be used.
## set a random seed to ensure the same values each time
set.seed(123)
## a numeric vector generated from random draws from a gamma distribution
(x <- rgamma(10, 5, 1))
[1] 3.389585 7.378957 1.629561 4.778440 8.873564 5.530862 2.216495 1.581772
[9] 7.471264 5.295647
[1] "a" "a" "b" "b" "a" "b" "a" "b" "a" "b"
[1] FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
3.1 Summary functions
Function | Description | Examples |
---|---|---|
mean() , median()
|
Arithmetic mean and median. Note, boolean vectors are stored as a set of 0 (FALSE) and 1 (TRUE) and thus mathematical functions operate on them as if they were numbers. |
|
sum() , prod()
|
Sum and product. | |
var() , sd()
|
Variance and standard deviation. | |
mad() , IQR()
|
Median Absolute Deviation and Inter-Quartile Range. | |
min() , max()
|
Minimum and maximum. | |
quantile()
|
Quantiles | |
dplyr::first() , dplyr::last() , dplyr::nth()
|
First, laste and nth value. | |
dplyr::n() , dplyr::n_distinct()
|
Number of values, number of distinct values.
|
3.2 Vectorised functions
Function | Description | Examples |
---|---|---|
+,-,*,/,^,%/%, %%
|
typical arithmetic operators | |
log() ,log2() ,log10() , exp()
|
logarithms and exponentials |
|
<,<=,>,>=,!=,==
|
logical operators | |
between()
|
Whether a value is between two numbers | |
near()
|
A safe way of assessing equality (== ) in floating points
|
|
lag() , lead()
|
shift cases down/up by one | |
cummax() , cummin() , dplyr::cummean()
|
Cumulative max, min and mean |
|
cumsum() , cumprod()
|
Cumulative sum and product | |
dplyr::cumall() , dplyr::cumany()
|
Cumulative all and any (mainly for use with filtering). |
|
rank() , order()
|
Rank and order of items | |
dplyr::min_rank() , dplyr::dense_rank() , dplyr::percent_rank()
|
Rank in which ties = min, without gaps and expressed as a percentage. | |
dplyr::row_number()
|
Rank in which ties = first. | |
dplyr::cume_dist()
|
Cumulative empirical distribution (proportion less than current value). | |
dplyr::ntile()
|
Partition into (n) bins. | |
dplyr::if_else()
|
Elementwise (case by case) if and else. | |
dplyr::case_when()
|
Elementwise multiple if and else. |
4 Sorting data (arrange
)
Most statistical analyses are invariant to the data order and thus data reordering is typically only for aesthetics in tables and figures.
Sorting data has the potential to be one of the most dangerous forms of data manipulation - particularly in spreadsheets in which there is no real binding between individual columns or rows. It is far to easy to accidentally sort the data in a single column (or row) without applying the ordering to all the other columns in the data set thereby resulting in broken data.
Whilst the above apocalypse is still possible in R, the data structures and manipulation interfaces mean that you really have to try to break the data in this way. Furthermore, you are encouraged to store reordered data in a different object to the original data, and hence ‘rolling’ back is trivial.
Treatment | Plot | Dose | Time | Resp1 | Resp2 |
---|---|---|---|---|---|
Control | P1 | H | 1 | 8.12 | 3.06 |
Control | P1 | H | 2 | 20.55 | 25.94 |
Control | P1 | H | 3 | 27.49 | 29.85 |
Control | P1 | H | 4 | 44.79 | 25.39 |
Control | P1 | M | 1 | 20.99 | 20.31 |
Control | P1 | M | 2 | 37.54 | 17.62 |
Control | P1 | M | 3 | 61.46 | 98.44 |
Control | P1 | M | 4 | 82.21 | 160.01 |
Control | P1 | L | 1 | 31.73 | 21.22 |
Control | P1 | L | 2 | 59.08 | 37.51 |
5 Subsets of data
We regularly want to run an analysis or generate tabular/graphic summaries for a sub-set of the data. For example, we may wish to reduce the number of columns in the data frame to remove excessive and unwanted data fields. Alternatively (or additionally), we may wish to focus on a subset of the rows. These two forms of sub-setting are:
select
ing - reducing the number of columnsfilter
ing - reducing the number of rows
5.1 Subset columns (select
)
Selecting works by either including (or excluding) the column names that you indicate or via special selection ‘Helper’ functions that pass a vector of column indices to include in the subset data.
Treatment | Plot | Dose | Time | Resp1 | Resp2 |
---|---|---|---|---|---|
Control | P1 | H | 1 | 8.12 | 3.06 |
Control | P1 | H | 2 | 20.55 | 25.94 |
Control | P1 | H | 3 | 27.49 | 29.85 |
Control | P1 | H | 4 | 44.79 | 25.39 |
Control | P1 | M | 1 | 20.99 | 20.31 |
Control | P1 | M | 2 | 37.54 | 17.62 |
Control | P1 | M | 3 | 61.46 | 98.44 |
Control | P1 | M | 4 | 82.21 | 160.01 |
Control | P1 | L | 1 | 31.73 | 21.22 |
Control | P1 | L | 2 | 59.08 | 37.51 |
The select()
function users tidy-select semantics.
Select specific variables
Treatment Dose Time Resp1
1 Control H 1 8.12
2 Control H 2 20.55
3 Control H 3 27.49
4 Control H 4 44.79
5 Control M 1 20.99
6 Control M 2 37.54
Exclude specific variables
Include columns Treatment
through to Time
Treatment Plot Dose Time
1 Control P1 H 1
2 Control P1 H 2
3 Control P1 H 3
4 Control P1 H 4
5 Control P1 M 1
6 Control P1 M 2
Exclude columns Treatment
through to Time
Resp1 Resp2
1 8.12 3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62
Exclude columns Treatment
through to Time
Names containing and ‘r’ (case insensitive)
Treatment Resp1 Resp2
1 Control 8.12 3.06
2 Control 20.55 25.94
3 Control 27.49 29.85
4 Control 44.79 25.39
5 Control 20.99 20.31
6 Control 37.54 17.62
Names starting with ‘r’ (case insensitive)
Resp1 Resp2
1 8.12 3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62
Names ending in ‘e’ (case insensitive)
Names that are exactly four characters long
Select from a vector of strings
Resp1 Resp2
1 8.12 3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62
Select from string vector of strings…
Resp1 Resp2
1 8.12 3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62
Since tibble
s permit column names that have special characters in them, in order to refer to a column that has a name with special characters, it is necessary to enclose the name in backticks. For example, to select the variable, Pocillopora damicornis
(which contains a space in the name - these are not permitted in data.frame
s, yet are permitted in tibble
s) and print
the first six rows:
5.2 Renaming columns (rename
)
Note, it is possible to have column names renamed during a select
action.
Both the rename()
and select()
functions user tidy-select semantics.
If we want to retain the other variables, we would also have to include them in the select
either explicitly, or via a helper function.
Treatment Plot Dose Time Weight Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
5 Control P1 M 1 20.99 20.31
6 Control P1 M 2 37.54 17.62
However, note that this might not retain the order of the columns. Hence, for simple renaming of columns, the rename
function is more convenient.
Treatment Plot Dose Time Weight Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
5 Control P1 M 1 20.99 20.31
6 Control P1 M 2 37.54 17.62
It is also possible to rename columns based on a lookup (list or data.frame). This is handy for cases when data columns have conveniently abbreviated names yet you are preparing data for tabular output - and thus need more descriptive column names.
Treatment Plot Dose Time Response 1 Response 2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
5 Control P1 M 1 20.99 20.31
6 Control P1 M 2 37.54 17.62
In the above example, the big bang operator !!!
forces-splice a list operator. That is, the elements of the list are spliced into the statement as if they had been included directly.
To do the same from a data.frame lookup..
lookup <- tribble(
~Abbr_name, ~Long_name,
"Resp1", "Response 1",
"Resp2", "Response 2")
## Convert to list of pairs
lookup <- lookup |>
select(Long_name, Abbr_name) |>
deframe() |>
list()
dat.1 |>
rename(!!!lookup) |>
head()
Treatment Plot Dose Time Response 1 Response 2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
5 Control P1 M 1 20.99 20.31
6 Control P1 M 2 37.54 17.62
5.3 Select a single column (pull
)
As indicated earlier, single column indices on tibble
s return a single column tibble
. To extract a single column as a vector, there is a pull
function.
The pull()
function users tidy-select semantics.
[1] 8.12 20.55 27.49 44.79 20.99 37.54 61.46 82.21 31.73 59.08
[11] 94.54 121.17 8.14 13.36 33.37 39.87 19.95 42.83 62.46 81.78
[21] 32.76 62.35 90.22 114.03 21.86 39.83 59.53 75.59 38.57 81.25
[31] 124.08 159.69 61.16 119.84 175.87 238.76 18.82 39.82 63.30 82.29
[41] 39.51 79.24 122.09 161.67 57.93 117.88 181.09 242.31
5.4 Subset of rows (filter
)
Filtering selects rows for which a condition is evaluated to be TRUE. Hence, any logical expression or vectorized function that returns Boolean values (TRUE or FALSE) can be used for filtering.
The filter()
function users data-masking semantics.
Treatment | Plot | Dose | Time | Resp1 | Resp2 |
---|---|---|---|---|---|
Control | P1 | H | 1 | 8.12 | 3.06 |
Control | P1 | H | 2 | 20.55 | 25.94 |
Control | P1 | H | 3 | 27.49 | 29.85 |
Control | P1 | H | 4 | 44.79 | 25.39 |
Control | P1 | M | 1 | 20.99 | 20.31 |
Control | P1 | M | 2 | 37.54 | 17.62 |
Control | P1 | M | 3 | 61.46 | 98.44 |
Control | P1 | M | 4 | 82.21 | 160.01 |
Control | P1 | L | 1 | 31.73 | 21.22 |
Control | P1 | L | 2 | 59.08 | 37.51 |
Helper function | Combine multiple selections … |
---|---|
if_any() |
With an an OR |
if_all() |
With an AND) |
Note, by default, the above searchers are NOT case sensitive
Logical/Boolean function | Description |
---|---|
== |
is equal to |
!= |
is not equal to |
< |
is less than |
> |
is greater than |
<= |
is less than or equal to |
>= |
is greater than or equal to |
is.na() |
is equal to NA |
!is.na() |
is not equal to NA |
%in% |
is in the following vector |
! |
not |
& && |
logical AND |
| || |
logical OR |
xor() |
elementwise OR |
isTRUE() |
is true |
isFALSE() |
is false |
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
5 Control P2 H 1 8.14 23.93
6 Control P2 H 2 13.36 28.02
7 Control P2 H 3 33.37 37.17
8 Control P2 H 4 39.87 38.25
9 Exclusion P3 H 1 21.86 23.58
10 Exclusion P3 H 2 39.83 28.03
11 Exclusion P3 H 3 59.53 21.32
12 Exclusion P3 H 4 75.59 90.76
13 Exclusion P4 H 1 18.82 28.60
14 Exclusion P4 H 2 39.82 39.07
15 Exclusion P4 H 3 63.30 93.43
16 Exclusion P4 H 4 82.29 60.15
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
5 Control P1 M 1 20.99 20.31
6 Control P1 M 2 37.54 17.62
7 Control P1 M 3 61.46 98.44
8 Control P1 M 4 82.21 160.01
9 Control P2 H 1 8.14 23.93
10 Control P2 H 2 13.36 28.02
11 Control P2 H 3 33.37 37.17
12 Control P2 H 4 39.87 38.25
13 Control P2 M 1 19.95 19.73
14 Control P2 M 2 42.83 40.52
15 Control P2 M 3 62.46 4.81
16 Control P2 M 4 81.78 136.66
17 Exclusion P3 H 1 21.86 23.58
18 Exclusion P3 H 2 39.83 28.03
19 Exclusion P3 H 3 59.53 21.32
20 Exclusion P3 H 4 75.59 90.76
21 Exclusion P3 M 1 38.57 30.63
22 Exclusion P3 M 2 81.25 83.61
23 Exclusion P3 M 3 124.08 124.09
24 Exclusion P3 M 4 159.69 112.65
25 Exclusion P4 H 1 18.82 28.60
26 Exclusion P4 H 2 39.82 39.07
27 Exclusion P4 H 3 63.30 93.43
28 Exclusion P4 H 4 82.29 60.15
29 Exclusion P4 M 1 39.51 45.90
30 Exclusion P4 M 2 79.24 88.04
31 Exclusion P4 M 3 122.09 84.19
32 Exclusion P4 M 4 161.67 256.34
Keep all cases after a value of Resp1 has exceeded 200
Treatment Plot Dose Time Resp1 Resp2
1 Exclusion P3 L 4 238.76 54.23
2 Exclusion P4 H 1 18.82 28.60
3 Exclusion P4 H 2 39.82 39.07
4 Exclusion P4 H 3 63.30 93.43
5 Exclusion P4 H 4 82.29 60.15
6 Exclusion P4 M 1 39.51 45.90
7 Exclusion P4 M 2 79.24 88.04
8 Exclusion P4 M 3 122.09 84.19
9 Exclusion P4 M 4 161.67 256.34
10 Exclusion P4 L 1 57.93 85.24
11 Exclusion P4 L 2 117.88 167.90
12 Exclusion P4 L 3 181.09 314.49
13 Exclusion P4 L 4 242.31 304.70
Keep all cases until the first case of Resp1 < 25
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P2 H 1 8.14 23.93
4 Control P2 H 2 13.36 28.02
5 Exclusion P3 H 1 21.86 23.58
6 Exclusion P4 H 1 18.82 28.60
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P2 H 1 8.14 23.93
4 Control P2 H 2 13.36 28.02
5 Exclusion P3 H 1 21.86 23.58
6 Exclusion P4 H 1 18.82 28.60
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
5 Control P1 M 1 20.99 20.31
6 Control P2 H 1 8.14 23.93
7 Control P2 H 2 13.36 28.02
8 Control P2 H 3 33.37 37.17
9 Control P2 H 4 39.87 38.25
10 Control P2 M 1 19.95 19.73
11 Exclusion P3 H 1 21.86 23.58
12 Exclusion P3 H 2 39.83 28.03
13 Exclusion P3 H 3 59.53 21.32
14 Exclusion P3 H 4 75.59 90.76
15 Exclusion P4 H 1 18.82 28.60
16 Exclusion P4 H 2 39.82 39.07
17 Exclusion P4 H 3 63.30 93.43
18 Exclusion P4 H 4 82.29 60.15
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 M 1 20.99 20.31
4 Control P1 M 2 37.54 17.62
5 Control P1 L 1 31.73 21.22
6 Control P2 H 1 8.14 23.93
7 Control P2 H 2 13.36 28.02
8 Control P2 M 1 19.95 19.73
9 Control P2 M 3 62.46 4.81
10 Exclusion P3 H 1 21.86 23.58
11 Exclusion P3 H 3 59.53 21.32
12 Exclusion P4 H 1 18.82 28.60
5.5 Subset of rows by position (slice
)
5.6 Random selection of rows (sample_n
)
In each of the examples, I will set the random seed to ensure we can all repeat the example exactly.
Random sample of 10 rows (without replacement)
Treatment Plot Dose Time Resp1 Resp2
1 Exclusion P3 M 3 124.08 124.09
2 Control P2 H 3 33.37 37.17
3 Control P2 H 2 13.36 28.02
4 Control P1 H 3 27.49 29.85
5 Exclusion P4 M 2 79.24 88.04
6 Exclusion P4 M 3 122.09 84.19
7 Exclusion P4 H 1 18.82 28.60
8 Exclusion P4 L 2 117.88 167.90
9 Exclusion P3 H 1 21.86 23.58
10 Exclusion P3 H 2 39.83 28.03
Random sample of 25% of the rows (without replacement)
Treatment Plot Dose Time Resp1 Resp2
1 Exclusion P3 M 3 124.08 124.09
2 Control P2 H 3 33.37 37.17
3 Control P2 H 2 13.36 28.02
4 Control P1 H 3 27.49 29.85
5 Exclusion P4 M 2 79.24 88.04
6 Exclusion P4 M 3 122.09 84.19
7 Exclusion P4 H 1 18.82 28.60
8 Exclusion P4 L 2 117.88 167.90
9 Exclusion P3 H 1 21.86 23.58
10 Exclusion P3 H 2 39.83 28.03
11 Exclusion P3 H 3 59.53 21.32
12 Control P1 M 1 20.99 20.31
5.7 Effects of filtering on factor levels
In R, categorical variables (factors) are actually stored as vectors of integers (1, 2, 3, …) along with an attribute that registers the names (and order) of the levels. We can see this, if we ask to see the structure of any categorical variable:
In the above, we see that the levels of the Dose
variable are “H”, “L”, “M” (by default, levels of an unordered factor are alphabetical). The first four cases are all 1
, which is mapped to “H”. The next four are 3
which maps to “M” and so on.
Alternatively, we can explore the levels attribute of a factor via the levels()
function:
[1] "H" "L" "M"
[1] "P1" "P2" "P3" "P4"
[1] "Control" "Exclusion"
Although subsets of rows (filter
, sample_n
etc) may appear to completely remove particular levels of a categorical variable, these actions do not update factor attributes. Consequently, many routines that operate on such factors (such as plots and statistical models) will proceed as if all factor levels are present (e.g. plots will contain gaps where space has been provisioned for all levels even though there may be no data associated with all levels).
To illustrate this, we will filter the dat.1
data such that it only includes cases for which Plot
equals “P1” and then explore the levels attribute of some of the categorical variables.
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.55 25.94
3 Control P1 H 3 27.49 29.85
4 Control P1 H 4 44.79 25.39
5 Control P1 M 1 20.99 20.31
6 Control P1 M 2 37.54 17.62
7 Control P1 M 3 61.46 98.44
8 Control P1 M 4 82.21 160.01
9 Control P1 L 1 31.73 21.22
10 Control P1 L 2 59.08 37.51
11 Control P1 L 3 94.54 119.22
12 Control P1 L 4 121.17 116.45
[1] "H" "L" "M"
[1] "P1" "P2" "P3" "P4"
[1] "Control" "Exclusion"
So although the data only contains Plot
values of “P1” (and Treatment
values of “Control”), the levels are still listed as “P1”, “P2”, “P3”, and “P4”.
To ensure that the attributes reflect the subset data, it is necessary to use the droplevels()
function:
6 Adding/modifying columns (mutate
)
To add to (mutate) a data set, a vectorised function is applied either once or across one or more existing columns.
The mutate()
function users data-masking semantics and can use tidy-select semantics via across()
.
Whether transforming/scaling existing variables or generating new variables derived from existing variables it is advisable that existing variables be retained (unless the data are huge and storage is tight).
Treatment Plot Dose Time Resp1 Resp2 Sum
1 Control P1 H 1 8.12 3.06 11.18
2 Control P1 H 2 20.55 25.94 46.49
3 Control P1 H 3 27.49 29.85 57.34
4 Control P1 H 4 44.79 25.39 70.18
5 Control P1 M 1 20.99 20.31 41.30
6 Control P1 M 2 37.54 17.62 55.16
Multiple additions/modifications can be defined in a single mutate
call and the changes occur sequentially. Consequently, subsequent changes can build upon early changes. To illustrate this, we will create a new variable (cResp1
) that is a centred version of the Resp1
variable (centring is when the mean of a variable is subtracted from each value resulting in a variable whose mean is exactly zero, yet all other properties are unaffected.).
Treatment Plot Dose Time Resp1 Resp2 meanResp1 cResp1
1 Control P1 H 1 8.12 3.06 75.26604 -67.14604
2 Control P1 H 2 20.55 25.94 75.26604 -54.71604
3 Control P1 H 3 27.49 29.85 75.26604 -47.77604
4 Control P1 H 4 44.79 25.39 75.26604 -30.47604
5 Control P1 M 1 20.99 20.31 75.26604 -54.27604
6 Control P1 M 2 37.54 17.62 75.26604 -37.72604
The following table lists some of the more useful vectorised functions to use with mutate
. Importantly, vectorised functions return a vector of the same length as they receive since all mutate
edits must preserve the number of rows.
Treatment | Plot | Dose | Time | Resp1 | Resp2 |
---|---|---|---|---|---|
Control | P1 | H | 1 | 8.12 | 3.06 |
Control | P1 | H | 2 | 20.55 | 25.94 |
Control | P1 | H | 3 | 27.49 | 29.85 |
Control | P1 | H | 4 | 44.79 | 25.39 |
Control | P1 | M | 1 | 20.99 | 20.31 |
Control | P1 | M | 2 | 37.54 | 17.62 |
Control | P1 | M | 3 | 61.46 | 98.44 |
Control | P1 | M | 4 | 82.21 | 160.01 |
Control | P1 | L | 1 | 31.73 | 21.22 |
Control | P1 | L | 2 | 59.08 | 37.51 |
6.1 Mathematical functions
Function | Description |
---|---|
+,-,*,/,^,%/%, %% |
typical arithmetic operators |
log() ,log2() ,log10() , exp() |
logarithms/exponentials |
<,<=,>,>=,!=,== |
logical operators |
between() |
whether a case is between to numbers |
near() |
a safe way of assessing equality in floating points |
Treatment Plot Dose Time Resp1 Resp2 Sum Div
1 Control P1 H 1 8.12 3.06 11.18 2.6535948
2 Control P1 H 2 20.55 25.94 46.49 0.7922128
3 Control P1 H 3 27.49 29.85 57.34 0.9209380
4 Control P1 H 4 44.79 25.39 70.18 1.7640803
5 Control P1 M 1 20.99 20.31 41.30 1.0334810
6 Control P1 M 2 37.54 17.62 55.16 2.1305335
Treatment Plot Dose Time Resp1 Resp2 logResp1 expResp2
1 Control P1 H 1 8.12 3.06 2.094330 2.132756e+01
2 Control P1 H 2 20.55 25.94 3.022861 1.843312e+11
3 Control P1 H 3 27.49 29.85 3.313822 9.197934e+12
4 Control P1 H 4 44.79 25.39 3.801985 1.063499e+11
5 Control P1 M 1 20.99 20.31 3.044046 6.614864e+08
6 Control P1 M 2 37.54 17.62 3.625407 4.490232e+07
Treatment Plot Dose Time Resp1 Resp2 largeResp1
1 Control P1 H 1 8.12 3.06 FALSE
2 Control P1 H 2 20.55 25.94 FALSE
3 Control P1 H 3 27.49 29.85 TRUE
4 Control P1 H 4 44.79 25.39 TRUE
5 Control P1 M 1 20.99 20.31 FALSE
6 Control P1 M 2 37.54 17.62 TRUE
Treatment Plot Dose Time Resp1 Resp2 A B
1 Control P1 H 1 8.12 3.06 FALSE TRUE
2 Control P1 H 2 20.55 25.94 FALSE FALSE
3 Control P1 H 3 27.49 29.85 FALSE FALSE
4 Control P1 H 4 44.79 25.39 FALSE FALSE
5 Control P1 M 1 20.99 20.31 FALSE FALSE
6 Control P1 M 2 37.54 17.62 FALSE FALSE
Treatment Plot Dose Time Resp1 Resp2 mediumResp1
1 Control P1 H 1 8.12 3.06 FALSE
2 Control P1 H 2 20.55 25.94 TRUE
3 Control P1 H 3 27.49 29.85 FALSE
4 Control P1 H 4 44.79 25.39 FALSE
5 Control P1 M 1 20.99 20.31 TRUE
6 Control P1 M 2 37.54 17.62 FALSE
6.2 Offset functions
Function | Description |
---|---|
lag() |
shift cases down one |
lead() |
shift cases up one |
Treatment Plot Dose Time Resp1 Resp2 leadResp1
43 Exclusion P4 M 3 122.09 84.19 161.67
44 Exclusion P4 M 4 161.67 256.34 57.93
45 Exclusion P4 L 1 57.93 85.24 117.88
46 Exclusion P4 L 2 117.88 167.90 181.09
47 Exclusion P4 L 3 181.09 314.49 242.31
48 Exclusion P4 L 4 242.31 304.70 NA
6.3 Cumulative aggregate functions
Function | Description |
---|---|
cummax() |
cumulative maximum |
cummin() |
cumulative minimum |
cummean() |
cumulative mean |
cumprod() |
cumulative product |
cumsum() |
cumulative sum |
rank() |
rank of current case (ties averaged) |
min_rank() |
rank of current case (minimum rank for ties) |
dense_rank() |
rank of current case (minimum rank for ties, no gaps) |
percent_rank() |
min_rank of current case (scaled to [0,1]) |
cume_dist() |
cumulative empirical distribution (prop. less than current rank) |
row_number() |
rank of current case (first row for ties) |
ntile() |
bin into (n) buckets |
Treatment Plot Dose Time Resp1 Resp2 Cummin Cummax Cummean
1 Control P1 H 1 8.12 3.06 8.12 8.12 8.1200
2 Control P1 H 2 20.55 25.94 8.12 20.55 14.3350
3 Control P1 H 3 27.49 29.85 8.12 27.49 18.7200
4 Control P1 H 4 44.79 25.39 8.12 44.79 25.2375
5 Control P1 M 1 20.99 20.31 8.12 44.79 24.3880
6 Control P1 M 2 37.54 17.62 8.12 44.79 26.5800
Treatment Plot Dose Time Resp1 Resp2 Cumsum Cumprod
1 Control P1 H 1 8.12 3.06 8.12 8.120000e+00
2 Control P1 H 2 20.55 25.94 28.67 1.668660e+02
3 Control P1 H 3 27.49 29.85 56.16 4.587146e+03
4 Control P1 H 4 44.79 25.39 100.95 2.054583e+05
5 Control P1 M 1 20.99 20.31 121.94 4.312569e+06
6 Control P1 M 2 37.54 17.62 159.48 1.618939e+08
dat.1 |> mutate(Rank = rank(Resp1),
minRank = min_rank(Resp1),
denseRank = dense_rank(Resp1),
percentRank = percent_rank(Resp1)) |>
head()
Treatment Plot Dose Time Resp1 Resp2 Rank minRank denseRank percentRank
1 Control P1 H 1 8.12 3.06 1 1 1 0.0000000
2 Control P1 H 2 20.55 25.94 6 6 6 0.1063830
3 Control P1 H 3 27.49 29.85 9 9 9 0.1702128
4 Control P1 H 4 44.79 25.39 20 20 20 0.4042553
5 Control P1 M 1 20.99 20.31 7 7 7 0.1276596
6 Control P1 M 2 37.54 17.62 13 13 13 0.2553191
Treatment Plot Dose Time Resp1 Resp2 cume_dist(Resp1)
1 Control P1 H 1 8.12 3.06 0.02083333
2 Control P1 H 2 20.55 25.94 0.12500000
3 Control P1 H 3 27.49 29.85 0.18750000
4 Control P1 H 4 44.79 25.39 0.41666667
5 Control P1 M 1 20.99 20.31 0.14583333
6 Control P1 M 2 37.54 17.62 0.27083333
6.4 Miscellaneous
Function | Description |
---|---|
if_else() |
elementwise (case by case) if and else |
case_when() |
elementwise multiple if_else |
na_if() |
elementwise replace nominated value with NA |
pmax() |
elementwise maximum across multiple columns |
pmin() |
elementwise minimum across multiple columns |
Treatment Plot Dose Time Resp1 Resp2 Size
1 Control P1 H 1 8.12 3.06 Small
2 Control P1 H 2 20.55 25.94 Small
3 Control P1 H 3 27.49 29.85 Big
4 Control P1 H 4 44.79 25.39 Big
5 Control P1 M 1 20.99 20.31 Small
6 Control P1 M 2 37.54 17.62 Big
dat.1 |> mutate(Size = case_when(Resp1 < 15 ~ "Small",
Resp1 < 25 ~ "Medium",
Resp1 >= 25 ~ "Big")) |>
head()
Treatment Plot Dose Time Resp1 Resp2 Size
1 Control P1 H 1 8.12 3.06 Small
2 Control P1 H 2 20.55 25.94 Medium
3 Control P1 H 3 27.49 29.85 Big
4 Control P1 H 4 44.79 25.39 Big
5 Control P1 M 1 20.99 20.31 Medium
6 Control P1 M 2 37.54 17.62 Big
6.5 Summary functions
Summary functions (those that return a single value) are also permissible - the value will be recycled for the total length of the input vector. A table of useful summary functions can be found in the Summarise section.
Treatment Plot Dose Time Resp1 Resp2 meanResp1
1 Control P1 H 1 8.12 3.06 75.26604
2 Control P1 H 2 20.55 25.94 75.26604
3 Control P1 H 3 27.49 29.85 75.26604
4 Control P1 H 4 44.79 25.39 75.26604
5 Control P1 M 1 20.99 20.31 75.26604
6 Control P1 M 2 37.54 17.62 75.26604
Another important summary function is the n()
function. This function returns the total number of rows. In the following example, we will use it to create a column that just provides a running row counter (e.g. a row index variable).
6.6 Multiple mutations (across
)
In the section on select
, a set of select helper functions were described to facilitate convenient ways to select columns based on properties of the column names etc. The across()
function allows us to bring those same selection helper functions to mutate.
The across()
function has the following form:
where:
.cols
- a tidy selection (e.g. selection helper function).fns
- a function (or list of functions) to apply to each selected column.names
- aglue
specification determining the format of the new variable names. By default the glue will be either{.col}
(when there is only a single function) or{.col}.{fn}
(when a list of functions)
Treatment Plot Dose Time Resp1 Resp2
1 Control P1 H 1 2.094330 1.118415
2 Control P1 H 2 3.022861 3.255786
3 Control P1 H 3 3.313822 3.396185
4 Control P1 H 4 3.801985 3.234355
5 Control P1 M 1 3.044046 3.011113
6 Control P1 M 2 3.625407 2.869035
Treatment Plot Dose Time Resp1 Resp2 lResp1 lResp2
1 Control P1 H 1 8.12 3.06 2.094330 1.118415
2 Control P1 H 2 20.55 25.94 3.022861 3.255786
3 Control P1 H 3 27.49 29.85 3.313822 3.396185
4 Control P1 H 4 44.79 25.39 3.801985 3.234355
5 Control P1 M 1 20.99 20.31 3.044046 3.011113
6 Control P1 M 2 37.54 17.62 3.625407 2.869035
Treatment Plot Dose Time Resp1 Resp2 lTime lResp1 lResp2
1 Control P1 H 1 8.12 3.06 0.0000000 2.094330 1.118415
2 Control P1 H 2 20.55 25.94 0.6931472 3.022861 3.255786
3 Control P1 H 3 27.49 29.85 1.0986123 3.313822 3.396185
4 Control P1 H 4 44.79 25.39 1.3862944 3.801985 3.234355
5 Control P1 M 1 20.99 20.31 0.0000000 3.044046 3.011113
6 Control P1 M 2 37.54 17.62 0.6931472 3.625407 2.869035
Treatment Plot Dose Time Resp1 Resp2 lResp1 lResp2
1 Control P1 H 1 8.12 3.06 2.094330 1.118415
2 Control P1 H 2 20.55 25.94 3.022861 3.255786
3 Control P1 H 3 27.49 29.85 3.313822 3.396185
4 Control P1 H 4 44.79 25.39 3.801985 3.234355
5 Control P1 M 1 20.99 20.31 3.044046 3.011113
6 Control P1 M 2 37.54 17.62 3.625407 2.869035
dat.1 |> mutate(across(c(Resp1, Resp2),
.fns = list(l = log, s = sqrt),
.names = "{.fn}.{.col}")) |>
head()
Treatment Plot Dose Time Resp1 Resp2 l.Resp1 s.Resp1 l.Resp2 s.Resp2
1 Control P1 H 1 8.12 3.06 2.094330 2.849561 1.118415 1.749286
2 Control P1 H 2 20.55 25.94 3.022861 4.533211 3.255786 5.093133
3 Control P1 H 3 27.49 29.85 3.313822 5.243091 3.396185 5.463515
4 Control P1 H 4 44.79 25.39 3.801985 6.692533 3.234355 5.038849
5 Control P1 M 1 20.99 20.31 3.044046 4.581484 3.011113 4.506662
6 Control P1 M 2 37.54 17.62 3.625407 6.126989 2.869035 4.197618
Centring all numeric variables (note the use of the purrr
style lambda formula for functions that require additional arguments. When applying the function, the focal variable is assigned the name of .x
.
dat.1 |> mutate(across(c(Resp1, Resp2),
.fns = list(c = ~ scale(.x, scale = FALSE)),
.names = "{.fn}{.col}")) |>
head()
Treatment Plot Dose Time Resp1 Resp2 cResp1 cResp2
1 Control P1 H 1 8.12 3.06 -67.14604 -78.64958
2 Control P1 H 2 20.55 25.94 -54.71604 -55.76958
3 Control P1 H 3 27.49 29.85 -47.77604 -51.85958
4 Control P1 H 4 44.79 25.39 -30.47604 -56.31958
5 Control P1 M 1 20.99 20.31 -54.27604 -61.39958
6 Control P1 M 2 37.54 17.62 -37.72604 -64.08958
6.7 Changing vectors
Convert Time
(a numeric) into a factor
# A tibble: 48 × 6
Treatment Plot Dose Time Resp1 Resp2
<fct> <fct> <fct> <fct> <dbl> <dbl>
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.6 25.9
3 Control P1 H 3 27.5 29.8
4 Control P1 H 4 44.8 25.4
5 Control P1 M 1 21.0 20.3
6 Control P1 M 2 37.5 17.6
7 Control P1 M 3 61.5 98.4
8 Control P1 M 4 82.2 160.
9 Control P1 L 1 31.7 21.2
10 Control P1 L 2 59.1 37.5
# ℹ 38 more rows
Change the labels of the “H” and “M” levels of Dose
# A tibble: 48 × 6
Treatment Plot Dose Time Resp1 Resp2
<fct> <fct> <fct> <int> <dbl> <dbl>
1 Control P1 High 1 8.12 3.06
2 Control P1 High 2 20.6 25.9
3 Control P1 High 3 27.5 29.8
4 Control P1 High 4 44.8 25.4
5 Control P1 Medium 1 21.0 20.3
6 Control P1 Medium 2 37.5 17.6
7 Control P1 Medium 3 61.5 98.4
8 Control P1 Medium 4 82.2 160.
9 Control P1 L 1 31.7 21.2
10 Control P1 L 2 59.1 37.5
# ℹ 38 more rows
'data.frame': 48 obs. of 6 variables:
$ Treatment: Factor w/ 2 levels "Control","Exclusion": 1 1 1 1 1 1 1 1 1 1 ...
$ Plot : Factor w/ 4 levels "P1","P2","P3",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Dose : Factor w/ 3 levels "High","L","Medium": 1 1 1 1 3 3 3 3 2 2 ...
$ Time : int 1 2 3 4 1 2 3 4 1 2 ...
$ Resp1 : num 8.12 20.55 27.49 44.79 20.99 ...
$ Resp2 : num 3.06 25.94 29.85 25.39 20.31 ...
- attr(*, "out.attrs")=List of 2
..$ dim : Named int [1:3] 4 3 4
.. ..- attr(*, "names")= chr [1:3] "Time" "Dose" "Plot"
..$ dimnames:List of 3
.. ..$ Time: chr [1:4] "Time=1" "Time=2" "Time=3" "Time=4"
.. ..$ Dose: chr [1:3] "Dose=H" "Dose=M" "Dose=L"
.. ..$ Plot: chr [1:4] "Plot=P1" "Plot=P2" "Plot=P3" "Plot=P4"
Change the level order of the Dose
factor to something more natural.
[1] H H H H M M M M L L L L H H H H M M M M L L L L H H H H M M M M L L L L H H
[39] H H M M M M L L L L
Levels: H L M
[1] H H H H M M M M L L L L H H H H M M M M L L L L H H H H M M M M L L L L H H
[39] H H M M M M L L L L
Levels: L M H
Change the labels and level order of the Dose
factor to something more natural.
7 Summarising
Data are often collected and transcribed at finer temporal/spatial scales and with greater fidelity than is required for all analyses. Furthermore, there is often a desire to present information summarised to a courser level. Therefore an important phase of data preparation/presentation is also to summarise the data into the spatial/temporal scales appropriate for the desired graphical and statistical analyses.
Summarising involves applying one or more summarising functions to one or more variables such that the data are reduced to a single row.
The summarise()
function users data-masking semantics and can use tidy-select semantics via across()
.
Treatment | Plot | Dose | Time | Resp1 | Resp2 |
---|---|---|---|---|---|
Control | P1 | H | 1 | 8.12 | 3.06 |
Control | P1 | H | 2 | 20.55 | 25.94 |
Control | P1 | H | 3 | 27.49 | 29.85 |
Control | P1 | H | 4 | 44.79 | 25.39 |
Control | P1 | M | 1 | 20.99 | 20.31 |
Control | P1 | M | 2 | 37.54 | 17.62 |
Control | P1 | M | 3 | 61.46 | 98.44 |
Control | P1 | M | 4 | 82.21 | 160.01 |
Control | P1 | L | 1 | 31.73 | 21.22 |
Control | P1 | L | 2 | 59.08 | 37.51 |
Function | Description |
---|---|
n() |
number of rows/values |
n_distinct() |
number of unique rows/values |
mean() , median() |
mean/median |
sum() , prod() |
sum/product |
var() , sd() |
variance/standard deviation |
mad() , IQR() |
median absolute deviation / interquartile range |
min() , max() |
minimum / maximum |
quantile() |
nth quantile |
first() , last() , nth() |
first / last / nth value |
Calculate the number of rows in the data as well as the number of distinct (unique) values of Dose
.
Calculate the number of rows in the data as well as the number of distinct (unique) values of Dose
.
Calculate the number of rows in the data as well as the number of distinct (unique) values of Dose
.
Calculate the number of rows in the data as well as the number of distinct (unique) values of Dose
.
Calculate the number of rows in the data as well as the number of distinct (unique) values of Dose
.
Calculate the number of rows in the data as well as the number of distinct (unique) values of Dose
.
Calculate the number of rows in the data as well as the number of distinct (unique) values of Dose
.
7.1 Count
The count()
function provides a convenient way to count up the number of unique combinations of factors.
8 Grouping
Base R has a family of apply functions that apply a function (such as mean()
) to a continuous variable separately for:
Apply the mean()
function to each column (MARGIN = 2
) of a dataset comprising only of “Resp1” and “Resp2”.
The above apply functions involve a sort of split, apply, combine process. This process became the inspiration for the plyr
package (also out of the Wickham lab) which built on and super-charged the process. The dplyr
package represents the next iteration of the process (for data frames and tibbles only as most data are in data frame format).
Central to the modern split/apply/combine process is the idea of groups. Groups are the basis of splitting the data. Functions applied to grouped data are applied to each group (subset) separately before the results are combined back into a single data frame (actually tibble). Hence grouped data are most powerful when combined with the summarise()
or mutate()
families of functions.
Groups are defined via the group_by()
function. For example, to indicate that all subsequent manipulations should occur separately for each level of the Dose
variable, we would:
# A tibble: 48 × 6
# Groups: Dose [3]
Treatment Plot Dose Time Resp1 Resp2
<fct> <fct> <fct> <int> <dbl> <dbl>
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.6 25.9
3 Control P1 H 3 27.5 29.8
4 Control P1 H 4 44.8 25.4
5 Control P1 M 1 21.0 20.3
6 Control P1 M 2 37.5 17.6
7 Control P1 M 3 61.5 98.4
8 Control P1 M 4 82.2 160.
9 Control P1 L 1 31.7 21.2
10 Control P1 L 2 59.1 37.5
# ℹ 38 more rows
Note, in the above, we have only defined the grouping structure, not any subsequent manipulations. Consequently, the tibble remains unchanged. Do note however, that above the tabular output, the output specifies a Group
attribute as a visual reminder.
It is in combination with other data manipulation routines that group_by()
is most useful. For example, to calculate the mean of “Resp1” within each Dose
:
# A tibble: 3 × 2
Dose Mean
<fct> <dbl>
1 H 37.3
2 L 113.
3 M 76.0
In the above output, you may have noticed that the data (means) have been rounded. This is a behaviour of the print method for tibbles. When stored, the full precision of the data are retained.
Before providing a list of examples, it is worth using group_by()
to highlight the difference between mutate()
and summarise()
.
# A tibble: 6 × 7
# Groups: Dose [2]
Treatment Plot Dose Time Resp1 Resp2 MeanResp1
<fct> <fct> <fct> <int> <dbl> <dbl> <dbl>
1 Control P1 H 1 8.12 3.06 37.3
2 Control P1 H 2 20.6 25.9 37.3
3 Control P1 H 3 27.5 29.8 37.3
4 Control P1 H 4 44.8 25.4 37.3
5 Control P1 M 1 21.0 20.3 76.0
6 Control P1 M 2 37.5 17.6 76.0
mutate()
maintains the number of rows. Notice how the mean for each group is repeated for each case.summarise()
reduces the number of rows to one (per group).mutate()
retains all the columnssummarise()
only returns the columns specifically involved (either as the Groups or manipulated variables).
When multiple variables are used to define the groups, the subsequent manipulations are applied to each unique combination of the factors.
# A tibble: 6 × 3
# Groups: Treatment [2]
Treatment Dose Mean
<fct> <fct> <dbl>
1 Control H 24.5
2 Control L 75.7
3 Control M 51.2
4 Exclusion H 50.1
5 Exclusion L 149.
6 Exclusion M 101.
Importantly, once all the manipulations that need to be performed separately per group have been performed, it is good practice to remove the group attribute. Forgetting to do so will continue to cause all subsequent manipulations (and possibly analyses and graphics) to be inadvertently performed separately per group - such downstream issues can be difficult to diagnose.
To remove the group attribute, we use the ungroup()
function.
# A tibble: 6 × 3
Treatment Dose Mean
<fct> <fct> <dbl>
1 Control H 24.5
2 Control L 75.7
3 Control M 51.2
4 Exclusion H 50.1
5 Exclusion L 149.
6 Exclusion M 101.
dat.1 |> group_by(Treatment, Dose) |>
summarise(meanResp1 = mean(Resp1),
varResp1 = var(Resp1),
N = n())
# A tibble: 6 × 5
# Groups: Treatment [2]
Treatment Dose meanResp1 varResp1 N
<fct> <fct> <dbl> <dbl> <int>
1 Control H 24.5 202. 8
2 Control L 75.7 1191. 8
3 Control M 51.2 611. 8
4 Exclusion H 50.1 562. 8
5 Exclusion L 149. 5192. 8
6 Exclusion M 101. 2378. 8
dat.1 |> group_by(Treatment, Dose) |>
summarise(across(starts_with("Resp"),
list(Mean = mean, Var = var),
.names = "{.fn}{.col}"),
N = n())
# A tibble: 6 × 7
# Groups: Treatment [2]
Treatment Dose MeanResp1 VarResp1 MeanResp2 VarResp2 N
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <int>
1 Control H 24.5 202. 26.5 118. 8
2 Control L 75.7 1191. 79.9 1948. 8
3 Control M 51.2 611. 62.3 3676. 8
4 Exclusion H 50.1 562. 48.1 886. 8
5 Exclusion L 149. 5192. 170. 13391. 8
6 Exclusion M 101. 2378. 103. 4786. 8
# A tibble: 8 × 6
# Groups: Treatment, Dose [3]
Treatment Plot Dose Time Resp1 Resp2
<fct> <fct> <fct> <int> <dbl> <dbl>
1 Control P1 H 1 8.12 3.06
2 Control P1 H 2 20.6 25.9
3 Control P1 M 1 21.0 20.3
4 Control P2 H 1 8.14 23.9
5 Control P2 H 2 13.4 28.0
6 Control P2 M 1 20.0 19.7
7 Exclusion P3 H 1 21.9 23.6
8 Exclusion P4 H 1 18.8 28.6
9 Reshaping data
9.1 Wide to long (pivot_longer
)
Whilst wide data formats are often more compact and typically easier to manage for data entry (particularly in the field), the data are not in the appropriate format for most analyses (traditional repeated measures and multivariate analyses are two exceptions). Most analyses require that each replicate is in its own row and thus it is necessary to be rearrange or reshape (melt) the data from this wide format to the long (molten) format.
Whilst there are numerous routines in R for reshaping data, we will only explore those that are formally part of the tidyverse ecosystem.
The pivot_longer()
function (tidyr
package) is very useful for converting wide (repeated measures-like) into long format. The important parameters to specify are:
where:
data
- the input dataframe or tibblecols
- a tidy select specification of the columns to be lengthened into a single columnnames_to
- a name (string) to give a new column into which to store the names of the old wide column namesvalues_to
- a name (string) to give the new column containing the values that were previously in the old columns.values_drop_na
- whether or not to drop rows that only containNA
values in the new value column.
To illustrate, we will use the dat.w
dataframe.
Plot | Between | Time.0 | Time.1 | Time.2 | |
---|---|---|---|---|---|
R1 | P1 | A1 | 8 | 14 | 14 |
R2 | P2 | A1 | 10 | 12 | 11 |
R3 | P3 | A2 | 7 | 11 | 8 |
R4 | P4 | A2 | 11 | 9 | 2 |
# A tibble: 12 × 4
Plot Between Time Count
<fct> <fct> <chr> <int>
1 P1 A1 Time.0 8
2 P1 A1 Time.1 14
3 P1 A1 Time.2 14
4 P2 A1 Time.0 10
5 P2 A1 Time.1 12
6 P2 A1 Time.2 11
7 P3 A2 Time.0 7
8 P3 A2 Time.1 11
9 P3 A2 Time.2 8
10 P4 A2 Time.0 11
11 P4 A2 Time.1 9
12 P4 A2 Time.2 2
data.w |> pivot_longer(starts_with("Time"),
names_to = "Time",
names_prefix = "Time.",
values_to = "Count"
)
# A tibble: 12 × 4
Plot Between Time Count
<fct> <fct> <chr> <int>
1 P1 A1 0 8
2 P1 A1 1 14
3 P1 A1 2 14
4 P2 A1 0 10
5 P2 A1 1 12
6 P2 A1 2 11
7 P3 A2 0 7
8 P3 A2 1 11
9 P3 A2 2 8
10 P4 A2 0 11
11 P4 A2 1 9
12 P4 A2 2 2
9.2 Long to wide (pivot_wider
)
The opposite of making a data set longer is to make a data set wider. Whilst analytical and graphical routines might require data to be in long format, wide tabular summaries are typically more compact and familiar.
Widening is performed via the pivot_wider()
function, the most important parameters of which are:
where:
data
- the input dataframe or tibbleid_cols
- a tidy select specification of the columns that uniquely identify the case - these columns will not be widened.names_from
- a tidy select specification of the column(s) that contain the names to be used as new column names.values_from
- a tidy select specification of the column(s) that contain the values to be used as values in the new columns (e.g, the data to be widened).
To illustrate, we will use the dat.w
dataframe.
Resp1 | Resp2 | Between | Plot | Subplot | Within |
---|---|---|---|---|---|
8 | 17 | A1 | P1 | S1 | B1 |
10 | 18 | A1 | P1 | S1 | B2 |
7 | 17 | A1 | P1 | S2 | B1 |
11 | 21 | A1 | P1 | S2 | B2 |
14 | 19 | A2 | P2 | S3 | B1 |
12 | 13 | A2 | P2 | S3 | B2 |
11 | 24 | A2 | P2 | S4 | B1 |
9 | 18 | A2 | P2 | S4 | B2 |
14 | 25 | A3 | P3 | S5 | B1 |
11 | 18 | A3 | P3 | S5 | B2 |
8 | 27 | A3 | P3 | S6 | B1 |
2 | 22 | A3 | P3 | S6 | B2 |
8 | 17 | A1 | P4 | S7 | B1 |
10 | 22 | A1 | P4 | S7 | B2 |
7 | 16 | A1 | P4 | S8 | B1 |
12 | 13 | A1 | P4 | S8 | B2 |
11 | 23 | A2 | P5 | S9 | B1 |
12 | 19 | A2 | P5 | S9 | B2 |
12 | 23 | A2 | P5 | S10 | B1 |
10 | 21 | A2 | P5 | S10 | B2 |
3 | 17 | A3 | P6 | S11 | B1 |
11 | 16 | A3 | P6 | S11 | B2 |
13 | 26 | A3 | P6 | S12 | B1 |
7 | 28 | A3 | P6 | S12 | B2 |
Note, these data are not as long as they could be. Purely “long” data should have each observation in its own row. The data
dataframe has two observations (one for “Resp1” and one for “Resp2”) per row.
Widen the “Resp1” variable by the levels of the Between
factor.
# A tibble: 12 × 5
Between Plot Subplot B1 B2
<fct> <fct> <fct> <int> <int>
1 A1 P1 S1 8 10
2 A1 P1 S2 7 11
3 A2 P2 S3 14 12
4 A2 P2 S4 11 9
5 A3 P3 S5 14 11
6 A3 P3 S6 8 2
7 A1 P4 S7 8 10
8 A1 P4 S8 7 12
9 A2 P5 S9 11 12
10 A2 P5 S10 12 10
11 A3 P6 S11 3 11
12 A3 P6 S12 13 7
Widen the “Resp1” and “Resp2” variables by the levels of the Between
factor.
# A tibble: 12 × 7
Between Plot Subplot Resp1_B1 Resp1_B2 Resp2_B1 Resp2_B2
<fct> <fct> <fct> <int> <int> <int> <int>
1 A1 P1 S1 8 10 17 18
2 A1 P1 S2 7 11 17 21
3 A2 P2 S3 14 12 19 13
4 A2 P2 S4 11 9 24 18
5 A3 P3 S5 14 11 25 18
6 A3 P3 S6 8 2 27 22
7 A1 P4 S7 8 10 17 22
8 A1 P4 S8 7 12 16 13
9 A2 P5 S9 11 12 23 19
10 A2 P5 S10 12 10 23 21
11 A3 P6 S11 3 11 17 16
12 A3 P6 S12 13 7 26 28
Alternatively we could make the data longer before widening
data |>
pivot_longer(cols = starts_with("Resp")) |>
pivot_wider(names_from = c(name, Within),
values_from = value)
# A tibble: 12 × 7
Between Plot Subplot Resp1_B1 Resp2_B1 Resp1_B2 Resp2_B2
<fct> <fct> <fct> <int> <int> <int> <int>
1 A1 P1 S1 8 17 10 18
2 A1 P1 S2 7 17 11 21
3 A2 P2 S3 14 19 12 13
4 A2 P2 S4 11 24 9 18
5 A3 P3 S5 14 25 11 18
6 A3 P3 S6 8 27 2 22
7 A1 P4 S7 8 17 10 22
8 A1 P4 S8 7 16 12 13
9 A2 P5 S9 11 23 12 19
10 A2 P5 S10 12 23 10 21
11 A3 P6 S11 3 17 11 16
12 A3 P6 S12 13 26 7 28
10 Combining columns (unite
)
If data are recorded with excessive fidelity, it may be useful to combine multiple fields into a single field. For example, if the date was recorded across three fields (year, month and day, we may like to combine these to form a single date field.
year | month | day | Resp1 |
---|---|---|---|
2009 | 11 | 09 | 25 |
2008 | 05 | 08 | 52 |
2009 | 05 | 23 | 18 |
2009 | 04 | 15 | 21 |
2008 | 10 | 25 | 45 |
2008 | 09 | 26 | 40 |
2008 | 07 | 05 | 38 |
2008 | 11 | 02 | 9 |
2009 | 08 | 19 | 0 |
2008 | 10 | 03 | 20 |
Lets take those data and combine (unite) the separate date components into a single date column before coercing it into a date
object.
11 Separating columns (separate
)
Separating variables is the opposite of uniting them. A field is separated by either indicating a character(s) to use as a separator, or else providing a fixed width format.
Year | Resp1 |
---|---|
F33 | 23 |
M27 | 55 |
M33 | 33 |
M26 | 36 |
F21 | 12 |
F34 | 49 |
M29 | 0 |
M34 | 18 |
F32 | 42 |
F26 | 0 |
For this example, we will split the Year
column into two columns, one that specifies the gender and one that specifies the age. The values are going to be separated after the first character.
12 Merging (_join
) data sets
It is common to have data associated with a particular study organised into a number of separate data tables (databases etc). In fact, large data sets are best managed in databases. However, statistical analyses generally require all data to be encapsulated within a single data structure. Therefore, prior to analysis, it is necessary to bring together multiple sources.
This phase of data preparation can be one of the most difficult to get right and verify that it is correct (particularly for large data sets).
Merging (or joining) involves creating a new data set that comprises information from two data sets. The resulting joined data set contains all fields from both data sets. The data sets are alignd together according to fields they have in common. Matching records in these common fields are used to select a row from each input data set to be combined.
There are numerous alternative ways of defining what should happen in the event that common records do not occur in both sets. For example, we could specify that only fully matching records be included in the final data set. Alternatively, we could specify that all records be included from both sets and thus the resulting data set will contain missing values. The following describe these various options.
- left join
- return all rows and columns from the left data set
- return all columns from the right data set
- new columns for unmatched rows from the right data sets receive NA values
- when there are multiple matches, all combinations included
- right join
- return all rows and columns from the right data set
- return all columns from the left data set
- new columns for unmatched rows from the left data sets receive NA values
- when there are multiple matches, all combinations included
- inner join
- return all columns from the left and right data set
- return only rows that match from left and right data sets
- when there are multiple matches, all combinations included
- semi join
- return all rows from the left data set that match with rows from the right data set
- keep only the columns from the left data set
- anti join
- return only the rows from the left data set that do not match with rows from the right data set
- keep only the columns from the left data set
- full join
- return all rows and columns from the left and right data set
- unmatched rows from either left data sets receive NA values in the associated new columns
To illustrate the above, we will use the data.bio
and data.chem
datasets. The data.bio
dataset mimics a typical ecological data set in which there are a number of ecological responses (Resp1
and Resp2
) along with the associated hierarchical design structure. This data set is intentionally missing data for Subplot
S3
. The data.chem
dataset has the same hierarchical design structure, yet records the two chemical analytes (Chem1
and Chem2
). This data set is intentionally missing data for Subplot
S7
.
Resp1 | Resp2 | Between | Plot | Subplot |
---|---|---|---|---|
8 | 18 | A1 | P1 | S1 |
10 | 21 | A1 | P1 | S2 |
11 | 23 | A1 | P2 | S4 |
14 | 22 | A2 | P3 | S5 |
12 | 24 | A2 | P3 | S6 |
11 | 23 | A2 | P4 | S7 |
9 | 20 | A2 | P4 | S8 |
14 | 11 | A3 | P5 | S9 |
11 | 22 | A3 | P5 | S10 |
8 | 24 | A3 | P6 | S11 |
2 | 16 | A3 | P6 | S12 |
Chem1 | Chem2 | Between | Plot | Subplot |
---|---|---|---|---|
1.452878 | 0.8858208 | A1 | P1 | S1 |
3.266253 | 0.1800177 | A1 | P1 | S2 |
1.178652 | 5.0780682 | A1 | P2 | S3 |
13.400350 | 1.5762780 | A1 | P2 | S4 |
3.779183 | 1.6222430 | A2 | P3 | S5 |
1.196657 | 4.2369184 | A2 | P3 | S6 |
5.687807 | 2.9859003 | A2 | P4 | S8 |
4.834518 | 4.1328919 | A3 | P5 | S9 |
2.002931 | 3.6043314 | A3 | P5 | S10 |
12.326867 | 1.7763576 | A3 | P6 | S11 |
4.014221 | 0.2255188 | A3 | P6 | S12 |
A full or outer join between data.bio
and data.chem
Resp1 Resp2 Between Plot Subplot Chem1 Chem2
1 8 18 A1 P1 S1 1.452878 0.8858208
2 10 21 A1 P1 S2 3.266253 0.1800177
3 11 23 A1 P2 S4 13.400350 1.5762780
4 14 22 A2 P3 S5 3.779183 1.6222430
5 12 24 A2 P3 S6 1.196657 4.2369184
6 11 23 A2 P4 S7 NA NA
7 9 20 A2 P4 S8 5.687807 2.9859003
8 14 11 A3 P5 S9 4.834518 4.1328919
9 11 22 A3 P5 S10 2.002931 3.6043314
10 8 24 A3 P6 S11 12.326867 1.7763576
11 2 16 A3 P6 S12 4.014221 0.2255188
12 NA NA A1 P2 S3 1.178652 5.0780682
Note that although all Subplot
s are present, there are missing values for Subplot
S3
(last row) and S7
.
An inner join between data.bio
and data.chem
Resp1 Resp2 Between Plot Subplot Chem1 Chem2
1 8 18 A1 P1 S1 1.452878 0.8858208
2 10 21 A1 P1 S2 3.266253 0.1800177
3 11 23 A1 P2 S4 13.400350 1.5762780
4 14 22 A2 P3 S5 3.779183 1.6222430
5 12 24 A2 P3 S6 1.196657 4.2369184
6 9 20 A2 P4 S8 5.687807 2.9859003
7 14 11 A3 P5 S9 4.834518 4.1328919
8 11 22 A3 P5 S10 2.002931 3.6043314
9 8 24 A3 P6 S11 12.326867 1.7763576
10 2 16 A3 P6 S12 4.014221 0.2255188
Note that both Subplot
s S3
and S7
are missing.
A left join between data.bio
and data.chem
Resp1 Resp2 Between Plot Subplot Chem1 Chem2
1 8 18 A1 P1 S1 1.452878 0.8858208
2 10 21 A1 P1 S2 3.266253 0.1800177
3 11 23 A1 P2 S4 13.400350 1.5762780
4 14 22 A2 P3 S5 3.779183 1.6222430
5 12 24 A2 P3 S6 1.196657 4.2369184
6 11 23 A2 P4 S7 NA NA
7 9 20 A2 P4 S8 5.687807 2.9859003
8 14 11 A3 P5 S9 4.834518 4.1328919
9 11 22 A3 P5 S10 2.002931 3.6043314
10 8 24 A3 P6 S11 12.326867 1.7763576
11 2 16 A3 P6 S12 4.014221 0.2255188
Note that both Subplot
S3
are missing and Subplot
S7
only has NA values for the chem data.
A left join between data.bio
and data.chem
Resp1 Resp2 Between Plot Subplot Chem1 Chem2
1 8 18 A1 P1 S1 1.452878 0.8858208
2 10 21 A1 P1 S2 3.266253 0.1800177
3 11 23 A1 P2 S4 13.400350 1.5762780
4 14 22 A2 P3 S5 3.779183 1.6222430
5 12 24 A2 P3 S6 1.196657 4.2369184
6 9 20 A2 P4 S8 5.687807 2.9859003
7 14 11 A3 P5 S9 4.834518 4.1328919
8 11 22 A3 P5 S10 2.002931 3.6043314
9 8 24 A3 P6 S11 12.326867 1.7763576
10 2 16 A3 P6 S12 4.014221 0.2255188
11 NA NA A1 P2 S3 1.178652 5.0780682
Note that both Subplot
S7
are missing and Subplot
S3
only has NA values for the bio data.
An anti join between data.bio
and data.chem
Note that Subplot
S7
is returned indicating that it in the right dataset, yet absent from the left.
An semi join between data.bio
and data.chem
Resp1 Resp2 Between Plot Subplot
1 8 18 A1 P1 S1
2 10 21 A1 P1 S2
3 11 23 A1 P2 S4
4 14 22 A2 P3 S5
5 12 24 A2 P3 S6
6 9 20 A2 P4 S8
7 14 11 A3 P5 S9
8 11 22 A3 P5 S10
9 8 24 A3 P6 S11
10 2 16 A3 P6 S12
Note that Subplot
S3
and S7
are missing and only the bio data are included.
12.1 VLOOKUP in R
Lookup tables provide a way of inserting a column of data into a large data set such that the entries in the new column are determined by a relational match within another data set (the lookup table). For example, the main data set might contain data collected from a number of sites (Plots). Elsewhere we may have a data set that just contains the set of sites and their corresponding latitudes and longitudes (geographical lookup table). We could incorporate these latitudes and longitudes into the main data set by merging against the geographical lookup table. In Excel, this is referred to as vlookup, in a relational database (and in tidyverse
) it is referred to as a join.
If we again consider our data.bio
data, but this time also consider the data.geo
data. This later dataset contains the latitude and longitude of each of the plots.
Resp1 | Resp2 | Between | Plot | Subplot |
---|---|---|---|---|
8 | 18 | A1 | P1 | S1 |
10 | 21 | A1 | P1 | S2 |
11 | 23 | A1 | P2 | S4 |
14 | 22 | A2 | P3 | S5 |
12 | 24 | A2 | P3 | S6 |
11 | 23 | A2 | P4 | S7 |
9 | 20 | A2 | P4 | S8 |
14 | 11 | A3 | P5 | S9 |
11 | 22 | A3 | P5 | S10 |
8 | 24 | A3 | P6 | S11 |
2 | 16 | A3 | P6 | S12 |
Plot | LAT | LONG |
---|---|---|
P1 | 17.9605 | 145.4326 |
P2 | 17.5210 | 146.1983 |
P3 | 17.0011 | 146.3839 |
P4 | 18.2350 | 146.7934 |
P5 | 18.9840 | 146.0345 |
P6 | 20.1154 | 146.4672 |
Resp1 Resp2 Between Plot Subplot LAT LONG
1 8 18 A1 P1 S1 17.9605 145.4326
2 10 21 A1 P1 S2 17.9605 145.4326
3 11 23 A1 P2 S4 17.5210 146.1983
4 14 22 A2 P3 S5 17.0011 146.3839
5 12 24 A2 P3 S6 17.0011 146.3839
6 11 23 A2 P4 S7 18.2350 146.7934
7 9 20 A2 P4 S8 18.2350 146.7934
8 14 11 A3 P5 S9 18.9840 146.0345
9 11 22 A3 P5 S10 18.9840 146.0345
10 8 24 A3 P6 S11 20.1154 146.4672
11 2 16 A3 P6 S12 20.1154 146.4672