Data manipulation (tidyverse and friends)

30 July, 2023

By: Murray Logan

I apologize in advance, this tutorial requires quite a bit of explaining and context before it can get into the code…. Good data manipulation is an art form that requires the use of many specific tools (functions) and expert data manipulation comes from the integration of these tools together. Therefore it is necessary to have an overview of the tool set before investigating any single tool.

Manipulating data sets

Rarely are data transcribed and organised into such simple and ready-to-go data sets. More typically, data are spread throughout multiple sources and in a variety of formats (particularly if compiled by multiple workers or instruments). Consequently, prior to any formal statistical analyses , it is necessary to compile very focused, tidy data sets.

Wickham (2014) suggested that there are many ways to organise data, yet tidy data (data that are structured in such a consistent way as to facilitate analyses) must adhere to a fairly strict set of structural rules. Specifically, in tidy data:

variables are in columns
observations are in rows - that is, for univariate data, there should be a separate row for each response observation.

To achieve tidy data, common data preparations include:

Reshaping and rearranging data
Merging multiple data sources
Aggregating data to appropriate spatial/temporal scales
Transforming data and deriving new variables
Sorting and reordering data
Relabelling data

This workshop will concentrate on these essential data preparation phases.

Practical data manipulation will be demonstrated via a series of very small artificial datasets. These datasets are presented in tables with black font and lines and the R code required to generate those data will be presented in static code boxes either underneath or adjacent the table. A very basic description of the table and the name of the data frame are displayed above the table. The entire collection of datasets used in this workshop can be obtained by issuing the following command:

(if online)

source("https://github.com/pcinereus/SUYRs_docs/raw/main/data/manipulationDatasets.R")
##OR
load(url("https://github.com/pcinereus/SUYRs_docs/raw/main/data/manipulationDatasets.RData"))

(if offline and are running this from a local version having first cloned the SUYRs documentation github repository
```
source(file="data/manipulationDatasets.R")
##OR
load(file = "data/manipulationDatasets.RData")
```

An overview of the example data sets follows.

dat.1

These data comprise two Responses (Resp1 and Resp2) that are each observed on four occasions (Time) from each of three dosage treatments (Dose) nested within four plots (Plot). Two of the four plots completely excluded something and the other two were controls (did not exclude the thing).

View data

Treatment	Plot	Dose	Time	Resp1	Resp2
Control	P1	H	1	8.12	3.06
Control	P1	H	2	20.55	25.94
Control	P1	H	3	27.49	29.85
Control	P1	H	4	44.79	25.39
Control	P1	M	1	20.99	20.31
Control	P1	M	2	37.54	17.62
Control	P1	M	3	61.46	98.44
Control	P1	M	4	82.21	160.01
Control	P1	L	1	31.73	21.22
Control	P1	L	2	59.08	37.51
Control	P1	L	3	94.54	119.22
Control	P1	L	4	121.17	116.45
Control	P2	H	1	8.14	23.93
Control	P2	H	2	13.36	28.02
Control	P2	H	3	33.37	37.17
Control	P2	H	4	39.87	38.25
Control	P2	M	1	19.95	19.73
Control	P2	M	2	42.83	40.52
Control	P2	M	3	62.46	4.81
Control	P2	M	4	81.78	136.66
Control	P2	L	1	32.76	30.70
Control	P2	L	2	62.35	123.78
Control	P2	L	3	90.22	113.87
Control	P2	L	4	114.03	76.52
Exclusion	P3	H	1	21.86	23.58
Exclusion	P3	H	2	39.83	28.03
Exclusion	P3	H	3	59.53	21.32
Exclusion	P3	H	4	75.59	90.76
Exclusion	P3	M	1	38.57	30.63
Exclusion	P3	M	2	81.25	83.61
Exclusion	P3	M	3	124.08	124.09
Exclusion	P3	M	4	159.69	112.65
Exclusion	P3	L	1	61.16	39.53
Exclusion	P3	L	2	119.84	110.27
Exclusion	P3	L	3	175.87	286.33
Exclusion	P3	L	4	238.76	54.23
Exclusion	P4	H	1	18.82	28.60
Exclusion	P4	H	2	39.82	39.07
Exclusion	P4	H	3	63.30	93.43
Exclusion	P4	H	4	82.29	60.15
Exclusion	P4	M	1	39.51	45.90
Exclusion	P4	M	2	79.24	88.04
Exclusion	P4	M	3	122.09	84.19
Exclusion	P4	M	4	161.67	256.34
Exclusion	P4	L	1	57.93	85.24
Exclusion	P4	L	2	117.88	167.90
Exclusion	P4	L	3	181.09	314.49
Exclusion	P4	L	4	242.31	304.70

Note as this data set is 48 rows long, the majority examples that feature this dataset will only show the first 6 (or 10) rows to save on space. This truncation will occur via the head() function.

data.w

These data comprise a Response observed on three occasions (Time.0, Time.1 and Time.2) from each of two (A1 and A2) treatments (Between, gridded or not) nested within four plots (Plot).

View data

	Plot	Between	Time.0	Time.1	Time.2
R1	P1	A1	8	14	14
R2	P2	A1	10	12	11
R3	P3	A2	7	11	8
R4	P4	A2	11	9	2

data

Similar to dat.1, these data comprise two Responses (Resp1 and Resp2). However, these data have a different hierarchical structure. Each response was measured from one of two Subplots nested within one of six Plots. Within each Subplot each of two treatment levels (B1 and B2) were applied and two of the Plots where exposed to one of the three Between levels.

View data

Resp1	Resp2	Between	Plot	Subplot	Within
8	17	A1	P1	S1	B1
10	18	A1	P1	S1	B2
7	17	A1	P1	S2	B1
11	21	A1	P1	S2	B2
14	19	A2	P2	S3	B1
12	13	A2	P2	S3	B2
11	24	A2	P2	S4	B1
9	18	A2	P2	S4	B2
14	25	A3	P3	S5	B1
11	18	A3	P3	S5	B2
8	27	A3	P3	S6	B1
2	22	A3	P3	S6	B2
8	17	A1	P4	S7	B1
10	22	A1	P4	S7	B2
7	16	A1	P4	S8	B1
12	13	A1	P4	S8	B2
11	23	A2	P5	S9	B1
12	19	A2	P5	S9	B2
12	23	A2	P5	S10	B1
10	21	A2	P5	S10	B2
3	17	A3	P6	S11	B1
11	16	A3	P6	S11	B2
13	26	A3	P6	S12	B1
7	28	A3	P6	S12	B2

data.d

The data.d data comprise of a single response (Resp1) collected on different days. The three different components of the dates have been recorded in three different columns corresponding to the year, month and day.

View data

year	month	day	Resp1
2009	11	09	25
2008	05	08	52
2009	05	23	18
2009	04	15	21
2008	10	25	45
2008	09	26	40
2008	07	05	38
2008	11	02	9
2009	08	19	0
2008	10	03	20

data.bio

The data.bio dataset represents a design that is like a cut down version of data in which there are no Within Subplot levels and thus there are also fewer rows.

View data

Resp1	Resp2	Between	Plot	Subplot
8	18	A1	P1	S1
10	21	A1	P1	S2
11	23	A1	P2	S4
14	22	A2	P3	S5
12	24	A2	P3	S6
11	23	A2	P4	S7
9	20	A2	P4	S8
14	11	A3	P5	S9
11	22	A3	P5	S10
8	24	A3	P6	S11
2	16	A3	P6	S12

data.chem

The data.chem dataset represents the same design as the data.bio data except that rather than contain two biological response, it has two chemical measurements (Chem1 and Chem2) collected from the same sampling units as the biological responses.

View data

Chem1	Chem2	Between	Plot	Subplot
1.452878	0.8858208	A1	P1	S1
3.266253	0.1800177	A1	P1	S2
1.178652	5.0780682	A1	P2	S3
13.400350	1.5762780	A1	P2	S4
3.779183	1.6222430	A2	P3	S5
1.196657	4.2369184	A2	P3	S6
5.687807	2.9859003	A2	P4	S8
4.834518	4.1328919	A3	P5	S9
2.002931	3.6043314	A3	P5	S10
12.326867	1.7763576	A3	P6	S11
4.014221	0.2255188	A3	P6	S12

data.geo

The data.geo dataset represent geographical information about the Plots in the data.bio dataset.

View data

Plot	LAT	LONG
P1	17.9605	145.4326
P2	17.5210	146.1983
P3	17.0011	146.3839
P4	18.2350	146.7934
P5	18.9840	146.0345
P6	20.1154	146.4672

tikus

The tikus dataset is a data.frame version of the data with the same name that is provided as a list within the mvabund package. The data contain a the abundance of 75 coral species from different locations (rep) over six years (time) at Tikus island. The abundance was measured as the length (in cm) of a 10m transect that intersected with the coral species.

View data

Note, these data actually have 60 rows and 77 columns. In order avoid displaying a huge table in this quick view, I have trimmed the view to just the first 10 rows and six columns.

	time	rep	Pocillopora damicornis	Pocillopora verrucosa
V1	81	1	79	32
V2	81	2	51	21
V3	81	3	42	35
V4	81	4	15	0
V5	81	5	9	0
V6	81	6	72	0
V7	81	7	0	41
V8	81	8	16	25
V9	81	9	0	38
V10	81	10	16	0

The great folks over at Rstudio have produced an excellent set of cheatsheets on a range of topics. For this tutorial, the Data Transformation Cheat Sheet and Data Tidying Cheat Sheet (mainly just the first page) are useful summaries.

Data importation

https://github.com/rstudio/cheatsheets/raw/main/data-import.pdf

Data transformations

https://github.com/rstudio/cheatsheets/raw/main/data-transformation.pdf

Tidy data

https://github.com/rstudio/cheatsheets/raw/main/tidyr.pdf tidyr tidyr

Working with factors

https://github.com/rstudio/cheatsheets/raw/main/factors.pdf factors

Working with time/dates

https://github.com/rstudio/cheatsheets/raw/main/lubridate.pdf

Functional programming (`purrr`)

https://github.com/rstudio/cheatsheets/raw/main/purrr.pdf purrr

Regular expression

https://github.com/rstudio/cheatsheets/raw/main/regex.pdf regex

Strings

https://github.com/rstudio/cheatsheets/raw/main/strings.pdf strings strings

Tidy evaluation

https://github.com/rstudio/cheatsheets/raw/main/tidyeval.pdf

`tidyverse` - a data manipulation ecosystem within R

There are numerous packages and base R routines devoted to data manipulation. Notable packages include data.tables, plyr, dplyr and doBy. Indeed, earlier versions of this tutorial featured examples of each of these packages. However, an entire family of packages from Hadley Wickem’s group now stands out as a comprehensive, intuitive suite of tools for data manipulation and visualisation.

Importantly, all of these packages are designed to integrate together and complement one another with a consistent interface. To simplify installing an entire data ecosystem, the tidyverse package is available. Installing this package (via install.packages('tidyverse')) will install the following packages (bold entries are automatically loaded into the global environment when loading the tidyverse package):

dplyr - for dataframe manipulation
tidyr - for data tidying
readr - for importing data
ggplot2 - for visualising data
purrr - for functional programming
tibble - for tibbles
stringr - for string manipulation
forecats - for factor (categorical) manipulation
lubridate - for dates/times manipulation

hmms - for time manipulation
broom - for tidying model outputs
DBI - for database interactions
haven - for importing SAS, SPSS and Stata files
httr - for web API’s
jsonlite - for working with JSON
rvest - for web scraping
xml2 - for manipulating XML structures
modelr - for model manipulation

As this tutorial will focus largely on the tidyverse ecosystem, we will now load (attach) this library (actually, eight libraries) to the global environment so that the functionality is available.

library(tidyverse)

── Attaching core tidyverse packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr   1.1.2     ✔ readr   2.1.4
✔ forcats 1.0.0     ✔ stringr 1.5.0
✔ ggplot2 3.4.2     ✔ tibble  3.2.1
✔ purrr   1.0.1     ✔ tidyr   1.3.0
── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Note that when loading the tidyverse package, we are given feedback about which other packages have been attached. Importantly, also notice that the output highlights a number of conflicts. In the above example, this indicates that the definition of the filter function is now taken from the the dplyr package rather than the stats package. Essentially, both packages define a function called filter and this is just indicating that in the current environment, the filter function from the dplyr package will be the one used.

On a case by case basis, this can be overruled by prefixing the function name with the package namespace. For example, to ensure that the stats package version of filter is used, we can engage it as stats::filter().

The grammar of data manipulation

Hadley and his collaborators argue that there is a grammar of data manipulation and that manipulations comprises a set of verbs. Furthermore, the philosophy adheres to the UNIX ideal that rather than attempt to build large complex tools that perform a large set of operations, it is better (with regards to usability and robustness) to build a tool set in which each tool performs a single, specific task and that the individual tools can be bolted together to achieve the complex operations.

The core of the tidyverse data manipulation ecosystem tools (verbs) can be organised into five categories within two types on the basis of what aspects of the data they operate on:

operate on a single data set
- operate on the rows
  - dplyr::arrange - changing the order of the rows (sorting)
  - dplyr::filter - subset of rows based on column values
  - dplyr::slice - subset of rows based on position
- operate on the columns
  - dplyr::select - subset of columns
  - dplyr::rename - change the names of columns
  - dplyr::pull - extract a single column as a vector
  - dplyr::distinct - unique combinations of column values
  - dplyr::mutate - adding columns and modifying column values
  - tidyr::unite - combine multiple columns together
  - tidyr::separate - separating a single column into multiple columns
- operate on groups of columns
  - dplyr::summarise - aggregating (collapsing) to a single row
  - dplyr::count - count the number of unique combinations single row
  - dplyr::group_by - define groups of rows
- reshapes (pivots) the data
  - tidyr::pivot_longer - lengthen data from wide format
  - tidyr::pivot_wider - widen data from long format
operate of two data sets
- _join - merge (join) two data sets together based on common field(s)

In base R, variables (columns) are referred to by either their name (as a prefix to the environment - the data.frame etc), name (as a string) or their index (position). For example to access a variable called “Var1” within a dataset called “data”, we could use either data$Var1, data[,"Var1"] or data[,1] (if the focal variable was in the first column).

The tidyverse ecosystem adopts an alternative concept called tidy evaluation to make referring to columns more flexible.

data-masking - refer to variables as if they were directly attached to the current environment (e.g. Var1 instead of data$Var1. This form of variable referral is used within:
- arrange()
- filter()
- count()
- mutate()
- summarise()
- group_by()

tidy selection - refer to variables by their position, name or type (e.g. starts_with("var")). This form of variable referral is used within:

select()
rename()
pull()
across() - which brings tidy selection semantics to data-masking (and thus functions like mutate())

The following table highlights the various tidy-selection

datamutate (and thus functions like mutate()) The following table highlights the v and can use via across().jarious tidy-selection options. The exJKamples all illustrate a hypothetical data set (data) with hypothetical columns (Var1, Var2, etc) and cannot be evaluated directly. They all mostly illustrate the concept using the select() function.

tidy-selection	Description	Examples
Bare names	Select columns based on their names. Columns can be excluded by prepending the name with a negative sign (`-`).	`select(data, Var1)` `select(data, Var1, Var2)` `select(data, c(Var1, Var2))` `select(data, -Var1)`
Ranges of columns	Select columns based on their a range of names or column numbers. The selections will be inclusive. Prepending with a negative sign (`-`) acts as an exclusion.	`select(data, Var1:Var3)` `select(data, 2:4)`
Boolean helper functions -`contains()` -`starts_with()` -`ends_with()` -`matches()` -`num_range()` -`everything()` -`where()` -`all_of()` -`any_of()`	Select columns based on evaluating functions on the column names - `contains()` - names containing the string - `starts_with()` - names starting with the string - `ends_with()` - names starting with the string - `matches()` - names matched with a regex - `num_range()` - names that start with string followed by numbers - `everything()` - all variables - useful in combination with other selections - `where()` - variable inclusion predicated by a function - `all_of()` - all variables included in a character vector - `any_of()` - any variables included in a character vector Each of the above can be reversed by prepending with either a exclamation sign (`!`) or negative sign (`-`). Also note that by default, these are not case sensitive.	`select(data, contains("Var"))` `select(data, !contains("Var"))` `select(data, starts_with("Var"))` `select(data, ends_with("Var"))` `select(data, matches("^.var[1-5]$"))` `select(data, num_range("Var", 1:2))` `select(data, Var3, everything())` `select(data, where(is.numeric))` `vars <- c("Var1", "Var2") select(data, all_of(vars))` `vars <- c("Var1", "Other") select(data, any_of(vars))`

Piping

Typically, data manipulation/preparations comprise multiple steps and stages in which a series of alterations, additions etc are performed sequentially such that the data are progressively molded into a form appropriate for analysis etc. That is, to achieve the desired result, we must bolt multiple tools (verbs) together.

Traditionally, this would have involved a separate expression for each step often resulting in the generation of numerous intermediate data sets. Furthermore, in an attempt to reduce the number of intermediate steps, functions are often nested within other functions such that alterations are made inline within another function.

For example, the following pseudo code examples illustrates the traditional and nested approaches to achieving numerous data manipulation steps:

Traditional

data1 <- select(data, ...)
data2 <- group_by(data1, ...)
data3 <- summarise(data2, ...)

Nested functions

data <- summarise(group_by(select(data, ...)))

Collectively, these practices can yield code that is very difficult to read and interpret.

A long honoured unix coding principle is that each application should focus on performing one action and performing that function well. In order to perform a sequence of actions therefore involves piping (via the unix pipe character |) the output of one application to the input of another application and so on in a chain. The grammar of data wrangling also adopts this principle (each tool specializing on one action and tools should be piped together to achieve a sequence of actions).

The piping (glue) operator in tidyverse is %>%. An object on the left hand side of the %>% operator is passed as the first argument of the function on the right hand side.

In pseudo code, the piping approach to the above manipulation would be:

data <- data %>%
    select(...) %>%
    group_by(...) %>%
    summarise(...)

If the code is set out as above (with each verb on a separate line), it gives an opportunity to provide a short comment to the right side of each line to provide useful documentation.

A more specific example

As a motivating (if not a bit extreme) example, lets say we wanted to calculate the logSumExp function:

\[ log(\sum^{n}_{i=1} e^{x_i}) \]

## Generate some data
set.seed(123)
x = rgamma(10,5,1)
## Calculate the logSumExp
log(sum(exp(x)))

[1] 9.316408

## OR
x1=exp(x)
x2=sum(x1)
log(x2)

[1] 9.316408

The piping approach could be:

x %>% exp() %>% sum() %>% log()

[1] 9.316408

## OR even
x %>% exp %>% sum %>% log

[1] 9.316408

To reiterate, the following three are equivalent:

exp(x)

 [1]   29.653639 1601.918872    5.101634  118.918637 7140.686681  252.361318
 [7]    9.175114    4.863565 1756.825350  199.466617

x %>% exp()

 [1]   29.653639 1601.918872    5.101634  118.918637 7140.686681  252.361318
 [7]    9.175114    4.863565 1756.825350  199.466617

x %>% exp

 [1]   29.653639 1601.918872    5.101634  118.918637 7140.686681  252.361318
 [7]    9.175114    4.863565 1756.825350  199.466617

as are the following:

log(x, base=10)

 [1] 0.5301465 0.8679950 0.2120706 0.6792861 0.9480981 0.7427928 0.3456667
 [8] 0.1991438 0.8733941 0.7239190

x %>% log(base=10)

 [1] 0.5301465 0.8679950 0.2120706 0.6792861 0.9480981 0.7427928 0.3456667
 [8] 0.1991438 0.8733941 0.7239190

Most of the following examples will demonstrate isolated data manipulation actions (such as filtering, summarising or joining) as this focuses on the specific uses of these functions without the distractions and complications of other actions. For isolated uses, piping has little (if any) advantages. Nevertheless, in recognition that data manipulations rarely comprise a single action (rather they are a series of linked actions), for all remaining examples demonstrated in the tidyverse (dplyr/tidyr) context, piping will be used.

Tibbles

data.frame’s are collections of variables of the identical length (yet not necessarily the same type) that the fundamental data structures used by most modelling routines in R. Nevertheless, there are numerous properties of data.frames that make them less than ideal. tibbles have been more recently engineered to address these shortcomings:

if a data.frame is very large, the print method can result output that is too large to be useful. By contrast, tibbles truncate the output to a maximum of 10 rows and as many columns as will fit in the output. The tibble print method also displays the class (type) of each column (variable).

show comparison

dat.1 %>% as.data.frame()

   Treatment Plot Dose Time  Resp1  Resp2
1    Control   P1    H    1   8.12   3.06
2    Control   P1    H    2  20.55  25.94
3    Control   P1    H    3  27.49  29.85
4    Control   P1    H    4  44.79  25.39
5    Control   P1    M    1  20.99  20.31
6    Control   P1    M    2  37.54  17.62
7    Control   P1    M    3  61.46  98.44
8    Control   P1    M    4  82.21 160.01
9    Control   P1    L    1  31.73  21.22
10   Control   P1    L    2  59.08  37.51
11   Control   P1    L    3  94.54 119.22
12   Control   P1    L    4 121.17 116.45
13   Control   P2    H    1   8.14  23.93
14   Control   P2    H    2  13.36  28.02
15   Control   P2    H    3  33.37  37.17
16   Control   P2    H    4  39.87  38.25
17   Control   P2    M    1  19.95  19.73
18   Control   P2    M    2  42.83  40.52
19   Control   P2    M    3  62.46   4.81
20   Control   P2    M    4  81.78 136.66
21   Control   P2    L    1  32.76  30.70
22   Control   P2    L    2  62.35 123.78
23   Control   P2    L    3  90.22 113.87
24   Control   P2    L    4 114.03  76.52
25 Exclusion   P3    H    1  21.86  23.58
26 Exclusion   P3    H    2  39.83  28.03
27 Exclusion   P3    H    3  59.53  21.32
28 Exclusion   P3    H    4  75.59  90.76
29 Exclusion   P3    M    1  38.57  30.63
30 Exclusion   P3    M    2  81.25  83.61
31 Exclusion   P3    M    3 124.08 124.09
32 Exclusion   P3    M    4 159.69 112.65
33 Exclusion   P3    L    1  61.16  39.53
34 Exclusion   P3    L    2 119.84 110.27
35 Exclusion   P3    L    3 175.87 286.33
36 Exclusion   P3    L    4 238.76  54.23
37 Exclusion   P4    H    1  18.82  28.60
38 Exclusion   P4    H    2  39.82  39.07
39 Exclusion   P4    H    3  63.30  93.43
40 Exclusion   P4    H    4  82.29  60.15
41 Exclusion   P4    M    1  39.51  45.90
42 Exclusion   P4    M    2  79.24  88.04
43 Exclusion   P4    M    3 122.09  84.19
44 Exclusion   P4    M    4 161.67 256.34
45 Exclusion   P4    L    1  57.93  85.24
46 Exclusion   P4    L    2 117.88 167.90
47 Exclusion   P4    L    3 181.09 314.49
48 Exclusion   P4    L    4 242.31 304.70

dat.1 %>% as_tibble()

# A tibble: 48 × 6
   Treatment Plot  Dose   Time Resp1  Resp2
   <fct>     <fct> <fct> <int> <dbl>  <dbl>
 1 Control   P1    H         1  8.12   3.06
 2 Control   P1    H         2 20.6   25.9 
 3 Control   P1    H         3 27.5   29.8 
 4 Control   P1    H         4 44.8   25.4 
 5 Control   P1    M         1 21.0   20.3 
 6 Control   P1    M         2 37.5   17.6 
 7 Control   P1    M         3 61.5   98.4 
 8 Control   P1    M         4 82.2  160.  
 9 Control   P1    L         1 31.7   21.2 
10 Control   P1    L         2 59.1   37.5 
# ℹ 38 more rows

data.frames have very strict column naming rules and when these are not satisfied, the names will be altered in order to adhere to the rules. tibbles permit a wider range of names.

show comparison

data.frame('1.3' = 1:6, 'Wt (kg)' = 1:6)

  X1.3 Wt..kg.
1    1       1
2    2       2
3    3       3
4    4       4
5    5       5
6    6       6

tibble('1.3' = 1:6, 'Wt (kg)' = 1:6)

# A tibble: 6 × 2
  `1.3` `Wt (kg)`
  <int>     <int>
1     1         1
2     2         2
3     3         3
4     4         4
5     5         5
6     6         6

character vectors are often coerced into factors (categorical variables). This is not the case with tibbles.

when vectors are added that are of a different length to the others in the data.frame, the values of the shorter vector(s) are recycled until all vectors are the same length. This behaviour can be dangerous and is not permitted in tibbles (except for vectors of length of one).

show comparison

data.frame(Var1 = 1:6, Var2 = 1:2, Var3 = 1)

  Var1 Var2 Var3
1    1    1    1
2    2    2    1
3    3    1    1
4    4    2    1
5    5    1    1
6    6    2    1

tibble(Var1 = 1:6, Var2 = 1:2, Var3 = 1)

Error in `tibble()`:
! Tibble columns must have compatible sizes.
• Size 6: Existing data.
• Size 2: Column `Var2`.
ℹ Only values of size one are recycled.

tibble(Var1 = 1:6, Var2 = 1)

# A tibble: 6 × 2
   Var1  Var2
  <int> <dbl>
1     1     1
2     2     1
3     3     1
4     4     1
5     5     1
6     6     1

when sub-setting via column indices, data.frames return a vector (when only a single index is provided) or a data.frame (if multiple indices are provided). This inconsistency is problematic in the context of a processing pipeline. tibble’s will always return a tibble from column indices.

show comparison

as.data.frame(dat.1)[,2]

 [1] P1 P1 P1 P1 P1 P1 P1 P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P3
[26] P3 P3 P3 P3 P3 P3 P3 P3 P3 P3 P3 P4 P4 P4 P4 P4 P4 P4 P4 P4 P4 P4 P4
Levels: P1 P2 P3 P4

as_tibble(dat.1)[,2]

# A tibble: 48 × 1
   Plot 
   <fct>
 1 P1   
 2 P1   
 3 P1   
 4 P1   
 5 P1   
 6 P1   
 7 P1   
 8 P1   
 9 P1   
10 P1   
# ℹ 38 more rows

data.frames permit partial matching via $ indexing. This can be problematic in the context of a processing pipeline. tibbles expressly forbid this.

show comparison

as.data.frame(dat.1)$Plo

 [1] P1 P1 P1 P1 P1 P1 P1 P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P3
[26] P3 P3 P3 P3 P3 P3 P3 P3 P3 P3 P3 P4 P4 P4 P4 P4 P4 P4 P4 P4 P4 P4 P4
Levels: P1 P2 P3 P4

as_tibble(dat.1)$Plo

Warning: Unknown or uninitialised column: `Plo`.

NULL

data.frames have very clumsy support for list columns (a list column is a column whose cells contain lists). List columns are better supported in tibbles. The ability to support list columns is an integral for the functional programming routines of the purrr package.

show comparison

In this fictitious example, I would like to store three different data.frames (or any three objects for that matter) in the three cells of a variable (e.g. Var2). This is not possible in a data.frame - it will coerce the Var2 column into multiple columns (one for each object) and thus will also recycle the length of each to ensure they are all the same length. Once the data have been coerced into a data.frame, the structure is lost (we can no longer access Var2).

These issues are addressed in the tibble.

D <- data.frame(Var1 = LETTERS[1:3],
           Var2 = list(
               data.frame(a = 1:6),
               data.frame(a = 1:3),
               data.frame(a = 1:2)
           ))
D

  Var1 Var2.a Var2.a.1 Var2.a.2
1    A      1        1        1
2    B      2        2        2
3    C      3        3        1
4    A      4        1        2
5    B      5        2        1
6    C      6        3        2

D$Var2

NULL

D <- tibble(Var1 = LETTERS[1:3],
       Var2 = list(
           data.frame(a = 1:6),
           data.frame(a = 1:3),
           data.frame(a = 1:2)
           ))
D

# A tibble: 3 × 2
  Var1  Var2        
  <chr> <list>      
1 A     <df [6 × 1]>
2 B     <df [3 × 1]>
3 C     <df [2 × 1]>

D$Var2

Summary and vectorised functions

Many data manipulation actions involve the use of specific auxiliary functions to act on specific parts of the data set. For example, if we wish to summarise a data set, we might apply a mean() function to the numeric vectors (variables) and levels() or unique() function to character or factor vectors. That is, the function is applied to each vector as a whole. Such functions are called summary functions.

On the other hand, if we wish to generate log-transformed versions of the numeric vectors, then we would apply the log() function to each value within those numeric vectors. Furthermore, we might use other auxiliary functions that return vectors of either boolean values (TRUE or FALSE) or integers that can represent row or column indices to determine which observations to perform actions on.

Broadly speaking, summary functions take a vector and return a single value. Familiar examples of summary functions are mean(), var(), min() etc. By contrast, vectorised functions take a vector and return a vector of the same length as the original input vector. rank(), cumsum() and log() are all examples of such vectorised (window) functions.

The dplyr package introduces a number of additional summary and vectorised functions that are deemed useful augmentations of the standard set of functions available in base R. The following tables describes the most useful functions (from base and dplyr) along with which manipulation functions they can be applied. Those functions defined in dplyr include the dplyr namespace.

There are also some operations that appear to provide a mixture of both summary and vectorised functions. The scale() function (which centres and scales a vector), for example, returns a vector of scaled values as well as storing the original mean and standard deviation (summaries) as attributes.

To demonstrate summary and vectorised functions, the following vectors will be used.

set.seed(123)
## a numeric vector
(x <- rgamma(10, 5, 1))

 [1] 3.389585 7.378957 1.629561 4.778440 8.873564 5.530862 2.216495 1.581772
 [9] 7.471264 5.295647

## a character vector
(A <- sample(letters[1:2], size = 10, replace = TRUE))

 [1] "a" "a" "b" "b" "a" "b" "a" "b" "a" "b"

## a boolean vector
(B <- sample(c(TRUE,FALSE), size = 10, replace = TRUE))

 [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE

Summary functions

Function	Description	Examples
`mean()`, `median()`	Arithmetic mean and median. Note, boolean vectors are stored as a set of 0 (FALSE) and 1 (TRUE) and thus mathematical functions operate on them as if they were numbers.	`mean(x)` `[1] 4.814615` `median(B)` `[1] 1`
`sum()`, `prod()`	Sum and product.	`sum(x)` `[1] 48.14615` `prod(B)` `[1] 0`
`var()`, `sd()`	Variance and standard deviation.	`var(x)` `[1] 6.692355` `sd(B)` `[1] 0.5163978`
`mad()`, `IQR()`	Median Absolute Deviation and Inter-Quartile Range.	`mad(x)` `[1] 3.540549` `IQR(B)` `[1] 1`
`min()`, `max()`	Minimum and maximum.	`min(x)` `[1] 1.581772` `max(B)` `[1] 1`
`quantile()`	Quantiles	`quantile(x)` `0% 25% 50% 75% 100% 1.581772 2.509767 5.037043 6.916934 8.873564` `quantile(x, p = 0.25)` `25% 2.509767`
`dplyr::first()`, `dplyr::last()`, `dplyr::nth()`	First, laste and nth value.	`first(x)` `[1] 3.389585` `first(x, order_by = A)` `[1] 3.389585` `last(B)` `[1] TRUE` `nth(A, n = 4)` `[1] "b"`
`dplyr::n()`, `dplyr::n_distinct()`	Number of values, number of distinct values. `n()` is a special case that can only be used within the context of a `data.frame` or `tibble`.	`n_distinct(A)` `[1] 2`

Vectorised functions

Function	Description	Examples
`+,-,*,/,^,%/%, %%`	typical arithmetic operators	`x + 2` `[1] 5.389585 9.378957 3.629561 6.778440 [5] 10.873564 7.530862 4.216495 3.581772 [9] 9.471264 7.295647` `x * B` `[1] 0.000000 7.378957 1.629561 4.778440 [5] 8.873564 0.000000 2.216495 0.000000 [9] 0.000000 5.295647`
`log()`,`log2()`,`log10()`, `exp()`	logarithms and exponentials	`log(x)` `[1] 1.2207074 1.9986324 0.4883105 1.5641140 [5] 2.1830765 1.7103437 0.7959271 0.4585455 [9] 2.0110642 1.6668851` `log10(x)` `[1] 0.5301465 0.8679950 0.2120706 0.6792861 [5] 0.9480981 0.7427928 0.3456667 0.1991438 [9] 0.8733941 0.7239190` `exp(x)` `[1] 29.653639 1601.918872 5.101634 [4] 118.918637 7140.686681 252.361318 [7] 9.175114 4.863565 1756.825350 [10] 199.466617`
`<,<=,>,>=,!=,==`	logical operators	`x < 5` `[1] TRUE FALSE TRUE TRUE FALSE FALSE [7] TRUE TRUE FALSE FALSE` `B == TRUE` `[1] FALSE TRUE TRUE TRUE TRUE FALSE [7] TRUE FALSE FALSE TRUE`
`between()`	Whether a value is between two numbers	`between(x, 3, 5)` `[1] TRUE FALSE FALSE TRUE FALSE FALSE [7] FALSE FALSE FALSE FALSE`
`near()`	A safe way of assessing equality (`==`) in floating points	`x == 3.39` `[1] FALSE FALSE FALSE FALSE FALSE FALSE [7] FALSE FALSE FALSE FALSE` `near(x, y = 3.39, tol =0.01)` `[1] TRUE FALSE FALSE FALSE FALSE FALSE [7] FALSE FALSE FALSE FALSE`
`lag()`, `lead()`	shift cases down/up by one	`lag(x)` `[1] NA 3.389585 7.378957 1.629561 [5] 4.778440 8.873564 5.530862 2.216495 [9] 1.581772 7.471264` `lead(x)` `[1] 7.378957 1.629561 4.778440 8.873564 [5] 5.530862 2.216495 1.581772 7.471264 [9] 5.295647 NA`
`cummax()`, `cummin()`, `dplyr::cummean()`	Cumulative max, min and mean	`cummax(x)` `[1] 3.389585 7.378957 7.378957 7.378957 [5] 8.873564 8.873564 8.873564 8.873564 [9] 8.873564 8.873564` `cummin(B)` `[1] 0 0 0 0 0 0 0 0 0 0` `cummean(x)` `[1] 3.389585 5.384271 4.132701 4.294136 [5] 5.210021 5.263495 4.828209 4.422404 [9] 4.761167 4.814615` `cummean(x > 2)` `[1] 1.0000000 1.0000000 0.6666667 0.7500000 [5] 0.8000000 0.8333333 0.8571429 0.7500000 [9] 0.7777778 0.8000000` `cummean(B)` `[1] 0.0000000 0.5000000 0.6666667 0.7500000 [5] 0.8000000 0.6666667 0.7142857 0.6250000 [9] 0.5555556 0.6000000`
`cumsum()`, `cumprod()`	Cumulative sum and product	`cumsum(x)` `[1] 3.389585 10.768542 12.398103 17.176543 [5] 26.050107 31.580969 33.797464 35.379235 [9] 42.850499 48.146146` `cumsum(x > 3)` `[1] 1 2 2 3 4 5 5 5 6 7` `cumprod(B)` `[1] 0 0 0 0 0 0 0 0 0 0`
`dplyr::cumall()`, `dplyr::cumany()`	Cumulative all and any (mainly for use with filtering).	`cumall(x)` `[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [9] TRUE TRUE` `cumall(x > 2)` `[1] TRUE TRUE FALSE FALSE FALSE FALSE [7] FALSE FALSE FALSE FALSE` `cumany(x > 4)` `[1] FALSE TRUE TRUE TRUE TRUE TRUE [7] TRUE TRUE TRUE TRUE` `cumany(B)` `[1] FALSE TRUE TRUE TRUE TRUE TRUE [7] TRUE TRUE TRUE TRUE` `cumall(B)` `[1] FALSE FALSE FALSE FALSE FALSE FALSE [7] FALSE FALSE FALSE FALSE`
`rank()`, `order()`	Rank and order of items	`rank(x)` `[1] 4 8 2 5 10 7 3 1 9 6` `order(x)` `[1] 8 3 7 1 4 10 6 2 9 5` `rank(B)` `[1] 2.5 7.5 7.5 7.5 7.5 2.5 7.5 2.5 2.5 7.5`
`dplyr::min_rank()`, `dplyr::dense_rank()`, `dplyr::percent_rank()`	Rank in which ties = min, without gaps and expressed as a percentage.	`min_rank(x)` `[1] 4 8 2 5 10 7 3 1 9 6` `dense_rank(x)` `[1] 4 8 2 5 10 7 3 1 9 6` `percent_rank(x)` `[1] 0.3333333 0.7777778 0.1111111 0.4444444 [5] 1.0000000 0.6666667 0.2222222 0.0000000 [9] 0.8888889 0.5555556` `rank(B)` `[1] 2.5 7.5 7.5 7.5 7.5 2.5 7.5 2.5 2.5 7.5`
`dplyr::row_number()`	Rank in which ties = first.	`row_number(x)` `[1] 4 8 2 5 10 7 3 1 9 6`
`dplyr::cume_dist()`	Cumulative empirical distribution (proportion less than current value).	`cume_dist(x)` `[1] 0.4 0.8 0.2 0.5 1.0 0.7 0.3 0.1 0.9 0.6`
`dplyr::ntile()`	Partition into (n) bins.	`ntile(x, n = 3)` `[1] 1 3 1 2 3 2 1 1 3 2`
`dplyr::if_else()`	Elementwise (case by case) if and else.	`if_else(x > 3, true = "H", false = "L")` `[1] "H" "H" "L" "H" "H" "H" "L" "L" "H" "H"`
`dplyr::case_when()`	Elementwise multiple if and else.	`case_when(x <= 3 ~ "L", x > 3 & x <= 6 ~ "M", x > 6 ~ "H")` `[1] "M" "H" "L" "M" "H" "M" "L" "L" "H" "M"`

Sorting data (`arrange`)

Most statistical analyses are invariant to the data order and thus data reordering is typically only for aesthetics in tables and figures.

Sorting data has the potential to be one of the most dangerous forms of data manipulation - particularly in spreadsheets in which there is no real binding between individual columns or rows. It is far to easy to accidentally sort the data in a single column (or row) without applying the ordering to all the other columns in the data set thereby resulting in broken data.

Whilst the above apocalypse is still possible in R, the data structures and manipulation interfaces mean that you really have to try to break the data in this way. Furthermore, you are encouraged to store reordered data in a different object to the original data, and hence ‘rolling’ back is trivial.

First 10 rows of the `dat.1` data.frame
Treatment	Plot	Dose	Time	Resp1	Resp2
Control	P1	H	1	8.12	3.06
Control	P1	H	2	20.55	25.94
Control	P1	H	3	27.49	29.85
Control	P1	H	4	44.79	25.39
Control	P1	M	1	20.99	20.31
Control	P1	M	2	37.54	17.62
Control	P1	M	3	61.46	98.44
Control	P1	M	4	82.21	160.01
Control	P1	L	1	31.73	21.22
Control	P1	L	2	59.08	37.51

Sort by Resp1

dat.1 %>% arrange(Resp1) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P2    H    1  8.14 23.93
3   Control   P2    H    2 13.36 28.02
4 Exclusion   P4    H    1 18.82 28.60
5   Control   P2    M    1 19.95 19.73
6   Control   P1    H    2 20.55 25.94

Sort by Resp1 (descending order)

dat.1 %>% arrange(desc(Resp1)) %>%
    head()

  Treatment Plot Dose Time  Resp1  Resp2
1 Exclusion   P4    L    4 242.31 304.70
2 Exclusion   P3    L    4 238.76  54.23
3 Exclusion   P4    L    3 181.09 314.49
4 Exclusion   P3    L    3 175.87 286.33
5 Exclusion   P4    M    4 161.67 256.34
6 Exclusion   P3    M    4 159.69 112.65

Sort by Dose and then Resp1

dat.1 %>% arrange(Dose, Resp1) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P2    H    1  8.14 23.93
3   Control   P2    H    2 13.36 28.02
4 Exclusion   P4    H    1 18.82 28.60
5   Control   P1    H    2 20.55 25.94
6 Exclusion   P3    H    1 21.86 23.58

Sort by sum of Resp1 and Resp2

dat.1 %>% arrange(Resp1 + Resp2) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P2    H    1  8.14 23.93
3   Control   P2    M    1 19.95 19.73
4   Control   P1    M    1 20.99 20.31
5   Control   P2    H    2 13.36 28.02
6 Exclusion   P3    H    1 21.86 23.58

Sort by Dose then by mean of Resp1 and Resp2

dat.1 %>% arrange(Dose, mean(c(Resp1, Resp2))) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P1    H    2 20.55 25.94
3   Control   P1    H    3 27.49 29.85
4   Control   P1    H    4 44.79 25.39
5   Control   P2    H    1  8.14 23.93
6   Control   P2    H    2 13.36 28.02

Subsets of data

We regularly want to run an analysis or generate tabular/graphic summaries for a sub-set of the data. For example, we may wish to reduce the number of columns in the data frame to remove excessive and unwanted data fields. Alternatively (or additionally), we may wish to focus on a subset of the rows. These two forms of sub-setting are:

selecting - reducing the number of columns
filtering - reducing the number of rows

Subset columns (`select`)

Selecting works by either including (or excluding) the column names that you indicate or via special selection ‘Helper’ functions that pass a vector of column indices to include in the subset data.

First 10 rows of the `dat.1` data.frame
Treatment	Plot	Dose	Time	Resp1	Resp2
Control	P1	H	1	8.12	3.06
Control	P1	H	2	20.55	25.94
Control	P1	H	3	27.49	29.85
Control	P1	H	4	44.79	25.39
Control	P1	M	1	20.99	20.31
Control	P1	M	2	37.54	17.62
Control	P1	M	3	61.46	98.44
Control	P1	M	4	82.21	160.01
Control	P1	L	1	31.73	21.22
Control	P1	L	2	59.08	37.51

The select() function users tidy-select semantics.

Inclusion / exclusion of from bare names

Select specific variables

dat.1 %>% select(Treatment, Dose, Time, Resp1) %>%
    head()

  Treatment Dose Time Resp1
1   Control    H    1  8.12
2   Control    H    2 20.55
3   Control    H    3 27.49
4   Control    H    4 44.79
5   Control    M    1 20.99
6   Control    M    2 37.54

Exclude specific variables

dat.1 %>% select(-Plot, -Resp2) %>%
    head()

  Treatment Dose Time Resp1
1   Control    H    1  8.12
2   Control    H    2 20.55
3   Control    H    3 27.49
4   Control    H    4 44.79
5   Control    M    1 20.99
6   Control    M    2 37.54

Inclusion / exclusion based on position

Include columns Treatment through to Time

dat.1 %>% select(Treatment:Time) %>%
    head()

  Treatment Plot Dose Time
1   Control   P1    H    1
2   Control   P1    H    2
3   Control   P1    H    3
4   Control   P1    H    4
5   Control   P1    M    1
6   Control   P1    M    2

Exclude columns Treatment through to Time

dat.1 %>% select(-(Treatment:Time)) %>%
    head()

  Resp1 Resp2
1  8.12  3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62

Exclude columns Treatment through to Time

dat.1 %>% select(!(Treatment:Time)) %>%
    head()

  Resp1 Resp2
1  8.12  3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62

Inclusion / exclusion from name predicates

Names containing and ‘r’ (case insensitive)

dat.1 %>% select(contains("R")) %>%
    head()

  Treatment Resp1 Resp2
1   Control  8.12  3.06
2   Control 20.55 25.94
3   Control 27.49 29.85
4   Control 44.79 25.39
5   Control 20.99 20.31
6   Control 37.54 17.62

Names starting with ‘r’ (case insensitive)

dat.1 %>% select(starts_with("R")) %>%
    head()

  Resp1 Resp2
1  8.12  3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62

Names ending in ‘e’ (case insensitive)

dat.1 %>% select(ends_with("e")) %>%
    head()

  Dose Time
1    H    1
2    H    2
3    H    3
4    H    4
5    M    1
6    M    2

Names that are exactly four characters long

dat.1 %>% select(matches("^.{4}$")) %>%
    head()

  Plot Dose Time
1   P1    H    1
2   P1    H    2
3   P1    H    3
4   P1    H    4
5   P1    M    1
6   P1    M    2

Select and `num_range`

dat.1 %>% select(num_range("Resp", 1:2)) %>%
    head()

  Resp1 Resp2
1  8.12  3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62

Reorder columns…

dat.1 %>% select(num_range("Resp", 1:2),
                 everything()) %>%
    head()

  Resp1 Resp2 Treatment Plot Dose Time
1  8.12  3.06   Control   P1    H    1
2 20.55 25.94   Control   P1    H    2
3 27.49 29.85   Control   P1    H    3
4 44.79 25.39   Control   P1    H    4
5 20.99 20.31   Control   P1    M    1
6 37.54 17.62   Control   P1    M    2

Select and `all_of` / `any_of`

Select from a vector of strings

Vars <- c("Resp1", "Resp2")
dat.1 %>% select(all_of(Vars)) %>%
    head()

  Resp1 Resp2
1  8.12  3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62

Select from string vector of strings…

Vars <- c("Resp1", "Resp2", "Resp3")
dat.1 %>% select(any_of(Vars)) %>%
    head()

  Resp1 Resp2
1  8.12  3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62

Vars <- c("Resp1", "Resp2", "Resp3")
dat.1 %>% select(-any_of(Vars)) %>%
    head()

  Treatment Plot Dose Time
1   Control   P1    H    1
2   Control   P1    H    2
3   Control   P1    H    3
4   Control   P1    H    4
5   Control   P1    M    1
6   Control   P1    M    2

Select and `where`

dat.1 %>% select(where(is.numeric)) %>%
    head()

  Time Resp1 Resp2
1    1  8.12  3.06
2    2 20.55 25.94
3    3 27.49 29.85
4    4 44.79 25.39
5    1 20.99 20.31
6    2 37.54 17.62

Since tibbles permit column names that have special characters in them, in order to refer to a column that has a name with special characters, it is necessary to enclose the name in backticks. For example, to select the variable, Pocillopora damicornis (which contains a space in the name - these are not permitted in data.frames, yet are permitted in tibbles) and print the first six rows:

tikus %>%
    select(`Pocillopora damicornis`) %>%
    head()

   Pocillopora damicornis
V1                     79
V2                     51
V3                     42
V4                     15
V5                      9
V6                     72

Renaming columns (`rename`)

Note, it is possible to have column names renamed during a select action.

Both the rename() and select() functions user tidy-select semantics.

dat.1 %>%
    select("Weight" = Resp1) %>%
    head()

If we want to retain the other variables, we would also have to include them in the select either explicitly, or via a helper function.

dat.1 %>%
    select(everything(), "Weight" = Resp1) %>%
    head()

  Treatment Plot Dose Time Weight Resp2
1   Control   P1    H    1   8.12  3.06
2   Control   P1    H    2  20.55 25.94
3   Control   P1    H    3  27.49 29.85
4   Control   P1    H    4  44.79 25.39
5   Control   P1    M    1  20.99 20.31
6   Control   P1    M    2  37.54 17.62

However, note that this might not retain the order of the columns. Hence, for simple renaming of columns, the rename function is more convenient.

dat.1 %>%
    rename("Weight" = Resp1) %>%
    head()

  Treatment Plot Dose Time Weight Resp2
1   Control   P1    H    1   8.12  3.06
2   Control   P1    H    2  20.55 25.94
3   Control   P1    H    3  27.49 29.85
4   Control   P1    H    4  44.79 25.39
5   Control   P1    M    1  20.99 20.31
6   Control   P1    M    2  37.54 17.62

It is also possible to rename columns based on a lookup (list or data.frame). This is handy for cases when data columns have conveniently abbreviated names yet you are preparing data for tabular output - and thus need more descriptive column names.

lookup <- list("Response 1" = "Resp1",
               "Response 2" = "Resp2")
dat.1 %>%
    rename(!!!lookup) %>%
    head()

  Treatment Plot Dose Time Response 1 Response 2
1   Control   P1    H    1       8.12       3.06
2   Control   P1    H    2      20.55      25.94
3   Control   P1    H    3      27.49      29.85
4   Control   P1    H    4      44.79      25.39
5   Control   P1    M    1      20.99      20.31
6   Control   P1    M    2      37.54      17.62

In the above example, the big bang operator !!! forces-splice a list operator. That is, the elements of the list are spliced into the statement as if they had been included directly.

To do the same from a data.frame lookup..

lookup <- tribble(
    ~Abbr_name, ~Long_name,
    "Resp1",     "Response 1",
    "Resp2",     "Response 2")
## Convert to list of pairs
lookup <- lookup %>%
    select(Long_name, Abbr_name) %>% 
    deframe() %>%
    list()
dat.1 %>%
    rename(!!!lookup) %>%
    head()

  Treatment Plot Dose Time Response 1 Response 2
1   Control   P1    H    1       8.12       3.06
2   Control   P1    H    2      20.55      25.94
3   Control   P1    H    3      27.49      29.85
4   Control   P1    H    4      44.79      25.39
5   Control   P1    M    1      20.99      20.31
6   Control   P1    M    2      37.54      17.62

Select a single column (`pull`)

As indicated earlier, single column indices on tibbles return a single column tibble. To extract a single column as a vector, there is a pull function.

The pull() function users tidy-select semantics.

dat.1 %>% pull(Resp1)

 [1]   8.12  20.55  27.49  44.79  20.99  37.54  61.46  82.21  31.73  59.08
[11]  94.54 121.17   8.14  13.36  33.37  39.87  19.95  42.83  62.46  81.78
[21]  32.76  62.35  90.22 114.03  21.86  39.83  59.53  75.59  38.57  81.25
[31] 124.08 159.69  61.16 119.84 175.87 238.76  18.82  39.82  63.30  82.29
[41]  39.51  79.24 122.09 161.67  57.93 117.88 181.09 242.31

Subset of rows (`filter`)

Filtering selects rows for which a condition is evaluated to be TRUE. Hence, any logical expression or vectorized function that returns Boolean values (TRUE or FALSE) can be used for filtering.

The filter() function users data-masking semantics.

First 10 rows of the `dat.1` data.frame
Treatment	Plot	Dose	Time	Resp1	Resp2
Control	P1	H	1	8.12	3.06
Control	P1	H	2	20.55	25.94
Control	P1	H	3	27.49	29.85
Control	P1	H	4	44.79	25.39
Control	P1	M	1	20.99	20.31
Control	P1	M	2	37.54	17.62
Control	P1	M	3	61.46	98.44
Control	P1	M	4	82.21	160.01
Control	P1	L	1	31.73	21.22
Control	P1	L	2	59.08	37.51

Helper function	Combine multiple selections …
`if_any()`	With an an OR
`if_all()`	With an AND)

Note, by default, the above searchers are NOT case sensitive

Logical/Boolean function	Description
`==`	is equal to
`!=`	is not equal to
`<`	is less than
`>`	is greater than
`<=`	is less than or equal to
`>=`	is greater than or equal to
`is.na()`	is equal to `NA`
`!is.na()`	is not equal to `NA`
`%in%`	is in the following vector
`!`	not
`&` `&&`	logical AND
`\|` `\|\|`	logical OR
`xor()`	elementwise OR
`isTRUE()`	is true
`isFALSE()`	is false

Filter by category level

dat.1 %>% filter(Dose == "H")

   Treatment Plot Dose Time Resp1 Resp2
1    Control   P1    H    1  8.12  3.06
2    Control   P1    H    2 20.55 25.94
3    Control   P1    H    3 27.49 29.85
4    Control   P1    H    4 44.79 25.39
5    Control   P2    H    1  8.14 23.93
6    Control   P2    H    2 13.36 28.02
7    Control   P2    H    3 33.37 37.17
8    Control   P2    H    4 39.87 38.25
9  Exclusion   P3    H    1 21.86 23.58
10 Exclusion   P3    H    2 39.83 28.03
11 Exclusion   P3    H    3 59.53 21.32
12 Exclusion   P3    H    4 75.59 90.76
13 Exclusion   P4    H    1 18.82 28.60
14 Exclusion   P4    H    2 39.82 39.07
15 Exclusion   P4    H    3 63.30 93.43
16 Exclusion   P4    H    4 82.29 60.15

Filter by multiple category levels

dat.1 %>% filter(Dose %in% c("H", "M"))

   Treatment Plot Dose Time  Resp1  Resp2
1    Control   P1    H    1   8.12   3.06
2    Control   P1    H    2  20.55  25.94
3    Control   P1    H    3  27.49  29.85
4    Control   P1    H    4  44.79  25.39
5    Control   P1    M    1  20.99  20.31
6    Control   P1    M    2  37.54  17.62
7    Control   P1    M    3  61.46  98.44
8    Control   P1    M    4  82.21 160.01
9    Control   P2    H    1   8.14  23.93
10   Control   P2    H    2  13.36  28.02
11   Control   P2    H    3  33.37  37.17
12   Control   P2    H    4  39.87  38.25
13   Control   P2    M    1  19.95  19.73
14   Control   P2    M    2  42.83  40.52
15   Control   P2    M    3  62.46   4.81
16   Control   P2    M    4  81.78 136.66
17 Exclusion   P3    H    1  21.86  23.58
18 Exclusion   P3    H    2  39.83  28.03
19 Exclusion   P3    H    3  59.53  21.32
20 Exclusion   P3    H    4  75.59  90.76
21 Exclusion   P3    M    1  38.57  30.63
22 Exclusion   P3    M    2  81.25  83.61
23 Exclusion   P3    M    3 124.08 124.09
24 Exclusion   P3    M    4 159.69 112.65
25 Exclusion   P4    H    1  18.82  28.60
26 Exclusion   P4    H    2  39.82  39.07
27 Exclusion   P4    H    3  63.30  93.43
28 Exclusion   P4    H    4  82.29  60.15
29 Exclusion   P4    M    1  39.51  45.90
30 Exclusion   P4    M    2  79.24  88.04
31 Exclusion   P4    M    3 122.09  84.19
32 Exclusion   P4    M    4 161.67 256.34

Filter by numeric

dat.1 %>% filter(Resp1 < 25)

  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P1    H    2 20.55 25.94
3   Control   P1    M    1 20.99 20.31
4   Control   P2    H    1  8.14 23.93
5   Control   P2    H    2 13.36 28.02
6   Control   P2    M    1 19.95 19.73
7 Exclusion   P3    H    1 21.86 23.58
8 Exclusion   P4    H    1 18.82 28.60

Filter and `between`

dat.1 %>% filter(between(Resp1, 15, 25))

  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    2 20.55 25.94
2   Control   P1    M    1 20.99 20.31
3   Control   P2    M    1 19.95 19.73
4 Exclusion   P3    H    1 21.86 23.58
5 Exclusion   P4    H    1 18.82 28.60

Filter by numeric (cumulative functions)

Keep all cases after a value of Resp1 has exceeded 200

dat.1 %>% filter(cumany(Resp1 > 200))

   Treatment Plot Dose Time  Resp1  Resp2
1  Exclusion   P3    L    4 238.76  54.23
2  Exclusion   P4    H    1  18.82  28.60
3  Exclusion   P4    H    2  39.82  39.07
4  Exclusion   P4    H    3  63.30  93.43
5  Exclusion   P4    H    4  82.29  60.15
6  Exclusion   P4    M    1  39.51  45.90
7  Exclusion   P4    M    2  79.24  88.04
8  Exclusion   P4    M    3 122.09  84.19
9  Exclusion   P4    M    4 161.67 256.34
10 Exclusion   P4    L    1  57.93  85.24
11 Exclusion   P4    L    2 117.88 167.90
12 Exclusion   P4    L    3 181.09 314.49
13 Exclusion   P4    L    4 242.31 304.70

Keep all cases until the first case of Resp1 < 25

dat.1 %>% filter(cumall(Resp1 < 25))

  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P1    H    2 20.55 25.94

Filter by multiple filters (AND)

dat.1 %>% filter(Dose == "H", Resp1 < 25)

  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P1    H    2 20.55 25.94
3   Control   P2    H    1  8.14 23.93
4   Control   P2    H    2 13.36 28.02
5 Exclusion   P3    H    1 21.86 23.58
6 Exclusion   P4    H    1 18.82 28.60

##OR
dat.1 %>% filter(Dose == "H" & Resp1 < 25)

  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P1    H    2 20.55 25.94
3   Control   P2    H    1  8.14 23.93
4   Control   P2    H    2 13.36 28.02
5 Exclusion   P3    H    1 21.86 23.58
6 Exclusion   P4    H    1 18.82 28.60

Filter by multiple filters (OR)

dat.1 %>% filter(Dose == "H" | Resp1 < 25)

   Treatment Plot Dose Time Resp1 Resp2
1    Control   P1    H    1  8.12  3.06
2    Control   P1    H    2 20.55 25.94
3    Control   P1    H    3 27.49 29.85
4    Control   P1    H    4 44.79 25.39
5    Control   P1    M    1 20.99 20.31
6    Control   P2    H    1  8.14 23.93
7    Control   P2    H    2 13.36 28.02
8    Control   P2    H    3 33.37 37.17
9    Control   P2    H    4 39.87 38.25
10   Control   P2    M    1 19.95 19.73
11 Exclusion   P3    H    1 21.86 23.58
12 Exclusion   P3    H    2 39.83 28.03
13 Exclusion   P3    H    3 59.53 21.32
14 Exclusion   P3    H    4 75.59 90.76
15 Exclusion   P4    H    1 18.82 28.60
16 Exclusion   P4    H    2 39.82 39.07
17 Exclusion   P4    H    3 63.30 93.43
18 Exclusion   P4    H    4 82.29 60.15

Filter by multiple selection filters (AND)

dat.1 %>% filter(if_all(starts_with("Resp"), ~ . < 25))

  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P1    M    1 20.99 20.31
3   Control   P2    H    1  8.14 23.93
4   Control   P2    M    1 19.95 19.73
5 Exclusion   P3    H    1 21.86 23.58

Filter by multiple selection filters (OR)

dat.1 %>% filter(if_any(starts_with("Resp"), ~ . < 25))

   Treatment Plot Dose Time Resp1 Resp2
1    Control   P1    H    1  8.12  3.06
2    Control   P1    H    2 20.55 25.94
3    Control   P1    M    1 20.99 20.31
4    Control   P1    M    2 37.54 17.62
5    Control   P1    L    1 31.73 21.22
6    Control   P2    H    1  8.14 23.93
7    Control   P2    H    2 13.36 28.02
8    Control   P2    M    1 19.95 19.73
9    Control   P2    M    3 62.46  4.81
10 Exclusion   P3    H    1 21.86 23.58
11 Exclusion   P3    H    3 59.53 21.32
12 Exclusion   P4    H    1 18.82 28.60

Subset of rows by position (`slice`)

Keep rows 1 through to 4

dat.1 %>% slice(1:4)

  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P1    H    2 20.55 25.94
3   Control   P1    H    3 27.49 29.85
4   Control   P1    H    4 44.79 25.39

Keep rows 1 through to 4 and 7

dat.1 %>% slice(c(1:4, 7))

  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P1    H    2 20.55 25.94
3   Control   P1    H    3 27.49 29.85
4   Control   P1    H    4 44.79 25.39
5   Control   P1    M    3 61.46 98.44

Random selection of rows (`sample_n`)

In each of the examples, I will set the random seed to ensure we can all repeat the example exactly.

Random sample of rows

Random sample of 10 rows (without replacement)

set.seed(123)
dat.1 %>% sample_n(10, replace = FALSE)

   Treatment Plot Dose Time  Resp1  Resp2
1  Exclusion   P3    M    3 124.08 124.09
2    Control   P2    H    3  33.37  37.17
3    Control   P2    H    2  13.36  28.02
4    Control   P1    H    3  27.49  29.85
5  Exclusion   P4    M    2  79.24  88.04
6  Exclusion   P4    M    3 122.09  84.19
7  Exclusion   P4    H    1  18.82  28.60
8  Exclusion   P4    L    2 117.88 167.90
9  Exclusion   P3    H    1  21.86  23.58
10 Exclusion   P3    H    2  39.83  28.03

Random fractional sample of rows

Random sample of 25% of the rows (without replacement)

set.seed(123)
dat.1 %>% sample_frac(0.25, replace = FALSE)

   Treatment Plot Dose Time  Resp1  Resp2
1  Exclusion   P3    M    3 124.08 124.09
2    Control   P2    H    3  33.37  37.17
3    Control   P2    H    2  13.36  28.02
4    Control   P1    H    3  27.49  29.85
5  Exclusion   P4    M    2  79.24  88.04
6  Exclusion   P4    M    3 122.09  84.19
7  Exclusion   P4    H    1  18.82  28.60
8  Exclusion   P4    L    2 117.88 167.90
9  Exclusion   P3    H    1  21.86  23.58
10 Exclusion   P3    H    2  39.83  28.03
11 Exclusion   P3    H    3  59.53  21.32
12   Control   P1    M    1  20.99  20.31

Effects of filtering on factor levels

In R, categorical variables (factors) are actually stored as vectors of integers (1, 2, 3, …) along with an attribute that registers the names (and order) of the levels. We can see this, if we ask to see the structure of any categorical variable:

str(dat.1$Dose)

 Factor w/ 3 levels "H","L","M": 1 1 1 1 3 3 3 3 2 2 ...

In the above, we see that the levels of the Dose variable are “H”, “L”, “M” (by default, levels of an unordered factor are alphabetical). The first four cases are all 1, which is mapped to “H”. The next four are 3 which maps to “M” and so on.

Alternatively, we can explore the levels attribute of a factor via the levels() function:

levels(dat.1$Dose)

[1] "H" "L" "M"

levels(dat.1$Plot)

[1] "P1" "P2" "P3" "P4"

levels(dat.1$Treatment)

[1] "Control"   "Exclusion"

Although subsets of rows (filter, sample_n etc) may appear to completely remove particular levels of a categorical variable, these actions do not update factor attributes. Consequently, many routines that operate on such factors (such as plots and statistical models) will proceed as if all factor levels are present (e.g. plots will contain gaps where space has been provisioned for all levels even though there may be no data associated with all levels).

To illustrate this, we will filter the dat.1 data such that it only includes cases for which Plot equals “P1” and then explore the levels attribute of some of the categorical variables.

dat.2 <- dat.1 %>%
    filter(Plot == "P1")
dat.2

   Treatment Plot Dose Time  Resp1  Resp2
1    Control   P1    H    1   8.12   3.06
2    Control   P1    H    2  20.55  25.94
3    Control   P1    H    3  27.49  29.85
4    Control   P1    H    4  44.79  25.39
5    Control   P1    M    1  20.99  20.31
6    Control   P1    M    2  37.54  17.62
7    Control   P1    M    3  61.46  98.44
8    Control   P1    M    4  82.21 160.01
9    Control   P1    L    1  31.73  21.22
10   Control   P1    L    2  59.08  37.51
11   Control   P1    L    3  94.54 119.22
12   Control   P1    L    4 121.17 116.45

levels(dat.2$Dose)

[1] "H" "L" "M"

levels(dat.2$Plot)

[1] "P1" "P2" "P3" "P4"

levels(dat.2$Treatment)

[1] "Control"   "Exclusion"

So although the data only contains Plot values of “P1” (and Treatment values of “Control”), the levels are still listed as “P1”, “P2”, “P3”, and “P4”.

To ensure that the attributes reflect the subset data, it is necessary to use the droplevels() function:

dat.2 <- dat.1 %>%
    filter(Plot == "P1") %>%
    droplevels()
levels(dat.2$Dose)

[1] "H" "L" "M"

levels(dat.2$Plot)

[1] "P1"

levels(dat.2$Treatment)

[1] "Control"

Adding/modifying columns (`mutate`)

To add to (mutate) a data set, a vectorised function is applied either once or across one or more existing columns.

The mutate() function users data-masking semantics and can use tidy-select semantics via across().

Whether transforming/scaling existing variables or generating new variables derived from existing variables it is advisable that existing variables be retained (unless the data are huge and storage is tight).

dat.1 %>% mutate(Sum = Resp1 + Resp2) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2   Sum
1   Control   P1    H    1  8.12  3.06 11.18
2   Control   P1    H    2 20.55 25.94 46.49
3   Control   P1    H    3 27.49 29.85 57.34
4   Control   P1    H    4 44.79 25.39 70.18
5   Control   P1    M    1 20.99 20.31 41.30
6   Control   P1    M    2 37.54 17.62 55.16

Multiple additions/modifications can be defined in a single mutate call and the changes occur sequentially. Consequently, subsequent changes can build upon early changes. To illustrate this, we will create a new variable (cResp1) that is a centred version of the Resp1 variable (centring is when the mean of a variable is subtracted from each value resulting in a variable whose mean is exactly zero, yet all other properties are unaffected.).

dat.1 %>% mutate(meanResp1 = mean(Resp1),
                 cResp1 = Resp1 - meanResp1) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2 meanResp1    cResp1
1   Control   P1    H    1  8.12  3.06  75.26604 -67.14604
2   Control   P1    H    2 20.55 25.94  75.26604 -54.71604
3   Control   P1    H    3 27.49 29.85  75.26604 -47.77604
4   Control   P1    H    4 44.79 25.39  75.26604 -30.47604
5   Control   P1    M    1 20.99 20.31  75.26604 -54.27604
6   Control   P1    M    2 37.54 17.62  75.26604 -37.72604

The following table lists some of the more useful vectorised functions to use with mutate. Importantly, vectorised functions return a vector of the same length as they receive since all mutate edits must preserve the number of rows.

First 10 rows of the `dat.1` data.frame
Treatment	Plot	Dose	Time	Resp1	Resp2
Control	P1	H	1	8.12	3.06
Control	P1	H	2	20.55	25.94
Control	P1	H	3	27.49	29.85
Control	P1	H	4	44.79	25.39
Control	P1	M	1	20.99	20.31
Control	P1	M	2	37.54	17.62
Control	P1	M	3	61.46	98.44
Control	P1	M	4	82.21	160.01
Control	P1	L	1	31.73	21.22
Control	P1	L	2	59.08	37.51

Mathematical functions

Function	Description
`+,-,*,/,^,%/%, %%`	typical arithmetic operators
`log()`,`log2()`,`log10()`, `exp()`	logarithms/exponentials
`<,<=,>,>=,!=,==`	logical operators
`between()`	whether a case is between to numbers
`near()`	a safe way of assessing equality in floating points

arithmetic operators

dat.1 %>% mutate(Sum = Resp1 + Resp2,
                 Div = Resp1 / Resp2) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2   Sum       Div
1   Control   P1    H    1  8.12  3.06 11.18 2.6535948
2   Control   P1    H    2 20.55 25.94 46.49 0.7922128
3   Control   P1    H    3 27.49 29.85 57.34 0.9209380
4   Control   P1    H    4 44.79 25.39 70.18 1.7640803
5   Control   P1    M    1 20.99 20.31 41.30 1.0334810
6   Control   P1    M    2 37.54 17.62 55.16 2.1305335

logarithmic operators

dat.1 %>% mutate(logResp1 = log(Resp1),
                 expResp2 = exp(Resp2)) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2 logResp1     expResp2
1   Control   P1    H    1  8.12  3.06 2.094330 2.132756e+01
2   Control   P1    H    2 20.55 25.94 3.022861 1.843312e+11
3   Control   P1    H    3 27.49 29.85 3.313822 9.197934e+12
4   Control   P1    H    4 44.79 25.39 3.801985 1.063499e+11
5   Control   P1    M    1 20.99 20.31 3.044046 6.614864e+08
6   Control   P1    M    2 37.54 17.62 3.625407 4.490232e+07

logical operators

dat.1 %>% mutate(largeResp1 = Resp1 > 25) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2 largeResp1
1   Control   P1    H    1  8.12  3.06      FALSE
2   Control   P1    H    2 20.55 25.94      FALSE
3   Control   P1    H    3 27.49 29.85       TRUE
4   Control   P1    H    4 44.79 25.39       TRUE
5   Control   P1    M    1 20.99 20.31      FALSE
6   Control   P1    M    2 37.54 17.62       TRUE

near

dat.1 %>% mutate(
              A = Resp1 == 8.1,
              B = near(Resp1, 8.1, tol = 0.1)) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2     A     B
1   Control   P1    H    1  8.12  3.06 FALSE  TRUE
2   Control   P1    H    2 20.55 25.94 FALSE FALSE
3   Control   P1    H    3 27.49 29.85 FALSE FALSE
4   Control   P1    H    4 44.79 25.39 FALSE FALSE
5   Control   P1    M    1 20.99 20.31 FALSE FALSE
6   Control   P1    M    2 37.54 17.62 FALSE FALSE

between

dat.1 %>% mutate(mediumResp1 = between(Resp1, 15, 25)) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2 mediumResp1
1   Control   P1    H    1  8.12  3.06       FALSE
2   Control   P1    H    2 20.55 25.94        TRUE
3   Control   P1    H    3 27.49 29.85       FALSE
4   Control   P1    H    4 44.79 25.39       FALSE
5   Control   P1    M    1 20.99 20.31        TRUE
6   Control   P1    M    2 37.54 17.62       FALSE

Offset functions

Function	Description
`lag()`	shift cases down one
`lead()`	shift cases up one

lags

dat.1 %>% mutate(lagResp1 = lag(Resp1)) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2 lagResp1
1   Control   P1    H    1  8.12  3.06       NA
2   Control   P1    H    2 20.55 25.94     8.12
3   Control   P1    H    3 27.49 29.85    20.55
4   Control   P1    H    4 44.79 25.39    27.49
5   Control   P1    M    1 20.99 20.31    44.79
6   Control   P1    M    2 37.54 17.62    20.99

leads

dat.1 %>% mutate(leadResp1 = lead(Resp1)) %>%
    tail()

   Treatment Plot Dose Time  Resp1  Resp2 leadResp1
43 Exclusion   P4    M    3 122.09  84.19    161.67
44 Exclusion   P4    M    4 161.67 256.34     57.93
45 Exclusion   P4    L    1  57.93  85.24    117.88
46 Exclusion   P4    L    2 117.88 167.90    181.09
47 Exclusion   P4    L    3 181.09 314.49    242.31
48 Exclusion   P4    L    4 242.31 304.70        NA

Cumulative aggregate functions

Function	Description
`cummax()`	cumulative maximum
`cummin()`	cumulative minimum
`cummean()`	cumulative mean
`cumprod()`	cumulative product
`cumsum()`	cumulative sum
`rank()`	rank of current case (ties averaged)
`min_rank()`	rank of current case (minimum rank for ties)
`dense_rank()`	rank of current case (minimum rank for ties, no gaps)
`percent_rank()`	`min_rank` of current case (scaled to [0,1])
`cume_dist()`	cumulative empirical distribution (prop. less than current rank)
`row_number()`	rank of current case (first row for ties)
`ntile()`	bin into (n) buckets

cummmax/ cummin/ cummean

dat.1 %>% mutate(Cummin = cummin(Resp1),
                 Cummax = cummax(Resp1),
                 Cummean = cummean(Resp1)) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2 Cummin Cummax Cummean
1   Control   P1    H    1  8.12  3.06   8.12   8.12  8.1200
2   Control   P1    H    2 20.55 25.94   8.12  20.55 14.3350
3   Control   P1    H    3 27.49 29.85   8.12  27.49 18.7200
4   Control   P1    H    4 44.79 25.39   8.12  44.79 25.2375
5   Control   P1    M    1 20.99 20.31   8.12  44.79 24.3880
6   Control   P1    M    2 37.54 17.62   8.12  44.79 26.5800

cumsum/ cumprod

dat.1 %>% mutate(Cumsum = cumsum(Resp1),
                 Cumprod = cumprod(Resp1)) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2 Cumsum      Cumprod
1   Control   P1    H    1  8.12  3.06   8.12 8.120000e+00
2   Control   P1    H    2 20.55 25.94  28.67 1.668660e+02
3   Control   P1    H    3 27.49 29.85  56.16 4.587146e+03
4   Control   P1    H    4 44.79 25.39 100.95 2.054583e+05
5   Control   P1    M    1 20.99 20.31 121.94 4.312569e+06
6   Control   P1    M    2 37.54 17.62 159.48 1.618939e+08

rank/ min_rank/ dense_rank/ percent_rank

dat.1 %>% mutate(Rank = rank(Resp1),
                 minRank = min_rank(Resp1),
                 denseRank = dense_rank(Resp1),
                 percentRank = percent_rank(Resp1)) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2 Rank minRank denseRank percentRank
1   Control   P1    H    1  8.12  3.06    1       1         1   0.0000000
2   Control   P1    H    2 20.55 25.94    6       6         6   0.1063830
3   Control   P1    H    3 27.49 29.85    9       9         9   0.1702128
4   Control   P1    H    4 44.79 25.39   20      20        20   0.4042553
5   Control   P1    M    1 20.99 20.31    7       7         7   0.1276596
6   Control   P1    M    2 37.54 17.62   13      13        13   0.2553191

cumulative empirical distribution

dat.1 %>% mutate(cume_dist(Resp1)) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2 cume_dist(Resp1)
1   Control   P1    H    1  8.12  3.06       0.02083333
2   Control   P1    H    2 20.55 25.94       0.12500000
3   Control   P1    H    3 27.49 29.85       0.18750000
4   Control   P1    H    4 44.79 25.39       0.41666667
5   Control   P1    M    1 20.99 20.31       0.14583333
6   Control   P1    M    2 37.54 17.62       0.27083333

rank row number

dat.1 %>% mutate(row_number(Resp1)) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2 row_number(Resp1)
1   Control   P1    H    1  8.12  3.06                 1
2   Control   P1    H    2 20.55 25.94                 6
3   Control   P1    H    3 27.49 29.85                 9
4   Control   P1    H    4 44.79 25.39                20
5   Control   P1    M    1 20.99 20.31                 7
6   Control   P1    M    2 37.54 17.62                13

bin into buckets

dat.1 %>% mutate(ntile(Resp1, 5)) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2 ntile(Resp1, 5)
1   Control   P1    H    1  8.12  3.06               1
2   Control   P1    H    2 20.55 25.94               1
3   Control   P1    H    3 27.49 29.85               1
4   Control   P1    H    4 44.79 25.39               2
5   Control   P1    M    1 20.99 20.31               1
6   Control   P1    M    2 37.54 17.62               2

Miscellaneous

Function	Description
`if_else()`	elementwise (case by case) if and else
`case_when()`	elementwise multiple `if_else`
`na_if()`	elementwise replace nominated value with `NA`
`pmax()`	elementwise maximum across multiple columns
`pmin()`	elementwise minimum across multiple columns

if_else

dat.1 %>% mutate(Size = if_else(Resp1 > 25, "Big", "Small")) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2  Size
1   Control   P1    H    1  8.12  3.06 Small
2   Control   P1    H    2 20.55 25.94 Small
3   Control   P1    H    3 27.49 29.85   Big
4   Control   P1    H    4 44.79 25.39   Big
5   Control   P1    M    1 20.99 20.31 Small
6   Control   P1    M    2 37.54 17.62   Big

case_when

dat.1 %>% mutate(Size = case_when(Resp1 < 15 ~ "Small",
                               Resp1 < 25 ~ "Medium",
                               Resp1 >= 25 ~ "Big")) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2   Size
1   Control   P1    H    1  8.12  3.06  Small
2   Control   P1    H    2 20.55 25.94 Medium
3   Control   P1    H    3 27.49 29.85    Big
4   Control   P1    H    4 44.79 25.39    Big
5   Control   P1    M    1 20.99 20.31 Medium
6   Control   P1    M    2 37.54 17.62    Big

Summary functions

Summary functions (those that return a single value) are also permissible - the value will be recycled for the total length of the input vector. A table of useful summary functions can be found in the Summarise section.

dat.1 %>% mutate(meanResp1 = mean(Resp1)) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2 meanResp1
1   Control   P1    H    1  8.12  3.06  75.26604
2   Control   P1    H    2 20.55 25.94  75.26604
3   Control   P1    H    3 27.49 29.85  75.26604
4   Control   P1    H    4 44.79 25.39  75.26604
5   Control   P1    M    1 20.99 20.31  75.26604
6   Control   P1    M    2 37.54 17.62  75.26604

Another important summary function is the n() function. This function returns the total number of rows. In the following example, we will use it to create a column that just provides a running row counter (e.g. a row index variable).

dat.1 %>% mutate(N = 1:n()) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2 N
1   Control   P1    H    1  8.12  3.06 1
2   Control   P1    H    2 20.55 25.94 2
3   Control   P1    H    3 27.49 29.85 3
4   Control   P1    H    4 44.79 25.39 4
5   Control   P1    M    1 20.99 20.31 5
6   Control   P1    M    2 37.54 17.62 6

Multiple mutations (`across`)

In the section on select, a set of select helper functions were described to facilitate convenient ways to select columns based on properties of the column names etc. The across() function allows us to bring those same selection helper functions to mutate.

The across() function has the following form:

across(.cols, .fns, .names)

where:

.cols - a tidy selection (e.g. selection helper function)
.fns - a function (or list of functions) to apply to each selected column
.names - a glue specification determining the format of the new variable names. By default the glue will be either {.col} (when there is only a single function) or {.col}.{fn} (when a list of functions)

Simple selections

dat.1 %>% mutate(across(c(Resp1, Resp2), log)) %>%
    head()

  Treatment Plot Dose Time    Resp1    Resp2
1   Control   P1    H    1 2.094330 1.118415
2   Control   P1    H    2 3.022861 3.255786
3   Control   P1    H    3 3.313822 3.396185
4   Control   P1    H    4 3.801985 3.234355
5   Control   P1    M    1 3.044046 3.011113
6   Control   P1    M    2 3.625407 2.869035

Simple selections with naming

dat.1 %>% mutate(across(c(Resp1, Resp2),
                        .fns = log,
                        .names = "l{.col}")) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2   lResp1   lResp2
1   Control   P1    H    1  8.12  3.06 2.094330 1.118415
2   Control   P1    H    2 20.55 25.94 3.022861 3.255786
3   Control   P1    H    3 27.49 29.85 3.313822 3.396185
4   Control   P1    H    4 44.79 25.39 3.801985 3.234355
5   Control   P1    M    1 20.99 20.31 3.044046 3.011113
6   Control   P1    M    2 37.54 17.62 3.625407 2.869035

`where` selections with naming

dat.1 %>% mutate(across(where(is.numeric),
                        .fns = log,
                        .names = "l{.col}")) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2     lTime   lResp1   lResp2
1   Control   P1    H    1  8.12  3.06 0.0000000 2.094330 1.118415
2   Control   P1    H    2 20.55 25.94 0.6931472 3.022861 3.255786
3   Control   P1    H    3 27.49 29.85 1.0986123 3.313822 3.396185
4   Control   P1    H    4 44.79 25.39 1.3862944 3.801985 3.234355
5   Control   P1    M    1 20.99 20.31 0.0000000 3.044046 3.011113
6   Control   P1    M    2 37.54 17.62 0.6931472 3.625407 2.869035

`num_range` selections with naming

dat.1 %>% mutate(across(num_range("Resp", 1:2),
                        .fns = log,
                        .names = "l{.col}")) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2   lResp1   lResp2
1   Control   P1    H    1  8.12  3.06 2.094330 1.118415
2   Control   P1    H    2 20.55 25.94 3.022861 3.255786
3   Control   P1    H    3 27.49 29.85 3.313822 3.396185
4   Control   P1    H    4 44.79 25.39 3.801985 3.234355
5   Control   P1    M    1 20.99 20.31 3.044046 3.011113
6   Control   P1    M    2 37.54 17.62 3.625407 2.869035

Multiple selections and functions

dat.1 %>% mutate(across(c(Resp1, Resp2),
                        .fns = list(l = log, s = sqrt),
                        .names = "{.fn}.{.col}")) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2  l.Resp1  s.Resp1  l.Resp2  s.Resp2
1   Control   P1    H    1  8.12  3.06 2.094330 2.849561 1.118415 1.749286
2   Control   P1    H    2 20.55 25.94 3.022861 4.533211 3.255786 5.093133
3   Control   P1    H    3 27.49 29.85 3.313822 5.243091 3.396185 5.463515
4   Control   P1    H    4 44.79 25.39 3.801985 6.692533 3.234355 5.038849
5   Control   P1    M    1 20.99 20.31 3.044046 4.581484 3.011113 4.506662
6   Control   P1    M    2 37.54 17.62 3.625407 6.126989 2.869035 4.197618

Additional arguments to the function(s)

Centring all numeric variables (note the use of the purrr style lambda formula for functions that require additional arguments. When applying the function, the focal variable is assigned the name of .x.

dat.1 %>% mutate(across(c(Resp1, Resp2),
                        .fns = list(c =  ~ scale(.x, scale = FALSE)),
                        .names = "{.fn}{.col}")) %>%
    head()

  Treatment Plot Dose Time Resp1 Resp2    cResp1    cResp2
1   Control   P1    H    1  8.12  3.06 -67.14604 -78.64958
2   Control   P1    H    2 20.55 25.94 -54.71604 -55.76958
3   Control   P1    H    3 27.49 29.85 -47.77604 -51.85958
4   Control   P1    H    4 44.79 25.39 -30.47604 -56.31958
5   Control   P1    M    1 20.99 20.31 -54.27604 -61.39958
6   Control   P1    M    2 37.54 17.62 -37.72604 -64.08958

Changing vectors

Cast to different class

Convert Time (a numeric) into a factor

dat.1 %>% mutate(Time = factor(Time)) %>%
    tibble()

# A tibble: 48 × 6
   Treatment Plot  Dose  Time  Resp1  Resp2
   <fct>     <fct> <fct> <fct> <dbl>  <dbl>
 1 Control   P1    H     1      8.12   3.06
 2 Control   P1    H     2     20.6   25.9 
 3 Control   P1    H     3     27.5   29.8 
 4 Control   P1    H     4     44.8   25.4 
 5 Control   P1    M     1     21.0   20.3 
 6 Control   P1    M     2     37.5   17.6 
 7 Control   P1    M     3     61.5   98.4 
 8 Control   P1    M     4     82.2  160.  
 9 Control   P1    L     1     31.7   21.2 
10 Control   P1    L     2     59.1   37.5 
# ℹ 38 more rows

Change factor labels

Change the labels of the “H” and “M” levels of Dose

dat.1 %>% mutate(Dose = fct_recode(Dose, High = 'H',  Medium = 'M')) %>%
    tibble()

# A tibble: 48 × 6
   Treatment Plot  Dose    Time Resp1  Resp2
   <fct>     <fct> <fct>  <int> <dbl>  <dbl>
 1 Control   P1    High       1  8.12   3.06
 2 Control   P1    High       2 20.6   25.9 
 3 Control   P1    High       3 27.5   29.8 
 4 Control   P1    High       4 44.8   25.4 
 5 Control   P1    Medium     1 21.0   20.3 
 6 Control   P1    Medium     2 37.5   17.6 
 7 Control   P1    Medium     3 61.5   98.4 
 8 Control   P1    Medium     4 82.2  160.  
 9 Control   P1    L          1 31.7   21.2 
10 Control   P1    L          2 59.1   37.5 
# ℹ 38 more rows

dat.1 %>% mutate(Dose = fct_recode(Dose, High = 'H',  Medium = 'M')) %>%
    str()

'data.frame':   48 obs. of  6 variables:
 $ Treatment: Factor w/ 2 levels "Control","Exclusion": 1 1 1 1 1 1 1 1 1 1 ...
 $ Plot     : Factor w/ 4 levels "P1","P2","P3",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Dose     : Factor w/ 3 levels "High","L","Medium": 1 1 1 1 3 3 3 3 2 2 ...
 $ Time     : int  1 2 3 4 1 2 3 4 1 2 ...
 $ Resp1    : num  8.12 20.55 27.49 44.79 20.99 ...
 $ Resp2    : num  3.06 25.94 29.85 25.39 20.31 ...
 - attr(*, "out.attrs")=List of 2
  ..$ dim     : Named int [1:3] 4 3 4
  .. ..- attr(*, "names")= chr [1:3] "Time" "Dose" "Plot"
  ..$ dimnames:List of 3
  .. ..$ Time: chr [1:4] "Time=1" "Time=2" "Time=3" "Time=4"
  .. ..$ Dose: chr [1:3] "Dose=H" "Dose=M" "Dose=L"
  .. ..$ Plot: chr [1:4] "Plot=P1" "Plot=P2" "Plot=P3" "Plot=P4"

Change factor levels

Change the level order of the Dose factor to something more natural.

dat.1 %>% pull(Dose)

 [1] H H H H M M M M L L L L H H H H M M M M L L L L H H H H M M M M L L L L H H
[39] H H M M M M L L L L
Levels: H L M

dat.1 %>%
    mutate(Dose = fct_relevel(Dose, c("L", "M", "H"))) %>% 
    as_tibble() %>%
    pull(Dose)

 [1] H H H H M M M M L L L L H H H H M M M M L L L L H H H H M M M M L L L L H H
[39] H H M M M M L L L L
Levels: L M H

Change factor levels and labels

Change the labels and level order of the Dose factor to something more natural.

dat.1 %>% pull(Dose)

 [1] H H H H M M M M L L L L H H H H M M M M L L L L H H H H M M M M L L L L H H
[39] H H M M M M L L L L
Levels: H L M

dat.1 %>% mutate(
              Dose = fct_relevel(Dose, c("L", "M", "H")),
              Dose = fct_recode(Dose, High = 'H',  Medium = 'M')
              ) %>%
    as_tibble %>% 
    pull(Dose)

 [1] High   High   High   High   Medium Medium Medium Medium L      L     
[11] L      L      High   High   High   High   Medium Medium Medium Medium
[21] L      L      L      L      High   High   High   High   Medium Medium
[31] Medium Medium L      L      L      L      High   High   High   High  
[41] Medium Medium Medium Medium L      L      L      L     
Levels: L Medium High

Change factor levels to reflect data order

dat.1 %>% pull(Dose)

 [1] H H H H M M M M L L L L H H H H M M M M L L L L H H H H M M M M L L L L H H
[39] H H M M M M L L L L
Levels: H L M

dat.1 %>% mutate(Dose = fct_reorder(Dose, 1:n())) %>%
    as_tibble %>% 
    pull(Dose)

 [1] H H H H M M M M L L L L H H H H M M M M L L L L H H H H M M M M L L L L H H
[39] H H M M M M L L L L
Levels: H M L

Change factor levels according to a numeric variable

Change the order of Dose levels according to the median Resp1 values

dat.1 %>% pull(Dose)

 [1] H H H H M M M M L L L L H H H H M M M M L L L L H H H H M M M M L L L L H H
[39] H H M M M M L L L L
Levels: H L M

dat.1 %>% mutate(Dose = fct_reorder(Dose, Resp1, median)) %>%
    as_tibble %>% 
    pull(Dose)

 [1] H H H H M M M M L L L L H H H H M M M M L L L L H H H H M M M M L L L L H H
[39] H H M M M M L L L L
Levels: H M L

Summarising

Data are often collected and transcribed at finer temporal/spatial scales and with greater fidelity than is required for all analyses. Furthermore, there is often a desire to present information summarised to a courser level. Therefore an important phase of data preparation/presentation is also to summarise the data into the spatial/temporal scales appropriate for the desired graphical and statistical analyses.

Summarising involves applying one or more summarising functions to one or more variables such that the data are reduced to a single row.

The summarise() function users data-masking semantics and can use tidy-select semantics via across().

First 10 rows of the `dat.1` data.frame
Treatment	Plot	Dose	Time	Resp1	Resp2
Control	P1	H	1	8.12	3.06
Control	P1	H	2	20.55	25.94
Control	P1	H	3	27.49	29.85
Control	P1	H	4	44.79	25.39
Control	P1	M	1	20.99	20.31
Control	P1	M	2	37.54	17.62
Control	P1	M	3	61.46	98.44
Control	P1	M	4	82.21	160.01
Control	P1	L	1	31.73	21.22
Control	P1	L	2	59.08	37.51

Function	Description
`n()`	number of rows/values
`n_distinct()`	number of unique rows/values
`mean()`, `median()`	mean/median
`sum()`, `prod()`	sum/product
`var()`, `sd()`	variance/standard deviation
`mad()`, `IQR()`	median absolute deviation / interquartile range
`min()`, `max()`	minimum / maximum
`quantile()`	nth quantile
`first()`, `last()`, `nth()`	first / last / nth value