Data wrangling (tidyverse and friends)

Author

Murray Logan

Published

September 15, 2024

I apologize in advance, this tutorial requires quite a bit of explaining and context before it can get into the code…. Good data manipulation is an art form that requires the use of many specific tools (functions) and expert data manipulation comes from the integration of these tools together. Therefore it is necessary to have an overview of the tool set before investigating any single tool.

1 Manipulating data sets

Rarely are data transcribed and organised into such simple and ready-to-go data sets. More typically, data are spread throughout multiple sources and in a variety of formats (particularly if compiled by multiple workers or instruments). Consequently, prior to any formal statistical analyses , it is necessary to compile very focused, tidy data sets.

Wickham (2014) suggested that there are many ways to organise data, yet tidy data (data that are structured in such a consistent way as to facilitate analyses) must adhere to a fairly strict set of structural rules. Specifically, in tidy data:

  • variables are in columns
  • observations are in rows - that is, for univariate data, there should be a separate row for each response observation.

To achieve tidy data, common data preparations include:

  • Reshaping and rearranging data
  • Merging multiple data sources
  • Aggregating data to appropriate spatial/temporal scales
  • Transforming data and deriving new variables
  • Sorting and reordering data
  • Relabelling data

This workshop will concentrate on these essential data preparation phases.

Practical data manipulation will be demonstrated via a series of very small artificial datasets. These datasets are presented in tables with black font and lines and the R code required to generate those data will be presented in static code boxes either underneath or adjacent the table. A very basic description of the table and the name of the data frame are displayed above the table. The entire collection of datasets used in this workshop can be obtained by issuing the following command:

  • (if online)

    source("https://github.com/ReefCloud/workshops/tree/main/data/manipulationDatasets.R")
    ##OR
    load(url("https://github.com/ReefCloud/workshops/tree/main/data/manipulationDatasets.R"))
  • (if offline and are running this from a local version having first cloned the ReefCloud Workshop github repository

    source(file="data/manipulationDatasets.R")
    ##OR
    load(file = "data/manipulationDatasets.RData")

An overview of the example data sets follows.

An overview of the example data sets follows.

These data comprise two Responses (Resp1 and Resp2) that are each observed on four occasions (Time) from each of three dosage treatments (Dose) nested within four plots (Plot). Two of the four plots completely excluded something and the other two were controls (did not exclude the thing).

View data
Treatment Plot Dose Time Resp1 Resp2
Control P1 H 1 8.12 3.06
Control P1 H 2 20.55 25.94
Control P1 H 3 27.49 29.85
Control P1 H 4 44.79 25.39
Control P1 M 1 20.99 20.31
Control P1 M 2 37.54 17.62
Control P1 M 3 61.46 98.44
Control P1 M 4 82.21 160.01
Control P1 L 1 31.73 21.22
Control P1 L 2 59.08 37.51
Control P1 L 3 94.54 119.22
Control P1 L 4 121.17 116.45
Control P2 H 1 8.14 23.93
Control P2 H 2 13.36 28.02
Control P2 H 3 33.37 37.17
Control P2 H 4 39.87 38.25
Control P2 M 1 19.95 19.73
Control P2 M 2 42.83 40.52
Control P2 M 3 62.46 4.81
Control P2 M 4 81.78 136.66
Control P2 L 1 32.76 30.70
Control P2 L 2 62.35 123.78
Control P2 L 3 90.22 113.87
Control P2 L 4 114.03 76.52
Exclusion P3 H 1 21.86 23.58
Exclusion P3 H 2 39.83 28.03
Exclusion P3 H 3 59.53 21.32
Exclusion P3 H 4 75.59 90.76
Exclusion P3 M 1 38.57 30.63
Exclusion P3 M 2 81.25 83.61
Exclusion P3 M 3 124.08 124.09
Exclusion P3 M 4 159.69 112.65
Exclusion P3 L 1 61.16 39.53
Exclusion P3 L 2 119.84 110.27
Exclusion P3 L 3 175.87 286.33
Exclusion P3 L 4 238.76 54.23
Exclusion P4 H 1 18.82 28.60
Exclusion P4 H 2 39.82 39.07
Exclusion P4 H 3 63.30 93.43
Exclusion P4 H 4 82.29 60.15
Exclusion P4 M 1 39.51 45.90
Exclusion P4 M 2 79.24 88.04
Exclusion P4 M 3 122.09 84.19
Exclusion P4 M 4 161.67 256.34
Exclusion P4 L 1 57.93 85.24
Exclusion P4 L 2 117.88 167.90
Exclusion P4 L 3 181.09 314.49
Exclusion P4 L 4 242.31 304.70

Note as this data set is 48 rows long, the majority examples that feature this dataset will only show the first 6 (or 10) rows to save on space. This truncation will occur via the head() function.

These data comprise a Response observed on three occasions (Time.0, Time.1 and Time.2) from each of two (A1 and A2) treatments (Between, gridded or not) nested within four plots (Plot).

View data
Plot Between Time.0 Time.1 Time.2
R1 P1 A1 8 14 14
R2 P2 A1 10 12 11
R3 P3 A2 7 11 8
R4 P4 A2 11 9 2

Similar to dat.1, these data comprise two Responses (Resp1 and Resp2). However, these data have a different hierarchical structure. Each response was measured from one of two Subplots nested within one of six Plots. Within each Subplot each of two treatment levels (B1 and B2) were applied and two of the Plots where exposed to one of the three Between levels.

View data
Resp1 Resp2 Between Plot Subplot Within
8 17 A1 P1 S1 B1
10 18 A1 P1 S1 B2
7 17 A1 P1 S2 B1
11 21 A1 P1 S2 B2
14 19 A2 P2 S3 B1
12 13 A2 P2 S3 B2
11 24 A2 P2 S4 B1
9 18 A2 P2 S4 B2
14 25 A3 P3 S5 B1
11 18 A3 P3 S5 B2
8 27 A3 P3 S6 B1
2 22 A3 P3 S6 B2
8 17 A1 P4 S7 B1
10 22 A1 P4 S7 B2
7 16 A1 P4 S8 B1
12 13 A1 P4 S8 B2
11 23 A2 P5 S9 B1
12 19 A2 P5 S9 B2
12 23 A2 P5 S10 B1
10 21 A2 P5 S10 B2
3 17 A3 P6 S11 B1
11 16 A3 P6 S11 B2
13 26 A3 P6 S12 B1
7 28 A3 P6 S12 B2

The data.d data comprise of a single response (Resp1) collected on different days. The three different components of the dates have been recorded in three different columns corresponding to the year, month and day.

View data
year month day Resp1
2009 11 09 25
2008 05 08 52
2009 05 23 18
2009 04 15 21
2008 10 25 45
2008 09 26 40
2008 07 05 38
2008 11 02 9
2009 08 19 0
2008 10 03 20

The data.bio dataset represents a design that is like a cut down version of data in which there are no Within Subplot levels and thus there are also fewer rows.

View data
Resp1 Resp2 Between Plot Subplot
8 18 A1 P1 S1
10 21 A1 P1 S2
11 23 A1 P2 S4
14 22 A2 P3 S5
12 24 A2 P3 S6
11 23 A2 P4 S7
9 20 A2 P4 S8
14 11 A3 P5 S9
11 22 A3 P5 S10
8 24 A3 P6 S11
2 16 A3 P6 S12

The data.chem dataset represents the same design as the data.bio data except that rather than contain two biological response, it has two chemical measurements (Chem1 and Chem2) collected from the same sampling units as the biological responses.

View data
Chem1 Chem2 Between Plot Subplot
1.452878 0.8858208 A1 P1 S1
3.266253 0.1800177 A1 P1 S2
1.178652 5.0780682 A1 P2 S3
13.400350 1.5762780 A1 P2 S4
3.779183 1.6222430 A2 P3 S5
1.196657 4.2369184 A2 P3 S6
5.687807 2.9859003 A2 P4 S8
4.834518 4.1328919 A3 P5 S9
2.002931 3.6043314 A3 P5 S10
12.326867 1.7763576 A3 P6 S11
4.014221 0.2255188 A3 P6 S12

The data.geo dataset represent geographical information about the Plots in the data.bio dataset.

View data
Plot LAT LONG
P1 17.9605 145.4326
P2 17.5210 146.1983
P3 17.0011 146.3839
P4 18.2350 146.7934
P5 18.9840 146.0345
P6 20.1154 146.4672

The tikus dataset is a data.frame version of the data with the same name that is provided as a list within the mvabund package. The data contain a the abundance of 75 coral species from different locations (rep) over six years (time) at Tikus island. The abundance was measured as the length (in cm) of a 10m transect that intersected with the coral species.

View data

Note, these data actually have 60 rows and 77 columns. In order avoid displaying a huge table in this quick view, I have trimmed the view to just the first 10 rows and six columns.

time rep Psammocora contigua Psammocora digitata Pocillopora damicornis Pocillopora verrucosa
V1 81 1 0 0 79 32
V2 81 2 0 0 51 21
V3 81 3 0 0 42 35
V4 81 4 0 0 15 0
V5 81 5 0 0 9 0
V6 81 6 0 0 72 0
V7 81 7 0 0 0 41
V8 81 8 0 0 16 25
V9 81 9 0 0 0 38
V10 81 10 0 0 16 0

The great folks over at Rstudio have produced an excellent set of cheatsheets on a range of topics. For this tutorial, the Data Transformation Cheat Sheet and Data Tidying Cheat Sheet (mainly just the first page) are useful summaries.

https://github.com/rstudio/cheatsheets/raw/main/data-import.pdf

https://github.com/rstudio/cheatsheets/raw/main/data-transformation.pdf

https://github.com/rstudio/cheatsheets/raw/main/tidyr.pdf

https://github.com/rstudio/cheatsheets/raw/main/factors.pdf

https://github.com/rstudio/cheatsheets/raw/main/lubridate.pdf

https://github.com/rstudio/cheatsheets/raw/main/purrr.pdf

https://github.com/rstudio/cheatsheets/raw/main/regex.pdf

https://github.com/rstudio/cheatsheets/raw/main/strings.pdf

https://github.com/rstudio/cheatsheets/raw/main/tidyeval.pdf

2 tidyverse - a data manipulation ecosystem within R

There are numerous packages and base R routines devoted to data manipulation. Notable packages include data.tables, plyr, dplyr and doBy. Indeed, earlier versions of this tutorial featured examples of each of these packages. However, an entire family of packages from Hadley Wickem’s group now stands out as a comprehensive, intuitive suite of tools for data manipulation and visualisation.

Importantly, all of these packages are designed to integrate together and complement one another with a consistent interface. To simplify installing an entire data ecosystem, the tidyverse package is available. Installing this package (via install.packages('tidyverse')) will install the following packages (bold entries are automatically loaded into the global environment when loading the tidyverse package):

  • dplyr - for dataframe manipulation
  • tidyr - for data tidying
  • readr - for importing data
  • ggplot2 - for visualising data
  • purrr - for functional programming
  • tibble - for tibbles
  • stringr - for string manipulation
  • forecats - for factor (categorical) manipulation
  • lubridate - for dates/times manipulation
  • hmms - for time manipulation
  • broom - for tidying model outputs
  • DBI - for database interactions
  • haven - for importing SAS, SPSS and Stata files
  • httr - for web API’s
  • jsonlite - for working with JSON
  • rvest - for web scraping
  • xml2 - for manipulating XML structures
  • modelr - for model manipulation

As this tutorial will focus largely on the tidyverse ecosystem, we will now load (attach) this library (actually, eight libraries) to the global environment so that the functionality is available.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr   1.1.4     ✔ readr   2.1.5
✔ forcats 1.0.0     ✔ stringr 1.5.1
✔ ggplot2 3.5.1     ✔ tibble  3.2.1
✔ purrr   1.0.2     ✔ tidyr   1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Note that when loading the tidyverse package, we are given feedback about which other packages have been attached. Importantly, also notice that the output highlights a number of conflicts. In the above example, this indicates that the definition of the filter function is now taken from the the dplyr package rather than the stats package. Essentially, both packages define a function called filter and this is just indicating that in the current environment, the filter function from the dplyr package will be the one used.

On a case by case basis, this can be overruled by prefixing the function name with the package namespace. For example, to ensure that the stats package version of filter is used, we can engage it as stats::filter().

2.1 The grammar of data manipulation

Hadley and his collaborators argue that there is a grammar of data manipulation and that manipulations comprises a set of verbs. Furthermore, the philosophy adheres to the UNIX ideal that rather than attempt to build large complex tools that perform a large set of operations, it is better (with regards to usability and robustness) to build a tool set in which each tool performs a single, specific task and that the individual tools can be bolted together to achieve the complex operations.

The core of the tidyverse data manipulation ecosystem tools (verbs) can be organised into five categories within two types on the basis of what aspects of the data they operate on:

In base R, variables (columns) are referred to by either their name (as a prefix to the environment - the data.frame etc), name (as a string) or their index (position). For example to access a variable called “Var1” within a dataset called “data”, we could use either data$Var1, data[,"Var1"] or data[,1] (if the focal variable was in the first column).

The tidyverse ecosystem adopts an alternative concept called tidy evaluation to make referring to columns more flexible.

  • data-masking - refer to variables as if they were directly attached to the current environment (e.g. Var1 instead of data$Var1. This form of variable referral is used within:
    • arrange()
    • filter()
    • count()
    • mutate()
    • summarise()
    • group_by()
  • tidy selection - refer to variables by their position, name or type (e.g. starts_with("var")). This form of variable referral is used within:
    • select()
    • rename()
    • pull()
    • across() - which brings tidy selection semantics to data-masking (and thus functions like mutate())
    The following table highlights the various tidy-selection options. The examples all illustrate a hypothetical data set (data) with hypothetical columns (Var1, Var2, etc) and cannot be evaluated directly. They all mostly illustrate the concept using the select() function.
tidy-selection Description Examples
Bare names

Select columns based on their names.
Columns can be excluded by prepending the name with a negative sign (-).

select(data, Var1)
select(data, Var1, Var2)
select(data, c(Var1, Var2))
select(data, -Var1)
Ranges of columns

Select columns based on their a range of names or column numbers. The selections will be inclusive. Prepending with a negative sign (-) acts as an exclusion.

select(data, Var1:Var3)
select(data, 2:4)
Boolean helper functions
-contains()
-starts_with()
-ends_with()
-matches()
-num_range()
-everything()
-where()
-all_of()
-any_of()

Select columns based on evaluating functions on the column names
- contains() - names containing the string
- starts_with() - names starting with the string
- ends_with() - names starting with the string
- matches() - names matched with a regex
- num_range() - names that start with string followed by numbers
- everything() - all variables - useful in combination with other selections
- where() - variable inclusion predicated by a function
- all_of() - all variables included in a character vector
- any_of() - any variables included in a character vector

Each of the above can be reversed by prepending with either a exclamation sign (!) or negative sign (-).
Also note that by default, these are not case sensitive.

select(data, contains("Var"))
select(data, !contains("Var"))
select(data, starts_with("Var"))
select(data, ends_with("Var"))
select(data, matches("^.var[1-5]$"))
select(data, num_range("Var", 1:2))
select(data, Var3, everything())
select(data, where(is.numeric))
vars <- c("Var1", "Var2")
select(data, all_of(vars))
vars <- c("Var1", "Other")
select(data, any_of(vars))

2.2 Piping

Typically, data manipulation/preparations comprise multiple steps and stages in which a series of alterations, additions etc are performed sequentially such that the data are progressively molded into a form appropriate for analysis etc. That is, to achieve the desired result, we must bolt multiple tools (verbs) together.

Traditionally, this would have involved a separate expression for each step often resulting in the generation of numerous intermediate data sets. Furthermore, in an attempt to reduce the number of intermediate steps, functions are often nested within other functions such that alterations are made inline within another function.

For example, the following pseudo code examples illustrates the traditional and nested approaches to achieving numerous data manipulation steps:

Traditional

data1 <- select(data, ...)
data2 <- group_by(data1, ...)
data3 <- summarise(data2, ...)

Nested functions

data <- summarise(group_by(select(data, ...)))

Collectively, these practices can yield code that is very difficult to read and interpret.

A long honoured unix coding principle is that each application should focus on performing one action and performing that function well. In order to perform a sequence of actions therefore involves piping (via the unix pipe character |) the output of one application to the input of another application and so on in a chain. The grammar of data wrangling also adopts the principle of each tool specializing on one action and tools should be piped together to achieve a sequence of actions.

The piping (glue) operator in R (as of version 4.1) is |>. An object on the left hand side of the |> operator is passed as the first argument of the function on the right hand side.

Prior to the native adoption of the pipe character in R, piping was supported via the magrittr package. Within this package, the pipe operator is %>%. Although both the native and magrittr pipe operators are somewhat analogous, they are not homologous.

  • Syntax:

    • Native pipe: Simple |> symbol placed between the expression and function call.
    • Magrittr pipe: Double percentage signs %>% offering visual distinction from surrounding code.
  • Placeholder:

    • Native pipe: No built-in placeholder for piped-in values. Functions need to handle them explicitly.
    • Magrittr pipe: Dot placeholder (.), allowing concise access to piped-in values within the function call.
  • Functionality:

    • Native pipe: More basic functionality, primarily focused on function chaining.
    • Magrittr pipe: Offers additional features like:
      • Exposition pipe (%~$%): Displays the values passed through the pipe, facilitating debugging.
      • Interjection pipe (~>%): Modifies the piped-in value before passing it to the next function.
      • Lazy evaluation: Delays intermediate calculations until necessary, potentially improving performance.
  • Scope:

    • Native pipe: Recently introduced, integrated directly into R Base package.
    • Magrittr pipe: A separate package, offering more extensibility and customization options.
  • Adoption:

    • Native pipe: Gaining popularity due to its simplicity and integration with base R.
    • Magrittr pipe: Widely used, having established a large community and ecosystem of compatible packages.
  • Choosing between them depends on:

    • Personal preference: Some prefer the visual clarity of %>%, while others favor the simplicity of |>.
    • Project requirements: Consider the need for features like the dot placeholder or additional pipe types.
    • Team collaboration: Choose a consistent style for shared code or adhere to project standards.

Ultimately, both pipes are powerful tools for data manipulation in R. Understanding their differences helps you choose the most appropriate option for your specific needs and coding style.

In pseudo code, the piping approach to the above manipulation would be:

data <- data |>
    select(...) |>
    group_by(...) |>
    summarise(...)

If the code is set out as above (with each verb on a separate line), it gives an opportunity to provide a short comment to the right side of each line to provide useful documentation.

A more specific example

As a motivating (if not a bit extreme) example, lets say we wanted to calculate the logSumExp function:

\[ log(\sum^{n}_{i=1} e^{x_i}) \]

## Generate some data
set.seed(123)
x <- rgamma(10,5,1)
## Calculate the logSumExp
log(sum(exp(x)))
[1] 9.316408
## OR
x1 <- exp(x)
x2 <-sum(x1)
log(x2)
[1] 9.316408

The piping approach could be:

x |> exp() |> sum() |> log()
[1] 9.316408

To reiterate, the following three are equivalent:

exp(x)
 [1]   29.653639 1601.918872    5.101634  118.918637 7140.686681  252.361318
 [7]    9.175114    4.863565 1756.825350  199.466617
x |> exp()
 [1]   29.653639 1601.918872    5.101634  118.918637 7140.686681  252.361318
 [7]    9.175114    4.863565 1756.825350  199.466617

as are the following:

log(x, base=10)
 [1] 0.5301465 0.8679950 0.2120706 0.6792861 0.9480981 0.7427928 0.3456667
 [8] 0.1991438 0.8733941 0.7239190
x |> log(base=10)
 [1] 0.5301465 0.8679950 0.2120706 0.6792861 0.9480981 0.7427928 0.3456667
 [8] 0.1991438 0.8733941 0.7239190

Most of the following examples will demonstrate isolated data manipulation actions (such as filtering, summarising or joining) as this focuses on the specific uses of these functions without the distractions and complications of other actions. For isolated uses, piping has little (if any) advantages. Nevertheless, in recognition that data manipulations rarely comprise a single action (rather they are a series of linked actions), for all remaining examples demonstrated in the tidyverse (dplyr/tidyr) context, piping will be used.

2.3 Tibbles

data.frame’s are collections of variables of the identical length (yet not necessarily the same type) that the fundamental data structures used by most modelling routines in R. Nevertheless, there are numerous properties of data.frames that make them less than ideal. tibbles have been more recently engineered to address these shortcomings:

  • if a data.frame is very large, the print method can result output that is too large to be useful. By contrast, tibbles truncate the output to a maximum of 10 rows and as many columns as will fit in the output. The tibble print method also displays the class (type) of each column (variable).

    show comparison

    dat.1 |> as.data.frame() 
       Treatment Plot Dose Time  Resp1  Resp2
    1    Control   P1    H    1   8.12   3.06
    2    Control   P1    H    2  20.55  25.94
    3    Control   P1    H    3  27.49  29.85
    4    Control   P1    H    4  44.79  25.39
    5    Control   P1    M    1  20.99  20.31
    6    Control   P1    M    2  37.54  17.62
    7    Control   P1    M    3  61.46  98.44
    8    Control   P1    M    4  82.21 160.01
    9    Control   P1    L    1  31.73  21.22
    10   Control   P1    L    2  59.08  37.51
    11   Control   P1    L    3  94.54 119.22
    12   Control   P1    L    4 121.17 116.45
    13   Control   P2    H    1   8.14  23.93
    14   Control   P2    H    2  13.36  28.02
    15   Control   P2    H    3  33.37  37.17
    16   Control   P2    H    4  39.87  38.25
    17   Control   P2    M    1  19.95  19.73
    18   Control   P2    M    2  42.83  40.52
    19   Control   P2    M    3  62.46   4.81
    20   Control   P2    M    4  81.78 136.66
    21   Control   P2    L    1  32.76  30.70
    22   Control   P2    L    2  62.35 123.78
    23   Control   P2    L    3  90.22 113.87
    24   Control   P2    L    4 114.03  76.52
    25 Exclusion   P3    H    1  21.86  23.58
    26 Exclusion   P3    H    2  39.83  28.03
    27 Exclusion   P3    H    3  59.53  21.32
    28 Exclusion   P3    H    4  75.59  90.76
    29 Exclusion   P3    M    1  38.57  30.63
    30 Exclusion   P3    M    2  81.25  83.61
    31 Exclusion   P3    M    3 124.08 124.09
    32 Exclusion   P3    M    4 159.69 112.65
    33 Exclusion   P3    L    1  61.16  39.53
    34 Exclusion   P3    L    2 119.84 110.27
    35 Exclusion   P3    L    3 175.87 286.33
    36 Exclusion   P3    L    4 238.76  54.23
    37 Exclusion   P4    H    1  18.82  28.60
    38 Exclusion   P4    H    2  39.82  39.07
    39 Exclusion   P4    H    3  63.30  93.43
    40 Exclusion   P4    H    4  82.29  60.15
    41 Exclusion   P4    M    1  39.51  45.90
    42 Exclusion   P4    M    2  79.24  88.04
    43 Exclusion   P4    M    3 122.09  84.19
    44 Exclusion   P4    M    4 161.67 256.34
    45 Exclusion   P4    L    1  57.93  85.24
    46 Exclusion   P4    L    2 117.88 167.90
    47 Exclusion   P4    L    3 181.09 314.49
    48 Exclusion   P4    L    4 242.31 304.70
    dat.1 |> as_tibble()
    # A tibble: 48 × 6
       Treatment Plot  Dose   Time Resp1  Resp2
       <fct>     <fct> <fct> <int> <dbl>  <dbl>
     1 Control   P1    H         1  8.12   3.06
     2 Control   P1    H         2 20.6   25.9 
     3 Control   P1    H         3 27.5   29.8 
     4 Control   P1    H         4 44.8   25.4 
     5 Control   P1    M         1 21.0   20.3 
     6 Control   P1    M         2 37.5   17.6 
     7 Control   P1    M         3 61.5   98.4 
     8 Control   P1    M         4 82.2  160.  
     9 Control   P1    L         1 31.7   21.2 
    10 Control   P1    L         2 59.1   37.5 
    # ℹ 38 more rows
  • data.frames have very strict column naming rules and when these are not satisfied, the names will be altered in order to adhere to the rules. tibbles permit a wider range of names.

    show comparison

    data.frame('1.3' = 1:6, 'Wt (kg)' = 1:6)
      X1.3 Wt..kg.
    1    1       1
    2    2       2
    3    3       3
    4    4       4
    5    5       5
    6    6       6
    tibble('1.3' = 1:6, 'Wt (kg)' = 1:6)
    # A tibble: 6 × 2
      `1.3` `Wt (kg)`
      <int>     <int>
    1     1         1
    2     2         2
    3     3         3
    4     4         4
    5     5         5
    6     6         6
  • character vectors are often coerced into factors (categorical variables). This is not the case with tibbles.

  • when vectors are added that are of a different length to the others in the data.frame, the values of the shorter vector(s) are recycled until all vectors are the same length. This behaviour can be dangerous and is not permitted in tibbles (except for vectors of length of one).

    show comparison

    data.frame(Var1 = 1:6, Var2 = 1:2, Var3 = 1)
      Var1 Var2 Var3
    1    1    1    1
    2    2    2    1
    3    3    1    1
    4    4    2    1
    5    5    1    1
    6    6    2    1
    tibble(Var1 = 1:6, Var2 = 1:2, Var3 = 1)
    Error in `tibble()`:
    ! Tibble columns must have compatible sizes.
    • Size 6: Existing data.
    • Size 2: Column `Var2`.
    ℹ Only values of size one are recycled.
    tibble(Var1 = 1:6, Var2 = 1)
    # A tibble: 6 × 2
       Var1  Var2
      <int> <dbl>
    1     1     1
    2     2     1
    3     3     1
    4     4     1
    5     5     1
    6     6     1
  • when sub-setting via column indices, data.frames return a vector (when only a single index is provided) or a data.frame (if multiple indices are provided). This inconsistency is problematic in the context of a processing pipeline. tibble’s will always return a tibble from column indices.

    show comparison

    as.data.frame(dat.1)[,2]
     [1] P1 P1 P1 P1 P1 P1 P1 P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P3
    [26] P3 P3 P3 P3 P3 P3 P3 P3 P3 P3 P3 P4 P4 P4 P4 P4 P4 P4 P4 P4 P4 P4 P4
    Levels: P1 P2 P3 P4
    as_tibble(dat.1)[,2]
    # A tibble: 48 × 1
       Plot 
       <fct>
     1 P1   
     2 P1   
     3 P1   
     4 P1   
     5 P1   
     6 P1   
     7 P1   
     8 P1   
     9 P1   
    10 P1   
    # ℹ 38 more rows
  • data.frames permit partial matching via $ indexing. This can be problematic in the context of a processing pipeline. tibbles expressly forbid this.

    show comparison

    as.data.frame(dat.1)$Plo
     [1] P1 P1 P1 P1 P1 P1 P1 P1 P1 P1 P1 P1 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P3
    [26] P3 P3 P3 P3 P3 P3 P3 P3 P3 P3 P3 P4 P4 P4 P4 P4 P4 P4 P4 P4 P4 P4 P4
    Levels: P1 P2 P3 P4
    as_tibble(dat.1)$Plo
    Warning: Unknown or uninitialised column: `Plo`.
    NULL
  • data.frames have very clumsy support for list columns (a list column is a column whose cells contain lists). List columns are better supported in tibbles. The ability to support list columns is an integral for the functional programming routines of the purrr package.

    show comparison

    In this fictitious example, I would like to store three different data.frames (or any three objects for that matter) in the three cells of a variable (e.g. Var2). This is not possible in a data.frame - it will coerce the Var2 column into multiple columns (one for each object) and thus will also recycle the length of each to ensure they are all the same length. Once the data have been coerced into a data.frame, the structure is lost (we can no longer access Var2).

    These issues are addressed in the tibble.

    D <- data.frame(Var1 = LETTERS[1:3],
               Var2 = list(
                   data.frame(a = 1:6),
                   data.frame(a = 1:3),
                   data.frame(a = 1:2)
               ))
    D
      Var1 Var2.a Var2.a.1 Var2.a.2
    1    A      1        1        1
    2    B      2        2        2
    3    C      3        3        1
    4    A      4        1        2
    5    B      5        2        1
    6    C      6        3        2
    D$Var2
    NULL
    D <- tibble(Var1 = LETTERS[1:3],
           Var2 = list(
               data.frame(a = 1:6),
               data.frame(a = 1:3),
               data.frame(a = 1:2)
               ))
    D
    # A tibble: 3 × 2
      Var1  Var2        
      <chr> <list>      
    1 A     <df [6 × 1]>
    2 B     <df [3 × 1]>
    3 C     <df [2 × 1]>
    D$Var2
    [[1]]
      a
    1 1
    2 2
    3 3
    4 4
    5 5
    6 6
    
    [[2]]
      a
    1 1
    2 2
    3 3
    
    [[3]]
      a
    1 1
    2 2

3 Summary and vectorised functions

Many data manipulation actions involve the use of specific auxiliary functions to act on specific parts of the data set. For example, if we wish to summarise a data set, we might apply a mean() function to the numeric vectors (variables) and levels() or unique() function to character or factor vectors. That is, the function is applied to each vector as a whole. Such functions are called summary functions.

On the other hand, if we wish to generate log-transformed versions of the numeric vectors, then we would apply the log() function to each value within those numeric vectors. Furthermore, we might use other auxiliary functions that return vectors of either boolean values (TRUE or FALSE) or integers that can represent row or column indices to determine which observations to perform actions on.

Broadly speaking, summary functions take a vector and return a single value. Familiar examples of summary functions are mean(), var(), min() etc. By contrast, vectorised functions take a vector and return a vector of the same length as the original input vector. rank(), cumsum() and log() are all examples of such vectorised (window) functions.

The dplyr package introduces a number of additional summary and vectorised functions that are deemed useful augmentations of the standard set of functions available in base R. The following tables describes the most useful functions (from base and dplyr) along with which manipulation functions they can be applied. Those functions defined in dplyr include the dplyr namespace.

There are also some operations that appear to provide a mixture of both summary and vectorised functions. The scale() function (which centres and scales a vector), for example, returns a vector of scaled values as well as storing the original mean and standard deviation (summaries) as attributes.

To demonstrate summary and vectorised functions, the following vectors will be used.

## set a random seed to ensure the same values each time
set.seed(123)
## a numeric vector generated from random draws from a gamma distribution
(x <- rgamma(10, 5, 1))
 [1] 3.389585 7.378957 1.629561 4.778440 8.873564 5.530862 2.216495 1.581772
 [9] 7.471264 5.295647
## a character vector
(A <- sample(letters[1:2], size = 10, replace = TRUE))
 [1] "a" "a" "b" "b" "a" "b" "a" "b" "a" "b"
## a boolean vector
(B <- sample(c(TRUE, FALSE), size = 10, replace = TRUE))
 [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE

3.1 Summary functions

Function Description Examples
mean(), median()

Arithmetic mean and median.

Note, boolean vectors are stored as a set of 0 (FALSE) and 1 (TRUE) and thus mathematical functions operate on them as if they were numbers.

mean(x)
[1] 4.814615
median(B)
[1] 1
sum(), prod() Sum and product.
sum(x)
[1] 48.14615
prod(B)
[1] 0
var(), sd() Variance and standard deviation.
var(x)
[1] 6.692355
sd(B)
[1] 0.5163978
mad(), IQR() Median Absolute Deviation and Inter-Quartile Range.
mad(x)
[1] 3.540549
IQR(B)
[1] 1
min(), max() Minimum and maximum.
min(x)
[1] 1.581772
max(B)
[1] 1
quantile() Quantiles
quantile(x)
      0%      25%      50%      75%     100% 
1.581772 2.509767 5.037043 6.916934 8.873564 
quantile(x, p = 0.25)
     25% 
2.509767 
dplyr::first(), dplyr::last(), dplyr::nth() First, laste and nth value.
first(x)
[1] 3.389585
first(x, order_by = A)
[1] 3.389585
last(B)
[1] TRUE
nth(A, n = 4)
[1] "b"
dplyr::n(), dplyr::n_distinct()

Number of values, number of distinct values.

n() is a special case that can only be used within the context of a data.frame or tibble.

n_distinct(A)
[1] 2

3.2 Vectorised functions

Function Description Examples
+,-,*,/,^,%/%, %% typical arithmetic operators
x + 2
 [1]  5.389585  9.378957  3.629561  6.778440
 [5] 10.873564  7.530862  4.216495  3.581772
 [9]  9.471264  7.295647
x * B
 [1] 0.000000 7.378957 1.629561 4.778440
 [5] 8.873564 0.000000 2.216495 0.000000
 [9] 0.000000 5.295647
log(),log2(),log10(), exp() logarithms and exponentials
log(x)
 [1] 1.2207074 1.9986324 0.4883105 1.5641140
 [5] 2.1830765 1.7103437 0.7959271 0.4585455
 [9] 2.0110642 1.6668851
log10(x)
 [1] 0.5301465 0.8679950 0.2120706 0.6792861
 [5] 0.9480981 0.7427928 0.3456667 0.1991438
 [9] 0.8733941 0.7239190
exp(x)
 [1]   29.653639 1601.918872    5.101634
 [4]  118.918637 7140.686681  252.361318
 [7]    9.175114    4.863565 1756.825350
[10]  199.466617
<,<=,>,>=,!=,== logical operators
x < 5
 [1]  TRUE FALSE  TRUE  TRUE FALSE FALSE
 [7]  TRUE  TRUE FALSE FALSE
B == TRUE
 [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE
 [7]  TRUE FALSE FALSE  TRUE
between() Whether a value is between two numbers
between(x, 3, 5)
 [1]  TRUE FALSE FALSE  TRUE FALSE FALSE
 [7] FALSE FALSE FALSE FALSE
near() A safe way of assessing equality (==) in floating points
x == 3.39
 [1] FALSE FALSE FALSE FALSE FALSE FALSE
 [7] FALSE FALSE FALSE FALSE
near(x, y = 3.39, tol =0.01)
 [1]  TRUE FALSE FALSE FALSE FALSE FALSE
 [7] FALSE FALSE FALSE FALSE
lag(), lead() shift cases down/up by one
lag(x)
 [1]       NA 3.389585 7.378957 1.629561
 [5] 4.778440 8.873564 5.530862 2.216495
 [9] 1.581772 7.471264
lead(x)
 [1] 7.378957 1.629561 4.778440 8.873564
 [5] 5.530862 2.216495 1.581772 7.471264
 [9] 5.295647       NA
cummax(), cummin(), dplyr::cummean() Cumulative max, min and mean
cummax(x)
 [1] 3.389585 7.378957 7.378957 7.378957
 [5] 8.873564 8.873564 8.873564 8.873564
 [9] 8.873564 8.873564
cummin(B)
 [1] 0 0 0 0 0 0 0 0 0 0
cummean(x)
 [1] 3.389585 5.384271 4.132701 4.294136
 [5] 5.210021 5.263495 4.828209 4.422404
 [9] 4.761167 4.814615
cummean(x > 2)
 [1] 1.0000000 1.0000000 0.6666667 0.7500000
 [5] 0.8000000 0.8333333 0.8571429 0.7500000
 [9] 0.7777778 0.8000000
cummean(B)
 [1] 0.0000000 0.5000000 0.6666667 0.7500000
 [5] 0.8000000 0.6666667 0.7142857 0.6250000
 [9] 0.5555556 0.6000000
cumsum(), cumprod() Cumulative sum and product
cumsum(x)
 [1]  3.389585 10.768542 12.398103 17.176543
 [5] 26.050107 31.580969 33.797464 35.379235
 [9] 42.850499 48.146146
cumsum(x > 3)
 [1] 1 2 2 3 4 5 5 5 6 7
cumprod(B)
 [1] 0 0 0 0 0 0 0 0 0 0
dplyr::cumall(), dplyr::cumany() Cumulative all and any (mainly for use with filtering).
cumall(x)
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [9] TRUE TRUE
cumall(x > 2)
 [1]  TRUE  TRUE FALSE FALSE FALSE FALSE
 [7] FALSE FALSE FALSE FALSE
cumany(x > 4)
 [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
 [7]  TRUE  TRUE  TRUE  TRUE
cumany(B)
 [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
 [7]  TRUE  TRUE  TRUE  TRUE
cumall(B)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE
 [7] FALSE FALSE FALSE FALSE
rank(), order() Rank and order of items
rank(x)
 [1]  4  8  2  5 10  7  3  1  9  6
order(x)
 [1]  8  3  7  1  4 10  6  2  9  5
rank(B)
 [1] 2.5 7.5 7.5 7.5 7.5 2.5 7.5 2.5 2.5 7.5
dplyr::min_rank(), dplyr::dense_rank(), dplyr::percent_rank() Rank in which ties = min, without gaps and expressed as a percentage.
min_rank(x)
 [1]  4  8  2  5 10  7  3  1  9  6
dense_rank(x)
 [1]  4  8  2  5 10  7  3  1  9  6
percent_rank(x)
 [1] 0.3333333 0.7777778 0.1111111 0.4444444
 [5] 1.0000000 0.6666667 0.2222222 0.0000000
 [9] 0.8888889 0.5555556
rank(B)
 [1] 2.5 7.5 7.5 7.5 7.5 2.5 7.5 2.5 2.5 7.5
dplyr::row_number() Rank in which ties = first.
row_number(x)
 [1]  4  8  2  5 10  7  3  1  9  6
dplyr::cume_dist() Cumulative empirical distribution (proportion less than current value).
cume_dist(x)
 [1] 0.4 0.8 0.2 0.5 1.0 0.7 0.3 0.1 0.9 0.6
dplyr::ntile() Partition into (n) bins.
ntile(x, n = 3)
 [1] 1 3 1 2 3 2 1 1 3 2
dplyr::if_else() Elementwise (case by case) if and else.
if_else(x > 3, true = "H", false = "L")
 [1] "H" "H" "L" "H" "H" "H" "L" "L" "H" "H"
dplyr::case_when() Elementwise multiple if and else.
case_when(x <= 3 ~ "L",
          x > 3 & x <= 6 ~ "M",
          x > 6 ~ "H")
 [1] "M" "H" "L" "M" "H" "M" "L" "L" "H" "M"

4 Sorting data (arrange)

Most statistical analyses are invariant to the data order and thus data reordering is typically only for aesthetics in tables and figures.

Sorting data has the potential to be one of the most dangerous forms of data manipulation - particularly in spreadsheets in which there is no real binding between individual columns or rows. It is far to easy to accidentally sort the data in a single column (or row) without applying the ordering to all the other columns in the data set thereby resulting in broken data.

Whilst the above apocalypse is still possible in R, the data structures and manipulation interfaces mean that you really have to try to break the data in this way. Furthermore, you are encouraged to store reordered data in a different object to the original data, and hence ‘rolling’ back is trivial.

First 10 rows of the `dat.1` data.frame
Treatment Plot Dose Time Resp1 Resp2
Control P1 H 1 8.12 3.06
Control P1 H 2 20.55 25.94
Control P1 H 3 27.49 29.85
Control P1 H 4 44.79 25.39
Control P1 M 1 20.99 20.31
Control P1 M 2 37.54 17.62
Control P1 M 3 61.46 98.44
Control P1 M 4 82.21 160.01
Control P1 L 1 31.73 21.22
Control P1 L 2 59.08 37.51
dat.1 |> arrange(Resp1) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P2    H    1  8.14 23.93
3   Control   P2    H    2 13.36 28.02
4 Exclusion   P4    H    1 18.82 28.60
5   Control   P2    M    1 19.95 19.73
6   Control   P1    H    2 20.55 25.94
dat.1 |> arrange(desc(Resp1)) |>
    head()
  Treatment Plot Dose Time  Resp1  Resp2
1 Exclusion   P4    L    4 242.31 304.70
2 Exclusion   P3    L    4 238.76  54.23
3 Exclusion   P4    L    3 181.09 314.49
4 Exclusion   P3    L    3 175.87 286.33
5 Exclusion   P4    M    4 161.67 256.34
6 Exclusion   P3    M    4 159.69 112.65
dat.1 |> arrange(Dose, Resp1) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P2    H    1  8.14 23.93
3   Control   P2    H    2 13.36 28.02
4 Exclusion   P4    H    1 18.82 28.60
5   Control   P1    H    2 20.55 25.94
6 Exclusion   P3    H    1 21.86 23.58
dat.1 |> arrange(Resp1 + Resp2) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P2    H    1  8.14 23.93
3   Control   P2    M    1 19.95 19.73
4   Control   P1    M    1 20.99 20.31
5   Control   P2    H    2 13.36 28.02
6 Exclusion   P3    H    1 21.86 23.58
dat.1 |> arrange(Dose, mean(c(Resp1, Resp2))) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P1    H    2 20.55 25.94
3   Control   P1    H    3 27.49 29.85
4   Control   P1    H    4 44.79 25.39
5   Control   P2    H    1  8.14 23.93
6   Control   P2    H    2 13.36 28.02

5 Subsets of data

We regularly want to run an analysis or generate tabular/graphic summaries for a sub-set of the data. For example, we may wish to reduce the number of columns in the data frame to remove excessive and unwanted data fields. Alternatively (or additionally), we may wish to focus on a subset of the rows. These two forms of sub-setting are:

  • selecting - reducing the number of columns
  • filtering - reducing the number of rows

5.1 Subset columns (select)

Selecting works by either including (or excluding) the column names that you indicate or via special selection ‘Helper’ functions that pass a vector of column indices to include in the subset data.

First 10 rows of the `dat.1` data.frame
Treatment Plot Dose Time Resp1 Resp2
Control P1 H 1 8.12 3.06
Control P1 H 2 20.55 25.94
Control P1 H 3 27.49 29.85
Control P1 H 4 44.79 25.39
Control P1 M 1 20.99 20.31
Control P1 M 2 37.54 17.62
Control P1 M 3 61.46 98.44
Control P1 M 4 82.21 160.01
Control P1 L 1 31.73 21.22
Control P1 L 2 59.08 37.51

The select() function users tidy-select semantics.

Select specific variables

dat.1 |> select(Treatment, Dose, Time, Resp1) |>
    head() 
  Treatment Dose Time Resp1
1   Control    H    1  8.12
2   Control    H    2 20.55
3   Control    H    3 27.49
4   Control    H    4 44.79
5   Control    M    1 20.99
6   Control    M    2 37.54

Exclude specific variables

dat.1 |> select(-Plot, -Resp2) |>
    head()
  Treatment Dose Time Resp1
1   Control    H    1  8.12
2   Control    H    2 20.55
3   Control    H    3 27.49
4   Control    H    4 44.79
5   Control    M    1 20.99
6   Control    M    2 37.54

Include columns Treatment through to Time

dat.1 |> select(Treatment:Time) |>
    head() 
  Treatment Plot Dose Time
1   Control   P1    H    1
2   Control   P1    H    2
3   Control   P1    H    3
4   Control   P1    H    4
5   Control   P1    M    1
6   Control   P1    M    2

Exclude columns Treatment through to Time

dat.1 |> select(-(Treatment:Time)) |>
    head() 
  Resp1 Resp2
1  8.12  3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62

Exclude columns Treatment through to Time

dat.1 |> select(!(Treatment:Time)) |>
    head() 
  Resp1 Resp2
1  8.12  3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62

Names containing and ‘r’ (case insensitive)

dat.1 |> select(contains("R")) |>
    head() 
  Treatment Resp1 Resp2
1   Control  8.12  3.06
2   Control 20.55 25.94
3   Control 27.49 29.85
4   Control 44.79 25.39
5   Control 20.99 20.31
6   Control 37.54 17.62

Names starting with ‘r’ (case insensitive)

dat.1 |> select(starts_with("R")) |>
    head() 
  Resp1 Resp2
1  8.12  3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62

Names ending in ‘e’ (case insensitive)

dat.1 |> select(ends_with("e")) |>
    head() 
  Dose Time
1    H    1
2    H    2
3    H    3
4    H    4
5    M    1
6    M    2

Names that are exactly four characters long

dat.1 |> select(matches("^.{4}$")) |>
    head()
  Plot Dose Time
1   P1    H    1
2   P1    H    2
3   P1    H    3
4   P1    H    4
5   P1    M    1
6   P1    M    2
dat.1 |> select(num_range("Resp", 1:2)) |>
    head()
  Resp1 Resp2
1  8.12  3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62
dat.1 |> select(num_range("Resp", 1:2),
                 everything()) |>
    head() 
  Resp1 Resp2 Treatment Plot Dose Time
1  8.12  3.06   Control   P1    H    1
2 20.55 25.94   Control   P1    H    2
3 27.49 29.85   Control   P1    H    3
4 44.79 25.39   Control   P1    H    4
5 20.99 20.31   Control   P1    M    1
6 37.54 17.62   Control   P1    M    2

Select from a vector of strings

Vars <- c("Resp1", "Resp2")
dat.1 |> select(all_of(Vars)) |>
    head()
  Resp1 Resp2
1  8.12  3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62

Select from string vector of strings…

Vars <- c("Resp1", "Resp2", "Resp3")
dat.1 |> select(any_of(Vars)) |>
    head() 
  Resp1 Resp2
1  8.12  3.06
2 20.55 25.94
3 27.49 29.85
4 44.79 25.39
5 20.99 20.31
6 37.54 17.62
Vars <- c("Resp1", "Resp2", "Resp3")
dat.1 |> select(-any_of(Vars)) |>
    head() 
  Treatment Plot Dose Time
1   Control   P1    H    1
2   Control   P1    H    2
3   Control   P1    H    3
4   Control   P1    H    4
5   Control   P1    M    1
6   Control   P1    M    2
dat.1 |> select(where(is.numeric)) |>
    head() 
  Time Resp1 Resp2
1    1  8.12  3.06
2    2 20.55 25.94
3    3 27.49 29.85
4    4 44.79 25.39
5    1 20.99 20.31
6    2 37.54 17.62

Since tibbles permit column names that have special characters in them, in order to refer to a column that has a name with special characters, it is necessary to enclose the name in backticks. For example, to select the variable, Pocillopora damicornis (which contains a space in the name - these are not permitted in data.frames, yet are permitted in tibbles) and print the first six rows:

tikus |>
    select(`Pocillopora damicornis`) |>
    head()
   Pocillopora damicornis
V1                     79
V2                     51
V3                     42
V4                     15
V5                      9
V6                     72

5.2 Renaming columns (rename)

Note, it is possible to have column names renamed during a select action.

Both the rename() and select() functions user tidy-select semantics.

dat.1 |>
    select("Weight" = Resp1) |>
    head()
  Weight
1   8.12
2  20.55
3  27.49
4  44.79
5  20.99
6  37.54

If we want to retain the other variables, we would also have to include them in the select either explicitly, or via a helper function.

dat.1 |>
    select(everything(), "Weight" = Resp1) |>
    head()
  Treatment Plot Dose Time Weight Resp2
1   Control   P1    H    1   8.12  3.06
2   Control   P1    H    2  20.55 25.94
3   Control   P1    H    3  27.49 29.85
4   Control   P1    H    4  44.79 25.39
5   Control   P1    M    1  20.99 20.31
6   Control   P1    M    2  37.54 17.62

However, note that this might not retain the order of the columns. Hence, for simple renaming of columns, the rename function is more convenient.

dat.1 |>
    rename("Weight" = Resp1) |>
    head()
  Treatment Plot Dose Time Weight Resp2
1   Control   P1    H    1   8.12  3.06
2   Control   P1    H    2  20.55 25.94
3   Control   P1    H    3  27.49 29.85
4   Control   P1    H    4  44.79 25.39
5   Control   P1    M    1  20.99 20.31
6   Control   P1    M    2  37.54 17.62

It is also possible to rename columns based on a lookup (list or data.frame). This is handy for cases when data columns have conveniently abbreviated names yet you are preparing data for tabular output - and thus need more descriptive column names.

lookup <- list("Response 1" = "Resp1",
               "Response 2" = "Resp2")
dat.1 |>
    rename(!!!lookup) |>
    head()
  Treatment Plot Dose Time Response 1 Response 2
1   Control   P1    H    1       8.12       3.06
2   Control   P1    H    2      20.55      25.94
3   Control   P1    H    3      27.49      29.85
4   Control   P1    H    4      44.79      25.39
5   Control   P1    M    1      20.99      20.31
6   Control   P1    M    2      37.54      17.62

In the above example, the big bang operator !!! forces-splice a list operator. That is, the elements of the list are spliced into the statement as if they had been included directly.

To do the same from a data.frame lookup..

lookup <- tribble(
    ~Abbr_name, ~Long_name,
    "Resp1",     "Response 1",
    "Resp2",     "Response 2")
## Convert to list of pairs
lookup <- lookup |>
    select(Long_name, Abbr_name) |> 
    deframe() |>
    list()
dat.1 |>
    rename(!!!lookup) |>
    head()
  Treatment Plot Dose Time Response 1 Response 2
1   Control   P1    H    1       8.12       3.06
2   Control   P1    H    2      20.55      25.94
3   Control   P1    H    3      27.49      29.85
4   Control   P1    H    4      44.79      25.39
5   Control   P1    M    1      20.99      20.31
6   Control   P1    M    2      37.54      17.62

5.3 Select a single column (pull)

As indicated earlier, single column indices on tibbles return a single column tibble. To extract a single column as a vector, there is a pull function.

The pull() function users tidy-select semantics.

dat.1 |> pull(Resp1) 
 [1]   8.12  20.55  27.49  44.79  20.99  37.54  61.46  82.21  31.73  59.08
[11]  94.54 121.17   8.14  13.36  33.37  39.87  19.95  42.83  62.46  81.78
[21]  32.76  62.35  90.22 114.03  21.86  39.83  59.53  75.59  38.57  81.25
[31] 124.08 159.69  61.16 119.84 175.87 238.76  18.82  39.82  63.30  82.29
[41]  39.51  79.24 122.09 161.67  57.93 117.88 181.09 242.31

5.4 Subset of rows (filter)

Filtering selects rows for which a condition is evaluated to be TRUE. Hence, any logical expression or vectorized function that returns Boolean values (TRUE or FALSE) can be used for filtering.

The filter() function users data-masking semantics.

First 10 rows of the `dat.1` data.frame
Treatment Plot Dose Time Resp1 Resp2
Control P1 H 1 8.12 3.06
Control P1 H 2 20.55 25.94
Control P1 H 3 27.49 29.85
Control P1 H 4 44.79 25.39
Control P1 M 1 20.99 20.31
Control P1 M 2 37.54 17.62
Control P1 M 3 61.46 98.44
Control P1 M 4 82.21 160.01
Control P1 L 1 31.73 21.22
Control P1 L 2 59.08 37.51
Helper function Combine multiple selections …
if_any() With an an OR
if_all() With an AND)

Note, by default, the above searchers are NOT case sensitive

Logical/Boolean function Description
== is equal to
!= is not equal to
< is less than
> is greater than
<= is less than or equal to
>= is greater than or equal to
is.na() is equal to NA
!is.na() is not equal to NA
%in% is in the following vector
! not
& && logical AND
| || logical OR
xor() elementwise OR
isTRUE() is true
isFALSE() is false
dat.1 |> filter(Dose == "H")
   Treatment Plot Dose Time Resp1 Resp2
1    Control   P1    H    1  8.12  3.06
2    Control   P1    H    2 20.55 25.94
3    Control   P1    H    3 27.49 29.85
4    Control   P1    H    4 44.79 25.39
5    Control   P2    H    1  8.14 23.93
6    Control   P2    H    2 13.36 28.02
7    Control   P2    H    3 33.37 37.17
8    Control   P2    H    4 39.87 38.25
9  Exclusion   P3    H    1 21.86 23.58
10 Exclusion   P3    H    2 39.83 28.03
11 Exclusion   P3    H    3 59.53 21.32
12 Exclusion   P3    H    4 75.59 90.76
13 Exclusion   P4    H    1 18.82 28.60
14 Exclusion   P4    H    2 39.82 39.07
15 Exclusion   P4    H    3 63.30 93.43
16 Exclusion   P4    H    4 82.29 60.15
dat.1 |> filter(Dose %in% c("H", "M"))
   Treatment Plot Dose Time  Resp1  Resp2
1    Control   P1    H    1   8.12   3.06
2    Control   P1    H    2  20.55  25.94
3    Control   P1    H    3  27.49  29.85
4    Control   P1    H    4  44.79  25.39
5    Control   P1    M    1  20.99  20.31
6    Control   P1    M    2  37.54  17.62
7    Control   P1    M    3  61.46  98.44
8    Control   P1    M    4  82.21 160.01
9    Control   P2    H    1   8.14  23.93
10   Control   P2    H    2  13.36  28.02
11   Control   P2    H    3  33.37  37.17
12   Control   P2    H    4  39.87  38.25
13   Control   P2    M    1  19.95  19.73
14   Control   P2    M    2  42.83  40.52
15   Control   P2    M    3  62.46   4.81
16   Control   P2    M    4  81.78 136.66
17 Exclusion   P3    H    1  21.86  23.58
18 Exclusion   P3    H    2  39.83  28.03
19 Exclusion   P3    H    3  59.53  21.32
20 Exclusion   P3    H    4  75.59  90.76
21 Exclusion   P3    M    1  38.57  30.63
22 Exclusion   P3    M    2  81.25  83.61
23 Exclusion   P3    M    3 124.08 124.09
24 Exclusion   P3    M    4 159.69 112.65
25 Exclusion   P4    H    1  18.82  28.60
26 Exclusion   P4    H    2  39.82  39.07
27 Exclusion   P4    H    3  63.30  93.43
28 Exclusion   P4    H    4  82.29  60.15
29 Exclusion   P4    M    1  39.51  45.90
30 Exclusion   P4    M    2  79.24  88.04
31 Exclusion   P4    M    3 122.09  84.19
32 Exclusion   P4    M    4 161.67 256.34
dat.1 |> filter(Resp1 < 25)
  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P1    H    2 20.55 25.94
3   Control   P1    M    1 20.99 20.31
4   Control   P2    H    1  8.14 23.93
5   Control   P2    H    2 13.36 28.02
6   Control   P2    M    1 19.95 19.73
7 Exclusion   P3    H    1 21.86 23.58
8 Exclusion   P4    H    1 18.82 28.60
dat.1 |> filter(between(Resp1, 15, 25))
  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    2 20.55 25.94
2   Control   P1    M    1 20.99 20.31
3   Control   P2    M    1 19.95 19.73
4 Exclusion   P3    H    1 21.86 23.58
5 Exclusion   P4    H    1 18.82 28.60

Keep all cases after a value of Resp1 has exceeded 200

dat.1 |> filter(cumany(Resp1 > 200))
   Treatment Plot Dose Time  Resp1  Resp2
1  Exclusion   P3    L    4 238.76  54.23
2  Exclusion   P4    H    1  18.82  28.60
3  Exclusion   P4    H    2  39.82  39.07
4  Exclusion   P4    H    3  63.30  93.43
5  Exclusion   P4    H    4  82.29  60.15
6  Exclusion   P4    M    1  39.51  45.90
7  Exclusion   P4    M    2  79.24  88.04
8  Exclusion   P4    M    3 122.09  84.19
9  Exclusion   P4    M    4 161.67 256.34
10 Exclusion   P4    L    1  57.93  85.24
11 Exclusion   P4    L    2 117.88 167.90
12 Exclusion   P4    L    3 181.09 314.49
13 Exclusion   P4    L    4 242.31 304.70

Keep all cases until the first case of Resp1 < 25

dat.1 |> filter(cumall(Resp1 < 25))
  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P1    H    2 20.55 25.94
dat.1 |> filter(Dose == "H", Resp1 < 25)
  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P1    H    2 20.55 25.94
3   Control   P2    H    1  8.14 23.93
4   Control   P2    H    2 13.36 28.02
5 Exclusion   P3    H    1 21.86 23.58
6 Exclusion   P4    H    1 18.82 28.60
##OR
dat.1 |> filter(Dose == "H" & Resp1 < 25)
  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P1    H    2 20.55 25.94
3   Control   P2    H    1  8.14 23.93
4   Control   P2    H    2 13.36 28.02
5 Exclusion   P3    H    1 21.86 23.58
6 Exclusion   P4    H    1 18.82 28.60
dat.1 |> filter(Dose == "H" | Resp1 < 25)
   Treatment Plot Dose Time Resp1 Resp2
1    Control   P1    H    1  8.12  3.06
2    Control   P1    H    2 20.55 25.94
3    Control   P1    H    3 27.49 29.85
4    Control   P1    H    4 44.79 25.39
5    Control   P1    M    1 20.99 20.31
6    Control   P2    H    1  8.14 23.93
7    Control   P2    H    2 13.36 28.02
8    Control   P2    H    3 33.37 37.17
9    Control   P2    H    4 39.87 38.25
10   Control   P2    M    1 19.95 19.73
11 Exclusion   P3    H    1 21.86 23.58
12 Exclusion   P3    H    2 39.83 28.03
13 Exclusion   P3    H    3 59.53 21.32
14 Exclusion   P3    H    4 75.59 90.76
15 Exclusion   P4    H    1 18.82 28.60
16 Exclusion   P4    H    2 39.82 39.07
17 Exclusion   P4    H    3 63.30 93.43
18 Exclusion   P4    H    4 82.29 60.15
dat.1 |> filter(if_all(starts_with("Resp"), ~ . < 25))
  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P1    M    1 20.99 20.31
3   Control   P2    H    1  8.14 23.93
4   Control   P2    M    1 19.95 19.73
5 Exclusion   P3    H    1 21.86 23.58
dat.1 |> filter(if_any(starts_with("Resp"), ~ . < 25))
   Treatment Plot Dose Time Resp1 Resp2
1    Control   P1    H    1  8.12  3.06
2    Control   P1    H    2 20.55 25.94
3    Control   P1    M    1 20.99 20.31
4    Control   P1    M    2 37.54 17.62
5    Control   P1    L    1 31.73 21.22
6    Control   P2    H    1  8.14 23.93
7    Control   P2    H    2 13.36 28.02
8    Control   P2    M    1 19.95 19.73
9    Control   P2    M    3 62.46  4.81
10 Exclusion   P3    H    1 21.86 23.58
11 Exclusion   P3    H    3 59.53 21.32
12 Exclusion   P4    H    1 18.82 28.60

5.5 Subset of rows by position (slice)

dat.1 |> slice(1:4)
  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P1    H    2 20.55 25.94
3   Control   P1    H    3 27.49 29.85
4   Control   P1    H    4 44.79 25.39
dat.1 |> slice(c(1:4, 7))
  Treatment Plot Dose Time Resp1 Resp2
1   Control   P1    H    1  8.12  3.06
2   Control   P1    H    2 20.55 25.94
3   Control   P1    H    3 27.49 29.85
4   Control   P1    H    4 44.79 25.39
5   Control   P1    M    3 61.46 98.44

5.6 Random selection of rows (sample_n)

In each of the examples, I will set the random seed to ensure we can all repeat the example exactly.

Random sample of 10 rows (without replacement)

set.seed(123)
dat.1 |> sample_n(10, replace = FALSE)
   Treatment Plot Dose Time  Resp1  Resp2
1  Exclusion   P3    M    3 124.08 124.09
2    Control   P2    H    3  33.37  37.17
3    Control   P2    H    2  13.36  28.02
4    Control   P1    H    3  27.49  29.85
5  Exclusion   P4    M    2  79.24  88.04
6  Exclusion   P4    M    3 122.09  84.19
7  Exclusion   P4    H    1  18.82  28.60
8  Exclusion   P4    L    2 117.88 167.90
9  Exclusion   P3    H    1  21.86  23.58
10 Exclusion   P3    H    2  39.83  28.03

Random sample of 25% of the rows (without replacement)

set.seed(123)
dat.1 |> sample_frac(0.25, replace = FALSE)
   Treatment Plot Dose Time  Resp1  Resp2
1  Exclusion   P3    M    3 124.08 124.09
2    Control   P2    H    3  33.37  37.17
3    Control   P2    H    2  13.36  28.02
4    Control   P1    H    3  27.49  29.85
5  Exclusion   P4    M    2  79.24  88.04
6  Exclusion   P4    M    3 122.09  84.19
7  Exclusion   P4    H    1  18.82  28.60
8  Exclusion   P4    L    2 117.88 167.90
9  Exclusion   P3    H    1  21.86  23.58
10 Exclusion   P3    H    2  39.83  28.03
11 Exclusion   P3    H    3  59.53  21.32
12   Control   P1    M    1  20.99  20.31

5.7 Effects of filtering on factor levels

In R, categorical variables (factors) are actually stored as vectors of integers (1, 2, 3, …) along with an attribute that registers the names (and order) of the levels. We can see this, if we ask to see the structure of any categorical variable:

str(dat.1$Dose)
 Factor w/ 3 levels "H","L","M": 1 1 1 1 3 3 3 3 2 2 ...

In the above, we see that the levels of the Dose variable are “H”, “L”, “M” (by default, levels of an unordered factor are alphabetical). The first four cases are all 1, which is mapped to “H”. The next four are 3 which maps to “M” and so on.

Alternatively, we can explore the levels attribute of a factor via the levels() function:

levels(dat.1$Dose)
[1] "H" "L" "M"
levels(dat.1$Plot)
[1] "P1" "P2" "P3" "P4"
levels(dat.1$Treatment)
[1] "Control"   "Exclusion"

Although subsets of rows (filter, sample_n etc) may appear to completely remove particular levels of a categorical variable, these actions do not update factor attributes. Consequently, many routines that operate on such factors (such as plots and statistical models) will proceed as if all factor levels are present (e.g. plots will contain gaps where space has been provisioned for all levels even though there may be no data associated with all levels).

To illustrate this, we will filter the dat.1 data such that it only includes cases for which Plot equals “P1” and then explore the levels attribute of some of the categorical variables.

dat.2 <- dat.1 |>
    filter(Plot == "P1")
dat.2
   Treatment Plot Dose Time  Resp1  Resp2
1    Control   P1    H    1   8.12   3.06
2    Control   P1    H    2  20.55  25.94
3    Control   P1    H    3  27.49  29.85
4    Control   P1    H    4  44.79  25.39
5    Control   P1    M    1  20.99  20.31
6    Control   P1    M    2  37.54  17.62
7    Control   P1    M    3  61.46  98.44
8    Control   P1    M    4  82.21 160.01
9    Control   P1    L    1  31.73  21.22
10   Control   P1    L    2  59.08  37.51
11   Control   P1    L    3  94.54 119.22
12   Control   P1    L    4 121.17 116.45
levels(dat.2$Dose)
[1] "H" "L" "M"
levels(dat.2$Plot)
[1] "P1" "P2" "P3" "P4"
levels(dat.2$Treatment)
[1] "Control"   "Exclusion"

So although the data only contains Plot values of “P1” (and Treatment values of “Control”), the levels are still listed as “P1”, “P2”, “P3”, and “P4”.

To ensure that the attributes reflect the subset data, it is necessary to use the droplevels() function:

dat.2 <- dat.1 |>
    filter(Plot == "P1") |>
    droplevels()
levels(dat.2$Dose)
[1] "H" "L" "M"
levels(dat.2$Plot)
[1] "P1"
levels(dat.2$Treatment)
[1] "Control"

6 Adding/modifying columns (mutate)

To add to (mutate) a data set, a vectorised function is applied either once or across one or more existing columns.

The mutate() function users data-masking semantics and can use tidy-select semantics via across().

Whether transforming/scaling existing variables or generating new variables derived from existing variables it is advisable that existing variables be retained (unless the data are huge and storage is tight).

dat.1 |> mutate(Sum = Resp1 + Resp2) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2   Sum
1   Control   P1    H    1  8.12  3.06 11.18
2   Control   P1    H    2 20.55 25.94 46.49
3   Control   P1    H    3 27.49 29.85 57.34
4   Control   P1    H    4 44.79 25.39 70.18
5   Control   P1    M    1 20.99 20.31 41.30
6   Control   P1    M    2 37.54 17.62 55.16

Multiple additions/modifications can be defined in a single mutate call and the changes occur sequentially. Consequently, subsequent changes can build upon early changes. To illustrate this, we will create a new variable (cResp1) that is a centred version of the Resp1 variable (centring is when the mean of a variable is subtracted from each value resulting in a variable whose mean is exactly zero, yet all other properties are unaffected.).

dat.1 |> mutate(meanResp1 = mean(Resp1),
                 cResp1 = Resp1 - meanResp1) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2 meanResp1    cResp1
1   Control   P1    H    1  8.12  3.06  75.26604 -67.14604
2   Control   P1    H    2 20.55 25.94  75.26604 -54.71604
3   Control   P1    H    3 27.49 29.85  75.26604 -47.77604
4   Control   P1    H    4 44.79 25.39  75.26604 -30.47604
5   Control   P1    M    1 20.99 20.31  75.26604 -54.27604
6   Control   P1    M    2 37.54 17.62  75.26604 -37.72604

The following table lists some of the more useful vectorised functions to use with mutate. Importantly, vectorised functions return a vector of the same length as they receive since all mutate edits must preserve the number of rows.

First 10 rows of the `dat.1` data.frame
Treatment Plot Dose Time Resp1 Resp2
Control P1 H 1 8.12 3.06
Control P1 H 2 20.55 25.94
Control P1 H 3 27.49 29.85
Control P1 H 4 44.79 25.39
Control P1 M 1 20.99 20.31
Control P1 M 2 37.54 17.62
Control P1 M 3 61.46 98.44
Control P1 M 4 82.21 160.01
Control P1 L 1 31.73 21.22
Control P1 L 2 59.08 37.51

6.1 Mathematical functions

Function Description
+,-,*,/,^,%/%, %% typical arithmetic operators
log(),log2(),log10(), exp() logarithms/exponentials
<,<=,>,>=,!=,== logical operators
between() whether a case is between to numbers
near() a safe way of assessing equality in floating points
dat.1 |> mutate(Sum = Resp1 + Resp2,
                 Div = Resp1 / Resp2) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2   Sum       Div
1   Control   P1    H    1  8.12  3.06 11.18 2.6535948
2   Control   P1    H    2 20.55 25.94 46.49 0.7922128
3   Control   P1    H    3 27.49 29.85 57.34 0.9209380
4   Control   P1    H    4 44.79 25.39 70.18 1.7640803
5   Control   P1    M    1 20.99 20.31 41.30 1.0334810
6   Control   P1    M    2 37.54 17.62 55.16 2.1305335
dat.1 |> mutate(logResp1 = log(Resp1),
                 expResp2 = exp(Resp2)) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2 logResp1     expResp2
1   Control   P1    H    1  8.12  3.06 2.094330 2.132756e+01
2   Control   P1    H    2 20.55 25.94 3.022861 1.843312e+11
3   Control   P1    H    3 27.49 29.85 3.313822 9.197934e+12
4   Control   P1    H    4 44.79 25.39 3.801985 1.063499e+11
5   Control   P1    M    1 20.99 20.31 3.044046 6.614864e+08
6   Control   P1    M    2 37.54 17.62 3.625407 4.490232e+07
dat.1 |> mutate(largeResp1 = Resp1 > 25) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2 largeResp1
1   Control   P1    H    1  8.12  3.06      FALSE
2   Control   P1    H    2 20.55 25.94      FALSE
3   Control   P1    H    3 27.49 29.85       TRUE
4   Control   P1    H    4 44.79 25.39       TRUE
5   Control   P1    M    1 20.99 20.31      FALSE
6   Control   P1    M    2 37.54 17.62       TRUE
dat.1 |> mutate(
              A = Resp1 == 8.1,
              B = near(Resp1, 8.1, tol = 0.1)) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2     A     B
1   Control   P1    H    1  8.12  3.06 FALSE  TRUE
2   Control   P1    H    2 20.55 25.94 FALSE FALSE
3   Control   P1    H    3 27.49 29.85 FALSE FALSE
4   Control   P1    H    4 44.79 25.39 FALSE FALSE
5   Control   P1    M    1 20.99 20.31 FALSE FALSE
6   Control   P1    M    2 37.54 17.62 FALSE FALSE
dat.1 |> mutate(mediumResp1 = between(Resp1, 15, 25)) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2 mediumResp1
1   Control   P1    H    1  8.12  3.06       FALSE
2   Control   P1    H    2 20.55 25.94        TRUE
3   Control   P1    H    3 27.49 29.85       FALSE
4   Control   P1    H    4 44.79 25.39       FALSE
5   Control   P1    M    1 20.99 20.31        TRUE
6   Control   P1    M    2 37.54 17.62       FALSE

6.2 Offset functions

Function Description
lag() shift cases down one
lead() shift cases up one
dat.1 |> mutate(lagResp1 = lag(Resp1)) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2 lagResp1
1   Control   P1    H    1  8.12  3.06       NA
2   Control   P1    H    2 20.55 25.94     8.12
3   Control   P1    H    3 27.49 29.85    20.55
4   Control   P1    H    4 44.79 25.39    27.49
5   Control   P1    M    1 20.99 20.31    44.79
6   Control   P1    M    2 37.54 17.62    20.99
dat.1 |> mutate(leadResp1 = lead(Resp1)) |>
    tail()
   Treatment Plot Dose Time  Resp1  Resp2 leadResp1
43 Exclusion   P4    M    3 122.09  84.19    161.67
44 Exclusion   P4    M    4 161.67 256.34     57.93
45 Exclusion   P4    L    1  57.93  85.24    117.88
46 Exclusion   P4    L    2 117.88 167.90    181.09
47 Exclusion   P4    L    3 181.09 314.49    242.31
48 Exclusion   P4    L    4 242.31 304.70        NA

6.3 Cumulative aggregate functions

Function Description
cummax() cumulative maximum
cummin() cumulative minimum
cummean() cumulative mean
cumprod() cumulative product
cumsum() cumulative sum
rank() rank of current case (ties averaged)
min_rank() rank of current case (minimum rank for ties)
dense_rank() rank of current case (minimum rank for ties, no gaps)
percent_rank() min_rank of current case (scaled to [0,1])
cume_dist() cumulative empirical distribution (prop. less than current rank)
row_number() rank of current case (first row for ties)
ntile() bin into (n) buckets
dat.1 |> mutate(Cummin = cummin(Resp1),
                 Cummax = cummax(Resp1),
                 Cummean = cummean(Resp1)) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2 Cummin Cummax Cummean
1   Control   P1    H    1  8.12  3.06   8.12   8.12  8.1200
2   Control   P1    H    2 20.55 25.94   8.12  20.55 14.3350
3   Control   P1    H    3 27.49 29.85   8.12  27.49 18.7200
4   Control   P1    H    4 44.79 25.39   8.12  44.79 25.2375
5   Control   P1    M    1 20.99 20.31   8.12  44.79 24.3880
6   Control   P1    M    2 37.54 17.62   8.12  44.79 26.5800
dat.1 |> mutate(Cumsum = cumsum(Resp1),
                 Cumprod = cumprod(Resp1)) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2 Cumsum      Cumprod
1   Control   P1    H    1  8.12  3.06   8.12 8.120000e+00
2   Control   P1    H    2 20.55 25.94  28.67 1.668660e+02
3   Control   P1    H    3 27.49 29.85  56.16 4.587146e+03
4   Control   P1    H    4 44.79 25.39 100.95 2.054583e+05
5   Control   P1    M    1 20.99 20.31 121.94 4.312569e+06
6   Control   P1    M    2 37.54 17.62 159.48 1.618939e+08
dat.1 |> mutate(Rank = rank(Resp1),
                 minRank = min_rank(Resp1),
                 denseRank = dense_rank(Resp1),
                 percentRank = percent_rank(Resp1)) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2 Rank minRank denseRank percentRank
1   Control   P1    H    1  8.12  3.06    1       1         1   0.0000000
2   Control   P1    H    2 20.55 25.94    6       6         6   0.1063830
3   Control   P1    H    3 27.49 29.85    9       9         9   0.1702128
4   Control   P1    H    4 44.79 25.39   20      20        20   0.4042553
5   Control   P1    M    1 20.99 20.31    7       7         7   0.1276596
6   Control   P1    M    2 37.54 17.62   13      13        13   0.2553191
dat.1 |> mutate(cume_dist(Resp1)) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2 cume_dist(Resp1)
1   Control   P1    H    1  8.12  3.06       0.02083333
2   Control   P1    H    2 20.55 25.94       0.12500000
3   Control   P1    H    3 27.49 29.85       0.18750000
4   Control   P1    H    4 44.79 25.39       0.41666667
5   Control   P1    M    1 20.99 20.31       0.14583333
6   Control   P1    M    2 37.54 17.62       0.27083333
dat.1 |> mutate(row_number(Resp1)) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2 row_number(Resp1)
1   Control   P1    H    1  8.12  3.06                 1
2   Control   P1    H    2 20.55 25.94                 6
3   Control   P1    H    3 27.49 29.85                 9
4   Control   P1    H    4 44.79 25.39                20
5   Control   P1    M    1 20.99 20.31                 7
6   Control   P1    M    2 37.54 17.62                13
dat.1 |> mutate(ntile(Resp1, 5)) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2 ntile(Resp1, 5)
1   Control   P1    H    1  8.12  3.06               1
2   Control   P1    H    2 20.55 25.94               1
3   Control   P1    H    3 27.49 29.85               1
4   Control   P1    H    4 44.79 25.39               2
5   Control   P1    M    1 20.99 20.31               1
6   Control   P1    M    2 37.54 17.62               2

6.4 Miscellaneous

Function Description
if_else() elementwise (case by case) if and else
case_when() elementwise multiple if_else
na_if() elementwise replace nominated value with NA
pmax() elementwise maximum across multiple columns
pmin() elementwise minimum across multiple columns
dat.1 |> mutate(Size = if_else(Resp1 > 25, "Big", "Small")) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2  Size
1   Control   P1    H    1  8.12  3.06 Small
2   Control   P1    H    2 20.55 25.94 Small
3   Control   P1    H    3 27.49 29.85   Big
4   Control   P1    H    4 44.79 25.39   Big
5   Control   P1    M    1 20.99 20.31 Small
6   Control   P1    M    2 37.54 17.62   Big
dat.1 |> mutate(Size = case_when(Resp1 < 15 ~ "Small",
                               Resp1 < 25 ~ "Medium",
                               Resp1 >= 25 ~ "Big")) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2   Size
1   Control   P1    H    1  8.12  3.06  Small
2   Control   P1    H    2 20.55 25.94 Medium
3   Control   P1    H    3 27.49 29.85    Big
4   Control   P1    H    4 44.79 25.39    Big
5   Control   P1    M    1 20.99 20.31 Medium
6   Control   P1    M    2 37.54 17.62    Big

6.5 Summary functions

Summary functions (those that return a single value) are also permissible - the value will be recycled for the total length of the input vector. A table of useful summary functions can be found in the Summarise section.

dat.1 |> mutate(meanResp1 = mean(Resp1)) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2 meanResp1
1   Control   P1    H    1  8.12  3.06  75.26604
2   Control   P1    H    2 20.55 25.94  75.26604
3   Control   P1    H    3 27.49 29.85  75.26604
4   Control   P1    H    4 44.79 25.39  75.26604
5   Control   P1    M    1 20.99 20.31  75.26604
6   Control   P1    M    2 37.54 17.62  75.26604

Another important summary function is the n() function. This function returns the total number of rows. In the following example, we will use it to create a column that just provides a running row counter (e.g. a row index variable).

dat.1 |> mutate(N = 1:n()) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2 N
1   Control   P1    H    1  8.12  3.06 1
2   Control   P1    H    2 20.55 25.94 2
3   Control   P1    H    3 27.49 29.85 3
4   Control   P1    H    4 44.79 25.39 4
5   Control   P1    M    1 20.99 20.31 5
6   Control   P1    M    2 37.54 17.62 6

6.6 Multiple mutations (across)

In the section on select, a set of select helper functions were described to facilitate convenient ways to select columns based on properties of the column names etc. The across() function allows us to bring those same selection helper functions to mutate.

The across() function has the following form:

across(.cols, .fns, .names)

where:

  • .cols - a tidy selection (e.g. selection helper function)
  • .fns - a function (or list of functions) to apply to each selected column
  • .names - a glue specification determining the format of the new variable names. By default the glue will be either {.col} (when there is only a single function) or {.col}.{fn} (when a list of functions)
dat.1 |> mutate(across(c(Resp1, Resp2), log)) |>
    head()
  Treatment Plot Dose Time    Resp1    Resp2
1   Control   P1    H    1 2.094330 1.118415
2   Control   P1    H    2 3.022861 3.255786
3   Control   P1    H    3 3.313822 3.396185
4   Control   P1    H    4 3.801985 3.234355
5   Control   P1    M    1 3.044046 3.011113
6   Control   P1    M    2 3.625407 2.869035
dat.1 |> mutate(across(c(Resp1, Resp2),
                        .fns = log,
                        .names = "l{.col}")) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2   lResp1   lResp2
1   Control   P1    H    1  8.12  3.06 2.094330 1.118415
2   Control   P1    H    2 20.55 25.94 3.022861 3.255786
3   Control   P1    H    3 27.49 29.85 3.313822 3.396185
4   Control   P1    H    4 44.79 25.39 3.801985 3.234355
5   Control   P1    M    1 20.99 20.31 3.044046 3.011113
6   Control   P1    M    2 37.54 17.62 3.625407 2.869035
dat.1 |> mutate(across(where(is.numeric),
                        .fns = log,
                        .names = "l{.col}")) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2     lTime   lResp1   lResp2
1   Control   P1    H    1  8.12  3.06 0.0000000 2.094330 1.118415
2   Control   P1    H    2 20.55 25.94 0.6931472 3.022861 3.255786
3   Control   P1    H    3 27.49 29.85 1.0986123 3.313822 3.396185
4   Control   P1    H    4 44.79 25.39 1.3862944 3.801985 3.234355
5   Control   P1    M    1 20.99 20.31 0.0000000 3.044046 3.011113
6   Control   P1    M    2 37.54 17.62 0.6931472 3.625407 2.869035
dat.1 |> mutate(across(num_range("Resp", 1:2),
                        .fns = log,
                        .names = "l{.col}")) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2   lResp1   lResp2
1   Control   P1    H    1  8.12  3.06 2.094330 1.118415
2   Control   P1    H    2 20.55 25.94 3.022861 3.255786
3   Control   P1    H    3 27.49 29.85 3.313822 3.396185
4   Control   P1    H    4 44.79 25.39 3.801985 3.234355
5   Control   P1    M    1 20.99 20.31 3.044046 3.011113
6   Control   P1    M    2 37.54 17.62 3.625407 2.869035
dat.1 |> mutate(across(c(Resp1, Resp2),
                        .fns = list(l = log, s = sqrt),
                        .names = "{.fn}.{.col}")) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2  l.Resp1  s.Resp1  l.Resp2  s.Resp2
1   Control   P1    H    1  8.12  3.06 2.094330 2.849561 1.118415 1.749286
2   Control   P1    H    2 20.55 25.94 3.022861 4.533211 3.255786 5.093133
3   Control   P1    H    3 27.49 29.85 3.313822 5.243091 3.396185 5.463515
4   Control   P1    H    4 44.79 25.39 3.801985 6.692533 3.234355 5.038849
5   Control   P1    M    1 20.99 20.31 3.044046 4.581484 3.011113 4.506662
6   Control   P1    M    2 37.54 17.62 3.625407 6.126989 2.869035 4.197618

Centring all numeric variables (note the use of the purrr style lambda formula for functions that require additional arguments. When applying the function, the focal variable is assigned the name of .x.

dat.1 |> mutate(across(c(Resp1, Resp2),
                        .fns = list(c =  ~ scale(.x, scale = FALSE)),
                        .names = "{.fn}{.col}")) |>
    head()
  Treatment Plot Dose Time Resp1 Resp2    cResp1    cResp2
1   Control   P1    H    1  8.12  3.06 -67.14604 -78.64958
2   Control   P1    H    2 20.55 25.94 -54.71604 -55.76958
3   Control   P1    H    3 27.49 29.85 -47.77604 -51.85958
4   Control   P1    H    4 44.79 25.39 -30.47604 -56.31958
5   Control   P1    M    1 20.99 20.31 -54.27604 -61.39958
6   Control   P1    M    2 37.54 17.62 -37.72604 -64.08958

6.7 Changing vectors

Convert Time (a numeric) into a factor

dat.1 |> mutate(Time = factor(Time)) |>
    tibble()
# A tibble: 48 × 6
   Treatment Plot  Dose  Time  Resp1  Resp2
   <fct>     <fct> <fct> <fct> <dbl>  <dbl>
 1 Control   P1    H     1      8.12   3.06
 2 Control   P1    H     2     20.6   25.9 
 3 Control   P1    H     3     27.5   29.8 
 4 Control   P1    H     4     44.8   25.4 
 5 Control   P1    M     1     21.0   20.3 
 6 Control   P1    M     2     37.5   17.6 
 7 Control   P1    M     3     61.5   98.4 
 8 Control   P1    M     4     82.2  160.  
 9 Control   P1    L     1     31.7   21.2 
10 Control   P1    L     2     59.1   37.5 
# ℹ 38 more rows

Change the labels of the “H” and “M” levels of Dose

dat.1 |> mutate(Dose = fct_recode(Dose, High = 'H',  Medium = 'M')) |>
    tibble()
# A tibble: 48 × 6
   Treatment Plot  Dose    Time Resp1  Resp2
   <fct>     <fct> <fct>  <int> <dbl>  <dbl>
 1 Control   P1    High       1  8.12   3.06
 2 Control   P1    High       2 20.6   25.9 
 3 Control   P1    High       3 27.5   29.8 
 4 Control   P1    High       4 44.8   25.4 
 5 Control   P1    Medium     1 21.0   20.3 
 6 Control   P1    Medium     2 37.5   17.6 
 7 Control   P1    Medium     3 61.5   98.4 
 8 Control   P1    Medium     4 82.2  160.  
 9 Control   P1    L          1 31.7   21.2 
10 Control   P1    L          2 59.1   37.5 
# ℹ 38 more rows
dat.1 |> mutate(Dose = fct_recode(Dose, High = 'H',  Medium = 'M')) |>
    str()
'data.frame':   48 obs. of  6 variables:
 $ Treatment: Factor w/ 2 levels "Control","Exclusion": 1 1 1 1 1 1 1 1 1 1 ...
 $ Plot     : Factor w/ 4 levels "P1","P2","P3",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Dose     : Factor w/ 3 levels "High","L","Medium": 1 1 1 1 3 3 3 3 2 2 ...
 $ Time     : int  1 2 3 4 1 2 3 4 1 2 ...
 $ Resp1    : num  8.12 20.55 27.49 44.79 20.99 ...
 $ Resp2    : num  3.06 25.94 29.85 25.39 20.31 ...
 - attr(*, "out.attrs")=List of 2
  ..$ dim     : Named int [1:3] 4 3 4
  .. ..- attr(*, "names")= chr [1:3] "Time" "Dose" "Plot"
  ..$ dimnames:List of 3
  .. ..$ Time: chr [1:4] "Time=1" "Time=2" "Time=3" "Time=4"
  .. ..$ Dose: chr [1:3] "Dose=H" "Dose=M" "Dose=L"
  .. ..$ Plot: chr [1:4] "Plot=P1" "Plot=P2" "Plot=P3" "Plot=P4"

Change the level order of the Dose factor to something more natural.

dat.1 |> pull(Dose)
 [1] H H H H M M M M L L L L H H H H M M M M L L L L H H H H M M M M L L L L H H
[39] H H M M M M L L L L
Levels: H L M
dat.1 |>
    mutate(Dose = fct_relevel(Dose, c("L", "M", "H"))) |> 
    as_tibble() |>
    pull(Dose)
 [1] H H H H M M M M L L L L H H H H M M M M L L L L H H H H M M M M L L L L H H
[39] H H M M M M L L L L
Levels: L M H

Change the labels and level order of the Dose factor to something more natural.

dat.1 |> pull(Dose)
dat.1 |> mutate(
              Dose = fct_relevel(Dose, c("L", "M", "H")),
              Dose = fct_recode(Dose, High = 'H',  Medium = 'M')
              ) |>
    as_tibble |> 
    pull(Dose)
Error: The pipe operator requires a function call as RHS (<text>:6:5)
dat.1 |> pull(Dose)
dat.1 |> mutate(Dose = fct_reorder(Dose, 1:n())) |>
    as_tibble |> 
    pull(Dose)
Error: The pipe operator requires a function call as RHS (<text>:3:5)

Change the order of Dose levels according to the median Resp1 values

dat.1 |> pull(Dose)
dat.1 |> mutate(Dose = fct_reorder(Dose, Resp1, median)) |>
    as_tibble |> 
    pull(Dose)
Error: The pipe operator requires a function call as RHS (<text>:3:5)

7 Summarising

Data are often collected and transcribed at finer temporal/spatial scales and with greater fidelity than is required for all analyses. Furthermore, there is often a desire to present information summarised to a courser level. Therefore an important phase of data preparation/presentation is also to summarise the data into the spatial/temporal scales appropriate for the desired graphical and statistical analyses.

Summarising involves applying one or more summarising functions to one or more variables such that the data are reduced to a single row.

The summarise() function users data-masking semantics and can use tidy-select semantics via across().

First 10 rows of the `dat.1` data.frame
Treatment Plot Dose Time Resp1 Resp2
Control P1 H 1 8.12 3.06
Control P1 H 2 20.55 25.94
Control P1 H 3 27.49 29.85
Control P1 H 4 44.79 25.39
Control P1 M 1 20.99 20.31
Control P1 M 2 37.54 17.62
Control P1 M 3 61.46 98.44
Control P1 M 4 82.21 160.01
Control P1 L 1 31.73 21.22
Control P1 L 2 59.08 37.51
Function Description
n() number of rows/values
n_distinct() number of unique rows/values
mean(), median() mean/median
sum(), prod() sum/product
var(), sd() variance/standard deviation
mad(), IQR() median absolute deviation / interquartile range
min(), max() minimum / maximum
quantile() nth quantile
first(), last(), nth() first / last / nth value

Calculate the number of rows in the data as well as the number of distinct (unique) values of Dose.

dat.1 |> summarise(N = n(),
                    NDose = n_distinct(Dose))
   N NDose
1 48     3

Calculate the number of rows in the data as well as the number of distinct (unique) values of Dose.

dat.1 |> summarise(Mean = mean(Resp1),
                    Median = median(Resp1)) 
      Mean Median
1 75.26604  61.31

Calculate the number of rows in the data as well as the number of distinct (unique) values of Dose.

dat.1 |> summarise(Sum = sum(Resp1),
                    Prod = prod(Resp1)) 
      Sum         Prod
1 3612.77 1.261602e+84

Calculate the number of rows in the data as well as the number of distinct (unique) values of Dose.

dat.1 |> summarise(Var = var(Resp1),
                    MAD = mad(Resp1)) 
       Var     MAD
1 3200.305 42.5951

Calculate the number of rows in the data as well as the number of distinct (unique) values of Dose.

dat.1 |> summarise(Min = min(Resp1),
                    Max = max(Resp1)) 
   Min    Max
1 8.12 242.31

Calculate the number of rows in the data as well as the number of distinct (unique) values of Dose.

dat.1 |> summarise(Q = t(quantile(Resp1)))
      Q.0%    Q.25%    Q.50%    Q.75%   Q.100%
1   8.1200  36.4975  61.3100  99.4125 242.3100

Calculate the number of rows in the data as well as the number of distinct (unique) values of Dose.

dat.1 |> summarise(First = first(Resp1),
                    Last = last(Resp1),
                    Fifth = nth(Resp1, 5)) 
  First   Last Fifth
1  8.12 242.31 20.99
dat.1 |> summarise(across(c(Resp1, Resp2),
                           list(Mean = mean, Var = var),
                           .names = "{.fn}.{.col}"),
                    N = n()) 
  Mean.Resp1 Var.Resp1 Mean.Resp2 Var.Resp2  N
1   75.26604  3200.305   81.70958  5886.444 48
dat.1 |> summarise(across(where(is.numeric),
                           list(Mean = mean, Var = var),
                           .names = "{.fn}.{.col}"),
                    across(where(is.factor), n_distinct),
                    N = n())
  Mean.Time Var.Time Mean.Resp1 Var.Resp1 Mean.Resp2 Var.Resp2 Treatment Plot
1       2.5 1.276596   75.26604  3200.305   81.70958  5886.444         2    4
  Dose  N
1    3 48
Var <- c("Resp1", "Resp2", "Resp3")
dat.1 |> summarise(across(any_of(Var),
                           list(Mean = mean, Var = var)))
  Resp1_Mean Resp1_Var Resp2_Mean Resp2_Var
1   75.26604  3200.305   81.70958  5886.444

7.1 Count

The count() function provides a convenient way to count up the number of unique combinations of factors.

dat.1 |> count(Dose)
  Dose  n
1    H 16
2    L 16
3    M 16
dat.1 |> count(Dose, between(Resp1, 30, 50))
  Dose between(Resp1, 30, 50)  n
1    H                  FALSE 11
2    H                   TRUE  5
3    L                  FALSE 14
4    L                   TRUE  2
5    M                  FALSE 12
6    M                   TRUE  4

8 Grouping

Base R has a family of apply functions that apply a function (such as mean()) to a continuous variable separately for:

Apply the mean() function to each column (MARGIN = 2) of a dataset comprising only of “Resp1” and “Resp2”.

apply(dat.1[, c("Resp1", "Resp2")],
      MARGIN = 2,
      FUN = mean)
   Resp1    Resp2 
75.26604 81.70958 

Apply the mean() function to each level of the Dose factor.

tapply(dat.1$Resp1,
       INDEX = dat.1$Dose,
       FUN = mean)
        H         L         M 
 37.29563 112.54500  75.95750 

The above apply functions involve a sort of split, apply, combine process. This process became the inspiration for the plyr package (also out of the Wickham lab) which built on and super-charged the process. The dplyr package represents the next iteration of the process (for data frames and tibbles only as most data are in data frame format).

Central to the modern split/apply/combine process is the idea of groups. Groups are the basis of splitting the data. Functions applied to grouped data are applied to each group (subset) separately before the results are combined back into a single data frame (actually tibble). Hence grouped data are most powerful when combined with the summarise() or mutate() families of functions.

Groups are defined via the group_by() function. For example, to indicate that all subsequent manipulations should occur separately for each level of the Dose variable, we would:

dat.1 |> group_by(Dose)
# A tibble: 48 × 6
# Groups:   Dose [3]
   Treatment Plot  Dose   Time Resp1  Resp2
   <fct>     <fct> <fct> <int> <dbl>  <dbl>
 1 Control   P1    H         1  8.12   3.06
 2 Control   P1    H         2 20.6   25.9 
 3 Control   P1    H         3 27.5   29.8 
 4 Control   P1    H         4 44.8   25.4 
 5 Control   P1    M         1 21.0   20.3 
 6 Control   P1    M         2 37.5   17.6 
 7 Control   P1    M         3 61.5   98.4 
 8 Control   P1    M         4 82.2  160.  
 9 Control   P1    L         1 31.7   21.2 
10 Control   P1    L         2 59.1   37.5 
# ℹ 38 more rows

Note, in the above, we have only defined the grouping structure, not any subsequent manipulations. Consequently, the tibble remains unchanged. Do note however, that above the tabular output, the output specifies a Group attribute as a visual reminder.

It is in combination with other data manipulation routines that group_by() is most useful. For example, to calculate the mean of “Resp1” within each Dose:

dat.1 |> group_by(Dose) |>
    summarise(Mean = mean(Resp1))
# A tibble: 3 × 2
  Dose   Mean
  <fct> <dbl>
1 H      37.3
2 L     113. 
3 M      76.0

In the above output, you may have noticed that the data (means) have been rounded. This is a behaviour of the print method for tibbles. When stored, the full precision of the data are retained.

Before providing a list of examples, it is worth using group_by() to highlight the difference between mutate() and summarise().

dat.1 |> group_by(Dose) |>
    mutate(MeanResp1 = mean(Resp1)) |>
    head()
# A tibble: 6 × 7
# Groups:   Dose [2]
  Treatment Plot  Dose   Time Resp1 Resp2 MeanResp1
  <fct>     <fct> <fct> <int> <dbl> <dbl>     <dbl>
1 Control   P1    H         1  8.12  3.06      37.3
2 Control   P1    H         2 20.6  25.9       37.3
3 Control   P1    H         3 27.5  29.8       37.3
4 Control   P1    H         4 44.8  25.4       37.3
5 Control   P1    M         1 21.0  20.3       76.0
6 Control   P1    M         2 37.5  17.6       76.0
dat.1 |> group_by(Dose) |>
    summarise(Mean = mean(Resp1))
# A tibble: 3 × 2
  Dose   Mean
  <fct> <dbl>
1 H      37.3
2 L     113. 
3 M      76.0
  • mutate() maintains the number of rows. Notice how the mean for each group is repeated for each case.
  • summarise()reduces the number of rows to one (per group).
  • mutate() retains all the columns
  • summarise() only returns the columns specifically involved (either as the Groups or manipulated variables).

When multiple variables are used to define the groups, the subsequent manipulations are applied to each unique combination of the factors.

dat.1 |> group_by(Treatment, Dose) |>
    summarise(Mean = mean(Resp1))
# A tibble: 6 × 3
# Groups:   Treatment [2]
  Treatment Dose   Mean
  <fct>     <fct> <dbl>
1 Control   H      24.5
2 Control   L      75.7
3 Control   M      51.2
4 Exclusion H      50.1
5 Exclusion L     149. 
6 Exclusion M     101. 

Importantly, once all the manipulations that need to be performed separately per group have been performed, it is good practice to remove the group attribute. Forgetting to do so will continue to cause all subsequent manipulations (and possibly analyses and graphics) to be inadvertently performed separately per group - such downstream issues can be difficult to diagnose.

To remove the group attribute, we use the ungroup() function.

dat.1 |> group_by(Treatment, Dose) |>
    summarise(Mean = mean(Resp1)) |>
    ungroup()
# A tibble: 6 × 3
  Treatment Dose   Mean
  <fct>     <fct> <dbl>
1 Control   H      24.5
2 Control   L      75.7
3 Control   M      51.2
4 Exclusion H      50.1
5 Exclusion L     149. 
6 Exclusion M     101. 
dat.1 |> group_by(Treatment, Dose) |>
    summarise(meanResp1 = mean(Resp1),
              varResp1 = var(Resp1),
              N = n())
# A tibble: 6 × 5
# Groups:   Treatment [2]
  Treatment Dose  meanResp1 varResp1     N
  <fct>     <fct>     <dbl>    <dbl> <int>
1 Control   H          24.5     202.     8
2 Control   L          75.7    1191.     8
3 Control   M          51.2     611.     8
4 Exclusion H          50.1     562.     8
5 Exclusion L         149.     5192.     8
6 Exclusion M         101.     2378.     8
dat.1 |> group_by(Treatment, Dose) |>
    summarise(across(starts_with("Resp"),
                     list(Mean = mean, Var = var),
                     .names = "{.fn}{.col}"),
              N = n())
# A tibble: 6 × 7
# Groups:   Treatment [2]
  Treatment Dose  MeanResp1 VarResp1 MeanResp2 VarResp2     N
  <fct>     <fct>     <dbl>    <dbl>     <dbl>    <dbl> <int>
1 Control   H          24.5     202.      26.5     118.     8
2 Control   L          75.7    1191.      79.9    1948.     8
3 Control   M          51.2     611.      62.3    3676.     8
4 Exclusion H          50.1     562.      48.1     886.     8
5 Exclusion L         149.     5192.     170.    13391.     8
6 Exclusion M         101.     2378.     103.     4786.     8
dat.1 |> group_by(Treatment, Dose) |>
    filter(Resp1 < 25)
# A tibble: 8 × 6
# Groups:   Treatment, Dose [3]
  Treatment Plot  Dose   Time Resp1 Resp2
  <fct>     <fct> <fct> <int> <dbl> <dbl>
1 Control   P1    H         1  8.12  3.06
2 Control   P1    H         2 20.6  25.9 
3 Control   P1    M         1 21.0  20.3 
4 Control   P2    H         1  8.14 23.9 
5 Control   P2    H         2 13.4  28.0 
6 Control   P2    M         1 20.0  19.7 
7 Exclusion P3    H         1 21.9  23.6 
8 Exclusion P4    H         1 18.8  28.6 

9 Reshaping data

9.1 Wide to long (pivot_longer)

Whilst wide data formats are often more compact and typically easier to manage for data entry (particularly in the field), the data are not in the appropriate format for most analyses (traditional repeated measures and multivariate analyses are two exceptions). Most analyses require that each replicate is in its own row and thus it is necessary to be rearrange or reshape (melt) the data from this wide format to the long (molten) format.

Whilst there are numerous routines in R for reshaping data, we will only explore those that are formally part of the tidyverse ecosystem.

The pivot_longer() function (tidyr package) is very useful for converting wide (repeated measures-like) into long format. The important parameters to specify are:

pivot_longer(data, cols, names_to, values_to, values_drop_na)

where:

  • data - the input dataframe or tibble
  • cols - a tidy select specification of the columns to be lengthened into a single column
  • names_to - a name (string) to give a new column into which to store the names of the old wide column names
  • values_to - a name (string) to give the new column containing the values that were previously in the old columns.
  • values_drop_na - whether or not to drop rows that only contain NA values in the new value column.

To illustrate, we will use the dat.w dataframe.

The `data.w` data.frame
Plot Between Time.0 Time.1 Time.2
R1 P1 A1 8 14 14
R2 P2 A1 10 12 11
R3 P3 A2 7 11 8
R4 P4 A2 11 9 2
data.w |> pivot_longer(Time.0:Time.2,
                        names_to = "Time",
                        values_to = "Count")
# A tibble: 12 × 4
   Plot  Between Time   Count
   <fct> <fct>   <chr>  <int>
 1 P1    A1      Time.0     8
 2 P1    A1      Time.1    14
 3 P1    A1      Time.2    14
 4 P2    A1      Time.0    10
 5 P2    A1      Time.1    12
 6 P2    A1      Time.2    11
 7 P3    A2      Time.0     7
 8 P3    A2      Time.1    11
 9 P3    A2      Time.2     8
10 P4    A2      Time.0    11
11 P4    A2      Time.1     9
12 P4    A2      Time.2     2
data.w |> pivot_longer(starts_with("Time"),
                        names_to = "Time",
                        names_prefix = "Time.",
                        values_to = "Count"
                        )
# A tibble: 12 × 4
   Plot  Between Time  Count
   <fct> <fct>   <chr> <int>
 1 P1    A1      0         8
 2 P1    A1      1        14
 3 P1    A1      2        14
 4 P2    A1      0        10
 5 P2    A1      1        12
 6 P2    A1      2        11
 7 P3    A2      0         7
 8 P3    A2      1        11
 9 P3    A2      2         8
10 P4    A2      0        11
11 P4    A2      1         9
12 P4    A2      2         2

9.2 Long to wide (pivot_wider)

The opposite of making a data set longer is to make a data set wider. Whilst analytical and graphical routines might require data to be in long format, wide tabular summaries are typically more compact and familiar.

Widening is performed via the pivot_wider() function, the most important parameters of which are:

pivot_wider(data, id_cols, names_from, values_from

where:

  • data - the input dataframe or tibble
  • id_cols - a tidy select specification of the columns that uniquely identify the case - these columns will not be widened.
  • names_from - a tidy select specification of the column(s) that contain the names to be used as new column names.
  • values_from - a tidy select specification of the column(s) that contain the values to be used as values in the new columns (e.g, the data to be widened).

To illustrate, we will use the dat.w dataframe.

The `data` data.frame
Resp1 Resp2 Between Plot Subplot Within
8 17 A1 P1 S1 B1
10 18 A1 P1 S1 B2
7 17 A1 P1 S2 B1
11 21 A1 P1 S2 B2
14 19 A2 P2 S3 B1
12 13 A2 P2 S3 B2
11 24 A2 P2 S4 B1
9 18 A2 P2 S4 B2
14 25 A3 P3 S5 B1
11 18 A3 P3 S5 B2
8 27 A3 P3 S6 B1
2 22 A3 P3 S6 B2
8 17 A1 P4 S7 B1
10 22 A1 P4 S7 B2
7 16 A1 P4 S8 B1
12 13 A1 P4 S8 B2
11 23 A2 P5 S9 B1
12 19 A2 P5 S9 B2
12 23 A2 P5 S10 B1
10 21 A2 P5 S10 B2
3 17 A3 P6 S11 B1
11 16 A3 P6 S11 B2
13 26 A3 P6 S12 B1
7 28 A3 P6 S12 B2

Note, these data are not as long as they could be. Purely “long” data should have each observation in its own row. The data dataframe has two observations (one for “Resp1” and one for “Resp2”) per row.

Widen the “Resp1” variable by the levels of the Between factor.

data |> select(-Resp2) |>
    pivot_wider(names_from = Within,
                values_from = c(Resp1))
# A tibble: 12 × 5
   Between Plot  Subplot    B1    B2
   <fct>   <fct> <fct>   <int> <int>
 1 A1      P1    S1          8    10
 2 A1      P1    S2          7    11
 3 A2      P2    S3         14    12
 4 A2      P2    S4         11     9
 5 A3      P3    S5         14    11
 6 A3      P3    S6          8     2
 7 A1      P4    S7          8    10
 8 A1      P4    S8          7    12
 9 A2      P5    S9         11    12
10 A2      P5    S10        12    10
11 A3      P6    S11         3    11
12 A3      P6    S12        13     7

Widen the “Resp1” and “Resp2” variables by the levels of the Between factor.

data |> pivot_wider(names_from = Within,
                values_from = starts_with("Resp"))
# A tibble: 12 × 7
   Between Plot  Subplot Resp1_B1 Resp1_B2 Resp2_B1 Resp2_B2
   <fct>   <fct> <fct>      <int>    <int>    <int>    <int>
 1 A1      P1    S1             8       10       17       18
 2 A1      P1    S2             7       11       17       21
 3 A2      P2    S3            14       12       19       13
 4 A2      P2    S4            11        9       24       18
 5 A3      P3    S5            14       11       25       18
 6 A3      P3    S6             8        2       27       22
 7 A1      P4    S7             8       10       17       22
 8 A1      P4    S8             7       12       16       13
 9 A2      P5    S9            11       12       23       19
10 A2      P5    S10           12       10       23       21
11 A3      P6    S11            3       11       17       16
12 A3      P6    S12           13        7       26       28

Alternatively we could make the data longer before widening

data |>
    pivot_longer(cols = starts_with("Resp")) |>
    pivot_wider(names_from = c(name, Within),
                values_from = value)
# A tibble: 12 × 7
   Between Plot  Subplot Resp1_B1 Resp2_B1 Resp1_B2 Resp2_B2
   <fct>   <fct> <fct>      <int>    <int>    <int>    <int>
 1 A1      P1    S1             8       17       10       18
 2 A1      P1    S2             7       17       11       21
 3 A2      P2    S3            14       19       12       13
 4 A2      P2    S4            11       24        9       18
 5 A3      P3    S5            14       25       11       18
 6 A3      P3    S6             8       27        2       22
 7 A1      P4    S7             8       17       10       22
 8 A1      P4    S8             7       16       12       13
 9 A2      P5    S9            11       23       12       19
10 A2      P5    S10           12       23       10       21
11 A3      P6    S11            3       17       11       16
12 A3      P6    S12           13       26        7       28

10 Combining columns (unite)

If data are recorded with excessive fidelity, it may be useful to combine multiple fields into a single field. For example, if the date was recorded across three fields (year, month and day, we may like to combine these to form a single date field.

The `data.d` data.frame
year month day Resp1
2009 11 09 25
2008 05 08 52
2009 05 23 18
2009 04 15 21
2008 10 25 45
2008 09 26 40
2008 07 05 38
2008 11 02 9
2009 08 19 0
2008 10 03 20

Lets take those data and combine (unite) the separate date components into a single date column before coercing it into a date object.

data.d |> unite(year,month,day,col='Date',sep='-') |>
    mutate(Date = as.Date(Date))
         Date Resp1
1  2009-11-09    25
2  2008-05-08    52
3  2009-05-23    18
4  2009-04-15    21
5  2008-10-25    45
6  2008-09-26    40
7  2008-07-05    38
8  2008-11-02     9
9  2009-08-19     0
10 2008-10-03    20

11 Separating columns (separate)

Separating variables is the opposite of uniting them. A field is separated by either indicating a character(s) to use as a separator, or else providing a fixed width format.

The `data.c` data.frame
Year Resp1
F33 23
M27 55
M33 33
M26 36
F21 12
F34 49
M29 0
M34 18
F32 42
F26 0

For this example, we will split the Year column into two columns, one that specifies the gender and one that specifies the age. The values are going to be separated after the first character.

data.c |> separate(Year,
                    into = c("Gender", "Age"),
                    sep = 1)
   Gender Age Resp1
1       F  33    23
2       M  27    55
3       M  33    33
4       M  26    36
5       F  21    12
6       F  34    49
7       M  29     0
8       M  34    18
9       F  32    42
10      F  26     0

12 Merging (_join) data sets

It is common to have data associated with a particular study organised into a number of separate data tables (databases etc). In fact, large data sets are best managed in databases. However, statistical analyses generally require all data to be encapsulated within a single data structure. Therefore, prior to analysis, it is necessary to bring together multiple sources.

This phase of data preparation can be one of the most difficult to get right and verify that it is correct (particularly for large data sets).

Merging (or joining) involves creating a new data set that comprises information from two data sets. The resulting joined data set contains all fields from both data sets. The data sets are alignd together according to fields they have in common. Matching records in these common fields are used to select a row from each input data set to be combined.

There are numerous alternative ways of defining what should happen in the event that common records do not occur in both sets. For example, we could specify that only fully matching records be included in the final data set. Alternatively, we could specify that all records be included from both sets and thus the resulting data set will contain missing values. The following describe these various options.

  • left join
    • return all rows and columns from the left data set
    • return all columns from the right data set
    • new columns for unmatched rows from the right data sets receive NA values
    • when there are multiple matches, all combinations included

  • right join
    • return all rows and columns from the right data set
    • return all columns from the left data set
    • new columns for unmatched rows from the left data sets receive NA values
    • when there are multiple matches, all combinations included

  • inner join
    • return all columns from the left and right data set
    • return only rows that match from left and right data sets
    • when there are multiple matches, all combinations included

  • semi join
    • return all rows from the left data set that match with rows from the right data set
    • keep only the columns from the left data set

  • anti join
    • return only the rows from the left data set that do not match with rows from the right data set
    • keep only the columns from the left data set

  • full join
    • return all rows and columns from the left and right data set
    • unmatched rows from either left data sets receive NA values in the associated new columns

To illustrate the above, we will use the data.bio and data.chem datasets. The data.bio dataset mimics a typical ecological data set in which there are a number of ecological responses (Resp1 and Resp2) along with the associated hierarchical design structure. This data set is intentionally missing data for Subplot S3. The data.chem dataset has the same hierarchical design structure, yet records the two chemical analytes (Chem1 and Chem2). This data set is intentionally missing data for Subplot S7.

Resp1 Resp2 Between Plot Subplot
8 18 A1 P1 S1
10 21 A1 P1 S2
11 23 A1 P2 S4
14 22 A2 P3 S5
12 24 A2 P3 S6
11 23 A2 P4 S7
9 20 A2 P4 S8
14 11 A3 P5 S9
11 22 A3 P5 S10
8 24 A3 P6 S11
2 16 A3 P6 S12
Chem1 Chem2 Between Plot Subplot
1.452878 0.8858208 A1 P1 S1
3.266253 0.1800177 A1 P1 S2
1.178652 5.0780682 A1 P2 S3
13.400350 1.5762780 A1 P2 S4
3.779183 1.6222430 A2 P3 S5
1.196657 4.2369184 A2 P3 S6
5.687807 2.9859003 A2 P4 S8
4.834518 4.1328919 A3 P5 S9
2.002931 3.6043314 A3 P5 S10
12.326867 1.7763576 A3 P6 S11
4.014221 0.2255188 A3 P6 S12

A full or outer join between data.bio and data.chem

data.bio |> full_join(data.chem)
   Resp1 Resp2 Between Plot Subplot     Chem1     Chem2
1      8    18      A1   P1      S1  1.452878 0.8858208
2     10    21      A1   P1      S2  3.266253 0.1800177
3     11    23      A1   P2      S4 13.400350 1.5762780
4     14    22      A2   P3      S5  3.779183 1.6222430
5     12    24      A2   P3      S6  1.196657 4.2369184
6     11    23      A2   P4      S7        NA        NA
7      9    20      A2   P4      S8  5.687807 2.9859003
8     14    11      A3   P5      S9  4.834518 4.1328919
9     11    22      A3   P5     S10  2.002931 3.6043314
10     8    24      A3   P6     S11 12.326867 1.7763576
11     2    16      A3   P6     S12  4.014221 0.2255188
12    NA    NA      A1   P2      S3  1.178652 5.0780682

Note that although all Subplots are present, there are missing values for Subplot S3 (last row) and S7.

An inner join between data.bio and data.chem

data.bio |> inner_join(data.chem)
   Resp1 Resp2 Between Plot Subplot     Chem1     Chem2
1      8    18      A1   P1      S1  1.452878 0.8858208
2     10    21      A1   P1      S2  3.266253 0.1800177
3     11    23      A1   P2      S4 13.400350 1.5762780
4     14    22      A2   P3      S5  3.779183 1.6222430
5     12    24      A2   P3      S6  1.196657 4.2369184
6      9    20      A2   P4      S8  5.687807 2.9859003
7     14    11      A3   P5      S9  4.834518 4.1328919
8     11    22      A3   P5     S10  2.002931 3.6043314
9      8    24      A3   P6     S11 12.326867 1.7763576
10     2    16      A3   P6     S12  4.014221 0.2255188

Note that both Subplots S3 and S7 are missing.

A left join between data.bio and data.chem

data.bio |> left_join(data.chem)
   Resp1 Resp2 Between Plot Subplot     Chem1     Chem2
1      8    18      A1   P1      S1  1.452878 0.8858208
2     10    21      A1   P1      S2  3.266253 0.1800177
3     11    23      A1   P2      S4 13.400350 1.5762780
4     14    22      A2   P3      S5  3.779183 1.6222430
5     12    24      A2   P3      S6  1.196657 4.2369184
6     11    23      A2   P4      S7        NA        NA
7      9    20      A2   P4      S8  5.687807 2.9859003
8     14    11      A3   P5      S9  4.834518 4.1328919
9     11    22      A3   P5     S10  2.002931 3.6043314
10     8    24      A3   P6     S11 12.326867 1.7763576
11     2    16      A3   P6     S12  4.014221 0.2255188

Note that both Subplot S3 are missing and Subplot S7 only has NA values for the chem data.

A left join between data.bio and data.chem

data.bio |> right_join(data.chem)
   Resp1 Resp2 Between Plot Subplot     Chem1     Chem2
1      8    18      A1   P1      S1  1.452878 0.8858208
2     10    21      A1   P1      S2  3.266253 0.1800177
3     11    23      A1   P2      S4 13.400350 1.5762780
4     14    22      A2   P3      S5  3.779183 1.6222430
5     12    24      A2   P3      S6  1.196657 4.2369184
6      9    20      A2   P4      S8  5.687807 2.9859003
7     14    11      A3   P5      S9  4.834518 4.1328919
8     11    22      A3   P5     S10  2.002931 3.6043314
9      8    24      A3   P6     S11 12.326867 1.7763576
10     2    16      A3   P6     S12  4.014221 0.2255188
11    NA    NA      A1   P2      S3  1.178652 5.0780682

Note that both Subplot S7 are missing and Subplot S3 only has NA values for the bio data.

An anti join between data.bio and data.chem

data.bio |> anti_join(data.chem)
  Resp1 Resp2 Between Plot Subplot
1    11    23      A2   P4      S7

Note that Subplot S7 is returned indicating that it in the right dataset, yet absent from the left.

An semi join between data.bio and data.chem

data.bio |> semi_join(data.chem)
   Resp1 Resp2 Between Plot Subplot
1      8    18      A1   P1      S1
2     10    21      A1   P1      S2
3     11    23      A1   P2      S4
4     14    22      A2   P3      S5
5     12    24      A2   P3      S6
6      9    20      A2   P4      S8
7     14    11      A3   P5      S9
8     11    22      A3   P5     S10
9      8    24      A3   P6     S11
10     2    16      A3   P6     S12

Note that Subplot S3 and S7 are missing and only the bio data are included.

12.1 VLOOKUP in R

Lookup tables provide a way of inserting a column of data into a large data set such that the entries in the new column are determined by a relational match within another data set (the lookup table). For example, the main data set might contain data collected from a number of sites (Plots). Elsewhere we may have a data set that just contains the set of sites and their corresponding latitudes and longitudes (geographical lookup table). We could incorporate these latitudes and longitudes into the main data set by merging against the geographical lookup table. In Excel, this is referred to as vlookup, in a relational database (and in tidyverse) it is referred to as a join.

If we again consider our data.bio data, but this time also consider the data.geo data. This later dataset contains the latitude and longitude of each of the plots.

Resp1 Resp2 Between Plot Subplot
8 18 A1 P1 S1
10 21 A1 P1 S2
11 23 A1 P2 S4
14 22 A2 P3 S5
12 24 A2 P3 S6
11 23 A2 P4 S7
9 20 A2 P4 S8
14 11 A3 P5 S9
11 22 A3 P5 S10
8 24 A3 P6 S11
2 16 A3 P6 S12
Plot LAT LONG
P1 17.9605 145.4326
P2 17.5210 146.1983
P3 17.0011 146.3839
P4 18.2350 146.7934
P5 18.9840 146.0345
P6 20.1154 146.4672
data.bio |> left_join(data.geo)
   Resp1 Resp2 Between Plot Subplot     LAT     LONG
1      8    18      A1   P1      S1 17.9605 145.4326
2     10    21      A1   P1      S2 17.9605 145.4326
3     11    23      A1   P2      S4 17.5210 146.1983
4     14    22      A2   P3      S5 17.0011 146.3839
5     12    24      A2   P3      S6 17.0011 146.3839
6     11    23      A2   P4      S7 18.2350 146.7934
7      9    20      A2   P4      S8 18.2350 146.7934
8     14    11      A3   P5      S9 18.9840 146.0345
9     11    22      A3   P5     S10 18.9840 146.0345
10     8    24      A3   P6     S11 20.1154 146.4672
11     2    16      A3   P6     S12 20.1154 146.4672

References

Wickham, Hadley. 2014. “Tidy Data.” The Journal of Statistical Software 59 (10). http://www.jstatsoft.org/v59/i10/.