[1] 6
Introduction to R
1 Installing R
The latest version of an R installation binary (or source code) can be downloaded from one of the Comprehensive R Archive Network (or CRAN) mirrors. Having selected one of the (Australian) mirrors, follow one of the sets of instructions below (depending on your operating system).
Download R:
- Go to the CRAN R-project website https://cran.r-project.org/ and click on “Download R for Windows”.
- Select the “base” subdirectory
- Select the “Download R-X.X.X for Windows” option (where X.X.X are a series of version and release numbers) to download.
Run the installer: Double-click the downloaded .exe file and follow the installation wizard. Accept the default settings unless you have specific needs.
Optional: Set R as the default: Check the checkbox to set R as the default for R scripts during installation. This allows you to run R scripts by double-clicking them.
Verify installation:
- Open a new command prompt (Start > Run > cmd) and type
R
. If the R console opens, the installation was successful. - Alternatively, search for R in the Start menu
- Open a new command prompt (Start > Run > cmd) and type
Download R:
- Go to the CRAN R-project website (https://cran.r-project.org/) and click on “Download R for macOS”.
- Choose the latest stable version that is appropriate for your architecture.
Open the disk image: Double-click the downloaded .pkg file and drag the R application icon to your Applications folder.
Verify installation:
- Open Terminal: Go to Applications > Utilities and open Terminal.
- Type
R
in the Terminal window. If the R console opens, the installation was successful.
Open Terminal: You can access Terminal through your application launcher or search bar.
Install R: The commands vary slightly depending on your Linux distribution. Here are common examples:
- Debian/Ubuntu:
sudo apt install r-base
- Fedora/CentOS:
sudo yum install R
- Arch Linux:
sudo pacman -S R
- Debian/Ubuntu:
Verify installation: Type
R
in the Terminal window. If the R console opens, the installation was successful.
2 Basic Syntax
2.1 The R environment and command line
Upon opening R, you are presented with the R Console along with the command prompt (>
). R is a command driven application (as opposed to a ‘point-and-click’ application) and despite the steep learning curve, there are many very good reasons for this.
Commands that you type are evaluated once the Enter key has been pressed
Enter the following command (5+1
) at the command prompt (>
);
I have suppressed the command prompt (<
) from almost all code blocks throughout these workshop and tutorial series to make it easier for you to cut and paste code into your own scripts or directly into R.
In this tutorial series, the R code to be entered appears to the right hand side of the vertical bar. The number of the left side of the bar is a line number. For single line code snippets, such as the example above, line numbers are not necessary. However, for multi-line code snippets, line numbers help for identifying and describing different parts of the code.
The above R code evaluates the command five plus one and returns the result (six).. The [1] before the 6 indicates that the object immediately to its right is the first element in the returned object. In this case there is only one object returned. However, when a large set of objects (e.g. numbers) are returned, each row will start with an index number thereby making it easier to count through the elements.
- Object
- As an object oriented language, everything in R is an object. Data, functions even output are objects.
- Vector
- A collection of one or more objects of the same type (e.g. all numbers or all characters).
- Function
- A set of instructions carried out on one or more objects. Functions are typically wrappers for a sequence of instructions that perform specific and common tasks.
- Parameter
- The kind of information passed to a function.
- Argument
- The specific information passed to a function.
- Operator
-
A symbol that has a pre-defined meaning. Familiar operators include
+
-
*
and/
. - Assignment operators
-
<-
Assigning a name to an object (left to right) -
->
Assigning a name to an object (right to left) -
=
Used when defining and specifying function arguments - Logical operators (return
TRUE
orFALSE
) -
<
Less than -
>
Greater than -
<=
Less than or equal -
>=
Greater than or equal -
==
Is the left hand side equal to the right hand side (a query) -
!=
Is the left hand side NOT equal to the right hand side (a query) -
&&
Are BOTH left hand and right hand conditions TRUE -
||
Are EITHER the left hand OR right hand conditions TRUE - Pipe operator
-
|>
piping the output of one operation to the input of the next
2.1.1 Expressions, Assignment and Arithmetic
Instead of evaluating a statement and printing the result directly to the console, the results of evaluations can be stored in an object via a process called ‘Assignment’. Assignment assigns a name to an object and stores the result of an evaluation in that object. The contents of an object can be viewed (printed) by typing the name of the object at the command prompt and hitting Enter
.
On line 1 above, the name var1
was assigned to the result of the sum of 2 and 3. On line 2, the contents of this object are printed to the screen.
A single command (statement) can spread over multiple lines. If the Enter key is pressed before R considers the statement complete, the next line in the console will begin with the prompt +
indicating that the statement is not complete. For this example, I will include the command prompt in order to demonstrate the above point.
When the contents of an object are numbers, standard arithmetic applies;
Compatible objects can be concatenated (joined together) to create objects with multiple entries. Object concatenation can be performed using the c()
function.
In both examples above, objects were not assigned names. As a result, the expressions were evaluated and directly printed to the consol without being stored in any way. Doing so is useful for experimenting, however as the results are not stored, they cannot be used in subsequent actions.
In addition to the typical addition, subtraction, multiplication and division operators, there are a number of special operators, the simplest of which are the quotient or integer divide operator (%/%
) and the remainder or modulus operator (%%
).
2.1.2 Operator precedence
The rules of operator precedence are listed (highest to lowest) in the following table. Additionally, expressions within parentheses ‘()
’ always have highest precedence.
Operator | Description |
---|---|
[ [[ |
indexing |
:: |
namespace |
$ |
component |
^ |
exponentiation (evaluated right to left) |
- |
+ sign (unary) |
: |
sequence |
%special% |
special operators (e.g. %/% , %% , %*% , %in% ) |
* / |
multiplication and division |
+ |
- addition and subtraction |
> < >= <= == != |
ordering and comparison |
! |
logical negation (not) |
& && |
logical AND |
| || |
logical OR |
~ |
formula |
-> ->> |
assignment (left to right) |
= |
argument assignment (right to left) |
<- <<- |
assignment (right to left) |
? |
help |
2.1.3 Command history
Each time a command is entered at the R command prompt, the command is also added to a list known as the command history. The up and down arrow keys scroll backward and forward respectively through the session’s command history list and place the top most command at the current R command prompt. Scrolling through the command history enables previous commands to be rapidly re-executed, reviewed or modified and executed.
2.1.4 Object names
Everything created within R are objects. Objects are programming constructs that not only store values (the visible part of an object), they also define other properties of the object (such as the type of information contained in the object) and sometimes they also define certain routines that can be used to store, retrieve and manipulate data within the object.
Importantly, all objects within R must have unique names to which they can be referred. Names given to any object in R can comprise virtually any sequence of letters and numbers providing that the following rules are adhered to:
- Names must begin with a letter (names beginning with numbers or operators are not permitted)
- Names cannot contain the following characters; space
,
-
+
*
/
#
%
&
[
]
{
}
(
)
~
Whilst the above rules are necessary, the following naming conventions are also recommended:
- only use lowercase letters and numbers
- use underscores (
_
) to separate words (e.g. snake case) - try to use names that are both concise and meaningful.
- names should reflect the content of the object. One of the powerful features of R is that there is virtually no limit to the number of objects (variables, datasets, results, models, etc) that can be in use at a time. However, without careful name management, objects can rapidly become misplaced or ambiguous. Therefore, the name of an object should reflect what it is, and what has happened to it. For example, the name
log_fish_wts
might be given to an object that contains log transformed fish weights. Moreover, many prefer to prefix the object name with a lowercase letter that denotes the type of data containing in the object. For example,d_mean_head_length
might indicate that the object contains the mean head lengths stored as a double floating point (real numbers). - although there are no restrictions on the length of names, shorter names are quicker to type and provide less scope for typographical errors and are therefore recommended (of course within the restrictions of the point above).
- names should reflect the content of the object. One of the powerful features of R is that there is virtually no limit to the number of objects (variables, datasets, results, models, etc) that can be in use at a time. However, without careful name management, objects can rapidly become misplaced or ambiguous. Therefore, the name of an object should reflect what it is, and what has happened to it. For example, the name
- where possible, avoid using names of common predefined functions and variables as this can provide a source of confusion for both you and R. For example, to represent the mean of a head length variable, use something like
mean_head_length
rather thanmean
(which is the name of a predefined function within R that calculates the mean of a set of numbers).
2.2 R Sessions and Workspaces
A number of objects have been created in the current session (a session encapsulates all the activity since the current instance of the R application was started). To review the names of all of the objects in the users current workspace (storage of user created objects);
You can also refine the scope of the ls()
function to search for object names that match a pattern:
The longer the session is running, the more objects will be created resulting in a very cluttered workspace. Unneeded objects can be removed using the rm()
function. The rm()
function only performs a side effect (deletes objects), if the function succeeds, it does not return any output. If it does return anything, it will be a warning or error.
In the above examples, comments were appended to each line of code. Comments begin with a hash (#
) character. Anything that follows a hash character will be ignored (until the end of the line).
Comments provide a convenient way to annotate your code so as to provide more explanation and clarity as to the intention and purpose of the associated code.
2.2.1 Current working directory
The R working directory (location from which files/data are read and written) is by default, either the location of the R executable (or execution path in Linux) or the users home directory. The current working directory can be reviewed and changed (for the session) using the getwd()
function and setwd()
functions respectively. Note that R uses the Unix/Linux style directory subdivision markers. That is, R uses the forward slash /
in path names rather than the regular \
of Windows.
When using setwd()
, you can provide either an absolute path (the full path) or a relative path (relative to the current location). Obviously, you will get a different result to me when you issue the following:
[1] "/home/runner/work/SUYRs_documents/SUYRs_documents/tut"
setwd("../") #change to the parent directory of the current working directory
list.files(path = getwd()) #list all files (and directories) in the current working directory
[1] "data" "default.nix" "docs" "Makefile" "pres"
[6] "resources" "tut"
2.2.2 Workspaces
Throughout an R session, all objects (including loaded packages, see Section 6) that have been added are stored within the R global environment, called the workspace. Occasionally, it is desirable to save the workspace and thus all those objects (vectors, functions, etc) that were in use during a session so that they are available during subsequent sessions. This can be done using the save.image()
function. Note, this will save the workspace to a file called .RData
in the current working directory (usually the R startup directory), unless a file
(filename and path) is supplied as an argument to the save.image()
function. A previously saved workspace can be loaded by providing a full path and filename as an argument to the load()
function.
Whilst saving a workspace image can sometimes be convenient, it can also contribute greatly to organisational problems associated with large numbers of obsolete or undocumented objects. Instead, it is usually better to specifically store each of the objects you know you are going to want to have access to across sessions separately.
2.2.3 Quitting elegantly
To quit R, issue the following command; Note in Windows and MacOSX, the application can also be terminated using the standard Exiting protocols.
You will then be asked whether or not you wish to save the current workspace. If you do, enter ‘Y’ otherwise enter ‘N’. Unless you have a very good reason to save the workspace, I would suggest that you do not. A workspace generated in a typical session will have numerous poorly named objects (objects created to temporarily store information whilst testing). Next time R starts, it could (likely will) restore this workspace thereby starting with a cluttered workspace, and becoming a potential source of confusion if you inadvertently refer to an object stored during a previous session. Moreover, if the workspace includes additional extension packages, these packages may also be loaded which will prevent them from being updated (often necessary when installing additional packages that depend on other packages).
2.3 Functions
As wrappers for collections of commands used together to perform a task, functions provide a convenient way of interacting with all of these commands in sequence. Most functions require one or more inputs (parameters), and while a particular function can have multiple parameters, not all are necessarily required (some could have default values). Parameters are parsed to a function as arguments comprising the name of the parameter, an equals operator and the value of the parameter. Hence, arguments are specified as name/value pairs.
Consider the seq()
function, which generates a sequence of values (a vector) according to the values of the arguments. We can see that the default version of this function has the following definition:
function (from = 1, to = 1, by = ((to - from)/(length.out - 1)), length.out = NULL,
along.with = NULL, ...)
if the
seq()
function is called without any arguments (e.g.seq()
), it will return a single number 1. Using the default arguments for the function, it returns a vector starting at 1 (from = 1
), going up to 1 (to = 1
) and thus having a length of 1.we can alter this behavior by specifically providing values for the named arguments. The following generates a sequence of numbers from 2 to 10 incrementing by 1 (default):
the following generates a sequence of numbers from 2 to 10 incrementing by 2:
alternatively, instead of manipulating the increment space of the sequence, we could specify the desired length of the sequence:
named arguments need not include the full name of the parameter, so long as it is unambiguous which parameter is being referred to. For example, length.out could be shortened to just l since there are no other parameters of this function that start with ‘l’:
parameters can also be specified as unnamed arguments provided they are in the order specified in the function definition. For example to generate a sequence of numbers from 2 to 10 incrementing by 2:
Note, although permittable, it is more difficult to unambiguously read/interpret the code and could easily be a source of bugs.
named and unnamed arguments can be mixed, just remember the above rules about parameter order and unambiguous names:
2.4 Function overloading (polymorphism)
Many routines can be applied to different sorts of data. That is, they are somewhat generic. For example, we could calculate the mean (arithmetic center) of a set of numbers or we could calculate the mean of a set of dates or times. Whilst the calculations in both cases are analogous to one another, they nevertheless differ sufficiently so as to warrant separate functions.
We could name the functions that calculate the mean of a set of numbers and the mean of a set of dates as mean_numbers
and mean_dates
respectively. Unfortunately, as this is a relatively common situation, the number of functions to learn rapidly expands. And from the perspective of writing a function that itself contains such a generic function, we would have to write multiple instances of the function in order to handle all the types of data we might want to accommodate.
To simplify the process of applying these generic functions, R provides yet another layer that is responsible for determining which of a series of overloaded functions is likely to be applicable according to the nature of the parameters and data parsed as arguments to the function. To see this in action, type mean
followed by hitting the TAB
key. The TAB
key is used for auto-completion and therefore this procedure lists all the objects that begin with the letters ‘mean’.
In addition to an object called mean
, there are additional objects that are suffixed as a ‘.’ followed by a data type. In this case, the objects mean.default
, mean.Date
, mean.POSIXct
, mean.POSIXlt
and mean.difftime
are functions that respectively calculate the mean of a set of numbers, dates, times, times, time and differences. The mean
function determines which of the other functions is appropriate for the data parsed and then redirects to that appropriate function. Typically, this means that it is only necessary to remember the one generic function (in this case, mean()
) as the specific functions are abstracted away.
[1] 2.5
# create a sequence of dates spaced 7 days apart between 29th Feb 2000 and 30th Apr 2000
sample_dates <- seq(from = as.Date("2000-02-29"), to = as.Date("2000-04-30"), by = "7 days")
# print (view) these dates
sample_dates
[1] "2000-02-29" "2000-03-07" "2000-03-14" "2000-03-21" "2000-03-28"
[6] "2000-04-04" "2000-04-11" "2000-04-18" "2000-04-25"
[1] "2000-03-28"
In the above examples, we called the same function (mean) on both occasions. In the first instance, it was equivalent to calling the mean.default()
function and in the second instance the mean.Date()
function. Note that the seq()
function is similarly overloaded.
The above example also illustrates another important behaviour of function arguments. Function calls can be nested within the arguments of other functions and function arguments are evaluated before the function runs. In this way, multiple steps to be truncated together (although for the sake of the codes’ readability and debugging, it is often better to break a problem up into smaller steps).
If a function argument itself contains a function (as was the case above with the from =
and to =
arguments, both of which called the as.Date()
function which converts a character string into a date object), the value of the evaluated argument is parsed to the outside function. That is, evaluations are made from the inside to out. The above example, could have been further truncated to;
# calculate the mean of a sequence of dates spaced 7 days apart between 29th Feb 2000 and 30th Apr 2000
mean(seq(from = as.Date("2000-02-29"), to = as.Date("2000-04-30"), by = "7 days"))
[1] "2000-03-28"
2.4.1 The pipe character
As we can see from the example above, nested functions can be pretty awkward to read. As of version 4.1, R has had a pipe operator. The concept of piping dates back to the early UNIX days when separate programs were chained (‘piped’) together such that the output of one program became the input of the next and so on. This enabled each program to remain relatively simple, yet by piping sequences of programs together, rather complex results could be achieved.
Similarly the R pipe operator (|>
) enables nested functions to alternatively be expressed as a chain of functions:
# calculate the mean of a sequence of dates spaced 7 days apart between 29th Feb 2000 and 30th Apr 2000
seq(from = as.Date("2000-02-29"), to = as.Date("2000-04-30"), by = "7 days") |> mean()
[1] "2000-03-28"
To maximise code readability, it is good form to keep lines of code short (less than 80 characters). One way to do this is to place a line break after pipe characters. Moreover, a line break after each function argument allows us to have more topical and granular comments.
2.5 External functions
As R is a scripting language (rather than a compiled language), it has the potential to be very slow (since syntax checking, machine instruction interpretation, etc must all take place at runtime rather than at compile time). Consequently, many of the functions are actually containers (wrappers) for external code (link libraries) precompiled in either C or Fortran. In this way, the environment can benefit from the flexibility of a scripting language whilst still maintaining most of the speed of a compiled language. Tutorial ? will introduce how to install and load external libraries.
3 Getting help
There are numerous ways of seeking help on R syntax and functions (the following all ways of finding information about a function that calculates the mean of a vector).
providing the name of the function as an argument to the
help()
functiontyping the name of the function preceded by a
'?'
to run the examples within the standard help files, use the
example()
functionsome packages include demonstrations that showcase their features and use cases. The
demo()
function provides a user-friendly way to access these demonstrations. For example, to respectively get an overview of the basic graphical procedures in R and get a list of available demonstrations:if you don’t know the exact name of the function, the
apropos()
function is useful as it returns the name of all objects from the current search list that match a specific pattern:if you have no idea what the function is called, the
help.search()
andhelp.start()
functions search through the regular manuals and the local HTML manuals (via a web browser) respectively for specific terms:to get a snapshot of the order and default values of a functions’ arguments, use the
args()
function:function (x, ...) NULL
function (path = ".", pattern = NULL, all.files = FALSE, full.names = FALSE, recursive = FALSE, ignore.case = FALSE, include.dirs = FALSE, no.. = FALSE) NULL
The ...
argument indicates that other arguments can also be provided that are then parsed onto other functions that may be called within the main function.
4 Data Types
4.1 Vectors
Vectors are a collection of one or more entries (values) of the same type (class) and are the basic storage unit in R. Vectors are one-dimensional arrays (have a single dimension - length) and can be thought of as a single column of data. Each entry in a vector has a unique index (like a row number) to enable reference to particular entries in the vector.
4.1.1 Consecutive integers
To get a vector of consecutive integers, we can specify an expression of the form <first integer>:<second integer>
where <first integer>
and <second integer>
represent the start and end of the sequence of integers respectively:
4.1.2 The c()
function
The c()
function concatenates values together into a vector. To create a vector with the numbers 1, 4, 7, 21:
As an example, we could store the temperature recorded at 10 sites:
[1] 36.1 30.6 31.0 36.3 39.9 6.5 11.2 12.8 9.7 15.9
To create a vector with the words ‘Fish’, ‘Rock’, ‘Tree’, ‘Git’:
4.1.3 Regular or patterned sequences (rep()
)
We have already seen the use of the seq()
function to create sequences of entries.
Sequences of repeated entries are supported with the rep()
function:
4.1.4 The paste()
function
To create a sequence of quadrat labels we could use the c()
function as illustrated above, e.g.
[1] "Q1" "Q2" "Q3" "Q4" "Q5" "Q6" "Q7" "Q8" "Q9" "Q10"
A more elegant way of doing this is to use the paste()
function:
[1] "Q1" "Q2" "Q3" "Q4" "Q5" "Q6" "Q7" "Q8" "Q9" "Q10"
This can be useful for naming vector elements. For example, we could use the names()
function to name the elements of the temperature variable according to the quadrat labels.
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
36.1 30.6 31.0 36.3 39.9 6.5 11.2 12.8 9.7 15.9
The paste()
function can also be used in conjunction with other functions to generate lists of labels. For example, we could combine a vector in which the letters A, B, C, D and E (generated with the LETTERS constant) are each repeated twice consecutively (using the rep()
function) with a vector that contains a 1 and a 2 to produce a character vector that labels sites in which the quadrats may have occurred.
[1] "A1" "A2" "B1" "B2" "C1" "C2" "D1" "D2" "E1" "E2"
Or, with the use of pipes:
[1] "A1" "A2" "B1" "B2" "C1" "C2" "D1" "D2" "E1" "E2"
Rather than specify that the components are not separated by any character (which is what we are doing above by indicating that the separator character should be ““), there is a version of paste()
that does this automatically. It is paste0()
.
[1] "A1" "A2" "B1" "B2" "C1" "C2" "D1" "D2" "E1" "E2"
Vector class | Examples |
---|---|
integer (whole numbers) |
|
numeric (real numbers) |
|
character (letters) |
|
logical (TRUE or FALSE) |
|
date (dates) |
|
POSIXlt (date/time) |
4.1.5 Factors
Factors are more than a vector of characters. Factors have additional properties that are utilized during statistical analyses and graphical procedures. To illustrate the difference, we will create a vector to represent a categorical variable indicating the level of shading applied to 10 quadrats. Firstly, we will create a character vector:
[1] "no" "no" "no" "no" "no" "full" "full" "full" "full" "full"
Now we convert this into a factor:
Notice the additional property (Levels
) at the end of the output. Notice also that unless specified otherwise, the levels are ordered alphabetically. Whilst this does not impact on how the data appear in a vector, it does effect some statistical analyses, their interpretations as well as some tabular and graphical displays. If the alphabetical ordering does not reflect the natural order of the data, it is best to reorder the levels whilst defining the factor:
[1] no no no no no full full full full full
Levels: no full
A more convenient way to create a balanced (equal number of replicates) factor is to use the gl()
function. To create the shading factor from above:
4.1.6 Matrices
Matrices have two dimensions (length and width). The entries (which must be all of the same length and type - class) are in rows and columns.
We could arrange the vector of shading into two columns:
[,1] [,2]
[1,] 36.1 6.5
[2,] 30.6 11.2
[3,] 31.0 12.8
[4,] 36.3 9.7
[5,] 39.9 15.9
Similarly, We could arrange the vector of shading into two columns:
[,1] [,2]
[1,] "no" "full"
[2,] "no" "full"
[3,] "no" "full"
[4,] "no" "full"
[5,] "no" "full"
As another example, we could store the X,Y coordinates for five quadrats within a grid. We start by generating separate vectors to represent the X and Y coordinates and then we bind them together using the cbind()
function (which combines objects by columns):
x <- c(16.92, 24.03, 7.61, 15.49, 11.77)
y<- c(8.37, 12.93, 16.65, 12.2, 13.12)
xy <- cbind(x, y)
xy
x y
[1,] 16.92 8.37
[2,] 24.03 12.93
[3,] 7.61 16.65
[4,] 15.49 12.20
[5,] 11.77 13.12
We could alternatively combine by rows using the rbind()
function
We could even alter the row names using an inbuilt vector of uppercase letters:
x y
A 16.92 8.37
B 24.03 12.93
C 7.61 16.65
D 15.49 12.20
E 11.77 13.12
Importantly, all entries in a matrix must be of the same type. That is, they must all be numeric, or all be characters etc. If we attempt to mix a combination of data types in a matrix, then the data will all be converted into a type that can accommodate all the data. For example, if we attempt to bind together the numeric temperature
data and the character site
data into a matrix, then the result will be a matrix of characters (since while it is possible to covert numbers to strings, in this case the reverse is not possible).
temperature site
Q1 "36.1" "A1"
Q2 "30.6" "A2"
Q3 "31" "B1"
Q4 "36.3" "B2"
Q5 "39.9" "C1"
Q6 "6.5" "C2"
Q7 "11.2" "D1"
Q8 "12.8" "D2"
Q9 "9.7" "E1"
Q10 "15.9" "E2"
On the other hand, if we attempt to bind together the numeric temperature
data and the factor shade
data into a matrix, then the result will be a matrix of numbers (recall that factors are internally stored as integers, yet they have a levels property that acts rather like a lookup key).
4.1.7 Lists
Lists provide a way to group together multiple objects of different type and length. For example, whilst the contents of any single vector or matrix must all be of the one type and length (e.g. all numeric or all character), a list can contain any combination of vectors, matrices, scalars and of any type. Furthermore, the objects contained in a list do not need to be of the same lengths (c.f data frames). The output of most analyses are returned as lists.
As an example, we could group together the previously created isolated vectors and matrices into a single object that encapsulates the entire experiment:
experiment <- list(
site = site,
quadrats = quadrats,
coordinates = xy,
shade = shade,
temperature = temperature
)
experiment
$site
[1] "A1" "A2" "B1" "B2" "C1" "C2" "D1" "D2" "E1" "E2"
$quadrats
[1] "Q1" "Q2" "Q3" "Q4" "Q5" "Q6" "Q7" "Q8" "Q9" "Q10"
$coordinates
x y
A 16.92 8.37
B 24.03 12.93
C 7.61 16.65
D 15.49 12.20
E 11.77 13.12
$shade
[1] no no no no no full full full full full
Levels: no full
$temperature
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
36.1 30.6 31.0 36.3 39.9 6.5 11.2 12.8 9.7 15.9
Lists can be thought of as a set of objects bound into a single container. In the example above, the list object experiment contains a copy of the site, quadrats, coordinates, shade and temperature objects.
Importantly, once a list has been created the objects within the list are not linked in any way to the original objects from which the list is formed. Consequently, any changes made to (for example) the temperature object will not be reflected in the content of the temperature object within the experiment list.
To access an object within a list, the $
operator is used as such:
4.1.8 Dataframes - data sets
Rarely are single biological variables collected in isolation. Rather, data are usually collected in sets of variables reflecting investigations of patterns between and/or among the different variables. Consequently, data sets are best organized into matrices of variables (vectors) all of the same lengths yet not necessarily of the same type. Hence, neither lists nor matrices represent natural storages for data sets. This is the role of data frames which are used to store a set of vectors of the same length (yet potentially different types) in a rectangular matrix.
Data frames are generated by combining multiple vectors together such that each vector becomes a separate column in the data frame. For a data frame to faithfully represent a data set, the sequence in which observations appear in the vectors must be the same for each vector, and each vector should have the same number of observations. For example, the first, second, third…etc entries in each vector must represent respectively, the observations collected from the first, second, third…etc sampling units.
Since the focus of these tutorials is on the exploration, analysis and summary of data sets, and data sets are accommodated in R by data frames, the generation, importation, exportation, manipulation and management of data frames receives extensive coverage in many other subsequent tutorials.
As an simple example of a data frame, we could again group together the previously created isolated vectors into a single object that encapsulates a data set:
data <- data.frame(
site = site,
quadrats = quadrats,
shade = shade,
temperature = temperature
)
data
site quadrats shade temperature
Q1 A1 Q1 no 36.1
Q2 A2 Q2 no 30.6
Q3 B1 Q3 no 31.0
Q4 B2 Q4 no 36.3
Q5 C1 Q5 no 39.9
Q6 C2 Q6 full 6.5
Q7 D1 Q7 full 11.2
Q8 D2 Q8 full 12.8
Q9 E1 Q9 full 9.7
Q10 E2 Q10 full 15.9
5 Object manipulation
5.1 Object information
As indicated earlier, everything in R is an object. All objects have a type or class that encapsulates the sort of information stored in the object as well as determining how other functions interact with the object. The class of an object can be reviewed with the class()
function:
There is also a family of functions prefixed with is. that evaluate whether or not an object is of a particular class (or type) or not. The following table lists the common object query functions. All object query functions return a logical vector. Enter methods(is) for a more comprehensive list.
Function class |
Returns TRUE
|
Examples |
---|---|---|
is.numeric(x)
|
if all elements of x are numeric or integers
|
|
is.null(x)
|
if x is null (the object has no length)
|
|
is.logical(x)
|
if all elements of x are logical
|
|
is.character(x)
|
if all elements of x are character strings
|
|
is.vector(x)
|
if the object x is a vector (has only a single dimension). Returns FALSE if object has attributes other than ‘names’.
| |
is.factor(x)
|
if the object x is a factor
|
|
is.matrix(x)
|
if the object x is a matrix (two dimensions, yet not adata.frame )
|
|
is.list(x)
|
if the object x is a list
|
|
is.data.frame(x)
|
if the object x is a data.frame
|
|
is.na(x)
|
for each missing (NA ) element in x
|
|
!
|
(‘not’) operator as a prefix converts the above functions into ‘is.not’ |
5.1.1 Attributes
Many R objects also have a set of attributes, the number and type of which are specific to each class of object. For example, a matrix object has a specific number of dimensions as well as row and column names. The attributes of an object can be viewed using the attributes()
function:
$dim
[1] 5 2
$dimnames
$dimnames[[1]]
[1] "A" "B" "C" "D" "E"
$dimnames[[2]]
[1] "x" "y"
Similarly, the attr()
function can be used to view and set individual attributes of an object, by specifying the name of the object and the name of the attribute (as a character string) as arguments. For example:
[1] 5 2
x y
A 16.92 8.37
B 24.03 12.93
C 7.61 16.65
D 15.49 12.20
E 11.77 13.12
attr(,"description")
[1] "coordinates of quadrats"
Note that in the above example, the attribute ‘description’ is not a in-built attribute of a matrix. When a new attribute is set, this attribute is displayed along with the object. This provides a useful way of attaching a description (or other metadata) to an object, thereby reducing the risks of the object becoming unfamiliar.
5.2 Object conversion
Objects can be converted or coerced into other objects using a family of functions with a as. prefix. Note that there are some obvious restrictions on these conversions as most objects cannot be completely accommodated by all other object types, and therefore some information (such as certain attributes) may be lost or modified during the conversion. Objects and elements that cannot be successfully coerced are returned as NA. The following table lists the common object coercion functions. Use methods(as) for a more comprehensive list.
Function | Converts object to |
---|---|
as.numeric(x)
|
a numeric vector (‘integer’ or ‘real’). Factors converted to integers. |
as.null(x)
|
a NULL |
as.logical(x)
|
a logical vector. A values of >1 converted to TRUE otherwise FALSE. |
as.character(x)
|
a character (string) vector. |
as.vector(x)
|
a vector. All attributes (including names) are removed. |
as.factor(x)
|
a factor. This is an abbreviated (with respect to its argument set) version of the factor() function. |
as.matrix(x)
|
a matrix. Any non-numeric elements result in all matrix elements being converted to characters. |
as.list(x)
|
a list |
as.data.frame(x)
|
a data.frame. Matrix columns and list items are converted into separate vectors of the dataframe and character vectors are converted into factors. All previous attributes are removed. |
as.date(x)
|
a date |
5.3 Indexing
Indexing is the means by which data are filtered (subsetted) to include and exclude certain entries.
5.3.1 Vector indexing
Subsets of vectors are produced by appending an index vector (inclosed in square brackets []
) to a vector name. There are four common forms of vector indexing used to extract a subset of vectors:
Vector of positive integers - a set of integers that indicate which elements of the vector should be included:
Vector of negative integers - a set of integers that indicate which elements of the vector should be excluded:
Vector of character strings (referencing names) - for vectors whose elements have been named, a vector of names can be used to select elements to include:
Vector of logical values - a vector of logical values (TRUE or FALSE) the same length as the vector being subsetted. Entries corresponding to a logical TRUE are included, FALSE are excluded:
5.3.2 Matrix indexing
Similar to vectors, matrices can be indexed using positive integers, negative integers, character strings and logical vectors. However, whereas vectors have a single dimension (length), matrices have two dimensions (length and width). Hence, indexing needs to reflect this. It is necessary to specify both the row and column number. Matrix indexing takes of the form of [row.indices, col.indices]
where row.indices
and col.indices
respectively represent sequences of row and column indices. If a row or column index sequence is omitted, it is interpreted as the entire row or column respectively.
[1] 16.65
x y
7.61 16.65
A B C D E
16.92 24.03 7.61 15.49 11.77
x y
16.92 8.37
A B C D E
16.92 24.03 7.61 15.49 11.77
x y
A 16.92 8.37
B 24.03 12.93
D 15.49 12.20
If you think that last example looks awkward you would not be alone. In a later tutorial, I will introduce an alternative way of manipulating data for data frames.
5.3.3 List indexing
Lists consist of collections of objects that need not be of the same size or type. The objects within a list are indexed by appending an index vector (enclosed in single or double square brackets, []
or [[]]
), to the list name. Single square brackets provide access to multiple list items (returned as a list), whereas double square brackets provide access to individual list items (returned according to the type of object represented by the list item). A single object within a list can also be referred to by appending a string character ($
) followed by the name of the object to the list names (e.g. list$object
). The elements of objects within a list are indexed according to the object type. Vector indices to objects within other objects (lists) are placed within their own square brackets outside the list square brackets: Recall the experiment
list we generated earlier.
$site
[1] "A1" "A2" "B1" "B2" "C1" "C2" "D1" "D2" "E1" "E2"
$quadrats
[1] "Q1" "Q2" "Q3" "Q4" "Q5" "Q6" "Q7" "Q8" "Q9" "Q10"
$coordinates
x y
A 16.92 8.37
B 24.03 12.93
C 7.61 16.65
D 15.49 12.20
E 11.77 13.12
$shade
[1] no no no no no full full full full full
Levels: no full
$temperature
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
36.1 30.6 31.0 36.3 39.9 6.5 11.2 12.8 9.7 15.9
The following examples illustrate the common ways to subset (index) lists.
A vector of positive numbers (single brackets) - that indicate which list items should be included:
A single positive number (double brackets) - that indicates which list item should be included:
A single character string (double brackets) - that indicates which list item should be included:
Extract the first element of each list item - returned as a matrix:
site quadrats coordinates shade temperature.Q1 "A1" "Q1" "16.92" "1" "36.1"
##notice that only one element of the coordinate pair is included ##OR when the list items are not vectors do.call(cbind, experiment)[1, ]
Warning in (function (..., deparse.level = 1) : number of rows of result is not a multiple of vector length (arg 1)
site quadrats x y shade temperature "A1" "Q1" "16.92" "8.37" "1" "36.1"
5.4 Pattern matching and replacement
An important part of filtering is the ability to detect patterns on which to base selections or exclusions. Numerical and categorical filtering rules are generally fairly straight forward, however complex filtering rules can also be devised from character vectors. Furthermore, the ability to search and replace character strings within a character vector can also be very useful.
5.4.1 grep
- index of match
The grep()
function searches within a vector for matches to a pattern and returns the index of all matching entries.
## get the indexes of elements of the site vector that contain an 'A'
grep(pattern = "A", experiment$site)
[1] 1 2
## use the results of the grep as indexes to select only those 'site'
## values that contain an 'A'
experiment$site[grep(pattern = "a", experiment$site)]
character(0)
The pattern can comprise any valid regular expression and is therefore very flexible:
## get the indexes of values of the 'site' vector within the `data`
## dataframe that contain either an 'A', 'B' or 'C' followed by a '1'
grep("[a-c]1", data$site)
integer(0)
## select only those rows of the `data` dataframe that correspond to a
## 'site' value of either an 'A', 'B' or 'C' followed by a '1'
data[grep("[a-c]1", data$site), ]
[1] site quadrats shade temperature
<0 rows> (or 0-length row.names)
5.4.2 regexpr
- position and length of match
Rather than return the indexes of matching entries, the regexpr()
function returns the position of the match within each string as well as the length of the pattern within each string (-1 values correspond to entries in which the pattern is not found).
aust <- c("adelaide", "brisbane", "canberra", "darwin", "hobart", "melbourne", "perth", "sydney")
aust
[1] "adelaide" "brisbane" "canberra" "darwin" "hobart" "melbourne"
[7] "perth" "sydney"
## get the position and length of string of characters containing an
## 'a' and an 'e' separated by any number of characters
regexpr(pattern="a.*e", aust)
[1] 1 6 2 -1 -1 -1 -1 -1
attr(,"match.length")
[1] 8 3 4 -1 -1 -1 -1 -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
5.4.3 gsub
- pattern replacement
The gsub()
function replaces all instances of an identified pattern within a character vector with an alternative set of characters. The similar sub()
function replaces only the first instance.
[1] no no no no no full full full full full
Levels: no full
[1] "Not shaded" "Not shaded" "Not shaded" "Not shaded" "Not shaded"
[6] "full" "full" "full" "full" "full"
It is also possible to extend the functionality to accomodate perl-compatible regular expressions.
5.4.4 substr
- extracting substrings
The substr()
function is used to extract parts of string (set of characters) entries within character vectors and thus is useful for making truncated labels (particularly for graphical summaries). For example, if we had a character vector containing the names of the Australian capital cities and required abbreviations (first 3 characters) for graph labels:
[1] "adelaide" "brisbane" "canberra" "darwin" "hobart" "melbourne"
[7] "perth" "sydney"
[1] "ade" "bri" "can" "dar" "hob" "mel" "per" "syd"
Alternatively, we could use the abbreviate()
function.
5.4.5 Value matching
In addition to the above matching procedures, it is possible to compare vectors via the usual set of binary operators (x<y, x>y, x≤y, x≥y, x==y and x!=y).
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
Note, that the comparisons are made in an item-wise manner. That is, item one of the right hand vector is compared to item one of the left hand vector, and item two of each vector are compared to one another and so on. If the two vectors are not of equal length, the shorter vector is recycled (that is, it returns to the start of that vector and keeps going).
## Compare 'Q1' to items 1,3,5,7,9 of quadrats and compare 'Q3' to
## items 2,4,6,8,10.
quadrats == c('Q1','Q3')
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Be very cautious when using the binary operators x==y or x!=y to compare numeric vectors as they do not allow for rounding errors or finite representation of fractions and will almost always return FALSE even for values that appear identical. As an alternative, consider using a combination of all.equal()
and identical()
:
[1] FALSE
[1] TRUE
[1] TRUE
Each of the search and replace functions listed above uses only a single search item (albeit with pattern matching that can accommodate multiple patterns). The match() function searches for the first instance of items in the lookup vector (vector of values to be matched against) within the vector to be matched (first vector) returning the index of the first instance. Similarly, the special binary operator %in% indicates whether or not (TRUE or FALSE) an item of the matching vector is contained anywhere within the first vector. This latter mechanism makes a very useful filter.
## match the items within the `shade` vector against a lookup character
## vector containing only the string of "no" returning the index
## within the lookup vector
match(shade,"no")
[1] 1 1 1 1 1 NA NA NA NA NA
## match the items within the shade vector against a lookup character
## vector containing only the string of "no" returning the index
## within the lookup vector
match(shade,"no")
[1] 1 1 1 1 1 NA NA NA NA NA
## same match as above, yet returning a logical vector corresponding
## to whether each item in the first vector is matched or not
shade %in% 'no'
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[1] 1 NA NA 2 NA NA NA NA NA 3
[1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
site quadrats shade temperature
Q1 A1 Q1 no 36.1
Q4 B2 Q4 no 36.3
Q10 E2 Q10 full 15.9
5.5 Sorting
The sort()
function is used to sort vector entries in increasing (or decreasing) order.
Q6 Q9 Q7 Q8 Q10 Q2 Q3 Q1 Q4 Q5
6.5 9.7 11.2 12.8 15.9 30.6 31.0 36.1 36.3 39.9
Q5 Q4 Q1 Q3 Q2 Q10 Q8 Q7 Q9 Q6
39.9 36.3 36.1 31.0 30.6 15.9 12.8 11.2 9.7 6.5
The order()
function is used to get the position of each entry in a vector if it were sorted in increasing (or decreasing) order.
[1] 6 9 7 8 10 2 3 1 4 5
[1] 5 4 1 3 2 10 8 7 9 6
Hence the smallest entry in the temperature
vector was at position (index) 6 and so on.
The rank()
function is used to get the ranking of each entry in a vector if it were sorted in increasing (or decreasing) order.
Indicating that the first entry in the temperature
vector was ranked eighth in increasing order. Ranks from decreasing order can be produced by then reversing the returned vector using the rev()
function.
5.6 Formatting data
5.6.1 Rounding of numerical data
The ceiling()
function rounds vector entries up to the nearest integer
The floor()
function rounds vector entries down to the nearest integer
The trunc()
function rounds vector entries to the nearest integer towards ‘0’ (zero)
[1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
[1] -2 -1 -1 0 0 0 1 1 2
The round()
function rounds vector entries to the nearest numeric with the specified number of decimal places. Digits of 5 are rounded off to the nearest even digit.
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
36 31 31 36 40 6 11 13 10 16
[1] -2 -2 -1 0 0 0 1 2 2
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
16.41 13.91 14.09 16.50 18.14 2.95 5.09 5.82 4.41 7.23
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
40 30 30 40 40 10 10 10 10 20
5.6.2 Notation and labelling of numeric or character data
Occasionally (mainly for graphical displays), it is necessary to be able to adjust the other aspects of the formatting of vector entries. For example, you may wish to have numbers expressed in scientific notation (2.93e-04 rather than 0.000293) or insert commas every 3 digits left of the decimal point or even add prefixes or suffixes to numbers or words. These procedures are supported via a number of functions. The uses of each function are contrasted in the following table followed by common usage examples below.
Function | Description |
---|---|
paste() |
Concatenate vectors after converting into characters |
format() |
Adjust decimal places, justification, padding and width of string and whether to use scientific notation |
formatC() |
A version of format() that is compliant with ‘C’ style formatting. |
sprintf() |
A wrapper for the ‘C’ style formatting function of the same name = provides even greater flexibility (and complexity). |
paste()
Combine multiple elements together along with other character strings.
[1] "Quadrat:1" "Quadrat:2" "Quadrat:3"
[1] "A1:Q1" "A2:Q2" "B1:Q3" "B2:Q4" "C1:Q5" "C2:Q6" "D1:Q7" "D2:Q8"
[9] "E1:Q9" "E2:Q10"
## create a formula relating temperature to quadrat, site and shade
paste(names(data)[4], paste(names(data)[-4], collapse = "+"), sep = "~")
[1] "temperature~site+quadrats+shade"
[1] "temperature~site+quadrats+shade"
format()
Overloaded generic function for formatting objects (particularly numeric vectors). The most prominent features include:
Automatically adding leading or trailing spaces to create equal width labels (via
trim =
,width =
andjustify =
)Application of scientific notation (via
scientific =
)Rounding of numbers (via
digits =
andnsmall =
)Applies to each column in a dataframe separately
## create equal width strings by adding padding to the start (left
## side) of numbers
format(temperature)
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
"36.1" "30.6" "31.0" "36.3" "39.9" " 6.5" "11.2" "12.8" " 9.7" "15.9"
## create labels with a minimum of 2 digits to the right hand side of
## the decimal place
format(temperature, nsmall = 2)
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
"36.10" "30.60" "31.00" "36.30" "39.90" " 6.50" "11.20" "12.80" " 9.70" "15.90"
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
"36" "31" "31" "36" "40" " 6" "11" "13" "10" "16"
## create labels that are scientific representations of the numbers
format(temperature, scientific = TRUE)
Q1 Q2 Q3 Q4 Q5 Q6 Q7
"3.61e+01" "3.06e+01" "3.10e+01" "3.63e+01" "3.99e+01" "6.50e+00" "1.12e+01"
Q8 Q9 Q10
"1.28e+01" "9.70e+00" "1.59e+01"
## apply formatting rules to a dataframe (notice the left
## justification of Shade and the number of decimal places of
## temperature)
format(data, justify = "left", nsmall = 2)
site quadrats shade temperature
Q1 A1 Q1 no 36.10
Q2 A2 Q2 no 30.60
Q3 B1 Q3 no 31.00
Q4 B2 Q4 no 36.30
Q5 C1 Q5 no 39.90
Q6 C2 Q6 full 6.50
Q7 D1 Q7 full 11.20
Q8 D2 Q8 full 12.80
Q9 E1 Q9 full 9.70
Q10 E2 Q10 full 15.90
formatC()
Similar to the format()
function, yet also allows ‘C’ style formatting specifications:
- ‘d’ for integers
- ‘f’ for reals in the standard xxx.xxx format
- ‘e’, ‘E’ for reals in the scientific (n.ddde+nn) format
- ‘g’, ‘G’ for reals in the scientific (n.ddde+nn) format when it saves space to do so
- ‘s’ for strings
[1] 3.141593 7856.337828 15709.534064 23562.730300 31415.926536
[1] "3" "7856" "15709" "23562" "31415"
[1] "3.14e+00" "7.86e+03" "1.57e+04" "2.36e+04" "3.14e+04"
## scientific notation only if it saves space
formatC(seq(pi, pi * 10000, length = 5), format = "g", digits = 2)
[1] "3.1" "7.9e+03" "1.6e+04" "2.4e+04" "3.1e+04"
## floating point format with 1000's indicators
formatC(seq(pi, pi * 10000, length = 5), format = "f", big.mark = ",", digits = 2)
[1] "3.14" "7,856.34" "15,709.53" "23,562.73" "31,415.93"
sprintf()
Similar to the format()
function, yet also allows ‘C’ style formatting specifications:
- ‘d’ for integers
- ‘f’ for reals in the standard xxx.xxx format
- ‘e’, ‘E’ for reals in the scientific (n.ddde+nn) format
- ‘g’, ‘G’ for reals in the scientific (n.ddde+nn) format when it saves space to do so
- ‘s’ for strings
[1] 3.141593 7856.337828 15709.534064 23562.730300 31415.926536
[1] "3" "7856" "15710" "23563" "31416"
## format to two decimal places and 6 characters to the left of the
## decimal point (right justified)
sprintf("%6.2f", PI)
[1] " 3.14" "7856.34" "15709.53" "23562.73" "31415.93"
[1] "3.141593e+00" "7.856338e+03" "1.570953e+04" "2.356273e+04" "3.141593e+04"
[1] " 3.1" "7.9e+03" "1.6e+04" "2.4e+04" "3.1e+04"
[1] "A1-Q1" "A2-Q2" "B1-Q3" "B2-Q4" "C1-Q5" "C2-Q6" "D1-Q7" "D2-Q8"
[9] "E1-Q9" "E2-Q10"
[1] "val=3.1" "val=7.9e+03" "val=1.6e+04" "val=2.4e+04" "val=3.1e+04"
[1] "val= 3.1" "val=7.9e+03" "val=1.6e+04" "val=2.4e+04" "val=3.1e+04"
[1] " val=3.1" "val=7.9e+03" "val=1.6e+04" "val=2.4e+04" "val=3.1e+04"
5.7 Applying functions repetitively
As R is a programming language, it naturally has constructs for controlling flow via looping and conditional evaluation. R’s basic control-flow constructs is the topic of another tutorial. Despite the enormous flexibility gained via the usual control-flow constructs, recall that as R is a scripting language (rather than a compiled language), it is relatively slow. In particular, repetitive tasks (such as looping though a dataframe and applying the same function to different subsets of the data) are especially inefficient.
There are a number of functions in R that are designed to allow the repetitive application of a function thereby replacing the need to write loops.
Function | Description |
---|---|
rep() |
Duplicates the result of a function multiple times |
replicated() |
Performs a function multiple times |
apply() |
Repetitively apply a function over the margins of a matrix |
tapply() |
Repetitively apply a function to cells made up of unique combinations of factor levels |
lapply() |
Repetitively apply a function to the elements of a list of vector and return a list. |
The replicate()
function repeatedly performs the function specified in the second argument the number of times indicated by the first argument. The important distinction between the replicate()
function and the rep()
function described earlier, is that the former repeatedly performs the function whereas the later performs the function only once and then duplicates the result multiple times.
Since most functions produce the same result each time they are performed, for many uses, both functions produce identical results. The one group of functions that do not produce identical results each time, are those involved in random number generation. Hence, the replicate()
function is usually used in conjunction with random number generators (such as runif()
, which will be described in greater detail in subsequent tutorial) to produce sets of random numbers. Consider first the difference between rep()
and replicate()
:
[1] 0.2543982 0.2543982 0.2543982 0.2543982 0.2543982
[1] 0.5708402 0.3839931 0.1320909 0.3101110 0.3772342
When the function being run within runif()
itself produces a vector of length > 1, the runif()
function combines each of the vectors together as separate columns in a matrix:
[,1] [,2] [,3] [,4] [,5]
[1,] 0.38092711 0.07538712 0.8895882 0.1030953 0.93951287
[2,] 0.14987665 0.65785499 0.3691097 0.9966423 0.88075174
[3,] 0.44044453 0.95326855 0.7094374 0.4323875 0.01900207
[4,] 0.08567467 0.53898123 0.2046870 0.3586476 0.42991762
[5,] 0.07394379 0.96047868 0.7087122 0.1363917 0.51966159
5.7.1 Apply functions along matrix margins
The apply()
function applies a function to the margins (1=row margins and 2=column margins) of a matrix. For example, we might have a matrix that represents the abundance of three species of moth from three habitat types:
moth <- cbind(SpA = c(25, 6, 3), SpB = c(12, 12, 3), SpC = c(7, 2, 19))
rownames(moth) <- paste("Habitat", 1:3, sep = "")
moth
SpA SpB SpC
Habitat1 25 12 7
Habitat2 6 12 2
Habitat3 3 3 19
The apply()
function could be used to calculate the column means (mean abundance of each species across habitat types):
5.7.2 Pivot tables
The tapply()
function applies a function to a vector separately for each level of a factorial variable. For example, if we wanted to calculate the mean temperature for each level of the shade variable:
no full
34.78 11.22
## calculate the mean temperature per shade and quadrat number combination
## quadrat number is just the last digit of the quadrats vector
## extracted via substr(site, 2, 2)
tapply(temperature, list(shade, quadnum = substr(site, 2, 2)), mean)
quadnum
1 2
no 35.66667 33.45000
full 10.45000 11.73333
5.7.3 Apply a function over a list
The lapply()
and sapply()
functions apply a function separately to each of the objects in a list and return a list and vector/matrix respectively. For example, to find out the length of each of the objects within the experiment list:
6 Packages
One of the great strengths of R is the ease to which it can be extended via the creation of new functions. This means that the functionality of the environment is not limited by the development priorities and economics of a comercial enterprise. Moreover, collections of related functions can be assembled together into what is called a package or library. These packages can be distributed to others to use or modify and thus the community and capacity grows.
One of the keys to the concept of packages is that they extend the functionality when it is required. Currently (2023), there are in excess of 5000 packages available on CRAN (Comprehensive R Archive Network) and an additional 3000 packages available via other sources. If all of that functionality was available simultaneously, the environment would be impeared with bloat. In any given session, the amount of extended functionality is likely to be relatively low, therefore it makes sense to only ‘load’ the functionality into memory when it is required.
The R environment comprises the core language itself (with its built in data, memory and control structures along with parsers, error handlers and built in operators and constants) along with any number of packages. Even on a brand new install of R there are some packages. These tend to provide crucial of common functions and as such many of them are automatically loaded at the start of an R session.
Packages in R can be conceptualised using an analogy of a various dictionaries in paper-back book format. Each dictionary (perhaps English, Spanish and French) contains the definition of words that you can lookup to understand their meaning. Analogously, when evaluating a statement, the R interpreter ‘looks up’ the definitions of functions and other objects from the collections of available sets of definitions.
In order to use your dictionary, you must first have purchased this dictionary, and upon doing so you tidily place it on the shelf with all your other books. Then each time you anticipate needing this dictionary, you take it off the shelf and place it conveniently on your desk. If you anticipate needing multiple dictionaries, then they are stacked one on top of the other on your desk (last collected off the shelf will be at the top of the stack).
In the case of R packages, a package is first installed whereby it is stored in your file system. When you anticipate needing the functions in a package, you load the package. This action places the package on the top of the search stack.
Note, in both the case of the dictionaries and the R packages, purchase (dictionary) and installation (R package) is only necessary once, however, retrieving to your desk (dictionary) and loading (R packages) is necessary for each session you intend to use the resource. To further extend the analogy, in both cases it is also recommended that your resource (dictionary or R packages) be updated from time to time so as to ensure they reflect the most modern understandings.
Finally, it is possible that the same word appears in multiple dictionaries (perhaps even with completely different definitions). When attempting to discover the definition of a word, you might settle on the first occurrence that you encounter - which would be the definition from the book closest to the top of the stack since that is the order that you search through your dictionaries). The same is true for R packages. The definition that R uses will be the definition from the package most recently added. If however, you know which dictionary/R package to search in (the namespace), you can of course go straight to that source and avoid any ambiguity.
To see what packages are currently loaded in your session, enter the following:
A more general alternative to using the .packages()
function, is to use the seach()
function.
[1] ".GlobalEnv" "package:stats" "package:graphics"
[4] "package:grDevices" "package:utils" "package:datasets"
[7] "package:methods" "Autoloads" "package:base"
Actually, the search()
function as just used (without providing a string to search for), returns the locations (search path) and order of where commands are searched for. For example, when you enter a command, the first place that R searches for this command (variable, function, constant, etc) is .GlobalEnv
. .GlobalEnv is the current workspace and stores all the user created objects (such as variables, dataframe etc). If the object is not found in .GlobalEnv, the search continues within the next search location (in my case the stats package) and so on. When you load an additional package (such as the MASS` package, this package (along with any of other packages that it depends on) will be placed towards the start of the search queue. The logic being that if you have just loaded the package, then chances are you intend to use its functionality and therefore your statements will most likely be evaluated faster (because there is likely to be less to search through before locating the relevant objects).
[1] ".GlobalEnv" "package:MASS" "package:stats"
[4] "package:graphics" "package:grDevices" "package:utils"
[7] "package:datasets" "package:methods" "Autoloads"
[10] "package:base"
Indeed, issuing the library()
function in this way simply adds the package to the search path. The detach()
function removes a package from the search path. Removing a package from the search path when you know its functions are not going to be required for the rest of the session speeds up the evaluation of many statements (and therefore most routines) as the engine potentially has fewer packages to traverse whilst seeking objects.
6.1 Listing installed packages
The installed.packages()
function tabulates a list of all the currently installed packages available on your system along with the package path (where is resides on your system) and version number. Additional fields can be requested (including “Priority”, “Depends”, “Imports”, “LinkingTo”, “Suggests”, “Enhances”, “OS_type”, “License” and “Built”).
In the above, I have intentionally supressed the output so as not to flood the output (I have a very large number of packages installed on my machine).
Yet more information can be obtained for any single package with the packageDescription() and library functions - the latter provides all the information of the former and then includes a descriptive index of all the functions and datasets defined within the package.
Package: MASS
Priority: recommended
Version: 7.3-60.0.1
Date: 2024-01-12
Revision: $Rev: 3621 $
Depends: R (>= 4.0), grDevices, graphics, stats, utils
Imports: methods
Suggests: lattice, nlme, nnet, survival
Authors@R: c(person("Brian", "Ripley", role = c("aut", "cre", "cph"),
email = "ripley@stats.ox.ac.uk"), person("Bill", "Venables",
role = "ctb"), person(c("Douglas", "M."), "Bates", role =
"ctb"), person("Kurt", "Hornik", role = "trl", comment =
"partial port ca 1998"), person("Albrecht", "Gebhardt", role =
"trl", comment = "partial port ca 1998"), person("David",
"Firth", role = "ctb"))
Description: Functions and datasets to support Venables and Ripley,
"Modern Applied Statistics with S" (4th edition, 2002).
Title: Support Functions and Datasets for Venables and Ripley's MASS
LazyData: yes
ByteCompile: yes
License: GPL-2 | GPL-3
URL: http://www.stats.ox.ac.uk/pub/MASS4/
Contact: <MASS@stats.ox.ac.uk>
NeedsCompilation: yes
Packaged: 2024-01-13 12:39:26 UTC; ripley
Author: Brian Ripley [aut, cre, cph], Bill Venables [ctb], Douglas M.
Bates [ctb], Kurt Hornik [trl] (partial port ca 1998), Albrecht
Gebhardt [trl] (partial port ca 1998), David Firth [ctb]
Maintainer: Brian Ripley <ripley@stats.ox.ac.uk>
Repository: CRAN
Date/Publication: 2024-01-13 13:36:16 UTC
Built: R 4.3.3; x86_64-pc-linux-gnu; 2024-08-22 01:50:57 UTC; unix
-- File: /opt/R/4.3.3/lib/R/library/MASS/Meta/package.rds
6.2 Installing packages
The R community contains some of the brightest and most generous mathematician, statisticians and practitioners who continue to actively develop and maintain concepts and routines. Most of these routines end up being packaged as a collection of functions and then hosted on one or more publicly available sites so that others can benefit from their efforts.
The locations of collections of packages are called repositories or ‘repos’ for short. There four main repositories are CRAN, Bioconductor, R-Forge and github. By default, R is only ‘tuned in’ to CRAN. That is any package queries or actions pertain just to the CRAN repositories.
To get a tabulated list of all the packages available on CRAN (warning there are over 5000 packages, so this will be a large table - I will suppress the output):
6.2.1 Comprehensive R Archive Network - CRAN
CRAN is a repository of R packages mirrored across 90 sites throughout the world. Packages are installed from CRAN using the install.packages()
function. The first (and only mandatory) argument to the install.packages()
function is the name of the package(s) to install (pkgs =
). If no other arguments are provided, the install.packages()
function will search CRAN for the specified package(s) and install it along with any of its dependencies that are not yet installed on your system.
Note, unless you have started the session with administrator (root) privileges, the packages will be installed within a path of your home folder. Whilst this is not necessarily a bad thing, it does mean that the package is not globally available to all users on your system (not that it is common to have multiple users of a single system these days). Moreover, it means that R packages reside in multiple locations across your system. The packages that came with your R install will be in one location (or a couple or related locations) and the packages that you have installed will be in another location.
To see the locations currently used on your system, you can issue the following statement.
[1] "/home/runner/work/_temp/Library" "/opt/R/4.3.3/lib/R/site-library"
[3] "/opt/R/4.3.3/lib/R/library"
To install a specific package (and its dependencies). The package that I have chosen to demonstrate this with (remotes) is a package that enables R packages to be installed from git repositories (such as github, and will be featured in a later subsection).
You will be prompted to select a mirror site. In the absence of any other criterion, just select the mirror that is closed geographically to you. The terminal will then provide feedback about the progress and status of the install process. By indicating a specific repository, you can avoid being prompted for a mirror. For example, I chose to use a CRAN mirror at Melbourne University (Australia), and therefore the following statement gives me direct access
Finally, you could provide a vector of repository names if you were unsure which repository was likely to contain the package you were after. This can also be useful if your preferred mirror regularly experiences downtime - the alternative mirror (second in the vector) is used only when the first fails.
6.2.2 Bioconductor
Bioconductor is an open source and open development project devoted to genomic data analysis tools, most of which are available as R packages. Whilst initially the packages focused primarily on the manipulation and analysis of DNA microarrays, as the scope of the projects has expanded, so too has the functional scope of the packages there hosted.
Or to install multiple packages from Bioconductor
6.2.3 R-Forge
Unlike both CRAN and Bioconductor (which are essentially package repositories), R-Forge is an entire R package development platform. Package development is supported through a range of services including:
- version control (SVN) - allowing multiple collaborators to maintain current and historical versions of files by facilitating simultaneous editing, conflict resolution and rolling back
- daily package checking and building - so packages are always up to date
- bug tracking and feature request tools
- mailing lists and message boards
- full backup and archival system
And all of this within a mature content management system like web environment. Installing packages from R-Forge is the same as it is for CRAN, just that the path of the root repository needs to be specified with the repos= argument.
6.2.4 Github (via remotes
)
Github builds upon the philosophy of the development platform promoted by the Source Forge family (including R-Forge) by adding the ability to fork a project. Forking is when the direction of a project is split so that multiple new opportunities can be explored without jeopardizing the stability and integrity of the parent source. If the change in direction proves valuable, the project (package) can either become a new package or else feedback into the development of the original package.
Hadley Wickham and Co have yet again come up with a set of outrageously useful tools (remotes package). This package is a set of functions that simplify (albeit slightly dictatorially) the processes of installing packages from remote and local repositories (Github, Gitlab, Bitbucket etc)
In order to make use of this package to install packages from github, the remotes package must itself be installed (we did this earlier). It is recommended that this install take place from CRAN (as outline above). Thereafter, the remotes package can be included in the search path and the install_github
function used to retrieve and install a nominated package or packages from Github.
As described above, Github is a development platform and therefore it is also a source of ‘bleeding edge’ development versions of packages. Whilst the development versions are less likely to be as stable or even as statistically rigorous as the final release versions, they do offer the very latest ideas and routines. They provide the very latest snapshot of where the developers are currently at.
Most of the time users only want the stable release versions of a package. However there are times when having the ability to try out new developments as they happen can be very rewarding. The install_dev()
function allows for the installation of the development version of a package.
The more complex devtools
package (also by Hadley Wickham et al) provides a set of functions that simplify (albeit slightly dictatorially) the processes of package authoring, building, releasing and installing. Within the devtools
package, the dev_mode()
function provides a switch that can be used to toggle your system in and out of development mode. When in development mode, installed packages are quarantined within a separate path (R-dev) to prevent them overriding or conflicting with the stable versions that are critical for your regular analyses.
## switch to development mode
devtools::dev_mode(on = TRUE)
##install the development version of ggplot2
devtools::install_github("ggplot2")
## use the development version of ggplot2
library(ggplot2)
## switch development mode off
devtools::dev_mode(on = FALSE)
## stable version of ggplot2 is now engaged
6.2.5 Manual download and install
Packages are made available on the various repositories in compressed form and differ between Windows, MacOSX and Linux versions. Those web repositories all have functionality for navigating or searching through the repositories for specific packages. The packages (compressed files) can be directly downloaded from these sites.
Additionally, some packages are not available on the various repositories and firewalls and proxies can sometimes prevent R from accessing the repositories directly. In these cases, packages must be manually downloaded and installed.
There are a number of ways to install a package that resides locally. Note, do not uncompress the packages.
- From the command line (outside of R).
where packagename
is replaced by the path and name of the compressed package.
- Using the
install.packages()
function by specifyingrepos = NULL
.
where packagename
is replaced by the path (if not in the current working directory) and name of the compressed package.
- Via the Windows RGui, select the Install package(s) from local zip files… option of the Packages menu and select the compressed package.
6.3 Updating packages
An integral component of package management is being able to maintain an up to date system. Many packages are regularly updated so as to adopt new ideas and functionality. Indeed, it is the speed of functional evolution that sets R apart from most other statistical environments.
Along with the install.packages()
function, there are three other functions to help manage and maintain the packages on your system.
old.packages()
compares the versions of packages you have installed with the versions of those packages available in the current repositories. It tabulates the names, install paths and versions of old packages on your system.Alternative repositories (than CRAN) can be indicated via the
repos =
argument.new.packages()
provides a tabulated list of all the packages on the repository that are either not in your local install, or else are of a newer version. Note, with over 4000 packages available on CRAN, unless the repos= parameter is pointing to somewhere very specific (and with a narrow subset of packages) this function is rarely of much use.update.packages()
downloads and installs packages for which newer versions of those packages identified as ‘old’ by theold.packages()
function. Just likeold.packages()
, alternative or multiple repositories can be specified.
6.4 Package management (pak
)
Package management can be a relatively complex task. These days packages are sourced from a variety and mixture of locations (CRAN, Github etc). Furthermore, most packages have a complex network of dependencies (that is, they depend on other packages). The fine folk over at Rstudio have developed a package called pak
that aims to provide a unified and simplified interface to package management.
This next-generation package installer offers several key advantages for the technical R user:
- Parallel downloads:
pak
leverages multi-core processing to download multiple packages simultaneously, significantly reducing installation time. - Intelligent dependency resolution:
pak
automatically resolves package dependencies, installing the necessary versions in the correct order, ensuring a seamless experience. - Expanded package sources:
pak
supports installation from diverse repositories like Bioconductor and even GitHub URLs, providing access to a broader range of cutting-edge tools. - Fine-grained control:
pak
gives you the power to specify them explicitly, offering greater control over your R environment. - Extensible architecture:
pak
exposes an API for building custom extensions and integrating seamlessly with your data science workflows.
Before we can take advantage of pak
package management, it must first be installed from CRAN using the traditional package installation method.
6.4.1 Dependencies
For any given package, we can see the dependencies. To illustrate, I will focus on the Matrix
package.
ℹ Loading metadata database
✔ Loading metadata database ... done
# A data frame: 2 × 37
ref type direct directpkg status package version license needscompilation
<chr> <chr> <lgl> <lgl> <chr> <chr> <chr> <chr> <lgl>
1 lattice stan… FALSE FALSE OK lattice 0.22-6 GPL (>… TRUE
2 instal… inst… FALSE TRUE OK Matrix 1.6-5 GPL (>… TRUE
# ℹ 28 more variables: priority <chr>, md5sum <chr>, sha256 <chr>,
# filesize <int>, built <chr>, platform <chr>, rversion <chr>,
# repotype <chr>, repodir <chr>, target <chr>, deps <list>, mirror <chr>,
# sources <list>, remote <list>, error <list>, metadata <list>,
# dep_types <list>, params <list>, sysreqs <chr>, os_type <chr>,
# cache_status <chr>, sysreqs_packages <list>, sysreqs_pre_install <chr>,
# sysreqs_post_install <chr>, sysreqs_install <chr>, lib_status <chr>, …
After some database checking, the above function returns a tibble (like a data frame, yet with some special properties that include truncated output) containing a row for each dependency. In this example, the tibble has just two rows (one for the Matrix
package, and the other for its only dependency, the Lattice
package). To save space, the many columns have been truncated, yet listed below the tibble.
Alternatively, we could view the dependencies as a tree.
Matrix 1.6-5 < 1.7-0 [old]
└─lattice 0.22-5 -> 0.22-6 [upd][bld][cmp][dl] (598.58 kB)
Key: [upd] update | [old] outdated | [dl] download | [bld] build | [cmp] compile
We can see from the above that the Matrix
package depends on the Lattice
package.
6.4.2 Installing packages
To install a package:
from CRAN or Bioconductor: just provide the package name as an argument
from Github: provide the package name in the form of
user/repo
. You can also nominate a specific branch (user/repo@branch
) or tag (user/repo@tag
).
Similarly, pak::pkg_install()
can be used for package updating. If the package has not yet been installed, the package will be installed, yet if the package has already been installed, then it will instead be updated (unless it is already the most up to date version).
If the upgrade = TRUE
argument is supplied, then all the dependencies will also be updated.
6.4.3 Removing packages
Package can be removed using the pak::pkg_remove()
function.
6.5 Namespaces
Early on in this tutorial, I presented a set of rules and recommendations for object naming. One recommendation that I stressed was to avoid using names for objects that are the names of common functions (like mean
) so as to (hopefully) avoid conflicting with any of the functions built in to R.
Having made these recommendations, I will now say that R is not overly fragile and is sufficiently cleaver to enable it to resolve many naming conflicts. Object names are context specific (see also object overloading above).
When the name of an object is supplied that could be used to refer to multiple objects (for example, if you had created an object called mean there would be two objects named mean - your object and the inbuilt function), R first attempts to determine which object you are likely to have been referring to.
Objects are defined and apply within certain contexts or namespaces. Namespaces defined the context (environment) in which an object is available. Objects created within functions, remain local to those functions. Hence if an object is created within a function, it is not available outside that function.
The namespace provides a context in which R should look for an object (such as a function). Functions defined within packages are available for use, when the library is loaded. This is essentially adding the libraries namespace to the list of contexts to that R should search within when you confront it with an expression.
Alternatively, we can prefix the function name with the package name (its namespace) thereby explicitly indicating the context in which the function is defined and thus, the function will be found.
For example, lets say we wanted to create sparse diagonal matrix (a matrix with values in the diagonals and blanks in the off diagonals. There is a function called Diagonal
in the Matrix
package. We could expose this function (and all others in the package via the library function or we could just prefix the function name with the package name.
Error in Diagonal(3): could not find function "Diagonal"
3 x 3 diagonal matrix of class "ddiMatrix"
[,1] [,2] [,3]
[1,] 1 . .
[2,] . 1 .
[3,] . . 1
Similarly, prefixing the namespace to the function name allows us to explicitly nominate exactly which function we want to use in the event that there are two functions of the same name in different packages.