Basic statistical principles
1 Statistics
Statistics is a branch of mathematical sciences that relates to the collection, analysis, presentation and interpretation of data and is therefore central to most scientific pursuits. Fundamental to statistics is the concept that samples are collected and statistics are calculated to estimate populations and their parameters.
Statistical populations can represent natural biological populations (such as the Victorian koala population), although more typically they reflect somewhat artificial constructs (e.g. Victorian male koalas). A statistical population strictly refers to all the possible observations from which a sample (a subset) can be drawn and is the entity about which you wish to make conclusions.
The population parameters are the characteristics (such as population mean, variability etc) of the population that we are interested in drawing conclusions about. Since it is usually not possible to observe an entire population, the population parameters must be estimated from corresponding statistics calculated from a subset of the population known as a sample (e.g sample mean, variability etc). Provided the sample adequately represents the population (is sufficiently large and unbiased), the sample statistics should be reliable estimates of the population parameters of interest.
It is primarily for this reason that most statistical procedures impose certain sampling and distributional assumptions on the collected data. For example, most statistical tests assume that the observations have been drawn randomly from populations (to maximize the likelihood that the sample will truly represent the population). Additional terminology fundamental to the study of ecological statistics are listed in the following table (in which the examples pertain to a hypothetical research investigation into estimating the protein content of koala milk).
Term | Definition | Example |
---|---|---|
Measurement | A single piece of recorded information reflecting a characteristic of interest (e.g. length of a leaf, pH of a water aliquot mass of an individual, number of individuals per quadrat etc) | Protein content of the milk of a single female koala |
Observation | A single measured sampling or experimental unit (such as an individual, a quadrat, a site etc) | A small quantity of milk from a single koala |
Population | All the possible observations that could be measured and the unit of which wish to draw conclusions about (note a statistical population need not be a viable biological population) | The milk of all female koalas |
Sample | The (representative) subset of the population that are observed | A small quantity of milk collected from 15 captive female koalas.Note that such a sample may not actually reflect the defined population. Rather, it could be argued that such a sample reflects captive populations. Nevertheless, such extrapolations are common when field samples are difficult to obtain. |
Variable | A set of measurements of the same type that comprise the sample. The characteristic that differs (varies) from observation to observation | The protein content of koala milk. |
In addition to estimating population parameters, various statistical functions (or statistics) are often calculated to express the relative magnitude of trends within and between populations. For example, the degree of difference between two populations is usually described in classic frequentist statistics by a t-statistic.
Another important concept in statistics is the idea of probability. The frequentist view of the probability of an event or outcome is the proportion of times that the event or outcome is expected to occur in the long-run (after a large number of repeated sampling events). For many statistical analyses, probabilities of occurrence are used as the basis for conclusions, inferences and predictions.
Consider the vague research question “How much do Victorian male koalas weigh?”. This could be interpreted as:
- How much do each of the Victorian male koalas weigh individually?
- What is the total mass of all Victorian male koalas added together?
- What is the mass of the typical Victorian male koala?
Arguably, it is the last of these questions that is of most interest. We might also be interested in the degree to which these weights differ from individual to individual and the frequency of individuals in different weight classes.
2 Probability theory
Probability (the chance of a particular outcome per event) can be considered from two different perspectives;
- as an objective representation of the relative frequency of times that the outcome occurs from a long series (infinite) of events. Hence, it can be calculated by counting the number of times that the outcome occurs (the frequency) divided (normalized) by the total number of events (the sample space) in which it could have occurred. In order to relate this back to a hypothesis, we typically estimate the expected frequencies of outcomes when the null hypothesis is true. We will return to why there is a focus on a null hypothesis rather than a hypothesis a little later.
- as a somewhat subjective representation of the uncertainty of an outcome. That is, how reasonable is an outcome given our previous understandings and the newly observed data.
These two approaches differ substantially in their interpretation of probability (long-run chances of outcomes under certain conditions vs degree of belief). This can be represented diagrammatically. In simple probability, the probability of an outcome (e.g. \(P(A)\)) is expressed relative to a broad sample space.
The Probability of outcome A is the frequency of times outcome A occurs divided by the total number of times the outcome could occur (the sample space). The open symbols represent alternative outcomes. |
\[\begin{align*} P(A) &= \frac{freq(A)}{freq(Total)}\\ P(A) &= \frac{5}{22}\\ &= 0.227 \end{align*}\] |
|
The Probability of outcome B is the frequency of times outcome B occurs divided by the total number of times the outcome could occur (the sample space). | \[\begin{align*} P(B) &= \frac{freq(B)}{freq(Total)}\\ P(B) &= \frac{7}{22}\\ &= 0.318 \end{align*}\] |
|
The Probability of both outcome A AND outcome B is the frequency of times outcome A AND outcome B both occur together divided by the total number of times the outcome could occur (the sample space). | \[\begin{align*} P(AB) &= \frac{freq(A\&B)}{freq(Total)}\\ P(AB) &= \frac{2}{22}\\ &= 0.091 \end{align*}\] |
|
Conditional probability on the other hand, establishes the probability of a particular event conditional to (given the occurrence of) another event and therefore alters the divisor sample space. The sample space is restricted to the occurrence of the unconditional outcome.
The probability of outcome A occurring given that outcome B also occurs (or has occurred) is the frequency of times that outcome A AND outcome B both occur divided by the frequency of times that outcome B occurs. The frequency of outcome B occurrences becomes the divisor. |
\[\begin{align*} P(A|B) &= \frac{freq(A\&B)}{freq(B)}\\ P(A|B) &= \frac{2}{7}\\ &= 0.286 \end{align*}\] |
|
The above representation of conditional probability can be expressed completely in terms of probability
\[\begin{align*} P(A|B) &= \frac{freq(A\&B)}{freq(B)}\Leftrightarrow \frac{P(AB)\times freq(Total)}{P(B)\times freq(Total)}\\ &= \frac{P(AB)}{P(B)} \end{align*}\]
Most probability statements take place in the context of a hypothesis (nor hypothesis). For example, frequentist probability is the probability of the data given the null hypothesis. Hence, most inferential statistics involve conditional probability.
3 Distributions
The set of observations in a sample can be represented by a sampling or frequency distribution. A frequency distribution (or just distribution) represents how often observations in certain ranges occur. For example, how many male koalas in the sample weigh between 10 and 11kg, or how many weigh more than 12kg. Such a sampling distribution can also be expressed in terms of the probability (long-run likelihood or chance) of encountering observations within certain ranges.
Probability distributions are also know as density distributions and their mathematical representations are known as density functions. For discrete outcomes (integers, such as the number of eggs laid by female silver gulls [range from 0-8]), the density represents the frequency of a certain outcome (clutch size) divided by the total number of observations (examined clutches). The following figures represent the frequency (left) and density (right) of clutch sizes from 100 nests.
For continuous outcomes however, it is highly likely that all the observed (sample) values are unique (at least in theory) and therefore all outcomes have a frequency of exactly 1. As the sample size approaches infinity, the probability of any single point value therefore approaches zero.
So we instead break the continuum into small equal-sized chunks and calculate the frequency of values within each chunk or bin (akin to a histogram). To normalize these data such that the histogram represents an area of exactly one (necessary to be considered a probability distribution), we divide by the chunk width.
Clearly the accuracy of the density (probability) will depend on the size of the chunk selected. The smaller the chunk, the greater the accuracy. Alternatively, integrating the density function produces an exact solution. Probability from continuous distributions is thence based on areas under the density function and is undefined for a single point along the curve.
For example, the probability of encountering a male koala weighing more than 12kg is equal to the proportion of male koalas in the sample that weighed greater than 12kg. It is then referred to as a probability distribution.
When a frequency distribution can be described by a mathematical function, the probability distribution is a curve. The total area under this curve is defined as 1 and thus, the area under sections of the curve represent the probability of values falling in the associated interval. Note, it is not possible to determine the probability of discrete events (such as the probability of encountering a koala weighing 12.183kg) only ranges of values. Well, it is possible, it is just that it is infinitesimally small and meaningless.
3.1 Continuous distributions
3.1.1 The normal (Gaussian) distribution
It has been a long observed mathematical phenomenon that the accumulation of a very large set of independent random influences tend to converge upon a central value (central limit theorem) and that the distribution of such accumulated values follow a specific “bell shaped” curve called a normal or Gaussian distribution. The normal distribution is a symmetrical distribution in which values close to the center of the distribution are more likely and that progressively larger and smaller values are less commonly encountered.
\[f(x;\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\left(\frac{x-\mu}{2\sigma}\right)^2}\]
At first, this might appear to be a very daunting formula. It essentially defines the density (frequency) of any value of \(x\). The exact shape of the distribution is determined by just two parameters:
- \(\mu\) - the mean. This defines the center of the distribution, the location of the peak.
- \(\sigma^2\) - the variance (or \(\sigma\), the standard deviation) which defines the variability or spread of values around the mean.
Important properties of the Gaussian distribution:
- There is no relationship between the distributions mean (location) and variance - they are independent of one another.
- It is symmetric and unbounded and thus defined for all real numbers in the range of (\(-\infty\), \(\infty\)).
- Governed by central limits theorem
- averages tend to converge to a central limit
As many biological measurements (such as weights, lengths etc) are influenced by an almost infinite number of factors (many of which can be considered independent and random), many biological variables also follow a Gaussian distribution. The Gaussian distribution is particularly well suited for representing the distribution variables whose values are either
- considerably larger (or smaller) than zero (e.g. koalas mass) or
- have no theoretical limits (e.g. difference in masses between sibling fledglings)
Even discrete responses (such as counts that can only logically be positive integers) can occasionally be approximately described by a Gaussian distribution, particularly if either the samples are very large and the values free from boundary conditions (such as being close to a lower limit of 0), or else we are dealing with average counts.
Since many scientific variables behave according to the central limit theorem, many of the common statistical procedures have been specifically derived for (and thus assume) that the underlying distribution from which the data are drawn is Gaussian. Specifically, parameter estimation, inference and hypothesis tests from simple parametric tests (regression, ANOVA etc) assume that the residuals (stochastic, unexplained components of data) are normally distributed around a mean of zero. The reliability of such tests is dependent on the degree of conformity to this assumption of normality. Likewise, many other statistical elements rely on normal distributions, and thus the normal distribution (or variants thereof) is one of the most important mathematical distributions.
3.1.2 Log-normal distribution
Many biological variables have a lower limit of zero (at least in theory). For example, a koala cannot weigh less than 0kg or there cannot be less than 0mm of rain in a month. Such circumstances can result in asymmetrical distributions that are highly truncated towards the left with a long right tail.
In such cases, the mean and median present different values (the latter arguably more reflective of the ‘typical’ value). These distributions can often be described by a log-normal distribution. Furthermore, some variables do not naturally vary on a linear scale. For example, growth rates or chemical concentrations might naturally operate on logarithmic or exponential scales. Consequently, when such data are collected on a linear scale, they might be expected to follow a non-normal (perhaps log-normal) distribution.
\[f(x;\mu,\sigma) = \frac{1}{x\sigma\sqrt{2\pi}} e^{-\left(\frac{ln x-\mu}{2\sigma^2}\right)^2}\]
As with the Gaussian distribution, the exact shape of the log-normal distribution is determined by just two parameters:
- \(\mu\) - the mean. This defines the center of the distribution, the location of the peak.
- \(\sigma^2\) - the variance (or \(\sigma\), the standard deviation) which defines the variability or spread of values around the mean.
However, \(\mu\) and \(\sigma^2\) are the mean and variance of \(ln(x)\) rather than \(x\).
Important properties of the log-normal distribution:
- The variance is related (proportional) to the mean (\(\sigma^2 \sim \mu^2\))
- The log-normal distribution is skewed to the right as a result of being bounded at 0, yet unbounded to the right (\(0\), \(\infty\))
- Also governed by central limits theorem except that it describes the distribution of values that are the product (rather than sum) of a large number of independent random factors.
3.1.3 t-distribution
The t-distribution, also known as the Student’s t-distribution, is a probability distribution that is similar to the standardised normal distribution (mean of 0, standard deviation of 1) however it is better suited for smaller sample sizes. It is characterized by its bell-shaped curve and heavier tails, making it suitable for modeling data that deviates from normality. The t-distribution is often employed in hypothesis testing and confidence interval estimation when dealing with small sample sizes or when the population standard deviation is unknown. It provides a robust alternative to the normal distribution in situations where the underlying data exhibit skewness or constraints.
\[ f(t; \mu, \nu) = \frac{\Gamma(\frac{\nu + 1}{2})}{\sqrt{\pi \nu} \Gamma(\frac{\nu}{2})} \left( 1 + \frac{(t - \mu)^2}{\nu} \right)^{-\frac{\nu + 1}{2}} \]
Where:
- \(f(t; \mu, \nu)\) represents the probability density function of the t-distribution.
- \(\mu\) is the location parameter (mean).
- \(\nu\) is the degrees of freedom parameter, which controls the shape of the distribution (fatter tails with lower \(\nu\)).
- \(\Gamma(⋅)\) is the gamma function.
This formula resembles the Gaussian distribution but includes an additional term involving the degrees of freedom and a different power in the exponent. This difference reflects the heavier tails of the t-distribution compared to the bell-shaped normal distribution. This formula describes the shape of the t-distribution, which converges to the standard normal distribution as the degrees of freedom increase.
3.1.4 Gamma distribution
The Gamma distribution describes the distribution of waiting times until a specific number of independent events (typically deaths) have occurred. For example, if the average mortality rate is one individual per five days (rate=1/5 or scale=5), then a Gamma distribution could be used to describe the distribution of expected waiting time before 10 individuals were dead.
There are two parameterizations of the Gamma distribution
- in terms of shape (\(k\)) and scale (\(\theta\))
\[ f(x;k,\theta) = \frac{1}{\theta^k}\frac{1}{\gamma(k)}x^{k-1}e^{-\frac{x}{\theta}}\\ \text{for}~x\gt 0~\text{and}~k,\theta\gt 0 \]
in terms of shape (\(\alpha\)) and rate (\(\beta\))
\[ f(x;\alpha,\beta) = \beta^\alpha\frac{1}{\gamma(\alpha)}x^{\alpha-1}e^{-\beta x}\\ \text{for}~x\gt 0~\text{and}~\alpha,\beta\gt 0 \]
In addition to being used to describe the distribution of waiting times, the gamma distribution can also be used as an alternative to the normal distribution when data (residuals) are skewed with a long right tail, such as when there is a relationship between mean and variance. When such data are modeled with a normal distribution, illogical negative predicted values can occur. Such values are not possible from a Gamma distribution.
The Gamma distribution is also an important conjugate prior for the precision (variance) of a normal distribution in Bayesian modeling. Important properties of the Gamma distribution:
- The shape parameter defines the number of events (for example, 10 deaths) and can technically be any positive number.
- shape values less than 1, the gamma distribution has a mode of 0
- shape values equal to 1, the gamma distribution is equivalent to the exponential distribution
- shape values greater than 1, the distribution becomes increasingly more symmetrical and approaches a normal distribution when the shape parameter is large.
- The scale or rate (rate=1/scale) parameter defines how often (scale) or the rate at which events are expected to occur
- The variance is proportional to the mean (\(variance=\frac{scale}{mean}\), \(variance=\frac{mean^2}{shape}\))
3.1.5 Uniform distribution
The uniform distribution describes a square distribution within a specific range.
\[f(x;a,b) = \begin{cases} \frac{1}{b-a} & \text{for } a \leq x \geq b,\\[1em] 0 & \text{for } x \lt a \text{ or } x \gt b \end{cases} \]
Important properties of the uniform distribution:
- Has a constant probability density within the range \(a\le x\ge b\) of \(\frac{1}{b-a}\) and zero outside of this range
- Whilst this distribution is rarely employed in frequentist statistics, it is occasionally used as an improper prior distribution in Bayesian modeling.
3.1.6 Exponential distribution
The exponential distribution describes the distribution of waiting times for the occurrence a single discrete event (such as an individual death) given a constant rate (probability of occurrence per unit of time) - for example, describing longevity or the time elapsed between events (such as whale sightings). It is also useful for describing the distribution of measurements that naturally attenuate (exponentially) such as light levels penetrating to increasing water depths.
\[f(x;\lambda) = \lambda e^{-\lambda x}\] The uniform distribution is defined by a single parameter:
- \(\lambda\) - the rate. The rate at which the event is expected to occur. The larger the rate, the steeper the curve.
Important properties of the uniform distribution:
- It is bounded by 0 on the left and limitless on the right (\(0\), \(\infty\)).
- The mean and variance are both related to the rate (\(variance=\frac{1}{\lambda^2}\), \(mean=\frac{1}{\lambda}\))
3.1.7 Beta distribution
The beta distribution describes the probability of success in a binomial trial is the only continuous distribution defined within the range that is bound at both ends (\(0-1\)). As it operates in the range of \(0-1\), it is ideal for modeling proportions and percentages. However, it is also useful for modeling other continuous quantities on a finite scale. The values are transformed (see Transformations) from the arbitrary finite scale to the \(0-1\) scale, modeling with a beta distribution and finally the parameters are back-transformed into the original scale.
\[f(x;a,b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}x^{a-1}(1-x)^{b-1}\]
The beta distribution is defined by two shape parameters:
- \(a\) - shape parameter 1. Number of successes in binomial trial (\(a-1\))
- \(b\) - shape parameter 1. Number of successes in binomial trial (\(b-1\))
The beta distribution is also a conjugate prior for binomial, Bernoulli and geometric distributions. Important properties of the beta distribution:
- It is bounded by 0 on the left and limitless on the right (\(0\), \(\infty\)).
- When \(a=b\), the distribution is symmetric about \(x=0.5\)
- When \(a=b=1\), the distribution is a uniform distribution with \(a=0\) and \(b=1\).
- The location of the peak shifts towards 0 as \(a<b\) and shifts towards 1 as \(a>b\).
- The variance of the distribution is inversely proportional to the total of \(a+b\) (the number of trials).
3.2 Discrete distributions
3.2.1 Binomial distribution
The binomial distribution describes the number of ‘successes’ out of a total of \(n\) independent trials each with a set probability. On any given trial, only two possible outcomes (binary) are possible (0 and 1) - that is it is a Bernoulli trial. Importantly, the binomial distribution is bounded at both ends - zero to the left and the trial size on the right. Typical binomial include:
- the number of surviving individuals from a pool of individuals
- the number of infected individuals from a pool of individuals
- the number of items of a particular class (e.g. males) from a pool of items
\[f(x;n,p) = \left(\begin{array}{c} n\\x \end{array}\right)p^{x}(1-p)^{n-x}\]
The binomial distribution is defined by two shape parameters:
- \(n\) - the total number of trials
- \(p\) - the probability of success on any given trial. Defined as any real number between 0 and 1.
The \(\left(\begin{array}{c} n\\x \end{array}\right)\) component is a normalizing constant that defines the number of ways of drawing \(x\) items out of \(n\) trials and also ensures that all probabilities add up to 1.
Important properties of the binomial distribution:
- It is bounded by 0 on the left and by \(n\) (the number of trials/individuals/quadrats etc) on the right (\(0\), \(n\)).
- Variance is proportional to \(n\) and related to the mean in that the larger the sample size, the larger the variance.
- Variance is greatest when \(p=0.5\) and decreases as \(p\) approaches 0 or 1.
- When \(n\) is large and \(p\) is away from 0 or 1, the binomial distribution approaches a normal distribution
- When \(n\) is large and \(p\) is small, the binomial distribution approaches a Poisson distribution/li>
3.2.2 Poisson distribution
The poisson distribution describes the number (counts) of independent discrete items or events (individuals, times, deaths) recorded for a given effort. The poisson distribution is defined by a single parameter (\(\lambda\)) that describes the expected count (mean) as well as the variance in count. The poisson distribution is bounded at the lower end by zero, yet theoretically unbounded at the upper end (\(0\),\(\infty\)).
The poisson distribution is particularly appropriate for modeling count data as they are always truncated at zero, have no upper limit and tend to get more variable with increasing mean.
\[f(x;\lambda) = \frac{e^{-\lambda}\lambda^x}{x!}\]
The poisson distribution is defined by a single parameter:
- \(\lambda\) - the expected value
Important properties of the binomial distribution:
- It is bounded by 0 on the left and unbounded on the right (\(0\), \(\infty\)).
- Mean and variance are both equal to \(\lambda\).
- Assumes that the ratio of variance to mean (Dispersion) is 1 (\(D=\frac{var}{mean}=1\))
- When \(\lambda\) is large, the binomial distribution approaches a normal distribution
3.3 Negative binomial distribution
The negative binomial distribution describes the expected number of failures out of a sequence of \(n\) independent trials before a success is obtained each with a set probability (typically 0.5). The negative binomial is a useful alternative to the poisson distribution for modeling count data for which the variance is greater than the mean (e.g. overdispersed, particularly when caused by a heterogeneous/patchy/clumped response). The negative binomial distribution is bounded at the lower end by zero, yet theoretically unbounded at the upper end (\(0\),\(\infty\)).
There are two parameterizations of the Gamma distribution
in terms of the size (\(n\)) and probability (\(p\))
\[f(x;n,p) = \frac{(n+x-1)!}{(n-1!)x!}p^{n}(1-p)^x\]
\(n\) - the number of successes to occur before stopping the count of failures. \(n\) acts as a stopping point in that the number of failures are counted until \(n\) successes are encountered.
\(p\) - the probability of success of any single trial.
in terms of mean \(\mu=n(1-p)/p\)) and overdispersion parameter or scaling factor (\(\omega\)). This parameterization is more meaningful in ecology.
\[f(x;\mu,\omega) = \frac{\Gamma(\omega+x)}{\Gamma(\omega)x!}\frac{(\mu^x\omega^\omega}{(\mu+\omega)^{\mu+\omega}}\]
- \(\mu\) - the mean (expected number of failures).
- \(\omega\) - the dispersion or scaling factor.
Important properties of the negative binomial distribution:
- It is bounded by 0 on the left and unbounded on the right (\(0\), \(\infty\)).
- The variance is related to the mean (\(\sigma^2=\mu+\mu^2/\omega\)) - variance increases with increasing mean.
4 Scale transformations
The above section on distributions illustrate the main distributions that are useful in ecology. Provided data have been collected in an unbiased manner and from well defined populations, data usually follow one of the above distributions. When data do not comply well to one of the above distributions, it is often possible to transform the scale of those data so that they may be better approximated by one of these distributions. For example, data measured on a percentage scale of 0 to 100 could be easily transformed into a scale of 0-1 (for a beta distribution), by dividing the observations by 100.
Essentially, data transformation is the process of converting the scale in which the observations were measured into another scale. I will demonstrate the principles of data transformation with two simple examples. Firstly, to illustrate the legitimacy and commonness of data transformations, imagine you had measured water temperature in a large number of streams. Let’s assume that you measured the temperature in \(\,^{\circ}\mathrm{C}\). Supposing later you required the temperatures be in \(\,^{\circ}\mathrm{F}\). You would not need to re-measure the stream temperatures. Rather, each of the temperatures could be converted from one scale (\(\,^{\circ}\mathrm{C}\)) to the other (\(\,^{\circ}\mathrm{F}\)). Such transformations are very common.
Imagine now that a botanist wanted to examine the leaf size of a particular species. The botanist decides to measure the length of a random selection of leaves using a standard linear, metric ruler and the distribution of sample observations are illustrated in the upper left hand figure of the following.
The growth rate of leaves might be expected to be greatest in small leaves and decelerate with increasing leaf size. That is, the growth rate of leaves might be expected to be logarithmic rather than linear. As a result, the distribution of leaf sizes using a linear scale might also be expected to be non-normal (log-normal). If, instead of using a linear scale, the botanist had used a logarithmic ruler, the distribution of leaf sizes may have been more like that depicted in the figure in the upper right corner.
If the distribution of observations is determined by the scale used to measure of the observations, and the choice of scale (in this case the ruler) is somewhat arbitrary (a linear scale is commonly used because we find it easier to understand), then it is justifiable to convert the data from one scale to another after the data has been collected and explored. It is not necessary to re-measure the data in a different scale. Therefore, to normalize the data, the botanist can simply convert the data to logarithms.
The important points in the process of transformations are;
- The order of the data has not been altered (a large leaf measured on a linear scale is still a large leaf on a logarithmic scale), only the spacing of the data has changed
- Since the spacing of the data is purely dependent on the scale of the measuring device, there is no reason why one scale is more correct than any other scale
- For the purpose of normalization, data can be converted from one scale to another
The purpose of scale transformation is purely to normalize the data so as to satisfy the underlying assumptions of a statistical analysis. As such, it is possible to apply any function to the data. Nevertheless, certain data types respond more favourably to certain transformations due to characteristics of those data types. Common transformations into an approximate normal distribution as well as the R syntax are provided in the following table.
Nature of the data | Transformation | R syntax |
---|---|---|
Measurements (lengths, weights, etc) | \(log_e\) (natural log) | log(x) |
\(log_{10}\) (log base 10) | log(x, 10) |
|
log10(x) |
||
\(log x+1\) | log(x+1) |
|
Counts (number of individuals etc) | \(\sqrt{~}\) | sqrt(x) |
Percentages (must be proportions) | \(arcsin\) | asin(sqrt(x))*180/pi |
Whilst scale transformations of the kind outlined above are legitimate, in general it is preferable that appropriate distributions be selected for modelling rather than transforming data to adhere to specific distributional requirements. For example, it is arguably more appropriate to model against a lognormal distribution than to model a gaussian distribution on log-transformed data.
Consider the common yesteryear practice of log (or worse, square root) transformation of count data (which are often skewed with a long right tail) to satisfy normality and homogeneity of variance assumptions of classic statistical tests. A statistical model itself should reflect the expected underlying data generation process. In applying a gaussian distribution, we are implying that the data were generated via a gaussian process. However, this is not logical. From a gaussian data generation process, all real values would be possible - such as a count of 10.235.
Rather than log-transform the data to satisfy the gaussian model assumptions, if we assume a poisson distribution, in addition to applying a distribution for which the data a likely to fit better, we are implying that the counts have been generated via a poisson process - a much more likely situation.
It used to be reasonably common to apply square-root transformations to normalise count data. Square-root transformations were favoured over logarithmic transformations since count data often contained numerous zero values and the logarithm of zero is illegal.
Great care and consideration must be applied prior to performing a square-root transformation in preparation for statistical model fitting. Typically, after fitting a model, various summations are produced to provide insights into the model estimates and inferences. To be meaningful, these insights are usually presented on the same scale and as original data. The back-transformation from a square-root transformation is to square the data. However, this transformation does not apply equally across all ranges of data.
Consider the following numbers: \(-4, -2, 0, 0.5, 1, 2\).
The smallest value here is the first value (\(-4\)) and the largest value is last value (\(2\)). However, once we square these numbers, they become \(16, 4, 0, 0.25, 1, 4\).
- the first value is now the largest value
- typically, when squared, values increase, but not values between 0 and 1 - they decrease.
So the spacing and order can change dramatically and hence such back-transformations can produce values that are meaningless in the context of the study.
5 Estimates
5.1 Measures of location
Measures of location describe the center of a distribution and thus characterize the typical value of a population. There are many different measures of location (see Table below), all of which yield identical values (in the center of the distribution) when the population (and sample) follows an exactly symmetrical distribution. Whilst the mean is highly influenced by unusually large or small values (outliers) and skewed distributions, the median is more robust. The greater the degree of asymmetry and outliers, the more disparate the different measures of location.
Parameter | Description | R syntax |
---|---|---|
Estimates of location | ||
Arithmetic mean (\(\mu\)) | the sum of the values divided by the number of values (\(n\)) | mean(x) |
Trimmed mean | the arithmetic mean calculated after a fraction (typically 0.05 or \(5\%\)) of the lower and upper values have been discarded | mean(x, trim=0.05) |
Winsorized mean | the arithmetic mean is calculated after the trimmed values are replaced by the upper and lower trimmed quantiles | library(psych)<br>winsor(x, trim=0.05) |
Median | the middle value | median(x) |
Minimum, maximum | the smallest and largest values | min(x), max(x) |
Estimates of spread | ||
Variance (\(\sigma^2\)) | the average deviation (difference) of observations from the mean | var(x) |
Standard deviation (\(\sigma\)) | square-root of the variance | sd(x) |
Median average deviation | the median difference of observations from the median value | mad(x) |
Inter-quartile range | the difference between the 75% and 25% ranked observations | IQR(x) |
Precision and confidence | ||
Standard error \(\bar{y}(s_{\bar{y}})\) | the precision of an estimate \(\bar{y}\) | sd(x)/sqrt(length(x)) |
95% confidence interval of \(\mu\) | the interval with a 95% probability of containing the true mean | library(gmodels)<br>ci(x) |
5.2 Measures of dispersion and variability
In addition to having an estimate of the typical value (center of a distribution), it is often desirable to have an estimate of the spread of the values in the population. That is, do all Victorian male koalas weigh the same or do the weights differ substantially?
In its simplest form, the variability, or spread, of a population can be characterized by its range (difference between maximum and minimum values). However, as ranges can only increase with increasing sample size, sample ranges are likely to be a poor estimate of population spread.
Variance (\(s^2\)) describes the typical deviation of values from the typical (mean) value: \[s^2=\sum{\frac{(y_i-\bar{y})^2}{n-1}}\] Note that by definition, the mean value must be in the center of all the values, and thus the sum of the positive and negative deviations will always be zero. Consequently, the deviances are squared prior to summing. Unfortunately, this results in the units of the spread estimates being different to the units of location. Standard deviation (the square-root of the variance) rectifies this issue.
Note also, that population variance (and standard deviation) estimates are calculated with a denominator of \(n-1\) rather than \(n\). The reason for this is that since the sample values are likely to be more similar to the sample mean (which is of course derived from these values) than to the fixed, yet unknown population mean, the sample variance will always underestimate the population variance. That is, the sample variance and standard deviations are biased estimates of the population parameters. Ideally, the mean and variance should be estimated from two different independent samples. However, this is not practical in most situations. Division by n-1 rather than n is an attempt to partly offset these biases.
There are more robust (less sensitive to outliers) measures of spread including the inter-quartile range (difference between 75% and 25% ranked observations) and the median absolute deviation (MAD: the median difference of observations from the median value).
5.3 Measures of the precision of estimates - standard errors and confidence intervals
Since sample statistics are used to estimate population parameters, it is also desirable to have a measure of how good the estimates are likely to be. For example, how well the sample mean is likely to represent the true population mean. The proximity of an estimated value to the true population value is its accuracy.
Clearly, as the true value of the population parameter is never known (hence the need for statistics), it is not possible to determine the accuracy of an estimate. Instead, we measure the precision (repeatability, consistency) of the estimate. Provided an estimate is repeatable (likely to be obtained from repeated samples) and that the sample is a good, unbiased representative of the population, a precise estimate should also be accurate.
Strictly, precision is measured as the degree of spread (standard deviation) in a set of sample statistics (e.g. means) calculated from multiple samples and is called the standard error. The standard error can be estimated from a single sample by dividing the sample standard deviation by the square-root of the sample size (\(\frac{\sigma}{\sqrt{n}}\)). The smaller the standard error of an estimate, the more precise the estimate is and thus the closer it is likely to approximate the true population parameter.
The central limit theorem (which predicates that any set of averaged values drawn from an identical population will always converge towards being normally distributed) suggests that the distribution of repeated sample means should follow a normal distribution and thus can be described by its overall mean and standard deviation (=standard error). In fact, since the standard error of the mean is estimated from the same single sample as the mean, its distribution follows a special type of normal distribution called a t-distribution.
In accordance to the properties of a normal distribution (and thus a t-distribution with infinite degrees of freedom), 68.27% of the repeated means fall between the true mean and \(\pm\) one sample standard error (see Figure bellow). Put differently, we are 68.27% percent confident that the interval bound by the sample mean plus and minus one standard error will contain the true population mean. Of course, the smaller the sample size (lower the degrees of freedom), the flatter the t-distribution and thus the smaller the level of confidence for a given span of values (interval).
This concept can be easily extended to produce intervals associated with other degrees of confidence (such as 95%) by determining the percentiles (and thus number of standard errors away from the mean) between which the nominated percentage (e.g. 95%) of the values lie. The 95% confidence interval is thus defined as:
\[P\{\bar{y}-t_{0.05(n-1)}s_{\bar{y}}\le\mu\le\bar{y}+t_{0.05(n-1)}s_{\bar{y}}\}\]
where \(\bar{y}\) is the sample mean, \(s_{\bar{y}}\) is the standard error, \(t_{0.05(n-1)}\) is the value of the 95% percentile of a distribution with \(n-1\) degrees of freedom, and \(\mu\) is the unknown population mean.
For a 95% confidence interval, there is a 95% probability that the interval will contain the true mean. Note, this interpretation is about the interval, not the true population value, which remains fixed (albeit unknown). The smaller the interval, the more confidence is placed in inferences about the estimated parameter.
The left hand figure above illustrates a Normal distribution displaying percentage quantiles (grey) and probabilities (areas under the curve) associated with a range of standard deviations beyond the mean. The right hand figure displays 20 possible 95% confidence intervals from 20 samples (\(n=30\)) drawn from the one population. Bold intervals are those that do not include the true population mean. In the long run, 5% of such intervals will not include the population mean (\(\mu\)).
6 Degrees of freedom
The concept of degrees of freedom is sufficiently abstract and foreign to those new to statistical principles that it warrants special attention. The degrees of freedom refers to how many observations in a sample are “free to vary” (theoretically take on any value) when calculating independent estimates of population parameters (such as population variance and standard deviation).
In order for any inferences about a population to be reliable, each population parameter estimate (such as the mean and the variance) must be independent of one another. Yet they are usually all obtained from a single sample and to estimate variance, a prior estimate of the mean is required. Consequently, mean and variance estimated from the same sample cannot strictly be independent of one another.
When estimating the population variance (and thus standard deviation) from sample observations, not all of the observations can be considered independent of the estimate of population mean. The value of at least one of the observations in the sample is constrained (not free to vary).
If, for example, there were four observations in a sample with a mean of 5, then the first three of these can theoretically take on any value, yet the forth value must be such that the sum of the values is still 20.
The degrees of freedom therefore indicates how many observations are involved in the estimation of a population parameter. A `cost’ of a single degree of freedom is incurred for each prior estimate required in the calculation of a population parameter.
The shape of the probability distributions of coefficients (such as those in linear models etc) and statistics depend on the number of degrees of freedom associated with the estimates. The greater the degrees of freedom, the narrower the probability distribution and thus the greater the statistical power. Power is the probability of detecting an effect if an effect genuinely occurs.
Degrees of freedom (and thus power) are positively related to sample size (the greater the number of replicates, the greater the degrees of freedom and power) and negatively related to the number of variables and prior required parameters (the greater the number of parameters and variables, the lower the degrees of freedom and power).