Running R

There are a number of ways to start R.

If R is fully installed locally on the computer OR
  1. Click the START button from the Windows task bar
  2. Click Program Files
  3. Click R2.2
  4. Click RGui
From the CD.
  1. Goto the R directory
  2. Goto the rw2020 subdirectory
  3. Goto the bib subdirectory
  4. Double Click on the RGui.exe file
The RGui window should shortly appear.


End of instructions

  R object classes

Object classes define how information in stored and displayed. The basic storage unit in R is called a vector. A vector is an array of one or more entries of the same class. The common classes include
  1. numeric - stores a number eg 1, 2.345 etc
  2. character - stores alphanumeric characters eg 'a', 'fish', 'color1'
  3. logical - stores either TRUE or FALSE
So the entries (1, 2, 3 & 4) might make up a numeric vector, whereas the entries ('Big', 'Small' & 'Tiny') would make up a character vector. To determine the class type of an object, use the following syntax (where bold font is used to represent the object whose class is to be determined).

> class(name)

End of instructions

  R object classes

Assigning entries is basically the act of defining a new object name and specifying what that object contains. For example if we wanted to store the number 10.513 as John Howards IQ, we instruct R to create a new object called (say IQ) and assign the value 10.513 to it. That is, we instruct R that IQ equals 10.513.
In R, the assignment operator is <- instead of =.

> name <- entry

So to assign IQ the value of 10.513 in R

> IQ <- 10.513

End of instructions

  Print contents

In R, print means to output (list) the contents of an object. By default, the contents are output to the screen (the default output device). It is also possible to redirect output to a file or printer if necessary. The contents of a file are 'printed' by completing the 'print()' function or by just entering the name of the object. Consider the following;

> numbers <- c(1, 4, 6, 7, 4, 345, 36, 78)
> print(numbers)
[1] 1 4 6 7 4 345 36 78
> numbers
[1] 1 4 6 7 4 345 36 78

The first line of this syntax generates and populates the numeric vector called 'numbers'. The second line uses the print function to tell R to list the contents of the 'numbers' object - the output of which appears on the third line. The forth and fifth line illustrate that the same outcome can be achieved by simply entering the name of the object.

End of instructions

  R vectors - variables

In biology, a variable is a collection of observations of the same type. For example, a variable might consist of the observed weights of individuals within a sample of 10 bush rats. Each item (or element) in the variable is of the same type (a weight) and will have been measured comparably (same techniques and units). Biological variables are therefore best represented in R by vectors.

End of instructions

  R Factors

There are a number of ways in which this can be done. One way is to use the 'factor' (makes a list into a factor) function in conjunction with the 'c' (concatenation) function.

> name <- factor(c(list of characters/words))

Another way is to use the 'gl' function (which generates factors according to specified patterns of their levels)

> name <- gl(number of levels, number of replicates, length of data set, lab=c(list of level names)))

Hence, consider the following alternative solutions;

> sex <- factor(c('Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male'))
OR
> sex <- factor(c(rep('Female',6),rep('Male',6)))
OR
> sex <- gl(2,6,12,lab=c('Female','Male'))

The second option uses the 'rep()' function which in this case is used to repeat the level name (eg 'Female') 6 times.

End of instructions

  Saving R scripts

To create a script file
  • Goto the RGui File menu
  • Select the New script submenu
  • A new R Editor window will appear. Copy and paste the important lines of R syntax into this window.
  • Goto the RGui File menu
  • Select the Save As submenu
  • From the Save script as dialog box, find a good location to save the script file (one that you will remember)
  • Provide a name for the file (this can be any name, but one that reflects the purpose of the script is best). The file can end in any extension as it is just a plain text file, however, to help you distinguish it as an R script file in future, it is recommended that you end the filename with '.R'

End of instructions

  Quitting R

There are two ways to quit R elegantly
  1. Goto the RGui File menu and select Quit
  2. In the R Console, type

    > q()

Either way you will be prompted to save the workspace, generally this is not necessary or even desirable.

End of instructions

  The R working directory

The R working directory is the location that R looks in to read/write files. When you load R, it initially sets the working directory to a location within the R subdirectory of your computer. However, we can alter the working directory so that it points to another location. This not only prevents having to search for files, it also negates the specify a path as well as a filename in various operations.
  1. Goto the RGui File menu
  2. Select the Change dir submenu
  3. Locate the directory that contains your data and/or scripts
  4. Click OK

End of instructions

  Read in (source) an R script

There are two ways to read in (source) a R script
  1. Goto the RGui File menu and select Source R code
  2. In the R Console, type

    > source('filename')

    where 'filename' is the name of the R script file.
All commands in the file will be executed sequentially starting at the first line. Note that R will ignore everything that is on a line following the '#' character. This provides a way of inserting comments into script files (a practice that is highly encouraged as it provides a way of reminding yourself what each line of syntax does).

Note that there is an alternative way of using scripts

  1. Goto the RGui File menu and select the Open script submenu
  2. This will display the R script file in the R Editor window
  3. Syntax can then be directly cut and paste into the R Console. Again,each line will be executed sequentially and comments are ignored.
When working with multi-step analyses, it is recommended that you have the R Editor open and that syntax is entered directly into the editor and then pasted into the R Console. Then provided the R script is saved regularly, your data and analyses are safe from computer disruptions.

End of instructions

  R workspace

The R workspace stores all the objects that are created during an R session. To view a list of objects occupying the current R workshpace

> ls()

The workspace is cabable of storing huge numbers of objects, and thus it can become cluttered (with objects that you have created, but forgotten about, or many similar objects that contain similar contents) very rapidly. Whilst this does not affect the performance of R, it can lead to chronic confusion. It is advisable that you remove all unwanted objects when you have finished with them. To remove and object;

> rm(object name)

where object name is the name of the object to be removed.

When you exit R, you will be prompted for whether you want to save the workspace. You can also save the current workspace at any time by;
  • Goto the RGui File menu
  • Select the Save Workspace submenu
  • Provide a location and file name for the workspace. The default and recommended (though not necessary) file extension is .RData
A saved workspace is not in easily readable text format. Saving a workspace enables you to retrieve an entire sessions work instantly, which can be useful if you are forced to perform you analyses over multiple sittings. Furthermore, by providing different names, it is possible to maintain different workspaces for different projects.

End of instructions

  Removing objects

To remove an object

> rm(object name)

where object name is the name of the object to be removed.

End of instructions

  Exporting from Excel

  • Goto the Excel File menu
  • Select the Save As submenu
  • Provide a path and filename for the file
  • Change the Save as type to CSV (Comma delimited)
  • Click the OK button.
  • You will be warned that only the selected sheet can be saved in this format - click OK
  • You will also be warned that some of the formating might be lost - again click OK. The formating that it is refering to are things like colors and fonts, neither of which are important for data analysis.

End of instructions

  Reading data into R

Ensure that the working directory is pointing to the path containing the file to be imported before proceeding.
To import data into R, we read the contents of a file into a data frame. The general format of the command for reading data into a data frame is

> name <- read.table('filename.csv', header=T, sep=',', row.names=column, strip.white=T)

where name is a name you wish the data frame to be referred to as, filename.csv is the name of the csv file that you created in excel and column is the number of the column that had row labels (if there were any). The argument header=T indicates that the variable (vector) names will be created from the names supplied in the first row of the file. The argument sep=',' indicates that entries in the file are separated by a comma (hence a comma delimited file). If the data file does not contain row labels, or you are not sure whether it does, it is best to omit the row.names=column argument. The strip.white=T arguement ensures that no leading or trailing spaces are left in character names (these can be particularly nasty in categorical variables!).

As an example

> phasmid <- read.data('phasmid.csv', header=T, sep=',', row.names=1)

End of instructions

  Cutting and pasting data into R

To import data into R via the clipboard, copy the data in Excel (including a row containing column headings - variable names) to the clipboard (CNTRL-C). Then in R use a modification of the read.table function:

> name <- read.table('clipboard', header=T, sep='\t', row.names=column, strip.white=T)

where name is a name you wish the data frame to be referred to as, clipboard is used to indicate that the data are on the clipboard, and column is the number of the column that had row labels (if there were any). The argument header=T indicates that the variable (vector) names will be created from the names supplied in the first row of the file. The argument sep='\t' indicates that entries in the clipboard are separated by a TAB. If the data file does not contain row labels, or you are not sure whether it does, it is best to omit the row.names=column argument. The strip.white=T arguement ensures that no leading or trailing spaces are left in character names (these can be particularly nasty in categorical variables!).

As an example

> phasmid <- read.data('clipboard', header=T, sep='\t', row.names=1)

End of instructions

  Writing a copy of the data to an R script file

> phasmid <- dump('data', "")

where 'data' is the name of the object that you wish to be stored.
Cut and paste the output of this command directly into the top of your R script. In future, when the script file is read, it will include a copy of your data set, preserved exactly.

End of instructions

  R writing data

To export data into R, we read the empty the contents a data frame into a file. The general format of the command for writing data in from a data frame into a file is

> write.table(data frame name, 'filename.csv', quote=F, sep=',')

where data frame name is a name of the data frame to be saved and filename.csv is the name of the csv file that is to be created (or overwritten). The argument quote=F indicates that quotation marks should not be placed around entries in the file. The argument sep=',' indicates that entries in the file are separated by a comma (hence a comma delimited file).

As an example

> write.table(LEAVES, 'leaves.csv', quote=F, sep=',')

End of instructions

  R workspace

A vector is simply a array (list) of entries of the same type. For example, a numeric vector is just a list of numbers. A vector has only a single dimension - length - and thus the first entry in the vector has an index of 1 (is in position 1), the second has an index of 2 (position 2), etc. The index of an entry provides a convenient way to access entries. If a vector contained the numbers 10, 5, 8, 3.1, then the entry with index 1 is 10, at index 4 is the entry 3.1. Consider the following examples

> var <- c(4,8,2,6,9,2)
> var
[1] 4 8 2 6 9 2
> var[1]
[1] 4
> var[5]
[1] 9
> var[3:5]
[1] 2 6 9

A data frame on the other hand is not a single vector, but rather a collection of vectors. It is a matrix, consisting of columns and rows and therefore has both length and width. Consequently, each entry has a row index and a column index. Consider the following examples

> dv <- c(4,8,2,6,9,2)
> iv <- c('a','a','a','b','b','b')
> data <- data.frame(iv,dv)
> data
iv dv
1 a 4
2 a 8
3 a 2
4 b 6
5 b 9
6 b 2
> data[1] #list first section
iv
1 a
2 a
3 a
4 b
5 b
6 b
> data[,1] #list contents of first column
[1] a a a b b b
Levels: a b
> data[1,] #list contents of first row
iv dv
1 a 4
> data[3,2] #list the entry in row 3, column 2
[1] 2

End of instructions

  R 'fix()' function

The 'fix()' function provides a very rudimentary spreadsheet for data editing and entry.

> fix(data frame name)

The spreadsheet (Data Editor) will appear. A new variable can be created by clicking on a column heading and then providing a name fo r the variable as well as an indication of whether the contents are expected to be numeric (numbers) or character (contain some letters). Data are added by clicking on a cell and providing an entry. Closing the Data Editor will cause the changes to come into affect.

End of instructions

  R transformations

The following table illustrates the use of common data transmformation functions on a single numeric vector (old_var). In each case a new vector is created (new_var)

TransformationSyntax
loge

> new_var <- log(old_var)

log10

> new_var <- log10(old_var)

square root

> new_var <- sqrt(old_var)

arcsin

> new_var <- asin(sqrt(old_var))

scale (mean=0, unit variance)

> new_var <- scale(old_var)

where old_var is the name of the original variable that requires transformation and new_var is the name of the new variable (vector) that is to contain the transformed data. Note that the original vector (variable) is not altered.

End of instructions

  Random numbers

Random numbers are usually given as a fraction (e.g 0.713791167) between 0 and 1. These can be used to generate a series of integer numbers (e.g. 7, 1, 3, 71, 379, 1167 ...) or floating points (e.g. 0.71, 7.13, 71.37, 713.7, 91.167, ...), which in turn can be used to obtain random sampling units (based on individual ID's or random grid coordinates respectively).

R uses the Mersenne-Twister Random Number Generator (RNG) with a random number sequence cycle of 219937-1. All random number generators have what is known as a 'seed'. This is a number that uniquely identifies a series of random number sequences. Strictly, computer generated random numbers are 'psuedo-random' as the sequences themselves are predefined. However, with such a large number of possible sequences (219937-1), for all intense and purposes they are random.
By default, the initial random seed is generated from the computer clock (milliseconds field) and is therefore unbiased. However, it is also possible to specify a random seed. This is often useful for error-checking functions. However, it is also facilitates learning how to perform randomizations, as the same outcomes can be repeated. To set the seed

> set.seed(number)

where number is an integer.

End of instructions

  R loading packages

R is a highly modular piece of software, consisting of a very large number of packages. Each package defines a set of functions that can be used to perform specific tasks. Packages also include of help files and example data sets and command scripts to provide information about the full use of the functions.
The modularized nature of R means that only the packages that are necessary to perform the current tasks need to be loaded into memory. This results in a very `light-weight', fast statistical software package. Furthermore, the functionality of R can be easily extended by the creation of additional packages, rather than re-write all the software. As a result of this, and the open source license, new statistics are added to R nearly as soon as staticians publish them. New and revised packages can be freely downloaded from the (CRAN) website at any time.
During the installation process of R, a large number of packages are also installed.
To install additional packages from RGui (note that this only ever needs to be done once!)
  • Goto the Rgui Packages menu
  • Select the Install package(s) from local zip files.. submenu
  • Locate and select the name of the recently downloaded (or acquired) package in zip format.

When R is first started, only the 'base' package is loaded. This package contains an extensive collection of functions for performing most of the basic data manipulation and statistical procedures. To load additional packages for use during a session:

> library(package)

where package is the name of the package to install.

End of instructions

  Frequency histogram

> hist(variable)

where variable is the name of the numeric vector (variable) for which the histogram is to be constructed.

End of instructions

  Summary Statistics

# mean
> mean(variable)
# variance
> var(variable)
# standard deviation
> sd(variable)
# variable length
> length(variable)
# median
> median(variable)
# maximum
> max(variable)
# minimum
> min(variable)
# standard error of mean
> sd(variable)/sqrt(length(variable))
# interquartile range
> quantile(variable, c(.25,.75))
# 95% confidence interval
> qnorm(0.975,0,1) * (sd(variable) /sqrt(length(variable)))

where variable is the name of the numeric vector (variable).

End of instructions

  Summary Statistics by groups

> tapply(dv,factor,func)

the tapply() function applies the function func to the numeric vector dv for each level of the factor factor. For example

# calculate the mean separately for each group
> tapply(dv,factor,mean)
# calculate the mean separately for each group
> tapply(dv,factor,var)

End of instructions

  Selecting subsets of data

# select the first 5 entries in the variable
> variable[1:5]
# select all but the first entry in the variable
> variable[-1]
# select all cases of variable that are less than num
> variable[variable<num]
# select all cases of variable that are equal to 'Big'
> variable1[variable<label]

where variable is the name of a numeric vector (variable) and num is a number. variable1 is the name of a character vector and label is a character string (word). 5. In the Subset expression box enter a logical statement that indicates how the data is to be subsetted. For example, exclude all values of DOC that are greater than 170 (to exclude the outlier) and therefore only include those values that are less that 170, enter DOC < 170. Alternatively, you could chose to exclude the outlier using its STREAM name. To exclude this point enter STREAM. != 'Santa Cruz'.

End of instructions

  Normal probabilities

> pnorm(c(value), mean=mean, sd=sd, lower.tail=FALSE)

this will calculate the area under a normal distribution (beyond the value of value) with mean of mean and standard deviation of sd. The argument lower.tail=FALSE indicates that the area is calculated for values greater than value. For example

> pnorm(c(2.9), mean=2.025882, sd=0.4836265, lower.tail=FALSE)

End of instructions

  Boxplots

> boxplot(variable)

where variable is the name of the numeric vector (variable)

End of instructions

  Boxplots

A boxplot is a graphical representation of the distribution of observations based on the 5-number summary that includes the median (50%), quartiles (25% and 75%) and data range (0% - smallest observation and 100% - largest observation). The box demonstrates where the middle 50% of the data fall and the whiskers represent the minimum and maximum observations. Outliers (extreme observations) are also represented when present.
The figure below demonstrates the relationship between the distribution of observations and a boxplot of the same observations. Normally distributed data results in symmetrical boxplots, whereas skewed distributions result in asymmetrical boxplots with each segment getting progressively larger (or smaller).


End of instructions

  EDA - Normal distribution

Parametric statistical hypothesis tests assume that the population measurements follow a specific distribution. Most commonly, statistical hypothesis tests assume that the population measurements are normally distributed (Question 4 highlights the specific reasoning for this).

While it is usually not possible to directly examine the distribution of measurements in the whole population to determine whether or not this requirement is satisfied or not (for the same reasons that it is not possible to directly measure population parameters such as population mean), if the sampling is adequate (unbiased and sufficiently replicated), the distribution of measurements in a sample should reflect the population distribution. That is a sample of measurements taken from a normally distributed population should be normally distributed.

Tests on data that do not meet the distributional requirements may not be reliable, and thus, it is essential that the distribution of data be examined routinely prior to any formal statistical analysis.


End of instructions

  EDA - Homogeneity of variances for regression

The assumption of homogeneity of variance is equally important for regression analysis and in particular, it is prospect of a relationship between the mean and variance of y-values across x-values that is of the greatest concern. Strictly the assumption is that the distribution of y values at each x value are equally varied and that there is no relationship between mean and variance. However, as we only have a single y-value for each x-value, it is difficult to determine whether the assumption of homogeneity of variance is likely to have been violated (mean of one value is meaningless and variability can't be assessed from a single value). The figure below depicts the ideal (and almost never realistic) situation in which there are multiple response variable (DV) observations for each of the levels of the predictor variable (IV). Boxplots are included that represent normality and variability.

Inferential statistics is based around repeatability and the likelihood of data re-occurrence. Consider the following scatterplot that has a regression line (linear smoother - line of best fit) fitted through the data (data points symbolized by letters).

Points that are close to the line (points A and B) of best fit (ie those points that are close to the predicted values) are likely to be highly repeatable (a repeat sampling effort is likely to yield a similar value). Conversely, points that are far from the line of best fit (such as points F and G) are likely to take on a larger variety of values.
The distance between a point (observed value) and the line of best fit (predicted value) is called a residual. A residual plot plots the each of the residuals against the expected y values. When there is a trend of increasing (or decreasing) spread of data along a regression line, the residuals will form a wedge shape pattern. Thus a wedge shape pattern in the residual plot suggests that the assumption of homogeneity of variance may have been violated.

Of course, it is also possible to assess the assumption of homogeneity of variance by simply viewing a scatterplot with a linear smoother (linear regression line) and determining whether there is a general increase or decrease in the distance that the values are from the line of best fit.


End of instructions

  EDA - Linearity

The most common methods for analysing the relationship or association between variables.assume that the variables are linearly related (or more correctly, that they do not covary according to some function other than linear). For example, to examine for the presence and strength of the relationship between two variables (such as those depicted below), it is a requirement of the more basic statistical tests that the variables not be related by any function other than a linear (straight line) function if any function at all.

There is no evidence that the data represented in Figure (a) are related (covary) according to any function other than linear. Likewise, there is no evidence that the data represented in Figure (b) are related (covary) according to any function other than linear (if at all). However, there is strong evidence that the data represented in Figure (c) are not related linearly. Hence Figure (c) depicts evidence for a violation in the statistical necessity of linearity.


End of instructions

  EDA - Scatterplots

Scatterplots are constructed to present the relationships, associations and trends between continuous variables. By convention, dependent variables are arranged on the Y-axis and independent variables on the X-axis. The variable measurements are used as coordinates and each replicate is represented by a single point.

Figure (c) above displays a linear smoother (linear `line-of-best-fit') through the data. Different smoothers help identify different trends in data.

End of instructions

  Population & sample

A population refers to all possible observations. Therefore, population parameters refer to the characteristics of the whole population. For example the population mean.

A sample represents a collected subset of the population's observations and is used to represent the entire population. Sample statistics are the characteristics of the sample (e.g. sample mean) and are used to estimate population parameters.


End of instructions

  Linear model fitting

> name.lm <- lm(dv~iv, data=data)

where name.lm is a name you provide for the output, dv is the name of the dependent variable, iv is the name of the independent variable and data is the name of the data frame (data set).

End of instructions

  Logistic Regression

> name.glm <- glm(dv~iv, family=binomial, data=data)

where name.glm is a name you provide for the fitted model, dv is the name of the dependent variable, iv is the name of the independent variable and data is the name of the data frame (data set).

End of instructions

  Multiple linear model fitting

> name.lm <- lm(dv~iv1+iv2, data=data)

where name.lm is a name you provide for the output, dv is the name of the dependent variable and iv1, iv2 etc are the independent variables. data is the name of the data frame (data set).

End of instructions

  ANOVA

> name.aov <- lm(dv~factor,data=data)
# OR
> day.aov <- aov(dv~factor,data=data)

where dv is the name of the numeric vector (dependent variable), factor is the name of the factor (categorical or factor variable and data is the name of the data frame (data set)). The '~' is used in formulas to represent that the left hand side is proportional to the right hand side.

End of instructions

  Nested ANOVA

A nested ANOVA is a simple extension of the single factor ANOVA. An additional factor (Factor B) is added that is nested within the main factor of interest (Factor A). Nested factors are incorporated in situations where researches envisage potentially large amounts of variation between different sets of replicates within the main Factor.
For example, a limnologist investigating the effect of flow (Factor A: high, medium and low) on macroinvertebate density, might collect samples from a number of high, medium and low flow streams. However, the limnologist might expect that many other factors (perhaps more important than stream flow) also influence the density of macroinvertebrates and that even samples collected in close proximity to one another might be expected to vary considerably in density of macroinvertebrates. Such variability increases the unexplained variability and therefore has the potential to make it very difficult to pick up the effect of the actual factor of interest (in this case flow).
In an attempt to combat this, the researcher decides to collect multiple samples from within each stream. If some of the unexplained variability can be explained by variation between samples within streams, then the test of flow will be more sensitive.
This is called a nested ANOVA, an additional factor (Factor B: stream) has been introduce within the main factor (Factor A: flow). Factor A is a fixed factor as we are only interested in the specific levels chosen (e.g. high, medium and low flow). Factor B however, is a random as we are not just interested in the specific streams that were sampled, we are interested in all the possible streams that they represent. In a nested ANOVA the categories of the Factor B nested within each level of Factor A are different. E.g. the streams used for each flow type are different. Therefore it is not possible to test for an interaction between Flow type and stream.

There are two null hypotheses being tested in a two factor nested ANOVA. H0 main effect: no effect of Factor A (flow) H0: = &mu1 = &mu2 = ... = &mui = &mu (as for single factor ANOVA). H0 nesting effect: no effect of Factor B (stream) nested within Factor A (flow). Technically it is that there is no added variance due to differences between all the possible levels of Factor B with any level of Factor A. E.g., that there is no added variance due to differences between all possible high flow streams, or between all possible medium flow streams or between all possible low flow streams. Diagram shows layout of nested design with three treatments (different pattern fills).

End of instructions

  Randomized complete block (RCB) designs

These are designs where one factor is a blocking variable (Factor B) and the other variable (Factor A) is the main treatment factor of interest. These designs are used in situations where the environment is potentially heterogeneous enough to substantially increase the variability in the response variable and thus obscure the effects of the main factor. Experimental units are grouped into blocks (units of space or time that are likely to have less variable background conditions) and then each level of the treatment factor is applied to a single unit within each block. By grouping the experimental units together into blocks, some of the total variability in the response variable that is not explained by the main factor may be explained by the blocks. If so, the residual (unexplained) variation will be reduced resulting in more precise estimates of parameters and more powerful tests of treatments.

A plant ecologist was investigating the effects of leaf damage on plant chemical defenses. Since individual trees vary greatly in the degree of chemical defense, she applied each of three artificial leaf damage treatments (heavy, medium and light damage) to each tree. She applied one of three treatments to three branches within eight randomly selected trees. Trees were the blocks. The diagram shows the layout of a blocking design. The main factor has three treatments (pattern fills), each of which are applied randomly within 3 randomly selected blocks. The decision to include a blocking factor (rather than single factor, completely random) depends on whether the experimental units (replicates) within a block are likely to be more similar to one another than those between blocks and whether the reduction in MSResidual is likely to compensate for the for the loss of residual df.

End of instructions

  Split-plot designs

Split-plot designs can be thought of as a combination of a randomized complete block (RCB) and nested designs. In their simplest form, they represent a RCB design, with a second factor applied to the whole blocks. One factor (Factor C) is applied to experimental units (usually randomly positioned) within each block (now referred to as plots, or Factor B). These factors are the Within-plot factors. A second factor (Between-plots factor, or Factor A) is then applied to the whole plots, and these plots are replicated. Hence a nested design applies between Factor A and the plots, and a blocking design applies between Factor C and the plots. Factor A is also crossed with Factor C, and therefore there is a test of the interaction between the two main factors.


Note that if there are no replicate units of Factor C within each plot (that is if each level of Factor C is applied only once to each plot), then the interaction of Factor C by Plots within Factor A represents the overall residual. However, since the plots are usually random factors, the effects of Factor A must be tested against the Plots within Factor A term.
If there there are replicate units of Factor C within each plot, then the Factor A by Plots within Factor A by Factor C is the overal residual.
SourceA,C (Fix); B (Rand)
AMSA/MSB
BMSB/MSResidual
CMSC/MSB:C
A:CMSA:C/MSB:C
B:CMSB:C/MSResidual

For example, a marine biologist investigated the effects of estuary flow on the abundance of mussels and barnacles, and whether these effects varied with tidal height. A splitplot was used whereby randomly selected estuary sites were the plots, six tidal heights (from 0 to 3.6 m) were applied within each site (within-plot effect), and each site was of either high or low flow (between-plots effect). Diagram illustrates layout for split-plot design. There are three treatments (levels of Factor C, fill patterns for squares) applied within each plot (ovals). One of two treatments (levels of Factor A, fill for ovals) was applied to whole plots. Note that treatments for both Factor A and C are replicated and fully crossed.

End of instructions

  Repeated Measures

Simple repeated measures designs and similar to Randomized complete block (RCB) designs. In the same way that we could explain some of the residual variation by blocking, we can also reduce the residual variation by removing the between-replicate variation that exists at the start of the experiment. This is achieved by applying each treatments (repeated measures) to the same subjects, rather than using different subjects for each treatment. In this way it is the same as a RCB. Where it differs from a RCB is that each of the treatments cannot be applied to the blocks (subjects) at the same time. For example, it is not possible to determine the effects of different drugs on the heart rate of rats if each of the drugs is applied at the same time! Each of the treatments should be applied in random order, however, when the repeated factor is time, this is not possible. The subjects (blocks or plots) are always a random factor. If a second factor is applied at the level of the whole subject and over each of the repeated measures (e.g. subjects are of one of two species), the analysis then resembles a splitplot design, where this second factor is the between subjects effect.


A agricultural scientist was investigating the effects of fiber on the digestibility of pastural grasses by cows. Ten randomly selected cows of two different breeds (five of each) were each fed a random sequence of four food types (of different fiber levels). After a week of feeding on a single food type, the digestibility of the food was assessed. Individual cows represented the subjects. Breed of cow was the between subjects effect and food fiber level was the within subjects effect. Diagram illustrates layout of 2 Factor repeated measures. Rectangles represent subjects. Squares represent the within subject treatments (random sequence represents the order in which each treatment is applied). Hence there are four repeated measures. There are two levels of the between subjects effect (pattern fill of rectangles).

End of instructions

  Nested ANOVA assumptions

In ANOVA, the assumptions of normality and variance homogeneity apply to the residuals for the test of interest. In this example, the nested factor (Factor B, SITE) is random. Therefore, the normality and heterogeneity of variance assumptions for Factor A (SEASON) must be based on the distribution, mean and variance of the response variable (S4DUGES) for each level of Factor A using the means of Factor B within each Factor A as the observations (since the error term for Factor A (SEASON) is the variation in the response variable between categories of Factor B within each level of Factor A (MSSITE). Calculate the mean from each level of Factor B (SITE), and use these means to calculate the mean, variance and distribution of each level of Factor A (SEASON).

# aggregate the data set by calculating the means for each level of the nesting factor
> name.ag <- aggregate(dataset, list(Name1=FactorA, Name2=FactorB), mean)
#OR perhaps more efficiently
library(lme4)
name.ag <- gsummary(dataset, groups=FactorB)
# use the aggregated means to produce the boxplots
> boxplot(dv~FactorA, name.ag)

where name.ag is a name given to the new aggregated data set, dataset is the name of the original data set, Name1 and Name2 are names to appear in the aggregated data set that represent FactorA (Main within factor) and FactorB (nesting factor) respectively. Finally, dv is the name of the dependent variable.

This data set generates two boxplots, each constructed from only 3 values and thus of questionable value.

End of instructions

  Nested ANOVA assumptions

# fit the heirachical linear model
> name.aov <- aov(dv~FactorA + Error(FactorB), data=data)
# print out the ANOVA strata
> summary(name.aov)

where name.aov is a name provided to store the fitted model, dv is the dependent variable, FactorA is the main fixed factor and FactorB is the random nested factor. data is the name of the data set. The 'Error()' argument defines additional error (residual) strata, hence the above template indicates that Factor should be tested against FactorB.
For example:

# fit the heirachical linear model
> curdies.aov <- aov(S4DUGES~SEASON + Error(SITE), data=curdies)
# print out the ANOVA strata
> summary(curdies.aov)

In recognition that Factor is a random nested factor, and thus of little interest, the F-ratio is not provided. Since the effect of FactorB is tested against the overal residual (error) term, to obtain the F-ratio for the test of FactorB, fit the model without defining the an additional error strata. Note that when you do this, all terms are tested against the residual term and therefore, any fixed factors above the random factors (in the hierarchy) will be incorrectly calculated.

> curdies.aov1 <- aov(S4DUGES~SEASON + SITE, data=curdies)
> summary(curdies.aov)

End of instructions

  Two-factor ANOVA

> name.aov <- aov(dv~factor1 * factor2, data=data)

where dv is the name of the numeric vector (dependent variable), factor1 and factor2 are the names of the factors (categorical or factor variables) and data is the name of the data frame (data set).

End of instructions

  R Compound symmetry/sphericity

# source Murray's syntax for dealing with sphericity
> library(biology)
# after fitting a randomized block, repeated measures or split-plot
> epsi.GG.HF(name.aov)

where name.aov is the name of the fitted randomized block, repeated measures or split-plot model. The epsi.GG.HF function returns a table of epsilon values corresponding to each of the model terms.

End of instructions

  Fitting linear contrasts to repeated measures

# Define the contrast coefficients for the factorial variable
> contrasts(factor) <- cbind(c(numeric list),...)
# Confirm that they are orthogonal
> round(crossprod(contrasts(factor)),2)
# Define a list of associated labels for the factorial variables defined contrasts
> name.list <- list(factor=list("Label"=1, ...))

where name.cont is a name provided to call the list of contrast coefficients, factor is the name of the factorial variable, numeric list is a set of contrast coefficients separated by commas (e.g 1,0,-1, or 2,0,-1,-1; there should be as many contrast coefficients as their are levels in the factor factor and the numbers should add up to 0). name.list is a list of labels (label) associated with the list of contrast coefficients. Note that as it is difficult to define multiple planned comparisons at once, the list of labels will only consist of a single label. This list is used to label the terms in the ANOVA table. name.aov is a name provided to for the fitted ANOVA (linear) model, dv is the dependent variable and data is the data frame (data set) For example

# Define the contrast coefficients for factorial variable
> contrasts(partridge$GROUP) <- cbind(c(0,0,0,-1,1), c(-2,-2,-2,3,3))
# Confirm that they are orthogonal
> round(crossprod(contrasts(partridge$GROUP)), 2)
# Define a list of associated labels for the factorial variables defined contrasts
> partridge.list <- list(GROUP=list("8virg vs 1virg"=1, "Partners vs Controls"=2))

End of instructions

  ANOVA output with planned contrasts

> summary(name.aov, split=list)

where name.aov is the name of the fitted linear model and list is the list of labels associated with the defined contrasts. For example:

> summary(partridge.aov, split=partridge.list)

End of instructions

  Saving graphs

1. With the R Graphics Window active, Select the File menu
2. Select the save as submenu
3. Select the Jpeg submenu
4. Select the 100% quality... submenu

The Jpeg file dialog box will be displayed.
5. Provide a filename and path for the picture
6. Click

End of instructions

  Importing graphs

1. Open up the other application (such as Word), or if it is already open, switch control (focus) to the other application using either Alt-tab or the Windows navigation buttons
2. Select the Insert menu. (in Word)
3. Select the Picture submenu. (in Word)
4. Select the From file submenu. (in Word)

The Insert Picture dialog box will be displayed.
5. Locate the picture file and click .
6. Done!, just add a caption

End of instructions

  Fitting linear contrasts to repeated measures

# source Murray's biology library
> library(biology)
# define polynomial contrasts > contrasts(dataset$Repeated) <- 'contr.poly'
> name.aov <- aov(dv~Error(subject/C(Repeated,poly,1)) + C(Repeated,poly,1), data=dataset)
> AnovaM(name.aov, split=list(Repeated=list('Linear'=1, 'Quad'=2,...)))

where dv is the name of the numeric vector (dependent variable), Repeated is the name of the repeated factor within the dataset data frame and subject is the name of the blocking factor
For example

> library(biology)
> contrasts(driscoll$YEAR) <- 'contr.poly'
> driscoll.aov <- aov(CALLS~Error(BLOCK/C(YEAR,poly,1)) + C(YEAR,poly,1), data=driscoll)
> AnovaM(driscoll.aov, split=list(YEAR=list('Linear'=1, 'Quad'=2,...)))

End of instructions

  Linear regression diagnostics

> plot(name.lm)

where name.lm is the name of a fitted linear model. You will be prompted to hit return to cycle through a series of four diagnostic plots. The first (and most important of these) is the Residual Plot.

End of instructions

  Residual plot

There is a residual for each point. The residuals are the differences between the predicted Y-values (point on the least squares regression line) for each X-value and the observed Y-value for each X-value. A residual plot, plots the residuals (Y-axis) against the predicted Y-values (X-axis) Plot a) shows a uniform spread of data around the least squares regression line and thus there is an even scatter of points in the residual plot (Plot b). In Plot c) the deviations of points from the predicted Y-values become greater with increasing X-values and thus there is a wedge-shape pattern in the residuals. Such a wedge suggests a violation of the homogeneity of variance assumption, and thus the regression may be unreliable.



End of instructions

  Regression output

#print a summary of the linear model coefficients
> summary(name.lm)
#print a summary of the ANOVA table
> anova(name.lm)

where name.lm is the name of a fitted linear model.

End of instructions

  ANOVA table

#print a summary of the ANOVA table
> anova(name.aov)

where name.aov is the name of a fitted linear model.

End of instructions

  Copy and Paste text

  1. Highlight the text (table) you want to copy
  2. Select the Edit.. menu..
  3. Select the Copy.. submenu..
  4. Switch focus to the program you want to paste the text into (e.g. Microsoft Word)
  5. Select the Edit.. menu. (in Word)
  6. Select the Paste.. submenu. (in Word)
  7. To format the table in Word
    1. Add the word 'Source' to the lop left hand corder of the first row of the ANOVA table to ensure that each table column has a heading and remove the stars


    2. Add commas after each entry in the table (except the last entry in each row) and remove any spaces between entries. This step defines the columns of the table, so it is important that they are placed in the correct positions. Note that 'Sums' and 'Sq' are not separate entries, the entry is 'Sums Sq'.

      1. Highlight the table in Word
      2. Select the Table menu. (in Word)
      3. Select the Convert submenu. (in Word)
      4. Select the Text to Table submenu. (in Word)
      5. The Convert Text to Table dialog box will appear

      6. Ensure that Commas is selected as the text separator
      7. Click the button
      8. The Table AutoFormat dialog box will appear

      9. Select the Table Classic 1 Table style from the Tables styles list
      10. Unselect the First column, Last row and Last column checkboxes
      11. Click the button

End of instructions

  Fully factorial ANOVA assumption checking

The assumptions of normality and homogeneity of variance apply to each of the factor level combinations, since it is the replicate observations of these that are the test residuals. If the design has two factors (IV1 and IV2) with three and four levels (groups) respectively within these factors, then a boxplot of each of the 3x4=12 combinations needs to be constructed. It is recommended that a variable (called GROUP) be setup to represent the combination of the two categorical (factor) variables.

Simply construct a boxplot with the dependent variable on the y-axis and GROUP on the x-axis. Visually assess whether the data from each group is normal (or at least that the groups are not all consistently skewed in the same direction), and whether the spread of data is each group is similar (or at least not related to the mean for that group). The GROUP variable can also assist in calculating the mean and variance for each group, to further assess variance homogeneity.

End of instructions

  Interaction plot

Interaction plots display the degree of consistency (or lack of) of the effect of one factor across each level of another factor. Interaction plots can be either bar or line graphs, however line graphs are more effective. The x-axis represents the levels of one factor, and a separate line in drawn for each level of the other factor. The following interaction plots represent two factors, A (with levels A1, A2, A3) and B (with levels B1, B2).


The parallel lines of first plot clearly indicate no interaction. The effect of factor A is consistent for both levels of factor B and visa versa. The middle plot demonstrates a moderate interaction and bottom plot illustrates a severe interaction. Clearly, whether or not there is an effect of factor B (e.g. B1 > B2 or B2 > B1) depends on the level of factor A. The trend is not consistent.

End of instructions

  Multifactor ANOVA

Statistical models that incorporate more than one categorical predictor variable are broadly referred to as multivariate analysis of variance. There are two main reasons for the inclusion of multiple factors:

  1. To attempt to reduce the unexplained (or residual) variation in the response variable and therefore increase the sensitivity of the main effect test.
  2. To examine the interaction between factors. That is, whether the effect of one factor on the response variable is dependent on other factor(s) (consistent across all levels of other factor(s)).

End of instructions

  Two factor ANOVA

Statistical models that incorporate more than one categorical predictor variable are broadly referred to as multivariate analysis of variance. There are two main reasons for the inclusion of multiple factors:

  1. To attempt to reduce the unexplained (or residual) variation in the response variable and therefore increase the sensitivity of the main effect test.
  2. To examine the interaction between factors. That is, whether the effect of one factor on the response variable is dependent on other factor(s) (consistent across all levels of other factor(s)).

Fully factorial linear models are used when the design incorporates two or more factors (independent, categorical variables) that are crossed with each other. That is, all combinations of the factors are included in the design and every level (group) of each factor occurs in combination with every level of the other factor(s). Furthermore, each combination is replicated.

In fully factorial designs both factors are equally important (potentially), and since all factors are crossed, we can also test whether there is an interaction between the factors (does the effect of one factor depend on the level of the other factor(s)). Graphs above depict a) interaction, b) no interaction.


For example, Quinn (1988) investigated the effects of season (two levels, winter/spring and summer/autumn) and adult density (four levels, 8, 15, 30 and 45 animals per 225cm2) on the number of limpet egg masses. As the design was fully crossed (all four densities were used in each season), he was also able to test for an interaction between season and density. That is, was the effect of density consistent across both seasons and, was the effect of season consistent across all densities.

Diagram shows layout of 2 factor fully crossed design. The two factors (each with two levels) are color (black or gray) and pattern (solid or striped). There are three experimental units (replicates) per treatment combination.


End of instructions

  Tukey's Test

# load the 'multcomp' package
> library(multcomp)
# perform Tukey's test
> summary(glht(data.aov, linfct=mcp(factor="Tukey")))

where dv is the name of the numeric vector (dependent variable), factor is the name of the factor (categorical or factor variable, data.aov is the name of the fitted ANOVA model and data is the name of the data frame (data set). The '~' is used in formulas to represent that the left hand side is proportional to the right hand side. The argument type="Tukey" indicates that a Tukeys p-value adjustments should be made.


The coefficients list the specific comparisons made. Each line represents a specific hypothesis test including difference between groups, t-value, raw p value and Tukey's adjusted p value (the other values presented are not important for BIO3011).

End of instructions

  Tukey's Test

# load the 'multcomp' package
> library(multcomp)
# perform Tukey's test
> summary(simtest(dv~factor1*factor2, whichf='factor', data=data, type="Tukey"))
# OR the more up to date method
> summary(glht(data.aov, linfct=mcp(factor="Tukey")))

where dv is the name of the numeric vector (dependent variable), factor1 and factor2 are the names of the factors (categorical or factor variables), and data is the name of the data frame (data set). The argument tyoe="Tukey" indicates that a Tukeys p-value adjustments should be made. When the linear model contains multiple categorical predictor variables, the arguement whichf='' must be used to indicate which factor the Tukey's test should be performed on.

The coefficients list the specific comparisons made. Each line represents a specific hypothesis test including difference between groups, t-value, raw p value and Tukey's adjusted p value (the other values presented are not important for BIO3011).

End of instructions

  Simple Main Effects

> library(biology)
> AnovaM(mainEffects(quinn.aov, at=SEASON=='Autumn'), type="III")

Subset, type in the name of the other categorical variable followed by == (two equals signs) and the name of one of the levels (surrounded by single quotation marks) of this factor. For example, if you had a categorical variable called TEMPERATURE and it consisted of three levels (High, Medium and Low), to perform a anova using only the High temperature data, the subset expression would be TEMPERATURE=='High'.

End of instructions

  Convert numeric variable into a factor

> data$var <- as.factor(data$var)

where data is the name of the data set and var is the name of the variable to be made into a factor. While there is no visible evidence that the data are any different, R will now treat the variable as a factor rather than as a numeric variable.

End of instructions

  Aggregate a data set

> name.ag <- aggregate(dataset, list(NAME=FactorA, NAME1=FactorB), mean)
# OR
> library(lme4)
> name.ag<-gsummary(dataset, groups=dataset$FactorA)

where name.ag is a name you provide for the aggregated data set, dataset is the name of the data set, NAME is for the first factor in the resulting data set, FactorA is the name of the main factor, NAME1 is for the second factor in the resulting data set and FactorA is the name of the plot (blocking) factor.
For example:

> cu.ag <- aggregate(copper, list(COPPER=copper$COPPER, PLATE=copper$PLATE), mean)
# OR
> library(lme4) > cu.ag <- aggregate(copper, groups=copper$PLATE)

End of instructions

  Boxplot from aggregated data set

Having generated an aggregated data set, it is possible to use the aggregated means to test the assumptions of normality and homogeneity of variance.

> boxplot(name.ag$dv~name.ag$FactorA)

where name.ag is a name of the aggregated data set, dv is the name of the dependent variable and FactorA is the name of the main factor.
For example:

> boxplot(cu.ag$WORMS ~ cu.ag$COPPER)

End of instructions

  R Split-plot

# Fit the linear model with the different error strata
> name.aov <- aov(dv ~ FactorA + FactorC + FactorA:FactorC + Error(FactorB), data=dataset))
> AnovaM(name.aov) # alternatively, if sphericity is an issue > AnovaM(name.aov, RM=T)

where name.aov is a name to provide the fitted linear model, dv is the name of the dependent variable, FactorA is the name of the between plots factor (Factor A), FactorB is the name of the block factor (random) and FactorC is the name of the within plots factor. dataset is the name of the data set.
For example:

> copper.aov <- aov(WORMS ~ COPPER + DIST + COPPER:DIST + Error(PLATE), data=copper))
> AnovaM(copper.aov, RM=T)

End of instructions

  R Interaction plot

> interaction.plot(dataset$Factor1, dataset$Factor2, dataset$dv)

where Factor1 is the name of the categical variable (factor) with the most levels (this will be used for the x-axis), Factor2 is the name of the other factorial variable (this will be used to produce the different lines) and dv is the name of the dependent variable within the dataset data set
For example:

> interaction.plot(copper$DIST, copper$COPPER, dataset$WORMS)

End of instructions

  R Randomized Block

# Fit the linear model with the different error strata
> name.aov <- aov(dv ~ FactorA + Error(FactorB), data=dataset))
> AnovaM(name.aov)

where name.aov is a name to provide the fitted linear model, dv is the name of the dependent variable, FactorA is the name of the within block factor (Factor A), FactorB is the name of the blocking factor (random) and dataset is the name of the data set.
For example:

> tobacco.aov <- aov(NUMBER ~ TREATMENT + Error(LEAF), data=tobacco))
> AnovaM(tobacco.aov)
# OR if sphericity is likely to be an issue
> AnovaM(tobacco.aov), RM=T)

where tobacco.aov is the name of the fitted model and the data set in wide (repeated measures) format.

End of instructions

  Plot of Means

  1. Select the Graphs.. menu..
  2. Select the Plot Means.. submenu..
The Plot Means dialog box will appear
  1. Select the factorial variables from the Factorslist (do not select GROUP - this was only for assumption checking!)
  2. Select the dependent variable from the Response Variable list
  3. Click the
The interaction plot (means plot) will appear in the RGui graphics window.

End of instructions

  Symbols on bargraphs

Add symbols to the bars such that common symbols signify non-significant comparisons and differences symbolss signify significant differences.

> Mbargraph(day$BARNACLE, day$TREAT, symbols=c('AB','A','BC','C'))

where xs is a list of the x-coordinates of the bar graph bars (as generated by the barplot() function), means+5 is a list of y-coordinates of where the labels should be positioned and labels is a list of the labels to add to the graph. Note, there should be as many labels as there are bars.

In the following graph the mean of group a was found to be significantly different from the means of both groups b and c whilst the means of groups b and c were not found to be significantly different from one another.


End of instructions

  Reordering factor levels

> dv <- factor(dv), levels=c(ordered levels))

where dv is the name of the numeric vector (dependent variable) and ordered levels is a list of factor levels listed in the preferred order
Note that the data will not appear to have changed at all, it is only during the presentation of results and graphs that reordering the factor levels makes a difference.

End of instructions

  Planned Comparisons

# Define the contrast coefficients for factorial variable
> name.cont <- list(factor = c(numeric list))
# Define the list of associated labels for the factorial variable
> name.list <- list(factor =list('label'=1))
# Fit the linear model using the defined contrasts
> name.aov <- aov(dv~factor, data = data, contrasts = name.cont)
# Print out the ANOVA table incorporating the define contrasts
> summary(name.aov, split = name.list)

where name.cont is a name provided to call the list of contrast coefficients, factor is the name of the factorial variable, numeric list is a set of contrast coefficients separated by commas (e.g 1,0,-1, or 2,0,-1,-1; there should be as many contrast coefficients as their are levels in the factor factor and the numbers should add up to 0). name.list is a list of labels (label) associated with the list of contrast coefficients. Note that as it is difficult to define multiple planned comparisons at once, the list of labels will only consist of a single label. This list is used to label the terms in the ANOVA table. name.aov is a name provided to for the fitted ANOVA (linear) model, dv is the dependent variable and data is the data frame (data set) For example

# Define the contrast coefficients for factorial variable
> partridge.cont <- list(GROUP = c(1,0,0,0,-1))
# Define the list of associated labels for the GROUP factor
> partridge.list <- list(GROUP = list("1 vs 5" = 1))
# Fit the linear model using the defined contrasts
> partridge.aov <- aov(LONGEVITY~GROUP, contrasts = partridge.cont)
# Print out the ANOVA table incorporating the define contrasts
> summary(partridge.aov, split = partridge.list)

End of instructions

  Fixed vs Random Factors

Fixed factors are factors whose levels represent the specific populations of interest. For example, a factor that comprises 'high', 'medium' and 'low' temperature treatments is a fixed factor - we are singularly interested in comparing those three populations. Conclusions about the effects of a fixed factor are restricted to the specific treatment levels investigated and for any subsequent experiments to be comparable, the same specific treatments of the factor would need to be used.

In contrast, Random factors are factors whose levels are randomly chosen from all the possible levels or populations and are used as random representatives of the populations. For example, five random temperature treatments could be used to represent a full spectrum of temperature treatments. In this case, conclusions are extrapolated to all the possible treatment (temperature) levels and for subsequent experiments, a new random set of treatments of the factor would be selected.

End of instructions

  Specific comparisons of means

Following a significant ANOVA result, it is often desirable to specifically compare group means to determine which groups are significantly different. However, multiple comparisons lead to two statistical problems. Firstly, multiple significance tests increase the Type I errors (&alpha, the probability of falsely rejecting H0). E.g., testing for differences between 5 groups requires ten pairwise comparisons. If the &alpha for each test is 0.05 (5%), then the probability of at least one Type I error for the family of 10 tests is 0.4 (40%). Secondly, the outcome of each test needs to be independent (orthogonality). E.g. if A>B and B>C then we already know the result of A vs. C.

Post-hoc unplanned pairwise comparisons (e.g. Tukey's test) compare all possible pairs of group means and are useful in an exploratory fashion to reveal major differences. Tukey's€™s test control the family-wise Type I error rate to no more that 0.05. However, this reduces the power of each of the pairwise comparisons, and only very large differences are detected (a consequence that exacerbates with an increasing number of groups).

Planned comparisons are specific comparisons that are usually planned during the design stage of the experiment. No more than (p-1, where p is the number of groups) comparisons can be made, however, each comparison (provided it is non-orthogonal) can be tested at &alpha = 0.05. Amongst all possible pairwise comparisons, specific comparisons are selected, while other meaningless (within the biological context of the investigation) are ignored.

Planned comparisons are defined using what are known as contrasts coefficients. These coefficients are a set of numbers that indicate which groups are to be compared, and what the relative contribution of each group is to the comparison. For example, if a factor has four groups (A, B, C and D) and you wish to specifically compare groups A and B, the contrast coefficients for this comparison would be 1, -1, 0,and 0 respectively. This indicates that groups A and B are to be compared (since their signs oppose) and that groups C and D are omitted from the specific comparison (their values are 0).

It is also possible to compare the average of two or more groups with the average of other groups. For example, to compare the average of groups A and B with the average of group C, the contrast coefficients would be 1, 1, -2, and 0 respectively. Note that 0.5, 0.5, 1 and 0 would be equivalent.

The following points relate to the specification of contrast coefficients:

  1. The sum of the coefficients should equal 0
  2. Groups to be compared should have opposing signs

End of instructions

  Copy and Paste graphs

  1. Graph window, click the right mouse button
  2. Select either the copy as metafile menu (if intending to modify/edit the graph after it is pasted into another program) or copy as bitmap (if don't intend to modify the graph after it is pasted into another program)
  3. Switch control (focus) to the other program (such as Microsoft Word) using either Alt-tab or the Windows navigation buttons
  4. Select the Edit.. menu. (in Word)
  5. Select the Paste.. submenu. (in Word)
  6. Switch focus back to RGui once you have finished and have produced a caption.

End of instructions

  ANOVA output

> anova(day.aov)

where dv is the name of the numeric vector (dependent variable), factor is the name of the factor (categorical or factor variable and data is the name of the data frame (data set). The '~' is used in formulas to represent that the left hand side is proportional to the right hand side.

End of instructions

  R2 value

The R2 value is the proportion of the total variability in the dependent variable that can be explained by its linear relationship with the independent variable, and thus a measure of the strength of the relationship. The R2 value ranges from 0 (very very low) to 1 (extremely tight relationship). The R2 value for the data represented in the following relationship would be 1 - that is all of the variation in DV can be explained by a change in IV.

The R2 value for the data represented in the next relationship however, would be substantially less than 1 - whilst some of the variation in DV can be explained by a change in IV, changes in IV alone does not account for all the changes in DV.


End of instructions

  R2 value

Essentially you need to reconstruct the linear equation (y=mx+c) that relates the dependent variable with the independent variable (including the estimated slope and y-intercept), substitute the new x value, and solve for y. Remember that if any transformations where used, then the transformation functions (eg log) need to be incorporated into the linear equation.

End of instructions

  Randomized complete block (RBC) designs

These are designs where one factor is a blocking variable (Factor B) and the other variable (Factor A) is the main treatment factor of interest. These designs are used in situations where the environment is potentially heterogeneous enough to substantially increase the variability in the response variable and thus obscure the effects of the main factor. Experimental units are grouped into blocks (units of space or time that are likely to have less variable background conditions) and then each level of the treatment factor is applied to a single unit in each block. By grouping the experimental units together into blocks , some of the total variability in the response variable that is not explained by the main factor may be explained by the blocks. If so, the residual (unexplained) variation will be reduced resulting in more precise estimates of parameters and more powerful tests of treatments.

For example, a plant ecologist was investigating the effects of leaf damage on plant chemical defenses. Since individual trees vary greatly in the degree of chemical defense, she applied each of three artificial leaf damage treatments (heavy, medium and light damage) to each tree. She applied one of three treatments to three branches within eight randomly selected trees. Trees were the blocks. The diagram shows the layout of a blocking design. The main factor has three treatments (pattern fills), each of which are applied randomly within 3 randomly selected blocks. The decision to include a blocking factor (rather than single factor, completely random) depends on whether the experimental units (replicates) within a block are likely to be more similar to one another than those between blocks and whether the reduction in MSResidual is likely to compensate for the for the loss of residual df.

End of instructions

  Sample statistics

1. Select the Statistics menu..
2. Select the Summaries.. submenu..
3. Select the Numerical Summaries.. submenu..
The Numerical Summaries dialog box will be displayed.
4. Select the variable for which the summary statistic(s) are to be generated
5. Select the appropriate statistic(s)
6. Click

End of instructions

  Observations, variables & populations

Observations are the sampling units (e.g quadrats) or experimental units (e.g. individual organisms, aquaria) that make up the sample.

Variables are the actual properties measured by the individual observations (e.g. length, number of individuals, rate, pH, etc). Random variables (those variables whose values are not known for sure before the onset of sampling) can be either continuous (can theoretically be any value within a range, e.g. length, weight, pH, etc) or categorical (can only take on certain discrete values, such as counts - number of organisms etc).

Populations are defined as all the possible observations that we are interested in.

A sample represents a collected subset of the population's observations and is used to represent the entire population. Sample statistics are the characteristics of the sample (e.g. sample mean) and are used to estimate population parameters.


End of instructions

  Standard error and precision

A good indication of how good a single estimate is likely to be is how precise the measure is. Precision is a measure of how repeatable an outcome is. If we could repeat the sampling regime multiple times and each time calculate the sample mean, then we could examine how similar each of the sample means are. So a measure of precision is the degree of variability between the individual sample means from repeated sampling events.

Sample number Sample mean
112.1
212.7
312.5
Mean of sample means12.433
SD of sample means0.306

The table above lists three sample means and also illustrates a number of important points;
  1. Each sample yields a different sample mean
  2. The mean of the sample means should be the best estimate of the true population mean
  3. The more similar (consistent) the sample means are, the more precise any single estimate of the population mean is likely to be
The standard deviation of the sample means from repeated sampling is called the Standard error of the mean.

It is impractical; to repeat the sampling effort multiple times, however, it is possible to estimate the standard error (and therefore the precision of any individual sample mean) using the standard deviation (SD) of a single sample and the size (n) of this single sample.

The smaller the standard error (SE) of the mean, the more precise (and possibly more reliable) the estimate of the mean is likely to be.

End of instructions

  Confidence intervals


A 95% confidence interval is an interval that we are 95% confident will contain the true population mean. It is the interval that there is a less than 5% chance that this interval will not contain the true population mean, and therefore it is very unlikely that this interval will not contain the true mean. The frequentist approach to statistics dictates that if multiple samples are collected from a population and the interval is calculated for each sample, 95% of these intervals will contain the true population mean and 5% will not. Therefore there is a 95% probability that any single sample interval will contain the population mean.

The interval is expressed as the mean ± half the interval. The confidence interval is obviously affected by the degree of confidence. In order to have a higher degree of confidence that an interval is likely to contain the population mean, the interval would need to be larger.

End of instructions

  Data transformations


Essentially transforming data is the process of converting the scale in which the observations were measured into another scale.

I will demonstrate the principles of data transformation with two simple examples.
Firstly, to illustrate the legality and frequency of data transformations, imagine you had measured water temperature in a large number of streams. Naturally, you would have probably measured the temperature in ° C. Supposing you later wanted to publish your results in an American journal and the editor requested that the results be in ° F. You would not need to re-measure the stream temperature. Rather, each of the temperatures could be converted from one scale (° C) to the other (° F). Such transformations are very common.

To further illustrate the process of transformation, imagine a botanist wanted to examine the size of a particular species leaves for some reason. The botanist decides to measure the length of a random selection of leaves using a standard linear, metric ruler and the distribution of sample observations as represented by a histogram and boxplot are as follows.


The growth rate of leaves might be expected to be greatest in small leaves and decelerate with increasing leaf size. That is, the growth rate of leaves might be expected to be logarithmic rather than linear. As a result, the distribution of leaf sizes using a linear scale might also be expected to be non-normal (log-normal). If, instead of using a linear scale, the botanist had used a logarithmic ruler, the distribution of leaf sizes may have been more like the following.


If the distribution of observations is determined by the scale used to measure of the observations, and the choice of scale (in this case the ruler) is somewhat arbitrary (a linear scale is commonly used because we find it easier to understand), then it is justifiable to convert the data from one scale to another after the data has been collected and explored. It is not necessary to re-measure the data in a different scale. Therefore to normalize the data, the botanist can simply convert the data to logarithms.

The important points in the process of transformations are;
  1. The order of the data has not been altered (a large leaf measured on a linear scale is still a large leaf on a logarithmic scale), only the spacing of the data
  2. Since the spacing of the data is purely dependent on the scale of the measuring device, there is no reason why one scale is more correct than any other scale
  3. For the purpose of normalization, data can be converted from one scale to another following exploration

End of instructions

  Data transformations

1. Select the Data menu..
2. Select the Manage variables in active data set.. submenu..
3. Select the Compute new variable.. submenu..
The Compute new variable dialog box will be displayed.
4. Enter a name for a new variable
Note, this should be a unique name that is not already in the data set (otherwise it will be overwritten). It is good practice to simply prepend the name of the variable whose values are to be transformed with the name of the transformation function (e.g. logH - this reminds us that this new variable contains the log transformed H observations).
5. Enter the name of a transformation function with the name of the variable whose values are to be transformed as its sole argument (e.g. log10(H) - this performs a log (base 10) transformation of the H observations)
6. Click

7. Confirm the transformation by clicking either or

End of instructions

  Successful transformations


Since the objective of a transformation is primarily to normalize the data (although variance homogeneity and linearization may also be beneficial side-effects) the success of a transformation is measured by whether or not it has improved the normality of the data. It is not measured by whether the outcome of a statistical analysis is more favorable!

End of instructions

  Transformations

Biological data measured on a linear scale often follows a log-normal distribution. If however, such data was collected using a logarithmic scale (for example the length of leaves measured using a logarithmic rule rather than a standard linear rule), the observations will be normally distributed. Linear scales are usually used because they are the easiest for the feeble human mind to comprehend, but other than that, no scale is more valid for making measurements than any other scale (linear, logarithmic, ...).
 Measurements collected in one scale can be easily converted to another scale, and the resulting observations would be as if they had been measured in the new scale from the start. Therefore, it is valid to convert observations from one scale to another (transformation) for the purpose of normalizing data.
 Note, transformations only alter the distances between observations, not the order of the observations

End of instructions

  Data transformations (Log)

1. Select the Data menu..
2. Select the Manage variables in active data set submenu..
3. Select the Compute new variable submenu..
The Let dialog box will be displayed.

The Compute New Variable dialog box will be displayed.
4. Enter the name of a new variable in the New variable name box. As this is a name to refer to a new variable, it should not be an existing name
Hint: if performing a log transform, use the existing variable name with an 'L' in front
5. In the Expression to compute enter a transformation (scaling) function with the name of the existing variable to be transformed as its sole argument (e.g. log10(DOC))
6. Click
A new variable will be created, and this variable will contain the transformed values of an existing variable

End of instructions

  Scatterplots

# load the 'car' library
> library(car)
# generate a scatterplot
> scatterplot(dv~iv,data=data)

where dv is the dependent variable, iv is the independent variable and data is the data frame (data set).

End of instructions

  Boxplot on scatterplot axes

From the Scatterplot dialog box will be displayed.
Re-setup your scatterplot,
1. Select the independent variable for the x axis
2. Select the dependent variable for the y axis
3. Ensure that the Marginal boxplots option is selected
4. Click
A new scatterplot will be created that includes boxplots on the axes.

End of instructions

  Lowess smoother

From the Scatterplot dialog box will be displayed.
Re-setup your scatterplot,
1. Select the independent variable for the x axis
2. Select the dependent variable for the y axis
3. Ensure that the Smooth line option is selected
4. Click
A new scatterplot will be created that includes a lowess smoother through the data.

End of instructions

  Plotting y against x


The statement 'plot variable1 against variable2' is interpreted as 'plot y against x'. That is variable1 is considered to be the dependent variable and is plotted on the y-axis and variable2 is the independent variable and thus is plotted on the x-axis.

End of instructions

  Scatterplot matrix (SPLOM)

# make sure the car package is loaded > library(car)
> scatterplot.matrix(~var1 + var2 + var3, data=data, diagonal='boxplot')
#OR > pairs(~var1 + var2 + var3, data=data)

where var1, var2 etc are the names of numeric vectors in the data data frame (data set)

End of instructions

  EDA - Homogeneity of variance

Many statistical hypothesis tests assume that the populations from which the sample measurements were collected are equally varied with respect to all possible measurements. For example, are the growth rates of one population of plants (the population treated with a fertilizer) more or less varied than the growth rates of another population (the control population that is purely treated with water). If they are, then the results of many statistical hypothesis tests that compare means may be unreliable. If the populations are equally varied, then the tests are more likely to be reliable.
Obviously, it is not possible to determine the variability of entire populations. However, if sampling is unbiased and sufficiently replicated, then the variability of samples should be directly proportional to the variability of their respective populations. Consequently, samples that have substantially different degrees of variability provide strong evidence that the populations are likely to be unequally varied.
There are a number of options available to determine the likelihood that populations are equally varied from samples.
1. Calculate the variance or standard deviation of the populations. If one sample is more than 2 times greater or less than the other sample(s), then there is some evidence for unequal population variability (non-homogeneity of variance)
2. Construct boxplots of the measurements for each population, and compare the lengths of the boxplots. The length of a symmetrical (and thus normally distributed) boxplot is a robust measure of the spread of values, and thus an indication of the spread of population values. Therefore if any of the boxplots are more than 2 times longer or shorter than the other boxplot(s), then there is again, some evidence for unequal population variability (non-homogeneity of variance)


End of instructions

  Boxplots

1. Select the Graphs menu..
2. Select the Boxplot.. submenu..
The Boxplot dialog box will be displayed.
3. Select the variable for which the boxplot is to be generated
4. Click

5. Select the appropriate grouping variable (factor) from the list. This variable should contain two or more levels of which the boxplots are to be constructed separately.
The Groups dialog box will be displayed.
6. Click in the Groups dialog box

7. Click

End of instructions

  t test

The frequentist approach to hypothesis testing is based on estimating the probability of collecting the observed sample(s) when the null hypothesis is true. That is, how rare would it be to collect samples that displayed the observed degree of difference if the samples had been collected from identical populations. The degree of difference between two collected samples is objectively represented by a t statistic.

Where y1 and y2 are the sample means of group 1 and 2 respectively and √s²/n1 + s²/n2 represents the degree of precision in the difference between means by taking into account the degree of variability of each sample.
If the null hypothesis is true (that is the mean of population 1 and population 2 are equal), the degree of difference (t value) between an unbiased sample collected from population 1 and an unbiased sample collected from population 2 should be close to zero (0). It is unlikely that an unbiased sample collected from population 1 will have a mean substantially higher or lower than an unbiased sample from population 2. That is, it is unlikely that such samples could result in a very large (positive or negative) t value if the null hypothesis (of no difference between populations) was true. The question is, how large (either positive or negative) does the t value need to be, before we conclude that the null hypothesis is unlikely to be true?.

What is needed is a method by which we can determine the likelihood of any conceivable t value when null hypothesis is true. This can be done via simulation. We can simulate the collection of random samples from two identical populations and calculate the proportion of all possible t values.

Lets say a vivoculturalist was interested in comparing the size of Eucalyptus regnans seedlings grown under shade and full sun conditions. In this case we have two populations. One population represents all the possible E. regnans seedlings grown in shade conditions, and the other population represents all the possible E. regnans seedlings grown in full sun conditions. If we had grown 200 seedlings under shade conditions and 200 seedlings under full sun conditions, then these samples can be used to assess the null hypothesis that the mean size of an infinite number (population) of E. regnans seedlings grown under shade conditions is the same as the mean size of an infinite number (population) of E. regnans seedlings grown under full sun conditions. That is that the population means are equal.

We can firstly simulate the sizes of 200 seedlings grown under shade conditions and another 200 seedlings grown under full sun conditions that could arise naturally when shading has no effect on seedling growth. That is, we can simulate one possible outcome when the null hypothesis is true and shading has no effect on seedling growth
Now we can calculate the degree of difference between the mean sizes of seedlings grown under the two different conditions taking into account natural variation (that is, we can calculate a t value using the formula from above). From this simulation we may have found that the mean size of seedling grown in shade and full sun was 31.5cm and 33.7cm respectively and the degree of difference (t value) was 0.25. This represents only one possible outcome (t value). We now repeat this simulation process a large number of times (1000) and generate a histogram (or more correctly, a distribution) of the t value outcomes that are possible when the null hypothesis is true.

It should be obvious that when the null hypothesis is true (and the two populations are the same), the majority of t values calculated from samples containing 200 seedlings will be close to zero (0) - indicating only small degrees of difference between the samples. However, it is also important to notice that it is possible (albeit less likely) to have samples that are quit different from one another (large positive or negative t values) just by pure chance (for example t values greater than 2).

It turns out that it is not necessary to perform these simulations each time you test a null hypothesis. There is a mathematical formulae to estimate the t distribution appropriate for any given sample size (or more correctly, degrees of freedom) when the null hypothesis is true. In this case, the t distribution is for (200-1)+(200-1)=398 degrees of freedom.

At this stage we would calculate the t value from our actual observed samples (the 200 real seedlings grown under each of the conditions). We then compare this t value to our distribution of all possible t values (t distribution) to determine how likely our t value is when the null hypothesis is true. The simulated t distribution suggests that very large (positive or negative) t values are unlikely. The t distribution allows us to calculate the probability of obtaining a value greater than a specific t value. For example, we could calculate the probability of obtaining a t value of 2 or greater when the null hypothesis is true, by determining the area under the t distribution beyond a value of 2.

If the calculated t value was 2, then the probability of obtaining this t value (or one greater) when the null hypothesis is true is 0.012 (or 1.2%). Since the probability of obtaining our t value or one greater (and therefore the probability of having a sample of 200 seedlings grown in the shade being so much different in size than a sample of 200 seedlings grown in full sun) is so low, we would conclude that the null hypothesis of no effect of shading on seedling growth is likely to be false. Thus, we have provided some strong evidence that shading conditions do effect seedling growth!


End of instructions

  t-test hypothesis testing demonstration


1. Select the Demonstrations menu..
2. Select the tdistribution submenu..

The t distribution demonstration dialog box will be displayed.

The default configuration is set up to collect two random samples (one containing 8 observations, the other 6) from normally distributed populations with means of 800 and standard deviations of 20.

The intention of this demonstration is to examine the range and frequency of outcomes that are possible when the null hypothesis is true (and the two populations to be compared are equal). That is, what levels of similarities between two random samples are likely and what levels are unlikely. In the frequentist approach to hypothesis testing, likelihood is measured by how often (or how frequently) an event occurs from a large number opportunities.

Start off setting the Number of simulations to 1000. This will collect 1000 random samples from Population 1 and 2.

Assumptions - In simulating the repeated collection of samples from both male and female populations, we make a number of assumptions that have important consequences for the reliability of the statistical tests.

Firstly, the function used to simulate the collection random samples of say 8 male fulmars (and 6 female fulmars), generates these samples from normal distributions of a given mean and standard deviation. Thus, each simulated t value and the distribution of t values from multiple runs represents the situation for when the samples are collected from a population that is normally distributed. Likewise, the mathematical t distribution also makes this assumption.

Furthermore, the sampling function used the same standard deviation for each collected sample. Hence, each simulated t values and the distribution of t values from multiple runs represents the situation for when the samples are collected from populations that are equally varied. Likewise, the mathematical t distribution also makes this assumption.

By altering the variability and degree of normality of the populations, you can see the effects that normality and homogeneity of variance have on how much a distribution of t values differs from the mathematical t-distribution. Violations of either of these two assumptions (normality and homogeneity of variance) have the real potential to compromise the reliability of the statistical conclusions and thus it is important for the data to satisfy these assumptions.

End of instructions

  Basic steps of Hypothesis testing


Step 1 - Clearly establish the statistical null hypothesis. Therefore, start off by considering the situation where the null hypothesis is true - e.g. when the two population means are equal

Step 2 - Establish a critical statistical criteria (e.g. alpha = 0.05)

Step 3 - Collect samples consisting of independent, unbiased samples

Step 4 - Assess the assumptions relevant to the statistical hypothesis test. For a t test:
  1.  Normality
  2.  Homogeneity of variance

Step 5 - Calculate test statistic appropriate for null hypothesis (e.g. a t value)


Step 6 - Compare observed test statistic to a probability distribution for that test statistic when the null hypothesis is true for the appropriate degrees of freedom (e.g. compare the observed t value to a t distribution).

Step 7 - If the observed test statistic is greater (in magnitude) than the critical value for that test statistic (based on the predefined critical criteria), we conclude that it is unlikely that the observed samples could have come from populations that fulfill the null hypothesis and therefore the null hypothesis is rejected, otherwise we conclude that there is insufficient evidence to reject the null hypothesis. Alternatively, we calculate the probability of obtaining the observed test statistic (or one of greater magnitude) when the null hypothesis is true. If this probability is less than our predefined critical criteria (e.g. 0.05), we conclude that it is unlikely that the observed samples could have come from populations that fulfill the null hypothesis and therefore the null hypothesis is rejected, otherwise we conclude that there is insufficient evidence to reject the null hypothesis.

End of instructions

  Pooled variance t-test

> t.test(dv~factor,data=data,var.equal=T)

where dv is the name of the dependent variable, factor is the name of the categorical/factorial variable and data is the name of the data frame (data set). The argument var.equal=T indicates a pooled variances t-test

End of instructions

  Separate variance t-test

> t.test(dv~factor,data=data,var.equal=F)

where dv is the name of the dependent variable, factor is the name of the categorical/factorial variable and data is the name of the data frame (data set). The argument var.equal=F indicates a separate variances t-test

End of instructions

  Bar graphs

# load the 'biology' extension library writen by Murray.
> library(biology)
# use the Mbargraph() function to construct a bargraph
> Mbargraph(furness$METRATE, furness$SEX, errorbars="se")

End of instructions

  Model I vs Model II Regression

Model I regression (fixed X regression)

The purpose of simple linear regression analysis is to estimate the slope and y-intercept of a straight line through a data set. The line represents the expected (and predicted) average values of X and Y in the population, based on the collected sample. The line (and thus the slope and intercept) is typically the line that minimizes the vertical spread of the observations from the line. This process of fitting the line of best fit is called Ordinary Least Squares (OLS), and is depicted in figure OLS Y vs X (top left) below. This fitting process assumes that there is no uncertainty in the predictor (X) variable and therefore that the observed values of X are exactly as they are in the population (there is no measurement error nor natural variation).

Model I regression is the classical regression analysis and the form of analysis most commonly used. However, it is only a valid method of estimating the model parameters (slope and intercept) when the levels of the predictor variable are not measured, but rather are set by the researcher. For example, if we were interested in the effects of temperature on the growth rate of plant seedlings, we could analyse the data using model I regression provided the different temperatures were specifically set by us (eg. controlling temperatures in green houses or growth cabinets). If on the other hand, the plants had been grown under a range of temperatures that were not specifically set by us and we had simply measured what the temperatures were, model I regression would not be appropriate.

Note that if instead of regressing Y against X we regressed X against Y, the slope of the line of best fit would be different since the line is now the one that minimizes the horizontal spread of the observations from the line (see figure OLS X vs Y - topright). The OLS regressions of Y on X and X on Y represent the two extremes of the population slope estimates as they depict the situations in which all of the variation is in the vertical and horizontal axes respectively. Typically, we might expect some uncertainty in both X and Y variables, and thus the true trend line should lie somewhere between these two extremes (see the bottom right figure below).


Model II regression (random X regression)

Model II regression estimates the line of best fit by incorporating uncertainty in both Y and X variables, and is therefore more suitable for estimating model parameters (slope and intercept) from data in which both variables have been measured. There are at least three forms of model II regression and these differ by what assumptions they place on the relative degree of uncertainty in Y and X variables

  • Major Axis (MA)-this form of regression estimates the line of best fit by minimizing the perpendicular spread of observations from the line (see figure MA - center left). In doing so, it is assumed that the degree of uncertainty in both X and Y variables is the same. This assumption can really only hold when both variables are measured on the same scale or have the same range of values, since variables with higher values will typically have a higher degree of variance. The greater the difference in degrees of uncertainty, the more the MA line of fit will approach either of the OLS lines. Consequently, MA regression is of limited use.
  • Ranged Major Axis (Ranged MA)-attempts to address the shortcomings of MA regression, by first standardizing the variables by their ranges (thus equalizing their degrees of uncertainties) then performing MA regression before back-transforming the slope of the estimated line into the original units. See figure Ranged MA - center right below.
  • Reduced Major Axis (RMA)-is an alternative that estimates the line of best fit by minimizing the (right-angle) triangular areas bounded by the observed values and the line of best fit (see figure RMA - bottom left). Essentially, the line of best fit is the average of the OLS Y vs X and OLS X vs Y lines.
The bottom right figure depicts each of the regression lines through a data set. Note the following:
  1. All estimated lines pass through a central point. This point is the mean value of Y and the mean value of X. The different regression fitting procedures essentially just rotate the line of best fit (major axis) through the data set. They differ in the degree to which the major axis is rotated.
  2. That the OLS lines form the extreme estimated lines of best fit.
  3. The RMA line is exactly half way between the two OLS lines
  4. Since for this example, the Y and X variable were measured on substantially different scales (where the degree of uncertainty in X is far greater than the degree of uncertainty in Y), the MA regression fitting procedure is equivalent to OLS X vs Y (assumes all uncertainty is in X and non in Y)

Implications

Given that most instances of regression in biology are actually more suitable for model II regression, why are examples of model II in the literature so rare and why do most statistical software default to model I (OLS) regression procedures (most do not even offer model II procedures)? Is it that biologists are ignorant or slack? It turns out that as far as the main hypothesis of interest is concerned (that the population slope) equals zero, it does not matter - RMA and OLS procedures give the same p-values (p-values cant be easily computed for MA and Ranged MA procedures).

If the purpose of the regression is to generate a model that can be used for the purpose of prediction (predict new values of the dependent variable from new values of the independent values), then only OLS fitted models are valid. The reason for this, is that the new independent value represents a theoretical fixed value with no uncertainty. This new value must be used in a model that assumed no uncertainty in the independent (predictor) variable. The OLS procedure is the only one that minimizes the vertical spread of the observations from the line.

Therefore, if regression analysis is being performed for the purpose of investigating whether there is a relationship between two continuous variables and or to generate a predictive model, then model I (OLS) regression is perfectly valid. Note, however, that if slopes and intercepts are reported, it is probably worth presenting model I and model II slopes.

If on the other hand the nature of the relationship is important, and therefore it is important to get estimates of the slope and y-intercept that best represent the population relationship, then it is necessary to make the model II corrections. Good examples of where this is important are:
  • size scaling applications - for example modeling the relationship between body size and metabolic rate - as it is usually only the slope (scaling factor) that is of interest
  • when comparing relationship slopes across studies
As a final point, if the R2 value is very high, then estimated slopes will be very similar, irrespective of which line fitting procedure is used!


End of instructions

  Model II regression analysis

# make sure that the biology library is loaded
> library(biology)
# fit the model II regression
> summary(lm.II(log10(INDIV)~log10(AREA), data, type="RMA"))

where log10(INDIV) is the name of the dependent (response) variable, log10(AREA) is the name of the independent (predictor) variable, data is the name of the dataset and type="RMA" indicates the form of regression to fit. Other types possible are; "MA" and "OLS".

End of instructions

  Raw or transmformed data?

Generally it is better to plot original (untransformed) data in any graphical summaries. Firstly, this people to examine the results without having to make any mental back-transformations. Secondly, transformations are usually only performed so as to satisfy the assumptions of a statistical test. A graph is not a statistical test. Rather it is a summary, and as such has no such assumptions.

End of instructions

  Error bars on bargraphs

# make sure that the Hmisc library is loaded
> library(Hmisc)
# plot the error bars
> errbar(xs, means, means+ses, means-ses, add=t)

where xs is a vector indicating the x coordinates of where the error bars are to be located (this vector is returned by the barplot() function), means is a vector containing the group means and ses is the name of a vector containing the standard errors (or standard deviations).

End of instructions

  Output of t-test


The following output are based on a simulated data sets;
1.  Pooled variance t-test for populations with equal (or nearly so) variances

2.  Separate variance t-test for population with unequal variances

End of instructions

  Paired t-test

> t.test(cat1, cat2, data=data, paired=T)

where cat1 and cat2 are two numeric vectors (variables) in the data data frame (data set). The argument paired=T indicates a paired t-test)

End of instructions

  Fitting Polynomial regression models


> name.lm <- lm(dv~iv+I(iv^2), data=data)

where name.lm is a name you provide for the output, dv is the name of the dependent variable, iv is the name of the independent variable and data is the name of the data frame (data set). The function I() essentially preserves the meaning of the object in brackets so that it treated as a separate term in the model. The above template defines a second order polynomial (quadratic). To define higher order polynomials, add additional instances of the iv to the appropriate power and enclosed within I() functions.

End of instructions


End of instructions

  Fitting a power model (non-linear model)


For this example, I will illustrate the fitting of the asymptotic power model, Species = &alpha(Area)&beta .

>name.nls <- nls(dv~alpha*(iv)^beta, data=data, start=list(alpha=0.1, beta=0.1))

where name.nls is a name you provide for the output, dv is the name of the dependent variable, iv is the name of the independent variable and data is the name of the data frame (data set).

In this case, alpha and beta are undefined and thus R understands that these represent coefficients that the model fitting needs to estimate. Non-linear modelling is a iterative model fitting process in which different combinations of estimated coefficients are cleverly evaluated to yield the best possible model fit. However, you must supply it with a starting value for these coefficients, hence the start=list() argument.

End of instructions


End of instructions

  Testing the null hypothesis that β12=...=0

Testing the null hypothesis that each of the slopes are equal to each other and zero (β12=...=0) is done by comparing the fit of the full model (Y=c+β1X12X2) to the fit of a reduced model that depicts β1 and β2 both equalling zero (ie Y=c).
This specific null hypothesis is summarized when you summarize the linear model fit. It is the F-ratio etc given at the end of the output (highlighted in the picture below).

End of instructions

  Comparing the fit of two models

> anova(full.lm, reduced.lm)

where full.lm is the name of full model and reduced.lm is the name of the reduced model (although it doesn't matter which model is put first - this only effects the sign of some of the resulting values - and you only take absolute values anyway).
A modified ANOVA table will be presented. The bottom line summarizes the results of comparing the two models. This line indicates the F-ratio, degrees of freedom and P-value for the comparison.

End of instructions

  Model balance

Balanced models are those models that have equal sample sizes for each level of a treatment. Unequal sample sizes are generally problematic for statistical analyses as they can be the source of unequal variances. They are additionally problematic for multifactor designs because of the way that total variability is partitioned into the contributions of each of the main effects as well as the contributions of interactions (ie the calculations of sums of squares and degrees of freedom). Sums of squares (and degrees of freedom) are usually (type I SS) partitioned additively. That is:
SSTotal=SSA+SSB+SSAB+SSResidual, and
dfTotal=dfA+dfB+dfAB+dfResidual
. In practice, these formulas are used in reverse. For example, if the model was entered as DV~A+B+AB, first SSResidual is calculated followed by SSAB, then SSB. SSA is thus calculated as the total sums of squares minus what has already been partitioned off to other terms (SSA=SSTotal-SSB+SSAB+SSResidual). If the model is balanced, then it doesn't matter what order the model is constructed (DV~B+A+AB will be equivalent).

However, in unbalanced designs, SS cannot be partitioned additively, that is the actual SS of each of the terms does not add up to SSTotal (SSTotal¹SSA+SSB+SSAB+SSResidual). Of course, when SS is partitioned it all must add up. Consequently, the SS of the last term to be calculated will not be correct. Therefore, in unbalanced designs, SS values for the terms are different (and therefore so are the F-ratios etc) depending on the order in which the terms are added - an undesirable situation.

To account for this, there are alternative methods of calculating SS for the main effects (all provide same SS for interaction terms and residuals).

  • Type I - are the additive SS, calculated by comparing the fit of successive addition of the model terms in a hierarchical sequence (interaction terms first).
  • Type II - SS for any given model term are determined by comparing the fit (deviance) of a model with the term (and all other terms of the same or lower level) to a model without that term, but with the other terms of the same or lower level
  • Type III - SS for any given model term are determined by comparing the fit (deviance) of a full model with the term to a model without that term

It turns out that it is relatively easy to determine whether or not the model is balanced or not - use the following syntax:

> !is.list(replications(formula,data))
#OR
> library(biology)
> is.balanced(formula,data)

where formula is the model formula (for example, MASS~SITUATION*MONTH) and data is the name of data set.
If this function returns a TRUE, your model is balanced. If it returns a FALSE then your model is not balanced and you should consider using Type II or III SS methods.

End of instructions

  Non-parametric tests

Non-parametric tests do not place any distributional limitations on data sets and are therefore useful when the assumption of normality is violated. There are a number of alternatives to parametric tests, the most common are;

1. Randomization tests - rather than assume that a test statistic calculated from the data follows a specific mathematical distribution (such as a normal distribution), these tests generate their own test statistic distribution by repeatedly re-sampling or re-shuffling the original data and recalculating the test statistic each time. A p value is subsequently calculated as the proportion of random test statistics that are greater than or equal to the test statistic based on the original (un-shuffled) data.

2. Rank-based tests - these tests operate not on the original data, but rather data that has first been ranked (each observation is assigned a ranking, such that the largest observation is given the value of 1, the next highest is 2 and so on). It turns out that the probability distribution of any rank based test statistic for a is identical.

End of instructions

  Randomization (permutation) tests

The basis of hypothesis testing (when comparing two populations) is to determine how often we would expect to collect a specific sample (or more one more unusual) if the populations were identical. That is, would our sample be considered unusual under normal circumstances? To determine this, we need to estimate what sort of samples we would expect under normal circumstances. The theoretical t-distribution mimics what it would be like to randomly resample repeatedly (with a given sample size) from such identical populations.

A single sample from the (proposed) population(s) provides a range of observations that are possible. Completely shuffling those data (or labels), should mimic the situation when there is no effect (null hypothesis true situation), since the data are then effectively assigned randomly with respect to the effect. In the case of a t-test comparing the means of two groups, randomly shuffling the data is likely to result similar group means. This provides one possible outcome (t-value) that might be expected when the null hypothesis true. If this shuffling process is repeated multiple times, a distribution of all of the outcomes (t-values) that are possible when the null hypothesis is true can be generated. We can now determine how unusual our real (unshuffled) t-value is by comparing it to the built up t-distribution. If the real t-value is expected to occur less than 5% of the time, we reject the null hypothesis.

Randomization tests - rather than assume that a test statistic calculated from the data follows a specific mathematical distribution (such as a normal distribution), these tests generate their own test statistic distribution by repeatedly re-sampling or re-shuffling the original data and recalculating the test statistic each time. A p value is subsequently calculated as the proportion of random test statistics that are greater than or equal to the test statistic based on the original (un-shuffled) data.

End of instructions

  Defining the randomization statistic

> stat <- function(data, indices){
    t <- t.test(data[,2]~data[,1])$stat
    t
}

The above function takes a data set (data) and calculates a test statistic (in this case a t-statistic). The illustrated function uses the t.test() function to perform a t-test on column 2 ([,2]) against column 1 ([,1]). The value of the t-statistic is stored in an object called 't'. The function returns the value of this object. The indentations are not important, they are purely used to improve human readability.

End of instructions

  Defining the data shuffling procedure

rand.gen <- function(data, mle){
    out <- data
    out[,1] <- sample(out[,1], replace=F)
    out
}

The above function defines how the data should be shuffled. The first line creates a temporary object (out) that stores the original data set (data) so that any alterations do not effect the original data set. The second line uses the sample() function to shuffle the first column (the group labels). The third line of the function just returns the shuffled data set.

End of instructions

  Perform randomization test (bootstrapping)

library(boot)
coots.boot <- boot(coots, stat, R=100, sim="parametric", ran.gen=rand.gen)

The sequence begins by ensuring that the boot package has been loaded into memory. Then the boot() function is used to repeatedly (R=100: repeat 100 times) perform the defined statistical procedure (stat). The ran.gen=rand.gen parameter defines how the data set should be altered prior to performing the statistical procedure, and the sim="parametric" parameter indicates that all randomizations are possible (as opposed to a permutation where each configuration can only occur once). The outcome of the randomization test (bootstrap) is stored in an object called coots.boot

End of instructions

  Calculating p-value from randomization test

p.length <- length(coots.boot$t[coots.boot$t >= coots.boot$t0])+1
print(p.length/(coots.boot$R + 1))

The coots.boot object contains a list of all of the t-values calculated from the resampling (shuffling and recalculating) procedure (coots.boot$t)as well as the actual t-value from the actual data set (coots.boot$t0). It also stores the number of randomizations that it performed (coots.boot$R).

The first line in the above sequence determines how many of the possible t values (including the actual t-value) are greater or equal to the actual t-value. It does this by specifying a list of coots.boot$t values that are greater or equal to the coots.boot$t0 value. (coots.boot$t[coots.boot$t >= coots.boot$t0]) and calculating the length of this list using the length() function. One (1) is added to this length, since the actual t-value must be included as a possible t-value.
The second line expresses the number of randomization t-values greater or equal to the actual t-value as a fraction of the total number of randomizations (again adding 1 to account for the actual situation). This is interpreted as any other p-value - the probability of obtaining our sample statistic (and thus our observations) when the null hypothesis is true.

End of instructions

  Calculating p-value from randomization test

In a t-test, the effect size is the absolute difference between the expected population means. Therefore if the expected population means of population A and B are 10 and 16 respectively, then the effect size is 6.

Typically, effect size is estimated by considering the magnitude of an effect that is either likely or else what magnitude of effect would be regarded biologically significant. For example, if the overall population mean was estimated to be 12 and the researches regarded a 15% increase to be biologically significant, then the effect size is the difference between 12 and a mean 15% greater than 12 (12+(12*.15)=13.8). Therefore the effect size is (13.8-12=1.8).

End of instructions

  Tolerance and Variance Inflation

The tolerance value for a predictor value is 1-r2 from the regression of this predictor variable against the other predictor variables included in the linear model. Low tolerances (less than 0.2, approaching o.1) signify that the predictor value is correlated to one or more of the other predictor variables.

Variance inflation factor (VIF) is the inverse of tolerance and is just another way of expressing the degree of correlation between a predictor variable and other predictor variables in a linear model. VIF values greater than 5 (approaching 10) indicate strong collinearity and are therefore of concern.

End of instructions

  Best regression model

Clearly, a regression model that includes all of the predictor variables will explain variation in the response variable better than models containing only a subset of the predictor variables. But how much better? Models with fewer predictor variables are easier and more precise to use for predictive purposes. Moreover, we are most often purely interested in determining which predictors are most important in explaining variation in the response variable.

Therefore, the best regression model should be a model that only includes the most important predictor variables. It should be the most efficient model - that is the model that explains the greatest amount of variation in the response variable with the least number of predictors.

There are a number of ways of determining which model is 'best'. These are based on:
  1. Adjusted r2 - this is the r2 that takes into account the number of predictor variables. It is calculated as the ratio of MSregression/MStotal. Larger values indicate better, more parsimonious models.
  2. AIC - Akaike Information Criterion - based on (log)likelihoods, this is a relative measure of goodness of fit of a model given the set of data. Lower values indicate better, more parsimonious models.
  3. BIC - Schwarz Bayesian Information Criterion - is similar to AIC, but it penalizes more heavily against models with more predictors. Lower values indicate better, more parsimonious models.

End of instructions

  Hierarchical Partitioning

Hierarchical partitioning determines the contributions of individual add joint contributions of each predictor variable by partitioning the r2 for the full model (containing all predictors). The individual contribution of a predictor variable are determined by calculating the improvement in fit (by r2) of a series of models (a hierarchy of models) with and without the predictor variable. Joint contributions are calculated from the differences between partial r2 for a series of models and the independent contributions.

End of instructions

  Hierarchical Partitioning

> loyn2 <- data.frame(logAREA=log10(loyn$AREA), logDIST=log10(loyn$DIST), logLDIST=log10(loyn$LDIST), ALT=loyn$ALT,GRAZE=loyn$GRAZE, YEARS=loyn$YR.ISOL)
> library(Hier.part)
> loyn.hier<-hier.part(loyn$ABUND, loyn2, gof="Rsqu")
> loyn.hier

The first step is to create a data frame that contains only the predictor variables. Then we load a package called hier.part by Chris Walsh and Ralph MacNally (Monash University). This contains the functions necessary to perform hierarchical partitioning in R.
Finally, we use the hier.part() function to perform the hierarchical partitioning and print out the results. The gof= parameter specifies in this case to use r2 values.

End of instructions

  Correlation coefficients to z-scores

> r<-sqrt(loyn.hier$IJ["Total"])
> log((abs((r+1)/(r-1)))^2)

The first step extracts the list of r2 values from the Total contribution column of the hierarchical partitioning output. At the same time, these are converted into correlation coefficients by taking their square roots. The second line converts the correlation coefficients into standard z-scores and prints them out.

End of instructions

  Randomization procedure for hierarchical partitioning

> rand.hp(loyn$ABUND, loyn2, fam = "gaussian", gof = "Rsqu", num.reps=100)$Iprobs

Note that this procedure is set to randomize and calculate 100 times. This can take up to 5 minutes to run! Predictor variables considered to contribute significantly to the explaination of variation in the response variable (from an independent perspective) are signified by a '*'.

End of instructions

  Power and ANOVA in R

Start off why defining some of the values, such as the usual (background or control) population mean and standard deviation.

> mean1 <- 1.65
> sd1 <- 0.51

Then define the effect size and expected between group variation. This reflects the of the amount of change (usually as a percentage) as well as the nature of the differences between groups. The following illustration defines between group variation based on three of the groups having equal population means that are 50% greater than another group.

> es<- .50
> mean2 <- mean1+(mean1*es)
> b.var <- var(c(mean1, mean2, mean2, mean2))

An alternative definition of expected between group variation might represent two of the groups having equal population means that are 50% greater than another groups.

> es<- .50
> mean2 <- mean1+(mean1*es)
> b.var <- var(c(mean1, mean1, mean2, mean2))

Alternatively, we could define the expected between group variation in terms of detecting a 50% increase in means according to some polynomial trend. The expected means are then calculated as a regular sequence.

> es<- .50
> mean2 <- mean1+(mean1*es)
> b.var <- var(seq(mean1, mean2, length=4))

Having defined the expected between group variation, power analysis can be performed as follows:

> power.anova.test(group=4, power=.8, between.var=b.var, within.var=sd1^2)

These instructions are altered by customising the groups size, power etc.

End of instructions

  Power envelope for ANOVA

A power envelope is the area between maximum and minimum power curves. When the design incorporates four groups (A, B, C, and D), the easiest pattern to detect (maximum power relationship) is A=B<C=D (or equivelently, A=B>C=D), and the most difficult pattern (minimum power relationship) is A<B=C<D (or A>B=C>D), where B and C are equidistant from both A and D. Similarly, if there are five groups (A, B, C, D and E), the easiest pattern to detect (maximum power relationship) is A=B<C=D=E (or equivelently, A=B>C=D=E, A=B=C>D=E, A=B=C<D=E,), and the most difficult pattern (minimum power relationship) is A<B=C=D<E (or A>B=C=D>E), where B and C and D are equidistant from both A and E.

> x=seq(3,30,l=100)
# generate maximum power curve
> b.var <- var(c(mean1, mean1, mean2, mean2))
> plot(x, power.anova.test(group=4, n=x, between.var=b.var, within.var=sqrt(sd1))$power, ylim=c(0,1), ylab="power", xlab="n", type="l")
# put in a line at power=0.8
> abline(0.8, 0, lty=2)
# generate the minimum power curve
> m<-mean(c(mean1,mean2))
> b.var <- var(c(mean1,m,m,mean2))
> points(x,power.anova.test(group=4,n=x,between.var=b.var, within.var=sqrt(sd1))$power, type="l",lty=4)

End of instructions

  Simple linear regression

Linear regression analysis explores the linear relationship between a continuous response variable (DV) and a single continuous predictor variable (IV). A line of best fit (one that minimizes the sum of squares deviations between the observed y values and the predicted y values at each x) is fitted to the data to estimate the linear model.

As plot a) shows, a slope of zero, would indicate no relationship – a increase in IV is not associated with a consistent change in DV.
Plot b) shows a positive relationship (positive slope) – an increase in IV is associated with an increase in DV.
Plot c) shows a negative relationship.

Testing whether the population slope (as estimated from the sample slope) is one way of evaluating the relationship. Regression analysis also determines how much of the total variability in the response variable can be explained by its linear relationship with IV, and how much remains unexplained.

The line of best fit (relating DV to IV) takes on the form of
y=mx + c
Response variable = (slope*predictor   variable) + y-intercept
If all points fall exactly on the line, then this model (right-hand part of equation) explains all of the variability in the response variable. That is, all variation in DV can be explained by differences in IV. Usually, however, the model does not explain all of the variability in the response variable (due to natural variation), some is left unexplained (referred to as error).

Therefore the linear model takes the form of;
Response variable = model + error.
A high proportion of variability explained relative to unexplained (error) is suggestive of a relationship between response and predictor variables.

For example, a mammalogist was investigating the effects of tooth wear on the amount of time free ranging koalas spend feeding per 24 h. Therefore, the amount of time spent feeding per 24 h was measured for a number of koalas that varied in their degree of tooth wear (ranging from unworn through to worn). Time spent feeding was the response variable and degree of tooth wear was the predictor variable.

End of instructions

  RMA regression slope

Model II (RMA - reduced major axis) regression slope is the average of the slope of the regression between Y on X (OLS regression slope) and the inverse of the slope of the regression between X on Y. It is the slope of the regression line that minimizes the sum of triangular areas bounded by the regression line and horizontal and vertical lines from obseved points and this regression line.

> christ.lm2 <- lm(RIPDENS ~ CWDBASAL, data=christ)
> slope <- mean(c(christ.lm$coef[2], 1/christ.lm2$coef[2]))


End of instructions

  RMA regression y-intercept

All regression lines, whether fitted by OLS or RMA (or MA variants), go through the point whose X and Y coordinates are mean X and mean Y. Essentially, all regression lines rotate about this point. Having estimated the RMA regression slope, the RMA y-intercept is estimated by simply solving the linear equation (y=bx+a) for a, using the means for X and Y as the known values.

> intercept <- mean(christ$CWDBASAL) - (slope*mean(christ$RIPDENS))


End of instructions

  Correlation and simple linear regression power analysis

To estimate sample size given degree of correlation and targeted power.

> library(pwr)
> pwr.r.test(power=.8,r=sqrt(.92))

To estimate power given degree of correlation and sample size (n).

> library(pwr)
> pwr.r.test(n=20,r=sqrt(.92))


End of instructions

  Correlation and simple linear regression power analysis plots

To estimate the relationship between power and sample size given degree of correlation for a range of sample sizes.

> source("power.R")
> power.r.plot(n=4:10,r=sqrt(.92))

To estimate the relationship between power and sample size given degree of correlation for a range of power.

> source("power.R")
> power.r.plot(power=seq(6,.99,l=100),r=sqrt(.92))


End of instructions

  Further exploration of three way log-linear interactions

Start with the female data. The female data can be accessed through the object sinclair.split[["FEMALE"]]. This is the "FEMALE" data of the sinclair.split object.

> sinclair.glmR <- glm(COUNT~DEATH+MARROW, family=poisson, data=sinclair.split[["FEMALE"]])
> sinclair.glmF <- glm(COUNT~DEATH*MARROW, family=poisson, data=sinclair.split[["FEMALE"]])
> anova(sinclair.glmR, sinclair.glmF, test="Chisq")

The first line fits the reduced log-linear model. This is the model that does not contain the two way interaction term. The second line fits the full log-linear model - the model that does contain the two way interaction. The third line generates the analysis of deviance table. This is essentially presenting the results of the hypothesis test that tests whether there is a two way interaction or not.

This same proceedure can then be repeated for the "MALE" data.

To estimate the relationship between power and sample size given degree of correlation for a range of power.

End of instructions

  Odds ratios for specific two-way interactions

Start with the female data.

> library(epitools)
> female.tab <- xtabs(COUNT ~ DEATH + MARROW, data=sinclair.split[["FEMALE"]])
> oddsratio.wald(t(female.tab))$measure
> oddsratio.wald(t(female.tab),rev="b")$measure

  1. The first line loads a package called 'epitools'. This package contains the function (oddsratiowald()) that is necessary to calculate the odds ratios.
  2. The second line converts the female data set into a cross table.
  3. The third line calculates the odds ratios and 95% confidence interval (after transposing the table). Transposing is done with the 't()' function and is necessary because oddsratio.wald is expected to only encounter a rx2 table (that is a table that only has two columns) and the odds ratios are calculated for the rows. Odds ratios are calculated by contrasting one of the levels of the row variable against the other levels. As a result, the odds ratios etc are not calculated for the first level.
  4. The final line performs the odds ratio calculations again accept this time a different level is chosen as the reference level so that all pairwise comparisons have been covered.


This same proceedure can then be repeated for the "MALE" data.

To estimate the relationship between power and sample size given degree of correlation for a range of power.

End of instructions

  ANOVA

Analysis of variance, or ANOVA, partitions the variation in the response (DV) variable into that explained and that unexplained by one of more categorical predictor variables, called factors. The ratio of this partitioning can then be used to test the null hypothesis (H0) that population group or treatment means are equal. A single factor ANOVA investigates the effect of a single factor, with 2 or more groups (levels), on the response variable. Single factor ANOVA's are completely randomized (CR) designs. That is, there is no restriction on the random allocation of experimental or sampling units to factor levels.
Single factor ANOVA tests the H0 that there is no difference between the population group means.
H0: = µ1 = µ2 = .... = µi = µ
If the effect of the ith group is:
(&alphai = µi - µ) the this can be written as
H0: = &alpha1 = &alpha2 = ... = &alphai = 0
If one or more of the I are different from zero, the hull hypothesis will be rejected.

Keough and Raymondi (1995) investigated the degree to which biofilms (films of diatoms, algal spores, bacteria, and other organic material that develop on hard surfaces) influence the settlement of polychaete worms. They had four categories (levels) of the biofilm treatment: sterile substrata, lab developed biofilms without larvae, lab developed biofilms with larvae (any larvae), field developed biofilms without larvae. Biofilm plates where placed in the field in a completely randomized array. After a week the number of polychaete species settled on each plate was then recorded. The diagram illustrates an example of the spatial layout of a single factor with four treatments (four levels of the treatment factor, each with a different pattern fill) and four experimental units (replicates) for each treatment.

End of instructions

  Factorial boxplots

> boxplot(dv~factor,data=data)

where dv is the name of the numeric vector (dependent variable), factor is the name of the factor (categorical or factor variable) and data is the name of the data frame (data set). The '~' is used in formulas to represent that the left hand side is proportional to the right hand side.

End of instructions

  Factorial boxplots

> boxplot(dv~factor1*factor2,data=data)

where dv is the name of the numeric vector (dependent variable), factor1 and factor2 are the names of the factors (categorical or factor variables) and data is the name of the data frame (data set). The '~' is used in formulas to represent that the left hand side is proportional to the right hand side.

End of instructions

  Wilcoxon test

> wilcox.test(dv ~ factor, data=data)

where dv is the dependent variable and factor is the categorical variable from the data data frame (data set).

End of instructions

  t-test degrees of freedom

Degrees of freedom is the number of observations that are free to vary when estimating a parameter. For a t-test, df is calculated as
df = (n1-1)+(n2-1)
where n1 is population 1 sample size and n2 is the sample size of population 2.

End of instructions

  One-tailed critical t-value

One-tail critical t-values are just calculated from a t distribution (of given df). The area under the t distribution curve above (or below) this value is 0.05 (or some other specified probability). Essentially we calculate a quantile of a specified probability from a t distribution of df degrees of freedom

# calculate critical t value (&alpha =0.05) for upper tail (e.g. A larger than B)
> qt(0.05,df=df,lower.tail=F)
# calculate critical t value (&alpha =0.05) for lower tail (e.g. B larger than A)
> qt(0.05,df=df,lower.tail=T)

where 0.05 is the specified &alpha value and df is the specified degrees of freedom. Note that it is not necessary to have collected the data before calculating the critical t-value, you only need to know sample sizes (to get df).

It actually doesn't matter whether you select Lower tail is TRUE or FALSE. The t-distribution is a symmetrical distribution, centered around 0, therefore, the critical t-value is the same for both Lower tail (e.g. population 2 greater than population 1) and Upper tail (e.g. population 1 greater than population 2), except that the Lower tail is always negative. As it is often less confusing to work with positive values, it is recommended that you use Upper tail values. An example of a t-distribution with Upper tail for a one-tailed test is depicted below. Note that this is not part of the t quantiles output!


End of instructions

  Two-tailed critical t-value

Two-tail critical t-values are just calculated from a t distribution (of given df). The area under the t distribution curve above and below the positive and negative of this value respectively is 0.05 (or some other specified probability). Essentially we calculate a quantile of a specified probability from a t distribution of df degrees of freedom. In a two-tailed test, half of the probability is associated with the area above the positive critical t-value and the other half is associated with the area below the negative critical t-value. Therefore when we use the quantile to calculate this critical t-value, we specify the probability as &alpha/2 - since &alpha/2 applies to each tail.

# calculate critical t value (&alpha =0.05) for upper tail (e.g. A different to B)
> qt(0.05/2,df=df, lower.tail=T)

where 0.05 is the specified &alpha value and df is the specified degrees of freedom. Note that it is not necessary to have collected the data before calculating the critical t-value, you only need to know sample sizes (to get df).

Again, it actually doesn't matter whether you select Lower tail as either TRUE or FALSE. For a symmetrical distribution, centered around 0, the critical t-value is the same for both Lower tail (e.g. population 2 greater than population 1) and Upper tail (e.g. population 1 greater than population 2), except that the Lower tail is always negative. As it is often less confusing to work with positive values, it is recommended that you use Upper tail values. An example of a t-distribution with Upper tail for a two-tailed test is depicted below. Note that this is not part of the t quantiles output!

End of instructions

  Plotting mean vs variance

# first calculate the means and variances
> means <- tapply(data$dv, data$factor, mean)
> vars <- tapply(data$dv, data$factor, var)
# then plot the mean against variance
> plot(means,vars)

where dv is the name of the numeric vector (dependent variable), factor is the name of the factor (categorical or factor variable) and data is the name of the data frame (data set).

End of instructions

  Plotting mean vs variance

# first calculate the means and variances
> means <- tapply(data$dv, list(data$factor1, data$factor2), mean)
> vars <- tapply(data$dv, list(data$factor1, data$factor2), var)
# then plot the mean against variance
> plot(means,vars)

where dv is the name of the numeric vector (dependent variable), factor1 and factor2 are the names of the factors (categorical or factor variables) and data is the name of the data frame (data set).

End of instructions

  Analysing frequencies

Analysis of frequencies is similar to Analysis of Variance (ANOVA) in some ways. Variables contain two or more classes that are defined from either natural categories or from a set of arbitrary class limits in a continuous variable. For example, the classes could be sexes (male and female) or color classes derived by splitting the light scale into a set of wavelength bands. Unlike ANOVA, in which an attribute (e.g. length) is measured for a set number of replicates and the means of different classes (categories) are compared, when analyzing frequencies, the number of replicates (observed) that fall into each of the defined classes are counted and these frequencies are compared to predicted (expected) frequencies.

Analysis of frequencies tests whether a sample of observations came from a population where the observed frequencies match some expected or theoretical frequencies. Analysis of frequencies is based on the chi-squared (X2) statistic, which follows a chi-square distribution (squared values from a standard normal distribution thus long right tail).

When there is only one categorical variable, expected frequencies are calculated from theoretical ratios. When there are more than one categorical variables, the data are arranged in a contingency table that reflects the cross-classification of sampling or experimental units into the classes of the two or more variables. The most common form of contingency table analysis (model I), tests a null hypothesis of independence between the categorical variables and is analogous to the test of an interaction in multifactorial ANOVA. Hence, frequency analysis provides hypothesis tests for solely categorical data. Although, analysis of frequencies provides a way to analyses data in which both the predictor and response variable are both categorical, since variables are not distinguished as either predictor or response in the analysis, establishment of causality is only of importance for interpretation.


End of instructions

  Goodness of fit test

The goodness-of-fit test compares observed frequencies of each class within a single categorical variable to the frequencies expected of each of the classes on a theoretical basis. It tests the null hypothesis that the sample came from a population in which the observed frequencies match the expected frequencies.

For example, an ecologist investigating factors that might lead to deviations from a 1:1 offspring sex ratio, hypothesized that when the investment in one sex is considerably greater than in the other, the offspring sex ratio should be biased towards the less costly sex. He studied two species of wasps, one of which had males that were considerably larger (and thus more costly to produce) than females. For each species, he compared the offspring sex ratio to a 1:1 ratio using a goodness-of-fit test.

End of instructions

  R Goodness of fit test

> chisq.test(c(counts))
#OR
> chisq.test(c(counts),p=c(.5,.5))

where counts is a comma separated list of observed counts or frequencies and .5,.5 is a comma separated list of expected frequencies. For example

> chisq.test(c(40,50))
#OR
> chisq.test(c(40,50),p=c(.5,.5))

End of instructions

  Contingency tables

Contingency tables are the cross-classification of two (or more) categorical variables. A 2 x 2 (two way) table takes on the following form:


Where f12 etc are the frequencies in each cell (Variable 1 x Variable 2 combination).

Contingency tables test the null hypothesis that the data came from a population in which variable 1 is independent of variable 2 and vice-versa. That is, it is a test of independence.

For example a plant ecologist examined the effect of heat on seed germination. Contingency test was used to determine whether germination (2 categories - yes or no) was independent of the heat treatment (2 categories heated or not heated).

End of instructions

  R Cross tables

> name.tab <- xtabs(counts ~ cat1 + cat2, data=data)

where name.tab is a name for the resulting cross table, counts are the observed counts/frequencies, cat1 and cat2 are the categorical variables and data is the name of the data frame (dataset)

End of instructions

  R Contingency tables

> name.x2 <- chisq.test(name.tab, correct=F)

where name.x2 is a name to provide for the fitted model and name.tab is a name of a cross table.

End of instructions

  Contingency tables from raw data sets

> name.tab <- table(data)

where name.tab is a name to give the resulting cross table and data is the name of the data set that contains the raw data.

End of instructions

  Multivariate analysis

Multivariate data sets include multiple variables recorded from a number of replicate sampling or experimental units, sometimes referred to as objects. If, for example, the objects are whole organisms or ecological sampling units, the variables could be morphometric measurements or abundances of various species respectively. The variables may be all response variables (MANOVA) or they could be response variables, predictor variables or a combination of both (PCA and MDS).

The aim of multivariate analysis is to reduce the many variables to a smaller number of new derived variables that adequately summarize the original information and can be used for further analysis. In addition, Principal components analysis (PCA) and Multidimensional scaling (MDS) also aim to reveal patterns in the data, especially among objects, that could not be found by analyzing each variable separately.

End of instructions

  Bray-Curtis dissimilarity

For two objects (i = 1 and 2) and a number of variables (j = 1 to p).

  • y1j and y2j are the values of variable j in object 1 and object 2.
  • min(y1j , y2j) is the lesser value of variable j for object 1 and object 2.
  • p is the number of variables.
For example

  Var 1Var 2Var 3
Object 1369
Object 261218

BC = 1-[(2)(3+6+9)/(18+36)] = 0.33 where (3+6+9) is the lesser abundance of each variable when it occurs in each object.

Bray-Curtis dissimilarity is well suited to species abundance data because it ignores variables that have zeros for both objects, and it reaches a constant maximum value when two objects have no variables in common. However, its value is determined mainly by variables with high values (e.g. species with high abundances) and thus results are biased towards trends displayed in such variables. It ranges from 0 (objects completely similar) to 1 (objects completely dissimilar).

End of instructions

  Euclidean distance

For two objects (i = 1 and 2) and a number of variables (j = 1 to p).

  • y1j and y2j are the values of variable j in object 1 and object 2.
  • p is the number of variables.
For example

  Var 1Var 2Var 3
Object 1369
Object 261218

EUC = [(3-6)2+(6-12)2+(9-18)2] = 11.2

Euclidean distance is a very important, general, distance measure in multivariate statistics. It is based on simple geometry as a measure of distance between two objects in multidimensional space. It is only bounded by a zero for two objects with exactly the same values for all variables and has no upper limits, even when two objects have no variables in common with positive values. Since it does not reach a constant maximum value when two objects have no variables in common, it is not ideal for species abundance data.

End of instructions

  Distance measures
  1. Select the Statistics menu..
  2. Select the Dimensional analysis.. submenu..
  3. Select the Distance measures.. submenu..
The Distance Measures dialog box will be displayed.
  1. Enter a name for the output distance matrix - the default name is usually fine
  2. Select the variables (species) to include in the analysis from the Variables list
  3. Select the objects (samples/sites) to include in the analysis from the Samples/Sites list
  4. Select the Type of distance from the options
  5. Click the button

The matrix of distance values will appear in the R Commander output window.

End of instructions

  Multidimensional Scaling

Multidimensional scaling (MDS), has two aims

  • to reduce a large number of variables down to a small number of summary variables (axes) that explain most of the variation (and patterns) in the original data.
  • to reveal patterns in the data that could not be found by analysing each variable separately. For example examining which sites are most similar to one another based on the abundances of many plant species

MDS starts with a dissimilarity matrix which represents the degree of difference between all objects (such as sites or quadrats) based on all the original variables. MDS then attempts to reproduce these patterns between objects by arranging the objects in multidimensional space with the distances between each pair of objects representing their relative dissimilarity. Each of the dimensions (plot axes) therefore represents a new variable that can be used to describe the relationships between objects. A greater number of new variables (axes) increases the degree to which the original patterns between objects are retained, however less data reduction has occurred. Furthermore, it is difficult to envisage points (objects) arranged in more than three dimensions.

MDS can be broken down into the following steps

  1. Calculate a matrix of dissimilarities between objects from a data matrix (raw or standardized) in which objects (e.g. quadrats) are in rows and variables (e.g. plants) are in columns.
  2. After deciding on the number of dimensions (axes), arrange the objects in ordination (multidimensional) space (essentially a graph containing as many axes as selected dimensions) either at random or using the coordinates of a previous ordination.
  3. Move the location of the objects in space iteratively so that at each successive step, the match between the ordination inter-object distances and the dissimilarities is improved. The match between ordination distances and object dissimilarities is measured by the stress value - the lower (< 15) the stress the better the fit. Ideally < 10.
  4. The final arrangement (configuration) of objects is achieved when further iterative moving of objects no longer results in a substantial decrease in stress. Note, if final stress is > 0.2, the number of selected dimensions is inadequate.

End of instructions

  Multidimensional scaling
  1. Select the Statistics menu..
  2. Select the Dimensional analysis.. submenu..
  3. Select the Multidimensional scaling.. submenu..
The Multidimensional Scaling dialog box will be displayed.
  1. Select the name of the distance matrix to be used from the Distance matrix list
  2. Enter a name for the resulting MDS output in the Enter a name for model box. The default name is usually fine.
  3. Enter the number of new dimensions (axes) to be calculated - usually 2 or 3.
  4. Select the objects (samples/sites) to be included in the MDS from the Samples list
  5. Click the button

A large amount of information will appear in the R Commander output window (most of which you can ignore!.

  1. Scroll up through the output and locate the stress value.
  2. The RGui command prompt will prompt you to hit the return key to examine a couple of plots.

The first of these plots is a Sheppard plot which represents the relationship between the original distances (y-axis) and the new MDS distances (x-axis).
The second plot is the final ordination plot. This represents the arrangement of objects in multidimensional space.

End of instructions

  View data set
  1. Click the button within R Commander
An spreadsheet (that can't be edited) containing the data set will appear.

End of instructions

  Linear relationship vs linear model

Linear models are those statistical models in which a series of parameters (predictor variables) are arranged as a linear combination. That is, within the model, no parameter appears as either a multiplier, divisor or exponent to any other parameter. Importantly, the term `linear’ in this context does not pertain to the nature of the relationship between the response variable and the predictor variable(s), and thus linear models are not restricted to `linear’ (straight-line) relationships. The following examples should help make the distinction.

A linear model representing a linear relationship
A linear model representing a curvilinear relationship
A non-linear model representing a curvilinear relationship

End of instructions