Jump to main navigation


Workshop 7 - Nested, randomized block, repeated measures and partly-nested designs (mixed effects models)

4 August 2011

Basic statistics references

  • Logan (2010) - Chpt 12-14
  • Quinn & Keough (2002) - Chpt 9-11
Very basic overview of Nested ANOVA

Randomized block

Nested ANOVA - one between factor

In an unusually detailed preparation for an Environmental Effects Statement for a proposed discharge of dairy wastes into the Curdies River, in western Victoria, a team of stream ecologists wanted to describe the basic patterns of variation in a stream invertebrate thought to be sensitive to nutrient enrichment. As an indicator species, they focused on a small flatworm, Dugesia, and started by sampling populations of this worm at a range of scales. They sampled in two seasons, representing different flow regimes of the river - Winter and Summer. Within each season, they sampled three randomly chosen (well, haphazardly, because sites are nearly always chosen to be close to road access) sites. A total of six sites in all were visited, 3 in each season. At each site, they sampled six stones, and counted the number of flatworms on each stone.

Download Curdies data set

Format of curdies.csv data files
SEASONSITEDUGESIAS4DUGESIA
WINTER10.6480.897
........
WINTER21.0161.004
........
WINTER30.6890.991
........
SUMMER400
........

Each row represents a different stone
SEASONSeason in which flatworms were counted - fixed factor
SITESite from where flatworms were counted - nested within SEASON (random factor)
DUGESIANumber of flatworms counted on a particular stone
S4DUGESIA4th root transformation of DUGESIA variable
Dugesia

Open
the curdies data file.
Show code

> curdies <- read.table("curdies.csv", header = T, sep = ",", strip.white = T)

> head(curdies)

  SEASON SITE   DUGESIA   S4DUGES
1 WINTER    1 0.6476829 0.8970995
2 WINTER    1 6.0961516 1.5713175
3 WINTER    1 1.3105639 1.0699526
4 WINTER    1 1.7252788 1.1460797
5 WINTER    1 1.4593867 1.0991136
6 WINTER    1 1.0575610 1.0140897

The SITE variable is supposed to represent a random factorial variable (which site). However, because the contents of this variable are numbers, R initially treats them as numbers, and therefore considers the variable to be numeric rather than categorical. In order to force R to treat this variable as a factor (categorical) it is necessary to first convert this numeric variable into a factor (HINT)
.
Show code

> curdies$SITE <- as.factor(curdies$SITE)

Notice the data set - each of the nested factors is labelled differently - there can be no replicate for the random (nesting) factor.

Q1-1. What are the main hypotheses being tested?
  1. H0 Effect 1:
  2. H0 Effect 2:
Q1-2. In the table below, list the assumptions of nested ANOVA along with how violations of each assumption are diagnosed and/or the risks of violations are minimized.

AssumptionDiagnostic/Risk Minimization
I.
II.
III.

Q1-3. Check these assumptions (HINT).
Show code

> library(nlme)

> curdies.ag <- gsummary(curdies, form = ~SEASON/SITE, mean)

Note that for the effects of SEASON (Factor A in a nested model) there are only three values for each of the two season types. Therefore, boxplots are of limited value! Is there however, any evidence of violations of the assumptions (HINT)?
Show code

> boxplot(DUGESIA ~ SEASON, curdies.ag)

(Y or N)
If so, assess whether a transformation will address the violations (HINT) and then make the appropriate corrections
Show code

> curdies$S4DUGES <- sqrt(sqrt(curdies$DUGESIA))

> curdies.ag <- gsummary(curdies, form = ~SEASON/SITE, mean)

> boxplot(S4DUGES ~ SEASON, curdies.ag)

Q1-4. For each of the tests, state which error (or residual) term (state as Mean Square) will be used as the nominator and denominator to calculate the F ratio. Also include degrees of freedom associated with each term.
EffectNominator (Mean Sq, df)Denominator (Mean Sq, df)
SEASON
SITE

Q1-5. If there is no evidence of violations, test the model;
S4DUGES = SEASON + SITE + CONSTANT
using a nested ANOVA
(HINT). Fill (HINT) out the table below, make sure that you have treated SITE as a random factor when compiling the overall results.
Show code

> curdies.aov <- aov(S4DUGES ~ SEASON + Error(SITE), data = curdies)

> summary(curdies.aov)

Error: SITE
          Df Sum Sq Mean Sq F value   Pr(>F)   
SEASON     1 5.5709  5.5709  34.496 0.004198 **
Residuals  4 0.6460  0.1615                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Error: Within
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals 30 4.5555 0.15185               

> summary(lm(curdies.aov))

Call:
lm(formula = curdies.aov)
Residuals:
    Min      1Q  Median      3Q     Max 
-0.3811 -0.2618 -0.1381  0.1652  0.9023 
Coefficients: (1 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)   
(Intercept)    0.3811     0.1591   2.396  0.02303 * 
SEASONWINTER   0.7518     0.2250   3.342  0.00224 **
SITE2          0.1389     0.2250   0.618  0.54156   
SITE3         -0.2651     0.2250  -1.178  0.24798   
SITE4         -0.0303     0.2250  -0.135  0.89376   
SITE5         -0.2007     0.2250  -0.892  0.37955   
SITE6              NA         NA      NA       NA   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Residual standard error: 0.3897 on 30 degrees of freedom
Multiple R-squared: 0.5771, Adjusted R-squared: 0.5066 
F-statistic: 8.188 on 5 and 30 DF,  p-value: 5.718e-05 

> library(gmodels)

> ci(lm(curdies.aov))

                Estimate    CI lower  CI upper Std. Error     p-value
(Intercept)   0.38112229  0.05622464 0.7060200  0.1590863 0.023031342
SEASONWINTER  0.75181980  0.29234513 1.2112945  0.2249821 0.002241586
SITE2         0.13892774 -0.32054693 0.5984024  0.2249821 0.541560368
SITE3        -0.26507141 -0.72454609 0.1944033  0.2249821 0.247982767
SITE4        -0.03030096 -0.48977564 0.4291737  0.2249821 0.893763109
SITE5        -0.20066007 -0.66013474 0.2588146  0.2249821 0.379548166

Q1-6. For each of the tests, state which error (or residual) term (state as Mean Square) will be used as the nominator and denominator to calculate the F ratio. Also include degrees of freedom associated with each term.
Source of variationdfMean SqF-ratioP-value
SEASON
SITE
Residuals   

EstimateMeanLower 95% CIUpper 95% CI
Summer
Effect size (Winter-Summer)

Normally, we are not interested in formally testing the effect of the nested factor to get the correct F test for the nested factor (SITE), examine a representation of the anova table of the fitted linear model that assumes all factors are fixed (HINT)

Q1-7. What are your conclusions (statistical and biological)?

Q1-8. Now analyse these data using a linear mixed effects model (REML).
Show code - lme

> library(nlme)

> curdies.lme <- lme(S4DUGES ~ SEASON, random = ~1 | SITE, data = curdies)

> summary(curdies.lme)

Linear mixed-effects model fit by REML
 Data: curdies 
       AIC     BIC    logLik
  46.42966 52.5351 -19.21483
Random effects:
 Formula: ~1 | SITE
        (Intercept)  Residual
StdDev:  0.04008598 0.3896803
Fixed effects: S4DUGES ~ SEASON 
                 Value Std.Error DF  t-value p-value
(Intercept)  0.3041353 0.0947195 30 3.210905  0.0031
SEASONWINTER 0.7867589 0.1339536  4 5.873369  0.0042
 Correlation: 
             (Intr)
SEASONWINTER -0.707
Standardized Within-Group Residuals:
       Min         Q1        Med         Q3        Max 
-1.2064782 -0.7680513 -0.2480590  0.3531126  2.3209354 
Number of Observations: 36
Number of Groups: 6 


#Calculate the confidence interval for the effect size of the main effect of season

> intervals(curdies.lme)

Approximate 95% confidence intervals
 Fixed effects:
                 lower      est.     upper
(Intercept)  0.1106923 0.3041353 0.4975783
SEASONWINTER 0.4148441 0.7867589 1.1586737
attr(,"label")
[1] "Fixed effects:"
 Random Effects:
  Level: SITE 
                       lower       est.    upper
sd((Intercept)) 1.837163e-07 0.04008598 8746.562
 Within-group standard error:
    lower      est.     upper 
0.3025499 0.3896803 0.5019033 


#Examine the ANOVA table with Type I (sequential) SS

> anova(curdies.lme)

            numDF denDF   F-value p-value
(Intercept)     1    30 108.45711  <.0001
SEASON          1     4  34.49647  0.0042

Show code - lmer

> library(lme4)

> curdies.lmer <- lmer(S4DUGES ~ SEASON + (1 | SITE), data = curdies)

> summary(curdies.lmer)

Linear mixed model fit by REML 
Formula: S4DUGES ~ SEASON + (1 | SITE) 
   Data: curdies 
   AIC   BIC logLik deviance REMLdev
 46.44 52.77 -19.22    32.52   38.44
Random effects:
 Groups   Name        Variance   Std.Dev.  
 SITE     (Intercept) 1.1647e-13 3.4127e-07
 Residual             1.5299e-01 3.9113e-01
Number of obs: 36, groups: SITE, 6
Fixed effects:
             Estimate Std. Error t value
(Intercept)   0.30414    0.09219   3.299
SEASONWINTER  0.78676    0.13038   6.034
Correlation of Fixed Effects:
            (Intr)
SEASONWINTE -0.707


#Calculate the confidence interval for the effect size of the main effect of season

> library(languageR)

> pvals.fnc(curdies.lmer, withMCMC = FALSE, addPlot = F)

$fixed
             Estimate MCMCmean HPD95lower HPD95upper  pMCMC Pr(>|t|)
(Intercept)    0.3041   0.3059     0.0791     0.5443 0.0176   0.0023
SEASONWINTER   0.7868   0.7848     0.4724     1.1253 0.0014   0.0000
$random
    Groups        Name Std.Dev. MCMCmedian MCMCmean HPD95lower HPD95upper
1     SITE (Intercept)   0.0000     0.0562   0.0827     0.0000     0.2652
2 Residual               0.3911     0.3878   0.3928     0.3032     0.4922


#Examine the effect season via a likelihood ratio test

> curdies.lmer1 <- lmer(S4DUGES ~ SEASON + (1 | SITE), data = curdies,

> REML = F)

> curdies.lmer2 <- lmer(S4DUGES ~ 1 + (1 | SITE), data = curdies,

> REML = F)

> anova(curdies.lmer1, curdies.lmer2, test = "F")

Data: curdies
Models:
curdies.lmer2: S4DUGES ~ 1 + (1 | SITE)
curdies.lmer1: S4DUGES ~ SEASON + (1 | SITE)
              Df    AIC    BIC  logLik  Chisq Chi Df Pr(>Chisq)    
curdies.lmer2  3 51.831 56.581 -22.916                             
curdies.lmer1  4 40.519 46.853 -16.259 13.312      1  0.0002637 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

LOOK into ezStats in ez package
Show code - ezStats

> library(ez)

> ezStats(dv = .(S4DUGES), between = .(SEASON), wid = .(SITE),

> data = curdies)

  SEASON N      Mean        SD      FLSD
1 SUMMER 3 0.3041353 0.1081703 0.3719147
2 WINTER 3 1.0908942 0.2052556 0.3719147

Show code - ezAnova

> library(ez)

> ezANOVA(dv = .(S4DUGES), between = .(SEASON), wid = .(SITE),

> data = curdies, type = 3)

$ANOVA
  Effect DFn DFd        F           p p<.05       ges
1 SEASON   1   4 34.49649 0.004197673     * 0.8960944
$`Levene's Test for Homogeneity of Variance`
  DFn DFd         SSn        SSd         F         p p<.05
1   1   4 0.006891131 0.05167755 0.5333945 0.5056576      

Plot predicted mean square-root transformed number of Dugesia (and confidence interval) for each season
Show code

> library(gplots)

> library(AICcmodavg)

> curdies.predict <- predictSE.mer(curdies.lmer, newdata = data.frame(SEASON = c("SUMMER",

> "WINTER")))

> par(mar = c(5, 5, 1, 1))

> plot(1:2, curdies.predict$fit, type = "n", axes = F, ann = F,

> ylim = range(0, 1.5), xlim = c(0.5, 2.5))

> plotCI(curdies.predict$fit, uiw = curdies.predict$se.fit * 2,

> las = 1, add = T)

> axis(1, at = c(1, 2), lab = levels(curdies$SEASON))

> axis(2, las = 1)

> mtext("Number of Dugesia per stone", 2, line = 3, cex = 1.25)

> mtext("Season", 1, line = 3, cex = 1.25)

> box(bty = "l")

OR the more flexible way, but that requires more code and thought:
Show code

> newdata <- expand.grid(SEASON = c("SUMMER", "WINTER"), S4DUGES = 0)

> mod.mat <- model.matrix(terms(curdies.lmer), newdata)

> newdata$S4DUGES = mod.mat %*% fixef(curdies.lmer)

> pvar1 <- (diag(mod.mat %*% tcrossprod(vcov(curdies.lmer), mod.mat)))

> tvar1 <- pvar1 + VarCorr(curdies.lmer)$SITE[1]

> newdata <- data.frame(newdata, plo = newdata$S4DUGES - 2 * sqrt(pvar1),

> phi = newdata$S4DUGES + 2 * sqrt(pvar1), tlo = newdata$S4DUGES -

> 2 * sqrt(tvar1), thi = newdata$S4DUGES + 2 * sqrt(tvar1))

> par(mar = c(5, 5, 1, 1))

> plot(S4DUGES ~ as.numeric(SEASON), newdata, type = "n", axes = F,

> ann = F, ylim = range(0, newdata$phi), xlim = c(0.5, 2.5))

> points(curdies$DUGESIA ~ jitter(as.numeric(curdies$SEASON)),

> pch = 16, col = "grey")

> with(newdata, arrows(as.numeric(SEASON), plo, as.numeric(SEASON),

> phi, ang = 90, length = 0.05, code = 3))

> with(newdata, points(as.numeric(SEASON), S4DUGES, pch = 21, bg = "black",

> col = "black"))

> axis(1, at = c(1, 2), lab = newdata$SEASON)

> axis(2, las = 1)

> mtext("Number of Dugesia per stone", 2, line = 3, cex = 1.25)

> mtext("Season", 1, line = 3, cex = 1.25)

> box(bty = "l")

Plot predicted mean transformed number of Dugesia (and confidence interval) for each season. Note that these data were modelled with a Gaussian distribution (this might not have been appropriate)
Show code

> newdata[, -1] <- newdata[, -1]^4

> par(mar = c(5, 5, 1, 1))

> plot(S4DUGES ~ as.numeric(SEASON), newdata, type = "n", axes = F,

> ann = F, ylim = range(newdata$plo, newdata$phi), xlim = c(0.5,

> 2.5))

> points(curdies$DUGESIA ~ jitter(as.numeric(curdies$SEASON)),

> pch = 16, col = "grey")

> with(newdata, arrows(as.numeric(SEASON), plo, as.numeric(SEASON),

> phi, ang = 90, length = 0.05, code = 3))

> with(newdata, points(as.numeric(SEASON), S4DUGES, pch = 21, bg = "black",

> col = "black"))

> axis(1, at = c(1, 2), lab = newdata$SEASON)

> axis(2, las = 1)

> mtext("Number of Dugesia per stone", 2, line = 3, cex = 1.25)

> mtext("Season", 1, line = 3, cex = 1.25)

> box(bty = "l")

Note, backtransforming the standard deviations has the reverse effect to desired (since they are less than 1).
Show code

> library(gplots)

> library(AICcmodavg)

> curdies.predict <- predictSE.mer(curdies.lmer, newdata = data.frame(SEASON = c("SUMMER",

> "WINTER")))

> curdies.predict$fit <- curdies.predict$fit^2

> curdies.predict$se.fit <- curdies.predict$se.fit^2

> par(mar = c(5, 5, 1, 1))

> plot(1:2, curdies.predict$fit, type = "n", axes = F, ann = F,

> ylim = range(0, 1.5), xlim = c(0.5, 2.5))

> plotCI(curdies.predict$fit, uiw = curdies.predict$se.fit * 2,

> las = 1, add = T)

> axis(1, at = c(1, 2), lab = levels(curdies$SEASON))

> axis(2, las = 1)

> mtext("Number of Dugesia per stone", 2, line = 3, cex = 1.25)

> mtext("Season", 1, line = 3, cex = 1.25)

> box(bty = "l")

Q1-9. Where is the major variation in numbers of flatworms? Between (seasons, sites or stones)?
Show code

> library(nlme)

> nlme:::VarCorr(lme(S4DUGES ~ 1, random = ~1 | SEASON/SITE, curdies))

            Variance     StdDev    
SEASON =    pdLogChol(1)           
(Intercept) 0.300522947  0.54819973
SITE =      pdLogChol(1)           
(Intercept) 0.001606859  0.04008565
Residual    0.151850793  0.38968037




Q1-10. How might this information influence the design of future experiments on Dugesia in terms of:
  1. What influences the abundance of Dugesia
  2. Where best to focus sampling effort to maximize statistical power?

Q1-11.Finally, construct an appropriate summary figure to accompany the above analyses. Note that this should use the correct replicates for depicting error.
Show code

> opar <- par(mar = c(5, 5, 1, 1))

> means <- tapply(curdies.ag$DUGESIA, curdies.ag$SEASON, mean)

> sds <- tapply(curdies.ag$DUGESIA, curdies.ag$SEASON, sd)

> lens <- tapply(curdies.ag$DUGESIA, curdies.ag$SEASON, length)

> ses <- sds/sqrt(lens)

> xs <- barplot(means, beside = T, ann = F, axes = F, ylim = c(0,

> max(means + ses)), axisnames = F)

> arrows(xs, means - ses, xs, means + ses, code = 3, length = 0.05,

> ang = 90)

> axis(1, at = xs, lab = c("Summer", "Winter"))

> mtext("Season", 1, line = 3, cex = 1.25)

> axis(2, las = 1)

> mtext("Mean number of Dugesia per stone", 2, line = 3, cex = 1.25)

> box(bty = "l")

Randomized block

A plant pathologist wanted to examine the effects of two different strengths of tobacco virus on the number of lesions on tobacco leaves. She knew from pilot studies that leaves were inherently very variable in response to the virus. In an attempt to account for this leaf to leaf variability, both treatments were applied to each leaf. Eight individual leaves were divided in half, with half of each leaf inoculated with weak strength virus and the other half inoculated with strong virus. So the leaves were blocks and each treatment was represented once in each block. A completely randomised design would have had 16 leaves, with 8 whole leaves randomly allocated to each treatment.

Download Tobacco data set

Format of tobacco.csv data files
LEAFTREATNUMBER
1Strong35.898
1Week25.02
2Strong34.118
2Week23.167
3Strong35.702
3Week24.122
.........
LEAFThe blocking factor - Factor B
TREATCategorical representation of the strength of the tobacco virus - main factor of interest Factor A
NUMBERNumber of lesions on that part of the tobacco leaf - response variable
Starlings

Open the tobacco data file.

Show code

> tobacco <- read.table("tobacco.csv", header = T, sep = ",", strip.white = T)

> head(tobacco)

  LEAF TREATMENT   NUMBER
1   L1    Strong 35.89776
2   L1      Weak 25.01984
3   L2    Strong 34.11786
4   L2      Weak 23.16740
5   L3    Strong 35.70215
6   L3      Weak 24.12191

Since each level of treatment was applied to each leaf, the data represents a randomized block design with leaves as blocks.

The variable LEAF contains a list of Leaf identification numbers and is supposed to represent a factorial blocking variable. However, because the contents of this variable are numbers, R initially treats them as numbers, and therefore considers the variable to be numeric rather than categorical. In order to force R to treat this variable as a factor (categorical) it is necessary to first convert this numeric variable into a factor
(HINT).
Show code

> tobacco$LEAF <- as.factor(tobacco$LEAF)

Q2-1. What are the main hypotheses being tested?
  1. H0 Factor A:
  2. H0 Factor B:
Q2-2. In the table below, list the assumptions of a randomized block design along with how violations of each assumption are diagnosed and/or the risks of violations are minimized.
  1. AssumptionDiagnostic/Risk Minimization
    I.
    II.
    III.
    IV.
  2. Is the proposed model balanced?
    (Yor N)
  3. Show code

    > replications(NUMBER ~ LEAF + TREATMENT, data = tobacco)

         LEAF TREATMENT 
            2         8 

    > !is.list(replications(NUMBER ~ LEAF + TREATMENT, data = tobacco))

    [1] TRUE


Q2-3. There are a number of ways of diagnosing block by within block interactions.
  1. The simplest method is to plot the number of lesions for each treatment and leaf combination (ie. an interaction plot
    (HINT). Any evidence of an interaction? Note that we cannot formally test for an interaction because we don't have replicates for each treatment-leaf combination!
  2. Show code

    > library(car)

    > interaction.plot(tobacco$LEAF, tobacco$TREAT, tobacco$NUMBER)

  3. We can also examine the residual plot from the fitted linear model. A curvilinear pattern in which the residual values switch from positive to negative and back again (or visa versa) over the range of predicted values implies that the scale (magnitude but not polarity) of the main treatment effects differs substantially across the range of blocks. (HINT). Any evidence of an interaction?
  4. Show code

    > residualPlots(lm(NUMBER ~ LEAF + TREATMENT, data = tobacco))

               Test stat Pr(>|t|)
    LEAF              NA       NA
    TREATMENT         NA       NA
    Tukey test     -0.18    0.858

  5. Perform a Tukey's test for non-additivity evaluated at α = 0.10 or even &alpha = 0.25. This (curvilinear test) formally tests for the presence of a quadratic trend in the relationship between residuals and predicted values. As such, it too is only appropriate for simple interactions of scale. (HINT). Any evidence of an interaction?
  6. Show code

    > library(alr3)

    > tukey.nonadd.test(lm(NUMBER ~ LEAF + TREATMENT, data = tobacco))

          Test     Pvalue 
    -0.1795228  0.8575272 

  7. Examine a lattice graphic to determine the consistency of the effect across each of the blocks(HINT).
  8. Show code

    > library(lattice)

    > print(barchart(NUMBER ~ TREATMENT | LEAF, data = tobacco))


Q2-4. Analyse these data with a classic randomized block ANOVA
(HINT) to test the H0 that there is no difference in the number of lesions produced by the different strength viruses. Complete the table below.
Show code

> tobacco.aov <- aov(NUMBER ~ TREATMENT + Error(LEAF), data = tobacco)

> summary(tobacco.aov)

Error: LEAF
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals  7 292.09  41.727               
Error: Within
          Df Sum Sq Mean Sq F value   Pr(>F)   
TREATMENT  1 248.34 248.339  17.172 0.004328 **
Residuals  7 101.23  14.462                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Source of variationdfMean SqF-ratioP-value
LEAF (block)   
TREAT
Residuals   


Q2-5. Interprete the results.

Q2-6. Now analyse these data with a linear mixed effects model (REML).
Show code

> library(nlme)

> tobacco.lme <- lme(NUMBER ~ TREATMENT, random = ~1 | LEAF, data = tobacco,

> method = "ML")

> summary(tobacco.lme)

Linear mixed-effects model fit by maximum likelihood
 Data: tobacco 
       AIC      BIC    logLik
  102.4911 105.5814 -47.24553
Random effects:
 Formula: ~1 | LEAF
        (Intercept) Residual
StdDev:    3.453752 3.557307
Fixed effects: NUMBER ~ TREATMENT 
                 Value Std.Error DF   t-value p-value
(Intercept)   34.94013  1.873989  7 18.644793  0.0000
TREATMENTWeak -7.87938  1.901460  7 -4.143858  0.0043
 Correlation: 
              (Intr)
TREATMENTWeak -0.507
Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-1.73025327 -0.51797920  0.01211696  0.43723932  1.52636286 
Number of Observations: 16
Number of Groups: 8 

> library(gmodels)

> ci(tobacco.lme)

               Estimate  CI lower  CI upper Std. Error DF      p-value
(Intercept)   34.940130  30.50885 39.371409   1.873989  7 3.169056e-07
TREATMENTWeak -7.879381 -12.37562 -3.383142   1.901460  7 4.328219e-03

> anova(tobacco.lme)

            numDF denDF  F-value p-value
(Intercept)     1     7 368.4997  <.0001
TREATMENT       1     7  17.1716  0.0043

Show code

> library(lme4)

> tobacco.lmer <- lmer(NUMBER ~ TREATMENT + (1 | LEAF), data = tobacco)

> summary(tobacco.lmer)

Linear mixed model fit by REML 
Formula: NUMBER ~ TREATMENT + (1 | LEAF) 
   Data: tobacco 
   AIC  BIC logLik deviance REMLdev
 96.71 99.8 -44.35    94.49   88.71
Random effects:
 Groups   Name        Variance Std.Dev.
 LEAF     (Intercept) 13.632   3.6922  
 Residual             14.462   3.8029  
Number of obs: 16, groups: LEAF, 8
Fixed effects:
              Estimate Std. Error t value
(Intercept)     34.940      1.874  18.645
TREATMENTWeak   -7.879      1.901  -4.144
Correlation of Fixed Effects:
            (Intr)
TREATMENTWk -0.507


#Calculate the confidence interval for the effect size of the main effect of season

> library(languageR)

> pvals.fnc(tobacco.lmer, withMCMC = FALSE, addPlot = F)

$fixed
              Estimate MCMCmean HPD95lower HPD95upper  pMCMC Pr(>|t|)
(Intercept)     34.940   34.898      30.83     38.789 0.0001    0.000
TREATMENTWeak   -7.879   -7.826     -12.82     -2.224 0.0072    0.001
$random
    Groups        Name Std.Dev. MCMCmedian MCMCmean HPD95lower HPD95upper
1     LEAF (Intercept)   3.6922     1.1625   1.3775     0.0000     3.8555
2 Residual               3.8029     5.0287   5.1814     3.2469     7.5238


#Examine the effect season via a likelihood ratio test

> tobacco.lmer1 <- lmer(NUMBER ~ TREATMENT + (1 | LEAF), data = tobacco,

> REML = F)

> tobacco.lmer2 <- lmer(NUMBER ~ 1 + (1 | LEAF), data = tobacco,

> REML = F)

> anova(tobacco.lmer1, tobacco.lmer2, test = "F")

Data: tobacco
Models:
tobacco.lmer2: NUMBER ~ 1 + (1 | LEAF)
tobacco.lmer1: NUMBER ~ TREATMENT + (1 | LEAF)
              Df    AIC    BIC  logLik  Chisq Chi Df Pr(>Chisq)   
tobacco.lmer2  3 110.47 112.79 -52.235                            
tobacco.lmer1  4 102.49 105.58 -47.246 9.9786      1   0.001584 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Show code - ezAnova

> library(ez)

> ezANOVA(dv = .(NUMBER), within = .(TREATMENT), wid = .(LEAF),

> data = tobacco, type = 3)

$ANOVA
     Effect DFn DFd        F           p p<.05       ges
2 TREATMENT   1   7 17.17156 0.004328219     * 0.3870228

Plot predicted mean square-root transformed number of Dugesia (and confidence interval) for each season
Show code

> newdata <- expand.grid(TREATMENT = c("STRONG", "WEAK"), NUMBER = 0)

> mod.mat <- model.matrix(terms(tobacco.lmer), newdata)

> newdata$NUMBER = mod.mat %*% fixef(tobacco.lmer)

> pvar1 <- (diag(mod.mat %*% tcrossprod(vcov(tobacco.lmer), mod.mat)))

> tvar1 <- pvar1 + VarCorr(tobacco.lmer)$LEAF[1]

> newdata <- data.frame(newdata, plo = newdata$NUMBER - 2 * sqrt(pvar1),

> phi = newdata$NUMBER + 2 * sqrt(pvar1), tlo = newdata$NUMBER -

> 2 * sqrt(tvar1), thi = newdata$NUMBER + 2 * sqrt(tvar1))

> par(mar = c(5, 5, 1, 1))

> plot(NUMBER ~ as.numeric(TREATMENT), newdata, type = "n", axes = F,

> ann = F, ylim = range(newdata$plo, newdata$phi), xlim = c(0.5,

> 2.5))

> with(newdata, arrows(as.numeric(TREATMENT), plo, as.numeric(TREATMENT),

> phi, ang = 90, length = 0.05, code = 3))

> with(newdata, points(as.numeric(TREATMENT), NUMBER, pch = 21,

> bg = "black", col = "black"))

> axis(1, at = c(1, 2), lab = newdata$TREATMENT)

> axis(2, las = 1)

> mtext("Number of lesions", 2, line = 3, cex = 1.25)

> mtext("Season", 1, line = 3, cex = 1.25)

> box(bty = "l")

Split-plot

Split-plot

In an attempt to understand the effects on marine animals of short-term exposure to toxic substances, such as might occur following a spill, or a major increase in storm water flows, a it was decided to examine the toxicant in question, Copper, as part of a field experiment in Honk Kong. The experiment consisted of small sources of Cu (small, hemispherical plaster blocks, impregnated with copper), which released the metal into sea water over 4 or 5 days. The organism whose response to Cu was being measured was a small, polychaete worm, Hydroides, that attaches to hard surfaces in the sea, and is one of the first species to colonize any surface that is submerged. The biological questions focused on whether the timing of exposure to Cu affects the overall abundance of these worms. The time period of interest was the first or second week after a surface being available.

The experimental setup consisted of sheets of black perspex (settlement plates), which provided good surfaces for these worms. Each plate had a plaster block bolted to its centre, and the dissolving block would create a gradient of [Cu] across the plate. Over the two weeks of the experiment, a given plate would have pl ain plaster blocks (Control) or a block containing copper in the first week, followed by a plain block, or a plain block in the first week, followed by a dose of copper in the second week. After two weeks in the water, plates were removed and counted back in the laboratory. Without a clear idea of how sensitive these worms are to copper, an effect of the treatments might show up as an overall difference in the density of worms across a plate, or it could show up as a gradient in abundance across the plate, with a different gradient in different treatments. Therefore, on each plate, the density of worms (#/cm2) was recorded at each of four distances from the center of the plate.

Download Copper data set

Format of copper.csv data file
COPPERPLATEDISTWORMS
........

COPPERCategorical listing of the copper treatment (control = no copper applied, week 2 = copper treatment applied in second week and week 1= copper treatment applied in first week) applied to whole plates. Factor A (between plot factor).
PLATESubstrate provided for polychaete worm colonization on which copper treatment applied. These are the plots (Factor B). Numbers in this column represent numerical labels given to each plate.
DISTCategorical listing for the four concentric distances from the center of the plate (source of copper treatment) with 1 being the closest and 4 the furthest. Factor C (within plot factor)
WORMSDensity (#/cm2) of worms measured. Response variable.
Saltmarsh

Open the Copper data set
Show code

> copper <- read.table("copper.csv", header = T, sep = ",", strip.white = T)

> head(copper)

   COPPER PLATE DIST WORMS
1 control   200    4 11.50
2 control   200    3 13.00
3 control   200    2 13.50
4 control   200    1 12.00
5 control    39    4 17.75
6 control    39    3 13.75

Notice that both the PLATE variable and the DIST variable contain only numbers. Make sure that you define both of these as factors (HINT)
Show code

> copper$PLATE <- factor(copper$PLATE)

> copper$DIST <- factor(copper$DIST)

Q3-1. What are the main hypotheses being tested?
  1. H0 Main Effect 1 (Factor A):
  2. H0 Main Effect 2 (Factor C):
  3. H0 Main Effect 3 (A*C):

Q3-2. The usual ANOVA assumptions apply to split-plot designs, and these can be tested by constructing boxplots for each of the main effects. However, it is important to consider what data the boxplots should be based upon. For each of the main hypothesis tests, describe what data should be used to construct boxplots (remember that the assumptions of normality and homogeneity of variance apply to the residuals of each hypothesis test, and therefore the data used in the construction of boxplots for each hypothesis test should represent the respective residuals, or sources of unexplained variation).
  1. H0 Main Effect 1 (Factor A):
  2. H0 Main Effect 2 (Factor C):
  3. H0 Main Effect 3 (A*C):

Q3-3. For each of the hypothesis tests, indicate which Mean Square term should be used as the residual (denominator) in the F-ratio calculation. Note, COPPER and DIST are fixed factors and PLATE is a random factor.
  1. H0 Main Effect 1 (Factor A): F-ratio = MSCOPPER/MS
  2. H0 Main Effect 2 (Factor C): F-ratio = MSDIST/MS
  3. H0 Main Effect 3 (A*C): F-ratio = MSCOPPER:DIST/MS

Q3-4. Construct a boxplot to investigate the assumptions as they apply to the test of H0 Main Effect 1 (Factor A): This is done in two steps

  1. Aggregate the data set by the mean number of WORMS within each plate
    (HINT)
  2. Show code

    > library(plyr)

    > cu.ag <- ddply(copper, ~COPPER + PLATE, function(df) data.frame(WORMS = mean(df$WORMS)))

    #OR

    > library(plyr)

    > cu.ag <- ddply(copper, ~COPPER + PLATE, summarise, WORMS = mean(WORMS,

    > na.rm = TRUE))

    #OR

    > cu.ag <- aggregate(copper, list(COPPER = copper$COPPER, PLATE = copper$PLATE),

    > mean)

    #OR

    > library(nlme)

    > cu.ag <- gsummary(copper, form = ~COPPER/PLATE, base:::mean)

  3. Construct a boxplot of aggregated mean number of WORMS against COPPER treatment
    (HINT)
  4. Show code

    > boxplot(WORMS ~ COPPER, data = cu.ag)

  5. Any evidence of violations of the assumptions (y or n)?


Q3-5. Construct a boxplot to investigate the assumptions as they apply to the test of H0 Main Effect 2 (Factor C): Since Factor C is tested against the overal residual in this case, this is a relatively straight forward procedure. (HINT)
Show code

> boxplot(WORMS ~ DIST, data = copper)

  1. Any evidence of violations of the assumptions (y or n)?


Q3-6. Construct a boxplot to investigate the assumptions as they apply to the test of H0 the main interaction effect (A:C): Since A:C is tested against the overal residual, this is a relatively straight forward procedure.(HINT)
Show code

> boxplot(WORMS ~ COPPER:DIST, data = copper)

  1. Any evidence of violations of the assumptions (y or n)?


Q3-7. In addition to the above assumptions, the test of PLATE assumes that there is no PLATE by DIST interaction as this is the overal residual (the replicates). That is, the test assumes that the effect of DIST is consistent in all PLATES. Construct an interaction plot to examine whether there is any evidence of an interaction between PLATE and DISTANCE (HINT)
Show code

> library(car)

> with(copper, interaction.plot(PLATE, DIST, WORMS))

> library(alr3)

> tukey.nonadd.test(lm(WORMS ~ PLATE + DIST, data = copper))

         Test        Pvalue 
-4.122434e+00  3.748907e-05 

  1. Any evidence of an interaction (y or n)?
  2. Is the design (model) unbalanced ?
    (HINT) (Yor N)
    Show code

    > !is.list(replications(WORMS ~ COPPER + Error(PLATE) + DIST +

    > COPPER:DIST, data = copper))

    [1] TRUE

  3. Is there any evidence of issues with sphericity
    ?(HINT) (Y or N)
  4. Show code

    > library(biology)

    > epsi.GG.HF(aov(WORMS ~ COPPER + Error(PLATE) + DIST + COPPER:DIST,

    > data = copper))

    $GG.eps
    [1] 0.5203159
    $HF.eps
    [1] 0.584099
    $sigma
    [1] 1.806771


Q3-8. Write out the linear model



Q3-9. Perform a split-plot ANOVA
(HINT), and complete the following table (HINT). To obtain the hypothesis test for the random factor (Factor B: PLATE), run the model as if all factors were fixed and thus all terms are tested against the overall residuals, HINT)
Show code

> copper.aov <- aov(WORMS ~ COPPER + DIST + COPPER:DIST + Error(PLATE),

> data = copper)

> AnovaM(copper.aov, RM = T)

   Sphericity Epsilon Values 
-------------------------------
 Greenhouse.Geisser Huynh.Feldt
          0.5203159    0.584099
Anova Table (Type I tests)
Response: WORMS
Error: PLATE
          Df Sum Sq Mean Sq F value    Pr(>F)    
COPPER     2 783.73  391.87  128.02 8.051e-09 ***
Residuals 12  36.73    3.06                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Error: Within
            Df  Sum Sq Mean Sq F value    Pr(>F)    
DIST         3 153.803  51.268 28.3753 1.356e-09 ***
COPPER:DIST  6  53.419   8.903  4.9276 0.0009028 ***
Residuals   36  65.044   1.807                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Greenhouse-Geisser corrected ANOVA table
Response: WORMS
Error: PLATE
               Df Sum Sq Mean Sq F value    Pr(>F)    
COPPER     1.0406 783.73  391.87  128.02 8.021e-08 ***
Residuals 12.0000  36.73    3.06                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Error: Within
                 Df  Sum Sq Mean Sq F value    Pr(>F)    
DIST         1.5609 153.803  51.268 28.3753 2.692e-07 ***
COPPER:DIST  3.1219  53.419   8.903  4.9276  0.005222 ** 
Residuals   36.0000  65.044   1.807                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Huynh-Feldt corrected ANOVA table
Response: WORMS
Error: PLATE
               Df Sum Sq Mean Sq F value    Pr(>F)    
COPPER     1.1682 783.73  391.87  128.02 5.259e-08 ***
Residuals 12.0000  36.73    3.06                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Error: Within
                 Df  Sum Sq Mean Sq F value    Pr(>F)    
DIST         1.7523 153.803  51.268 28.3753 1.126e-07 ***
COPPER:DIST  3.5046  53.419   8.903  4.9276   0.00397 ** 
Residuals   36.0000  65.044   1.807                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Source of variationdfMean SqF-ratioP-value
COPPER
PLATE   
DIST
COPPER:DIST
Residuals   

Q3-10.Construct an interaction plot
showing the density of worms against distance from Cu source, with each treatment as different lines (or different bars). HINT
Show code

> interaction.plot(copper$DIST, copper$COPPER, copper$WORMS)


Q3-11. What conclusions would you draw from the analysis (and graph)?

Q3-12. In order to tease out the interaction, we could analyse the effects of the main factor (COPPER) separately for each Distance and/or investigate the effects of Distance separately for each of the three copper treatments. Recall that when performing such simple main effects, it is necessary to use the residual terms from the global analyses as these are likely to be better estimates (as they are based on a larger amount of data).
  1. Investigate the effects of copper separately for each of the four distances HINT
    Show code

    > AnovaM(mainEffects(copper.aov, at = DIST == 1), type = "III")

    Error: PLATE
              Df Sum Sq Mean Sq F value    Pr(>F)    
    INT        2 783.73  391.87  128.02 8.051e-09 ***
    Residuals 12  36.73    3.06                      
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    Error: Within
              Df  Sum Sq Mean Sq F value    Pr(>F)    
    INT        9 207.222 23.0247  12.743 8.382e-09 ***
    Residuals 36  65.044  1.8068                      
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 


    > AnovaM(mainEffects(copper.aov, at = DIST == 2), type = "III")

    Error: PLATE
              Df Sum Sq Mean Sq F value    Pr(>F)    
    INT        2 783.73  391.87  128.02 8.051e-09 ***
    Residuals 12  36.73    3.06                      
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    Error: Within
              Df  Sum Sq Mean Sq F value    Pr(>F)    
    INT        9 207.222 23.0247  12.743 8.382e-09 ***
    Residuals 36  65.044  1.8068                      
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 


    > AnovaM(mainEffects(copper.aov, at = DIST == 3), type = "III")

    Error: PLATE
              Df Sum Sq Mean Sq F value    Pr(>F)    
    INT        2 783.73  391.87  128.02 8.051e-09 ***
    Residuals 12  36.73    3.06                      
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    Error: Within
              Df  Sum Sq Mean Sq F value    Pr(>F)    
    INT        9 207.222 23.0247  12.743 8.382e-09 ***
    Residuals 36  65.044  1.8068                      
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 


    What conclusions would you draw from these four analyses?

  2. Investigate the effects of distance separately for each of the three copper treatments HINT
    Show code

    > AnovaM(mainEffects(copper.aov, at = COPPER == "control"), type = "III")

    Error: PLATE
              Df Sum Sq Mean Sq F value    Pr(>F)    
    INT        2 783.73  391.87  128.02 8.051e-09 ***
    Residuals 12  36.73    3.06                      
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    Error: Within
              Df  Sum Sq Mean Sq F value    Pr(>F)    
    INT        6 188.597 31.4328 17.3972 2.506e-09 ***
    DIST       3  18.625  6.2083  3.4361   0.02691 *  
    Residuals 36  65.044  1.8068                      
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 


    > AnovaM(mainEffects(copper.aov, at = COPPER == "Week 1"), type = "III")

    Error: PLATE
              Df Sum Sq Mean Sq F value    Pr(>F)    
    INT        2 783.73  391.87  128.02 8.051e-09 ***
    Residuals 12  36.73    3.06                      
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    Error: Within
              Df  Sum Sq Mean Sq F value    Pr(>F)    
    INT        6 188.059 31.3432 17.3477 2.599e-09 ***
    DIST       3  19.163  6.3875  3.5353   0.02419 *  
    Residuals 36  65.044  1.8068                      
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 


    > AnovaM(mainEffects(copper.aov, at = COPPER == "Week 2"), type = "III")

    Error: PLATE
              Df Sum Sq Mean Sq F value    Pr(>F)    
    INT        2 783.73  391.87  128.02 8.051e-09 ***
    Residuals 12  36.73    3.06                      
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    Error: Within
              Df  Sum Sq Mean Sq F value    Pr(>F)    
    INT        6  37.788   6.298  3.4857  0.008066 ** 
    DIST       3 169.434  56.478 31.2592 3.968e-10 ***
    Residuals 36  65.044   1.807                      
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

    What conclusions would you draw from these four analyses?

Mixed models

Show code

> copper.lme <- lme(WORMS ~ COPPER * DIST, random = ~1 | PLATE,

> data = copper, method = "ML")

> copper.lme1 <- lme(WORMS ~ COPPER * DIST, random = ~1 | PLATE,

> data = copper, method = "ML", correlation = corCompSymm(form = ~1 |

> PLATE))

> copper.lme2 <- lme(WORMS ~ COPPER * DIST, random = ~1 | PLATE,

> data = copper, method = "ML", correlation = corAR1(form = ~1 |

> PLATE))

> anova(copper.lme, copper.lme1)

            Model df      AIC      BIC    logLik   Test      L.Ratio p-value
copper.lme      1 14 228.2842 257.6050 -100.1421                            
copper.lme1     2 15 230.2842 261.6993 -100.1421 1 vs 2 1.024887e-10       1


Check whether a first order autoregressive correlation structure is more appropriate

> copper.lme2 <- update(copper.lme, correlation = corAR1(form = ~1 |

> PLATE))

> anova(copper.lme, copper.lme2)

            Model df      AIC      BIC     logLik   Test  L.Ratio p-value
copper.lme      1 14 228.2842 257.6050 -100.14209                        
copper.lme2     2 15 222.9325 254.3477  -96.46625 1 vs 2 7.351676  0.0067


# Now fit the 'best'' model with REML

> copper.lme3 <- update(copper.lme2, method = "REML")

> anova.lme(copper.lme3)

            numDF denDF  F-value p-value
(Intercept)     1    36 980.6144  <.0001
COPPER          2    12  93.0577  <.0001
DIST            3    36  25.1142  <.0001
COPPER:DIST     6    36   4.2520  0.0025

Show code

> detach(package:nlme)

> copper.lmer <- lmer(WORMS ~ COPPER * DIST + (1 | PLATE), data = copper)

Examine the fixed effects coefficients
Show code

> summary(copper.lmer)@coefs

> library(languageR)

> pvals.fnc(copper.lmer, withMCMC = FALSE, addPlot = F)

$fixed
                   Estimate MCMCmean HPD95lower HPD95upper  pMCMC Pr(>|t|)
(Intercept)           10.85  10.8586     9.5809     12.184 0.0001   0.0000
COPPERWeek 1          -3.60  -3.6040    -5.4196     -1.744 0.0001   0.0003
COPPERWeek 2         -10.60 -10.6050   -12.4661     -8.789 0.0001   0.0000
DIST2                  1.15   1.1468    -0.7585      2.829 0.2074   0.1825
DIST3                  1.55   1.5459    -0.1836      3.360 0.0830   0.0745
DIST4                  2.70   2.6984     0.9173      4.550 0.0054   0.0026
COPPERWeek 1:DIST2    -0.05  -0.0453    -2.5593      2.476 0.9552   0.9670
COPPERWeek 2:DIST2     0.05   0.0540    -2.4253      2.541 0.9654   0.9670
COPPERWeek 1:DIST3    -0.30  -0.3022    -2.7646      2.198 0.8132   0.8040
COPPERWeek 2:DIST3     2.20   2.1969    -0.2649      4.787 0.0852   0.0735
COPPERWeek 1:DIST4     0.05   0.0448    -2.5683      2.504 0.9700   0.9670
COPPERWeek 2:DIST4     4.90   4.8992     2.3596      7.395 0.0010   0.0002
$random
    Groups        Name Std.Dev. MCMCmedian MCMCmean HPD95lower HPD95upper
1    PLATE (Intercept)   0.5599     0.2627   0.2841     0.0000     0.7267
2 Residual               1.3442     1.4133   1.4279     1.1428     1.7589

Broad interpretation: In general, compared to the number of worms settling on the control plates, the number of worms settling was less in the Week 1 treatment and even lower in the Week 2 treatment. The increases in number of settled worms in distances progressively further from the source (Dist 1) get progressively larger and the trend between DIST1 (base level) and DIST 4 is significantly greater for the Week 2 treatment compared to the control copper treatment.

Examine the model fit measures
Show code

> summary(copper.lmer)@AICtab

      AIC      BIC    logLik deviance  REMLdev
 218.2515 247.5723 -95.12575 200.2842 190.2515

Examine the variance components
Show code

> data.frame(summary(copper.lmer)@REmat)

    Groups        Name Variance Std.Dev.
1    PLATE (Intercept)  0.31354  0.55995
2 Residual              1.80677  1.34416

Examine the predicted means and confidence intervals
Show code

> newdata <- expand.grid(COPPER = levels(copper$COPPER), DIST = levels(copper$DIST),

> WORMS = 0)

> mod.mat <- model.matrix(terms(copper.lmer), newdata)

> newdata$WORMS = mod.mat %*% fixef(copper.lmer)

> pvar1 <- diag(mod.mat %*% tcrossprod(vcov(copper.lmer), mod.mat))

> tvar1 <- pvar1 + VarCorr(copper.lmer)$PLATE[1]

> newdata <- data.frame(newdata, plo = newdata$WORMS - 2 * sqrt(pvar1),

> phi = newdata$WORMS + 2 * sqrt(pvar1), tlo = newdata$WORMS -

> 2 * sqrt(tvar1), thi = newdata$WORMS + 2 * sqrt(tvar1))

> par(mar = c(5, 5, 1, 1))

> plot(WORMS ~ as.numeric(DIST), newdata, type = "n", axes = F,

> ann = F, ylim = range(newdata$plo, newdata$phi), xlim = c(0.5,

> 4.5))

> with(subset(newdata, COPPER == "control"), arrows(as.numeric(DIST),

> plo, as.numeric(DIST), phi, ang = 90, length = 0.05, code = 3))

> with(subset(newdata, COPPER == "control"), lines(as.numeric(DIST),

> WORMS, lty = 1))

> with(subset(newdata, COPPER == "control"), points(as.numeric(DIST),

> WORMS, pch = 21, bg = "black", col = "black"))

> with(subset(newdata, COPPER == "Week 1"), arrows(as.numeric(DIST),

> plo, as.numeric(DIST), phi, ang = 90, length = 0.05, code = 3))

> with(subset(newdata, COPPER == "Week 1"), lines(as.numeric(DIST),

> WORMS, lty = 2))

> with(subset(newdata, COPPER == "Week 1"), points(as.numeric(DIST),

> WORMS, pch = 21, bg = "grey", col = "black"))

> with(subset(newdata, COPPER == "Week 2"), arrows(as.numeric(DIST),

> plo, as.numeric(DIST), phi, ang = 90, length = 0.05, code = 3))

> with(subset(newdata, COPPER == "Week 2"), lines(as.numeric(DIST),

> WORMS, lty = 3))

> with(subset(newdata, COPPER == "Week 2"), points(as.numeric(DIST),

> WORMS, pch = 21, bg = "white", col = "black"))

> axis(1, at = as.numeric(newdata$DIST), lab = newdata$DIST)

> axis(2, las = 1)

> mtext("Number of settled worms", 2, line = 3, cex = 1.25)

> mtext("Distance from Copper", 1, line = 3, cex = 1.25)

> legend("bottomright", legend = c("Control", "Week 1", "Week 2"),

> pch = c(21, 21, 21), pt.bg = c("black", "grey", "white"),

> lty = c(1, 2, 3), bty = "n", title = "Treatment")

> box(bty = "l")

> library(AICcmodavg)

> data.frame(newdata, predictSE.mer(copper.lmer, newdata, print.matrix = T))

    COPPER DIST WORMS        plo       phi        tlo       thi   fit    se.fit
1  control    1 10.85  9.5476118 12.152388  9.1323308 12.567669 10.85 0.6511941
2   Week 1    1  7.25  5.9476118  8.552388  5.5323308  8.967669  7.25 0.6511941
3   Week 2    1  0.25 -1.0523882  1.552388 -1.4676692  1.967669  0.25 0.6511941
4  control    2 12.00 10.6976118 13.302388 10.2823308 13.717669 12.00 0.6511941
5   Week 1    2  8.35  7.0476118  9.652388  6.6323308 10.067669  8.35 0.6511941
6   Week 2    2  1.45  0.1476118  2.752388 -0.2676692  3.167669  1.45 0.6511941
7  control    3 12.40 11.0976118 13.702388 10.6823308 14.117669 12.40 0.6511941
8   Week 1    3  8.50  7.1976118  9.802388  6.7823308 10.217669  8.50 0.6511941
9   Week 2    3  4.00  2.6976118  5.302388  2.2823308  5.717669  4.00 0.6511941
10 control    4 13.55 12.2476118 14.852388 11.8323308 15.267669 13.55 0.6511941
11  Week 1    4 10.00  8.6976118 11.302388  8.2823308 11.717669 10.00 0.6511941
12  Week 2    4  7.85  6.5476118  9.152388  6.1323308  9.567669  7.85 0.6511941

Now modeled against a Poisson distribution with polynomial contrasts for the distance variable.
Show code

> contrasts(copper$DIST) <- contr.poly

> copper.lmer <- lmer(WORMS ~ COPPER * DIST + (1 | PLATE), data = copper,

> family = "poisson")

Examine the fixed effects coefficients
Show code

> summary(copper.lmer)@coefs

                        Estimate Std. Error     z value     Pr(>|z|)
(Intercept)          2.498288400 0.06422083 38.90152634 0.000000e+00
COPPERWeek 1        -0.361810036 0.10032970 -3.60621071 3.107009e-04
COPPERWeek 2        -1.890274446 0.25973697 -7.27764880 3.396883e-13
DIST.L               0.156402669 0.12875085  1.21476998 2.244538e-01
DIST.Q              -0.006024973 0.12844166 -0.04690824 9.625864e-01
DIST.C               0.027693947 0.12813173  0.21613652 8.288813e-01
COPPERWeek 1:DIST.L  0.063303015 0.20090958  0.31508212 7.526993e-01
COPPERWeek 2:DIST.L  2.382701845 0.63043195  3.77947508 1.571593e-04
COPPERWeek 1:DIST.Q  0.016654058 0.20065940  0.08299665 9.338542e-01
COPPERWeek 2:DIST.Q -0.535807563 0.51947394 -1.03144262 3.023333e-01
COPPERWeek 1:DIST.C  0.032271051 0.20040891  0.16102603 8.720729e-01
COPPERWeek 2:DIST.C  0.062310438 0.37717623  0.16520245 8.687846e-01

Interpretation: Compared to the number of worms settling on control plates, the number of settling worms is less in Week 1 and even lower in the Week 2 copper treatment. Settlement increases with increasing distance from the copper source and this rate of increase is over twice the rate (significantly greater) in the Week 2 treatment.
Show code - ezAnova

> library(ez)

> ezANOVA(dv = .(WORMS), between = .(COPPER), within = .(DIST),

> wid = .(PLATE), data = copper, type = 3)

$ANOVA
       Effect DFn DFd          F            p p<.05       ges
2      COPPER   2  12 128.021440 8.051223e-09     * 0.8850657
3        DIST   3  36  28.375324 1.356185e-09     * 0.6017852
4 COPPER:DIST   6  36   4.927645 9.028004e-04     * 0.3442068
$`Mauchly's Test for Sphericity`
       Effect         W           p p<.05
3        DIST 0.2311286 0.008004879     *
4 COPPER:DIST 0.2311286 0.008004879     *
$`Sphericity Corrections`
       Effect       GGe        p[GG] p[GG]<.05      HFe        p[HF] p[HF]<.05
3        DIST 0.5203159 6.355564e-06         * 0.584099 2.048030e-06         *
4 COPPER:DIST 0.5203159 1.021592e-02         * 0.584099 7.346469e-03         *

Repeated Measures

Repeated Measures

In an honours thesis from (1992), Mullens was investigating the ways that cane toads ( Bufo marinus ) respond to conditions of hypoxia. Toads show two different kinds of breathing patterns, lung or buccal, requiring them to be treated separately in the experiment. Her aim was to expose toads to a range of O2 concentrations, and record their breathing patterns, including parameters such as the expired volume for individual breaths. It was desirable to have around 8 replicates to compare the responses of the two breathing types, and the complication is that animals are expensive, and different individuals are likely to have different O2 profiles (leading to possibly reduced power). There are two main design options for this experiment;

  • One animal per O2 treatment, 8 concentrations, 2 breathing types. With 8 replicates the experiment would require 128 animals, but that this could be analysed as a completely randomized design
  • One O2 profile per animal, so that each animal would be used 8 times and only 16 animals are required (8 lung and 8 buccal breathers)
Mullens decided to use the second option so as to reduce the number of animals required (on financial and ethical grounds). By selecting this option, she did not have a set of independent measurements for each oxygen concentration, by repeated measurements on each animal across the 8 oxygen concentrations.

Download Mullens data set

Format of mullens.csv data file
BREATHTOADO2LEVELFREQBUCSFREQBUC
lunga010.63.256
lunga518.84.336
lunga1017.44.171
lunga1516.64.074
...............

BREATHCategorical listing of the breathing type treatment (buccal = buccal breathing toads, lung = lung breathing toads). This is the between subjects (plots) effect and applies to the whole toads (since a single toad can only be one breathing type - either lung or buccal). Equivalent to Factor A (between plots effect) in a split-plot design
TOADThese are the subjects (equivalent to the plots in a split-plot design: Factor B). The letters in this variable represent the labels given to each individual toad.
O2LEVEL0 through to 50 represent the the different oxygen concentrations (0% to 50%). The different oxygen concentrations are equivalent to the within plot effects in a split-plot (Factor C).
FREQBUCThe frequency of buccal breathing - the response variable
SFREQBUCSquare root transformed frequency of buccal breathing - the response variable
Saltmarsh

Open the mullens data file. HINT.
Show code

> mullens <- read.table("mullens.csv", header = T, sep = ",", strip.white = T)

> head(mullens)

  BREATH TOAD O2LEVEL FREQBUC SFREQBUC
1   lung    a       0    10.6 3.255764
2   lung    a       5    18.8 4.335897
3   lung    a      10    17.4 4.171331
4   lung    a      15    16.6 4.074310
5   lung    a      20     9.4 3.065942
6   lung    a      30    11.4 3.376389

Notice that both the O2LEVEL variable contains only numbers. Make sure that you define both of this as a factors (HINT)

Show code

> mullens$O2LEVEL <- factor(mullens$O2LEVEL)

Q4-1. What are the main hypotheses being tested?
  1. H0 Main Effect 1 (Factor A):
  2. H0 Main Effect 2 (Factor C):
  3. H0 Main Effect 3 (A*C):

Q4-2. We will now address all the assumptions.
Construct a boxplot to investigate the assumptions as they apply to the test of H0 Main Effect 1 (Factor A): This is done in two steps

  1. Construct a boxplot to investigate the assumptions as they apply to the test of H0 Main Effect 1 (Factor A): Start by aggregating the data set by TOAD
    (since the toads are the replicates for the between subject effect - BREATH) (HINT). Then Construct a boxplot of aggregated mean FREQBUC against BREATH treatment
    (HINT). Any evidence of violations of the assumptions (y or n)?
    Show code

    > library(plyr)

    > mullens.ag <- ddply(mullens, ~BREATH + TOAD, function(df) data.frame(FREQBUC = mean(df$FREQBUC)))

    > boxplot(FREQBUC ~ BREATH, mullens.ag)


    Try a square-root transformation!
  2. Show code

    > library(plyr)

    > mullens.ag <- ddply(mullens, ~BREATH + TOAD, function(df) data.frame(SFREQBUC = mean(sqrt(df$FREQBUC))))

    > boxplot(SFREQBUC ~ BREATH, mullens.ag)

  3. Construct a boxplot to investigate the assumptions as they apply to the test of H0 Main Effect 2 (Factor C): Since Factor C is tested against the overall residual in this case, this is a relatively straight forward procedure.(HINT). Any evidence of violations of the assumptions (y or n)?
  4. Show code

    > boxplot(SFREQBUC ~ O2LEVEL, mullens)

  5. Construct a boxplot to investigate the assumptions as they apply to the test of H0 the main interaction effect (A:C): Since A:C is tested against the overal residual, this is a relatively straight forward procedure.(HINT). Any evidence of violations of the assumptions (y or n)?
  6. Show code

    > boxplot(SFREQBUC ~ BREATH * O2LEVEL, mullens)

  7. In addition to the above assumptions, the test of TOAD assumes that there is no TOAD by O2LEVEL interaction as this is the overal residual (the replicates). That is, the test assumes that the effect of O2LEVEL is consistent in all TOADS. Construct an interaction plot to examine whether there is any evidence of an interaction between TOAD and O2LEVEL (HINT). Any evidence of an interaction (y or n)?
  8. Show code

    > with(mullens, interaction.plot(TOAD, O2LEVEL, SFREQBUC))

    > library(alr3)

    > tukey.nonadd.test(lm(SFREQBUC ~ TOAD + O2LEVEL, data = mullens))

           Test      Pvalue 
    3.079699333 0.002072097 

  9. You must also check to see whether the proposed model balanced.
    Show code

    > !is.list(replications(SFREQBUC ~ BREATH + Error(TOAD) + O2LEVEL +

    > BREATH:O2LEVEL, data = mullens))

    [1] FALSE

    Is it? (Yor N)
  10. Finally, since the design is a repeated measures, there is a good chance that the assumption of sphericity
    has been violated balanced.
    Show code

    > library(biology)

    > epsi.GG.HF(aov(SFREQBUC ~ BREATH + O2LEVEL + BREATH:O2LEVEL +

    > Error(TOAD), data = mullens))

    $GG.eps
    [1] 0.4281775
    $HF.eps
    [1] 0.5172755
    $sigma
    [1] 0.7531308

    Has it (HINT)? (Y or N)


Q4-3. Assume that the assumption of compound symmetry/sphericity will be violated and perform a split-plot ANOVA (repeated measures)
(HINT), and complete the following table with corrected p-values (HINT).
Show code

> mullens.aov <- aov(SFREQBUC ~ BREATH * O2LEVEL + Error(TOAD),

> data = mullens)

> library(biology)

> AnovaM(mullens.aov, RM = T)

   Sphericity Epsilon Values 
-------------------------------
 Greenhouse.Geisser Huynh.Feldt
          0.4281775   0.5172755
Anova Table (Type I tests)
Response: SFREQBUC
Error: TOAD
          Df  Sum Sq Mean Sq F value  Pr(>F)  
BREATH     1  39.921  39.921  5.7622 0.02678 *
Residuals 19 131.634   6.928                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Error: Within
                Df  Sum Sq Mean Sq F value    Pr(>F)    
O2LEVEL          7  38.969  5.5670  7.3918 1.710e-07 ***
BREATH:O2LEVEL   7  56.372  8.0531 10.6928 1.228e-10 ***
Residuals      133 100.166  0.7531                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Greenhouse-Geisser corrected ANOVA table
Response: SFREQBUC
Error: TOAD
                Df  Sum Sq Mean Sq F value  Pr(>F)  
BREATH     0.42818  39.921  39.921  5.7622 0.04785 *
Residuals 19.00000 131.634   6.928                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Error: Within
                     Df  Sum Sq Mean Sq F value    Pr(>F)    
O2LEVEL          2.9972  38.969  5.5670  7.3918  0.000129 ***
BREATH:O2LEVEL   2.9972  56.372  8.0531 10.6928 2.435e-06 ***
Residuals      133.0000 100.166  0.7531                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Huynh-Feldt corrected ANOVA table
Response: SFREQBUC
Error: TOAD
                Df  Sum Sq Mean Sq F value  Pr(>F)  
BREATH     0.51728  39.921  39.921  5.7622 0.04393 *
Residuals 19.00000 131.634   6.928                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Error: Within
                     Df  Sum Sq Mean Sq F value    Pr(>F)    
O2LEVEL          3.6209  38.969  5.5670  7.3918 4.095e-05 ***
BREATH:O2LEVEL   3.6209  56.372  8.0531 10.6928 4.223e-07 ***
Residuals      133.0000 100.166  0.7531                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Source of variationdfMean SqF-ratioP-valueGG.PHF.P
BREATH   
TOAD     
O2LEVEL
BREATH:O2LEVEL
Residuals     

Q4-4. Construct an interaction plot
showing the frequency of buccal breathing against oxygen level, with each breathing type as different lines (or different bars). HINT
Show code

> with(mullens, interaction.plot(O2LEVEL, BREATH, FREQBUC))


Q4-5. What conclusions would you draw from the analysis (and graph)?

Mixed models

Show code

> mullens.lmer <- lmer(SFREQBUC ~ BREATH * O2LEVEL + (1 | TOAD),

> data = mullens)

Examine the fixed effects coefficients
Show code

> summary(mullens.lmer)@coefs

> library(languageR)

> pvals.fnc(mullens.lmer, withMCMC = FALSE, addPlot = F)

$fixed
                     Estimate MCMCmean HPD95lower HPD95upper  pMCMC Pr(>|t|)
(Intercept)            4.9410   4.9384     4.3270     5.5424 0.0001   0.0000
BREATHlung            -3.3914  -3.3876    -4.3599    -2.3899 0.0001   0.0000
O2LEVEL5              -0.2174  -0.2189    -0.9361     0.5244 0.5640   0.5240
O2LEVEL10             -0.5476  -0.5484    -1.2543     0.1900 0.1394   0.1097
O2LEVEL15             -0.8687  -0.8702    -1.5651    -0.1471 0.0178   0.0117
O2LEVEL20             -1.5444  -1.5463    -2.2855    -0.8327 0.0002   0.0000
O2LEVEL30             -1.4644  -1.4656    -2.1589    -0.7132 0.0002   0.0000
O2LEVEL40             -2.2457  -2.2460    -2.9810    -1.5347 0.0001   0.0000
O2LEVEL50             -2.4600  -2.4599    -3.1459    -1.7343 0.0001   0.0000
BREATHlung:O2LEVEL5    1.0362   1.0376    -0.0997     2.2189 0.0766   0.0622
BREATHlung:O2LEVEL10   2.3327   2.3326     1.1724     3.5081 0.0001   0.0000
BREATHlung:O2LEVEL15   2.3729   2.3726     1.1888     3.5310 0.0001   0.0000
BREATHlung:O2LEVEL20   3.2985   3.2967     2.1247     4.4251 0.0001   0.0000
BREATHlung:O2LEVEL30   2.9767   2.9738     1.8095     4.1667 0.0001   0.0000
BREATHlung:O2LEVEL40   3.6168   3.6121     2.4003     4.7480 0.0001   0.0000
BREATHlung:O2LEVEL50   3.4669   3.4647     2.3277     4.6389 0.0001   0.0000
$random
    Groups        Name Std.Dev. MCMCmedian MCMCmean HPD95lower HPD95upper
1     TOAD (Intercept)   0.8786     0.6016   0.6089     0.4299     0.8014
2 Residual               0.8678     0.9305   0.9341     0.8218     1.0581

Welcome to the end of Workshop 7

  Reading data into R

Ensure that the working directory is pointing to the path containing the file to be imported before proceeding.
To import data into R, we read the contents of a file into a data frame. The general format of the command for reading data into a data frame is

> name <- read.table('filename.csv', header=T, sep=',', row.names=column)

where name is a name you wish the data frame to be referred to as, filename.csv is the name of the csv file that you created in excel and column is the number of the column that had row labels (if there were any). The argument header=T indicates that the variable (vector) names will be created from the names supplied in the first row of the file. The argument sep=',' indicates that entries in the file are separated by a comma (hence a comma delimited file). If the data file does not contain row labels, or you are not sure whether it does, it is best to omit the row.names=column argument.

As an example

> phasmid <- read.data('phasmid.csv', header=T, sep=',', row.names=1)

End of instructions

  Nested ANOVA

A nested ANOVA is a simple extension of the single factor ANOVA. An additional factor (Factor B) is added that is nested within the main factor of interest (Factor A). Nested factors are incorporated in situations where researches envisage potentially large amounts of variation between different sets of replicates within the main Factor.
For example, a limnologist investigating the effect of flow (Factor A: high, medium and low) on macroinvertebate density, might collect samples from a number of high, medium and low flow streams. However, the limnologist might expect that many other factors (perhaps more important than stream flow) also influence the density of macroinvertebrates and that even samples collected in close proximity to one another might be expected to vary considerably in density of macroinvertebrates. Such variability increases the unexplained variability and therefore has the potential to make it very difficult to pick up the effect of the actual factor of interest (in this case flow).
In an attempt to combat this, the researcher decides to collect multiple samples from within each stream. If some of the unexplained variability can be explained by variation between samples within streams, then the test of flow will be more sensitive.
This is called a nested ANOVA, an additional factor (Factor B: stream) has been introduce within the main factor (Factor A: flow). Factor A is a fixed factor as we are only interested in the specific levels chosen (e.g. high, medium and low flow). Factor B however, is a random as we are not just interested in the specific streams that were sampled, we are interested in all the possible streams that they represent. In a nested ANOVA the categories of the Factor B nested within each level of Factor A are different. E.g. the streams used for each flow type are different. Therefore it is not possible to test for an interaction between Flow type and stream.

There are two null hypotheses being tested in a two factor nested ANOVA. H0 main effect: no effect of Factor A (flow) H0: = &mu1 = &mu2 = ... = &mui = &mu (as for single factor ANOVA). H0 nesting effect: no effect of Factor B (stream) nested within Factor A (flow). Technically it is that there is no added variance due to differences between all the possible levels of Factor B with any level of Factor A. E.g., that there is no added variance due to differences between all possible high flow streams, or between all possible medium flow streams or between all possible low flow streams. Diagram shows layout of nested design with three treatments (different pattern fills).

End of instructions

  Convert numeric variable into a factor

> data$var <- as.factor(data$var)

where data is the name of the data set and var is the name of the variable to be made into a factor. While there is no visible evidence that the data are any different, R will now treat the variable as a factor rather than as a numeric variable.

End of instructions

  Nested ANOVA assumptions

In ANOVA, the assumptions of normality and variance homogeneity apply to the residuals for the test of interest. In this example, the nested factor (Factor B, SITE) is random. Therefore, the normality and heterogeneity of variance assumptions for Factor A (SEASON) must be based on the distribution, mean and variance of the response variable (S4DUGES) for each level of Factor A using the means of Factor B within each Factor A as the observations (since the error term for Factor A (SEASON) is the variation in the response variable between categories of Factor B within each level of Factor A (MSSITE). Calculate the mean from each level of Factor B (SITE), and use these means to calculate the mean, variance and distribution of each level of Factor A (SEASON).

# aggregate the data set by calculating the means for each level of the nesting factor
> name.ag <- aggregate(dataset, list(Name1=FactorA, Name2=FactorB), mean)
#OR perhaps more efficiently
library(lme4)
name.ag <- gsummary(dataset, groups=FactorB)
# use the aggregated means to produce the boxplots
> boxplot(dv~FactorA, name.ag)

where name.ag is a name given to the new aggregated data set, dataset is the name of the original data set, Name1 and Name2 are names to appear in the aggregated data set that represent FactorA (Main within factor) and FactorB (nesting factor) respectively. Finally, dv is the name of the dependent variable.

This data set generates two boxplots, each constructed from only 3 values and thus of questionable value.

End of instructions

  Nested ANOVA assumptions

# fit the heirachical linear model
> name.aov <- aov(dv~FactorA + Error(FactorB), data=data)
# print out the ANOVA strata
> summary(name.aov)

where name.aov is a name provided to store the fitted model, dv is the dependent variable, FactorA is the main fixed factor and FactorB is the random nested factor. data is the name of the data set. The 'Error()' argument defines additional error (residual) strata, hence the above template indicates that Factor should be tested against FactorB.
For example:

# fit the heirachical linear model
> curdies.aov <- aov(S4DUGES~SEASON + Error(SITE), data=curdies)
# print out the ANOVA strata
> summary(curdies.aov)

In recognition that Factor is a random nested factor, and thus of little interest, the F-ratio is not provided. Since the effect of FactorB is tested against the overal residual (error) term, to obtain the F-ratio for the test of FactorB, fit the model without defining the an additional error strata. Note that when you do this, all terms are tested against the residual term and therefore, any fixed factors above the random factors (in the hierarchy) will be incorrectly calculated.

> curdies.aov1 <- aov(S4DUGES~SEASON + SITE, data=curdies)
> summary(curdies.aov)

End of instructions

  Two factor ANOVA

Statistical models that incorporate more than one categorical predictor variable are broadly referred to as multivariate analysis of variance. There are two main reasons for the inclusion of multiple factors:

  1. To attempt to reduce the unexplained (or residual) variation in the response variable and therefore increase the sensitivity of the main effect test.
  2. To examine the interaction between factors. That is, whether the effect of one factor on the response variable is dependent on other factor(s) (consistent across all levels of other factor(s)).

Fully factorial linear models are used when the design incorporates two or more factors (independent, categorical variables) that are crossed with each other. That is, all combinations of the factors are included in the design and every level (group) of each factor occurs in combination with every level of the other factor(s). Furthermore, each combination is replicated.

In fully factorial designs both factors are equally important (potentially), and since all factors are crossed, we can also test whether there is an interaction between the factors (does the effect of one factor depend on the level of the other factor(s)). Graphs above depict a) interaction, b) no interaction.


For example, Quinn (1988) investigated the effects of season (two levels, winter/spring and summer/autumn) and adult density (four levels, 8, 15, 30 and 45 animals per 225cm2) on the number of limpet egg masses. As the design was fully crossed (all four densities were used in each season), he was also able to test for an interaction between season and density. That is, was the effect of density consistent across both seasons and, was the effect of season consistent across all densities.

Diagram shows layout of 2 factor fully crossed design. The two factors (each with two levels) are color (black or gray) and pattern (solid or striped). There are three experimental units (replicates) per treatment combination.


End of instructions

  Two-factor ANOVA

> name.aov <- aov(dv~factor1 * factor2, data=data)

where dv is the name of the numeric vector (dependent variable), factor1 and factor2 are the names of the factors (categorical or factor variables) and data is the name of the data frame (data set).

End of instructions

  Fully factorial ANOVA assumption checking

The assumptions of normality and homogeneity of variance apply to each of the factor level combinations, since it is the replicate observations of these that are the test residuals. If the design has two factors (IV1 and IV2) with three and four levels (groups) respectively within these factors, then a boxplot of each of the 3x4=12 combinations needs to be constructed. It is recommended that a variable (called GROUP) be setup to represent the combination of the two categorical (factor) variables.

Simply construct a boxplot with the dependent variable on the y-axis and GROUP on the x-axis. Visually assess whether the data from each group is normal (or at least that the groups are not all consistently skewed in the same direction), and whether the spread of data is each group is similar (or at least not related to the mean for that group). The GROUP variable can also assist in calculating the mean and variance for each group, to further assess variance homogeneity.

End of instructions

  Factorial boxplots

> boxplot(dv~factor1*factor2,data=data)

where dv is the name of the numeric vector (dependent variable), factor1 and factor2 are the names of the factors (categorical or factor variables) and data is the name of the data frame (data set). The '~' is used in formulas to represent that the left hand side is proportional to the right hand side.

End of instructions

  Plotting mean vs variance

# first calculate the means and variances
> means <- tapply(data$dv, list(data$factor1, data$factor2), mean)
> vars <- tapply(data$dv, list(data$factor1, data$factor2), var)
# then plot the mean against variance
> plot(means,vars)

where dv is the name of the numeric vector (dependent variable), factor1 and factor2 are the names of the factors (categorical or factor variables) and data is the name of the data frame (data set).

End of instructions

  Model balance

Balanced models are those models that have equal sample sizes for each level of a treatment. Unequal sample sizes are generally problematic for statistical analyses as they can be the source of unequal variances. They are additionally problematic for multifactor designs because of the way that total variability is partitioned into the contributions of each of the main effects as well as the contributions of interactions (ie the calculations of sums of squares and degrees of freedom). Sums of squares (and degrees of freedom) are usually (type I SS) partitioned additively. That is:
SSTotal=SSA+SSB+SSAB+SSResidual, and
dfTotal=dfA+dfB+dfAB+dfResidual
. In practice, these formulas are used in reverse. For example, if the model was entered as DV~A+B+AB, first SSResidual is calculated followed by SSAB, then SSB. SSA is thus calculated as the total sums of squares minus what has already been partitioned off to other terms (SSA=SSTotal-SSB+SSAB+SSResidual). If the model is balanced, then it doesn't matter what order the model is constructed (DV~B+A+AB will be equivalent).

However, in unbalanced designs, SS cannot be partitioned additively, that is the actual SS of each of the terms does not add up to SSTotal (SSTotal=SSA+SSB+SSAB+SSResidual). Of course, when SS is partitioned it all must add up. Consequently, the SS of the last term to be calculated will not be correct. Therefore, in unbalanced designs, SS values for the terms are different (and therefore so are the F-ratios etc) depending on the order in which the terms are added - an undesirable situation.

To account for this, there are alternative methods of calculating SS for the main effects (all provide same SS for interaction terms and residuals).

It turns out that it is relatively easy to determine whether or not the model is balanced or not - use the following syntax:

> !is.list(replications(formula,data))

where formula is the model formula (for example, MASS~SITUATION*MONTH) and data is the name of data set.
If this function returns a TRUE, your model is balanced. If it returns a FALSE then your model is not balanced and you should consider using Type II or III SS methods.

End of instructions

  Linear regression diagnostics

> plot(name.lm)

where name.lm is the name of a fitted linear model. You will be prompted to hit return to cycle through a series of four diagnostic plots. The first (and most important of these) is the Residual Plot.

End of instructions

  ANOVA table

#print a summary of the ANOVA table
> anova(name.aov)

where name.aov is the name of a fitted linear model.

End of instructions

  Interaction plot

Interaction plots display the degree of consistency (or lack of) of the effect of one factor across each level of another factor. Interaction plots can be either bar or line graphs, however line graphs are more effective. The x-axis represents the levels of one factor, and a separate line in drawn for each level of the other factor. The following interaction plots represent two factors, A (with levels A1, A2, A3) and B (with levels B1, B2).


The parallel lines of first plot clearly indicate no interaction. The effect of factor A is consistent for both levels of factor B and visa versa. The middle plot demonstrates a moderate interaction and bottom plot illustrates a severe interaction. Clearly, whether or not there is an effect of factor B (e.g. B1 > B2 or B2 > B1) depends on the level of factor A. The trend is not consistent.

End of instructions

  R Interaction plot

> interaction.plot(dataset$Factor1, dataset$Factor2, dataset$dv)

where Factor1 is the name of the categical variable (factor) with the most levels (this will be used for the x-axis), Factor2 is the name of the other factorial variable (this will be used to produce the different lines) and dv is the name of the dependent variable within the dataset data set
For example:

> interaction.plot(copper$DIST, copper$COPPER, dataset$WORMS)

End of instructions

  Specific comparisons of means

Following a significant ANOVA result, it is often desirable to specifically compare group means to determine which groups are significantly different. However, multiple comparisons lead to two statistical problems. Firstly, multiple significance tests increase the Type I errors (&alpha, the probability of falsely rejecting H0). E.g., testing for differences between 5 groups requires ten pairwise comparisons. If the &alpha for each test is 0.05 (5%), then the probability of at least one Type I error for the family of 10 tests is 0.4 (40%). Secondly, the outcome of each test needs to be independent (orthogonality). E.g. if A>B and B>C then we already know the result of A vs. C.

Post-hoc unplanned pairwise comparisons (e.g. Tukey's test) compare all possible pairs of group means and are useful in an exploratory fashion to reveal major differences. Tukey's test control the family-wise Type I error rate to no more that 0.05. However, this reduces the power of each of the pairwise comparisons, and only very large differences are detected (a consequence that exacerbates with an increasing number of groups).

Planned comparisons are specific comparisons that are usually planned during the design stage of the experiment. No more than (p-1, where p is the number of groups) comparisons can be made, however, each comparison (provided it is non-orthogonal) can be tested at &alpha = 0.05. Amongst all possible pairwise comparisons, specific comparisons are selected, while other meaningless (within the biological context of the investigation) are ignored.

Planned comparisons are defined using what are known as contrasts coefficients. These coefficients are a set of numbers that indicate which groups are to be compared, and what the relative contribution of each group is to the comparison. For example, if a factor has four groups (A, B, C and D) and you wish to specifically compare groups A and B, the contrast coefficients for this comparison would be 1, -1, 0,and 0 respectively. This indicates that groups A and B are to be compared (since their signs oppose) and that groups C and D are omitted from the specific comparison (their values are 0).

It is also possible to compare the average of two or more groups with the average of other groups. For example, to compare the average of groups A and B with the average of group C, the contrast coefficients would be 1, 1, -2, and 0 respectively. Note that 0.5, 0.5, 1 and 0 would be equivalent.

The following points relate to the specification of contrast coefficients:

  1. The sum of the coefficients should equal 0
  2. Groups to be compared should have opposing signs

End of instructions

  Tukey's Test

# load the 'multcomp' package
> library(multcomp)
# perform Tukey's test
> summary(simtest(dv~factor1*factor2, whichf='factor', data=data, type="Tukey"))
# OR the more up to date method
> summary(glht(data.aov, linfct=mcp(factor="Tukey")))

where dv is the name of the numeric vector (dependent variable), factor1 and factor2 are the names of the factors (categorical or factor variables), and data is the name of the data frame (data set). The argument tyoe="Tukey" indicates that a Tukeys p-value adjustments should be made. When the linear model contains multiple categorical predictor variables, the arguement whichf='' must be used to indicate which factor the Tukey's test should be performed on.

The coefficients list the specific comparisons made. Each line represents a specific hypothesis test including difference between groups, t-value, raw p value and Tukey's adjusted p value (the other values presented are not important for BIO3011).

End of instructions

  Simple Main Effects

> library(biology)
> quinn.aov1 <- aov(SQRTRECRUITS~DENSITY, data=quinn, subset=c(SEASON=='Summer'))
> AnovaM(lm(quinn.aov1), lm(quinn.aov),type="II")

Subset, type in the name of the other categorical variable followed by == (two equals signs) and the name of one of the levels (surrounded by single quotation marks) of this factor. For example, if you had a categorical variable called TEMPERATURE and it consisted of three levels (High, Medium and Low), to perform a anova using only the High temperature data, the subset expression would be TEMPERATURE=='High'.

End of instructions

  Randomized complete block (RCB) designs

These are designs where one factor is a blocking variable (Factor B) and the other variable (Factor A) is the main treatment factor of interest. These designs are used in situations where the environment is potentially heterogeneous enough to substantially increase the variability in the response variable and thus obscure the effects of the main factor. Experimental units are grouped into blocks (units of space or time that are likely to have less variable background conditions) and then each level of the treatment factor is applied to a single unit within each block. By grouping the experimental units together into blocks, some of the total variability in the response variable that is not explained by the main factor may be explained by the blocks. If so, the residual (unexplained) variation will be reduced resulting in more precise estimates of parameters and more powerful tests of treatments.

A plant ecologist was investigating the effects of leaf damage on plant chemical defenses. Since individual trees vary greatly in the degree of chemical defense, she applied each of three artificial leaf damage treatments (heavy, medium and light damage) to each tree. She applied one of three treatments to three branches within eight randomly selected trees. Trees were the blocks. The diagram shows the layout of a blocking design. The main factor has three treatments (pattern fills), each of which are applied randomly within 3 randomly selected blocks. The decision to include a blocking factor (rather than single factor, completely random) depends on whether the experimental units (replicates) within a block are likely to be more similar to one another than those between blocks and whether the reduction in MSResidual is likely to compensate for the for the loss of residual df.

End of instructions

  Randomized blocks

> name.aov <- aov(dv~factorA + Error(factorB), data=data)

where dv is the name of the numeric vector (dependent variable), factorA and factorB represent the main factor of interest and the blocking factor respectively.

End of instructions

  Split-plot designs

Split-plot designs can be thought of as a combination of a randomized complete block (RCB) and nested designs. In their simplest form, they represent a RCB design, with a second factor applied to the whole blocks. One factor (Factor C) is applied to experimental units (usually randomly positioned) within each block (now referred to as plots, or Factor B). These factors are the Within-plot factors. A second factor (Between-plots factor, or Factor A) is then applied to the whole plots, and these plots are replicated. Hence a nested design applies between Factor A and the plots, and a blocking design applies between Factor C and the plots. Factor A is also crossed with Factor C, and therefore there is a test of the interaction between the two main factors.


Note that if there are no replicate units of Factor C within each plot (that is if each level of Factor C is applied only once to each plot), then the interaction of Factor C by Plots within Factor A represents the overall residual. However, since the plots are usually random factors, the effects of Factor A must be tested against the Plots within Factor A term.
If there there are replicate units of Factor C within each plot, then the Factor A by Plots within Factor A by Factor C is the overal residual.

SourceA,C (Fix); B (Rand)
AMSA/MSB
BMSB/MSResidual
CMSC/MSB:C
A:CMSA:C/MSB:C
B:CMSB:C/MSResidual

For example, a marine biologist investigated the effects of estuary flow on the abundance of mussels and barnacles, and whether these effects varied with tidal height. A splitplot was used whereby randomly selected estuary sites were the plots, six tidal heights (from 0 to 3.6 m) were applied within each site (within-plot effect), and each site was of either high or low flow (between-plots effect). Diagram illustrates layout for split-plot design. There are three treatments (levels of Factor C, fill patterns for squares) applied within each plot (ovals). One of two treatments (levels of Factor A, fill for ovals) was applied to whole plots. Note that treatments for both Factor A and C are replicated and fully crossed.

End of instructions

  Aggregate a data set

> name.ag <- aggregate(dataset, list(NAME=FactorA, NAME1=FactorB), mean)
# OR
> library(lme4)
> name.ag<-gsummary(dataset, groups=dataset$FactorA)

where name.ag is a name you provide for the aggregated data set, dataset is the name of the data set, NAME is for the first factor in the resulting data set, FactorA is the name of the main factor, NAME1 is for the second factor in the resulting data set and FactorA is the name of the plot (blocking) factor.
For example:

> cu.ag <- aggregate(copper, list(COPPER=copper$COPPER, PLATE=copper$PLATE), mean)
# OR
> library(lme4) > cu.ag <- aggregate(copper, groups=copper$PLATE)

End of instructions

  Boxplot from aggregated data set

Having generated an aggregated data set, it is possible to use the aggregated means to test the assumptions of normality and homogeneity of variance.

> boxplot(name.ag$dv~name.ag$FactorA)

where name.ag is a name of the aggregated data set, dv is the name of the dependent variable and FactorA is the name of the main factor.
For example:

> boxplot(cu.ag$WORMS ~ cu.ag$COPPER)

End of instructions

  R Compound symmetry/sphericity

# source Murray's syntax for dealing with sphericity
> library(biology)
# after fitting a randomized block, repeated measures or split-plot
> epsi.GG.HF(name.aov)

where name.aov is the name of the fitted randomized block, repeated measures or split-plot model. The epsi.GG.HF function returns a table of epsilon values corresponding to each of the model terms.

End of instructions

  R Split-plot

# Fit the linear model with the different error strata
> name.aov <- aov(dv ~ FactorA + FactorC + FactorA:FactorC + Error(FactorB), data=dataset))
> AnovaM(name.aov) # alternatively, if sphericity is an issue > AnovaM(name.aov, RM=T)

where name.aov is a name to provide the fitted linear model, dv is the name of the dependent variable, FactorA is the name of the between plots factor (Factor A), FactorB is the name of the block factor (random) and FactorC is the name of the within plots factor. dataset is the name of the data set.
For example:

> copper.aov <- aov(WORMS ~ COPPER + DIST + COPPER:DIST + Error(PLATE), data=copper))
> AnovaM(copper.aov, RM=T)

End of instructions

  Repeated Measures

Simple repeated measures designs and similar to Randomized complete block (RCB) designs. In the same way that we could explain some of the residual variation by blocking, we can also reduce the residual variation by removing the between-replicate variation that exists at the start of the experiment. This is achieved by applying each treatments (repeated measures) to the same subjects, rather than using different subjects for each treatment. In this way it is the same as a RCB. Where it differs from a RCB is that each of the treatments cannot be applied to the blocks (subjects) at the same time. For example, it is not possible to determine the effects of different drugs on the heart rate of rats if each of the drugs is applied at the same time! Each of the treatments should be applied in random order, however, when the repeated factor is time, this is not possible. The subjects (blocks or plots) are always a random factor. If a second factor is applied at the level of the whole subject and over each of the repeated measures (e.g. subjects are of one of two species), the analysis then resembles a splitplot design, where this second factor is the between subjects effect.


A agricultural scientist was investigating the effects of fiber on the digestibility of pastural grasses by cows. Ten randomly selected cows of two different breeds (five of each) were each fed a random sequence of four food types (of different fiber levels). After a week of feeding on a single food type, the digestibility of the food was assessed. Individual cows represented the subjects. Breed of cow was the between subjects effect and food fiber level was the within subjects effect. Diagram illustrates layout of 2 Factor repeated measures. Rectangles represent subjects. Squares represent the within subject treatments (random sequence represents the order in which each treatment is applied). Hence there are four repeated measures. There are two levels of the between subjects effect (pattern fill of rectangles).

End of instructions

  R Compound symmetry/sphericity

# source Murray's syntax for dealing with sphericity
> library(biology)
# after fitting a randomized block, repeated measures or split-plot
> epsi.GG.HF(name.aov)

where name.aov is the name of the fitted randomized block, repeated measures or split-plot model. The epsi.GG.HF function returns a table of epsilon values corresponding to each of the model terms.

End of instructions