Jump to main navigation


Workshop 9 - Generalized linear models

14 September 2011

Basic statistics references

  • Logan (2010) - Chpt 16-17
  • Quinn & Keough (2002) - Chpt 13-14
Generalized linear models

Poisson t-test

A marine ecologist was interested in examining the effects of wave exposure on the abundance of the striped limpet Siphonaria diemenensis on rocky intertidal shores. To do so, a single quadrat was placed on 10 exposed (to large waves) shores and 10 sheltered shores. From each quadrat, the number of Siphonaria diemenensis were counted.

Download Limpets data set
Format of limpets.csv data files
CountShore
1sheltered
3sheltered
2sheltered
1sheltered
4sheltered
......

ShoreCategorical description of the shore type (sheltered or exposed) - Predictor variable
CountNumber of limpets per quadrat - Response variable
turf

Open the limpet data set.

Show code

> limpets <- read.table("limpets.csv", header = T, sep = ",", strip.white = T)

> limpets

   Count     Shore
1      2 sheltered
2      2 sheltered
3      0 sheltered
4      6 sheltered
5      0 sheltered
6      3 sheltered
7      3 sheltered
8      0 sheltered
9      1 sheltered
10     2 sheltered
11     5   exposed
12     3   exposed
13     6   exposed
14     3   exposed
15     4   exposed
16     7   exposed
17    10   exposed
18     3   exposed
19     5   exposed
20     2   exposed

The design is clearly a t-test as we are interested in comparing two population means (mean number of striped limpets on exposed and sheltered shores). Recall that a parametric t-test assumes that the residuals (and populations from which the samples are drawn) are normally distributed and equally varied.

Q1-1 Is there any evidence that these assumptions have been violated?;
Show code

> boxplot(Count ~ Shore, data = limpets)

  1. The assumption of normality has been violated?
  2. The assumption of homogeneity of variance has been violated?

At this point we could transform the data in an attempt to satisfy normality and homogeneity of variance. However, there are likely to be zero values in the data and moreover, it has been demonstrated that transforming count data is not as effective as modelling count data with a Poisson distribution.

Q1-1 Lets instead we fit the same model with a Poisson distribution.
Show code

> limpets.glm <- glm(Count ~ Shore, family = poisson, data = limpets)

  1. Check the model for lack of fit via:
    • Pearson Χ2
      Show code

      > limpets.resid <- sum(resid(limpets.glm, type = "pearson")^2)

      > 1 - pchisq(limpets.resid, limpets.glm$df.resid)

      [1] 0.07874903


      #No evidence of a lack of fit
    • Deviance (G2)
      Show code

      > 1 - pchisq(limpets.glm$deviance, limpets.glm$df.resid)

      [1] 0.05286849


      #No evidence of a lack of fit
    • Standardized residuals (plot)
      Show code

      > plot(limpets.glm)

  2. Estimate the dispersion parameter to evaluate over dispersion in the fitted model
    • Via Pearson residuals
      Show code

      > limpets.resid/limpets.glm$df.resid

      [1] 1.500731


      #No evidence of over dispersion
    • Via deviance
      Show code

      > limpets.glm$deviance/limpets.glm$df.resid

      [1] 1.591496


      #No evidence of over dispersion
  3. Compare the expected number of zeros to the actual number of zeros to determine whether the model might be zero inflated.
    • Calculate the proportion of zeros in the data
      Show code

      > limpets.tab <- table(limpets$Count == 0)

      > limpets.tab/sum(limpets.tab)

      FALSE  TRUE 
       0.85  0.15 

    • Now work out the proportion of zeros we would expect from a Poisson distribution with a lambda equal to the mean count of limpets in this study.
      Show code

      > limpets.mu <- mean(limpets$Count)

      > cnts <- rpois(1000, limpets.mu)

      > limpets.tab <- table(cnts == 0)

      > limpets.tab/sum(limpets.tab)

      FALSE  TRUE 
      0.955 0.045 

      Clearly there are not more zeros than expected so the data are not zero inflated.

Q1-2. We can now test the null hypothesis that there is no effect of shore type on the abundance of striped limpets;
  1. Wald z-tests (these are equivalent to t-tests)
  2. Show code

    > summary(limpets.glm)

    Call:
    glm(formula = Count ~ Shore, family = poisson, data = limpets)
    Deviance Residuals: 
         Min        1Q    Median        3Q       Max  
    -1.94936  -0.88316   0.07192   0.57905   2.36619  
    Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
    (Intercept)      1.5686     0.1443  10.868  < 2e-16 ***
    Shoresheltered  -0.9268     0.2710  -3.419 0.000628 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    (Dispersion parameter for poisson family taken to be 1)
        Null deviance: 41.624  on 19  degrees of freedom
    Residual deviance: 28.647  on 18  degrees of freedom
    AIC: 85.567
    Number of Fisher Scoring iterations: 5

  3. Wald Chisq-test. Note that Chisq = z2
  4. Show code

    > library(car)

    > Anova(limpets.glm, test.statistic = "Wald")

    Analysis of Deviance Table (Type II tests)
    Response: Count
              Df  Chisq Pr(>Chisq)    
    Shore      1 11.691   0.000628 ***
    Residuals 18                      
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

  5. Likelihood ratio test (LRT), analysis of deviance Chisq-tests (these are equivalent to ANOVA)
  6. Show code

    > limpets.glmN <- glm(Count ~ 1, family = poisson, data = limpets)

    > anova(limpets.glmN, limpets.glm, test = "Chisq")

    Analysis of Deviance Table
    Model 1: Count ~ 1
    Model 2: Count ~ Shore
      Resid. Df Resid. Dev Df Deviance P(>|Chi|)    
    1        19     41.624                          
    2        18     28.647  1   12.977 0.0003154 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 


    # OR

    > anova(limpets.glm, test = "Chisq")

    Analysis of Deviance Table
    Model: poisson, link: log
    Response: Count
    Terms added sequentially (first to last)
          Df Deviance Resid. Df Resid. Dev P(>|Chi|)    
    NULL                     19     41.624              
    Shore  1   12.977        18     28.647 0.0003154 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 


    # OR

    > Anova(limpets.glm, test.statistic = "LR")

    Analysis of Deviance Table (Type II tests)
    Response: Count
          LR Chisq Df Pr(>Chisq)    
    Shore   12.977  1  0.0003154 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Q1-3. And for those who do not suffer a p-value affliction, we can alternatively (or in addition) calculate the confidence intervals for the estimated parameters.
Show code

> library(gmodels)

> ci(limpets.glm)

                Estimate  CI lower   CI upper Std. Error      p-value
(Intercept)     1.568616  1.265374  1.8718579  0.1443376 1.643085e-27
Shoresheltered -0.926762 -1.496204 -0.3573203  0.2710437 6.279766e-04

Poisson ANOVA

We once again return to a modified example from Quinn and Keough (2002). In Exercise 1 of Workshop 5, we introduced an experiment by Day and Quinn (1989) that examined how rock surface type affected the recruitment of barnacles to a rocky shore. The experiment had a single factor, surface type, with 4 treatments or levels: algal species 1 (ALG1), algal species 2 (ALG2), naturally bare surfaces (NB) and artificially scraped bare surfaces (S). There were 5 replicate plots for each surface type and the response (dependent) variable was the number of newly recruited barnacles on each plot after 4 weeks.

Download Day data set

Format of day.csv data files
TREATBARNACLE
ALG127
....
ALG224
....
NB9
....
S12
....
TREATCategorical listing of surface types. ALG1 = algal species 1, ALG2 = algal species 2, NB = naturally bare surface, S = scraped bare surface.
BARNACLEThe number of newly recruited barnacles on each plot after 4 weeks.
Six-plated barnacle

Open
the day data file.
Show code

> day <- read.table("day.csv", header = T, sep = ",", strip.white = T)

> head(day)

  TREAT BARNACLE
1  ALG1       27
2  ALG1       19
3  ALG1       18
4  ALG1       23
5  ALG1       25
6  ALG2       24

We previously analysed these data with via a classic linear model (ANOVA) as the data appeared to conform to the linear model assumptions. Alternatively, we could analyse these data with a generalized linear model (Poisson error distribution). Note that as the boxplots were all fairly symmetrical and equally varied, and the sample means are well away from zero (in fact there are no zero's in the data) we might suspect that whether we fit the model as a linear or generalized linear model is probably not going to be of great consequence. Nevertheless, it does then provide a good comparison between the two frameworks.

Q2-1. Using boxplots re-examine the assumptions of normality and homogeneity of variance. Note that when sample sizes are small (as is the case with this data set), these ANOVA assumptions cannot reliably be checked using boxplots since boxplots require at least 5 replicates (and preferably more), from which to calculate the median and quartiles. As with regression analysis, it is the assumption of homogeneity of variances (and in particular, whether there is a relationship between the mean and variance) that is of most concern for ANOVA.
Show code

> boxplot(BARNACLE ~ TREAT, data = day)

Q2-2. Fit the generalized linear model (GLM) relating the number of newly recruited barnacles to substrate type
Show code - fit the model

> day.glm <- glm(BARNACLE ~ TREAT, family = poisson, data = day)

  1. Check the model for lack of fit via:
    • Pearson Χ2
      Show code

      > day.resid <- sum(resid(day.glm, type = "pearson")^2)

      > 1 - pchisq(day.resid, day.glm$df.resid)

      [1] 0.3641093


      #No evidence of a lack of fit
    • Standardized residuals (plot)
      Show code

      > plot(day.glm)

  2. Estimate the dispersion parameter to evaluate over dispersion in the fitted model
    • Via Pearson residuals
      Show code

      > day.resid/day.glm$df.resid

      [1] 1.083574


      #No evidence of over dispersion
    • Via deviance
      Show code

      > day.glm$deviance/day.glm$df.resid

      [1] 1.075851


      #No evidence of over dispersion
  3. Compare the expected number of zeros to the actual number of zeros to determine whether the model might be zero inflated.
    • Calculate the proportion of zeros in the data
      Show code

      > day.tab <- table(day$Count == 0)

      > day.tab/sum(day.tab)

      numeric(0)

    • Now work out the proportion of zeros we would expect from a Poisson distribution with a lambda equal to the mean count of day in this study.
      Show code

      > day.mu <- mean(day$Count)

      > cnts <- rpois(1000, day.mu)

      > day.tab <- table(cnts == 0)

      > day.tab/sum(day.tab)

      numeric(0)

      Clearly there are not more zeros than expected so the data are not zero inflated.

Q2-3. We can now test the null hypothesis that there is no effect of shore type on the abundance of striped day;
  1. Wald z-tests (these are equivalent to t-tests)
  2. Show code

    > summary(day.glm)

    Call:
    glm(formula = BARNACLE ~ TREAT, family = poisson, data = day)
    Deviance Residuals: 
        Min       1Q   Median       3Q      Max  
    -1.6748  -0.6522  -0.2630   0.5699   1.7380  
    Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
    (Intercept)  3.10906    0.09449  32.903  < 2e-16 ***
    TREATALG2    0.23733    0.12638   1.878 0.060387 .  
    TREATNB     -0.40101    0.14920  -2.688 0.007195 ** 
    TREATS      -0.52884    0.15518  -3.408 0.000654 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    (Dispersion parameter for poisson family taken to be 1)
        Null deviance: 54.123  on 19  degrees of freedom
    Residual deviance: 17.214  on 16  degrees of freedom
    AIC: 120.34
    Number of Fisher Scoring iterations: 4

  3. Wald Chisq-test.
  4. Show code

    > library(car)

    > Anova(day.glm, test.statistic = "Wald")

    Analysis of Deviance Table (Type II tests)
    Response: BARNACLE
              Df  Chisq Pr(>Chisq)    
    TREAT      3 36.041  7.342e-08 ***
    Residuals 16                      
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

  5. Likelihood ratio test (LRT), analysis of deviance Chisq-tests (these are equivalent to ANOVA)
  6. Show code

    > day.glmN <- glm(BARNACLE ~ 1, family = poisson, data = day)

    > anova(day.glmN, day.glm, test = "Chisq")

    Analysis of Deviance Table
    Model 1: BARNACLE ~ 1
    Model 2: BARNACLE ~ TREAT
      Resid. Df Resid. Dev Df Deviance P(>|Chi|)    
    1        19     54.123                          
    2        16     17.214  3   36.909  4.81e-08 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 


    # OR

    > anova(day.glm, test = "Chisq")

    Analysis of Deviance Table
    Model: poisson, link: log
    Response: BARNACLE
    Terms added sequentially (first to last)
          Df Deviance Resid. Df Resid. Dev P(>|Chi|)    
    NULL                     19     54.123              
    TREAT  3   36.909        16     17.214  4.81e-08 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 


    # OR

    > Anova(day.glm, test.statistic = "LR")

    Analysis of Deviance Table (Type II tests)
    Response: BARNACLE
          LR Chisq Df Pr(>Chisq)    
    TREAT   36.909  3   4.81e-08 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Q2-4. Lets now get the confidence intervals for the parameter estimates. Note these are still on the logs scale!
Show code

> library(gmodels)

> ci(day.glm)

              Estimate    CI lower    CI upper Std. Error       p-value
(Intercept)  3.1090610  2.90874874  3.30937318 0.09449112 1.977474e-237
TREATALG2    0.2373282 -0.03057639  0.50523276 0.12637573  6.038705e-02
TREATNB     -0.4010108 -0.71730958 -0.08471194 0.14920422  7.195385e-03
TREATS      -0.5288441 -0.85780588 -0.19988237 0.15517757  6.544249e-04

Q2-5. Although we have now established that there is a statistical difference between the group means, we do not yet know which group(s) are different from which other(s). For this data a Tukey's multiple comparison test (to determine which surface type groups are different from each other, in terms of number of barnacles) could be appropriate.
Show code

> library(multcomp)

> summary(glht(day.glm, linfct = mcp(TREAT = "Tukey")))

 Simultaneous Tests for General Linear Hypotheses
Multiple Comparisons of Means: Tukey Contrasts
Fit: glm(formula = BARNACLE ~ TREAT, family = poisson, data = day)
Linear Hypotheses:
                 Estimate Std. Error z value Pr(>|z|)    
ALG2 - ALG1 == 0   0.2373     0.1264   1.878  0.23529    
NB - ALG1 == 0    -0.4010     0.1492  -2.688  0.03570 *  
S - ALG1 == 0     -0.5288     0.1552  -3.408  0.00379 ** 
NB - ALG2 == 0    -0.6383     0.1427  -4.472  < 0.001 ***
S - ALG2 == 0     -0.7662     0.1490  -5.143  < 0.001 ***
S - NB == 0       -0.1278     0.1688  -0.757  0.87234    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
(Adjusted p values reported -- single-step method)

Q2-6. I am now in the mood to derive other parameters from the model:
  1. Cell (population) means
    Show code
    #On a link (log) scale

    > day.means <- predict(day.glm, newdata = data.frame(TREAT = levels(day$TREAT)),

    > se = T)

    > day.means <- data.frame(day.means[1], day.means[2])

    > rownames(day.means) <- levels(day$TREAT)

    > day.ci <- qt(0.975, day.glm$df.residual) * day.means[, 2]

    > day.means <- data.frame(day.means, lwr = day.means[, 1] - day.ci,

    > upr = day.means[, 1] + day.ci)

              fit     se.fit      lwr      upr
    ALG1 3.109061 0.09449112 2.908749 3.309373
    ALG2 3.346389 0.08391814 3.168491 3.524288
    NB   2.708050 0.11547003 2.463265 2.952836
    S    2.580217 0.12309146 2.319275 2.841159


    #On the scale of the response

    > day.means <- predict(day.glm, newdata = data.frame(TREAT = levels(day$TREAT)),

    > se = T, type = "response")

    > day.means <- data.frame(day.means[1], day.means[2])

    > rownames(day.means) <- levels(day$TREAT)

    > day.ci <- qt(0.975, day.glm$df.residual) * day.means[, 2]

    > day.means <- data.frame(day.means, lwr = day.means[, 1] - day.ci,

    > upr = day.means[, 1] + day.ci)

          fit   se.fit       lwr      upr
    ALG1 22.4 2.116601 17.913006 26.88699
    ALG2 28.4 2.383275 23.347683 33.45232
    NB   15.0 1.732051 11.328217 18.67178
    S    13.2 1.624807  9.755563 16.64444

    Or the long way
    Show code
    #On a link (log) scale

    > cmat <- cbind(c(1, 0, 0, 0), c(1, 1, 0, 0), c(1, 0, 1, 0), c(1,

    > 0, 0, 1))

    > day.mu <- t(day.glm$coef %*% cmat)

    > day.se <- sqrt(diag(t(cmat) %*% vcov(day.glm) %*% cmat))

    > day.means <- data.frame(fit = day.mu, se = day.se)

    > rownames(day.means) <- levels(day$TREAT)

    > day.ci <- qt(0.975, day.glm$df.residual) * day.se

    > day.means <- data.frame(day.means, lwr = day.mu - day.ci, upr = day.mu +

    > day.ci)

              fit         se      lwr      upr
    ALG1 3.109061 0.09449112 2.908749 3.309373
    ALG2 3.346389 0.08391814 3.168491 3.524288
    NB   2.708050 0.11547003 2.463265 2.952836
    S    2.580217 0.12309146 2.319275 2.841159


    #On a response scale

    > cmat <- cbind(c(1, 0, 0, 0), c(1, 1, 0, 0), c(1, 0, 1, 0), c(1,

    > 0, 0, 1))

    > day.mu <- t(day.glm$coef %*% cmat)

    > day.se <- sqrt(diag(t(cmat) %*% vcov(day.glm) %*% cmat)) * abs(poisson()$mu.eta(day.mu))

    > day.mu <- exp(day.mu)

    > day.se <- sqrt(diag(vcov(day.glm) %*% cmat)) * day.mu

    > day.means <- data.frame(fit = day.mu, se = day.se)

    > rownames(day.means) <- levels(day$TREAT)

    > day.ci <- qt(0.975, day.glm$df.residual) * day.se

    > day.means <- data.frame(day.means, lwr = day.mu - day.ci, upr = day.mu +

    > day.ci)

          fit       se       lwr      upr
    ALG1 22.4 2.116601 17.913006 26.88699
    ALG2 28.4 2.383275 23.347683 33.45232
    NB   15.0 1.732051 11.328217 18.67178
    S    13.2 1.624807  9.755563 16.64444

  2. Differences between each treatment mean and the global mean
    Show code
    #On a link (log) scale

    > cmat <- rbind(c(1, 0, 0, 0), c(1, 1, 0, 0), c(1, 0, 1, 0), c(1,

    > 0, 0, 1))

    > gm.cmat <- apply(cmat, 2, mean)

    > for (i in 1:nrow(cmat)) {

    > cmat[i, ] <- cmat[i, ] - gm.cmat

    > }

    > day.glm$coef %*% gm.cmat

    > day.mu <- t(day.glm$coef %*% t(cmat))

    > day.se <- sqrt(diag(cmat %*% vcov(day.glm) %*% t(cmat)))

    > day.means <- data.frame(fit = day.mu, se = day.se)

    > rownames(day.means) <- levels(day$TREAT)

    > day.ci <- qt(0.975, day.glm$df.residual) * day.se

    > day.means <- data.frame(day.means, lwr = day.mu - day.ci, upr = day.mu +

    > day.ci)

                fit         se          lwr        upr
    ALG1  0.1731317 0.08510443 -0.007281663  0.3535450
    ALG2  0.4104599 0.07937005  0.242202862  0.5787169
    NB   -0.2278791 0.09718613 -0.433904466 -0.0218537
    S    -0.3557125 0.10175575 -0.571425004 -0.1399999


    #On a response scale

    > cmat <- rbind(c(1, 0, 0, 0), c(1, 1, 0, 0), c(1, 0, 1, 0), c(1,

    > 0, 0, 1))

    > gm.cmat <- apply(cmat, 2, mean)

    > for (i in 1:nrow(cmat)) {

    > cmat[i, ] <- cmat[i, ] - gm.cmat

    > }

    > day.glm$coef %*% gm.cmat

    > day.mu <- t(day.glm$coef %*% t(cmat))

    > day.se <- sqrt(diag(cmat %*% vcov(day.glm) %*% t(cmat))) * abs(poisson()$mu.eta(day.mu))

    > day.mu <- exp(abs(day.mu))

    > day.means <- data.frame(fit = day.mu, se = day.se)

    > rownames(day.means) <- levels(day$TREAT)

    > day.ci <- qt(0.975, day.glm$df.residual) * day.se

    > day.means <- data.frame(day.means, lwr = day.mu - day.ci, upr = day.mu +

    > day.ci)

              fit         se       lwr      upr
    ALG1 1.189023 0.10119110 0.9745071 1.403538
    ALG2 1.507511 0.11965122 1.2538616 1.761160
    NB   1.255933 0.07738159 1.0918918 1.419975
    S    1.427197 0.07129761 1.2760529 1.578341

Poisson t-test with overdispersion

The same marine ecologist that featured earlier was also interested in examining the effects of wave exposure on the abundance of another intertidal marine limpet Cellana tramoserica on rocky intertidal shores. Again, a single quadrat was placed on 10 exposed (to large waves) shores and 10 sheltered shores. From each quadrat, the number of smooth limpets (Cellana tramoserica) were counted.

Download LimpetsSmooth data set

Format of limpetsSmooth.csv data files
CountShore
4sheltered
1sheltered
2sheltered
0sheltered
4sheltered
......

ShoreCategorical description of the shore type (sheltered or exposed) - Predictor variable
CountNumber of limpets per quadrat - Response variable
turf

Open the limpetsSmooth data set.

Show code

> limpets <- read.table("limpetsSmooth.csv", header = T, sep = ",",

> strip.white = T)

> limpets

   Count     Shore
1      3 sheltered
2      1 sheltered
3      6 sheltered
4      0 sheltered
5      4 sheltered
6      3 sheltered
7      8 sheltered
8      1 sheltered
9     13 sheltered
10     0 sheltered
11     2   exposed
12    14   exposed
13     2   exposed
14     3   exposed
15    31   exposed
16     1   exposed
17    13   exposed
18     8   exposed
19    14   exposed
20    11   exposed

Q3-1 Lets instead we fit the same model with a Poisson distribution.
Show code

> limpets.glm <- glm(Count ~ Shore, family = poisson, data = limpets)

  1. Check the model for lack of fit via:
    • Pearson Χ2
      Show code

      > limpets.resid <- sum(resid(limpets.glm, type = "pearson")^2)

      > 1 - pchisq(limpets.resid, limpets.glm$df.resid)

      [1] 4.440892e-16


      #No evidence of a lack of fit
    • Deviance (G2)
      Show code

      > 1 - pchisq(limpets.glm$deviance, limpets.glm$df.resid)

      [1] 1.887379e-15


      #No evidence of a lack of fit
    • Standardized residuals (plot)
      Show code

      > plot(limpets.glm)

  2. Estimate the dispersion parameter to evaluate over dispersion in the fitted model
    • Via Pearson residuals
      Show code

      > limpets.resid/limpets.glm$df.resid

      [1] 6.358197


      #Evidence of over dispersion
    • Via deviance
      Show code

      > limpets.glm$deviance/limpets.glm$df.resid

      [1] 6.177846


      #Evidence of over dispersion
  3. Compare the expected number of zeros to the actual number of zeros to determine whether the model might be zero inflated.
    • Calculate the proportion of zeros in the data
      Show code

      > limpets.tab <- table(limpets$Count == 0)

      > limpets.tab/sum(limpets.tab)

      FALSE  TRUE 
        0.9   0.1 

    • Now work out the proportion of zeros we would expect from a Poisson distribution with a lambda equal to the mean count of limpets in this study.
      Show code

      > limpets.mu <- mean(limpets$Count)

      > cnts <- rpois(1000, limpets.mu)

      > limpets.tab <- table(cnts == 0)

      > limpets.tab/sum(limpets.tab)

      FALSE  TRUE 
      0.999 0.001 

      Clearly there are not more zeros than expected so the data are not zero inflated.

Clearly there is an issue of with overdispersion. There are multiple potential reasons for this. It could be that there are major sources of variability that are not captured within the sampling design. Alternatively it could be that the data are very clumped. For example, often species abundances are clumped - individuals are either not present, or else when they are present they occur in groups.

Q3-1 As part of the routine exploratory data analysis:
  1. Explore the distribution of counts from each population
    Show code

    > boxplot(Count ~ Shore, data = limpets)

    > rug(jitter(limpets$Count), side = 2)

Q3-2. There are generally two ways of handling this form of overdispersion.
  1. Fit the model with a quasipoisson distribution in which the dispersion factor (lambda) is not assumed to be 1. The degree of variability (and how this differs from the mean) is estimated and the dispersion factor is incorporated.
    Show code

    > limpets.glmQ <- glm(Count ~ Shore, family = quasipoisson, data = limpets)

    1. t-tests
    2. Show code

      > summary(limpets.glmQ)

      Call:
      glm(formula = Count ~ Shore, family = quasipoisson, data = limpets)
      Deviance Residuals: 
          Min       1Q   Median       3Q      Max  
      -3.6352  -2.6303  -0.4752   1.0449   5.3451  
      Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
      (Intercept)      2.2925     0.2534   9.046 4.08e-08 ***
      Shoresheltered  -0.9316     0.4767  -1.954   0.0664 .  
      ---
      Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
      (Dispersion parameter for quasipoisson family taken to be 6.358223)
          Null deviance: 138.18  on 19  degrees of freedom
      Residual deviance: 111.20  on 18  degrees of freedom
      AIC: NA
      Number of Fisher Scoring iterations: 5

    3. Wald F-test. Note that F = t2
    4. Show code

      > library(car)

      > Anova(limpets.glmQ, test.statistic = "F")

      Analysis of Deviance Table (Type II tests)
      Response: Count
                     SS Df     F  Pr(>F)  
      Shore      26.978  1 4.243 0.05417 .
      Residuals 114.448 18                
      ---
      Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 


      #OR

      > anova(limpets.glmQ, test = "F")

      Analysis of Deviance Table
      Model: quasipoisson, link: log
      Response: Count
      Terms added sequentially (first to last)
            Df Deviance Resid. Df Resid. Dev     F  Pr(>F)  
      NULL                     19     138.18                
      Shore  1   26.978        18     111.20 4.243 0.05417 .
      ---
      Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

  2. Fit the model with a negative binomial distribution (which accounts for the clumping).
    Show code

    > limpets.glmNB <- glm.nb(Count ~ Shore, data = limpets)

    Check the fit
    Show code

    > limpets.resid <- sum(resid(limpets.glmNB, type = "pearson")^2)

    > 1 - pchisq(limpets.resid, limpets.glmNB$df.resid)

    [1] 0.4333663

    #Deviance

    > 1 - pchisq(limpets.glmNB$deviance, limpets.glmNB$df.resid)

    [1] 0.215451

    1. Wald z-tests (these are equivalent to t-tests)
    2. Show code

      > summary(limpets.glmNB)

      Call:
      glm.nb(formula = Count ~ Shore, data = limpets, init.theta = 1.283382111, 
          link = log)
      Deviance Residuals: 
          Min       1Q   Median       3Q      Max  
      -1.8929  -1.0926  -0.2313   0.3943   1.5320  
      Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
      (Intercept)      2.2925     0.2967   7.727  1.1e-14 ***
      Shoresheltered  -0.9316     0.4377  -2.128   0.0333 *  
      ---
      Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
      (Dispersion parameter for Negative Binomial(1.2834) family taken to be 1)
          Null deviance: 26.843  on 19  degrees of freedom
      Residual deviance: 22.382  on 18  degrees of freedom
      AIC: 122.02
      Number of Fisher Scoring iterations: 1
                    Theta:  1.283 
                Std. Err.:  0.506 
       2 x log-likelihood:  -116.018 

    3. Wald Chisq-test. Note that Chisq = z2
    4. Show code

      > library(car)

      > Anova(limpets.glmNB, test.statistic = "Wald")

      Analysis of Deviance Table (Type II tests)
      Response: Count
                Df  Chisq Pr(>Chisq)  
      Shore      1 4.5297    0.03331 *
      Residuals 18                    
      ---
      Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

      Analysis of Deviance Table
      Model: Negative Binomial(1.2834), link: log
      Response: Count
      Terms added sequentially (first to last)
            Df Deviance Resid. Df Resid. Dev      F  Pr(>F)  
      NULL                     19     26.843                 
      Shore  1   4.4611        18     22.382 4.4611 0.03468 *
      ---
      Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

      Analysis of Deviance Table
      Model: Negative Binomial(1.2834), link: log
      Response: Count
      Terms added sequentially (first to last)
            Df Deviance Resid. Df Resid. Dev P(>|Chi|)  
      NULL                     19     26.843            
      Shore  1   4.4611        18     22.382   0.03468 *
      ---
      Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Poisson t-test with zero-inflation

Yet again our rather fixated marine ecologist has returned to the rocky shore with an interest in examining the effects of wave exposure on the abundance of yet another intertidal marine limpet Patelloida latistrigata on rocky intertidal shores. It would appear that either this ecologist is lazy, is satisfied with the methodology or else suffers from some superstition disorder (that prevents them from deviating too far from a series of tasks), and thus, yet again, a single quadrat was placed on 10 exposed (to large waves) shores and 10 sheltered shores. From each quadrat, the number of scaley limpets (Patelloida latistrigata) were counted. Initially, faculty were mildly interested in the research of this ecologist (although lets be honest, most academics usually only display interest in the work of the other members of their school out of politeness - Oh I did not say that did I?). Nevertheless, after years of attending seminars by this ecologist in which the only difference is the target species, faculty have started to display more disturbing levels of animosity towards our ecologist. In fact, only last week, the members of the school's internal ARC review panel (when presented with yet another wave exposure proposal) were rumored to take great pleasure in concocting up numerous bogus applications involving hedgehogs and balloons just so they could rank our ecologist proposal outside of the top 50 prospects and ... Actually, I may just have digressed!

Download LimpetsScaley data set

Format of limpetsScaley.csv data files
CountShore
4sheltered
1sheltered
2sheltered
0sheltered
4sheltered
......

ShoreCategorical description of the shore type (sheltered or exposed) - Predictor variable
CountNumber of limpets per quadrat - Response variable
turf

Open the limpetsSmooth data set.

Show code

> limpets <- read.table("limpetsScaley.csv", header = T, sep = ",",

> strip.white = T)

> limpets

   Count     Shore
1      2 sheltered
2      1 sheltered
3      3 sheltered
4      1 sheltered
5      0 sheltered
6      0 sheltered
7      1 sheltered
8      0 sheltered
9      2 sheltered
10     1 sheltered
11     4   exposed
12     9   exposed
13     3   exposed
14     1   exposed
15     3   exposed
16     0   exposed
17     0   exposed
18     7   exposed
19     8   exposed
20     5   exposed

Q4-1 As part of the routine exploratory data analysis:
  1. Explore the distribution of counts from each population
    Show code

    > boxplot(Count ~ Shore, data = limpets)

    > rug(jitter(limpets$Count), side = 2)

  2. Compare the expected number of zeros to the actual number of zeros to determine whether the model might be zero inflated.
    • Calculate the proportion of zeros in the data
      Show code

      > limpets.tab <- table(limpets$Count == 0)

      > limpets.tab/sum(limpets.tab)

      FALSE  TRUE 
       0.75  0.25 

    • Now work out the proportion of zeros we would expect from a Poisson distribution with a lambda equal to the mean count of limpets in this study.
      Show code

      > limpets.mu <- mean(limpets$Count)

      > cnts <- rpois(1000, limpets.mu)

      > limpets.tab <- table(cnts == 0)

      > limpets.tab/sum(limpets.tab)

      FALSE  TRUE 
       0.91  0.09 

      Clearly there are not more zeros than expected so the data are not zero inflated.

Q4-2 Lets then fit these data via a zero-inflated Poisson (ZIP) model.
Show code

> library(pscl)

pscl  vcd  colorspace  grid  gam  coda  lattice  multcomp  mvtnorm  gmodels  car  survival  splines  nnet  MASS  R2HTML  stats  graphics  grDevices  utils  datasets  methods  base

> limpets.zip <- zeroinfl(Count ~ Shore | 1, dist = "poisson",

> data = limpets)

  1. Check the model for lack of fit via:
    • Pearson Χ2
      Show code

      > limpets.resid <- sum(resid(limpets.zip, type = "pearson")^2)

      > 1 - pchisq(limpets.resid, limpets.zip$df.resid)

      [1] 0.2810221


      #No evidence of a lack of fit
  2. Estimate the dispersion parameter to evaluate over dispersion in the fitted model
    • Via Pearson residuals
      Show code

      > limpets.resid/limpets.zip$df.resid

      [1] 1.168722


      #No evidence of over dispersion

Q4-3. We can now test the null hypothesis that there is no effect of shore type on the abundance of scaley limpets;
  1. Wald z-tests (these are equivalent to t-tests)
  2. Show code

    > summary(limpets.zip)

    Call:
    zeroinfl(formula = Count ~ Shore | 1, data = limpets, dist = "poisson")
    Pearson residuals:
         Min       1Q   Median       3Q      Max 
    -1.54025 -0.93957 -0.04699  0.84560  1.76938 
    Count model coefficients (poisson with log link):
                   Estimate Std. Error z value Pr(>|z|)    
    (Intercept)      1.6002     0.1619   9.886  < 2e-16 ***
    Shoresheltered  -1.3810     0.3618  -3.817 0.000135 ***
    Zero-inflation model coefficients (binomial with logit link):
                Estimate Std. Error z value Pr(>|z|)  
    (Intercept)  -1.6995     0.7568  -2.246   0.0247 *
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    Number of iterations in BFGS optimization: 13 
    Log-likelihood: -37.53 on 3 Df

  3. Chisq-test. Note that Chisq = z2
  4. Show code

    > library(car)

    > Anova(limpets.zip, test.statistic = "Chisq")

    Analysis of Deviance Table (Type II tests)
    Response: Count
              Df Chisq Pr(>Chisq)    
    Shore      1 14.57  0.0001351 ***
    Residuals 17                     
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

  5. F-test. Note that Chisq = z2
  6. Show code

    > library(car)

    > Anova(limpets.zip, test.statistic = "F")

    Analysis of Deviance Table (Type II tests)
    Response: Count
              Df     F   Pr(>F)   
    Shore      1 14.57 0.001379 **
    Residuals 17                  
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Welcome to the end of Workshop 9

  Generalized linear models

Generalized linear models (GLM's) extend the application range of linear modelling by accommodating non-stable variances as well as alternative exponential (distributions defined by a location parameter and a dispersion character) residual distributions (such as the binomial and Poisson distributions). Generalized linear models have three components:
  1. The random component that specifies the conditional distribution of the response variable. Such distributions are characterised by some function of the mean (canonical or location parameter) and a function of the variance (dispersion parameter). Note that for binomial and Poisson distributions, the dispersion parameter is 1, whereas for the Gaussian (normal) distribution the dispersion parameter is the error variance and is assumed to be independent of the mean.
    Gaussian: ε ~ Normal(μ, σ2)
    Gaussian: ε ~ Poisson(μ)
  2. The systematic component that represents the linear combination of predictors (which can be categorical, continuous, polynomial or other contrasts). This is identical to that of general linear models.
  3. The link function which links the expected values of the response (random component) to the linear combination of predictors (systematic component). The generalized linear model can thus be represented as:
    g(μ) = β0 + β1X1 + β2X2 + ...
    where g(μ) represents the link function and β0 , β1 and β2 represent parameters broadly analogous to those of general linear models. Although there are many commonly employed link functions, typically the exact form of the link function depends on the nature of the random response distribution. Some of the canonical (natural) link function and distribution pairings that are suitable for different forms of generalized linear models are listed in the following table.
  4. Response variablePredictor variable(s)Residual distributionLink
    ContinuousContinuous or CategoricalGaussianIdentity g(μ)=μ
    BinaryContinuous or CategoricalBinomialLogit g(μ)=loge(μ/1-μ)
    CountsContinuous or CategoricalPoissonLog g(μ)=logeμ

End of instructions