End of instructions
> class(name)
> name <- entry
> IQ <- 10.513
> numbers <- c(1, 4, 6, 7, 4, 345, 36, 78) > print(numbers) [1] 1 4 6 7 4 345 36 78 > numbers [1] 1 4 6 7 4 345 36 78
> name <- factor(c(list of characters/words))
> name <- gl(number of levels, number of replicates, length of data set, lab=c(list of level names)))
> sex <- factor(c('Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male')) OR > sex <- factor(c(rep('Female',6),rep('Male',6))) OR > sex <- gl(2,6,12,lab=c('Female','Male'))
> q()
> source('filename')
Note that there is an alternative way of using scripts
> ls()
> rm(object name)
> name <- read.table('filename.csv', header=T, sep=',', row.names=column, strip.white=T)
> phasmid <- read.data('phasmid.csv', header=T, sep=',', row.names=1)
> name <- read.table('clipboard', header=T, sep='\t', row.names=column, strip.white=T)
> phasmid <- read.data('clipboard', header=T, sep='\t', row.names=1)
> phasmid <- dump('data', "")
> write.table(data frame name, 'filename.csv', quote=F, sep=',')
> write.table(LEAVES, 'leaves.csv', quote=F, sep=',')
> var <- c(4,8,2,6,9,2) > var [1] 4 8 2 6 9 2 > var[1] [1] 4 > var[5] [1] 9 > var[3:5] [1] 2 6 9
> dv <- c(4,8,2,6,9,2) > iv <- c('a','a','a','b','b','b') > data <- data.frame(iv,dv) > data iv dv 1 a 4 2 a 8 3 a 2 4 b 6 5 b 9 6 b 2 > data[1] #list first section iv 1 a 2 a 3 a 4 b 5 b 6 b > data[,1] #list contents of first column [1] a a a b b b Levels: a b > data[1,] #list contents of first row iv dv 1 a 4 > data[3,2] #list the entry in row 3, column 2 [1] 2
> fix(data frame name)
> new_var <- log(old_var)
> new_var <- log10(old_var)
> new_var <- sqrt(old_var)
> new_var <- asin(sqrt(old_var))
> new_var <- scale(old_var)
> set.seed(number)
> library(package)
> hist(variable)
# mean > mean(variable) # variance > var(variable) # standard deviation > sd(variable) # variable length > length(variable) # median > median(variable) # maximum > max(variable) # minimum > min(variable) # standard error of mean > sd(variable)/sqrt(length(variable)) # interquartile range > quantile(variable, c(.25,.75)) # 95% confidence interval > qnorm(0.975,0,1) * (sd(variable) /sqrt(length(variable)))
> tapply(dv,factor,func)
# calculate the mean separately for each group > tapply(dv,factor,mean) # calculate the mean separately for each group > tapply(dv,factor,var)
# select the first 5 entries in the variable > variable[1:5] # select all but the first entry in the variable > variable[-1] # select all cases of variable that are less than num > variable[variable<num] # select all cases of variable that are equal to 'Big' > variable1[variable<label]
> pnorm(c(value), mean=mean, sd=sd, lower.tail=FALSE)
> pnorm(c(2.9), mean=2.025882, sd=0.4836265, lower.tail=FALSE)
> boxplot(variable)
> name.lm <- lm(dv~iv, data=data)
> name.glm <- glm(dv~iv, family=binomial, data=data)
> name.lm <- lm(dv~iv1+iv2, data=data)
> name.aov <- lm(dv~factor,data=data) # OR > day.aov <- aov(dv~factor,data=data)
A nested ANOVA is a simple extension of the single factor ANOVA. An additional factor (Factor B) is added that is nested within the main factor of interest (Factor A). Nested factors are incorporated in situations where researches envisage potentially large amounts of variation between different sets of replicates within the main Factor. For example, a limnologist investigating the effect of flow (Factor A: high, medium and low) on macroinvertebate density, might collect samples from a number of high, medium and low flow streams. However, the limnologist might expect that many other factors (perhaps more important than stream flow) also influence the density of macroinvertebrates and that even samples collected in close proximity to one another might be expected to vary considerably in density of macroinvertebrates. Such variability increases the unexplained variability and therefore has the potential to make it very difficult to pick up the effect of the actual factor of interest (in this case flow). In an attempt to combat this, the researcher decides to collect multiple samples from within each stream. If some of the unexplained variability can be explained by variation between samples within streams, then the test of flow will be more sensitive. This is called a nested ANOVA, an additional factor (Factor B: stream) has been introduce within the main factor (Factor A: flow). Factor A is a fixed factor as we are only interested in the specific levels chosen (e.g. high, medium and low flow). Factor B however, is a random as we are not just interested in the specific streams that were sampled, we are interested in all the possible streams that they represent. In a nested ANOVA the categories of the Factor B nested within each level of Factor A are different. E.g. the streams used for each flow type are different. Therefore it is not possible to test for an interaction between Flow type and stream. There are two null hypotheses being tested in a two factor nested ANOVA. H0 main effect: no effect of Factor A (flow) H0: = &mu1 = &mu2 = ... = &mui = &mu (as for single factor ANOVA). H0 nesting effect: no effect of Factor B (stream) nested within Factor A (flow). Technically it is that there is no added variance due to differences between all the possible levels of Factor B with any level of Factor A. E.g., that there is no added variance due to differences between all possible high flow streams, or between all possible medium flow streams or between all possible low flow streams. Diagram shows layout of nested design with three treatments (different pattern fills).
These are designs where one factor is a blocking variable (Factor B) and the other variable (Factor A) is the main treatment factor of interest. These designs are used in situations where the environment is potentially heterogeneous enough to substantially increase the variability in the response variable and thus obscure the effects of the main factor. Experimental units are grouped into blocks (units of space or time that are likely to have less variable background conditions) and then each level of the treatment factor is applied to a single unit within each block. By grouping the experimental units together into blocks, some of the total variability in the response variable that is not explained by the main factor may be explained by the blocks. If so, the residual (unexplained) variation will be reduced resulting in more precise estimates of parameters and more powerful tests of treatments. A plant ecologist was investigating the effects of leaf damage on plant chemical defenses. Since individual trees vary greatly in the degree of chemical defense, she applied each of three artificial leaf damage treatments (heavy, medium and light damage) to each tree. She applied one of three treatments to three branches within eight randomly selected trees. Trees were the blocks. The diagram shows the layout of a blocking design. The main factor has three treatments (pattern fills), each of which are applied randomly within 3 randomly selected blocks. The decision to include a blocking factor (rather than single factor, completely random) depends on whether the experimental units (replicates) within a block are likely to be more similar to one another than those between blocks and whether the reduction in MSResidual is likely to compensate for the for the loss of residual df.
Split-plot designs can be thought of as a combination of a randomized complete block (RCB) and nested designs. In their simplest form, they represent a RCB design, with a second factor applied to the whole blocks. One factor (Factor C) is applied to experimental units (usually randomly positioned) within each block (now referred to as plots, or Factor B). These factors are the Within-plot factors. A second factor (Between-plots factor, or Factor A) is then applied to the whole plots, and these plots are replicated. Hence a nested design applies between Factor A and the plots, and a blocking design applies between Factor C and the plots. Factor A is also crossed with Factor C, and therefore there is a test of the interaction between the two main factors. Note that if there are no replicate units of Factor C within each plot (that is if each level of Factor C is applied only once to each plot), then the interaction of Factor C by Plots within Factor A represents the overall residual. However, since the plots are usually random factors, the effects of Factor A must be tested against the Plots within Factor A term.If there there are replicate units of Factor C within each plot, then the Factor A by Plots within Factor A by Factor C is the overal residual.
Simple repeated measures designs and similar to Randomized complete block (RCB) designs. In the same way that we could explain some of the residual variation by blocking, we can also reduce the residual variation by removing the between-replicate variation that exists at the start of the experiment. This is achieved by applying each treatments (repeated measures) to the same subjects, rather than using different subjects for each treatment. In this way it is the same as a RCB. Where it differs from a RCB is that each of the treatments cannot be applied to the blocks (subjects) at the same time. For example, it is not possible to determine the effects of different drugs on the heart rate of rats if each of the drugs is applied at the same time! Each of the treatments should be applied in random order, however, when the repeated factor is time, this is not possible. The subjects (blocks or plots) are always a random factor. If a second factor is applied at the level of the whole subject and over each of the repeated measures (e.g. subjects are of one of two species), the analysis then resembles a splitplot design, where this second factor is the between subjects effect. A agricultural scientist was investigating the effects of fiber on the digestibility of pastural grasses by cows. Ten randomly selected cows of two different breeds (five of each) were each fed a random sequence of four food types (of different fiber levels). After a week of feeding on a single food type, the digestibility of the food was assessed. Individual cows represented the subjects. Breed of cow was the between subjects effect and food fiber level was the within subjects effect. Diagram illustrates layout of 2 Factor repeated measures. Rectangles represent subjects. Squares represent the within subject treatments (random sequence represents the order in which each treatment is applied). Hence there are four repeated measures. There are two levels of the between subjects effect (pattern fill of rectangles).
In ANOVA, the assumptions of normality and variance homogeneity apply to the residuals for the test of interest. In this example, the nested factor (Factor B, SITE) is random. Therefore, the normality and heterogeneity of variance assumptions for Factor A (SEASON) must be based on the distribution, mean and variance of the response variable (S4DUGES) for each level of Factor A using the means of Factor B within each Factor A as the observations (since the error term for Factor A (SEASON) is the variation in the response variable between categories of Factor B within each level of Factor A (MSSITE). Calculate the mean from each level of Factor B (SITE), and use these means to calculate the mean, variance and distribution of each level of Factor A (SEASON).
# aggregate the data set by calculating the means for each level of the nesting factor > name.ag <- aggregate(dataset, list(Name1=FactorA, Name2=FactorB), mean) #OR perhaps more efficiently library(lme4) name.ag <- gsummary(dataset, groups=FactorB) # use the aggregated means to produce the boxplots > boxplot(dv~FactorA, name.ag)
This data set generates two boxplots, each constructed from only 3 values and thus of questionable value.
# fit the heirachical linear model > name.aov <- aov(dv~FactorA + Error(FactorB), data=data) # print out the ANOVA strata > summary(name.aov)
# fit the heirachical linear model > curdies.aov <- aov(S4DUGES~SEASON + Error(SITE), data=curdies) # print out the ANOVA strata > summary(curdies.aov)
> curdies.aov1 <- aov(S4DUGES~SEASON + SITE, data=curdies) > summary(curdies.aov)
> name.aov <- aov(dv~factor1 * factor2, data=data)
# source Murray's syntax for dealing with sphericity > library(biology) # after fitting a randomized block, repeated measures or split-plot > epsi.GG.HF(name.aov)
# Define the contrast coefficients for the factorial variable > contrasts(factor) <- cbind(c(numeric list),...) # Confirm that they are orthogonal > round(crossprod(contrasts(factor)),2) # Define a list of associated labels for the factorial variables defined contrasts > name.list <- list(factor=list("Label"=1, ...))
# Define the contrast coefficients for factorial variable > contrasts(partridge$GROUP) <- cbind(c(0,0,0,-1,1), c(-2,-2,-2,3,3)) # Confirm that they are orthogonal > round(crossprod(contrasts(partridge$GROUP)), 2) # Define a list of associated labels for the factorial variables defined contrasts > partridge.list <- list(GROUP=list("8virg vs 1virg"=1, "Partners vs Controls"=2))
> summary(name.aov, split=list)
> summary(partridge.aov, split=partridge.list)
# source Murray's biology library > library(biology) # define polynomial contrasts > contrasts(dataset$Repeated) <- 'contr.poly' > name.aov <- aov(dv~Error(subject/C(Repeated,poly,1)) + C(Repeated,poly,1), data=dataset) > AnovaM(name.aov, split=list(Repeated=list('Linear'=1, 'Quad'=2,...)))
> library(biology) > contrasts(driscoll$YEAR) <- 'contr.poly' > driscoll.aov <- aov(CALLS~Error(BLOCK/C(YEAR,poly,1)) + C(YEAR,poly,1), data=driscoll) > AnovaM(driscoll.aov, split=list(YEAR=list('Linear'=1, 'Quad'=2,...)))
> plot(name.lm)
#print a summary of the linear model coefficients > summary(name.lm) #print a summary of the ANOVA table > anova(name.lm)
#print a summary of the ANOVA table > anova(name.aov)
The assumptions of normality and homogeneity of variance apply to each of the factor level combinations, since it is the replicate observations of these that are the test residuals. If the design has two factors (IV1 and IV2) with three and four levels (groups) respectively within these factors, then a boxplot of each of the 3x4=12 combinations needs to be constructed. It is recommended that a variable (called GROUP) be setup to represent the combination of the two categorical (factor) variables.
Simply construct a boxplot with the dependent variable on the y-axis and GROUP on the x-axis. Visually assess whether the data from each group is normal (or at least that the groups are not all consistently skewed in the same direction), and whether the spread of data is each group is similar (or at least not related to the mean for that group). The GROUP variable can also assist in calculating the mean and variance for each group, to further assess variance homogeneity.
Interaction plots display the degree of consistency (or lack of) of the effect of one factor across each level of another factor. Interaction plots can be either bar or line graphs, however line graphs are more effective. The x-axis represents the levels of one factor, and a separate line in drawn for each level of the other factor. The following interaction plots represent two factors, A (with levels A1, A2, A3) and B (with levels B1, B2).
The parallel lines of first plot clearly indicate no interaction. The effect of factor A is consistent for both levels of factor B and visa versa. The middle plot demonstrates a moderate interaction and bottom plot illustrates a severe interaction. Clearly, whether or not there is an effect of factor B (e.g. B1 > B2 or B2 > B1) depends on the level of factor A. The trend is not consistent.
Statistical models that incorporate more than one categorical predictor variable are broadly referred to as multivariate analysis of variance. There are two main reasons for the inclusion of multiple factors:
Fully factorial linear models are used when the design incorporates two or more factors (independent, categorical variables) that are crossed with each other. That is, all combinations of the factors are included in the design and every level (group) of each factor occurs in combination with every level of the other factor(s). Furthermore, each combination is replicated.
In fully factorial designs both factors are equally important (potentially), and since all factors are crossed, we can also test whether there is an interaction between the factors (does the effect of one factor depend on the level of the other factor(s)). Graphs above depict a) interaction, b) no interaction.
For example, Quinn (1988) investigated the effects of season (two levels, winter/spring and summer/autumn) and adult density (four levels, 8, 15, 30 and 45 animals per 225cm2) on the number of limpet egg masses. As the design was fully crossed (all four densities were used in each season), he was also able to test for an interaction between season and density. That is, was the effect of density consistent across both seasons and, was the effect of season consistent across all densities.
Diagram shows layout of 2 factor fully crossed design. The two factors (each with two levels) are color (black or gray) and pattern (solid or striped). There are three experimental units (replicates) per treatment combination.
# load the 'multcomp' package > library(multcomp) # perform Tukey's test > summary(simtest(dv~factor, data=data, type="Tukey")) # OR the more up to date method > summary(glht(data.aov, linfct=mcp(factor="Tukey")))
# load the 'multcomp' package > library(multcomp) # perform Tukey's test > summary(simtest(dv~factor1*factor2, whichf='factor', data=data, type="Tukey")) # OR the more up to date method > summary(glht(data.aov, linfct=mcp(factor="Tukey")))
> library(biology) > quinn.aov1 <- aov(SQRTRECRUITS~DENSITY, data=quinn, subset=c(SEASON=='Summer')) > AnovaM(lm(quinn.aov1), lm(quinn.aov),type="II")
> data$var <- as.factor(data$var)
> name.ag <- aggregate(dataset, list(NAME=FactorA, NAME1=FactorB), mean) # OR > library(lme4) > name.ag<-gsummary(dataset, groups=dataset$FactorA)
> cu.ag <- aggregate(copper, list(COPPER=copper$COPPER, PLATE=copper$PLATE), mean) # OR > library(lme4) > cu.ag <- aggregate(copper, groups=copper$PLATE)
> boxplot(name.ag$dv~name.ag$FactorA)
> boxplot(cu.ag$WORMS ~ cu.ag$COPPER)
# Fit the linear model with the different error strata > name.aov <- aov(dv ~ FactorA + FactorC + FactorA:FactorC + Error(FactorB), data=dataset)) > AnovaM(name.aov) # alternatively, if sphericity is an issue > AnovaM(name.aov, RM=T)
> copper.aov <- aov(WORMS ~ COPPER + DIST + COPPER:DIST + Error(PLATE), data=copper)) > AnovaM(copper.aov, RM=T)
> interaction.plot(dataset$Factor1, dataset$Factor2, dataset$dv)
> interaction.plot(copper$DIST, copper$COPPER, dataset$WORMS)
# Fit the linear model with the different error strata > name.aov <- aov(dv ~ FactorA + Error(FactorB), data=dataset)) > AnovaM(name.aov)
> tobacco.aov <- aov(NUMBER ~ TREATMENT + Error(LEAF), data=tobacco)) > AnovaM(tobacco.aov) # OR if sphericity is likely to be an issue > AnovaM(tobacco.aov), RM=T)
> Mbargraph(day$BARNACLE, day$TREAT, symbols=c('AB','A','BC','C'))
> dv <- factor(dv), levels=c(ordered levels))
# Define the contrast coefficients for factorial variable > name.cont <- list(factor = c(numeric list)) # Define the list of associated labels for the factorial variable > name.list <- list(factor =list('label'=1)) # Fit the linear model using the defined contrasts > name.aov <- aov(dv~factor, data = data, contrasts = name.cont) # Print out the ANOVA table incorporating the define contrasts > summary(name.aov, split = name.list)
# Define the contrast coefficients for factorial variable > partridge.cont <- list(GROUP = c(1,0,0,0,-1)) # Define the list of associated labels for the GROUP factor > partridge.list <- list(GROUP = list("1 vs 5" = 1)) # Fit the linear model using the defined contrasts > partridge.aov <- aov(LONGEVITY~GROUP, contrasts = partridge.cont) # Print out the ANOVA table incorporating the define contrasts > summary(partridge.aov, split = partridge.list)
Fixed factors are factors whose levels represent the specific populations of interest. For example, a factor that comprises 'high', 'medium' and 'low' temperature treatments is a fixed factor - we are singularly interested in comparing those three populations. Conclusions about the effects of a fixed factor are restricted to the specific treatment levels investigated and for any subsequent experiments to be comparable, the same specific treatments of the factor would need to be used. In contrast, Random factors are factors whose levels are randomly chosen from all the possible levels or populations and are used as random representatives of the populations. For example, five random temperature treatments could be used to represent a full spectrum of temperature treatments. In this case, conclusions are extrapolated to all the possible treatment (temperature) levels and for subsequent experiments, a new random set of treatments of the factor would be selected.
Following a significant ANOVA result, it is often desirable to specifically compare group means to determine which groups are significantly different. However, multiple comparisons lead to two statistical problems. Firstly, multiple significance tests increase the Type I errors (&alpha, the probability of falsely rejecting H0). E.g., testing for differences between 5 groups requires ten pairwise comparisons. If the &alpha for each test is 0.05 (5%), then the probability of at least one Type I error for the family of 10 tests is 0.4 (40%). Secondly, the outcome of each test needs to be independent (orthogonality). E.g. if A>B and B>C then we already know the result of A vs. C.
Post-hoc unplanned pairwise comparisons (e.g. Tukey's test) compare all possible pairs of group means and are useful in an exploratory fashion to reveal major differences. Tukey's€™s test control the family-wise Type I error rate to no more that 0.05. However, this reduces the power of each of the pairwise comparisons, and only very large differences are detected (a consequence that exacerbates with an increasing number of groups).
Planned comparisons are specific comparisons that are usually planned during the design stage of the experiment. No more than (p-1, where p is the number of groups) comparisons can be made, however, each comparison (provided it is non-orthogonal) can be tested at &alpha = 0.05. Amongst all possible pairwise comparisons, specific comparisons are selected, while other meaningless (within the biological context of the investigation) are ignored.
Planned comparisons are defined using what are known as contrasts coefficients. These coefficients are a set of numbers that indicate which groups are to be compared, and what the relative contribution of each group is to the comparison. For example, if a factor has four groups (A, B, C and D) and you wish to specifically compare groups A and B, the contrast coefficients for this comparison would be 1, -1, 0,and 0 respectively. This indicates that groups A and B are to be compared (since their signs oppose) and that groups C and D are omitted from the specific comparison (their values are 0).
It is also possible to compare the average of two or more groups with the average of other groups. For example, to compare the average of groups A and B with the average of group C, the contrast coefficients would be 1, 1, -2, and 0 respectively. Note that 0.5, 0.5, 1 and 0 would be equivalent.
The following points relate to the specification of contrast coefficients:
> anova(day.aov)
These are designs where one factor is a blocking variable (Factor B) and the other variable (Factor A) is the main treatment factor of interest. These designs are used in situations where the environment is potentially heterogeneous enough to substantially increase the variability in the response variable and thus obscure the effects of the main factor. Experimental units are grouped into blocks (units of space or time that are likely to have less variable background conditions) and then each level of the treatment factor is applied to a single unit in each block. By grouping the experimental units together into blocks , some of the total variability in the response variable that is not explained by the main factor may be explained by the blocks. If so, the residual (unexplained) variation will be reduced resulting in more precise estimates of parameters and more powerful tests of treatments.
For example, a plant ecologist was investigating the effects of leaf damage on plant chemical defenses. Since individual trees vary greatly in the degree of chemical defense, she applied each of three artificial leaf damage treatments (heavy, medium and light damage) to each tree. She applied one of three treatments to three branches within eight randomly selected trees. Trees were the blocks. The diagram shows the layout of a blocking design. The main factor has three treatments (pattern fills), each of which are applied randomly within 3 randomly selected blocks. The decision to include a blocking factor (rather than single factor, completely random) depends on whether the experimental units (replicates) within a block are likely to be more similar to one another than those between blocks and whether the reduction in MSResidual is likely to compensate for the for the loss of residual df.
# load the 'car' library > library(car) # generate a scatterplot > scatterplot(dv~iv,data=data)
# make sure the car package is loaded > library(car) > scatterplot.matrix(~var1 + var2 + var3, data=data, diagonal='boxplot') #OR > pairs(~var1 + var2 + var3, data=data)
Start off setting the Number of simulations to 1000. This will collect 1000 random samples from Population 1 and 2.
Assumptions - In simulating the repeated collection of samples from both male and female populations, we make a number of assumptions that have important consequences for the reliability of the statistical tests. Firstly, the function used to simulate the collection random samples of say 8 male fulmars (and 6 female fulmars), generates these samples from normal distributions of a given mean and standard deviation. Thus, each simulated t value and the distribution of t values from multiple runs represents the situation for when the samples are collected from a population that is normally distributed. Likewise, the mathematical t distribution also makes this assumption. Furthermore, the sampling function used the same standard deviation for each collected sample. Hence, each simulated t values and the distribution of t values from multiple runs represents the situation for when the samples are collected from populations that are equally varied. Likewise, the mathematical t distribution also makes this assumption. By altering the variability and degree of normality of the populations, you can see the effects that normality and homogeneity of variance have on how much a distribution of t values differs from the mathematical t-distribution. Violations of either of these two assumptions (normality and homogeneity of variance) have the real potential to compromise the reliability of the statistical conclusions and thus it is important for the data to satisfy these assumptions.
Step 1 - Clearly establish the statistical null hypothesis. Therefore, start off by considering the situation where the null hypothesis is true - e.g. when the two population means are equal
Step 2 - Establish a critical statistical criteria (e.g. alpha = 0.05)
Step 3 - Collect samples consisting of independent, unbiased samples
Step 4 - Assess the assumptions relevant to the statistical hypothesis test. For a t test: 1. Normality 2. Homogeneity of variance
Step 5 - Calculate test statistic appropriate for null hypothesis (e.g. a t value)
Step 6 - Compare observed test statistic to a probability distribution for that test statistic when the null hypothesis is true for the appropriate degrees of freedom (e.g. compare the observed t value to a t distribution).
Step 7 - If the observed test statistic is greater (in magnitude) than the critical value for that test statistic (based on the predefined critical criteria), we conclude that it is unlikely that the observed samples could have come from populations that fulfill the null hypothesis and therefore the null hypothesis is rejected, otherwise we conclude that there is insufficient evidence to reject the null hypothesis. Alternatively, we calculate the probability of obtaining the observed test statistic (or one of greater magnitude) when the null hypothesis is true. If this probability is less than our predefined critical criteria (e.g. 0.05), we conclude that it is unlikely that the observed samples could have come from populations that fulfill the null hypothesis and therefore the null hypothesis is rejected, otherwise we conclude that there is insufficient evidence to reject the null hypothesis.
> t.test(dv~factor,data=data,var.equal=T)
> t.test(dv~factor,data=data,var.equal=F)
# load the 'biology' extension library writen by Murray. > library(biology) # use the Mbargraph() function to construct a bargraph > Mbargraph(furness$METRATE, furness$SEX, errorbars="se")
The purpose of simple linear regression analysis is to estimate the slope and y-intercept of a straight line through a data set. The line represents the expected (and predicted) average values of X and Y in the population, based on the collected sample. The line (and thus the slope and intercept) is typically the line that minimizes the vertical spread of the observations from the line. This process of fitting the line of best fit is called Ordinary Least Squares (OLS), and is depicted in figure OLS Y vs X (top left) below. This fitting process assumes that there is no uncertainty in the predictor (X) variable and therefore that the observed values of X are exactly as they are in the population (there is no measurement error nor natural variation).
Model I regression is the classical regression analysis and the form of analysis most commonly used. However, it is only a valid method of estimating the model parameters (slope and intercept) when the levels of the predictor variable are not measured, but rather are set by the researcher. For example, if we were interested in the effects of temperature on the growth rate of plant seedlings, we could analyse the data using model I regression provided the different temperatures were specifically set by us (eg. controlling temperatures in green houses or growth cabinets). If on the other hand, the plants had been grown under a range of temperatures that were not specifically set by us and we had simply measured what the temperatures were, model I regression would not be appropriate.
Note that if instead of regressing Y against X we regressed X against Y, the slope of the line of best fit would be different since the line is now the one that minimizes the horizontal spread of the observations from the line (see figure OLS X vs Y - topright). The OLS regressions of Y on X and X on Y represent the two extremes of the population slope estimates as they depict the situations in which all of the variation is in the vertical and horizontal axes respectively. Typically, we might expect some uncertainty in both X and Y variables, and thus the true trend line should lie somewhere between these two extremes (see the bottom right figure below).
Model II regression estimates the line of best fit by incorporating uncertainty in both Y and X variables, and is therefore more suitable for estimating model parameters (slope and intercept) from data in which both variables have been measured. There are at least three forms of model II regression and these differ by what assumptions they place on the relative degree of uncertainty in Y and X variables
Given that most instances of regression in biology are actually more suitable for model II regression, why are examples of model II in the literature so rare and why do most statistical software default to model I (OLS) regression procedures (most do not even offer model II procedures)? Is it that biologists are ignorant or slack? It turns out that as far as the main hypothesis of interest is concerned (that the population slope) equals zero, it does not matter - RMA and OLS procedures give the same p-values (p-values cant be easily computed for MA and Ranged MA procedures).
If the purpose of the regression is to generate a model that can be used for the purpose of prediction (predict new values of the dependent variable from new values of the independent values), then only OLS fitted models are valid. The reason for this, is that the new independent value represents a theoretical fixed value with no uncertainty. This new value must be used in a model that assumed no uncertainty in the independent (predictor) variable. The OLS procedure is the only one that minimizes the vertical spread of the observations from the line.
Therefore, if regression analysis is being performed for the purpose of investigating whether there is a relationship between two continuous variables and or to generate a predictive model, then model I (OLS) regression is perfectly valid. Note, however, that if slopes and intercepts are reported, it is probably worth presenting model I and model II slopes.
# make sure that the biology library is loaded > library(biology) # fit the model II regression > summary(lm.II(log10(INDIV)~log10(AREA), data, type="RMA"))
# make sure that the Hmisc library is loaded > library(Hmisc) # plot the error bars > errbar(xs, means, means+ses, means-ses, add=t)
The following output are based on a simulated data sets; 1.  Pooled variance t-test for populations with equal (or nearly so) variances 2.  Separate variance t-test for population with unequal variances
> t.test(cat1, cat2, data=data, paired=T)
> name.lm <- lm(dv~iv+I(iv^2), data=data)
> anova(full.lm, reduced.lm)
Balanced models are those models that have equal sample sizes for each level of a treatment. Unequal sample sizes are generally problematic for statistical analyses as they can be the source of unequal variances. They are additionally problematic for multifactor designs because of the way that total variability is partitioned into the contributions of each of the main effects as well as the contributions of interactions (ie the calculations of sums of squares and degrees of freedom). Sums of squares (and degrees of freedom) are usually (type I SS) partitioned additively. That is: SSTotal=SSA+SSB+SSAB+SSResidual, and dfTotal=dfA+dfB+dfAB+dfResidual. In practice, these formulas are used in reverse. For example, if the model was entered as DV~A+B+AB, first SSResidual is calculated followed by SSAB, then SSB. SSA is thus calculated as the total sums of squares minus what has already been partitioned off to other terms (SSA=SSTotal-SSB+SSAB+SSResidual). If the model is balanced, then it doesn't matter what order the model is constructed (DV~B+A+AB will be equivalent).
However, in unbalanced designs, SS cannot be partitioned additively, that is the actual SS of each of the terms does not add up to SSTotal (SSTotal¹SSA+SSB+SSAB+SSResidual). Of course, when SS is partitioned it all must add up. Consequently, the SS of the last term to be calculated will not be correct. Therefore, in unbalanced designs, SS values for the terms are different (and therefore so are the F-ratios etc) depending on the order in which the terms are added - an undesirable situation.
To account for this, there are alternative methods of calculating SS for the main effects (all provide same SS for interaction terms and residuals).
> !is.list(replications(formula,data))
Non-parametric tests do not place any distributional limitations on data sets and are therefore useful when the assumption of normality is violated. There are a number of alternatives to parametric tests, the most common are;
1. Randomization tests - rather than assume that a test statistic calculated from the data follows a specific mathematical distribution (such as a normal distribution), these tests generate their own test statistic distribution by repeatedly re-sampling or re-shuffling the original data and recalculating the test statistic each time. A p value is subsequently calculated as the proportion of random test statistics that are greater than or equal to the test statistic based on the original (un-shuffled) data. 2. Rank-based tests - these tests operate not on the original data, but rather data that has first been ranked (each observation is assigned a ranking, such that the largest observation is given the value of 1, the next highest is 2 and so on). It turns out that the probability distribution of any rank based test statistic for a is identical.
The basis of hypothesis testing (when comparing two populations) is to determine how often we would expect to collect a specific sample (or more one more unusual) if the populations were identical. That is, would our sample be considered unusual under normal circumstances? To determine this, we need to estimate what sort of samples we would expect under normal circumstances. The theoretical t-distribution mimics what it would be like to randomly resample repeatedly (with a given sample size) from such identical populations.
A single sample from the (proposed) population(s) provides a range of observations that are possible. Completely shuffling those data (or labels), should mimic the situation when there is no effect (null hypothesis true situation), since the data are then effectively assigned randomly with respect to the effect. In the case of a t-test comparing the means of two groups, randomly shuffling the data is likely to result similar group means. This provides one possible outcome (t-value) that might be expected when the null hypothesis true. If this shuffling process is repeated multiple times, a distribution of all of the outcomes (t-values) that are possible when the null hypothesis is true can be generated. We can now determine how unusual our real (unshuffled) t-value is by comparing it to the built up t-distribution. If the real t-value is expected to occur less than 5% of the time, we reject the null hypothesis.
Randomization tests - rather than assume that a test statistic calculated from the data follows a specific mathematical distribution (such as a normal distribution), these tests generate their own test statistic distribution by repeatedly re-sampling or re-shuffling the original data and recalculating the test statistic each time. A p value is subsequently calculated as the proportion of random test statistics that are greater than or equal to the test statistic based on the original (un-shuffled) data.
> stat <- function(data, indices){ t <- t.test(data[,2]~data[,1])$stat t }
rand.gen <- function(data, mle){ out <- data out[,1] <- sample(out[,1], replace=F) out }
library(boot) coots.boot <- boot(coots, stat, R=100, sim="parametric", ran.gen=rand.gen)
p.length <- length(coots.boot$t[coots.boot$t >= coots.boot$t0])+1 print(p.length/(coots.boot$R + 1))
> loyn2 <- data.frame(logAREA=log10(loyn$AREA), logDIST=log10(loyn$DIST), logLDIST=log10(loyn$LDIST), ALT=loyn$ALT,GRAZE=loyn$GRAZE, YEARS=loyn$YR.ISOL) > library(Hier.part) > loyn.hier<-hier.part(loyn$ABUND, loyn2, gof="Rsqu") > loyn.hier
The first step is to create a data frame that contains only the predictor variables. Then we load a package called hier.part by Chris Walsh and Ralph MacNally (Monash University). This contains the functions necessary to perform hierarchical partitioning in R. Finally, we use the hier.part() function to perform the hierarchical partitioning and print out the results. The gof= parameter specifies in this case to use r2 values.
> r<-sqrt(loyn.hier$IJ["Total"]) > log((abs((r+1)/(r-1)))^2)
The first step extracts the list of r2 values from the Total contribution column of the hierarchical partitioning output. At the same time, these are converted into correlation coefficients by taking their square roots. The second line converts the correlation coefficients into standard z-scores and prints them out.
> rand.hp(loyn$ABUND, loyn2, fam = "gaussian", gof = "Rsqu", num.reps=100)$Iprobs
Note that this procedure is set to randomize and calculate 100 times. This can take up to 5 minutes to run! Predictor variables considered to contribute significantly to the explaination of variation in the response variable (from an independent perspective) are signified by a '*'.
> mean1 <- 1.65 > sd1 <- 0.51
> es<- .50 > mean2 <- mean1+(mean1*es) > b.var <- var(c(mean1, mean2, mean2, mean2))
> es<- .50 > mean2 <- mean1+(mean1*es) > b.var <- var(c(mean1, mean1, mean2, mean2))
> es<- .50 > mean2 <- mean1+(mean1*es) > b.var <- var(seq(mean1, mean2, length=4))
> power.anova.test(group=4, power=.8, between.var=b.var, within.var=sd1^2)
These instructions are altered by customising the groups size, power etc.
> x=seq(3,30,l=100) # generate maximum power curve > b.var <- var(c(mean1, mean1, mean2, mean2)) > plot(x, power.anova.test(group=4, n=x, between.var=b.var, within.var=sqrt(sd1))$power, ylim=c(0,1), ylab="power", xlab="n", type="l") # put in a line at power=0.8 > abline(0.8, 0, lty=2) # generate the minimum power curve > m<-mean(c(mean1,mean2)) > b.var <- var(c(mean1,m,m,mean2)) > points(x,power.anova.test(group=4,n=x,between.var=b.var, within.var=sqrt(sd1))$power, type="l",lty=4)
Linear regression analysis explores the linear relationship between a continuous response variable (DV) and a single continuous predictor variable (IV). A line of best fit (one that minimizes the sum of squares deviations between the observed y values and the predicted y values at each x) is fitted to the data to estimate the linear model.
As plot a) shows, a slope of zero, would indicate no relationship – a increase in IV is not associated with a consistent change in DV. Plot b) shows a positive relationship (positive slope) – an increase in IV is associated with an increase in DV. Plot c) shows a negative relationship.
Testing whether the population slope (as estimated from the sample slope) is one way of evaluating the relationship. Regression analysis also determines how much of the total variability in the response variable can be explained by its linear relationship with IV, and how much remains unexplained. The line of best fit (relating DV to IV) takes on the form of y=mx + c Response variable = (slope*predictor   variable) + y-intercept If all points fall exactly on the line, then this model (right-hand part of equation) explains all of the variability in the response variable. That is, all variation in DV can be explained by differences in IV. Usually, however, the model does not explain all of the variability in the response variable (due to natural variation), some is left unexplained (referred to as error). Therefore the linear model takes the form of; Response variable = model + error. A high proportion of variability explained relative to unexplained (error) is suggestive of a relationship between response and predictor variables. For example, a mammalogist was investigating the effects of tooth wear on the amount of time free ranging koalas spend feeding per 24 h. Therefore, the amount of time spent feeding per 24 h was measured for a number of koalas that varied in their degree of tooth wear (ranging from unworn through to worn). Time spent feeding was the response variable and degree of tooth wear was the predictor variable.
Model II (RMA - reduced major axis) regression slope is the average of the slope of the regression between Y on X (OLS regression slope) and the inverse of the slope of the regression between X on Y. It is the slope of the regression line that minimizes the sum of triangular areas bounded by the regression line and horizontal and vertical lines from obseved points and this regression line.
> christ.lm2 <- lm(RIPDENS ~ CWDBASAL, data=christ) > slope <- mean(c(christ.lm$coef[2], 1/christ.lm2$coef[2]))
All regression lines, whether fitted by OLS or RMA (or MA variants), go through the point whose X and Y coordinates are mean X and mean Y. Essentially, all regression lines rotate about this point. Having estimated the RMA regression slope, the RMA y-intercept is estimated by simply solving the linear equation (y=bx+a) for a, using the means for X and Y as the known values.
> intercept <- mean(christ$CWDBASAL) - (slope*mean(christ$RIPDENS))
> library(pwr) > pwr.r.test(power=.8,r=sqrt(.92))
> library(pwr) > pwr.r.test(n=20,r=sqrt(.92))
> source("power.R") > power.r.plot(n=4:10,r=sqrt(.92))
> source("power.R") > power.r.plot(power=seq(6,.99,l=100),r=sqrt(.92))
> sinclair.glmR <- glm(COUNT~DEATH+MARROW, family=poisson, data=sinclair.split[["FEMALE"]]) > sinclair.glmF <- glm(COUNT~DEATH*MARROW, family=poisson, data=sinclair.split[["FEMALE"]]) > anova(sinclair.glmR, sinclair.glmF, test="Chisq") The first line fits the reduced log-linear model. This is the model that does not contain the two way interaction term. The second line fits the full log-linear model - the model that does contain the two way interaction. The third line generates the analysis of deviance table. This is essentially presenting the results of the hypothesis test that tests whether there is a two way interaction or not.This same proceedure can then be repeated for the "MALE" data.
> library(epitools) > female.tab <- xtabs(COUNT ~ DEATH + MARROW, data=sinclair.split[["FEMALE"]]) > oddsratio.wald(t(female.tab))$measure > oddsratio.wald(t(female.tab),rev="b")$measure
Analysis of variance, or ANOVA, partitions the variation in the response (DV) variable into that explained and that unexplained by one of more categorical predictor variables, called factors. The ratio of this partitioning can then be used to test the null hypothesis (H0) that population group or treatment means are equal. A single factor ANOVA investigates the effect of a single factor, with 2 or more groups (levels), on the response variable. Single factor ANOVA's are completely randomized (CR) designs. That is, there is no restriction on the random allocation of experimental or sampling units to factor levels. Single factor ANOVA tests the H0 that there is no difference between the population group means.H0: = µ1 = µ2 = .... = µi = µ If the effect of the ith group is:(&alphai = µi - µ) the this can be written as H0: = &alpha1 = &alpha2 = ... = &alphai = 0 If one or more of the I are different from zero, the hull hypothesis will be rejected.
Keough and Raymondi (1995) investigated the degree to which biofilms (films of diatoms, algal spores, bacteria, and other organic material that develop on hard surfaces) influence the settlement of polychaete worms. They had four categories (levels) of the biofilm treatment: sterile substrata, lab developed biofilms without larvae, lab developed biofilms with larvae (any larvae), field developed biofilms without larvae. Biofilm plates where placed in the field in a completely randomized array. After a week the number of polychaete species settled on each plate was then recorded. The diagram illustrates an example of the spatial layout of a single factor with four treatments (four levels of the treatment factor, each with a different pattern fill) and four experimental units (replicates) for each treatment.
> boxplot(dv~factor,data=data)
> boxplot(dv~factor1*factor2,data=data)
> wilcox.test(dv ~ factor, data=data)
# calculate critical t value (&alpha =0.05) for upper tail (e.g. A larger than B) > qt(0.05,df=df,lower.tail=F) # calculate critical t value (&alpha =0.05) for lower tail (e.g. B larger than A) > qt(0.05,df=df,lower.tail=T)
It actually doesn't matter whether you select Lower tail is TRUE or FALSE. The t-distribution is a symmetrical distribution, centered around 0, therefore, the critical t-value is the same for both Lower tail (e.g. population 2 greater than population 1) and Upper tail (e.g. population 1 greater than population 2), except that the Lower tail is always negative. As it is often less confusing to work with positive values, it is recommended that you use Upper tail values. An example of a t-distribution with Upper tail for a one-tailed test is depicted below. Note that this is not part of the t quantiles output!
# calculate critical t value (&alpha =0.05) for upper tail (e.g. A different to B) > qt(0.05/2,df=df, lower.tail=T)
Again, it actually doesn't matter whether you select Lower tail as either TRUE or FALSE. For a symmetrical distribution, centered around 0, the critical t-value is the same for both Lower tail (e.g. population 2 greater than population 1) and Upper tail (e.g. population 1 greater than population 2), except that the Lower tail is always negative. As it is often less confusing to work with positive values, it is recommended that you use Upper tail values. An example of a t-distribution with Upper tail for a two-tailed test is depicted below. Note that this is not part of the t quantiles output!
# first calculate the means and variances > means <- tapply(data$dv, data$factor, mean) > vars <- tapply(data$dv, data$factor, var) # then plot the mean against variance > plot(means,vars)
# first calculate the means and variances > means <- tapply(data$dv, list(data$factor1, data$factor2), mean) > vars <- tapply(data$dv, list(data$factor1, data$factor2), var) # then plot the mean against variance > plot(means,vars)
Analysis of frequencies is similar to Analysis of Variance (ANOVA) in some ways. Variables contain two or more classes that are defined from either natural categories or from a set of arbitrary class limits in a continuous variable. For example, the classes could be sexes (male and female) or color classes derived by splitting the light scale into a set of wavelength bands. Unlike ANOVA, in which an attribute (e.g. length) is measured for a set number of replicates and the means of different classes (categories) are compared, when analyzing frequencies, the number of replicates (observed) that fall into each of the defined classes are counted and these frequencies are compared to predicted (expected) frequencies.
Analysis of frequencies tests whether a sample of observations came from a population where the observed frequencies match some expected or theoretical frequencies. Analysis of frequencies is based on the chi-squared (X2) statistic, which follows a chi-square distribution (squared values from a standard normal distribution thus long right tail).
When there is only one categorical variable, expected frequencies are calculated from theoretical ratios. When there are more than one categorical variables, the data are arranged in a contingency table that reflects the cross-classification of sampling or experimental units into the classes of the two or more variables. The most common form of contingency table analysis (model I), tests a null hypothesis of independence between the categorical variables and is analogous to the test of an interaction in multifactorial ANOVA. Hence, frequency analysis provides hypothesis tests for solely categorical data. Although, analysis of frequencies provides a way to analyses data in which both the predictor and response variable are both categorical, since variables are not distinguished as either predictor or response in the analysis, establishment of causality is only of importance for interpretation.
The goodness-of-fit test compares observed frequencies of each class within a single categorical variable to the frequencies expected of each of the classes on a theoretical basis. It tests the null hypothesis that the sample came from a population in which the observed frequencies match the expected frequencies.
For example, an ecologist investigating factors that might lead to deviations from a 1:1 offspring sex ratio, hypothesized that when the investment in one sex is considerably greater than in the other, the offspring sex ratio should be biased towards the less costly sex. He studied two species of wasps, one of which had males that were considerably larger (and thus more costly to produce) than females. For each species, he compared the offspring sex ratio to a 1:1 ratio using a goodness-of-fit test.
> chisq.test(c(counts)) #OR > chisq.test(c(counts),p=c(.5,.5))
> chisq.test(c(40,50)) #OR > chisq.test(c(40,50),p=c(.5,.5))
Contingency tables are the cross-classification of two (or more) categorical variables. A 2 x 2 (two way) table takes on the following form:
Contingency tables test the null hypothesis that the data came from a population in which variable 1 is independent of variable 2 and vice-versa. That is, it is a test of independence. For example a plant ecologist examined the effect of heat on seed germination. Contingency test was used to determine whether germination (2 categories - yes or no) was independent of the heat treatment (2 categories heated or not heated).
> name.tab <- xtabs(counts ~ cat1 + cat2, data=data)
> name.x2 <- chisq.test(name.tab, correct=F)
> name.tab <- table(data)
Multivariate data sets include multiple variables recorded from a number of replicate sampling or experimental units, sometimes referred to as objects. If, for example, the objects are whole organisms or ecological sampling units, the variables could be morphometric measurements or abundances of various species respectively. The variables may be all response variables (MANOVA) or they could be response variables, predictor variables or a combination of both (PCA and MDS).
The aim of multivariate analysis is to reduce the many variables to a smaller number of new derived variables that adequately summarize the original information and can be used for further analysis. In addition, Principal components analysis (PCA) and Multidimensional scaling (MDS) also aim to reveal patterns in the data, especially among objects, that could not be found by analyzing each variable separately.
For two objects (i = 1 and 2) and a number of variables (j = 1 to p).
BC = 1-[(2)(3+6+9)/(18+36)] = 0.33 where (3+6+9) is the lesser abundance of each variable when it occurs in each object.
Bray-Curtis dissimilarity is well suited to species abundance data because it ignores variables that have zeros for both objects, and it reaches a constant maximum value when two objects have no variables in common. However, its value is determined mainly by variables with high values (e.g. species with high abundances) and thus results are biased towards trends displayed in such variables. It ranges from 0 (objects completely similar) to 1 (objects completely dissimilar).
EUC = [(3-6)2+(6-12)2+(9-18)2] = 11.2
Euclidean distance is a very important, general, distance measure in multivariate statistics. It is based on simple geometry as a measure of distance between two objects in multidimensional space. It is only bounded by a zero for two objects with exactly the same values for all variables and has no upper limits, even when two objects have no variables in common with positive values. Since it does not reach a constant maximum value when two objects have no variables in common, it is not ideal for species abundance data.
The matrix of distance values will appear in the R Commander output window.
Multidimensional scaling (MDS), has two aims
MDS starts with a dissimilarity matrix which represents the degree of difference between all objects (such as sites or quadrats) based on all the original variables. MDS then attempts to reproduce these patterns between objects by arranging the objects in multidimensional space with the distances between each pair of objects representing their relative dissimilarity. Each of the dimensions (plot axes) therefore represents a new variable that can be used to describe the relationships between objects. A greater number of new variables (axes) increases the degree to which the original patterns between objects are retained, however less data reduction has occurred. Furthermore, it is difficult to envisage points (objects) arranged in more than three dimensions.
MDS can be broken down into the following steps
A large amount of information will appear in the R Commander output window (most of which you can ignore!.
The first of these plots is a Sheppard plot which represents the relationship between the original distances (y-axis) and the new MDS distances (x-axis). The second plot is the final ordination plot. This represents the arrangement of objects in multidimensional space.
Linear models are those statistical models in which a series of parameters (predictor variables) are arranged as a linear combination. That is, within the model, no parameter appears as either a multiplier, divisor or exponent to any other parameter. Importantly, the term `linear’ in this context does not pertain to the nature of the relationship between the response variable and the predictor variable(s), and thus linear models are not restricted to `linear’ (straight-line) relationships. The following examples should help make the distinction.