End of instructions
> class(name)
> name <- entry
> IQ <- 10.513
> numbers <- c(1, 4, 6, 7, 4, 345, 36, 78) > print(numbers) [1] 1 4 6 7 4 345 36 78 > numbers [1] 1 4 6 7 4 345 36 78
> name <- factor(c(list of characters/words))
> name <- gl(number of levels, number of replicates, length of data set, lab=c(list of level names)))
> sex <- factor(c('Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male')) OR > sex <- factor(c(rep('Female',6),rep('Male',6))) OR > sex <- gl(2,6,12,lab=c('Female','Male'))
> q()
> source('filename')
Note that there is an alternative way of using scripts
> ls()
> rm(object name)
> name <- read.table('filename.csv', header=T, sep=',', row.names=column)
> phasmid <- read.data('phasmid.csv', header=T, sep=',', row.names=1)
> write.data(data frame name, 'filename.csv', quote=F, sep=',')
> write.data(LEAVES, 'leaves.csv', quote=F, sep=',')
> var <- c(4,8,2,6,9,2) > var [1] 4 8 2 6 9 2 > var[1] [1] 4 > var[5] [1] 9 > var[3:5] [1] 2 6 9
> dv <- c(4,8,2,6,9,2) > iv <- c('a','a','a','b','b','b') > data <- data.frame(iv,dv) > data iv dv 1 a 4 2 a 8 3 a 2 4 b 6 5 b 9 6 b 2 > data[1] #list first section iv 1 a 2 a 3 a 4 b 5 b 6 b > data[,1] #list contents of first column [1] a a a b b b Levels: a b > data[1,] #list contents of first row iv dv 1 a 4 > data[3,2] #list the entry in row 3, column 2 [1] 2
> fix(data frame name)
> old_var <- log(new_var)
> old_var <- log10(new_var)
> old_var <- sqrt(new_var)
> old_var <- asin(sqrt(new_var))
> old_var <- scale(new_var)
> set.seed(number)
> library(package)
> hist(variable)
# mean > mean(variable) # variance > var(variable) # standard deviation > sd(variable) # variable length > length(variable) # median > median(variable) # maximum > max(variable) # minimum > min(variable) # standard error of mean > sd(variable)/sqrt(length(variable)) # interquartile range > quantile(variable, c(.25,.75)) # 95% confidence interval > qnorm(0.975,0,1)*(sd(variable)/sqrt(length(variable)))
> tapply(dv,factor,func)
# calculate the mean separately for each group > tapply(dv,factor,mean) # calculate the mean separately for each group > tapply(dv,factor,var)
# select the first 5 entries in the variable > variable[1:5] # select all but the first entry in the variable > variable[-1] # select all cases of variable that are less than num > variable[variable<num] # select all cases of variable that are equal to 'Big' > variable1[variable<label]
> pnorm(c(value), mean=mean, sd=sd, lower.tail=FALSE)
> pnorm(c(2.9), mean=2.025882, sd=0.4836265, lower.tail=FALSE)
> boxplot(variable)
> name.lm <- lm(dv~iv, data=data)
> name.lm <- lm(dv~iv1+iv2, data=data)
> day.aov <- lm(dv~factor,data=data) # OR > day.aov <- aov(dv~factor,data=data)
A nested ANOVA is a simple extension of the single factor ANOVA. An additional factor (Factor B) is added that is nested within the main factor of interest (Factor A). Nested factors are incorporated in situations where researches envisage potentially large amounts of variation between different sets of replicates within the main Factor. For example, a limnologist investigating the effect of flow (Factor A: high, medium and low) on macroinvertebate density, might collect samples from a number of high, medium and low flow streams. However, the limnologist might expect that many other factors (perhaps more important than stream flow) also influence the density of macroinvertebrates and that even samples collected in close proximity to one another might be expected to vary considerably in density of macroinvertebrates. Such variability increases the unexplained variability and therefore has the potential to make it very difficult to pick up the effect of the actual factor of interest (in this case flow). In an attempt to combat this, the researcher decides to collect multiple samples from within each stream. If some of the unexplained variability can be explained by variation between samples within streams, then the test of flow will be more sensitive. This is called a nested ANOVA, an additional factor (Factor B: stream) has been introduce within the main factor (Factor A: flow). Factor A is a fixed factor as we are only interested in the specific levels chosen (e.g. high, medium and low flow). Factor B however, is a random as we are not just interested in the specific streams that were sampled, we are interested in all the possible streams that they represent. In a nested ANOVA the categories of the Factor B nested within each level of Factor A are different. E.g. the streams used for each flow type are different. Therefore it is not possible to test for an interaction between Flow type and stream. There are two null hypotheses being tested in a two factor nested ANOVA. H0 main effect: no effect of Factor A (flow) H0: = &mu1 = &mu2 = ... = &mui = &mu (as for single factor ANOVA). H0 nesting effect: no effect of Factor B (stream) nested within Factor A (flow). Technically it is that there is no added variance due to differences between all the possible levels of Factor B with any level of Factor A. E.g., that there is no added variance due to differences between all possible high flow streams, or between all possible medium flow streams or between all possible low flow streams. Diagram shows layout of nested design with three treatments (different pattern fills).
These are designs where one factor is a blocking variable (Factor B) and the other variable (Factor A) is the main treatment factor of interest. These designs are used in situations where the environment is potentially heterogeneous enough to substantially increase the variability in the response variable and thus obscure the effects of the main factor. Experimental units are grouped into blocks (units of space or time that are likely to have less variable background conditions) and then each level of the treatment factor is applied to a single unit within each block. By grouping the experimental units together into blocks, some of the total variability in the response variable that is not explained by the main factor may be explained by the blocks. If so, the residual (unexplained) variation will be reduced resulting in more precise estimates of parameters and more powerful tests of treatments. A plant ecologist was investigating the effects of leaf damage on plant chemical defenses. Since individual trees vary greatly in the degree of chemical defense, she applied each of three artificial leaf damage treatments (heavy, medium and light damage) to each tree. She applied one of three treatments to three branches within eight randomly selected trees. Trees were the blocks. The diagram shows the layout of a blocking design. The main factor has three treatments (pattern fills), each of which are applied randomly within 3 randomly selected blocks. The decision to include a blocking factor (rather than single factor, completely random) depends on whether the experimental units (replicates) within a block are likely to be more similar to one another than those between blocks and whether the reduction in MSResidual is likely to compensate for the for the loss of residual df.
Split-plot designs can be thought of as a combination of a randomized complete block (RCB) and nested designs. In their simplest form, they represent a RCB design, with a second factor applied to the whole blocks. One factor (Factor C) is applied to experimental units (usually randomly positioned) within each block (now referred to as plots, or Factor B). These factors are the Within-plot factors. A second factor (Between-plots factor, or Factor A) is then applied to the whole plots, and these plots are replicated. Hence a nested design applies between Factor A and the plots, and a blocking design applies between Factor C and the plots. Factor A is also crossed with Factor C, and therefore there is a test of the interaction between the two main factors. Note that if there are no replicate units of Factor C within each plot (that is if each level of Factor C is applied only once to each plot), then the interaction of Factor C by Plots within Factor A represents the overall residual. However, since the plots are usually random factors, the effects of Factor A must be tested against the Plots within Factor A term.If there there are replicate units of Factor C within each plot, then the Factor A by Plots within Factor A by Factor C is the overal residual.
Simple repeated measures designs and similar to Randomized complete block (RCB) designs. In the same way that we could explain some of the residual variation by blocking, we can also reduce the residual variation by removing the between-replicate variation that exists at the start of the experiment. This is achieved by applying each treatments (repeated measures) to the same subjects, rather than using different subjects for each treatment. In this way it is the same as a RCB. Where it differs from a RCB is that each of the treatments cannot be applied to the blocks (subjects) at the same time. For example, it is not possible to determine the effects of different drugs on the heart rate of rats if each of the drugs is applied at the same time! Each of the treatments should be applied in random order, however, when the repeated factor is time, this is not possible. The subjects (blocks or plots) are always a random factor. If a second factor is applied at the level of the whole subject and over each of the repeated measures (e.g. subjects are of one of two species), the analysis then resembles a splitplot design, where this second factor is the between subjects effect. A agricultural scientist was investigating the effects of fiber on the digestibility of pastural grasses by cows. Ten randomly selected cows of two different breeds (five of each) were each fed a random sequence of four food types (of different fiber levels). After a week of feeding on a single food type, the digestibility of the food was assessed. Individual cows represented the subjects. Breed of cow was the between subjects effect and food fiber level was the within subjects effect. Diagram illustrates layout of 2 Factor repeated measures. Rectangles represent subjects. Squares represent the within subject treatments (random sequence represents the order in which each treatment is applied). Hence there are four repeated measures. There are two levels of the between subjects effect (pattern fill of rectangles).
In ANOVA, the assumptions of normality and variance homogeneity apply to the residuals for the test of interest. In this example, the nested factor (Factor B, SITE) is random. Therefore, the normality and heterogeneity of variance assumptions for Factor A (SEASON) must be based on the distribution, mean and variance of the response variable (S4DUGES) for each level of Factor A using the means of Factor B within each Factor A as the observations (since the error term for Factor A (SEASON) is the variation in the response variable between categories of Factor B within each level of Factor A (MSSITE). Calculate the mean from each level of Factor B (SITE), and use these means to calculate the mean, variance and distribution of each level of Factor A (SEASON).
# calculate the means from each level of Factor B (SITE) > means <- tapply(curdies$S4DUGES, curdies$SITE, mean) # calculate the variance form each level of Factor B (SITE) > vars <- tapply(curdies$S4DUGES, curdies$SITE, var) # plot mean against variance > plot(means~vars) # use the means to produce the boxplots > boxplot(means[1:3],means[4:6])
This data set generates two boxplots, each constructed from only 3 values and thus of questionable value.
# fit the heirachical linear model > name.aov <- aov(dv~FactorA + Error(FactorB), data=data) # print out the ANOVA strata > summary(name.aov)
# fit the heirachical linear model > curdies.aov <- aov(S4DUGES~SEASON + Error(SITE), data=curdies) # print out the ANOVA strata > summary(curdies.aov)
> curdies.aov1 <- aov(S4DUGES~SEASON + SITE, data=curdies) > summary(curdies.aov)
> name.aov <- aov(dv~factor1 * factor2, data=data)
> plot(name.lm)
#print a summary of the linear model coefficients > summary(name.lm) #print a summary of the ANOVA table > anova(name.lm)
> anova(name.aov)
The assumptions of normality and homogeneity of variance apply to each of the factor level combinations, since it is the replicate observations of these that are the test residuals. If the design has two factors (IV1 and IV2) with three and four levels (groups) respectively within these factors, then a boxplot of each of the 3x4=12 combinations needs to be constructed. It is recommended that a variable (called GROUP) be setup to represent the combination of the two categorical (factor) variables.
Simply construct a boxplot with the dependent variable on the y-axis and GROUP on the x-axis. Visually assess whether the data from each group is normal (or at least that the groups are not all consistently skewed in the same direction), and whether the spread of data is each group is similar (or at least not related to the mean for that group). The GROUP variable can also assist in calculating the mean and variance for each group, to further assess variance homogeneity.
Interaction plots display the degree of consistency (or lack of) of the effect of one factor across each level of another factor. Interaction plots can be either bar or line graphs, however line graphs are more effective. The x-axis represents the levels of one factor, and a separate line in drawn for each level of the other factor. The following interaction plots represent two factors, A (with levels A1, A2, A3) and B (with levels B1, B2).
The parallel lines of first plot clearly indicate no interaction. The effect of factor A is consistent for both levels of factor B and visa versa. The middle plot demonstrates a moderate interaction and bottom plot illustrates a severe interaction. Clearly, whether or not there is an effect of factor B (e.g. B1 > B2 or B2 > B1) depends on the level of factor A. The trend is not consistent.
Statistical models that incorporate more than one categorical predictor variable are broadly referred to as multivariate analysis of variance. There are two main reasons for the inclusion of multiple factors:
Fully factorial linear models are used when the design incorporates two or more factors (independent, categorical variables) that are crossed with each other. That is, all combinations of the factors are included in the design and every level (group) of each factor occurs in combination with every level of the other factor(s). Furthermore, each combination is replicated.
In fully factorial designs both factors are equally important (potentially), and since all factors are crossed, we can also test whether there is an interaction between the factors (does the effect of one factor depend on the level of the other factor(s)). Graphs above depict a) interaction, b) no interaction.
For example, Quinn (1988) investigated the effects of season (two levels, winter/spring and summer/autumn) and adult density (four levels, 8, 15, 30 and 45 animals per 225cm2) on the number of limpet egg masses. As the design was fully crossed (all four densities were used in each season), he was also able to test for an interaction between season and density. That is, was the effect of density consistent across both seasons and, was the effect of season consistent across all densities.
Diagram shows layout of 2 factor fully crossed design. The two factors (each with two levels) are color (black or gray) and pattern (solid or striped). There are three experimental units (replicates) per treatment combination.
# load the 'multcomp' package > library(multcomp) # perform Tukey's test > summary(simtest(dv~factor, data=data, type="Tukey"))
# load the 'multcomp' package > library(multcomp) # perform Tukey's test > summary(simtest(dv~factor1*factor2, whichf='factor', data=data, type="Tukey"))
> data$var <- as.factor(data$var)
> name.ag <- aggregate(dataset, list(NAME=FactorA, NAME1=FactorB), mean)
> cu.ag <- aggregate(copper, list(COPPER=copper$COPPER, PLATE=copper$PLATE), mean)
> boxplot(name.ag$dv~name.ag$FactorA)
> boxplot(cu.ag$WORMS ~ , cu.ag$COPPER)
# Fit the linear model with the different error strata > name.aov <- aov(dv ~ FactorA + FactorC + FactorA:FactorC + Error(FactorB), data=dataset)) > summary(name.aov)
> copper.aov <- aov(WORMS ~ COPPER + DIST + COPPER:DIST + Error(PLATE), data=copper)) > summary(copper.aov)
# Fit the linear model with the different error strata > name.aov <- aov(dv ~ FactorA + Error(FactorB), data=dataset)) > summary(name.aov)
> tobacco.aov <- aov(NUMBER ~ TREATMENT + Error(LEAF), data=tobacco)) > summary(tobacco.aov)
> text(xs, means+5, lab=c(labels))
> levels(dv) <- c(ordered levels)
# Define the contrast coefficients for factorial variable > name.cont <- list(factor = c(numeric list)) # Define the list of associated labels for the factorial variable > name.list <- list(factor =list('label'=1)) # Fit the linear model using the defined contrasts > name.aov <- aov(dv~factor, data = data, contrasts = name.cont) # Print out the ANOVA table incorporating the define contrasts > summary(name.aov, split = name.list)
# Define the contrast coefficients for factorial variable > partridge.cont <- list(GROUP = c(1,0,0,0,-1)) # Define the list of associated labels for the GROUP factor > partridge.list <- list(GROUP = list("1 vs 5" = 1)) # Fit the linear model using the defined contrasts > partridge.aov <- aov(LONGEVITY~GROUP, contrasts = partridge.cont) # Print out the ANOVA table incorporating the define contrasts > summary(partridge.aov, split = partridge.list)
Following a significant ANOVA result, it is often desirable to specifically compare group means to determine which groups are significantly different. However, multiple comparisons lead to two statistical problems. Firstly, multiple significance tests increase the Type I errors (&alpha, the probability of falsely rejecting H0). E.g., testing for differences between 5 groups requires ten pairwise comparisons. If the &alpha for each test is 0.05 (5%), then the probability of at least one Type I error for the family of 10 tests is 0.4 (40%). Secondly, the outcome of each test needs to be independent (orthogonality). E.g. if A>B and B>C then we already know the result of A vs. C.
Post-hoc unplanned pairwise comparisons (e.g. Tukey's test) compare all possible pairs of group means and are useful in an exploratory fashion to reveal major differences. Tukey's€Â™s test control the family-wise Type I error rate to no more that 0.05. However, this reduces the power of each of the pairwise comparisons, and only very large differences are detected (a consequence that exacerbates with an increasing number of groups).
Planned comparisons are specific comparisons that are usually planned during the design stage of the experiment. No more than (p-1, where p is the number of groups) comparisons can be made, however, each comparison (provided it is non-orthogonal) can be tested at &alpha = 0.05. Amongst all possible pairwise comparisons, specific comparisons are selected, while other meaningless (within the biological context of the investigation) are ignored.
> anova(day.aov)
These are designs where one factor is a blocking variable (Factor B) and the other variable (Factor A) is the main treatment factor of interest. These designs are used in situations where the environment is potentially heterogeneous enough to substantially increase the variability in the response variable and thus obscure the effects of the main factor. Experimental units are grouped into blocks (units of space or time that are likely to have less variable background conditions) and then each level of the treatment factor is applied to a single unit in each block. By grouping the experimental units together into blocks , some of the total variability in the response variable that is not explained by the main factor may be explained by the blocks. If so, the residual (unexplained) variation will be reduced resulting in more precise estimates of parameters and more powerful tests of treatments.
For example, a plant ecologist was investigating the effects of leaf damage on plant chemical defenses. Since individual trees vary greatly in the degree of chemical defense, she applied each of three artificial leaf damage treatments (heavy, medium and light damage) to each tree. She applied one of three treatments to three branches within eight randomly selected trees. Trees were the blocks. The diagram shows the layout of a blocking design. The main factor has three treatments (pattern fills), each of which are applied randomly within 3 randomly selected blocks. The decision to include a blocking factor (rather than single factor, completely random) depends on whether the experimental units (replicates) within a block are likely to be more similar to one another than those between blocks and whether the reduction in MSResidual is likely to compensate for the for the loss of residual df.
# load the 'car' library > library(car) # generate a scatterplot > scatterplot(dv~iv,data=data)
# make sure the car package is loaded > library(car) > scatterplot.matrix(~var1+var2+var3,data=data, diagonal='boxplot')
Start off setting the Number of simulations to 1000. This will collect 1000 random samples from Population 1 and 2.
Assumptions - In simulating the repeated collection of samples from both male and female populations, we make a number of assumptions that have important consequences for the reliability of the statistical tests. Firstly, the function used to simulate the collection random samples of say 8 male fulmars (and 6 female fulmars), generates these samples from normal distributions of a given mean and standard deviation. Thus, each simulated t value and the distribution of t values from multiple runs represents the situation for when the samples are collected from a population that is normally distributed. Likewise, the mathematical t distribution also makes this assumption. Furthermore, the sampling function used the same standard deviation for each collected sample. Hence, each simulated t values and the distribution of t values from multiple runs represents the situation for when the samples are collected from populations that are equally varied. Likewise, the mathematical t distribution also makes this assumption. By altering the variability and degree of normality of the populations, you can see the effects that normality and homogeneity of variance have on how much a distribution of t values differs from the mathematical t-distribution. Violations of either of these two assumptions (normality and homogeneity of variance) have the real potential to compromise the reliability of the statistical conclusions and thus it is important for the data to satisfy these assumptions.
Step 1 - Clearly establish the statistical null hypothesis. Therefore, start off by considering the situation where the null hypothesis is true - e.g. when the two population means are equal
Step 2 - Establish a critical statistical criteria (e.g. alpha = 0.05)
Step 3 - Collect samples consisting of independent, unbiased samples
Step 4 - Assess the assumptions relevant to the statistical hypothesis test. For a t test: 1. Normality 2. Homogeneity of variance
Step 5 - Calculate test statistic appropriate for null hypothesis (e.g. a t value)
Step 6 - Compare observed test statistic to a probability distribution for that test statistic when the null hypothesis is true for the appropriate degrees of freedom (e.g. compare the observed t value to a t distribution).
Step 7 - If the observed test statistic is greater (in magnitude) than the critical value for that test statistic (based on the predefined critical criteria), we conclude that it is unlikely that the observed samples could have come from populations that fulfill the null hypothesis and therefore the null hypothesis is rejected, otherwise we conclude that there is insufficient evidence to reject the null hypothesis. Alternatively, we calculate the probability of obtaining the observed test statistic (or one of greater magnitude) when the null hypothesis is true. If this probability is less than our predefined critical criteria (e.g. 0.05), we conclude that it is unlikely that the observed samples could have come from populations that fulfill the null hypothesis and therefore the null hypothesis is rejected, otherwise we conclude that there is insufficient evidence to reject the null hypothesis.
> t.test(dv~factor,data=data,var.equal=T)
> t.test(dv~factor,data=data,var.equal=F)
# calculate the means > means <- tapply(dv, factor, mean) # calculate the standard deviation > sds <- tapply(dv, factor, sd) # calculate the lengths > ns <- tapply(dv, factor, length) # calculate the standard errors > ses <- sd/sqrt(ns) # plot the bars > xs <- barplot(means,beside=T) # load package containing error bar function > library(Hmisc) # plot the error bars > errbar(xs, means, means+ses, means-ses, add=t)
# make sure that the Hmisc library is loaded > library(Hmisc) # plot the error bars > errbar(xs, means, means+ses, means-ses, add=t)
The following output are based on a simulated data sets; 1.  Pooled variance t-test for populations with equal (or nearly so) variances 2.  Separate variance t-test for population with unequal variances
> t.test(cat1, cat2, data=data, paired=T)
Non-parametric tests do not place any distributional limitations on data sets and are therefore useful when the assumption of normality is violated. There are a number of alternatives to parametric tests, the most common are;
1. Randomization tests - rather than assume that a test statistic calculated from the data follows a specific mathematical distribution (such as a normal distribution), these tests generate their own test statistic distribution by repeatedly re-sampling or re-shuffling the original data and recalculating the test statistic each time. A p value is subsequently calculated as the proportion of random test statistics that are greater than or equal to the test statistic based on the original (un-shuffled) data. 2. Rank-based tests - these tests operate not on the original data, but rather data that has first been ranked (each observation is assigned a ranking, such that the largest observation is given the value of 1, the next highest is 2 and so on). It turns out that the probability distribution of any rank based test statistic for a is identical.
Linear regression analysis explores the linear relationship between a continuous response variable (DV) and a single continuous predictor variable (IV). A line of best fit (one that minimizes the sum of squares deviations between the observed y values and the predicted y values at each x) is fitted to the data to estimate the linear model.
As plot a) shows, a slope of zero, would indicate no relationship – a increase in IV is not associated with a consistent change in DV. Plot b) shows a positive relationship (positive slope) – an increase in IV is associated with an increase in DV. Plot c) shows a negative relationship.
Testing whether the population slope (as estimated from the sample slope) is one way of evaluating the relationship. Regression analysis also determines how much of the total variability in the response variable can be explained by its linear relationship with IV, and how much remains unexplained. The line of best fit (relating DV to IV) takes on the form of y=mx + c Response variable = (slope*predictor   variable) + y-intercept If all points fall exactly on the line, then this model (right-hand part of equation) explains all of the variability in the response variable. That is, all variation in DV can be explained by differences in IV. Usually, however, the model does not explain all of the variability in the response variable (due to natural variation), some is left unexplained (referred to as error). Therefore the linear model takes the form of; Response variable = model + error. A high proportion of variability explained relative to unexplained (error) is suggestive of a relationship between response and predictor variables. For example, a mammalogist was investigating the effects of tooth wear on the amount of time free ranging koalas spend feeding per 24 h. Therefore, the amount of time spent feeding per 24 h was measured for a number of koalas that varied in their degree of tooth wear (ranging from unworn through to worn). Time spent feeding was the response variable and degree of tooth wear was the predictor variable.
Analysis of variance, or ANOVA, partitions the variation in the response (DV) variable into that explained and that unexplained by one of more categorical predictor variables, called factors. The ratio of this partitioning can then be used to test the null hypothesis (H0) that population group or treatment means are equal. A single factor ANOVA investigates the effect of a single factor, with 2 or more groups (levels), on the response variable. Single factor ANOVA's are completely randomized (CR) designs. That is, there is no restriction on the random allocation of experimental or sampling units to factor levels. Single factor ANOVA tests the H0 that there is no difference between the population group means.H0: = µ1 = µ2 = .... = µi = µ If the effect of the ith group is:(&alphai = µi - µ) the this can be written as H0: = &alpha1 = &alpha2 = ... = &alphai = 0 If one or more of the I are different from zero, the hull hypothesis will be rejected.
Keough and Raymondi (1995) investigated the degree to which biofilms (films of diatoms, algal spores, bacteria, and other organic material that develop on hard surfaces) influence the settlement of polychaete worms. They had four categories (levels) of the biofilm treatment: sterile substrata, lab developed biofilms without larvae, lab developed biofilms with larvae (any larvae), field developed biofilms without larvae. Biofilm plates where placed in the field in a completely randomized array. After a week the number of polychaete species settled on each plate was then recorded. The diagram illustrates an example of the spatial layout of a single factor with four treatments (four levels of the treatment factor, each with a different pattern fill) and four experimental units (replicates) for each treatment.
> boxplot(dv~factor,data=data)
> boxplot(dv~factor1*factor2,data=data)
> wilcox.test(dv ~ factor, data=data)
# calculate critical t value (&alpha =0.05) for upper tail (e.g. A larger than B) > qt(0.05,df=df,lower.tail=F) # calculate critical t value (&alpha =0.05) for lower tail (e.g. B larger than A) > qt(0.05,df=df,lower.tail=T)
It actually doesn't matter whether you select Lower tail is TRUE or FALSE. The t-distribution is a symmetrical distribution, centered around 0, therefore, the critical t-value is the same for both Lower tail (e.g. population 2 greater than population 1) and Upper tail (e.g. population 1 greater than population 2), except that the Lower tail is always negative. As it is often less confusing to work with positive values, it is recommended that you use Upper tail values. An example of a t-distribution with Upper tail for a one-tailed test is depicted below. Note that this is not part of the t quantiles output!
# calculate critical t value (&alpha =0.05) for upper tail (e.g. A different to B) > qt(0.05/2,df=df, lower.tail=T)
Again, it actually doesn't matter whether you select Lower tail as either TRUE or FALSE. For a symmetrical distribution, centered around 0, the critical t-value is the same for both Lower tail (e.g. population 2 greater than population 1) and Upper tail (e.g. population 1 greater than population 2), except that the Lower tail is always negative. As it is often less confusing to work with positive values, it is recommended that you use Upper tail values. An example of a t-distribution with Upper tail for a two-tailed test is depicted below. Note that this is not part of the t quantiles output!
# first calculate the means and variances > means <- tapply(data$dv, data$factor, mean) > vars <- tapply(data$dv, data$factor, var) # then plot the mean against variance > plot(means,vars)
# first calculate the means and variances > means <- tapply(data$dv, list(data$factor1, data$factor2), mean) > vars <- tapply(data$dv, list(data$factor1, data$factor2), var) # then plot the mean against variance > plot(means,vars)
Analysis of frequencies is similar to Analysis of Variance (ANOVA) in some ways. Variables contain two or more classes that are defined from either natural categories or from a set of arbitrary class limits in a continuous variable. For example, the classes could be sexes (male and female) or color classes derived by splitting the light scale into a set of wavelength bands. Unlike ANOVA, in which an attribute (e.g. length) is measured for a set number of replicates and the means of different classes (categories) are compared, when analyzing frequencies, the number of replicates (observed) that fall into each of the defined classes are counted and these frequencies are compared to predicted (expected) frequencies.
Analysis of frequencies tests whether a sample of observations came from a population where the observed frequencies match some expected or theoretical frequencies. Analysis of frequencies is based on the chi-squared (X2) statistic, which follows a chi-square distribution (squared values from a standard normal distribution thus long right tail).
When there is only one categorical variable, expected frequencies are calculated from theoretical ratios. When there are more than one categorical variables, the data are arranged in a contingency table that reflects the cross-classification of sampling or experimental units into the classes of the two or more variables. The most common form of contingency table analysis (model I), tests a null hypothesis of independence between the categorical variables and is analogous to the test of an interaction in multifactorial ANOVA. Hence, frequency analysis provides hypothesis tests for solely categorical data. Although, analysis of frequencies provides a way to analyses data in which both the predictor and response variable are both categorical, since variables are not distinguished as either predictor or response in the analysis, establishment of causality is only of importance for interpretation.
The goodness-of-fit test compares observed frequencies of each class within a single categorical variable to the frequencies expected of each of the classes on a theoretical basis. It tests the null hypothesis that the sample came from a population in which the observed frequencies match the expected frequencies.
For example, an ecologist investigating factors that might lead to deviations from a 1:1 offspring sex ratio, hypothesized that when the investment in one sex is considerably greater than in the other, the offspring sex ratio should be biased towards the less costly sex. He studied two species of wasps, one of which had males that were considerably larger (and thus more costly to produce) than females. For each species, he compared the offspring sex ratio to a 1:1 ratio using a goodness-of-fit test.
The Goodness of fit test output will appear in the R Commander output window.
Contingency tables are the cross-classification of two (or more) categorical variables. A 2 x 2 (two way) table takes on the following form:
Contingency tables test the null hypothesis that the data came from a population in which variable 1 is independent of variable 2 and vice-versa. That is, it is a test of independence. For example a plant ecologist examined the effect of heat on seed germination. Contingency test was used to determine whether germination (2 categories - yes or no) was independent of the heat treatment (2 categories heated or not heated).
The Pearson's chi-squared test will appear in the R Commander output window.
The Pearson's chi-squared test and residuals will appear in the R Commander output window.
Multivariate data sets include multiple variables recorded from a number of replicate sampling or experimental units, sometimes referred to as objects. If, for example, the objects are whole organisms or ecological sampling units, the variables could be morphometric measurements or abundances of various species respectively. The variables may be all response variables (MANOVA) or they could be response variables, predictor variables or a combination of both (PCA and MDS).
The aim of multivariate analysis is to reduce the many variables to a smaller number of new derived variables that adequately summarize the original information and can be used for further analysis. In addition, Principal components analysis (PCA) and Multidimensional scaling (MDS) also aim to reveal patterns in the data, especially among objects, that could not be found by analyzing each variable separately.
For two objects (i = 1 and 2) and a number of variables (j = 1 to p).
BC = 1-[(2)(3+6+9)/(18+36)] = 0.33 where (3+6+9) is the lesser abundance of each variable when it occurs in each object.
Bray-Curtis dissimilarity is well suited to species abundance data because it ignores variables that have zeros for both objects, and it reaches a constant maximum value when two objects have no variables in common. However, its value is determined mainly by variables with high values (e.g. species with high abundances) and thus results are biased towards trends displayed in such variables. It ranges from 0 (objects completely similar) to 1 (objects completely dissimilar).
EUC = [(3-6)2+(6-12)2+(9-18)2] = 11.2
Euclidean distance is a very important, general, distance measure in multivariate statistics. It is based on simple geometry as a measure of distance between two objects in multidimensional space. It is only bounded by a zero for two objects with exactly the same values for all variables and has no upper limits, even when two objects have no variables in common with positive values. Since it does not reach a constant maximum value when two objects have no variables in common, it is not ideal for species abundance data.
The matrix of distance values will appear in the R Commander output window.
Multidimensional scaling (MDS), has two aims
MDS starts with a dissimilarity matrix which represents the degree of difference between all objects (such as sites or quadrats) based on all the original variables. MDS then attempts to reproduce these patterns between objects by arranging the objects in multidimensional space with the distances between each pair of objects representing their relative dissimilarity. Each of the dimensions (plot axes) therefore represents a new variable that can be used to describe the relationships between objects. A greater number of new variables (axes) increases the degree to which the original patterns between objects are retained, however less data reduction has occurred. Furthermore, it is difficult to envisage points (objects) arranged in more than three dimensions.
MDS can be broken down into the following steps
A large amount of information will appear in the R Commander output window (most of which you can ignore!.
The first of these plots is a Sheppard plot which represents the relationship between the original distances (y-axis) and the new MDS distances (x-axis). The second plot is the final ordination plot. This represents the arrangement of objects in multidimensional space.