End of instructions
> class(name)
> name <- entry
> IQ <- 10.513
> numbers <- c(1, 4, 6, 7, 4, 345, 36, 78) > print(numbers) [1] 1 4 6 7 4 345 36 78 > numbers [1] 1 4 6 7 4 345 36 78
> name <- factor(c(list of characters/words))
> name <- gl(number of levels, number of replicates, length of data set, lab=c(list of level names)))
> sex <- factor(c('Female', 'Female', 'Female', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male')) OR > sex <- factor(c(rep('Female',6),rep('Male',6))) OR > sex <- gl(2,6,12,lab=c('Female','Male'))
> q()
> source('filename')
Note that there is an alternative way of using scripts
> ls()
> rm(object name)
> name <- read.data('filename.csv', header=T, sep=',', row.names=column)
> phasmid <- read.data('phasmid.csv', header=T, sep=',', row.names=1)
> write.data(data frame name, 'filename.csv', quote=F, sep=',')
> write.data(LEAVES, 'leaves.csv', quote=F, sep=',')
> var <- c(4,8,2,6,9,2) > var [1] 4 8 2 6 9 2 > var[1] [1] 4 > var[5] [1] 9 > var[3:5] [1] 2 6 9
> dv <- c(4,8,2,6,9,2) > iv <- c('a','a','a','b','b','b') > data <- data.frame(iv,dv) > data iv dv 1 a 4 2 a 8 3 a 2 4 b 6 5 b 9 6 b 2 > data[1] #list first section iv 1 a 2 a 3 a 4 b 5 b 6 b > data[,1] #list contents of first column [1] a a a b b b Levels: a b > data[1,] #list contents of first row iv dv 1 a 4 > data[3,2] #list the entry in row 3, column 2 [1] 2
> fix(data frame name)
> old_var <- log(new_var)
> old_var <- log10(new_var)
> old_var <- sqrt(new_var)
> old_var <- asin(sqrt(new_var))
> old_var <- scale(new_var)
> set.seed(number)
> library(package)
Now, select perform the calculations separately on the original and modified data sets (To change the active data set, click on the blue data set display next to where it says Data set on the button bar of Rcmdr.
The assumptions of normality and homogeneity of variance apply to each of the factor level combinations, since it is the replicate observations of these that are the test residuals. If the design has two factors (IV1 and IV2) with three and four levels (groups) respectively within these factors, then a boxplot of each of the 3x4=12 combinations needs to be constructed. It is recommended that a variable (called GROUP) be setup to represent the combination of the two categorical (factor) variables.
Simply construct a boxplot with the dependent variable on the y-axis and GROUP on the x-axis. Visually assess whether the data from each group is normal (or at least that the groups are not all consistently skewed in the same direction), and whether the spread of data is each group is similar (or at least not related to the mean for that group). The GROUP variable can also assist in calculating the mean and variance for each group, to further assess variance homogeneity.
Interaction plots display the degree of consistency (or lack of) of the effect of one factor across each level of another factor. Interaction plots can be either bar or line graphs, however line graphs are more effective. The x-axis represents the levels of one factor, and a separate line in drawn for each level of the other factor. The following interaction plots represent two factors, A (with levels A1, A2, A3) and B (with levels B1, B2).
The parallel lines of first plot clearly indicate no interaction. The effect of factor A is consistent for both levels of factor B and visa versa. The middle plot demonstrates a moderate interaction and bottom plot illustrates a severe interaction. Clearly, whether or not there is an effect of factor B (e.g. B1 > B2 or B2 > B1) depends on the level of factor A. The trend is not consistent.
Statistical models that incorporate more than one categorical predictor variable are broadly referred to as multivariate analysis of variance. There are two main reasons for the inclusion of multiple factors:
Fully factorial linear models are used when the design incorporates two or more factors (independent, categorical variables) that are crossed with each other. That is, all combinations of the factors are included in the design and every level (group) of each factor occurs in combination with every level of the other factor(s). Furthermore, each combination is replicated.
In fully factorial designs both factors are equally important (potentially), and since all factors are crossed, we can also test whether there is an interaction between the factors (does the effect of one factor depend on the level of the other factor(s)). Graphs above depict a) interaction, b) no interaction.
For example, Quinn (1988) investigated the effects of season (two levels, winter/spring and summer/autumn) and adult density (four levels, 8, 15, 30 and 45 animals per 225cm2) on the number of limpet egg masses. As the design was fully crossed (all four densities were used in each season), he was also able to test for an interaction between season and density. That is, was the effect of density consistent across both seasons and, was the effect of season consistent across all densities.
Diagram shows layout of 2 factor fully crossed design. The two factors (each with two levels) are color (black or gray) and pattern (solid or striped). There are three experimental units (replicates) per treatment combination.
Following a significant ANOVA result, it is often desirable to specifically compare group means to determine which groups are significantly different. However, multiple comparisons lead to two statistical problems. Firstly, multiple significance tests increase the Type I errors (&alpha, the probability of falsely rejecting H0). E.g., testing for differences between 5 groups requires ten pairwise comparisons. If the &alpha for each test is 0.05 (5%), then the probability of at least one Type I error for the family of 10 tests is 0.4 (40%). Secondly, the outcome of each test needs to be independent (orthogonality). E.g. if A>B and B>C then we already know the result of A vs. C.
Post-hoc unplanned pairwise comparisons (e.g. Tukey's test) compare all possible pairs of group means and are useful in an exploratory fashion to reveal major differences. Tukey's€™s test control the family-wise Type I error rate to no more that 0.05. However, this reduces the power of each of the pairwise comparisons, and only very large differences are detected (a consequence that exacerbates with an increasing number of groups).
Planned comparisons are specific comparisons that are usually planned during the design stage of the experiment. No more than (p-1, where p is the number of groups) comparisons can be made, however, each comparison (provided it is non-orthogonal) can be tested at &alpha = 0.05. Amongst all possible pairwise comparisons, specific comparisons are selected, while other meaningless (within the biological context of the investigation) are ignored.
Start off setting the Number of simulations to 1000. This will collect 1000 random samples from Population 1 and 2.
Assumptions - In simulating the repeated collection of samples from both male and female populations, we make a number of assumptions that have important consequences for the reliability of the statistical tests. Firstly, the function used to simulate the collection random samples of say 8 male fulmars (and 6 female fulmars), generates these samples from normal distributions of a given mean and standard deviation. Thus, each simulated t value and the distribution of t values from multiple runs represents the situation for when the samples are collected from a population that is normally distributed. Likewise, the mathematical t distribution also makes this assumption. Furthermore, the sampling function used the same standard deviation for each collected sample. Hence, each simulated t values and the distribution of t values from multiple runs represents the situation for when the samples are collected from populations that are equally varied. Likewise, the mathematical t distribution also makes this assumption. By altering the variability and degree of normality of the populations, you can see the effects that normality and homogeneity of variance have on how much a distribution of t values differs from the mathematical t-distribution. Violations of either of these two assumptions (normality and homogeneity of variance) have the real potential to compromise the reliability of the statistical conclusions and thus it is important for the data to satisfy these assumptions.
Step 1 - Clearly establish the statistical null hypothesis. Therefore, start off by considering the situation where the null hypothesis is true - e.g. when the two population means are equal
Step 2 - Establish a critical statistical criteria (e.g. alpha = 0.05)
Step 3 - Collect samples consisting of independent, unbiased samples
Step 4 - Assess the assumptions relevant to the statistical hypothesis test. For a t test: 1. Normality 2. Homogeneity of variance
Step 5 - Calculate test statistic appropriate for null hypothesis (e.g. a t value)
Step 6 - Compare observed test statistic to a probability distribution for that test statistic when the null hypothesis is true for the appropriate degrees of freedom (e.g. compare the observed t value to a t distribution).
Step 7 - If the observed test statistic is greater (in magnitude) than the critical value for that test statistic (based on the predefined critical criteria), we conclude that it is unlikely that the observed samples could have come from populations that fulfill the null hypothesis and therefore the null hypothesis is rejected, otherwise we conclude that there is insufficient evidence to reject the null hypothesis. Alternatively, we calculate the probability of obtaining the observed test statistic (or one of greater magnitude) when the null hypothesis is true. If this probability is less than our predefined critical criteria (e.g. 0.05), we conclude that it is unlikely that the observed samples could have come from populations that fulfill the null hypothesis and therefore the null hypothesis is rejected, otherwise we conclude that there is insufficient evidence to reject the null hypothesis.
The following output are based on a simulated data sets; 1.  Pooled variance t-test for populations with equal (or nearly so) variances 2.  Separate variance t-test for population with unequal variances
1.  Select the Statistics menu 2.  Select the Means .. submenu 3.  Select the Paired t-test submenu
Non-parametric tests do not place any distributional limitations on data sets and are therefore useful when the assumption of normality is violated. There are a number of alternatives to parametric tests, the most common are;
1. Randomization tests - rather than assume that a test statistic calculated from the data follows a specific mathematical distribution (such as a normal distribution), these tests generate their own test statistic distribution by repeatedly re-sampling or re-shuffling the original data and recalculating the test statistic each time. A p value is subsequently calculated as the proportion of random test statistics that are greater than or equal to the test statistic based on the original (un-shuffled) data. 2. Rank-based tests - these tests operate not on the original data, but rather data that has first been ranked (each observation is assigned a ranking, such that the largest observation is given the value of 1, the next highest is 2 and so on). It turns out that the probability distribution of any rank based test statistic for a is identical.
Linear regression analysis explores the linear relationship between a continuous response variable (DV) and a single continuous predictor variable (IV). A line of best fit (one that minimizes the sum of squares deviations between the observed y values and the predicted y values at each x) is fitted to the data to estimate the linear model.
As plot a) shows, a slope of zero, would indicate no relationship – a increase in IV is not associated with a consistent change in DV. Plot b) shows a positive relationship (positive slope) – an increase in IV is associated with an increase in DV. Plot c) shows a negative relationship.
Testing whether the population slope (as estimated from the sample slope) is one way of evaluating the relationship. Regression analysis also determines how much of the total variability in the response variable can be explained by its linear relationship with IV, and how much remains unexplained. The line of best fit (relating DV to IV) takes on the form of y=mx + c Response variable = (slope*predictor   variable) + y-intercept If all points fall exactly on the line, then this model (right-hand part of equation) explains all of the variability in the response variable. That is, all variation in DV can be explained by differences in IV. Usually, however, the model does not explain all of the variability in the response variable (due to natural variation), some is left unexplained (referred to as error). Therefore the linear model takes the form of; Response variable = model + error. A high proportion of variability explained relative to unexplained (error) is suggestive of a relationship between response and predictor variables. For example, a mammalogist was investigating the effects of tooth wear on the amount of time free ranging koalas spend feeding per 24 h. Therefore, the amount of time spent feeding per 24 h was measured for a number of koalas that varied in their degree of tooth wear (ranging from unworn through to worn). Time spent feeding was the response variable and degree of tooth wear was the predictor variable.
Analysis of variance, or ANOVA, partitions the variation in the response (DV) variable into that explained and that unexplained by one of more categorical predictor variables, called factors. The ratio of this partitioning can then be used to test the null hypothesis (H0) that population group or treatment means are equal. A single factor ANOVA investigates the effect of a single factor, with 2 or more groups (levels), on the response variable. Single factor ANOVA's are completely randomized (CR) designs. That is, there is no restriction on the random allocation of experimental or sampling units to factor levels. Single factor ANOVA tests the H0 that there is no difference between the population group means.H0: = µ1 = µ2 = .... = µi = µ If the effect of the ith group is:(&alphai = µi - µ) the this can be written as H0: = &alpha1 = &alpha2 = ... = &alphai = 0 If one or more of the I are different from zero, the hull hypothesis will be rejected.
Keough and Raymondi (1995) investigated the degree to which biofilms (films of diatoms, algal spores, bacteria, and other organic material that develop on hard surfaces) influence the settlement of polychaete worms. They had four categories (levels) of the biofilm treatment: sterile substrata, lab developed biofilms without larvae, lab developed biofilms with larvae (any larvae), field developed biofilms without larvae. Biofilm plates where placed in the field in a completely randomized array. After a week the number of polychaete species settled on each plate was then recorded. The diagram illustrates an example of the spatial layout of a single factor with four treatments (four levels of the treatment factor, each with a different pattern fill) and four experimental units (replicates) for each treatment.
Analysis of frequencies is similar to Analysis of Variance (ANOVA) in some ways. Variables contain two or more classes that are defined from either natural categories or from a set of arbitrary class limits in a continuous variable. For example, the classes could be sexes (male and female) or color classes derived by splitting the light scale into a set of wavelength bands. Unlike ANOVA, in which an attribute (e.g. length) is measured for a set number of replicates and the means of different classes (categories) are compared, when analyzing frequencies, the number of replicates (observed) that fall into each of the defined classes are counted and these frequencies are compared to predicted (expected) frequencies.
Analysis of frequencies tests whether a sample of observations came from a population where the observed frequencies match some expected or theoretical frequencies. Analysis of frequencies is based on the chi-squared (X2) statistic, which follows a chi-square distribution (squared values from a standard normal distribution thus long right tail).
When there is only one categorical variable, expected frequencies are calculated from theoretical ratios. When there are more than one categorical variables, the data are arranged in a contingency table that reflects the cross-classification of sampling or experimental units into the classes of the two or more variables. The most common form of contingency table analysis (model I), tests a null hypothesis of independence between the categorical variables and is analogous to the test of an interaction in multifactorial ANOVA. Hence, frequency analysis provides hypothesis tests for solely categorical data. Although, analysis of frequencies provides a way to analyses data in which both the predictor and response variable are both categorical, since variables are not distinguished as either predictor or response in the analysis, establishment of causality is only of importance for interpretation.
The goodness-of-fit test compares observed frequencies of each class within a single categorical variable to the frequencies expected of each of the classes on a theoretical basis. It tests the null hypothesis that the sample came from a population in which the observed frequencies match the expected frequencies.
For example, an ecologist investigating factors that might lead to deviations from a 1:1 offspring sex ratio, hypothesized that when the investment in one sex is considerably greater than in the other, the offspring sex ratio should be biased towards the less costly sex. He studied two species of wasps, one of which had males that were considerably larger (and thus more costly to produce) than females. For each species, he compared the offspring sex ratio to a 1:1 ratio using a goodness-of-fit test.
The Goodness of fit test output will appear in the R Commander output window.
Contingency tables are the cross-classification of two (or more) categorical variables. A 2 x 2 (two way) table takes on the following form:
Contingency tables test the null hypothesis that the data came from a population in which variable 1 is independent of variable 2 and vice-versa. That is, it is a test of independence. For example a plant ecologist examined the effect of heat on seed germination. Contingency test was used to determine whether germination (2 categories - yes or no) was independent of the heat treatment (2 categories heated or not heated).
The Pearson's chi-squared test will appear in the R Commander output window.
The Pearson's chi-squared test and residuals will appear in the R Commander output window.
Multivariate data sets include multiple variables recorded from a number of replicate sampling or experimental units, sometimes referred to as objects. If, for example, the objects are whole organisms or ecological sampling units, the variables could be morphometric measurements or abundances of various species respectively. The variables may be all response variables (MANOVA) or they could be response variables, predictor variables or a combination of both (PCA and MDS).
The aim of multivariate analysis is to reduce the many variables to a smaller number of new derived variables that adequately summarize the original information and can be used for further analysis. In addition, Principal components analysis (PCA) and Multidimensional scaling (MDS) also aim to reveal patterns in the data, especially among objects, that could not be found by analyzing each variable separately.
For two objects (i = 1 and 2) and a number of variables (j = 1 to p).
BC = 1-[(2)(3+6+9)/(18+36)] = 0.33 where (3+6+9) is the lesser abundance of each variable when it occurs in each object.
Bray-Curtis dissimilarity is well suited to species abundance data because it ignores variables that have zeros for both objects, and it reaches a constant maximum value when two objects have no variables in common. However, its value is determined mainly by variables with high values (e.g. species with high abundances) and thus results are biased towards trends displayed in such variables. It ranges from 0 (objects completely similar) to 1 (objects completely dissimilar).
EUC = [(3-6)2+(6-12)2+(9-18)2] = 11.2
Euclidean distance is a very important, general, distance measure in multivariate statistics. It is based on simple geometry as a measure of distance between two objects in multidimensional space. It is only bounded by a zero for two objects with exactly the same values for all variables and has no upper limits, even when two objects have no variables in common with positive values. Since it does not reach a constant maximum value when two objects have no variables in common, it is not ideal for species abundance data.
The matrix of distance values will appear in the R Commander output window.
Multidimensional scaling (MDS), has two aims
MDS starts with a dissimilarity matrix which represents the degree of difference between all objects (such as sites or quadrats) based on all the original variables. MDS then attempts to reproduce these patterns between objects by arranging the objects in multidimensional space with the distances between each pair of objects representing their relative dissimilarity. Each of the dimensions (plot axes) therefore represents a new variable that can be used to describe the relationships between objects. A greater number of new variables (axes) increases the degree to which the original patterns between objects are retained, however less data reduction has occurred. Furthermore, it is difficult to envisage points (objects) arranged in more than three dimensions.
MDS can be broken down into the following steps
A large amount of information will appear in the R Commander output window (most of which you can ignore!.
The first of these plots is a Sheppard plot which represents the relationship between the original distances (y-axis) and the new MDS distances (x-axis). The second plot is the final ordination plot. This represents the arrangement of objects in multidimensional space.