Running R

There are a number of ways to start R.

If R is fully installed locally on the computer

Double Click on the RGui icon on the desktop

Click the START button from the Windows task bar
Click Program Files
Click R2.8
Click RGui

From the CD.

Goto the R directory
Goto the rw2800 subdirectory
Goto the bib subdirectory
Double Click on the RGui.exe file

The RGui window should shortly appear.

End of instructions

Running Rcmdr

To start R commander.

Select the Packages menu from the RGui application
Select the Load packages.. submenu

The Select one dialog box will appear

Select Rcmdr from the list
Click

The R commander window should shortly appear. You may be warned that a number of packages are not available and that this may restrict the availability of certain procedures in Rcmdr. You are prompted to download missing packages. Just respond by clicking the button. The packages that are missing are of no importance for BIO3011 and there absence is of no consequence to us.

End of instructions

Frequency histogram

1. Select the Graphs menu..
2. Select the Histogram.. submenu..
The Histogram dialog box will be displayed.
3. Select the variable for which the histogram is to be generated
4. Click

The graph will appear in an R graphics window of the RGui.

End of instructions

Summary Statistics

Select the Statistics menu..
Select the Summaries.. submenu..
Select the Basic statistics.. submenu..

The Basic Statistics dialog box will be displayed.

Select the response variable for which the statistics/parameters are to be generated from the Variable list

Select the appropriate statistics from the list of available statistics

Click

End of instructions

Selecting subsets of data

1. Select the Data menu..
2. Select the Active data set.. submenu..
3. Select the Subset active data set.. submenu..
The Subset Data Set dialog box will be displayed.

4. To include all the variables, ensure that the Include all variables checkbox is checked
5. In the Subset expression box enter a logical statement that indicates how the data is to be subsetted. For example, exclude all values of DOC that are greater than 170 (to exclude the outlier) and therefore only include those values that are less that 170, enter DOC < 170. Alternatively, you could chose to exclude the outlier using its STREAM name. To exclude this point enter STREAM. != 'Santa Cruz'.
6. In the Enter name for new data set box enter a name for a new data set. Since subsetting works by generating a new data set based on the existing data set, by convention the name that you supply should reflect the data set on which it is based as well as the way in which it is modified (subsetted). In this case a good name might be lovett_minus_SantaCruz
7. Click

Now, select perform the calculations separately on the original and modified data sets (To change the active data set, click on the blue data set display next to where it says Data set on the button bar of Rcmdr.

End of instructions

Normal probabilities

Select the Distributions menu..
Select the Continuous distributions.. submenu..
Select the Normal distribution.. submenu..
Select the Normal probabilities.. submenu..
The Normal probabilities dialog box will be displayed.

Enter the value (e.g. the probability will be the area under the curve greater than this value)

Enter the mean and standard deviation of the population/sample

Select the Upper tail option

Click

End of instructions

Boxplots

End of instructions

EDA - Normal distribution

Parametric statistical hypothesis tests assume that the population measurements follow a specific distribution. Most commonly, statistical hypothesis tests assume that the population measurements are normally distributed (Question 4 highlights the specific reasoning for this).

While it is usually not possible to directly examine the distribution of measurements in the whole population to determine whether or not this requirement is satisfied or not (for the same reasons that it is not possible to directly measure population parameters such as population mean), if the sampling is adequate (unbiased and sufficiently replicated), the distribution of measurements in a sample should reflect the population distribution. That is a sample of measurements taken from a normally distributed population should be normally distributed.

Tests on data that do not meet the distributional requirements may not be reliable, and thus, it is essential that the distribution of data be examined routinely prior to any formal statistical analysis.

End of instructions

EDA - Homogeneity of variances for regression

The assumption of homogeneity of variance is equally important for regression analysis and in particular, it is prospect of a relationship between the mean and variance of y-values across x-values that is of the greatest concern. Strictly the assumption is that the distribution of y values at each x value are equally varied and that there is no relationship between mean and variance. However, as we only have a single y-value for each x-value, it is difficult to determine whether the assumption of homogeneity of variance is likely to have been violated (mean of one value is meaningless and variability can't be assessed from a single value). The figure below depicts the ideal (and almost never realistic) situation in which there are multiple response variable (DV) observations for each of the levels of the predictor variable (IV). Boxplots are included that represent normality and variability.

Inferential statistics is based around repeatability and the likelihood of data re-occurrence. Consider the following scatterplot that has a regression line (linear smoother - line of best fit) fitted through the data (data points symbolized by letters).

Points that are close to the line (points A and B) of best fit (ie those points that are close to the predicted values) are likely to be highly repeatable (a repeat sampling effort is likely to yield a similar value). Conversely, points that are far from the line of best fit (such as points F and G) are likely to take on a larger variety of values.
The distance between a point (observed value) and the line of best fit (predicted value) is called a residual. A residual plot plots the each of the residuals against the expected y values. When there is a trend of increasing (or decreasing) spread of data along a regression line, the residuals will form a wedge shape pattern. Thus a wedge shape pattern in the residual plot suggests that the assumption of homogeneity of variance may have been violated.

Of course, it is also possible to assess the assumption of homogeneity of variance by simply viewing a scatterplot with a linear smoother (linear regression line) and determining whether there is a general increase or decrease in the distance that the values are from the line of best fit.

End of instructions

EDA - Linearity

The most common methods for analysing the relationship or association between variables.assume that the variables are linearly related (or more correctly, that they do not covary according to some function other than linear). For example, to examine for the presence and strength of the relationship between two variables (such as those depicted below), it is a requirement of the more basic statistical tests that the variables not be related by any function other than a linear (straight line) function if any function at all.

There is no evidence that the data represented in Figure (a) are related (covary) according to any function other than linear. Likewise, there is no evidence that the data represented in Figure (b) are related (covary) according to any function other than linear (if at all). However, there is strong evidence that the data represented in Figure (c) are not related linearly. Hence Figure (c) depicts evidence for a violation in the statistical necessity of linearity.

End of instructions

Linear regression analysis

1. Select the Statistics menu..
2. Select the Fit models.. submenu..
3. Select the Linear model.. submenu..
The Linear Model dialog box will be displayed.

4. Enter a name to call the resulting output (the default is fine)
5. Double click on the dependent (response) variable from the Variables box. It will be added to the box on the left side of the ~
6. Double click on the independent (predictor) variable from the Variables box. It will be added to the box on the right side of the ~
Together these construct the model formula in the form of DV~IV
7. Click

End of instructions

Model II regression analysis

Select the Statistics menu..
Select the Fit models.. submenu..
Select the Model II.. submenu..

The Model II dialog box will be displayed.

Enter a name to call the resulting output (the default is fine)

Double click on the dependent (response) variable from the Variables box. It will be added to the box on the left side of the ~

Double click on the independent (predictor) variable from the Variables box. It will be added to the box on the right side of the ~

Together these construct the model formula in the form of DV~IV

Select the appropriate regression fitting procedure from amongst the available options

Click

End of instructions

ANOVA

1. Select the Statistics menu..
2. Select the Fit models.. submenu..
3. Select the Linear model.. submenu..
The Linear Model dialog box will be displayed.

4. Enter a name to call the resulting output (the default is fine)
5. Double click on the dependent (response) variable from the Variables box. It will be added to the box on the left side of the ~
6. Double click on the independent (factorial) variable from the Variables box. It will be added to the box on the right side of the ~
Together these construct the model formula in the form of DV~IV
7. Click

A brief summary will appear in the R Commander output window

End of instructions

Two-factor ANOVA

Select the Statistics menu..
Select the Fit models.. submenu..
Select the Linear model.. submenu..

The Linear Model dialog box will be displayed.

Enter a name to call the resulting output (the default is fine)
Double click on the dependent (response) variable from the Variables box. It will be added to the box on the left side of the ~
Double click on one of the independent (factorial) variable from the Variables box. It will be added to the box on the right side of the ~
Click on the button. This is a multiplication symbol, and it will be added to the right of the first independent (factorial) variable.
Together these construct the model formula in the form of DV~IV1*IV2 which R expands to into DV~IV1+IV2+IV1:IV2 (that is, the effect of factor IV1, the effect of factor IV2 and the interaction of IV1 and IV2)
Click

A brief summary will appear in the R Commander output window

End of instructions

Model I vs Model II Regression

Model I regression (fixed X regression)

The purpose of simple linear regression analysis is to estimate the slope and y-intercept of a straight line through a data set. The line represents the expected (and predicted) average values of X and Y in the population, based on the collected sample. The line (and thus the slope and intercept) is typically the line that minimizes the vertical spread of the observations from the line. This process of fitting the line of best fit is called Ordinary Least Squares (OLS), and is depicted in figure OLS Y vs X (top left) below. This fitting process assumes that there is no uncertainty in the predictor (X) variable and therefore that the observed values of X are exactly as they are in the population (there is no measurement error nor natural variation).

Model I regression is the classical regression analysis and the form of analysis most commonly used. However, it is only a valid method of estimating the model parameters (slope and intercept) when the levels of the predictor variable are not measured, but rather are set by the researcher. For example, if we were interested in the effects of temperature on the growth rate of plant seedlings, we could analyse the data using model I regression provided the different temperatures were specifically set by us (eg. controlling temperatures in green houses or growth cabinets). If on the other hand, the plants had been grown under a range of temperatures that were not specifically set by us and we had simply measured what the temperatures were, model I regression would not be appropriate.

Note that if instead of regressing Y against X we regressed X against Y, the slope of the line of best fit would be different since the line is now the one that minimizes the horizontal spread of the observations from the line (see figure OLS X vs Y - topright). The OLS regressions of Y on X and X on Y represent the two extremes of the population slope estimates as they depict the situations in which all of the variation is in the vertical and horizontal axes respectively. Typically, we might expect some uncertainty in both X and Y variables, and thus the true trend line should lie somewhere between these two extremes (see the bottom right figure below).

Model II regression (random X regression)

Model II regression estimates the line of best fit by incorporating uncertainty in both Y and X variables, and is therefore more suitable for estimating model parameters (slope and intercept) from data in which both variables have been measured. There are at least three forms of model II regression and these differ by what assumptions they place on the relative degree of uncertainty in Y and X variables

Major Axis (MA)-this form of regression estimates the line of best fit by minimizing the perpendicular spread of observations from the line (see figure MA - center left). In doing so, it is assumed that the degree of uncertainty in both X and Y variables is the same. This assumption can really only hold when both variables are measured on the same scale or have the same range of values, since variables with higher values will typically have a higher degree of variance. The greater the difference in degrees of uncertainty, the more the MA line of fit will approach either of the OLS lines. Consequently, MA regression is of limited use.
Ranged Major Axis (Ranged MA)-attempts to address the shortcomings of MA regression, by first standardizing the variables by their ranges (thus equalizing their degrees of uncertainties) then performing MA regression before back-transforming the slope of the estimated line into the original units. See figure Ranged MA - center right below.
Reduced Major Axis (RMA)-is an alternative that estimates the line of best fit by minimizing the (right-angle) triangular areas bounded by the observed values and the line of best fit (see figure RMA - bottom left). Essentially, the line of best fit is the average of the OLS Y vs X and OLS X vs Y lines.

The bottom right figure depicts each of the regression lines through a data set. Note the following:

All estimated lines pass through a central point. This point is the mean value of Y and the mean value of X. The different regression fitting procedures essentially just rotate the line of best fit (major axis) through the data set. They differ in the degree to which the major axis is rotated.
That the OLS lines form the extreme estimated lines of best fit.
The RMA line is exactly half way between the two OLS lines
Since for this example, the Y and X variable were measured on substantially different scales (where the degree of uncertainty in X is far greater than the degree of uncertainty in Y), the MA regression fitting procedure is equivalent to OLS X vs Y (assumes all uncertainty is in X and non in Y)

Implications

Given that most instances of regression in biology are actually more suitable for model II regression, why are examples of model II in the literature so rare and why do most statistical software default to model I (OLS) regression procedures (most do not even offer model II procedures)? Is it that biologists are ignorant or slack? It turns out that as far as the main hypothesis of interest is concerned (that the population slope) equals zero, it does not matter - RMA and OLS procedures give the same p-values (p-values cant be easily computed for MA and Ranged MA procedures).

If the purpose of the regression is to generate a model that can be used for the purpose of prediction (predict new values of the dependent variable from new values of the independent values), then only OLS fitted models are valid. The reason for this, is that the new independent value represents a theoretical fixed value with no uncertainty. This new value must be used in a model that assumed no uncertainty in the independent (predictor) variable. The OLS procedure is the only one that minimizes the vertical spread of the observations from the line.

Therefore, if regression analysis is being performed for the purpose of investigating whether there is a relationship between two continuous variables and or to generate a predictive model, then model I (OLS) regression is perfectly valid. Note, however, that if slopes and intercepts are reported, it is probably worth presenting model I and model II slopes.

If on the other hand the nature of the relationship is important, and therefore it is important to get estimates of the slope and y-intercept that best represent the population relationship, then it is necessary to make the model II corrections. Good examples of where this is important are:

size scaling applications - for example modeling the relationship between body size and metabolic rate - as it is usually only the slope (scaling factor) that is of interest
when comparing relationship slopes across studies

End of instructions

Copy and Paste text

Firstly, prepare the Rcommander output
1. Add the word 'Source' in the top left hand corner of the first row of the ANOVA table.
2. Add commas after each entry in the table (except the last entry in each row) and remove any spaces between entries. This step defines the columns of the table, so it is important that they are placed in the correct positions. Note that 'Sums' and 'Sq' are not separate entries, the entry is 'Sums Sq'.
3. Select (highlight) the table, click the right mouse button and select cut from the popup menu
Switch focus to the program you want to paste the text into (e.g. Microsoft Word)
Select the Edit menu. (in Word)
Select the Paste.. submenu. (in Word)
To format the table in Word

Highlight the table in Word
Select the Table menu. (in Word)
Select the Convert submenu. (in Word)
Select the Text to Table submenu. (in Word)

The Convert Text to Table dialog box will appear

Ensure that Commas is selected as the text separator
Click the button

The Table AutoFormat dialog box will appear

Select the Table Classic 1 Table style from the Tables styles list
Unselect the First column, Last row and Last column checkboxes
Click the button

Click the button
Neaten the table by altering the column spacings

Finally, add a meaningful caption above the table and alter the name of the categorical variable so that it is more meaningful to the reader

End of instructions

Fully factorial ANOVA assumption checking

The assumptions of normality and homogeneity of variance apply to each of the factor level combinations, since it is the replicate observations of these that are the test residuals. If the design has two factors (IV1 and IV2) with three and four levels (groups) respectively within these factors, then a boxplot of each of the 3x4=12 combinations needs to be constructed. It is recommended that a variable (called GROUP) be setup to represent the combination of the two categorical (factor) variables.

Simply construct a boxplot with the dependent variable on the y-axis and GROUP on the x-axis. Visually assess whether the data from each group is normal (or at least that the groups are not all consistently skewed in the same direction), and whether the spread of data is each group is similar (or at least not related to the mean for that group). The GROUP variable can also assist in calculating the mean and variance for each group, to further assess variance homogeneity.

End of instructions

Interaction plot

Interaction plots display the degree of consistency (or lack of) of the effect of one factor across each level of another factor. Interaction plots can be either bar or line graphs, however line graphs are more effective. The x-axis represents the levels of one factor, and a separate line in drawn for each level of the other factor. The following interaction plots represent two factors, A (with levels A1, A2, A3) and B (with levels B1, B2).

The parallel lines of first plot clearly indicate no interaction. The effect of factor A is consistent for both levels of factor B and visa versa. The middle plot demonstrates a moderate interaction and bottom plot illustrates a severe interaction. Clearly, whether or not there is an effect of factor B (e.g. B1 > B2 or B2 > B1) depends on the level of factor A. The trend is not consistent.

End of instructions

Non-linear modelling

Select the Statistics menu..
Select the Fit models submenu..
Select the Nonlinear model.. submenu..

The Non-Linear Model dialog box will appear

Enter a name to call the resulting output (the default is fine)
Double click on the dependent (response) variable from the Variables box. It will be added to the box on the left side of the ~
Construct the non-linear model on the right side of the ~. The picture above illustrates a power model. The parameters to be estimated should be denoted by letters that do not otherwise represent variables in your data set.

Multiplication is represented by a *
^ is used to define a power

Non-linear modeling works by iteratively altering the values of the unknown parameters so as to achieve the best possible fit of the model to your data. You must however give the modeling process a starting estimate of the parameters. Technically, the closer your estimates are to the best fit estimates, the quicker the process will calculate, however, for most simple models computers are sufficiently powerful to achieve an instantaneous fit irrespective of the estimates you provide. Nevertheless, you must provide starting estimates for these parameters. For most models, a value of 1 is sufficient. If there are multiple parameters, these should be separated by a comma as shown in the picture above.
Click the

The results of the non-linear modeling will appear in the R Commander output window. The output will include the general formula of the model along with estimates of each of the parameters. Each line represents a specific hypothesis test and includes the parameter estimate, precision of the estimate, t-value and associated p value).

End of instructions

Fitting Polynomial regression models

Select the Statistics menu..
Select the Fit models submenu..
Select the Linear model.. submenu..

The Add Trendlines dialog box will appear

Enter a name to call the resulting output (the default is fine)
Double click on the dependent (response) variable from the Variables box. It will be added to the box on the left side of the ~
Double click on the independent (predictor) variable from the Variables box. It will be added to the box on the right side of the ~. So far we have defined a simple linear model (first order polynomial)
To define a second order polynomial (quadratic), add to the right hand side of the ~ (in this order):/li>
- a plus (+) sign
- the independent variable from the Variables box
- a power or hat (^) sign - computer notation for 'to the power of'
- a 2 - to signify that the term is the independent variable to the power of 2
- finally, enclose the independent variable, hat and 2 with an I and brackets - see picture above. I() is a function that essentially preserves the polynomial term
To construct higher order polynomials, simply repeat the steps above with the powers 3, 4, ...
Click the

A summary of the model (including the model formula and estimated model parameters/coefficients will appear in the Rcmdr Output Window.

End of instructions

Plotting trendlines

Select the Graphs menu..
Select the Trendlines.. submenu..

The Add Trendlines dialog box will appear

There are two ways to define the type of trendline to add

To construct a trendline from a fitted model, select a model from the models list box
To define a generic trendline from the variables, select the predictor (independent) and response (dependent) variables from the x-variable and y-variable boxes respectively and select the type of regression model from the options.

For polynomials, indicate the order of the polynomial. First order (1) polynomial is a straight, linear line.

A legend can be included on the plot. Select a location of the legend from the options
The legend can include the Equation of the line, r-squared value (R2) and adjusted r-squared value (Adj. R2)
The color of the line and legend text can be altered by clicking on the button
If you don't want a legend, un select Equation, R2 and Adj R2
Click the

The trendline and legend will be added to the current plot in the R Graphics window.

End of instructions

Two factor ANOVA

Statistical models that incorporate more than one categorical predictor variable are broadly referred to as multivariate analysis of variance. There are two main reasons for the inclusion of multiple factors:

To attempt to reduce the unexplained (or residual) variation in the response variable and therefore increase the sensitivity of the main effect test.
To examine the interaction between factors. That is, whether the effect of one factor on the response variable is dependent on other factor(s) (consistent across all levels of other factor(s)).

Fully factorial linear models are used when the design incorporates two or more factors (independent, categorical variables) that are crossed with each other. That is, all combinations of the factors are included in the design and every level (group) of each factor occurs in combination with every level of the other factor(s). Furthermore, each combination is replicated.

In fully factorial designs both factors are equally important (potentially), and since all factors are crossed, we can also test whether there is an interaction between the factors (does the effect of one factor depend on the level of the other factor(s)). Graphs above depict a) interaction, b) no interaction.

For example, Quinn (1988) investigated the effects of season (two levels, winter/spring and summer/autumn) and adult density (four levels, 8, 15, 30 and 45 animals per 225cm²) on the number of limpet egg masses. As the design was fully crossed (all four densities were used in each season), he was also able to test for an interaction between season and density. That is, was the effect of density consistent across both seasons and, was the effect of season consistent across all densities.

Diagram shows layout of 2 factor fully crossed design. The two factors (each with two levels) are color (black or gray) and pattern (solid or striped). There are three experimental units (replicates) per treatment combination.

End of instructions

Simple Main Effects

Before proceeding make sure you have already computed the global (multifactor ANOVA) and remember what you called the resulting model output.

Select the Statistics menu..
Select the Fit models.. submenu..
Select the Linear model.. submenu..

The Linear Model dialog box will be displayed.

Enter a name to call the resulting output (the default is fine)

Double click on the dependent (response) variable from the Variables box. It will be added to the box on the left side of the ~

Double click on the independent (categorical) variable that you are primarily interested from the Variables box. It will be added to the box on the right side of the ~
Together these construct the model formula in the form of DV~IV
In the Subset expression box, type in the name of the other categorical variable followed by == (two equals signs) and the name of one of the levels (surrounded by single quotation marks) of this factor. For example, if you had a categorical variable called TEMPERATURE and it consisted of three levels (High, Medium and Low), to perform a anova using only the High temperature data, the subset expression would be TEMPERATURE=='High'.
Click

This will have performed the simple main effects ANOVA. To view the ANOVA table with the correct residual term (i.e. that from the original global anova):

Select the Models menu..
Select the Hypothesis tests.. submenu..
Select the ANOVA table.. submenu..

The Anova table dialog box will appear

Select the model that corresponds to the simple main effect from the Models list
Select the Global model from the Error terms list. This is the model that contains the correct residual term that should be used for testing the simple main effects
Click

The ANOVA table for the simple main effect hypothesis test will appear in the R Commander output window.

End of instructions

Tukey's Test

From within bar graph dialog box enter a series of symbols that are to appear above the error bars of each bar. The following formatting rules are important

Each symbol should be surrounded by a set of quotation marks (e.g. 'A')
Symbols should be listed with commas separating each symbol (e.g. 'A','B','A')
There should be as many symbols as there are bars on the graph. If less symbols are required than there are bars, then blank symbols (' ') are used for those bars not requiring a symbol
The following are all valid symbol definitions for a factor with 4 groups

'A','A','B','B'

'A', ' ', ' ','B'

' ', ' ', '*', '**'

Define the series of symbols such that common symbols signify non-significant comparisons and differences between symbols signify significant differences.
Click the button

In the following graph the mean of group a was found to be significantly different from the means of both groups b and c whilst the means of groups b and c were not found to be significantly different from one another.

End of instructions

Reordering factor levels

Select the Data menu
Select the Manage variables in active data set.. submenu
Select the Reorder factor levels.. submenu
Select the factorial variable whose levels you wish to reorder
Click and accept that it is OK to overwrite the existing variable

The Reorder Factor Levels dialog box will appear

The dialog box list the old order of the factor levels and prompts you to specify the new order. Edit the numbers in the New order boxes such that 1 signifies the factor level to be first in the order, 2 to be second and so on.
Click

Note that the data as viewed or edited using either the or buttons respectively will not appear to have changed at all, but internally, R knows how to order the factor levels during analysis and plotting.

End of instructions

Defining contrasts

Select the Data menu
Select the Manage variables in active data set submenu
Select the Define contrasts for a factor.. submenu

The Set Contrasts for Factor dialog box will appear

Select the factorial variable and the Other (specify) option

The Specify Contrasts dialog box will appear

It is possible to define up to n-1 planned comparisons (where n is the number of levels of the factorial variable). Note, it is not necessary to define the maximum number of comparisons.
Provide a name for the comparison. For example if the first comparison is to compare level a with level b, perhaps make the label"a vs b"
For each planned comparison that you are defining enter the contrast coefficients corresponding to each level of the factorial variable. For example if the factor levels are high, medium and low and you wish to compare high and low, give high a 1, medium a 0 and c low -1
Click

Set Contrasts dialog box

An ANOVA table (incorporating the specific planned comparisons) will appear in the R Commander output window.

End of instructions

Planned Comparisons

Having already performed the ANOVA..

Select the Models menu
Select the Hypothesis tests.. submenu
Select the Planned Comparisons.. submenu
Select the factorial variable and a contrast matrix will appear. It is possible to define up to n-1 planned comparisons (where n is the number of levels of the factorial variable). Note, it is not necessary to define the maximum number of comparisons.
Provide a name for the comparison. For example if the first comparison is to compare level a with level b, perhaps make the label"a vs b"
For each planned comparison that you are defining enter the contrast coefficients corresponding to each level of the factorial variable. For example if the factor levels are high, medium and low and you wish to compare high and low, give high a 1, medium a 0 and c low -1
Click and accept that it is OK to overwrite the existing variable

An ANOVA table (incorporating the specific planned comparisons) will appear in the R Commander output window.

End of instructions

Specific comparisons of means

Following a significant ANOVA result, it is often desirable to specifically compare group means to determine which groups are significantly different. However, multiple comparisons lead to two statistical problems. Firstly, multiple significance tests increase the Type I errors (&alpha, the probability of falsely rejecting H₀). E.g., testing for differences between 5 groups requires ten pairwise comparisons. If the &alpha for each test is 0.05 (5%), then the probability of at least one Type I error for the family of 10 tests is 0.4 (40%). Secondly, the outcome of each test needs to be independent (orthogonality). E.g. if A>B and B>C then we already know the result of A vs. C.

Post-hoc unplanned pairwise comparisons (e.g. Tukey's test) compare all possible pairs of group means and are useful in an exploratory fashion to reveal major differences. Tukey's��s test control the family-wise Type I error rate to no more that 0.05. However, this reduces the power of each of the pairwise comparisons, and only very large differences are detected (a consequence that exacerbates with an increasing number of groups).

Planned comparisons are specific comparisons that are usually planned during the design stage of the experiment. No more than (p-1, where p is the number of groups) comparisons can be made, however, each comparison (provided it is non-orthogonal) can be tested at &alpha = 0.05. Amongst all possible pairwise comparisons, specific comparisons are selected, while other meaningless (within the biological context of the investigation) are ignored.

Planned comparisons are defined using what are known as contrasts coefficients. These coefficients are a set of numbers that indicate which groups are to be compared, and what the relative contribution of each group is to the comparison. For example, if a factor has four groups (A, B, C and D) and you wish to specifically compare groups A and B, the contrast coefficients for this comparison would be 1, -1, 0,and 0 respectively. This indicates that groups A and B are to be compared (since their signs oppose) and that groups C and D are omitted from the specific comparison (their values are 0).

It is also possible to compare the average of two or more groups with the average of other groups. For example, to compare the average of groups A and B with the average of group C, the contrast coefficients would be 1, 1, -2, and 0 respectively. Note that 0.5, 0.5, 1 and 0 would be equivalent.

The following points relate to the specification of contrast coefficients:

The sum of the coefficients should equal 0
Groups to be compared should have opposing signs

End of instructions

Saving graphs

1. With the R Graphics Window active, Select the File menu
2. Select the save as submenu
3. Select the Jpeg submenu
4. Select the 100% quality... submenu

The Jpeg file dialog box will be displayed.
5. Provide a filename and path for the picture
6. Click

End of instructions

Importing graphs

1. Open up the other application (such as Word), or if it is already open, switch control (focus) to the other application using either Alt-tab or the Windows navigation buttons
2. Select the Insert menu. (in Word)
3. Select the Picture submenu. (in Word)
4. Select the From file submenu. (in Word)

The Insert Picture dialog box will be displayed.
5. Locate the picture file and click .
6. Done!, just add a caption

End of instructions

ANOVA output with planned contrasts

Select the Models menu..
Select the Hypothesis tests.. submenu..
Select the ANOVA table.. submenu..

The Anova table dialog box will appear

Select the model to be tabulated from the Models list
Click the Split ANOVA table check button. A table listing the factor(s) in the model and the contrast names that were defined when the contrasts for the factor(s) were defined will be listed. Note that there will be n-1 (where n is the number of groups) defined comparisons.
Delete the text in the table for the comparisons that you are not interested in. The text for required comparisons can be modified if necessary

Click

The ANOVA table will appear in the R Commander output window. The table will be further split according to the defined planned comparisons (contrasts)

End of instructions

Sample statistics

1. Select the Statistics menu..
2. Select the Summaries.. submenu..
3. Select the Numerical Summaries.. submenu..
The Numerical Summaries dialog box will be displayed.
4. Select the variable for which the summary statistic(s) are to be generated
5. Select the appropriate statistic(s)
6. Click

End of instructions

Observations, variables & populations

Observations are the sampling units (e.g quadrats) or experimental units (e.g. individual organisms, aquaria) that make up the sample.

Variables are the actual properties measured by the individual observations (e.g. length, number of individuals, rate, pH, etc). Random variables (those variables whose values are not known for sure before the onset of sampling) can be either continuous (can theoretically be any value within a range, e.g. length, weight, pH, etc) or categorical (can only take on certain discrete values, such as counts - number of organisms etc).

Populations are defined as all the possible observations that we are interested in.

A sample represents a collected subset of the population's observations and is used to represent the entire population. Sample statistics are the characteristics of the sample (e.g. sample mean) and are used to estimate population parameters.

End of instructions

Standard error and precision

A good indication of how good a single estimate is likely to be is how precise the measure is. Precision is a measure of how repeatable an outcome is. If we could repeat the sampling regime multiple times and each time calculate the sample mean, then we could examine how similar each of the sample means are. So a measure of precision is the degree of variability between the individual sample means from repeated sampling events.

Sample number	Sample mean
1	12.1
2	12.7
3	12.5
Mean of sample means	12.433
SD of sample means	0.306

The table above lists three sample means and also illustrates a number of important points;

Each sample yields a different sample mean
The mean of the sample means should be the best estimate of the true population mean
The more similar (consistent) the sample means are, the more precise any single estimate of the population mean is likely to be

The standard deviation of the sample means from repeated sampling is called the Standard error of the mean.

It is impractical; to repeat the sampling effort multiple times, however, it is possible to estimate the standard error (and therefore the precision of any individual sample mean) using the standard deviation (SD) of a single sample and the size (n) of this single sample.

The smaller the standard error (SE) of the mean, the more precise (and possibly more reliable) the estimate of the mean is likely to be.

End of instructions

Confidence intervals

A 95% confidence interval is an interval that we are 95% confident will contain the true population mean. It is the interval that there is a less than 5% chance that this interval will not contain the true population mean, and therefore it is very unlikely that this interval will not contain the true mean. The frequentist approach to statistics dictates that if multiple samples are collected from a population and the interval is calculated for each sample, 95% of these intervals will contain the true population mean and 5% will not. Therefore there is a 95% probability that any single sample interval will contain the population mean.

The interval is expressed as the mean ± half the interval. The confidence interval is obviously affected by the degree of confidence. In order to have a higher degree of confidence that an interval is likely to contain the population mean, the interval would need to be larger.

End of instructions

Data transformations

Essentially transforming data is the process of converting the scale in which the observations were measured into another scale.

I will demonstrate the principles of data transformation with two simple examples.
Firstly, to illustrate the legality and frequency of data transformations, imagine you had measured water temperature in a large number of streams. Naturally, you would have probably measured the temperature in ° C. Supposing you later wanted to publish your results in an American journal and the editor requested that the results be in ° F. You would not need to re-measure the stream temperature. Rather, each of the temperatures could be converted from one scale (° C) to the other (° F). Such transformations are very common.

To further illustrate the process of transformation, imagine a botanist wanted to examine the size of a particular species leaves for some reason. The botanist decides to measure the length of a random selection of leaves using a standard linear, metric ruler and the distribution of sample observations as represented by a histogram and boxplot are as follows.

The growth rate of leaves might be expected to be greatest in small leaves and decelerate with increasing leaf size. That is, the growth rate of leaves might be expected to be logarithmic rather than linear. As a result, the distribution of leaf sizes using a linear scale might also be expected to be non-normal (log-normal). If, instead of using a linear scale, the botanist had used a logarithmic ruler, the distribution of leaf sizes may have been more like the following.

If the distribution of observations is determined by the scale used to measure of the observations, and the choice of scale (in this case the ruler) is somewhat arbitrary (a linear scale is commonly used because we find it easier to understand), then it is justifiable to convert the data from one scale to another after the data has been collected and explored. It is not necessary to re-measure the data in a different scale. Therefore to normalize the data, the botanist can simply convert the data to logarithms.

The important points in the process of transformations are;

The order of the data has not been altered (a large leaf measured on a linear scale is still a large leaf on a logarithmic scale), only the spacing of the data
Since the spacing of the data is purely dependent on the scale of the measuring device, there is no reason why one scale is more correct than any other scale
For the purpose of normalization, data can be converted from one scale to another following exploration

End of instructions

Data transformations

1. Select the Data menu..
2. Select the Manage variables in active data set.. submenu..
3. Select the Compute new variable.. submenu..
The Compute new variable dialog box will be displayed.

4. Enter a name for a new variable
Note, this should be a unique name that is not already in the data set (otherwise it will be overwritten). It is good practice to simply prepend the name of the variable whose values are to be transformed with the name of the transformation function (e.g. logH - this reminds us that this new variable contains the log transformed H observations).
5. Enter the name of a transformation function with the name of the variable whose values are to be transformed as its sole argument (e.g. log10(H) - this performs a log (base 10) transformation of the H observations)
6. Click

7. Confirm the transformation by clicking either or

End of instructions

Transformations

Biological data measured on a linear scale often follows a log-normal distribution. If however, such data was collected using a logarithmic scale (for example the length of leaves measured using a logarithmic rule rather than a standard linear rule), the observations will be normally distributed. Linear scales are usually used because they are the easiest for the feeble human mind to comprehend, but other than that, no scale is more valid for making measurements than any other scale (linear, logarithmic, ...).
Measurements collected in one scale can be easily converted to another scale, and the resulting observations would be as if they had been measured in the new scale from the start. Therefore, it is valid to convert observations from one scale to another (transformation) for the purpose of normalizing data.
Note, transformations only alter the distances between observations, not the order of the observations

End of instructions

Data transformations (Log)

1. Select the Data menu..
2. Select the Manage variables in active data set submenu..
3. Select the Compute new variable submenu..
The Let dialog box will be displayed.

The Compute New Variable dialog box will be displayed.

4. Enter the name of a new variable in the New variable name box. As this is a name to refer to a new variable, it should not be an existing name
Hint: if performing a log transform, use the existing variable name with an 'L' in front
5. In the Expression to compute enter a transformation (scaling) function with the name of the existing variable to be transformed as its sole argument (e.g. log10(DOC))
6. Click
A new variable will be created, and this variable will contain the transformed values of an existing variable

End of instructions

Scatterplots

1. Select the Graphs menu..
2. Select the Scatterplot... submenu..

The Scatterplot dialog box will be displayed.
3. Select the independent variable for the x axis
4. Select the dependent variable for the y axis
6. Click
A new scatterplot will be created that by default includes boxplots on the axes as well as both a linear regression smoother (dashed red line) and lowess smoother (solid red line)

End of instructions

Boxplot on scatterplot axes

From the Scatterplot dialog box will be displayed.
Re-setup your scatterplot,
1. Select the independent variable for the x axis
2. Select the dependent variable for the y axis
3. Ensure that the Marginal boxplots option is selected
4. Click
A new scatterplot will be created that includes boxplots on the axes.

End of instructions

Lowess smoother

From the Scatterplot dialog box will be displayed.
Re-setup your scatterplot,
1. Select the independent variable for the x axis
2. Select the dependent variable for the y axis
3. Ensure that the Smooth line option is selected
4. Click
A new scatterplot will be created that includes a lowess smoother through the data.

End of instructions

Scatterplot matrix (SPLOM)

1. Select the Graphs menu..
2. Select the Scatterplot matrix... submenu..

The Scatterplot Matrix dialog box will be displayed.
3. Select the continuous variables to be included in the scatterplot matrix (hold down the CNTL key while selecting variables)
4. Select the Boxplots option from the On Diagonal: options list
6. Click
A new scatterplot matrix will be created that by default includes boxplots in the diagonals as well as both a linear regression smoother (dashed black line) and lowess smoother (solid black line)

End of instructions

EDA - Homogeneity of variance

Many statistical hypothesis tests assume that the populations from which the sample measurements were collected are equally varied with respect to all possible measurements. For example, are the growth rates of one population of plants (the population treated with a fertilizer) more or less varied than the growth rates of another population (the control population that is purely treated with water). If they are, then the results of many statistical hypothesis tests that compare means may be unreliable. If the populations are equally varied, then the tests are more likely to be reliable.
Obviously, it is not possible to determine the variability of entire populations. However, if sampling is unbiased and sufficiently replicated, then the variability of samples should be directly proportional to the variability of their respective populations. Consequently, samples that have substantially different degrees of variability provide strong evidence that the populations are likely to be unequally varied.
There are a number of options available to determine the likelihood that populations are equally varied from samples.

Calculate the variance or standard deviation of the populations. If one sample is more than 2 times greater or less than the other sample(s), then there is some evidence for unequal population variability (non-homogeneity of variance)
Construct boxplots of the measurements for each population, and compare the lengths of the boxplots. The length of a symmetrical (and thus normally distributed) boxplot is a robust measure of the spread of values, and thus an indication of the spread of population values. Therefore if any of the boxplots are more than 2 times longer or shorter than the other boxplot(s), then there is again, some evidence for unequal population variability (non-homogeneity of variance)

End of instructions

Boxplots

1. Select the Graphs menu..
2. Select the Boxplot.. submenu..
The Boxplot dialog box will be displayed.
3. Select the variable for which the boxplot is to be generated
4. Click

5. Select the appropriate grouping variable (factor) from the list. This variable should contain two or more levels of which the boxplots are to be constructed separately.
The Groups dialog box will be displayed.
6. Click in the Groups dialog box

7. Click

End of instructions

t test

The frequentist approach to hypothesis testing is based on estimating the probability of collecting the observed sample(s) when the null hypothesis is true. That is, how rare would it be to collect samples that displayed the observed degree of difference if the samples had been collected from identical populations. The degree of difference between two collected samples is objectively represented by a t statistic.

Where y₁ and y₂ are the sample means of group 1 and 2 respectively and √s²/n1 + s²/n2 represents the degree of precision in the difference between means by taking into account the degree of variability of each sample.
If the null hypothesis is true (that is the mean of population 1 and population 2 are equal), the degree of difference (t value) between an unbiased sample collected from population 1 and an unbiased sample collected from population 2 should be close to zero (0). It is unlikely that an unbiased sample collected from population 1 will have a mean substantially higher or lower than an unbiased sample from population 2. That is, it is unlikely that such samples could result in a very large (positive or negative) t value if the null hypothesis (of no difference between populations) was true. The question is, how large (either positive or negative) does the t value need to be, before we conclude that the null hypothesis is unlikely to be true?.

What is needed is a method by which we can determine the likelihood of any conceivable t value when null hypothesis is true. This can be done via simulation. We can simulate the collection of random samples from two identical populations and calculate the proportion of all possible t values.

Lets say a vivoculturalist was interested in comparing the size of Eucalyptus regnans seedlings grown under shade and full sun conditions. In this case we have two populations. One population represents all the possible E. regnans seedlings grown in shade conditions, and the other population represents all the possible E. regnans seedlings grown in full sun conditions. If we had grown 200 seedlings under shade conditions and 200 seedlings under full sun conditions, then these samples can be used to assess the null hypothesis that the mean size of an infinite number (population) of E. regnans seedlings grown under shade conditions is the same as the mean size of an infinite number (population) of E. regnans seedlings grown under full sun conditions. That is that the population means are equal.

We can firstly simulate the sizes of 200 seedlings grown under shade conditions and another 200 seedlings grown under full sun conditions that could arise naturally when shading has no effect on seedling growth. That is, we can simulate one possible outcome when the null hypothesis is true and shading has no effect on seedling growth
Now we can calculate the degree of difference between the mean sizes of seedlings grown under the two different conditions taking into account natural variation (that is, we can calculate a t value using the formula from above). From this simulation we may have found that the mean size of seedling grown in shade and full sun was 31.5cm and 33.7cm respectively and the degree of difference (t value) was 0.25. This represents only one possible outcome (t value). We now repeat this simulation process a large number of times (1000) and generate a histogram (or more correctly, a distribution) of the t value outcomes that are possible when the null hypothesis is true.

It should be obvious that when the null hypothesis is true (and the two populations are the same), the majority of t values calculated from samples containing 200 seedlings will be close to zero (0) - indicating only small degrees of difference between the samples. However, it is also important to notice that it is possible (albeit less likely) to have samples that are quit different from one another (large positive or negative t values) just by pure chance (for example t values greater than 2).

It turns out that it is not necessary to perform these simulations each time you test a null hypothesis. There is a mathematical formulae to estimate the t distribution appropriate for any given sample size (or more correctly, degrees of freedom) when the null hypothesis is true. In this case, the t distribution is for (200-1)+(200-1)=398 degrees of freedom.

At this stage we would calculate the t value from our actual observed samples (the 200 real seedlings grown under each of the conditions). We then compare this t value to our distribution of all possible t values (t distribution) to determine how likely our t value is when the null hypothesis is true. The simulated t distribution suggests that very large (positive or negative) t values are unlikely. The t distribution allows us to calculate the probability of obtaining a value greater than a specific t value. For example, we could calculate the probability of obtaining a t value of 2 or greater when the null hypothesis is true, by determining the area under the t distribution beyond a value of 2.

If the calculated t value was 2, then the probability of obtaining this t value (or one greater) when the null hypothesis is true is 0.012 (or 1.2%). Since the probability of obtaining our t value or one greater (and therefore the probability of having a sample of 200 seedlings grown in the shade being so much different in size than a sample of 200 seedlings grown in full sun) is so low, we would conclude that the null hypothesis of no effect of shading on seedling growth is likely to be false. Thus, we have provided some strong evidence that shading conditions do effect seedling growth!

End of instructions

t-test hypothesis testing demonstration

1. Select the Demonstrations menu..
2. Select the tdistribution submenu..

The t distribution demonstration dialog box will be displayed.

The default configuration is set up to collect two random samples (one containing 8 observations, the other 6) from normally distributed populations with means of 800 and standard deviations of 20.

The intention of this demonstration is to examine the range and frequency of outcomes that are possible when the null hypothesis is true (and the two populations to be compared are equal). That is, what levels of similarities between two random samples are likely and what levels are unlikely. In the frequentist approach to hypothesis testing, likelihood is measured by how often (or how frequently) an event occurs from a large number opportunities.

Start off setting the Number of simulations to 1000. This will collect 1000 random samples from Population 1 and 2.

Assumptions - In simulating the repeated collection of samples from both male and female populations, we make a number of assumptions that have important consequences for the reliability of the statistical tests.

Firstly, the function used to simulate the collection random samples of say 8 male fulmars (and 6 female fulmars), generates these samples from normal distributions of a given mean and standard deviation. Thus, each simulated t value and the distribution of t values from multiple runs represents the situation for when the samples are collected from a population that is normally distributed. Likewise, the mathematical t distribution also makes this assumption.

Furthermore, the sampling function used the same standard deviation for each collected sample. Hence, each simulated t values and the distribution of t values from multiple runs represents the situation for when the samples are collected from populations that are equally varied. Likewise, the mathematical t distribution also makes this assumption.

By altering the variability and degree of normality of the populations, you can see the effects that normality and homogeneity of variance have on how much a distribution of t values differs from the mathematical t-distribution. Violations of either of these two assumptions (normality and homogeneity of variance) have the real potential to compromise the reliability of the statistical conclusions and thus it is important for the data to satisfy these assumptions.

End of instructions

Basic steps of Hypothesis testing

Step 1 - Clearly establish the statistical null hypothesis. Therefore, start off by considering the situation where the null hypothesis is true - e.g. when the two population means are equal

Step 2 - Establish a critical statistical criteria (e.g. alpha = 0.05)

Step 3 - Collect samples consisting of independent, unbiased samples

Step 4 - Assess the assumptions relevant to the statistical hypothesis test. For a t test:
1. Normality
2. Homogeneity of variance

Step 5 - Calculate test statistic appropriate for null hypothesis (e.g. a t value)

Step 6 - Compare observed test statistic to a probability distribution for that test statistic when the null hypothesis is true for the appropriate degrees of freedom (e.g. compare the observed t value to a t distribution).

Step 7 - If the observed test statistic is greater (in magnitude) than the critical value for that test statistic (based on the predefined critical criteria), we conclude that it is unlikely that the observed samples could have come from populations that fulfill the null hypothesis and therefore the null hypothesis is rejected, otherwise we conclude that there is insufficient evidence to reject the null hypothesis. Alternatively, we calculate the probability of obtaining the observed test statistic (or one of greater magnitude) when the null hypothesis is true. If this probability is less than our predefined critical criteria (e.g. 0.05), we conclude that it is unlikely that the observed samples could have come from populations that fulfill the null hypothesis and therefore the null hypothesis is rejected, otherwise we conclude that there is insufficient evidence to reject the null hypothesis.

End of instructions

Pooled variance t-test

1. Select the Statistics menu..
2. Select the Means submenu..
3. Select the Independent samples t-test... submenu..

The Independent samples t-test dialog box will be displayed.
4. Select the name of the grouping (categorical, factorial, independent) variable.
5. Select the name of the response (dependent) variable.
6. Select Yes under Assume equal variances?
7. Click
The output of the pooled variance t-test will appear in the R output window.

End of instructions

Separate variance t-test

1. Select the Statistics menu..
2. Select the Means submenu..
3. Select the Independent samples t-test... submenu..

The Independent samples t-test dialog box will be displayed.
4. Select the name of the grouping (categorical, factorial, independent) variable.
5. Select the name of the response (dependent) variable.
6. Select No under Assume equal variances?
7. Click
The output of the pooled variance t-test will appear in the R output window.

End of instructions

Error bars on bargraphs

From the Bar Graph dialog box
1. Select the type of error bars to include
2. Confirm that the correct dependent and categorical variables etc are highlighted
3. Click

The bargraph will be drawn with errorbars and appear in RGui

End of instructions

Output of t-test

The following output are based on a simulated data sets;
1. Pooled variance t-test for populations with equal (or nearly so) variances

2. Separate variance t-test for population with unequal variances

End of instructions

Non-parametric tests

Non-parametric tests do not place any distributional limitations on data sets and are therefore useful when the assumption of normality is violated. There are a number of alternatives to parametric tests, the most common are;

1. Randomization tests - rather than assume that a test statistic calculated from the data follows a specific mathematical distribution (such as a normal distribution), these tests generate their own test statistic distribution by repeatedly re-sampling or re-shuffling the original data and recalculating the test statistic each time. A p value is subsequently calculated as the proportion of random test statistics that are greater than or equal to the test statistic based on the original (un-shuffled) data.

2. Rank-based tests - these tests operate not on the original data, but rather data that has first been ranked (each observation is assigned a ranking, such that the largest observation is given the value of 1, the next highest is 2 and so on). It turns out that the probability distribution of any rank based test statistic for a is identical.

End of instructions

Simple linear regression

Linear regression analysis explores the linear relationship between a continuous response variable (DV) and a single continuous predictor variable (IV). A line of best fit (one that minimizes the sum of squares deviations between the observed y values and the predicted y values at each x) is fitted to the data to estimate the linear model.

As plot a) shows, a slope of zero, would indicate no relationship � a increase in IV is not associated with a consistent change in DV.
Plot b) shows a positive relationship (positive slope) � an increase in IV is associated with an increase in DV.
Plot c) shows a negative relationship.

Testing whether the population slope (as estimated from the sample slope) is one way of evaluating the relationship. Regression analysis also determines how much of the total variability in the response variable can be explained by its linear relationship with IV, and how much remains unexplained.

The line of best fit (relating DV to IV) takes on the form of
y=mx + c
Response variable = (slope*predictor variable) + y-intercept
If all points fall exactly on the line, then this model (right-hand part of equation) explains all of the variability in the response variable. That is, all variation in DV can be explained by differences in IV. Usually, however, the model does not explain all of the variability in the response variable (due to natural variation), some is left unexplained (referred to as error).

Therefore the linear model takes the form of;
Response variable = model + error.
A high proportion of variability explained relative to unexplained (error) is suggestive of a relationship between response and predictor variables.

For example, a mammalogist was investigating the effects of tooth wear on the amount of time free ranging koalas spend feeding per 24 h. Therefore, the amount of time spent feeding per 24 h was measured for a number of koalas that varied in their degree of tooth wear (ranging from unworn through to worn). Time spent feeding was the response variable and degree of tooth wear was the predictor variable.

End of instructions

ANOVA

Analysis of variance, or ANOVA, partitions the variation in the response (DV) variable into that explained and that unexplained by one of more categorical predictor variables, called factors. The ratio of this partitioning can then be used to test the null hypothesis (H₀) that population group or treatment means are equal. A single factor ANOVA investigates the effect of a single factor, with 2 or more groups (levels), on the response variable. Single factor ANOVA's are completely randomized (CR) designs. That is, there is no restriction on the random allocation of experimental or sampling units to factor levels.
Single factor ANOVA tests the H₀ that there is no difference between the population group means.
H₀: = µ₁ = µ₂ = .... = µ_i = µ
If the effect of the ith group is:
(&alpha_i = µ_i - µ) the this can be written as
H₀: = &alpha₁ = &alpha₂ = ... = &alpha_i = 0
If one or more of the I are different from zero, the hull hypothesis will be rejected.

Keough and Raymondi (1995) investigated the degree to which biofilms (films of diatoms, algal spores, bacteria, and other organic material that develop on hard surfaces) influence the settlement of polychaete worms. They had four categories (levels) of the biofilm treatment: sterile substrata, lab developed biofilms without larvae, lab developed biofilms with larvae (any larvae), field developed biofilms without larvae. Biofilm plates where placed in the field in a completely randomized array. After a week the number of polychaete species settled on each plate was then recorded. The diagram illustrates an example of the spatial layout of a single factor with four treatments (four levels of the treatment factor, each with a different pattern fill) and four experimental units (replicates) for each treatment.

End of instructions

Factorial boxplots

Select the Graphs menu
Select the Boxplot.. submenu

The Boxplot dialog box will appear

Select the dependent variable from the Variable list
Click the button

The Groups dialog box will appear

Select the factorial (independent) variable from the Groups variable list
Click the button
Click the button

End of instructions

One-tailed critical t-value

Select the Distributions menu
Select the Continuous Distributions.. submenu
Select the t distribution.. submenu
Select the t quantiles.. submenu

The t Quantiles dialog box will appear

Enter the &alpha value (usually 0.05) in the Probabilities box
Enter the degrees of freedom for the test in the Degrees of freedom box
It actually doesn't matter whether you select Lower tail or Upper tail. For a symmetrical distribution, centered around 0, the critical t-value is the same for both Lower tail (e.g. population 2 greater than population 1) and Upper tail (e.g. population 1 greater than population 2), except that the Lower tail is always negative. As it is often less confusing to work with positive values, it is recommended that you use Upper tail values. An example of a t-distribution with Upper tail for a one-tailed test is depicted below. Note that this is not part of the t quantiles output!

Click the button

The critical t-value will appear in the R Commander output window

End of instructions

Two-tailed critical t-value

Select the Distributions menu
Select the Continuous Distributions.. submenu
Select the t distribution.. submenu
Select the t quantiles.. submenu

The t Quantiles dialog box will appear

Enter the &alpha value (usually 0.025, recall that two-tail tests assign 0.05/2 to each tail) in the Probabilities box
Enter the degrees of freedom for the test in the Degrees of freedom box
It actually doesn't matter whether you select Lower tail or Upper tail. For a symmetrical distribution, centered around 0, the critical t-value is the same for both Lower tail (e.g. population 2 greater than population 1) and Upper tail (e.g. population 1 greater than population 2), except that the Lower tail is always negative. As it is often less confusing to work with positive values, it is recommended that you use Upper tail values. An example of a t-distribution with Upper tail for a two-tailed test is depicted below. Note that this is not part of the t quantiles output!

Click the button

The critical t-value will appear in the R Commander output window

End of instructions

Plotting mean vs variance

Firstly you need to calculate the mean and variance for each level of the factorial variable.

Select the Statistics menu..
Select the Summaries.. submenu..
Select the Basic statistics.. submenu..

The Basic Statistics dialog box will be displayed.

Note the default name that is entered in the Table name box. This can be changed to any name not already in use if you feel the need to make a more informative name
Select the dependent variable
Select at least the Mean and Variance options
Select at Mean vs Var plot option
Click the button

The Groups dialog box will be displayed.

Select the categorical variable from the Groups variable list
Click in the Groups dialog box
Click buttons

The plot of Mean vs Variance will appear in the RGui graphics window

End of instructions

Analysing frequencies

Analysis of frequencies is similar to Analysis of Variance (ANOVA) in some ways. Variables contain two or more classes that are defined from either natural categories or from a set of arbitrary class limits in a continuous variable. For example, the classes could be sexes (male and female) or color classes derived by splitting the light scale into a set of wavelength bands. Unlike ANOVA, in which an attribute (e.g. length) is measured for a set number of replicates and the means of different classes (categories) are compared, when analyzing frequencies, the number of replicates (observed) that fall into each of the defined classes are counted and these frequencies are compared to predicted (expected) frequencies.

Analysis of frequencies tests whether a sample of observations came from a population where the observed frequencies match some expected or theoretical frequencies. Analysis of frequencies is based on the chi-squared (X²) statistic, which follows a chi-square distribution (squared values from a standard normal distribution thus long right tail).

When there is only one categorical variable, expected frequencies are calculated from theoretical ratios. When there are more than one categorical variables, the data are arranged in a contingency table that reflects the cross-classification of sampling or experimental units into the classes of the two or more variables. The most common form of contingency table analysis (model I), tests a null hypothesis of independence between the categorical variables and is analogous to the test of an interaction in multifactorial ANOVA. Hence, frequency analysis provides hypothesis tests for solely categorical data. Although, analysis of frequencies provides a way to analyses data in which both the predictor and response variable are both categorical, since variables are not distinguished as either predictor or response in the analysis, establishment of causality is only of importance for interpretation.

End of instructions

Goodness of fit test

The goodness-of-fit test compares observed frequencies of each class within a single categorical variable to the frequencies expected of each of the classes on a theoretical basis. It tests the null hypothesis that the sample came from a population in which the observed frequencies match the expected frequencies.

For example, an ecologist investigating factors that might lead to deviations from a 1:1 offspring sex ratio, hypothesized that when the investment in one sex is considerably greater than in the other, the offspring sex ratio should be biased towards the less costly sex. He studied two species of wasps, one of which had males that were considerably larger (and thus more costly to produce) than females. For each species, he compared the offspring sex ratio to a 1:1 ratio using a goodness-of-fit test.

End of instructions

Use the Number of Columns slider to indicate the number of categories
Modify the column labels one the Enter table table to reflect the data categories
Enter the observed counts (frequencies) in the Enter table table
In the Enter expected ratio table, enter the expected counts or ratio
Click the button

The Goodness of fit test output will appear in the R Commander output window.

End of instructions

Multivariate analysis

Multivariate data sets include multiple variables recorded from a number of replicate sampling or experimental units, sometimes referred to as objects. If, for example, the objects are whole organisms or ecological sampling units, the variables could be morphometric measurements or abundances of various species respectively. The variables may be all response variables (MANOVA) or they could be response variables, predictor variables or a combination of both (PCA and MDS).

The aim of multivariate analysis is to reduce the many variables to a smaller number of new derived variables that adequately summarize the original information and can be used for further analysis. In addition, Principal components analysis (PCA) and Multidimensional scaling (MDS) also aim to reveal patterns in the data, especially among objects, that could not be found by analyzing each variable separately.

End of instructions

Bray-Curtis dissimilarity

For two objects (i = 1 and 2) and a number of variables (j = 1 to p).

y_1j and y_2j are the values of variable j in object 1 and object 2.
min(y_1j , y_2j) is the lesser value of variable j for object 1 and object 2.
p is the number of variables.

For example

Var 1 Var 2 Var 3

Object 1 3 6 9

Object 2 6 12 18

BC = 1-[(2)(3+6+9)/(18+36)] = 0.33 where (3+6+9) is the lesser abundance of each variable when it occurs in each object.

Bray-Curtis dissimilarity is well suited to species abundance data because it ignores variables that have zeros for both objects, and it reaches a constant maximum value when two objects have no variables in common. However, its value is determined mainly by variables with high values (e.g. species with high abundances) and thus results are biased towards trends displayed in such variables. It ranges from 0 (objects completely similar) to 1 (objects completely dissimilar).

End of instructions

Euclidean distance

For two objects (i = 1 and 2) and a number of variables (j = 1 to p).

y_1j and y_2j are the values of variable j in object 1 and object 2.
p is the number of variables.

For example

Var 1 Var 2 Var 3

Object 1 3 6 9

Object 2 6 12 18

EUC = [(3-6)²+(6-12)²+(9-18)²] = 11.2

Euclidean distance is a very important, general, distance measure in multivariate statistics. It is based on simple geometry as a measure of distance between two objects in multidimensional space. It is only bounded by a zero for two objects with exactly the same values for all variables and has no upper limits, even when two objects have no variables in common with positive values. Since it does not reach a constant maximum value when two objects have no variables in common, it is not ideal for species abundance data.

End of instructions

Multidimensional Scaling

Multidimensional scaling (MDS), has two aims

to reduce a large number of variables down to a small number of summary variables (axes) that explain most of the variation (and patterns) in the original data.
to reveal patterns in the data that could not be found by analysing each variable separately. For example examining which sites are most similar to one another based on the abundances of many plant species

MDS starts with a dissimilarity matrix which represents the degree of difference between all objects (such as sites or quadrats) based on all the original variables. MDS then attempts to reproduce these patterns between objects by arranging the objects in multidimensional space with the distances between each pair of objects representing their relative dissimilarity. Each of the dimensions (plot axes) therefore represents a new variable that can be used to describe the relationships between objects. A greater number of new variables (axes) increases the degree to which the original patterns between objects are retained, however less data reduction has occurred. Furthermore, it is difficult to envisage points (objects) arranged in more than three dimensions.

MDS can be broken down into the following steps

Calculate a matrix of dissimilarities between objects from a data matrix (raw or standardized) in which objects (e.g. quadrats) are in rows and variables (e.g. plants) are in columns.
After deciding on the number of dimensions (axes), arrange the objects in ordination (multidimensional) space (essentially a graph containing as many axes as selected dimensions) either at random or using the coordinates of a previous ordination.
Move the location of the objects in space iteratively so that at each successive step, the match between the ordination inter-object distances and the dissimilarities is improved. The match between ordination distances and object dissimilarities is measured by the stress value - the lower (< 15) the stress the better the fit. Ideally < 10.
The final arrangement (configuration) of objects is achieved when further iterative moving of objects no longer results in a substantial decrease in stress. Note, if final stress is > 0.2, the number of selected dimensions is inadequate.

End of instructions

Multidimensional scaling

Select the Statistics menu..
Select the Dimensional analysis.. submenu..
Select the Multidimensional scaling.. submenu..

The Multidimensional Scaling dialog box will be displayed.

Select the name of the distance matrix to be used from the Distance matrix list
Enter a name for the resulting MDS output in the Enter a name for model box. The default name is usually fine.
Enter the number of new dimensions (axes) to be calculated - usually 2 or 3.
Select the Shepard diagram checkbox
Select the objects (samples/sites) to be included in the MDS from the Samples list
Click the button

If you selected one or both of the diagram checkboxes (Shepard and Final Configuration), you will be sent to RGui where the diagram(s) appear in R Graphics Window(s). If you selected both Shepard diagram and Final Configuration, there will be multiple R Graphics Windows on top of one another - just move one of them to the side.
The Sheppard plot represents the relationship between the original distances (y-axis) and the new MDS distances (x-axis) and the final ordination plot represents the arrangement of objects in multidimensional space.

A large amount of information will appear in the R Commander output window (most of which you can ignore!.

Scroll up through the output and locate the stress value.

End of instructions

Linear models are those statistical models in which a series of parameters are arranged as a linear combination. That is, within the model, no parameter appears as either a multiplier, divisor or exponent to any other parameter. Importantly, the term �linear� in this context does not pertain to the nature of the relationship between the response variable and the predictor variable(s), and thus linear models are not restricted to `linear� (straight-line) relationships.

A linear model representing a linear relationship
A linear model representing a curvilinear relationship
A non-linear model representing a curvilinear relationship