Jump to main navigation


Workshop 2 - Data importation and exploratory data analysis

23 April 2011

Basic statistics references

  • Logan (2010) - Chpt 1, 2 & 6
  • Quinn & Keough (2002) - Chpt 1, 2, 3 & 4

Basic syntax

Q1-1.Complete the following table, by assigning the following entries
(numbers etc) to the corresponding object names and determining the object class
for each.
NameEntrySyntaxClass
A100hint
BBighint
VAR1100 & 105 & 110hint
VAR25 + 6hint
VAR3150 to 250hint

  1. Print out the contents of the vector you called 'a'
    . Notice that the output appears on the line under the syntax that you entered, and that the output is proceeded by a [1]. This indicates that the value returned (100) is the first entry in the vector
  2. Print out the contents of the vector called 'b'. Again notice that the output is proceeded by [1].
  3. Print out the contents of the vector called 'var1'.
  4. Print out the contents of the vector called 'var2'. Notice that the output contains the product of the statement rather than the statement itself.
  5. Print out the contents of the vector called 'var3'. Notice that the output contains 100 entries (150 to 250) and that it spans multiple lines on the screen. Each new line begins with a number in square brackets [] to indicate the index of the entry that immediately follows.

Variables - vectors

Q1-2. Generate the following numeric vectors (variables)
  1. The numbers 1, 4, 7 & 9 (call the object y)
  2. The numbers 10 to 25 (call the object y1)

  3. The sequency of numbers 2, 4, 6, 8...20 (call the object y2)


Q1-3. Generate the following character vectors (factorial/categorical variables)
  1. A factor that lists the sex of individuals as 6 females followed by 6 males

  2. A factor called TEMPERATURE that lists 10 cases of 'High', 10 'Medium & 10 'Low'
  3. A factor called TEMPERATURE that lists 'High', 'Medium & 'Low' alternating 10 times
  4. A factor called TEMPERATURE that lists 10 cases of 'High', 8 cases of 'Medium' and 11 cases of 'Low'

Q1-4. Print out the contents of the 'TEMPERATURE' factor. A list of factor levels will be printed on the screen. This will take up multiple lines, each of which will start with a number in square brackets [ ] to indicate the index number of the entry immediately to the right. At the end, the output also indicates what the names of the factor levels are. These are listed in alphebetical order.


Data sets - Data frames(R)

Rarely is only a single biological variable collected. Data are usually collected in sets of variables reflecting tests of relationships, differences between groups, multiple characterizations etc. Consequently, data sets are best organized into collections of variables (vectors). Such collections are called data frames in R.
Data frames are generated by combining multiple vectors together whereby each vector becomes a separate column in the data frame. In for a data frame to represent the data properly, the sequence in which observations appear in the vectors (variables) must be the same for each vector and each vector should have the same number of observations. For example, the first observations from each of the vectors to be included in the data frame must represent observations collected from the same sampling unit.

To demonstrate the use of dataframes in R, we will use fictitious data representing the areas of leaves of two species of Japanese Boxwood

Format of the fictitious data set
PLANTSPECIESAREA
P1B.semp25
P2B.semp22
P3B.semp29
P4B.micro15
P5B.micro17
P6B.micro20

PLANTAn identifier for each individual plant that was measured (a single leaf was measured from each individual plant)
SPECIESCategorical listing of whether the individual plant was Buxus sempervirens (B.semp) or Buxus microphyllum (B.micro)
AREAThe surface area (mm2) of the leaf measured - Response variable
Leaves

Q2-1. Lets create the data set in a series of steps. Use the textbox provided in part g below to record the R syntax used in each step
  1. First create the categorical (factor) variable containing the listing of B.semp three times and B.micro three times
  2. Now create the dependent variable (numeric vector) containing the leaf areas
  3. Combine the two variables (vectors) into a single data set (data frame) called LEAVES
  4. Print (to the screen) the contents of this new data set called LEAVES
  5. You will have noticed that the names of the rows are listed as 1 to 6 (this is the default). In the table above, we can see that there is a variable called PLANT that listed unique plant identification labels. These labels are of no use for any statistics, however, they are useful for identifying particular observations. Consequently it would be good to incorporate these labels as row names in the data set. Create a variable called PLANT that contains a listing of the plant identifications
  6. Use this plant identification label variable to define the row names in the data frame called LEAVES
  7. In the textbox provided below, list each of the lines of R syntax required to generate the data set

The above syntax forms a list of instructions that R can perform. Such lists are called scripts. Scripts offer the following;

  • Enable a sequence of tasks such as data entry, analysis and graphical preparation to be repeated quickly and precisely
  • Ensure that the sequence of tasks used to complete an analysis are permanently documented
  • Simplify performing many similar analyses
  • Simplify sharing of data, analyses and techniques

Q2-3.To see how to use a script,
  1. close down R
  2. restart R
  3. Change the working directory (path)
    to the location where you saved the script file in Q2-2 above
  4. Source the script file
Q2-4.There are now at least four objects in the R workspace.
These should be LEAVES (the data frame - data set), PLANTS (the list of plant ID's), SPECIES (the character vector of plant species) and AREA (the numeric vector of leaf areas).
  1. Print (list on screen) the contents of the AREA vector. Note, that this is listing the contents of the AREA vector, this is not the same as asking it to list the contents of the AREA vector within the LEAVES data frame. For example, multiply all of the numbers in the AREA vector by 2. Now print the contents of the AREA vector then the LEAVES data frame. Notice that only the values in the AREA vector have changed - the values within the AREA vector of the LEAVES data frame were not effected.
  2. To avoid confusion and clutter, it is therefore always best to remove single vectors
    once a data frame has been created. Remove the PLANTS, SPECIES and AREA vectors.
  3. Notice what happens when you now try to access the AREA vector.
  4. To access a variable from within a data frame, we use the $ sign. Print the contents of the LEAVES AREA vector

Q2-5.Since data are stored in vectors, it is possible to access single entries or specific groups of entries. A specific entry is accessed via its index.
To investigate the range of options, complete the following table.
AccessSyntax
print the LEAVES data sethint
print first leaf area in the LEAVES data sethint
print the first 3 leaf areas in the LEAVES data sethint
print a list of leaf areas that are greater than 20hint
print a list of leaf areas for the B.microphylum specieshint
print the section of the data set that contains the B.microphylum specieshint
alter the second leaf area from 22 to 23hint

Q2-6.Although it is possible to some data editing this way, for more major editing procedures it is better to either return to Excel or use the 'fix()' function.
Use the 'fix()' function to make a number of changes to the data frame (data set) including adding another column of data (that might represent another variable).

Q2-7.Sometimes it is necessary to transform
a variable from one scale to another. While it is possible to modify an existing variable (vector), it is safer to create a new variable that contains the altered values. Examine the use of R for common transformations.
Transform the leaf areas to log (base 10).


Importing data and data files

Although it is possible to generate a data set from scratch using the procedures demonstrated in the above demonstration module, often data sets are better managed with spreadsheet software. R is not designed to be a spreadsheet, and thus, it is necessary to import data into R. We will use the following small data set (in which the feeding metabolic rate of stick insects fed two different diets was recorded)to demonstrate how a data set is imported into R.

Format of the fictitious data set
PHASMIDDIETMET.RATE
P1tough1.25
P2tough1.22
P3tough1.29
P4soft1.51
P5soft1.55
P6soft1.48

PHASMIDAn identifier for each individual stick insect (Phasmid) that was measured
DIETCategorical listing of whether the food consumed was considered to be tough or soft
MET.RATEThe feeding metabolic rate (mg 02/min/g) of phasmids - Response variable
Leaves

Q3-1.Importing data into R from Excel is a multistage stage process.
  1. Enter the above data set into Excel and save the sheet as a comma delimited text file (CSV)
    . Ensure that the column titles (variable names) are in the first row and that you take note where the file is saved. To see the format of this file, open it in Notepad (the windows accessory program). Notice that it is just a straight text file, there is no encryption or encoding.
  2. Ensure that the current working directory is set to the location of this file
  3. Read (import) the data set into a data table
    . Since data exploration and analysis cannot begin until the data is imported into R, the syntax of this step would usually be on the first line in a new script file that is stored with the comma delimited text file version of the data set.
  4. To ensure that the data have been successfully imported, print the data frame

Q3-2.As well as importing files, it is often necessary to save a data set (data frame) - particularly if it has been modified and you wish to retain the changes. To demonstrate how to export a data set, we need a data frame (data set) to export. If the LEAVES data frame (from Excersize 2 above) is no longer present, regenerate the LEAVES data set from above using the script file
that was generated in Q2-2. To export an R data frame to a text file, you need to write the data frame to a file
  1. Examine the contents of this comma delimited text file using Notepad

Q3-3.Alternatively, it is also possible to copy and paste data from Excel into R (via the clipboard). Although this method is quicker, there is no record in a R script file as to which Excel file the data originally came from. Furthermore, changes to the Excel data sheet will not be accounted for. Read the data in from the clipboard.


Q3-4.Since there is no link between the data and the script when data are imported via the clipboard, it is recommended that the data be stored as a structure within your R script above any commands that use these data. Place a copy of the data within the R script file that you generated earlier..

Population parameters

The little spotted kiwi (Apteryx owenii) is a very rare flightless bird that is extinct on mainland New Zealand and survives as 1000 individuals on Kapiti Island. In order to monitor the population, researches in the recovery team systematically captured all of the individuals in the population over a two week period. Each individual was weighed, banded, assessed and released. The file *.csv lists the weights of each individual male little spotted kiwi in the population.

Download Kiwi data set

Format of kiwi.csv data file
BandWeight
649551.749
653182.551
646121.768
643932.327
640922.127
......

BandUnique bird identification band number
WeightWeight (grams) of the individual male birds
Spotted Kiwi

Open the kiwi data file. HINT.

Generate a frequency histogram
of male kiwi weights. HINT. This distribution represents the population (all possible observations) of male kiwi weights.  Note that this is the statistical population and not a biological population - obviously a biological population entirely lacking in females would not last long!

Q4-1.Describe the shape of the distribution?

Since we have the weights of all male kiwi in the population, is possible to calculate population parameters (such as population mean and standard deviation)
directly!

Q4-2. What is the mean (a location measure) and standard deviation (a measure of spread) of the population?
  1. Mean HINT
  2. SD HINT

Assuming, the population is normally distributed, it is possible to calculate the probability that a randomly recaptured male kiwi will weigh greater than a particular value, less than a particular value, or weigh between a range of weights. This probability is just the area under a particular region of a normal distribution and can be calculated using the normal probabilities.

Q4-3. Assuming that the population is normally distributed, what is the probability of recapturing a male little spotted kiwi that weighs greater than 2.9 kg? HINT
For data sets with large numbers of observations, the distribution of observations can be examined via a histogram - as demonstrated above. However, histograms are only meaningful for summarizing large data sets. For smaller data sets other exploratory tools (such as boxplots)
are necessary. To appreciate the relationship between boxplots and the underlying distribution of data, construct a boxplot
of male kiwi weights. HINT

Samples as estimates of populations

Here is a modified example from Quinn and Keough (2002). Lovett et al. (2000) studied the chemistry of forested watersheds in the Catskill Mountains in New York State. They had 38 sites and recorded the concentrations of ten chemical variables, averaged over three years. We will look at two of these variables, dissolved organic carbon (DOC) and hydrogen ions (H).

Download Lovett data set

Format of lovett.csv data files
STREAMDOCH
Hunter180.40.48
West Kill108.80.24
Mill104.70.47
Kelly Hollow84.50.23
Pigeon82.40.37

STREAMName of the site (stream) from which observations were collected
DOCDissolved oxygen concentration (mmol.L-1)
HHydrogen concentration (mmol.L-1)
A river in the Catskill Mountains

Open
the lovett data file.

Q5-1. What is the purpose of sampling?

Before continuing, make sure you are clear on what the observations, variables and populations
are.

Construct a boxplot
of dissolved organic carbon (DOC) from the sample observations. HINT

Q5-2. How would you describe the boxplot?

Q5-3. Are there any outliers? (Y or N)

Provided the data were collected without bias (ideally random) and with adequate replication, the sample should reflect the entire population. Therefore sample statistics
should be good estimates of the population parameters.

Q5-4. Calculate the sample mean
HINT

The mean of a sample is considered to be a location characteristic of the sample. Along with the mean, it is often desirable to characterize the spread of data in a sample - that is to determine how variable the sample is.

Q5-5. Calculate the sample standard deviation
HINT

For most purposes, the sample itself is of little interest - it is purely used to estimate the population. Therefore it is necessary to be able to estimate how well the sample mean estimates the true population mean. The Standard error (SE) of the mean is a measure of the precision
of the mean.

Q5-6. Calculate the standard error of the mean HINT

Following on from the idea of precision of the mean, is the concept of confidence intervals
, by which an interval is calculated that we are 95% confident will contain the true population mean.

Q5-7. Calculate the 95% confidence interval of the mean HINT +/-

Construct a boxplot
of hydrogen concentration (H) from the sample observations HINT

Q5-8. How would you describe the boxplot?

Many statistical analyses assume that the population from which the sample was collected is normally distributed. However, biological data is not always normally distributed. To normalize the data, try transforming
to logs.
HINT

Q5-9. Does the transformation successfully normalize
these data? (Y or N)

Earlier we identified the presence of an outlier, to investigate the impact of this outlier on a range of summary statistics, calculate the following measures of location (mean and median) and spread (standard deviation and interquartile range) for DOC, with and without the outlying observation
and complete the table below.

Summary StatisticDOCModified DOC
Mean HINT HINT
Median HINT HINT
Variance HINT HINT
Standard deviation HINT HINT
Inter-quartile range HINT HINT

Q5-10. Which measures of location and spread are most robust to inclusion and exclusion of a single unusual observation?

Exploratory data analysis

Sánchez-Piñero & Polis (2000) studied the effects of seabirds on tenebrionid beetles on islands in the Gulf of California. These beetles are the dominant consumers on these islands and it was envisaged that seabirds leaving guano and carrion would increase beetle productivity. They had a sample of 25 islands and recorded the beetle density, the type of bird colony (roosting, breeding, no birds), % cover of guano and % plant cover of annuals and perennials.

Download Sanchez data set

Format of sanchez.csv data files
COLTYPEBEETLE96GUANOPLANT96
........
........
........

COLTYPEType of bird colony (N = no birds, R = roosting, B = breeding
BEETLE96Abundance of beetles (number per carrion trap) in 1996
GUANO% cover of guano on island in 1995 and 1996
PLANT96% cover of total plants (annual and perennial) on island in 1996
Tenebrionid beetle

Open the sanchez data file.

Q6-1.For percentage plant cover, Calculate the following summary statistics separately for each colony type
and complete the table below.

Summary StatisticNo coloniesRoosting colonyBreeding colony
Mean HINT
Variance HINT
Standard deviation HINT


  1. Which colony type has the greatest variance? (N, R or B)

Normality

Before proceeding, make sure you are familiar with the significance of normally distributed sample data and thus why it is necessary to examine the distribution of sample data
as part of routine exploratory data analysis (EDA) prior to any formal data analysis.

Q6-2. Construct a boxplot
for total 1996 beetle abundance for each colony type separately. HINT
  1. Are there any outliers identified? (Y or N)
  2. Describe the shape of each distribution.
  3. Now transform the response variable to logs and redraw the boxplots HINT. Does this change (improve?) the shape of the distributions? (Y or N)

Linearity

Often it is necessary to examine the nature of the relationship or association between
as part of routine exploratory data analysis (EDA) prior to any formal data analysis. The nature of relationships/associations between continuous data is explored using scatterplots.

Q6-3. Construct a scatterplot
for beetle abundance against
total 1996 plant cover (HINT).
  1. Is there any evidence of non-linearity? (Y or N)
  2. Note, that the boxplots also enable us to explore the normality of both variables (populations). Is there any evidence of non-normality? (Y or N)

Sánchez-Piñero & Polis (2000) measured a number of continuous variables (% cover of guano, % cover or plants and abundance of beetles. Therefore, they might be interested in exploring the relationships between each of these variables. That is, the relationship between guano and plants, guano and beetles, and beetles and plants. While it is possible to create separate scatterplots for each pair (in this case three separate scatterplots), a scatterplot matrix is usually more informative and efficient.

Q6-4. Construct a scatterplot matrix or SPLOM
for % of guano, % of plant cover and beetle abundance HINT. Are there any obvious relationships?

Homogeneity of variance

Many statistical hypothesis tests assume that populations are equally varied. For hypothesis tests that compare populations (such as t-tests - see Question 4), it is important that one of the populations is not substantially more or less variable than the other population(s). Thus, such tests assume homogeneity of variance.

Q6-5. Construct an examine boxplots
of beetle abundance for each of the three colony types. HINT
  1. Firstly, is there any evidence of non-normality? (Y or N)
  2. Try square-root transforming (preferred over log transformation when applying to count data, since log(0) is not legal) the beetle variable (function is sqrt) and using this transformed variable to reconstruct the boxplots. Note that it may be necessary to perform a forth-root transformation (which performing the square-root transformation twice) in order to normalize this highly skewed data. This can be done using the expression to compute as sqrt(sqrt(BEETLE96)) HINT or HINT. If this successfully normalizes the data, focus on whether there is any evidence that the populations are equally varied. Was a forth-root transformation successfull? (Y or N)
  3. Try calculating the variance or standard deviation
    of beetle abundance for each colony type separately (remember to use the transformed data, as the raw data was obviously non-normal and non-normality often results in unequal variances). Do these values provide any evidence for unequally varied populations? (Y or N)

Hypothesis testing

Furness & Bryant (1996) studied the energy budgets of breeding northern fulmars (Fulmarus glacialis) in Shetland. As part of their study, they recorded the body mass and metabolic rate of eight male and six female fulmars.

Download Furness data set

Format of furness.csv data files
SEXMETRATEBODYMASS
MALE2950875
FEMALE1956765
MALE2308780
MALE2135790
MALE1945788

SEXSex of breeding northern fulmars (Fulmarus glacialis)
METRATEMetabolic rate (hJ/day)
BODYMASSBody mass (g)
Northern fulmars

Open the furness data file.

Q7-1. The researchers were interested in testing whether there is a difference in the metabolic rate of male and female breeding northern fulmars. In light of this, list the following:
  1. The biological hypotheses of interest
  2. The biological null hypotheses
  3. The statistical null hypotheses (H0)

The appropriate statistical test for testing the null hypothesis that the means of two independent populations are equal is a t-test

Before proceeding, make sure you understand what is meant by normality
and equal variance
as well as the principles of hypothesis testing using a t-test.

Q7-2. For the null hypothesis test of interest (that the mean population metabolic rate of males and females were the same), calculate the Degrees of freedom

Q7-3. Calculate the critical t-values for the following null hypotheses (&alpha = 0.05)
  1. The metabolic rate of males is higher than that females (one-tailed test)
    HINT
  2. The metabolic rate of males is the same as that of females (two-tailed test)
    HINT

Since most hypothesis tests follow the same basic procedure, confirm that you understand the basic steps of hypothesis tests
.

Q7-4.In the table below, list the assumptions of a t-test along with how violations of each assumption are diagnosed and/or the risks of violations are minimized.

AssumptionDiagnostic/Risk Minimization
I.
II.
III.

So, we wish to investigate whether or not male and female fulmars have the same metabolic rates, and that we intend to use a t-test to test the null hypothesis that the population mean metabolic rate of males is equal to the population mean metabolic rate of females. Having identified the important assumptions of a t-test, use the samples to evaluate whether the assumptions are likely to be violated and thus whether a t-test is likely to be reliability.

Q7-5 Is there any evidence that; HINT
  1. The assumption of normality has been violated?
  2. The assumption of homogeneity of variance has been violated?

Q7-6. Perform a t-test to examine the effect of sex on the mass of fulmars using either (which ever is most appropriate) a pooled variance t-test
(for when population variances are very similar HINT) or separate variance t-test
(for when the variance of one population is likely to be up to 2.5 times greater or less than the other population HINT). Ensure that you are familiar with the output of a t-test.

  1. What is the t-value? (Excluding the sign. The sign will depend on whether you compared males to females or females to males, and thus only indicates which group had the higher mean).
  2. What is the df (degrees of freedom).
  3. What is the p value.

Q7-7. Write the results out as though you were writing a research paper/thesis. For example (select the phrase that applies and fill in gaps with your results): 
The mean metabolic rate of male fulmars was (choose correct option)
(t = , df = , P = )
the mean metabolic rate of female fulmars.

Q7-8.Construct a bar graph
showing the mean metabolic rate of male and female fulmars and an indication of the precision of the means with error bars.HINT

Paired data

Here is a modified example from Quinn and Keough (2002). Elgar et al. (1996) studied the effect of lighting on the web structure or an orb-spinning spider. They set up wooden frames with two different light regimes (controlled by black or white mosquito netting), light and dim. A total of 17 orb spiders were allowed to spin their webs in both a light frame and a dim frame, with six days `rest' between trials for each spider, and the vertical and horizontal diameter of each web was measured. Whether each spider was allocated to a light or dim frame first was randomized. The H0's were that each of the two variables (vertical diameter and horizontal diameter of the orb web) were the same in dim and light conditions. Elgar et al. (1996) correctly treated these as paired comparisons because the same spider spun her web in a light frame and a dark frame.

Download Elgar data set

Format of elgar.csv data files
PAIRVERTDIMHORIZDIMVERTLIGHHORIZLIGH
..........
..........
..........

PAIRName given to each pair of webs spun by a particular spider
VERTDIMThe vertical dimension or height (mm) of webs spun in dim conditions
HORIZDIMThe horizontal dimension or width (mm) of webs spun in dim conditions
VERTLIGHThe vertical dimension or height (mm) of webs spun in light conditions
HORIZLIGHThe horizontal dimension or width (mm) of webs spun in light conditions
Coastal orb-weaving spider
Note:for paired t-tests, it is traditional for categories to be column labels rather than entries in a categorical variable. Compare the structure of the elgar data (paired t-test) set with that of the furness (standard t-test) data set.

Open the elgar data file.

Q8-1. What is an appropriate statistical test for testing an hypothesis about the difference in dimensions of webs spun in light versus dark conditions? Explain why?

Q8-2. The actual H0 is that the mean of the differences between the pairs (light and dim for each spider) equals zero. Use a paired t-test
to test the H0 that the mean of the differences in vertical diameter (HINT) and separately, in horizontal diameter (HINT) of the web between the pairs (light and dim for each spider) equal zero.

Q8-3. Write the results out as though you were writing a research paper/thesis. For example (select the phrase that applies and fill in gaps with your results): 
The mean vertical diameter of spider webs in dim conditions was (choose correct option)
(t = , df = , P = )
the vertical dimensions in light conditions.
The mean horizontal diameter of spider webs in dim conditions was (choose correct option)
(t = , df = , P = )
the horizontal dimensions in light conditions.

Non-parametric tests

We will now revisit the data set of Furness & Bryant (1996) that was used in Question 4 to investigate the effects of gender on the metabolic rates of breeding northern fulmars (Fulmarus glacialis). Furness & Bryant (1996) also recorded the body mass of the eight male and six female fulmars they captured.

Since the males and female fulmars were all independent of one another, a t-test would be appropriate to test the null hypothesis of no difference in mean body weight of male and female fulmars.

Q9-1. Are the assumptions underlying this test met? (Y or N) Hint: check the relative sizes of the two sample variances and the distribution of body weight for each sex.

When the distributional assumptions are violated, parametric tests are unreliable. Under these circumstances, non-parametric tests
can be very useful.

Q9-2. The Wilcoxon-Mann-Whitney test is described as a non-parametric test for comparing two groups.
  1. What null hypothesis does this test actually evaluate?
  2. What are the underlying assumptions of a Wilcoxon-Mann-Whitney test?

Q9-3. If the assumptions are met, test the null hypothesis of no difference in body weight between male and female fulmars using a Wilcoxon test
HINT. Based on this outcome, what are your conclusions?
  1. Statistical:
  2. Biological (include trend):

Q9-4.Construct a bar graph
showing the mean mass of male and female fulmars and an indication of the precision of the means with error bars. HINT

Randomization (permutation) test

A wildlife ecologist responsible for the management of a significant population of southern brown bandicoot, Isoodon obesulus, was interested in determining the impacts that picnickers were having on the health of bandicoots in the park. In particular, he was interested in determining whether bandicoots that occupied areas frequented by picnickers were heavier (and thus fatter) than bandicoots that occurred in other woodland areas. Fifty adult male bandicoots were captured from picnic and woodland areas and the weights of all individuals were measured.

Download Bandicoot data set

Format of bandicoots.csv data files
AREAWEIGHT
PICNIC875
PICNIC765
...780
WOODLAND790
WOODLAND788

AREAArea from which bandicoots were captured (Picnic or Woodland)
WEIGHTCapture weight (g) of bandicoots
Northern fulmars

Open the bandicoots data file.

Q10-1. Are the assumptions underlying this test met? (Y or N) Hint: check the relative sizes of the two sample variances and the distribution of body weight for each sex.

Q10-2. There is a clear problem with non-normality and this problem cannot be fixed by a transformation. Why?

When the assumptions of the t-test have been violated, the distribution of all possible t-values cannot reliably be assumed to follow a mathematical t distribution. So, when the null hypothesis is true, and there is no effect of AREA on the WEIGHT of bandicoots, what t-values would we expect.

Q10-3. In this case, it is likely that observations (individual bandicoots) were collected via random sampling. It would be logistically impossible to do so. Therefore, since the variances are not wildly unequal, a randomization (or permutation) test
is appropriate. Such a test, repeatedly shuffles the sample data, each time calculating a specific statistic. So while the assumptions of the t-test have been violated, and therefore the distribution of all possible t-values cannot reliably be assumed to follow a mathematical t distribution, we can generate a distribution of possible t-statistics from the randomized t-statistics.

Q10-4. A randomization test involves the following sequence of tasks:
  1. Define a new function that accepts a data set and returns an appropriate statistic. In this case, since we are comparing two populations a t-statistic is appropriate. Define an appropriate function .
  2. Next we need to define a function that alters the structure of the data. In this case, we need to define a function that randomly shuffles the categorical variable (group labels) around. Define an appropriate function .
  3. Then we use the 'boot()' function to repeatedly calculate the statistic, each time on the randomly altered data. In this case, we want to repeatidly (100 times) calculate the t-statistic from the data set in which the group labels have been randomly shuffled. Perform the bootstrapping
  4. .

Q10-5. What is the actual sample t-value in this case? HINT

Q10-6. Having performed the randomization proceedure to calculate a set of t-values that we might expect to obtain when the null hypothesis is true, we can now explore the distribution of these t-values. This distribution is in effect a t-distribution. However, rather than being a general, theoretical t-distribution that can be applied to all populations (provided assumptions are met), this distribution is specific to the populations that we are interested in.
  1. Examine the distribution of these t-values (the t-distribution). HINT
  2. Determine what proportion of resampled t-values are as great or greater than our actual sample t-value.
    This then represents the probability of obtaining our sample t-value when the null hypothesis is true, and is thereby interpreted as any other p-value. What is the p-value?
    .

Q10-7. What conclusions would you draw from the analysis?

Q10-8. Given that generated t-distribution is specific to the population(s) that we are interested in, and is more robust than parametric statistical analyses, we might wonder why parametric analyses are prefered over randomization procedures. Why are parametric analyses prefered?

Power analysis

An ornithologist studying various populations of the Australian magpie, Gymnorhina tibicen, was primarily interested in whether the growth of urban magpies might be stunted as a result of the increased consumption of processed foods. To investigate this hypothesis she intended to measured the total body lengths in centimeters of a number of birds from both urban and rural locations. The null hypothesis of interest is that the population mean length of urban magpies is equal to that of rural magpies and thus a t-test is an appropriate test. Previous research had indicated that the mean body length of rural magpies was 36.87cm with a standard deviation of 2.

Q11-1. If the ornithologist considered a 10% decrease in mean body length to be of biological significance
  1. What effect size
    is she interested in detecting? HINT
  2. In order to have an 80% chance of detecting such an effect (if one really exists), how many replicate birds would the ornithologist need to measure from each population (assume significance level of 0.05)? HINT

Q11-2. Often, it is difficult to obtain estimates of the likely population standard deviation. Similarly, it can be difficult to estimate the effect size (delta). Consequently, it is often more preferable to perform power calculations for a range of standard deviations or effect sizes and plot the relationships for each parameter set. To assist the ornithologist to determine sample sizes, estimate the following:
  1. Relationship between power and sample size for a range of effect sizes (3, 4, 5 & 6). HINT. Note, you need to load the biology library
  2. Relationship between power and sample size for a range of standard deviations (1.8,2,2.2). HINT. Note, you need to load the biology library

Q11-3. Alternatively, the ornithologist's sampling efforts may be constrained somewhat by either ethics or by difficulties in capturing birds, and thus the ornithologist may wish to estimate what the minimum detectable effect size would be for a given range of sample sizes. To assist the ornithologist to determine sample sizes, estimate the following:
  1. Relationship between minimum detectable effect size and sample size for a range of standard deviations (1.8,2,2.2). HINT
  2. Relationship between minimum detectable effect size and sample size for a range of power (0.7,0.8,0.9). HINT

Welcome to the end of Workshop 2

  Print contents

In R, print means to output (list) the contents of an object. By default, the contents are output to the screen (the default output device). It is also possible to redirect output to a file or printer if necessary. The contents of a file are 'printed' by completing the 'print()' function or by just entering the name of the object. Consider the following;

> numbers <- c(1, 4, 6, 7, 4, 345, 36, 78)

> print(numbers)

[1]   1   4   6   7   4 345  36  78

The first line of this syntax generates and populates the numeric vector called 'numbers'. The second line uses the print function to tell R to list the contents of the 'numbers' object - the output of which appears on the third line. The forth and fifth line illustrate that the same outcome can be achieved by simply entering the name of the object.

End of instructions

  R object classes

Assigning entries is basically the act of defining a new object name and specifying what that object contains. For example if we wanted to store the number 10.513 as John Howards IQ, we instruct R to create a new object called (say IQ) and assign the value 10.513 to it. That is, we instruct R that IQ equals 10.513.
In R, the assignment operator is <- instead of =.

> name <- entry

So to assign IQ the value of 10.513 in R

> IQ <- 10.513

End of instructions

  R object classes

Object classes define how information in stored and displayed. The basic storage unit in R is called a vector. A vector is an array of one or more entries of the same class. The common classes include
  1. numeric - stores a number eg 1, 2.345 etc
  2. character - stores alphanumeric characters eg 'a', 'fish', 'color1'
  3. logical - stores either TRUE or FALSE
So the entries (1, 2, 3 & 4) might make up a numeric vector, whereas the entries ('Big', 'Small' & 'Tiny') would make up a character vector. To determine the class type of an object, use the following syntax (where bold font is used to represent the object whose class is to be determined).

> class(name)

End of instructions

  R Factors

There are a number of ways in which this can be done. One way is to use the 'factor' (makes a list into a factor) function in conjunction with the 'c' (concatenation) function.

> name <- factor(c(list of characters/words))

Another way is to use the 'gl' function (which generates factors according to specified patterns of their levels)

> name <- gl(number of levels, number of replicates, length of data set, lab=c(list of level names)))

Hence, consider the following alternative solutions;

> sex <- factor(c("Female", "Female", "Female", "Female", "Female",

> "Female", "Male", "Male", "Male", "Male", "Male", "Male"))

> sex <- factor(c(rep("Female", 6), rep("Male", 6)))

> sex <- gl(2, 6, 12, lab = c("Female", "Male"))

The second option uses the 'rep()' function which in this case is used to repeat the level name (eg 'Female') 6 times.

End of instructions

  Quitting R

There are two ways to quit R elegantly
  1. Goto the RGui File menu and select Quit
  2. In the R Console, type

    > q()

Either way you will be prompted to save the workspace, generally this is not necessary or even desirable.

End of instructions

  The R working directory

The R working directory is the location that R looks in to read/write files. When you load R, it initially sets the working directory to a location within the R subdirectory of your computer. However, we can alter the working directory so that it points to another location. This not only prevents having to search for files, it also negates the specify a path as well as a filename in various operations.
  1. Goto the RGui File menu
  2. Select the Change dir submenu
  3. Locate the directory that contains your data and/or scripts
  4. Click OK

End of instructions

  Removing objects

To remove an object

> rm(object name)

where object name is the name of the object to be removed.

End of instructions

  Read in (source) an R script

There are two ways to read in (source) a R script
  1. Goto the RGui File menu and select Source R code
  2. In the R Console, type

    > source('filename')

    where 'filename' is the name of the R script file.
All commands in the file will be executed sequentially starting at the first line. Note that R will ignore everything that is on a line following the '#' character. This provides a way of inserting comments into script files (a practice that is highly encouraged as it provides a way of reminding yourself what each line of syntax does).

Note that there is an alternative way of using scripts

  1. Goto the RGui File menu and select the Open script submenu
  2. This will display the R script file in the R Editor window
  3. Syntax can then be directly cut and paste into the R Console. Again,each line will be executed sequentially and comments are ignored.
When working with multi-step analyses, it is recommended that you have the R Editor open and that syntax is entered directly into the editor and then pasted into the R Console. Then provided the R script is saved regularly, your data and analyses are safe from computer disruptions.

End of instructions

  The R working directory

The R working directory is the location that R looks in to read/write files. When you load R, it initially sets the working directory to a location within the R subdirectory of your computer. However, we can alter the working directory so that it points to another location. This not only prevents having to search for files, it also negates the specify a path as well as a filename in various operations.
  1. Goto the RGui File menu
  2. Select the Change dir submenu
  3. Locate the directory that contains your data and/or scripts
  4. Click OK

End of instructions

  Reading data into R

Ensure that the working directory is pointing to the path containing the file to be imported before proceeding.
To import data into R, we read the contents of a file into a data frame. The general format of the command for reading data into a data frame is

> name <- read.table('filename.csv', header=T, sep=',', row.names=column, strip.white=T)

where name is a name you wish the data frame to be referred to as, filename.csv is the name of the csv file that you created in excel and column is the number of the column that had row labels (if there were any). The argument header=T indicates that the variable (vector) names will be created from the names supplied in the first row of the file. The argument sep=',' indicates that entries in the file are separated by a comma (hence a comma delimited file). If the data file does not contain row labels, or you are not sure whether it does, it is best to omit the row.names=column argument. The strip.white=T arguement ensures that no leading or trailing spaces are left in character names (these can be particularly nasty in categorical variables!).

As an example

> phasmid <- read.data("phasmid.csv", header = T, sep = ",", row.names = 1,

> strip.white = T)

End of instructions

  Exporting from Excel

  • Goto the Excel File menu
  • Select the Save As submenu
  • Provide a path and filename for the file
  • Change the Save as type to CSV (Comma delimited)
  • Click the OK button.
  • You will be warned that only the selected sheet can be saved in this format - click OK
  • You will also be warned that some of the formating might be lost - again click OK. The formating that it is refering to are things like colors and fonts, neither of which are important for data analysis.

End of instructions

  R vectors - variables

In biology, a variable is a collection of observations of the same type. For example, a variable might consist of the observed weights of individuals within a sample of 10 bush rats. Each item (or element) in the variable is of the same type (a weight) and will have been measured comparably (same techniques and units). Biological variables are therefore best represented in R by vectors.

End of instructions

  R workspace

The R workspace stores all the objects that are created during an R session. To view a list of objects occupying the current R workshpace

> ls()

The workspace is cabable of storing huge numbers of objects, and thus it can become cluttered (with objects that you have created, but forgotten about, or many similar objects that contain similar contents) very rapidly. Whilst this does not affect the performance of R, it can lead to chronic confusion. It is advisable that you remove all unwanted objects when you have finished with them. To remove and object;

> rm(object name)

where object name is the name of the object to be removed.

When you exit R, you will be prompted for whether you want to save the workspace. You can also save the current workspace at any time by;
  • Goto the RGui File menu
  • Select the Save Workspace submenu
  • Provide a location and file name for the workspace. The default and recommended (though not necessary) file extension is .RData
A saved workspace is not in easily readable text format. Saving a workspace enables you to retrieve an entire sessions work instantly, which can be useful if you are forced to perform you analyses over multiple sittings. Furthermore, by providing different names, it is possible to maintain different workspaces for different projects.

End of instructions

  R indexing

A vector is simply a array (list) of entries of the same type. For example, a numeric vector is just a list of numbers. A vector has only a single dimension - length - and thus the first entry in the vector has an index of 1 (is in position 1), the second has an index of 2 (position 2), etc. The index of an entry provides a convenient way to access entries. If a vector contained the numbers 10, 5, 8, 3.1, then the entry with index 1 is 10, at index 4 is the entry 3.1. Consider the following examples

> var <- c(4, 8, 2, 6, 9, 2)

> var

[1] 4 8 2 6 9 2

> var[5]

[1] 9

> var[3:5]

[1] 2 6 9

A data frame on the other hand is not a single vector, but rather a collection of vectors. It is a matrix, consisting of columns and rows and therefore has both length and width. Consequently, each entry has a row index and a column index. Consider the following examples

> dv <- c(4, 8, 2, 6, 9, 2)

> iv <- c("a", "a", "a", "b", "b", "b")

> data <- data.frame(iv, dv)

> data

  iv dv
1  a  4
2  a  8
3  a  2
4  b  6
5  b  9
6  b  2

#list first section

> data[1]

  iv
1  a
2  a
3  a
4  b
5  b
6  b

#list contents of first column

> data[, 1]

[1] a a a b b b
Levels: a b

#list contents of first row

> data[1, ]

  iv dv
1  a  4

#list the entry in row 3, column 2

> data[3, 2]

[1] 2

End of instructions

  R 'fix()' function

The 'fix()' function provides a very rudimentary spreadsheet for data editing and entry.

> fix(data frame name)

The spreadsheet (Data Editor) will appear. A new variable can be created by clicking on a column heading and then providing a name fo r the variable as well as an indication of whether the contents are expected to be numeric (numbers) or character (contain some letters). Data are added by clicking on a cell and providing an entry. Closing the Data Editor will cause the changes to come into affect.

End of instructions

  R transformations

The following table illustrates the use of common data transmformation functions on a single numeric vector (old_var). In each case a new vector is created (new_var)

TransformationSyntax
loge

> new_var <- log(old_var)

log10

> new_var <- log10(old_var)

square root

> new_var <- sqrt(old_var)

arcsin

> new_var <- asin(sqrt(old_var))

scale (mean=0, unit variance)

> new_var <- scale(old_var)

where old_var is the name of the original variable that requires transformation and new_var is the name of the new variable (vector) that is to contain the transformed data. Note that the original vector (variable) is not altered.

End of instructions

  Data transformations


Essentially transforming data is the process of converting the scale in which the observations were measured into another scale.

I will demonstrate the principles of data transformation with two simple examples.
Firstly, to illustrate the legality and frequency of data transformations, imagine you had measured water temperature in a large number of streams. Naturally, you would have probably measured the temperature in ° C. Supposing you later wanted to publish your results in an American journal and the editor requested that the results be in ° F. You would not need to re-measure the stream temperature. Rather, each of the temperatures could be converted from one scale (° C) to the other (° F). Such transformations are very common.

To further illustrate the process of transformation, imagine a botanist wanted to examine the size of a particular species leaves for some reason. The botanist decides to measure the length of a random selection of leaves using a standard linear, metric ruler and the distribution of sample observations as represented by a histogram and boxplot are as follows.


The growth rate of leaves might be expected to be greatest in small leaves and decelerate with increasing leaf size. That is, the growth rate of leaves might be expected to be logarithmic rather than linear. As a result, the distribution of leaf sizes using a linear scale might also be expected to be non-normal (log-normal). If, instead of using a linear scale, the botanist had used a logarithmic ruler, the distribution of leaf sizes may have been more like the following.


If the distribution of observations is determined by the scale used to measure of the observations, and the choice of scale (in this case the ruler) is somewhat arbitrary (a linear scale is commonly used because we find it easier to understand), then it is justifiable to convert the data from one scale to another after the data has been collected and explored. It is not necessary to re-measure the data in a different scale. Therefore to normalize the data, the botanist can simply convert the data to logarithms.

The important points in the process of transformations are;
  1. The order of the data has not been altered (a large leaf measured on a linear scale is still a large leaf on a logarithmic scale), only the spacing of the data
  2. Since the spacing of the data is purely dependent on the scale of the measuring device, there is no reason why one scale is more correct than any other scale
  3. For the purpose of normalization, data can be converted from one scale to another following exploration

End of instructions

  R writing data

To export data into R, we read the empty the contents a data frame into a file. The general format of the command for writing data in from a data frame into a file is

> write.data(data frame name, 'filename.csv', quote=F, sep=',')

where data frame name is a name of the data frame to be saved and filename.csv is the name of the csv file that is to be created (or overwritten). The argument quote=F indicates that quotation marks should not be placed around entries in the file. The argument sep=',' indicates that entries in the file are separated by a comma (hence a comma delimited file).

As an example

> write.data(LEAVES, 'leaves.csv', quote=F, sep=',')

End of instructions

  Cutting and pasting data into R

To import data into R via the clipboard, copy the data in Excel (including a row containing column headings - variable names) to the clipboard (CNTRL-C). Then in R use a modification of the read.table function:

> name <- read.table('clipboard', header=T, sep='\t', row.names=column, strip.white=T)

where name is a name you wish the data frame to be referred to as, clipboard is used to indicate that the data are on the clipboard, and column is the number of the column that had row labels (if there were any). The argument header=T indicates that the variable (vector) names will be created from the names supplied in the first row of the file. The argument sep='\t' indicates that entries in the clipboard are separated by a TAB. If the data file does not contain row labels, or you are not sure whether it does, it is best to omit the row.names=column argument. The strip.white=T arguement ensures that no leading or trailing spaces are left in character names (these can be particularly nasty in categorical variables!).

As an example

> phasmid <- read.data("clipboard", header = T, sep = "\t", row.names = 1,

> strip.white = T)

End of instructions

  Writing a copy of the data to an R script file

> phasmid <- dump('data', "")

where 'data' is the name of the object that you wish to be stored.
Cut and paste the output of this command directly into the top of your R script. In future, when the script file is read, it will include a copy of your data set, preserved exactly.

End of instructions

  Frequency histogram

> hist(variable)

where variable is the name of the numeric vector (variable) for which the histogram is to be constructed.

End of instructions

  Summary Statistics

# mean
> mean(variable)
# variance
> var(variable)
# standard deviation
> sd(variable)
# variable length
> length(variable)
# median
> median(variable)
# maximum
> max(variable)
# minimum
> min(variable)
# standard error of mean
> sd(variable)/sqrt(length(variable))
# interquartile range
> quantile(variable, c(.25,.75))
# 95% confidence interval > qnorm(0.975,0,1)*(sd(variable)/sqrt(length(variable)))

where variable is the name of the numeric vector (variable).

End of instructions

  Normal probabilities

> pnorm(c(value), mean=mean, sd=sd, lower.tail=FALSE)

this will calculate the area under a normal distribution (beyond the value of value) with mean of mean and standard deviation of sd. The argument lower.tail=FALSE indicates that the area is calculated for values greater than value. For example

> pnorm(c(2.9), mean=2.025882, sd=0.4836265, lower.tail=FALSE)

End of instructions

  Boxplots

> boxplot(variable)

where variable is the name of the numeric vector (variable)

End of instructions

  Boxplots

A boxplot is a graphical representation of the distribution of observations based on the 5-number summary that includes the median (50%), quartiles (25% and 75%) and data range (0% - smallest observation and 100% - largest observation). The box demonstrates where the middle 50% of the data fall and the whiskers represent the minimum and maximum observations. Outliers (extreme observations) are also represented when present.
The figure below demonstrates the relationship between the distribution of observations and a boxplot of the same observations. Normally distributed data results in symmetrical boxplots, whereas skewed distributions result in asymmetrical boxplots with each segment getting progressively larger (or smaller).


End of instructions

  Observations, variables & populations

Observations are the sampling units (e.g quadrats) or experimental units (e.g. individual organisms, aquaria) that make up the sample.

Variables are the actual properties measured by the individual observations (e.g. length, number of individuals, rate, pH, etc). Random variables (those variables whose values are not known for sure before the onset of sampling) can be either continuous (can theoretically be any value within a range, e.g. length, weight, pH, etc) or categorical (can only take on certain discrete values, such as counts - number of organisms etc).

Populations are defined as all the possible observations that we are interested in.

A sample represents a collected subset of the population's observations and is used to represent the entire population. Sample statistics are the characteristics of the sample (e.g. sample mean) and are used to estimate population parameters.


End of instructions

  Population & sample

A population refers to all possible observations. Therefore, population parameters refer to the characteristics of the whole population. For example the population mean.

A sample represents a collected subset of the population's observations and is used to represent the entire population. Sample statistics are the characteristics of the sample (e.g. sample mean) and are used to estimate population parameters.


End of instructions

  Standard error and precision

A good indication of how good a single estimate is likely to be is how precise the measure is. Precision is a measure of how repeatable an outcome is. If we could repeat the sampling regime multiple times and each time calculate the sample mean, then we could examine how similar each of the sample means are. So a measure of precision is the degree of variability between the individual sample means from repeated sampling events.

Sample number Sample mean
112.1
212.7
312.5
Mean of sample means12.433
SD of sample means0.306

The table above lists three sample means and also illustrates a number of important points;
  1. Each sample yields a different sample mean
  2. The mean of the sample means should be the best estimate of the true population mean
  3. The more similar (consistent) the sample means are, the more precise any single estimate of the population mean is likely to be
The standard deviation of the sample means from repeated sampling is called the Standard error of the mean.

It is impractical to repeat the sampling effort multiple times, however, it is possible to estimate the standard error (and therefore the precision of any individual sample mean) using the standard deviation (SD) of a single sample and the size (n) of this single sample.

The smaller the standard error (SE) of the mean, the more precise (and possibly more reliable) the estimate of the mean is likely to be.

End of instructions

  Confidence intervals


A 95% confidence interval is an interval that we are 95% confident will contain the true population mean. It is the interval that there is a less than 5% chance that this interval will not contain the true population mean, and therefore it is very unlikely that this interval will not contain the true mean. The frequentist approach to statistics dictates that if multiple samples are collected from a population and the interval is calculated for each sample, 95% of these intervals will contain the true population mean and 5% will not. Therefore there is a 95% probability that any single sample interval will contain the population mean.

The interval is expressed as the mean ± half the interval. The confidence interval is obviously affected by the degree of confidence. In order to have a higher degree of confidence that an interval is likely to contain the population mean, the interval would need to be larger.

End of instructions

  Successful transformations


Since the objective of a transformation is primarily to normalize the data (although variance homogeneity and linearization may also be beneficial side-effects) the success of a transformation is measured by whether or not it has improved the normality of the data. It is not measured by whether the outcome of a statistical analysis is more favorable!

End of instructions

  Selecting subsets of data

# select the first 5 entries in the variable
> variable[1:5]
# select all but the first entry in the variable
> variable[-1]
# select all cases of variable that are less than num
> variable[variablenum]
# select all cases of variable that are equal to 'Big'
> variable1[variablelabel]

where variable is the name of a numeric vector (variable) and num is a number. variable1 is the name of a character vector and label is a character string (word). 5. In the Subset expression box enter a logical statement that indicates how the data is to be subsetted. For example, exclude all values of DOC that are greater than 170 (to exclude the outlier) and therefore only include those values that are less that 170, enter DOC < 170. Alternatively, you could chose to exclude the outlier using its STREAM name. To exclude this point enter STREAM. != 'Santa Cruz'.

End of instructions

  Summary Statistics by groups

> tapply(dv,factor,func)

the tapply() function applies the function func to the numeric vector dv for each level of the factor factor. For example

# calculate the mean separately for each group
> tapply(dv,factor,mean)
# calculate the mean separately for each group
> tapply(dv,factor,var)

End of instructions

  EDA - Normal distribution

Parametric statistical hypothesis tests assume that the population measurements follow a specific distribution. Most commonly, statistical hypothesis tests assume that the population measurements are normally distributed (Question 4 highlights the specific reasoning for this).

While it is usually not possible to directly examine the distribution of measurements in the whole population to determine whether or not this requirement is satisfied or not (for the same reasons that it is not possible to directly measure population parameters such as population mean), if the sampling is adequate (unbiased and sufficiently replicated), the distribution of measurements in a sample should reflect the population distribution. That is a sample of measurements taken from a normally distributed population should be normally distributed.

Tests on data that do not meet the distributional requirements may not be reliable, and thus, it is essential that the distribution of data be examined routinely prior to any formal statistical analysis.


End of instructions

  Factorial boxplots

> boxplot(dv~factor,data=data)

where dv is the name of the numeric vector (dependent variable), factor is the name of the factor (categorical or factor variable) and data is the name of the data frame (data set). The '~' is used in formulas to represent that the left hand side is proportional to the right hand side.

End of instructions

  EDA - Linearity

The most common methods for analysing the relationship or association between variables.assume that the variables are linearly related (or more correctly, that they do not covary according to some function other than linear). For example, to examine for the presence and strength of the relationship between two variables (such as those depicted below), it is a requirement of the more basic statistical tests that the variables not be related by any function other than a linear (straight line) function if any function at all.

There is no evidence that the data represented in Figure (a) are related (covary) according to any function other than linear. Likewise, there is no evidence that the data represented in Figure (b) are related (covary) according to any function other than linear (if at all). However, there is strong evidence that the data represented in Figure (c) are not related linearly. Hence Figure (c) depicts evidence for a violation in the statistical necessity of linearity.


End of instructions

  EDA - Scatterplots

Scatterplots are constructed to present the relationships, associations and trends between continuous variables. By convention, dependent variables are arranged on the Y-axis and independent variables on the X-axis. The variable measurements are used as coordinates and each replicate is represented by a single point.

Figure (c) above displays a linear smoother (linear `line-of-best-fit') through the data. Different smoothers help identify different trends in data.


End of instructions

  Scatterplots

# load the 'car' library
> library(car)
# generate a scatterplot
> scatterplot(dv~iv,data=data)

where dv is the dependent variable, iv is the independent variable and data is the data frame (data set).

End of instructions

  Plotting y against x


The statement 'plot variable1 against variable2' is interpreted as 'plot y against x'. That is variable1 is considered to be the dependent variable and is plotted on the y-axis and variable2 is the independent variable and thus is plotted on the x-axis.

End of instructions

  Scatterplot matrix (SPLOM)

# make sure the car package is loaded > library(car)
> scatterplot.matrix(~var1 + var2 + var3, data=data, diagonal='boxplot')
#OR > pairs(~var1 + var2 + var3, data=data)

where var1, var2 etc are the names of numeric vectors in the data data frame (data set)

End of instructions

  EDA - Homogeneity of variance

Many statistical hypothesis tests assume that the populations from which the sample measurements were collected are equally varied with respect to all possible measurements. For example, are the growth rates of one population of plants (the population treated with a fertilizer) more or less varied than the growth rates of another population (the control population that is purely treated with water). If they are, then the results of many statistical hypothesis tests that compare means may be unreliable. If the populations are equally varied, then the tests are more likely to be reliable.
Obviously, it is not possible to determine the variability of entire populations. However, if sampling is unbiased and sufficiently replicated, then the variability of samples should be directly proportional to the variability of their respective populations. Consequently, samples that have substantially different degrees of variability provide strong evidence that the populations are likely to be unequally varied.
There are a number of options available to determine the likelihood that populations are equally varied from samples.
1. Calculate the variance or standard deviation of the populations. If one sample is more than 2 times greater or less than the other sample(s), then there is some evidence for unequal population variability (non-homogeneity of variance)
2. Construct boxplots of the measurements for each population, and compare the lengths of the boxplots. The length of a symmetrical (and thus normally distributed) boxplot is a robust measure of the spread of values, and thus an indication of the spread of population values. Therefore if any of the boxplots are more than 2 times longer or shorter than the other boxplot(s), then there is again, some evidence for unequal population variability (non-homogeneity of variance)


End of instructions

  t test

The frequentist approach to hypothesis testing is based on estimating the probability of collecting the observed sample(s) when the null hypothesis is true. That is, how rare would it be to collect samples that displayed the observed degree of difference if the samples had been collected from identical populations. The degree of difference between two collected samples is objectively represented by a t statistic.

Where y1 and y2 are the sample means of group 1 and 2 respectively and √s²/n1 + s²/n2 represents the degree of precision in the difference between means by taking into account the degree of variability of each sample.
If the null hypothesis is true (that is the mean of population 1 and population 2 are equal), the degree of difference (t value) between an unbiased sample collected from population 1 and an unbiased sample collected from population 2 should be close to zero (0). It is unlikely that an unbiased sample collected from population 1 will have a mean substantially higher or lower than an unbiased sample from population 2. That is, it is unlikely that such samples could result in a very large (positive or negative) t value if the null hypothesis (of no difference between populations) was true. The question is, how large (either positive or negative) does the t value need to be, before we conclude that the null hypothesis is unlikely to be true?.

What is needed is a method by which we can determine the likelihood of any conceivable t value when null hypothesis is true. This can be done via simulation. We can simulate the collection of random samples from two identical populations and calculate the proportion of all possible t values.

Lets say a vivoculturalist was interested in comparing the size of Eucalyptus regnans seedlings grown under shade and full sun conditions. In this case we have two populations. One population represents all the possible E. regnans seedlings grown in shade conditions, and the other population represents all the possible E. regnans seedlings grown in full sun conditions. If we had grown 200 seedlings under shade conditions and 200 seedlings under full sun conditions, then these samples can be used to assess the null hypothesis that the mean size of an infinite number (population) of E. regnans seedlings grown under shade conditions is the same as the mean size of an infinite number (population) of E. regnans seedlings grown under full sun conditions. That is that the population means are equal.

We can firstly simulate the sizes of 200 seedlings grown under shade conditions and another 200 seedlings grown under full sun conditions that could arise naturally when shading has no effect on seedling growth. That is, we can simulate one possible outcome when the null hypothesis is true and shading has no effect on seedling growth
Now we can calculate the degree of difference between the mean sizes of seedlings grown under the two different conditions taking into account natural variation (that is, we can calculate a t value using the formula from above). From this simulation we may have found that the mean size of seedling grown in shade and full sun was 31.5cm and 33.7cm respectively and the degree of difference (t value) was 0.25. This represents only one possible outcome (t value). We now repeat this simulation process a large number of times (1000) and generate a histogram (or more correctly, a distribution) of the t value outcomes that are possible when the null hypothesis is true.

It should be obvious that when the null hypothesis is true (and the two populations are the same), the majority of t values calculated from samples containing 200 seedlings will be close to zero (0) - indicating only small degrees of difference between the samples. However, it is also important to notice that it is possible (albeit less likely) to have samples that are quit different from one another (large positive or negative t values) just by pure chance (for example t values greater than 2).

It turns out that it is not necessary to perform these simulations each time you test a null hypothesis. There is a mathematical formulae to estimate the t distribution appropriate for any given sample size (or more correctly, degrees of freedom) when the null hypothesis is true. In this case, the t distribution is for (200-1)+(200-1)=398 degrees of freedom.

At this stage we would calculate the t value from our actual observed samples (the 200 real seedlings grown under each of the conditions). We then compare this t value to our distribution of all possible t values (t distribution) to determine how likely our t value is when the null hypothesis is true. The simulated t distribution suggests that very large (positive or negative) t values are unlikely. The t distribution allows us to calculate the probability of obtaining a value greater than a specific t value. For example, we could calculate the probability of obtaining a t value of 2 or greater when the null hypothesis is true, by determining the area under the t distribution beyond a value of 2.

If the calculated t value was 2, then the probability of obtaining this t value (or one greater) when the null hypothesis is true is 0.012 (or 1.2%). Since the probability of obtaining our t value or one greater (and therefore the probability of having a sample of 200 seedlings grown in the shade being so much different in size than a sample of 200 seedlings grown in full sun) is so low, we would conclude that the null hypothesis of no effect of shading on seedling growth is likely to be false. Thus, we have provided some strong evidence that shading conditions do effect seedling growth!


End of instructions

  t-test degrees of freedom

Degrees of freedom is the number of observations that are free to vary when estimating a parameter. For a t-test, df is calculated as
df = (n1-1)+(n2-1)
where n1 is population 1 sample size and n2 is the sample size of population 2.

End of instructions

  One-tailed critical t-value

One-tail critical t-values are just calculated from a t distribution (of given df). The area under the t distribution curve above (or below) this value is 0.05 (or some other specified probability). Essentially we calculate a quantile of a specified probability from a t distribution of df degrees of freedom

# calculate critical t value (&alpha =0.05) for upper tail (e.g. A larger than B)
> qt(0.05,df=df,lower.tail=F)
# calculate critical t value (&alpha =0.05) for lower tail (e.g. B larger than A)
> qt(0.05,df=df,lower.tail=T)

where 0.05 is the specified &alpha value and df is the specified degrees of freedom. Note that it is not necessary to have collected the data before calculating the critical t-value, you only need to know sample sizes (to get df).

It actually doesn't matter whether you select Lower tail is TRUE or FALSE. The t-distribution is a symmetrical distribution, centered around 0, therefore, the critical t-value is the same for both Lower tail (e.g. population 2 greater than population 1) and Upper tail (e.g. population 1 greater than population 2), except that the Lower tail is always negative. As it is often less confusing to work with positive values, it is recommended that you use Upper tail values. An example of a t-distribution with Upper tail for a one-tailed test is depicted below. Note that this is not part of the t quantiles output!


End of instructions

  Two-tailed critical t-value

Two-tail critical t-values are just calculated from a t distribution (of given df). The area under the t distribution curve above and below the positive and negative of this value respectively is 0.05 (or some other specified probability). Essentially we calculate a quantile of a specified probability from a t distribution of df degrees of freedom. In a two-tailed test, half of the probability is associated with the area above the positive critical t-value and the other half is associated with the area below the negative critical t-value. Therefore when we use the quantile to calculate this critical t-value, we specify the probability as &alpha/2 - since &alpha/2 applies to each tail.

# calculate critical t value (&alpha =0.05) for upper tail (e.g. A different to B)
> qt(0.05/2,df=df, lower.tail=T)

where 0.05 is the specified &alpha value and df is the specified degrees of freedom. Note that it is not necessary to have collected the data before calculating the critical t-value, you only need to know sample sizes (to get df).

Again, it actually doesn't matter whether you select Lower tail as either TRUE or FALSE. For a symmetrical distribution, centered around 0, the critical t-value is the same for both Lower tail (e.g. population 2 greater than population 1) and Upper tail (e.g. population 1 greater than population 2), except that the Lower tail is always negative. As it is often less confusing to work with positive values, it is recommended that you use Upper tail values. An example of a t-distribution with Upper tail for a two-tailed test is depicted below. Note that this is not part of the t quantiles output!

End of instructions

  Basic steps of Hypothesis testing


Step 1 - Clearly establish the statistical null hypothesis. Therefore, start off by considering the situation where the null hypothesis is true - e.g. when the two population means are equal

Step 2 - Establish a critical statistical criteria (e.g. alpha = 0.05)

Step 3 - Collect samples consisting of independent, unbiased samples

Step 4 - Assess the assumptions relevant to the statistical hypothesis test. For a t test:
  1.  Normality
  2.  Homogeneity of variance

Step 5 - Calculate test statistic appropriate for null hypothesis (e.g. a t value)


Step 6 - Compare observed test statistic to a probability distribution for that test statistic when the null hypothesis is true for the appropriate degrees of freedom (e.g. compare the observed t value to a t distribution).

Step 7 - If the observed test statistic is greater (in magnitude) than the critical value for that test statistic (based on the predefined critical criteria), we conclude that it is unlikely that the observed samples could have come from populations that fulfill the null hypothesis and therefore the null hypothesis is rejected, otherwise we conclude that there is insufficient evidence to reject the null hypothesis. Alternatively, we calculate the probability of obtaining the observed test statistic (or one of greater magnitude) when the null hypothesis is true. If this probability is less than our predefined critical criteria (e.g. 0.05), we conclude that it is unlikely that the observed samples could have come from populations that fulfill the null hypothesis and therefore the null hypothesis is rejected, otherwise we conclude that there is insufficient evidence to reject the null hypothesis.

End of instructions

  Pooled variance t-test

> t.test(dv~factor,data=data,var.equal=T)

where dv is the name of the dependent variable, factor is the name of the categorical/factorial variable and data is the name of the data frame (data set). The argument var.equal=T indicates a pooled variances t-test

End of instructions

  Separate variance t-test

> t.test(dv~factor,data=data,var.equal=F)

where dv is the name of the dependent variable, factor is the name of the categorical/factorial variable and data is the name of the data frame (data set). The argument var.equal=F indicates a separate variances t-test

End of instructions

  Output of t-test


The following output are based on a simulated data sets;
1.  Pooled variance t-test for populations with equal (or nearly so) variances

2.  Separate variance t-test for population with unequal variances

End of instructions

  Bar graphs

# load the 'biology' extension library writen by Murray.
> library(biology)
# calculate the means
> means <- tapply(dv, factor, mean)
# calculate the standard deviation
> sds <- tapply(dv, factor, sd)
# calculate the lengths
> ns <- tapply(dv, factor, length)
# calculate the standard errors
> ses <- sds/sqrt(ns)
# plot the bars
> xs <- barplot(means,beside=T)
# load package containing error bar function
> library(Hmisc)
# plot the error bars
> errbar(xs, means, means+ses, means-ses, add=T)

OR

# use the Mbargraph() function to construct a bargraph
> Mbargraph(furness$METRATE, furness$SEX, errorbars="se")

End of instructions

  Paired t-test

> t.test(cat1, cat2, data=data, paired=T)

where cat1 and cat2 are two numeric vectors (variables) in the data data frame (data set). The argument paired=T indicates a paired t-test)

End of instructions

  Non-parametric tests

Non-parametric tests do not place any distributional limitations on data sets and are therefore useful when the assumption of normality is violated. There are a number of alternatives to parametric tests, the most common are;

1. Randomization tests - rather than assume that a test statistic calculated from the data follows a specific mathematical distribution (such as a normal distribution), these tests generate their own test statistic distribution by repeatedly re-sampling or re-shuffling the original data and recalculating the test statistic each time. A p value is subsequently calculated as the proportion of random test statistics that are greater than or equal to the test statistic based on the original (un-shuffled) data.

2. Rank-based tests - these tests operate not on the original data, but rather data that has first been ranked (each observation is assigned a ranking, such that the largest observation is given the value of 1, the next highest is 2 and so on). It turns out that the probability distribution of any rank based test statistic for a is identical.

End of instructions

  Wilcoxon test

> wilcox.test(dv ~ factor, data=data)

where dv is the dependent variable and factor is the categorical variable from the data data frame (data set).

End of instructions

  Randomization (permutation) tests

The basis of hypothesis testing (when comparing two populations) is to determine how often we would expect to collect a specific sample (or more one more unusual) if the populations were identical. That is, would our sample be considered unusual under normal circumstances? To determine this, we need to estimate what sort of samples we would expect under normal circumstances. The theoretical t-distribution mimics what it would be like to randomly resample repeatedly (with a given sample size) from such identical populations.

A single sample from the (proposed) population(s) provides a range of observations that are possible. Completely shuffling those data (or labels), should mimic the situation when there is no effect (null hypothesis true situation), since the data are then effectively assigned randomly with respect to the effect. In the case of a t-test comparing the means of two groups, randomly shuffling the data is likely to result similar group means. This provides one possible outcome (t-value) that might be expected when the null hypothesis true. If this shuffling process is repeated multiple times, a distribution of all of the outcomes (t-values) that are possible when the null hypothesis is true can be generated. We can now determine how unusual our real (unshuffled) t-value is by comparing it to the built up t-distribution. If the real t-value is expected to occur less than 5% of the time, we reject the null hypothesis.

Randomization tests - rather than assume that a test statistic calculated from the data follows a specific mathematical distribution (such as a normal distribution), these tests generate their own test statistic distribution by repeatedly re-sampling or re-shuffling the original data and recalculating the test statistic each time. A p value is subsequently calculated as the proportion of random test statistics that are greater than or equal to the test statistic based on the original (un-shuffled) data.

End of instructions

  Defining the randomization statistic

> stat <- function(data, indices){
    t <- t.test(data[,2]~data[,1])$stat
    t
}

The above function takes a data set (data) and calculates a test statistic (in this case a t-statistic). The illustrated function uses the t.test() function to perform a t-test on column 2 ([,2]) against column 1 ([,1]). The value of the t-statistic is stored in an object called 't'. The function returns the value of this object. The indentations are not important, they are purely used to improve human readability.

End of instructions

  Defining the data shuffling procedure

rand.gen <- function(data, mle){
    out <- data
    out[,1] <- sample(out[,1], replace=F)
    out
}

The above function defines how the data should be shuffled. The first line creates a temporary object (out) that stores the original data set (data) so that any alterations do not effect the original data set. The second line uses the sample() function to shuffle the first column (the group labels). The third line of the function just returns the shuffled data set.

End of instructions

  Perform randomization test (bootstrapping)

library(boot)
coots.boot <- boot(coots, stat, R=100, sim="parametric", ran.gen=rand.gen)

The sequence begins by ensuring that the boot package has been loaded into memory. Then the boot() function is used to repeatedly (R=100: repeat 100 times) perform the defined statistical procedure (stat). The ran.gen=rand.gen parameter defines how the data set should be altered prior to performing the statistical procedure, and the sim="parametric" parameter indicates that all randomizations are possible (as opposed to a permutation where each configuration can only occur once). The outcome of the randomization test (bootstrap) is stored in an object called coots.boot

End of instructions

  Calculating p-value from randomization test

p.length <- length(coots.boot$t[coots.boot$t >= coots.boot$t0])+1
print(p.length/(coots.boot$R + 1))

The coots.boot object contains a list of all of the t-values calculated from the resampling (shuffling and recalculating) procedure (coots.boot$t)as well as the actual t-value from the actual data set (coots.boot$t0). It also stores the number of randomizations that it performed (coots.boot$R).

The first line in the above sequence determines how many of the possible t values (including the actual t-value) are greater or equal to the actual t-value. It does this by specifying a list of coots.boot$t values that are greater or equal to the coots.boot$t0 value. (coots.boot$t[coots.boot$t >= coots.boot$t0]) and calculating the length of this list using the length() function. One (1) is added to this length, since the actual t-value must be included as a possible t-value.
The second line expresses the number of randomization t-values greater or equal to the actual t-value as a fraction of the total number of randomizations (again adding 1 to account for the actual situation). This is interpreted as any other p-value - the probability of obtaining our sample statistic (and thus our observations) when the null hypothesis is true.

End of instructions

  Calculating p-value from randomization test

In a t-test, the effect size is the absolute difference between the expected population means. Therefore if the expected population means of population A and B are 10 and 16 respectively, then the effect size is 6.

Typically, effect size is estimated by considering the magnitude of an effect that is either likely or else what magnitude of effect would be regarded biologically significant. For example, if the overall population mean was estimated to be 12 and the researches regarded a 15% increase to be biologically significant, then the effect size is the difference between 12 and a mean 15% greater than 12 (12+(12*.15)=13.8). Therefore the effect size is (13.8-12=1.8).

End of instructions