Workshop 2 - Data importation and exploratory data analysis

23 April 2011

Basic syntax revision
- Exercise 1
Working with data sets
- Exercise 2
- Exercise 3
Exploratory data analysis
Basic hypothesis testing
Basic power analysis
- Exercise 11

Basic statistics references

Logan (2010) - Chpt 1, 2 & 6
Quinn & Keough (2002) - Chpt 1, 2, 3 & 4

Basic syntax

Q1-1.Complete the following table, by assigning the following entries

(numbers etc) to the corresponding object names and determining the object class

for each.

Name	Entry	Syntax
A	100	hint
B	Big	hint
VAR1	100 & 105 & 110	hint
VAR2	5 + 6	hint
VAR3	150 to 250	hint

Print out the contents of the vector you called 'a'
. Notice that the output appears on the line under the syntax that you entered, and that the output is proceeded by a [1]. This indicates that the value returned (100) is the first entry in the vector
Print out the contents of the vector called 'b'. Again notice that the output is proceeded by [1].
Print out the contents of the vector called 'var1'.
Print out the contents of the vector called 'var2'. Notice that the output contains the product of the statement rather than the statement itself.
Print out the contents of the vector called 'var3'. Notice that the output contains 100 entries (150 to 250) and that it spans multiple lines on the screen. Each new line begins with a number in square brackets [] to indicate the index of the entry that immediately follows.

Variables - vectors

Q1-2. Generate the following numeric vectors (variables)

The numbers 1, 4, 7 & 9 (call the object y)
The numbers 10 to 25 (call the object y1)
The sequency of numbers 2, 4, 6, 8...20 (call the object y2)

Q1-3. Generate the following character vectors (factorial/categorical variables)

A factor that lists the sex of individuals as 6 females followed by 6 males
A factor called TEMPERATURE that lists 10 cases of 'High', 10 'Medium & 10 'Low'
A factor called TEMPERATURE that lists 'High', 'Medium & 'Low' alternating 10 times
A factor called TEMPERATURE that lists 10 cases of 'High', 8 cases of 'Medium' and 11 cases of 'Low'

Q1-4. Print out the contents of the 'TEMPERATURE' factor. A list of factor levels will be printed on the screen. This will take up multiple lines, each of which will start with a number in square brackets [ ] to indicate the index number of the entry immediately to the right. At the end, the output also indicates what the names of the factor levels are. These are listed in alphebetical order.

Data sets - Data frames(R)

Rarely is only a single biological variable collected. Data are usually collected in sets of variables reflecting tests of relationships, differences between groups, multiple characterizations etc. Consequently, data sets are best organized into collections of variables (vectors). Such collections are called data frames in R.
Data frames are generated by combining multiple vectors together whereby each vector becomes a separate column in the data frame. In for a data frame to represent the data properly, the sequence in which observations appear in the vectors (variables) must be the same for each vector and each vector should have the same number of observations. For example, the first observations from each of the vectors to be included in the data frame must represent observations collected from the same sampling unit.

To demonstrate the use of dataframes in R, we will use fictitious data representing the areas of leaves of two species of Japanese Boxwood

Format of the fictitious data set

PLANT	SPECIES	AREA
P1	B.semp	25
P2	B.semp	22
P3	B.semp	29
P4	B.micro	15
P5	B.micro	17
P6	B.micro	20

PLANT	An identifier for each individual plant that was measured (a single leaf was measured from each individual plant)
SPECIES	Categorical listing of whether the individual plant was Buxus sempervirens (B.semp) or Buxus microphyllum (B.micro)
AREA	The surface area (mm²) of the leaf measured - Response variable

Q2-1. Lets create the data set in a series of steps. Use the textbox provided in part g below to record the R syntax used in each step

First create the categorical (factor) variable containing the listing of B.semp three times and B.micro three times
Now create the dependent variable (numeric vector) containing the leaf areas
Combine the two variables (vectors) into a single data set (data frame) called LEAVES
Print (to the screen) the contents of this new data set called LEAVES
You will have noticed that the names of the rows are listed as 1 to 6 (this is the default). In the table above, we can see that there is a variable called PLANT that listed unique plant identification labels. These labels are of no use for any statistics, however, they are useful for identifying particular observations. Consequently it would be good to incorporate these labels as row names in the data set. Create a variable called PLANT that contains a listing of the plant identifications
Use this plant identification label variable to define the row names in the data frame called LEAVES

In the textbox provided below, list each of the lines of R syntax required to generate the data set

The above syntax forms a list of instructions that R can perform. Such lists are called scripts. Scripts offer the following;

Enable a sequence of tasks such as data entry, analysis and graphical preparation to be repeated quickly and precisely
Ensure that the sequence of tasks used to complete an analysis are permanently documented
Simplify performing many similar analyses
Simplify sharing of data, analyses and techniques

Q2-3.To see how to use a script,

close down R
restart R
Change the working directory (path)
to the location where you saved the script file in Q2-2 above
Source the script file

Q2-4.There are now at least four objects in the R workspace.

These should be LEAVES (the data frame - data set), PLANTS (the list of plant ID's), SPECIES (the character vector of plant species) and AREA (the numeric vector of leaf areas).

Print (list on screen) the contents of the AREA vector. Note, that this is listing the contents of the AREA vector, this is not the same as asking it to list the contents of the AREA vector within the LEAVES data frame. For example, multiply all of the numbers in the AREA vector by 2. Now print the contents of the AREA vector then the LEAVES data frame. Notice that only the values in the AREA vector have changed - the values within the AREA vector of the LEAVES data frame were not effected.
To avoid confusion and clutter, it is therefore always best to remove single vectors
once a data frame has been created. Remove the PLANTS, SPECIES and AREA vectors.
Notice what happens when you now try to access the AREA vector.
To access a variable from within a data frame, we use the $ sign. Print the contents of the LEAVES AREA vector

Q2-5.Since data are stored in vectors, it is possible to access single entries or specific groups of entries. A specific entry is accessed via its index.

To investigate the range of options, complete the following table.

Access	Syntax
print the LEAVES data set	hint
print first leaf area in the LEAVES data set	hint
print the first 3 leaf areas in the LEAVES data set	hint
print a list of leaf areas that are greater than 20	hint
print a list of leaf areas for the B.microphylum species	hint
print the section of the data set that contains the B.microphylum species	hint
alter the second leaf area from 22 to 23	hint

Q2-6.Although it is possible to some data editing this way, for more major editing procedures it is better to either return to Excel or use the 'fix()' function.

Use the 'fix()' function to make a number of changes to the data frame (data set) including adding another column of data (that might represent another variable).

Q2-7.Sometimes it is necessary to transform

a variable from one scale to another. While it is possible to modify an existing variable (vector), it is safer to create a new variable that contains the altered values. Examine the use of R for common transformations.

Transform the leaf areas to log (base 10).

Importing data and data files

Although it is possible to generate a data set from scratch using the procedures demonstrated in the above demonstration module, often data sets are better managed with spreadsheet software. R is not designed to be a spreadsheet, and thus, it is necessary to import data into R. We will use the following small data set (in which the feeding metabolic rate of stick insects fed two different diets was recorded)to demonstrate how a data set is imported into R.

Format of the fictitious data set

PHASMID	DIET	MET.RATE
P1	tough	1.25
P2	tough	1.22
P3	tough	1.29
P4	soft	1.51
P5	soft	1.55
P6	soft	1.48

PHASMID	An identifier for each individual stick insect (Phasmid) that was measured
DIET	Categorical listing of whether the food consumed was considered to be tough or soft
MET.RATE	The feeding metabolic rate (mg 0₂/min/g) of phasmids - Response variable

Q3-1.Importing data into R from Excel is a multistage stage process.

Enter the above data set into Excel and save the sheet as a comma delimited text file (CSV)
. Ensure that the column titles (variable names) are in the first row and that you take note where the file is saved. To see the format of this file, open it in Notepad (the windows accessory program). Notice that it is just a straight text file, there is no encryption or encoding.
Ensure that the current working directory is set to the location of this file
Read (import) the data set into a data table
. Since data exploration and analysis cannot begin until the data is imported into R, the syntax of this step would usually be on the first line in a new script file that is stored with the comma delimited text file version of the data set.
To ensure that the data have been successfully imported, print the data frame

Q3-2.As well as importing files, it is often necessary to save a data set (data frame) - particularly if it has been modified and you wish to retain the changes. To demonstrate how to export a data set, we need a data frame (data set) to export. If the LEAVES data frame (from Excersize 2 above) is no longer present, regenerate the LEAVES data set from above using the script file

that was generated in Q2-2. To export an R data frame to a text file, you need to write the data frame to a file

Examine the contents of this comma delimited text file using Notepad

Q3-3.Alternatively, it is also possible to copy and paste data from Excel into R (via the clipboard). Although this method is quicker, there is no record in a R script file as to which Excel file the data originally came from. Furthermore, changes to the Excel data sheet will not be accounted for. Read the data in from the clipboard.

Q3-4.Since there is no link between the data and the script when data are imported via the clipboard, it is recommended that the data be stored as a structure within your R script above any commands that use these data. Place a copy of the data within the R script file that you generated earlier..

Population parameters

The little spotted kiwi (Apteryx owenii) is a very rare flightless bird that is extinct on mainland New Zealand and survives as 1000 individuals on Kapiti Island. In order to monitor the population, researches in the recovery team systematically captured all of the individuals in the population over a two week period. Each individual was weighed, banded, assessed and released. The file *.csv lists the weights of each individual male little spotted kiwi in the population.

Download Kiwi data set

Format of kiwi.csv data file

Band	Weight
64955	1.749
65318	2.551
64612	1.768
64393	2.327
64092	2.127
...	...

Band	Unique bird identification band number
Weight	Weight (grams) of the individual male birds

Open the kiwi data file. HINT.

Generate a frequency histogram

of male kiwi weights. HINT. This distribution represents the population (all possible observations) of male kiwi weights. Note that this is the statistical population and not a biological population - obviously a biological population entirely lacking in females would not last long!

Q4-1.Describe the shape of the distribution?

Since we have the weights of all male kiwi in the population, is possible to calculate population parameters (such as population mean and standard deviation)

directly!

Q4-2. What is the mean (a location measure) and standard deviation (a measure of spread) of the population?

Mean HINT
SD HINT

Assuming, the population is normally distributed, it is possible to calculate the probability that a randomly recaptured male kiwi will weigh greater than a particular value, less than a particular value, or weigh between a range of weights. This probability is just the area under a particular region of a normal distribution and can be calculated using the normal probabilities.

Q4-3. Assuming that the population is normally distributed, what is the probability of recapturing a male little spotted kiwi that weighs greater than 2.9 kg? HINT

For data sets with large numbers of observations, the distribution of observations can be examined via a histogram - as demonstrated above. However, histograms are only meaningful for summarizing large data sets. For smaller data sets other exploratory tools (such as boxplots)

are necessary. To appreciate the relationship between boxplots and the underlying distribution of data, construct a boxplot

of male kiwi weights. HINT

Samples as estimates of populations

Here is a modified example from Quinn and Keough (2002). Lovett et al. (2000) studied the chemistry of forested watersheds in the Catskill Mountains in New York State. They had 38 sites and recorded the concentrations of ten chemical variables, averaged over three years. We will look at two of these variables, dissolved organic carbon (DOC) and hydrogen ions (H).

Download Lovett data set

Format of lovett.csv data files

STREAM	DOC	H
Hunter	180.4	0.48
West Kill	108.8	0.24
Mill	104.7	0.47
Kelly Hollow	84.5	0.23
Pigeon	82.4	0.37

STREAM	Name of the site (stream) from which observations were collected
DOC	Dissolved oxygen concentration (mmol.L^-1)
H	Hydrogen concentration (mmol.L^-1)

Open

the lovett data file.

Q5-1. What is the purpose of sampling?

Before continuing, make sure you are clear on what the observations, variables and populations

are.

Construct a boxplot

of dissolved organic carbon (DOC) from the sample observations. HINT

Q5-2. How would you describe the boxplot?

Q5-3. Are there any outliers? (Y or N)

Provided the data were collected without bias (ideally random) and with adequate replication, the sample should reflect the entire population. Therefore sample statistics

should be good estimates of the population parameters.

Q5-4. Calculate the sample mean

HINT

The mean of a sample is considered to be a location characteristic of the sample. Along with the mean, it is often desirable to characterize the spread of data in a sample - that is to determine how variable the sample is.

Q5-5. Calculate the sample standard deviation

HINT

For most purposes, the sample itself is of little interest - it is purely used to estimate the population. Therefore it is necessary to be able to estimate how well the sample mean estimates the true population mean. The Standard error (SE) of the mean is a measure of the precision

of the mean.

Q5-6. Calculate the standard error of the mean HINT

Following on from the idea of precision of the mean, is the concept of confidence intervals

, by which an interval is calculated that we are 95% confident will contain the true population mean.

Q5-7. Calculate the 95% confidence interval of the mean HINT +/-

Construct a boxplot

of hydrogen concentration (H) from the sample observations HINT

Q5-8. How would you describe the boxplot?

Many statistical analyses assume that the population from which the sample was collected is normally distributed. However, biological data is not always normally distributed. To normalize the data, try transforming

to logs.

HINT

Q5-9. Does the transformation successfully normalize

these data? (Y or N)

Earlier we identified the presence of an outlier, to investigate the impact of this outlier on a range of summary statistics, calculate the following measures of location (mean and median) and spread (standard deviation and interquartile range) for DOC, with and without the outlying observation

and complete the table below.

Summary Statistic	DOC	Modified DOC
Mean	HINT	HINT
Median	HINT	HINT
Variance	HINT	HINT
Standard deviation	HINT	HINT
Inter-quartile range	HINT	HINT

Q5-10. Which measures of location and spread are most robust to inclusion and exclusion of a single unusual observation?

Exploratory data analysis

Sánchez-Piñero & Polis (2000) studied the effects of seabirds on tenebrionid beetles on islands in the Gulf of California. These beetles are the dominant consumers on these islands and it was envisaged that seabirds leaving guano and carrion would increase beetle productivity. They had a sample of 25 islands and recorded the beetle density, the type of bird colony (roosting, breeding, no birds), % cover of guano and % plant cover of annuals and perennials.

Download Sanchez data set

Format of sanchez.csv data files

COLTYPE	BEETLE96	GUANO	PLANT96
..	..	..	..
..	..	..	..
..	..	..	..

COLTYPE	Type of bird colony (N = no birds, R = roosting, B = breeding
BEETLE96	Abundance of beetles (number per carrion trap) in 1996
GUANO	% cover of guano on island in 1995 and 1996
PLANT96	% cover of total plants (annual and perennial) on island in 1996

Open the sanchez data file.

Q6-1.For percentage plant cover, Calculate the following summary statistics separately for each colony type

and complete the table below.

Summary Statistic	No colonies	Roosting colony	Breeding colony
Mean HINT
Variance HINT
Standard deviation HINT

Which colony type has the greatest variance? (N, R or B)

Normality

Before proceeding, make sure you are familiar with the significance of normally distributed sample data and thus why it is necessary to examine the distribution of sample data

as part of routine exploratory data analysis (EDA) prior to any formal data analysis.

Q6-2. Construct a boxplot

for total 1996 beetle abundance for each colony type separately. HINT

Are there any outliers identified? (Y or N)
Describe the shape of each distribution.
Now transform the response variable to logs and redraw the boxplots HINT. Does this change (improve?) the shape of the distributions? (Y or N)

Linearity

Often it is necessary to examine the nature of the relationship or association between

as part of routine exploratory data analysis (EDA) prior to any formal data analysis. The nature of relationships/associations between continuous data is explored using scatterplots.

Q6-3. Construct a scatterplot

for beetle abundance against

total 1996 plant cover (HINT).

Is there any evidence of non-linearity? (Y or N)
Note, that the boxplots also enable us to explore the normality of both variables (populations). Is there any evidence of non-normality? (Y or N)

SÃ¡nchez-PiÃ±ero & Polis (2000) measured a number of continuous variables (% cover of guano, % cover or plants and abundance of beetles. Therefore, they might be interested in exploring the relationships between each of these variables. That is, the relationship between guano and plants, guano and beetles, and beetles and plants. While it is possible to create separate scatterplots for each pair (in this case three separate scatterplots), a scatterplot matrix is usually more informative and efficient.

Q6-4. Construct a scatterplot matrix or SPLOM

for % of guano, % of plant cover and beetle abundance HINT. Are there any obvious relationships?

Homogeneity of variance

Many statistical hypothesis tests assume that populations are equally varied. For hypothesis tests that compare populations (such as t-tests - see Question 4), it is important that one of the populations is not substantially more or less variable than the other population(s). Thus, such tests assume homogeneity of variance.

Q6-5. Construct an examine boxplots

of beetle abundance for each of the three colony types. HINT

Firstly, is there any evidence of non-normality? (Y or N)
Try square-root transforming (preferred over log transformation when applying to count data, since log(0) is not legal) the beetle variable (function is sqrt) and using this transformed variable to reconstruct the boxplots. Note that it may be necessary to perform a forth-root transformation (which performing the square-root transformation twice) in order to normalize this highly skewed data. This can be done using the expression to compute as sqrt(sqrt(BEETLE96)) HINT or HINT. If this successfully normalizes the data, focus on whether there is any evidence that the populations are equally varied. Was a forth-root transformation successfull? (Y or N)
Try calculating the variance or standard deviation
of beetle abundance for each colony type separately (remember to use the transformed data, as the raw data was obviously non-normal and non-normality often results in unequal variances). Do these values provide any evidence for unequally varied populations? (Y or N)

Hypothesis testing

Furness & Bryant (1996) studied the energy budgets of breeding northern fulmars (Fulmarus glacialis) in Shetland. As part of their study, they recorded the body mass and metabolic rate of eight male and six female fulmars.

Download Furness data set

Format of furness.csv data files

SEX	METRATE	BODYMASS
MALE	2950	875
FEMALE	1956	765
MALE	2308	780
MALE	2135	790
MALE	1945	788

SEX	Sex of breeding northern fulmars (Fulmarus glacialis)
METRATE	Metabolic rate (hJ/day)
BODYMASS	Body mass (g)

Open the furness data file.

Q7-1. The researchers were interested in testing whether there is a difference in the metabolic rate of male and female breeding northern fulmars. In light of this, list the following:

The biological hypotheses of interest
The biological null hypotheses
The statistical null hypotheses (H₀)

The appropriate statistical test for testing the null hypothesis that the means of two independent populations are equal is a t-test

Before proceeding, make sure you understand what is meant by normality

and equal variance

as well as the principles of hypothesis testing using a t-test.

Q7-2. For the null hypothesis test of interest (that the mean population metabolic rate of males and females were the same), calculate the Degrees of freedom

Q7-3. Calculate the critical t-values for the following null hypotheses (&alpha = 0.05)

The metabolic rate of males is higher than that females (one-tailed test)
HINT
The metabolic rate of males is the same as that of females (two-tailed test)
HINT

Since most hypothesis tests follow the same basic procedure, confirm that you understand the basic steps of hypothesis tests

Q7-4.In the table below, list the assumptions of a t-test along with how violations of each assumption are diagnosed and/or the risks of violations are minimized.

Assumption	Diagnostic/Risk Minimization
I.
II.
III.

So, we wish to investigate whether or not male and female fulmars have the same metabolic rates, and that we intend to use a t-test to test the null hypothesis that the population mean metabolic rate of males is equal to the population mean metabolic rate of females. Having identified the important assumptions of a t-test, use the samples to evaluate whether the assumptions are likely to be violated and thus whether a t-test is likely to be reliability.

Q7-5 Is there any evidence that; HINT

The assumption of normality has been violated?
The assumption of homogeneity of variance has been violated?

Q7-6. Perform a t-test to examine the effect of sex on the mass of fulmars using either (which ever is most appropriate) a pooled variance t-test

(for when population variances are very similar HINT) or separate variance t-test

(for when the variance of one population is likely to be up to 2.5 times greater or less than the other population HINT). Ensure that you are familiar with the output of a t-test.

What is the t-value? (Excluding the sign. The sign will depend on whether you compared males to females or females to males, and thus only indicates which group had the higher mean).
What is the df (degrees of freedom).
What is the p value.

Q7-7. Write the results out as though you were writing a research paper/thesis. For example (select the phrase that applies and fill in gaps with your results):
The mean metabolic rate of male fulmars was (choose correct option)
(t = , df = , P = )
the mean metabolic rate of female fulmars.

Q7-8.Construct a bar graph

showing the mean metabolic rate of male and female fulmars and an indication of the precision of the means with error bars.HINT

Paired data

Here is a modified example from Quinn and Keough (2002). Elgar et al. (1996) studied the effect of lighting on the web structure or an orb-spinning spider. They set up wooden frames with two different light regimes (controlled by black or white mosquito netting), light and dim. A total of 17 orb spiders were allowed to spin their webs in both a light frame and a dim frame, with six days `rest' between trials for each spider, and the vertical and horizontal diameter of each web was measured. Whether each spider was allocated to a light or dim frame first was randomized. The H₀'s were that each of the two variables (vertical diameter and horizontal diameter of the orb web) were the same in dim and light conditions. Elgar et al. (1996) correctly treated these as paired comparisons because the same spider spun her web in a light frame and a dark frame.

Download Elgar data set

Format of elgar.csv data files

PAIR	VERTDIM	HORIZDIM	VERTLIGH	HORIZLIGH
..	..	..	..	..
..	..	..	..	..
..	..	..	..	..

PAIR	Name given to each pair of webs spun by a particular spider
VERTDIM	The vertical dimension or height (mm) of webs spun in dim conditions
HORIZDIM	The horizontal dimension or width (mm) of webs spun in dim conditions
VERTLIGH	The vertical dimension or height (mm) of webs spun in light conditions
HORIZLIGH	The horizontal dimension or width (mm) of webs spun in light conditions

Note:for paired t-tests, it is traditional for categories to be column labels rather than entries in a categorical variable. Compare the structure of the elgar data (paired t-test) set with that of the furness (standard t-test) data set.

Open the elgar data file.

Q8-1. What is an appropriate statistical test for testing an hypothesis about the difference in dimensions of webs spun in light versus dark conditions? Explain why?

Q8-2. The actual H₀ is that the mean of the differences between the pairs (light and dim for each spider) equals zero. Use a paired t-test

to test the H₀ that the mean of the differences in vertical diameter (HINT) and separately, in horizontal diameter (HINT) of the web between the pairs (light and dim for each spider) equal zero.

Q8-3. Write the results out as though you were writing a research paper/thesis. For example (select the phrase that applies and fill in gaps with your results):
The mean vertical diameter of spider webs in dim conditions was (choose correct option)
(t = , df = , P = )
the vertical dimensions in light conditions.
The mean horizontal diameter of spider webs in dim conditions was (choose correct option)
(t = , df = , P = )
the horizontal dimensions in light conditions.

Non-parametric tests

We will now revisit the data set of Furness & Bryant (1996) that was used in Question 4 to investigate the effects of gender on the metabolic rates of breeding northern fulmars (Fulmarus glacialis). Furness & Bryant (1996) also recorded the body mass of the eight male and six female fulmars they captured.

Since the males and female fulmars were all independent of one another, a t-test would be appropriate to test the null hypothesis of no difference in mean body weight of male and female fulmars.

Q9-1. Are the assumptions underlying this test met? (Y or N) Hint: check the relative sizes of the two sample variances and the distribution of body weight for each sex.

When the distributional assumptions are violated, parametric tests are unreliable. Under these circumstances, non-parametric tests

can be very useful.

Q9-2. The Wilcoxon-Mann-Whitney test is described as a non-parametric test for comparing two groups.

What null hypothesis does this test actually evaluate?
What are the underlying assumptions of a Wilcoxon-Mann-Whitney test?

Q9-3. If the assumptions are met, test the null hypothesis of no difference in body weight between male and female fulmars using a Wilcoxon test

HINT. Based on this outcome, what are your conclusions?

Statistical:
Biological (include trend):

Q9-4.Construct a bar graph

showing the mean mass of male and female fulmars and an indication of the precision of the means with error bars. HINT

Randomization (permutation) test

A wildlife ecologist responsible for the management of a significant population of southern brown bandicoot, Isoodon obesulus, was interested in determining the impacts that picnickers were having on the health of bandicoots in the park. In particular, he was interested in determining whether bandicoots that occupied areas frequented by picnickers were heavier (and thus fatter) than bandicoots that occurred in other woodland areas. Fifty adult male bandicoots were captured from picnic and woodland areas and the weights of all individuals were measured.

Download Bandicoot data set

Format of bandicoots.csv data files

AREA	WEIGHT
PICNIC	875
PICNIC	765
...	780
WOODLAND	790
WOODLAND	788

AREA	Area from which bandicoots were captured (Picnic or Woodland)
WEIGHT	Capture weight (g) of bandicoots

Open the bandicoots data file.

Q10-1. Are the assumptions underlying this test met? (Y or N) Hint: check the relative sizes of the two sample variances and the distribution of body weight for each sex.

Q10-2. There is a clear problem with non-normality and this problem cannot be fixed by a transformation. Why?

When the assumptions of the t-test have been violated, the distribution of all possible t-values cannot reliably be assumed to follow a mathematical t distribution. So, when the null hypothesis is true, and there is no effect of AREA on the WEIGHT of bandicoots, what t-values would we expect.

Q10-3. In this case, it is likely that observations (individual bandicoots) were collected via random sampling. It would be logistically impossible to do so. Therefore, since the variances are not wildly unequal, a randomization (or permutation) test

is appropriate. Such a test, repeatedly shuffles the sample data, each time calculating a specific statistic. So while the assumptions of the t-test have been violated, and therefore the distribution of all possible t-values cannot reliably be assumed to follow a mathematical t distribution, we can generate a distribution of possible t-statistics from the randomized t-statistics.

Q10-4. A randomization test involves the following sequence of tasks:

Define a new function that accepts a data set and returns an appropriate statistic. In this case, since we are comparing two populations a t-statistic is appropriate. Define an appropriate function .
Next we need to define a function that alters the structure of the data. In this case, we need to define a function that randomly shuffles the categorical variable (group labels) around. Define an appropriate function .
Then we use the 'boot()' function to repeatedly calculate the statistic, each time on the randomly altered data. In this case, we want to repeatidly (100 times) calculate the t-statistic from the data set in which the group labels have been randomly shuffled. Perform the bootstrapping

Q10-5. What is the actual sample t-value in this case? HINT

Q10-6. Having performed the randomization proceedure to calculate a set of t-values that we might expect to obtain when the null hypothesis is true, we can now explore the distribution of these t-values. This distribution is in effect a t-distribution. However, rather than being a general, theoretical t-distribution that can be applied to all populations (provided assumptions are met), this distribution is specific to the populations that we are interested in.

Examine the distribution of these t-values (the t-distribution). HINT
Determine what proportion of resampled t-values are as great or greater than our actual sample t-value.
This then represents the probability of obtaining our sample t-value when the null hypothesis is true, and is thereby interpreted as any other p-value. What is the p-value?.

Q10-7. What conclusions would you draw from the analysis?

Q10-8. Given that generated t-distribution is specific to the population(s) that we are interested in, and is more robust than parametric statistical analyses, we might wonder why parametric analyses are prefered over randomization procedures. Why are parametric analyses prefered?

Power analysis

An ornithologist studying various populations of the Australian magpie, Gymnorhina tibicen, was primarily interested in whether the growth of urban magpies might be stunted as a result of the increased consumption of processed foods. To investigate this hypothesis she intended to measured the total body lengths in centimeters of a number of birds from both urban and rural locations. The null hypothesis of interest is that the population mean length of urban magpies is equal to that of rural magpies and thus a t-test is an appropriate test. Previous research had indicated that the mean body length of rural magpies was 36.87cm with a standard deviation of 2.

Q11-1. If the ornithologist considered a 10% decrease in mean body length to be of biological significance

What effect size
is she interested in detecting? HINT
In order to have an 80% chance of detecting such an effect (if one really exists), how many replicate birds would the ornithologist need to measure from each population (assume significance level of 0.05)? HINT

Q11-2. Often, it is difficult to obtain estimates of the likely population standard deviation. Similarly, it can be difficult to estimate the effect size (delta). Consequently, it is often more preferable to perform power calculations for a range of standard deviations or effect sizes and plot the relationships for each parameter set. To assist the ornithologist to determine sample sizes, estimate the following:

Relationship between power and sample size for a range of effect sizes (3, 4, 5 & 6). HINT. Note, you need to load the biology library
Relationship between power and sample size for a range of standard deviations (1.8,2,2.2). HINT. Note, you need to load the biology library

Q11-3. Alternatively, the ornithologist's sampling efforts may be constrained somewhat by either ethics or by difficulties in capturing birds, and thus the ornithologist may wish to estimate what the minimum detectable effect size would be for a given range of sample sizes. To assist the ornithologist to determine sample sizes, estimate the following:

Relationship between minimum detectable effect size and sample size for a range of standard deviations (1.8,2,2.2). HINT
Relationship between minimum detectable effect size and sample size for a range of power (0.7,0.8,0.9). HINT

Welcome to the end of Workshop 2

Transformation	Syntax
log_e	> new_var <- log(old_var)
log₁₀	> new_var <- log10(old_var)
square root	> new_var <- sqrt(old_var)
arcsin	> new_var <- asin(sqrt(old_var))
scale (mean=0, unit variance)	> new_var <- scale(old_var)

Transformation

Syntax

log_e

> new_var <- log(old_var)

log₁₀

> new_var <- log10(old_var)

square root

> new_var <- sqrt(old_var)

arcsin

> new_var <- asin(sqrt(old_var))

scale (mean=0, unit variance)

> new_var <- scale(old_var)

Sample number	Sample mean
1	12.1
2	12.7
3	12.5
Mean of sample means	12.433
SD of sample means	0.306