Quick Navigation
Before beginning this worksheet make sure that you have read the R manual and have either fully installed R (according to the instructions in the R manual) or are running R from the Research Methods CD provided. Other versions of R and Rcmdr obtained elsewhere have not been customized for Bio3011 and therefore do not offer all the required features 'out of the box'.
Run R . The picture above depicts RGui 2.0.0 running on a Windows system. RGui itself consists of a number of windows (although not all of them are necessarily always on display).
Everything in R is an object. For example, a single number is an object, a variable is an object, output is an object, a data set is an object, etc. Furthermore, all objects have unique names (that you provide) to enable each object to be referred to. To get the feel of data storage and access in R, try the following
Rarely is only a single biological variable collected. Data are usually collected in sets of variables reflecting tests of relationships, differences between groups, multiple characterizations etc. Consequently, data sets are best organized into collections of variables (vectors). Such collections are called data frames in R. Data frames are generated by combining multiple vectors together whereby each vector becomes a separate column in the data frame. In for a data frame to represent the data properly, the sequence in which observations appear in the vectors (variables) must be the same for each vector and each vector should have the same number of observations. For example, the first observations from each of the vectors to be included in the data frame must represent observations collected from the same sampling unit.
To demonstrate the use of dataframes in R, we will use fictitious data representing the areas of leaves of two species of Japanese Boxwood
The above syntax forms a list of instructions that R can perform. Such lists are called scripts. Scripts offer the following;
Although it is possible to generate a data set from scratch using the procedures demonstrated in the above demonstration module, often data sets are better managed with spreadsheet software. R is not designed to be a spreadsheet, and thus, it is necessary to import data into R. We will use the following small data set (in which the feeding metabolic rate of stick insects fed two different diets was recorded)to demonstrate how a data set is imported into R.
Randomization is a fundamental concept in sampling design. In order for a sample to truly represent an entire population (all the possible observations), the sample must be collected without bias (intentional or otherwise). Ideally, samples should be collected randomly. For example, the location of quadrats in an area should be determined by a random grid. Individuals from a population to be measured should be selected at random. Likewise the application of treatments should also occur at random.
Given the importance of randomization then, it is necessary to be able to perform randomizations, randomly order treatments and generate random numbers, random coordinates etc.
One of the simplest tools to use for random sampling and treatment allocation is a set of random numbers. Random numbers are usually given as a fraction (e.g 0.713791167) between 0 and 1. These can be used to generate a series of integer numbers (e.g. 7, 1, 3, 71, 379, 1167 ...) or floating points (e.g. 0.71, 7.13, 71.37, 713.7, 91.167, ...), which in turn can be used to obtain random sampling units (based on individual ID's or random grid coordinates respectively).
Lets say you were interested in examining the changes that occur in rocky intertidal invertebrate communities along an exposed rock platform. To do so, you had decided to establish five transects each running perpendicular to the beach and extending from the high water mark to the low tide water line. Every 10 meters along the transects, three 0.5x0.5m quadrats were to be used to count the number and diversity of invertebrate species (see diagram below). To minimize bias, the position of these quadrats with respect the transect were to be at random distances (0 to 5 meters) and directions (either left or right) from the transect line. Therefore, it is was necessary to generate a list of random distances (from 0 to 5 meters) and directions (left or right) to use for the positioning of each quadrat.
Note again that it is usually a good idea to generate 50% more random numbers, sequences (etc) than you expect to need to allow for invalid or inappropriate combinations. For example, in the above situation you may have set an a priori decision not to measure rock pools on the rock platform since they provide very different conditions that may not truly represent the location along the intertidal zone. Consequently, if a random sequence dictates that a quadrat is to be placed in a rock pool, this sequence would be skipped. Likewise, if a random sequence dictated that a quadrat should be placed in a location that overlaps a previously sampled area, it might also be appropriate to skip to the next random sequence.