Data manipulations
19 Oct 2011
In the proceeding workshops, a great deal of emphasis has been placed on exploring and analysing prepared data. However, rarely are data transcribed and organized into such simple and ready-to-go data sets. More typically, data are spread throughout multiple sources and in a variety of formats (particularly if compiled by multiple workers or instruments). Consequently, prior to any formal statistical analyses , it is necessary to compile very focused data sets.
Common data preparations include:
- Reshaping and rearranging data
- Merging multiple data sources
- Aggregating data to appropriate spatial/temporal scales
- Transforming data and deriving new variables
- Sorting and reordering data
- Relabelling data
Practical data manipulation will be demonstrated via a series of very small artificial datasets. These datasets are presented in tables with black font and lines and the R code required to generate those data will be presented in static code boxes either underneath or adjacent the table. A very basic description of the table and the name of the data frame are displayed above the table. The entire collection of datasets used in this workshop can be obtained by issuing the following command:
- (if online)
> source(url("http://users.monash.edu.au/~murray/AIMS-R-users/downloads/data/manipulationDatasets.R")) - (if offline and are running this from a local version, run the following from within the /downloads/data directory)
> source(file = "manipulationDatasets.R") The result of data manipulations will be displayed in tables with blue font and lines. Clicking on these tables will reveal the R source code that was used to perform the manipulation along with the actual result as it appears in R (not formatted for HTML).
Reshaping datasets
Wide to long (melt)
Whilst wide data formats are often more compact and typically easier to manage for data entry (particularly in the field), the data are not in the appropriate format for most analyses (traditional repeated measures is one exception). Most analyses require that each replicate is in its own row and thus it is necessary to be rearrange or reshape (melt) the data from this wide format to the long (molten) format.
The melt function (reshape package) is very useful for converting wide (repeated measures-like) into long format. The important parameters to specify are;
- id the set of factors already in long format
- measure.var the columns that define the repeated measure that should be melted into a long column
Long to wide (cast)
The cast function (reshape package) is very useful for converting long datasets into wide (repeated measures-like). The heart of the function definition is specifying a formula LHS ~ RHS where;
- LHS is a plus separated set of variables to retain long format
- RHS the variable to be cast into wide format
An example of a long data set (data) Widen (cast) Resp1 for repeated measures (Within) Resp1 Resp2 Between Plot Subplot Within 1 8.00 17.00 A1 P1 S1 B1 2 10.00 18.00 A1 P1 S1 B2 3 7.00 17.00 A1 P1 S2 B1 4 11.00 21.00 A1 P1 S2 B2 5 14.00 19.00 A2 P2 S3 B1 6 12.00 13.00 A2 P2 S3 B2 7 11.00 24.00 A2 P2 S4 B1 8 9.00 18.00 A2 P2 S4 B2 9 14.00 25.00 A3 P3 S5 B1 10 11.00 18.00 A3 P3 S5 B2 11 8.00 27.00 A3 P3 S6 B1 12 2.00 22.00 A3 P3 S6 B2 13 8.00 17.00 A1 P4 S7 B1 14 10.00 22.00 A1 P4 S7 B2 15 7.00 16.00 A1 P4 S8 B1 16 12.00 13.00 A1 P4 S8 B2 17 11.00 23.00 A2 P5 S9 B1 18 12.00 19.00 A2 P5 S9 B2 19 12.00 23.00 A2 P5 S10 B1 20 10.00 21.00 A2 P5 S10 B2 21 3.00 17.00 A3 P6 S11 B1 22 11.00 16.00 A3 P6 S11 B2 23 13.00 26.00 A3 P6 S12 B1 24 7.00 28.00 A3 P6 S12 B2 > set.seed(1) > data <- expand.grid(Within = paste("B", > 1:2, sep = ""), Subplot = paste("S", > 1:2, sep = ""), Plot = paste("P", > 1:6, sep = "")) > data$Subplot <- gl(12, 2, 24, lab = paste("S", > 1:12, sep = "")) > data$Between <- gl(3, 4, 24, lab = paste("A", > 1:3, sep = "")) > data$Resp1 <- rpois(24, 10) > data$Resp2 <- rpois(24, 20) > data <- with(data, data.frame(Resp1, > Resp2, Between, Plot, Subplot, > Within)) Between Plot Subplot B1 B2 1 A1 P1 S1 8.00 10.00 2 A1 P1 S2 7.00 11.00 3 A1 P4 S7 8.00 10.00 4 A1 P4 S8 7.00 12.00 5 A2 P2 S3 14.00 12.00 6 A2 P2 S4 11.00 9.00 7 A2 P5 S9 11.00 12.00 8 A2 P5 S10 12.00 10.00 9 A3 P3 S5 14.00 11.00 10 A3 P3 S6 8.00 2.00 11 A3 P6 S11 3.00 11.00 12 A3 P6 S12 13.00 7.00 > library(reshape) > cast(data, Between + Plot + Subplot ~ > Within, value = "Resp1") Between Plot Subplot B1 B2 1 A1 P1 S1 8 10 2 A1 P1 S2 7 11 3 A1 P4 S7 8 10 4 A1 P4 S8 7 12 5 A2 P2 S3 14 12 6 A2 P2 S4 11 9 7 A2 P5 S9 11 12 8 A2 P5 S10 12 10 9 A3 P3 S5 14 11 10 A3 P3 S6 8 2 11 A3 P6 S11 3 11 12 A3 P6 S12 13 7
Widen (cast) both Resp1 and Resp2 for repeated measures Between Plot Subplot B1Resp1 B2Resp1 B1Resp2 B2Resp2 1 A1 P1 S1 8.00 10.00 17.00 18.00 2 A1 P1 S2 7.00 11.00 17.00 21.00 3 A1 P4 S7 8.00 10.00 17.00 22.00 4 A1 P4 S8 7.00 12.00 16.00 13.00 5 A2 P2 S3 14.00 12.00 19.00 13.00 6 A2 P2 S4 11.00 9.00 24.00 18.00 7 A2 P5 S10 12.00 10.00 23.00 21.00 8 A2 P5 S9 11.00 12.00 23.00 19.00 9 A3 P3 S5 14.00 11.00 25.00 18.00 10 A3 P3 S6 8.00 2.00 27.00 22.00 11 A3 P6 S11 3.00 11.00 17.00 16.00 12 A3 P6 S12 13.00 7.00 26.00 28.00 > library(reshape) > d1 <- cast(data, Between + Plot + > Subplot ~ Within, value = "Resp1") > d2 <- cast(data, Between + Plot + > Subplot ~ Within, value = "Resp2") > merge(d1, d2, by = c("Between", > "Plot", "Subplot"), suffixes = c("Resp1", > "Resp2")) Between Plot Subplot B1Resp1 B2Resp1 B1Resp2 B2Resp2 1 A1 P1 S1 8 10 17 18 2 A1 P1 S2 7 11 17 21 3 A1 P4 S7 8 10 17 22 4 A1 P4 S8 7 12 16 13 5 A2 P2 S3 14 12 19 13 6 A2 P2 S4 11 9 24 18 7 A2 P5 S10 12 10 23 21 8 A2 P5 S9 11 12 23 19 9 A3 P3 S5 14 11 25 18 10 A3 P3 S6 8 2 27 22 11 A3 P6 S11 3 11 17 16 12 A3 P6 S12 13 7 26 28
Merging data sets
It is common to have data associated with a particular study organized into a number of separate data tables (databases etc). In fact, large data sets are best managed in databases. However, statistical analyses generally require all data to be encapsulated within a single data structure. Therefore, prior to analysis, it is necessary to bring together multiple sources.
This phase of data preparation can be one of the most difficult to get right and verify.
Biological data set (missing Subplot 3)
(data.bio)Physio-chemical data (missing Subplot 7)
(data.chem)Resp1 Resp2 Between Plot Subplot 1 8.00 18.00 A1 P1 S1 2 10.00 21.00 A1 P1 S2 4 11.00 23.00 A1 P2 S4 5 14.00 22.00 A2 P3 S5 6 12.00 24.00 A2 P3 S6 7 11.00 23.00 A2 P4 S7 8 9.00 20.00 A2 P4 S8 9 14.00 11.00 A3 P5 S9 10 11.00 22.00 A3 P5 S10 11 8.00 24.00 A3 P6 S11 12 2.00 16.00 A3 P6 S12 > set.seed(1) > data.bio <- expand.grid(Subplot = paste("S", > 1:2, sep = ""), Plot = paste("P", > 1:6, sep = "")) > data.bio$Subplot <- gl(12, 1, 12, > lab = paste("S", 1:12, sep = "")) > data.bio$Between <- gl(3, 4, 12, > lab = paste("A", 1:3, sep = "")) > data.bio$Resp1 <- rpois(12, 10) > data.bio$Resp2 <- rpois(12, 20) > data.bio <- with(data.bio, data.frame(Resp1, > Resp2, Between, Plot, Subplot)) > data.bio <- data.bio[-3, ] Chem1 Chem2 Between Plot Subplot 1 1.45 0.89 A1 P1 S1 2 3.27 0.18 A1 P1 S2 3 1.18 5.08 A1 P2 S3 4 13.40 1.58 A1 P2 S4 5 3.78 1.62 A2 P3 S5 6 1.20 4.24 A2 P3 S6 8 5.69 2.99 A2 P4 S8 9 4.83 4.13 A3 P5 S9 10 2.00 3.60 A3 P5 S10 11 12.33 1.78 A3 P6 S11 12 4.01 0.23 A3 P6 S12 > set.seed(1) > data.chem <- expand.grid(Subplot = paste("S", > 1:2, sep = ""), Plot = paste("P", > 1:6, sep = "")) > data.chem$Subplot <- gl(12, 1, > 12, lab = paste("S", 1:12, > sep = "")) > data.chem$Between <- gl(3, 4, 12, > lab = paste("A", 1:3, sep = "")) > data.chem$Chem1 <- rlnorm(12, 1) > data.chem$Chem2 <- rlnorm(12, 0.5) > data.chem <- with(data.chem, data.frame(Chem1, > Chem2, Between, Plot, Subplot)) > data.chem <- data.chem[-7, ] Merge bio and chem data (only keep full matches) Merge bio and chem data (keep all data) Between Plot Subplot Resp1 Resp2 Chem1 Chem2 1 A1 P1 S1 8.00 18.00 1.45 0.89 2 A1 P1 S2 10.00 21.00 3.27 0.18 3 A1 P2 S4 11.00 23.00 13.40 1.58 4 A2 P3 S5 14.00 22.00 3.78 1.62 5 A2 P3 S6 12.00 24.00 1.20 4.24 6 A2 P4 S8 9.00 20.00 5.69 2.99 7 A3 P5 S10 11.00 22.00 2.00 3.60 8 A3 P5 S9 14.00 11.00 4.83 4.13 9 A3 P6 S11 8.00 24.00 12.33 1.78 10 A3 P6 S12 2.00 16.00 4.01 0.23 #> merge(data.bio, data.chem, by = c("Between", > "Plot", "Subplot")) Between Plot Subplot Resp1 Resp2 Chem1 Chem2 1 A1 P1 S1 8 18 1.45287766933472 0.885820822702143 2 A1 P1 S2 10 21 3.26625256848005 0.180017738238554 3 A1 P2 S4 11 23 13.40034967998687 1.576278030210834 4 A2 P3 S5 14 22 3.77918270942276 1.622242963314320 5 A2 P3 S6 12 24 1.19665673682759 4.236918392102122 6 A2 P4 S8 9 20 5.68780668311483 2.985900335255433 7 A3 P5 S10 11 22 2.00293101017655 3.604331442629514 8 A3 P5 S9 14 11 4.83451760105434 4.132891865927561 9 A3 P6 S11 8 24 12.32686674438908 1.776357613699845 10 A3 P6 S12 2 16 4.01422072002186 0.225518812936411
Between Plot Subplot Resp1 Resp2 Chem1 Chem2 1 A1 P1 S1 8.00 18.00 1.45 0.89 2 A1 P1 S2 10.00 21.00 3.27 0.18 3 A1 P2 S3 1.18 5.08 4 A1 P2 S4 11.00 23.00 13.40 1.58 5 A2 P3 S5 14.00 22.00 3.78 1.62 6 A2 P3 S6 12.00 24.00 1.20 4.24 7 A2 P4 S7 11.00 23.00 8 A2 P4 S8 9.00 20.00 5.69 2.99 9 A3 P5 S9 14.00 11.00 4.83 4.13 10 A3 P5 S10 11.00 22.00 2.00 3.60 11 A3 P6 S11 8.00 24.00 12.33 1.78 12 A3 P6 S12 2.00 16.00 4.01 0.23 #> merge(data.bio, data.chem, by = c("Between", > "Plot", "Subplot"), all = T) Between Plot Subplot Resp1 Resp2 Chem1 Chem2 1 A1 P1 S1 8 18 1.45287766933472 0.885820822702143 2 A1 P1 S2 10 21 3.26625256848005 0.180017738238554 3 A1 P2 S3 NA NA 1.17865197053252 5.078068222680516 4 A1 P2 S4 11 23 13.40034967998687 1.576278030210834 5 A2 P3 S5 14 22 3.77918270942276 1.622242963314320 6 A2 P3 S6 12 24 1.19665673682759 4.236918392102122 7 A2 P4 S7 11 23 NA NA 8 A2 P4 S8 9 20 5.68780668311483 2.985900335255433 9 A3 P5 S9 14 11 4.83451760105434 4.132891865927561 10 A3 P5 S10 11 22 2.00293101017655 3.604331442629514 11 A3 P6 S11 8 24 12.32686674438908 1.776357613699845 12 A3 P6 S12 2 16 4.01422072002186 0.225518812936411
Merge bio and chem data (only keep full BIO matches) Merge bio and chem data (keep full CHEM matches) Between Plot Subplot Resp1 Resp2 Chem1 Chem2 1 A1 P1 S1 8.00 18.00 1.45 0.89 2 A1 P1 S2 10.00 21.00 3.27 0.18 3 A1 P2 S4 11.00 23.00 13.40 1.58 4 A2 P3 S5 14.00 22.00 3.78 1.62 5 A2 P3 S6 12.00 24.00 1.20 4.24 6 A2 P4 S7 11.00 23.00 7 A2 P4 S8 9.00 20.00 5.69 2.99 8 A3 P5 S9 14.00 11.00 4.83 4.13 9 A3 P5 S10 11.00 22.00 2.00 3.60 10 A3 P6 S11 8.00 24.00 12.33 1.78 11 A3 P6 S12 2.00 16.00 4.01 0.23 #> merge(data.bio, data.chem, by = c("Between", > "Plot", "Subplot"), all.x = T) Between Plot Subplot Resp1 Resp2 Chem1 Chem2 1 A1 P1 S1 8 18 1.45287766933472 0.885820822702143 2 A1 P1 S2 10 21 3.26625256848005 0.180017738238554 3 A1 P2 S4 11 23 13.40034967998687 1.576278030210834 4 A2 P3 S5 14 22 3.77918270942276 1.622242963314320 5 A2 P3 S6 12 24 1.19665673682759 4.236918392102122 6 A2 P4 S7 11 23 NA NA 7 A2 P4 S8 9 20 5.68780668311483 2.985900335255433 8 A3 P5 S9 14 11 4.83451760105434 4.132891865927561 9 A3 P5 S10 11 22 2.00293101017655 3.604331442629514 10 A3 P6 S11 8 24 12.32686674438908 1.776357613699845 11 A3 P6 S12 2 16 4.01422072002186 0.225518812936411
Between Plot Subplot Resp1 Resp2 Chem1 Chem2 1 A1 P1 S1 8.00 18.00 1.45 0.89 2 A1 P1 S2 10.00 21.00 3.27 0.18 3 A1 P2 S3 1.18 5.08 4 A1 P2 S4 11.00 23.00 13.40 1.58 5 A2 P3 S5 14.00 22.00 3.78 1.62 6 A2 P3 S6 12.00 24.00 1.20 4.24 7 A2 P4 S8 9.00 20.00 5.69 2.99 8 A3 P5 S9 14.00 11.00 4.83 4.13 9 A3 P5 S10 11.00 22.00 2.00 3.60 10 A3 P6 S11 8.00 24.00 12.33 1.78 11 A3 P6 S12 2.00 16.00 4.01 0.23 #> merge(data.bio, data.chem, by = c("Between", > "Plot", "Subplot"), all.y = T) Between Plot Subplot Resp1 Resp2 Chem1 Chem2 1 A1 P1 S1 8 18 1.45287766933472 0.885820822702143 2 A1 P1 S2 10 21 3.26625256848005 0.180017738238554 3 A1 P2 S3 NA NA 1.17865197053252 5.078068222680516 4 A1 P2 S4 11 23 13.40034967998687 1.576278030210834 5 A2 P3 S5 14 22 3.77918270942276 1.622242963314320 6 A2 P3 S6 12 24 1.19665673682759 4.236918392102122 7 A2 P4 S8 9 20 5.68780668311483 2.985900335255433 8 A3 P5 S9 14 11 4.83451760105434 4.132891865927561 9 A3 P5 S10 11 22 2.00293101017655 3.604331442629514 10 A3 P6 S11 8 24 12.32686674438908 1.776357613699845 11 A3 P6 S12 2 16 4.01422072002186 0.225518812936411
VLOOKUP in R
Biological data set (data.bio1) Geographical data set (lookup table) (data.geo) Resp1 Resp2 Between Plot Subplot 1 8.00 18.00 A1 P1 S1 2 10.00 21.00 A1 P1 S2 3 7.00 16.00 A1 P2 S3 4 11.00 23.00 A1 P2 S4 5 14.00 22.00 A2 P3 S5 6 12.00 24.00 A2 P3 S6 7 11.00 23.00 A2 P4 S7 8 9.00 20.00 A2 P4 S8 9 14.00 11.00 A3 P5 S9 10 11.00 22.00 A3 P5 S10 11 8.00 24.00 A3 P6 S11 12 2.00 16.00 A3 P6 S12 > set.seed(1) > data.bio1 <- expand.grid(Subplot = paste("S", > 1:2, sep = ""), Plot = paste("P", > 1:6, sep = "")) > data.bio1$Subplot <- gl(12, 1, > 12, lab = paste("S", 1:12, > sep = "")) > data.bio1$Between <- gl(3, 4, 12, > lab = paste("A", 1:3, sep = "")) > data.bio1$Resp1 <- rpois(12, 10) > data.bio1$Resp2 <- rpois(12, 20) > data.bio1 <- with(data.bio1, data.frame(Resp1, > Resp2, Between, Plot, Subplot)) Plot LAT LONG 1 P1 17.96 145.43 2 P2 17.52 146.20 3 P3 17.00 146.38 4 P4 18.23 146.79 5 P5 18.98 146.03 6 P6 20.12 146.47 > data.geo <- expand.grid(Plot = paste("P", > 1:6, sep = "")) > data.geo$LAT <- c(17.9605, 17.521, > 17.0011, 18.235, 18.984, 20.1154) > data.geo$LONG <- c(145.4326, 146.1983, > 146.3839, 146.7934, 146.0345, > 146.4672) Incorporate (merge) the lat/longs into the bio data Plot Resp1 Resp2 Between Subplot LAT LONG 1 P1 8.00 18.00 A1 S1 17.96 145.43 2 P1 10.00 21.00 A1 S2 17.96 145.43 3 P2 7.00 16.00 A1 S3 17.52 146.20 4 P2 11.00 23.00 A1 S4 17.52 146.20 5 P3 14.00 22.00 A2 S5 17.00 146.38 6 P3 12.00 24.00 A2 S6 17.00 146.38 7 P4 11.00 23.00 A2 S7 18.23 146.79 8 P4 9.00 20.00 A2 S8 18.23 146.79 9 P5 14.00 11.00 A3 S9 18.98 146.03 10 P5 11.00 22.00 A3 S10 18.98 146.03 11 P6 8.00 24.00 A3 S11 20.12 146.47 12 P6 2.00 16.00 A3 S12 20.12 146.47 #> merge(data.bio1, data.geo, by = "Plot") Plot Resp1 Resp2 Between Subplot LAT LONG 1 P1 8 18 A1 S1 17.9605 145.4326 2 P1 10 21 A1 S2 17.9605 145.4326 3 P2 7 16 A1 S3 17.5210 146.1983 4 P2 11 23 A1 S4 17.5210 146.1983 5 P3 14 22 A2 S5 17.0011 146.3839 6 P3 12 24 A2 S6 17.0011 146.3839 7 P4 11 23 A2 S7 18.2350 146.7934 8 P4 9 20 A2 S8 18.2350 146.7934 9 P5 14 11 A3 S9 18.9840 146.0345 10 P5 11 22 A3 S10 18.9840 146.0345 11 P6 8 24 A3 S11 20.1154 146.4672 12 P6 2 16 A3 S12 20.1154 146.4672
Aggregating data sets
Data are often collected and transcribed at finer temporal/spatial scales and with greater fidelity than is required for all analyses. Therefore an important phase of data preparation is also to summarize the data into the spatial/temporal scales appropriate for the desired graphical and statistical analyses.
Example data set (data.1) Generating R code Resp1 Resp2 Between Plot 1 8.00 22.00 A1 P1 2 10.00 18.00 A1 P2 3 7.00 26.00 A2 P3 4 11.00 21.00 A2 P4 5 14.00 17.00 A3 P5 6 12.00 18.00 A3 P6 > set.seed(1) > data.1 <- expand.grid(Plot = paste("P", 1:6, sep = "")) > data.1$Between <- gl(3, 2, 6, lab = paste("A", 1:3, sep = "")) > data.1$Resp1 <- rpois(6, 10) > data.1$Resp2 <- rpois(6, 20) > data.1 <- with(data.1, data.frame(Resp1, Resp2, Between, > Plot)) The major functions useful for summarizing/aggregating are: (Note the associated outputs as displayed on this page will differ in extent and formatting from that displayed in R)
> tapply(data.1$Resp1, data.1$Between, function(x) { > print(x) > return(paste(x)) > }) A1 A2 A3 8 , 10 8 , 10
> library(reshape) > cast(data.1, ~Between, value = "Resp1", fun = function(x) { > print(x) > return(paste(x)) > }) value A1_X1 A1_X2 A2_X1 A2_X2 A3_X1 A3_X2 1 (all) 8 10 7 11 14 12
> library(plyr) > ddply(data.1, ~Between, function(x) { > print(x) > return(paste(x)) > }) Between V1 V2 V3 V4 1 A1 c(8, 10) c(22, 18) c(1, 1) 1:2 2 A2 c(7, 11) c(26, 21) c(2, 2) 3:4 3 A3 c(14, 12) c(17, 18) c(3, 3) 5:6
> library(doBy) > summaryBy(Resp1 ~ Between, data.1, FUN = function(x) { > print(x) > return(paste(x)) > }) Between Resp1.FUN1 Resp1.FUN2 1 A1 8 10 2 A2 7 11 3 A3 14 12
The main features of each of the above functions are compared and contrasted in the following table.
Variables passed Passed as Functions Returns tapply Single specified Vectors Single Matrix cast Single specified Vectors Multiples as list Dataframe summaryBy All specified Vectors Multiples as list Dataframe ddply All Dataframes Defined per variable Dataframe A typical data set (data) Generating R code Resp1 Resp2 Between Plot Subplot Within 1 8.00 17.00 A1 P1 S1 B1 2 10.00 18.00 A1 P1 S1 B2 3 7.00 17.00 A1 P1 S2 B1 4 11.00 21.00 A1 P1 S2 B2 5 14.00 19.00 A2 P2 S3 B1 6 12.00 13.00 A2 P2 S3 B2 7 11.00 24.00 A2 P2 S4 B1 8 9.00 18.00 A2 P2 S4 B2 9 14.00 25.00 A3 P3 S5 B1 10 11.00 18.00 A3 P3 S5 B2 11 8.00 27.00 A3 P3 S6 B1 12 2.00 22.00 A3 P3 S6 B2 13 8.00 17.00 A1 P4 S7 B1 14 10.00 22.00 A1 P4 S7 B2 15 7.00 16.00 A1 P4 S8 B1 16 12.00 13.00 A1 P4 S8 B2 17 11.00 23.00 A2 P5 S9 B1 18 12.00 19.00 A2 P5 S9 B2 19 12.00 23.00 A2 P5 S10 B1 20 10.00 21.00 A2 P5 S10 B2 21 3.00 17.00 A3 P6 S11 B1 22 11.00 16.00 A3 P6 S11 B2 23 13.00 26.00 A3 P6 S12 B1 24 7.00 28.00 A3 P6 S12 B2 > set.seed(1) > data <- expand.grid(Within = paste("B", 1:2, sep = ""), > Subplot = paste("S", 1:2, sep = ""), Plot = paste("P", > 1:6, sep = "")) > data$Subplot <- gl(12, 2, 24, lab = paste("S", 1:12, sep = "")) > data$Between <- gl(3, 4, 24, lab = paste("A", 1:3, sep = "")) > data$Resp1 <- rpois(24, 10) > data$Resp2 <- rpois(24, 20) > data <- with(data, data.frame(Resp1, Resp2, Between, Plot, > Subplot, Within)) Pivot tables, data summaries
Resp 1 means (Between) Resp1 means and standard deviations (Between) Resp1 means (Between/Within) > tapply(data$Resp1, data$Between, > mean) A1 A2 A3 9.125 11.375 8.625
> library(reshape) > cast(data, Between ~ ., value = "Resp1", > fun = c(mean, sd)) Between mean sd 1 A1 9.125 1.88509188862809 2 A2 11.375 1.50594061730772 3 A3 8.625 4.43806585929373
> library(doBy) > summaryBy(Resp1 ~ Between, data, > FUN = list(mean, sd)) Between Resp1.mean Resp1.sd 1 A1 9.125 1.88509188862809 2 A2 11.375 1.50594061730772 3 A3 8.625 4.43806585929373
> tapply(data$Resp1, list(data$Between, > data$Within), mean) B1 B2 A1 7.50 10.75 A2 12.00 10.75 A3 9.50 7.75
> library(reshape) > cast(data, Between ~ Within, value = "Resp1", > fun = mean) Between B1 B2 1 A1 7.5 10.75 2 A2 12.0 10.75 3 A3 9.5 7.75
#OR> library(reshape) > cast(data, Between ~ Within, value = "Resp1", > fun = mean, margins = c("grand_col", > "grand_row")) Between B1 B2 (all) 1 A1 7.50000000000000 10.75 9.12500000000000 2 A2 12.00000000000000 10.75 11.37500000000000 3 A3 9.50000000000000 7.75 8.62500000000000 4 (all) 9.66666666666667 9.75 9.70833333333333
Resp1 means and standard deviations (Between/Within) Between B1_mean B1_sd B2_mean B2_sd 1 A1 7.50 0.58 10.75 0.96 2 A2 12.00 1.41 10.75 1.50 3 A3 9.50 5.07 7.75 4.27 > library(reshape) > cast(data, Between ~ Within, value = "Resp1", > fun = c(mean, sd)) Between B1_mean B1_sd B2_mean B2_sd 1 A1 7.5 0.577350269189626 10.75 0.957427107756338 2 A2 12.0 1.414213562373095 10.75 1.500000000000000 3 A3 9.5 5.066228051190222 7.75 4.272001872658765
Between Within Resp1_means 1 A1 B1 7.50 2 A1 B2 10.75 3 A2 B1 12.00 4 A2 B2 10.75 5 A3 B1 9.50 6 A3 B2 7.75 #> library(plyr) > ddply(data, ~Between + Within, > summarize, Resp1_means = mean(Resp1)) Between Within Resp1_means 1 A1 B1 7.50 2 A1 B2 10.75 3 A2 B1 12.00 4 A2 B2 10.75 5 A3 B1 9.50 6 A3 B2 7.75
# ddply> library(doBy) > summaryBy(Resp1 ~ Between + Within, > data = data) Between Within Resp1.mean 1 A1 B1 7.50 2 A1 B2 10.75 3 A2 B1 12.00 4 A2 B2 10.75 5 A3 B1 9.50 6 A3 B2 7.75
Resp1 and Resp2 means
(Between/Within)Resp1 and Resp2 means
(Between/Within) long formatResp1 means and standard
deviations (Between/Within)Between Within Resp1 Resp2 1 A1 B1 7.50 16.75 2 A1 B2 10.75 18.50 3 A2 B1 12.00 22.25 4 A2 B2 10.75 17.75 5 A3 B1 9.50 23.75 6 A3 B2 7.75 21.00 > library(plyr) > ddply(data, ~Between + Within, > summarize, Resp1 = mean(Resp1), > Resp2 = mean(Resp2)) Between Within Resp1 Resp2 1 A1 B1 7.50 16.75 2 A1 B2 10.75 18.50 3 A2 B1 12.00 22.25 4 A2 B2 10.75 17.75 5 A3 B1 9.50 23.75 6 A3 B2 7.75 21.00
# ddply> library(doBy) > summaryBy(Resp1 + Resp2 ~ Between + > Within, data = data) Between Within Resp1.mean Resp2.mean 1 A1 B1 7.50 16.75 2 A1 B2 10.75 18.50 3 A2 B1 12.00 22.25 4 A2 B2 10.75 17.75 5 A3 B1 9.50 23.75 6 A3 B2 7.75 21.00
Between Within variable value 1 A1 B1 Resp1 7.50 2 A1 B2 Resp1 10.75 3 A2 B1 Resp1 12.00 4 A2 B2 Resp1 10.75 5 A3 B1 Resp1 9.50 6 A3 B2 Resp1 7.75 7 A1 B1 Resp2 16.75 8 A1 B2 Resp2 18.50 9 A2 B1 Resp2 22.25 10 A2 B2 Resp2 17.75 11 A3 B1 Resp2 23.75 12 A3 B2 Resp2 21.00 > library(plyr) > library(reshape) > melt(ddply(data, ~Between + Within, > summarize, Resp1 = mean(Resp1), > Resp2 = mean(Resp2)), measure.vars = c("Resp1", > "Resp2"), variable_name = "Resp") Between Within Resp value 1 A1 B1 Resp1 7.50 2 A1 B2 Resp1 10.75 3 A2 B1 Resp1 12.00 4 A2 B2 Resp1 10.75 5 A3 B1 Resp1 9.50 6 A3 B2 Resp1 7.75 7 A1 B1 Resp2 16.75 8 A1 B2 Resp2 18.50 9 A2 B1 Resp2 22.25 10 A2 B2 Resp2 17.75 11 A3 B1 Resp2 23.75 12 A3 B2 Resp2 21.00
Between Within means sds 1 A1 B1 7.50 0.58 2 A1 B2 10.75 0.96 3 A2 B1 12.00 1.41 4 A2 B2 10.75 1.50 5 A3 B1 9.50 5.07 6 A3 B2 7.75 4.27 > library(plyr) > ddply(data, ~Between + Within, > summarize, means = mean(Resp1), > sds = sd(Resp1)) Between Within means sds 1 A1 B1 7.50 0.577350269189626 2 A1 B2 10.75 0.957427107756338 3 A2 B1 12.00 1.414213562373095 4 A2 B2 10.75 1.500000000000000 5 A3 B1 9.50 5.066228051190222 6 A3 B2 7.75 4.272001872658765
> library(doBy) > summaryBy(Resp1 ~ Between + Within, > data = data, FUN = list(mean, > sd)) Between Within Resp1.mean Resp1.sd 1 A1 B1 7.50 0.577350269189626 2 A1 B2 10.75 0.957427107756338 3 A2 B1 12.00 1.414213562373095 4 A2 B2 10.75 1.500000000000000 5 A3 B1 9.50 5.066228051190222 6 A3 B2 7.75 4.272001872658765
Resp1 and Resp2 means and
standard devations (Between/Within)
wide formatResp1 and Resp2 means and
standard devations (Between/Within)
long formatBetween Within Resp1 SD1 Resp2 SD2 1 A1 B1 7.50 0.58 16.75 0.50 2 A1 B2 10.75 0.96 18.50 4.04 3 A2 B1 12.00 1.41 22.25 2.22 4 A2 B2 10.75 1.50 17.75 3.40 5 A3 B1 9.50 5.07 23.75 4.57 6 A3 B2 7.75 4.27 21.00 5.29 > library(plyr) > ddply(data, ~Between + Within, > function(df) data.frame(Resp1 = mean(df$Resp1), > SD1 = sd(df$Resp1), Resp2 = mean(df$Resp2), > SD2 = sd(df$Resp2))) Between Within Resp1 SD1 Resp2 SD2 1 A1 B1 7.50 0.577350269189626 16.75 0.50000000000000 2 A1 B2 10.75 0.957427107756338 18.50 4.04145188432738 3 A2 B1 12.00 1.414213562373095 22.25 2.21735578260835 4 A2 B2 10.75 1.500000000000000 17.75 3.40342964277702 5 A3 B1 9.50 5.066228051190222 23.75 4.57347424467075 6 A3 B2 7.75 4.272001872658765 21.00 5.29150262212918
# ddply> library(doBy) > summaryBy(Resp1 + Resp2 ~ Between + > Within, data = data, FUN = list(mean, > sd)) Between Within Resp1.mean Resp2.mean Resp1.sd Resp2.sd 1 A1 B1 7.50 16.75 0.577350269189626 0.50000000000000 2 A1 B2 10.75 18.50 0.957427107756338 4.04145188432738 3 A2 B1 12.00 22.25 1.414213562373095 2.21735578260835 4 A2 B2 10.75 17.75 1.500000000000000 3.40342964277702 5 A3 B1 9.50 23.75 5.066228051190222 4.57347424467075 6 A3 B2 7.75 21.00 4.272001872658765 5.29150262212918
Between Within Resp Mean Sd 1 A1 B1 Resp1 7.50 0.58 2 A1 B1 Resp2 16.75 0.50 3 A1 B2 Resp1 10.75 0.96 4 A1 B2 Resp2 18.50 4.04 5 A2 B1 Resp1 12.00 1.41 6 A2 B1 Resp2 22.25 2.22 7 A2 B2 Resp1 10.75 1.50 8 A2 B2 Resp2 17.75 3.40 9 A3 B1 Resp1 9.50 5.07 10 A3 B1 Resp2 23.75 4.57 11 A3 B2 Resp1 7.75 4.27 12 A3 B2 Resp2 21.00 5.29 > library(plyr) > library(reshape2) > d1 <- ddply(data, ~Between + Within, > function(df) melt(data.frame(Resp1 = mean(df$Resp1), > Resp2 = mean(df$Resp2)), > var = "Resp", value.name = "Mean")) > d2 <- ddply(data, ~Between + Within, > function(df) melt(data.frame(Resp1 = sd(df$Resp1), > Resp2 = sd(df$Resp2)), > var = "Resp", value.name = "Sd")) > d3 <- merge(d1, d2) Hierarchical aggregations
It is useful to be able to aggregate the data to a different level of replication. Two common reasons are:- Exploratory data analysis for assessing normality, homogeneity of variance as applied to the approximate appropriate residuals (appropriate level fo replication for a give test)
- Calculating observations (typically means of subreplicates) appropriate for the desired scale of the analysis
Warning gsummary() in the nlme package appears to be unable to resolve the base scope if functions like mean are overloaded by other packages. Consequently, it is a good idea when specifying the FUN parameter in gsummary, to include the scope. E.g. use base:::mean instead of just mean.
Plot means (ignore Subplot and Within column) Subplot means (ignore Within column) Plot Between Resp1 Resp2 1 P1 A1 9.00 18.25 2 P2 A2 11.50 18.50 3 P3 A3 8.75 23.00 4 P4 A1 9.25 17.00 5 P5 A2 11.25 21.50 6 P6 A3 8.50 21.75 > library(plyr) > ddply(data, ~Plot, summarize, Between = unique(Between), > Resp1 = mean(Resp1), Resp2 = mean(Resp2)) Plot Between Resp1 Resp2 1 P1 A1 9.00 18.25 2 P2 A2 11.50 18.50 3 P3 A3 8.75 23.00 4 P4 A1 9.25 17.00 5 P5 A2 11.25 21.50 6 P6 A3 8.50 21.75
> library(nlme) > gsummary(data, form = ~Plot, base:::mean) Resp1 Resp2 Between Plot Subplot Within P1 9.00 18.25 A1 P1 S1 B1 P2 11.50 18.50 A2 P2 S3 B1 P3 8.75 23.00 A3 P3 S5 B1 P4 9.25 17.00 A1 P4 S7 B1 P5 11.25 21.50 A2 P5 S10 B1 P6 8.50 21.75 A3 P6 S11 B1
> library(doBy) > summaryBy(Resp1 + Resp2 ~ Between + > Plot, data = data, FUN = mean) Between Plot Resp1.mean Resp2.mean 1 A1 P1 9.00 18.25 2 A1 P4 9.25 17.00 3 A2 P2 11.50 18.50 4 A2 P5 11.25 21.50 5 A3 P3 8.75 23.00 6 A3 P6 8.50 21.75
Plot Subplot Between Resp1 Resp2 1 P1 S1 A1 9.00 17.50 2 P1 S2 A1 9.00 19.00 3 P2 S3 A2 13.00 16.00 4 P2 S4 A2 10.00 21.00 5 P3 S5 A3 12.50 21.50 6 P3 S6 A3 5.00 24.50 7 P4 S7 A1 9.00 19.50 8 P4 S8 A1 9.50 14.50 9 P5 S9 A2 11.50 21.00 10 P5 S10 A2 11.00 22.00 11 P6 S11 A3 7.00 16.50 12 P6 S12 A3 10.00 27.00 > library(plyr) > ddply(data, ~Plot + Subplot, summarize, > Between = unique(Between), > Resp1 = mean(Resp1), Resp2 = mean(Resp2)) Plot Subplot Between Resp1 Resp2 1 P1 S1 A1 9.0 17.5 2 P1 S2 A1 9.0 19.0 3 P2 S3 A2 13.0 16.0 4 P2 S4 A2 10.0 21.0 5 P3 S5 A3 12.5 21.5 6 P3 S6 A3 5.0 24.5 7 P4 S7 A1 9.0 19.5 8 P4 S8 A1 9.5 14.5 9 P5 S9 A2 11.5 21.0 10 P5 S10 A2 11.0 22.0 11 P6 S11 A3 7.0 16.5 12 P6 S12 A3 10.0 27.0
> library(nlme) > gsummary(data, form = ~Plot/Subplot, > base:::mean) Resp1 Resp2 Between Plot Subplot Within P1/S1 9.0 17.5 A1 P1 S1 B1 P1/S2 9.0 19.0 A1 P1 S2 B1 P2/S3 13.0 16.0 A2 P2 S3 B1 P2/S4 10.0 21.0 A2 P2 S4 B1 P3/S5 12.5 21.5 A3 P3 S5 B1 P3/S6 5.0 24.5 A3 P3 S6 B1 P4/S7 9.0 19.5 A1 P4 S7 B1 P4/S8 9.5 14.5 A1 P4 S8 B1 P5/S9 11.5 21.0 A2 P5 S9 B1 P5/S10 11.0 22.0 A2 P5 S10 B1 P6/S11 7.0 16.5 A3 P6 S11 B1 P6/S12 10.0 27.0 A3 P6 S12 B1
> library(doBy) > summaryBy(Resp1 + Resp2 ~ Between + > Plot + Subplot, data = data, > FUN = mean) Between Plot Subplot Resp1.mean Resp2.mean 1 A1 P1 S1 9.0 17.5 2 A1 P1 S2 9.0 19.0 3 A1 P4 S7 9.0 19.5 4 A1 P4 S8 9.5 14.5 5 A2 P2 S3 13.0 16.0 6 A2 P2 S4 10.0 21.0 7 A2 P5 S9 11.5 21.0 8 A2 P5 S10 11.0 22.0 9 A3 P3 S5 12.5 21.5 10 A3 P3 S6 5.0 24.5 11 A3 P6 S11 7.0 16.5 12 A3 P6 S12 10.0 27.0
Transformations and Derivatives
A very simple dataset (data.s) Generating R code Between Plot Resp1 Resp2 1 A1 P1 8.00 13.00 2 A1 P2 10.00 22.00 3 A2 P3 7.00 23.00 4 A2 P4 11.00 22.00 > set.seed(1) > data.s <- expand.grid(Plot = paste("P", 1:4, sep = "")) > data.s$Between <- gl(2, 2, 4, lab = paste("A", 1:2, > sep = "")) > data.s <- with(data.s, data.frame(Between, Plot, Resp1 = rpois(4, > 10), Resp2 = rpois(4, 20))) Data transformations
The difference between transform and mutate is that mutate works with each column sequentially and therefore can derive new columns using the columns created in earlier iterations.\\ summaryBy returns a new dataframe, does not build upon old.Scale (log) transform a single variable
Resp1Scale (log) transform multiple variables
Resp1 and Resp2Between Plot Resp1 Resp2 logResp1 1 A1 P1 8.00 13.00 2.08 2 A1 P2 10.00 22.00 2.30 3 A2 P3 7.00 23.00 1.95 4 A2 P4 11.00 22.00 2.40 > library(plyr) > transform(data.s, logResp1 = log(Resp1)) Between Plot Resp1 Resp2 logResp1 1 A1 P1 8 13 2.07944154167984 2 A1 P2 10 22 2.30258509299405 3 A2 P3 7 23 1.94591014905531 4 A2 P4 11 22 2.39789527279837
> library(plyr) > mutate(data.s, logResp1 = log(Resp1)) Between Plot Resp1 Resp2 logResp1 1 A1 P1 8 13 2.07944154167984 2 A1 P2 10 22 2.30258509299405 3 A2 P3 7 23 1.94591014905531 4 A2 P4 11 22 2.39789527279837
Between Plot Resp1 Resp2 logResp1 logResp2 1 A1 P1 8.00 13.00 2.08 2.56 2 A1 P2 10.00 22.00 2.30 3.09 3 A2 P3 7.00 23.00 1.95 3.14 4 A2 P4 11.00 22.00 2.40 3.09 > library(plyr) > transform(data.s, logResp1 = log(Resp1), > logResp2 = log(Resp2)) Between Plot Resp1 Resp2 logResp1 logResp2 1 A1 P1 8 13 2.07944154167984 2.56494935746154 2 A1 P2 10 22 2.30258509299405 3.09104245335832 3 A2 P3 7 23 1.94591014905531 3.13549421592915 4 A2 P4 11 22 2.39789527279837 3.09104245335832
> library(doBy) > summaryBy(Resp1 + Resp2 ~ ., data.s, > FUN = list(log), id = ~Resp1 + > Resp2) Between Plot Resp1.log Resp2.log Resp1 Resp2 1 A1 P1 2.07944154167984 2.56494935746154 8 13 2 A1 P2 2.30258509299405 3.09104245335832 10 22 3 A2 P3 1.94591014905531 3.13549421592915 7 23 4 A2 P4 2.39789527279837 3.09104245335832 11 22
#OR> library(doBy) > summaryBy(Resp1 + Resp2 ~ ., data.s, > FUN = list(c, log)) Between Plot Resp1.c Resp2.c Resp1.log Resp2.log 1 A1 P1 8 13 2.07944154167984 2.56494935746154 2 A1 P2 10 22 2.30258509299405 3.09104245335832 3 A2 P3 7 23 1.94591014905531 3.13549421592915 4 A2 P4 11 22 2.39789527279837 3.09104245335832
Deriving new variables
Center a single variable (Resp2),
centered within the entire variableCenter a single variable (Resp2),
centered within each level of the between variablexoBetween Plot Resp1 Resp2 cResp2 1 A1 P1 8.00 13.00 -7.00 2 A1 P2 10.00 22.00 2.00 3 A2 P3 7.00 23.00 3.00 4 A2 P4 11.00 22.00 2.00 > library(plyr) > transform(data.s, cResp2 = Resp2 - > mean(Resp2)) Between Plot Resp1 Resp2 cResp2 1 A1 P1 8 13 -7 2 A1 P2 10 22 2 3 A2 P3 7 23 3 4 A2 P4 11 22 2
Between Plot Resp1 Resp2 cResp2 1 A1 P1 8.00 13.00 -4.50 2 A1 P2 10.00 22.00 4.50 3 A2 P3 7.00 23.00 0.50 4 A2 P4 11.00 22.00 -0.50 > library(plyr) > ddply(data.s, ~Between, transform, > cResp2 = Resp2 - mean(Resp2)) Between Plot Resp1 Resp2 cResp2 1 A1 P1 8 13 -4.5 2 A1 P2 10 22 4.5 3 A2 P3 7 23 0.5 4 A2 P4 11 22 -0.5
Difference in centred means sorted by Between
(derivative of derivatives)Scale (log) transformation of Resp1 and Resp2
sorted by BetweenBetween Plot Resp1 Resp2 cResp1 cResp2 cDiff 1 A1 P1 8.00 13.00 -1.00 -4.50 3.50 2 A1 P2 10.00 22.00 1.00 4.50 -3.50 3 A2 P3 7.00 23.00 -2.00 0.50 -2.50 4 A2 P4 11.00 22.00 2.00 -0.50 2.50 > library(plyr) > ddply(data.s, ~Between, mutate, > cResp1 = Resp1 - mean(Resp1), > cResp2 = Resp2 - mean(Resp2), > cDiff = cResp1 - cResp2) Between Plot Resp1 Resp2 cResp1 cResp2 cDiff 1 A1 P1 8 13 -1 -4.5 3.5 2 A1 P2 10 22 1 4.5 -3.5 3 A2 P3 7 23 -2 0.5 -2.5 4 A2 P4 11 22 2 -0.5 2.5
Between Plot Resp1 Resp2 logResp2 logResp2.1 1 A1 P1 8.00 13.00 2.56 2.56 2 A1 P2 10.00 22.00 3.09 3.09 3 A2 P3 7.00 23.00 3.14 3.14 4 A2 P4 11.00 22.00 3.09 3.09 > library(plyr) > ddply(data.s, ~Between, transform, > logResp1 = log(Resp1), logResp2 = log(Resp1)) Between Plot Resp1 Resp2 logResp1 logResp2 1 A1 P1 8 13 2.07944154167984 2.07944154167984 2 A1 P2 10 22 2.30258509299405 2.30258509299405 3 A2 P3 7 23 1.94591014905531 1.94591014905531 4 A2 P4 11 22 2.39789527279837 2.39789527279837
Alterations
Sorting data
Most statistical analyses are invariant to the data order and thus data reordering is typically only for aesthetics in tables and figures.
Sorting data has the potential to be one of the most dangerous forms of data manipulation - particularly in spreadsheets in which there is no real binding between individual columns or rows. It is far to easy to accidentally sort the data in a single column (or row) without applying the ordering to all the other columns in the data set thereby resulting in broken data.
Whilst the above apocalypse is still possible in R, the data structures and manipulation interfaces mean that you really have to try to break the data in this way. Furthermore, you are encouraged to store reordered data in a different object to the original data, and hence 'rolling' back is trivial.
An example of an small omibus dataset (Created here)
(data.m)Plot Resp1 Resp2 Between Subplot LAT LONG 1 P1 8.00 18.00 A1 S1 17.96 145.43 2 P1 10.00 21.00 A1 S2 17.96 145.43 3 P2 7.00 16.00 A1 S3 17.52 146.20 4 P2 11.00 23.00 A1 S4 17.52 146.20 5 P3 14.00 22.00 A2 S5 17.00 146.38 6 P3 12.00 24.00 A2 S6 17.00 146.38 7 P4 11.00 23.00 A2 S7 18.23 146.79 8 P4 9.00 20.00 A2 S8 18.23 146.79 9 P5 14.00 11.00 A3 S9 18.98 146.03 10 P5 11.00 22.00 A3 S10 18.98 146.03 11 P6 8.00 24.00 A3 S11 20.12 146.47 12 P6 2.00 16.00 A3 S12 20.12 146.47 Sort data by LAT Sort data by Resp1 then Resp2 Plot Resp1 Resp2 Between Subplot LAT LONG 5 P3 14.00 22.00 A2 S5 17.00 146.38 6 P3 12.00 24.00 A2 S6 17.00 146.38 3 P2 7.00 16.00 A1 S3 17.52 146.20 4 P2 11.00 23.00 A1 S4 17.52 146.20 1 P1 8.00 18.00 A1 S1 17.96 145.43 2 P1 10.00 21.00 A1 S2 17.96 145.43 7 P4 11.00 23.00 A2 S7 18.23 146.79 8 P4 9.00 20.00 A2 S8 18.23 146.79 9 P5 14.00 11.00 A3 S9 18.98 146.03 10 P5 11.00 22.00 A3 S10 18.98 146.03 11 P6 8.00 24.00 A3 S11 20.12 146.47 12 P6 2.00 16.00 A3 S12 20.12 146.47 > data.m[order(data.m$LAT), ] Plot Resp1 Resp2 Between Subplot LAT LONG 5 P3 14 22 A2 S5 17.0011 146.3839 6 P3 12 24 A2 S6 17.0011 146.3839 3 P2 7 16 A1 S3 17.5210 146.1983 4 P2 11 23 A1 S4 17.5210 146.1983 1 P1 8 18 A1 S1 17.9605 145.4326 2 P1 10 21 A1 S2 17.9605 145.4326 7 P4 11 23 A2 S7 18.2350 146.7934 8 P4 9 20 A2 S8 18.2350 146.7934 9 P5 14 11 A3 S9 18.9840 146.0345 10 P5 11 22 A3 S10 18.9840 146.0345 11 P6 8 24 A3 S11 20.1154 146.4672 12 P6 2 16 A3 S12 20.1154 146.4672
Plot Resp1 Resp2 Between Subplot LAT LONG 12 P6 2.00 16.00 A3 S12 20.12 146.47 3 P2 7.00 16.00 A1 S3 17.52 146.20 1 P1 8.00 18.00 A1 S1 17.96 145.43 11 P6 8.00 24.00 A3 S11 20.12 146.47 8 P4 9.00 20.00 A2 S8 18.23 146.79 2 P1 10.00 21.00 A1 S2 17.96 145.43 10 P5 11.00 22.00 A3 S10 18.98 146.03 4 P2 11.00 23.00 A1 S4 17.52 146.20 7 P4 11.00 23.00 A2 S7 18.23 146.79 6 P3 12.00 24.00 A2 S6 17.00 146.38 9 P5 14.00 11.00 A3 S9 18.98 146.03 5 P3 14.00 22.00 A2 S5 17.00 146.38 > data.m[order(data.m$Resp1, data.m$Resp2), > ] Plot Resp1 Resp2 Between Subplot LAT LONG 12 P6 2 16 A3 S12 20.1154 146.4672 3 P2 7 16 A1 S3 17.5210 146.1983 1 P1 8 18 A1 S1 17.9605 145.4326 11 P6 8 24 A3 S11 20.1154 146.4672 8 P4 9 20 A2 S8 18.2350 146.7934 2 P1 10 21 A1 S2 17.9605 145.4326 10 P5 11 22 A3 S10 18.9840 146.0345 4 P2 11 23 A1 S4 17.5210 146.1983 7 P4 11 23 A2 S7 18.2350 146.7934 6 P3 12 24 A2 S6 17.0011 146.3839 9 P5 14 11 A3 S9 18.9840 146.0345 5 P3 14 22 A2 S5 17.0011 146.3839
Re-levelling (sorting) factors
In the previous section, we altered the order of the records in the data set. However, it is important to realize that categorical variables in a dataframe have a property (called levels) that indicates the order of the levels of the factor and that this property is completely independent of the order of records int the data structure. By default, categorical variables (factors) are ordered alphabetically (as I said, irrespective of their order in the data set). The order of factor levels dictates the order in which groups will be plotted in figures and tabulated in tables. Consequently, re-levelling factors is often desirable
An example of an data set (data.2) Plot Cond Between Temp 1 P1 H A1 15.74 2 P2 H A1 23.84 3 P3 H A1 13.64 4 P4 H A2 37.95 5 P1 M A2 25.30 6 P2 M A2 13.80 7 P3 M A3 26.87 8 P4 M A3 29.38 9 P1 L A3 27.76 10 P2 L A4 18.95 11 P3 L A4 37.12 12 P4 L A4 25.90 > set.seed(1) > data.2 <- expand.grid(Plot = paste("P", > 1:4, sep = ""), Cond = c("H", > "M", "L")) > data.2$Cond <- factor(as.character(data.2$Cond)) > data.2$Between <- gl(4, 3, 12, > lab = paste("A", 1:4, sep = "")) > data.2$Temp <- rnorm(12, 22, 10) Default alphabetical order Re-levelled into L, M, H > levels(data.2$Cond) H L M
> data.2$Cond <- factor(data.2$Cond, > levels = c("L", "M", "H")) > levels(data.2$Cond) L M H
Relabelling
In this case, the L,M,H labels are efficient, yet not as informative as we would perhaps like for labels on a graphic. Instead, we would probably prefer they were something like, "Low", "Medium" and "High"
> data.2$Cond <- factor(data.2$Cond, levels = c("L", "M", "H"), > lab = c("Low", "Medium", "High")) > data.2 Plot Cond Between Temp 1 P1 High A1 15.7354618925767 2 P2 High A1 23.8364332422208 3 P3 High A1 13.6437138758995 4 P4 High A2 37.9528080213779 5 P1 Medium A2 25.2950777181536 6 P2 Medium A2 13.7953161588198 7 P3 Medium A3 26.8742905242849 8 P4 Medium A3 29.3832470512922 9 P1 Low A3 27.7578135165349 10 P2 Low A4 18.9461161284364 11 P3 Low A4 37.1178116845085 12 P4 Low A4 25.8984323641143
Subsetting
We regularly want to run an analysis or generate a graphic for a sub-set of the data. Take for example the data used here. Warning, following any subsetting, it is crucial that you then instruct R to redefine the factor levels of any factors in the subsetted data set. When subsetting a dataset, all factors simply inherit the original levels property of the original factors, and therefore the new factors potentially define more factor levels than are in the subsetted data set. This, along with the solution is illustrated below.
An example of an data set (data.2) Effect of subsetting on factor levels Correcting factor levels post subsetting Plot Cond Between Temp 1 P1 H A1 15.74 2 P2 H A1 23.84 3 P3 H A2 13.64 4 P4 H A2 37.95 5 P1 M A3 25.30 6 P2 M A3 13.80 7 P3 M A4 26.87 8 P4 M A4 29.38 9 P1 L A1 27.76 10 P2 L A1 18.95 11 P3 L A2 37.12 12 P4 L A2 25.90 #examine the levels of the Cond factor> levels(data.2$Cond) H L M
#subset the dataset to just Cond H,
#examine subset data and then
#examine the levels of the Cond factor> data.3 <- subset(data.2, Cond == > "H") > data.3 Plot Cond Between Temp 1 P1 H A1 15.7354618925767 2 P2 H A1 23.8364332422208 3 P3 H A2 13.6437138758995 4 P4 H A2 37.9528080213779
> levels(data.3$Cond) H L M
> levels(data.3$Between) A1 A2 A3 A4
#NOTICE the levels property is unaffected
by subsetting#subset the dataset to just Cond H,
> data.3 <- subset(data.2, Cond == > "H") #drop the unused factor levels from all factorsOr if we only want to correct one of the factors
#examine the levels of the Cond factor> data.3 <- droplevels(data.3) > levels(data.3$Cond) H
> levels(data.3$Between) A1 A2
#NOTICE the levels property now reflect the subset data#drop the unused factor levels from just Cond factor
#examine the levels of the Cond factor> data.3 <- subset(data.2, Cond == > "H") > data.3$Cond <- factor(data.3$Cond) > levels(data.3$Cond) H
> levels(data.3$Between) A1 A2 A3 A4
#NOTICE the levels property now reflect the subset dataKeep only the observations where Cond is H Keep only the observations where Cond is either H or M Keep only the observations with
H OR M condition AND Temp < 30> subset(data.2, Cond == "H") Plot Cond Between Temp 1 P1 H A1 15.7354618925767 2 P2 H A1 23.8364332422208 3 P3 H A2 13.6437138758995 4 P4 H A2 37.9528080213779
> data.2[data.2$Cond == "H", ] Plot Cond Between Temp 1 P1 H A1 15.7354618925767 2 P2 H A1 23.8364332422208 3 P3 H A2 13.6437138758995 4 P4 H A2 37.9528080213779
Plot Cond Between Temp 1 P1 H A1 15.74 2 P2 H A1 23.84 3 P3 H A2 13.64 4 P4 H A2 37.95 5 P1 M A3 25.30 6 P2 M A3 13.80 7 P3 M A4 26.87 8 P4 M A4 29.38 > subset(data.2, Cond %in% c("H", > "M")) Plot Cond Between Temp 1 P1 H A1 15.7354618925767 2 P2 H A1 23.8364332422208 3 P3 H A2 13.6437138758995 4 P4 H A2 37.9528080213779 5 P1 M A3 25.2950777181536 6 P2 M A3 13.7953161588198 7 P3 M A4 26.8742905242849 8 P4 M A4 29.3832470512922
> data.2[data.2$Cond %in% c("H", > "M"), ] Plot Cond Between Temp 1 P1 H A1 15.7354618925767 2 P2 H A1 23.8364332422208 3 P3 H A2 13.6437138758995 4 P4 H A2 37.9528080213779 5 P1 M A3 25.2950777181536 6 P2 M A3 13.7953161588198 7 P3 M A4 26.8742905242849 8 P4 M A4 29.3832470512922
Plot Cond Between Temp 1 P1 H A1 15.74 2 P2 H A1 23.84 3 P3 H A2 13.64 5 P1 M A3 25.30 6 P2 M A3 13.80 7 P3 M A4 26.87 8 P4 M A4 29.38 > subset(data.2, Cond %in% c("H", > "M") & Temp < 30) Plot Cond Between Temp 1 P1 H A1 15.7354618925767 2 P2 H A1 23.8364332422208 3 P3 H A2 13.6437138758995 5 P1 M A3 25.2950777181536 6 P2 M A3 13.7953161588198 7 P3 M A4 26.8742905242849 8 P4 M A4 29.3832470512922
> data.2[data.2$Cond %in% c("H", > "M") & data.2$Temp < 30, ] Plot Cond Between Temp 1 P1 H A1 15.7354618925767 2 P2 H A1 23.8364332422208 3 P3 H A2 13.6437138758995 5 P1 M A3 25.2950777181536 6 P2 M A3 13.7953161588198 7 P3 M A4 26.8742905242849 8 P4 M A4 29.3832470512922
More complex manipulations
Here is an example data set that comprises two Responses measured in each of four years from each of four Plots. What we want is to derive a new measure that is the change in abundance between a certain year (2008) and all subsequent years - and of course, we want to do this separately for each Plot.
An example data set (data.t) Generating R code Plot Year Resp1 Resp2 1 P1 2008 0.00 36.00 2 P1 2009 48.00 0.00 3 P1 2010 12.00 0.00 4 P1 2011 0.00 15.00 5 P2 2008 19.00 34.00 6 P2 2009 18.00 0.00 7 P2 2010 21.00 36.00 8 P2 2011 45.00 12.00 9 P3 2008 40.00 49.00 10 P3 2009 38.00 0.00 11 P3 2010 9.00 18.00 12 P3 2011 22.00 42.00 13 P4 2008 20.00 0.00 14 P4 2009 22.00 14.00 15 P4 2010 32.00 0.00 16 P4 2011 38.00 27.00 > set.seed(1) > data.t <- expand.grid(Year = 2008:2011, > Plot = paste("P", 1:4, sep = ""))[, > c(2, 1)] > data.t <- data.frame(data.t, Resp1 = rnbinom(16, > 5, mu = 30) * as.numeric(replicate(4, > rbinom(4, 1, 0.8))), Resp2 = rnbinom(16, > 5, mu = 30) * as.numeric(replicate(4, > rbinom(4, 1, 0.8)))) As this is ultimately a demonstration of a complex data manipulation, we will perform the manipulation in increasingly more complex iterations. Should you purely be interested in the final solution, you may which to skip to the final point.
- Calculate the change (difference in abundance between 2008 and all subsequent years) for a single variable (Resp1)
Plot Year Resp1 delta 1 P1 2008 0.00 0.00 2 P1 2009 48.00 48.00 3 P1 2010 12.00 12.00 4 P1 2011 0.00 0.00 5 P2 2008 19.00 0.00 6 P2 2009 18.00 -1.00 7 P2 2010 21.00 2.00 8 P2 2011 45.00 26.00 9 P3 2008 40.00 0.00 10 P3 2009 38.00 -2.00 11 P3 2010 9.00 -31.00 12 P3 2011 22.00 -18.00 13 P4 2008 20.00 0.00 14 P4 2009 22.00 2.00 15 P4 2010 32.00 12.00 16 P4 2011 38.00 18.00 > ddply(data.t, ~Plot, function(df) { > t1 <- df$Resp1[df$Year == 2008] > yr <- df$Year > t2 <- df$Resp1 > data.frame(Year = yr, Resp1 = df$Resp1, > delta = t2 - t1) > }) Plot Year Resp1 delta 1 P1 2008 0 0 2 P1 2009 48 48 3 P1 2010 12 12 4 P1 2011 0 0 5 P2 2008 19 0 6 P2 2009 18 -1 7 P2 2010 21 2 8 P2 2011 45 26 9 P3 2008 40 0 10 P3 2009 38 -2 11 P3 2010 9 -31 12 P3 2011 22 -18 13 P4 2008 20 0 14 P4 2009 22 2 15 P4 2010 32 12 16 P4 2011 38 18
- As 2008 is the year we are comparing all others to, we dont really want the 2008 data in the final data set - so lets suppress it.
Plot Year Resp1 delta 1 P1 2009 48.00 48.00 2 P1 2010 12.00 12.00 3 P1 2011 0.00 0.00 4 P2 2009 18.00 -1.00 5 P2 2010 21.00 2.00 6 P2 2011 45.00 26.00 7 P3 2009 38.00 -2.00 8 P3 2010 9.00 -31.00 9 P3 2011 22.00 -18.00 10 P4 2009 22.00 2.00 11 P4 2010 32.00 12.00 12 P4 2011 38.00 18.00 > ddply(data.t, ~Plot, function(df) { > t1 <- df$Resp1[df$Year == 2008] > df <- df[df$Year > 2008, ] > yr <- df$Year > t2 <- df$Resp1 > data.frame(Year = yr, Resp1 = df$Resp1, > delta = t2 - t1) > }) Plot Year Resp1 delta 1 P1 2009 48 48 2 P1 2010 12 12 3 P1 2011 0 0 4 P2 2009 18 -1 5 P2 2010 21 2 6 P2 2011 45 26 7 P3 2009 38 -2 8 P3 2010 9 -31 9 P3 2011 22 -18 10 P4 2009 22 2 11 P4 2010 32 12 12 P4 2011 38 18
- Lets now extend this to multiple variables (Resp1 and Resp2). This is achieved with the following modifications
- The response variables to apply the delta calculations to are specified by their column indices in code rows 2 and 5 (e.g. 3:4)
- t1 and t2 now store vectors and matrices respectively. Therefore we need to use the sweep function in order to subtract each item of the t1 vector from each item in the corresponding columns of the t2 matrix. The sweep function takes the values in one matrix (or vector) and "sweeps" (subtracts by default) them through either the rows (1) or columns (2) of another matrix (perform the action repeatedly along the rows or columns).
- Finally, to ensure that the results of the sweep are correctly packaged into rows and appended to the splitting variable (Plot), we need to construct and return a dataframe to the ddply function.
Plot Year Resp1 Resp2 Resp1.1 Resp2.1 1 P1 2009 48.00 0.00 48.00 -36.00 2 P1 2010 12.00 0.00 12.00 -36.00 3 P1 2011 0.00 15.00 0.00 -21.00 4 P2 2009 18.00 0.00 -1.00 -34.00 5 P2 2010 21.00 36.00 2.00 2.00 6 P2 2011 45.00 12.00 26.00 -22.00 7 P3 2009 38.00 0.00 -2.00 -49.00 8 P3 2010 9.00 18.00 -31.00 -31.00 9 P3 2011 22.00 42.00 -18.00 -7.00 10 P4 2009 22.00 14.00 2.00 14.00 11 P4 2010 32.00 0.00 12.00 0.00 12 P4 2011 38.00 27.00 18.00 27.00 > ddply(data.t, ~Plot, function(df) { > t1 <- as.matrix(df[df$Year == > 2008, 3:4]) > df <- df[df$Year > 2008, ] > yr <- df$Year > t2 <- as.matrix(df[, 3:4]) > return(data.frame(Year = yr, > Resp1 = df$Resp1, Resp2 = df$Resp2, > sweep(t2, 2, t1))) > }) Plot Year Resp1 Resp2 Resp1.1 Resp2.1 1 P1 2009 48 0 48 -36 2 P1 2010 12 0 12 -36 3 P1 2011 0 15 0 -21 4 P2 2009 18 0 -1 -34 5 P2 2010 21 36 2 2 6 P2 2011 45 12 26 -22 7 P3 2009 38 0 -2 -49 8 P3 2010 9 18 -31 -31 9 P3 2011 22 42 -18 -7 10 P4 2009 22 14 2 14 11 P4 2010 32 0 12 0 12 P4 2011 38 27 18 27
- So far, the metric representing the difference between before and after has been pretty crude.
It is simply the raw difference, the magnitude of which is limitless and dependent on the magnitude and scale
of the variables - thereby meaningless for comparisons.
A more sophisticated metric would standardize the differences such that they represent the proportional difference (divide by the first value).
Of course this introduces additional issues, such as what happens if the initial value is 0? In such cases, any increase is effectively an infinite increase. However, if the initial value and a subsequent value are both zero, then a sensible value of delta would be 0.Δ =We can then further scale this metric to be symmetrically bound around to a range around zero (-4.615 and 4.615 on a log scale).[(A2 - A1)/A2] × 100 if A1≠0 10,000 if A1=0 & A2≠0 0 if A1=0 & A2=0
*An alternative way of dealing with these issues is to replace the zero with 2 standard deviations below the smallest value in the set. This is a technique that is often employed to deal with situations in which zeros arise as a result of the resolution of instruments (having a minimum detection that is higher than the sample they are measuring) or otherwise inability to detect very small quantities. Similarly, it could be argued that a measured abundance of zero here could alo be the result of an inability to detect something in very low quantities. Hence, we could replace the zeros with estimates of the lowest detection value or similar. * The standardized and scaled option is achieved with the following modifications- As indicated in the previous version above, the sweep function performs a simple subtraction. We want the sweep function to perform a different function - one that we define in accordance with the formula and rules above. So we first create a function that can perform this task.
- Have the sweep function call out specifically-defined function.
Plot Year Resp1 Resp2 Resp1.1 Resp2.1 1 P1 2009 48.00 0.00 4.61 -4.62 2 P1 2010 12.00 0.00 4.61 -4.62 3 P1 2011 0.00 15.00 0.00 -0.86 4 P2 2009 18.00 0.00 -0.05 -4.62 5 P2 2010 21.00 36.00 0.10 0.06 6 P2 2011 45.00 12.00 0.86 -1.02 7 P3 2009 38.00 0.00 -0.05 -4.62 8 P3 2010 9.00 18.00 -1.46 -0.98 9 P3 2011 22.00 42.00 -0.59 -0.15 10 P4 2009 22.00 14.00 0.09 4.61 11 P4 2010 32.00 0.00 0.47 0.00 12 P4 2011 38.00 27.00 0.64 4.61 > delta <- function(t2, t1) { > d <- 100 * ((t2 - t1)/t1) > d[t1 == 0 & t2 == 0] <- 0 > d[t1 == 0 & t2 != 0] <- 10000 > log(1 + (d/101)) > } > ddply(data.t, ~Plot, function(df) { > t1 <- as.matrix(df[df$Year == > 2008, 3:4]) > df <- df[df$Year > 2008, ] > yr <- df$Year > t2 <- as.matrix(df[, 3:4]) > return(data.frame(Year = yr, > Resp1 = df$Resp1, Resp2 = df$Resp2, > sweep(t2, 2, t1, delta))) > }) Plot Year Resp1 Resp2 Resp1.1 Resp2.1 1 P1 2009 48 0 4.6052691909879249 -4.6151205168412588 2 P1 2010 12 0 4.6052691909879249 -4.6151205168412588 3 P1 2011 0 15 0.0000000000000000 -0.8617025415897520 4 P2 2009 18 0 -0.0535173174895776 -4.6151205168412588 5 P2 2010 21 36 0.0991400622605991 0.0566082074986956 6 P2 2011 45 12 0.8564865134035798 -1.0234648369920876 7 P3 2009 38 0 -0.0507723253734233 -4.6151205168412588 8 P3 2010 9 18 -1.4581200956911462 -0.9845405835647614 9 P3 2011 22 42 -0.5897688261061101 -0.1525018748370745 10 P4 2009 22 14 0.0944096844710748 4.6052691909879249 11 P4 2010 32 0 0.4662838481432036 0.0000000000000000 12 P4 2011 38 27 0.6371529112053703 4.6052691909879249
- The above procedure is perfectly adequate for dealing with a single ubiquitous event (disturbance), but what happens if there have been multiple different disturbances that begin and end at different times in different plots. Comparing each time period to a single, fixed time (2008 in this case), is then not logical. Lets say that in addition to our data set, we had also compiled a data set that defined times prior and post for each disturbance for each Plot. This data set could then be used as a lookup table (as in this section above). In this example we will incorporate the post disturbance date, yet we will not restrict the resulting data beyond this date (that is, we will continue to explore the impacts of the disturbance after the primary condition has ceased).
- Merge the data and the lookup table such that the prior and post dates are included in the data set.
- Limit the first date of the comparison for each Plot such that the first measure is relative to the defined Prior disturbance time for that Plot.
Plot Prior Post 1 P1 2008.00 2011.00 2 P2 2008.00 2010.00 3 P3 2009.00 2011.00 4 P4 2008.00 2011.00 > data.v <- expand.grid(Plot = paste("P", > 1:4, sep = "")) > data.v$Prior <- c(2008, 2008, 2009, > 2008) > data.v$Post <- c(2011, 2010, 2011, > 2011) > data.v Plot Prior Post 1 P1 2008 2011 2 P2 2008 2010 3 P3 2009 2011 4 P4 2008 2011
Plot Year Resp1 Resp2 Resp1.1 Resp2.1 1 P1 2009 48.00 0.00 4.61 -4.62 2 P1 2010 12.00 0.00 4.61 -4.62 3 P1 2011 0.00 15.00 0.00 -0.86 4 P2 2009 18.00 0.00 -0.05 -4.62 5 P2 2010 21.00 36.00 0.10 0.06 6 P2 2011 45.00 12.00 0.86 -1.02 7 P3 2010 9.00 18.00 -1.41 4.61 8 P3 2011 22.00 42.00 -0.54 4.61 9 P4 2009 22.00 14.00 0.09 4.61 10 P4 2010 32.00 0.00 0.47 0.00 11 P4 2011 38.00 27.00 0.64 4.61 > data.u <- merge(data.t, data.v, > by = "Plot") > delta <- function(t2, t1) { > d <- 100 * ((t2 - t1)/t1) > d[t1 == 0 & t2 == 0] <- 0 > d[t1 == 0 & t2 != 0] <- 10000 > log(1 + (d/101)) > } > ddply(data.u, ~Plot, function(df) { > t1 <- as.matrix(df[df$Year == > df$Prior, 3:4]) > df <- df[df$Year > df$Prior, > ] > yr <- df$Year > t2 <- as.matrix(df[, 3:4]) > return(data.frame(Year = yr, > Resp1 = df$Resp1, Resp2 = df$Resp2, > sweep(t2, 2, t1, delta))) > }) Plot Year Resp1 Resp2 Resp1.1 Resp2.1 1 P1 2009 48 0 4.6052691909879249 -4.6151205168412588 2 P1 2010 12 0 4.6052691909879249 -4.6151205168412588 3 P1 2011 0 15 0.0000000000000000 -0.8617025415897520 4 P2 2009 18 0 -0.0535173174895776 -4.6151205168412588 5 P2 2010 21 36 0.0991400622605991 0.0566082074986956 6 P2 2011 45 12 0.8564865134035798 -1.0234648369920876 7 P3 2010 9 18 -1.4089567275614210 4.6052691909879249 8 P3 2011 22 42 -0.5393687876957745 4.6052691909879249 9 P4 2009 22 14 0.0944096844710748 4.6052691909879249 10 P4 2010 32 0 0.4662838481432036 0.0000000000000000 11 P4 2011 38 27 0.6371529112053703 4.6052691909879249
Welcome to the end of this tutorial