Data manipulations

19 Oct 2011

Topics

Reshaping datasets
Merging datasets
- Merging datasets
- VLOOKUP in R
Aggregating datasets
Transformations and Derivatives
Alterations
More complex manipulations
- More complex manipulations

In the proceeding workshops, a great deal of emphasis has been placed on exploring and analysing prepared data. However, rarely are data transcribed and organized into such simple and ready-to-go data sets. More typically, data are spread throughout multiple sources and in a variety of formats (particularly if compiled by multiple workers or instruments). Consequently, prior to any formal statistical analyses , it is necessary to compile very focused data sets.

Common data preparations include:

Reshaping and rearranging data
Merging multiple data sources
Aggregating data to appropriate spatial/temporal scales
Transforming data and deriving new variables
Sorting and reordering data
Relabelling data

This workshop will concentrate on these essential data preparation phases.

Practical data manipulation will be demonstrated via a series of very small artificial datasets. These datasets are presented in tables with black font and lines and the R code required to generate those data will be presented in static code boxes either underneath or adjacent the table. A very basic description of the table and the name of the data frame are displayed above the table. The entire collection of datasets used in this workshop can be obtained by issuing the following command:

(if online)
(if offline and are running this from a local version, run the following from within the /downloads/data directory)

The result of data manipulations will be displayed in tables with blue font and lines. Clicking on these tables will reveal the R source code that was used to perform the manipulation along with the actual result as it appears in R (not formatted for HTML).

Reshaping datasets

Wide to long (melt)

Whilst wide data formats are often more compact and typically easier to manage for data entry (particularly in the field), the data are not in the appropriate format for most analyses (traditional repeated measures is one exception). Most analyses require that each replicate is in its own row and thus it is necessary to be rearrange or reshape (melt) the data from this wide format to the long (molten) format.

The melt function (reshape package) is very useful for converting wide (repeated measures-like) into long format. The important parameters to specify are;

id the set of factors already in long format
measure.var the columns that define the repeated measure that should be melted into a long column

An example wide data set (data.w)

Lengthen (melt) the repeated measures

	Between	Plot	Time.0	Time.1	Time.2
R1	A1	P1	8.00	14.00	14.00
R2	A1	P2	10.00	12.00	11.00
R3	A2	P3	7.00	11.00	8.00
R4	A2	P4	11.00	9.00	2.00

> set.seed(1)

> data.w <- expand.grid(Plot = paste("P",

> 1:4, sep = ""))

> data.w$Between <- gl(2, 2, 4, lab = paste("A",

> 1:2, sep = ""))

> data.w <- with(data.w, data.frame(Between,

> Plot, matrix(rpois(12, 10),

> ncol = 3, dimnames = list(paste("R",

> 1:4, sep = ""), paste("Time",

> 0:2, sep = ":")))))

	Between	Plot	Time	value
1	A1	P1	Time.0	8.00
2	A1	P2	Time.0	10.00
3	A2	P3	Time.0	7.00
4	A2	P4	Time.0	11.00
5	A1	P1	Time.1	14.00
6	A1	P2	Time.1	12.00
7	A2	P3	Time.1	11.00
8	A2	P4	Time.1	9.00
9	A1	P1	Time.2	14.00
10	A1	P2	Time.2	11.00
11	A2	P3	Time.2	8.00
12	A2	P4	Time.2	2.00

> library(reshape)

> melt(data.w, id = 1:2, measure.var = 3:5,

> var = "Time")

	Between	Plot	Time	value
1	A1	P1	Time.0	8
2	A1	P2	Time.0	10
3	A2	P3	Time.0	7
4	A2	P4	Time.0	11
5	A1	P1	Time.1	14
6	A1	P2	Time.1	12
7	A2	P3	Time.1	11
8	A2	P4	Time.1	9
9	A1	P1	Time.2	14
10	A1	P2	Time.2	11
11	A2	P3	Time.2	8
12	A2	P4	Time.2	2

#OR

> library(reshape)

> melt(data.w, id = c("Between",

> "Plot"), measure.var = c("Time.0",

> "Time.1", "Time.2"), var = "Time")

	Between	Plot	Time	value
1	A1	P1	Time.0	8
2	A1	P2	Time.0	10
3	A2	P3	Time.0	7
4	A2	P4	Time.0	11
5	A1	P1	Time.1	14
6	A1	P2	Time.1	12
7	A2	P3	Time.1	11
8	A2	P4	Time.1	9
9	A1	P1	Time.2	14
10	A1	P2	Time.2	11
11	A2	P3	Time.2	8
12	A2	P4	Time.2	2

Long to wide (cast)

The cast function (reshape package) is very useful for converting long datasets into wide (repeated measures-like). The heart of the function definition is specifying a formula LHS ~ RHS where;

LHS is a plus separated set of variables to retain long format
RHS the variable to be cast into wide format

An example of a long data set (data)

Widen (cast) Resp1 for repeated measures (Within)

	Resp1	Resp2	Between	Plot	Subplot	Within
1	8.00	17.00	A1	P1	S1	B1
2	10.00	18.00	A1	P1	S1	B2
3	7.00	17.00	A1	P1	S2	B1
4	11.00	21.00	A1	P1	S2	B2
5	14.00	19.00	A2	P2	S3	B1
6	12.00	13.00	A2	P2	S3	B2
7	11.00	24.00	A2	P2	S4	B1
8	9.00	18.00	A2	P2	S4	B2
9	14.00	25.00	A3	P3	S5	B1
10	11.00	18.00	A3	P3	S5	B2
11	8.00	27.00	A3	P3	S6	B1
12	2.00	22.00	A3	P3	S6	B2
13	8.00	17.00	A1	P4	S7	B1
14	10.00	22.00	A1	P4	S7	B2
15	7.00	16.00	A1	P4	S8	B1
16	12.00	13.00	A1	P4	S8	B2
17	11.00	23.00	A2	P5	S9	B1
18	12.00	19.00	A2	P5	S9	B2
19	12.00	23.00	A2	P5	S10	B1
20	10.00	21.00	A2	P5	S10	B2
21	3.00	17.00	A3	P6	S11	B1
22	11.00	16.00	A3	P6	S11	B2
23	13.00	26.00	A3	P6	S12	B1
24	7.00	28.00	A3	P6	S12	B2

> set.seed(1)

> data <- expand.grid(Within = paste("B",

> 1:2, sep = ""), Subplot = paste("S",

> 1:2, sep = ""), Plot = paste("P",

> 1:6, sep = ""))

> data$Subplot <- gl(12, 2, 24, lab = paste("S",

> 1:12, sep = ""))

> data$Between <- gl(3, 4, 24, lab = paste("A",

> 1:3, sep = ""))

> data$Resp1 <- rpois(24, 10)

> data$Resp2 <- rpois(24, 20)

> data <- with(data, data.frame(Resp1,

> Resp2, Between, Plot, Subplot,

> Within))

	Between	Plot	Subplot	B1	B2
1	A1	P1	S1	8.00	10.00
2	A1	P1	S2	7.00	11.00
3	A1	P4	S7	8.00	10.00
4	A1	P4	S8	7.00	12.00
5	A2	P2	S3	14.00	12.00
6	A2	P2	S4	11.00	9.00
7	A2	P5	S9	11.00	12.00
8	A2	P5	S10	12.00	10.00
9	A3	P3	S5	14.00	11.00
10	A3	P3	S6	8.00	2.00
11	A3	P6	S11	3.00	11.00
12	A3	P6	S12	13.00	7.00

> library(reshape)

> cast(data, Between + Plot + Subplot ~

> Within, value = "Resp1")

	Between	Plot	Subplot	B1	B2
1	A1	P1	S1	8	10
2	A1	P1	S2	7	11
3	A1	P4	S7	8	10
4	A1	P4	S8	7	12
5	A2	P2	S3	14	12
6	A2	P2	S4	11	9
7	A2	P5	S9	11	12
8	A2	P5	S10	12	10
9	A3	P3	S5	14	11
10	A3	P3	S6	8	2
11	A3	P6	S11	3	11
12	A3	P6	S12	13	7

Widen (cast) both Resp1 and Resp2 for repeated measures

	Between	Plot	Subplot	B1Resp1	B2Resp1	B1Resp2	B2Resp2
1	A1	P1	S1	8.00	10.00	17.00	18.00
2	A1	P1	S2	7.00	11.00	17.00	21.00
3	A1	P4	S7	8.00	10.00	17.00	22.00
4	A1	P4	S8	7.00	12.00	16.00	13.00
5	A2	P2	S3	14.00	12.00	19.00	13.00
6	A2	P2	S4	11.00	9.00	24.00	18.00
7	A2	P5	S10	12.00	10.00	23.00	21.00
8	A2	P5	S9	11.00	12.00	23.00	19.00
9	A3	P3	S5	14.00	11.00	25.00	18.00
10	A3	P3	S6	8.00	2.00	27.00	22.00
11	A3	P6	S11	3.00	11.00	17.00	16.00
12	A3	P6	S12	13.00	7.00	26.00	28.00

> library(reshape)

> d1 <- cast(data, Between + Plot +

> Subplot ~ Within, value = "Resp1")

> d2 <- cast(data, Between + Plot +

> Subplot ~ Within, value = "Resp2")

> merge(d1, d2, by = c("Between",

> "Plot", "Subplot"), suffixes = c("Resp1",

> "Resp2"))

	Between	Plot	Subplot	B1Resp1	B2Resp1	B1Resp2	B2Resp2
1	A1	P1	S1	8	10	17	18
2	A1	P1	S2	7	11	17	21
3	A1	P4	S7	8	10	17	22
4	A1	P4	S8	7	12	16	13
5	A2	P2	S3	14	12	19	13
6	A2	P2	S4	11	9	24	18
7	A2	P5	S10	12	10	23	21
8	A2	P5	S9	11	12	23	19
9	A3	P3	S5	14	11	25	18
10	A3	P3	S6	8	2	27	22
11	A3	P6	S11	3	11	17	16
12	A3	P6	S12	13	7	26	28

Merging data sets

It is common to have data associated with a particular study organized into a number of separate data tables (databases etc). In fact, large data sets are best managed in databases. However, statistical analyses generally require all data to be encapsulated within a single data structure. Therefore, prior to analysis, it is necessary to bring together multiple sources.

This phase of data preparation can be one of the most difficult to get right and verify.

Biological data set (missing Subplot 3)
(data.bio)

Physio-chemical data (missing Subplot 7)
(data.chem)

	Resp1	Resp2	Between	Plot	Subplot
1	8.00	18.00	A1	P1	S1
2	10.00	21.00	A1	P1	S2
4	11.00	23.00	A1	P2	S4
5	14.00	22.00	A2	P3	S5
6	12.00	24.00	A2	P3	S6
7	11.00	23.00	A2	P4	S7
8	9.00	20.00	A2	P4	S8
9	14.00	11.00	A3	P5	S9
10	11.00	22.00	A3	P5	S10
11	8.00	24.00	A3	P6	S11
12	2.00	16.00	A3	P6	S12

> set.seed(1)

> data.bio <- expand.grid(Subplot = paste("S",

> 1:2, sep = ""), Plot = paste("P",

> 1:6, sep = ""))

> data.bio$Subplot <- gl(12, 1, 12,

> lab = paste("S", 1:12, sep = ""))

> data.bio$Between <- gl(3, 4, 12,

> lab = paste("A", 1:3, sep = ""))

> data.bio$Resp1 <- rpois(12, 10)

> data.bio$Resp2 <- rpois(12, 20)

> data.bio <- with(data.bio, data.frame(Resp1,

> Resp2, Between, Plot, Subplot))

> data.bio <- data.bio[-3, ]

	Chem1	Chem2	Between	Plot	Subplot
1	1.45	0.89	A1	P1	S1
2	3.27	0.18	A1	P1	S2
3	1.18	5.08	A1	P2	S3
4	13.40	1.58	A1	P2	S4
5	3.78	1.62	A2	P3	S5
6	1.20	4.24	A2	P3	S6
8	5.69	2.99	A2	P4	S8
9	4.83	4.13	A3	P5	S9
10	2.00	3.60	A3	P5	S10
11	12.33	1.78	A3	P6	S11
12	4.01	0.23	A3	P6	S12

> set.seed(1)

> data.chem <- expand.grid(Subplot = paste("S",

> 1:2, sep = ""), Plot = paste("P",

> 1:6, sep = ""))

> data.chem$Subplot <- gl(12, 1,

> 12, lab = paste("S", 1:12,

> sep = ""))

> data.chem$Between <- gl(3, 4, 12,

> lab = paste("A", 1:3, sep = ""))

> data.chem$Chem1 <- rlnorm(12, 1)

> data.chem$Chem2 <- rlnorm(12, 0.5)

> data.chem <- with(data.chem, data.frame(Chem1,

> Chem2, Between, Plot, Subplot))

> data.chem <- data.chem[-7, ]

Merge bio and chem data (only keep full matches)

Merge bio and chem data (keep all data)

	Between	Plot	Subplot	Resp1	Resp2	Chem1	Chem2
1	A1	P1	S1	8.00	18.00	1.45	0.89
2	A1	P1	S2	10.00	21.00	3.27	0.18
3	A1	P2	S4	11.00	23.00	13.40	1.58
4	A2	P3	S5	14.00	22.00	3.78	1.62
5	A2	P3	S6	12.00	24.00	1.20	4.24
6	A2	P4	S8	9.00	20.00	5.69	2.99
7	A3	P5	S10	11.00	22.00	2.00	3.60
8	A3	P5	S9	14.00	11.00	4.83	4.13
9	A3	P6	S11	8.00	24.00	12.33	1.78
10	A3	P6	S12	2.00	16.00	4.01	0.23

> merge(data.bio, data.chem, by = c("Between",

> "Plot", "Subplot"))

	Between	Plot	Subplot	Resp1	Resp2	Chem1	Chem2
1	A1	P1	S1	8	18	1.45287766933472	0.885820822702143
2	A1	P1	S2	10	21	3.26625256848005	0.180017738238554
3	A1	P2	S4	11	23	13.40034967998687	1.576278030210834
4	A2	P3	S5	14	22	3.77918270942276	1.622242963314320
5	A2	P3	S6	12	24	1.19665673682759	4.236918392102122
6	A2	P4	S8	9	20	5.68780668311483	2.985900335255433
7	A3	P5	S10	11	22	2.00293101017655	3.604331442629514
8	A3	P5	S9	14	11	4.83451760105434	4.132891865927561
9	A3	P6	S11	8	24	12.32686674438908	1.776357613699845
10	A3	P6	S12	2	16	4.01422072002186	0.225518812936411

	Between	Plot	Subplot	Resp1	Resp2	Chem1	Chem2
1	A1	P1	S1	8.00	18.00	1.45	0.89
2	A1	P1	S2	10.00	21.00	3.27	0.18
3	A1	P2	S3			1.18	5.08
4	A1	P2	S4	11.00	23.00	13.40	1.58
5	A2	P3	S5	14.00	22.00	3.78	1.62
6	A2	P3	S6	12.00	24.00	1.20	4.24
7	A2	P4	S7	11.00	23.00
8	A2	P4	S8	9.00	20.00	5.69	2.99
9	A3	P5	S9	14.00	11.00	4.83	4.13
10	A3	P5	S10	11.00	22.00	2.00	3.60
11	A3	P6	S11	8.00	24.00	12.33	1.78
12	A3	P6	S12	2.00	16.00	4.01	0.23

> merge(data.bio, data.chem, by = c("Between",

> "Plot", "Subplot"), all = T)

	Between	Plot	Subplot	Resp1	Resp2	Chem1	Chem2
1	A1	P1	S1	8	18	1.45287766933472	0.885820822702143
2	A1	P1	S2	10	21	3.26625256848005	0.180017738238554
3	A1	P2	S3	NA	NA	1.17865197053252	5.078068222680516
4	A1	P2	S4	11	23	13.40034967998687	1.576278030210834
5	A2	P3	S5	14	22	3.77918270942276	1.622242963314320
6	A2	P3	S6	12	24	1.19665673682759	4.236918392102122
7	A2	P4	S7	11	23	NA	NA
8	A2	P4	S8	9	20	5.68780668311483	2.985900335255433
9	A3	P5	S9	14	11	4.83451760105434	4.132891865927561
10	A3	P5	S10	11	22	2.00293101017655	3.604331442629514
11	A3	P6	S11	8	24	12.32686674438908	1.776357613699845
12	A3	P6	S12	2	16	4.01422072002186	0.225518812936411

Merge bio and chem data (only keep full BIO matches)

Merge bio and chem data (keep full CHEM matches)

	Between	Plot	Subplot	Resp1	Resp2	Chem1	Chem2
1	A1	P1	S1	8.00	18.00	1.45	0.89
2	A1	P1	S2	10.00	21.00	3.27	0.18
3	A1	P2	S4	11.00	23.00	13.40	1.58
4	A2	P3	S5	14.00	22.00	3.78	1.62
5	A2	P3	S6	12.00	24.00	1.20	4.24
6	A2	P4	S7	11.00	23.00
7	A2	P4	S8	9.00	20.00	5.69	2.99
8	A3	P5	S9	14.00	11.00	4.83	4.13
9	A3	P5	S10	11.00	22.00	2.00	3.60
10	A3	P6	S11	8.00	24.00	12.33	1.78
11	A3	P6	S12	2.00	16.00	4.01	0.23

> merge(data.bio, data.chem, by = c("Between",

> "Plot", "Subplot"), all.x = T)

	Between	Plot	Subplot	Resp1	Resp2	Chem1	Chem2
1	A1	P1	S1	8	18	1.45287766933472	0.885820822702143
2	A1	P1	S2	10	21	3.26625256848005	0.180017738238554
3	A1	P2	S4	11	23	13.40034967998687	1.576278030210834
4	A2	P3	S5	14	22	3.77918270942276	1.622242963314320
5	A2	P3	S6	12	24	1.19665673682759	4.236918392102122
6	A2	P4	S7	11	23	NA	NA
7	A2	P4	S8	9	20	5.68780668311483	2.985900335255433
8	A3	P5	S9	14	11	4.83451760105434	4.132891865927561
9	A3	P5	S10	11	22	2.00293101017655	3.604331442629514
10	A3	P6	S11	8	24	12.32686674438908	1.776357613699845
11	A3	P6	S12	2	16	4.01422072002186	0.225518812936411

	Between	Plot	Subplot	Resp1	Resp2	Chem1	Chem2
1	A1	P1	S1	8.00	18.00	1.45	0.89
2	A1	P1	S2	10.00	21.00	3.27	0.18
3	A1	P2	S3			1.18	5.08
4	A1	P2	S4	11.00	23.00	13.40	1.58
5	A2	P3	S5	14.00	22.00	3.78	1.62
6	A2	P3	S6	12.00	24.00	1.20	4.24
7	A2	P4	S8	9.00	20.00	5.69	2.99
8	A3	P5	S9	14.00	11.00	4.83	4.13
9	A3	P5	S10	11.00	22.00	2.00	3.60
10	A3	P6	S11	8.00	24.00	12.33	1.78
11	A3	P6	S12	2.00	16.00	4.01	0.23

> merge(data.bio, data.chem, by = c("Between",

> "Plot", "Subplot"), all.y = T)

	Between	Plot	Subplot	Resp1	Resp2	Chem1	Chem2
1	A1	P1	S1	8	18	1.45287766933472	0.885820822702143
2	A1	P1	S2	10	21	3.26625256848005	0.180017738238554
3	A1	P2	S3	NA	NA	1.17865197053252	5.078068222680516
4	A1	P2	S4	11	23	13.40034967998687	1.576278030210834
5	A2	P3	S5	14	22	3.77918270942276	1.622242963314320
6	A2	P3	S6	12	24	1.19665673682759	4.236918392102122
7	A2	P4	S8	9	20	5.68780668311483	2.985900335255433
8	A3	P5	S9	14	11	4.83451760105434	4.132891865927561
9	A3	P5	S10	11	22	2.00293101017655	3.604331442629514
10	A3	P6	S11	8	24	12.32686674438908	1.776357613699845
11	A3	P6	S12	2	16	4.01422072002186	0.225518812936411

VLOOKUP in R

Biological data set (data.bio1)

Geographical data set (lookup table) (data.geo)

	Resp1	Resp2	Between	Plot	Subplot
1	8.00	18.00	A1	P1	S1
2	10.00	21.00	A1	P1	S2
3	7.00	16.00	A1	P2	S3
4	11.00	23.00	A1	P2	S4
5	14.00	22.00	A2	P3	S5
6	12.00	24.00	A2	P3	S6
7	11.00	23.00	A2	P4	S7
8	9.00	20.00	A2	P4	S8
9	14.00	11.00	A3	P5	S9
10	11.00	22.00	A3	P5	S10
11	8.00	24.00	A3	P6	S11
12	2.00	16.00	A3	P6	S12

> set.seed(1)

> data.bio1 <- expand.grid(Subplot = paste("S",

> 1:2, sep = ""), Plot = paste("P",

> 1:6, sep = ""))

> data.bio1$Subplot <- gl(12, 1,

> 12, lab = paste("S", 1:12,

> sep = ""))

> data.bio1$Between <- gl(3, 4, 12,

> lab = paste("A", 1:3, sep = ""))

> data.bio1$Resp1 <- rpois(12, 10)

> data.bio1$Resp2 <- rpois(12, 20)

> data.bio1 <- with(data.bio1, data.frame(Resp1,

> Resp2, Between, Plot, Subplot))

	Plot	LAT	LONG
1	P1	17.96	145.43
2	P2	17.52	146.20
3	P3	17.00	146.38
4	P4	18.23	146.79
5	P5	18.98	146.03
6	P6	20.12	146.47

> data.geo <- expand.grid(Plot = paste("P",

> 1:6, sep = ""))

> data.geo$LAT <- c(17.9605, 17.521,

> 17.0011, 18.235, 18.984, 20.1154)

> data.geo$LONG <- c(145.4326, 146.1983,

> 146.3839, 146.7934, 146.0345,

> 146.4672)

Incorporate (merge) the lat/longs into the bio data

	Plot	Resp1	Resp2	Between	Subplot	LAT	LONG
1	P1	8.00	18.00	A1	S1	17.96	145.43
2	P1	10.00	21.00	A1	S2	17.96	145.43
3	P2	7.00	16.00	A1	S3	17.52	146.20
4	P2	11.00	23.00	A1	S4	17.52	146.20
5	P3	14.00	22.00	A2	S5	17.00	146.38
6	P3	12.00	24.00	A2	S6	17.00	146.38
7	P4	11.00	23.00	A2	S7	18.23	146.79
8	P4	9.00	20.00	A2	S8	18.23	146.79
9	P5	14.00	11.00	A3	S9	18.98	146.03
10	P5	11.00	22.00	A3	S10	18.98	146.03
11	P6	8.00	24.00	A3	S11	20.12	146.47
12	P6	2.00	16.00	A3	S12	20.12	146.47

> merge(data.bio1, data.geo, by = "Plot")

	Plot	Resp1	Resp2	Between	Subplot	LAT	LONG
1	P1	8	18	A1	S1	17.9605	145.4326
2	P1	10	21	A1	S2	17.9605	145.4326
3	P2	7	16	A1	S3	17.5210	146.1983
4	P2	11	23	A1	S4	17.5210	146.1983
5	P3	14	22	A2	S5	17.0011	146.3839
6	P3	12	24	A2	S6	17.0011	146.3839
7	P4	11	23	A2	S7	18.2350	146.7934
8	P4	9	20	A2	S8	18.2350	146.7934
9	P5	14	11	A3	S9	18.9840	146.0345
10	P5	11	22	A3	S10	18.9840	146.0345
11	P6	8	24	A3	S11	20.1154	146.4672
12	P6	2	16	A3	S12	20.1154	146.4672

Aggregating data sets

Data are often collected and transcribed at finer temporal/spatial scales and with greater fidelity than is required for all analyses. Therefore an important phase of data preparation is also to summarize the data into the spatial/temporal scales appropriate for the desired graphical and statistical analyses.

Example data set (data.1)

Generating R code

	Resp1	Resp2	Between	Plot
1	8.00	22.00	A1	P1
2	10.00	18.00	A1	P2
3	7.00	26.00	A2	P3
4	11.00	21.00	A2	P4
5	14.00	17.00	A3	P5
6	12.00	18.00	A3	P6

> set.seed(1)

> data.1 <- expand.grid(Plot = paste("P", 1:6, sep = ""))

> data.1$Between <- gl(3, 2, 6, lab = paste("A", 1:3, sep = ""))

> data.1$Resp1 <- rpois(6, 10)

> data.1$Resp2 <- rpois(6, 20)

> data.1 <- with(data.1, data.frame(Resp1, Resp2, Between,

> Plot))

The major functions useful for summarizing/aggregating are: (Note the associated outputs as displayed on this page will differ in extent and formatting from that displayed in R)

tapply - (split-apply-matrix) passes SINGLE specified variable as vectors

> tapply(data.1$Resp1, data.1$Between, function(x) {

> print(x)

> return(paste(x))

> })

	A1	A2	A3
	8 , 10		8 , 10

cast - (split-tapply-reshape) passes SINGLE specified variable as vectors

> library(reshape)

> cast(data.1, ~Between, value = "Resp1", fun = function(x) {

> print(x)

> return(paste(x))

> })

	value	A1_X1	A1_X2	A2_X1	A2_X2	A3_X1	A3_X2
1	(all)	8	10	7	11	14	12

ddply - (split-apply-combine) passes ALL variables as dataframe

> library(plyr)

> ddply(data.1, ~Between, function(x) {

> print(x)

> return(paste(x))

> })

	Between	V1	V2	V3	V4
1	A1	c(8, 10)	c(22, 18)	c(1, 1)	1:2
2	A2	c(7, 11)	c(26, 21)	c(2, 2)	3:4
3	A3	c(14, 12)	c(17, 18)	c(3, 3)	5:6

summaryBy - (split-apply-combine) passes ALL specified variables as vectors

> library(doBy)

> summaryBy(Resp1 ~ Between, data.1, FUN = function(x) {

> print(x)

> return(paste(x))

> })

	Between	Resp1.FUN1	Resp1.FUN2
1	A1	8	10
2	A2	7	11
3	A3	14	12

The main features of each of the above functions are compared and contrasted in the following table.

	Variables passed	Passed as	Functions	Returns
`tapply`	Single specified	Vectors	Single	Matrix
`cast`	Single specified	Vectors	Multiples as list	Dataframe
`summaryBy`	All specified	Vectors	Multiples as list	Dataframe
`ddply`	All	Dataframes	Defined per variable	Dataframe

A typical data set (data)

Generating R code

	Resp1	Resp2	Between	Plot	Subplot	Within
1	8.00	17.00	A1	P1	S1	B1
2	10.00	18.00	A1	P1	S1	B2
3	7.00	17.00	A1	P1	S2	B1
4	11.00	21.00	A1	P1	S2	B2
5	14.00	19.00	A2	P2	S3	B1
6	12.00	13.00	A2	P2	S3	B2
7	11.00	24.00	A2	P2	S4	B1
8	9.00	18.00	A2	P2	S4	B2
9	14.00	25.00	A3	P3	S5	B1
10	11.00	18.00	A3	P3	S5	B2
11	8.00	27.00	A3	P3	S6	B1
12	2.00	22.00	A3	P3	S6	B2
13	8.00	17.00	A1	P4	S7	B1
14	10.00	22.00	A1	P4	S7	B2
15	7.00	16.00	A1	P4	S8	B1
16	12.00	13.00	A1	P4	S8	B2
17	11.00	23.00	A2	P5	S9	B1
18	12.00	19.00	A2	P5	S9	B2
19	12.00	23.00	A2	P5	S10	B1
20	10.00	21.00	A2	P5	S10	B2
21	3.00	17.00	A3	P6	S11	B1
22	11.00	16.00	A3	P6	S11	B2
23	13.00	26.00	A3	P6	S12	B1
24	7.00	28.00	A3	P6	S12	B2

> set.seed(1)

> data <- expand.grid(Within = paste("B", 1:2, sep = ""),

> Subplot = paste("S", 1:2, sep = ""), Plot = paste("P",

> 1:6, sep = ""))

> data$Subplot <- gl(12, 2, 24, lab = paste("S", 1:12, sep = ""))

> data$Between <- gl(3, 4, 24, lab = paste("A", 1:3, sep = ""))

> data$Resp1 <- rpois(24, 10)

> data$Resp2 <- rpois(24, 20)

> data <- with(data, data.frame(Resp1, Resp2, Between, Plot,

> Subplot, Within))

Pivot tables, data summaries

Resp 1 means (Between)

Resp1 means and standard deviations (Between)

Resp1 means (Between/Within)

	A1	A2	A3
means	9.12	11.38	8.62

> tapply(data$Resp1, data$Between,

> mean)

	A1	A2	A3
	9.125	11.375	8.625

	Between	mean	sd
1	A1	9.12	1.89
2	A2	11.38	1.51
3	A3	8.62	4.44

> library(reshape)

> cast(data, Between ~ ., value = "Resp1",

> fun = c(mean, sd))

	Between	mean	sd
1	A1	9.125	1.88509188862809
2	A2	11.375	1.50594061730772
3	A3	8.625	4.43806585929373

Via summaryBy

> library(doBy)

> summaryBy(Resp1 ~ Between, data,

> FUN = list(mean, sd))

	Between	Resp1.mean	Resp1.sd
1	A1	9.125	1.88509188862809
2	A2	11.375	1.50594061730772
3	A3	8.625	4.43806585929373

	B1	B2
A1	7.50	10.75
A2	12.00	10.75
A3	9.50	7.75

> tapply(data$Resp1, list(data$Between,

> data$Within), mean)

	B1	B2
A1	7.50	10.75
A2	12.00	10.75
A3	9.50	7.75

Via cast

> library(reshape)

> cast(data, Between ~ Within, value = "Resp1",

> fun = mean)

	Between	B1	B2
1	A1	7.5	10.75
2	A2	12.0	10.75
3	A3	9.5	7.75

#OR

> library(reshape)

> cast(data, Between ~ Within, value = "Resp1",

> fun = mean, margins = c("grand_col",

> "grand_row"))

	Between	B1	B2	(all)
1	A1	7.50000000000000	10.75	9.12500000000000
2	A2	12.00000000000000	10.75	11.37500000000000
3	A3	9.50000000000000	7.75	8.62500000000000
4	(all)	9.66666666666667	9.75	9.70833333333333

Resp1 means and standard deviations (Between/Within)

	Between	B1_mean	B1_sd	B2_mean	B2_sd
1	A1	7.50	0.58	10.75	0.96
2	A2	12.00	1.41	10.75	1.50
3	A3	9.50	5.07	7.75	4.27

> library(reshape)

> cast(data, Between ~ Within, value = "Resp1",

> fun = c(mean, sd))

	Between	B1_mean	B1_sd	B2_mean	B2_sd
1	A1	7.5	0.577350269189626	10.75	0.957427107756338
2	A2	12.0	1.414213562373095	10.75	1.500000000000000
3	A3	9.5	5.066228051190222	7.75	4.272001872658765

	Between	Within	Resp1_means
1	A1	B1	7.50
2	A1	B2	10.75
3	A2	B1	12.00
4	A2	B2	10.75
5	A3	B1	9.50
6	A3	B2	7.75

> library(plyr)

> ddply(data, ~Between + Within,

> summarize, Resp1_means = mean(Resp1))

	Between	Within	Resp1_means
1	A1	B1	7.50
2	A1	B2	10.75
3	A2	B1	12.00
4	A2	B2	10.75
5	A3	B1	9.50
6	A3	B2	7.75

Via summaryBy

# ddply

> library(doBy)

> summaryBy(Resp1 ~ Between + Within,

> data = data)

	Between	Within	Resp1.mean
1	A1	B1	7.50
2	A1	B2	10.75
3	A2	B1	12.00
4	A2	B2	10.75
5	A3	B1	9.50
6	A3	B2	7.75

Resp1 and Resp2 means
(Between/Within)

Resp1 and Resp2 means
(Between/Within) long format

Resp1 means and standard
deviations (Between/Within)

	Between	Within	Resp1	Resp2
1	A1	B1	7.50	16.75
2	A1	B2	10.75	18.50
3	A2	B1	12.00	22.25
4	A2	B2	10.75	17.75
5	A3	B1	9.50	23.75
6	A3	B2	7.75	21.00

> library(plyr)

> ddply(data, ~Between + Within,

> summarize, Resp1 = mean(Resp1),

> Resp2 = mean(Resp2))

	Between	Within	Resp1	Resp2
1	A1	B1	7.50	16.75
2	A1	B2	10.75	18.50
3	A2	B1	12.00	22.25
4	A2	B2	10.75	17.75
5	A3	B1	9.50	23.75
6	A3	B2	7.75	21.00

Via summaryBy

# ddply

> library(doBy)

> summaryBy(Resp1 + Resp2 ~ Between +

> Within, data = data)

	Between	Within	Resp1.mean	Resp2.mean
1	A1	B1	7.50	16.75
2	A1	B2	10.75	18.50
3	A2	B1	12.00	22.25
4	A2	B2	10.75	17.75
5	A3	B1	9.50	23.75
6	A3	B2	7.75	21.00

	Between	Within	variable	value
1	A1	B1	Resp1	7.50
2	A1	B2	Resp1	10.75
3	A2	B1	Resp1	12.00
4	A2	B2	Resp1	10.75
5	A3	B1	Resp1	9.50
6	A3	B2	Resp1	7.75
7	A1	B1	Resp2	16.75
8	A1	B2	Resp2	18.50
9	A2	B1	Resp2	22.25
10	A2	B2	Resp2	17.75
11	A3	B1	Resp2	23.75
12	A3	B2	Resp2	21.00

> library(plyr)

> library(reshape)

> melt(ddply(data, ~Between + Within,

> summarize, Resp1 = mean(Resp1),

> Resp2 = mean(Resp2)), measure.vars = c("Resp1",

> "Resp2"), variable_name = "Resp")

	Between	Within	Resp	value
1	A1	B1	Resp1	7.50
2	A1	B2	Resp1	10.75
3	A2	B1	Resp1	12.00
4	A2	B2	Resp1	10.75
5	A3	B1	Resp1	9.50
6	A3	B2	Resp1	7.75
7	A1	B1	Resp2	16.75
8	A1	B2	Resp2	18.50
9	A2	B1	Resp2	22.25
10	A2	B2	Resp2	17.75
11	A3	B1	Resp2	23.75
12	A3	B2	Resp2	21.00

	Between	Within	means	sds
1	A1	B1	7.50	0.58
2	A1	B2	10.75	0.96
3	A2	B1	12.00	1.41
4	A2	B2	10.75	1.50
5	A3	B1	9.50	5.07
6	A3	B2	7.75	4.27

> library(plyr)

> ddply(data, ~Between + Within,

> summarize, means = mean(Resp1),

> sds = sd(Resp1))

	Between	Within	means	sds
1	A1	B1	7.50	0.577350269189626
2	A1	B2	10.75	0.957427107756338
3	A2	B1	12.00	1.414213562373095
4	A2	B2	10.75	1.500000000000000
5	A3	B1	9.50	5.066228051190222
6	A3	B2	7.75	4.272001872658765

Via summaryBy

> library(doBy)

> summaryBy(Resp1 ~ Between + Within,

> data = data, FUN = list(mean,

> sd))

	Between	Within	Resp1.mean	Resp1.sd
1	A1	B1	7.50	0.577350269189626
2	A1	B2	10.75	0.957427107756338
3	A2	B1	12.00	1.414213562373095
4	A2	B2	10.75	1.500000000000000
5	A3	B1	9.50	5.066228051190222
6	A3	B2	7.75	4.272001872658765

Resp1 and Resp2 means and
standard devations (Between/Within)
wide format

Resp1 and Resp2 means and
standard devations (Between/Within)
long format

	Between	Within	Resp1	SD1	Resp2	SD2
1	A1	B1	7.50	0.58	16.75	0.50
2	A1	B2	10.75	0.96	18.50	4.04
3	A2	B1	12.00	1.41	22.25	2.22
4	A2	B2	10.75	1.50	17.75	3.40
5	A3	B1	9.50	5.07	23.75	4.57
6	A3	B2	7.75	4.27	21.00	5.29

> library(plyr)

> ddply(data, ~Between + Within,

> function(df) data.frame(Resp1 = mean(df$Resp1),

> SD1 = sd(df$Resp1), Resp2 = mean(df$Resp2),

> SD2 = sd(df$Resp2)))

	Between	Within	Resp1	SD1	Resp2	SD2
1	A1	B1	7.50	0.577350269189626	16.75	0.50000000000000
2	A1	B2	10.75	0.957427107756338	18.50	4.04145188432738
3	A2	B1	12.00	1.414213562373095	22.25	2.21735578260835
4	A2	B2	10.75	1.500000000000000	17.75	3.40342964277702
5	A3	B1	9.50	5.066228051190222	23.75	4.57347424467075
6	A3	B2	7.75	4.272001872658765	21.00	5.29150262212918

Via summaryBy

# ddply

> library(doBy)

> summaryBy(Resp1 + Resp2 ~ Between +

> Within, data = data, FUN = list(mean,

> sd))

	Between	Within	Resp1.mean	Resp2.mean	Resp1.sd	Resp2.sd
1	A1	B1	7.50	16.75	0.577350269189626	0.50000000000000
2	A1	B2	10.75	18.50	0.957427107756338	4.04145188432738
3	A2	B1	12.00	22.25	1.414213562373095	2.21735578260835
4	A2	B2	10.75	17.75	1.500000000000000	3.40342964277702
5	A3	B1	9.50	23.75	5.066228051190222	4.57347424467075
6	A3	B2	7.75	21.00	4.272001872658765	5.29150262212918

	Between	Within	Resp	Mean	Sd
1	A1	B1	Resp1	7.50	0.58
2	A1	B1	Resp2	16.75	0.50
3	A1	B2	Resp1	10.75	0.96
4	A1	B2	Resp2	18.50	4.04
5	A2	B1	Resp1	12.00	1.41
6	A2	B1	Resp2	22.25	2.22
7	A2	B2	Resp1	10.75	1.50
8	A2	B2	Resp2	17.75	3.40
9	A3	B1	Resp1	9.50	5.07
10	A3	B1	Resp2	23.75	4.57
11	A3	B2	Resp1	7.75	4.27
12	A3	B2	Resp2	21.00	5.29

> library(plyr)

> library(reshape2)

> d1 <- ddply(data, ~Between + Within,

> function(df) melt(data.frame(Resp1 = mean(df$Resp1),

> Resp2 = mean(df$Resp2)),

> var = "Resp", value.name = "Mean"))

> d2 <- ddply(data, ~Between + Within,

> function(df) melt(data.frame(Resp1 = sd(df$Resp1),

> Resp2 = sd(df$Resp2)),

> var = "Resp", value.name = "Sd"))

> d3 <- merge(d1, d2)

Hierarchical aggregations

It is useful to be able to aggregate the data to a different level of replication. Two common reasons are:

Exploratory data analysis for assessing normality, homogeneity of variance as applied to the approximate appropriate residuals (appropriate level fo replication for a give test)
Calculating observations (typically means of subreplicates) appropriate for the desired scale of the analysis

Warning gsummary() in the nlme package appears to be unable to resolve the base scope if functions like mean are overloaded by other packages. Consequently, it is a good idea when specifying the FUN parameter in gsummary, to include the scope. E.g. use base:::mean instead of just mean.

Plot means (ignore Subplot and Within column)

Subplot means (ignore Within column)

	Plot	Between	Resp1	Resp2
1	P1	A1	9.00	18.25
2	P2	A2	11.50	18.50
3	P3	A3	8.75	23.00
4	P4	A1	9.25	17.00
5	P5	A2	11.25	21.50
6	P6	A3	8.50	21.75

> library(plyr)

> ddply(data, ~Plot, summarize, Between = unique(Between),

> Resp1 = mean(Resp1), Resp2 = mean(Resp2))

	Plot	Between	Resp1	Resp2
1	P1	A1	9.00	18.25
2	P2	A2	11.50	18.50
3	P3	A3	8.75	23.00
4	P4	A1	9.25	17.00
5	P5	A2	11.25	21.50
6	P6	A3	8.50	21.75

Via gsummary

> library(nlme)

> gsummary(data, form = ~Plot, base:::mean)

	Resp1	Resp2	Between	Plot	Subplot	Within
P1	9.00	18.25	A1	P1	S1	B1
P2	11.50	18.50	A2	P2	S3	B1
P3	8.75	23.00	A3	P3	S5	B1
P4	9.25	17.00	A1	P4	S7	B1
P5	11.25	21.50	A2	P5	S10	B1
P6	8.50	21.75	A3	P6	S11	B1

Via summaryBy

> library(doBy)

> summaryBy(Resp1 + Resp2 ~ Between +

> Plot, data = data, FUN = mean)

	Between	Plot	Resp1.mean	Resp2.mean
1	A1	P1	9.00	18.25
2	A1	P4	9.25	17.00
3	A2	P2	11.50	18.50
4	A2	P5	11.25	21.50
5	A3	P3	8.75	23.00
6	A3	P6	8.50	21.75

	Plot	Subplot	Between	Resp1	Resp2
1	P1	S1	A1	9.00	17.50
2	P1	S2	A1	9.00	19.00
3	P2	S3	A2	13.00	16.00
4	P2	S4	A2	10.00	21.00
5	P3	S5	A3	12.50	21.50
6	P3	S6	A3	5.00	24.50
7	P4	S7	A1	9.00	19.50
8	P4	S8	A1	9.50	14.50
9	P5	S9	A2	11.50	21.00
10	P5	S10	A2	11.00	22.00
11	P6	S11	A3	7.00	16.50
12	P6	S12	A3	10.00	27.00

> library(plyr)

> ddply(data, ~Plot + Subplot, summarize,

> Between = unique(Between),

> Resp1 = mean(Resp1), Resp2 = mean(Resp2))

	Plot	Subplot	Between	Resp1	Resp2
1	P1	S1	A1	9.0	17.5
2	P1	S2	A1	9.0	19.0
3	P2	S3	A2	13.0	16.0
4	P2	S4	A2	10.0	21.0
5	P3	S5	A3	12.5	21.5
6	P3	S6	A3	5.0	24.5
7	P4	S7	A1	9.0	19.5
8	P4	S8	A1	9.5	14.5
9	P5	S9	A2	11.5	21.0
10	P5	S10	A2	11.0	22.0
11	P6	S11	A3	7.0	16.5
12	P6	S12	A3	10.0	27.0

Via gsummary

> library(nlme)

> gsummary(data, form = ~Plot/Subplot,

> base:::mean)

	Resp1	Resp2	Between	Plot	Subplot	Within
P1/S1	9.0	17.5	A1	P1	S1	B1
P1/S2	9.0	19.0	A1	P1	S2	B1
P2/S3	13.0	16.0	A2	P2	S3	B1
P2/S4	10.0	21.0	A2	P2	S4	B1
P3/S5	12.5	21.5	A3	P3	S5	B1
P3/S6	5.0	24.5	A3	P3	S6	B1
P4/S7	9.0	19.5	A1	P4	S7	B1
P4/S8	9.5	14.5	A1	P4	S8	B1
P5/S9	11.5	21.0	A2	P5	S9	B1
P5/S10	11.0	22.0	A2	P5	S10	B1
P6/S11	7.0	16.5	A3	P6	S11	B1
P6/S12	10.0	27.0	A3	P6	S12	B1

Via summaryBy

> library(doBy)

> summaryBy(Resp1 + Resp2 ~ Between +

> Plot + Subplot, data = data,

> FUN = mean)

	Between	Plot	Subplot	Resp1.mean	Resp2.mean
1	A1	P1	S1	9.0	17.5
2	A1	P1	S2	9.0	19.0
3	A1	P4	S7	9.0	19.5
4	A1	P4	S8	9.5	14.5
5	A2	P2	S3	13.0	16.0
6	A2	P2	S4	10.0	21.0
7	A2	P5	S9	11.5	21.0
8	A2	P5	S10	11.0	22.0
9	A3	P3	S5	12.5	21.5
10	A3	P3	S6	5.0	24.5
11	A3	P6	S11	7.0	16.5
12	A3	P6	S12	10.0	27.0

Transformations and Derivatives

A very simple dataset (data.s)

Generating R code

	Between	Plot	Resp1	Resp2
1	A1	P1	8.00	13.00
2	A1	P2	10.00	22.00
3	A2	P3	7.00	23.00
4	A2	P4	11.00	22.00

> set.seed(1)

> data.s <- expand.grid(Plot = paste("P", 1:4, sep = ""))

> data.s$Between <- gl(2, 2, 4, lab = paste("A", 1:2,

> sep = ""))

> data.s <- with(data.s, data.frame(Between, Plot, Resp1 = rpois(4,

> 10), Resp2 = rpois(4, 20)))

Data transformations

The difference between transform and mutate is that mutate works with each column sequentially and therefore can derive new columns using the columns created in earlier iterations.\\ summaryBy returns a new dataframe, does not build upon old.

Scale (log) transform a single variable
Resp1

Scale (log) transform multiple variables
Resp1 and Resp2

	Between	Plot	Resp1	Resp2	logResp1
1	A1	P1	8.00	13.00	2.08
2	A1	P2	10.00	22.00	2.30
3	A2	P3	7.00	23.00	1.95
4	A2	P4	11.00	22.00	2.40

> library(plyr)

> transform(data.s, logResp1 = log(Resp1))

	Between	Plot	Resp1	Resp2	logResp1
1	A1	P1	8	13	2.07944154167984
2	A1	P2	10	22	2.30258509299405
3	A2	P3	7	23	1.94591014905531
4	A2	P4	11	22	2.39789527279837

Via mutate

> library(plyr)

> mutate(data.s, logResp1 = log(Resp1))

	Between	Plot	Resp1	Resp2	logResp1
1	A1	P1	8	13	2.07944154167984
2	A1	P2	10	22	2.30258509299405
3	A2	P3	7	23	1.94591014905531
4	A2	P4	11	22	2.39789527279837

	Between	Plot	Resp1	Resp2	logResp1	logResp2
1	A1	P1	8.00	13.00	2.08	2.56
2	A1	P2	10.00	22.00	2.30	3.09
3	A2	P3	7.00	23.00	1.95	3.14
4	A2	P4	11.00	22.00	2.40	3.09

> library(plyr)

> transform(data.s, logResp1 = log(Resp1),

> logResp2 = log(Resp2))

	Between	Plot	Resp1	Resp2	logResp1	logResp2
1	A1	P1	8	13	2.07944154167984	2.56494935746154
2	A1	P2	10	22	2.30258509299405	3.09104245335832
3	A2	P3	7	23	1.94591014905531	3.13549421592915
4	A2	P4	11	22	2.39789527279837	3.09104245335832

Via summaryBy

> library(doBy)

> summaryBy(Resp1 + Resp2 ~ ., data.s,

> FUN = list(log), id = ~Resp1 +

> Resp2)

	Between	Plot	Resp1.log	Resp2.log	Resp1	Resp2
1	A1	P1	2.07944154167984	2.56494935746154	8	13
2	A1	P2	2.30258509299405	3.09104245335832	10	22
3	A2	P3	1.94591014905531	3.13549421592915	7	23
4	A2	P4	2.39789527279837	3.09104245335832	11	22

#OR

> library(doBy)

> summaryBy(Resp1 + Resp2 ~ ., data.s,

> FUN = list(c, log))

	Between	Plot	Resp1.c	Resp2.c	Resp1.log	Resp2.log
1	A1	P1	8	13	2.07944154167984	2.56494935746154
2	A1	P2	10	22	2.30258509299405	3.09104245335832
3	A2	P3	7	23	1.94591014905531	3.13549421592915
4	A2	P4	11	22	2.39789527279837	3.09104245335832

Deriving new variables

Center a single variable (Resp2),
centered within the entire variable

Center a single variable (Resp2),
centered within each level of the between variablexo

	Between	Plot	Resp1	Resp2	cResp2
1	A1	P1	8.00	13.00	-7.00
2	A1	P2	10.00	22.00	2.00
3	A2	P3	7.00	23.00	3.00
4	A2	P4	11.00	22.00	2.00

> library(plyr)

> transform(data.s, cResp2 = Resp2 -

> mean(Resp2))

	Between	Plot	Resp1	Resp2	cResp2
1	A1	P1	8	13	-7
2	A1	P2	10	22	2
3	A2	P3	7	23	3
4	A2	P4	11	22	2

	Between	Plot	Resp1	Resp2	cResp2
1	A1	P1	8.00	13.00	-4.50
2	A1	P2	10.00	22.00	4.50
3	A2	P3	7.00	23.00	0.50
4	A2	P4	11.00	22.00	-0.50

> library(plyr)

> ddply(data.s, ~Between, transform,

> cResp2 = Resp2 - mean(Resp2))

	Between	Plot	Resp1	Resp2	cResp2
1	A1	P1	8	13	-4.5
2	A1	P2	10	22	4.5
3	A2	P3	7	23	0.5
4	A2	P4	11	22	-0.5

Difference in centred means sorted by Between
(derivative of derivatives)

Scale (log) transformation of Resp1 and Resp2
sorted by Between

	Between	Plot	Resp1	Resp2	cResp1	cResp2	cDiff
1	A1	P1	8.00	13.00	-1.00	-4.50	3.50
2	A1	P2	10.00	22.00	1.00	4.50	-3.50
3	A2	P3	7.00	23.00	-2.00	0.50	-2.50
4	A2	P4	11.00	22.00	2.00	-0.50	2.50

> library(plyr)

> ddply(data.s, ~Between, mutate,

> cResp1 = Resp1 - mean(Resp1),

> cResp2 = Resp2 - mean(Resp2),

> cDiff = cResp1 - cResp2)

	Between	Plot	Resp1	Resp2	cResp1	cResp2	cDiff
1	A1	P1	8	13	-1	-4.5	3.5
2	A1	P2	10	22	1	4.5	-3.5
3	A2	P3	7	23	-2	0.5	-2.5
4	A2	P4	11	22	2	-0.5	2.5

	Between	Plot	Resp1	Resp2	logResp2	logResp2.1
1	A1	P1	8.00	13.00	2.56	2.56
2	A1	P2	10.00	22.00	3.09	3.09
3	A2	P3	7.00	23.00	3.14	3.14
4	A2	P4	11.00	22.00	3.09	3.09

> library(plyr)

> ddply(data.s, ~Between, transform,

> logResp1 = log(Resp1), logResp2 = log(Resp1))

	Between	Plot	Resp1	Resp2	logResp1	logResp2
1	A1	P1	8	13	2.07944154167984	2.07944154167984
2	A1	P2	10	22	2.30258509299405	2.30258509299405
3	A2	P3	7	23	1.94591014905531	1.94591014905531
4	A2	P4	11	22	2.39789527279837	2.39789527279837

Alterations

Sorting data

Most statistical analyses are invariant to the data order and thus data reordering is typically only for aesthetics in tables and figures.

Sorting data has the potential to be one of the most dangerous forms of data manipulation - particularly in spreadsheets in which there is no real binding between individual columns or rows. It is far to easy to accidentally sort the data in a single column (or row) without applying the ordering to all the other columns in the data set thereby resulting in broken data.

Whilst the above apocalypse is still possible in R, the data structures and manipulation interfaces mean that you really have to try to break the data in this way. Furthermore, you are encouraged to store reordered data in a different object to the original data, and hence 'rolling' back is trivial.

An example of an small omibus dataset (Created here)
(data.m)

	Plot	Resp1	Resp2	Between	Subplot	LAT	LONG
1	P1	8.00	18.00	A1	S1	17.96	145.43
2	P1	10.00	21.00	A1	S2	17.96	145.43
3	P2	7.00	16.00	A1	S3	17.52	146.20
4	P2	11.00	23.00	A1	S4	17.52	146.20
5	P3	14.00	22.00	A2	S5	17.00	146.38
6	P3	12.00	24.00	A2	S6	17.00	146.38
7	P4	11.00	23.00	A2	S7	18.23	146.79
8	P4	9.00	20.00	A2	S8	18.23	146.79
9	P5	14.00	11.00	A3	S9	18.98	146.03
10	P5	11.00	22.00	A3	S10	18.98	146.03
11	P6	8.00	24.00	A3	S11	20.12	146.47
12	P6	2.00	16.00	A3	S12	20.12	146.47

Sort data by LAT

Sort data by Resp1 then Resp2

	Plot	Resp1	Resp2	Between	Subplot	LAT	LONG
5	P3	14.00	22.00	A2	S5	17.00	146.38
6	P3	12.00	24.00	A2	S6	17.00	146.38
3	P2	7.00	16.00	A1	S3	17.52	146.20
4	P2	11.00	23.00	A1	S4	17.52	146.20
1	P1	8.00	18.00	A1	S1	17.96	145.43
2	P1	10.00	21.00	A1	S2	17.96	145.43
7	P4	11.00	23.00	A2	S7	18.23	146.79
8	P4	9.00	20.00	A2	S8	18.23	146.79
9	P5	14.00	11.00	A3	S9	18.98	146.03
10	P5	11.00	22.00	A3	S10	18.98	146.03
11	P6	8.00	24.00	A3	S11	20.12	146.47
12	P6	2.00	16.00	A3	S12	20.12	146.47

> data.m[order(data.m$LAT), ]

	Plot	Resp1	Resp2	Between	Subplot	LAT	LONG
5	P3	14	22	A2	S5	17.0011	146.3839
6	P3	12	24	A2	S6	17.0011	146.3839
3	P2	7	16	A1	S3	17.5210	146.1983
4	P2	11	23	A1	S4	17.5210	146.1983
1	P1	8	18	A1	S1	17.9605	145.4326
2	P1	10	21	A1	S2	17.9605	145.4326
7	P4	11	23	A2	S7	18.2350	146.7934
8	P4	9	20	A2	S8	18.2350	146.7934
9	P5	14	11	A3	S9	18.9840	146.0345
10	P5	11	22	A3	S10	18.9840	146.0345
11	P6	8	24	A3	S11	20.1154	146.4672
12	P6	2	16	A3	S12	20.1154	146.4672

	Plot	Resp1	Resp2	Between	Subplot	LAT	LONG
12	P6	2.00	16.00	A3	S12	20.12	146.47
3	P2	7.00	16.00	A1	S3	17.52	146.20
1	P1	8.00	18.00	A1	S1	17.96	145.43
11	P6	8.00	24.00	A3	S11	20.12	146.47
8	P4	9.00	20.00	A2	S8	18.23	146.79
2	P1	10.00	21.00	A1	S2	17.96	145.43
10	P5	11.00	22.00	A3	S10	18.98	146.03
4	P2	11.00	23.00	A1	S4	17.52	146.20
7	P4	11.00	23.00	A2	S7	18.23	146.79
6	P3	12.00	24.00	A2	S6	17.00	146.38
9	P5	14.00	11.00	A3	S9	18.98	146.03
5	P3	14.00	22.00	A2	S5	17.00	146.38

> data.m[order(data.m$Resp1, data.m$Resp2),

> ]

	Plot	Resp1	Resp2	Between	Subplot	LAT	LONG
12	P6	2	16	A3	S12	20.1154	146.4672
3	P2	7	16	A1	S3	17.5210	146.1983
1	P1	8	18	A1	S1	17.9605	145.4326
11	P6	8	24	A3	S11	20.1154	146.4672
8	P4	9	20	A2	S8	18.2350	146.7934
2	P1	10	21	A1	S2	17.9605	145.4326
10	P5	11	22	A3	S10	18.9840	146.0345
4	P2	11	23	A1	S4	17.5210	146.1983
7	P4	11	23	A2	S7	18.2350	146.7934
6	P3	12	24	A2	S6	17.0011	146.3839
9	P5	14	11	A3	S9	18.9840	146.0345
5	P3	14	22	A2	S5	17.0011	146.3839

Re-levelling (sorting) factors

In the previous section, we altered the order of the records in the data set. However, it is important to realize that categorical variables in a dataframe have a property (called levels) that indicates the order of the levels of the factor and that this property is completely independent of the order of records int the data structure. By default, categorical variables (factors) are ordered alphabetically (as I said, irrespective of their order in the data set). The order of factor levels dictates the order in which groups will be plotted in figures and tabulated in tables. Consequently, re-levelling factors is often desirable

An example of an data set (data.2)

	Plot	Cond	Between	Temp
1	P1	H	A1	15.74
2	P2	H	A1	23.84
3	P3	H	A1	13.64
4	P4	H	A2	37.95
5	P1	M	A2	25.30
6	P2	M	A2	13.80
7	P3	M	A3	26.87
8	P4	M	A3	29.38
9	P1	L	A3	27.76
10	P2	L	A4	18.95
11	P3	L	A4	37.12
12	P4	L	A4	25.90

> set.seed(1)

> data.2 <- expand.grid(Plot = paste("P",

> 1:4, sep = ""), Cond = c("H",

> "M", "L"))

> data.2$Cond <- factor(as.character(data.2$Cond))

> data.2$Between <- gl(4, 3, 12,

> lab = paste("A", 1:4, sep = ""))

> data.2$Temp <- rnorm(12, 22, 10)

Default alphabetical order

Re-levelled into L, M, H

> levels(data.2$Cond)

H L M

> data.2$Cond <- factor(data.2$Cond,

> levels = c("L", "M", "H"))

> levels(data.2$Cond)

L M H

Relabelling

In this case, the L,M,H labels are efficient, yet not as informative as we would perhaps like for labels on a graphic. Instead, we would probably prefer they were something like, "Low", "Medium" and "High"

> data.2$Cond <- factor(data.2$Cond, levels = c("L", "M", "H"),

> lab = c("Low", "Medium", "High"))

> data.2

	Plot	Cond	Between	Temp
1	P1	High	A1	15.7354618925767
2	P2	High	A1	23.8364332422208
3	P3	High	A1	13.6437138758995
4	P4	High	A2	37.9528080213779
5	P1	Medium	A2	25.2950777181536
6	P2	Medium	A2	13.7953161588198
7	P3	Medium	A3	26.8742905242849
8	P4	Medium	A3	29.3832470512922
9	P1	Low	A3	27.7578135165349
10	P2	Low	A4	18.9461161284364
11	P3	Low	A4	37.1178116845085
12	P4	Low	A4	25.8984323641143

Subsetting

We regularly want to run an analysis or generate a graphic for a sub-set of the data. Take for example the data used here. Warning, following any subsetting, it is crucial that you then instruct R to redefine the factor levels of any factors in the subsetted data set. When subsetting a dataset, all factors simply inherit the original levels property of the original factors, and therefore the new factors potentially define more factor levels than are in the subsetted data set. This, along with the solution is illustrated below.

An example of an data set (data.2)

Effect of subsetting on factor levels

Correcting factor levels post subsetting

	Plot	Cond	Between	Temp
1	P1	H	A1	15.74
2	P2	H	A1	23.84
3	P3	H	A2	13.64
4	P4	H	A2	37.95
5	P1	M	A3	25.30
6	P2	M	A3	13.80
7	P3	M	A4	26.87
8	P4	M	A4	29.38
9	P1	L	A1	27.76
10	P2	L	A1	18.95
11	P3	L	A2	37.12
12	P4	L	A2	25.90

#examine the levels of the Cond factor

> levels(data.2$Cond)

H L M

#subset the dataset to just Cond H,
#examine subset data and then
#examine the levels of the Cond factor

> data.3 <- subset(data.2, Cond ==

> "H")

> data.3

	Plot	Cond	Between	Temp
1	P1	H	A1	15.7354618925767
2	P2	H	A1	23.8364332422208
3	P3	H	A2	13.6437138758995
4	P4	H	A2	37.9528080213779

> levels(data.3$Cond)

H L M

> levels(data.3$Between)

A1 A2 A3 A4

#NOTICE the levels property is unaffected
by subsetting

#subset the dataset to just Cond H,

> data.3 <- subset(data.2, Cond ==

> "H")

#drop the unused factor levels from all factors
#examine the levels of the Cond factor

> data.3 <- droplevels(data.3)

> levels(data.3$Cond)

> levels(data.3$Between)

A1 A2

#NOTICE the levels property now reflect the subset data

Or if we only want to correct one of the factors

#drop the unused factor levels from just Cond factor
#examine the levels of the Cond factor

> data.3 <- subset(data.2, Cond ==

> "H")

> data.3$Cond <- factor(data.3$Cond)

> levels(data.3$Cond)

> levels(data.3$Between)

A1 A2 A3 A4

#NOTICE the levels property now reflect the subset data

Keep only the observations where Cond is H

Keep only the observations where Cond is either H or M

Keep only the observations with
H OR M condition AND Temp < 30

	Plot	Cond	Between	Temp
1	P1	H	A1	15.74
2	P2	H	A1	23.84
3	P3	H	A2	13.64
4	P4	H	A2	37.95

> subset(data.2, Cond == "H")

	Plot	Cond	Between	Temp
1	P1	H	A1	15.7354618925767
2	P2	H	A1	23.8364332422208
3	P3	H	A2	13.6437138758995
4	P4	H	A2	37.9528080213779

Via indexing

> data.2[data.2$Cond == "H", ]

	Plot	Cond	Between	Temp
1	P1	H	A1	15.7354618925767
2	P2	H	A1	23.8364332422208
3	P3	H	A2	13.6437138758995
4	P4	H	A2	37.9528080213779

	Plot	Cond	Between	Temp
1	P1	H	A1	15.74
2	P2	H	A1	23.84
3	P3	H	A2	13.64
4	P4	H	A2	37.95
5	P1	M	A3	25.30
6	P2	M	A3	13.80
7	P3	M	A4	26.87
8	P4	M	A4	29.38

> subset(data.2, Cond %in% c("H",

> "M"))

	Plot	Cond	Between	Temp
1	P1	H	A1	15.7354618925767
2	P2	H	A1	23.8364332422208
3	P3	H	A2	13.6437138758995
4	P4	H	A2	37.9528080213779
5	P1	M	A3	25.2950777181536
6	P2	M	A3	13.7953161588198
7	P3	M	A4	26.8742905242849
8	P4	M	A4	29.3832470512922

Via indexing

> data.2[data.2$Cond %in% c("H",

> "M"), ]

	Plot	Cond	Between	Temp
1	P1	H	A1	15.7354618925767
2	P2	H	A1	23.8364332422208
3	P3	H	A2	13.6437138758995
4	P4	H	A2	37.9528080213779
5	P1	M	A3	25.2950777181536
6	P2	M	A3	13.7953161588198
7	P3	M	A4	26.8742905242849
8	P4	M	A4	29.3832470512922

	Plot	Cond	Between	Temp
1	P1	H	A1	15.74
2	P2	H	A1	23.84
3	P3	H	A2	13.64
5	P1	M	A3	25.30
6	P2	M	A3	13.80
7	P3	M	A4	26.87
8	P4	M	A4	29.38

> subset(data.2, Cond %in% c("H",

> "M") & Temp < 30)

	Plot	Cond	Between	Temp
1	P1	H	A1	15.7354618925767
2	P2	H	A1	23.8364332422208
3	P3	H	A2	13.6437138758995
5	P1	M	A3	25.2950777181536
6	P2	M	A3	13.7953161588198
7	P3	M	A4	26.8742905242849
8	P4	M	A4	29.3832470512922

Via indexing

> data.2[data.2$Cond %in% c("H",

> "M") & data.2$Temp < 30, ]

	Plot	Cond	Between	Temp
1	P1	H	A1	15.7354618925767
2	P2	H	A1	23.8364332422208
3	P3	H	A2	13.6437138758995
5	P1	M	A3	25.2950777181536
6	P2	M	A3	13.7953161588198
7	P3	M	A4	26.8742905242849
8	P4	M	A4	29.3832470512922

More complex manipulations

Here is an example data set that comprises two Responses measured in each of four years from each of four Plots. What we want is to derive a new measure that is the change in abundance between a certain year (2008) and all subsequent years - and of course, we want to do this separately for each Plot.

An example data set (data.t)

Generating R code

	Plot	Year	Resp1	Resp2
1	P1	2008	0.00	36.00
2	P1	2009	48.00	0.00
3	P1	2010	12.00	0.00
4	P1	2011	0.00	15.00
5	P2	2008	19.00	34.00
6	P2	2009	18.00	0.00
7	P2	2010	21.00	36.00
8	P2	2011	45.00	12.00
9	P3	2008	40.00	49.00
10	P3	2009	38.00	0.00
11	P3	2010	9.00	18.00
12	P3	2011	22.00	42.00
13	P4	2008	20.00	0.00
14	P4	2009	22.00	14.00
15	P4	2010	32.00	0.00
16	P4	2011	38.00	27.00

> set.seed(1)

> data.t <- expand.grid(Year = 2008:2011,

> Plot = paste("P", 1:4, sep = ""))[,

> c(2, 1)]

> data.t <- data.frame(data.t, Resp1 = rnbinom(16,

> 5, mu = 30) * as.numeric(replicate(4,

> rbinom(4, 1, 0.8))), Resp2 = rnbinom(16,

> 5, mu = 30) * as.numeric(replicate(4,

> rbinom(4, 1, 0.8))))

As this is ultimately a demonstration of a complex data manipulation, we will perform the manipulation in increasingly more complex iterations. Should you purely be interested in the final solution, you may which to skip to the final point.

Calculate the change (difference in abundance between 2008 and all subsequent years) for a single variable (Resp1)

	Plot	Year	Resp1	delta
1	P1	2008	0.00	0.00
2	P1	2009	48.00	48.00
3	P1	2010	12.00	12.00
4	P1	2011	0.00	0.00
5	P2	2008	19.00	0.00
6	P2	2009	18.00	-1.00
7	P2	2010	21.00	2.00
8	P2	2011	45.00	26.00
9	P3	2008	40.00	0.00
10	P3	2009	38.00	-2.00
11	P3	2010	9.00	-31.00
12	P3	2011	22.00	-18.00
13	P4	2008	20.00	0.00
14	P4	2009	22.00	2.00
15	P4	2010	32.00	12.00
16	P4	2011	38.00	18.00

> ddply(data.t, ~Plot, function(df) {

> t1 <- df$Resp1[df$Year == 2008]

> yr <- df$Year

> t2 <- df$Resp1

> data.frame(Year = yr, Resp1 = df$Resp1,

> delta = t2 - t1)

> })

	Plot	Year	Resp1	delta
1	P1	2008	0	0
2	P1	2009	48	48
3	P1	2010	12	12
4	P1	2011	0	0
5	P2	2008	19	0
6	P2	2009	18	-1
7	P2	2010	21	2
8	P2	2011	45	26
9	P3	2008	40	0
10	P3	2009	38	-2
11	P3	2010	9	-31
12	P3	2011	22	-18
13	P4	2008	20	0
14	P4	2009	22	2
15	P4	2010	32	12
16	P4	2011	38	18

As 2008 is the year we are comparing all others to, we dont really want the 2008 data in the final data set - so lets suppress it.

	Plot	Year	Resp1	delta
1	P1	2009	48.00	48.00
2	P1	2010	12.00	12.00
3	P1	2011	0.00	0.00
4	P2	2009	18.00	-1.00
5	P2	2010	21.00	2.00
6	P2	2011	45.00	26.00
7	P3	2009	38.00	-2.00
8	P3	2010	9.00	-31.00
9	P3	2011	22.00	-18.00
10	P4	2009	22.00	2.00
11	P4	2010	32.00	12.00
12	P4	2011	38.00	18.00

> ddply(data.t, ~Plot, function(df) {

> t1 <- df$Resp1[df$Year == 2008]

> df <- df[df$Year > 2008, ]

> yr <- df$Year

> t2 <- df$Resp1

> data.frame(Year = yr, Resp1 = df$Resp1,

> delta = t2 - t1)

> })

	Plot	Year	Resp1	delta
1	P1	2009	48	48
2	P1	2010	12	12
3	P1	2011	0	0
4	P2	2009	18	-1
5	P2	2010	21	2
6	P2	2011	45	26
7	P3	2009	38	-2
8	P3	2010	9	-31
9	P3	2011	22	-18
10	P4	2009	22	2
11	P4	2010	32	12
12	P4	2011	38	18

Lets now extend this to multiple variables (Resp1 and Resp2). This is achieved with the following modifications

The response variables to apply the delta calculations to are specified by their column indices in code rows 2 and 5 (e.g. 3:4)
t1 and t2 now store vectors and matrices respectively. Therefore we need to use the sweep function in order to subtract each item of the t1 vector from each item in the corresponding columns of the t2 matrix. The sweep function takes the values in one matrix (or vector) and "sweeps" (subtracts by default) them through either the rows (1) or columns (2) of another matrix (perform the action repeatedly along the rows or columns).
Finally, to ensure that the results of the sweep are correctly packaged into rows and appended to the splitting variable (Plot), we need to construct and return a dataframe to the ddply function.

	Plot	Year	Resp1	Resp2	Resp1.1	Resp2.1
1	P1	2009	48.00	0.00	48.00	-36.00
2	P1	2010	12.00	0.00	12.00	-36.00
3	P1	2011	0.00	15.00	0.00	-21.00
4	P2	2009	18.00	0.00	-1.00	-34.00
5	P2	2010	21.00	36.00	2.00	2.00
6	P2	2011	45.00	12.00	26.00	-22.00
7	P3	2009	38.00	0.00	-2.00	-49.00
8	P3	2010	9.00	18.00	-31.00	-31.00
9	P3	2011	22.00	42.00	-18.00	-7.00
10	P4	2009	22.00	14.00	2.00	14.00
11	P4	2010	32.00	0.00	12.00	0.00
12	P4	2011	38.00	27.00	18.00	27.00

> ddply(data.t, ~Plot, function(df) {

> t1 <- as.matrix(df[df$Year ==

> 2008, 3:4])

> df <- df[df$Year > 2008, ]

> yr <- df$Year

> t2 <- as.matrix(df[, 3:4])

> return(data.frame(Year = yr,

> Resp1 = df$Resp1, Resp2 = df$Resp2,

> sweep(t2, 2, t1)))

> })

	Plot	Year	Resp1	Resp2	Resp1.1	Resp2.1
1	P1	2009	48	0	48	-36
2	P1	2010	12	0	12	-36
3	P1	2011	0	15	0	-21
4	P2	2009	18	0	-1	-34
5	P2	2010	21	36	2	2
6	P2	2011	45	12	26	-22
7	P3	2009	38	0	-2	-49
8	P3	2010	9	18	-31	-31
9	P3	2011	22	42	-18	-7
10	P4	2009	22	14	2	14
11	P4	2010	32	0	12	0
12	P4	2011	38	27	18	27

So far, the metric representing the difference between before and after has been pretty crude. It is simply the raw difference, the magnitude of which is limitless and dependent on the magnitude and scale of the variables - thereby meaningless for comparisons.

A more sophisticated metric would standardize the differences such that they represent the proportional difference (divide by the first value).
$Δ = [(A$ ₂ - A₁)/A₂] × 100
Of course this introduces additional issues, such as what happens if the initial value is 0? In such cases, any increase is effectively an infinite increase. However, if the initial value and a subsequent value are both zero, then a sensible value of delta would be 0.

Δ =

[(A₂ - A₁)/A₂] × 100	if A₁≠0
10,000	if A₁=0 & A₂≠0
0	if A₁=0 & A₂=0

We can then further scale this metric to be symmetrically bound around to a range around zero (-4.615 and 4.615 on a log scale).

d = log

_e(1+(Δ/101))
*An alternative way of dealing with these issues is to replace the zero with 2 standard deviations below the smallest value in the set. This is a technique that is often employed to deal with situations in which zeros arise as a result of the resolution of instruments (having a minimum detection that is higher than the sample they are measuring) or otherwise inability to detect very small quantities. Similarly, it could be argued that a measured abundance of zero here could alo be the result of an inability to detect something in very low quantities. Hence, we could replace the zeros with estimates of the lowest detection value or similar. *

The standardized and scaled option is achieved with the following modifications

As indicated in the previous version above, the sweep function performs a simple subtraction. We want the sweep function to perform a different function - one that we define in accordance with the formula and rules above. So we first create a function that can perform this task.
Have the sweep function call out specifically-defined function.

	Plot	Year	Resp1	Resp2	Resp1.1	Resp2.1
1	P1	2009	48.00	0.00	4.61	-4.62
2	P1	2010	12.00	0.00	4.61	-4.62
3	P1	2011	0.00	15.00	0.00	-0.86
4	P2	2009	18.00	0.00	-0.05	-4.62
5	P2	2010	21.00	36.00	0.10	0.06
6	P2	2011	45.00	12.00	0.86	-1.02
7	P3	2009	38.00	0.00	-0.05	-4.62
8	P3	2010	9.00	18.00	-1.46	-0.98
9	P3	2011	22.00	42.00	-0.59	-0.15
10	P4	2009	22.00	14.00	0.09	4.61
11	P4	2010	32.00	0.00	0.47	0.00
12	P4	2011	38.00	27.00	0.64	4.61

> delta <- function(t2, t1) {

> d <- 100 * ((t2 - t1)/t1)

> d[t1 == 0 & t2 == 0] <- 0

> d[t1 == 0 & t2 != 0] <- 10000

> log(1 + (d/101))

> }

> ddply(data.t, ~Plot, function(df) {

> t1 <- as.matrix(df[df$Year ==

> 2008, 3:4])

> df <- df[df$Year > 2008, ]

> yr <- df$Year

> t2 <- as.matrix(df[, 3:4])

> return(data.frame(Year = yr,

> Resp1 = df$Resp1, Resp2 = df$Resp2,

> sweep(t2, 2, t1, delta)))

> })

	Plot	Year	Resp1	Resp2	Resp1.1	Resp2.1
1	P1	2009	48	0	4.6052691909879249	-4.6151205168412588
2	P1	2010	12	0	4.6052691909879249	-4.6151205168412588
3	P1	2011	0	15	0.0000000000000000	-0.8617025415897520
4	P2	2009	18	0	-0.0535173174895776	-4.6151205168412588
5	P2	2010	21	36	0.0991400622605991	0.0566082074986956
6	P2	2011	45	12	0.8564865134035798	-1.0234648369920876
7	P3	2009	38	0	-0.0507723253734233	-4.6151205168412588
8	P3	2010	9	18	-1.4581200956911462	-0.9845405835647614
9	P3	2011	22	42	-0.5897688261061101	-0.1525018748370745
10	P4	2009	22	14	0.0944096844710748	4.6052691909879249
11	P4	2010	32	0	0.4662838481432036	0.0000000000000000
12	P4	2011	38	27	0.6371529112053703	4.6052691909879249

The above procedure is perfectly adequate for dealing with a single ubiquitous event (disturbance), but what happens if there have been multiple different disturbances that begin and end at different times in different plots. Comparing each time period to a single, fixed time (2008 in this case), is then not logical. Lets say that in addition to our data set, we had also compiled a data set that defined times prior and post for each disturbance for each Plot. This data set could then be used as a lookup table (as in this section above). In this example we will incorporate the post disturbance date, yet we will not restrict the resulting data beyond this date (that is, we will continue to explore the impacts of the disturbance after the primary condition has ceased).

	Plot	Prior	Post
1	P1	2008.00	2011.00
2	P2	2008.00	2010.00
3	P3	2009.00	2011.00
4	P4	2008.00	2011.00

> data.v <- expand.grid(Plot = paste("P",

> 1:4, sep = ""))

> data.v$Prior <- c(2008, 2008, 2009,

> 2008)

> data.v$Post <- c(2011, 2010, 2011,

> 2011)

> data.v

	Plot	Prior	Post
1	P1	2008	2011
2	P2	2008	2010
3	P3	2009	2011
4	P4	2008	2011

Merge the data and the lookup table such that the prior and post dates are included in the data set.
Limit the first date of the comparison for each Plot such that the first measure is relative to the defined Prior disturbance time for that Plot.

	Plot	Year	Resp1	Resp2	Resp1.1	Resp2.1
1	P1	2009	48.00	0.00	4.61	-4.62
2	P1	2010	12.00	0.00	4.61	-4.62
3	P1	2011	0.00	15.00	0.00	-0.86
4	P2	2009	18.00	0.00	-0.05	-4.62
5	P2	2010	21.00	36.00	0.10	0.06
6	P2	2011	45.00	12.00	0.86	-1.02
7	P3	2010	9.00	18.00	-1.41	4.61
8	P3	2011	22.00	42.00	-0.54	4.61
9	P4	2009	22.00	14.00	0.09	4.61
10	P4	2010	32.00	0.00	0.47	0.00
11	P4	2011	38.00	27.00	0.64	4.61

> data.u <- merge(data.t, data.v,

> by = "Plot")

> delta <- function(t2, t1) {

> d <- 100 * ((t2 - t1)/t1)

> d[t1 == 0 & t2 == 0] <- 0

> d[t1 == 0 & t2 != 0] <- 10000

> log(1 + (d/101))

> }

> ddply(data.u, ~Plot, function(df) {

> t1 <- as.matrix(df[df$Year ==

> df$Prior, 3:4])

> df <- df[df$Year > df$Prior,

> ]

> yr <- df$Year

> t2 <- as.matrix(df[, 3:4])

> return(data.frame(Year = yr,

> Resp1 = df$Resp1, Resp2 = df$Resp2,

> sweep(t2, 2, t1, delta)))

> })

	Plot	Year	Resp1	Resp2	Resp1.1	Resp2.1
1	P1	2009	48	0	4.6052691909879249	-4.6151205168412588
2	P1	2010	12	0	4.6052691909879249	-4.6151205168412588
3	P1	2011	0	15	0.0000000000000000	-0.8617025415897520
4	P2	2009	18	0	-0.0535173174895776	-4.6151205168412588
5	P2	2010	21	36	0.0991400622605991	0.0566082074986956
6	P2	2011	45	12	0.8564865134035798	-1.0234648369920876
7	P3	2010	9	18	-1.4089567275614210	4.6052691909879249
8	P3	2011	22	42	-0.5393687876957745	4.6052691909879249
9	P4	2009	22	14	0.0944096844710748	4.6052691909879249
10	P4	2010	32	0	0.4662838481432036	0.0000000000000000
11	P4	2011	38	27	0.6371529112053703	4.6052691909879249

Data manipulations

Reshaping datasets

Wide to long (melt)

Long to wide (cast)

Merging data sets

VLOOKUP in R

Aggregating data sets

Pivot tables, data summaries

Hierarchical aggregations

Transformations and Derivatives

Data transformations

Deriving new variables

Alterations

Sorting data

Re-levelling (sorting) factors

Relabelling

Subsetting

More complex manipulations

Welcome to the end of this tutorial