In this article we will show you how to use the R statistical package for analyzing the output of DroughtSpotter result files. These files are in a .csv format which stands for comma separated values. Over time the results from the DroughSpotter can reach big sizes causing commonly used spreadsheet programs to either work very slowly, refuse to load the files or just crash. The R statistical package however does not pose such problems and it is a more professional way of working with large data sets. Since it can be hard to start with R we want to show you the most common tasks to be done and how to achieve them with R. To successfully start make sure the R software for doing the calculations and RStudio which is a nice, handy frontend to R are installed on your computer.

To use R, you should use RStudio since it provides a more natural way for most users to operate the software. You could use R directly though it is command line driven, meaning every operation you want to do has to be typed. With RStudio you still type, but it can be used with the mouse, there are many handy functionalities to help you on the way, extensive documentation and the ability to create graphs and preview/use them immediately. With R, to get a graph you would first type the commands to actually provide the data for the graph, save the graph and then open it in a separate program to look if it is the expected result which can be quite tedious.

Loading the .csv files

To load data into RStudio from DroughtSpotter .csv files, first load up RStudio. From there the first thing to do is to set the working directory. Set it to a convenient place where the .csv file is also located. Then to load the file use the read.table command in R.

> ObjDemoExperiment <- read.table(file = "./result_demo_experiment.csv", header = TRUE, skip = 3, sep="\t", stringsAsFactors = FALSE)
What we have done here is use the read.table function in R to load the file "result_demo_experiment.csv" into an object in R. Keep in mind that everything in R in which you put any kind of information is called an object that you can use and manipulate in R using functions, such as the read.table function. You see two other options above, header and sep. These are options for the read.table function to alter it's behavior. The skip option tells the read.table function to skip the first line of the .csv since these are not useful data. A DroughtSpotter .csv file has the following extra lines that we thus remove:


The header option tells the read.table function that the first line after the removal of the first rows  actually has our column names, so we can reference and use them later in R. The sep function tells the read.table function which separator is used. Although the .csv function is an abbreviation of comma separated values, it is misleading. Values can be separated by more then just comma's such as semicolon (;), dots (.), spaces ( ) or TABS (\t). In the example we use here the values are separated by TABS, so the option reads sep="\t". One more thing to keep in mind, is quoting with "", these are used by the R software to interpret commands as one instead of multiple, because some characters have a special meaning to R such as spaces, slashes and others. To prevent R from acting on these and just read them as characters we use quotes. If your command does something strange, the first thing to check is the use of quotes. Sometimes they are not needed such as with the option "header" in the example. The value for header is either FALSE or TRUE. With or without quotes, it is unambiguous to R so they are not needed.

Now that we have loaded the file into R, we can have a first look at it. Just type the object name as on the first line below, the rest is what follows after the command.

> ObjDemoExperiment
              Timestamp Row Column Weight_g Irrigation_amount_g Irrigation_mode Temp_Celsius CO2_ppm RH_. PAR_umol.m2.s Reliable Status
1      2017.07.05 10:30   E      4   2631.8                 0.0               ?            0       0    0             0       ok      M
2      2017.07.05 10:30   F      4   2631.9                 0.0               ?            0       0    0             0       ok      M
3      2017.07.05 10:30   A     13   2625.1                 0.0               ?            0       0    0             0       ok      M
4      2017.07.05 10:30   A     14   2681.8                 0.0               ?            0       0    0             0       ok      M
5      2017.07.05 10:30   B     13   2630.7                 0.0               ?            0       0    0             0       ok      M
6      2017.07.05 10:30   A     21   2627.8                 0.0               ?            0       0    0             0       ok      M
7      2017.07.05 10:30   A     22   2625.9                 0.0               ?            0       0    0             0       ok      M
8      2017.07.05 10:30   B     21   2636.8                 0.0               ?            0       0    0             0       ok      M
9      2017.07.05 10:30   A     25   2652.5                 0.0               ?            0       0    0             0       ok      M
10     2017.07.05 10:30   A     27   2657.5                 0.0               ?            0       0    0             0       ok      M
11     2017.07.05 10:30   B     26   2649.8                 0.0               ?            0       0    0             0       ok      M
12     2017.07.05 10:30   E     11   2634.4                 0.0               ?            0       0    0             0       ok      M
13     2017.07.05 10:30   E     12   2634.3                 0.0               ?            0       0    0             0       ok      M
14     2017.07.05 10:30   F     12   2632.6                 0.0               ?            0       0    0             0       ok      M
15     2017.07.05 10:30   E     20   2635.1                 0.0               ?            0       0    0             0       ok      M
16     2017.07.05 10:30   F     17   2625.4                 0.0               ?            0       0    0             0       ok      M
17     2017.07.05 10:30   F     20   2628.3                 0.0               ?            0       0    0             0       ok      M
18     2017.07.05 10:30   E     27   2664.8                 0.0               ?            0       0    0             0       ok      M
19     2017.07.05 10:30   F     25   2669.6                 0.0               ?            0       0    0             0       ok      M
20     2017.07.05 10:30   F     26   2681.8                 0.0               ?            0       0    0             0       ok      M
21     2017.07.05 10:30   E     23   2625.7                 0.0               ?            0       0    0             0       ok      M
22     2017.07.05 10:30   E     24   2636.2                 0.0               ?            0       0    0             0       ok      M
23     2017.07.05 10:30   F     21   2626.4                 0.0               ?            0       0    0             0       ok      M
24     2017.07.05 10:30   F     24   2628.8                 0.0               ?            0       0    0             0       ok      M
25     2017.07.05 10:30   E     14   2625.4                 0.0               ?            0       0    0             0       ok      M
26     2017.07.05 10:30   E     16   2631.3                 0.0               ?            0       0    0             0       ok      M
27     2017.07.05 10:30   F     14   2629.2                 0.0               ?            0       0    0             0       ok      M
28     2017.07.05 10:30   F     15   2634.7                 0.0               ?            0       0    0             0       ok      M
29     2017.07.05 10:30   F     16   2626.9                 0.0               ?            0       0    0             0       ok      M
30     2017.07.05 10:30   A      3   2624.9                 0.0               ?            0       0    0             0       ok      M
31     2017.07.05 10:30   A      4   2633.0                 0.0               ?            0       0    0             0       ok      M
32     2017.07.05 10:30   B      2   2632.7                 0.0               ?            0       0    0             0       ok      M
33     2017.07.05 10:30   B      3   2634.0                 0.0               ?            0       0    0             0       ok      M
34     2017.07.05 10:30   C      2   2634.7                 0.0               ?            0       0    0             0       ok      M
35     2017.07.05 10:30   C      3   2654.5                 0.0               ?            0       0    0             0       ok      M
36     2017.07.05 10:30   A      9   2631.8                 0.0               ?            0       0    0             0       ok      M
37     2017.07.05 10:30   A     10   2633.2                 0.0               ?            0       0    0             0       ok      M
38     2017.07.05 10:30   B      9   2626.1                 0.0               ?            0       0    0             0       ok      M
39     2017.07.05 10:30   B     11   2662.1                 0.0               ?            0       0    0             0       ok      M
40     2017.07.05 10:30   B     12   2631.1                 0.0               ?            0       0    0             0       ok      M
41     2017.07.05 10:30   C     12   2630.2                 0.0               ?            0       0    0             0       ok      M
 [ reached getOption("max.print") -- omitted 151701 rows ]
>

Generating a histogram of the number of observations per DroughtSpotter scale

To quality check the data, we want to take the location information and the number of its occurrences in it's file. Every row represents one measurement, at one timestamp of one specific scale on the DroughSpotter. So to generate a histogram of the measurement frequencies we need the information from the Row and Column in our object. To ease our workflow, we first create a new column in our object where the value of Row and Column are pasted together, another more technical term for it is to concatenate. To do so we use the paste command. Our total command will look like below:

> ObjDemoExperiment$Scale <- paste(ObjDemoExperiment$Row, ObjDemoExperiment$Column, sep="")
The command makes use of the dollar sign ($) to create and access columns in our object. The command ObjDemoExperiment$Scale creates the column "Scale" in our object since it does not exist yet. The commands ObjDemoExperiment$Row and ObjDemoExperiment$Column get the data currently stored in our object for these columns, they already exist on our object. The paste command actually combines the data from ObjDemoExperiment$Row and ObjDemoExperiment$Column per line into new column "Scale". The sep="" command is an option to paste and tells it to just paste them behind each other, so no characters are inserted in between them. It will result in the following output in our ObjDemoExperiment:
> ObjDemoExperiment Timestamp Row Column Weight_g Irrigation_amount_g Irrigation_mode Temp_Celsius CO2_ppm RH_. PAR_umol.m2.s Reliable Status Scale 1 2017.07.05 10:30 E 4 2631.8 0.0 ? 0 0 0 0 ok M E4 2 2017.07.05 10:30 F 4 2631.9 0.0 ? 0 0 0 0 ok M F4 3 2017.07.05 10:30 A 13 2625.1 0.0 ? 0 0 0 0 ok M A13 4 2017.07.05 10:30 A 14 2681.8 0.0 ? 0 0 0 0 ok M A14 5 2017.07.05 10:30 B 13 2630.7 0.0 ? 0 0 0 0 ok M B13 6 2017.07.05 10:30 A 21 2627.8 0.0 ? 0 0 0 0 ok M A21 7 2017.07.05 10:30 A 22 2625.9 0.0 ? 0 0 0 0 ok M A22 8 2017.07.05 10:30 B 21 2636.8 0.0 ? 0 0 0 0 ok M B21 9 2017.07.05 10:30 A 25 2652.5 0.0 ? 0 0 0 0 ok M A25 10 2017.07.05 10:30 A 27 2657.5 0.0 ? 0 0 0 0 ok M A27 11 2017.07.05 10:30 B 26 2649.8 0.0 ? 0 0 0 0 ok M B26 12 2017.07.05 10:30 E 11 2634.4 0.0 ? 0 0 0 0 ok M E11 13 2017.07.05 10:30 E 12 2634.3 0.0 ? 0 0 0 0 ok M E12 14 2017.07.05 10:30 F 12 2632.6 0.0 ? 0 0 0 0 ok M F12 15 2017.07.05 10:30 E 20 2635.1 0.0 ? 0 0 0 0 ok M E20 16 2017.07.05 10:30 F 17 2625.4 0.0 ? 0 0 0 0 ok M F17 17 2017.07.05 10:30 F 20 2628.3 0.0 ? 0 0 0 0 ok M F20 18 2017.07.05 10:30 E 27 2664.8 0.0 ? 0 0 0 0 ok M E27 19 2017.07.05 10:30 F 25 2669.6 0.0 ? 0 0 0 0 ok M F25 20 2017.07.05 10:30 F 26 2681.8 0.0 ? 0 0 0 0 ok M F26 21 2017.07.05 10:30 E 23 2625.7 0.0 ? 0 0 0 0 ok M E23 22 2017.07.05 10:30 E 24 2636.2 0.0 ? 0 0 0 0 ok M E24 23 2017.07.05 10:30 F 21 2626.4 0.0 ? 0 0 0 0 ok M F21 24 2017.07.05 10:30 F 24 2628.8 0.0 ? 0 0 0 0 ok M F24 25 2017.07.05 10:30 E 14 2625.4 0.0 ? 0 0 0 0 ok M E14 26 2017.07.05 10:30 E 16 2631.3 0.0 ? 0 0 0 0 ok M E16 27 2017.07.05 10:30 F 14 2629.2 0.0 ? 0 0 0 0 ok M F14 28 2017.07.05 10:30 F 15 2634.7 0.0 ? 0 0 0 0 ok M F15 29 2017.07.05 10:30 F 16 2626.9 0.0 ? 0 0 0 0 ok M F16 30 2017.07.05 10:30 A 3 2624.9 0.0 ? 0 0 0 0 ok M A3 31 2017.07.05 10:30 A 4 2633.0 0.0 ? 0 0 0 0 ok M A4 32 2017.07.05 10:30 B 2 2632.7 0.0 ? 0 0 0 0 ok M B2 33 2017.07.05 10:30 B 3 2634.0 0.0 ? 0 0 0 0 ok M B3 34 2017.07.05 10:30 C 2 2634.7 0.0 ? 0 0 0 0 ok M C2 35 2017.07.05 10:30 C 3 2654.5 0.0 ? 0 0 0 0 ok M C3 36 2017.07.05 10:30 A 9 2631.8 0.0 ? 0 0 0 0 ok M A9 37 2017.07.05 10:30 A 10 2633.2 0.0 ? 0 0 0 0 ok M A10 38 2017.07.05 10:30 B 9 2626.1 0.0 ? 0 0 0 0 ok M B9 [ reached getOption("max.print") -- omitted 151704 rows ]
>

Notice the new column added named Scale. To create the plot, we need to load into R software suitable for drawing plots. There is a lot of software available for this purpose in R, but we will use the software named ggplot2. To load it issue the command:

> require(ggplot2)
If it does not work, it probably means it is not installed. To do so:

> install.packages(ggplot2)
and execute the require command again. To draw the plot, we will need to tell ggplot2 from which object we want to use the data, and from which column in it to calculate the frequencies and create a histogram. The following command does this:

ggplot(data.frame(ObjDemoExperiment),aes(x=Scale)) + geom_bar()
To break it down, the data.frame above tells ggplot2 what type of object we have which is a data.frame. The aes option is used to map data to the axis, here we tell it that the x axis should have the data from the column Scale in our object. The "+ geom_bar()" tells ggplot what kind of plot we want, geom_bar results in a histogram for categorical variables which is what our column Scales has. The results will be the following in a windows in R:




We can see that for every scale there is the same amount of measurements, bar a few missed measurements which is negligible. Though to make it more clear, we can create a report in numbers with the command count:

> count(ObjDemoExperiment$Scale)
     x freq
1  A10 2809
2  A13 2809
3  A14 2809
4  A19 2809
5  A20 2809
6  A21 2809
7  A22 2811
8  A25 2809
9  A27 2809
10  A3 2809
11  A4 2809
12  A9 2809
13 B11 2809
14 B12 2811
15 B13 2809
16 B17 2809
17 B18 2810
18 B19 2809
19  B2 2811
20 B21 2811
21 B26 2809
22  B3 2809
23  B9 2809
24 C12 2811
25 C18 2812
26  C2 2811
27  C3 2809
28  D7 2812
29  D8 2812
30 E11 2809
31 E12 2809
32 E14 2809
33 E16 2814
34 E20 2809
35 E23 2815
36 E24 2811
37 E27 2809
38  E4 2813
39  E5 2809
40  E6 2809
41  E7 2812
42 F12 2809
43 F14 2809
44 F15 2809
45 F16 2813
46 F17 2812
47 F20 2809
48 F21 2809
49 F24 2812
50 F25 2809
51 F26 2809
52  F4 2811
53  F5 2811
54  F6 2809
>
To create a histogram of the frequencies above to quickly see the range:

> ObjDemoExperimentFreq <- count(ObjDemoExperiment$Scale); hist(ObjDemoExperimentFreq$freq)
will yield the following output:


To break down the command, we first have to understand that these were actually two commands. These are:

> ObjDemoExperimentFreq <- count(ObjDemoExperiment$Scale)
> hist(ObjDemoExperimentFreq$freq)
If you just type the commands as above with an enter in between, it will do exactly the same as our one-line command, though there the commands are separated by a semicolon (;). The first command Takes the output from the previously used count command and places it in a newly created object named ObjDemoExperimentFreq. The second command hist creates a plot from this objects column named "freq" and draws the plot.

Generating a line graph

To generate a line graph where every scale is included but differently coloured per scale we again use ggplot2. The command to quickly create a graph is:

> ggplot(ObjDemoExperiment) + geom_line(aes(x=Timestamp, y=Weight_g, colour=Scale, group=Scale))

Again we tell ggplot2 what Object we work on, the type of graph we want we geom_line, the variables for the X (Timestamp) and the Y (Weight_g) axis. To let ggplot2 automatically select good colours for our graph, the colour=Scale option is included. Thus it will colour the line graphs per our values in the Scale column in the ObjDemoExperiment data set. The group=Scale option tells it that these values should be grouped together. If you do not use it the result will not look well. The resulting graph from the above is:


From the graph we can quickly see that there are some outliers which distort our graph. 

Removing outliers

To remove the outliers from our data, we can use the graph to coarsely pick our range. We can say that any measurement below 1000 grams and above 5000 grams is an outlier. To achieve it, use the following command:
> ObjDemoExperimentSubset <- ObjDemoExperiment[!((ObjDemoExperiment$Weight_g > 1000) & (ObjDemoExperiment$Weight_g < 5000)) , ]
It does look a bit more complicated, so let us break it down. First we make sure our data goes into a new Object named ObjDemoExperimentSubset, as to leave the original data untouched. With ObjDemoExperiment$Weight_g > 1000 we tell it that any value above 1000 should be kept, with (ObjDemoExperiment$Weight_g < 5000) we tell R to keep any value below 500. The ampersand (&) is used to link the two conditions together, a value has to satisfy both conditions (logical AND) to be kept. From here on however we must take special care about the exclamation mark (!) which inverts our selection. So we first tell it to keep any value above 1000 and below 5000 and then we invert the result with the exclamation mark. Thus it will keep any value below 1000 and above 5000 which will the data in our newly created ObjDemoExperimentSubset object. The result is as follows:
> ObjDemoExperimentSubset
              Timestamp Row Column Weight_g Irrigation_amount_g Irrigation_mode Temp_Celsius CO2_ppm RH_. PAR_umol.m2.s Reliable Status Scale
8759   2017.07.06 13:20   B     26     34.5                   0               ?            0       0    0             0       ok      P   B26
62003  2017.07.13 09:20   B     26   5970.2                   0               ?            0       0    0             0       ok      P   B26
126071 2017.07.21 14:30   E     27     -2.2                   0               ?            0       0    0             0       ok      P   E27
126126 2017.07.21 14:40   F     25     25.0                   0               ?            0       0    0             0       ok      P   F25
These measurements would thus be removed. Now let us actually remove them from our dataset. All we have to do is remove the exclamation mark producing the inversion. Thus the result will be as below:

> ObjDemoExperimentSubset <- ObjDemoExperiment[((ObjDemoExperiment$Weight_g > 1000) & (ObjDemoExperiment$Weight_g < 5000)) , ] > ObjDemoExperimentSubset Timestamp Row Column Weight_g Irrigation_amount_g Irrigation_mode Temp_Celsius CO2_ppm RH_. PAR_umol.m2.s Reliable Status Scale 1 2017.07.05 10:30 E 4 2631.8 0.0 ? 0 0 0 0 ok M E4 2 2017.07.05 10:30 F 4 2631.9 0.0 ? 0 0 0 0 ok M F4 3 2017.07.05 10:30 A 13 2625.1 0.0 ? 0 0 0 0 ok M A13 4 2017.07.05 10:30 A 14 2681.8 0.0 ? 0 0 0 0 ok M A14 5 2017.07.05 10:30 B 13 2630.7 0.0 ? 0 0 0 0 ok M B13 6 2017.07.05 10:30 A 21 2627.8 0.0 ? 0 0 0 0 ok M A21 7 2017.07.05 10:30 A 22 2625.9 0.0 ? 0 0 0 0 ok M A22 8 2017.07.05 10:30 B 21 2636.8 0.0 ? 0 0 0 0 ok M B21 9 2017.07.05 10:30 A 25 2652.5 0.0 ? 0 0 0 0 ok M A25 10 2017.07.05 10:30 A 27 2657.5 0.0 ? 0 0 0 0 ok M A27 11 2017.07.05 10:30 B 26 2649.8 0.0 ? 0 0 0 0 ok M B26 12 2017.07.05 10:30 E 11 2634.4 0.0 ? 0 0 0 0 ok M E11 13 2017.07.05 10:30 E 12 2634.3 0.0 ? 0 0 0 0 ok M E12 14 2017.07.05 10:30 F 12 2632.6 0.0 ? 0 0 0 0 ok M F12 15 2017.07.05 10:30 E 20 2635.1 0.0 ? 0 0 0 0 ok M E20 16 2017.07.05 10:30 F 17 2625.4 0.0 ? 0 0 0 0 ok M F17 17 2017.07.05 10:30 F 20 2628.3 0.0 ? 0 0 0 0 ok M F20 18 2017.07.05 10:30 E 27 2664.8 0.0 ? 0 0 0 0 ok M E27 19 2017.07.05 10:30 F 25 2669.6 0.0 ? 0 0 0 0 ok M F25 20 2017.07.05 10:30 F 26 2681.8 0.0 ? 0 0 0 0 ok M F26 21 2017.07.05 10:30 E 23 2625.7 0.0 ? 0 0 0 0 ok M E23 22 2017.07.05 10:30 E 24 2636.2 0.0 ? 0 0 0 0 ok M E24 23 2017.07.05 10:30 F 21 2626.4 0.0 ? 0 0 0 0 ok M F21 24 2017.07.05 10:30 F 24 2628.8 0.0 ? 0 0 0 0 ok M F24 25 2017.07.05 10:30 E 14 2625.4 0.0 ? 0 0 0 0 ok M E14 26 2017.07.05 10:30 E 16 2631.3 0.0 ? 0 0 0 0 ok M E16 27 2017.07.05 10:30 F 14 2629.2 0.0 ? 0 0 0 0 ok M F14 28 2017.07.05 10:30 F 15 2634.7 0.0 ? 0 0 0 0 ok M F15 29 2017.07.05 10:30 F 16 2626.9 0.0 ? 0 0 0 0 ok M F16 30 2017.07.05 10:30 A 3 2624.9 0.0 ? 0 0 0 0 ok M A3 31 2017.07.05 10:30 A 4 2633.0 0.0 ? 0 0 0 0 ok M A4 32 2017.07.05 10:30 B 2 2632.7 0.0 ? 0 0 0 0 ok M B2 33 2017.07.05 10:30 B 3 2634.0 0.0 ? 0 0 0 0 ok M B3 34 2017.07.05 10:30 C 2 2634.7 0.0 ? 0 0 0 0 ok M C2 35 2017.07.05 10:30 C 3 2654.5 0.0 ? 0 0 0 0 ok M C3 36 2017.07.05 10:30 A 9 2631.8 0.0 ? 0 0 0 0 ok M A9 37 2017.07.05 10:30 A 10 2633.2 0.0 ? 0 0 0 0 ok M A10 38 2017.07.05 10:30 B 9 2626.1 0.0 ? 0 0 0 0 ok M B9 [ reached getOption("max.print") -- omitted 151700 rows ]
>
We could find the number of lines in our output by looking back at our ObjDemoExperiment, it will tell us:
 [ reached getOption("max.print") -- omitted 151704 rows ]
Whereas our new ObjDemoExperimentSubset will yield:

[ reached getOption("max.print") -- omitted 151700 rows ]
An easier way to just get the number of observation is to use the command nrow like below:
> nrow(ObjDemoExperiment)
[1] 151742
> nrow(ObjDemoExperimentSubset)
[1] 151738
We see there are four observations less as we expected. The new graph will look as the following: