-
Notifications
You must be signed in to change notification settings - Fork 0
/
VegetablesMarkdownV1.Rmd
755 lines (568 loc) · 38.3 KB
/
VegetablesMarkdownV1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
---
title: "A Journey into the Contents of a Blue Vegetable Basket"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
This is a personal project of mine that I started a year ago out of curiosity and the necessity of wanting something to practice my data manipulation and analysis skills in R, the use of Markdown, and eventually also R Shiny in.
The focus of Birsmattehof farm in Therwil is the production of organic vegetables. These can be bought directly from their farm, their farm stands at various weekly markets around Basel, or they can ordered for delivery on a weekly basis. I am a customer of the third option: 46 times a year we recieve a basket full of freshly harvested organic vegetables, the contence of which is a surprise, as the variety and amount is determined by the season and the local weather conditions. On average, our basket order contains 3.5 - 5 kilograms per week.
Over the course of many months, I weighed and noted all vegetables that we recieved in the basket. I then correlated these data the with local meterological data in hopes of finding some interesting trends.
More information to the farm and their products can found on their website https://www.birsmattehof.ch/.
![Birsmattehof, Therwil](C:\Users\gartens1\Desktop\BMH\BirsmattehofBild.png)
## Installing and Loading the Necessary R Packages
For the transformation, analysis, and visualisation of the data several packages were used:
**tidyverse**, version 1.3.1
Citation: Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686
**dyplr**, version 1.0.5
Citation: Hadley Wickham, Romain Francois, Lionel Henry and Kirill Mueller (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.5. https://CRAN.R-project.org/package=dplyr
**stringr**, version 1.4.0
Citation: Hadley Wickham (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. https://CRAN.R-project.org/package=stringr
**lubridate**, version 1.7.10
Citation: Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL https://www.jstatsoft.org/v40/i03/.
**plotly**, version 4.10.0
Citation: C. Sievert. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida, 2020. URL https://plotly-r.com.
```{r Installing and Loading the Necessary R Packages, message = FALSE}
### Installing Packages:
# install.packages("tidyverse")
# install.packages("dplyr")
# install.packages("stringr")
# install.packages("lubridate")
# install.packages("plotly")
### Loading Packages
library(dplyr)
library(stringr)
library(tidyverse)
library(lubridate)
library(plotly)
### Obtaining Citations
# citation(package="dplyr", lib.loc = NULL, auto = NULL)
# citation(package="tidyverse", lib.loc = NULL, auto = NULL)
# citation(package="stringr", lib.loc = NULL, auto = NULL)
# citation(package="lubridate", lib.loc = NULL, auto = NULL)
# citation(package="plotly", lib.loc = NULL, auto = NULL)
```
## Importing Data
Two raw data files were imported for this project.
The first data set, "BMH_Veggiedata_V1.csv"" is a csv file containing all vegetable weights collected spanning from the 10th of November 2020 to the 21th of Decemeber 2021. This file contains a column for the vegetable category (i.e. cabbage), a second column for the specific vegetable (i.e. red cabbage), and then columns for the weights of all produce for each date when a basket was collected.
The second file "weather.csv" is a csv file containing meterological data from Basel during the same time period. This data was downloaded from Meteoblue (meteoblue AG, Greifengasse 38, CH-4058, Basel, Switzerland) at https://www.meteoblue.com/en/weather/archive/export/basel_switzerland_2661604 on the 23.12.2021. This file contains a column for the date, a second for every hour of the day, and a column for measurements of temperature, sunshine, precipiation, snowfall, humidity, and cloud cover.
These data sets will be referred to as the "Vegetable Data Set" and the "Weather Data Set", respectively, from here on out.
```{r Importing Files, message = FALSE, results = "hide"}
# Setting working directory and listing files
setwd("C:/Users/gartens1/Desktop/BMH")
list.files()
# Vegetable Data
filepath <- "C:/Users/gartens1/Desktop/BMH/BMH_Veggidata_V1.csv"
rawdata <- read.csv(filepath, header = TRUE, sep = ";")
# Weather Data
filepath2 <- "C:/Users/gartens1/Desktop/BMH/weather.csv"
rawdata2 <- read.csv(filepath2, header = TRUE, sep = ",")
```
## Tidy Data
Tidy data dictates that each variable has its own column and each observation has its own row. This format provides a standard way for structuring a dataset which allows for easier analysis. As neither of the two datasets are tidy, the first step of this analysis will deal with wrangling the data into this format. The following sections describe each step of the transformation for both data sets.
### Vegetable Data
With the `head()` function a sneak peak at the original dataset can be taken to see its stucture. All unneeded columns and rows are removed. The last row contains the summed weight of all vegetables in each basket. The three last columns contain calculations done in the original excel file. These rows and columns are not needed as these calculations will be performed off of the tidy data set using R code. To remove them in a way so that the code will also work should additional rows (i.e. new vegetable types) and columns (i.e. more measurement dates) be added to the original dataset, the variables `nr`and `nc` were created where the former is the number of rows minus 1 and the latter is the number of columns minus 3. The original `rawdata` dataset was then transformed accordingly and stored as `data1`.
The 3.5 - 5 kilograms of vegetables that we recieve weekly throughout the year are split into 46 baskets. The vegetables are delievered weekly except for during the months of January and February, where the basket contains the doubled amount of produce but is only delivered every two weeks. Also, no basket is delievered the week between Christmas and New Years. To account for this, the recorded data from the dates of the 12.01.2021, 26.01.2021, 09.02.2021, 23.02.2021, and the 09.03.2021 is halved and copied into the weeks of the 05.01.2021, 19.01.2021, 02.02.2021, 16.02.2021, and the 02.03.2021, respectively.
```{r Selecting Relevant Rows/Columns from Vegetable Dataset, message = FALSE, results = "hide"}
# Taking a look at the original dataset
head(rawdata)
# Removing unncessary columns/rows
nr <- nrow(rawdata) - 1
nc <- ncol(rawdata) - 3
data1 <- rawdata[c(1:nr),c(1:nc)]
# Splitting doublely-sized baskets into two weeks
data1$X05.01.2021 <- (data1$X12.01.2021)/2
data1$X12.01.2021 <- (data1$X12.01.2021)/2
data1$X19.01.2021 <- (data1$X26.01.2021)/2
data1$X26.01.2021 <- (data1$X26.01.2021)/2
data1$X02.02.2021 <- (data1$X09.02.2021)/2
data1$X09.02.2021 <- (data1$X09.02.2021)/2
data1$X16.02.2021 <- (data1$X23.02.2021)/2
data1$X23.02.2021 <- (data1$X23.02.2021)/2
data1$X02.03.2021 <- (data1$X09.03.2021)/2
data1$X09.03.2021 <- (data1$X09.03.2021)/2
```
To tidy the vegetable dataset, I first generated a dataframe `cnames` with all except the first two column names using `colnames()`. The dates in the columns contain the letter "X" in the name (e.g. "X02.02.2021"). Using `str_remove()` this part of the string was removed. The format of the vector was then changed to "Date" using the `as.Date()` function. With `as.data.frame()` this repeating list was transformed into a dataframe.
Similarly, two dataframes `vegcat` and `veg` containing a repetition of the vegetable categories and vegetables (columns 1 and 2, respectively) "nc-2" times was generated (the first two columns from "nc" were subtracted to account for the columns with the vegetable names and their categories).
The dataframe `weight` containing all weight measurements was generated by unlisting all columns (except 1 and 2) of `data1`.
These four dataframes were combined together using `cbind()` to generate the dataframe `veggiedata`. Finally, using `na.omit()`, all NA values were removed. The four columns of the tidy dataframe `veggiedata` were renamed appropriately with the command `colnames()`.
```{r Tidying the Vegetable Dataset}
# generating date vector and removing "X"
cnames <- colnames(data1[-c(1,2)])
cnames <- str_remove(cnames, "[X]")
cnames <- as.Date(cnames, format = "%d.%m.%Y")
# generating dataframe of dates repeated "nr"" times
date <- as.data.frame(rep(cnames, each = nr))
# generating dataframes of all vegetables and vegetable categories "nc" minus 2 times.
vegcat <- as.data.frame(rep(data1$Vegetable.Category[1:nr], times = nc-2))
veg <- as.data.frame(rep(data1$Vegetable[1:nr], times = nc-2))
# generating a dataframe of all weights
weight <- as.data.frame(unlist(data1[,-c(1,2)]))
# cbind and remove all NAs
veggiedata <- na.omit(cbind(date, vegcat, veg, weight))
# rename columns
colnames(veggiedata) <- c("Date", "Vegetable_Category", "Vegetable", "Weight")
```
The `veggiedata` dataframe contains four columns: `Date`, `Vegetable_Category`, `Vegetable`, and `Weight`. Two calculations were performed on these columns. Firstly, the total weight of each vegetable basket was calculated by summing all of the individual weights. Secondly, the percentage that each vegetable weighs in each basket was calculated (i.e. on the 2020-11-10, Red Cabbage accounted for 16.8% of the total basket weight). Both of these calculations were performed using the `summarise()` function. These two newly calculated columns were added to the original dataframe as `Basket_Weight` and `Percent`, respectively, forming a new dataframe `veggiedata2`.
The total weight of all baskets as well as the average weight of each basket was calculated and stored as the variables `weightsum` and `weightaverage`, respectively. Additionally, the total weight of each vegetable category and each vegetable was calculated and stored in the dataframes `categoryweights` and `vegetableweights`.
``` {r Transforming Vegetable Dataset, message = FALSE, results = "hide"}
# Calculating the total basket weight by measurement date and the percentage of each vegetable weight in each basket
basketweight <- veggiedata %>%
group_by(Date) %>%
summarise(Weight = sum(Weight))
veggieweight<- merge(veggiedata, basketweight, by = "Date", all = TRUE)
colnames(veggieweight) <- c("Date", "Vegetable_Category", "Vegetable", "Weight", "Basket_Weight")
veggiedata2 <- veggieweight %>%
summarise(Date = Date,
Vegetable_Category = Vegetable_Category,
Vegetable = Vegetable,
Weight = Weight,
Basket_Weight = Basket_Weight,
Percent = (veggieweight$Weight / veggieweight$Basket_Weight)*100) %>%
mutate_if(is.numeric, ~round(., 1))
# Calculating total weight of all baskets and average basket weight
weightsum <- sum(basketweight$Weight)
weightaverage <- round(mean(basketweight$Weight), digits = 1)
# Calculating total weight of each vegetable category and each vegetable
categoryweights <- veggiedata %>%
group_by(Vegetable_Category) %>%
summarise(Weight = sum(Weight)) %>%
arrange(desc(Weight))
vegetableweights <- veggiedata %>%
group_by(Vegetable) %>%
summarise(Weight = sum(Weight)) %>%
arrange(desc(Weight))
```
For neatness, the environment was cleared of all unneeded objects except for the final tidy `veggiedata2`, `categoryweights` and `basketweight` dataframes as well as the two variables `weightaverage`and `weightsum`.
```{r Clearing the Environment}
# remove all unneeded objects from environment
rm(veggiedata, veggieweight, data1, date, rawdata, veg, vegcat, weight, cnames, filepath, nc, nr)
```
### Weather Data
With the `head()` function a sneak peak at the original dataset can be taken to see its stucture. The parameter was set to `15` to be able to properly see the format. The first 9 rows of the dataset will not be needed and were deleted to generate `data2`. The first step was to transform the date-time column into a properly formatted date. This was done using `str_split_fixed()` to split the original column into two columns, date and time, respectively, separated by the letter `"T"`. The date column was transformed into `date` format using `ymd()` (the time column was not needed for this analysis). This date column was bound to the data using `cbind()` to generate the dataframe `weather`. The original, messy date-time column was removed. The class of the columns containing the weather data were changed to `numeric` using `transform()`. Finally, all columns were renamed appropriately using `colnames()`.
```{r Selecting Relevant Rows/Columns from Weather Dataset, message = FALSE, results = "hide"}
# Taking a look at the original dataset
head(rawdata2, 15)
# Remove unncessary rows at the top
data2 <- rawdata2[-c(1:9),]
# fix date/time issue: split string and generate column only with date
date2 <- as.data.frame(str_split_fixed(data2[,1], "T", 2))
date2 <- ymd(date2[,1])
# bind date column to the dataframe and remove all unneeded columns
weather <- cbind(date2, data2)
weather <- weather[,-c(2)]
weather <- transform(weather,
Basel = as.numeric(Basel),
Basel.1 = as.numeric(Basel.1),
Basel.2 = as.numeric(Basel.2),
Basel.3 = as.numeric(Basel.3),
Basel.4 = as.numeric(Basel.4),
Basel.5 = as.numeric(Basel.5))
# rename columns appropriately and change class from character to numeric
colnames(weather) <- c("Date", "Temperature_C", "Sunshine_min", "Precipitation_mm", "Snowfall_cm", "Humidity_per", "Cloudcover_per")
```
Although the `weather` dataframe now appears neat, it is not yet in the tidy format. In the next steps individual dataframes for each weather condition were generated with just one value per day (i.e. the average daily temperature, the daily sum of precipitation, etc). For the measurements of temperature, humidity, and cloudcover the daily average values were calculated by grouping data by `Date` in the `weather` dataset and calculating the mean value. Similarily, for the measurements of sunshine, snow, and precipitation the daily totals were calculated by grouping data by `Date` in the `weather` dataset and calculating the sum of each. In order to generate the final `weather2` dataframe an additional column with the name of the measurement (i.e. `"Temperature_C"`, `"Sunshine_min"`) was made and placed in between the date and the measurement value columns. Using `nrow()` the number of rows (i.e. the number of measurement days) was calculated and stored as `days`. This variable was then used to generate the correct length for the column containing the `"Condition"`. For each weather condition the three columns were renamed to `"Date"`, `"Condition"`, and `"Measurement"`. The six individual dataframes were then all joined together using `rbind()` to generate the final compact tidy dataframe `weather2`.
``` {r Generating Tidy Datasets for each Weather Condition & Transforming Dataset, message = FALSE, results = "hide"}
# Temperature: daily average temperture (average)
temperature <- weather %>%
group_by(Date) %>%
summarise(Temperature_C = mean(Temperature_C)) %>%
ungroup()
days <- nrow(temperature)
temp1 <- as.data.frame(rep("Temperature_DailyAverage", times = days))
temperature <- cbind(temperature$Date, temp1, temperature$Temperature_C)
colnames(temperature) <- c("Date", "Condition", "Measurement")
# Sunshine: daily sunshine minutues (sum)
sunshine <- weather %>%
group_by(Date) %>%
summarise(Sunshine_min = sum(Sunshine_min)) %>%
ungroup()
sun1 <- as.data.frame(rep("Sunshine_DailyAverage", times = days))
sunshine <- cbind(sunshine$Date, sun1, sunshine$Sunshine_min)
colnames(sunshine) <- c("Date", "Condition", "Measurement")
# Precipitation: daily preciptation (sum)
precipitation <- weather %>%
group_by(Date) %>%
summarise(Precipitation_mm = sum(Precipitation_mm)) %>%
ungroup()
precip1 <- as.data.frame(rep("Precipitation_DailySum", times = days))
precipitation <- cbind(precipitation$Date, precip1, precipitation$Precipitation_mm)
colnames(precipitation) <- c("Date", "Condition", "Measurement")
# Snowfall: daily snowfall (sum)
snowfall <- weather %>%
group_by(Date) %>%
summarise(Snowfall_cm = sum(Snowfall_cm)) %>%
ungroup()
snow1 <- as.data.frame(rep("Snowfall_DailySum", times = days))
snowfall <- cbind(snowfall$Date, snow1, snowfall$Snowfall_cm)
colnames(snowfall) <- c("Date", "Condition", "Measurement")
# Humidity: daily average humidity (average)
humidity <- weather %>%
group_by(Date) %>%
summarise(Humidity_per = mean(Humidity_per)) %>%
ungroup()
hum1 <- as.data.frame(rep("Humidity_DailyAverage", times = days))
humidity <- cbind(humidity$Date, hum1, humidity$Humidity_per)
colnames(humidity) <- c("Date", "Condition", "Measurement")
# Cloud Cover: daily average cloud cover (average)
cloudcover <- weather %>%
group_by(Date) %>%
summarise(Cloudcover_per = mean(Cloudcover_per)) %>%
ungroup()
cloud1 <- as.data.frame(rep("Cloudcover_DailyAverage", times = days))
cloudcover <- cbind(cloudcover$Date, cloud1, cloudcover$Cloudcover_per)
colnames(cloudcover) <- c("Date", "Condition", "Measurement")
# make one dataframe
weather2 <- rbind(cloudcover, precipitation, humidity, temperature, snowfall, sunshine)
weather2$Measurement <- round(weather2$Measurement, digits = 1)
```
For neatness, the environment was cleared of all unneeded objects except for the final tidy `weather2` dataframe.
```{r Clearing the Environment 2}
# remove all unneeded objects from environment
rm(weather, cloud1, cloudcover, hum1, humidity, precip1, precipitation, snow1, snowfall, sun1, sunshine, temp1, temperature, data2, rawdata2, days, filepath2, date2)
```
## Data Exploration
In this section, I explored the `veggiedata2` dataset to see if I could find interesting correlations to the `weather2` dataset. For this exploratory data analysis I used `ggplot()` and `plotly()` to graphically display the data, as well as a number of functions from the `tidyverse` package.
### Distribution of different Vegetable Types throughout the Year
To visualise the distribution of different vegetable categories in each basket throughout the entire measurment period, the plot `allveg1` was generated with `ggplot()`. A horizontal line marking the average basket weight was added to the plot using `geom_hline()` with the value of `weightaverage` as its y-intercept. Similarly, the plot `allveg2` was created using `plotly()`. The horizontal line indicating the average basket weight was drawn using the function defined as `hline`. In my opinion, one of the clear advantages of the `allveg2` plot is the interactive nature of the graph itself, even though I find the syntax of `ggplot()` easier to navigate. As evidented by the `allveg2` code and the final plot, there is room for improvement.
The heaviest basket was on the 11.05.2021 with 6.742kg, and the lightest on the 02.03.2021 and 09.03.2021 with 2.286kg each. The average basket weight was 4.801kg, which surpasses the expected average weight of 4.25kg (the expected weight range is between 3.5kg and 5kg).
```{r data exploration 1}
# Plotting (ggplot):
allveg1 <- ggplot(data=veggiedata2,
aes(x=Date, y=Weight)) +
geom_bar(stat = "identity",
aes(fill = Vegetable_Category), color = "black") +
xlab("Date") +
ylab("Weight (gr)") +
ggtitle("Distribution of Vegetable Types throughout the Year") +
theme(legend.key.size = unit(0.25, 'cm'),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_text(vjust = 0)) +
scale_x_date(date_breaks = "1 month",
date_labels = "%m/%y") +
scale_y_continuous(name = "Weight (gr)", breaks = seq(0, 7000, 1000)) +
geom_hline(yintercept = weightaverage,
col = "black",
lwd = 0.5,
linetype = "dotted") +
annotate("text",
x = as.Date("2021-02-01"),
y = c(5430, 5230, 5030),
label = c("Mean Basket","Weight:", paste0(weightaverage, "gr")),
size = 2.5)
allveg1
# Plotting (plotly):
hline <- function(y = 0, color = "blue") {
list(
type = "line",
x0 = 0,
x1 = 1,
xref = "paper",
y0 = y,
y1 = y,
line = list(color = color,
size = 0.5)
)
} # function taken from https://stackoverflow.com/questions/34093169/horizontal-vertical-line-in-plotly
# define custom colours
custom_col <- c("darkmagenta", "lemonchiffon", "aliceblue", "aquamarine", "beige","navy", "blue",
"cadetblue", "blueviolet", "cyan", "#F08080", "deeppink", "#FF00FF", "gold", "darkcyan",
"yellow", "mediumvioletred")
custom_col <- setNames(custom_col, c("Tomato", "Potato", "Herbs", "Celery", "Aubergine", "Squash", "Fennel", "Capsicum", "Salad", "Onions", "Cucumber", "Cabbage", "Root Vegetable", "Leafy Greens", "Corn", "Beans", "Other"))
allveg2 <- plot_ly(data = veggiedata2,
x = ~Date,
y = ~Weight,
color = ~Vegetable_Category,
text = ~Vegetable,
colors = custom_col,
textposition = 'none',
type = "bar",
marker = list(line = list(color = 'rgb(8,48,107)',
width = 0.5)),
hovertemplate = paste('<b>Vegetable</b>: %{text}' ,
'<br><b>Weight</b>: %{y}gr',
'<br><b>Date</b>: %{x}',
'<extra></extra>'),
hoverinfo = 'text' ) %>%
layout(barmode = 'stack',
title = "Distribution of Vegetable Types throughout the Year",
plot_bgcolor='#e5ecf6',
xaxis = list(
title = "",
type = 'date'),
yaxis = list(title = "Weight (grams)"),
legend = list(orientation = 'h', size = 3),
font = list(family = "Arial")) %>%
add_trace(y=weightaverage,
type = "scatter",
mode="lines+markers",
showlegend = FALSE)
# add_text(showlegend = FALSE,
# x = "2021-02-01",
# y = 5200,
# text = paste('Average Basket Weight: <br>', weightaverage, 'grams'),
# textfont = list(color = "blue",
# size = 7,
# family = "Arial")) %>%
allveg2
# allveg3<-add_trace(allveg2,
# y=weightaverage, type = "scatter", mode="lines+markers", showlegend = FALSE)
# allveg3
# finding the heaviest and lightest baskets
heaviest <- basketweight[which.max(basketweight$Weight),]
print(heaviest)
lightest <- basketweight[which.min(basketweight$Weight),]
print(lightest)
```
### Correlation of Vegetable Categories and Daily Average Temperature
To take a closer look at the vegetable distribution in the basket throughout the year, subsets of `veggiedata2` were plotted and correlated with the daily average temperature from the `weather2` dataset. By flitering by the term `Cabbage` in `Vegetable_Category`, the dataframe `cabbage` was generated from `veggiedata2`. Similarly, by filtering by the term `Temperature_DailyAverage` in `Condition` in the `weather2` dataframe, the dataframe `temp` was generated. Using `ggplot()` the barplot `plotcabbage` was created showing the distribution of cabbage types over time. This plot was then added to the line graph showing the daily average temperature in the plot `cabbagetemp`. The code for the `cabbagetemp` plot was a bit tricky given the secondary y-axis. Generally speaking, secondary axes can be very misleading, however in this case I feel that the graph clearly displays the weight and temperature on the right and left y-axis, respectively, and that it is clear to the reader which bars and curves belong to which axis.
Subsets of the `cabbage` dataframe based on the seasons were generated to find the sum of total amount of cabbage received in these time periods. Most cabbage was recieved in the winter (12.394kg) and the least in the autumn (2.986kg), with spring and summer falling in between (9.295kg and 5.146kg, respectively). Warmer temperatures correlate with less cabbage. As not the complete set of seasonal data from Autumn 2020 or Winter 2021 have been collected, these values are not representative for this comparison.
```{r data exploration 2, message = FALSE}
# Cabbage Types and Temperature
cabbage <- veggiedata2 %>%
filter (Vegetable_Category == "Cabbage")
cabbage$Vegetable <- gsub("Giant Kohlrabi", "Kohlrabi", cabbage$Vegetable)
cabbagecolours <- c("Broccoli" = "turquoise4", "Cauliflower" = "mediumpurple2", "Chinacabbage" = "darkseagreen2", "Red Cabbage" = "lightpink4", "White Cabbage" = "cornsilk2", "Green Cabbage" = "chartreuse2", "Brusselsprouts" = "seagreen4", "Kale" = "darkslategrey", "Sauerkraut" = "thistle3", "Kohlrabi" = "lightpink2")
xint <- c(as.Date(20/12, origin = "2020-12-01"),
as.Date(21/03, origin = "2021-03-01"),
as.Date(21/06, origin = "2021-06-01"),
as.Date(21/09, origin = "2021-09-01"),
as.Date(21/12, origin = "2021-12-01"))
plotcabbage <- ggplot(data=cabbage, aes(x=Date, y = Weight)) +
geom_bar(stat = "identity", aes(fill = Vegetable), color = "black") +
xlab("Date") +
ylab("Weight (gr)") +
ggtitle("The Distribution of Cabbage Types \n throughout the Seasons of the Year") +
scale_y_continuous(breaks = seq(0,4000, by = 250)) +
scale_x_date(date_breaks = "1 month", date_labels = "%m/%y") +
scale_fill_manual(name = "Cabbage Type",
values = cabbagecolours) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
legend.title = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 0.5),
axis.title.x = element_text(size = 10),
axis.text.y = element_text(hjust = 1.2),
axis.title.y = element_text(size = 10, vjust = 1),
# legend.position = c(0.83,0.79),
legend.key.size = unit(0.2, 'cm'),
legend.text = element_text(size = 6.5),
legend.background = element_blank(),
panel.background = element_rect(fill = "grey96")) +
geom_vline(xintercept = xint,
linetype = "dashed") +
geom_text(aes(x = as.Date(02/21, origin = "2021-01-15"), y = 3500, label = "Winter")) +
geom_text(aes(x = as.Date(02/21, origin = "2021-04-15"), y = 3500, label = "Spring")) +
geom_text(aes(x = as.Date(02/21, origin = "2021-07-15"), y = 3500, label = "Summer")) +
geom_text(aes(x = as.Date(02/21, origin = "2021-10-15"), y = 3500, label = "Autumn"))
#plotcabbage
temp <- weather2 %>%
filter (weather2$Condition == "Temperature_DailyAverage")
ylim.veg <- c(0, 4000)
ylim.temp <- c(-5, 30)
b <- diff(ylim.veg)/diff(ylim.temp)
a <- ylim.veg[1] - b*ylim.temp[1]
cabbagetemp <- plotcabbage +
geom_line(data = temp, aes(y = a + (Measurement)*b, colour = ..y..),
linetype = "solid", alpha = 0.6, show.legend = F) +
geom_smooth(method = loess, data = temp, aes(y = a + (Measurement)*b, colour = ..y..),
linetype = "solid", alpha = 0.6, show.legend = F) +
scale_y_continuous("Weight (gr)",
sec.axis = sec_axis(~(.- a)/b,
name = "Average Daily Temperature (C)")) +
scale_colour_gradient(low = "purple", high = "orange")
cabbagetemp
# how many kilos of cabbage per season?
cabbage$Date <- as.Date(cabbage$Date, format= "%Y-%m-%d")
cabbage_autumn20 <- cabbage %>%
filter(Date < "2020-11-30") %>%
summarise(Season = "Autumn_2020",
Total_Weight_gr = sum(Weight))
cabbage_winter20 <- cabbage %>%
filter(Date > "2020-12-01" & Date < "2021-02-28") %>%
summarise(Season = "Winter_2020",
Total_Weight_gr = sum(Weight))
cabbage_spring21 <- cabbage %>%
filter(Date > "2021-03-01" & Date < "2021-05-31") %>%
summarise(Season = "Spring_2021",
Total_Weight_gr = sum(Weight))
cabbage_summer21 <- cabbage %>%
filter(Date > "2021-06-01" & Date < "2021-08-31") %>%
summarise(Season = "Summer_2021",
Total_Weight_gr = sum(Weight))
cabbage_autumn21 <- cabbage %>%
filter(Date > "2021-09-01" & Date < "2021-11-30") %>%
summarise(Season = "Autumn_2021",
Total_Weight_gr = sum(Weight))
cabbage_winter21 <- cabbage %>%
filter(Date > "2021-12-01" & Date < "2022-02-28") %>%
summarise(Season = "Winter_2021",
Total_Weight_gr = sum(Weight))
cabbage_seasontotal <- rbind(cabbage_autumn20, cabbage_winter20, cabbage_spring21, cabbage_summer21, cabbage_autumn21, cabbage_winter21)
print(cabbage_seasontotal)
rm(cabbage_autumn20, cabbage_winter20, cabbage_spring21, cabbage_summer21, cabbage_autumn21, cabbage_winter21)
```
A similar analysis was performed for root vegetables using the subset `roots` from the original dataframe `veggiedata2`. The most root vegetables are received in the winter and spring with 12.339kg and 13.983kg, autumn closely follows with 10.592kg whereas the summer months only see a total of 3.995kg. Also here, warmer temperatures appear to correlate with less root vegetables.
```{r data exploration 3}
# Root Vegetable Types
roots <- veggiedata2 %>%
filter (Vegetable_Category == "Root Vegetable")
# To simplify the analysis, some vegetable names are combined
unique(roots$Vegetable)
roots$Vegetable <- gsub("Rettich", "Radish", roots$Vegetable)
roots$Vegetable <- gsub("Radishes", "Radish", roots$Vegetable)
roots$Vegetable <- gsub("[()]", "", roots$Vegetable)
roots$Vegetable <- gsub("Carrots Multicolour", "Carrots", roots$Vegetable)
roots$Vegetable <- gsub("Carrots Orange", "Carrots", roots$Vegetable)
rootscolours <- c("Carrots" = "coral1", "Radish" = "maroon2", "Salsify" = "khaki3", "Beetroot" = "maroon4", "Parsnip" = "cornsilk2", "Turnip" = "cornsilk4")
plotroots <- ggplot(data=roots, aes(x=Date, y = Weight)) +
geom_bar(stat = "identity", aes(fill = Vegetable), colour = "black") +
xlab("Date") +
ylab("Weight (gr)") +
ggtitle("The Distribution of Root Vegetables Types \n throughout the Seasons of the Year") +
scale_y_continuous(breaks = seq(0,4000, by = 250)) +
scale_x_date(date_breaks = "1 month", date_labels = "%m/%y") +
scale_fill_manual(name = "Root Vegetable Type",
values = rootscolours) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
legend.title = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 0.5),
axis.title.x = element_text(size = 10),
axis.text.y = element_text(hjust = 1.2),
axis.title.y = element_text(size = 10, vjust = 1),
# legend.position = c(0.85,0.85),
legend.key.size = unit(0.25, 'cm'),
legend.background = element_blank(),
panel.background = element_rect(fill = "grey96")) +
geom_vline(xintercept = xint,
linetype = "dashed") +
geom_text(aes(x = as.Date(02/21, origin = "2021-01-15"), y = 3600, label = "Winter")) +
geom_text(aes(x = as.Date(02/21, origin = "2021-04-15"), y = 3600, label = "Spring")) +
geom_text(aes(x = as.Date(02/21, origin = "2021-07-15"), y = 3600, label = "Summer")) +
geom_text(aes(x = as.Date(02/21, origin = "2021-10-15"), y = 3600, label = "Autumn"))
# plotroots
temp <- weather2 %>%
filter (weather2$Condition == "Temperature_DailyAverage")
ylim.roots <- c(0, 4000)
ylim.temp <- c(-5, 30)
d <- diff(ylim.roots)/diff(ylim.temp)
c <- ylim.roots[1] - d*ylim.temp[1]
rootstemp <- plotroots +
geom_line(data = temp, aes(y = c + (Measurement)*d, colour = ..y..),
linetype = "solid", alpha = 0.6, show.legend = F) +
geom_smooth(method = loess, data = temp, aes(y = c + (Measurement)*d, colour = ..y..),
linetype = "solid", alpha = 0.6, show.legend = F) +
scale_y_continuous("Weight (gr)",
sec.axis = sec_axis(~(.- c)/d,
name = "Average Daily Temperature (C)")) +
scale_colour_gradient(low = "purple", high = "orange")
rootstemp
# how many kilos of root vegetables per season?
roots$Date <- as.Date(roots$Date, format= "%Y-%m-%d")
roots_autumn20 <- roots %>%
filter(Date < "2020-11-30") %>%
summarise(Season = "Autumn_2020",
Total_Weight_gr = sum(Weight))
roots_winter20 <- roots %>%
filter(Date > "2020-12-01" & Date < "2021-02-28") %>%
summarise(Season = "Winter_2020",
Total_Weight_gr = sum(Weight))
roots_spring21 <- roots %>%
filter(Date > "2021-03-01" & Date < "2021-05-31") %>%
summarise(Season = "Spring_2021",
Total_Weight_gr = sum(Weight))
roots_summer21 <- roots %>%
filter(Date > "2021-06-01" & Date < "2021-08-31") %>%
summarise(Season = "Summer_2021",
Total_Weight_gr = sum(Weight))
roots_autumn21 <- roots %>%
filter(Date > "2021-09-01" & Date < "2021-11-30") %>%
summarise(Season = "Autumn_2021",
Total_Weight_gr = sum(Weight))
roots_winter21 <- roots %>%
filter(Date > "2021-12-01" & Date < "2022-02-28") %>%
summarise(Season = "Winter_2021",
Total_Weight_gr = sum(Weight))
roots_seasontotal <- rbind(roots_autumn20, roots_winter20, roots_spring21, roots_summer21, roots_autumn21, roots_winter21)
print(roots_seasontotal)
rm(roots_autumn20, roots_winter20, roots_spring21, roots_summer21, roots_autumn21, roots_winter21)
```
In the following section I looked at the amounts of the two types of carrots in relation to all other vegetables present in each basket. The percentage of each basket's weight made up of carrots is relatively uniform throughout the year, ranging between 10.4% and 16.0% with the winter season of 2020 and the autumn season of 2021 having the least and most, respectively. Out of the 53 baskets recieved, 43 contained carrots and 10 did not.
```{r data exploration 4}
# Amount of Carrots
carrotscolours <- c("Carrots (Orange)" = "orange", "Carrots (Multicolour)" = "mediumpurple2")
# Setting the order for the bar chart
vegorder <- unique(veggiedata2$Vegetable)
vegorder2 <- vegorder[!vegorder %in% c("Carrots (Orange)", "Carrots (Multicolour)")]
vegorder3 <- c(vegorder2, "Carrots (Multicolour)", "Carrots (Orange)")
veggiedata3 <- veggiedata2 %>%
arrange(Date, factor(Vegetable, levels = vegorder3))
veggiedata3$Vegetable <- factor(veggiedata3$Vegetable, levels = vegorder3)
veggiedata3$Percent <- paste(veggiedata3$Percent, "%", sep="")
# Plotting
plotcarrots <- ggplot(veggiedata3, aes(x=Date, y = Weight, fill = Vegetable)) +
geom_bar(stat = "identity") +
xlab("Date") +
ylab("Weight (gr)") +
ggtitle("Percentage of Carrots in each \n Vegetable Basket throughout the Seasons of the Year") +
scale_y_continuous(breaks = seq(0,8000, by = 500)) +
scale_x_date(date_breaks = "1 month", date_labels = "%m/%y") +
scale_fill_manual(values = carrotscolours) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
legend.position = "bottom",
legend.key.size = unit(0.5, 'cm'),
legend.title = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 0.5),
axis.title.x = element_text(size = 10),
axis.text.y = element_text(hjust = 1.2),
axis.title.y = element_text(size = 10, vjust = 1),
panel.background = element_rect(fill = "grey96")) +
geom_vline(xintercept = xint, linetype = "dashed") +
geom_text(aes(x = as.Date(02/21, origin = "2021-01-15"), y = 7500, label = "Winter")) +
geom_text(aes(x = as.Date(02/21, origin = "2021-04-15"), y = 7500, label = "Spring")) +
geom_text(aes(x = as.Date(02/21, origin = "2021-07-15"), y = 7500, label = "Summer")) +
geom_text(aes(x = as.Date(02/21, origin = "2021-10-15"), y = 7500, label = "Autumn")) +
geom_text(aes(label=ifelse(Vegetable == "Carrots (Orange)" | Vegetable == "Carrots (Multicolour)",
Percent, "")),
size = 1.5,
angle = 90,
position = position_stack(vjust = 0.5))
plotcarrots
# average amount of carrots per season
carrots <- veggiedata2 %>%
filter(Vegetable == "Carrots (Orange)" | Vegetable == "Carrots (Multicolour)")
carrots$Date <- as.Date(carrots$Date, format= "%Y-%m-%d")
carrots_autumn20 <- carrots %>%
filter(Date < "2020-11-30") %>%
summarise(Season = "Autumn_2020",
Average_Percent = mean(Percent))
carrots_winter20 <- carrots %>%
filter(Date > "2020-12-01" & Date < "2021-02-28") %>%
summarise(Season = "Winter_2020",
Average_Percent = mean(Percent))
carrots_spring21 <- carrots %>%
filter(Date > "2021-03-01" & Date < "2021-05-31") %>%
summarise(Season = "Spring_2021",
Average_Percent = mean(Percent))
carrots_summer21 <- carrots %>%
filter(Date > "2021-06-01" & Date < "2021-08-31") %>%
summarise(Season = "Summer_2021",
Average_Percent = mean(Percent))
carrots_autumn21 <- carrots %>%
filter(Date > "2021-09-01" & Date < "2021-11-30") %>%
summarise(Season = "Autumn_2021",
Average_Percent = mean(Percent))
carrots_winter21 <- carrots %>%
filter(Date > "2021-12-01" & Date < "2022-02-28") %>%
summarise(Season = "Winter_2021",
Average_Percent = mean(Percent))
carrots_seasontotal <- rbind(carrots_autumn20, carrots_winter20, carrots_spring21, carrots_summer21, carrots_autumn21, carrots_winter21)
print(carrots_seasontotal)
rm(carrots_autumn20, carrots_winter20, carrots_spring21, carrots_summer21, carrots_autumn21, carrots_winter21)
# how many times were carrots recieved?
basket_number <- sum(!duplicated(veggiedata2$Date))
yes_carrots <- sum(!duplicated(carrots$Date))
no_carrots <- basket_number - yes_carrots
```
```{r data exploration 5}
```