-
Notifications
You must be signed in to change notification settings - Fork 0
/
Multilevel Analysis of Vic Road Accidents.Rmd
513 lines (389 loc) · 28.4 KB
/
Multilevel Analysis of Vic Road Accidents.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
---
title: "Analysis of Road Accidents Severity using Multilevel Modeling"
author: "Leah Nguyen"
date: "11/9/2021"
output:
html_document:
toc: true
toc_float: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
Road accidents may cause deaths and it may leave a lot of people seriously injured. In 2016, 5.34 road deaths occurred in Australia per 100,000 people (Bureau of Infrastructure, Transport and Regional Economics 2018). The number of vehicles registered each year in Australia is increasing with more than 1.5% per annum, inline with a similar increase in the population (Australian Bureau of Statistics 2019, Australian Bureau of Statistics 2018). There is thus an increasing call for road safety by the government, which can be noticed with the Towards Zero campaign setup by the Victorian government. The dataset that was analysed was collected also collected from a government website; from the Department of Planning, Transport and Infrastructure.
This study dives deeper into the vehicle aspect of road accidents to answer the question of whether vehicle’s make and model have an impact on the severity of road accidents or not.
## Backgound, Research Quesion and Model Specification
The background for this assignment is informed by the initial research question “Which factors influence road fatalities in Victoria conditioned on there being an accident in the first place?”, and by the takeaway from our group assignment AT2. The previous study included a detailed exploratory data analysis (EDA) for the road accident data,, which was used to choose the independent variables, and a logistic regression on the dataset to find out what were the causes of high severity and fatalities in road accidents of Victorian Roads.
The final result of the analysis as shown in Appendix 1 had concluded that type of accident, light condition, type of road and type of vehicle during the time of the accident had the most effect on the severity of road accidents. Low p (hence, statistically significant) values were observed on the vehicle’s part, such as the age of the vehicle and the type of vehicle (like buses, vans, taxis and cars).
However, there may be an underlying effect within, where the make and the model of the vehicles would be causing an impact on the accidents’ severity. Such a structure is not considered by the previous model and without looking into more details, it cannot be assumed that Vehicle Type, in general, is significant. It is also worthwhile to find out how predictors on the different levels account in increasing or decreasing the severity of accidents. The group predictors (such as the age of the vehicle, damage done to a vehicle, its speed) can be conjugate with different hierarchies, and their effects can be analysed separately using a multilevel model. Therefore, I conduct my new analysis and reframe the research question as "Accounting for the hierarchical structure of the data, what are the effects of vehicles and other group predictors (vehicle age, damage, speed) on injured severity an fatalities?".
I also investigated different types of statistical modelling and finally chose a a simple multilevel model to factor the underlying variables. Using multilevel fixed effects model, inferences can be derived beyond a group. Using multilevel random effects model, the effect of both types of group level predictor variables and group level dummy variables can be estimated (Magnusson, 2015).
## Exploratory Data Analysis
### Structure of the dataset
The total dataset consists of 28 Vehicle types, from light such as bicycles to heavy ones. With the aim of making this study more specific, it will factor out two wheelers and heavy wheelers and will only take everyday vehicles from the small category of vehicles only, viz. car, station wagon, utility and taxi only.
```{r include=FALSE}
##################################################
## This script prepares the training and test
## datasets for both EDA and Modelling
##################################################
# Load Libraries
library(sqldf)
library(lme4)
library(tidyverse)
```
```{r include=FALSE}
# read data from source
accident <- read.csv("dataset/Road Crashes/ACCIDENT.csv")
vehicles <- read.csv("dataset/Road Crashes/VEHICLE.csv")
person <- read.csv("dataset/Road Crashes/PERSON.csv")
```
```{r include=FALSE}
# Temporary variable to store vehicle's types and descriptions
vehicle_type <- sqldf("select distinct VEHICLE_TYPE, \"Vehicle.Type.Desc\" from vehicles")
```
```{r include=FALSE}
# Join accident, vehicles and people datasets
avp_df1 <- sqldf("select a.ACCIDENT_NO, a.ACCIDENTDATE, a.SPEED_ZONE, v.VEHICLE_ID, v.VEHICLE_TYPE,
v.VEHICLE_MAKE, v.VEHICLE_MODEL,
v.\"Vehicle.Type.Desc\" as Vehicle_Type_Desc,
v.VEHICLE_YEAR_MANUF, v.LEVEL_OF_DAMAGE,
p.PERSON_ID,
p.INJ_LEVEL, p.\"Inj.Level.Desc\" as Inj_Level_Desc
from vehicles v
inner join accident a on
a.ACCIDENT_NO = v.ACCIDENT_NO
inner join person p on
p.ACCIDENT_NO = a.ACCIDENT_NO
and p.VEHICLE_ID = v.VEHICLE_ID
where v.VEHICLE_TYPE in (1, 2, 4)")
```
```{r include=FALSE}
# Join vehicle's model
vehicle_model <- sqldf("select VEHICLE_MAKE || VEHICLE_MODEL AS VEHICLE_MODEL, count(*) as my_Count
from vehicles where
VEHICLE_TYPE in (1, 2, 4) and VEHICLE_MODEL > 0
group by VEHICLE_MAKE || VEHICLE_MODEL
having my_Count >= 100")
```
```{r include=FALSE}
# Filter vehicles by model and remove unneeded ones
df <- sqldf("select ACCIDENT_NO, ACCIDENTDATE, VEHICLE_MAKE, VEHICLE_MODEL,
Vehicle_Type_Desc, min(INJ_LEVEL) as SEVERITY,
VEHICLE_YEAR_MANUF, LEVEL_OF_DAMAGE, SPEED_ZONE
from avp_df1
where VEHICLE_MAKE || VEHICLE_MODEL IN (select VEHICLE_MODEL from vehicle_model)
AND VEHICLE_MAKE NOT LIKE '%UNKN%'
group by ACCIDENT_NO, ACCIDENTDATE, VEHICLE_MAKE, VEHICLE_MODEL, Vehicle_Type_Desc,
VEHICLE_YEAR_MANUF, LEVEL_OF_DAMAGE")
```
```{r include=FALSE}
# Format date, years and number of years
df$ACCIDENTDATE <- as.Date(df$ACCIDENTDATE, "%d/%m/%Y")
df$ACCIDENT_YEAR <- as.numeric(format(df$ACCIDENTDATE, "%Y"))
df$NUM_OF_YEARS <- df$ACCIDENT_YEAR - df$VEHICLE_YEAR_MANUF
```
```{r include=FALSE}
# Create a training dataset
train <- df
```
```{r include=FALSE}
# Fix the categorical dependent variable- Severity- Yes and No
# If Severety is 1 or 2, put "Yes" else "No"
train$SEVERITY[train$SEVERITY == "2"] <- "1"
train$SEVERITY[train$SEVERITY == "3"] <- "2"
train$SEVERITY[train$SEVERITY == "4"] <- "2"
train$SEVERITY[train$SEVERITY == "1"] <- "0"
train$SEVERITY[train$SEVERITY == "2"] <- "1"
train$SEVERE[train$SEVERITY == "0"] <- "Yes"
train$SEVERE[train$SEVERITY == "1"] <- "No"
# Fix vehicle's make typos
train$VEHICLE_MAKE <- as.character(train$VEHICLE_MAKE)
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "AUD "] <- "AUDI"
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "AUDI "] <- "AUDI"
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "B.M. "] <- "B M W "
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "L RO "] <- "L ROV "
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "FOR "] <- "FORD"
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "FOR"] <- "FORD"
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "FORD "] <- "FORD"
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "LANDRO"] <- "LAND R"
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "ROVER "] <- "ROVER"
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "L ROV "] <- "LAND R"
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "NIS "] <- "NISSAN"
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "SUB "] <- "SUBARU"
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "HYND "] <- "HYUNDAI"
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "HYNDAI"] <- "HYUNDAI"
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "HYU "] <- "HYUNDAI"
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "HYUNDA"] <- "HYUNDAI"
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "HYUNDI"] <- "HYUNDAI"
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "MISTUB"] <- "MITSUB"
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "R ROV "] <- "ROVER"
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "VLK "] <- "VOLKSW"
train$VEHICLE_MAKE[train$VEHICLE_MAKE == "VOLKS "] <- "VOLKSW"
train$VEHICLE_MAKE <- as.factor(train$VEHICLE_MAKE)
# Fix vehicle's Model typos
train$VEHICLE_MODEL <- as.character(train$VEHICLE_MODEL)
names(train$VEHICLE_MODEL) <- gsub("\\s+", "", names(train$VEHICLE_MODEL))
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "LANDCR"] <- "LANCER"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "COROLL"] <- "COROLA"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "C/DORE"] <- "C'DORE"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "COMMO "] <- "COMM"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "COMM "] <- "COMM"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "COMMOD"] <- "COMM"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "EXEC "] <- "EXECUT"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "ELANTR"] <- "ELNTRA"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "LANTRA"] <- "ELNTRA"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "CR-V "] <- "CRV"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "CRV "] <- "CRV"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "FALCO "] <- "FALCON"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "MAZDA3"] <- "3"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "3 "] <- "3"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "RAV4 "] <- "RAV4 "
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "RAV 4 "] <- "RAV4 "
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "SANTA "] <- "SANTAF"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "SKYLNE "] <- "SKYLIN"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "SPORTA"] <- "SPTAGE"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "SKYLNE"] <- "SKYLIN"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "SKYLNE"] <- "SKYLIN"
train$VEHICLE_MODEL[train$VEHICLE_MODEL == "CX-5 "] <- "CX5 "
train$VEHICLE_MODEL <- as.factor(train$VEHICLE_MODEL)
# Change type of variable of Number of years from factor to numeric
train$NUM_OF_YEARS <- as.numeric(train$NUM_OF_YEARS)
# Put NAs for not needed values, i.e. have years of vehicle from 0 to 30 years
train$NUM_OF_YEARS[train$NUM_OF_YEARS < 0 ] <- NA
train$NUM_OF_YEARS[train$NUM_OF_YEARS > 30 ] <- NA
train$LEVEL_OF_DAMAGE <- as.factor(train$LEVEL_OF_DAMAGE)
train$LEVEL_OF_DAMAGE [train$LEVEL_OF_DAMAGE == '9']<- NA
# set Severe's type as factor
train$SEVERE <- factor(train$SEVERE,
levels=c("No", "Yes"),
labels=c("No", "Yes"))
# Once again,
# Change type of variable of Number of years from factor to numeric
train$NUM_OF_YEARS <- as.numeric(train$NUM_OF_YEARS)
# If speeds are greater than 100 or less than 0 then put NA
train$SPEED_ZONE[train$SPEED_ZONE >110 ] <- NA
train$NUM_OF_YEARS[train$NUM_OF_YEARS <0 ] <- NA
```
```{r include=FALSE}
##################################################
## This script is used to do EDA on the data
##################################################
# Load Libraries
library(ggplot2)
library(GGally)
library(reshape2)
library(lme4)
library(compiler)
library(parallel)
library(boot)
library(lattice)
library(sqldf)
library(data.tree)
library(DiagrammeR)
```
```{r include=FALSE}
# Get the list of vehicle's models
# create variables to be used
vehicle_models <- sqldf("select distinct VEHICLE_MAKE, VEHICLE_MODEL from train limit 7")
vehicle_models$pathString <- paste("Vehicle",
vehicle_models$VEHICLE_MAKE,
vehicle_models$VEHICLE_MODEL,
sep = "/")
vehicle_models_copy <- as.Node(vehicle_models)
```
```{r include=FALSE}
# Join vehicle, accident and person data
vehicle_models <- sqldf("select a.ACCIDENT_NO, a.ACCIDENTDATE, a.SPEED_ZONE, v.VEHICLE_ID, v.VEHICLE_TYPE,
v.VEHICLE_MAKE, v.VEHICLE_MODEL,
substr(v.\"Vehicle.Type.Desc\", 0, 10) as Vehicle_Type_Desc,
v.VEHICLE_YEAR_MANUF, v.LEVEL_OF_DAMAGE,
p.PERSON_ID,
p.INJ_LEVEL, p.\"Inj.Level.Desc\" as Inj_Level_Desc
from vehicles v
inner join accident a on
a.ACCIDENT_NO = v.ACCIDENT_NO
inner join person p on
p.ACCIDENT_NO = a.ACCIDENT_NO
and p.VEHICLE_ID = v.VEHICLE_ID")
```
The total dataset consists of 28 Vehicle types, from light such as bicycles to heavy ones. With the aim of making this study more specific, it will factor out two wheelers and heavy wheelers and will only take everyday vehicles from the small category of vehicles only, viz. car, station wagon, utility and taxi only.
```{r echo=FALSE, message=FALSE, warning=FALSE}
# Plot a bar graph using the above dataset
ggplot(vehicle_models, aes(x=Vehicle_Type_Desc)) + geom_bar()+
theme(axis.text.x = element_text(angle = 90))+
xlab("Vehicle Type")+
ylab("Count of Accidents")
```
<center>
**Figure 1.** A bar graph showing the count of individual vehicle types in the dataset
</center>
The data-frame used in the analysis is as follows:
| **Column Name** | **Description** |
|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ACCIDENT_NO | The Unique ID of the accident |
| ACCIDENTDATE | The date of the accident |
| VEHICLE_MAKE | The Make of the vehicle |
| VEHICLE_MODEL | The Model of the vehicle |
| Vehicle_Type_Desc | Description of the Type of the Vehicle (Car, Station Wagon) |
| VEHICLE_YEAR_MANUF | The year of Manufacture of the Vehicle |
| LEVEL_OF_DAMAGE | Damage Level: 1-6, 1 includes severely damaged |
| SPEED_ZONE | The speed zone of the road which the accident occurred (This is assumed to be the speed of vehicle during crash and is taken as continuous- as speed of the vehicle is not always equal to speed zones speed) |
| ACCIDENT_YEAR | Year of the accident, derived from `ACCIDENTDATE` (continuous) |
| NUM_OF_YEARS | Age of Car, obtained from `ACCIDENT_YEAR` - `VEHICLE_YEAR_MANUF` |
| SEVERE | Is the accident severe? Binary: Yes and No |
Now, checking for collinearity between the continuous variables `NUM_OF_YEARS` and `SPEED_ZONE`, we see that there it is minimal. Hence, both of them can be used in our model.
```{r message=FALSE, warning=FALSE}
# plot using GGpairs
ggpairs(train[, c("NUM_OF_YEARS", "SPEED_ZONE")],
cardinality_threshold = NULL)
```
<center>
**Figure 2.** Checking collinearity between continuous residual variables
</center>
Looking at the distribution of severity on continuous variables (Figure 3), Number of Years (i.e. the age of car) and Speed Zone, we find that there is not much impact of the age of car on the severity of accidents, whereas with regards to the speed zone, accidents are more severe at higher speed zone areas.
```{r echo=FALSE, message=FALSE, warning=FALSE}
tmp <- melt(train[, c("SEVERE", "Vehicle_Type_Desc", "NUM_OF_YEARS", "SPEED_ZONE")])
# Fourth plot
ggplot(tmp, aes(factor(SEVERE), y = value, fill=factor(SEVERE))) +
geom_boxplot() +
facet_wrap(~variable, scales="free_y")
```
<center>
**Figure 3.** A box plot showing the distribution of continuous variables
</center>
Second test is to check for the ratios of the response variable, `SEVERE`. From Figure 4, it can be seen that there are no zeros, and also some of the ratios are large which may be due to the dataset being from the year 2006 to 2018. We will need to be careful with this during modeling our data.
```{r echo=FALSE}
# check data and results
vehicle_models <- table(train$VEHICLE_MAKE, train$SEVERE)
vehicle_models_copy <- as.data.frame.matrix(vehicle_models)
vehicle_models_copy <- cbind(vehicle_models_copy, rownames(vehicle_models))
vehicle_binary <- sqldf("select \"rownames(vehicle_models)\" as VEHICLE_MAKE, No, Yes, (100.0 * No )/(Yes+No) as No_Percentage,
(100.0 * Yes )/(Yes+No) as Yes_Percentage
from vehicle_models_copy")
knitr::kable(vehicle_binary)
```
<center>
**Figure 4.** A table showing the number and percentage of Response variable
</center>
Now, regarding the levels in the data, a vehicle’s make and Vehicle’s model is a sample of wider population of model and make respectively. Hence, they can be used as levels for multilevel analysis. It is useful to compare between higher level units; hence we will be comparing the vehicle’s makes.
The data has a 2-level nested structure hierarchy. The lowest level (Level 2) is vehicle’s model, and highest level (Level 1) is the vehicle’s make, as shown in Figure 6.
![](Hierarchy.png)
<center>
**Figure 5.** A plot and a relationship diagram showing 2-level hierarchy in the dataset
</center>
Thus, we will be taking the `VEHICLE_MAKE` and `VEHICLE_MODEL` as our two levels for multilevel modeling. There are no predictor variables for VEHICLE_MAKE in this dataset. For `VEHICLE_MODEL` level, will take `LEVEL_OF_DAMAGE`, `SPEED_ZONE` and `NUM_OF_YEARS` as predictor variables.
The response variable, `SEVERE` is binary, having values ‘No’ and ‘Yes’. Hence, from the nature of the data and the response variable, the type of regression that we will be using is a logistic multilevel regression to access whether the vehicle’s attributes play a significant impact on the severity of accidents or not.
## Multilevel Model Design and Interpretation
Multilevel model treats units of analysis as independent observations and helps in drawing conclusions about the effects in those observations individually. The motivation to take up multilevel modeling can be due to the wish to make amendments to previously made inferences, or an interest in group effects of a research, or even to estimate the group effects using the group level predictors together. Whatever the motivation might be, there are a series of steps that are performed while doing multilevel regression. We will be performing each of those steps, taking inferences and moving onto the next until the best combination of variables is found.
We will be using `glmer` function of `lme4` package in R to compute the multi-level models with the optimizer `bobyqa` and value of nAGQ as 0, for optimizing the speed of the modeling. To compare the models, we will use the function AIC which stands for ‘An Information Criterion’ with log likelihood of the model. Lower value of AIC means a better model.
Now, the first step of doing multilevel modeling is to create a null model, with no effects, just the intercepts. A null model can be taken as the base model for analysis, and it will help to determine whether controlling the factor is useful or not.
From the data, we have more ‘No’ and less ‘Yes’. The lower coded group is ‘No’ in this analysis.
```{r include=FALSE}
#load libraries
library(lme4)
```
```{r echo=FALSE}
#MLM Step 1- Fixed Effects
m1 <- glm(SEVERE ~ 1,
family= binomial(link= "logit"),
data= train)
summary(m1)
m1$aic
```
<center>
**Figure 6.** A null model with only the response variable SEVERE
</center>
After the null model, a random intercept model is made using `glmer` using random intercept of `VEHICLE_MAKE`. From this model, we can see that the intercept’s estimate has changed towards the ‘No’ category. The variance due to random effect `VEHICLE_MAKE` is very small, so it is not significant.
```{r echo=FALSE}
m2 <- glmer(SEVERE ~ 1 + (1 | VEHICLE_MAKE),
data = train,
family = binomial(link= "logit"),
control = glmerControl(optimizer = "bobyqa"),
nAGQ = 0
)
summary(m2)
AIC(logLik(m2))
```
<center>
**Figure 7.** A random intercept model with VEHICLE_MAKE independent variable
</center>
Now, again adding a second random intercept VEHICLE_MODEL to the above model to note the changes in the variance of the intercept (Figure 9), we can see that the estimate again changes more towards ‘No’. The variance of `VEHICLE_MODEL` is a little more than that of `VEHICLE_MAKE`, hence it can be said that `VEHICLE_MODEL` plays a bigger role.
```{r echo=FALSE}
#MLM Step 2.1- Random intercept
m2.1 <- glmer(SEVERE ~ 1 + (1 | VEHICLE_MAKE) + (1 | VEHICLE_MODEL),
data = train,
family = binomial(link= "logit"),
control = glmerControl(optimizer = "bobyqa"),
nAGQ = 0
)
summary(m2.1)
AIC(logLik(m2.1))
```
<center>
**Figure 8.** A random intercept model with VEHICLE_MAKE independent variable
</center>
Adding fixed variables NUM_OF_YEARS and `SPEED_ZONE` to the model, we have slight change in the intercepts. The positive value of Estimate indicates that adding these variables has increases the chance of getting a ‘Yes’, though they are very small. This means that there is a slight increase in the chances of having a higher severity when the vehicle gets older, or when the vehicle is in the higher speed zone (essentially meaning that the vehicle is in higher speed). This is shown in Figure 9.
```{r echo=FALSE}
#MLM Step 3
#MLM Step 3- Random Intercept
m3 <- glmer(SEVERE ~ NUM_OF_YEARS + SPEED_ZONE + (1 | VEHICLE_MAKE) + (1 | VEHICLE_MODEL),
data = train,
family = binomial(link= "logit"),
control = glmerControl(optimizer = "bobyqa"),
nAGQ = 0
)
summary(m3)
AIC(logLik(m3))
```
<center>
**Figure 9.** A random intercept model with added Level 2 fixed variables
</center>
Now, the last thing to do is to add a random slope in the model. A subset of the dataset where the accident year is 2018 was sued to see the effect as the computation did not complete for the entire dataset. It can be seen that the `NUM_OF_YEARS` has had a little impact on the Variance of the Random Effects model, so it is also not that significant. Adding other intercept values also does not have considerable effect on the data. Hence, we will be excluding it from our analysis.
```{r echo=FALSE}
#MLM Step 4 Random Slope Random Intercept
m4 <- glmer(SEVERE ~ NUM_OF_YEARS + SPEED_ZONE +
(NUM_OF_YEARS | VEHICLE_MAKE) + (1 | VEHICLE_MODEL),
data = subset(train, format(as.Date(ACCIDENTDATE),"%Y")==2018),
family = binomial(link= "logit"),
control = glmerControl(optimizer = "bobyqa"),
nAGQ = 0
)
summary(m4)
```
<center>
**Figure 10.** A random slope random intercept model applied on a subset of dataset to analyse the effects
</center>
We will now be taking the AIC to compare between our models and finding out the best model among them. Figure 11 shows the AIC values for all the models discussed above. It shows that the model `m3`, which takes `NUM_OF_YEARS` and `SPEED_ZONE` as the fixed variables and the random intercepts of our multilevel variables `VEHICLE_MAKE` and `VEHICLE_MODEL` gives the most efficient model.
```{r echo=FALSE}
#Test for dispersion for the best model, which is m3
m1$aic
AIC(logLik(m2))
AIC(logLik(m2.1))
AIC(logLik(m3))
```
<center>
**Figure 11.** Performance of models, compared using AIC
</center>
Though we have compared our models and found that our model m3 is the best fitting model, it has variance of intercepts only 0.081 for `VEHICLE_MODEL` and 0.0617 for `VEHICLE_MAKE`. Since the model is a same slope and random intercept model, the parallel regression lines which are drawn are separated by a very small distance. This implies that the regression lines drawn by using our independent variables will be very close to each other, which means that there is no effect of either `VEHICLE_TYPE` or `VEHICLE_MODEL` on our response variable individually. So, the answer to whether the vehicle’s make or model impacts severity of accidents is negative.
## Conclusion
The original study conducted by the group found that type of vehicle played a significant role in the severity of accidents. It had given an impression that except two wheelers, all the vehicle types were statistically significant. However, this assumption is contradicted by analysing deeper and performing multilevel modeling by taking a subset and using the vehicle’s make and model. The reason that the previous study had given those to be statistically significant may be due to overfitting, since the data points were in the order of 180,000.
The main thing to take from this study is that the simple regression models, though would give a good approximation, may be incorrect. It is necessary to go deeper into details if one were to set a firm assumption on the model. Lastly, this study has come up with the conclusion that the vehicle’s attributes are not statistically significant when it comes to the severity of the accidents.
## Reflection
In my short but intense experience with modelling, I have personally come to the conclusion that the temptation to rely on algorithms is immense. Combined with computing power it makes us believe we have an artificial intelligence crystal ball at our finger tips.
The multilevel method can adjust the model assumption to better fit data with a hierarchy structure. It provides a better model fit by allowing random effect on the location and DayType variable to adapt to the two-level nested structure of this fuel data set, compare to linear model as previously discussed in the model justification session.
This model helps to provide a better estimate of the influence of predictor variables toward the response variable as it considers the dependency between different variables.
## References
1. Taimur, U., Liping, F., & F. M. M., Luis. (2016). Injury severity analysis: comparison of multilevel logistic regression models and effects of collision data aggregation. Retrieved from https://link.springer.com/article/10.1007/s40534-016-0096-4
2. Gudmundur, F., & Fred, L. M. (2004). Differences in male and female injury severities in sportutility vehicle, minivan, pickup and passenger car accidents. Retrieved from https://www.sciencedirect.com/science/article/pii/S0001457502001355
3. Bureau of Infrastructure, Transport and Regional Economics 2018, International road safety comparisons 2016, BITRE, Canberra.
4. Magnusson, K. (2015). Training for multilevel modeling. Retrieved from http://www.bristol.ac.uk/cmm/learning/multilevel-models/samples.html
5. Taylor, J. J. (2011). Confusing Stats Terms Explained: Residual. Retrieved from http://www.statsmakemecry.com/smmctheblog/confusing-stats-terms-explainedresidual.html
6. Blissett, R. (2017). Multilevel Modeling in R. Retrieved from https://rpubs.com/rslbliss/r_mlm_ws
7. Dr. Buchanan, E. M. (2019). R - Binary Logistic Multilevel Models. Viewed from https://www.youtube.com/watch?v=6MexZiX-2W8
8. Jovanovic, M. (2015). R Playbook: Introduction to Multilevel/Hierarchical Models. Retrieved from https://complementarytraining.net/r-playbook-introduction-to-multilevelhierarchical-models/
9. Borowski, T. G. Chapter 16: Multilevel Modeling. Retrieved from https://ademos.people.uic.edu/Chapter16.html
## Appendix
![](previous_study.png)
<center>
**Appendix 1:** The final model of our group’s study on Road Accidents
</center>