Return to MUSA 801 Projects Page
Our Client is the Department of General Services of Baltimore City Government and their fleet repair & management team. They are having an unusual Business problem, they are spending too much on vehicles and too many are coming back too often. Or at least this is what they perceive, they want us to analyse their data set and give a mathematical definition of ‘what exactly is a comeback’. We start by exploring two vehicle spending and developing some tools to keep visiting thorough our analysis. This leads to ultimately defining a comeback. Lastly we finally predict the chance of comeback to help the city better manage their fleet and reduce their spending.
This step involves setting Up libraries and themes and loading all the relevant data and converting it into the appropriate required format.
library(tidyverse)
library(sf)
library(QuantPsyc)
library(RSocrata)
library(viridis)
library(caret)
library(spatstat)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(rgdal)
library(tidycensus)
library(knitr)
library(rmarkdown)
library(readxl)
library(lubridate)
library(dygraphs)
library(zoo)
library(xts)
library(rmdformats)
library(tidyverse)
library(readxl)
library(lubridate)
library(tidyverse)
This is the step for accessing our data in the desired format and below we can even have a quick glimpse at it.
setwd("/Users/akshaynagar/Desktop/To_Org/In_Out_Temp")
dat <- read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 1) %>%
rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 2)) %>%
rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 3)) %>%
rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 4)) %>%
rename(EPCycleLength = EPCycleLenghth) %>%
rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 5)) %>%
mutate(PostDate = ymd(as.character(PostDate)))
glimpse(dat)
## Observations: 925,256
## Variables: 21
## $ VehicleCost <dbl> 209353, 209353, 209353, 209353, 209353, 209353, 209353,…
## $ EhKey <chr> "000024", "000024", "000024", "000024", "000024", "0000…
## $ WHKey <chr> "0000539542", "0000546043", "0000546043", "0000546725",…
## $ Make <chr> "PIERCE", "PIERCE", "PIERCE", "PIERCE", "PIERCE", "PIER…
## $ Model <chr> "SABER", "SABER", "SABER", "SABER", "SABER", "SABER", "…
## $ MajGroup <chr> "Fire", "Fire", "Fire", "Fire", "Fire", "Fire", "Fire",…
## $ EqcDesc <chr> "FIRE DEPARTMENT", "FIRE DEPARTMENT", "FIRE DEPARTMENT"…
## $ Dept <chr> "64-00", "64-00", "64-00", "64-00", "64-00", "64-00", "…
## $ EPCycleLength <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…
## $ CycleDESC <chr> "MONTHS", "MONTHS", "MONTHS", "MONTHS", "MONTHS", "MONT…
## $ BhLineChr <dbl> 879.18, 87.71, 81.33, 146.48, 131.57, 131.57, 249.90, 3…
## $ AcctType <chr> "SU", "LA", "PA", "LA", "LA", "LA", "LA", "LA", "LA", "…
## $ PostDate <date> 2014-07-15, 2014-07-09, 2014-08-21, 2014-07-15, 2014-0…
## $ MechID <dbl> NA, 5217, NA, 5151, 5010, 5025, 5178, 5205, 5232, 5247,…
## $ WHShop <chr> "TRK", "FIR", "FIR", "FWY", "FIR", "FIR", "FIR", "FIR",…
## $ SHPDESC <chr> "TRUCK SIDE - CENTRAL", "FIRE MAINTENANCE", "FIRE MAINT…
## $ Reason <chr> "9", "R", "R", NA, "8", "8", "8", "8", "R", "R", "2", "…
## $ RTRDESC <chr> "Road Call (Vendor)", "Agency Request", "Agency Request…
## $ RGDESC <chr> "Tire Service", "Engine Exhaust", "Electrical", "Igniti…
## $ RCDESC <chr> "Tire Heavy", "Misc Parts", "System", "System", "Comple…
## $ RRDESC <chr> "Replace L/F", "Repair", "Part Issue", "Road Service", …
This step involves creating a time lag for all the data, this takes into account EhKey, which is referred to as Vehicle ID in this documentation, and WHKey which is referred to as Trip ID for the purpose of this documentation, keeping both the Vehicle ID, and the Trip ID, and taking into account the initial date for each of these along with each successive date, a Time lag data is created, please note this is done for the purpose of creating a powerful tool, which will help us better understand and define comeback rates, this in itself is not the comeback rate.
A comeback Trip is thus best explained in Detail in the following Chart Below. The Image in the Red is comeback trips these are trip that always talk about the same part and where warranty gets lapsed, a regular is similar except the same part is getting repaired after it has successfully lived it’s warranty, if two parts are different for two different repairs, then these are Non related trips. This is best shown with the case of examples and with splitting sample data sets, as well as with date examples and examples on actual parts with actual warranty periods taken for a fire truck let’s say with Vehicle ID 24, in the following example, illustrated in detail in the chart below.
In summary A comeback is thus simply stated as a Repair Trip on the same part before it’s warranty elapsed while a regular trip is after the warranty has elapsed this is illustrated in the summary below:
This step involves counting and displaying the number of trips where comeback is within 5 days, 10 days and so on for all the Data between 2015 to 2019 and for all the vehicles.
ggplot(dat_02 %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 100))+
geom_histogram(aes(time), binwidth = 5)+
geom_vline(xintercept=30,
color = "#F57434", size=1)+
labs(
title = "Time since last maintenance, All data 2015-2019",
subtitle = "Max time 100 days, Bar = 5 days",
x="Days",
y="Number of records" )
The Baltimore City Government uses Pierce Saber make fire trucks, which is a leading American company that specifically specialises in Making Fire trucks.
Now we focus on one fire truck with vehicle ID ‘000024’, here we analyse and plot using time lag data, the duration between two repairs for the truck.
fireTruck <- dat %>%
filter(EhKey == '000024') %>%
arrange(PostDate)
fireTruck_time <- fireTruck %>%
group_by(EhKey) %>%
group_by(WHKey) %>%
dplyr::slice(which.min(PostDate))%>%
mutate(PostDate = ymd(as.character(PostDate))) %>%
group_by(EhKey) %>%
arrange(PostDate) %>%
mutate(lagTime = PostDate - lag(PostDate))
fireTruck_time %>%
ggplot( aes(x=PostDate, y=lagTime)) +
geom_line() +
geom_point()+
geom_hline(yintercept=60, linetype="dashed",
color = "#F57434", size=1)+
labs(
title = "Duration between two repairment, FireTruck 000024",
x="Date",
y="Duration")
Here we plot the costing of repairs of the fire truck, these costs are accumulative. And here we can see the amount spent on the repairs of the truck, with respect of time.
#DONT INCLUDE
fireTruck %>%
ggplot(aes(x=PostDate, y=cumsum(BhLineChr))) + geom_line() + geom_point()+
labs(
title = "Cumulative Maintenance fee, FireTruck 000024",
x="Date",
y="Dollar" )
The Cars used in Baltimore fleet for police are the Chevrolet Impala 2009. ‘The Impala is Chevrolet’s popular flagship passenger car and is generally among the better selling American made automobiles in the United States.’
Here as for the Fire Trucks we do a similar exercise for a Police Car, focusing on some different repair types as per car specific, however the thinking is very similar to the work on Police Car, this is for Police Car vehicle ID (EhKey == ‘078983’). We must note that warranty period or EPCycleLength is much shorter here, this is understandable as Police cars are used much more of then fire trucks due the nature of their job demanding much more frequent trips and much more extensive travel and usage.
ps1 <- dat %>%
filter(EhKey == '078983') %>%
arrange(PostDate)
ps1_time <- ps1 %>%
group_by(EhKey) %>%
group_by(WHKey) %>%
dplyr::slice(which.min(PostDate))%>%
mutate(PostDate = ymd(as.character(PostDate))) %>%
group_by(EhKey) %>%
arrange(PostDate) %>%
mutate(lagTime = PostDate - lag(PostDate))
ps1_time %>%
ggplot( aes(x=PostDate, y=lagTime)) +
geom_line() +
geom_point()+
geom_hline(yintercept=30, linetype="dashed",
color = "#F57434", size=1)+
labs(
title = "Duration between two repairment, Police Car 078983",
x="Date",
y="Duration")
Police_car <- dat %>%
filter(EhKey == '078983') %>%
arrange(PostDate)
Police_car %>%
ggplot(aes(x=PostDate, y=cumsum(BhLineChr))) + geom_line() + geom_point()+
labs(
title = "Cumulative Maintenance fee, Police_car 078983",
x="Date",
y="Dollar" )
# Calculate the time since last maintenance of each vehicle and each part
all_duration <- dat %>%
group_by(EhKey,RGDESC,WHKey) %>%
summarise(Totalcost = sum(BhLineChr),PostDate =min(PostDate))%>%
arrange(EhKey,RGDESC,PostDate)%>%
mutate(lagTime = PostDate - lag(PostDate))
# Add department information
group_list <- dat %>%
group_by(EhKey) %>%
summarise(EqcDesc = first(EqcDesc),EPCycleLength= first(EPCycleLength),Make=first(Make),Model=first(Model),CycleDESC=first(CycleDESC))
all_duration <- right_join(group_list,all_duration,by='EhKey')
# convert all epcycle to days
all_duration$unit[all_duration$CycleDESC == "MONTHS"]<- 30
all_duration$unit[all_duration$CycleDESC == "WEEKS"]<- 7
all_duration$unit[all_duration$CycleDESC == "DAYS"]<- 1
all_duration$EPCycleLength_day = all_duration$EPCycleLength*all_duration$unit
all_duration$comeback[all_duration$lagTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback[all_duration$lagTime>=all_duration$EPCycleLength_day]<- "Regular Trip"
# Exploration 1: How many comeback and Regular Trip?
all_duration_plot<-
all_duration %>%
group_by(comeback) %>%
summarise(number = n())
At this step after doing some individual explorations we summarise the total number of comebacks and regular trips to analyse their difference and move further ahead in our objective of ultimately reducing comebacks.
ggplot(data=all_duration_plot[!is.na(all_duration_plot$comeback),], aes(x=comeback, y=number)) +
geom_bar(stat="identity",width=0.2,fill = c("#c70000","#00c700"))+
geom_text(aes(label=number), position=position_dodge(width=0.9), vjust=-0.25)+
labs(
title = "How many comeback and Regular Trip totally?",
y="Total number")
At this step the average cost of comebacks and regular trips to analyse their difference and so that later on we can do a detailed cost benefit analysis.
# Exploration 3: Average Cost different between comeback and Regular Trip
all_duration_plot<-
all_duration %>%
group_by(comeback) %>%
summarise(Cost = mean(Totalcost))
ggplot(data=all_duration_plot[!is.na(all_duration_plot$comeback),], aes(x=comeback, y=round(Cost, digits = 2))) +
geom_bar(stat="identity",width=0.2,fill = c("#c70000","#00c700"))+
geom_text(aes(label=round(Cost, digits = 2)), position=position_dodge(width=0.9), vjust=-0.25)+
labs(
title = "How much money cost for each comeback/Regular Trip?",
y="Average cost" )
This exercise is for all of the data, and this takes into account how much are the total costs to the city within a period of time, this helps them see which months or what times were more cost efficient and which were not.
# Calculate the time since last maintenance of each vehicle and each part
all_duration <- dat %>%
group_by(EhKey,RGDESC,WHKey) %>%
summarise(Totalcost = sum(BhLineChr),PostDate =min(PostDate))%>%
arrange(EhKey,RGDESC,PostDate,)%>%
mutate(lagTime = PostDate - lag(PostDate))
# Add department information
group_list <- dat %>%
group_by(EhKey) %>%
summarise(EqcDesc = first(EqcDesc), EPCycleLength=first(EPCycleLength), Make=first(Make),Model=first(Model),CycleDESC =first(CycleDESC))
all_duration <- right_join(group_list,all_duration,by='EhKey')
# Interactive - Cost's change accross time
all_duration_try <- all_duration %>%
group_by(PostDate) %>%
summarise(Totalcost_byDay = sum(Totalcost))
library(dygraphs)
library(zoo)
library(xts)
library(Lahman)
my_weekday_date <- all_duration_try %>% dplyr::select(PostDate) %>% pull
my_weekday_view <- all_duration_try %>% dplyr::select(Totalcost_byDay) %>% pull
my_zoo_weekday <- zoo(my_weekday_view, my_weekday_date)
my_zoo_weekday_xts <- as.xts(my_zoo_weekday, order.by = my_weekday_date)
dygraph(my_zoo_weekday_xts, main = "Cost of each day") %>%
dyRangeSelector(dateWindow = c("2017-01-01", "2018-05-17"))
Over here we have a scatter plot that shows us the total cost and lag time by vehicle type that we can select, this is helpful to quickly scroll through and compare vehicles, and here we also have a Distribution plot that shows us the total lag time by vehicle type that we can select, this is helpful to quickly scroll through and compare vehicles.
# Interactive - Scatterplot of cost vs duration by all groups
library(plotly)
p <- sample_n(all_duration, 10000) %>%
plot_ly(
type = 'scatter', x = ~lagTime, y = ~Totalcost, text = ~RGDESC,
hoverinfo = 'text',
mode = 'markers',
transforms = list(
list(
type = 'filter',
target = ~EqcDesc,
operation = '=',
value = unique(all_duration$EqcDesc)[1]
)
)) %>% layout(
updatemenus = list(
list(
type = 'dropdown',
active = 0,
buttons = list(
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[1]),
label = unique(all_duration$EqcDesc)[1]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[2]),
label = unique(all_duration$EqcDesc)[2]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[3]),
label = unique(all_duration$EqcDesc)[3]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[3]),
label = unique(all_duration$EqcDesc)[4]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[5]),
label = unique(all_duration$EqcDesc)[5]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[6]),
label = unique(all_duration$EqcDesc)[6]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[7]),
label = unique(all_duration$EqcDesc)[7]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[8]),
label = unique(all_duration$EqcDesc)[8]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[9]),
label = unique(all_duration$EqcDesc)[9]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[10]),
label = unique(all_duration$EqcDesc)[10]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[11]),
label = unique(all_duration$EqcDesc)[11]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[12]),
label = unique(all_duration$EqcDesc)[12]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[13]),
label = unique(all_duration$EqcDesc)[13]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[14]),
label = unique(all_duration$EqcDesc)[14]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[15]),
label = unique(all_duration$EqcDesc)[15]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[16]),
label = unique(all_duration$EqcDesc)[16]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[17]),
label = unique(all_duration$EqcDesc)[17]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[18]),
label = unique(all_duration$EqcDesc)[18]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[19]),
label = unique(all_duration$EqcDesc)[19]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[20]),
label = unique(all_duration$EqcDesc)[20]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[21]),
label = unique(all_duration$EqcDesc)[21])
)
)
)
)
p2 <- sample_n(all_duration, 10000) %>%
plot_ly(
type = 'scatter', x = ~lagTime, y = ~Totalcost, text = ~RGDESC,
hoverinfo = 'text',
mode = 'markers',
marker = list(color = 'rgba(255, 100, 100,0.9)'),
transforms = list(
list(
type = 'filter',
target = ~EqcDesc,
operation = '=',
value = unique(all_duration$EqcDesc)[1]
)
)) %>% layout(
updatemenus = list(
list(
type = 'dropdown',
active = 0,
buttons = list(
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[1]),
label = unique(all_duration$EqcDesc)[1]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[2]),
label = unique(all_duration$EqcDesc)[2]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[3]),
label = unique(all_duration$EqcDesc)[3]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[3]),
label = unique(all_duration$EqcDesc)[4]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[5]),
label = unique(all_duration$EqcDesc)[5]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[6]),
label = unique(all_duration$EqcDesc)[6]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[7]),
label = unique(all_duration$EqcDesc)[7]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[8]),
label = unique(all_duration$EqcDesc)[8]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[9]),
label = unique(all_duration$EqcDesc)[9]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[10]),
label = unique(all_duration$EqcDesc)[10]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[11]),
label = unique(all_duration$EqcDesc)[11]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[12]),
label = unique(all_duration$EqcDesc)[12]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[13]),
label = unique(all_duration$EqcDesc)[13]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[14]),
label = unique(all_duration$EqcDesc)[14]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[15]),
label = unique(all_duration$EqcDesc)[15]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[16]),
label = unique(all_duration$EqcDesc)[16]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[17]),
label = unique(all_duration$EqcDesc)[17]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[18]),
label = unique(all_duration$EqcDesc)[18]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[19]),
label = unique(all_duration$EqcDesc)[19]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[20]),
label = unique(all_duration$EqcDesc)[20]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[21]),
label = unique(all_duration$EqcDesc)[21])
)
)
)
)
p
# Duration distribution by repair part type
p3 <- all_duration %>%
filter( lagTime > 5, EqcDesc == "FIRE DEPARTMENT")%>%
plot_ly(
type = 'histogram', x = ~lagTime,
transforms = list(
list(
type = 'filter',
target = ~RGDESC,
operation = '=',
value = unique(all_duration$RGDESC)[1]
)
)) %>% layout(
updatemenus = list(
list(
type = 'dropdown',
active = 0,
buttons = list(
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[1]),
label = unique(all_duration$RGDESC)[1]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[2]),
label = unique(all_duration$RGDESC)[2]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[3]),
label = unique(all_duration$RGDESC)[3]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[3]),
label = unique(all_duration$RGDESC)[4]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[5]),
label = unique(all_duration$RGDESC)[5]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[6]),
label = unique(all_duration$RGDESC)[6]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[7]),
label = unique(all_duration$RGDESC)[7]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[8]),
label = unique(all_duration$RGDESC)[8]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[9]),
label = unique(all_duration$RGDESC)[9]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[10]),
label = unique(all_duration$RGDESC)[10]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[11]),
label = unique(all_duration$RGDESC)[11]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[12]),
label = unique(all_duration$RGDESC)[12]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[13]),
label = unique(all_duration$RGDESC)[13]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[14]),
label = unique(all_duration$RGDESC)[14]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[15]),
label = unique(all_duration$RGDESC)[15]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[16]),
label = unique(all_duration$RGDESC)[16]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[17]),
label = unique(all_duration$RGDESC)[17]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[18]),
label = unique(all_duration$RGDESC)[18]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[19]),
label = unique(all_duration$RGDESC)[19]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[20]),
label = unique(all_duration$RGDESC)[20]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[21]),
label = unique(all_duration$RGDESC)[21])
)
)
)
)
p4 <- all_duration %>%
filter( lagTime > 5, EqcDesc == "FIRE DEPARTMENT")%>%
plot_ly(
type = 'histogram', x = ~lagTime,
marker = list(color = 'rgba(255, 100, 100,0.9)'),
transforms = list(
list(
type = 'filter',
target = ~RGDESC,
operation = '=',
value = unique(all_duration$RGDESC)[1]
)
)) %>% layout(
updatemenus = list(
list(
type = 'dropdown',
active = 0,
buttons = list(
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[1]),
label = unique(all_duration$RGDESC)[1]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[2]),
label = unique(all_duration$RGDESC)[2]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[3]),
label = unique(all_duration$RGDESC)[3]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[3]),
label = unique(all_duration$RGDESC)[4]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[5]),
label = unique(all_duration$RGDESC)[5]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[6]),
label = unique(all_duration$RGDESC)[6]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[7]),
label = unique(all_duration$RGDESC)[7]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[8]),
label = unique(all_duration$RGDESC)[8]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[9]),
label = unique(all_duration$RGDESC)[9]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[10]),
label = unique(all_duration$RGDESC)[10]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[11]),
label = unique(all_duration$RGDESC)[11]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[12]),
label = unique(all_duration$RGDESC)[12]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[13]),
label = unique(all_duration$RGDESC)[13]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[14]),
label = unique(all_duration$RGDESC)[14]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[15]),
label = unique(all_duration$RGDESC)[15]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[16]),
label = unique(all_duration$RGDESC)[16]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[17]),
label = unique(all_duration$RGDESC)[17]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[18]),
label = unique(all_duration$RGDESC)[18]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[19]),
label = unique(all_duration$RGDESC)[19]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[20]),
label = unique(all_duration$RGDESC)[20]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[21]),
label = unique(all_duration$RGDESC)[21])
)
)
)
)
p3
Here we move our Analysis to working with each department, In the graphs below we see how many trips are happening before the warranty period and how many are living past their warranty period this will help us quickly visualise and analyse the data all the trips to the left of the epcycle are comebacks and all to the right are not.
p1 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "FIRE DEPARTMENT"))+
geom_histogram(aes(time), binwidth = 10)+
geom_vline(aes(xintercept = 180), colour = "#FA7800",size=1)+
geom_text(aes(x=180, label="EPCycle 6 months", y=1000), colour="blue")+
labs(
title = " Fire vehicle",
subtitle = "Max time 365 days, Bar = 10 days",
x="Days",
y="Number of records" )
p2<-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EPCycleLength == 1,EqcDesc == "POLICE DEPARTMENT"))+
geom_histogram(aes(time), binwidth = 10)+
geom_vline(aes(xintercept = 30), colour = "#FA7800",size=1)+
geom_text(aes(x=30, label="EPCycle 1 months", y=9200), colour="blue")+
labs(
title = " Police vehicle",
subtitle = "Max time 365 days, Bar = 10 days",
x="Days",
y="Number of records" )
p3 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365,EPCycleLength == 3, EqcDesc == "DPW - Solid Waste"))+
geom_histogram(aes(time), binwidth = 10)+
geom_vline(aes(xintercept = 90), colour = "#FA7800",size=1)+
geom_text(aes(x=90, label="EPCycle 3 months", y=3500), colour="blue")+
labs(
title = "Solid Waste vehicle",
subtitle = "Max time 365 days, Bar = 10 days",
x="Days",
y="Number of records" )
p4 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365,EPCycleLength == 4, EqcDesc == "DPW Water & Wastewater"))+
geom_histogram(aes(time), binwidth = 10)+
geom_vline(aes(xintercept = 120), colour = "#FA7800",size=1)+
geom_text(aes(x=120, label="EPCycle 4 months", y=2400), colour="blue")+
labs(
title = " Water & Wastewater vehicle",
subtitle = "Max time 365 days, Bar = 10 days",
x="Days",
y="Number of records" )
p5 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EPCycleLength == 1,EqcDesc == "Light Duty Emergency"))+
geom_histogram(aes(time), binwidth = 10)+
geom_vline(aes(xintercept = 30), colour = "#FA7800",size=1)+
geom_text(aes(x=30, label="EPCycle 1 months", y=3000), colour="blue")+
labs(
title = " Light Duty Emergency vehicle",
subtitle = "Max time 365 days, Bar = 10 days",
x="Days",
y="Number of records" )
p6 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EqcDesc == "Light Duty Non-Emergency",EPCycleLength == 4))+
geom_histogram(aes(time), binwidth = 10)+
geom_vline(aes(xintercept = 120), colour = "#FA7800",size=1)+
geom_text(aes(x=120, label="EPCycle 4 months", y=550), colour="blue")+
labs(
title = " Light Duty Non-Emergency vehicle",
subtitle = "Max time 365 days, Bar = 10 days",
x="Days",
y="Number of records" )
grid.arrange(p1, p2,p3,p4,p5,p6, nrow = 2, ncol = 3 )
Furthering our analysis to work with each department, Now after understanding what is a comeback and a regular trip we show the total number of each of these by department type as below.
dat <- read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 1) %>%
rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 2)) %>%
rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 3)) %>%
rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 4)) %>%
rename(EPCycleLength = EPCycleLenghth) %>%
rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 5)) %>%
mutate(PostDate = ymd(as.character(PostDate)))
# Calculate the time since last maintenance of each vehicle and each part
all_duration <- dat %>%
group_by(EhKey,RGDESC,WHKey) %>%
summarise(Totalcost = sum(BhLineChr),PostDate =min(PostDate))%>%
arrange(EhKey,RGDESC,PostDate)%>%
mutate(lagTime = PostDate - lag(PostDate))
# Add department information
group_list <- dat %>%
group_by(EhKey) %>%
summarise(EqcDesc = first(EqcDesc),EPCycleLength= first(EPCycleLength),Make=first(Make),Model=first(Model),CycleDESC=first(CycleDESC))
all_duration <- right_join(group_list,all_duration,by='EhKey')
if(all_duration$CycleDESC == "MONTHS" )
{
all_duration$CycleDESC_day = all_duration$EPCycleLength*30
}
if(all_duration$CycleDESC == "DO NOT USE SET A MONTHLY DATE")
{
all_duration$CycleDESC_day = all_duration$EPCycleLength*30
}
# convert all epcycle to days
all_duration$unit[all_duration$CycleDESC == "MONTHS"]<- 30
all_duration$unit[all_duration$CycleDESC == "WEEKS"]<- 7
all_duration$unit[all_duration$CycleDESC == "DAYS"]<- 1
all_duration$EPCycleLength_day = all_duration$EPCycleLength*all_duration$unit
# identify comeback
all_duration$comeback[all_duration$lagTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback[all_duration$lagTime>=all_duration$EPCycleLength_day]<- "Regular Trip"
# # Exploration 1: How many comeback and Regular Trip?
# barplot(table(all_duration$comeback), main="How many comeback and Regular Trip?",
# xlab="Yes or No", ylab = "Count")
#
# # Exploration 2: Comeback & Department
# t_1 <- table(all_duration$EqcDesc)
# name = t_1 > 10000
#
# counts <- table(all_duration$comeback, all_duration$EqcDesc)
# barplot(counts, main="Car Distribution by Gears and VS",
# xlab="Number of Gears", col=c("darkblue","red"),
# legend = rownames(counts))
#====================
# convert all epcycle to days
all_duration$unit[all_duration$CycleDESC == "MONTHS"]<- 30
all_duration$unit[all_duration$CycleDESC == "WEEKS"]<- 7
all_duration$unit[all_duration$CycleDESC == "DAYS"]<- 1
all_duration$EPCycleLength_day = all_duration$EPCycleLength*all_duration$unit
# identify comeback
all_duration$comeback[all_duration$lagTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback[all_duration$lagTime>=all_duration$EPCycleLength_day]<- "Regular Trip"
# # Exploration 1: How many comeback and Regular Trip?
# barplot(table(all_duration$comeback), main="How many comeback and Regular Trip?",
# xlab="Yes or No", ylab = "Count")
#
# # Exploration 2: Comeback & Department
# t_1 <- table(all_duration$EqcDesc)
# name = t_1 > 10000
#
# counts <- table(all_duration$comeback, all_duration$EqcDesc)
# barplot(counts, main="Car Distribution by Gears and VS",
# xlab="Number of Gears", col=c("darkblue","red"),
# legend = rownames(counts))
#====================
p1 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "FIRE DEPARTMENT"),aes(x=comeback))+
geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
labs(
title = "Fire Department Comeback Data",
x="Comeback or a Regular Non Comeback Repair Trip",
y="Total Count" )
p2 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EPCycleLength == 1, EqcDesc == "POLICE DEPARTMENT"),aes(x=comeback))+
geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
labs(
title = "Police Department Comeback Data",
x="Comeback or a Regular Non Comeback Repair Trip",
y="Total Count" )
p3 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EPCycleLength == 3, EqcDesc == "DPW - Solid Waste"),aes(x=comeback))+
geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
labs(
title = "DPW - Solid Waste Comeback Data",
x="Comeback or a Regular Non Comeback Repair Trip",
y="Total Count" )
p4 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "DPW Water & Wastewater"),aes(x=comeback))+
geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
labs(
title = "DPW Water & Wastewater",
x="Comeback or a Regular Non Comeback Repair Trip",
y="Total Count" )
p5 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "Light Duty Emergency"),aes(x=comeback))+
geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
labs(
title = "Light Duty Emergency",
x="Comeback or a Regular Non Comeback Repair Trip",
y="Total Count" )
p6 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "Light Duty Non-Emergency"),aes(x=comeback))+
geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
labs(
title = "Light Duty Non-Emergency Comeback Data",
x="Comeback or a Regular Non Comeback Repair Trip",
y="Total Count" )
grid.arrange(p1, p2,p3,p4,p5,p6, nrow = 2, ncol = 3 )
These are the steps for preparing data to apply feature engineering, this is an essential preparatory step.
#======Data Preparation======
# Load Data
# Rename a column in between sheets 4 and 5 because it's spelled wrong in all but sheet 5
# Turn PostDate into a lubridate
# Calculate the time since last maintenance of each vehicle and each part
all_duration <- dat %>%
group_by(EhKey,RGDESC,WHKey) %>%
summarise(Totalcost = sum(BhLineChr),PostDate =min(PostDate))%>%
arrange(EhKey,RGDESC,PostDate)%>%
mutate(lagTime = PostDate - lag(PostDate))%>%
mutate(leadTime = lead(PostDate)-PostDate)
# Add department information
group_list <- dat %>%
group_by(EhKey) %>%
summarise(EqcDesc = first(EqcDesc),EPCycleLength= first(EPCycleLength),Make=first(Make),Model=first(Model),CycleDESC=first(CycleDESC))
all_duration <- right_join(group_list,all_duration,by='EhKey')
# convert all epcycle to days
all_duration$unit[all_duration$CycleDESC == "MONTHS"]<- 30
all_duration$unit[all_duration$CycleDESC == "WEEKS"]<- 7
all_duration$unit[all_duration$CycleDESC == "DAYS"]<- 1
all_duration$EPCycleLength_day = all_duration$EPCycleLength*all_duration$unit
# identify comeback
all_duration$comeback[all_duration$lagTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback[all_duration$lagTime>=all_duration$EPCycleLength_day]<- "Regular Trip"
all_duration$comeback_next[all_duration$leadTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback_next[all_duration$leadTime>=all_duration$EPCycleLength_day]<- "Regular Trip"
Over here we deal with feature engineering and create valuable features that will help generate an intelligent predictive model. This is also seen in the table below:
# rolling average
df_1 <- all_duration %>%
arrange(EhKey) %>%
mutate(date = ymd(PostDate))%>%
arrange(PostDate)
library(tidyr)
library(geosphere)
rollavgbyperiod <- function(i,window,df){
startdate <- df$date[i]-window
enddate <- df$date[i]-1
interval <- seq(startdate,enddate,1)
tmp <- df$Totalcost[df$date %in% interval]
return(sum(tmp))
}
roll_function_col_7 <- function(df){
window <-7
Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
Pastweekaverage[is.nan(Pastweekaverage)] <- 0
return(Pastweekaverage)
}
roll_function_col_30 <- function(df){
window <-30
Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
Pastweekaverage[is.nan(Pastweekaverage)] <- 0
return(Pastweekaverage)
}
roll_function_col_182 <- function(df){
window <-182
Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
Pastweekaverage[is.nan(Pastweekaverage)] <- 0
return(Pastweekaverage)
}
df_2 <- df_1 %>%
arrange(EhKey,PostDate)%>%
nest(-EhKey) %>%
group_by(EhKey) %>%
mutate(past1week_cost = purrr::map(data, roll_function_col_7),
past1month_cost = purrr::map(data, roll_function_col_30),
past6month_cost = purrr::map(data, roll_function_col_182)) %>%
unnest()
# rolling mean
rollavgbyperiod <- function(i,window,df){
startdate <- df$date[i]-window
enddate <- df$date[i]-1
interval <- seq(startdate,enddate,1)
tmp <- df$Totalcost[df$date %in% interval]
return(mean(tmp))
}
roll_function_col_7 <- function(df){
window <-7
Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
Pastweekaverage[is.nan(Pastweekaverage)] <- 0
return(Pastweekaverage)
}
roll_function_col_30 <- function(df){
window <-30
Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
Pastweekaverage[is.nan(Pastweekaverage)] <- 0
return(Pastweekaverage)
}
roll_function_col_182 <- function(df){
window <-182
Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
Pastweekaverage[is.nan(Pastweekaverage)] <- 0
return(Pastweekaverage)
}
# rolling average and count
df_2 <- df_2 %>%
arrange(EhKey,PostDate)%>%
nest(-EhKey) %>%
group_by(EhKey) %>%
mutate(past1week_ave_cost = purrr::map(data, roll_function_col_7),
past1month_ave_cost = purrr::map(data, roll_function_col_30),
past6month_ave_cost = purrr::map(data, roll_function_col_182)) %>%
unnest()
df_2$past1week_count = df_2$past1week_cost/df_2$past1week_ave_cost
df_2$past1month_count = df_2$past1month_cost/df_2$past1month_ave_cost
df_2$past6month_count = df_2$past6month_cost/df_2$past6month_ave_cost
df_2[["past1week_count"]][is.na(df_2[["past1week_count"]])] <- 0
df_2[["past1month_count"]][is.na(df_2[["past1month_count"]])] <- 0
df_2[["past6month_count"]][is.na(df_2[["past6month_count"]])] <- 0
# rolling comeback
df_2$comeback_next_10=0
df_2$comeback_10=0
df_2 <- within(df_2, comeback_next_10[comeback_next=='Comeback'] <- 1)
df_2 <- within(df_2, comeback_10[comeback=='Comeback'] <- 1)
rollavgbyperiod <- function(i,window,df){
startdate <- df$date[i]-window
enddate <- df$date[i]-1
interval <- seq(startdate,enddate,1)
tmp <- df$comeback_10[df$date %in% interval]
return(sum(tmp))
}
roll_function_col_7 <- function(df){
window <-7
Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
Pastweekaverage[is.nan(Pastweekaverage)] <- 0
return(Pastweekaverage)
}
roll_function_col_30 <- function(df){
window <-30
Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
Pastweekaverage[is.nan(Pastweekaverage)] <- 0
return(Pastweekaverage)
}
roll_function_col_182 <- function(df){
window <-182
Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
Pastweekaverage[is.nan(Pastweekaverage)] <- 0
return(Pastweekaverage)
}
df_2 <- df_2 %>%
arrange(EhKey,PostDate)%>%
nest(-EhKey) %>%
group_by(EhKey) %>%
mutate(past1week_comeback = purrr::map(data, roll_function_col_7),
past1month_comeback = purrr::map(data, roll_function_col_30),
past6month_comeback = purrr::map(data, roll_function_col_182)) %>%
unnest()
df_2$same_part_lasttime = df_2$lagTime
# comeback rate of each model and make
model_rate<-df_2%>%
group_by(Model)%>%
summarise(comeback_rate_model = mean(comeback_10))
df_2 = left_join(df_2, model_rate, by = "Model")
make_rate<-df_2%>%
group_by(Make)%>%
summarise(comeback_rate_make = mean(comeback_10))
df_2 = left_join(df_2, make_rate, by = "Make")
# comeback rate of monthlyly department average
df_2$year = format(df_2$date,'%Y')
df_2$month = format(df_2$date,'%m')
dpt_month_rate<-df_2%>%
group_by(EqcDesc,year,month)%>%
summarise(comeback_rate_dpt_ym = mean(comeback_10))
df_2 = left_join(df_2, dpt_month_rate, by = c("EqcDesc","year","month"))
# comeback rate of yearly department average
df_2$year = format(df_2$date,'%Y')
df_2$month = format(df_2$date,'%m')
dpt_month_rate<-df_2%>%
group_by(EqcDesc,year)%>%
summarise(comeback_rate_dpt_y = mean(comeback_10))
df_2 = left_join(df_2, dpt_month_rate, by = c("EqcDesc","year"))
# comeback rate of each system of each type of vehicle
RGDESC_model_rate<-df_2%>%
group_by(RGDESC,Model)%>%
summarise(comeback_rate_RGDESC = mean(comeback_10))
df_2 = left_join(df_2, RGDESC_model_rate, by = c("RGDESC","Model"))
# === Categorical Creation ===
df_backup = df_2
#df_2 = df_backup
# Department as a categorical variable (big/small)
dpt_n<-df_2%>%
group_by(EqcDesc)%>%
summarise(number_of_vehicle = n())
df_2 = left_join(df_2, dpt_n, by = c("EqcDesc"))
df_2 <- df_2 %>%
mutate(dpt_big = case_when(number_of_vehicle>=10000 ~ "big department",
number_of_vehicle<10000 ~ "small department"))
# Department as a categorical variable (H, M , L, VL)
dpt_rate<-df_2%>%
group_by(EqcDesc)%>%
summarise(comeback_rate_dpt = mean(comeback_10))
df_2 = left_join(df_2, dpt_rate, by = c("EqcDesc"))
df_2 <- df_2 %>%
mutate(dpt_risk = case_when(comeback_rate_dpt>=0.5 ~ "High risk",
comeback_rate_dpt>=0.25 ~ "Medium risk",
comeback_rate_dpt>=0.15 ~ "Low risk",
TRUE ~ "Very Low risk"))
# Make as a categorical variable (ex. risk score 0-5)
df_2 <- df_2 %>%
mutate(make_risk = case_when(comeback_rate_make>=0.5 ~ "High risk",
comeback_rate_make>=0.25 ~ "Medium risk",
comeback_rate_make>=0.15 ~ "Low risk",
TRUE ~ "Very Low risk"))
# Model as a categorical variable (ex. risk score 0-5)
df_2 <- df_2 %>%
mutate(model_risk = case_when(comeback_rate_model>=0.8 ~ "Very High risk",
comeback_rate_model>=0.6 ~ "High risk",
comeback_rate_model>=0.3 ~ "Medium risk",
comeback_rate_model>=0.15 ~ "Low risk",
TRUE ~ "Very Low risk"))
# RGDesc as a categorical variable (1-10)
df_2 <- df_2 %>%
mutate(part_risk = case_when(comeback_rate_RGDESC>=0.8 ~ "Very High risk",
comeback_rate_RGDESC>=0.5 ~ "High risk",
comeback_rate_RGDESC>=0.2 ~ "Medium risk",
comeback_rate_RGDESC>0 ~ "Low risk",
TRUE ~ "Very Low risk"))
# Workshop(get the workshop first)
workshop_list <- dat %>%
group_by(WHKey) %>%
summarise(WHShop = first(WHShop))
df_2 <- left_join(df_2,workshop_list,by='WHKey')
Shop_rate<-df_2%>%
group_by(WHShop)%>%
summarise(comeback_rate_shop = mean(comeback_10))
df_2 = left_join(df_2, Shop_rate, by = c("WHShop"))
df_2 <- df_2 %>%
mutate(Shop_risk = case_when(comeback_rate_shop>=0.7 ~ "High risk",
comeback_rate_shop>=0.5 ~ "Medium risk",
comeback_rate_shop>=0.25 ~ "Low risk",
TRUE ~ "Very Low risk"))
# month
df_2 <- df_2 %>%
mutate(month_cat = case_when(month=="04" ~ "April",
month=="05" ~ "May",
month=="06" ~ "June",
TRUE ~ "else"))
RG_list <- dat %>%
group_by(WHKey) %>%
summarise(N_RG = n())
df_2 <- left_join(df_2,RG_list,by='WHKey')
Corelation Matrix is shown below. For the prediction models in the next step, we can not select those features that are strongly correlated (Pearson correlation smaller than 0.8).
library(corrplot)
df_Cor <-
df_2 %>%
dplyr::select(Totalcost,past1week_cost,past1month_cost, past6month_cost,past1week_count,past1month_count,past6month_count,past1week_ave_cost, past1month_ave_cost,past6month_ave_cost,past1week_comeback,past1month_comeback,past6month_comeback, comeback_rate_model,comeback_rate_make,comeback_rate_dpt_ym,comeback_rate_dpt_y,comeback_rate_dpt,comeback_rate_RGDESC,comeback_rate_shop)
df_Cor <- subset(df_Cor, select=-c(EhKey))
M <- cor(df_Cor,use="pairwise.complete.obs")
corrplot(M, method = "square",tl.cex = 0.5)
The Main purpose of this model is going to predict whether the next the next trip is a comeback or not. Thus, the Logistic regression model
seems to be a good choice. It can predict the probability that an observation is part of a particular group just like Comeback or Regular Trip. Before training the model, we should separate the dataset to training set and test set first. Below shows the result of the prediction model. We can see that the whether this trip is comeback or not and the comeback rate of this system are the most significant predictor.
options(scipen=10000000)
library(tidyverse)
library(caret)
library(knitr)
library(pscl)
library(plotROC)
library(pROC)
palette5 <- c("#3F6973","#49A695","#F2CD88","#F2A679","#D9725B")
palette4 <- c("#00c700","#0060C7","#c70000","#DEA60B")
palette3 <- c("#00c700","#DEA60B","#c70000")
palette2 <- c("#00c700","#c70000")
df_model <- na.omit(df_2)
df_model <- subset(df_model, select=-c(EhKey,WHKey,PostDate,unit,comeback,comeback_next,RGDESC,Model))
samp <- sample(nrow(df_model),20000)
df_model_sample<- df_model[samp,]
df_model_sample_1<-df_model_sample
set.seed(11233)
trainIndex <- createDataPartition(df_model_sample$comeback_next_10, p = .70,
list = FALSE,
times = 1)
df_Train <- df_model_sample[ trainIndex,]
df_Test <- df_model_sample[-trainIndex,]
new_predictor <-c("comeback_next_10","comeback_rate_RGDESC","past1week_comeback","past1week_count","past6month_ave_cost","comeback_rate_shop","year","month","EPCycleLength_day","comeback_10")
Logit_Model_1 <- glm(comeback_next_10 ~ .,
data=df_Train %>% dplyr::select(new_predictor),
family="binomial" (link="logit"))
testProbs_0 <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
Probs_1 = predict(Logit_Model_1, df_Test, type= "response"))
testProbs_0 <- testProbs_0%>%
mutate(predOutcome_1 = as.factor(ifelse(testProbs_0$Probs_1 > 0.5 , 1, 0)))
cm_1 <-confusionMatrix(testProbs_0$predOutcome_1, testProbs_0$Outcome,positive = "1")
summary(Logit_Model_1)
##
## Call:
## glm(formula = comeback_next_10 ~ ., family = binomial(link = "logit"),
## data = df_Train %>% dplyr::select(new_predictor))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4960 -0.8208 -0.5066 0.8416 2.2100
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.39145893 0.12907065 -18.528 < 0.0000000000000002 ***
## comeback_rate_RGDESC 4.49741627 0.11915269 37.745 < 0.0000000000000002 ***
## past1week_comeback 0.07323652 0.03727047 1.965 0.0494 *
## past1week_count -0.02903997 0.01718635 -1.690 0.0911 .
## past6month_ave_cost -0.00001241 0.00001254 -0.990 0.3224
## comeback_rate_shop 0.25015478 0.22090930 1.132 0.2575
## year2015 -0.11462763 0.09629985 -1.190 0.2339
## year2016 -0.17703553 0.09618185 -1.841 0.0657 .
## year2017 -0.12350266 0.09607567 -1.285 0.1986
## year2018 0.38964065 0.09779777 3.984 0.000067723030075 ***
## year2019 2.31234258 0.20844934 11.093 < 0.0000000000000002 ***
## month02 -0.00974146 0.09416247 -0.103 0.9176
## month03 -0.21190666 0.09516550 -2.227 0.0260 *
## month04 -0.21299367 0.09977696 -2.135 0.0328 *
## month05 -0.09052411 0.09837170 -0.920 0.3575
## month06 -0.02178538 0.09724580 -0.224 0.8227
## month07 0.04344010 0.09957893 0.436 0.6627
## month08 -0.03400486 0.09560648 -0.356 0.7221
## month09 0.11187255 0.09550588 1.171 0.2415
## month10 -0.12491258 0.09651563 -1.294 0.1956
## month11 0.17711476 0.09747132 1.817 0.0692 .
## month12 0.11231099 0.09766530 1.150 0.2502
## EPCycleLength_day 0.00317469 0.00042179 7.527 0.000000000000052 ***
## comeback_10 0.43751845 0.04503216 9.716 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 19300 on 13999 degrees of freedom
## Residual deviance: 14700 on 13976 degrees of freedom
## AIC: 14748
##
## Number of Fisher Scoring iterations: 5
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2710 1065
## 1 577 1648
##
## Accuracy : 0.7263
## 95% CI : (0.7149, 0.7376)
## No Information Rate : 0.5478
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4388
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.6074
## Specificity : 0.8245
## Pos Pred Value : 0.7407
## Neg Pred Value : 0.7179
## Prevalence : 0.4522
## Detection Rate : 0.2747
## Detection Prevalence : 0.3708
## Balanced Accuracy : 0.7160
##
## 'Positive' Class : 1
##
In general, ensemble model is a technique of combining two or more algorithms. Doing this is to make a more robust prediction model which is able to incorporate the predictions from all base machine learning models. In this project, we are going to use weighted average Logistic Regression model. In this model, different weights are applied to each predictions from different logit models then taking the weighted average of each models’ output.
new_predictor <-c("comeback_next_10","comeback_rate_RGDESC","past1week_comeback","past1week_count","past6month_ave_cost","comeback_rate_shop","year","month","EPCycleLength_day","CycleDESC","N_RG")
Logit_Model_1 <- glm(comeback_next_10 ~ .,
data=df_Train %>% dplyr::select(new_predictor),
family="binomial" (link="logit"))
testProbs <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
Probs_1 = predict(Logit_Model_1, df_Test, type= "response"))
testProbs <- testProbs%>%
mutate(predOutcome_1 = as.factor(ifelse(testProbs$Probs_1 > 0.5 , 1, 0)))
cm <-confusionMatrix(testProbs$predOutcome_1, testProbs$Outcome,positive = "1")
test_predictor <-c("comeback_next_10","comeback_10")
Logit_Model_2 <- glm(comeback_next_10 ~ .,
data=df_Train %>% dplyr::select(test_predictor),
family="binomial" (link="logit"))
testProbs_2 <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
Probs = predict(Logit_Model_2, df_Test, type= "response"))
testProbs_2 <- testProbs_2%>%
mutate(predOutcome = as.factor(ifelse(testProbs_2$Probs > 0.5 , 1, 0)))
cm <-confusionMatrix(testProbs_2$predOutcome, testProbs_2$Outcome,positive = "1")
testProbs$Probs_2 = testProbs_2$Probs
testProbs$predOutcome_2 = testProbs_2$predOutcome
new_predictor <-c("comeback_next_10","dpt_big","dpt_risk","make_risk","model_risk","part_risk","Shop_risk")
Logit_Model_new <- glm(comeback_next_10 ~ .,
data=df_Train %>% dplyr::select(new_predictor),
family="binomial" (link="logit"))
testProbs_3 <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
Probs = predict(Logit_Model_new, df_Test, type= "response"))
testProbs_3 <- testProbs_3%>%
mutate(predOutcome = as.factor(ifelse(testProbs_3$Probs > 0.5 , 1, 0)))
cm <-confusionMatrix(testProbs_3$predOutcome, testProbs_3$Outcome,positive = "1")
testProbs$Probs_3 = testProbs_3$Probs
testProbs$predOutcome_3 = testProbs_3$predOutcome
testProbs$Probs_Ensemble = (0.9*testProbs$Probs_1+0.09*testProbs$Probs_2+0.01*testProbs$Probs_3)
testProbs <- testProbs%>%
mutate(predOutcome_Ensemble = as.factor(ifelse(testProbs$Probs_Ensemble > 0.5 , 1, 0)))
cm_2 <-confusionMatrix(testProbs$predOutcome_Ensemble, testProbs$Outcome,positive = "1")
cm_2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2720 1098
## 1 567 1615
##
## Accuracy : 0.7225
## 95% CI : (0.711, 0.7338)
## No Information Rate : 0.5478
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4301
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.5953
## Specificity : 0.8275
## Pos Pred Value : 0.7401
## Neg Pred Value : 0.7124
## Prevalence : 0.4522
## Detection Rate : 0.2692
## Detection Prevalence : 0.3637
## Balanced Accuracy : 0.7114
##
## 'Positive' Class : 1
##
Random forest is also an ensemble learning method which combine a multitude of decision trees. Here we use specifically Random Forest:
library(randomForest)
df_Train_fac=df_Train %>% mutate_if(is.character, as.factor)
df_Test_fac=df_Test %>% mutate_if(is.character, as.factor)
new_predictor <-c("comeback_next_10","comeback_10","comeback_rate_RGDESC","past1week_comeback","past1week_count","past6month_ave_cost","comeback_rate_shop","year","month","EPCycleLength","CycleDESC","N_RG")
RF_Model_1 <- randomForest(comeback_next_10 ~ .,
data=df_Train_fac %>% dplyr::select(new_predictor),
norm.votes = TRUE, proximity = TRUE)
testProbs_RF <- data.frame(Outcome = as.factor(df_Test_fac$comeback_next_10),
Probs = predict(RF_Model_1, df_Test_fac, type= "response"))
testProbs_RF <- testProbs_RF%>%
mutate(predOutcome = as.factor(ifelse(testProbs_RF$Probs > 0.5 , 1, 0)))
cm_3 <-confusionMatrix(testProbs_RF$predOutcome, testProbs_RF$Outcome,positive = "1")
cm_3
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2683 1052
## 1 604 1661
##
## Accuracy : 0.724
## 95% CI : (0.7125, 0.7353)
## No Information Rate : 0.5478
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4348
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.6122
## Specificity : 0.8162
## Pos Pred Value : 0.7333
## Neg Pred Value : 0.7183
## Prevalence : 0.4522
## Detection Rate : 0.2768
## Detection Prevalence : 0.3775
## Balanced Accuracy : 0.7142
##
## 'Positive' Class : 1
##
XGBoost is an optimized distributed gradient boosting library. This algorithm implements under the Gradient Boosting framework. Parallel tree boosting is used in this algorithm which helps it to address & solve this data science problem fast and accurately.
testProbs_xgb <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
Probs = predict(xgb_Model_1, as.matrix(df_Test %>% dplyr::select(all_predictor_xgb))))
testProbs_xgb <- testProbs_xgb%>%
mutate(predOutcome = as.factor(ifelse(testProbs_xgb$Probs > 0.5 , 1, 0)))
cm_4 <-confusionMatrix(testProbs_xgb$predOutcome, testProbs_xgb$Outcome,positive = "1")
cm_4
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2432 1047
## 1 855 1666
##
## Accuracy : 0.683
## 95% CI : (0.6711, 0.6948)
## No Information Rate : 0.5478
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.3562
##
## Mcnemar's Test P-Value : 0.00001189
##
## Sensitivity : 0.6141
## Specificity : 0.7399
## Pos Pred Value : 0.6608
## Neg Pred Value : 0.6991
## Prevalence : 0.4522
## Detection Rate : 0.2777
## Detection Prevalence : 0.4202
## Balanced Accuracy : 0.6770
##
## 'Positive' Class : 1
##
From the table below, we can see the comparison between different models. The Random Forest model is the best for it has the highest Accuracy and Sensitivity, though its specificity is not the best. Why the sensitivity is the most important index will be explained later.
library(kableExtra)
model_table <- data.frame(Model = c("Logistic Regression","Ensemble Logistic","Random Forest","XGBoost"),
Accuracy=c(cm_1$overall['Accuracy'],cm_2$overall['Accuracy'],cm_3$overall['Accuracy'],cm_4$overall['Accuracy']),
Sensitivity =c(cm_1$byClass['Sensitivity'],cm_2$byClass['Sensitivity'],cm_3$byClass['Sensitivity'],cm_4$byClass['Sensitivity']),
Specificity =c(cm_1$byClass['Specificity'],cm_2$byClass['Specificity'],cm_3$byClass['Specificity'],cm_4$byClass['Specificity']))
model_table %>%
gather("Type", "Value",-Model) %>%
ggplot(aes(Model, Value, fill = Type)) +
labs(title = "Model Comparison") +
scale_fill_manual(values = palette3) +
geom_bar(position = "dodge", stat = "identity") +
theme_bw()
Model | Accuracy | Sensitivity | Specificity |
---|---|---|---|
Logistic Regression | 0.7263333 | 0.6074456 | 0.8244600 |
Ensemble Logistic | 0.7225000 | 0.5952820 | 0.8275023 |
Random Forest | 0.7240000 | 0.6122374 | 0.8162458 |
XGBoost | 0.6830000 | 0.6140804 | 0.7398844 |
The plot below the distribution of predicted probabilities for each observed outcome, comeback and regular trip, respectively shown as 1 and 0. We can see that the model performance better when prediction regular trips than predicting comeback trips.
# ggplot(testProbs, aes(x = Probs_Ensemble, fill = as.factor(Outcome))) +
# geom_density() +
# facet_grid(Outcome ~ .) +
# scale_fill_manual(values = palette2) +
# labs(x = "Churn", y = "Density of probabilities",
# title = "Distribution of predicted probabilities by observed outcome") +
# theme(strip.text.x = element_text(size = 18),
# legend.position = "none")
ggplot(testProbs_RF, aes(x = Probs, fill = as.factor(Outcome))) +
geom_density() +
facet_grid(Outcome ~ .) +
scale_fill_manual(values = palette2) +
labs(x = "Churn", y = "Density of probabilities",
title = "Distribution of predicted probabilities by observed outcome") +
theme(strip.text.x = element_text(size = 18),
legend.position = "none")
If we set 0.5 as threshold, which means that, when probability is bigger than 0.5, the case will be determined as comeback. There were 1672 true positives, instances where observed trip was correctly predicted as comeback. There were 1041 false positives, instances where trip was comeback trip incorrectly predicted as regular trip. There were 2714 true negatives, instances where observed regular trip was correctly predicted. Finally, there were 573 false negatives, instances where Regular trip was incorrectly predicted as comeback.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2683 1052
## 1 604 1661
##
## Accuracy : 0.724
## 95% CI : (0.7125, 0.7353)
## No Information Rate : 0.5478
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4348
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.6122
## Specificity : 0.8162
## Pos Pred Value : 0.7333
## Neg Pred Value : 0.7183
## Prevalence : 0.4522
## Detection Rate : 0.2768
## Detection Prevalence : 0.3775
## Balanced Accuracy : 0.7142
##
## 'Positive' Class : 1
##
From the result above, we can also see that the Sensitivity and Specificity of this model is 61.6% and 82.5%. And the Accuracy is 73.1%.
We can see from the ROC Curves that, the curve is well above the diagonal. Which means that this model can make the meaningful prediction. The Area Under the Curve metric or AUC of this model is 0.8066 Nearly 80% percent of the data are explained.
# ggplot(testProbs, aes(d = as.numeric(testProbs$Outcome), m = Probs_Ensemble)) +
# geom_roc(n.cuts = 50, labels = FALSE, colour = "#D9725B") +
# style_roc(theme = theme_grey) +
# geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
# labs(title = "ROC Curve")
ggplot(testProbs_RF, aes(d = as.numeric(testProbs_RF$Outcome), m = Probs)) +
geom_roc(n.cuts = 50, labels = FALSE, colour = "#c70000") +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
labs(title = "ROC Curve")
## Area under the curve: 0.7965
For the Model generalizability, we are mainly focusing the accuracy differences on different departments and vehicle systems. From the results, we can see that our model has good generalizability, the accuracy of different departments is not much different, and the accuracy of different car systems is about the same. All are very close to the overall accuracy of the model.
df_model_1 <- na.omit(df_2)
df_model_sample_1<- df_model_1[samp,]
df_Test_1 <- df_model_sample_1[-trainIndex,]
df_G <- df_Test_1
df_G$Outcome = testProbs_RF$Outcome
df_G$Probs = testProbs_RF$Probs
df_G$predOutcome = testProbs_RF$predOutcome
df_G$correct_10 = (df_G$predOutcome == df_G$Outcome)
df_G_n_dpt <- df_G %>%
group_by(EqcDesc) %>%
summarise(number=n(),Correct = sum(correct_10=="TRUE"),Accuracy = Correct/number)%>%
arrange(-number)
df_G_n_dpt=df_G_n_dpt[0:6,]
p1 <- ggplot(data=df_G_n_dpt, aes(x=EqcDesc, y=Accuracy)) +
geom_bar(stat="identity", fill="#00c700",position = 'dodge',color="#00c700")+
scale_y_continuous(labels = scales::percent)+
labs(x = 'Top 6 Biggest Department', y = 'Accuracy', title = "Model's Generalizability - Department")+
theme(axis.text.x = element_text(angle = 20, hjust = 1))
df_G_n_part <- df_G %>%
group_by(RGDESC) %>%
summarise(number=n(),Correct = sum(correct_10=="TRUE"),Accuracy = Correct/number)%>%
arrange(-number)
df_G_n_part=df_G_n_part[0:6,]
p2 <- ggplot(data=df_G_n_part, aes(x=RGDESC, y=Accuracy)) +
geom_bar(stat="identity", fill="#00c700",position = 'dodge',color="#00c700")+
scale_y_continuous(labels = scales::percent)+
labs(x = 'Top 6 Most Repaired Vehicle System', y = 'Accuracy', title = "Model's Generalizability - System")+
theme(axis.text.x = element_text(angle = 20, hjust = 1))
grid.arrange(p1, p2, nrow = 1, ncol = 2 )
If we can correctly predict whether next trip is comeback, we can help the government to save a lot of money. To be specific, if we know whether next trip is comeback, we can request the vehicle to come to the workshop earlier than the warranty. So that, the car won’t break on the road. If we can get the money we save or lost for each type of result in the confusion matrix, we can see that totally how much money we can save or lost is we change the repair time based on our prediction. Let’s assume that if we predict the next trip is a comeback, we will spend $30 to make a additional check, which can detect the hidden danger and make the vehicle do not comeback before the warranty. According to the statistic above, we also assume that a comeback trip will cost 600 dollars and a regular trip will cost 470 dollars. The Total Cost may be:
cost_benefit_table <-
testProbs_RF %>%
count(predOutcome, Outcome) %>%
summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
True_Positive = sum(n[predOutcome==1 & Outcome==1]),
False_Negative = sum(n[predOutcome==0 & Outcome==1]),
False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
gather(Variable, Count) %>%
bind_cols(data.frame(Each_Trip_Cost = c(
"470","30+470","600","30+470")))%>%
mutate(Cost =
ifelse(Variable == "True_Negative", Count * 470,
ifelse(Variable == "True_Positive", Count * (30+470),
ifelse(Variable == "False_Negative",Count * 600,
ifelse(Variable == "False_Positive",Count * (30+470), 0)))))%>%
bind_cols(data.frame(Description = c(
"We predicted next repair trip is not come back and it didn't comeback",
"We predicted next repair trip is come back and it cameback next trip",
"We predicted next repair trip is not come back but it cameback next trip",
"We predicted next repair trip is come back and it didn't comeback")))
kable(cost_benefit_table, align = "l",
caption = "Cost/Benefit Table") %>% kable_styling()
Variable | Count | Each_Trip_Cost | Cost | Description |
---|---|---|---|---|
True_Negative | 2683 | 470 | 1261010 | We predicted next repair trip is not come back and it didn’t comeback |
True_Positive | 1661 | 30+470 | 830500 | We predicted next repair trip is come back and it cameback next trip |
False_Negative | 1052 | 600 | 631200 | We predicted next repair trip is not come back but it cameback next trip |
False_Positive | 604 | 30+470 | 302000 | We predicted next repair trip is come back and it didn’t comeback |
Then, we can optimize the cost and benefit analysis by changing the threshold as we have said above. The figure below plots the count for each confusion matrix metric by threshold.
iterateThresholds <- function(data) {
x = .01
all_prediction <- data.frame()
while (x <= 1) {
this_prediction <-
testProbs_RF %>%
mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
count(predOutcome, Outcome) %>%
summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
True_Positive = sum(n[predOutcome==1 & Outcome==1]),
False_Negative = sum(n[predOutcome==0 & Outcome==1]),
False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
gather(Variable, Count) %>%
mutate(Cost =
ifelse(Variable == "True_Negative", Count * 470,
ifelse(Variable == "True_Positive", Count * (30+470),
ifelse(Variable == "False_Negative",Count * 600,
ifelse(Variable == "False_Positive",Count * (30+470), 0)))),Threshold=x)
all_prediction <- rbind(all_prediction, this_prediction)
x <- x + .01
}
return(all_prediction)
}
whichThreshold <- iterateThresholds(testProbs_RF)
whichThreshold %>%
ggplot(.,aes(Threshold, Count, colour = Variable)) +
geom_point() +
scale_colour_manual(values = palette4) +
labs(title = "Number by confusion matrix type and threshold cut off",
y="Count") +
guides(colour=guide_legend(title="Confusion Matrix"))
For each type of result, we already know the cost. Thus, we simply add them all up and we can know how much is the total cost. By doing this, we can know how much money we can save at most. If threshold equals to 0, it means that we predict all trips to be comeback. If threshold equals to 1, it means that we predict all trips to be regular trips, which the department doing right now, or in other words, the real case. We can see that, when threshold equals to 0.28, the total cost is the lowest. Which means that, if the predicted probability of comeback is less than 0.28, we recommend not doing any deep additional check. If the predicted probability of comeback is larger than 0.28, we recommend to do some deep additional check, because the vehicle has a high probability to comeback before the warranty.
whichThreshold_revenue <-
whichThreshold %>%
group_by(Threshold) %>%
summarize(Total_Cost = sum(Cost))
whichThreshold_revenue %>%
dplyr::select(Threshold, Total_Cost) %>%
gather(Variable, Value, -Threshold) %>%
ggplot(aes(Threshold, Value, colour = Variable)) +
geom_point() +
scale_colour_manual(values = "#c70000") +
labs(title = "Total Cost by threshold cut off",
y="Total Cost")+
geom_vline(xintercept = pull(arrange(whichThreshold_revenue, Total_Cost)[1,1]))
Then let’s focusing on the money. In this test dataset, totally 6000 trips are researched. And in these 6000 trips, the previous cost is $ 3,166,120 dollars. While if we recommend mechanics to do deep check or not based on our perdition result, the total cost will be 2,984,820 dollars, which can help the general service department to save 181,300 dollars. Around 5.72% of total cost can be saved just based on this simple prediction model.
old_cost <- max(whichThreshold_revenue$Total_Cost)
new_cost <- min(whichThreshold_revenue$Total_Cost)
save_money <- old_cost-new_cost
kable(data.frame("Business_as_Usual" = c(old_cost),"Using_our_model" = c(new_cost),"Money_Save"=c(save_money),"Percent_Save(%)"=c(save_money/old_cost*100)),
caption = "How Much We Save",align = "l") %>% kable_styling()
Business_as_Usual | Using_our_model | Money_Save | Percent_Save… |
---|---|---|---|
3166120 | 2984820 | 181300 | 5.726252 |
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(sf))
suppressPackageStartupMessages(library(QuantPsyc))
suppressPackageStartupMessages(library(RSocrata))
suppressPackageStartupMessages(library(viridis))
suppressPackageStartupMessages(library(caret))
suppressPackageStartupMessages(library(spatstat))
suppressPackageStartupMessages(library(spdep))
suppressPackageStartupMessages(library(FNN))
suppressPackageStartupMessages(library(grid))
suppressPackageStartupMessages(library(gridExtra))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages(library(rgdal))
suppressPackageStartupMessages(library(tidycensus))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(rmarkdown))
suppressPackageStartupMessages(library(readxl))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(dygraphs))
suppressPackageStartupMessages(library(zoo))
suppressPackageStartupMessages(library(xts))
suppressPackageStartupMessages(library(rmdformats))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(readxl))
suppressPackageStartupMessages(library(lubridate))
library(tidyverse)
library(sf)
library(QuantPsyc)
library(RSocrata)
library(viridis)
library(caret)
library(spatstat)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(rgdal)
library(tidycensus)
library(knitr)
library(rmarkdown)
library(readxl)
library(lubridate)
library(dygraphs)
library(zoo)
library(xts)
library(rmdformats)
library(tidyverse)
library(readxl)
library(lubridate)
library(tidyverse)
setwd("/Users/akshaynagar/Desktop/To_Org/In_Out_Temp")
dat <- read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 1) %>%
rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 2)) %>%
rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 3)) %>%
rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 4)) %>%
rename(EPCycleLength = EPCycleLenghth) %>%
rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 5)) %>%
mutate(PostDate = ymd(as.character(PostDate)))
glimpse(dat)
dat_01 <- dat %>%
group_by(EhKey) %>%
group_by(WHKey) %>%
dplyr::slice(which.min(PostDate))
dat_02 <- dat_01 %>%
mutate(PostDate = ymd(as.character(PostDate))) %>%
group_by(EhKey) %>%
arrange(PostDate) %>%
mutate(lagTime = PostDate - lag(PostDate))
ggplot(dat_02 %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 100))+
geom_histogram(aes(time), binwidth = 5)+
geom_vline(xintercept=30,
color = "#F57434", size=1)+
labs(
title = "Time since last maintenance, All data 2015-2019",
subtitle = "Max time 100 days, Bar = 5 days",
x="Days",
y="Number of records" )
fireTruck <- dat %>%
filter(EhKey == '000024') %>%
arrange(PostDate)
fireTruck_time <- fireTruck %>%
group_by(EhKey) %>%
group_by(WHKey) %>%
dplyr::slice(which.min(PostDate))%>%
mutate(PostDate = ymd(as.character(PostDate))) %>%
group_by(EhKey) %>%
arrange(PostDate) %>%
mutate(lagTime = PostDate - lag(PostDate))
fireTruck_time %>%
ggplot( aes(x=PostDate, y=lagTime)) +
geom_line() +
geom_point()+
geom_hline(yintercept=60, linetype="dashed",
color = "#F57434", size=1)+
labs(
title = "Duration between two repairment, FireTruck 000024",
x="Date",
y="Duration")
#DONT INCLUDE
fireTruck %>%
ggplot(aes(x=PostDate, y=cumsum(BhLineChr))) + geom_line() + geom_point()+
labs(
title = "Cumulative Maintenance fee, FireTruck 000024",
x="Date",
y="Dollar" )
ps1 <- dat %>%
filter(EhKey == '078983') %>%
arrange(PostDate)
ps1_time <- ps1 %>%
group_by(EhKey) %>%
group_by(WHKey) %>%
dplyr::slice(which.min(PostDate))%>%
mutate(PostDate = ymd(as.character(PostDate))) %>%
group_by(EhKey) %>%
arrange(PostDate) %>%
mutate(lagTime = PostDate - lag(PostDate))
ps1_time %>%
ggplot( aes(x=PostDate, y=lagTime)) +
geom_line() +
geom_point()+
geom_hline(yintercept=30, linetype="dashed",
color = "#F57434", size=1)+
labs(
title = "Duration between two repairment, Police Car 078983",
x="Date",
y="Duration")
Police_car <- dat %>%
filter(EhKey == '078983') %>%
arrange(PostDate)
Police_car %>%
ggplot(aes(x=PostDate, y=cumsum(BhLineChr))) + geom_line() + geom_point()+
labs(
title = "Cumulative Maintenance fee, Police_car 078983",
x="Date",
y="Dollar" )
# Calculate the time since last maintenance of each vehicle and each part
all_duration <- dat %>%
group_by(EhKey,RGDESC,WHKey) %>%
summarise(Totalcost = sum(BhLineChr),PostDate =min(PostDate))%>%
arrange(EhKey,RGDESC,PostDate)%>%
mutate(lagTime = PostDate - lag(PostDate))
# Add department information
group_list <- dat %>%
group_by(EhKey) %>%
summarise(EqcDesc = first(EqcDesc),EPCycleLength= first(EPCycleLength),Make=first(Make),Model=first(Model),CycleDESC=first(CycleDESC))
all_duration <- right_join(group_list,all_duration,by='EhKey')
# convert all epcycle to days
all_duration$unit[all_duration$CycleDESC == "MONTHS"]<- 30
all_duration$unit[all_duration$CycleDESC == "WEEKS"]<- 7
all_duration$unit[all_duration$CycleDESC == "DAYS"]<- 1
all_duration$EPCycleLength_day = all_duration$EPCycleLength*all_duration$unit
all_duration$comeback[all_duration$lagTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback[all_duration$lagTime>=all_duration$EPCycleLength_day]<- "Regular Trip"
# Exploration 1: How many comeback and Regular Trip?
all_duration_plot<-
all_duration %>%
group_by(comeback) %>%
summarise(number = n())
ggplot(data=all_duration_plot[!is.na(all_duration_plot$comeback),], aes(x=comeback, y=number)) +
geom_bar(stat="identity",width=0.2,fill = c("#c70000","#00c700"))+
geom_text(aes(label=number), position=position_dodge(width=0.9), vjust=-0.25)+
labs(
title = "How many comeback and Regular Trip totally?",
y="Total number")
# Exploration 3: Average Cost different between comeback and Regular Trip
all_duration_plot<-
all_duration %>%
group_by(comeback) %>%
summarise(Cost = mean(Totalcost))
ggplot(data=all_duration_plot[!is.na(all_duration_plot$comeback),], aes(x=comeback, y=round(Cost, digits = 2))) +
geom_bar(stat="identity",width=0.2,fill = c("#c70000","#00c700"))+
geom_text(aes(label=round(Cost, digits = 2)), position=position_dodge(width=0.9), vjust=-0.25)+
labs(
title = "How much money cost for each comeback/Regular Trip?",
y="Average cost" )
# Calculate the time since last maintenance of each vehicle and each part
all_duration <- dat %>%
group_by(EhKey,RGDESC,WHKey) %>%
summarise(Totalcost = sum(BhLineChr),PostDate =min(PostDate))%>%
arrange(EhKey,RGDESC,PostDate,)%>%
mutate(lagTime = PostDate - lag(PostDate))
# Add department information
group_list <- dat %>%
group_by(EhKey) %>%
summarise(EqcDesc = first(EqcDesc), EPCycleLength=first(EPCycleLength), Make=first(Make),Model=first(Model),CycleDESC =first(CycleDESC))
all_duration <- right_join(group_list,all_duration,by='EhKey')
# Interactive - Cost's change accross time
all_duration_try <- all_duration %>%
group_by(PostDate) %>%
summarise(Totalcost_byDay = sum(Totalcost))
library(dygraphs)
library(zoo)
library(xts)
library(Lahman)
my_weekday_date <- all_duration_try %>% dplyr::select(PostDate) %>% pull
my_weekday_view <- all_duration_try %>% dplyr::select(Totalcost_byDay) %>% pull
my_zoo_weekday <- zoo(my_weekday_view, my_weekday_date)
my_zoo_weekday_xts <- as.xts(my_zoo_weekday, order.by = my_weekday_date)
dygraph(my_zoo_weekday_xts, main = "Cost of each day") %>%
dyRangeSelector(dateWindow = c("2017-01-01", "2018-05-17"))
# Interactive - Scatterplot of cost vs duration by all groups
library(plotly)
p <- sample_n(all_duration, 10000) %>%
plot_ly(
type = 'scatter', x = ~lagTime, y = ~Totalcost, text = ~RGDESC,
hoverinfo = 'text',
mode = 'markers',
transforms = list(
list(
type = 'filter',
target = ~EqcDesc,
operation = '=',
value = unique(all_duration$EqcDesc)[1]
)
)) %>% layout(
updatemenus = list(
list(
type = 'dropdown',
active = 0,
buttons = list(
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[1]),
label = unique(all_duration$EqcDesc)[1]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[2]),
label = unique(all_duration$EqcDesc)[2]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[3]),
label = unique(all_duration$EqcDesc)[3]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[3]),
label = unique(all_duration$EqcDesc)[4]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[5]),
label = unique(all_duration$EqcDesc)[5]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[6]),
label = unique(all_duration$EqcDesc)[6]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[7]),
label = unique(all_duration$EqcDesc)[7]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[8]),
label = unique(all_duration$EqcDesc)[8]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[9]),
label = unique(all_duration$EqcDesc)[9]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[10]),
label = unique(all_duration$EqcDesc)[10]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[11]),
label = unique(all_duration$EqcDesc)[11]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[12]),
label = unique(all_duration$EqcDesc)[12]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[13]),
label = unique(all_duration$EqcDesc)[13]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[14]),
label = unique(all_duration$EqcDesc)[14]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[15]),
label = unique(all_duration$EqcDesc)[15]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[16]),
label = unique(all_duration$EqcDesc)[16]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[17]),
label = unique(all_duration$EqcDesc)[17]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[18]),
label = unique(all_duration$EqcDesc)[18]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[19]),
label = unique(all_duration$EqcDesc)[19]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[20]),
label = unique(all_duration$EqcDesc)[20]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[21]),
label = unique(all_duration$EqcDesc)[21])
)
)
)
)
p2 <- sample_n(all_duration, 10000) %>%
plot_ly(
type = 'scatter', x = ~lagTime, y = ~Totalcost, text = ~RGDESC,
hoverinfo = 'text',
mode = 'markers',
marker = list(color = 'rgba(255, 100, 100,0.9)'),
transforms = list(
list(
type = 'filter',
target = ~EqcDesc,
operation = '=',
value = unique(all_duration$EqcDesc)[1]
)
)) %>% layout(
updatemenus = list(
list(
type = 'dropdown',
active = 0,
buttons = list(
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[1]),
label = unique(all_duration$EqcDesc)[1]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[2]),
label = unique(all_duration$EqcDesc)[2]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[3]),
label = unique(all_duration$EqcDesc)[3]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[3]),
label = unique(all_duration$EqcDesc)[4]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[5]),
label = unique(all_duration$EqcDesc)[5]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[6]),
label = unique(all_duration$EqcDesc)[6]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[7]),
label = unique(all_duration$EqcDesc)[7]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[8]),
label = unique(all_duration$EqcDesc)[8]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[9]),
label = unique(all_duration$EqcDesc)[9]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[10]),
label = unique(all_duration$EqcDesc)[10]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[11]),
label = unique(all_duration$EqcDesc)[11]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[12]),
label = unique(all_duration$EqcDesc)[12]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[13]),
label = unique(all_duration$EqcDesc)[13]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[14]),
label = unique(all_duration$EqcDesc)[14]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[15]),
label = unique(all_duration$EqcDesc)[15]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[16]),
label = unique(all_duration$EqcDesc)[16]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[17]),
label = unique(all_duration$EqcDesc)[17]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[18]),
label = unique(all_duration$EqcDesc)[18]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[19]),
label = unique(all_duration$EqcDesc)[19]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[20]),
label = unique(all_duration$EqcDesc)[20]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$EqcDesc)[21]),
label = unique(all_duration$EqcDesc)[21])
)
)
)
)
p
p2
# Duration distribution by repair part type
p3 <- all_duration %>%
filter( lagTime > 5, EqcDesc == "FIRE DEPARTMENT")%>%
plot_ly(
type = 'histogram', x = ~lagTime,
transforms = list(
list(
type = 'filter',
target = ~RGDESC,
operation = '=',
value = unique(all_duration$RGDESC)[1]
)
)) %>% layout(
updatemenus = list(
list(
type = 'dropdown',
active = 0,
buttons = list(
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[1]),
label = unique(all_duration$RGDESC)[1]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[2]),
label = unique(all_duration$RGDESC)[2]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[3]),
label = unique(all_duration$RGDESC)[3]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[3]),
label = unique(all_duration$RGDESC)[4]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[5]),
label = unique(all_duration$RGDESC)[5]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[6]),
label = unique(all_duration$RGDESC)[6]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[7]),
label = unique(all_duration$RGDESC)[7]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[8]),
label = unique(all_duration$RGDESC)[8]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[9]),
label = unique(all_duration$RGDESC)[9]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[10]),
label = unique(all_duration$RGDESC)[10]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[11]),
label = unique(all_duration$RGDESC)[11]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[12]),
label = unique(all_duration$RGDESC)[12]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[13]),
label = unique(all_duration$RGDESC)[13]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[14]),
label = unique(all_duration$RGDESC)[14]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[15]),
label = unique(all_duration$RGDESC)[15]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[16]),
label = unique(all_duration$RGDESC)[16]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[17]),
label = unique(all_duration$RGDESC)[17]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[18]),
label = unique(all_duration$RGDESC)[18]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[19]),
label = unique(all_duration$RGDESC)[19]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[20]),
label = unique(all_duration$RGDESC)[20]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[21]),
label = unique(all_duration$RGDESC)[21])
)
)
)
)
p4 <- all_duration %>%
filter( lagTime > 5, EqcDesc == "FIRE DEPARTMENT")%>%
plot_ly(
type = 'histogram', x = ~lagTime,
marker = list(color = 'rgba(255, 100, 100,0.9)'),
transforms = list(
list(
type = 'filter',
target = ~RGDESC,
operation = '=',
value = unique(all_duration$RGDESC)[1]
)
)) %>% layout(
updatemenus = list(
list(
type = 'dropdown',
active = 0,
buttons = list(
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[1]),
label = unique(all_duration$RGDESC)[1]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[2]),
label = unique(all_duration$RGDESC)[2]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[3]),
label = unique(all_duration$RGDESC)[3]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[3]),
label = unique(all_duration$RGDESC)[4]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[5]),
label = unique(all_duration$RGDESC)[5]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[6]),
label = unique(all_duration$RGDESC)[6]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[7]),
label = unique(all_duration$RGDESC)[7]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[8]),
label = unique(all_duration$RGDESC)[8]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[9]),
label = unique(all_duration$RGDESC)[9]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[10]),
label = unique(all_duration$RGDESC)[10]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[11]),
label = unique(all_duration$RGDESC)[11]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[12]),
label = unique(all_duration$RGDESC)[12]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[13]),
label = unique(all_duration$RGDESC)[13]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[14]),
label = unique(all_duration$RGDESC)[14]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[15]),
label = unique(all_duration$RGDESC)[15]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[16]),
label = unique(all_duration$RGDESC)[16]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[17]),
label = unique(all_duration$RGDESC)[17]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[18]),
label = unique(all_duration$RGDESC)[18]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[19]),
label = unique(all_duration$RGDESC)[19]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[20]),
label = unique(all_duration$RGDESC)[20]),
list(method = "restyle",
args = list("transforms[0].value", unique(all_duration$RGDESC)[21]),
label = unique(all_duration$RGDESC)[21])
)
)
)
)
p3
p4
p1 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "FIRE DEPARTMENT"))+
geom_histogram(aes(time), binwidth = 10)+
geom_vline(aes(xintercept = 180), colour = "#FA7800",size=1)+
geom_text(aes(x=180, label="EPCycle 6 months", y=1000), colour="blue")+
labs(
title = " Fire vehicle",
subtitle = "Max time 365 days, Bar = 10 days",
x="Days",
y="Number of records" )
p2<-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EPCycleLength == 1,EqcDesc == "POLICE DEPARTMENT"))+
geom_histogram(aes(time), binwidth = 10)+
geom_vline(aes(xintercept = 30), colour = "#FA7800",size=1)+
geom_text(aes(x=30, label="EPCycle 1 months", y=9200), colour="blue")+
labs(
title = " Police vehicle",
subtitle = "Max time 365 days, Bar = 10 days",
x="Days",
y="Number of records" )
p3 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365,EPCycleLength == 3, EqcDesc == "DPW - Solid Waste"))+
geom_histogram(aes(time), binwidth = 10)+
geom_vline(aes(xintercept = 90), colour = "#FA7800",size=1)+
geom_text(aes(x=90, label="EPCycle 3 months", y=3500), colour="blue")+
labs(
title = "Solid Waste vehicle",
subtitle = "Max time 365 days, Bar = 10 days",
x="Days",
y="Number of records" )
p4 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365,EPCycleLength == 4, EqcDesc == "DPW Water & Wastewater"))+
geom_histogram(aes(time), binwidth = 10)+
geom_vline(aes(xintercept = 120), colour = "#FA7800",size=1)+
geom_text(aes(x=120, label="EPCycle 4 months", y=2400), colour="blue")+
labs(
title = " Water & Wastewater vehicle",
subtitle = "Max time 365 days, Bar = 10 days",
x="Days",
y="Number of records" )
p5 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EPCycleLength == 1,EqcDesc == "Light Duty Emergency"))+
geom_histogram(aes(time), binwidth = 10)+
geom_vline(aes(xintercept = 30), colour = "#FA7800",size=1)+
geom_text(aes(x=30, label="EPCycle 1 months", y=3000), colour="blue")+
labs(
title = " Light Duty Emergency vehicle",
subtitle = "Max time 365 days, Bar = 10 days",
x="Days",
y="Number of records" )
p6 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EqcDesc == "Light Duty Non-Emergency",EPCycleLength == 4))+
geom_histogram(aes(time), binwidth = 10)+
geom_vline(aes(xintercept = 120), colour = "#FA7800",size=1)+
geom_text(aes(x=120, label="EPCycle 4 months", y=550), colour="blue")+
labs(
title = " Light Duty Non-Emergency vehicle",
subtitle = "Max time 365 days, Bar = 10 days",
x="Days",
y="Number of records" )
grid.arrange(p1, p2,p3,p4,p5,p6, nrow = 2, ncol = 3 )
dat <- read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 1) %>%
rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 2)) %>%
rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 3)) %>%
rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 4)) %>%
rename(EPCycleLength = EPCycleLenghth) %>%
rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 5)) %>%
mutate(PostDate = ymd(as.character(PostDate)))
# Calculate the time since last maintenance of each vehicle and each part
all_duration <- dat %>%
group_by(EhKey,RGDESC,WHKey) %>%
summarise(Totalcost = sum(BhLineChr),PostDate =min(PostDate))%>%
arrange(EhKey,RGDESC,PostDate)%>%
mutate(lagTime = PostDate - lag(PostDate))
# Add department information
group_list <- dat %>%
group_by(EhKey) %>%
summarise(EqcDesc = first(EqcDesc),EPCycleLength= first(EPCycleLength),Make=first(Make),Model=first(Model),CycleDESC=first(CycleDESC))
all_duration <- right_join(group_list,all_duration,by='EhKey')
if(all_duration$CycleDESC == "MONTHS" )
{
all_duration$CycleDESC_day = all_duration$EPCycleLength*30
}
if(all_duration$CycleDESC == "DO NOT USE SET A MONTHLY DATE")
{
all_duration$CycleDESC_day = all_duration$EPCycleLength*30
}
# convert all epcycle to days
all_duration$unit[all_duration$CycleDESC == "MONTHS"]<- 30
all_duration$unit[all_duration$CycleDESC == "WEEKS"]<- 7
all_duration$unit[all_duration$CycleDESC == "DAYS"]<- 1
all_duration$EPCycleLength_day = all_duration$EPCycleLength*all_duration$unit
# identify comeback
all_duration$comeback[all_duration$lagTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback[all_duration$lagTime>=all_duration$EPCycleLength_day]<- "Regular Trip"
# # Exploration 1: How many comeback and Regular Trip?
# barplot(table(all_duration$comeback), main="How many comeback and Regular Trip?",
# xlab="Yes or No", ylab = "Count")
#
# # Exploration 2: Comeback & Department
# t_1 <- table(all_duration$EqcDesc)
# name = t_1 > 10000
#
# counts <- table(all_duration$comeback, all_duration$EqcDesc)
# barplot(counts, main="Car Distribution by Gears and VS",
# xlab="Number of Gears", col=c("darkblue","red"),
# legend = rownames(counts))
#====================
# convert all epcycle to days
all_duration$unit[all_duration$CycleDESC == "MONTHS"]<- 30
all_duration$unit[all_duration$CycleDESC == "WEEKS"]<- 7
all_duration$unit[all_duration$CycleDESC == "DAYS"]<- 1
all_duration$EPCycleLength_day = all_duration$EPCycleLength*all_duration$unit
# identify comeback
all_duration$comeback[all_duration$lagTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback[all_duration$lagTime>=all_duration$EPCycleLength_day]<- "Regular Trip"
# # Exploration 1: How many comeback and Regular Trip?
# barplot(table(all_duration$comeback), main="How many comeback and Regular Trip?",
# xlab="Yes or No", ylab = "Count")
#
# # Exploration 2: Comeback & Department
# t_1 <- table(all_duration$EqcDesc)
# name = t_1 > 10000
#
# counts <- table(all_duration$comeback, all_duration$EqcDesc)
# barplot(counts, main="Car Distribution by Gears and VS",
# xlab="Number of Gears", col=c("darkblue","red"),
# legend = rownames(counts))
#====================
p1 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "FIRE DEPARTMENT"),aes(x=comeback))+
geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
labs(
title = "Fire Department Comeback Data",
x="Comeback or a Regular Non Comeback Repair Trip",
y="Total Count" )
p2 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EPCycleLength == 1, EqcDesc == "POLICE DEPARTMENT"),aes(x=comeback))+
geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
labs(
title = "Police Department Comeback Data",
x="Comeback or a Regular Non Comeback Repair Trip",
y="Total Count" )
p3 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EPCycleLength == 3, EqcDesc == "DPW - Solid Waste"),aes(x=comeback))+
geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
labs(
title = "DPW - Solid Waste Comeback Data",
x="Comeback or a Regular Non Comeback Repair Trip",
y="Total Count" )
p4 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "DPW Water & Wastewater"),aes(x=comeback))+
geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
labs(
title = "DPW Water & Wastewater",
x="Comeback or a Regular Non Comeback Repair Trip",
y="Total Count" )
p5 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "Light Duty Emergency"),aes(x=comeback))+
geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
labs(
title = "Light Duty Emergency",
x="Comeback or a Regular Non Comeback Repair Trip",
y="Total Count" )
p6 <-ggplot(all_duration %>%
mutate(time = as.numeric(lagTime, "days")) %>%
filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "Light Duty Non-Emergency"),aes(x=comeback))+
geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
labs(
title = "Light Duty Non-Emergency Comeback Data",
x="Comeback or a Regular Non Comeback Repair Trip",
y="Total Count" )
grid.arrange(p1, p2,p3,p4,p5,p6, nrow = 2, ncol = 3 )
#======Data Preparation======
# Load Data
# Rename a column in between sheets 4 and 5 because it's spelled wrong in all but sheet 5
# Turn PostDate into a lubridate
# Calculate the time since last maintenance of each vehicle and each part
all_duration <- dat %>%
group_by(EhKey,RGDESC,WHKey) %>%
summarise(Totalcost = sum(BhLineChr),PostDate =min(PostDate))%>%
arrange(EhKey,RGDESC,PostDate)%>%
mutate(lagTime = PostDate - lag(PostDate))%>%
mutate(leadTime = lead(PostDate)-PostDate)
# Add department information
group_list <- dat %>%
group_by(EhKey) %>%
summarise(EqcDesc = first(EqcDesc),EPCycleLength= first(EPCycleLength),Make=first(Make),Model=first(Model),CycleDESC=first(CycleDESC))
all_duration <- right_join(group_list,all_duration,by='EhKey')
# convert all epcycle to days
all_duration$unit[all_duration$CycleDESC == "MONTHS"]<- 30
all_duration$unit[all_duration$CycleDESC == "WEEKS"]<- 7
all_duration$unit[all_duration$CycleDESC == "DAYS"]<- 1
all_duration$EPCycleLength_day = all_duration$EPCycleLength*all_duration$unit
# identify comeback
all_duration$comeback[all_duration$lagTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback[all_duration$lagTime>=all_duration$EPCycleLength_day]<- "Regular Trip"
all_duration$comeback_next[all_duration$leadTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback_next[all_duration$leadTime>=all_duration$EPCycleLength_day]<- "Regular Trip"
# rolling average
df_1 <- all_duration %>%
arrange(EhKey) %>%
mutate(date = ymd(PostDate))%>%
arrange(PostDate)
library(tidyr)
library(geosphere)
rollavgbyperiod <- function(i,window,df){
startdate <- df$date[i]-window
enddate <- df$date[i]-1
interval <- seq(startdate,enddate,1)
tmp <- df$Totalcost[df$date %in% interval]
return(sum(tmp))
}
roll_function_col_7 <- function(df){
window <-7
Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
Pastweekaverage[is.nan(Pastweekaverage)] <- 0
return(Pastweekaverage)
}
roll_function_col_30 <- function(df){
window <-30
Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
Pastweekaverage[is.nan(Pastweekaverage)] <- 0
return(Pastweekaverage)
}
roll_function_col_182 <- function(df){
window <-182
Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
Pastweekaverage[is.nan(Pastweekaverage)] <- 0
return(Pastweekaverage)
}
df_2 <- df_1 %>%
arrange(EhKey,PostDate)%>%
nest(-EhKey) %>%
group_by(EhKey) %>%
mutate(past1week_cost = purrr::map(data, roll_function_col_7),
past1month_cost = purrr::map(data, roll_function_col_30),
past6month_cost = purrr::map(data, roll_function_col_182)) %>%
unnest()
# rolling mean
rollavgbyperiod <- function(i,window,df){
startdate <- df$date[i]-window
enddate <- df$date[i]-1
interval <- seq(startdate,enddate,1)
tmp <- df$Totalcost[df$date %in% interval]
return(mean(tmp))
}
roll_function_col_7 <- function(df){
window <-7
Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
Pastweekaverage[is.nan(Pastweekaverage)] <- 0
return(Pastweekaverage)
}
roll_function_col_30 <- function(df){
window <-30
Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
Pastweekaverage[is.nan(Pastweekaverage)] <- 0
return(Pastweekaverage)
}
roll_function_col_182 <- function(df){
window <-182
Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
Pastweekaverage[is.nan(Pastweekaverage)] <- 0
return(Pastweekaverage)
}
# rolling average and count
df_2 <- df_2 %>%
arrange(EhKey,PostDate)%>%
nest(-EhKey) %>%
group_by(EhKey) %>%
mutate(past1week_ave_cost = purrr::map(data, roll_function_col_7),
past1month_ave_cost = purrr::map(data, roll_function_col_30),
past6month_ave_cost = purrr::map(data, roll_function_col_182)) %>%
unnest()
df_2$past1week_count = df_2$past1week_cost/df_2$past1week_ave_cost
df_2$past1month_count = df_2$past1month_cost/df_2$past1month_ave_cost
df_2$past6month_count = df_2$past6month_cost/df_2$past6month_ave_cost
df_2[["past1week_count"]][is.na(df_2[["past1week_count"]])] <- 0
df_2[["past1month_count"]][is.na(df_2[["past1month_count"]])] <- 0
df_2[["past6month_count"]][is.na(df_2[["past6month_count"]])] <- 0
# rolling comeback
df_2$comeback_next_10=0
df_2$comeback_10=0
df_2 <- within(df_2, comeback_next_10[comeback_next=='Comeback'] <- 1)
df_2 <- within(df_2, comeback_10[comeback=='Comeback'] <- 1)
rollavgbyperiod <- function(i,window,df){
startdate <- df$date[i]-window
enddate <- df$date[i]-1
interval <- seq(startdate,enddate,1)
tmp <- df$comeback_10[df$date %in% interval]
return(sum(tmp))
}
roll_function_col_7 <- function(df){
window <-7
Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
Pastweekaverage[is.nan(Pastweekaverage)] <- 0
return(Pastweekaverage)
}
roll_function_col_30 <- function(df){
window <-30
Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
Pastweekaverage[is.nan(Pastweekaverage)] <- 0
return(Pastweekaverage)
}
roll_function_col_182 <- function(df){
window <-182
Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
Pastweekaverage[is.nan(Pastweekaverage)] <- 0
return(Pastweekaverage)
}
df_2 <- df_2 %>%
arrange(EhKey,PostDate)%>%
nest(-EhKey) %>%
group_by(EhKey) %>%
mutate(past1week_comeback = purrr::map(data, roll_function_col_7),
past1month_comeback = purrr::map(data, roll_function_col_30),
past6month_comeback = purrr::map(data, roll_function_col_182)) %>%
unnest()
df_2$same_part_lasttime = df_2$lagTime
# comeback rate of each model and make
model_rate<-df_2%>%
group_by(Model)%>%
summarise(comeback_rate_model = mean(comeback_10))
df_2 = left_join(df_2, model_rate, by = "Model")
make_rate<-df_2%>%
group_by(Make)%>%
summarise(comeback_rate_make = mean(comeback_10))
df_2 = left_join(df_2, make_rate, by = "Make")
# comeback rate of monthlyly department average
df_2$year = format(df_2$date,'%Y')
df_2$month = format(df_2$date,'%m')
dpt_month_rate<-df_2%>%
group_by(EqcDesc,year,month)%>%
summarise(comeback_rate_dpt_ym = mean(comeback_10))
df_2 = left_join(df_2, dpt_month_rate, by = c("EqcDesc","year","month"))
# comeback rate of yearly department average
df_2$year = format(df_2$date,'%Y')
df_2$month = format(df_2$date,'%m')
dpt_month_rate<-df_2%>%
group_by(EqcDesc,year)%>%
summarise(comeback_rate_dpt_y = mean(comeback_10))
df_2 = left_join(df_2, dpt_month_rate, by = c("EqcDesc","year"))
# comeback rate of each system of each type of vehicle
RGDESC_model_rate<-df_2%>%
group_by(RGDESC,Model)%>%
summarise(comeback_rate_RGDESC = mean(comeback_10))
df_2 = left_join(df_2, RGDESC_model_rate, by = c("RGDESC","Model"))
# === Categorical Creation ===
df_backup = df_2
#df_2 = df_backup
# Department as a categorical variable (big/small)
dpt_n<-df_2%>%
group_by(EqcDesc)%>%
summarise(number_of_vehicle = n())
df_2 = left_join(df_2, dpt_n, by = c("EqcDesc"))
df_2 <- df_2 %>%
mutate(dpt_big = case_when(number_of_vehicle>=10000 ~ "big department",
number_of_vehicle<10000 ~ "small department"))
# Department as a categorical variable (H, M , L, VL)
dpt_rate<-df_2%>%
group_by(EqcDesc)%>%
summarise(comeback_rate_dpt = mean(comeback_10))
df_2 = left_join(df_2, dpt_rate, by = c("EqcDesc"))
df_2 <- df_2 %>%
mutate(dpt_risk = case_when(comeback_rate_dpt>=0.5 ~ "High risk",
comeback_rate_dpt>=0.25 ~ "Medium risk",
comeback_rate_dpt>=0.15 ~ "Low risk",
TRUE ~ "Very Low risk"))
# Make as a categorical variable (ex. risk score 0-5)
df_2 <- df_2 %>%
mutate(make_risk = case_when(comeback_rate_make>=0.5 ~ "High risk",
comeback_rate_make>=0.25 ~ "Medium risk",
comeback_rate_make>=0.15 ~ "Low risk",
TRUE ~ "Very Low risk"))
# Model as a categorical variable (ex. risk score 0-5)
df_2 <- df_2 %>%
mutate(model_risk = case_when(comeback_rate_model>=0.8 ~ "Very High risk",
comeback_rate_model>=0.6 ~ "High risk",
comeback_rate_model>=0.3 ~ "Medium risk",
comeback_rate_model>=0.15 ~ "Low risk",
TRUE ~ "Very Low risk"))
# RGDesc as a categorical variable (1-10)
df_2 <- df_2 %>%
mutate(part_risk = case_when(comeback_rate_RGDESC>=0.8 ~ "Very High risk",
comeback_rate_RGDESC>=0.5 ~ "High risk",
comeback_rate_RGDESC>=0.2 ~ "Medium risk",
comeback_rate_RGDESC>0 ~ "Low risk",
TRUE ~ "Very Low risk"))
# Workshop(get the workshop first)
workshop_list <- dat %>%
group_by(WHKey) %>%
summarise(WHShop = first(WHShop))
df_2 <- left_join(df_2,workshop_list,by='WHKey')
Shop_rate<-df_2%>%
group_by(WHShop)%>%
summarise(comeback_rate_shop = mean(comeback_10))
df_2 = left_join(df_2, Shop_rate, by = c("WHShop"))
df_2 <- df_2 %>%
mutate(Shop_risk = case_when(comeback_rate_shop>=0.7 ~ "High risk",
comeback_rate_shop>=0.5 ~ "Medium risk",
comeback_rate_shop>=0.25 ~ "Low risk",
TRUE ~ "Very Low risk"))
# month
df_2 <- df_2 %>%
mutate(month_cat = case_when(month=="04" ~ "April",
month=="05" ~ "May",
month=="06" ~ "June",
TRUE ~ "else"))
RG_list <- dat %>%
group_by(WHKey) %>%
summarise(N_RG = n())
df_2 <- left_join(df_2,RG_list,by='WHKey')
library(corrplot)
df_Cor <-
df_2 %>%
dplyr::select(Totalcost,past1week_cost,past1month_cost, past6month_cost,past1week_count,past1month_count,past6month_count,past1week_ave_cost, past1month_ave_cost,past6month_ave_cost,past1week_comeback,past1month_comeback,past6month_comeback, comeback_rate_model,comeback_rate_make,comeback_rate_dpt_ym,comeback_rate_dpt_y,comeback_rate_dpt,comeback_rate_RGDESC,comeback_rate_shop)
df_Cor <- subset(df_Cor, select=-c(EhKey))
M <- cor(df_Cor,use="pairwise.complete.obs")
corrplot(M, method = "square",tl.cex = 0.5)
options(scipen=10000000)
library(tidyverse)
library(caret)
library(knitr)
library(pscl)
library(plotROC)
library(pROC)
palette5 <- c("#3F6973","#49A695","#F2CD88","#F2A679","#D9725B")
palette4 <- c("#00c700","#0060C7","#c70000","#DEA60B")
palette3 <- c("#00c700","#DEA60B","#c70000")
palette2 <- c("#00c700","#c70000")
df_model <- na.omit(df_2)
df_model <- subset(df_model, select=-c(EhKey,WHKey,PostDate,unit,comeback,comeback_next,RGDESC,Model))
samp <- sample(nrow(df_model),20000)
df_model_sample<- df_model[samp,]
df_model_sample_1<-df_model_sample
set.seed(11233)
trainIndex <- createDataPartition(df_model_sample$comeback_next_10, p = .70,
list = FALSE,
times = 1)
df_Train <- df_model_sample[ trainIndex,]
df_Test <- df_model_sample[-trainIndex,]
new_predictor <-c("comeback_next_10","comeback_rate_RGDESC","past1week_comeback","past1week_count","past6month_ave_cost","comeback_rate_shop","year","month","EPCycleLength_day","comeback_10")
Logit_Model_1 <- glm(comeback_next_10 ~ .,
data=df_Train %>% dplyr::select(new_predictor),
family="binomial" (link="logit"))
testProbs_0 <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
Probs_1 = predict(Logit_Model_1, df_Test, type= "response"))
testProbs_0 <- testProbs_0%>%
mutate(predOutcome_1 = as.factor(ifelse(testProbs_0$Probs_1 > 0.5 , 1, 0)))
cm_1 <-confusionMatrix(testProbs_0$predOutcome_1, testProbs_0$Outcome,positive = "1")
summary(Logit_Model_1)
cm_1
new_predictor <-c("comeback_next_10","comeback_rate_RGDESC","past1week_comeback","past1week_count","past6month_ave_cost","comeback_rate_shop","year","month","EPCycleLength_day","CycleDESC","N_RG")
Logit_Model_1 <- glm(comeback_next_10 ~ .,
data=df_Train %>% dplyr::select(new_predictor),
family="binomial" (link="logit"))
testProbs <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
Probs_1 = predict(Logit_Model_1, df_Test, type= "response"))
testProbs <- testProbs%>%
mutate(predOutcome_1 = as.factor(ifelse(testProbs$Probs_1 > 0.5 , 1, 0)))
cm <-confusionMatrix(testProbs$predOutcome_1, testProbs$Outcome,positive = "1")
test_predictor <-c("comeback_next_10","comeback_10")
Logit_Model_2 <- glm(comeback_next_10 ~ .,
data=df_Train %>% dplyr::select(test_predictor),
family="binomial" (link="logit"))
testProbs_2 <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
Probs = predict(Logit_Model_2, df_Test, type= "response"))
testProbs_2 <- testProbs_2%>%
mutate(predOutcome = as.factor(ifelse(testProbs_2$Probs > 0.5 , 1, 0)))
cm <-confusionMatrix(testProbs_2$predOutcome, testProbs_2$Outcome,positive = "1")
testProbs$Probs_2 = testProbs_2$Probs
testProbs$predOutcome_2 = testProbs_2$predOutcome
new_predictor <-c("comeback_next_10","dpt_big","dpt_risk","make_risk","model_risk","part_risk","Shop_risk")
Logit_Model_new <- glm(comeback_next_10 ~ .,
data=df_Train %>% dplyr::select(new_predictor),
family="binomial" (link="logit"))
testProbs_3 <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
Probs = predict(Logit_Model_new, df_Test, type= "response"))
testProbs_3 <- testProbs_3%>%
mutate(predOutcome = as.factor(ifelse(testProbs_3$Probs > 0.5 , 1, 0)))
cm <-confusionMatrix(testProbs_3$predOutcome, testProbs_3$Outcome,positive = "1")
testProbs$Probs_3 = testProbs_3$Probs
testProbs$predOutcome_3 = testProbs_3$predOutcome
testProbs$Probs_Ensemble = (0.9*testProbs$Probs_1+0.09*testProbs$Probs_2+0.01*testProbs$Probs_3)
testProbs <- testProbs%>%
mutate(predOutcome_Ensemble = as.factor(ifelse(testProbs$Probs_Ensemble > 0.5 , 1, 0)))
cm_2 <-confusionMatrix(testProbs$predOutcome_Ensemble, testProbs$Outcome,positive = "1")
cm_2
library(randomForest)
df_Train_fac=df_Train %>% mutate_if(is.character, as.factor)
df_Test_fac=df_Test %>% mutate_if(is.character, as.factor)
new_predictor <-c("comeback_next_10","comeback_10","comeback_rate_RGDESC","past1week_comeback","past1week_count","past6month_ave_cost","comeback_rate_shop","year","month","EPCycleLength","CycleDESC","N_RG")
RF_Model_1 <- randomForest(comeback_next_10 ~ .,
data=df_Train_fac %>% dplyr::select(new_predictor),
norm.votes = TRUE, proximity = TRUE)
testProbs_RF <- data.frame(Outcome = as.factor(df_Test_fac$comeback_next_10),
Probs = predict(RF_Model_1, df_Test_fac, type= "response"))
testProbs_RF <- testProbs_RF%>%
mutate(predOutcome = as.factor(ifelse(testProbs_RF$Probs > 0.5 , 1, 0)))
cm_3 <-confusionMatrix(testProbs_RF$predOutcome, testProbs_RF$Outcome,positive = "1")
cm_3
library(xgboost)
all_predictor_xgb <- c("Totalcost","past1week_cost","past1month_cost",
"past6month_cost","past1week_count","past1month_count","past6month_count","past1week_ave_cost",
"past1month_ave_cost","past6month_ave_cost","comeback_10","past1week_comeback","past1month_comeback",
"past6month_comeback","comeback_rate_model","comeback_rate_make","comeback_rate_dpt_ym","comeback_rate_dpt_y",
"comeback_rate_RGDESC","comeback_rate_shop","comeback_rate_dpt","N_RG")
new_predictor <-c("comeback_10","comeback_rate_RGDESC","past1week_comeback","past1week_count","past6month_ave_cost","comeback_rate_shop","EPCycleLength","N_RG")
xgb_Model_1 <- xgboost(data=as.matrix(df_Train %>% dplyr::select(all_predictor_xgb)),label = df_Train$comeback_next_10,max.depth = 50, eta = 1, nthread = 2, nrounds = 10, objective = "binary:logistic")
testProbs_xgb <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
Probs = predict(xgb_Model_1, as.matrix(df_Test %>% dplyr::select(all_predictor_xgb))))
testProbs_xgb <- testProbs_xgb%>%
mutate(predOutcome = as.factor(ifelse(testProbs_xgb$Probs > 0.5 , 1, 0)))
cm_4 <-confusionMatrix(testProbs_xgb$predOutcome, testProbs_xgb$Outcome,positive = "1")
cm_4
library(kableExtra)
model_table <- data.frame(Model = c("Logistic Regression","Ensemble Logistic","Random Forest","XGBoost"),
Accuracy=c(cm_1$overall['Accuracy'],cm_2$overall['Accuracy'],cm_3$overall['Accuracy'],cm_4$overall['Accuracy']),
Sensitivity =c(cm_1$byClass['Sensitivity'],cm_2$byClass['Sensitivity'],cm_3$byClass['Sensitivity'],cm_4$byClass['Sensitivity']),
Specificity =c(cm_1$byClass['Specificity'],cm_2$byClass['Specificity'],cm_3$byClass['Specificity'],cm_4$byClass['Specificity']))
model_table %>%
gather("Type", "Value",-Model) %>%
ggplot(aes(Model, Value, fill = Type)) +
labs(title = "Model Comparison") +
scale_fill_manual(values = palette3) +
geom_bar(position = "dodge", stat = "identity") +
theme_bw()
kable(model_table, align = "l",
caption = "Model Comparison") %>% kable_styling()
# ggplot(testProbs, aes(x = Probs_Ensemble, fill = as.factor(Outcome))) +
# geom_density() +
# facet_grid(Outcome ~ .) +
# scale_fill_manual(values = palette2) +
# labs(x = "Churn", y = "Density of probabilities",
# title = "Distribution of predicted probabilities by observed outcome") +
# theme(strip.text.x = element_text(size = 18),
# legend.position = "none")
ggplot(testProbs_RF, aes(x = Probs, fill = as.factor(Outcome))) +
geom_density() +
facet_grid(Outcome ~ .) +
scale_fill_manual(values = palette2) +
labs(x = "Churn", y = "Density of probabilities",
title = "Distribution of predicted probabilities by observed outcome") +
theme(strip.text.x = element_text(size = 18),
legend.position = "none")
caret::confusionMatrix(testProbs_RF$predOutcome, testProbs_RF$Outcome,
positive = "1")
# ggplot(testProbs, aes(d = as.numeric(testProbs$Outcome), m = Probs_Ensemble)) +
# geom_roc(n.cuts = 50, labels = FALSE, colour = "#D9725B") +
# style_roc(theme = theme_grey) +
# geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
# labs(title = "ROC Curve")
ggplot(testProbs_RF, aes(d = as.numeric(testProbs_RF$Outcome), m = Probs)) +
geom_roc(n.cuts = 50, labels = FALSE, colour = "#c70000") +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
labs(title = "ROC Curve")
# auc(testProbs$Outcome, testProbs$Probs_Ensemble)
auc(testProbs_RF$Outcome, testProbs_RF$Probs)
df_model_1 <- na.omit(df_2)
df_model_sample_1<- df_model_1[samp,]
df_Test_1 <- df_model_sample_1[-trainIndex,]
df_G <- df_Test_1
df_G$Outcome = testProbs_RF$Outcome
df_G$Probs = testProbs_RF$Probs
df_G$predOutcome = testProbs_RF$predOutcome
df_G$correct_10 = (df_G$predOutcome == df_G$Outcome)
df_G_n_dpt <- df_G %>%
group_by(EqcDesc) %>%
summarise(number=n(),Correct = sum(correct_10=="TRUE"),Accuracy = Correct/number)%>%
arrange(-number)
df_G_n_dpt=df_G_n_dpt[0:6,]
p1 <- ggplot(data=df_G_n_dpt, aes(x=EqcDesc, y=Accuracy)) +
geom_bar(stat="identity", fill="#00c700",position = 'dodge',color="#00c700")+
scale_y_continuous(labels = scales::percent)+
labs(x = 'Top 6 Biggest Department', y = 'Accuracy', title = "Model's Generalizability - Department")+
theme(axis.text.x = element_text(angle = 20, hjust = 1))
df_G_n_part <- df_G %>%
group_by(RGDESC) %>%
summarise(number=n(),Correct = sum(correct_10=="TRUE"),Accuracy = Correct/number)%>%
arrange(-number)
df_G_n_part=df_G_n_part[0:6,]
p2 <- ggplot(data=df_G_n_part, aes(x=RGDESC, y=Accuracy)) +
geom_bar(stat="identity", fill="#00c700",position = 'dodge',color="#00c700")+
scale_y_continuous(labels = scales::percent)+
labs(x = 'Top 6 Most Repaired Vehicle System', y = 'Accuracy', title = "Model's Generalizability - System")+
theme(axis.text.x = element_text(angle = 20, hjust = 1))
grid.arrange(p1, p2, nrow = 1, ncol = 2 )
cost_benefit_table <-
testProbs_RF %>%
count(predOutcome, Outcome) %>%
summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
True_Positive = sum(n[predOutcome==1 & Outcome==1]),
False_Negative = sum(n[predOutcome==0 & Outcome==1]),
False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
gather(Variable, Count) %>%
bind_cols(data.frame(Each_Trip_Cost = c(
"470","30+470","600","30+470")))%>%
mutate(Cost =
ifelse(Variable == "True_Negative", Count * 470,
ifelse(Variable == "True_Positive", Count * (30+470),
ifelse(Variable == "False_Negative",Count * 600,
ifelse(Variable == "False_Positive",Count * (30+470), 0)))))%>%
bind_cols(data.frame(Description = c(
"We predicted next repair trip is not come back and it didn't comeback",
"We predicted next repair trip is come back and it cameback next trip",
"We predicted next repair trip is not come back but it cameback next trip",
"We predicted next repair trip is come back and it didn't comeback")))
kable(cost_benefit_table, align = "l",
caption = "Cost/Benefit Table") %>% kable_styling()
iterateThresholds <- function(data) {
x = .01
all_prediction <- data.frame()
while (x <= 1) {
this_prediction <-
testProbs_RF %>%
mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
count(predOutcome, Outcome) %>%
summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
True_Positive = sum(n[predOutcome==1 & Outcome==1]),
False_Negative = sum(n[predOutcome==0 & Outcome==1]),
False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
gather(Variable, Count) %>%
mutate(Cost =
ifelse(Variable == "True_Negative", Count * 470,
ifelse(Variable == "True_Positive", Count * (30+470),
ifelse(Variable == "False_Negative",Count * 600,
ifelse(Variable == "False_Positive",Count * (30+470), 0)))),Threshold=x)
all_prediction <- rbind(all_prediction, this_prediction)
x <- x + .01
}
return(all_prediction)
}
whichThreshold <- iterateThresholds(testProbs_RF)
whichThreshold %>%
ggplot(.,aes(Threshold, Count, colour = Variable)) +
geom_point() +
scale_colour_manual(values = palette4) +
labs(title = "Number by confusion matrix type and threshold cut off",
y="Count") +
guides(colour=guide_legend(title="Confusion Matrix"))
whichThreshold_revenue <-
whichThreshold %>%
group_by(Threshold) %>%
summarize(Total_Cost = sum(Cost))
whichThreshold_revenue %>%
dplyr::select(Threshold, Total_Cost) %>%
gather(Variable, Value, -Threshold) %>%
ggplot(aes(Threshold, Value, colour = Variable)) +
geom_point() +
scale_colour_manual(values = "#c70000") +
labs(title = "Total Cost by threshold cut off",
y="Total Cost")+
geom_vline(xintercept = pull(arrange(whichThreshold_revenue, Total_Cost)[1,1]))