1 Introduction

1.1 Client & Business Problem

Our Client is the Department of General Services of Baltimore City Government and their fleet repair & management team. They are having an unusual Business problem, they are spending too much on vehicles and too many are coming back too often. Or at least this is what they perceive, they want us to analyse their data set and give a mathematical definition of ‘what exactly is a comeback’. We start by exploring two vehicle spending and developing some tools to keep visiting thorough our analysis. This leads to ultimately defining a comeback. Lastly we finally predict the chance of comeback to help the city better manage their fleet and reduce their spending.

1.2 Preparation

This step involves setting Up libraries and themes and loading all the relevant data and converting it into the appropriate required format.

library(tidyverse)
library(sf)
library(QuantPsyc)
library(RSocrata)
library(viridis)
library(caret)
library(spatstat)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(rgdal)
library(tidycensus)
library(knitr)
library(rmarkdown)
library(readxl)
library(lubridate)
library(dygraphs)
library(zoo)
library(xts)
library(rmdformats)
library(tidyverse)
library(readxl)
library(lubridate)
library(tidyverse)

This is the step for accessing our data in the desired format and below we can even have a quick glimpse at it.

setwd("/Users/akshaynagar/Desktop/To_Org/In_Out_Temp")




dat <- read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 1) %>%
  rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 2)) %>%
  rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 3)) %>%
  rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 4)) %>%
  rename(EPCycleLength = EPCycleLenghth) %>%
  rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 5)) %>%
  mutate(PostDate = ymd(as.character(PostDate)))


glimpse(dat)

## Observations: 925,256
## Variables: 21
## $ VehicleCost   <dbl> 209353, 209353, 209353, 209353, 209353, 209353, 209353,…
## $ EhKey         <chr> "000024", "000024", "000024", "000024", "000024", "0000…
## $ WHKey         <chr> "0000539542", "0000546043", "0000546043", "0000546725",…
## $ Make          <chr> "PIERCE", "PIERCE", "PIERCE", "PIERCE", "PIERCE", "PIER…
## $ Model         <chr> "SABER", "SABER", "SABER", "SABER", "SABER", "SABER", "…
## $ MajGroup      <chr> "Fire", "Fire", "Fire", "Fire", "Fire", "Fire", "Fire",…
## $ EqcDesc       <chr> "FIRE DEPARTMENT", "FIRE DEPARTMENT", "FIRE DEPARTMENT"…
## $ Dept          <chr> "64-00", "64-00", "64-00", "64-00", "64-00", "64-00", "…
## $ EPCycleLength <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…
## $ CycleDESC     <chr> "MONTHS", "MONTHS", "MONTHS", "MONTHS", "MONTHS", "MONT…
## $ BhLineChr     <dbl> 879.18, 87.71, 81.33, 146.48, 131.57, 131.57, 249.90, 3…
## $ AcctType      <chr> "SU", "LA", "PA", "LA", "LA", "LA", "LA", "LA", "LA", "…
## $ PostDate      <date> 2014-07-15, 2014-07-09, 2014-08-21, 2014-07-15, 2014-0…
## $ MechID        <dbl> NA, 5217, NA, 5151, 5010, 5025, 5178, 5205, 5232, 5247,…
## $ WHShop        <chr> "TRK", "FIR", "FIR", "FWY", "FIR", "FIR", "FIR", "FIR",…
## $ SHPDESC       <chr> "TRUCK SIDE - CENTRAL", "FIRE MAINTENANCE", "FIRE MAINT…
## $ Reason        <chr> "9", "R", "R", NA, "8", "8", "8", "8", "R", "R", "2", "…
## $ RTRDESC       <chr> "Road Call (Vendor)", "Agency Request", "Agency Request…
## $ RGDESC        <chr> "Tire Service", "Engine Exhaust", "Electrical", "Igniti…
## $ RCDESC        <chr> "Tire Heavy", "Misc Parts", "System", "System", "Comple…
## $ RRDESC        <chr> "Replace L/F", "Repair", "Part Issue", "Road Service", …

1.3 Creating General Time Lag

This step involves creating a time lag for all the data, this takes into account EhKey, which is referred to as Vehicle ID in this documentation, and WHKey which is referred to as Trip ID for the purpose of this documentation, keeping both the Vehicle ID, and the Trip ID, and taking into account the initial date for each of these along with each successive date, a Time lag data is created, please note this is done for the purpose of creating a powerful tool, which will help us better understand and define comeback rates, this in itself is not the comeback rate.

dat_01  <-  dat %>%
  group_by(EhKey) %>%
  group_by(WHKey) %>%
  dplyr::slice(which.min(PostDate))
dat_02 <- dat_01 %>%
  mutate(PostDate = ymd(as.character(PostDate))) %>%
  group_by(EhKey) %>%
  arrange(PostDate) %>%
  mutate(lagTime = PostDate - lag(PostDate))

2 What is a Comeback

2.1 Comeback Defining

A comeback Trip is thus best explained in Detail in the following Chart Below. The Image in the Red is comeback trips these are trip that always talk about the same part and where warranty gets lapsed, a regular is similar except the same part is getting repaired after it has successfully lived it’s warranty, if two parts are different for two different repairs, then these are Non related trips. This is best shown with the case of examples and with splitting sample data sets, as well as with date examples and examples on actual parts with actual warranty periods taken for a fire truck let’s say with Vehicle ID 24, in the following example, illustrated in detail in the chart below.

2.2 Comeback Summary

In summary A comeback is thus simply stated as a Repair Trip on the same part before it’s warranty elapsed while a regular trip is after the warranty has elapsed this is illustrated in the summary below:

2.3 Time Since last maintenance

This step involves counting and displaying the number of trips where comeback is within 5 days, 10 days and so on for all the Data between 2015 to 2019 and for all the vehicles.

ggplot(dat_02 %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 100))+ 
  geom_histogram(aes(time), binwidth = 5)+
  geom_vline(xintercept=30, 
                color = "#F57434", size=1)+
  labs(
    title = "Time since last maintenance, All data 2015-2019",
    subtitle = "Max time 100 days, Bar = 5 days",
    x="Days",
    y="Number of records" )

3 Explore

3.1 FireTruck Fleet of Baltimore

The Baltimore City Government uses Pierce Saber make fire trucks, which is a leading American company that specifically specialises in Making Fire trucks.

3.2 Specific Fire Truck Data 1

Now we focus on one fire truck with vehicle ID ‘000024’, here we analyse and plot using time lag data, the duration between two repairs for the truck.

fireTruck <- dat %>%
  filter(EhKey == '000024') %>%
  arrange(PostDate)



fireTruck_time  <-  fireTruck %>%
  group_by(EhKey) %>%
  group_by(WHKey) %>%
  dplyr::slice(which.min(PostDate))%>%
  mutate(PostDate = ymd(as.character(PostDate))) %>%
  group_by(EhKey) %>%
  arrange(PostDate) %>%
  mutate(lagTime = PostDate - lag(PostDate))
fireTruck_time %>%
  ggplot( aes(x=PostDate, y=lagTime)) +
  geom_line() +
  geom_point()+
  geom_hline(yintercept=60, linetype="dashed", 
                color = "#F57434", size=1)+
  labs(
    title = "Duration between two repairment, FireTruck 000024",
    x="Date",
    y="Duration")

3.3 Fire Truck Costing

Here we plot the costing of repairs of the fire truck, these costs are accumulative. And here we can see the amount spent on the repairs of the truck, with respect of time.

#DONT INCLUDE 

fireTruck  %>%
  ggplot(aes(x=PostDate, y=cumsum(BhLineChr))) + geom_line() + geom_point()+
  labs(
    title = "Cumulative Maintenance fee, FireTruck 000024",
    x="Date",
    y="Dollar" )

3.4 Police Car

The Cars used in Baltimore fleet for police are the Chevrolet Impala 2009. ‘The Impala is Chevrolet’s popular flagship passenger car and is generally among the better selling American made automobiles in the United States.’

3.5 Same Exercise Police Car 1

Here as for the Fire Trucks we do a similar exercise for a Police Car, focusing on some different repair types as per car specific, however the thinking is very similar to the work on Police Car, this is for Police Car vehicle ID (EhKey == ‘078983’). We must note that warranty period or EPCycleLength is much shorter here, this is understandable as Police cars are used much more of then fire trucks due the nature of their job demanding much more frequent trips and much more extensive travel and usage.

ps1 <- dat %>%
  filter(EhKey == '078983') %>%
  arrange(PostDate)



ps1_time  <-  ps1 %>%
  group_by(EhKey) %>%
  group_by(WHKey) %>%
  dplyr::slice(which.min(PostDate))%>%
  mutate(PostDate = ymd(as.character(PostDate))) %>%
  group_by(EhKey) %>%
  arrange(PostDate) %>%
  mutate(lagTime = PostDate - lag(PostDate))
ps1_time %>%
  ggplot( aes(x=PostDate, y=lagTime)) +
  geom_line() +
  geom_point()+
  geom_hline(yintercept=30, linetype="dashed", 
                color = "#F57434", size=1)+
  labs(
    title = "Duration between two repairment, Police Car 078983",
    x="Date",
    y="Duration")

Police_car <- dat %>%
  filter(EhKey == '078983') %>%
  arrange(PostDate)


Police_car  %>%
  ggplot(aes(x=PostDate, y=cumsum(BhLineChr))) + geom_line() + geom_point()+
  labs(
    title = "Cumulative Maintenance fee, Police_car 078983",
    x="Date",
    y="Dollar" )

# Calculate the time since last maintenance of each vehicle and each part
all_duration  <-  dat %>%
  group_by(EhKey,RGDESC,WHKey) %>%
  summarise(Totalcost = sum(BhLineChr),PostDate =min(PostDate))%>%
  arrange(EhKey,RGDESC,PostDate)%>%
  mutate(lagTime = PostDate - lag(PostDate))

# Add department information
group_list <- dat %>%
  group_by(EhKey) %>%
  summarise(EqcDesc = first(EqcDesc),EPCycleLength= first(EPCycleLength),Make=first(Make),Model=first(Model),CycleDESC=first(CycleDESC))
all_duration <- right_join(group_list,all_duration,by='EhKey')


# convert all epcycle to days
all_duration$unit[all_duration$CycleDESC == "MONTHS"]<- 30
all_duration$unit[all_duration$CycleDESC == "WEEKS"]<- 7
all_duration$unit[all_duration$CycleDESC == "DAYS"]<- 1
all_duration$EPCycleLength_day = all_duration$EPCycleLength*all_duration$unit






all_duration$comeback[all_duration$lagTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback[all_duration$lagTime>=all_duration$EPCycleLength_day]<- "Regular Trip"
# Exploration 1: How many comeback and Regular Trip?
all_duration_plot<-
  all_duration %>%
  group_by(comeback) %>%
  summarise(number = n())

4 Total Data

4.1 How Many Total Comebacks

At this step after doing some individual explorations we summarise the total number of comebacks and regular trips to analyse their difference and move further ahead in our objective of ultimately reducing comebacks.

ggplot(data=all_duration_plot[!is.na(all_duration_plot$comeback),], aes(x=comeback, y=number)) +
  geom_bar(stat="identity",width=0.2,fill = c("#c70000","#00c700"))+
  geom_text(aes(label=number), position=position_dodge(width=0.9), vjust=-0.25)+
  labs(
    title = "How many comeback and Regular Trip totally?",
    y="Total number")

4.2 Average Cost of Comebacks

At this step the average cost of comebacks and regular trips to analyse their difference and so that later on we can do a detailed cost benefit analysis.

# Exploration 3: Average Cost different between comeback and Regular Trip
all_duration_plot<-  
  all_duration %>% 
  group_by(comeback) %>% 
  summarise(Cost = mean(Totalcost))

ggplot(data=all_duration_plot[!is.na(all_duration_plot$comeback),], aes(x=comeback, y=round(Cost, digits = 2))) +
  geom_bar(stat="identity",width=0.2,fill = c("#c70000","#00c700"))+
  geom_text(aes(label=round(Cost, digits = 2)), position=position_dodge(width=0.9), vjust=-0.25)+
  labs(
    title = "How much money cost for each comeback/Regular Trip?",
    y="Average cost" )

5 Tools

5.1 Cost Analysis

This exercise is for all of the data, and this takes into account how much are the total costs to the city within a period of time, this helps them see which months or what times were more cost efficient and which were not.

# Calculate the time since last maintenance of each vehicle and each part
all_duration  <-  dat %>%
  group_by(EhKey,RGDESC,WHKey) %>%
  summarise(Totalcost = sum(BhLineChr),PostDate =min(PostDate))%>%
  arrange(EhKey,RGDESC,PostDate,)%>%
  mutate(lagTime = PostDate - lag(PostDate))

# Add department information
group_list <- dat %>%
  group_by(EhKey) %>%
  summarise(EqcDesc = first(EqcDesc), EPCycleLength=first(EPCycleLength), Make=first(Make),Model=first(Model),CycleDESC =first(CycleDESC))
all_duration <- right_join(group_list,all_duration,by='EhKey')

# Interactive - Cost's change accross time




all_duration_try  <-  all_duration %>%
  group_by(PostDate) %>%
  summarise(Totalcost_byDay = sum(Totalcost))

library(dygraphs)
library(zoo)
library(xts)
library(Lahman)

my_weekday_date <- all_duration_try %>% dplyr::select(PostDate) %>% pull
my_weekday_view <- all_duration_try %>% dplyr::select(Totalcost_byDay) %>% pull
my_zoo_weekday <- zoo(my_weekday_view, my_weekday_date)
my_zoo_weekday_xts <- as.xts(my_zoo_weekday, order.by = my_weekday_date)

dygraph(my_zoo_weekday_xts, main = "Cost of each day") %>% 
  dyRangeSelector(dateWindow = c("2017-01-01", "2018-05-17"))

5.2 Interactive Plots

Over here we have a scatter plot that shows us the total cost and lag time by vehicle type that we can select, this is helpful to quickly scroll through and compare vehicles, and here we also have a Distribution plot that shows us the total lag time by vehicle type that we can select, this is helpful to quickly scroll through and compare vehicles.

# Interactive - Scatterplot of cost vs duration by all groups
library(plotly)  
p <- sample_n(all_duration, 10000) %>%
  plot_ly(
    type = 'scatter', x = ~lagTime, y = ~Totalcost, text = ~RGDESC,
    hoverinfo = 'text',
    mode = 'markers',
    transforms = list(
      list(
        type = 'filter',
        target = ~EqcDesc,
        operation = '=',
        value = unique(all_duration$EqcDesc)[1]
      )
    )) %>% layout(
      updatemenus = list(
        list(
          type = 'dropdown',
          active = 0,
          buttons = list(
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[1]),
                 label = unique(all_duration$EqcDesc)[1]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[2]),
                 label = unique(all_duration$EqcDesc)[2]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[3]),
                 label = unique(all_duration$EqcDesc)[3]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[3]),
                 label = unique(all_duration$EqcDesc)[4]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[5]),
                 label = unique(all_duration$EqcDesc)[5]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[6]),
                 label = unique(all_duration$EqcDesc)[6]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[7]),
                 label = unique(all_duration$EqcDesc)[7]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[8]),
                 label = unique(all_duration$EqcDesc)[8]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[9]),
                 label = unique(all_duration$EqcDesc)[9]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[10]),
                 label = unique(all_duration$EqcDesc)[10]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[11]),
                 label = unique(all_duration$EqcDesc)[11]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[12]),
                 label = unique(all_duration$EqcDesc)[12]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[13]),
                 label = unique(all_duration$EqcDesc)[13]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[14]),
                 label = unique(all_duration$EqcDesc)[14]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[15]),
                 label = unique(all_duration$EqcDesc)[15]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[16]),
                 label = unique(all_duration$EqcDesc)[16]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[17]),
                 label = unique(all_duration$EqcDesc)[17]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[18]),
                 label = unique(all_duration$EqcDesc)[18]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[19]),
                 label = unique(all_duration$EqcDesc)[19]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[20]),
                 label = unique(all_duration$EqcDesc)[20]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[21]),
                 label = unique(all_duration$EqcDesc)[21])
          )
        )
      )
    )
p2 <- sample_n(all_duration, 10000) %>%
  plot_ly(
    type = 'scatter', x = ~lagTime, y = ~Totalcost, text = ~RGDESC,
    hoverinfo = 'text',
    mode = 'markers',
    marker = list(color = 'rgba(255, 100, 100,0.9)'),
    transforms = list(
      list(
        type = 'filter',
        target = ~EqcDesc,
        operation = '=',
        value = unique(all_duration$EqcDesc)[1]
      )
    )) %>% layout(
      updatemenus = list(
        list(
          type = 'dropdown',
          active = 0,
          buttons = list(
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[1]),
                 label = unique(all_duration$EqcDesc)[1]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[2]),
                 label = unique(all_duration$EqcDesc)[2]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[3]),
                 label = unique(all_duration$EqcDesc)[3]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[3]),
                 label = unique(all_duration$EqcDesc)[4]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[5]),
                 label = unique(all_duration$EqcDesc)[5]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[6]),
                 label = unique(all_duration$EqcDesc)[6]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[7]),
                 label = unique(all_duration$EqcDesc)[7]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[8]),
                 label = unique(all_duration$EqcDesc)[8]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[9]),
                 label = unique(all_duration$EqcDesc)[9]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[10]),
                 label = unique(all_duration$EqcDesc)[10]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[11]),
                 label = unique(all_duration$EqcDesc)[11]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[12]),
                 label = unique(all_duration$EqcDesc)[12]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[13]),
                 label = unique(all_duration$EqcDesc)[13]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[14]),
                 label = unique(all_duration$EqcDesc)[14]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[15]),
                 label = unique(all_duration$EqcDesc)[15]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[16]),
                 label = unique(all_duration$EqcDesc)[16]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[17]),
                 label = unique(all_duration$EqcDesc)[17]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[18]),
                 label = unique(all_duration$EqcDesc)[18]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[19]),
                 label = unique(all_duration$EqcDesc)[19]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[20]),
                 label = unique(all_duration$EqcDesc)[20]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[21]),
                 label = unique(all_duration$EqcDesc)[21])
          )
        )
      )
    )

p

p2

# Duration distribution by repair part type
p3 <- all_duration %>%
  filter( lagTime > 5, EqcDesc == "FIRE DEPARTMENT")%>%
  plot_ly(
    type = 'histogram', x = ~lagTime, 
    transforms = list(
      list(
        type = 'filter',
        target = ~RGDESC,
        operation = '=',
        value = unique(all_duration$RGDESC)[1]
      )
    )) %>% layout(
      updatemenus = list(
        list(
          type = 'dropdown',
          active = 0,
          buttons = list(
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[1]),
                 label = unique(all_duration$RGDESC)[1]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[2]),
                 label = unique(all_duration$RGDESC)[2]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[3]),
                 label = unique(all_duration$RGDESC)[3]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[3]),
                 label = unique(all_duration$RGDESC)[4]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[5]),
                 label = unique(all_duration$RGDESC)[5]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[6]),
                 label = unique(all_duration$RGDESC)[6]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[7]),
                 label = unique(all_duration$RGDESC)[7]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[8]),
                 label = unique(all_duration$RGDESC)[8]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[9]),
                 label = unique(all_duration$RGDESC)[9]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[10]),
                 label = unique(all_duration$RGDESC)[10]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[11]),
                 label = unique(all_duration$RGDESC)[11]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[12]),
                 label = unique(all_duration$RGDESC)[12]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[13]),
                 label = unique(all_duration$RGDESC)[13]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[14]),
                 label = unique(all_duration$RGDESC)[14]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[15]),
                 label = unique(all_duration$RGDESC)[15]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[16]),
                 label = unique(all_duration$RGDESC)[16]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[17]),
                 label = unique(all_duration$RGDESC)[17]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[18]),
                 label = unique(all_duration$RGDESC)[18]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[19]),
                 label = unique(all_duration$RGDESC)[19]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[20]),
                 label = unique(all_duration$RGDESC)[20]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[21]),
                 label = unique(all_duration$RGDESC)[21])
          )
        )
      )
    )

p4 <- all_duration %>%
  filter( lagTime > 5, EqcDesc == "FIRE DEPARTMENT")%>%
  plot_ly(
    type = 'histogram', x = ~lagTime, 
    marker = list(color = 'rgba(255, 100, 100,0.9)'),
    transforms = list(
      list(
        type = 'filter',
        target = ~RGDESC,
        operation = '=',
        value = unique(all_duration$RGDESC)[1]
      )
    )) %>% layout(
      updatemenus = list(
        list(
          type = 'dropdown',
          active = 0,
          buttons = list(
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[1]),
                 label = unique(all_duration$RGDESC)[1]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[2]),
                 label = unique(all_duration$RGDESC)[2]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[3]),
                 label = unique(all_duration$RGDESC)[3]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[3]),
                 label = unique(all_duration$RGDESC)[4]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[5]),
                 label = unique(all_duration$RGDESC)[5]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[6]),
                 label = unique(all_duration$RGDESC)[6]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[7]),
                 label = unique(all_duration$RGDESC)[7]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[8]),
                 label = unique(all_duration$RGDESC)[8]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[9]),
                 label = unique(all_duration$RGDESC)[9]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[10]),
                 label = unique(all_duration$RGDESC)[10]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[11]),
                 label = unique(all_duration$RGDESC)[11]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[12]),
                 label = unique(all_duration$RGDESC)[12]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[13]),
                 label = unique(all_duration$RGDESC)[13]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[14]),
                 label = unique(all_duration$RGDESC)[14]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[15]),
                 label = unique(all_duration$RGDESC)[15]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[16]),
                 label = unique(all_duration$RGDESC)[16]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[17]),
                 label = unique(all_duration$RGDESC)[17]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[18]),
                 label = unique(all_duration$RGDESC)[18]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[19]),
                 label = unique(all_duration$RGDESC)[19]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[20]),
                 label = unique(all_duration$RGDESC)[20]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[21]),
                 label = unique(all_duration$RGDESC)[21])
          )
        )
      )
    )
p3

p4

6 Exploration by Department

6.1 Warranty(EPCycle) Data

Here we move our Analysis to working with each department, In the graphs below we see how many trips are happening before the warranty period and how many are living past their warranty period this will help us quickly visualise and analyse the data all the trips to the left of the epcycle are comebacks and all to the right are not.

p1 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "FIRE DEPARTMENT"))+ 
  geom_histogram(aes(time), binwidth = 10)+
  geom_vline(aes(xintercept = 180), colour = "#FA7800",size=1)+
  geom_text(aes(x=180, label="EPCycle 6 months", y=1000), colour="blue")+
  labs(
    title = " Fire vehicle",
    subtitle = "Max time 365 days, Bar = 10 days",
    x="Days",
    y="Number of records" )

p2<-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EPCycleLength == 1,EqcDesc == "POLICE DEPARTMENT"))+ 
  geom_histogram(aes(time), binwidth = 10)+
  geom_vline(aes(xintercept = 30), colour = "#FA7800",size=1)+
  geom_text(aes(x=30, label="EPCycle 1 months", y=9200), colour="blue")+
  labs(
    title = " Police vehicle",
    subtitle = "Max time 365 days, Bar = 10 days",
    x="Days",
    y="Number of records" )


p3 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365,EPCycleLength == 3, EqcDesc == "DPW - Solid Waste"))+ 
  geom_histogram(aes(time), binwidth = 10)+
  geom_vline(aes(xintercept = 90), colour = "#FA7800",size=1)+
  geom_text(aes(x=90, label="EPCycle 3 months", y=3500), colour="blue")+
  labs(
    title = "Solid Waste vehicle",
    subtitle = "Max time 365 days, Bar = 10 days",
    x="Days",
    y="Number of records" )

p4 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365,EPCycleLength == 4, EqcDesc == "DPW Water & Wastewater"))+ 
  geom_histogram(aes(time), binwidth = 10)+
  geom_vline(aes(xintercept = 120), colour = "#FA7800",size=1)+
  geom_text(aes(x=120, label="EPCycle 4 months", y=2400), colour="blue")+
  labs(
    title = " Water & Wastewater vehicle",
    subtitle = "Max time 365 days, Bar = 10 days",
    x="Days",
    y="Number of records" )

p5 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EPCycleLength == 1,EqcDesc == "Light Duty Emergency"))+ 
  geom_histogram(aes(time), binwidth = 10)+
  geom_vline(aes(xintercept = 30), colour = "#FA7800",size=1)+
  geom_text(aes(x=30, label="EPCycle 1 months", y=3000), colour="blue")+
  labs(
    title = " Light Duty Emergency vehicle",
    subtitle = "Max time 365 days, Bar = 10 days",
    x="Days",
    y="Number of records" )

p6 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EqcDesc == "Light Duty Non-Emergency",EPCycleLength == 4))+ 
  geom_histogram(aes(time), binwidth = 10)+
  geom_vline(aes(xintercept = 120), colour = "#FA7800",size=1)+
  geom_text(aes(x=120, label="EPCycle 4 months", y=550), colour="blue")+
  labs(
    title = " Light Duty Non-Emergency vehicle",
    subtitle = "Max time 365 days, Bar = 10 days",
    x="Days",
    y="Number of records" )

grid.arrange(p1, p2,p3,p4,p5,p6, nrow = 2, ncol = 3 )

6.2 Comeback & Regular Trips

Furthering our analysis to work with each department, Now after understanding what is a comeback and a regular trip we show the total number of each of these by department type as below.

dat <- read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 1) %>%
  rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 2)) %>%
  rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 3)) %>%
  rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 4)) %>%
  rename(EPCycleLength = EPCycleLenghth) %>%
  rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 5)) %>%
  mutate(PostDate = ymd(as.character(PostDate)))

# Calculate the time since last maintenance of each vehicle and each part
all_duration  <-  dat %>%
  group_by(EhKey,RGDESC,WHKey) %>%
  summarise(Totalcost = sum(BhLineChr),PostDate =min(PostDate))%>%
  arrange(EhKey,RGDESC,PostDate)%>%
  mutate(lagTime = PostDate - lag(PostDate))

# Add department information
group_list <- dat %>%
  group_by(EhKey) %>%
  summarise(EqcDesc = first(EqcDesc),EPCycleLength= first(EPCycleLength),Make=first(Make),Model=first(Model),CycleDESC=first(CycleDESC))
all_duration <- right_join(group_list,all_duration,by='EhKey')

if(all_duration$CycleDESC == "MONTHS" )
  {
  all_duration$CycleDESC_day = all_duration$EPCycleLength*30
}
if(all_duration$CycleDESC == "DO NOT USE SET A MONTHLY DATE")
{
  all_duration$CycleDESC_day = all_duration$EPCycleLength*30
}

# convert all epcycle to days
all_duration$unit[all_duration$CycleDESC == "MONTHS"]<- 30
all_duration$unit[all_duration$CycleDESC == "WEEKS"]<- 7
all_duration$unit[all_duration$CycleDESC == "DAYS"]<- 1
all_duration$EPCycleLength_day = all_duration$EPCycleLength*all_duration$unit

# identify comeback
all_duration$comeback[all_duration$lagTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback[all_duration$lagTime>=all_duration$EPCycleLength_day]<- "Regular Trip"

# # Exploration 1: How many comeback and Regular Trip?
# barplot(table(all_duration$comeback),  main="How many comeback and Regular Trip?",
#         xlab="Yes or No", ylab = "Count")
# 
# # Exploration 2: Comeback & Department
# t_1 <- table(all_duration$EqcDesc)
# name = t_1 > 10000
# 
# counts <- table(all_duration$comeback, all_duration$EqcDesc)
# barplot(counts, main="Car Distribution by Gears and VS",
#         xlab="Number of Gears", col=c("darkblue","red"),
#         legend = rownames(counts))

#====================

# convert all epcycle to days
all_duration$unit[all_duration$CycleDESC == "MONTHS"]<- 30
all_duration$unit[all_duration$CycleDESC == "WEEKS"]<- 7
all_duration$unit[all_duration$CycleDESC == "DAYS"]<- 1
all_duration$EPCycleLength_day = all_duration$EPCycleLength*all_duration$unit

# identify comeback
all_duration$comeback[all_duration$lagTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback[all_duration$lagTime>=all_duration$EPCycleLength_day]<- "Regular Trip"

# # Exploration 1: How many comeback and Regular Trip?
# barplot(table(all_duration$comeback),  main="How many comeback and Regular Trip?",
#         xlab="Yes or No", ylab = "Count")
# 
# # Exploration 2: Comeback & Department
# t_1 <- table(all_duration$EqcDesc)
# name = t_1 > 10000
# 
# counts <- table(all_duration$comeback, all_duration$EqcDesc)
# barplot(counts, main="Car Distribution by Gears and VS",
#         xlab="Number of Gears", col=c("darkblue","red"),
#         legend = rownames(counts))

#====================

p1 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "FIRE DEPARTMENT"),aes(x=comeback))+
  geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
  labs(
    title = "Fire Department Comeback Data",
    x="Comeback or a Regular Non Comeback Repair Trip",
    y="Total Count" )

p2 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EPCycleLength == 1, EqcDesc == "POLICE DEPARTMENT"),aes(x=comeback))+
  geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
  labs(
    title = "Police Department Comeback Data",
    x="Comeback or a Regular Non Comeback Repair Trip",
    y="Total Count" )

p3 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EPCycleLength == 3, EqcDesc == "DPW - Solid Waste"),aes(x=comeback))+
  geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
  labs(
    title = "DPW - Solid Waste Comeback Data",
    x="Comeback or a Regular Non Comeback Repair Trip",
    y="Total Count" )

p4 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "DPW Water & Wastewater"),aes(x=comeback))+
  geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
  labs(
    title = "DPW Water & Wastewater",
    x="Comeback or a Regular Non Comeback Repair Trip",
    y="Total Count" )

p5 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "Light Duty Emergency"),aes(x=comeback))+
  geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
  labs(
     title = "Light Duty Emergency",
    x="Comeback or a Regular Non Comeback Repair Trip",
    y="Total Count" )



p6 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "Light Duty Non-Emergency"),aes(x=comeback))+
  geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
  labs(
    title = "Light Duty Non-Emergency Comeback Data",
    x="Comeback or a Regular Non Comeback Repair Trip",
    y="Total Count" )








grid.arrange(p1, p2,p3,p4,p5,p6, nrow = 2, ncol = 3 )

7 Feature Engineering

7.1 Data Preparation

These are the steps for preparing data to apply feature engineering, this is an essential preparatory step.

#======Data Preparation======




# Load Data
# Rename a column in between sheets 4 and 5 because it's spelled wrong in all but sheet 5
# Turn PostDate into a lubridate



# Calculate the time since last maintenance of each vehicle and each part
all_duration  <-  dat %>%
  group_by(EhKey,RGDESC,WHKey) %>%
  summarise(Totalcost = sum(BhLineChr),PostDate =min(PostDate))%>%
  arrange(EhKey,RGDESC,PostDate)%>%
  mutate(lagTime = PostDate - lag(PostDate))%>%
  mutate(leadTime = lead(PostDate)-PostDate)

# Add department information
group_list <- dat %>%
  group_by(EhKey) %>%
  summarise(EqcDesc = first(EqcDesc),EPCycleLength= first(EPCycleLength),Make=first(Make),Model=first(Model),CycleDESC=first(CycleDESC))
all_duration <- right_join(group_list,all_duration,by='EhKey')


# convert all epcycle to days
all_duration$unit[all_duration$CycleDESC == "MONTHS"]<- 30
all_duration$unit[all_duration$CycleDESC == "WEEKS"]<- 7
all_duration$unit[all_duration$CycleDESC == "DAYS"]<- 1
all_duration$EPCycleLength_day = all_duration$EPCycleLength*all_duration$unit

# identify comeback
all_duration$comeback[all_duration$lagTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback[all_duration$lagTime>=all_duration$EPCycleLength_day]<- "Regular Trip"
all_duration$comeback_next[all_duration$leadTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback_next[all_duration$leadTime>=all_duration$EPCycleLength_day]<- "Regular Trip"

7.2 Features

Over here we deal with feature engineering and create valuable features that will help generate an intelligent predictive model. This is also seen in the table below:

# rolling average
df_1 <- all_duration %>%
  arrange(EhKey) %>%
  mutate(date = ymd(PostDate))%>%
  arrange(PostDate)


library(tidyr)
library(geosphere)
rollavgbyperiod <- function(i,window,df){
  startdate <- df$date[i]-window
  enddate <- df$date[i]-1
  interval <- seq(startdate,enddate,1)
  tmp <- df$Totalcost[df$date %in% interval]
  return(sum(tmp))
}
roll_function_col_7 <- function(df){
  window <-7
  Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
  Pastweekaverage[is.nan(Pastweekaverage)] <- 0
  return(Pastweekaverage)
}
roll_function_col_30 <- function(df){
  window <-30
  Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
  Pastweekaverage[is.nan(Pastweekaverage)] <- 0
  return(Pastweekaverage)
}
roll_function_col_182 <- function(df){
  window <-182
  Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
  Pastweekaverage[is.nan(Pastweekaverage)] <- 0
  return(Pastweekaverage)
}


df_2 <- df_1 %>% 
  arrange(EhKey,PostDate)%>%
  nest(-EhKey) %>%
  group_by(EhKey) %>%
  mutate(past1week_cost = purrr::map(data, roll_function_col_7),
         past1month_cost = purrr::map(data, roll_function_col_30),
         past6month_cost = purrr::map(data, roll_function_col_182)) %>%
  unnest()


# rolling mean
rollavgbyperiod <- function(i,window,df){
  startdate <- df$date[i]-window
  enddate <- df$date[i]-1
  interval <- seq(startdate,enddate,1)
  tmp <- df$Totalcost[df$date %in% interval]
  return(mean(tmp))
}
roll_function_col_7 <- function(df){
  window <-7
  Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
  Pastweekaverage[is.nan(Pastweekaverage)] <- 0
  return(Pastweekaverage)
}
roll_function_col_30 <- function(df){
  window <-30
  Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
  Pastweekaverage[is.nan(Pastweekaverage)] <- 0
  return(Pastweekaverage)
}
roll_function_col_182 <- function(df){
  window <-182
  Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
  Pastweekaverage[is.nan(Pastweekaverage)] <- 0
  return(Pastweekaverage)
}



# rolling average and count
df_2 <- df_2 %>%
  arrange(EhKey,PostDate)%>%
  nest(-EhKey) %>%
  group_by(EhKey) %>%
  mutate(past1week_ave_cost = purrr::map(data, roll_function_col_7),
         past1month_ave_cost = purrr::map(data, roll_function_col_30),
         past6month_ave_cost = purrr::map(data, roll_function_col_182)) %>%
  unnest()

df_2$past1week_count = df_2$past1week_cost/df_2$past1week_ave_cost
df_2$past1month_count = df_2$past1month_cost/df_2$past1month_ave_cost
df_2$past6month_count = df_2$past6month_cost/df_2$past6month_ave_cost
df_2[["past1week_count"]][is.na(df_2[["past1week_count"]])] <- 0
df_2[["past1month_count"]][is.na(df_2[["past1month_count"]])] <- 0
df_2[["past6month_count"]][is.na(df_2[["past6month_count"]])] <- 0

# rolling comeback
df_2$comeback_next_10=0
df_2$comeback_10=0
df_2 <- within(df_2, comeback_next_10[comeback_next=='Comeback'] <- 1)
df_2 <- within(df_2, comeback_10[comeback=='Comeback'] <- 1)

rollavgbyperiod <- function(i,window,df){
  startdate <- df$date[i]-window
  enddate <- df$date[i]-1
  interval <- seq(startdate,enddate,1)
  tmp <- df$comeback_10[df$date %in% interval]
  return(sum(tmp))
}
roll_function_col_7 <- function(df){
  window <-7
  Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
  Pastweekaverage[is.nan(Pastweekaverage)] <- 0
  return(Pastweekaverage)
}
roll_function_col_30 <- function(df){
  window <-30
  Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
  Pastweekaverage[is.nan(Pastweekaverage)] <- 0
  return(Pastweekaverage)
}
roll_function_col_182 <- function(df){
  window <-182
  Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
  Pastweekaverage[is.nan(Pastweekaverage)] <- 0
  return(Pastweekaverage)
}
df_2 <- df_2 %>%
  arrange(EhKey,PostDate)%>%
  nest(-EhKey) %>%
  group_by(EhKey) %>%
  mutate(past1week_comeback = purrr::map(data, roll_function_col_7),
         past1month_comeback = purrr::map(data, roll_function_col_30),
         past6month_comeback = purrr::map(data, roll_function_col_182)) %>%
  unnest()

df_2$same_part_lasttime = df_2$lagTime




# comeback rate of each model and make
model_rate<-df_2%>%
  group_by(Model)%>%
  summarise(comeback_rate_model = mean(comeback_10))
df_2 = left_join(df_2, model_rate, by = "Model")

make_rate<-df_2%>%
  group_by(Make)%>%
  summarise(comeback_rate_make = mean(comeback_10))
df_2 = left_join(df_2, make_rate, by = "Make")

# comeback rate of monthlyly department average
df_2$year = format(df_2$date,'%Y')
df_2$month = format(df_2$date,'%m')
dpt_month_rate<-df_2%>%
  group_by(EqcDesc,year,month)%>%
  summarise(comeback_rate_dpt_ym = mean(comeback_10))
df_2 = left_join(df_2, dpt_month_rate, by = c("EqcDesc","year","month"))

# comeback rate of yearly department average
df_2$year = format(df_2$date,'%Y')
df_2$month = format(df_2$date,'%m')
dpt_month_rate<-df_2%>%
  group_by(EqcDesc,year)%>%
  summarise(comeback_rate_dpt_y = mean(comeback_10))
df_2 = left_join(df_2, dpt_month_rate, by = c("EqcDesc","year"))


# comeback rate of each system of each type of vehicle
RGDESC_model_rate<-df_2%>%
  group_by(RGDESC,Model)%>%
  summarise(comeback_rate_RGDESC = mean(comeback_10))
df_2 = left_join(df_2, RGDESC_model_rate, by = c("RGDESC","Model"))

# === Categorical Creation ===
df_backup = df_2
#df_2 = df_backup

# Department as a categorical variable (big/small)
dpt_n<-df_2%>%
  group_by(EqcDesc)%>%
  summarise(number_of_vehicle = n())
df_2 = left_join(df_2, dpt_n, by = c("EqcDesc"))

df_2 <- df_2 %>% 
  mutate(dpt_big = case_when(number_of_vehicle>=10000 ~ "big department",
                                number_of_vehicle<10000 ~ "small department"))



# Department as a categorical variable (H, M , L, VL)
dpt_rate<-df_2%>%
  group_by(EqcDesc)%>%
  summarise(comeback_rate_dpt = mean(comeback_10))
df_2 = left_join(df_2, dpt_rate, by = c("EqcDesc"))
df_2 <- df_2 %>% 
  mutate(dpt_risk = case_when(comeback_rate_dpt>=0.5 ~ "High risk",
                             comeback_rate_dpt>=0.25 ~ "Medium risk",
                             comeback_rate_dpt>=0.15 ~ "Low risk",
                             TRUE ~ "Very Low risk"))

# Make as a categorical variable  (ex. risk score 0-5)
df_2 <- df_2 %>% 
  mutate(make_risk = case_when(comeback_rate_make>=0.5 ~ "High risk",
                             comeback_rate_make>=0.25 ~ "Medium risk",
                             comeback_rate_make>=0.15 ~ "Low risk",
                             TRUE ~ "Very Low risk"))

# Model as a categorical variable  (ex. risk score 0-5)
df_2 <- df_2 %>% 
  mutate(model_risk = case_when(comeback_rate_model>=0.8 ~ "Very High risk",
                             comeback_rate_model>=0.6 ~ "High risk",
                             comeback_rate_model>=0.3 ~ "Medium risk",
                             comeback_rate_model>=0.15 ~ "Low risk",
                             TRUE ~ "Very Low risk"))

# RGDesc as a categorical variable (1-10)
df_2 <- df_2 %>% 
  mutate(part_risk = case_when(comeback_rate_RGDESC>=0.8 ~ "Very High risk",
                             comeback_rate_RGDESC>=0.5 ~ "High risk",
                             comeback_rate_RGDESC>=0.2 ~ "Medium risk",
                             comeback_rate_RGDESC>0 ~ "Low risk",
                             TRUE ~ "Very Low risk"))
# Workshop(get the workshop first)
workshop_list <- dat %>%
  group_by(WHKey) %>%
  summarise(WHShop = first(WHShop))
df_2 <- left_join(df_2,workshop_list,by='WHKey')

Shop_rate<-df_2%>%
  group_by(WHShop)%>%
  summarise(comeback_rate_shop = mean(comeback_10))
df_2 = left_join(df_2, Shop_rate, by = c("WHShop"))
df_2 <- df_2 %>% 
  mutate(Shop_risk = case_when(comeback_rate_shop>=0.7 ~ "High risk",
                             comeback_rate_shop>=0.5 ~ "Medium risk",
                             comeback_rate_shop>=0.25 ~ "Low risk",
                             TRUE ~ "Very Low risk"))


# month
df_2 <- df_2 %>% 
  mutate(month_cat = case_when(month=="04" ~ "April",
                               month=="05" ~ "May",
                               month=="06" ~ "June",
                             TRUE ~ "else"))

RG_list <- dat %>%
  group_by(WHKey) %>%
  summarise(N_RG = n())
df_2 <- left_join(df_2,RG_list,by='WHKey')

Corelation Matrix is shown below. For the prediction models in the next step, we can not select those features that are strongly correlated (Pearson correlation smaller than 0.8).

library(corrplot)
df_Cor <-
  df_2 %>%
  dplyr::select(Totalcost,past1week_cost,past1month_cost,                                                  past6month_cost,past1week_count,past1month_count,past6month_count,past1week_ave_cost,                                                  past1month_ave_cost,past6month_ave_cost,past1week_comeback,past1month_comeback,past6month_comeback, comeback_rate_model,comeback_rate_make,comeback_rate_dpt_ym,comeback_rate_dpt_y,comeback_rate_dpt,comeback_rate_RGDESC,comeback_rate_shop)
df_Cor <- subset(df_Cor, select=-c(EhKey))
M <- cor(df_Cor,use="pairwise.complete.obs")
corrplot(M, method = "square",tl.cex = 0.5)

8 Prediction Model

8.1 Logistic Regression

The Main purpose of this model is going to predict whether the next the next trip is a comeback or not. Thus, the Logistic regression model seems to be a good choice. It can predict the probability that an observation is part of a particular group just like Comeback or Regular Trip. Before training the model, we should separate the dataset to training set and test set first. Below shows the result of the prediction model. We can see that the whether this trip is comeback or not and the comeback rate of this system are the most significant predictor.

options(scipen=10000000)
library(tidyverse)
library(caret)
library(knitr) 
library(pscl)
library(plotROC)
library(pROC)

palette5 <- c("#3F6973","#49A695","#F2CD88","#F2A679","#D9725B")
palette4 <- c("#00c700","#0060C7","#c70000","#DEA60B")
palette3 <- c("#00c700","#DEA60B","#c70000")
palette2 <- c("#00c700","#c70000")
df_model <- na.omit(df_2)

df_model <- subset(df_model, select=-c(EhKey,WHKey,PostDate,unit,comeback,comeback_next,RGDESC,Model))

samp <- sample(nrow(df_model),20000)
df_model_sample<- df_model[samp,]

df_model_sample_1<-df_model_sample

set.seed(11233)

trainIndex <- createDataPartition(df_model_sample$comeback_next_10, p = .70,
                                  list = FALSE,
                                  times = 1)
df_Train <- df_model_sample[ trainIndex,]
df_Test  <- df_model_sample[-trainIndex,]


new_predictor <-c("comeback_next_10","comeback_rate_RGDESC","past1week_comeback","past1week_count","past6month_ave_cost","comeback_rate_shop","year","month","EPCycleLength_day","comeback_10")

Logit_Model_1 <- glm(comeback_next_10 ~ .,
                  data=df_Train %>% dplyr::select(new_predictor),
                  family="binomial" (link="logit"))
testProbs_0 <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
                        Probs_1 = predict(Logit_Model_1, df_Test, type= "response"))
testProbs_0 <- testProbs_0%>%
  mutate(predOutcome_1  = as.factor(ifelse(testProbs_0$Probs_1 > 0.5 , 1, 0)))
cm_1 <-confusionMatrix(testProbs_0$predOutcome_1, testProbs_0$Outcome,positive = "1")
summary(Logit_Model_1)

## 
## Call:
## glm(formula = comeback_next_10 ~ ., family = binomial(link = "logit"), 
##     data = df_Train %>% dplyr::select(new_predictor))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4960  -0.8208  -0.5066   0.8416   2.2100  
## 
## Coefficients:
##                         Estimate  Std. Error z value             Pr(>|z|)    
## (Intercept)          -2.39145893  0.12907065 -18.528 < 0.0000000000000002 ***
## comeback_rate_RGDESC  4.49741627  0.11915269  37.745 < 0.0000000000000002 ***
## past1week_comeback    0.07323652  0.03727047   1.965               0.0494 *  
## past1week_count      -0.02903997  0.01718635  -1.690               0.0911 .  
## past6month_ave_cost  -0.00001241  0.00001254  -0.990               0.3224    
## comeback_rate_shop    0.25015478  0.22090930   1.132               0.2575    
## year2015             -0.11462763  0.09629985  -1.190               0.2339    
## year2016             -0.17703553  0.09618185  -1.841               0.0657 .  
## year2017             -0.12350266  0.09607567  -1.285               0.1986    
## year2018              0.38964065  0.09779777   3.984    0.000067723030075 ***
## year2019              2.31234258  0.20844934  11.093 < 0.0000000000000002 ***
## month02              -0.00974146  0.09416247  -0.103               0.9176    
## month03              -0.21190666  0.09516550  -2.227               0.0260 *  
## month04              -0.21299367  0.09977696  -2.135               0.0328 *  
## month05              -0.09052411  0.09837170  -0.920               0.3575    
## month06              -0.02178538  0.09724580  -0.224               0.8227    
## month07               0.04344010  0.09957893   0.436               0.6627    
## month08              -0.03400486  0.09560648  -0.356               0.7221    
## month09               0.11187255  0.09550588   1.171               0.2415    
## month10              -0.12491258  0.09651563  -1.294               0.1956    
## month11               0.17711476  0.09747132   1.817               0.0692 .  
## month12               0.11231099  0.09766530   1.150               0.2502    
## EPCycleLength_day     0.00317469  0.00042179   7.527    0.000000000000052 ***
## comeback_10           0.43751845  0.04503216   9.716 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19300  on 13999  degrees of freedom
## Residual deviance: 14700  on 13976  degrees of freedom
## AIC: 14748
## 
## Number of Fisher Scoring iterations: 5

cm_1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2710 1065
##          1  577 1648
##                                                
##                Accuracy : 0.7263               
##                  95% CI : (0.7149, 0.7376)     
##     No Information Rate : 0.5478               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.4388               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.6074               
##             Specificity : 0.8245               
##          Pos Pred Value : 0.7407               
##          Neg Pred Value : 0.7179               
##              Prevalence : 0.4522               
##          Detection Rate : 0.2747               
##    Detection Prevalence : 0.3708               
##       Balanced Accuracy : 0.7160               
##                                                
##        'Positive' Class : 1                    
##

8.2 Ensemble Model

In general, ensemble model is a technique of combining two or more algorithms. Doing this is to make a more robust prediction model which is able to incorporate the predictions from all base machine learning models. In this project, we are going to use weighted average Logistic Regression model. In this model, different weights are applied to each predictions from different logit models then taking the weighted average of each models’ output.

new_predictor <-c("comeback_next_10","comeback_rate_RGDESC","past1week_comeback","past1week_count","past6month_ave_cost","comeback_rate_shop","year","month","EPCycleLength_day","CycleDESC","N_RG")

Logit_Model_1 <- glm(comeback_next_10 ~ .,
                  data=df_Train %>% dplyr::select(new_predictor),
                  family="binomial" (link="logit"))
testProbs <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
                        Probs_1 = predict(Logit_Model_1, df_Test, type= "response"))
testProbs <- testProbs%>%
  mutate(predOutcome_1  = as.factor(ifelse(testProbs$Probs_1 > 0.5 , 1, 0)))
cm <-confusionMatrix(testProbs$predOutcome_1, testProbs$Outcome,positive = "1")

test_predictor <-c("comeback_next_10","comeback_10")

Logit_Model_2 <- glm(comeback_next_10 ~ .,
                  data=df_Train %>% dplyr::select(test_predictor),
                  family="binomial" (link="logit"))

testProbs_2 <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
                        Probs = predict(Logit_Model_2, df_Test, type= "response"))
testProbs_2 <- testProbs_2%>%
  mutate(predOutcome  = as.factor(ifelse(testProbs_2$Probs > 0.5 , 1, 0)))
cm <-confusionMatrix(testProbs_2$predOutcome, testProbs_2$Outcome,positive = "1")
testProbs$Probs_2 = testProbs_2$Probs
testProbs$predOutcome_2 = testProbs_2$predOutcome

new_predictor <-c("comeback_next_10","dpt_big","dpt_risk","make_risk","model_risk","part_risk","Shop_risk")

Logit_Model_new <- glm(comeback_next_10 ~ .,
                  data=df_Train %>% dplyr::select(new_predictor),
                  family="binomial" (link="logit"))
testProbs_3 <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
                        Probs = predict(Logit_Model_new, df_Test, type= "response"))
testProbs_3 <- testProbs_3%>%
  mutate(predOutcome  = as.factor(ifelse(testProbs_3$Probs > 0.5 , 1, 0)))
cm <-confusionMatrix(testProbs_3$predOutcome, testProbs_3$Outcome,positive = "1")
testProbs$Probs_3 = testProbs_3$Probs
testProbs$predOutcome_3 = testProbs_3$predOutcome

testProbs$Probs_Ensemble = (0.9*testProbs$Probs_1+0.09*testProbs$Probs_2+0.01*testProbs$Probs_3)
testProbs <- testProbs%>%
  mutate(predOutcome_Ensemble  = as.factor(ifelse(testProbs$Probs_Ensemble > 0.5 , 1, 0)))
cm_2 <-confusionMatrix(testProbs$predOutcome_Ensemble, testProbs$Outcome,positive = "1")
cm_2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2720 1098
##          1  567 1615
##                                                
##                Accuracy : 0.7225               
##                  95% CI : (0.711, 0.7338)      
##     No Information Rate : 0.5478               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.4301               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.5953               
##             Specificity : 0.8275               
##          Pos Pred Value : 0.7401               
##          Neg Pred Value : 0.7124               
##              Prevalence : 0.4522               
##          Detection Rate : 0.2692               
##    Detection Prevalence : 0.3637               
##       Balanced Accuracy : 0.7114               
##                                                
##        'Positive' Class : 1                    
##

8.3 Random Forest

Random forest is also an ensemble learning method which combine a multitude of decision trees. Here we use specifically Random Forest:

library(randomForest)
df_Train_fac=df_Train %>% mutate_if(is.character, as.factor)
df_Test_fac=df_Test %>% mutate_if(is.character, as.factor)

new_predictor <-c("comeback_next_10","comeback_10","comeback_rate_RGDESC","past1week_comeback","past1week_count","past6month_ave_cost","comeback_rate_shop","year","month","EPCycleLength","CycleDESC","N_RG")
RF_Model_1 <- randomForest(comeback_next_10 ~ .,
                  data=df_Train_fac %>% dplyr::select(new_predictor),
                  norm.votes = TRUE, proximity = TRUE)

testProbs_RF <- data.frame(Outcome = as.factor(df_Test_fac$comeback_next_10),
                        Probs = predict(RF_Model_1, df_Test_fac, type= "response"))
testProbs_RF <- testProbs_RF%>%
  mutate(predOutcome  = as.factor(ifelse(testProbs_RF$Probs > 0.5 , 1, 0)))
cm_3 <-confusionMatrix(testProbs_RF$predOutcome, testProbs_RF$Outcome,positive = "1")
cm_3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2683 1052
##          1  604 1661
##                                                
##                Accuracy : 0.724                
##                  95% CI : (0.7125, 0.7353)     
##     No Information Rate : 0.5478               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.4348               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.6122               
##             Specificity : 0.8162               
##          Pos Pred Value : 0.7333               
##          Neg Pred Value : 0.7183               
##              Prevalence : 0.4522               
##          Detection Rate : 0.2768               
##    Detection Prevalence : 0.3775               
##       Balanced Accuracy : 0.7142               
##                                                
##        'Positive' Class : 1                    
##

8.4 XGBoost

XGBoost is an optimized distributed gradient boosting library. This algorithm implements under the Gradient Boosting framework. Parallel tree boosting is used in this algorithm which helps it to address & solve this data science problem fast and accurately.

testProbs_xgb <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
                        Probs = predict(xgb_Model_1, as.matrix(df_Test %>% dplyr::select(all_predictor_xgb))))
testProbs_xgb <- testProbs_xgb%>%
  mutate(predOutcome  = as.factor(ifelse(testProbs_xgb$Probs > 0.5 , 1, 0)))

cm_4 <-confusionMatrix(testProbs_xgb$predOutcome, testProbs_xgb$Outcome,positive = "1")
cm_4

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2432 1047
##          1  855 1666
##                                                
##                Accuracy : 0.683                
##                  95% CI : (0.6711, 0.6948)     
##     No Information Rate : 0.5478               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3562               
##                                                
##  Mcnemar's Test P-Value : 0.00001189           
##                                                
##             Sensitivity : 0.6141               
##             Specificity : 0.7399               
##          Pos Pred Value : 0.6608               
##          Neg Pred Value : 0.6991               
##              Prevalence : 0.4522               
##          Detection Rate : 0.2777               
##    Detection Prevalence : 0.4202               
##       Balanced Accuracy : 0.6770               
##                                                
##        'Positive' Class : 1                    
##

8.5 Goodness of Fit

From the table below, we can see the comparison between different models. The Random Forest model is the best for it has the highest Accuracy and Sensitivity, though its specificity is not the best. Why the sensitivity is the most important index will be explained later.

library(kableExtra)
model_table <- data.frame(Model = c("Logistic Regression","Ensemble Logistic","Random Forest","XGBoost"),
           Accuracy=c(cm_1$overall['Accuracy'],cm_2$overall['Accuracy'],cm_3$overall['Accuracy'],cm_4$overall['Accuracy']),
           Sensitivity =c(cm_1$byClass['Sensitivity'],cm_2$byClass['Sensitivity'],cm_3$byClass['Sensitivity'],cm_4$byClass['Sensitivity']),
           Specificity =c(cm_1$byClass['Specificity'],cm_2$byClass['Specificity'],cm_3$byClass['Specificity'],cm_4$byClass['Specificity']))
model_table %>%
  gather("Type", "Value",-Model) %>%
  ggplot(aes(Model, Value, fill = Type)) +
  labs(title = "Model Comparison") +
  scale_fill_manual(values = palette3) +
  geom_bar(position = "dodge", stat = "identity") +
  theme_bw()

kable(model_table, align = "l",
      caption = "Model Comparison") %>% kable_styling()

Model Comparison
Model	Accuracy	Sensitivity	Specificity
Logistic Regression	0.7263333	0.6074456	0.8244600
Ensemble Logistic	0.7225000	0.5952820	0.8275023
Random Forest	0.7240000	0.6122374	0.8162458
XGBoost	0.6830000	0.6140804	0.7398844

The plot below the distribution of predicted probabilities for each observed outcome, comeback and regular trip, respectively shown as 1 and 0. We can see that the model performance better when prediction regular trips than predicting comeback trips.

# ggplot(testProbs, aes(x = Probs_Ensemble, fill = as.factor(Outcome))) + 
#   geom_density() +
#   facet_grid(Outcome ~ .) +
#   scale_fill_manual(values = palette2) +
#   labs(x = "Churn", y = "Density of probabilities",
#        title = "Distribution of predicted probabilities by observed outcome") +
#   theme(strip.text.x = element_text(size = 18),
#         legend.position = "none")
ggplot(testProbs_RF, aes(x = Probs, fill = as.factor(Outcome))) +
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Churn", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

If we set 0.5 as threshold, which means that, when probability is bigger than 0.5, the case will be determined as comeback. There were 1672 true positives, instances where observed trip was correctly predicted as comeback. There were 1041 false positives, instances where trip was comeback trip incorrectly predicted as regular trip. There were 2714 true negatives, instances where observed regular trip was correctly predicted. Finally, there were 573 false negatives, instances where Regular trip was incorrectly predicted as comeback.

caret::confusionMatrix(testProbs_RF$predOutcome, testProbs_RF$Outcome,
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2683 1052
##          1  604 1661
##                                                
##                Accuracy : 0.724                
##                  95% CI : (0.7125, 0.7353)     
##     No Information Rate : 0.5478               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.4348               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.6122               
##             Specificity : 0.8162               
##          Pos Pred Value : 0.7333               
##          Neg Pred Value : 0.7183               
##              Prevalence : 0.4522               
##          Detection Rate : 0.2768               
##    Detection Prevalence : 0.3775               
##       Balanced Accuracy : 0.7142               
##                                                
##        'Positive' Class : 1                    
##

From the result above, we can also see that the Sensitivity and Specificity of this model is 61.6% and 82.5%. And the Accuracy is 73.1%.

8.6 ROC Curves

We can see from the ROC Curves that, the curve is well above the diagonal. Which means that this model can make the meaningful prediction. The Area Under the Curve metric or AUC of this model is 0.8066 Nearly 80% percent of the data are explained.

# ggplot(testProbs, aes(d = as.numeric(testProbs$Outcome), m = Probs_Ensemble)) +
#   geom_roc(n.cuts = 50, labels = FALSE, colour = "#D9725B") +
#   style_roc(theme = theme_grey) +
#   geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
#   labs(title = "ROC Curve")
ggplot(testProbs_RF, aes(d = as.numeric(testProbs_RF$Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#c70000") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve")

# auc(testProbs$Outcome, testProbs$Probs_Ensemble)
auc(testProbs_RF$Outcome, testProbs_RF$Probs)

## Area under the curve: 0.7965

8.7 Generalizability

For the Model generalizability, we are mainly focusing the accuracy differences on different departments and vehicle systems. From the results, we can see that our model has good generalizability, the accuracy of different departments is not much different, and the accuracy of different car systems is about the same. All are very close to the overall accuracy of the model.

df_model_1 <- na.omit(df_2)

df_model_sample_1<- df_model_1[samp,]
df_Test_1  <- df_model_sample_1[-trainIndex,]

df_G <- df_Test_1
df_G$Outcome = testProbs_RF$Outcome
df_G$Probs = testProbs_RF$Probs
df_G$predOutcome = testProbs_RF$predOutcome

df_G$correct_10 = (df_G$predOutcome == df_G$Outcome)


df_G_n_dpt <- df_G %>%
  group_by(EqcDesc) %>%
  summarise(number=n(),Correct = sum(correct_10=="TRUE"),Accuracy = Correct/number)%>%
  arrange(-number)

df_G_n_dpt=df_G_n_dpt[0:6,]


p1 <- ggplot(data=df_G_n_dpt, aes(x=EqcDesc, y=Accuracy)) +
  geom_bar(stat="identity", fill="#00c700",position = 'dodge',color="#00c700")+
  scale_y_continuous(labels = scales::percent)+
  labs(x = 'Top 6 Biggest Department', y = 'Accuracy', title = "Model's Generalizability - Department")+
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

df_G_n_part <- df_G %>%
  group_by(RGDESC) %>%
  summarise(number=n(),Correct = sum(correct_10=="TRUE"),Accuracy = Correct/number)%>%
  arrange(-number)


df_G_n_part=df_G_n_part[0:6,]


p2 <- ggplot(data=df_G_n_part, aes(x=RGDESC, y=Accuracy)) +
  geom_bar(stat="identity", fill="#00c700",position = 'dodge',color="#00c700")+
  scale_y_continuous(labels = scales::percent)+
  labs(x = 'Top 6 Most Repaired Vehicle System', y = 'Accuracy', title = "Model's Generalizability - System")+
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

grid.arrange(p1, p2, nrow = 1, ncol = 2 )

9 Costs Benefit analysis

If we can correctly predict whether next trip is comeback, we can help the government to save a lot of money. To be specific, if we know whether next trip is comeback, we can request the vehicle to come to the workshop earlier than the warranty. So that, the car won’t break on the road. If we can get the money we save or lost for each type of result in the confusion matrix, we can see that totally how much money we can save or lost is we change the repair time based on our prediction. Let’s assume that if we predict the next trip is a comeback, we will spend $30 to make a additional check, which can detect the hidden danger and make the vehicle do not comeback before the warranty. According to the statistic above, we also assume that a comeback trip will cost 600 dollars and a regular trip will cost 470 dollars. The Total Cost may be:

True Negative Cost: “We correctly predicted next repair trip is not come back.” We only spend 470 dollars for the next regular repair trip.
True Positive Cost: “We correctly predicted next repair trip is come back.” We spend 30 dollars to make a additional check and make the next trip from comeback to a regular repair trip which cost 470 dollars instead of 600 dollars.
False Negative Cost: “We incorrectly predicted next repair trip is not come back, but it come back.” We do not make additional check, but the comeback trip cost 600 dollars.
False Positive Cost: “We incorrectly predicted next repair trip is come back, but it do not come back.” We spend 30 dollars to make a additional check, but next trip is a regular trip which cost 470 dollars.

cost_benefit_table <-
  testProbs_RF %>%
  count(predOutcome, Outcome) %>%
  summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
            True_Positive = sum(n[predOutcome==1 & Outcome==1]),
            False_Negative = sum(n[predOutcome==0 & Outcome==1]),
            False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
  gather(Variable, Count) %>%
  bind_cols(data.frame(Each_Trip_Cost = c(
    "470","30+470","600","30+470")))%>%
  mutate(Cost =
           ifelse(Variable == "True_Negative", Count * 470,
           ifelse(Variable == "True_Positive", Count * (30+470),
           ifelse(Variable == "False_Negative",Count * 600,
           ifelse(Variable == "False_Positive",Count * (30+470), 0)))))%>%
  
  bind_cols(data.frame(Description = c(
    "We predicted next repair trip is not come back and it didn't comeback",
    "We predicted next repair trip is come back and it cameback next trip",
    "We predicted next repair trip is not come back but it cameback next trip",
    "We predicted next repair trip is come back and it didn't comeback")))

kable(cost_benefit_table, align = "l",
      caption = "Cost/Benefit Table") %>% kable_styling()

Cost/Benefit Table
Variable	Count	Each_Trip_Cost	Cost	Description
True_Negative	2683	470	1261010	We predicted next repair trip is not come back and it didn’t comeback
True_Positive	1661	30+470	830500	We predicted next repair trip is come back and it cameback next trip
False_Negative	1052	600	631200	We predicted next repair trip is not come back but it cameback next trip
False_Positive	604	30+470	302000	We predicted next repair trip is come back and it didn’t comeback

Then, we can optimize the cost and benefit analysis by changing the threshold as we have said above. The figure below plots the count for each confusion matrix metric by threshold.

iterateThresholds <- function(data) {
  x = .01
  all_prediction <- data.frame()
  while (x <= 1) {
    
    this_prediction <-
      testProbs_RF %>%
      mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
      gather(Variable, Count) %>%
      mutate(Cost =
               ifelse(Variable == "True_Negative", Count * 470,
               ifelse(Variable == "True_Positive", Count * (30+470),
               ifelse(Variable == "False_Negative",Count * 600,
               ifelse(Variable == "False_Positive",Count * (30+470), 0)))),Threshold=x)
    
    all_prediction <- rbind(all_prediction, this_prediction)
    x <- x + .01
  }
  return(all_prediction)
}

whichThreshold <- iterateThresholds(testProbs_RF)
whichThreshold %>%
  ggplot(.,aes(Threshold, Count, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette4) +    
  labs(title = "Number by confusion matrix type and threshold cut off",
       y="Count") +
  guides(colour=guide_legend(title="Confusion Matrix"))

For each type of result, we already know the cost. Thus, we simply add them all up and we can know how much is the total cost. By doing this, we can know how much money we can save at most. If threshold equals to 0, it means that we predict all trips to be comeback. If threshold equals to 1, it means that we predict all trips to be regular trips, which the department doing right now, or in other words, the real case. We can see that, when threshold equals to 0.28, the total cost is the lowest. Which means that, if the predicted probability of comeback is less than 0.28, we recommend not doing any deep additional check. If the predicted probability of comeback is larger than 0.28, we recommend to do some deep additional check, because the vehicle has a high probability to comeback before the warranty.

whichThreshold_revenue <- 
  whichThreshold %>% 
    group_by(Threshold) %>% 
    summarize(Total_Cost = sum(Cost)) 
whichThreshold_revenue %>%
  dplyr::select(Threshold, Total_Cost) %>%
  gather(Variable, Value, -Threshold) %>%
  ggplot(aes(Threshold, Value, colour = Variable)) +
    geom_point() +
    scale_colour_manual(values = "#c70000") +
    labs(title = "Total Cost by threshold cut off",
       y="Total Cost")+
    geom_vline(xintercept = pull(arrange(whichThreshold_revenue, Total_Cost)[1,1]))

Then let’s focusing on the money. In this test dataset, totally 6000 trips are researched. And in these 6000 trips, the previous cost is $ 3,166,120 dollars. While if we recommend mechanics to do deep check or not based on our perdition result, the total cost will be 2,984,820 dollars, which can help the general service department to save 181,300 dollars. Around 5.72% of total cost can be saved just based on this simple prediction model.

old_cost <- max(whichThreshold_revenue$Total_Cost)
new_cost <- min(whichThreshold_revenue$Total_Cost)

save_money <- old_cost-new_cost



kable(data.frame("Business_as_Usual" = c(old_cost),"Using_our_model" = c(new_cost),"Money_Save"=c(save_money),"Percent_Save(%)"=c(save_money/old_cost*100)),
      caption = "How Much We Save",align = "l") %>% kable_styling()

How Much We Save
Business_as_Usual	Using_our_model	Money_Save	Percent_Save…
3166120	2984820	181300	5.726252

Code Appendix

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(sf))
suppressPackageStartupMessages(library(QuantPsyc))
suppressPackageStartupMessages(library(RSocrata))
suppressPackageStartupMessages(library(viridis))
suppressPackageStartupMessages(library(caret))
suppressPackageStartupMessages(library(spatstat))
suppressPackageStartupMessages(library(spdep))
suppressPackageStartupMessages(library(FNN))
suppressPackageStartupMessages(library(grid))
suppressPackageStartupMessages(library(gridExtra))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages(library(rgdal))
suppressPackageStartupMessages(library(tidycensus))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(rmarkdown))
suppressPackageStartupMessages(library(readxl))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(dygraphs))
suppressPackageStartupMessages(library(zoo))
suppressPackageStartupMessages(library(xts))
suppressPackageStartupMessages(library(rmdformats))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(readxl))
suppressPackageStartupMessages(library(lubridate))


library(tidyverse)
library(sf)
library(QuantPsyc)
library(RSocrata)
library(viridis)
library(caret)
library(spatstat)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(rgdal)
library(tidycensus)
library(knitr)
library(rmarkdown)
library(readxl)
library(lubridate)
library(dygraphs)
library(zoo)
library(xts)
library(rmdformats)
library(tidyverse)
library(readxl)
library(lubridate)
library(tidyverse)


setwd("/Users/akshaynagar/Desktop/To_Org/In_Out_Temp")




dat <- read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 1) %>%
  rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 2)) %>%
  rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 3)) %>%
  rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 4)) %>%
  rename(EPCycleLength = EPCycleLenghth) %>%
  rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 5)) %>%
  mutate(PostDate = ymd(as.character(PostDate)))


glimpse(dat)

dat_01  <-  dat %>%
  group_by(EhKey) %>%
  group_by(WHKey) %>%
  dplyr::slice(which.min(PostDate))
dat_02 <- dat_01 %>%
  mutate(PostDate = ymd(as.character(PostDate))) %>%
  group_by(EhKey) %>%
  arrange(PostDate) %>%
  mutate(lagTime = PostDate - lag(PostDate))

ggplot(dat_02 %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 100))+ 
  geom_histogram(aes(time), binwidth = 5)+
  geom_vline(xintercept=30, 
                color = "#F57434", size=1)+
  labs(
    title = "Time since last maintenance, All data 2015-2019",
    subtitle = "Max time 100 days, Bar = 5 days",
    x="Days",
    y="Number of records" )

fireTruck <- dat %>%
  filter(EhKey == '000024') %>%
  arrange(PostDate)



fireTruck_time  <-  fireTruck %>%
  group_by(EhKey) %>%
  group_by(WHKey) %>%
  dplyr::slice(which.min(PostDate))%>%
  mutate(PostDate = ymd(as.character(PostDate))) %>%
  group_by(EhKey) %>%
  arrange(PostDate) %>%
  mutate(lagTime = PostDate - lag(PostDate))
fireTruck_time %>%
  ggplot( aes(x=PostDate, y=lagTime)) +
  geom_line() +
  geom_point()+
  geom_hline(yintercept=60, linetype="dashed", 
                color = "#F57434", size=1)+
  labs(
    title = "Duration between two repairment, FireTruck 000024",
    x="Date",
    y="Duration")

#DONT INCLUDE 

fireTruck  %>%
  ggplot(aes(x=PostDate, y=cumsum(BhLineChr))) + geom_line() + geom_point()+
  labs(
    title = "Cumulative Maintenance fee, FireTruck 000024",
    x="Date",
    y="Dollar" )


ps1 <- dat %>%
  filter(EhKey == '078983') %>%
  arrange(PostDate)



ps1_time  <-  ps1 %>%
  group_by(EhKey) %>%
  group_by(WHKey) %>%
  dplyr::slice(which.min(PostDate))%>%
  mutate(PostDate = ymd(as.character(PostDate))) %>%
  group_by(EhKey) %>%
  arrange(PostDate) %>%
  mutate(lagTime = PostDate - lag(PostDate))
ps1_time %>%
  ggplot( aes(x=PostDate, y=lagTime)) +
  geom_line() +
  geom_point()+
  geom_hline(yintercept=30, linetype="dashed", 
                color = "#F57434", size=1)+
  labs(
    title = "Duration between two repairment, Police Car 078983",
    x="Date",
    y="Duration")


Police_car <- dat %>%
  filter(EhKey == '078983') %>%
  arrange(PostDate)


Police_car  %>%
  ggplot(aes(x=PostDate, y=cumsum(BhLineChr))) + geom_line() + geom_point()+
  labs(
    title = "Cumulative Maintenance fee, Police_car 078983",
    x="Date",
    y="Dollar" )

# Calculate the time since last maintenance of each vehicle and each part
all_duration  <-  dat %>%
  group_by(EhKey,RGDESC,WHKey) %>%
  summarise(Totalcost = sum(BhLineChr),PostDate =min(PostDate))%>%
  arrange(EhKey,RGDESC,PostDate)%>%
  mutate(lagTime = PostDate - lag(PostDate))

# Add department information
group_list <- dat %>%
  group_by(EhKey) %>%
  summarise(EqcDesc = first(EqcDesc),EPCycleLength= first(EPCycleLength),Make=first(Make),Model=first(Model),CycleDESC=first(CycleDESC))
all_duration <- right_join(group_list,all_duration,by='EhKey')


# convert all epcycle to days
all_duration$unit[all_duration$CycleDESC == "MONTHS"]<- 30
all_duration$unit[all_duration$CycleDESC == "WEEKS"]<- 7
all_duration$unit[all_duration$CycleDESC == "DAYS"]<- 1
all_duration$EPCycleLength_day = all_duration$EPCycleLength*all_duration$unit






all_duration$comeback[all_duration$lagTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback[all_duration$lagTime>=all_duration$EPCycleLength_day]<- "Regular Trip"
# Exploration 1: How many comeback and Regular Trip?
all_duration_plot<-
  all_duration %>%
  group_by(comeback) %>%
  summarise(number = n())





ggplot(data=all_duration_plot[!is.na(all_duration_plot$comeback),], aes(x=comeback, y=number)) +
  geom_bar(stat="identity",width=0.2,fill = c("#c70000","#00c700"))+
  geom_text(aes(label=number), position=position_dodge(width=0.9), vjust=-0.25)+
  labs(
    title = "How many comeback and Regular Trip totally?",
    y="Total number")
# Exploration 3: Average Cost different between comeback and Regular Trip
all_duration_plot<-  
  all_duration %>% 
  group_by(comeback) %>% 
  summarise(Cost = mean(Totalcost))

ggplot(data=all_duration_plot[!is.na(all_duration_plot$comeback),], aes(x=comeback, y=round(Cost, digits = 2))) +
  geom_bar(stat="identity",width=0.2,fill = c("#c70000","#00c700"))+
  geom_text(aes(label=round(Cost, digits = 2)), position=position_dodge(width=0.9), vjust=-0.25)+
  labs(
    title = "How much money cost for each comeback/Regular Trip?",
    y="Average cost" )
# Calculate the time since last maintenance of each vehicle and each part
all_duration  <-  dat %>%
  group_by(EhKey,RGDESC,WHKey) %>%
  summarise(Totalcost = sum(BhLineChr),PostDate =min(PostDate))%>%
  arrange(EhKey,RGDESC,PostDate,)%>%
  mutate(lagTime = PostDate - lag(PostDate))

# Add department information
group_list <- dat %>%
  group_by(EhKey) %>%
  summarise(EqcDesc = first(EqcDesc), EPCycleLength=first(EPCycleLength), Make=first(Make),Model=first(Model),CycleDESC =first(CycleDESC))
all_duration <- right_join(group_list,all_duration,by='EhKey')
# Interactive - Cost's change accross time




all_duration_try  <-  all_duration %>%
  group_by(PostDate) %>%
  summarise(Totalcost_byDay = sum(Totalcost))

library(dygraphs)
library(zoo)
library(xts)
library(Lahman)

my_weekday_date <- all_duration_try %>% dplyr::select(PostDate) %>% pull
my_weekday_view <- all_duration_try %>% dplyr::select(Totalcost_byDay) %>% pull
my_zoo_weekday <- zoo(my_weekday_view, my_weekday_date)
my_zoo_weekday_xts <- as.xts(my_zoo_weekday, order.by = my_weekday_date)

dygraph(my_zoo_weekday_xts, main = "Cost of each day") %>% 
  dyRangeSelector(dateWindow = c("2017-01-01", "2018-05-17"))

# Interactive - Scatterplot of cost vs duration by all groups
library(plotly)  
p <- sample_n(all_duration, 10000) %>%
  plot_ly(
    type = 'scatter', x = ~lagTime, y = ~Totalcost, text = ~RGDESC,
    hoverinfo = 'text',
    mode = 'markers',
    transforms = list(
      list(
        type = 'filter',
        target = ~EqcDesc,
        operation = '=',
        value = unique(all_duration$EqcDesc)[1]
      )
    )) %>% layout(
      updatemenus = list(
        list(
          type = 'dropdown',
          active = 0,
          buttons = list(
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[1]),
                 label = unique(all_duration$EqcDesc)[1]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[2]),
                 label = unique(all_duration$EqcDesc)[2]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[3]),
                 label = unique(all_duration$EqcDesc)[3]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[3]),
                 label = unique(all_duration$EqcDesc)[4]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[5]),
                 label = unique(all_duration$EqcDesc)[5]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[6]),
                 label = unique(all_duration$EqcDesc)[6]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[7]),
                 label = unique(all_duration$EqcDesc)[7]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[8]),
                 label = unique(all_duration$EqcDesc)[8]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[9]),
                 label = unique(all_duration$EqcDesc)[9]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[10]),
                 label = unique(all_duration$EqcDesc)[10]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[11]),
                 label = unique(all_duration$EqcDesc)[11]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[12]),
                 label = unique(all_duration$EqcDesc)[12]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[13]),
                 label = unique(all_duration$EqcDesc)[13]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[14]),
                 label = unique(all_duration$EqcDesc)[14]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[15]),
                 label = unique(all_duration$EqcDesc)[15]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[16]),
                 label = unique(all_duration$EqcDesc)[16]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[17]),
                 label = unique(all_duration$EqcDesc)[17]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[18]),
                 label = unique(all_duration$EqcDesc)[18]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[19]),
                 label = unique(all_duration$EqcDesc)[19]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[20]),
                 label = unique(all_duration$EqcDesc)[20]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[21]),
                 label = unique(all_duration$EqcDesc)[21])
          )
        )
      )
    )
p2 <- sample_n(all_duration, 10000) %>%
  plot_ly(
    type = 'scatter', x = ~lagTime, y = ~Totalcost, text = ~RGDESC,
    hoverinfo = 'text',
    mode = 'markers',
    marker = list(color = 'rgba(255, 100, 100,0.9)'),
    transforms = list(
      list(
        type = 'filter',
        target = ~EqcDesc,
        operation = '=',
        value = unique(all_duration$EqcDesc)[1]
      )
    )) %>% layout(
      updatemenus = list(
        list(
          type = 'dropdown',
          active = 0,
          buttons = list(
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[1]),
                 label = unique(all_duration$EqcDesc)[1]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[2]),
                 label = unique(all_duration$EqcDesc)[2]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[3]),
                 label = unique(all_duration$EqcDesc)[3]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[3]),
                 label = unique(all_duration$EqcDesc)[4]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[5]),
                 label = unique(all_duration$EqcDesc)[5]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[6]),
                 label = unique(all_duration$EqcDesc)[6]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[7]),
                 label = unique(all_duration$EqcDesc)[7]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[8]),
                 label = unique(all_duration$EqcDesc)[8]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[9]),
                 label = unique(all_duration$EqcDesc)[9]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[10]),
                 label = unique(all_duration$EqcDesc)[10]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[11]),
                 label = unique(all_duration$EqcDesc)[11]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[12]),
                 label = unique(all_duration$EqcDesc)[12]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[13]),
                 label = unique(all_duration$EqcDesc)[13]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[14]),
                 label = unique(all_duration$EqcDesc)[14]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[15]),
                 label = unique(all_duration$EqcDesc)[15]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[16]),
                 label = unique(all_duration$EqcDesc)[16]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[17]),
                 label = unique(all_duration$EqcDesc)[17]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[18]),
                 label = unique(all_duration$EqcDesc)[18]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[19]),
                 label = unique(all_duration$EqcDesc)[19]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[20]),
                 label = unique(all_duration$EqcDesc)[20]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$EqcDesc)[21]),
                 label = unique(all_duration$EqcDesc)[21])
          )
        )
      )
    )

p

p2
# Duration distribution by repair part type
p3 <- all_duration %>%
  filter( lagTime > 5, EqcDesc == "FIRE DEPARTMENT")%>%
  plot_ly(
    type = 'histogram', x = ~lagTime, 
    transforms = list(
      list(
        type = 'filter',
        target = ~RGDESC,
        operation = '=',
        value = unique(all_duration$RGDESC)[1]
      )
    )) %>% layout(
      updatemenus = list(
        list(
          type = 'dropdown',
          active = 0,
          buttons = list(
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[1]),
                 label = unique(all_duration$RGDESC)[1]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[2]),
                 label = unique(all_duration$RGDESC)[2]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[3]),
                 label = unique(all_duration$RGDESC)[3]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[3]),
                 label = unique(all_duration$RGDESC)[4]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[5]),
                 label = unique(all_duration$RGDESC)[5]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[6]),
                 label = unique(all_duration$RGDESC)[6]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[7]),
                 label = unique(all_duration$RGDESC)[7]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[8]),
                 label = unique(all_duration$RGDESC)[8]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[9]),
                 label = unique(all_duration$RGDESC)[9]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[10]),
                 label = unique(all_duration$RGDESC)[10]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[11]),
                 label = unique(all_duration$RGDESC)[11]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[12]),
                 label = unique(all_duration$RGDESC)[12]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[13]),
                 label = unique(all_duration$RGDESC)[13]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[14]),
                 label = unique(all_duration$RGDESC)[14]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[15]),
                 label = unique(all_duration$RGDESC)[15]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[16]),
                 label = unique(all_duration$RGDESC)[16]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[17]),
                 label = unique(all_duration$RGDESC)[17]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[18]),
                 label = unique(all_duration$RGDESC)[18]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[19]),
                 label = unique(all_duration$RGDESC)[19]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[20]),
                 label = unique(all_duration$RGDESC)[20]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[21]),
                 label = unique(all_duration$RGDESC)[21])
          )
        )
      )
    )

p4 <- all_duration %>%
  filter( lagTime > 5, EqcDesc == "FIRE DEPARTMENT")%>%
  plot_ly(
    type = 'histogram', x = ~lagTime, 
    marker = list(color = 'rgba(255, 100, 100,0.9)'),
    transforms = list(
      list(
        type = 'filter',
        target = ~RGDESC,
        operation = '=',
        value = unique(all_duration$RGDESC)[1]
      )
    )) %>% layout(
      updatemenus = list(
        list(
          type = 'dropdown',
          active = 0,
          buttons = list(
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[1]),
                 label = unique(all_duration$RGDESC)[1]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[2]),
                 label = unique(all_duration$RGDESC)[2]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[3]),
                 label = unique(all_duration$RGDESC)[3]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[3]),
                 label = unique(all_duration$RGDESC)[4]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[5]),
                 label = unique(all_duration$RGDESC)[5]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[6]),
                 label = unique(all_duration$RGDESC)[6]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[7]),
                 label = unique(all_duration$RGDESC)[7]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[8]),
                 label = unique(all_duration$RGDESC)[8]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[9]),
                 label = unique(all_duration$RGDESC)[9]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[10]),
                 label = unique(all_duration$RGDESC)[10]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[11]),
                 label = unique(all_duration$RGDESC)[11]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[12]),
                 label = unique(all_duration$RGDESC)[12]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[13]),
                 label = unique(all_duration$RGDESC)[13]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[14]),
                 label = unique(all_duration$RGDESC)[14]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[15]),
                 label = unique(all_duration$RGDESC)[15]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[16]),
                 label = unique(all_duration$RGDESC)[16]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[17]),
                 label = unique(all_duration$RGDESC)[17]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[18]),
                 label = unique(all_duration$RGDESC)[18]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[19]),
                 label = unique(all_duration$RGDESC)[19]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[20]),
                 label = unique(all_duration$RGDESC)[20]),
            list(method = "restyle",
                 args = list("transforms[0].value", unique(all_duration$RGDESC)[21]),
                 label = unique(all_duration$RGDESC)[21])
          )
        )
      )
    )
p3




p4
p1 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "FIRE DEPARTMENT"))+ 
  geom_histogram(aes(time), binwidth = 10)+
  geom_vline(aes(xintercept = 180), colour = "#FA7800",size=1)+
  geom_text(aes(x=180, label="EPCycle 6 months", y=1000), colour="blue")+
  labs(
    title = " Fire vehicle",
    subtitle = "Max time 365 days, Bar = 10 days",
    x="Days",
    y="Number of records" )

p2<-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EPCycleLength == 1,EqcDesc == "POLICE DEPARTMENT"))+ 
  geom_histogram(aes(time), binwidth = 10)+
  geom_vline(aes(xintercept = 30), colour = "#FA7800",size=1)+
  geom_text(aes(x=30, label="EPCycle 1 months", y=9200), colour="blue")+
  labs(
    title = " Police vehicle",
    subtitle = "Max time 365 days, Bar = 10 days",
    x="Days",
    y="Number of records" )


p3 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365,EPCycleLength == 3, EqcDesc == "DPW - Solid Waste"))+ 
  geom_histogram(aes(time), binwidth = 10)+
  geom_vline(aes(xintercept = 90), colour = "#FA7800",size=1)+
  geom_text(aes(x=90, label="EPCycle 3 months", y=3500), colour="blue")+
  labs(
    title = "Solid Waste vehicle",
    subtitle = "Max time 365 days, Bar = 10 days",
    x="Days",
    y="Number of records" )

p4 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365,EPCycleLength == 4, EqcDesc == "DPW Water & Wastewater"))+ 
  geom_histogram(aes(time), binwidth = 10)+
  geom_vline(aes(xintercept = 120), colour = "#FA7800",size=1)+
  geom_text(aes(x=120, label="EPCycle 4 months", y=2400), colour="blue")+
  labs(
    title = " Water & Wastewater vehicle",
    subtitle = "Max time 365 days, Bar = 10 days",
    x="Days",
    y="Number of records" )

p5 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EPCycleLength == 1,EqcDesc == "Light Duty Emergency"))+ 
  geom_histogram(aes(time), binwidth = 10)+
  geom_vline(aes(xintercept = 30), colour = "#FA7800",size=1)+
  geom_text(aes(x=30, label="EPCycle 1 months", y=3000), colour="blue")+
  labs(
    title = " Light Duty Emergency vehicle",
    subtitle = "Max time 365 days, Bar = 10 days",
    x="Days",
    y="Number of records" )

p6 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EqcDesc == "Light Duty Non-Emergency",EPCycleLength == 4))+ 
  geom_histogram(aes(time), binwidth = 10)+
  geom_vline(aes(xintercept = 120), colour = "#FA7800",size=1)+
  geom_text(aes(x=120, label="EPCycle 4 months", y=550), colour="blue")+
  labs(
    title = " Light Duty Non-Emergency vehicle",
    subtitle = "Max time 365 days, Bar = 10 days",
    x="Days",
    y="Number of records" )

grid.arrange(p1, p2,p3,p4,p5,p6, nrow = 2, ncol = 3 )

dat <- read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 1) %>%
  rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 2)) %>%
  rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 3)) %>%
  rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 4)) %>%
  rename(EPCycleLength = EPCycleLenghth) %>%
  rbind(read_xlsx(path = "Comeback_data_12_6_2019.xlsx", sheet = 5)) %>%
  mutate(PostDate = ymd(as.character(PostDate)))

# Calculate the time since last maintenance of each vehicle and each part
all_duration  <-  dat %>%
  group_by(EhKey,RGDESC,WHKey) %>%
  summarise(Totalcost = sum(BhLineChr),PostDate =min(PostDate))%>%
  arrange(EhKey,RGDESC,PostDate)%>%
  mutate(lagTime = PostDate - lag(PostDate))

# Add department information
group_list <- dat %>%
  group_by(EhKey) %>%
  summarise(EqcDesc = first(EqcDesc),EPCycleLength= first(EPCycleLength),Make=first(Make),Model=first(Model),CycleDESC=first(CycleDESC))
all_duration <- right_join(group_list,all_duration,by='EhKey')

if(all_duration$CycleDESC == "MONTHS" )
  {
  all_duration$CycleDESC_day = all_duration$EPCycleLength*30
}
if(all_duration$CycleDESC == "DO NOT USE SET A MONTHLY DATE")
{
  all_duration$CycleDESC_day = all_duration$EPCycleLength*30
}

# convert all epcycle to days
all_duration$unit[all_duration$CycleDESC == "MONTHS"]<- 30
all_duration$unit[all_duration$CycleDESC == "WEEKS"]<- 7
all_duration$unit[all_duration$CycleDESC == "DAYS"]<- 1
all_duration$EPCycleLength_day = all_duration$EPCycleLength*all_duration$unit

# identify comeback
all_duration$comeback[all_duration$lagTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback[all_duration$lagTime>=all_duration$EPCycleLength_day]<- "Regular Trip"

# # Exploration 1: How many comeback and Regular Trip?
# barplot(table(all_duration$comeback),  main="How many comeback and Regular Trip?",
#         xlab="Yes or No", ylab = "Count")
# 
# # Exploration 2: Comeback & Department
# t_1 <- table(all_duration$EqcDesc)
# name = t_1 > 10000
# 
# counts <- table(all_duration$comeback, all_duration$EqcDesc)
# barplot(counts, main="Car Distribution by Gears and VS",
#         xlab="Number of Gears", col=c("darkblue","red"),
#         legend = rownames(counts))

#====================





# convert all epcycle to days
all_duration$unit[all_duration$CycleDESC == "MONTHS"]<- 30
all_duration$unit[all_duration$CycleDESC == "WEEKS"]<- 7
all_duration$unit[all_duration$CycleDESC == "DAYS"]<- 1
all_duration$EPCycleLength_day = all_duration$EPCycleLength*all_duration$unit

# identify comeback
all_duration$comeback[all_duration$lagTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback[all_duration$lagTime>=all_duration$EPCycleLength_day]<- "Regular Trip"

# # Exploration 1: How many comeback and Regular Trip?
# barplot(table(all_duration$comeback),  main="How many comeback and Regular Trip?",
#         xlab="Yes or No", ylab = "Count")
# 
# # Exploration 2: Comeback & Department
# t_1 <- table(all_duration$EqcDesc)
# name = t_1 > 10000
# 
# counts <- table(all_duration$comeback, all_duration$EqcDesc)
# barplot(counts, main="Car Distribution by Gears and VS",
#         xlab="Number of Gears", col=c("darkblue","red"),
#         legend = rownames(counts))

#====================

p1 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "FIRE DEPARTMENT"),aes(x=comeback))+
  geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
  labs(
    title = "Fire Department Comeback Data",
    x="Comeback or a Regular Non Comeback Repair Trip",
    y="Total Count" )

p2 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EPCycleLength == 1, EqcDesc == "POLICE DEPARTMENT"),aes(x=comeback))+
  geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
  labs(
    title = "Police Department Comeback Data",
    x="Comeback or a Regular Non Comeback Repair Trip",
    y="Total Count" )

p3 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EPCycleLength == 3, EqcDesc == "DPW - Solid Waste"),aes(x=comeback))+
  geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
  labs(
    title = "DPW - Solid Waste Comeback Data",
    x="Comeback or a Regular Non Comeback Repair Trip",
    y="Total Count" )

p4 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "DPW Water & Wastewater"),aes(x=comeback))+
  geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
  labs(
    title = "DPW Water & Wastewater",
    x="Comeback or a Regular Non Comeback Repair Trip",
    y="Total Count" )

p5 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "Light Duty Emergency"),aes(x=comeback))+
  geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
  labs(
     title = "Light Duty Emergency",
    x="Comeback or a Regular Non Comeback Repair Trip",
    y="Total Count" )



p6 <-ggplot(all_duration %>%
         mutate(time = as.numeric(lagTime, "days")) %>%
         filter( time > 1, time < 365, EPCycleLength == 6, EqcDesc == "Light Duty Non-Emergency"),aes(x=comeback))+
  geom_bar(width=0.2,fill = c("#c70000","#00c700"))+
  labs(
    title = "Light Duty Non-Emergency Comeback Data",
    x="Comeback or a Regular Non Comeback Repair Trip",
    y="Total Count" )








grid.arrange(p1, p2,p3,p4,p5,p6, nrow = 2, ncol = 3 )
#======Data Preparation======




# Load Data
# Rename a column in between sheets 4 and 5 because it's spelled wrong in all but sheet 5
# Turn PostDate into a lubridate



# Calculate the time since last maintenance of each vehicle and each part
all_duration  <-  dat %>%
  group_by(EhKey,RGDESC,WHKey) %>%
  summarise(Totalcost = sum(BhLineChr),PostDate =min(PostDate))%>%
  arrange(EhKey,RGDESC,PostDate)%>%
  mutate(lagTime = PostDate - lag(PostDate))%>%
  mutate(leadTime = lead(PostDate)-PostDate)

# Add department information
group_list <- dat %>%
  group_by(EhKey) %>%
  summarise(EqcDesc = first(EqcDesc),EPCycleLength= first(EPCycleLength),Make=first(Make),Model=first(Model),CycleDESC=first(CycleDESC))
all_duration <- right_join(group_list,all_duration,by='EhKey')


# convert all epcycle to days
all_duration$unit[all_duration$CycleDESC == "MONTHS"]<- 30
all_duration$unit[all_duration$CycleDESC == "WEEKS"]<- 7
all_duration$unit[all_duration$CycleDESC == "DAYS"]<- 1
all_duration$EPCycleLength_day = all_duration$EPCycleLength*all_duration$unit

# identify comeback
all_duration$comeback[all_duration$lagTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback[all_duration$lagTime>=all_duration$EPCycleLength_day]<- "Regular Trip"
all_duration$comeback_next[all_duration$leadTime<all_duration$EPCycleLength_day]<- "Comeback"
all_duration$comeback_next[all_duration$leadTime>=all_duration$EPCycleLength_day]<- "Regular Trip"

# rolling average
df_1 <- all_duration %>%
  arrange(EhKey) %>%
  mutate(date = ymd(PostDate))%>%
  arrange(PostDate)


library(tidyr)
library(geosphere)
rollavgbyperiod <- function(i,window,df){
  startdate <- df$date[i]-window
  enddate <- df$date[i]-1
  interval <- seq(startdate,enddate,1)
  tmp <- df$Totalcost[df$date %in% interval]
  return(sum(tmp))
}
roll_function_col_7 <- function(df){
  window <-7
  Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
  Pastweekaverage[is.nan(Pastweekaverage)] <- 0
  return(Pastweekaverage)
}
roll_function_col_30 <- function(df){
  window <-30
  Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
  Pastweekaverage[is.nan(Pastweekaverage)] <- 0
  return(Pastweekaverage)
}
roll_function_col_182 <- function(df){
  window <-182
  Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
  Pastweekaverage[is.nan(Pastweekaverage)] <- 0
  return(Pastweekaverage)
}


df_2 <- df_1 %>% 
  arrange(EhKey,PostDate)%>%
  nest(-EhKey) %>%
  group_by(EhKey) %>%
  mutate(past1week_cost = purrr::map(data, roll_function_col_7),
         past1month_cost = purrr::map(data, roll_function_col_30),
         past6month_cost = purrr::map(data, roll_function_col_182)) %>%
  unnest()


# rolling mean
rollavgbyperiod <- function(i,window,df){
  startdate <- df$date[i]-window
  enddate <- df$date[i]-1
  interval <- seq(startdate,enddate,1)
  tmp <- df$Totalcost[df$date %in% interval]
  return(mean(tmp))
}
roll_function_col_7 <- function(df){
  window <-7
  Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
  Pastweekaverage[is.nan(Pastweekaverage)] <- 0
  return(Pastweekaverage)
}
roll_function_col_30 <- function(df){
  window <-30
  Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
  Pastweekaverage[is.nan(Pastweekaverage)] <- 0
  return(Pastweekaverage)
}
roll_function_col_182 <- function(df){
  window <-182
  Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
  Pastweekaverage[is.nan(Pastweekaverage)] <- 0
  return(Pastweekaverage)
}



# rolling average and count
df_2 <- df_2 %>%
  arrange(EhKey,PostDate)%>%
  nest(-EhKey) %>%
  group_by(EhKey) %>%
  mutate(past1week_ave_cost = purrr::map(data, roll_function_col_7),
         past1month_ave_cost = purrr::map(data, roll_function_col_30),
         past6month_ave_cost = purrr::map(data, roll_function_col_182)) %>%
  unnest()

df_2$past1week_count = df_2$past1week_cost/df_2$past1week_ave_cost
df_2$past1month_count = df_2$past1month_cost/df_2$past1month_ave_cost
df_2$past6month_count = df_2$past6month_cost/df_2$past6month_ave_cost
df_2[["past1week_count"]][is.na(df_2[["past1week_count"]])] <- 0
df_2[["past1month_count"]][is.na(df_2[["past1month_count"]])] <- 0
df_2[["past6month_count"]][is.na(df_2[["past6month_count"]])] <- 0

# rolling comeback
df_2$comeback_next_10=0
df_2$comeback_10=0
df_2 <- within(df_2, comeback_next_10[comeback_next=='Comeback'] <- 1)
df_2 <- within(df_2, comeback_10[comeback=='Comeback'] <- 1)

rollavgbyperiod <- function(i,window,df){
  startdate <- df$date[i]-window
  enddate <- df$date[i]-1
  interval <- seq(startdate,enddate,1)
  tmp <- df$comeback_10[df$date %in% interval]
  return(sum(tmp))
}
roll_function_col_7 <- function(df){
  window <-7
  Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
  Pastweekaverage[is.nan(Pastweekaverage)] <- 0
  return(Pastweekaverage)
}
roll_function_col_30 <- function(df){
  window <-30
  Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
  Pastweekaverage[is.nan(Pastweekaverage)] <- 0
  return(Pastweekaverage)
}
roll_function_col_182 <- function(df){
  window <-182
  Pastweekaverage <- sapply(1:length(df$date),function(m) rollavgbyperiod(m,window,df))
  Pastweekaverage[is.nan(Pastweekaverage)] <- 0
  return(Pastweekaverage)
}
df_2 <- df_2 %>%
  arrange(EhKey,PostDate)%>%
  nest(-EhKey) %>%
  group_by(EhKey) %>%
  mutate(past1week_comeback = purrr::map(data, roll_function_col_7),
         past1month_comeback = purrr::map(data, roll_function_col_30),
         past6month_comeback = purrr::map(data, roll_function_col_182)) %>%
  unnest()

df_2$same_part_lasttime = df_2$lagTime




# comeback rate of each model and make
model_rate<-df_2%>%
  group_by(Model)%>%
  summarise(comeback_rate_model = mean(comeback_10))
df_2 = left_join(df_2, model_rate, by = "Model")

make_rate<-df_2%>%
  group_by(Make)%>%
  summarise(comeback_rate_make = mean(comeback_10))
df_2 = left_join(df_2, make_rate, by = "Make")

# comeback rate of monthlyly department average
df_2$year = format(df_2$date,'%Y')
df_2$month = format(df_2$date,'%m')
dpt_month_rate<-df_2%>%
  group_by(EqcDesc,year,month)%>%
  summarise(comeback_rate_dpt_ym = mean(comeback_10))
df_2 = left_join(df_2, dpt_month_rate, by = c("EqcDesc","year","month"))

# comeback rate of yearly department average
df_2$year = format(df_2$date,'%Y')
df_2$month = format(df_2$date,'%m')
dpt_month_rate<-df_2%>%
  group_by(EqcDesc,year)%>%
  summarise(comeback_rate_dpt_y = mean(comeback_10))
df_2 = left_join(df_2, dpt_month_rate, by = c("EqcDesc","year"))


# comeback rate of each system of each type of vehicle
RGDESC_model_rate<-df_2%>%
  group_by(RGDESC,Model)%>%
  summarise(comeback_rate_RGDESC = mean(comeback_10))
df_2 = left_join(df_2, RGDESC_model_rate, by = c("RGDESC","Model"))

# === Categorical Creation ===
df_backup = df_2
#df_2 = df_backup

# Department as a categorical variable (big/small)
dpt_n<-df_2%>%
  group_by(EqcDesc)%>%
  summarise(number_of_vehicle = n())
df_2 = left_join(df_2, dpt_n, by = c("EqcDesc"))

df_2 <- df_2 %>% 
  mutate(dpt_big = case_when(number_of_vehicle>=10000 ~ "big department",
                                number_of_vehicle<10000 ~ "small department"))



# Department as a categorical variable (H, M , L, VL)
dpt_rate<-df_2%>%
  group_by(EqcDesc)%>%
  summarise(comeback_rate_dpt = mean(comeback_10))
df_2 = left_join(df_2, dpt_rate, by = c("EqcDesc"))
df_2 <- df_2 %>% 
  mutate(dpt_risk = case_when(comeback_rate_dpt>=0.5 ~ "High risk",
                             comeback_rate_dpt>=0.25 ~ "Medium risk",
                             comeback_rate_dpt>=0.15 ~ "Low risk",
                             TRUE ~ "Very Low risk"))

# Make as a categorical variable  (ex. risk score 0-5)
df_2 <- df_2 %>% 
  mutate(make_risk = case_when(comeback_rate_make>=0.5 ~ "High risk",
                             comeback_rate_make>=0.25 ~ "Medium risk",
                             comeback_rate_make>=0.15 ~ "Low risk",
                             TRUE ~ "Very Low risk"))

# Model as a categorical variable  (ex. risk score 0-5)
df_2 <- df_2 %>% 
  mutate(model_risk = case_when(comeback_rate_model>=0.8 ~ "Very High risk",
                             comeback_rate_model>=0.6 ~ "High risk",
                             comeback_rate_model>=0.3 ~ "Medium risk",
                             comeback_rate_model>=0.15 ~ "Low risk",
                             TRUE ~ "Very Low risk"))

# RGDesc as a categorical variable (1-10)
df_2 <- df_2 %>% 
  mutate(part_risk = case_when(comeback_rate_RGDESC>=0.8 ~ "Very High risk",
                             comeback_rate_RGDESC>=0.5 ~ "High risk",
                             comeback_rate_RGDESC>=0.2 ~ "Medium risk",
                             comeback_rate_RGDESC>0 ~ "Low risk",
                             TRUE ~ "Very Low risk"))
# Workshop(get the workshop first)
workshop_list <- dat %>%
  group_by(WHKey) %>%
  summarise(WHShop = first(WHShop))
df_2 <- left_join(df_2,workshop_list,by='WHKey')

Shop_rate<-df_2%>%
  group_by(WHShop)%>%
  summarise(comeback_rate_shop = mean(comeback_10))
df_2 = left_join(df_2, Shop_rate, by = c("WHShop"))
df_2 <- df_2 %>% 
  mutate(Shop_risk = case_when(comeback_rate_shop>=0.7 ~ "High risk",
                             comeback_rate_shop>=0.5 ~ "Medium risk",
                             comeback_rate_shop>=0.25 ~ "Low risk",
                             TRUE ~ "Very Low risk"))


# month
df_2 <- df_2 %>% 
  mutate(month_cat = case_when(month=="04" ~ "April",
                               month=="05" ~ "May",
                               month=="06" ~ "June",
                             TRUE ~ "else"))

RG_list <- dat %>%
  group_by(WHKey) %>%
  summarise(N_RG = n())
df_2 <- left_join(df_2,RG_list,by='WHKey')


library(corrplot)
df_Cor <-
  df_2 %>%
  dplyr::select(Totalcost,past1week_cost,past1month_cost,                                                  past6month_cost,past1week_count,past1month_count,past6month_count,past1week_ave_cost,                                                  past1month_ave_cost,past6month_ave_cost,past1week_comeback,past1month_comeback,past6month_comeback, comeback_rate_model,comeback_rate_make,comeback_rate_dpt_ym,comeback_rate_dpt_y,comeback_rate_dpt,comeback_rate_RGDESC,comeback_rate_shop)
df_Cor <- subset(df_Cor, select=-c(EhKey))
M <- cor(df_Cor,use="pairwise.complete.obs")
corrplot(M, method = "square",tl.cex = 0.5)
options(scipen=10000000)
library(tidyverse)
library(caret)
library(knitr) 
library(pscl)
library(plotROC)
library(pROC)

palette5 <- c("#3F6973","#49A695","#F2CD88","#F2A679","#D9725B")
palette4 <- c("#00c700","#0060C7","#c70000","#DEA60B")
palette3 <- c("#00c700","#DEA60B","#c70000")
palette2 <- c("#00c700","#c70000")
df_model <- na.omit(df_2)

df_model <- subset(df_model, select=-c(EhKey,WHKey,PostDate,unit,comeback,comeback_next,RGDESC,Model))

samp <- sample(nrow(df_model),20000)
df_model_sample<- df_model[samp,]

df_model_sample_1<-df_model_sample

set.seed(11233)

trainIndex <- createDataPartition(df_model_sample$comeback_next_10, p = .70,
                                  list = FALSE,
                                  times = 1)
df_Train <- df_model_sample[ trainIndex,]
df_Test  <- df_model_sample[-trainIndex,]


new_predictor <-c("comeback_next_10","comeback_rate_RGDESC","past1week_comeback","past1week_count","past6month_ave_cost","comeback_rate_shop","year","month","EPCycleLength_day","comeback_10")

Logit_Model_1 <- glm(comeback_next_10 ~ .,
                  data=df_Train %>% dplyr::select(new_predictor),
                  family="binomial" (link="logit"))
testProbs_0 <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
                        Probs_1 = predict(Logit_Model_1, df_Test, type= "response"))
testProbs_0 <- testProbs_0%>%
  mutate(predOutcome_1  = as.factor(ifelse(testProbs_0$Probs_1 > 0.5 , 1, 0)))
cm_1 <-confusionMatrix(testProbs_0$predOutcome_1, testProbs_0$Outcome,positive = "1")
summary(Logit_Model_1)
cm_1
new_predictor <-c("comeback_next_10","comeback_rate_RGDESC","past1week_comeback","past1week_count","past6month_ave_cost","comeback_rate_shop","year","month","EPCycleLength_day","CycleDESC","N_RG")

Logit_Model_1 <- glm(comeback_next_10 ~ .,
                  data=df_Train %>% dplyr::select(new_predictor),
                  family="binomial" (link="logit"))
testProbs <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
                        Probs_1 = predict(Logit_Model_1, df_Test, type= "response"))
testProbs <- testProbs%>%
  mutate(predOutcome_1  = as.factor(ifelse(testProbs$Probs_1 > 0.5 , 1, 0)))
cm <-confusionMatrix(testProbs$predOutcome_1, testProbs$Outcome,positive = "1")

test_predictor <-c("comeback_next_10","comeback_10")

Logit_Model_2 <- glm(comeback_next_10 ~ .,
                  data=df_Train %>% dplyr::select(test_predictor),
                  family="binomial" (link="logit"))

testProbs_2 <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
                        Probs = predict(Logit_Model_2, df_Test, type= "response"))
testProbs_2 <- testProbs_2%>%
  mutate(predOutcome  = as.factor(ifelse(testProbs_2$Probs > 0.5 , 1, 0)))
cm <-confusionMatrix(testProbs_2$predOutcome, testProbs_2$Outcome,positive = "1")
testProbs$Probs_2 = testProbs_2$Probs
testProbs$predOutcome_2 = testProbs_2$predOutcome

new_predictor <-c("comeback_next_10","dpt_big","dpt_risk","make_risk","model_risk","part_risk","Shop_risk")

Logit_Model_new <- glm(comeback_next_10 ~ .,
                  data=df_Train %>% dplyr::select(new_predictor),
                  family="binomial" (link="logit"))
testProbs_3 <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
                        Probs = predict(Logit_Model_new, df_Test, type= "response"))
testProbs_3 <- testProbs_3%>%
  mutate(predOutcome  = as.factor(ifelse(testProbs_3$Probs > 0.5 , 1, 0)))
cm <-confusionMatrix(testProbs_3$predOutcome, testProbs_3$Outcome,positive = "1")
testProbs$Probs_3 = testProbs_3$Probs
testProbs$predOutcome_3 = testProbs_3$predOutcome

testProbs$Probs_Ensemble = (0.9*testProbs$Probs_1+0.09*testProbs$Probs_2+0.01*testProbs$Probs_3)
testProbs <- testProbs%>%
  mutate(predOutcome_Ensemble  = as.factor(ifelse(testProbs$Probs_Ensemble > 0.5 , 1, 0)))
cm_2 <-confusionMatrix(testProbs$predOutcome_Ensemble, testProbs$Outcome,positive = "1")
cm_2
library(randomForest)
df_Train_fac=df_Train %>% mutate_if(is.character, as.factor)
df_Test_fac=df_Test %>% mutate_if(is.character, as.factor)

new_predictor <-c("comeback_next_10","comeback_10","comeback_rate_RGDESC","past1week_comeback","past1week_count","past6month_ave_cost","comeback_rate_shop","year","month","EPCycleLength","CycleDESC","N_RG")
RF_Model_1 <- randomForest(comeback_next_10 ~ .,
                  data=df_Train_fac %>% dplyr::select(new_predictor),
                  norm.votes = TRUE, proximity = TRUE)

testProbs_RF <- data.frame(Outcome = as.factor(df_Test_fac$comeback_next_10),
                        Probs = predict(RF_Model_1, df_Test_fac, type= "response"))
testProbs_RF <- testProbs_RF%>%
  mutate(predOutcome  = as.factor(ifelse(testProbs_RF$Probs > 0.5 , 1, 0)))
cm_3 <-confusionMatrix(testProbs_RF$predOutcome, testProbs_RF$Outcome,positive = "1")
cm_3

library(xgboost)
all_predictor_xgb <- c("Totalcost","past1week_cost","past1month_cost",
                                                  "past6month_cost","past1week_count","past1month_count","past6month_count","past1week_ave_cost",
                                                  "past1month_ave_cost","past6month_ave_cost","comeback_10","past1week_comeback","past1month_comeback",
                                                "past6month_comeback","comeback_rate_model","comeback_rate_make","comeback_rate_dpt_ym","comeback_rate_dpt_y",
                                              "comeback_rate_RGDESC","comeback_rate_shop","comeback_rate_dpt","N_RG")

new_predictor <-c("comeback_10","comeback_rate_RGDESC","past1week_comeback","past1week_count","past6month_ave_cost","comeback_rate_shop","EPCycleLength","N_RG")

xgb_Model_1 <- xgboost(data=as.matrix(df_Train %>% dplyr::select(all_predictor_xgb)),label = df_Train$comeback_next_10,max.depth = 50, eta = 1, nthread = 2, nrounds = 10, objective = "binary:logistic")


testProbs_xgb <- data.frame(Outcome = as.factor(df_Test$comeback_next_10),
                        Probs = predict(xgb_Model_1, as.matrix(df_Test %>% dplyr::select(all_predictor_xgb))))
testProbs_xgb <- testProbs_xgb%>%
  mutate(predOutcome  = as.factor(ifelse(testProbs_xgb$Probs > 0.5 , 1, 0)))

cm_4 <-confusionMatrix(testProbs_xgb$predOutcome, testProbs_xgb$Outcome,positive = "1")
cm_4

library(kableExtra)
model_table <- data.frame(Model = c("Logistic Regression","Ensemble Logistic","Random Forest","XGBoost"),
           Accuracy=c(cm_1$overall['Accuracy'],cm_2$overall['Accuracy'],cm_3$overall['Accuracy'],cm_4$overall['Accuracy']),
           Sensitivity =c(cm_1$byClass['Sensitivity'],cm_2$byClass['Sensitivity'],cm_3$byClass['Sensitivity'],cm_4$byClass['Sensitivity']),
           Specificity =c(cm_1$byClass['Specificity'],cm_2$byClass['Specificity'],cm_3$byClass['Specificity'],cm_4$byClass['Specificity']))
model_table %>%
  gather("Type", "Value",-Model) %>%
  ggplot(aes(Model, Value, fill = Type)) +
  labs(title = "Model Comparison") +
  scale_fill_manual(values = palette3) +
  geom_bar(position = "dodge", stat = "identity") +
  theme_bw()
kable(model_table, align = "l",
      caption = "Model Comparison") %>% kable_styling()

# ggplot(testProbs, aes(x = Probs_Ensemble, fill = as.factor(Outcome))) + 
#   geom_density() +
#   facet_grid(Outcome ~ .) +
#   scale_fill_manual(values = palette2) +
#   labs(x = "Churn", y = "Density of probabilities",
#        title = "Distribution of predicted probabilities by observed outcome") +
#   theme(strip.text.x = element_text(size = 18),
#         legend.position = "none")
ggplot(testProbs_RF, aes(x = Probs, fill = as.factor(Outcome))) +
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Churn", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome") +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "none")


caret::confusionMatrix(testProbs_RF$predOutcome, testProbs_RF$Outcome,
                       positive = "1")
# ggplot(testProbs, aes(d = as.numeric(testProbs$Outcome), m = Probs_Ensemble)) +
#   geom_roc(n.cuts = 50, labels = FALSE, colour = "#D9725B") +
#   style_roc(theme = theme_grey) +
#   geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
#   labs(title = "ROC Curve")
ggplot(testProbs_RF, aes(d = as.numeric(testProbs_RF$Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#c70000") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve")
# auc(testProbs$Outcome, testProbs$Probs_Ensemble)
auc(testProbs_RF$Outcome, testProbs_RF$Probs)


df_model_1 <- na.omit(df_2)

df_model_sample_1<- df_model_1[samp,]
df_Test_1  <- df_model_sample_1[-trainIndex,]

df_G <- df_Test_1
df_G$Outcome = testProbs_RF$Outcome
df_G$Probs = testProbs_RF$Probs
df_G$predOutcome = testProbs_RF$predOutcome

df_G$correct_10 = (df_G$predOutcome == df_G$Outcome)


df_G_n_dpt <- df_G %>%
  group_by(EqcDesc) %>%
  summarise(number=n(),Correct = sum(correct_10=="TRUE"),Accuracy = Correct/number)%>%
  arrange(-number)

df_G_n_dpt=df_G_n_dpt[0:6,]


p1 <- ggplot(data=df_G_n_dpt, aes(x=EqcDesc, y=Accuracy)) +
  geom_bar(stat="identity", fill="#00c700",position = 'dodge',color="#00c700")+
  scale_y_continuous(labels = scales::percent)+
  labs(x = 'Top 6 Biggest Department', y = 'Accuracy', title = "Model's Generalizability - Department")+
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

df_G_n_part <- df_G %>%
  group_by(RGDESC) %>%
  summarise(number=n(),Correct = sum(correct_10=="TRUE"),Accuracy = Correct/number)%>%
  arrange(-number)


df_G_n_part=df_G_n_part[0:6,]


p2 <- ggplot(data=df_G_n_part, aes(x=RGDESC, y=Accuracy)) +
  geom_bar(stat="identity", fill="#00c700",position = 'dodge',color="#00c700")+
  scale_y_continuous(labels = scales::percent)+
  labs(x = 'Top 6 Most Repaired Vehicle System', y = 'Accuracy', title = "Model's Generalizability - System")+
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

grid.arrange(p1, p2, nrow = 1, ncol = 2 )

cost_benefit_table <-
  testProbs_RF %>%
  count(predOutcome, Outcome) %>%
  summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
            True_Positive = sum(n[predOutcome==1 & Outcome==1]),
            False_Negative = sum(n[predOutcome==0 & Outcome==1]),
            False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
  gather(Variable, Count) %>%
  bind_cols(data.frame(Each_Trip_Cost = c(
    "470","30+470","600","30+470")))%>%
  mutate(Cost =
           ifelse(Variable == "True_Negative", Count * 470,
           ifelse(Variable == "True_Positive", Count * (30+470),
           ifelse(Variable == "False_Negative",Count * 600,
           ifelse(Variable == "False_Positive",Count * (30+470), 0)))))%>%
  
  bind_cols(data.frame(Description = c(
    "We predicted next repair trip is not come back and it didn't comeback",
    "We predicted next repair trip is come back and it cameback next trip",
    "We predicted next repair trip is not come back but it cameback next trip",
    "We predicted next repair trip is come back and it didn't comeback")))

kable(cost_benefit_table, align = "l",
      caption = "Cost/Benefit Table") %>% kable_styling()
iterateThresholds <- function(data) {
  x = .01
  all_prediction <- data.frame()
  while (x <= 1) {
    
    this_prediction <-
      testProbs_RF %>%
      mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
      gather(Variable, Count) %>%
      mutate(Cost =
               ifelse(Variable == "True_Negative", Count * 470,
               ifelse(Variable == "True_Positive", Count * (30+470),
               ifelse(Variable == "False_Negative",Count * 600,
               ifelse(Variable == "False_Positive",Count * (30+470), 0)))),Threshold=x)
    
    all_prediction <- rbind(all_prediction, this_prediction)
    x <- x + .01
  }
  return(all_prediction)
}

whichThreshold <- iterateThresholds(testProbs_RF)
whichThreshold %>%
  ggplot(.,aes(Threshold, Count, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette4) +    
  labs(title = "Number by confusion matrix type and threshold cut off",
       y="Count") +
  guides(colour=guide_legend(title="Confusion Matrix"))
whichThreshold_revenue <- 
  whichThreshold %>% 
    group_by(Threshold) %>% 
    summarize(Total_Cost = sum(Cost)) 
whichThreshold_revenue %>%
  dplyr::select(Threshold, Total_Cost) %>%
  gather(Variable, Value, -Threshold) %>%
  ggplot(aes(Threshold, Value, colour = Variable)) +
    geom_point() +
    scale_colour_manual(values = "#c70000") +
    labs(title = "Total Cost by threshold cut off",
       y="Total Cost")+
    geom_vline(xintercept = pull(arrange(whichThreshold_revenue, Total_Cost)[1,1]))

Baltimore Fleet Management

Akshay Nagar & Miaoyan Zhang