Return to MUSA 801 Projects Page

This project was produced as part of the University of Pennsylvania’s Master of Urban Spatial Analytics Spring 2020 Practicum (MUSA 801) taught by Ken Steif, Michael Fichman, and Matt Harris. We would like to thank the team at Guilford County Child Protective Services for providing much of the data used in this analysis.

This document is intended to enable others to replicate a study of social worker assignment to child welfare cases. We first introduce the context of this project, followed by an exploration of the data and predictive modeling procedures. We’ve included hyperlinks throughout the document, plus a closing appendix with details on data transformation and code.

Section 1: Motivation

In our use case, we’re working to optimize the assignment of social workers to child welfare cases, based on the predicted difficulty of each case. Our report focuses on social workers in Child Protective Services (CPS) in Guilford County, NC.

Currently, social workers in Guilford County are assigned to new cases alphabetically based on the social worker’s last name. This would result in an equitable distribution of cases only if cases were similar in duration and difficulty. Instead, because cases differ in duration and difficulty, social worker caseloads vary widely.

To improve this condition, we first had to define a difficult caseload. We determined that a difficult caseload is one that includes (1) a disproportionate number of cases and/or (2) cases that ultimately result in the most stressful result for a child: being removed from the home. This result, being removed from the home, is referred to as a “transfer to services”.

With this definition of difficulty in place, we explored the data and determined that the most experienced social workers are most likely to be overburdened by difficult caseloads. To improve this condition, we built a model to predict if a case will result in a transfer to services. To put this predictive modeling to practical use, we developed an app to help social worker managers assign cases with difficulty in mind.

Our ultimate goal is to make it easier for the managers of social workers to assign their team members equitable caseloads. By making this administrative task easier and its result more equitable, we hope to enable social workers to spend more time on their important and highly demanding work with families. To preview our app, click here

Child Welfare in the US

In the US, more than 5 million reports of suspected child abuse are made annually. These reports include physical harm, emotional harm, and sexual abuse and exploitation. The potential consequences of abuse are severe - the US Department of Health and Human Services reports that an estimated 1,720 children died from abuse and neglect in fiscal year 2017. Individuals who experience abuse in childhood are more at risk of experiencing or perpetrating abuse as adults, with accompanying personal and financial impacts.

Section 2: About Guilford County

Our use case focuses on Guilford County, NC, the third most populous county in North Carolina. Guilford includes the city of Greensboro, the third most populous city in North Carolina. As of the 2010 census, Guilford County had a population of over 500,000 people or 192,064 households. 30% of households included children under the age of 18.

The Guilford County Department of Social Services recieves more than 4,000 reports of suspected child abuse each year, which represents 3% of the 125,000 reports made annually across all counties in North Carolina. These cases are recieved, investigated, and serviced by a system of intake worker, social worker, and interventionist teams.

Section 3: Guilford County Social Services, At a Glance

An investigation into child abuse begins with a report, called the “intake process”. After the intial intake process, a case may require up to three additional stages before a determination is made. This analysis focuses on Stage 2: Assessment.

Stage 1: Reporting and intake A person who suspects a child in Guilford County is experiencing abuse can contact the Guilford County Department of Social Services by phone, email, fax, or in person. When a child abuse allegation is made, it is reported predominantly by phone.

Stage 2: Assessment (focus of this analysis) When a case enters assessment, it is assigned to one social worker. Currently, social workers are assigned to cases alphabetically by last name. The potential difficulty of a case is not considered in the assignment process. This assignment process is the existing condition that we intend to improve.

When cases are assigned inequitably, it results in some social workers having a greater number of cases, and/or a greater number of difficult cases, in their caseload. An overwhelming caseload may have a negative impact on a social worker’s job performance, their mental health, and their tenure in the position. The ramifications of poor performance are particularly costly in social work because the outcomes of their work have to do with child welfare.

Stage 3: Ongoing Services

As stated previously, we expect that the most difficult cases are those that ultimately transfer to services, meaning they make it to Stage 3: Ongoing Services. These Stage 3 cases are the ones our predictive model will work to identify preemptively.

In-Home Services are wrap-around services provided to a family, with the child still in place. The case transitions to an in-home services team.

Permanency Planning (out-of-home services) means a child will be transitioned out of the home. This could result in a transition to foster care, an adoptive family, or emancipation, among other possible outcomes. The case transitions to a permanency planning team.

Existing Condition: Alphabetical Assignment (inequitable caseloads)

Improved Condition: Assignment Based on Difficulty (equitable caseloads)

Section 4: Exploratory Analysis

To understand the existing condition, we focused our exploratory analysis on four areas:

  • 4a. Prevalence of “Difficult” Cases (those that result in transfer to services)
  • 4b. Number of Cases Assiged (by time period and by social worker)
  • 4c. Duration of Employment Among Guilford’s Social Workers
  • 4d. Difference Between Individual Caseloads and Teamwide Average

4a. Prevalence of “Difficult” Cases

The sankey chart below illustrates the progression from Stage 1 (intake) to the outcome of the assessment period (transfer to services or close without transfer to services). Although the vast maority of cases are screened in initially, only a small portion ultimately transfer to services.

4b. Number of Cases Assigned

Cases Assigned by Time Period

New reports are made to Guilford County CPS nearly every day. The total number of cases assigned each month during our study period ranged from 570 to 800.

The number of cases assigned each day varies widely, from 1 to 23. On average, the there are between 6 and 11 cases assigned each day, depending on the month.

Cases Assigned by Social Worker The number of cases on each social worker’s caseload varies. Figure 4b.3 is a heatmap in which cells in red represent months when a social worker has a particularly large caseload- up to 36 cases.

Social workers are arranged by months of employment, with the most experienced social workers at the top. A careful analysis of Figure 4b.3 reveals that the social workers with the highest number of cases in a month tend to be those who were in their roles for all, or nearly all, 31 months of our study period. This leads to the next section of our analysis, a comparison of social workers by duration of employment.

4c. Duration of Employment Among Guilford’s Social Workers

Since 2017, 87 individuals have worked as social workers for Guilford County. We will define “experience” as employment duration with Guilford, though we realize social workers may have valuable experience outside Guilford.

How many of the 87 social workers were actively working over the full 31 months of our study? Just 17 individuals, less than 20% of the team.

To explore the differences in employment duration, we divided all social workers into four groups, omitting one social worker with an irregular employment history.
Experience Group Months in Role n
Group 4 30 or more 21
Group 3 16 to 29 20
Group 2 8 to 15 26
Group 1 Less than 8 19

In our study time period, the total number of social workers assigned to active cases ranges from 41 in September of 2017 to 49 in January to March 2019. Employment among Group 4, experienced, social workers is stable. In contrast, the number of Group 1 social workers, those with less experience, ranges widely from 1 in September 2018 to 13 in January 2020.

4d. Difference Between Individual Caseloads and Teamwide Average

As depicted in Figures 4b.1 and 4b.2, the number of cases assigned varies monthly and daily, making it difficult to determine a set “limit” to the number of cases that should be in a caseload. To determine which social workers are most overburdened, we refined our analysis to compare individual caseloads to team-wide averages.

To do this, we compared two values for every social worker and every month of their employement (1) the number of active cases assigned to the social worker in a month (2) the average number of cases assigned, that same month, across all social workers. The difference between these two values speaks to whether a social worker had a large or small caseload compared to peers.

We found that Group 4 social workers, those with the longest duration of employment, tend to have caseloads larger than the team-wide average. Group 1 social workers, those with the shortest duration of employment tend to have caseloads that fall well below the team average.

The histogram below, Figure 4d.1 displays these deviations for individual social workers, ordered by the social worker’s length of employment.

This difference between experienced and less experienced social workers held when we looked at averages across experience groups (Figure 4d.2). We averaged individual deviations within experience groups and found that Group 4, the most experienced social workers, continued to have caseloads larger than the team-wide average.

Why do the most experienced social workers consistently have the largest caseloads? Our analysis revealed that because cases are assigned alphabetically, without considering existing caseloads, it takes many months for a new social worker to accumulate a significant caseload. As less experienced social workers come and go from Guilford, the most experienced social workers are overburdened by disproportionately large caseloads.


Section 5: Predictive Model

Synthetic Data

To develop our model, we were not able to use the child assessment data provided by Guilford County, since it lacks a robust set of information to enable the development of a model. As a replacement, we researched to identify a dataset with characteristics that would enable us to simulate how we could predict case difficulty using information collected at the intake phase (when a report is created). We built a synthetic dataset using a completely unrelated dataset: one used by a bank to predict if customers were likely to open a savings account based on demographics and previous interactions with the bank. See the variables below:

The reason to select this dataset was that it is possible to use its variables to emulate several of the characteristics of an intake dataset and a binary (yes/no) outcome variable. This is important because we could transform data in a manner that preserves its characteristics in the original dataset (bank) to maintain its predictive power and emulate the fields we expect to be present in a CPS intake survey form. The binary outcome variable is also important because we expect to predict if cases are transferred to ongoing services or not. After rescaling some variables (like adjusting ages), renaming some categorical (we assumed that the profession is an important feature in the bank dataset, so we created synthetic allegations with it) and binary (the bank dataset does not collect racial data, we used the contact mode – either phone or cellular – to create this feature) variables, we have a dataset that simulates what we expect to see in an intake dataset:

Exploratory Analysis

The following exploratory analysis of the model dataset will provide us some insight into the incoming cases for CPS. A histogram of the distribution of word count shows that there is a large range of word counts (max: 49,280), but the median word count is 1,120. Most cases came from households that did not have any previous reports. Comparing cases, by allegation type, that went to services to those that did not helps determine how allegation type interacts with a case’s likelihood of transferring to services.

Improper supervision is the most commonly known allegation type. The percentage of cases that resulted in a transfer to services differs by allegation type. In reviewing the relationship of the reporter to the potential victim, we see that the most common reporter type was a parent, followed by an unrelated observer. There were more children involved in cases that were non-white than white, and they seem to be transferred at a higher proportion. Most cases that are screened in the intake phase have children under 10, and most of the children transferred to ongoing services are between 2 and 4 years old:

Model Building

Our modeling approach needed algorithms that classify data. Our outcome variable is a binary that informs whether cases were transferred to ongoing services (1) or not (0). To create our predictions – or more precisely our risk score, we have experimented with three approaches: a Logistic Regression model, an XG Boost Model (XGB), and a Random Forest Model (RF). For practical reasons, we have used the Caret package to build and analyze our logistic model. The RF and XGB Models were built in the new Tidymodels library. The charts below show the distribution of the 50 fold cross-validation accuracies and the top 10 most valuable predictors.

Note that the number of children in a household is an important predictor, as well as the length of the narrative (notes_wordcount), and the month that the intake happens (reports on summer months and March are more prone to get services).

Density Plots

The density plots below show the number of observations for each probability of being transferred to ongoing services (ranging from 0 to 1). The density plots are colored by the known outcome. So, the orange portions of the plots show the distribution of probabilities for observations that ended up being transferred, whereas the yellow curves show the probabilities for those cases that were not transferred. In general, all three models are quite effective when correctly predicting a low probability for cases that weren’t transferred (Notice that the peaks of the yellow density plots are close to 0). When it comes to assigning a high-risk score for those cases that ended up being transferred, we notice that the Random Forest Model has the worst performance, as it peaks around .25 and barely assigns any risk score above .75. The Logistic Regression and XGB Models are very similar in this assignment. These density plots show the distribution of the probabilities assigned to cases. If we think of the real application of this model though, we must think about at which probability should a case be considered a case that will be transferred. For example, if we consider that a risk score above .25 in the RF Model is a case that will be transferred, we will incorrectly classify as difficult cases all the cases in the yellow curve to the right of the .25 mark, as well as incorrectly classify as not difficult cases all the cases in the orange curve to the left of the .25 mark. This procedure of calibrating the sweet spot of positive and negative detections is very important. A better way of looking into the performance of these models is a ROC curve. This curve is a way of showing the tradeoffs of each model, for different calibrations (thresholds of detection). In the Y-axis, the plot shows the portion of transferred cases correctly identified as transferred (true positives). The X-axis shows the portion of non-transferred cases that were incorrectly identified as transferred (false positives). As we move along the curves from left to right the portion of both measures increases (true positives and false positives). So, the closer the curves are to the top left corner of the plot, the better. This plot confirms that the Random Forest Model has slightly worse performance. A more mathematical method to assess these curves is the area under the curve (AUC), which computes exactly what is says. The values for the Logistic Regression, XGB, and RF Models are 0.936, 0.938, and 0.924, meaning that the differences are marginal. An important metric to assess in these models is the generalizability of each one of them. In the CPS case, it is important to understand how similar their performance is for different populations. The two sets of ROC curves below show how each model performs for cases with male and female children, and cases with different races. All models perform similarly regarding the gender of the children. But they present a significant difference when detecting positives in cases with white vs non-white children. As we go through the calibration of the models and the analysis of its outputs, we are going to talk more about generalizability. The series of bar plots below show how each model performs for different detection thresholds. Think of these as vertical lines on the density plots or points on the ROC curves. The bar charts indicate three metrics for each setting: accuracy, sensitivity, and specificity. Accuracy means the percentage of observations that were correctly classified at that threshold (either as a positive or negative). Sensitivity is the detection rate of true positives (observations transferred to ongoing service correctly flagged as such). Specificity is the detection rate of true negatives (observations not transferred to ongoing service correctly flagged as such). It is very important to keep those concepts in mind since they are associated with different costs:

TRUE POSITIVES: As mentioned, these are observations transferred to ongoing service correctly flagged as such.

TRUE NEGATIVES: As mentioned, these are observations not transferred to ongoing service correctly flagged as such.

FALSE POSITIVES: These are observations not transferred to ongoing service incorrectly flagged as they would be. In practical terms, false positives generate more burden on social workers. If a case does not need ongoing services but is flagged as it would, it will consume more carrying capacity of the social workers. As our model is trying to optimize case assignment with carrying capacity, this would not be so severe, as the carrying capacity would be available again once the case is closed. At worst, false positives create a bad allocation of the workforce.

FALSE NEGATIVES: This is the scenario we want to avoid the most. A false negative means that a case that needs ongoing services is going to be flagged as it does not. In our allocation system, this means that it would be considered an easy case, and either allocated to a less experienced social worker or given to someone who already has their hands full. Hence, it is important to optimize the sensitivity of the model, to avoid false negatives. By observing the charts below, we have established that a 0.12 threshold has the best balance between accuracy, specificity, and sensitivity.

After establishing a .12 threshold, the next step is to look at the generalizability. We chose to use the Logistic Model as our final model, as it holds the smallest spread between predictions for cases with white and non-white children (see bar charts below). Furthermore, we consider an important feature of a machine learning model its explainability. Logistic regression is a more approachable method than XGB or RF for a lay audience. Therefore, a government agency has a better awareness of the tool its staff is using if the model is mathematically more simple. The plots show that the accuracy for cases with white children is higher, although its sensitivity is lower. This example illustrates why accuracy cannot be solely considered as a defining metric to evaluate a model. Looking at the predictions of the Logistic Model that we chose as our final model, we perceive that the model predicted a much higher rate of children getting services than what is observed in the dataset. As we are calibrating our model to maximize the True Positives, this is a reflection of the tradeoff we considered. To avoid the occurrence of false negatives – which is the condition we want to avoid the most – we calibrated our model in such a manner that it creates more false positives than false negatives. Next, we look at the percentage of the predicted conditions. The bar chart below shows the percentage of each prediction condition per race. The rate of false negatives is very low, as we have planned it to be. False positives are more common in the non-white population.

Section 6: Conclusion

The proposed process creates an allocation based on carrying capacity and difficulty, matching social workers with cases in which they are most needed (in case of more experienced social workers) or cases that are adequate to their current carrying capacity. We believe this is a more equitable way of assigning cases since it considers both the current caseload, the social worker’s experience, and the expected difficulty of the case. We have started to outline a comparative allocation process, in which we compare the caseload of social workers allocated by alphabetical order – as it is currently done – with our allocation based on experience, carrying capacity, and case difficulty. However, working with two datasets that were not originally meant to be used together hindered the development of this process and we could not finish in the time we had. The next step in this project would be to build this allocation comparison algorithm to show how our allocation process is more equitable and efficient than the current alphabetical order allocation. The ideal way to do such a task is to have complete data from the Child Protective Services.

Appendix

APPENDIX PART 1: Data Transformation

Data Transformation Needed to Determine Duration of Cases Time periods, particularly the duration of individual cases, proved important to our study. Before we could assess the differences in caseloads among social workers, we had to refine our case duration variable.

All cases have a case assignment date- the date the case was assigned to a social worker. Most also had a case close date- the date the case was either closed or transferred to ongoing services. Of the 6,895 unique cases included in our study, 890 did not have a closing date. Knowing the duration of each case from assignment to close was important to our study, we had to either apply a close date to those 890 cases, or drop them. Based on the analysis below we did a combination.

The 6,002 cases with close dates range in duration from 1 to 703 days. Most, however, last less than 100 days, as depicted in the histogram of case duration distribution below.

The mean duration of cases with a close date is 67 days and the standard deviation is 50 days.
Without.Feature.Engineering mean median min max sd
Cases with Close Dates (n= 6,002) 66.7 days 53 days 1 days 730 days 49.7

Some of the 890 cases without a close date were presumably still open at the time of our data collection. However, some presumably lacked a close date due to a clerical error. We considered two options to deal with missing close dates; wechose the option that resulted in a distribution of case duration most similar to the distribution of the 6,002 cases above.

Version 1 : Add a close date (the date of data collection, 2/7/2020) to all cases that lack a close date. Version 2: Add a close date (the date of data collection, 2/7/2020) to all cases that lack a close date AND began after 11/1/2019. Drop cases that lack a close date and began before 11/1/2019.

Version 2 resulted in a distribution more similar to that of the 6,002 case with close dates.
Version Details Mean Duration Median Duration Min Max Standard Deviation
Version 1 (n=6893) Closing date 2/7/2020 added to 890 cases 91.0 56 1 914 130.6
Version 2 (n=6437) Closing date 2/7/2020 added to 434 cases, 456 cases deleted 65.2 52 1 730 48.8

APPENDIX PART 2: Exploratory Analysis Code

#Load packages 
install.packages("rlang")
library(rlang)
#install.packages('boxr')
#library(boxr)
library(plyr)
library(dplyr)
library(tidyverse)
library(SparkR)
library(lubridate)
library(ggplot2)
library(alluvial)
library(formattable)
library(grDevices)
library(qwraps2)
library(kableExtra)
library(forcats)
library(gganimate)

#AESTHETICS:
yl_1 <- c('#FFEDA0') #beige 
or_1 <- c('#FEB24C') #GROUP
rd_1 <- c('#FC4E2A') #red-orange 
rd_2 <- c('#B6280C') #GROUP 4: dark red
gy_1 <- c('#BABABA') #GROUP 2: light gray
gy_2 <- c('#404040') #GROUP 2: dark gray

gr_all2 <- c(gy_1, rd_1) #c(gr_4, gr_1)
gr_all4 <- c(gy_2, gy_1, or_1, rd_1) #c(gr_1, gr_2, gr_3, gr_4)
#heatmap color scale defined in plot function

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )}


#EXPLORATORY ANALYSIS: SOCIAL WORKERS AND CASELOADS
###PART 1: df of number of cases active for each social worker by month 

#1. Load in data on cases from the assessment phase 

load(file="20200507_ToStart_Exploratory.RData") ##this environment contains just a df of real cases (data from Guilford County), but with all personal information replaced (the names of social workers were replaced using the RandNames package and names of managers were removed)

assess_sens1 <- assess_sens1_fakenames 

#2. Goal: To organize cases into a count of the number of cases active for each social worker any given month. 
##Find all combinations of social workers and month/years 

##Determine total number of months/years between August 2017 and February 2020: 31 months, and build a df
allmonthyears <- data.frame("mmyy" = c(seq(as.Date("2017/09/01"), by = "month", length.out = 31)))

##Determine total number of all social workers: 95 social workers 
SWList <- as.data.frame(unique(assess_sens1$`Assessment SW`))
##Create a separate list of all social worker names. Remove blanks and set as a dataframe
SWList<-  as.data.frame(SWList[!(SWList[1]==""), ])
names(SWList)[1] <- "Names" #handy way to rename columns in simple df: call the df name and the number of the column (first column is 1)
SWList <- na.exclude(SWList)

###Note: Detemine the total number of combinations of social workers and months/years: 2945
#length(unique(SWList$Names)) #95
#length(unique(allmonthyears$mmyy)) #31
#length(unique(SWList$Names))* length(unique(allmonthyears$mmyy)) #2945 possible combinations

##Convert dates to date format so they can be worked with 
assess_sens1$'Case Assigned FIX' <- as.Date(assess_sens1$`Case Assigned Date`, format= "%m/%d/%Y") #starting date
assess_sens1$'Latest Case Decision FIX' <- as.Date(assess_sens1$`Latest Case Decision Submitted Date`, format= "%m/%d/%Y") #closing date
##Calculate duration from start to close for each case (add as a field)
assess_sens1 <- assess_sens1 %>% 
  dplyr::mutate(duration_assigntoClose = as.Date(assess_sens1$`Latest Case Decision FIX`) - as.Date(assess_sens1$`Case Assigned FIX`))
#table(is.na(assess_sens1$duration_assigntoClose)) #4768 NAs (missing close dates duplicated bc cases aren't unique)

##Collapse rows to include just one row for every unique case (based on Case Assessment Reference): 6895 unique cases
assess_unique <- assess_sens1 %>%  #RENAMED assessbyMonth to assess_unique 
  dplyr::distinct(`Assessment Case Reference`, .keep_all = TRUE) #EXAMPLE of dplyr distinct function: df %>% distinct(id, id2, .keep_all = TRUE)

##Note: Hundreds of cases have NA duration- *because* no close date was entered. 
#table(is.na(assess_unique$duration_assigntoClose)) #890 are NA
#table(is.na(assess_unique$'Latest Case Decision FIX')) #890 are NA
#table(is.na(assess_unique$'Case Assigned FIX')) #2 are NA

#3. Goal: Decide how to deal with cases that lack a close date (feature engineering of duration of cases)
##look at duration in cases that DO have closing dates
assess_closeExist <- assess_unique[complete.cases(assess_unique), ] #df of unique cases that do have closing dates entered. 
##What is the average duration of cases that *do* have close dates? A: 66.67 days
#summary(assess_closeExist$duration_assigntoClose) #class: difftime, mode: numeric

##histogram of durations among cases with close dates - only works if duration is numeric
hist_closeExist_duration <- ggplot(data=assess_closeExist, aes(x= duration_assigntoClose, y=frequency(duration_assigntoClose)))+
  geom_col(fill=or_1, width=15) + labs(x='Case Duration (days)', y='Frequency')
plot(hist_closeExist_duration)+labs(title='Case Durations from Assignment to Close', caption='Figure X') + plotTheme() #FOR RMD: The 60002 cases with close dates range in duration from 1 to 703 days. Most, however, last less than 100 days. Considered this when deciding what to do with cases that lack a closing date.

df_closeExist_duration <- data.frame( #FOR RMD: df of summary stats of cases with closing dates (before feature engineering)
  "Without Feature Engineering" = c("Cases with Close Dates (n= 6,002)"),
  "mean" = round(mean(assess_closeExist[["duration_assigntoClose"]]),digits=1),
  "median" = median(assess_closeExist[["duration_assigntoClose"]]),
  "min" = min(assess_closeExist[["duration_assigntoClose"]]),
  "max" = max(assess_closeExist[["duration_assigntoClose"]]),
  "sd" = round(sd(assess_closeExist[["duration_assigntoClose"]]),digits=1))

df_closeExist_duration %>% #FOR RMD: display summary stats of cases with closing dates
  kable('html') %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>% 
  column_spec(1, bold = T, border_right = T)

n_Group <- panel8 %>% 
  dplyr::mutate(count=1)%>% 
  dplyr::group_by(group_num)%>%
  dplyr::mutate(n= sum(count)) %>% 
  dplyr::distinct(group_num, group, n)
  
table_group <- data.frame(
  "Experience Group" = c("Group 4", 'Group 3', 'Group 2', 'Group 1'),
  "Months at Guilford" = c("30 or more",'16 to 29','8 to 15' ,'Less than 8'),
  "Number in Group" = c(21, 20, 26, 19)
)

table_group %>% 
  kable('html', col.names = c('Experience Group', 'Months in Role', 'n')) %>% 
  kable_styling(bootstrap_options = c("striped", "hover")) %>% 
  column_spec(2, border_right = T, border_left = T) %>% 
  row_spec(1, color='white', background=rd_1) %>% 
  row_spec(2, color='white', background= or_1) %>% 
  row_spec(3, color='white', background=gy_1) %>% 
  row_spec(4, color='white', background=gy_2)

##Create versions of assess_unique that handle missing close dates differently
##Version 1: add a close date to all cases with NA close dates
##Version 2: add a close date to caseswith NA close dates AND start dates after 11/1/2019
##If Latest Case Decision FIX == NA, add a field with term "No Case Decisions"
assess_unique <- as.data.frame(assess_unique) %>% 
  dplyr::mutate("Any Case Decisions" = ifelse(is.na(assess_unique$`Latest Case Decision FIX`), assess_unique$`Any Case Decisions`== 1, assess_unique$`Any Case Decisions`== 0))
#summary(assess_unique$`Any Case Decisions`) #successfully labeled 890 cases as "FALSE" because there's no latest case decision FIX date

##Version 1: all cases with NA have a "fake date" of 2020-07-02 added as closing date. 
fake_LatestCaseDecision <- list(date1 = "2020-07-02")
assess_1 <- assess_unique %>% #assess_1 is version 1 of feature engineering. df with *all* unique cases- close date of 2/7/2020 added to those without a closing date
  dplyr::mutate("Latest Case Decision" = dplyr::coalesce(assess_unique$`Latest Case Decision FIX`, as.Date("2020-02-07")))
assess_1$`Latest Case Decision` <- as.Date(assess_1$`Latest Case Decision`, format = "%m/%d/%Y") 
assess_1$duration_assigntoClose <- as.Date(assess_1$`Latest Case Decision`) - as.Date(assess_1$`Case Assigned FIX`)
#summary(assess_1$duration_assigntoClose) #class: difftime, mode: numeric
##assess_1 <- dplyr::select(assess_1, -c(`Latest Case Decision FIX`)) #Need to eliminate more columns
#assess_sens1$duration_assigntoClose <- as.integer(assess_sens1$duration_assigntoClose) #Needed?
assess_1$duration_assigntoClose <- as.numeric(assess_1$duration_assigntoClose) #need to remove NA/ #need to keep for assessExpand to work!
assess_1 <- assess_1[!is.na(assess_1$`Case Assigned FIX`), ]

#table(is.na(assess_1$duration_assigntoClose)) #no NAs should remain

##Version 2: drop cases with NA close dates and start dates before 11/1/2019.. add a "fake date" of 2020-07-02 as a closing date to the remaining NA. 
##Drop cases if Any Case Decisions == FALSE and Case Assignment FIX <= 10/31/2019
assess_2 <- assess_1 #df after 2nd version of feature engineering: close date of 2/7/2020 added, but only to cases that began after 11/1/2019 (cases that began before 11/1/2019 without a closing date are dropped)

assess_2 <- assess_2[!(assess_2$`Case Assigned FIX`<= as.Date("2019-10-31") & assess_2$`Any Case Decisions`==FALSE),] #drop rows based on criteria..
assess_2$duration_assigntoClose <- as.integer(assess_2$duration_assigntoClose) #need to keep for assessExpand to work!

#Decide to go with version 2 (assess_2)
#summary(assess_1$duration_assigntoClose)
#summary(assess_2$duration_assigntoClose)
#hist_closeDateALL_duration <- hist(assess_1$duration_assigntoClose) #version 1 includes a wider range of durations, with some notable outliers
#hist_closeDateTRIM_duration <- hist(assess_2$duration_assigntoClose) #version 2 durations are more concentrated below 100 days
##Means cutting 456 cases assigned before 10/31/2019 but never got any entry in latest case decision field

##Display results of feature engineering in tables.
df_duration_bf_af_engineer <- data.frame( #FOR RMD: df of summary statistics of the effects of two versions of feature engineering
  "Feature_Engineering" = c('Version 1 (n=6893)', 'Version 2 (n=6437)'),
  "Details" = c('Closing date 2/7/2020 added to 890 cases', 'Closing date 2/7/2020 added to 434 cases, 456 cases deleted'),
  "mean" = c(round(mean(assess_1[["duration_assigntoClose"]]), digits=1), round(mean(assess_2[["duration_assigntoClose"]]),digits=1)),
  "median" = c(median(assess_1[["duration_assigntoClose"]]), median(assess_2[["duration_assigntoClose"]])),
  "min" = c(min(assess_1[["duration_assigntoClose"]]), min(assess_2[["duration_assigntoClose"]])),
  "max" = c(max(assess_1[["duration_assigntoClose"]]), max(assess_2[["duration_assigntoClose"]])),
  "sd" = c(round(sd(assess_1[["duration_assigntoClose"]]),digits=1), round(sd(assess_2[["duration_assigntoClose"]]),digits=1))
)

df_duration_bf_af_engineer %>%#FOR RMD:display summary statistics of two versions of feature engineering. Chose to use V2 because it maintains mean and sd duration very similar to that of cases with closing dates. 
  kable('html', col.names = c('Version', 'Details', 'Mean Duration', 'Median Duration', 'Min', 'Max', 'Standard Deviation')) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>% 
  column_spec(1, bold = T, border_right = T)

#Pause to check for NA in duration field and prep for next step
#table(is.na(assess_2$duration_assigntoClose)) #no NAs
##Rename assess_2 assess_z, just to keep track of it (and in case we want to swtich to another version of assess)
assess_z <- assess_2
#sapply(assess_z, class)
assess_z$duration_assigntoClose <- as.integer(assess_z$duration_assigntoClose)
#assess_z$duration_assigntoClose <- as.numeric(assess_z$duration_assigntoClose) #needs different format here!!! -REMOVE??
#summary(assess_z$duration_assigntoClose)
#table(is.na(assess_z$duration_assigntoClose)) #check for NAs in duration, but there aren't any - hooray! 

##4. Complete the setting up of assessment df.. "Expand" the df of SW and mmyy to get a row for every day that the case exists (to then groupby to get a count of active cases per person by month). 
assessExpand <- assess_z[rep(row.names(assess_z), times = assess_z$duration_assigntoClose), ] ##EXAMPLE CODE: # EXAMPLE of rep function: expanded <-mydf[rep(row.names(mydf), mydf$ Days.in.Flight), ]
##Now that theres a row for every day a case existed.. Add dates in sequence so that every row represents a day a case existed.. 
assessAddDates <- data.frame(assessExpand, Date = assessExpand$`Case Assigned FIX` + (sequence(assess_z$duration_assigntoClose)-1)) #add enough dates to fill dates for full duration # EXAMPLE of expand function: adddates <- data.frame(expand, Date = expand$Flight.Start+(sequence(3)-1))
#bin all the newly addedd sequence dates to get month 
assessAddMonths <- assessAddDates %>%
  dplyr::mutate(bin_mmyy = floor_date(Date, unit = c("month")))

##5. Create the "panel" and then add columns as needed.. laid out step-by-step because some panels are used later to organize data differently 
#Panel1: create just one row for every month that each case exists- plus a count of the days in that month the case existed! 
panel1 <- assessAddMonths %>%
  dplyr::mutate(freq1 = 1) %>% 
  dplyr::group_by(assessAddMonths$Assessment.Case.Reference, Assessment.SW, bin_mmyy)%>%
  dplyr::summarize(sumgroup = sum(freq1))

#Panel2: group by social worker, month to get a count of cases per month by SW (to later compare to full panel so month-year without cases are included)
panel2 <- panel1 %>%
  dplyr::mutate(freq2 = 1) %>% 
  dplyr::group_by(Assessment.SW, bin_mmyy)%>%
  dplyr::summarize(monthlyCases = sum(freq2))

#Panel3: Add a column for career cases for each SW 
panel3 <- panel2 %>% 
  dplyr::group_by(Assessment.SW) %>% 
  dplyr::summarize(career = sum(monthlyCases))
panel3 <- left_join(panel2, panel3, "Assessment.SW")
#Reorder panel3 so SW with most cases appear first.. EX: newdata <- mtcars[order(mpg, -cyl),] sorts df by two fields: mpg ascending and cyl descending
panel3 <- panel3[order(-panel3$career),]

#Panel4: Add a column to show average (across all working SW) of number of cases by month
panel4 <- panel3 %>% 
  dplyr::group_by(bin_mmyy) %>% 
  dplyr::summarize(average = mean(monthlyCases))
panel4 <- left_join(panel3, panel4, "bin_mmyy") 

#Panel5: Add a column for difference to average and SW individual monthly caseload
panel5 <- panel4 %>% 
  dplyr::mutate(delta = monthlyCases - average)

#Panel6: Add a column for average difference to mean monthly cases per SW
panel6 <- panel5 %>% 
  dplyr::group_by(Assessment.SW) %>% 
  dplyr::summarize(delta2 = mean(delta))

#Panel7: add average delta per SW to monthly pannel
panel7 <- left_join(panel5, panel6, "Assessment.SW")

#Panel8: add count of months SW worked for Guilford, based on active cases AND add a column to assign SW to groups (quartered based on active months of employment)
panel8 <- panel7 %>% 
  dplyr::group_by(Assessment.SW) %>% 
  dplyr::summarize(months = dplyr::n_distinct(bin_mmyy)) #Use panel 8 for key counts of cases by SW
panel8 <- panel8 %>% 
  dplyr::mutate(group = ifelse(panel8$months < 8, "Group 1 (<8 months)", ifelse(panel8$months <=15, "Group 2 (<16 months)", ifelse(panel8$months <= 29, "Group 3 (<30 months)", "Group 4 (30+ months)")))) %>%  #need character to easily assign color palette to historgrams
  dplyr::mutate(group_num = ifelse(panel8$months < 8, 1, ifelse(panel8$months <=15, 2, ifelse(panel8$months <= 29, 3, 4)))) #need numeric to order

#Panel9: join to add number of months to SW individual monthly caseload (final panel before shift to add SW/months with 0 cases)
#Add a column to assign each social worker to a group based on their active months of employment (up to 31)
panel9 <- right_join(panel8, panel7, "Assessment.SW")
panel9 <- panel9[order(-panel9$months),]#Use panel9 for whole workforce plotting

#Plot based on panel9: Line graph showing cases for just one person at a time, from each group. 
#For later: something is wrong with code above related to panel9..?
line_4EX <- ggplot(panel9 %>% 
                   dplyr::filter(Assessment.SW == 'Alexa Jordan'),
                   aes(x=bin_mmyy, y=monthlyCases, group=as.factor(Assessment.SW))) + 
            geom_line(size=1.5, color=gr_1) + theme(legend.position = "none")
plot(line_4EX) + labs(title='Caseload Size of a Group 4 Social Worker Overtime', x='Month', y='Number of Cases') +
  plotTheme()

#Plot of one person from each group overlayed
line_Group4EX <- ggplot(panel9 %>% 
                          dplyr::filter(Assessment.SW == 'Alexa Jordan' | Assessment.SW == 'Devon Meyer' | Assessment.SW == 'Ana Terry' | Assessment.SW == 'Nevaeh Fowler'),
                          aes(x=bin_mmyy, y=monthlyCases, color=as.factor(Assessment.SW), group=as.factor(Assessment.SW))) + 
                 geom_line(size=1.5) + theme(legend.position = "none") +
                 scale_color_manual(values=gr_all4)
plot(line_Group4EX)+ theme(legend.position = "bottom") + labs(fill="Experience Group", title='Caseload Size of One Social Worker from Each Group', x='Month', y='Number of Cases') +
                      plotTheme()
#legend title isn't updating.... ???

#Takeaways from panel8: length of SW careers (in months)
summary(panel8$months) #mean career length (time with active cases) = 16.69, median = 15.00 
summary(panel8$months == 31) #17 SW have worked all 31 months- the full time frame in study
summary(panel8$months >= 12) #49 of 87 have worked for at least a year (56%)
summary(panel8$months >= 18) #37 of 87 have worked for at least a 1.5 year (43%)

#summarize the number of SW employed each month
panel8_monthlystaff <- assessAddMonths %>% 
  dplyr::group_by(bin_mmyy) %>% 
  dplyr::distinct(Assessment.SW, bin_mmyy) %>% 
  dplyr::mutate(monthlyStaff= 1) %>% 
  dplyr::summarise(monthlyStaff = sum(monthlyStaff))
#simple hist of employment by month (replaced with a stack by employment group version)
#hist_monthlystaff <-ggplot(data=panel8_monthlystaff, aes(x= bin_mmyy, y=monthlyStaff, fill=hist_fill)) + 
#                    theme(legend.position = "bottom")+
#                    geom_col(fill=hist_fill, width=20)
#plot(hist_monthlystaff) + labs(title='Social Workers Employed by Month', x='Month', y='Number Employed')

#Bonus panel: group caseload sizes by groups to get average among high and low experience SW
panel_byQu <- panel9 %>% 
  dplyr::group_by(group, bin_mmyy) %>% 
  dplyr::summarise(monthly_cases = mean(monthlyCases)) %>% 
  dplyr::mutate(group_num = ifelse(group == 'Group 1 (<8 months)', 1, ifelse(group == 'Group 2 (<16 months)', 2, ifelse(group == 'Group 3 (<30 months)', 3, 4))))
panel_byQu <- dplyr::filter(panel_byQu, bin_mmyy != '2020-02-01')

#filter panel to include just most experienced- Group 4 and least- Group 1
panel_Qu1_4 <- panel_byQu %>% 
  dplyr::filter(group_num !=3, group_num !=1)
#plot two lines to show the difference in caseload accumulation between the two groups
line_Qu1_4 <- ggplot(panel_Qu1_4, aes(x=bin_mmyy, y=monthly_cases, group=group_num)) + 
              geom_line(size = 1.5, aes(color=group)) + theme(legend.position = "bottom", plot.subtitle = element_text(size = 10)) +
              scale_color_manual(values=gr_all2)

plot(line_Qu1_4) + transition_reveal(bin_mmyy) +
  labs(title='Average Caseload Sizes Overtime \nComparion of Most and Least Experienced Groups', #add to RMD: note that experienced social workers caseloads peak when less experienced SW caseloads fall- the lines travel almost inversely. This is because when the social work staff includes fewer new people, there are fewer people overall to carry the overall cases. In January 2019, 
       subtitle='Less experienced social workers rarely accumulate as many cases as experienced colleagues',
       x='Month', y='Monthly Cases: Group Average', fill='Group') + 
  plotTheme()

#Bonus: number of SW in each group employed over time
panel_QuCareer <- panel9 %>% 
  dplyr::group_by(group, bin_mmyy) %>% 
  dplyr::mutate(n= 1) %>% 
  dplyr::summarise(employed = sum(n)) #%>% 
  #dplyr::mutate(group_num = ifelse(group == 'Group 1 (<8 months)', 1, ifelse(group == 'Group 2 (<16 months)', 2, ifelse(group == 'Group 3 (<30 months)', 3, 4))))
panel_QuCareer <- dplyr::filter(panel_QuCareer, bin_mmyy != '2020-02-01')

#stacked bar chart of the number of SW employed every month, broken down by employment group
stacked_careerbyGroup <- ggplot(panel_QuCareer, aes(x=bin_mmyy, y=employed, fill=group)) + 
                         geom_bar(position="stack", stat="identity") +
                         scale_fill_manual(values=gr_all4) + theme(legend.position = "bottom")

plot(stacked_careerbyGroup) + labs(title = 'Social Worker Employment, by Group', 
                                   x='Month', y='Number Employed', fill='Experience Group') +
                              plotTheme()

#Bonus panel: Show how much EACH INDIVIDUAL SW deviates from the workforce caseload mean each month - note obvious similarity between employment groups
panel8_delta <- left_join(panel8, panel6) %>% 
  dplyr::filter(Assessment.SW != "Kylie Myers")# has odd history- left for large swath of time
#histogram of EACH INDIVIDUAL SW deviation from mean across career 
hist_SWdeltas <-ggplot(data=panel8_delta, aes(x= reorder(Assessment.SW, -months), y=delta2, fill=group)) + 
                geom_bar(stat="identity") + theme(axis.text.x = element_text(angle=45), legend.position = "bottom") +
                scale_fill_manual(values= gr_all4)
plot(hist_SWdeltas) + labs(title="Deviation from Workforce Caseload Mean, \nIndividual Careers", 
                           x="Social Worker Name", y="Deviation", fill="Experience Group") + 
                      plotTheme()

#Bonus panel: Show that same deviation but averages across employment groups- Group 4 averages above mean and Group 1 averages below
panel8_delta_Qu <- panel8_delta %>% 
  dplyr::group_by(group) %>% 
  dplyr::summarise(quMean = mean(delta2)) %>% 
  dplyr::mutate(group_num = ifelse(group == 'Group 1 (<8 months)', 1, ifelse(group == 'Group 2 (<16 months)', 2, ifelse(group == 'Group 3 (<30 months)', 3, 4))))

hist_deltasbyQuarter <- ggplot(data=panel8_delta_Qu, aes(x= reorder(group, -group_num), y=quMean, fill=group)) +
                        geom_bar(stat='identity') + theme(axis.text.x = element_blank(), legend.position = "bottom", axis.ticks.x = element_blank()) + 
                        scale_fill_manual(values=gr_all4)
  
plot(hist_deltasbyQuarter) + labs(title="Deviation from Workforce Caseload Mean, by group",
                                  x ="", y = "Deviation from Workforce Caseload Mean", fill='Experience Group') +
                             plotTheme()

#Plot the heatmap of caseload size of all SW overtime: 
heatmap <- ggplot(panel9, aes(x = bin_mmyy, y = reorder(Assessment.SW, months))) + geom_tile(aes(fill = monthlyCases),colour = "white", na.rm = TRUE) +
           scale_fill_gradient(low = yl_1, high = rd_1) + theme(legend.position = "bottom")

plot(heatmap) + labs(title='Caseloads of Social Workers Overtime',subtitle='Cells in red represent months in which a social worker had up to 40 cases.',
                     x='Month',y='Social Worker Name', fill='Caseload Size') +
                plotTheme()

###PART 2: Employment length of SW

###PART 3: Panels based on CASE ASSIGNMENT (rather than cumulative count)
#1. Goal: Build panels to count number of cases ASSIGNED by month to each SW
## Find total possible combinations of SW and month/yr of employment
panel_start <- expand.grid(uniqueSW=unique(SWList$Names), 
               uniqueYr = unique(allmonthyears$mmyy))

#duplicate panel1 so we can rename columns without potentially messing up any previous work
ass_panel1 <- panel_start #ass_panel1 is just all of the possible combinations of monhth/year and SW
#rename relevant columns (binned month and SW) in panel1 to make anti_join work
names(ass_panel1)[names(ass_panel1)=="uniqueSW"] <- "AssessmentSW" 
names(ass_panel1)[names(ass_panel1)=="uniqueYr"] <- "month_assigned" 

#*** Build panel to count number of cases ASSIGNED by month to each SW (will be missing columns when SW didn't have cases)
#start with the list of unique cases, assessbyMonth (terrible name EDIT), made for the first panel
ass_panel2 <- assessAddMonths %>% 
  dplyr::distinct(assessAddMonths, Assessment.Case.Reference, bin_mmyy, .keep_all=TRUE)
#rename important columns to remove spaces and easily reference them in dplyr
names(ass_panel2)[names(ass_panel2)=="Case.Assigned.Date"] <- "CaseAssignedDate" 
names(ass_panel2)[names(ass_panel2)=="Assessment.SW"] <- "AssessmentSW" 
#reformat CaseAssignedDate as a date (wasn't working in floor_date before reformat)
#ass_panel2$CaseAssignedDate <- as.Date(ass_panel2$Case.Assigned.Date, format= "%m/%d/%Y") 
#Determine the month of assignment by binning CaseAssignedDate (using floor_date) then group_by and summarize to get count of assignments per month
ass_panel3 <- ass_panel2 %>% 
  #dplyr::mutate(month_assigned = floor_date(, unit = c("month"))) %>%
  dplyr::mutate(freq2 = 1) %>% 
  dplyr::group_by(AssessmentSW, bin_mmyy)%>%
  dplyr::summarize(sumgroup = sum(freq2))

ass_panel3$AssessmentSW <- as.factor(ass_panel3$AssessmentSW) 
#df2 <- transform(df,id=as.numeric(factor(sample)))
ass_panel3 <- transform(ass_panel3, id=as.numeric(factor(AssessmentSW))) #need to refresh on this section :)
#EXAMPLE from SO for removing NA and blank values (only need to remove blanks, but whatever: df[!(is.na(df$start_pc) | df$start_pc==""), ]
ass_panel1 <- ass_panel1[!(is.na(ass_panel1$AssessmentSW) | ass_panel1$AssessmentSW==""), ] #cut the blank SW fields from panel1 
ass_panel1 <- transform(ass_panel1, id=as.numeric(factor(AssessmentSW))) #add ID for each SW (should work because IDs are assigned alphabetically by first name)
#Do the anti_join
names(ass_panel3)[names(ass_panel3)=="bin_mmyy"] <- "month_assigned" 
ass_antipanel <- anti_join(ass_panel1, ass_panel3, by=c("month_assigned","id"))
#add a column in antipanel for sumgroup 
ass_antipanel <- ass_antipanel %>% 
  dplyr::mutate(sumgroup = 0)
#Join ass_panel4 and ass_antipanel (the panel with monthly assigned cases greater than 0 and the anti-panel of SW/months that were missing because 0 cases were assigned)
names(ass_panel3)[names(ass_panel3)=="Assessment.SW"] <- "AssessmentSW" #rename so they match for the rbind
ass_fullpanel <- rbind.data.frame(ass_panel3, ass_antipanel) 
#turn full panel into a matrix
ass_matrix <- data.matrix(ass_fullpanel)
#turn full panel into a df
ass_df <- data.frame(ass_fullpanel)
#plot heatmap.. x is assignment month and y is SW name
#ggplot(dayHour, aes(hour, wday)) + geom_tile(aes(fill = N),colour = "white", na.rm = TRUE) +
#  scale_fill_gradient(low = col1, high = col2)

#delete unnecessary column (id) from ass_df
ass_df = subset(ass_df, select = -c(id))

##Cases assigned by day
ass_byDay <- assess_unique %>% 
  dplyr::group_by(`Case Assigned FIX`) %>% 
  dplyr::summarise(ass2day = sum(`Assess Caseref Count`)) %>% 
  dplyr::mutate(month = floor_date(`Case Assigned FIX`, unit = c("month")))

#floor date to get month and then mean of # of daily assignments for each month- add as a line to same plot
ass_monthlyMean <- ass_byDay %>% 
  dplyr::group_by(month) %>% 
  dplyr::summarise(month_mean = mean(ass2day))

ass_byDayandMonth <- left_join(ass_byDay, ass_monthlyMean) %>% 
  dplyr::filter(month != "2020-03-01") %>% 
  dplyr::filter(month != "2020-02-01")

#scatterplot of daily assignments (shows variability)
points_daily_ass <- ggplot(ass_byDayandMonth, aes(x=`Case Assigned FIX`, y=ass2day)) +
  geom_point(size=2)

line_DaymeanforMonth <- ggplot(data=ass_byDayandMonth, aes(x=month, y=month_mean)) +
  geom_line(color='#2480A6', size=1.5)

plot(line_DaymeanforMonth)

#jointly plot scatter of daily assignments with line of monthly mean of daily assignments by defining one x and two y

line_scatter <- ggplot(data=ass_byDayandMonth, aes(x=`Case Assigned FIX`)) +
  geom_point(aes(y=ass2day), color = gy_1)+
  geom_line(aes(y= month_mean), color=or_1, size=1.5)

plot(line_scatter) + #transition_reveal(`Case Assigned FIX`) +
  labs(title='Daily Cases (points) \nAverage Daily Cases Each Month (line)', 
       subtitle='The count of cases assigned daily varied widely- from 1 to 23. \nWhen daily counts are averaged per month, the range narrows.',
       x='Date of Assignment', y='Number of Assignments') +
  plotTheme()

#
ass_df_months<- left_join(ass_df, panel8, by= c("AssessmentSW" = "Assessment.SW"))
ass_df_months <- ass_df_months %>% 
  dplyr::mutate(AssessmentSW = fct_reorder(AssessmentSW, months)) #fct_reorder is from forcats library
ass_df_months <- na.exclude(ass_df_months)

ass_byMonth <- ass_df_months %>% 
  dplyr::group_by(month_assigned) %>% 
  dplyr::distinct(AssessmentSW, month_assigned) %>% 
  dplyr::mutate(monthlyAss= 1) %>% 
  dplyr::summarise(monthlyAss = sum(monthlyAss))

ass_byMonth <- ass_byMonth %>% 
  dplyr::filter(month_assigned != "2017-08-01") %>% 
  dplyr::filter(month_assigned != "2020-03-01")

line_assbyMonth <- ggplot(data=ass_byMonth, aes(x=month_assigned, y=monthlyAss)) +
  geom_line(color=or_1, size=1.5)

plot(line_assbyMonth) + labs(title='Cases Assigned Monthly',subtitle = 'The number of cases assigned monthly varied from a \nhigh of 71 in March/April 2018 to a low of 64 in July 2019.', 
                             x='Month', y='Number of Assignments') +
  plotTheme()

#Remove dot plots, but could be useful in the future
#Make a dot plot showing how the case loads of each SW changes monthly 
#p <- ggplot(dat, aes(factor(Day), factor(ID)))
#p + geom_dotplot(binaxis = "y", stackdir = "center", binpositions="all") + 
#  geom_point(aes(size=Dose))
#filter by date
#filter(FL_DATE >= as.Date("2014-01-05"))
#Q4_6months <- panel9 %>% 
#  dplyr::filter(bin_mmyy > as.Date("2019-06-01")) %>% 
#  dplyr::filter(group == "Quarter 4")
#group_top10_6months <- group_top10 %>% 
#  dplyr::filter(bin_mmyy > as.Date("2019-06-01"))
#sapply(top10part, class)
#dot_Q4 <- ggplot(Q4_6months, aes(factor(bin_mmyy), factor(Assessment.SW)))
#dotplot_Q4 <- dot_Q4 + geom_dotplot(binaxis = "y", stackdir = "center", binpositions = "all", binwidth = 0 ) +
#  geom_point(aes(size=monthlyCases))
#plot(dotplot_Q4)

#outofdate does not exist.. 
#dot_Rosa <- ggplot(justRosa, aes(factor(bin_mmyy), factor(Assessment.SW)))
#dot_Rosa + geom_dotplot(binaxis = "y", stackdir = "center", binpositions = "all") +
#  geom_point(aes(size=monthlyCases))

APPENDIX PART 3: Predictive Modeling Code

## DATA WRANGLING

library(tidyverse)
library(scales)
library(reshape2)
library(knitr)
library(boxr)

dat <- bank #added for CE to run

s <- function (x) {
  summary(as.factor(x))
}

palette_3_colors <- c("#ffeda0","#feb24c","#fc4e2a") #beige, orange, red-orange
palette_2_colors <- c("#feb24c", "#fc4e2a")
palette_1_colors <- c("#feb24c")

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )}

knitr::kable(head(dat), "markdown")

#### DATA WRANGLING ####

dat$child_age <- rescale(dat$age)
dat$child_age <- as.integer(round(dat$child_age * 18))

s(dat$child_age)

dat$allegation_type <- dplyr::case_when(
  dat$job == "admin." ~ "improper_care",
  dat$job == "blue-collar" ~ "substance_abuse",
  dat$job == "entrepreneur" ~ "dependency",
  dat$job == "housemaid" ~ "sexual_abuse",
  dat$job == "management" ~ "physical_abuse",
  dat$job == "retired" ~ "improper_supervision",
  dat$job == "self-employed" ~ "moral_turpitude",
  dat$job == "services" ~ "injurious_environment",
  dat$job == "student " ~ "abandonment",
  dat$job == "technician" ~ "domestic_violence",
  dat$job == "unemployed" ~ "emotional_abuse",
  dat$job == "unknown" ~ "human_trafficking",
  TRUE ~ "Unknown"
)

s(dat$allegation_type)

dat$reporter_relation <- dplyr::case_when(
  dat$education == "basic.4y" ~ "healthcare_professional",
  dat$education == "basic.6y" ~ "friend",
  dat$education == "basic.9y" ~ "teacher",
  dat$education == "high.school" ~ "unrelated_observer",
  dat$education == "illiterate" ~ "child",
  dat$education == "professional.course" ~ "neighbor",
  dat$education == "university.degree" ~ "parent",
  dat$education == "unknown" ~ "anonymous",
  TRUE ~ "Unknown"
)

dat$child_parents_marital <- dplyr::case_when(
  dat$marital == "married" ~ "married",
  dat$marital == "divorced" ~ "single_father",
  dat$marital == "single" ~ "single_mother",
  dat$marital == "unknown" ~ "unknown",
  TRUE ~ "no_record"
)

s(dat$y)

dat$got_services <- dplyr::if_else(dat$y == "yes", 1, 0)
s(dat$got_services)

s(dat$loan)

dat$visible_injury <- dplyr::if_else(dat$loan == "yes", 1, 0)

summary(dat$duration)

dat$notes_wordcount <- (dat$duration + 10)*10

summary(dat$notes_wordcount)

dat$children_in_hh <- rescale(dat$nr.employed)
dat$children_in_hh <- as.integer(round(dat$children_in_hh * 4))
dat$children_in_hh <- dplyr::case_when(
  dat$children_in_hh == 0 ~ 5,
  dat$children_in_hh == 1 ~ 4,
  dat$children_in_hh == 2 ~ 3,
  dat$children_in_hh == 3 ~ 2,
  dat$children_in_hh == 4 ~ 1,
  TRUE ~ 10
)

s(dat$children_in_hh)

dat$nr_allegations <- rescale(dat$euribor3m)
dat$nr_allegations <- as.integer(round(dat$nr_allegations * 18))
dat$nr_allegations <- dplyr::case_when(
  dat$nr_allegations == 0 ~ 15,
  dat$nr_allegations == 1 ~ 14,
  dat$nr_allegations == 2 ~ 13,
  dat$nr_allegations == 3 ~ 12,
  dat$nr_allegations == 4 ~ 11,
  dat$nr_allegations == 5 ~ 10,
  dat$nr_allegations == 10 ~ 9,
  dat$nr_allegations == 11 ~ 8,
  dat$nr_allegations == 12 ~ 7,
  dat$nr_allegations == 13 ~ 6,
  dat$nr_allegations == 14 ~ 5,
  dat$nr_allegations == 15 ~ 4,
  dat$nr_allegations == 16 ~ 3,
  dat$nr_allegations == 17 ~ 2,
  dat$nr_allegations == 18 ~ 1,
  TRUE ~ 20)

s(dat$nr_allegations)

dat$nr_allegations <- dplyr::if_else(dat$children_in_hh > dat$nr_allegations, dat$children_in_hh, dat$nr_allegations)

summary(dat$campaign)
s(dat$contact)

dat$race <- dplyr::if_else(dat$contact == "cellular", "white", "black")
s(dat$race)

s(dat$default)

dat$child_fatality <- dplyr::if_else(dat$default == "yes", 1, 0)

s(dat$child_fatality)

dat$how_many_previous_reports <- dat$previous

dat$got_service_before <- dplyr::case_when(
  dat$poutcome == "success" ~ "Yes",
  dat$poutcome == "failure" ~ "No",
  TRUE ~ "Unknown"
)

dat$sex <- dplyr::if_else(dat$housing == "no", "Male", "Female")

dat$count_keywords_in_notes <- rescale(dat$campaign)
dat$count_keywords_in_notes <- as.integer(round((dat$count_keywords_in_notes * 25)+10))

library(dplyr)
library(ggplot2)

dat_final <- dat %>% 
  dplyr::select(
    month, day_of_week, child_age, sex, race, allegation_type, reporter_relation, visible_injury, notes_wordcount, 
    count_keywords_in_notes, children_in_hh, nr_allegations, child_fatality, how_many_previous_reports, 
    got_service_before, got_services)

knitr::kable(head(dat_final), "markdown")

#### EXPLORATORY ANALYSIS ####

Lhist_wordcount <- ggplot(dat_final) +
  geom_histogram(aes(notes_wordcount), bins = 100, fill = palette_1_colors) +
  labs(title = "Distribution Wordcount in Notes") + xlab("") + ylab("") +
  plotTheme()

Lhist_prvReport <- ggplot(dat_final) +
  geom_histogram(aes(how_many_previous_reports), bins = 100, fill = palette_1_colors) +
  labs(title = "Distribution Number of previous reports from same HH") + xlab("") + ylab("") +
  plotTheme()

Lhist_allegations <- ggplot(dat_final) + 
  geom_bar(aes(x = allegation_type, y = ..count.., fill = as.factor(got_services)), position = "dodge") +
  coord_flip() +
  scale_fill_manual(values = palette_2_colors) +
  labs(title = "Allegations Count", fill = "Transfered to Services") + xlab("") + ylab("") +
  plotTheme()

allegration_diff <- dat_final %>% 
  dplyr::mutate(count_event = 1) %>% 
  dplyr::group_by(got_services, allegation_type) %>%
  dplyr::summarise(total = sum(count_event))

allegration_diff_Y <- dplyr::filter(allegration_diff, got_services == 1) %>% dplyr::rename(total_Y = total)
allegration_diff_N <- dplyr::filter(allegration_diff, got_services == 0) %>% dplyr::rename(total_N = total)
allegration_diff <- left_join(allegration_diff_Y, allegration_diff_N, by = "allegation_type") %>% 
  #select(-got_services.x, -got_services.y) %>% 
  dplyr::mutate(diff = total_N - total_Y, pct_diff = total_Y/(total_Y + total_N))

Lhist_pcttoservices <- ggplot(allegration_diff) +
  geom_bar(aes(x = reorder(allegation_type, pct_diff), y = pct_diff), stat = "identity", fill = palette_1_colors) + 
  coord_flip() + xlab("") + ylab("") +
  labs(title = "% of Cases that got services per allegation") +
  plotTheme()
  
Lhist_svc_reporter <- ggplot(dat_final) + 
  geom_bar(aes(x = reporter_relation, y = ..count.., fill = as.factor(got_services)), position = "dodge") +
  scale_fill_manual(values = palette_2_colors) +
  coord_flip() +
  plotTheme()

dat_final$race <- if_else(dat_final$race == "white", "non-white", "white")

Lhist_svc_race <- ggplot(dat_final) +
  geom_bar(aes(x = race, y = ..count.., fill = as.factor(got_services)), position = "dodge") +
  scale_fill_manual(values = palette_2_colors) +
  plotTheme()

Lhist_svc_race + labs(title="Case Transfer to Services \nby Race", x="Number of Cases", y="Race", fill="Transfer to Services")+theme(legend.position = "bottom")

Lhist_svc_age <- ggplot(dat_final) +
  geom_bar(aes(x = child_age, y = ..count.., fill = as.factor(got_services)), position = "dodge") +
  scale_fill_manual(values = palette_2_colors) +
  plotTheme()

Lhist_svc_age + theme(legend.position = 'bottom') + labs(title = "Case Transfer to Services \nby Child Age", fill = "Transfer to Services") + ylab("Number of Cases") + xlab("Child Age")

remove(s, allegration_diff, allegration_diff_N, allegration_diff_Y)

## SIMULATION, THREE MODELS

#### INITIAL SETTINGS ####

library(tidyverse)
library(caret)
library(gridExtra)
library(plotROC)
library(pROC)
library(reshape2)
library(tidymodels)
library(workflows)
library(tune)

palette_3_colors <- c("#ffeda0","#feb24c","#fc4e2a")
palette_2_colors <- c("#feb24c", "#fc4e2a")
palette_1_colors <- c("#feb24c")

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )}

#### FEATURE ENGINEERING ####

# I will drop child fatality as a feature, it is too rare to have predictive power
dat_final <- dat_final %>% 
  select(-child_fatality)

dat_final$got_services <- as.factor(dat_final$got_services) # transform the dependent variable into factors
dat_final$month <- as.factor(dat_final$month) # transform into a factors
dat_final$day_of_week <- as.factor(dat_final$day_of_week) # transform into a factors
dat_final$child_age <- log(dat_final$child_age + 0.1) # log the number. added 0.1 to avoid logging zeroes
dat_final$sex <- as.factor(dat_final$sex) # transform into a factors
dat_final$race <- as.factor(dat_final$race) # transform into a factors
dat_final$allegation_type <- as.factor(dat_final$allegation_type) # transform into a factors
dat_final$reporter_relation <- as.factor(dat_final$reporter_relation) # transform into a factors
dat_final$visible_injury <- as.factor(dat_final$visible_injury) # transform into a factors
dat_final$notes_wordcount <- log(dat_final$notes_wordcount)  # log the number
dat_final$count_keywords_in_notes <- log(dat_final$count_keywords_in_notes)
dat_final$children_in_hh <- as.factor(dat_final$children_in_hh) # transform into a factors
dat_final$nr_allegations <- log(dat_final$nr_allegations)  # log the number
dat_final$how_many_previous_reports <- as.factor(dat_final$how_many_previous_reports) # transform into a factors
dat_final$got_service_before <- as.factor(dat_final$got_service_before) # transform into a factors

#### REGRESSIONS ####

# Split train and test set

set.seed(3456)

# whole data
data_split <- rsample::initial_split(dat_final, strata = "got_services", prop = 0.75)

cps_train <- tidymodels::training(data_split)
cps_test  <- testing(data_split)

update.packages()

cv_splits_v5 <- vfold_cv(cps_train, v = 3, strata = "got_services")

# Create Logistic Model

logit_model <- glm(got_services ~ .,
                   data = cps_train, 
                   family = "binomial"(link="logit")
)

summary(logit_model)

# Perform a 3-fold cross validation to train the model
ctrl <- trainControl(method = "cv",
                     number = 50,
                     savePredictions = TRUE)

cvFit <- train(got_services ~ .,
               data = cps_train, 
               family = "binomial"(link="logit"),
               trControl = ctrl, method = "glm")

ggplot(as.data.frame(cvFit$resample), aes(Accuracy)) + 
  geom_histogram(bins = 50, fill = palette_1_colors) +
  scale_x_continuous(limits = c(0, 1)) +
  labs(x="Accuracy",
       y="Count") +
  labs(title = "Distributions of folds accuracy") +
  plotTheme()

predictors <- as.data.frame(cvFit$finalModel[1])
predictors$Coefficient <- rownames(predictors)
rownames(predictors) <- NULL
names(predictors)[1] <- "Values"
predictors <- predictors %>% filter(Coefficient != "(Intercept)")

predictors_top_10 <- predictors[with(predictors,order(-Values)),]
predictors_top_10 <- predictors_top_10[1:10,]

ggplot(predictors_top_10) +
  geom_bar(aes(x = reorder(Coefficient, Values), y = Values, fill = palette_1_colors), stat = 'identity',
           show.legend = FALSE) +
  coord_flip() +
  labs(title = "Most Important Predictors", fill = "") +
  xlab("") +
  ylab("") +
  plotTheme()

cps_test$predictions_logit <- predict(cvFit, cps_test, type="prob")[,2]

# Create recipe for  RF and XGB models

model_rec <- recipe(got_services ~ ., data = cps_train) %>%
  step_dummy(all_nominal(), -got_services) %>%
  step_zv(all_predictors())

glimpse(model_rec %>% prep() %>% juice())

## Model specifications

rf_plan <- rand_forest() %>%
  set_args(mtry  = tune()) %>%
  set_args(min_n = tune()) %>%
  set_args(trees = 1000) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

XGB_plan <- boost_tree() %>%
  set_args(mtry  = tune()) %>%
  set_args(min_n = tune()) %>%
  set_args(trees = 100) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

# Hyperparameter grid for glmnet (penalization)

rf_grid <- expand.grid(mtry = c(2,5), 
                       min_n = c(1,5))

xgb_grid <- expand.grid(mtry = c(3,5), 
                        min_n = c(1,5))

# Create Workflow
rf_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(rf_plan)

xgb_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(XGB_plan)

# fit model to workflow and calculate metrics

control <- control_resamples(save_pred = TRUE, verbose = TRUE)

rf_tuned <- rf_wf %>%
  tune::tune_grid(.,
                  resamples = cv_splits_v5,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metric_set(roc_auc, sens, spec))

xgb_tuned <- xgb_wf %>%
  tune::tune_grid(.,
                  resamples = cv_splits_v5,
                  grid      = xgb_grid,
                  control   = control,
                  metrics   = metric_set(roc_auc, sens, spec))

# 'Best' by some metric and margin
show_best(rf_tuned, metric = "roc_auc", n = 15, maximize = FALSE)
show_best(xgb_tuned, metric = "roc_auc", n = 15, maximize = FALSE)

rf_best_params <- select_best(rf_tuned, metric = "roc_auc", maximize = FALSE)
xgb_best_params    <- select_best(xgb_tuned, metric = "roc_auc", maximize = FALSE)

## Final workflow
rf_best_wf <- finalize_workflow(rf_wf, rf_best_params)
xgb_best_wf    <- finalize_workflow(xgb_wf, xgb_best_params)

### Get Out-Fold Predictions for best param set above........ HERE
#start
rf_OOF_preds     <- collect_predictions(rf_tuned) %>% 
  filter(mtry  == rf_best_params$mtry[1],
         min_n == rf_best_params$min_n[1])

xgb_OOF_preds    <- collect_predictions(xgb_tuned) %>% 
  filter(mtry  == xgb_best_params$mtry[1],
         min_n == xgb_best_params$min_n[1])

# last_fit() emulates the process where, after determining the best model, the final fit on the entire training set is needed and is then evaluated on the test set.
rf_val_fit <- rf_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metric_set(roc_auc, sens, spec))

xgb_val_fit <- xgb_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metric_set(roc_auc, sens, spec))

# collect test set predictions
rf_val_pred     <- collect_predictions(rf_val_fit)
xgb_val_pred    <- collect_predictions(xgb_val_fit)

cps_test$predictions_xgb <- xgb_val_pred$.pred_1
cps_test$predictions_rf <- rf_val_pred$.pred_1

allpred_long <- melt(cps_test, id.vars = c("got_services", "race", "sex"), measure.vars = c("predictions_logit", "predictions_xgb", "predictions_rf"))

remove(control, cps_test, cps_train, cv_splits_v5, data_split, logit_model, model_rec, rf_best_params, rf_best_wf,
       rf_grid, rf_OOF_preds, rf_plan, rf_tuned, rf_val_fit, rf_val_pred, rf_wf, xgb_best_params, xgb_best_wf, xgb_grid,
       xgb_OOF_preds, XGB_plan, xgb_tuned, xgb_val_fit, xgb_val_pred, xgb_wf)
remove(ctrl, cvFit, dat_final)
remove(predictors, predictors_top_10, palette_1_colors, palette_2_colors, palette_3_colors, plotTheme)


## DATA VISUALIZATIONS
library(tidyverse)
library(plotROC)
library(pROC)
library(caret)
library(reshape2)
library(gridExtra)

palette_3_colors <- c("#ffeda0","#feb24c","#fc4e2a")
palette_2_colors <- c("#feb24c", "#fc4e2a")
palette_1_colors <- c("#feb24c")

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )}

allpred_long <- allpred_long %>% 
  dplyr::rename(Transfered_OS = got_services,
                Model = variable,
                Probability = value,
                Race = race,
                Sex = sex)

allpred_long$Race <- if_else(allpred_long$Race == "white", "White", "Non-white")
allpred_long$Model <- if_else(allpred_long$Model == "predictions_logit", "Logistic", 
                              if_else(allpred_long$Model == "predictions_xgb", "XGB", "RF"))

#Density Plots

Lmodel_density <- ggplot(allpred_long, aes(x = Probability, fill=as.factor(Transfered_OS))) +
  geom_density(alpha=0.35, color = "grey") +
  scale_fill_manual(values = palette_2_colors,
                    labels=c("Not transfered","Transfered")) +
  labs(title = "Model Comparison",
       x="Predicted Probabilities",y="Density", fill = "Transfered to OS?") +
  theme(legend.position="bottom") +
  facet_wrap(~Model, ncol = 3) +
  plotTheme()

#ROC Curves

auc(allpred_long$Transfered_OS[allpred_long$Model == "Logistic"], 
    allpred_long$Probability[allpred_long$Model == "Logistic"])

auc(allpred_long$Transfered_OS[allpred_long$Model == "XGB"], 
    allpred_long$Probability[allpred_long$Model == "XGB"])

auc(allpred_long$Transfered_OS[allpred_long$Model == "RF"], 
    allpred_long$Probability[allpred_long$Model == "RF"])

Lmodel_ROC1 <- ggplot(allpred_long) + 
  geom_roc(aes(d = as.numeric(Transfered_OS), m = Probability, color = as.factor(Model)), n.cuts = 50, labels = FALSE) + 
  scale_color_manual(values = palette_3_colors) +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curves Models", color = "Model") +
  plotTheme()

Lmodel_ROC_race <- ggplot(allpred_long) + 
  geom_roc(aes(d = as.numeric(Transfered_OS), m = Probability, color = as.factor(Race)), n.cuts = 50, labels = FALSE) + 
  scale_color_manual(values = palette_2_colors) +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curves Models per Race", color = "Model") +
  facet_wrap(~Model, nrow = 1) +
  plotTheme()

Lmodel_ROC_sex <- ggplot(allpred_long) + 
  geom_roc(aes(d = as.numeric(Transfered_OS), m = Probability, color = as.factor(Sex)), n.cuts = 50, labels = FALSE) + 
  scale_color_manual(values = palette_2_colors) +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curves Models per Sex", color = "Sex") +
  facet_wrap(~Model, nrow = 1) +
  plotTheme()

# Confusion Matrix

check_calibration <- function(probability) { 

allpred_long$Classification <- if_else(allpred_long$Probability > probability, 1, 0)

Logistic = dplyr::filter(allpred_long, Model == "Logistic")
XGB = dplyr::filter(allpred_long, Model == "XGB")
RF = dplyr::filter(allpred_long, Model == "RF")

conf_Logistic <- caret::confusionMatrix(reference = as.factor(Logistic$Transfered_OS), 
                                        data = as.factor(Logistic$Classification), 
                                        positive = "1")
conf_XGB <- caret::confusionMatrix(reference = as.factor(XGB$Transfered_OS), 
                                   data = as.factor(XGB$Classification), 
                                   positive = "1")
conf_RF <- caret::confusionMatrix(reference = as.factor(RF$Transfered_OS), 
                                  data = as.factor(RF$Classification), 
                                  positive = "1")

conf_Logistic_DF <- as.data.frame(conf_Logistic$byClass)
conf_XGB_DF <- as.data.frame(conf_XGB$byClass)
conf_RF_DF <- as.data.frame(conf_RF$byClass)

ALLconfusion_matrix <- cbind(conf_Logistic_DF, conf_XGB_DF)
ALLconfusion_matrix <- cbind(ALLconfusion_matrix, conf_RF_DF)
ALLconfusion_matrix$indicator = row.names(ALLconfusion_matrix)

ALLconfusion_matrix <- ALLconfusion_matrix %>% 
  dplyr::rename(Logistic = `conf_Logistic$byClass`, XGB = `conf_XGB$byClass`, RF = `conf_RF$byClass`)

acc_log <- as.numeric(conf_Logistic$overall[1])
acc_xgb <- as.numeric(conf_XGB$overall[1])
acc_rf <- as.numeric(conf_RF$overall[1])

acc <- data.frame(acc_log, acc_xgb, acc_rf, "Accuracy") %>% 
  dplyr::rename(
    Logistic = acc_log,
    XGB = acc_xgb,
    RF = acc_rf,
    indicator = `X.Accuracy.`
  )

ALLconfusion_matrix <- rbind(ALLconfusion_matrix, acc) %>% 
  filter(indicator == "Sensitivity" | indicator == "Specificity" | indicator == "Accuracy")
confusion_matrix_long <- melt(ALLconfusion_matrix, id.vars = "indicator")

ggplot(data = confusion_matrix_long, aes(x = variable, y = value, fill = indicator)) +
  geom_bar(stat = "identity", position = position_dodge(.7), width = .6, color = NA) +
  geom_text(aes(label = round(value, digits = 3)), position = position_dodge(.7), vjust = -.35, size = 2.75) +
  labs(title = "Threshold for detecting a 1:", subtitle = as.character(probability), fill = "") + 
  xlab("") + ylab("") +
  scale_fill_manual(values = palette_3_colors) +
  plotTheme() +
  theme(legend.position="bottom") 
}

models_10 <- check_calibration(.10)
models_12 <- check_calibration(.12)
models_14 <- check_calibration(.14)
models_16 <- check_calibration(.16)

g_legend<-function(a.gplot){
  tmp <- ggplot_gtable(ggplot_build(a.gplot))
  leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
  legend <- tmp$grobs[[leg]]
  return(legend)
  }

mylegend<-g_legend(models_14)

grid.arrange(arrangeGrob(models_10 + theme(legend.position="none"),
                         models_12 + theme(legend.position="none"),
                         models_14 + theme(legend.position="none"),
                         models_16 + theme(legend.position="none"),
                         nrow=2),
             mylegend, nrow=2,heights=c(10, 1))

# Check generalizability

check_generalizability <- function(df, title_df, probability) { 
  
  df$Classification <- if_else(df$Probability > probability, 1, 0)
  
  Logistic = filter(df, Model == "Logistic")
  XGB = filter(df, Model == "XGB")
  RF = filter(df, Model == "RF")
  
  conf_Logistic <- caret::confusionMatrix(reference = as.factor(Logistic$Transfered_OS), 
                                          data = as.factor(Logistic$Classification), 
                                          positive = "1")
  conf_XGB <- caret::confusionMatrix(reference = as.factor(XGB$Transfered_OS), 
                                     data = as.factor(XGB$Classification), 
                                     positive = "1")
  conf_RF <- caret::confusionMatrix(reference = as.factor(RF$Transfered_OS), 
                                    data = as.factor(RF$Classification), 
                                    positive = "1")
  
  conf_Logistic_DF <- as.data.frame(conf_Logistic$byClass)
  conf_XGB_DF <- as.data.frame(conf_XGB$byClass)
  conf_RF_DF <- as.data.frame(conf_RF$byClass)
  
  ALLconfusion_matrix <- cbind(conf_Logistic_DF, conf_XGB_DF)
  ALLconfusion_matrix <- cbind(ALLconfusion_matrix, conf_RF_DF)
  ALLconfusion_matrix$indicator = row.names(ALLconfusion_matrix)
  
  ALLconfusion_matrix <- ALLconfusion_matrix %>% 
    dplyr::rename(Logistic = `conf_Logistic$byClass`, XGB = `conf_XGB$byClass`, RF = `conf_RF$byClass`)
  
  acc_log <- as.numeric(conf_Logistic$overall[1])
  acc_xgb <- as.numeric(conf_XGB$overall[1])
  acc_rf <- as.numeric(conf_RF$overall[1])
  
  acc <- data.frame(acc_log, acc_xgb, acc_rf, "Accuracy") %>% 
    dplyr::rename(
      Logistic = acc_log,
      XGB = acc_xgb,
      RF = acc_rf,
      indicator = `X.Accuracy.`
    )
  
  ALLconfusion_matrix <- rbind(ALLconfusion_matrix, acc) %>% 
    filter(indicator == "Sensitivity" | indicator == "Specificity" | indicator == "Accuracy")
  confusion_matrix_long <- melt(ALLconfusion_matrix, id.vars = "indicator")
  
  ggplot(data = confusion_matrix_long, aes(x = variable, y = value, fill = indicator)) +
    geom_bar(stat = "identity", position = position_dodge(.7), width = .6, color = NA) +
    geom_text(aes(label = round(value, digits = 3)), position = position_dodge(.7), vjust = -.35, size = 2.75) +
    labs(title = title_df, fill = "") + 
    xlab("") + ylab("") +
    scale_fill_manual(values = palette_3_colors) +
    plotTheme() +
    theme(legend.position="bottom")
}

Black <- dplyr::filter(allpred_long, Race == "Non-white")
White <- dplyr::filter(allpred_long, Race == "White")
male <- dplyr::filter(allpred_long, Sex == "Male")
female <- dplyr::filter(allpred_long, Sex == "Female")

plot_black <- check_generalizability(Black, "Results for Non-white individuals", .12)
plot_whites <- check_generalizability(White, "Results for White individuals", .12)
plot_males <- check_generalizability(male, "Results for Males", .12)
plot_females <- check_generalizability(female, "Results for females", .12)

mylegend2 <-g_legend(plot_black)

grid.arrange(arrangeGrob(plot_black + theme(legend.position="none"),
                         plot_whites + theme(legend.position="none"),
                         plot_males + theme(legend.position="none"),
                         plot_females + theme(legend.position="none"),
                         nrow=2),
             mylegend2, nrow=2,heights=c(10, 1))

remove(Black, White, male, female, mylegend, mylegend2, plot_black, plot_females, plot_males, plot_whites,
       models_14, models_16, models_10, models_12)

# We will use the Logistic Boost model, as it has the best generalizability across races

allpred_long$Predicted <- if_else(allpred_long$Probability > .12, 1, 0)

results_final <- filter(allpred_long, Model == "Logistic")

black <- filter(results_final, Race == "Non-white")
white <- filter(results_final, Race == "White")

conf_black <- caret::confusionMatrix(reference = as.factor(black$Transfered_OS), 
                                  data = as.factor(black$Predicted), 
                                  positive = "1")
conf_white <- caret::confusionMatrix(reference = as.factor(white$Transfered_OS), 
                                     data = as.factor(white$Predicted), 
                                     positive = "1")

predicted_observed_plot <- results_final %>%
  mutate(count_event = 1) %>%
  group_by(Race) %>%
  summarise(
    total = sum(count_event),
    predicted = sum(count_event[Predicted == 1]),
    observed = sum(count_event[Transfered_OS ==1]),
    pct_predicted = predicted/total,
    pct_observed = observed/total
  ) %>%
  dplyr::select(Race, pct_predicted, pct_observed)

predicted_observed_plot <- melt(predicted_observed_plot, id.vars = "Race")

predicted_observed_plot$variable <- if_else(predicted_observed_plot$variable == "pct_predicted", "Predicted", "Observed")

ggplot(predicted_observed_plot, aes(x = variable, y = value, fill = Race)) +
  scale_fill_manual(values = palette_2_colors) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Observed and predicted tranfer to Ongoing Services") +
  xlab("") + ylab("") +
  plotTheme()

confusion_matrix_plot <- results_final %>% 
  dplyr::mutate(conf_mtx = case_when(
    results_final$Transfered_OS == 1 & results_final$Predicted == 1 ~ "True Positive",
    results_final$Transfered_OS == 0 & results_final$Predicted == 0 ~ "True Negative",
    results_final$Transfered_OS == 1 & results_final$Predicted == 0 ~ "False Negative",
    results_final$Transfered_OS == 0 & results_final$Predicted == 1 ~ "False Positive",
    TRUE ~ "Unknown"
  )) %>% 
  dplyr::select(Race, conf_mtx)

confusion_matrix_plot <- confusion_matrix_plot %>% 
  dplyr::mutate(count_event = 1) %>% 
  group_by(Race, conf_mtx) %>% 
  summarise(total = sum(count_event))

confusion_matrix_plot$pct_total <- if_else(confusion_matrix_plot$Race == "Non-white",
                                           (confusion_matrix_plot$total/sum(confusion_matrix_plot$total[confusion_matrix_plot$Race == "Non-white"]))*100,
                                           (confusion_matrix_plot$total/sum(confusion_matrix_plot$total[confusion_matrix_plot$Race == "White"]))*100)

ggplot(confusion_matrix_plot, aes(x = conf_mtx, y = pct_total, fill = Race)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = palette_2_colors) +
  labs(title = "Confusion Matrix Rates by Race", subtitle = "Percentage of predicted conditions") +
  geom_text(aes(label = round(pct_total, digits = 2)), position = position_dodge(.7), vjust = -.35, size = 2.75) +
  plotTheme() +
  xlab("") + ylab("")