Predicting Voluntary Employee Turnover

Jiali Yao | Olivia Scalora | Johnathan Clementi

header

This project was produced as part of the University of Pennsylvania’s Master of Urban Spatial Analytics Spring 2022 Practicum (MUSA 801) taught by Michael Fichman, and Matt Harris. We would like to give special thanks to June Harley, Graham Rothrock, David Albright, Sherri Bigelow and Jason Jones of Guilford County Human Resources, for providing data, insight and support throughout the semester. This project would not have been possible without them.

This document begins with an abstract detailing the motivation and use case for the project. This is followed by a series of appendices that discuss the data cleaning, panel creation, data visualization, feature engineering, and model results in greater depth. The full reproducable code is provided at the end of the report. Navigate through the document either by using the table of contents at the left, or scrolling through sequentially.

For specific questions about project, please contact the students
jialiyao99@gmail.com | oliviascalora@gmail.com | johnathan.clementi@gmail.com

1. Introduction

1.1 Abstract

Governments and firms invest precious time and money into training their workforce. Each employee also holds institutional knowledge that is crucial to the organization functioning effectively. When an employee leaves (when there is “turnover”), they take with them that knowledge, and the impact of their departure is felt not just during their absence, but during the training of their successor.The Work Institute conservatively estimates that the cost to lose a U.S. worker is $15,000 ¹.

An employee’s satisfaction with their position at a company has many layers. What are their job expectations? What is their starting salary? How long have they been employed with the company and how many promotions have they received in that amount of time? Is there opportunity for personal and professional growth and pay raises? With these questions in mind, we can begin to conceptualize a system that uses employee satisfaction and other individual factors as indicators of how likely an employee is to leave their position within the county.

1.2 Use Case

Using County level government employee data, the goal of this project is to create a proof-of concept analytical and informational system that will associate a turnover risk score with each employee such that Human Resources and supervisors can implement a “stay interview” process to address employees predicted to be most likely to leave their job within a specified period of time (i.e., 6 months, 12 months, 18 months).

1.3 Employee Profile Example

Below, we highlight two examples of what an individual’s experience through Guilford County employment might look like.

Hire Date: January 02, 2008

Start Position: Eligibility Caseworker

Department: Social Services

Start Salary: $29,338.00

End Salary: $44,920.27

Termination Date: July 10, 2015

Tenure: 7 years 6 months

Jane Doe was hired in January of 2008 as an Eligibility Caseworker in the Social Services department. She was 27, paid hourly and made just under $30,000 a year. She lives in the city of Greensboro, and has a short 7 minute commute to work, assuming she drives a car. Over the next two years, Jane was given 5 minor raises, and became a Lead Eligibility Caseworker. By the end of 2012, Jane was promoted to Eligibility Supervisor 1, a salaried position paying about $41,000. By July of 2015, Jane was no longer employed with the county.

Hire Date: May 01, 1995

Start Position: Social Work Supervisor: Protective Service

Department: Social Services

Start Salary: $54,946.27

End Salary: $84,814.73

Termination Date: NA - Still Active

Tenure: 27 years

John began his position at Guilford County as a Child Protection Social Worker in 1999, and his starting salary was around $55k. He lives just outside of Greensboro and drives about 15 minutes to work. John has held this position with consistent pay raises for 22 years and is still actively employed with Guilford County Social Services today.

2. Exploratory Data Analysis

2.1 Defining Turnover

When an employee’s tenure ends, the turnover type can be categorized as VOLUNTARY, INVOLUNTARY or RETIRE. If an individual is currently employed, they are categorized as ACTIVE. For our use case, we are most interested in employees who leave county employment voluntarily. Turnover type is classified below.

1. Voluntary
- Job dissatisfaction
- Accepted job elsewhere
- Stops reporting to work

2. Involuntary
- Fired - unsatisfactory job performance
- Laid off - budget reasons or position elimination
- Medical reasons or death

3. Retired

4. Active
- Still employed with the county.

TURNOVER, the dependent variable for our predictive model, has two possible outcomes (binary).
1 indicated VOLUNTARY turnover
0 indicates ACTIVE, RETIRE, and INVOLUNTARY termination

2.2 Data Cleaning and Wrangling

The data that power this study are administrative data collected by the Human Resources department of Guilford County. There are six primary collections of data, consisting of a total of 28 data sets. These data collections are: Personal Information, Position Transactions, Termination Reports, Job Information, Department Information, and Employee Evaluations. These collections correspond with the data cleaning code chunks found in the appendix. Held within these collections are both cross-sectional and time-series data. Cross-sectional data are data that represent a single point in time, whereas time-series data contain multiple observations for an individual across time.
Our data required a considerable amount of data cleaning and wrangling as many details about an employee’s tenure with the County are spread across the many tables we received from our client. Columns removed from these data sets were removed for having not enough information (too-many NA’s) or were deemed not relevant to our study upon consultation with our client.

1. Personal Information
The Employee Extract table is the primary container of employee personal information. It contains the unique ID for each employee, first name & last name, address columns, employment status at the date in which the data were exported (January 2022), hire date, termination date (if applicable), the Employee ID of their supervisor, and information regarding pay and benefits. To clean these data, we removed some extraneous columns, parsed the date columns and filtered out part-time and non-benefited employees based on consultation with our client.
Additional personal information comes from the table known as PAEmployeeExtract, including the gender, race / ethnicity, marital status, smoker status, and age of the employee at the time the data were exported. Both Employee Extract and PAEmployeeExtract are cross-sectional and only have observations for an employee at the time the data were exported.

2. Position Transactions
The position transactions (PositionTrans) data are time-series and contain a record of every time an employee received a raise, was promoted, was demoted, or changed jobs. This data set contains important information about how much an employee was making before and after one of these job changes. The time range for these data begin in March of 1997 and extend to the end of 2021, however 2006 is the first year with granular data. Therefore, we restrict our study to employees that were employeed with the county between 2006 and 2021.

3. Termination Reports
The termination reports (PA840) data are time-series and contain a record for every employee turnover event. Paired with the Reason Code data dictionary, these data contain information detailing when an employee left employment with the county and the reason they gave for leaving. Since the goal of the project is to predict when an employee will voluntarily leave (as opposed to predicting when an employee will be terminated with cause), we separate out these turnover events into three categories: Voluntary, Involuntary, and Retire, as detailed in Section 2.1 above.

4. Job Information
The job title tables (HR206 and HR206_Jobcodes) contain job titles and corresponding job codes. These data are useful as the job codes held in the Personal Information section are numeric, rather than containing the actual job title. These data were joined with the employee personal information via the job code number.

5. Department Information
The HR202 table contains the department title (known as process level). This department information, much like the job titles, is numeric in the Personal information and after being cleaned, is joined with the personal information and position transaction reports. These data were joined with the employee personal information via the department number.

6. Employee Evaluations
We also had access to employee evaluation reports containing details about an employee’s performance, however these data were ultimately not included in our final model as they did not match the time availability and granularity as our other data sets.

7. Commute Estimation
To get a sense of an employee’s possible commute to work, we made use of the Google Maps Geocoding and Routing API using the employee addresses held in the personal information tables. The main County office was set as the destination point, which may be inaccurate for some employees.

2.3 Panel Creation

The next step of our process is to create a data set with our dependent and independent variables that will allow us to detect at which point in time any given employee is at risk for leaving their position voluntarily. This data set will consist of observations for each employee at each time observation at monthly intervals. This type of data in which there is an observation for every individual at every time record is known as a panel or long-form data set. The panel allows us to correlate events that occur prior to turnover with the turnover event.

The first step in the creation of the panel is to combine all of the cross-sectional data into one table. To do this, we use the employee’s unique identifier, EMPLOYEE, to join the tables, resulting in a dataframe with the following columns: EMP_STATUS, EMPLOYEE, Calculated_Age, SEX, SMOKER, EEO_CLASS, TRUE_MAR_STAT, EXEMPT_EMP, SUPERVISOR, HIRE_SOURCE, DEPARTMENT, JOB_CODE, "JBC_DESCRIPTION", FNCTN_GROUP_DESCRIPTION, DEPARTMENT, NEXT_REV_CODE, DATE_HIRED, ADJ_HIRE_DATE, TERM_DATE, NEW_HIRE_DATE, PAY_RATE, STAND_HOURS, PENSION_PLAN, ANNUAL_HOURS, mean_distance, mean_duration, BIRTHDATE

Next, we create a panel by using the cleaned position transactions data set. Originally, this data set only contains a record for an employee when they experience some job change. However we need a data set that contains a record for each employee for each month in our study window (2006-2021). This is done by using the expand.grid function. We then create a unique identifier for each employee-month combination so that we can join in the time-series data from the termination reports. We also fill each employee’s information to the rows above and below.

Once the panel has been created, we join the cross-sectional data to the panel via the EMPLOYEE ID. We also join in pay plan information so we can later explore an employee’s salary to the market rate.

Next we use the features we have added to the panel to engineer our dependent variable: turnover. If you recall, when we cleaned the termination reports, we created a variable called turnover, with choices of Voluntary, Involuntary, and Retire. We use this variable to create a turnover window which will help us associate events occurring before an employee turnover with the turnover event itself. To do this, we lead the turnover variable by 6, 12, and 18 months.

Conceptual window of turnover

We then mark any employee-month rows after turnover as post-termination and any rows before our turnover windows as active. Finally, we mark employees as pre-hire if their employee-month row occurs before their hire-date. The logic behind these classifications involves creating a series of binary classifiers: afterHire, beforeTerm, and the composite of those: currentStatus. An employee’s currentStatus is 1 if they were actively employeed with the County that month, and a 0 if they were not.

2.4 Temporal Trends

The data we are working with describes 1,481 individuals’ experience through Guilford County employment. The data set observes each employee monthly between the years 2006 and 2021, giving every employee numerous observations. At any point in time, an employee’s status for TURNOVER will reflect whether they are PRE-HIRE, ACTIVE, terminated (VOLUNTARY, INVOLUNTARY, or RETIRE), or POST-TERM. Through creating a time lagged version of an employee’s current TURNOVER status, we are able to predict 6, 12 or 18 months ahead of time whether or not the employee will have a voluntary turnover event.

Figures x and x illustrate the trends in TURNOVER observations across time. As depicted by figure x, the count of employees actively employed with Guilford County has been steadily declining since the start of our study period. In January of 2006, Guilford County had 1123 active employees, while in January of 2022 it had only 777 active employees. Interestingly, the count of voluntary turnover observations has a normal distribution across our study period, visualized in figure x. The peak of voluntary turnover occurred in 2016 with 154 observed events. The month of September had the highest count of 16 employees leaving employment by choice.

2.5 Exploratory Analysis

To explore our data further, we formulate a series of hypotheses. We are looking for trends and patterns in our predictor variables that may explain voluntary turnover.

Hypothesis 1

Employees with shorter tenure are more likely to leave voluntarily as opposed to employees with long tenure.

According to the Work Institute 2019 Turnover Report, over one third (38.6%) of new employees quit within 365 days or less. ² We test this with our data by categorizing tenure by number of years employed. The employee timeline plot (figure x) illustrates each employee’s tenure through county employment by turnover type. The timelines of employees who terminate employment voluntarily are strikingly short in comparison to involuntary and retirees. Figure x confirms that lower tenure categories correspond to higher voluntary turnover rates. However, the “1 to 5 years” category has an alarmingly high percentage of voluntary turnover rate at 87% — 25% higher than the “Less than 1 year” category.

Hypothesis 2

Employees whose salary is under the market rate for their job position are more likely to leave their position for other work than employees who are paid at least the market rate salary for their position.

Market rate values for each position are sourced from Human Resources salary data provided by the Human Resources team at Guilford County. There is a huge discrepancy in the amount of Guilford County employees who are paid at or below the market rate for their position versus employees who are paid above the market rate. From our sample of county employees, 451 are paid above the market rate, while almost double the amount, 851, are paid below the market salary rate for their respective positions. Figure x illustrates the relationship between turnover type and an employee’s salary. Clearly, there is high turnover for individuals who are paid below the market rate. Voluntary turnover has a slightly higher proportion than involuntary turnover, but they are both significant. Figure x below gives us a clearer idea of the turnover that is happening in relation to salary. For employees who are paid above the market rate of their salary, the majority leave county employment through retirement. Conversely, almost 60% of employees who are paid below the market rate leave their positions voluntarily.

COUNT OF EMPLOYEES
Above Market Rate: 451 Below Market Rate: 851

Hypothesis 3

Employees who have longer commute times are more likely to leave their positions for other work than employees who have shorter commute times.

According to the Commuting and Well Being study from the University of West England, job satisfaction decreases as commute time increases. ³ Using Google’s mapsapi R package, we’ve geocoded the address on file for each employee and calculated the average commute duration and distance in miles to work. There are limitations to this calculation because it assumes all Guilford County employees have the same place of work at the Guilford County Government Office in Greensboro, NC. We recognize that place of work will vary depending on position and department for each employee, and the following figures should be viewed with this consideration.

Figure x illustrates the relationship between turnover incidents and commute duration. The relative frequency of VOLUNTARY Turnover between under 15 minutes commute time to up to 1 hour commute time remains consistent at around 40%. VOLUNTARY turnover spikes to 75% in the 1 to 2 hour commute category while ACTIVE employment drops from 50% to 17.5%. The majority of employees (60%) who commute over 2 hours to work are also leave employment voluntarily.

Figure x similarly examines the relationship between distance in miles from each employee home to Guilford County Government Offices in Downtown Greensboro. As we would expect, the relative amount of VOLUNTARY turnover increases as distance increases. The majority of employees who live less than 5 miles away from work are still actively employed, while 66% of employees who live over 60 miles away from their place of work have left their positions voluntarily.

Hypothesis 4

Employee age bracket will have significance in turnover events. Employees who are younger are more likely to leave their position voluntarily.

A report from PEW Research Center ⁴ states that adults younger than 30 are far more likely than older adults to have voluntarily left their job last year: 37% of young adults say they did this, compared with 17% of those ages 30 to 49, 9% of those ages 50 to 64 and 5% of those ages 65 and older. We test this with Guilford County data and expect to see higher percentage of turnover in lower age brackets.

Figure x illustrates the high degree of voluntary turnover across all age brackets as a result of high levels of voluntary turnover County-wide. However, the “Under 30” and “50 to 65” age brackets have the highest percentage of voluntary turnover at 62% and 70% respectively. Figure x depicts voluntary turnover by age bracket and reveals that nearly 50% of voluntary turnover are employees under 30 years old.

Hypothesis 5

Work environment and employee experience will vary by County Department. Unsatisfactory work environments will result in more voluntary turnover.

An employee’s experience in a position with the County will be heavily influenced by the department under which they are employed. Factors that affect work environments and relative satisfaction include the demographic makeup of the department, supervisor performance, and job expectations.
Each County Department falls under a function group. The groupings are as follows: Departments

First, because we know tenure is a significant indicator of voluntary turnover from previous exploratory plots, we plot the average tenure of each department group. On the left, figure x tells us that Health and Community Development have the shorter average tenure. On average, an employee within the Health department will work for just over 10 years. In contrast, Law Enforcement employees have the longest average tenure of over 20 years of employment.
Figure x on the right illustrates count of turnover type by department group. As the short tenure leads us to expect, the Health department and Community development both have disproportionately high instances of VOLUNTARY turnover compared to INVOLUNTARY. Conversely, Law Enforcement is one of the groups with the lowest proportion of VOLUNTARY turnover. Financial Administration is the only department grouping that does not align with our hypothesis of voluntary turnover being a function of short average tenure.

Engineered Features

Relative Pay Raise

To explore the relationship of voluntary turnover with department groups further, we engineer new features for each employee observation that describe an employee’s relative position within their respective departments at any given point in time. The first feature we engineer is relative pay raise. The feature is engineered by calculating the average raise (percent change) within each department within each year. Then, we calculate each individual employee’s pay raise (percent change) that year while they are actively employed. If the percent change of an employee’s pay is equivalent or more than the department average percent change in pay, they receive a 1. If the employee’s percent change in pay is less than the department average that year, they receive a 0. Figure X plots county-wide voluntary turnover by relative pay raise. The plot tells us that 66.8 % of voluntary turnover instances are of employees who’s pay raise that year was below the average for the department.

Figure x below illustrates the relationship between employee and county level percent change in pay over time for instances of VOLUNTARY turnover. The blue line represents the average voluntary turnover employee pay raise for each year, while the dark purple line represents the average county level percent change in pay for each year. The county level pay raise is calculated by taking the mean department average across the county for each year. To interpret this plot, we can say that in 2010, on average, the employees who left Guilford County voluntarily were paid 1 - 2% below the average for their respective department, while in 2020 the average percent change in pay across the county was lower than the average individual percent change in pay. This is likely due to the COVID19 Pandemic affecting turnover. This means for a short amount of time, relative pay raise was not affecting voluntary turnover the way we would expect it to. We can see the relationship has again switched back to percent change in pay across the county being higher than percent change in pay for the individual employee.

Relative Ethnicity

The next feature we engineer regarding relative position within a department is whether or not an employee identifies as the same EEO Class, or ethnicity as the majority within their department. We create this feature by grouping each department at each time observation within our study period. We filter the data where the employees are active, and calculate the majority EEO_CLASS for that observed month. If an employee is the same EEO_CLASS as the majority of their respective department at that point in time, they are classified as 1. If they are not the majority, they are classified as 0. The plots below explore the relationship to relative ethnicity within an employee’s department and VOLUNTARY turnover.

Figure x illustrates that 62.1% of employees who are not the majority EEO class MAJ_EEO within their department decide to leave voluntarily. The high percentage of voluntary turnover amongst employees who are MAJ_EEO within their department is also worth noting. The significance of this feature may not be as strong as we expected. Figure x examines the temporal trends in voluntary turnover and relative ethnicity. This plot reiterates the deduction that an employee’s EEO class relative to the department majority may not be significant in predicting voluntary turnover. We can see that only in 2010 and 2021 the employees not in the majority EEO class turned over voluntarily at a higher rate than employees who were in the majority EEO class.

Relative Gender

Next we engineer a feature for majority gender MAJ_GENDER using the same process as MAJ_EEO above. We calculate the majority gender for each department at each point in time. If an employee identifies as the majority gender, they are classified as 1, otherwise they are classified as 0. Figure x tells us that a significant proportion of employees who are not the same gender as the department majority are turning over voluntarily. 64.6% of employees not the majority gender voluntarily leave employment, while an equal proportion of employees who are the same gender as the majority in their respective department leave employment voluntarily and involuntarily.
Furthermore, when we plot the proportion of majority gender vs not majority gender within voluntary turnover occurrences over time, we can see that employees who leave employment voluntarily are consistently not the majority gender within their respective departments. The proportions after 2010 are quite striking with over 75% of voluntary turnover resulting from employees that of the majority gender year after year.

Relative Age

Finally, we look at the relationship of relative age group and voluntary turnover. Age groups are defined as “20s”, “30s”, “40s”, “50s” and “60 or older”. We calculate the average age group for a department at each point in time. For the MAJ_AGE_GROUP variable, if an employee falls under that age group at that point in time, they are majority and are classified as 1, otherwise they are classified as 0. Figure x tells us that despite being being in the majority age group relative to their departments, 81.4% of employees in this group leave employment voluntarily, while 51.1% of employees not in the majority age group leave voluntarily.
The significance of relative age becomes clear when we examine the temporal trend of voluntary turnover by MAJ_AGE_GROUP. Like the plots above, figure x describes the proportion of MAJ_AGE_GROUP in voluntary turnover for each year of our study period. Since 2010, an employee’s age relative to the majority age group of their respective department has increased the likelihood of voluntary turnover. In 2018, over 80% of voluntary turnover occurred from individuals who were not in the same age group as the majority of their coworkers.

3. Predictive Model Building

Our goal here is to predict the status of an employee 6 months from now using their personal and working information. We will use the data from the last 15 years to build the model, for which we have created the study.panel as described in the previous section, so that we are able to capture the dynamic changes in the employees over the years as well as take the time effect on the model into consideration.

We want to optimize our model to have good generalizability. In other words, our model needs to adapt properly to new, previously unseen data, for different types of groups of people, locations or situations. As shown below, even though there is no geographic elements in this case, we still generalize our model on the department level and then choose the optimized parameters to build the model.

3.1 Split the data

The characteristics of the data are chronological and very imbalanced, and we can only use the previous data to test/predict the future. Therefore, instead of splitting the data randomly, we will first split the data into the training set and testing set based on the year: 2006-2016 as training, 2017-2022 as testing;

then we will do the oversampling on the minority level of the outcome (VOLUNTARY) so that we could have a balanced dataset without losing information. The method we use to rebalance the data is smote(Synthetic Minority Oversampling Technique), the concept behind this is clustering analysis, specifically K means, based on which it can generate synthetic observations without break the patterns of the original data.

Moreover, since the future prediction will always have the same characteristics as our sample data, the oversampling would only be operated on the training set, not the testing set.

3.2 Modeling & Evaluation

After the EDA and hypotheses that we make in the previous section, we finally select 17 features (including both categorical and continuous variables) to predict the employee turnover.

We compare each of the 3 following regression methods: logistic regression, RF (random forest), and XGB (extreme gradient boosting). Each of these three models can be appropriate for determining a binary outcome, in which “true” would signify that an employee is going to be active or leave voluntarily, and the positive event would be VOLUNTARY. The model would output a risk score of 0 to 1 for each employee, where each score is interpreted as the probability that the employee will leave the County voluntarily within 6 months.

We fit all 3 models on the testing set to evaluate which model performs better. Several metrics are used, we mainly focus on balanced accuracy, sensitivity, and specificity. As mentioned before, our data is highly imbalanced, so accuracy is not suitable as a metric anymore, instead, we use balanced accuracy, which has considered the imbalance in the calculation. As for sensitivity and specificity, the higher the value the better the performance, and ideally, these two value themselves should be relatively balanced. All metrics considered, we will select random forest to do the final prediction. However, further adjustment is needed.

From top to bottom: Logistic Regression,Random Forest,extreme gradient boosting

**Metrics of 3 Models**

Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Precision	Recall	F1	Prevalence	Detection Rate	Detection Prevalence	Balanced Accuracy

0.835	0.434	0.041	0.989	0.041	0.835	0.078	0.028	0.023	0.574	0.634

**Metrics of 3 Models**

Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Precision	Recall	F1	Prevalence	Detection Rate	Detection Prevalence	Balanced Accuracy

0.443	0.890	0.104	0.982	0.104	0.443	0.169	0.028	0.012	0.119	0.667

**Metrics of 3 Models**

Sensitivity	Specificity	Pos Pred Value	Neg Pred Value	Precision	Recall	F1	Prevalence	Detection Rate	Detection Prevalence	Balanced Accuracy

0.361	0.893	0.088	0.980	0.088	0.361	0.142	0.028	0.010	0.114	0.627

Next, we reset the threshold based on the distribution of predicted probabilities by actual status. As plotted below, in order to optimize the confusion matrix, we set the threshold to 0.2, meaning if the probability is greater than 0.2, then we predict as VOLUNTARY. Now, the balanced accuracy becomes 0.72(compares to initial 0.67), sensitivity is 0.84(compares to initial 0.44), and specificity is 0.63(compares to initial 0.89). Even though specificity drops, according to the cost that will be explained later, we are more happy to see the sensitivity increases significantly. Moreover, the values of sensitivity and specificity become more balanced, which means our model’s ability to correctly predict both employee status

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  ACTIVE VOLUNTARY
##   ACTIVE      2114        18
##   VOLUNTARY   1257        79
##                                              
##                Accuracy : 0.6324             
##                  95% CI : (0.6161, 0.6484)   
##     No Information Rate : 0.972              
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.0613             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.81443            
##             Specificity : 0.62711            
##          Pos Pred Value : 0.05913            
##          Neg Pred Value : 0.99156            
##              Prevalence : 0.02797            
##          Detection Rate : 0.02278            
##    Detection Prevalence : 0.38524            
##       Balanced Accuracy : 0.72077            
##                                              
##        'Positive' Class : VOLUNTARY          
##

The Receiver Operating Characteristic Curve or ROC Curve is useful because it visualizes trade-offs for two important confusion metrics, while also providing a single goodness of fit indicator. In the figure below, the y-axis of the ROC curve shows the rate of true positives (observed VOLUNTARY, predicted as VOLUNTARY) for each threshold from 0.01 to 1. The x-axis shows the rate of false positives (observed ACTIVE, predicted as ACTIVE) for each threshold. AUC is another quick goodness of fit measure to guide feature selection across different models, the area under curve is 0.78, so our model is relatively good.

## Area under the curve: 0.78

4. Cost and Benefit Analysis

Keeping the valuable employees is maintaining the existing asset of the County, so the County will not be generating any revenue from this process. The goal here is to reduce the cost of turnover. The cost of turnover is 20% of the average compensation, and based on time cost: A stay interview takes 30 min on average, $28/hr per interview, therefore a 30-min stay interview would cost $14.

The model has an associated ‘Confusion Matrix’ which will output the amount of all 4 possible combinations of prediction outcomes of our binary dependent variable. The components of this matrix converted to cost are:

True positive: “We predicted turnover, conducted a stay interview”: So the cost is only from the interview which is $28*0.5 = $14; however, even with an intervention, we still anticipate there is 20% chance that the employee would leave, so the equation is ($28*0.5)*80%+($62275*20% +$28*0.5)*20%, and then the total cost would be $2505.6;

False negative: “We predicted no turnover, did not conduct a stay interview, but the employee left”: the cost is ($62275*20%)= $12,000;

True negative: “We predicted no turnover, did not conduct a stay interview, and the employee stayed active”: so employees continue to bring value to the county and the County pays for the salary, so the cost would be $0;

False positive: “We predicted turnover and conducted the interview on the employee was not going to leave”:so the cost is from the interview which is $14.

Ideally, we would use this analysis to find the best threshold to get the lowest cost, but because the information we have on this are just monetary cost, and the cost of false positive, incorrectly predicted ACTIVE as VOLUNTARY and give the interview, is only $14 per employee, which is nothing compared to false negative or true positive. So eventually, as shown below, the analysis would prefer we predict every observation as VOLUNTARY, which is clearly not reasonable. In order to do a solid CAB analysis to decide threshold, we need more information, such as,

What is the success rate of the stay interview?
How many stay interviews can be conducted in a week?
If active employees are asked to have a stay interview, will they doubt themselves and potentially generate the idea of leaving?
Are there any other intervention methods, what do they cost?
Does the stay interview cost more than just purely time cost?

…

Here, we use cost and benefit analysis to simply calculate how our model can save by predicting ahead and intervening compared to having no intervention at all. The total cost is $439,730.4.If we do not do any intervention, then the cost would be (102*$62,275*20%) = $1,270,410. The our model could save the County ($1,270,410-$439,730.4)/$1,270,410= 65.4%.

Cost/Benefit Table
Variable	Count	Each_Employee_Cost	Cost
True_Negative	2114	$0	0.0
True_Positive	79	$2505.6	197942.4
False_Negative	18	$12455	224190.0
False_Positive	1257	$14	17598.0

5. Conclusion

5.1 Discussion

The goal of our analysis was to produce a system which will allow Guilford County Human Resources to more efficiently allocate their limited resources towards conducting stay interviews with the intention of increasing employee retention. Our cost benefit analysis emphasizes the utility of this model by highlighting the County’s potential to save resources against the business as usual system. The Human Resources Turnover Dashboard provides a proof-of-concept application to be used by County employers to assess turnover risk across employees and departments.

View Turnover Dashboard

There are clear limitations in our analysis and model, and room for improvement. We were graciously supplied with ample and clean data by Guilford County Human Resources Department. When building the panel data, we were unable to join certain data sets due to timeline or common identifier incompatibility. The time-series nature of our panel expects static variables that do not change over time (sex, race, etc.) and dynamic variables that change over time (job title, salary, age, etc.). We did not have access to all of the dynamic variables, therefore certain variables which are represented as static throughout an individual’s employment tenure may have changed in reality (commute duration, marital status, smoker status). Additionally, we omitted valuable employee evaluation data from the analysis due to incompatible timelines. Ideally, we would have multiple evaluation scores for each employee represented in our panel, however the data set provided insufficient dates and employees. A join would have resulted too great a loss of data from our panel.

5.2 Acknowledgments

Despite these challenges, our analysis has resulted in a proof-of-concept system that is more efficient than the current business as usual approach to employee retention and stay interview process. Again, we would like to thank our professors, Michael Fichman and Matt Harris, for their continued support and critique throughout the spring semester. Additionally, the involvement of Guilford County Human Resources team, June Harley, Graham Rothrock, David Albright, Sherri Bigelow and Jason Jones, has been indispensable and greatly appreciated. Without the hard work of this team behind gathering data, providing data dictionaries, definitions, insight, suggestions and support, this analysis would not have been possible.

5.3 Code Appendix

knitr::opts_chunk$set(echo = TRUE)

# Load Libraries 
library(tidyverse)
library(rlang)
library(nnet)
library(dplyr)
library(tidyr)
library(lubridate, warn.conflicts = FALSE)
library(kableExtra)
library(RColorBrewer)
library(ggthemes)
library(data.table)
library(ggplot2)
library(raster)
library(sf)
library(janitor)
library(caret)
library(knitr) 
library(pscl)
library(plotROC)
library(pROC)
library(ROCR)
library(grid)
library(gridExtra)
library(stargazer)
library(ggcorrplot)
library(DataExplorer)
library(mapsapi)
library(leaflet)
require(tidyr)
library(xml2)
library(htmltools)
library(tidymodels)
library(viridis)

options(scipen=999)
options(tigris_class = "sf")
C = "#8856A7"
B= "#9ebcda"
active<- "#7fd4b6"
pal1<- c("#dfd885", "#aba7cf")
pal3<- c("#dfd885", "#aba7cf", "#7fd4b6")
pal3_ret<- c("#dfd885", "#aba7cf", "#e79899")
pal4<- c("#dfd885", "#aba7cf", "#7fd4b6", "#e79899")
pal5<- c("#f3efea", "#e4dde1", "#a797a1", "#8f7b87", "#434343")
pal6<- c("#dfd885", "#aba7cf", "#7fd4b6", "#e79899", "#9ebcda", "#E8BF99")
source("loaddata.R")
# CLEAN EMPLOYEE EXTRACT
EmployeeExtract_cl <- EmployeeExtract %>%
  dplyr::select(-COMPANY, -NICK_NAME, -MIDDLE_NAME, -HM_ACCOUNT, -HM_ACCT_UNIT, -EIC_STATUS, -MIDDLE_INIT, -SHIFT,-PAY_STEP,-TAX_PROVINCE,-PIN, -WORK_STATE, -WC_STATE, -EMAIL_ADDRESS, -ACTIVITY,-ACCT_CATEGORY,-SEC_LVL,-CURRENCY_CODE, -CURR_ND,-COUNTRY_CODE,-NAME_SUFFIX,-WORK_COUNTRY, -FST_DAY_WORKED, -EST_EXPENSE, -DEATH_DATE,-LAST_NAME_UC,-FIRST_NAME_UC,-TIPPED,-MOVING_EXP, -HM_SUB_ACCT)

#Identify DV: 0/No = Active employee; 1/Yes = Terminated, left or Retired 
EmployeeExtract_cl <- EmployeeExtract_cl %>%
  mutate(EMP_STATUS = case_when(
    EMP_STATUS == "T1"|EMP_STATUS == "T2"  ~ "T",
    EMP_STATUS == "R1"|EMP_STATUS == "R2"|EMP_STATUS == "R9"  ~ "R",
    EMP_STATUS == "A5"|EMP_STATUS == "AS"|EMP_STATUS == "A1" |EMP_STATUS == "SA"|EMP_STATUS == "AC"|EMP_STATUS == "AA" |EMP_STATUS == "A3" |EMP_STATUS == "A2"|EMP_STATUS == "AE"|EMP_STATUS == "AF"|EMP_STATUS == "A4"|EMP_STATUS == "AG"|EMP_STATUS == "AB"|EMP_STATUS == "L1"|EMP_STATUS == "L2"  ~ "Active"))

# Lubridate Dates in EmployeeExtract 
EmployeeExtract_cl <- EmployeeExtract_cl %>%
  mutate(DATE_HIRED = lubridate::ymd(DATE_HIRED),
         ADJ_HIRE_DATE = lubridate::ymd(ADJ_HIRE_DATE), 
         ANNIVERS_DATE = lubridate::ymd(ANNIVERS_DATE),
         TERM_DATE = lubridate::ymd(TERM_DATE),
         CREATION_DATE = lubridate::ymd(CREATION_DATE),
         PEND_ACT_DATE = lubridate::ymd(PEND_ACT_DATE),
         NEW_HIRE_DATE = lubridate::ymd(NEW_HIRE_DATE),
         LAST_DAY_PAID = lubridate::ymd(LAST_DAY_PAID))
#### CLEAN PAEMPLOYEE EXTRACT ####
PAEE_clean <- PAEmployeeExtract %>%
  dplyr::select(EMPLOYEE, EEO_CLASS, SEX, VETERAN, Calculated_Age, NEXT_REV_CODE,
                HIRE_SOURCE, LAST_REVIEW, FNCTN_GROUP, SMOKER, TRUE_MAR_STAT)
#### CLEAN HR863_PositionTransactions
PT_clean <- PositionTrans %>%
  dplyr::select(EMPLOYEE, PAY_RATE, EMP_EFFECT_DATE, JOB_CODE, PROCESS_LEVEL, DESCRIPTION, SALARY_CLASS, PAY_GRADE, CHG_REASON, )
PT_clean <- filter(PT_clean, EMPLOYEE != 0)
PT_clean <- PT_clean %>%
  mutate(PAY_RATE = str_replace(PAY_RATE, "\\$", ""),
         PAY_RATE = str_replace(PAY_RATE, "\\,", ""), 
         PAY_RATE = as.numeric(PAY_RATE)) %>%
  mutate(eff_year = lubridate::year(EMP_EFFECT_DATE),
         eff_month = lubridate::month(EMP_EFFECT_DATE)) %>%
  mutate(eff_yr_month = paste0(eff_year,  "/", eff_month)) %>%
  mutate(eff_yr_month = lubridate::ym(eff_yr_month)) %>%
  # Create foreign key to join Position Transaction data to panel data
  mutate(emp_yr_month_KEY = paste0(EMPLOYEE, "_", eff_yr_month))
raises <- PT_clean%>%
  group_by(EMPLOYEE)%>%
  arrange(EMP_EFFECT_DATE)%>%
  mutate(RAISE_amt = PAY_RATE - lag(PAY_RATE),
         pct_change = round((PAY_RATE/lag(PAY_RATE) - 1) * 100,2))%>%
  arrange(EMPLOYEE)%>%
  dplyr::select(EMPLOYEE, EMP_EFFECT_DATE, DESCRIPTION, SALARY_CLASS,
                PAY_RATE, RAISE_amt, pct_change)
# Clean PA 840
ReasonCode_cl <- ReasonCode
names(ReasonCode_cl) <- gsub(x = names(ReasonCode_cl), pattern = "-", replacement = "_")
names(ReasonCode_cl) <- gsub(x = names(ReasonCode_cl), pattern = " ", replacement = "_")
ReasonCode_cl<- ReasonCode_cl%>%rename(LOG_REASON = ACT_REASON_CD)
ReasonCode_cl <- ReasonCode_cl %>% dplyr::select(-DATE_STAMP)
PA840_cl <- PA840
names(PA840_cl) <- gsub(x = names(PA840_cl), pattern = "-", replacement = "_")
names(PA840_cl) <- gsub(x = names(PA840_cl), pattern = " ", replacement = "_")

PA840_cl <- PA840_cl %>%
  rename(ChangeReasonGroup = "Items_To_Select_Name", # Rename colname to something more informative
         EMPLOYEE = LOG_EMPLOYEE,
         LOG_REASON = LOG_REASON_1) %>% 
  dplyr::select(-c(LOG_ANT_END_DATE, LOG_ERR, LOG_REASON_2, # These columns are entirely NA's
                   LOG_POS_LEVEL, # This column only has 1 value 
                   LOG_CURRENCY, LOG_USER_ID)) %>% # Currency not useful, UserID is personal identifiers
  mutate(BEG_DATE = lubridate::mdy(BEG_DATE),
         END_DATE = lubridate::mdy(END_DATE),
         LOG_EFFECT_DATE = lubridate::mdy(LOG_EFFECT_DATE),
         LOG_DATE_STAMP = lubridate::mdy(LOG_DATE_STAMP)) %>%
  # Create employee-date unique identifier for panel joining
  mutate(eff_year = lubridate::year(LOG_EFFECT_DATE),
         eff_month = lubridate::month(LOG_EFFECT_DATE)) %>%
  mutate(eff_yr_month = paste0(eff_year,  "/", eff_month)) %>%
  mutate(eff_yr_month = lubridate::ym(eff_yr_month)) %>%
  # Create foreign key to join Position Transaction data to panel data
  mutate(emp_yr_month_KEY = paste0(EMPLOYEE, "_", eff_yr_month))

PA840_cl <- PA840_cl %>%
  filter(ChangeReasonGroup == c("Status")) %>% # "Termination Date"
  left_join(ReasonCode_cl, by = c("LOG_REASON" = "LOG_REASON"))%>%
  mutate(TURNOVER = case_when(
    DESCRIPTION %in% c("TER-Leaving Locality","TER-Accepted Other Employment",
                       "TER-WC Settlement Resignation", "TER-Return to School", "TER-Spouse Transferred",
                       "TER-Home Duties","TER-Stopped Reporting to Work","TER-Other","TER-Voluntary During Probation")~ "VOLUNTARY",
    DESCRIPTION %in% c("RET-Service Retirement", "RET-Voluntary Enhanced Retire",
                       "RET-Disability Retirement", "Separation with Pending RET", 
                       "RET-Early Retirement", "RET-Retire & Posn Eliminated", 
                       "RET-Elected Official Reorg", "RET-Vol Ret w/MH Div Qual VER",
                       "RET-MH Divest-Other Employment", "Retiree-End Sep Allowance",
                       "RET-Pos Eliminated w/MH Divest", "RET-DIS with WC Settlement",
                       "RET-SER with WC Settlement")~ "RETIRE",
    DESCRIPTION %in% c("TER-Reg 24", "TER-Temporary Appt Ended",
                       "TER-Retiree Callback", "TER-Disciplinary-Unspecified",
                       "TER-Elimination of Position", "TER-End Probationary Period",
                       "TER-Deceased", "TER-Grant Funding Ended",
                       "TER-Elected Official Reorg", "TER-MH Divest-Other Employment",
                       "TER-Never Worked since HIR/REH", "TER-Reg 24 During Probation","TER-Failed to Meet Job Expectn",
                       "TER-City of Greensboro Transfr","TER-Retiree Callback-MH Divest", "TER-Spouse Transferred",
                       "TER-Pos Eliminated w/MH Divest",  "RET-Pos Elim w/MH Div Qual VER",
                       "TER-Deceased (Hourly)", "TER-Deceased (RCB)", "TER-WC Separation", "TER-Approved Family LOA Ended",
                       "TER-WC Separation-Hourly", "TER-HRLNever Wkd since HIR/REH")~ "INVOLUNTARY",
    DESCRIPTION == "ACTIVE" ~"ACTIVE"))

PA840_cl$TURNOVER[is.na(PA840_cl$TURNOVER)] <- "ACTIVE"

# Count the number of duplicate employees in the cleaned 
PA840_DupEmps <- data.frame(table(PA840_cl$EMPLOYEE)) %>%
  rename(EMPLOYEE = "Var1")
PA840_DupEmps <- PA840_DupEmps %>% filter(Freq > 1)

# Clean HR 206

names(HR206) <- gsub(x = names(HR206), pattern = "-", replacement = "_")
names(HR206) <- gsub(x = names(HR206), pattern = " ", replacement = "_")
HR206cleaned <- HR206 %>%
  # Remove redundant features or features with lots of NA's
  dplyr::select(-EXEMPT_DESC, -JBC_JOB_CLASS, -JCL_DESCRIPTION, -SGH_DESCRIPTION,-JBC_EXP_DIST_CO, -JBC_EXP_ACCT_UNIT,
         -SALARY_CLASS_DESC, -JBC_M3_FLAG, -M3_DESC, -JBC_TIPPED, -JBC_EEO_CAT, -Items_To_Select_Name, 
         - LOG_PRE_VALUE, -LOG_NEW_VALUE, -LOG_DATE_STAMP, -LOG_USER_ID) %>%
  mutate(JBC_EFFECTIVE_DATE = lubridate::mdy(JBC_EFFECTIVE_DATE),
         HRH_BEG_DATE = lubridate::mdy(HRH_BEG_DATE))
HR206_FT_emp <- HR206cleaned %>%
  filter(Exempt == "N")

##HR206_JobCodes
HR206_JobCodes_cl <- HR206_JobCodes %>%
  dplyr::select(Code, JBC_DESCRIPTION)%>%
  rename(JOB_CODE = "Code")
# Clean HR 893

HR893cleaned <- HR893_allEES %>%
  dplyr::select(-FIRST_NAME, -LAST_NAME) %>%
  mutate(PAY_RATE = str_replace(PAY_RATE, "\\$", ""),
         PAY_RATE = str_replace(PAY_RATE, "\\,", ""), 
         PAY_RATE = as.numeric(PAY_RATE)) %>%
  mutate(DATE_HIRED = lubridate::ymd(DATE_HIRED),
         ADJ_HIRE_DATE = lubridate::ymd(ADJ_HIRE_DATE),
         BIRTHDATE = lubridate::ymd(BIRTHDATE),
         BEN_DATE_1 = lubridate::ymd(BEN_DATE_1),
         BEG_DATE = lubridate::ymd(BEG_DATE)) %>%
  mutate(EMP_STATUS = case_when(
    EMP_STATUS == "T1"|EMP_STATUS == "T2"|EMP_STATUS == "L1"|EMP_STATUS == "L2"  ~ "T",
    EMP_STATUS == "R1"|EMP_STATUS == "R2"|EMP_STATUS == "R9"  ~ "R",
    EMP_STATUS == "A5"|EMP_STATUS == "AS"|EMP_STATUS == "A1" |EMP_STATUS == "SA"|EMP_STATUS == "AC"|EMP_STATUS == "AA" |EMP_STATUS == "A3" |EMP_STATUS == "A2"|EMP_STATUS == "AE"|EMP_STATUS == "AF"|EMP_STATUS == "A4"|EMP_STATUS == "AG"|EMP_STATUS == "AB"  ~ "Active"))
# Clean HR 202
names(HR202_ProcessLevel) <- gsub(x = names(HR202_ProcessLevel), pattern = "-", replacement = "_")
names(HR202_ProcessLevel) <- gsub(x = names(HR202_ProcessLevel), pattern = " ", replacement = "_")
HR202_ProcessLevel_cl = rename(HR202_ProcessLevel, PROCESS_LEVEL = PRS_PROCESS_LEVEL)
HR202_ProcessLevel_cl = rename(HR202_ProcessLevel_cl, DEPARTMENT = DPT_DEPARTMENT)
HR202_ProcessLevel_cl <- HR202_ProcessLevel_cl %>%
  dplyr::select(-DPT_DEP_DIST_CO, -DPT_DEP_ACCT_UNIT, -DPT_DEP_ACCOUNT, -DPT_DEP_SUB_ACCT) %>%
  na.omit()

#Clean Halogen (Evaluation Data)

Halogen <- Halogen[ , !duplicated(colnames(Halogen))]  # Remove duplicate column names
Halogen_cl<- 
  Halogen %>%
  dplyr::select( -'Subject First Name', -'Subject Last Name', -'Evaluator First Name',
                -'Evaluator Last Name', -'Manager First Name', -'Manager Last Name',
                 -"Group Name")%>%
  rename(EMPLOYEE = `Subject ID (EEID)`)
#Update column names to lowercase _ format
names(Halogen_cl) <- gsub(x = casefold(names(Halogen_cl), upper=T), pattern = " ", replacement = "_")                         
#make overall_score a numeric to calculate the mean  
Halogen_cl[, 3] <- sapply(Halogen_cl[, 3], as.numeric)          
Halogen_cl[, 9] <- sapply(Halogen_cl[, 9], as.numeric)  
Halogen_cl2<-
  Halogen_cl%>%
  group_by(EMPLOYEE) %>%
  summarise(MEAN_SCORE = round(mean(OVERALL_SCORE),3))%>%
  left_join(Halogen_cl, ., by = "EMPLOYEE")%>%
  dplyr::select(-OVERALL_SCORE)%>%
  distinct(EMPLOYEE, .keep_all=T)%>%
  mutate(CALC_SCORE = round((as.numeric(MEAN_SCORE)/SCORING_SCALE)*100, 2))%>%
  mutate(MEAN_SCORE = replace(MEAN_SCORE, which(is.na(MEAN_SCORE)), 'NO SCORE'),
         CALC_SCORE = replace(CALC_SCORE, which(is.na(CALC_SCORE)), 'NO SCORE'))%>%
  dplyr::select(EMPLOYEE, PROCESS_TITLE, PROCESS_STATUS, 
                PROCESS_END_DATE, CALC_SCORE, MEAN_SCORE, SUBJECT_DIVISION, 
                `SUPERVISOR_ID#`, TERMINATION_DATE, SUBJECT_HIRE_DATE, SUBJECT_DEPARTMENT)
Halogen_panel<-
  Halogen_cl%>%
  group_by(EMPLOYEE) %>%
  mutate(EVAL_SCORE = round((as.numeric(OVERALL_SCORE)/SCORING_SCALE)*100, 2))%>%
  mutate(EVAL_SCORE = replace(EVAL_SCORE, which(is.na(EVAL_SCORE)), 'NO SCORE'))%>%
  filter(EVAL_SCORE != 'NO SCORE')%>%
  mutate(PROCESS_END_DATE = lubridate::dmy(PROCESS_END_DATE))%>%
  mutate(eff_year = lubridate::year(PROCESS_END_DATE),
         eff_month = lubridate::month(PROCESS_END_DATE)) %>%
  mutate(eff_yr_month = paste0(eff_year,  "/", eff_month)) %>%
  mutate(eff_yr_month = lubridate::ym(eff_yr_month)) %>%
  drop_na(eff_year)%>%
  # Create foreign key to join Position Transaction data to panel data
  mutate(emp_yr_month_KEY = paste0(EMPLOYEE, "_", eff_yr_month))%>%
  dplyr::select(EMPLOYEE, emp_yr_month_KEY, PROCESS_TITLE,EVAL_SCORE, PROCESS_STATUS, 
                PROCESS_END_DATE,SUBJECT_DIVISION,`SUPERVISOR_ID#`, TERMINATION_DATE, 
                SUBJECT_HIRE_DATE)
#Geocode_Script.R EVAL = FALSE, INCLUDE = TRUE
#Pull Address string to merge EMPLOYEE identifyer to our geocoded data
EMPLOYEE_ID<- unique(PT_panelJoin$EMPLOYEE)
addresses<- EmployeeExtract_cl%>%
  dplyr::select(EMPLOYEE,ADDR1,ADDR2,CITY,STATE,ZIP)%>% 
  unite(ADDRESS, ADDR1, ADDR2, CITY, STATE, ZIP, sep = ' ')
# %>%
#   filter(EMPLOYEE %in% EMPLOYEE_ID)
write.csv(addresses, "addresses.csv")
#Code takes long time to run 
geocoded_doc <- mp_geocode( addresses = addresses$ADDRESS, key = key, quiet = TRUE)
geocoded_Addresses <- mp_get_points(geocoded_doc)
write.csv(geocoded_Addresses, 'Geocoded_Emp_Address.csv')
#Clean geocoded Address data for running through mp_direction function
geocoded_Address<- geocoded_Address %>%
    mutate(EMPLOYEE = USER_EMPLOYEE)%>%dplyr::select(-USER_EMPLOYEE, -IN_Address, -OID_)%>%
  filter(EMPLOYEE %in% EMPLOYEE_ID)
geocoded_Address<- merge(addresses, geocoded_Address, by = "EMPLOYEE")%>%
  dplyr::select(EMPLOYEE, everything())
geocoded_Address<- geocoded_Address%>%  
  st_as_sf(coords = c("X", "Y"), crs = 4326)
  
#County Buildings point in Downtown Greensboro
county_point<- c(-79.79159,36.07139)
#Create data frame to populate with loop results
commute_results <- data.frame(EMPLOYEE = geocoded_Address$EMPLOYEE,
                              ADDRESS = geocoded_Address$ADDRESS,
                              point    = geocoded_Address$geometry,
                              mean_distance = NA,
                              mean_duration = NA)
i<- 1
# For loop calculates mean distance and mean duration from 3 possible routes between
# each employee's home address and the county government offices
for(i in seq_len(nrow(geocoded_Address))){
  # cat(i,"\n")
  doc_i = mp_directions(
    origin = geocoded_Address[i,"geometry"],
    destination = county_point,
    alternatives = TRUE, key = key, quiet = TRUE)
  r_i = mp_get_routes(doc_i)
  dist_i <- mean(r_i$distance_m, na.rm=TRUE)
  dur_i  <- mean(r_i$duration_s, na.rm=TRUE)/60
  commute_results[i,"mean_distance"] <- dist_i
  commute_results[i,"mean_duration"] <- dur_i
  }
# Export to box 
write_csv(commute_results, "commute_results_panel.csv")
### Joining Datasets  
#Join Function Group Description
PAEE_clean<- left_join(PAEE_clean, JobFnctnGroup_Dict, on = 'FNCTN_GROUP')
#### join the main dataset####
Employee_Join <-merge(x = EmployeeExtract_cl, y = PAEE_clean, by = "EMPLOYEE", all.y = TRUE)
#### join pay_plan_S####
Pay_Plan_S1<- Pay_Plan_S%>%
  rename('JOB_CODE' = 'JOB CLASS', 'JOB_CLASS_TITLE'='JOB CLASS TITLE')%>%
  dplyr::select(JOB_CODE,JOB_CLASS_TITLE, MIN, MAX, MARKET) %>%
  mutate(MIN = str_replace(MIN, "\\$", ""),
         MIN = str_replace(MIN, "\\,", ""),
         MAX = str_replace(MAX, "\\$", ""),
         MAX = str_replace(MAX, "\\,", ""),
         MARKET = str_replace(MARKET, "\\$", ""),
         MARKET = str_replace(MARKET, "\\,", "")) %>%
  mutate(MIN = as.integer(MIN),
         MAX = as.integer(MAX),
         MARKET = as.integer(MARKET))
#Employee_Join <-merge(x = Employee_Join, y = Pay_Plan_S1, by = "JOB_CODE", all.x = TRUE)
Employee_Join <- merge(x = Employee_Join, y = HR202_ProcessLevel_cl, by = "DEPARTMENT") %>%
  dplyr::select(-PROCESS_LEVEL.y) %>%
  rename("PROCESS_LEVEL" = "PROCESS_LEVEL.x")
Employee_Join <- left_join(Employee_Join, JobFnctnGroup_Dict, on = 'FNCTN_GROUP')%>%
    dplyr::select(-LAST_REVIEW)
Employee_Join <- merge(x = Employee_Join, y = HR206_JobCodes_cl, by = "JOB_CODE")
Employee_Join <- left_join(x = Employee_Join, 
                           y = (HR893cleaned %>% 
                                  dplyr::select(EMPLOYEE, BIRTHDATE) %>% 
                                  distinct(EMPLOYEE, BIRTHDATE)), 
                           by = "EMPLOYEE")

### Data for Prediction
predictionData <-merge(x = Employee_Join, y = commute_results, by = "EMPLOYEE")
#n10495 <- predictionData %>% filter(EMPLOYEE == 10495)
predictionData <- predictionData %>%
  dplyr::select(EMP_STATUS, EMPLOYEE, Calculated_Age, SEX, SMOKER, EEO_CLASS, TRUE_MAR_STAT, EXEMPT_EMP, SUPERVISOR, HIRE_SOURCE, DEPARTMENT, JOB_CODE, "JBC_DESCRIPTION", FNCTN_GROUP_DESCRIPTION, NEXT_REV_CODE, DATE_HIRED, ADJ_HIRE_DATE, TERM_DATE, NEW_HIRE_DATE, PAY_RATE, STAND_HOURS, PENSION_PLAN, ANNUAL_HOURS, mean_distance, mean_duration, BIRTHDATE)
predictionData <- predictionData %>%
  mutate(PENSION_PLAN = case_when(PENSION_PLAN == "N" ~ "No",
                                  PENSION_PLAN == "" ~ "Yes"),
         HIRE_SOURCE = case_when(
           HIRE_SOURCE == "AGENCY" ~ "AGENCY",
           HIRE_SOURCE == "CLIENTREF" ~ "CLIENTREF",
           HIRE_SOURCE == "EMPLOYEE"|HIRE_SOURCE == "EMPLOYMENT" ~ "EMPLOYEE",
           HIRE_SOURCE == "JOB FAIR" ~ "JOB FAIR",
           HIRE_SOURCE == "NEWSPAPER" ~ "NEWSPAPER",
           HIRE_SOURCE == "COUNTY WEB" ~ "COUNTY WEB",
           HIRE_SOURCE == "OTHER WEB"|HIRE_SOURCE == "OTHER WEBS" ~ "OTHER WEB",
           HIRE_SOURCE == "OTHER" ~ "OTHER",
           HIRE_SOURCE == "PHONE INQU" ~ "PHONE INQU",
           HIRE_SOURCE == "UNKNOWN"|HIRE_SOURCE == "" ~ "UNKNOWN",
           HIRE_SOURCE == "WALK-IN" ~ "WALK-IN",
           HIRE_SOURCE == "OPEN HOUSE" ~ "OPEN HOUSE"),
         NEXT_REV_CODE = case_when(
           NEXT_REV_CODE == "12-MONTH"|NEXT_REV_CODE == "12 MONTH" ~ "12 MONTH",
           NEXT_REV_CODE == "6-MONTH"|NEXT_REV_CODE == "6 MONTH" ~ "6 MONTH",
           NEXT_REV_CODE == "9-MONTH" ~ "9 MONTH",
           NEXT_REV_CODE == "ANNUAL" ~ "ANNUAL",
           NEXT_REV_CODE == "TRIAL"|NEXT_REV_CODE == "TRIAL12 MO" ~ "TRIAL",
           NEXT_REV_CODE == "" ~ "UNKOWN",
    ))
#Join Department Description to predictionData
Dept_Descr<- HR865_2021%>%
  dplyr::select(DEPT, NAME)%>%distinct()%>%
  rename(DEPARTMENT = DEPT)
Dept_Descr$DEPARTMENT<- as.numeric(Dept_Descr$DEPARTMENT)
predictionData<- predictionData%>%
  mutate(DEPARTMENT = as.numeric(gsub("(^\\d{3}).*", "\\1", DEPARTMENT)))%>%
  left_join(., Dept_Descr, by = 'DEPARTMENT')%>%
  rename(DEPT_NAME = NAME)
  
# predictionData_viz WAS CREATED STRICTLY FOR VISUALIZATION PURPOSES 
# IT DOES NOT TAKE INTO ACCOUNT MULTIPLE TERMINATION EVENTS FOR AN EMPLOYEE

# join reason codes and description to prediction data 
predictionData_viz<-PA840_cl%>%
  dplyr::select(EMPLOYEE, TURNOVER, DESCRIPTION, LOG_REASON)%>%
  # rename(EMPLOYEE = LOG_EMPLOYEE,
  #        LOG_REASON = LOG_REASON_1)%>%
  left_join(predictionData, ., by = "EMPLOYEE")%>%
   mutate(LOG_REASON = replace(LOG_REASON, is.na(LOG_REASON), "ACTIVE"),
          DESCRIPTION = replace(DESCRIPTION, is.na(DESCRIPTION), "ACTIVE"),
          TURNOVER = replace(TURNOVER, is.na(TURNOVER), "ACTIVE"))
#Df is 1660 observations because of multiple observations for unique employees
### Building the Panel
# Prepare Cleaned position transaction data for creating timeseries panel
PT_panelJoin <- PT_clean %>%
  filter(eff_year >= 2006) %>%
  # In order to find employees that had periods of only salaried employment or both salaried and hourly,
  # BUT NOT just hourly, we use the below logic:
  mutate(SALARY_CLASS_BIN = case_when(SALARY_CLASS == "S" ~ 1,
                                      SALARY_CLASS == "H" ~ 0)) %>%
  group_by(EMPLOYEE) %>% # Want to look at individual employee's salary/hourly history
  mutate(SALARY_CLASS_SUM = sum(SALARY_CLASS_BIN)) %>% # add up the binary values, 
  filter(SALARY_CLASS_SUM > 0) %>% # if an employee has a 0 value, it means they only worked as an hourly employee
  # if they have a value > 0, then they worked as just a salaried or salaried and hourly employee
  ungroup() %>%
  dplyr::select(-c(SALARY_CLASS_BIN, SALARY_CLASS_SUM))
# 2006-04-01 is missing from the data - there were no position transactions for that month, but we need to add it into the panel
PT_panelJoin[nrow(PT_panelJoin)+1,] <- data.frame(99999, # dummy value for EMPLOYEE
                       0, # dummy value for PAY_RATE
                       ymd('2006-04-01'),  #IMPORTANT VALUE FOR EMP_EFFECT_DATE
                       99999, # dummy value for JOB_CODE
                       99999, # dummy value for PROCESS_LEVEL
                       '99999', # dummy value for DESCRIPTION
                       '99999', # dummy value for SALARY_CLASS
                       '99999', # dummy value for PAY_GRADE
                       '99999', # dummy value for CHG_REASON
                       99999, # dummy value for eff_year
                       99999, # dummy value for eff_month
                       ymd('2006-04-01'), #IMPORTANT VALUE FOR eff_yr_month
                       '99999') # dummy value for emp_yr_month_KEY
  
# Create dataset containing a row for each employee at each month from 2006 -> 2022
study.panel <- expand.grid(EMPLOYEE = unique(PT_panelJoin$EMPLOYEE), eff_yr_month = unique(PT_panelJoin$eff_yr_month))
# Remove the dummy employee
study.panel <- study.panel %>%
  filter(!EMPLOYEE == 99999)
PT_panelJoin <- PT_panelJoin %>%
  filter(!EMPLOYEE == 99999)
# Create unique ID for each employee 
study.panel <- study.panel %>%
  mutate(emp_yr_month_KEY = paste0(EMPLOYEE, "_", eff_yr_month))

# Start joining DV's into the panel
study.panel <- study.panel %>%
  left_join(PT_clean, by = "emp_yr_month_KEY") %>%
  dplyr::select(-c(EMPLOYEE.y, eff_yr_month.y)) %>%
  dplyr::rename(EMPLOYEE = EMPLOYEE.x,
                eff_yr_month = eff_yr_month.x)
# Use dtplyr fill to carry data forward in time for missing values
# See this link for details: https://dtplyr.tidyverse.org/reference/fill.dtplyr_step.html
studypanelCols_toFill <- colnames(study.panel)[-c(1,2,3)]
study.panel <- study.panel %>%
  dplyr::group_by(EMPLOYEE) %>% # Group by employee so data is only filled from the same employee
  arrange(EMPLOYEE, eff_yr_month) %>% # Put columns in order so fill will work correctly
  fill(all_of(studypanelCols_toFill), .direction = "downup") # Fill NA values up and down
# Join in PA840_cl (Termination Report)
study.panel <- study.panel %>%
  left_join((PA840_cl %>% 
               dplyr::select(emp_yr_month_KEY, LOG_ACTION_CODE, LOG_REASON, DESCRIPTION, LOG_PRE_VALUE, LOG_NEW_VALUE, TURNOVER) %>%
               rename(TERM_DESCRIPTION = DESCRIPTION)), # DESCRIPTION IS ALREADY USED FOR THE EMP'S JOB TITLE
             by = "emp_yr_month_KEY")

# Join Static features from engineered employee information dataset
study.panel <- study.panel %>%
  left_join(predictionData, by = "EMPLOYEE") %>%
  dplyr::rename(PT_PAY_RATE = PAY_RATE.x,
                LATEST_PAY_RATE = PAY_RATE.y,
                PT_JOB_CODE = JOB_CODE.x,
                LATEST_JOB_CODE = JOB_CODE.y) %>%
  dplyr::select(-Calculated_Age,-LOG_ACTION_CODE, -LOG_REASON, -TERM_DESCRIPTION, -LOG_PRE_VALUE, -LOG_NEW_VALUE) %>%
  mutate(TERM_DATE = case_when(TERM_DATE == ymd('1753/01/01') ~ ymd(today()), 
                               TERM_DATE != ymd('1753/01/01') ~ TERM_DATE)) 
study.panel <- study.panel %>% drop_na(EMP_STATUS)
# join salary market rates
study.panel_t<- study.panel %>%
  mutate(JOB_CODE = as.numeric(LATEST_JOB_CODE))%>%
  left_join(., Pay_Plan_S1, by= "JOB_CODE")%>%
  mutate(SALARY_COMPARE = case_when(LATEST_PAY_RATE > MARKET ~ "Above Market",
                                   LATEST_PAY_RATE < MARKET ~ "Below Market", 
                                   LATEST_PAY_RATE == MARKET ~ "0"))

study.panel <- study.panel %>%
  group_by(EMPLOYEE) %>%
  mutate(lead_TURNOVER_6mo = lead(TURNOVER, n = 6, order_by = eff_yr_month),
         lead_TURNOVER_12mo = lead(TURNOVER, n = 12, order_by = eff_yr_month),
         lead_TURNOVER_18mo = lead(TURNOVER, n = 18, order_by = eff_yr_month),
         Turnover = TURNOVER) %>%
  fill(TURNOVER, .direction = "up") %>%
  fill(lead_TURNOVER_6mo, .direction = "down") %>% 
  fill(lead_TURNOVER_12mo, .direction = "down") %>%
  fill(lead_TURNOVER_18mo, .direction = "down") %>%
  fill(Turnover, .direction = "down")
study.panel <- study.panel %>%
  group_by(EMPLOYEE) %>%
  mutate(TURNOVER = ifelse(EMP_STATUS == "Active" & DATE_HIRED == ADJ_HIRE_DATE,"A",TURNOVER))%>% ## filter out active employees, they do not have a turnover(avoid including returned employees, whose status are "Active" but have previous T/R)
  mutate(TURNOVER = ifelse(is.na(TURNOVER), "POST_TERM", TURNOVER)) %>%
  mutate(TURNOVER_6mo_flag = case_when(lead_TURNOVER_6mo == TURNOVER ~ TURNOVER,
                                       TURNOVER == "POST_TERM" ~ "POST_TERM"),
         TURNOVER_12mo_flag = case_when(lead_TURNOVER_12mo == TURNOVER ~ TURNOVER,
                                        TURNOVER == "POST_TERM" ~ "POST_TERM"),
         TURNOVER_18mo_flag = case_when(lead_TURNOVER_18mo == TURNOVER ~ TURNOVER,
                                        TURNOVER == "POST_TERM" ~ "POST_TERM"),
         Turnover = case_when(Turnover == TURNOVER ~ TURNOVER,
                              TURNOVER == "POST_TERM" ~ "POST_TERM")
         ) %>%
  dplyr::select(-c(lead_TURNOVER_6mo, lead_TURNOVER_12mo, lead_TURNOVER_18mo)) %>%
  mutate(TURNOVER = ifelse(is.na(TURNOVER), "ACTIVE", TURNOVER)) %>%
  mutate(TURNOVER_6mo_flag = ifelse(is.na(TURNOVER_6mo_flag), "ACTIVE", TURNOVER_6mo_flag),
         TURNOVER_12mo_flag = ifelse(is.na(TURNOVER_12mo_flag), "ACTIVE", TURNOVER_12mo_flag),
         TURNOVER_18mo_flag = ifelse(is.na(TURNOVER_18mo_flag), "ACTIVE", TURNOVER_18mo_flag),
         Turnover = ifelse(is.na(Turnover), "ACTIVE", Turnover))


# Create employment bounds - create regions for hired employees
study.panel <- study.panel %>%
  mutate(currentAge = round(time_length(duration(as.duration(eff_yr_month - BIRTHDATE)), unit = "years"),2),
         afterHire = case_when(DATE_HIRED <= eff_yr_month ~ 1, # Emp is active if current yr/month is after hire date
                               DATE_HIRED > eff_yr_month ~ 0), # Emp is NOT active if current yr/month is before hire date
         return = ifelse(ADJ_HIRE_DATE <= eff_yr_month & ADJ_HIRE_DATE != DATE_HIRED, 1,0),
         beforeTerm = case_when(TERM_DATE == ymd(today()) ~ 1, # Emp is active if they have been given dummy val for Term
                                TERM_DATE >= eff_yr_month ~ 1, # Emp is is active if current yr/month is before termination date
                                TERM_DATE < eff_yr_month ~ 0)) %>% # Emp is NOT active if current yr/month is after termination date
  mutate(currentStatus = case_when(afterHire == 1 & beforeTerm == 1 ~ 1, 
                                   # employee is active if they are between hire and termination date
                                   afterHire == 0 & beforeTerm == 1 ~ 0,
                                   afterHire == 1 & beforeTerm == 0 ~ 0)) %>%
  mutate(TURNOVER = ifelse(afterHire == 0 & TURNOVER == "A", "PRE_HIRE", TURNOVER))%>%
  mutate(Turnover = ifelse(TURNOVER == "PRE_HIRE", "PRE_HIRE", Turnover),
    TURNOVER_6mo_flag = ifelse(TURNOVER == "PRE_HIRE", "PRE_HIRE", TURNOVER_6mo_flag),
         TURNOVER_12mo_flag = ifelse(TURNOVER == "PRE_HIRE", "PRE_HIRE", TURNOVER_6mo_flag),
         TURNOVER_18mo_flag = ifelse(TURNOVER == "PRE_HIRE", "PRE_HIRE", TURNOVER_6mo_flag))%>%
  
    mutate(TURNOVER = ifelse(return == 1 & TURNOVER == "POST_TERM" , "R_ACTIVE", TURNOVER))%>% # return employees who are active now
  mutate(Turnover = ifelse(TURNOVER == "R_ACTIVE", "ACTIVE", Turnover),
    TURNOVER_6mo_flag = ifelse(TURNOVER == "R_ACTIVE", "ACTIVE", TURNOVER_6mo_flag),
         TURNOVER_12mo_flag = ifelse(TURNOVER == "R_ACTIVE", "ACTIVE", TURNOVER_6mo_flag),
         TURNOVER_18mo_flag = ifelse(TURNOVER == "R_ACTIVE", "ACTIVE", TURNOVER_6mo_flag))%>%
  dplyr::select(-c(afterHire, beforeTerm, return, TURNOVER)) %>%
    rename( "TURNOVER" = Turnover)%>%
  group_by(EMPLOYEE) %>%
  mutate(lag_PT_JOB_CODE = lag(PT_JOB_CODE, order_by = eff_yr_month)
         ) %>%
  # When lag is run, the first row will get an NA value because there aren't any values 'in front' of it
  # Therefore, we will fill in those NA's so they don't mess up the calculations
  mutate(lag_PT_JOB_CODE = case_when(!is.na(lag_PT_JOB_CODE) ~ lag_PT_JOB_CODE, 
                                     is.na(lag_PT_JOB_CODE) ~ PT_JOB_CODE)) %>% 
  mutate(jobChange = case_when(lag_PT_JOB_CODE != PT_JOB_CODE ~ 1, # If job codes are different, there was a job change
                               lag_PT_JOB_CODE == PT_JOB_CODE ~ 0)) %>%  # If job codes are the same, no job change
  mutate(jobChangeCount = cumsum(jobChange)) %>% # Sequentially count the number of job changes
  ungroup()
# Write panel to box
# box_write(study.panel, "study_panel.csv")
#tenure_m = if current date > term date 
#calculate tenure in months and in years 
study.panel<- study.panel%>%
  mutate(TENURE_M = round(time_length(duration(as.duration(eff_yr_month - DATE_HIRED)), unit = "months"),2),
         TENURE_Y = round(time_length(duration(as.duration(eff_yr_month - DATE_HIRED)), unit = "years"),2))
study.panel<- study.panel%>%  
  mutate(TENURE_M = if_else(TENURE_M < 0, 0, TENURE_M),
         TENURE_Y = if_else(TENURE_Y < 0, 0, TENURE_Y))
library(viridis)
# create visualization study panel - where each employee only shows up where current status = 1, 
# time observations occur yearly instead of monthly (01/01/XX)
study.panel.viz<- study.panel%>%
  filter(currentStatus==1)%>%
  mutate(current_year = lubridate::year(eff_yr_month),
         current_month = lubridate::month(eff_yr_month),)%>%
  mutate(DayDate=lubridate::day(eff_yr_month)) %>% mutate(MonthDate=lubridate::month(eff_yr_month)) %>%
  unite(DayMonth, c("DayDate", "MonthDate"), sep = "-")%>%
  filter(DayMonth == '1-1')%>%
  left_join(., predictionData%>%
    mutate(n = floor(runif(nrow(.), min=0, max=nrow(.))))%>%
    mutate(TERM_DATE = replace(TERM_DATE, year(TERM_DATE) == 1753, today()),
           TENURE = round(time_length(duration(as.duration(TERM_DATE -DATE_HIRED)), unit = "years"),2))%>%
           dplyr::select("EMPLOYEE", "n"), 
            by = "EMPLOYEE")

study.panel<- study.panel%>%
  mutate(yr_mo_dept = paste0(eff_yr_month, DEPT_NAME, sep = '_'))

study.panel$TURNOVER <- factor(study.panel$TURNOVER, 
                               levels = c("VOLUNTARY","INVOLUNTARY", "RETIRE", "ACTIVE"))
#create columns to calculate new features with
#reclassify new columns
study.panel<- study.panel%>%
  group_by(eff_yr_month, DEPT_NAME)%>%
  mutate(dept_MAJ_EEO = max(EEO_CLASS),
         dept_MAJ_SEX = max(SEX),
         dept_AVG_AGE = mean(currentAge))%>%
  mutate(dept_AVG_AGE = case_when(dept_AVG_AGE < 30 ~ '20s',
                                  dept_AVG_AGE >= 30 & dept_AVG_AGE < 40 ~ '30s',
                                  dept_AVG_AGE >= 40 & dept_AVG_AGE <50 ~ '40s',
                                  dept_AVG_AGE >= 50 & dept_AVG_AGE <60 ~ '50s',
                                  dept_AVG_AGE >= 60~ '60s'),
         AGE_GROUP = case_when(currentAge < 30 ~ '20s',
                                  currentAge >= 30 & currentAge < 40 ~ '30s',
                                  currentAge >= 40 & currentAge <50 ~ '40s',
                                  currentAge >= 50 & currentAge <60 ~ '50s',
                                  currentAge >= 60~ '60s'))%>%
  ungroup()%>%
  mutate(MAJ_EEO = ifelse(EEO_CLASS == dept_MAJ_EEO, 1, 0),
         MAJ_GENDER = ifelse(SEX == dept_MAJ_SEX, 1, 0),
         MAJ_AGE_GROUP = ifelse(AGE_GROUP == dept_AVG_AGE, 1, 0))%>%dplyr::select(-c(dept_MAJ_EEO, dept_MAJ_SEX, dept_AVG_AGE, AGE_GROUP))
  
study.panel_newfeat<- study.panel%>%dplyr::select(emp_yr_month_KEY, MAJ_EEO, MAJ_GENDER)

#average percent change in pay - countywide and department 
#DEPT_YR_RAISE =  pct_change in pay that year >= dept_avg_raise that year ~ 1, else 0
#dept_avg_raise (% change)  --- limited to our observations 
#pct_change_pay -- employee's % change in pay that year

PAY_RAISE<- study.panel%>%
    dplyr::select(EMPLOYEE, emp_yr_month_KEY, eff_yr_month, yr_mo_dept, DEPT_NAME, JBC_DESCRIPTION, PT_PAY_RATE, TURNOVER)%>%
    mutate(year = lubridate::year(eff_yr_month),
           HRLY_PAY = ifelse(PT_PAY_RATE > 100, (PT_PAY_RATE/2080), PT_PAY_RATE))%>%
    group_by(EMPLOYEE, year)%>%
    mutate(low_pay = min(HRLY_PAY),
           high_pay = max(HRLY_PAY),
           pay_dif= round((high_pay - low_pay),2),
           pct_change_pay = (round((pay_dif)/(low_pay),2)*100))%>%
    ungroup()%>%group_by(year, DEPT_NAME)%>%
  mutate(pct_change_pay = ifelse(is.infinite(pct_change_pay), 100, pct_change_pay))%>%
  mutate(dept_avg_raise = round(mean(pct_change_pay),2))%>%
  dplyr::select(EMPLOYEE, eff_yr_month, emp_yr_month_KEY, year, DEPT_NAME, pct_change_pay, dept_avg_raise, TURNOVER)%>%
  mutate(YR_DEPT_RAISE = ifelse(pct_change_pay >= dept_avg_raise, 1, 0))%>%ungroup()


study.panel<- PAY_RAISE%>%
  dplyr::select(emp_yr_month_KEY, dept_avg_raise, pct_change_pay, YR_DEPT_RAISE)%>%
  left_join(study.panel, ., by = 'emp_yr_month_KEY')

Health<- c("Public Health", "Animal Services", "Mental Health")
PUBLIC_WELFARE <- c("Social Services", "Child Support Enforcement", "Family Justice Center", "Transportation-Human Services", 
                    "Emergency Services", "Transportation-Human Services")
finance_admin <- c("Purchasing", "Security", "Human Resources", "Elections",
                   "Tax", "Internal Audit", "Veterans' Services", "Finance", "Information Technology", "Risk Management", 
                   "Commissioners", "Budget & Management Services", "Clerk to the Board", "Register of Deeds", "Legal", 
                   "Administration", "Property Management/Courts")
community_development<- c("Community&Economic Development", "Planning", "Facilities", "PLN/Solid Waste", "PLN/Soil & Water",
                          "PLN/Inspections", "Parks (Prev Prop Mgmt)")

study.panel<- study.panel%>%
  mutate(deptGroup = case_when(DEPT_NAME %in% Health ~ "Health", 
                                            DEPT_NAME == "Law Enforcement" ~ "Law Enforcement", 
                                            DEPT_NAME %in% PUBLIC_WELFARE ~ "Public Welfare", 
                                            DEPT_NAME %in% finance_admin ~ "Financial Administration",
                                            DEPT_NAME %in% community_development~ "Community Development", 
                                            DEPT_NAME == "Juvenile Detention"~ "Corrections"))

PAY_RAISE<- PAY_RAISE%>%
  mutate(deptGroup = case_when(DEPT_NAME %in% Health ~ "Health", 
                                            DEPT_NAME == "Law Enforcement" ~ "Law Enforcement", 
                                            DEPT_NAME %in% PUBLIC_WELFARE ~ "Public Welfare", 
                                            DEPT_NAME %in% finance_admin ~ "Financial Administration",
                                            DEPT_NAME %in% community_development~ "Community Development", 
                                            DEPT_NAME == "Juvenile Detention"~ "Corrections"))

PAY_RAISE$deptGroup<- factor(PAY_RAISE$deptGroup,
         levels = c("Health", "Law Enforcement", "Financial Administration", "Public Welfare", "Corrections", "Community Development"))
### Visualize Trends in Data 
EMPLOYEE_ID<- unique(PT_panelJoin$EMPLOYEE)
addresses<- EmployeeExtract_cl%>%
  dplyr::select(EMPLOYEE,ADDR1,ADDR2,CITY,STATE,ZIP)%>% 
  unite(ADDRESS, ADDR1, ADDR2, CITY, STATE, ZIP, sep = ' ')%>%
  filter(EMPLOYEE %in% EMPLOYEE_ID)
geocoded_Address<- geocoded_Address %>%
    mutate(EMPLOYEE = USER_EMPLOYEE)%>%dplyr::select(-USER_EMPLOYEE, -IN_Address, -OID_)%>%
  filter(EMPLOYEE %in% EMPLOYEE_ID)
geocoded_Address<- merge(addresses, geocoded_Address, by = "EMPLOYEE")%>%
  dplyr::select(EMPLOYEE, everything())
geocoded_Address<- geocoded_Address%>%  
  st_as_sf(coords = c("X", "Y"), crs = 4326)
#County Buildings point in Downtown Greensboro
county_point<- c(-79.79159,36.07139)
doc_2 = mp_directions(
  origin = c(-79.77857679329264, 36.09399181230865),
  destination = county_point,
  alternatives = F, key = key, quiet = TRUE)
r_2 = mp_get_routes(doc_2)
emp_2 <- paste(sep = "<br/>",
                 "<b>JANE DOE</b>",
                 "2.1 mi", "7 mins")
#Make a map!
leaflet(width = '70%') %>% 
  addProviderTiles(providers$CartoDB.Positron)%>%
  # setView(lng = -79.78344390867515,lat = 36.08276000733181, zoom = 14)%>%
  addPolylines(data = r_2, opacity = 1, weight = 5, color = "red")%>%
  addMarkers(-79.79159,36.07139, label = "Guilford County Offices")%>% 
  addMarkers(-79.77857679329264, 36.09399181230865, label = "Employee Home", popup = emp_2)
doc_3 = mp_directions(
  origin = c(-79.86746340934306, 36.02906630084408),
  destination = county_point,
  alternatives = F, key = key, quiet = TRUE)
r_3 = mp_get_routes(doc_3)
emp_3 <- paste(sep = "<br/>",
                 "<b>JOHN SMITH</b>",
                 "7 mi", "15 mins")
#Make a map!
leaflet(width = '70%') %>%
  addProviderTiles(providers$CartoDB.Positron)%>%
  addPolylines(data = r_3, opacity = 1, weight = 5, color = "red")%>%
  addMarkers(-79.79159,36.07139, label = "Guilford County Offices")%>%
  addMarkers(-79.86746340934306, 36.02906630084408, label = "Employee Home", popup = emp_3)
predictionData_viz$TURNOVER <- factor(predictionData_viz$TURNOVER,
                                 levels = c("VOLUNTARY", "INVOLUNTARY","RETIRE", "ACTIVE"))
predictionData_viz_T<- predictionData_viz%>%
  filter(predictionData_viz$TURNOVER %in% c("VOLUNTARY","INVOLUNTARY"))
#count of employee status - unique employees in preditionData / study panel
predictionData_viz%>%na.omit()%>%
  filter(TURNOVER %in% c("VOLUNTARY","INVOLUNTARY", "RETIRE"))%>%
  ggplot(aes(TURNOVER))+
  geom_bar(aes(fill = TURNOVER))+
  geom_text(aes(label = ..count..), stat = "count",
            vjust = 2, 
            colour = "white",
            fontface = "bold")+
  scale_fill_manual(values = pal3_ret)+
  labs(title = "Count of Employees by Turnover Status",
       caption = "figure x")+
  guides(fill = "none")+
  theme_minimal()
# Visualize active employees with Guilford County over time
study.panel%>%
  filter(currentStatus==1)%>% na.omit() %>%
     filter(TURNOVER%in% "ACTIVE") %>%
     group_by(eff_yr_month, TURNOVER) %>%
     tally() %>%
     ggplot(aes(x=eff_yr_month, y=n, color=TURNOVER)) +
     geom_smooth(size = 1, se = FALSE) +
     scale_color_manual(values = active)+
     labs(title = "Temporal Trends in Active Employees", x = "Year", y = "Number of Active Employees",
          color = "",
       caption = "figure x")+
     theme_minimal()+ 
    theme(axis.text = element_text(face = "bold", color = "grey35"),
          axis.title = element_text(face = "bold", color = "grey35"), 
          axis.line = element_line(size = .5, color= "grey35"),
          panel.grid.major.y  =element_line(color = 'grey60',linetype = 3, size = .5), 
          panel.grid.minor = element_blank(), panel.grid.major.x=element_blank())

# Visualize count of voluntary and involuntary turnover over time
study.panel %>% na.omit()%>%
  filter(currentStatus==1) %>%
  filter(TURNOVER%in% c("VOLUNTARY", "INVOLUNTARY")) %>%
  group_by(eff_yr_month, TURNOVER) %>%
  tally() %>%
  ggplot(aes(x=eff_yr_month, y=n, color=TURNOVER)) +
    geom_smooth(size = 1, se = FALSE) +
    scale_color_manual(values = pal3)+
    labs(title = "Temporal Trends in Turnover", x = "Year", y = "Number of Employees Leaving",
         color = '',
       caption = "figure x")+
    theme_minimal()+ 
    theme(axis.text = element_text(face = "bold", color = "grey35"),
          axis.title = element_text(face = "bold", color = "grey35"), 
          axis.line = element_line(size = .5, color= "grey35"),
          panel.grid.major.y  =element_line(color = 'grey60',linetype = 3, size = .5), 
          panel.grid.minor = element_blank(), panel.grid.major.x=element_blank())

study.panel.viz$TURNOVER <- factor(study.panel.viz$TURNOVER, 
                                   levels = c("VOLUNTARY", "INVOLUNTARY", "RETIRE"))
# visualize tenure timeline
# yellow lines = voluntary turnover
study.panel.viz%>%
  filter(TURNOVER %in% c("VOLUNTARY", "INVOLUNTARY", "RETIRE"))%>%
  distinct(EMPLOYEE, .keep_all = T)%>%
  ggplot() + 
     geom_segment(aes(x=DATE_HIRED, xend=TERM_DATE, y=n, yend=n, colour = TURNOVER, size = TURNOVER), alpha = .9)+
  labs(x = 'Employee Timeline', y = 'Unique Employee', title = 'Guilford County Employee Timeline History',
       subtitle = "  ",caption = "figure x")+
  scale_color_manual(values = pal3_ret)+
  scale_size_manual(values = c(4, 1.5,1.5))+
  guides(color = F, size= F)+
  theme_minimal()+ 
  theme(plot.background = element_rect('#f5f2e9', color = NA),
        axis.line = element_line(size =.5, color = "grey55"),
        axis.title = element_text(face = "bold"),
        axis.text = element_text(face = "bold"),
        axis.text.y = element_blank(),
        panel.grid = element_blank(),
        panel.grid.major.x = element_line(size = 1, color = "grey75", linetype = 3))
#Tenure vs Turnover Type
tenure_turn<- study.panel%>%filter(currentStatus == 1)%>%
  filter(TURNOVER %in% c('VOLUNTARY', 'INVOLUNTARY', 'RETIRE'))%>%
  distinct(EMPLOYEE, .keep_all=T)%>%
  dplyr::select(EMPLOYEE, TURNOVER, DEPARTMENT, TENURE_Y)%>%
  group_by(TURNOVER)%>%
  mutate(TENURE = case_when(TENURE_Y < 1 ~ "Less than 1 year",
                            TENURE_Y >=1 & TENURE_Y <5 ~ "1 to 5 years", 
                            TENURE_Y >=5 & TENURE_Y <10 ~ "5 to 10 years",
                            TENURE_Y >=10 & TENURE_Y <20 ~ "10 to 20 years",
                            TENURE_Y >= 20 ~ "Over 20 years" ))

tenure_turn$TENURE = factor(tenure_turn$TENURE, levels= c("Less than 1 year",
                                                              "1 to 5 years",
                                                              "5 to 10 years",
                                                              "10 to 20 years",
                                                              "Over 20 years"))
tenure_turn$TURNOVER = factor(tenure_turn$TURNOVER, levels= c("VOLUNTARY",
                                                              "INVOLUNTARY",
                                                              "RETIRE"))

#Visualize 
tenure_turn%>%na.omit()%>%filter(TURNOVER == "VOLUNTARY")%>%
  ggplot(aes(TENURE, group = TURNOVER)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), color = "#8f7b87", stat="count") +
  geom_text(aes(label = scales::percent(..prop.., accuracy = .1),
                   y= ..prop.. ), stat= "count",
            nudge_y = .055,
            color = "grey35",
            fontface = "bold") +
  scale_y_continuous(labels=scales::percent) +
  scale_fill_manual(values = pal5,
                    labels = c("Less than 1 year","1 to 5 years",
                                "5 to 10 years","10 to 20 years",
                                "Over 20 years"))+
  labs(title = 'Voluntary Turnover by Tenure Category',
       x = "", fill = "Turnover Status",
       caption = "figure x")+
  ylab("relative frequencies") +
  guides()+
  theme_minimal()+
  theme(plot.background = element_rect('#f5f2e9', color = NA),
        axis.title = element_text(face = "bold"),
        axis.text.x = element_blank())

#Visualize 
tenure_turn%>%na.omit()%>%
  ggplot(aes(TURNOVER, group = TENURE)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
  geom_text(aes(label = scales::percent(..prop.., accuracy = .1),
                   y= ..prop.. ), stat= "count",
            nudge_y = .055,
            color = "grey35",
            fontface = "bold") +
  scale_y_continuous(labels=scales::percent) +
  scale_fill_manual(values = pal3_ret,
                    labels = c("VOLUNTARY", "INVOLUNTARY", "RETIRE"))+
  labs(title = 'Turnover Type by Tenure Category',
       x = "", fill = "Turnover Status",
       caption = "figure x")+
  ylab("relative frequencies") +
  guides()+
  facet_wrap(~TENURE,
             strip.position="bottom",
             nrow=1)+theme_bw()+
  theme_minimal()+
  theme(plot.background = element_rect('#f5f2e9', color = NA),
        strip.text = element_text(face = "bold"),
        axis.title = element_text(face = "bold"),
        axis.text.x = element_blank())
#Create market_salary dataframe from prediction data 
market_salary<- predictionData_viz%>%dplyr::select(EMPLOYEE, EMP_STATUS, PAY_RATE, JOB_CODE, TURNOVER, DESCRIPTION)%>%
  left_join(., Pay_Plan_S1, by = "JOB_CODE")%>%distinct(EMPLOYEE, .keep_all = T)%>%
  mutate(SALARY_COMPARE = case_when(PAY_RATE > MARKET ~ "Above Market",
                                   PAY_RATE < MARKET ~ "Below Market", 
                                   PAY_RATE == MARKET ~ "0"))
market_salary$TURNOVER <- 
  factor(market_salary$TURNOVER,
         levels = c("VOLUNTARY", "INVOLUNTARY","ACTIVE", "RETIRE"))


# consider dropping? not accurate representation of active employees in our data set
market_salary%>%na.omit()%>%filter(TURNOVER %in% c('VOLUNTARY','INVOLUNTARY'))%>%
ggplot(aes(SALARY_COMPARE, group =TURNOVER)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), color ="#8f7b87", stat="count") +
  geom_text(aes(label = scales::percent(..prop..),
                   y= ..prop.. ), stat= "count",
            nudge_y = .025,
            color = "grey35",
            fontface = "bold") +
  scale_y_continuous(labels=scales::percent) +
   scale_fill_manual(values = c("#e4dde1","#8f7b87"),
                     labels= c("Above Market",
                               "Below Market"))+
  labs(title = "Employee Status by Relative Salary",
       x = "", fill = "Relative Salary Rate", caption = "figure x")+
  ylab("relative frequencies") +
  guides()+
  facet_wrap(~TURNOVER,
             strip.position="bottom", nrow=1)+theme_minimal()+
  theme(axis.text.x = element_blank(),
        plot.background = element_rect('#f5f2e9', color = NA))
market_salary%>%na.omit()%>%
  filter(TURNOVER %in% c("VOLUNTARY", "INVOLUNTARY", "RETIRE"))%>%
ggplot(aes(TURNOVER, group = SALARY_COMPARE)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
  geom_text(aes(label = scales::percent(..prop..),
                   y= ..prop.. ), stat= "count",
            nudge_y = .025,
            color = "grey35",
            fontface = "bold") +
  scale_y_continuous(labels=scales::percent) +
  scale_fill_manual(values = pal3_ret,
                    labels= c("VOLUNTARY", "INVOLUNTARY", "RETIRE"))+
  labs(title = "Employee Status by Pay Rate to Market Rate Comparison",
       x = "", fill = "Turnover Status", caption = "figure x")+
  ylab("relative frequencies") +
  guides()+
  facet_wrap(~SALARY_COMPARE,
             strip.position="bottom")+theme_minimal()+
  theme(axis.text.x = element_blank(),
        plot.background = element_rect('#f5f2e9', color = NA))


#Reclassify commute duration
COMMUTE <- predictionData_viz%>%
  dplyr::select(EMPLOYEE, TURNOVER, mean_distance, mean_duration)%>%
  mutate(DURATION_CLASS = case_when(mean_duration<15 ~ "Under 15 min",
                                    mean_duration>=10 & mean_duration<45~ "15 to 45 min",
                                    mean_duration>=45 & mean_duration<60~ "45min to 1hr",
                                    mean_duration>=60 & mean_duration<120~ "1 to 2hr",
                                    mean_duration>=120~ "over 2 hours"))%>%
  mutate(miles= round((mean_distance / 1609), 2))%>%
  mutate(DIST_CLASS = case_when(miles < 5~ "Less than 5mi",
                                miles >= 5 & miles < 10~ "5 to 10mi",
                                miles >= 10 & miles <40~ "10 to 40mi",
                                miles >= 40 & miles <60 ~ "40 to 60mi",
                                miles >= 60 ~ "Over 60mi"))

#Factor commute distance classes
COMMUTE$DURATION_CLASS<- 
  factor(COMMUTE$DURATION_CLASS,
         levels = c("Under 15 min", "15 to 45 min","45min to 1hr", 
                    "1 to 2hr", "over 2 hours"))
#Reclassify commute duration - convert meters to miles

COMMUTE$DIST_CLASS<- 
  factor(COMMUTE$DIST_CLASS,
         levels = c("Less than 5mi", "5 to 10mi","10 to 40mi", 
                    "40 to 60mi", "Over 60mi"))
COMMUTE$TURNOVER <- 
  factor(COMMUTE$TURNOVER,
         levels = c("VOLUNTARY", "INVOLUNTARY","RETIRE", "ACTIVE"))

COMMUTE_T<- COMMUTE%>%
  filter(COMMUTE$TURNOVER %in% c('VOLUNTARY', 'INVOLUNTARY', 'ACTIVE'))
grid.arrange((
COMMUTE_T%>%na.omit()%>%
  ggplot(aes(TURNOVER, group = DURATION_CLASS)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
  geom_text(aes(label = scales::percent(..prop.., accuracy = .1),
                   y= ..prop.. ), stat= "count",
            nudge_y = .025,
            color = "grey35",
            fontface = "bold") +
  scale_y_continuous(labels=scales::percent) +
  scale_fill_manual(values = pal3,
                    labels = c("VOLUNTARY", "INVOLUNTARY", 'ACTIVE'))+
  labs(title = "Turnover by Employee Commute Time to Work",
       x = "", fill = "Employee Status",
       caption = "figure x")+
  ylab("relative frequencies") +
  guides()+
  facet_wrap(~DURATION_CLASS,
             strip.position="bottom",
             nrow=1)+theme_minimal()+
  theme(axis.text.x = element_blank(),
        title= element_text(size = 12),
        strip.text = element_text(size = 11),
        plot.background = element_rect('#f5f2e9', color = NA),
        panel.background = element_rect('#f5f2e9', color = NA),
        legend.background = element_rect('#f5f2e9', color = NA),
        strip.background = element_rect('#f5f2e9', color = NA))),
#Distance Class
(COMMUTE_T%>%na.omit()%>%
  ggplot(aes(TURNOVER, group = DIST_CLASS)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
  geom_text(aes(label = scales::percent(..prop.., accuracy = .1),
                   y= ..prop.. ), stat= "count",
            nudge_y = .025,
            color = "grey35",
            fontface = "bold") +
  scale_y_continuous(labels=scales::percent) +
  scale_fill_manual(values = pal3,
                    labels = c("VOLUNTARY", "INVOLUNTARY", 'ACTIVE'))+
  labs(title = "Turnover by Employee Distance (mi) to Work",
       x = "", fill = "Employee Status",
       caption = "figure x")+
  ylab("relative frequencies") +
  guides()+
  facet_wrap(~DIST_CLASS,
             strip.position="bottom",
             nrow=1)+theme_minimal()+
  theme(axis.text.x = element_blank(),
        plot.background = element_rect('#f5f2e9', color = NA),
        title= element_text(size = 12),
        strip.text = element_text(size = 11),
        panel.background = element_rect('#f5f2e9', color = NA),
        legend.background = element_rect('#f5f2e9', color = NA),
        strip.background = element_rect('#f5f2e9', color = NA))), ncol = 1)

# Mutate data to create age categories - filter by turnover events 
age_turn<- study.panel%>%
  filter(TURNOVER %in% c("VOLUNTARY","INVOLUNTARY", "RETIRE"))%>%
  mutate(AGE_HIRED =(lubridate::year(DATE_HIRED) - lubridate::year(BIRTHDATE)),
         AGE_TERM =(lubridate::year(TERM_DATE) - lubridate::year(BIRTHDATE)))%>%
  mutate(AGE_TERM = round(replace(AGE_TERM, which(AGE_TERM<0), currentAge),0))%>%
  mutate(AGE_TERM=  case_when(AGE_TERM < 30 ~ "Under 30",
                               AGE_TERM >= 30 & AGE_TERM < 40 ~ "30 to 40",
                               AGE_TERM >= 40 & AGE_TERM < 50 ~ "40 to 50",
                               AGE_TERM >= 50 & AGE_TERM < 65 ~ "50 to 65",
                               AGE_TERM >= 65 ~ "Above 65"),
         AGE_HIRED = case_when(AGE_HIRED < 30 ~ "Under 30",
                               AGE_HIRED >= 30 & AGE_HIRED < 40 ~ "30 to 40",
                               AGE_HIRED >= 40 & AGE_HIRED < 50 ~ "40 to 50",
                               AGE_HIRED >= 50 & AGE_HIRED < 65 ~ "50 to 65",
                               AGE_HIRED >= 65 ~ "Above 65"),
         CALC_AGE= case_when(currentAge < 30 ~ "Under 30",
                               currentAge >= 30 & currentAge < 40 ~ "30 to 40",
                               currentAge >= 40 & currentAge < 50 ~ "40 to 50",
                               currentAge >= 50 & currentAge < 65 ~ "50 to 65",
                               currentAge >= 65 ~ "Above 65"))%>%na.omit()

#Factor levels of age categories and turnover
age_turn$AGE_HIRED = factor(age_turn$AGE_HIRED,
                     levels = c("Under 30", "30 to 40", 
                                "40 to 50", "50 to 65", "Above 65"))
age_turn$AGE_TERM = factor(age_turn$AGE_HIRED,
                    levels = c("Under 30", "30 to 40", 
                               "40 to 50", "50 to 65", "Above 65"))
age_turn$CALC_AGE = factor(age_turn$CALC_AGE,
                    levels = c("Under 30","30 to 40", 
                               "40 to 50", "50 to 65", "Above 65"))
age_turn$TURNOVER = factor(age_turn$TURNOVER, 
                      levels = c("VOLUNTARY", "INVOLUNTARY", "RETIRE"))

# check age by turnover type
age_turn%>%filter(TURNOVER == 'VOLUNTARY')%>%
  ggplot(aes(AGE_TERM, group = TURNOVER)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), color ="#8f7b87", stat="count") +
  geom_text(aes(label = scales::percent(..prop.., accuracy = .1),
                   y= ..prop.. ), stat= "count",
            nudge_y = .025,
            color = "grey35",
            fontface = "bold") +
  scale_y_continuous(labels=scales::percent) +
   scale_fill_manual(values = pal5,
     labels = c("Under 30", "30 to 40", 
                               "40 to 50", "50 to 65", "Above 65"))+
  labs(title = "Age of Employee at Termination by Turnover Type",
       x = " ", fill = "Age Group",
       caption = "figure x")+
  ylab("relative frequencies") +
  guides()+
  theme(
        plot.background = element_rect('#f5f2e9', color = NA),
        panel.background = element_rect('#f5f2e9', color = NA),
        legend.background = element_rect('#f5f2e9', color = NA),
        strip.background = element_rect('#f5f2e9', color = NA))

#plot turnover by age at time of termination
age_turn%>%
  ggplot(aes(TURNOVER, group = AGE_TERM)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
  geom_text(aes(label = scales::percent(..prop.., accuracy = .1),
                   y= ..prop.. ), stat= "count",
            nudge_y = .025,
            color = "grey35",
            fontface = "bold") +
  scale_y_continuous(labels=scales::percent) +
  scale_fill_manual(values = pal3_ret,
                    labels = c("VOLUNTARY", "INVOLUNTARY", 'RETIRE'))+
  labs(title = "Turnover by Age of Employee At Termination",
       x = "Age Category", fill = "Turnover Status",
       caption = "figure x")+
  ylab("relative frequencies") +
  guides()+
  facet_wrap(~AGE_TERM,
             strip.position="bottom",
             nrow=1)+theme_minimal()+
  theme(axis.text.x = element_blank(),
        plot.background = element_rect('#f5f2e9', color = NA),
        panel.background = element_rect('#f5f2e9', color = NA),
        legend.background = element_rect('#f5f2e9', color = NA),
        strip.background = element_rect('#f5f2e9', color = NA))

#Average Employee Tenure by County Department 
func_labs<- c("Health", "Law Enforcement", "Financial Administration", "Public Welfare", "Corrections", "Community Development")

study.panel$deptGroup<- factor(study.panel$deptGroup,
         levels = c("Health", "Law Enforcement", "Financial Administration", "Public Welfare", "Corrections", "Community Development"))

#Voluntary turnover by function group 
fnctn_gr<- study.panel%>%
  group_by(EMPLOYEE)%>%
  mutate(index = row_number(EMPLOYEE))%>%
  filter(TURNOVER %in% c('VOLUNTARY', 'INVOLUNTARY'))%>%
  ungroup()%>%
  dplyr::select(index,TURNOVER,deptGroup)%>%
  group_by(deptGroup)%>%
  add_count()%>%add_count(TURNOVER)%>%
  mutate(PCT= round((nn/n)*100,2))%>%
  distinct(TURNOVER, deptGroup, .keep_all=T)%>%
  dplyr::select(TURNOVER, deptGroup, PCT)%>%ungroup()

#vizualize turnover by function group
fnctn_gr$deptGroup<- factor(fnctn_gr$deptGroup,
         levels = c("Health", "Law Enforcement", "Financial Administration", "Public Welfare", "Corrections", "Community Development"))

grid.arrange(
(study.panel%>%filter(currentStatus==1)%>%
  group_by(EMPLOYEE) %>% 
  top_n(1, TENURE_Y)%>%
  ungroup()%>%
  group_by(deptGroup)%>%
  summarise_at(vars(TENURE_Y), list(mean_tenure = mean))%>%
  ggplot()+
  geom_bar(aes(x= deptGroup,y = mean_tenure),fill = "#dfd885", stat= 'identity')+
  guides(fill = 'none',)+
  labs(title = 'Average Employee Tenure By \nCounty Department Function Group',
       y = 'Average Tenure (Years)', x = ' ', caption = "figure x")+
  coord_flip()+
  theme_minimal()+
  theme(panel.background = element_rect(fill ='#f5f2e9', color = NA),
        plot.background = element_rect(fill = '#f5f2e9', color = NA),
        axis.text.y= element_text(face = "bold"),
        panel.grid.major.x  =element_line(color = 'grey60',linetype = 3, size = .5), 
        panel.grid.minor.x  =element_line(color = 'grey85',linetype = 3, size = .25),
        panel.grid.minor.y = element_blank(), 
        panel.grid.major.y=element_blank())),
(fnctn_gr%>%
  ggplot(aes(deptGroup, fill = TURNOVER))+
  geom_bar(aes(y = PCT), position = "dodge", stat="identity")+
  labs(title = "Turnover Type by \nCounty Department Function Group", 
       caption = "figure x", y = "Percent", x = " ")+
  guides()+
  scale_fill_manual(values = pal1)+
  # scale_x_discrete(labels= func_labs)+
  coord_flip()+
  theme_minimal()+
  theme(panel.background = element_rect(fill ='#f5f2e9', color = NA),
        plot.background = element_rect(fill = '#f5f2e9', color = NA),
        axis.text.y= element_text(face = "bold"),
        panel.grid.major.x  =element_line(color = 'grey60',linetype = 3, size = .5), 
        panel.grid.minor.x  =element_line(color = 'grey85',linetype = 3, size = .25),
        panel.grid.minor.y = element_blank(), 
        panel.grid.major.y=element_blank())), nrow = 1)
#How do we want to visualize this new feature data
PAY_RAISE$TURNOVER = factor(PAY_RAISE$TURNOVER, levels= c("VOLUNTARY",
                                                              "INVOLUNTARY",
                                                              "RETIRE"))

YR_DEPT_NAMES <- c(`0` = "Below Average \nDept. Pay Raise (0) \n",
                   `1` = "Above Average \nDept. Pay Raise (1) ")

# PAY_RAISE%>%na.omit()%>%
#   filter(TURNOVER %in% c('VOLUNTARY', 'INVOLUNTARY', 'RETIRE'))%>%
#   dplyr::select(EMPLOYEE, TURNOVER, DEPT_NAME, YR_DEPT_RAISE)%>%
#   group_by(TURNOVER)%>%
#   ggplot(aes(TURNOVER, group = YR_DEPT_RAISE)) +
#   geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
#   geom_text(aes(label = scales::percent(..prop..),
#                    y= ..prop.. ), stat= "count",
#             nudge_y = .025,
#             color = "grey35",
#             fontface = "bold") +
#   scale_y_continuous(labels=scales::percent) +
#   scale_fill_manual(values = pal3_ret,
#                     labels = c("VOLUNTARY", "INVOLUNTARY", "RETIRE"))+
#   labs(title = '  ',
#        x = "", fill = "Turnover Status")+
#   ylab("relative frequencies") +
#   facet_wrap(~YR_DEPT_RAISE,
#              strip.position="bottom",
#              labeller = as_labeller(YR_DEPT_NAMES),
#              nrow=1)+
#   theme_minimal()+
#   theme(plot.background = element_rect('#f5f2e9', color = NA),
#         axis.text.x = element_blank(),
#         panel.grid.major.y  =element_line(color = 'grey60',linetype = 3, size = .5), 
#           panel.grid.minor = element_blank(), panel.grid.major.x=element_blank())


PAY_RAISE%>%na.omit()%>%filter(TURNOVER == "VOLUNTARY")%>%
ggplot(aes(YR_DEPT_RAISE, group = TURNOVER)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), color ="#8f7b87", stat="count") +
  geom_text(aes(label = scales::percent(..prop.., accuracy = .1),
                   y= ..prop.. ), stat= "count",
            nudge_y = .025,
            color = "grey35",
            fontface = "bold") +
  scale_y_continuous(labels=scales::percent) +
  scale_x_discrete(labels = YR_DEPT_NAMES) +
   scale_fill_manual(values = c("#f3efea", "#8f7b87"), 
                     labels = YR_DEPT_NAMES)+
  labs(title = "Voluntary Turnover Frequency by \nRelative Percent Change in Pay",
       x = " ", fill = " ",
       caption = "figure x")+
  ylab("relative frequencies") +
  guides()+
  theme(plot.background = element_rect('#f5f2e9', color = NA),
        panel.background = element_rect('#f5f2e9', color = NA),
        legend.background = element_rect('#f5f2e9', color = NA),
        strip.background = element_rect('#f5f2e9', color = NA))

PAY_RAISE%>%na.omit()%>%group_by(year)%>%
  filter(TURNOVER %in% c("VOLUNTARY"))%>%
  summarise_at(vars(pct_change_pay, dept_avg_raise), list(avg = mean))%>%
  left_join(PAY_RAISE, ., by = "year")%>%
  filter(TURNOVER %in% c("VOLUNTARY"))%>%
  ggplot(aes(x=year))+
  geom_smooth(aes(y = pct_change_pay_avg, color = "#e4dde1"), se=F, size = 1)+
  geom_smooth(aes(y = dept_avg_raise_avg, color = "#8f7b87"), se=F, size = 1)+
  scale_colour_manual(name="", values=c("#7fd4b6" , "#8f7b87"), 
                      labels = c("Average Employee Percent Change in Pay", "County Percent Change in Pay"))+
  labs(y = "Percent", x = "Year", title = "Relative Percent Change in Pay Over Time", subtitle= "Voluntary Turnover", 
       caption = "figure x")+
  theme(plot.background = element_rect('#f5f2e9', color = NA),
        panel.background = element_rect('#f5f2e9', color = NA),
        legend.background = element_rect('#f5f2e9', color = NA),
        strip.background = element_rect('#f5f2e9', color = NA), 
        strip.text = element_text(face = "bold"),
        legend.position="bottom")

#MAJ EEO
MAJ_EEO_NAMES <- c(`0` = "Not in the Majority EEO Class (0)",
                   `1` = "Majority EEO Class (1)")
study.panel$MAJ_EEO= factor(study.panel$MAJ_EEO, levels= c(0,1))

study.panel$TURNOVER = factor(study.panel$TURNOVER, levels= c("VOLUNTARY",
                                                              "INVOLUNTARY",
                                                              "RETIRE"))
# bar chart - overall percent - non temporal

study.panel%>%
  dplyr::select(EMPLOYEE, eff_yr_month, emp_yr_month_KEY, TURNOVER, DEPT_NAME, MAJ_EEO, MAJ_GENDER, MAJ_AGE_GROUP)%>%
  mutate(year = lubridate::year(eff_yr_month))%>%na.omit()%>%
  filter(TURNOVER %in% c('VOLUNTARY', 'INVOLUNTARY', 'RETIRE'))%>%
  group_by(TURNOVER)%>%
  ggplot(aes(TURNOVER, group = MAJ_EEO)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
  geom_text(aes(label = scales::percent(..prop..),
                   y= ..prop.. ), stat= "count",
            nudge_y = .025,
            color = "grey35",
            fontface = "bold") +
  scale_y_continuous(labels=scales::percent) +
  scale_fill_manual(values = pal3_ret,
                    labels = c("VOLUNTARY", "INVOLUNTARY", "RETIRE"))+
  labs(title = 'Turnover Type by MAJ_EEO',
       x = "", fill = "Turnover Status",
       caption = "figure x")+
  ylab("relative frequencies") +
  guides()+
  facet_wrap(~MAJ_EEO,
             strip.position="bottom",
             labeller = as_labeller(MAJ_EEO_NAMES),
             nrow=1)+
  theme_minimal()+
  theme(plot.background = element_rect('#f5f2e9', color = NA),
        axis.text.x = element_blank(),
        panel.grid.major.y  =element_line(color = 'grey60',linetype = 3, size = .5), 
          panel.grid.minor = element_blank(), panel.grid.major.x=element_blank())


study.panel%>%na.omit()%>%filter(TURNOVER == "VOLUNTARY")%>%
  mutate(year = lubridate::year(eff_yr_month))%>%dplyr::select(TURNOVER, year, MAJ_EEO)%>%
  group_by(year)%>%arrange(year)%>%filter(year != "2022")%>%
  ggplot(aes(MAJ_EEO, group = year))+
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat= "count")+
            scale_y_continuous(labels=scales::percent) +
  scale_fill_manual(name = "", values = c("#c0b5bc", "#8f7b87"), labels= c("Not Majority EEO Class", "Majority EEO Class"))+
          facet_wrap(~year,nrow=1, strip.position = "bottom")+
    labs(x = "year", title = "Voluntary Turnover by Relative Ethnicity Over Time", caption = "figure x")+
  theme(plot.background = element_rect('#f5f2e9', color = NA),
        panel.background = element_rect('#f5f2e9', color = NA),
        legend.background = element_rect('#f5f2e9', color = NA),
        strip.background = element_rect('#f5f2e9', color = NA),
        axis.text.x = element_blank(), 
        axis.ticks = element_blank(),
        panel.grid = element_blank(),
        panel.grid.major.y= element_line(color = "grey", linetype = 2))
#MAJ GENDER
MAJ_G_NAMES <- c(`0` = "Not the Majority Gender (0)",
                   `1` = "Majority Gender (1)")


# bar chart - overall percent - non temporal
study.panel%>%
  dplyr::select(EMPLOYEE, eff_yr_month, emp_yr_month_KEY, TURNOVER, DEPT_NAME, MAJ_EEO, MAJ_GENDER, MAJ_AGE_GROUP)%>%
  mutate(year = lubridate::year(eff_yr_month))%>%na.omit()%>%
  filter(TURNOVER %in% c('VOLUNTARY', 'INVOLUNTARY', 'RETIRE'))%>%
  group_by(TURNOVER)%>%
  ggplot(aes(TURNOVER, group = MAJ_GENDER)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
  geom_text(aes(label = scales::percent(..prop..),
                   y= ..prop.. ), stat= "count",
            nudge_y = .025,
            color = "grey35",
            fontface = "bold") +
  scale_y_continuous(labels=scales::percent) +
  scale_fill_manual(values = pal3_ret,
                    labels = c("VOLUNTARY", "INVOLUNTARY", "RETIRE"))+
  labs(title = 'Turnover Type by Majority Gender',
       x = "", fill = "Turnover Status",
       caption = "figure x")+
  ylab("relative frequencies") +
  guides()+
  facet_wrap(~MAJ_GENDER,
             strip.position="bottom",
             labeller = as_labeller(MAJ_G_NAMES),
             nrow=1)+
  theme_minimal()+
  theme(plot.background = element_rect('#f5f2e9', color = NA),
        axis.text.x = element_blank(),
        panel.grid.major.y  =element_line(color = 'grey60',linetype = 3, size = .5), 
          panel.grid.minor = element_blank(), panel.grid.major.x=element_blank())

study.panel%>%na.omit()%>%filter(TURNOVER == "VOLUNTARY")%>%
  mutate(year = lubridate::year(eff_yr_month))%>%dplyr::select(TURNOVER, year, MAJ_GENDER)%>%
  group_by(year)%>%arrange(year)%>%filter(year != "2022")%>%
  ggplot(aes(MAJ_GENDER, group = year))+
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat= "count")+
            scale_y_continuous(labels=scales::percent) +
  scale_fill_manual(name = "", values = c("#c0b5bc", "#8f7b87"), labels= MAJ_G_NAMES)+
    labs(x = "year", title = "Voluntary Turnover by Relative Gender Over Time", caption = "figure x")+
          facet_wrap(~year,nrow=1, strip.position = "bottom")+
  theme(plot.background = element_rect('#f5f2e9', color = NA),
        panel.background = element_rect('#f5f2e9', color = NA),
        legend.background = element_rect('#f5f2e9', color = NA),
        strip.background = element_rect('#f5f2e9', color = NA),
        axis.text.x = element_blank(), 
        axis.ticks = element_blank(),
        panel.grid = element_blank(),
        panel.grid.major.y= element_line(color = "grey", linetype = 2))

#MAJ AGE GROUP
MAJ_AGE_NAMES <- c(`0` = "Not the Majority Age Group",
                   `1` = "Majority Age Group")
study.panel%>%
  dplyr::select(EMPLOYEE, eff_yr_month, emp_yr_month_KEY, TURNOVER, DEPT_NAME, MAJ_EEO, MAJ_GENDER, MAJ_AGE_GROUP)%>%
  mutate(year = lubridate::year(eff_yr_month))%>%na.omit()%>%
  filter(TURNOVER %in% c('VOLUNTARY', 'INVOLUNTARY', 'RETIRE'))%>%
  group_by(TURNOVER)%>% 
  ggplot(aes(TURNOVER, group = MAJ_AGE_GROUP)) +
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") +
  geom_text(aes(label = scales::percent(..prop..),
                   y= ..prop.. ), stat= "count",
            nudge_y = .025,
            color = "grey35",
            fontface = "bold") +
  scale_y_continuous(labels=scales::percent) +
  scale_fill_manual(values = pal3_ret,
                    labels = c("VOLUNTARY", "INVOLUNTARY", "RETIRE"))+
  labs(title = 'Turnover Type by Majority Age Group',
       x = "", fill = "Turnover Status",
       caption = "figure x")+
  ylab("relative frequencies") +
  guides()+
  facet_wrap(~MAJ_AGE_GROUP,
             strip.position="bottom",
             labeller = as_labeller(MAJ_AGE_NAMES),
             nrow=1)+
  theme_minimal()+
  theme(plot.background = element_rect('#f5f2e9', color = NA),
        axis.text.x = element_blank(),
        panel.grid.major.y  =element_line(color = 'grey60',linetype = 3, size = .5), 
          panel.grid.minor = element_blank(), panel.grid.major.x=element_blank())

study.panel%>%na.omit()%>%filter(TURNOVER == "VOLUNTARY")%>%
  mutate(year = lubridate::year(eff_yr_month))%>%dplyr::select(TURNOVER, year, MAJ_AGE_GROUP)%>%
  group_by(year)%>%arrange(year)%>%filter(year != "2022")%>%
  ggplot(aes(MAJ_AGE_GROUP, group = year))+
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat= "count")+
            scale_y_continuous(labels=scales::percent) +
  scale_fill_manual(name = "", values = c("#c0b5bc", "#8f7b87"), labels= MAJ_AGE_NAMES)+
  labs(x = "year", title = "Voluntary Turnover by Relative Age Group Over Time", caption = "figure x")+
          facet_wrap(~year,nrow=1, strip.position = "bottom")+
  theme(plot.background = element_rect('#f5f2e9', color = NA),
        panel.background = element_rect('#f5f2e9', color = NA),
        legend.background = element_rect('#f5f2e9', color = NA),
        strip.background = element_rect('#f5f2e9', color = NA),
        axis.text.x = element_blank(), 
        axis.ticks = element_blank(),
        panel.grid = element_blank(),
        panel.grid.major.y= element_line(color = "grey", linetype = 2))

## filter out features not related to the prediction
P_Data <- study.panel %>%
  dplyr::select( -EMPLOYEE, -eff_yr_month, -EMP_EFFECT_DATE, -CHG_REASON, -eff_month, -EMP_STATUS, -LATEST_JOB_CODE, -JBC_DESCRIPTION, -DATE_HIRED, 
                -ADJ_HIRE_DATE, -NEW_HIRE_DATE, -TERM_DATE, -LATEST_PAY_RATE, 
                -ANNUAL_HOURS, -BIRTHDATE, -currentStatus,-PROCESS_LEVEL, -DESCRIPTION, -lag_PT_JOB_CODE, -jobChange, -TENURE_Y, -DEPARTMENT,
                -emp_yr_month_KEY, -TURNOVER_12mo_flag, -TURNOVER, -TURNOVER_18mo_flag, -PT_JOB_CODE, -PAY_GRADE, -mean_distance)%>%
  filter(TURNOVER_6mo_flag == "ACTIVE"|TURNOVER_6mo_flag == "VOLUNTARY")%>%
  na.omit()%>%
  filter(SMOKER != "")%>%
  rename("TimeAtCounty(month)" = TENURE_M)

## reclassify the departments
P_Data <- P_Data %>% mutate(DEPARTMENT = case_when(DEPT_NAME == "Animal Services" ~ "Animal Services",
                                                   DEPT_NAME == "Budget & Management Services" ~ "Budget & Management Services",
                                                   DEPT_NAME == "Child Support Enforcement" ~ "Child Support",
                                                   DEPT_NAME == "Elections" ~ "Board of Elections",
                                                   DEPT_NAME == "Emergency Services" ~ "Emergency Services",
                                                   DEPT_NAME == "Family Justice Center" ~ "Family Justice Center",
                                                   DEPT_NAME == "Internal Audit"|DEPT_NAME == "Finance" ~ "Finance",
                                                   DEPT_NAME == "Human Resources" ~ "Human Resources",
                                                   DEPT_NAME == "Juvenile Detention" ~ "Juvenile Detention",
                                                   DEPT_NAME == "Purchasing" ~ "Purchasing",
                                                   DEPT_NAME == "Register of Deeds" ~ "Register of Deeds",
                                                   DEPT_NAME == "Risk Management" ~ "Risk Management",
                                                   DEPT_NAME == "Tax" ~ "Tax",
                                                   DEPT_NAME == "Veterans' Services" ~ "Veterans' Services",
                                                   DEPT_NAME == "Administration" ~ "Administration",
                                                   DEPT_NAME == "Commissioners" ~ "Board of Commissioners",
                                                   DEPT_NAME == "Clerk to the Board" ~ "Clerk to the Board",
                                                   DEPT_NAME == "Parks (Prev Prop Mgmt)" ~ "County Parks",
                                                   DEPT_NAME == "Facilities"|DEPT_NAME == "Property Management/Courts" ~ "Faclities and Property Management",
                                                   DEPT_NAME == "Public Health"|DEPT_NAME == "Social Services"|DEPT_NAME == "Transportation-Human Services"|DEPT_NAME == "Mental Health" ~ "Human services",
                                                   DEPT_NAME == "Information Technology" ~ "Information Technology",
                                                   DEPT_NAME == "Law Enforcement"|DEPT_NAME == "Legal" ~ "Legal - County Attorney",
                                                   DEPT_NAME == "Planning"|DEPT_NAME == "PLN/Solid Waste"|DEPT_NAME == "Community&Economic Development"|DEPT_NAME == "PLN/Inspections"|DEPT_NAME == "PLN/Soil & Water" ~ "Planning & Development",
                                                   DEPT_NAME == "Security" ~ "Security"
                                                   ))

P_Data <- P_Data %>% dplyr::select(-DEPT_NAME)%>%na.omit()


###split the data
training1 <- P_Data %>% filter(eff_year %in% (2006:2016))%>%dplyr::select(-eff_year)
testing1 <- P_Data %>% filter(eff_year %in% (2017:2022))%>%dplyr::select(-eff_year)

#write.csv(training1, "training.csv") 
#write.csv(testing1, "testing.csv") 

######balance data in python######

###balancing class using SMOTE_NC for TRAINSET(generate artificial data based on existing data) -- more details in file `overSampling.ipynb`

###read balanced data
set.seed(42)
balancedTraining <- balancedTraining %>% 
  dplyr::select(-SUPERVISOR,-FNCTN_GROUP_DESCRIPTION,-HIRE_SOURCE,-NEXT_REV_CODE)%>%
  filter(EEO_CLASS != "AP", TRUE_MAR_STAT != "L")

Testing <- Testing %>% 
  dplyr::select(-V1,-SUPERVISOR,-FNCTN_GROUP_DESCRIPTION,-HIRE_SOURCE,-NEXT_REV_CODE)


### Extract Sample datasets due to the work load for analysis on the whole data
data_split <- initial_split(balancedTraining, strata = "TURNOVER_6mo_flag", prop = 0.03)
TrainingSample <- training(data_split)
data_split1 <- initial_split(Testing, strata = "TURNOVER_6mo_flag", prop = 0.06)
TestingSample  <- training(data_split1)


####### tidymodels ######
### Cross Validation
## LOGOCV on DEPARTMENT with group_vfold_cv()
cv_splits <- group_vfold_cv(TrainingSample,  
                            group = "DEPARTMENT")
# print(cv_splits)

### Create Recipes

# Feature Creation
model_rec <- recipe(TURNOVER_6mo_flag ~ ., 
                    data = TrainingSample) %>%
  update_role(DEPARTMENT, new_role = "DEPARTMENT") %>%
  step_center(all_numeric_predictors()) %>%
  step_scale(all_numeric_predictors()) %>%
  step_other(all_nominal_predictors(), threshold = 0.005) %>%
  step_dummy(all_nominal(), -all_outcomes(), -DEPARTMENT)%>%
  step_zv(all_predictors()) %>%
  prep()


# summary(model_rec)
# See the data after all transformations
# glimpse(model_rec %>% juice())


## Model specifications
glmnet_plan <- 
  logistic_reg() %>% 
  set_args(penalty  = tune()) %>%
  set_args(mixture  = tune()) %>%
  set_engine("glmnet")

rf_plan <- rand_forest() %>%
  set_args(mtry  = tune()) %>%
  set_args(min_n = tune()) %>%
  set_args(trees = 1000) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

XGB_plan <- boost_tree() %>%
  set_args(mtry  = tune()) %>%
  set_args(min_n = tune()) %>%
  set_args(trees = 100) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

# Hyperparameter grid for glmnet (penalization)
glmnet_grid <- expand.grid(penalty = seq(0, 1, by = .25), 
                           mixture = seq(0,1,0.25))
rf_grid <- expand.grid(mtry = c(2,5), 
                       min_n = c(1,5))
xgb_grid <- expand.grid(mtry = c(3,5), 
                        min_n = c(1,5))

# create workflow
glmnet_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(glmnet_plan) 

rf_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(rf_plan)
xgb_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(XGB_plan)


# fit model to workflow and calculate metrics
control <- control_resamples(save_pred = TRUE, verbose = TRUE)
metrics <- metric_set(roc_auc,sensitivity, precision, recall)

glmnet_tuned <- glmnet_wf %>%
  tune::tune_grid(.,
                  resamples = cv_splits,
                  grid      = glmnet_grid,
                  control   = control,
                  metrics   = metrics)
rf_tuned <- rf_wf %>%
  tune::tune_grid(.,
                  resamples = cv_splits,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metrics)

xgb_tuned <- xgb_wf %>%
  tune::tune_grid(.,
                  resamples = cv_splits,
                  grid      = xgb_grid,
                  control   = control,
                  metrics   = metrics)


## metrics across grid
# autoplot(xgb_tuned)
# collect_metrics(xgb_tuned)
## 'Best' by some metric and margin
# show_best(glmnet_tuned, metric = "roc_auc", n = 24)
# show_best(rf_tuned, metric = "roc_auc", n = 15)
# show_best(xgb_tuned, metric = "roc_auc", n = 15)

glmnet_best_params <- select_best(glmnet_tuned, metric = "roc_auc")
rf_best_params     <- select_best(rf_tuned, metric = "roc_auc"    )
xgb_best_params    <- select_best(xgb_tuned, metric = "roc_auc"   )

## Final workflow
glmnet_best_wf <- finalize_workflow(glmnet_wf, glmnet_best_params)
rf_best_wf     <- finalize_workflow(rf_wf, rf_best_params)
xgb_best_wf    <- finalize_workflow(xgb_wf, xgb_best_params)


# last_fit() emulates the process where, after determining the best model, the final fit on the entire training set is needed and is then evaluated on the test set.
custom_split <- make_splits(TrainingSample, assessment = TestingSample)

glmnet_val_fit <- glmnet_best_wf %>% 
  last_fit(split     = custom_split,
           control   = control,
           metrics   = metrics)

rf_val_fit <- rf_best_wf %>% 
  last_fit(split     = custom_split,
           control   = control,
           metrics   = metrics)


xgb_val_fit <- xgb_best_wf %>% 
  last_fit(split     = custom_split,
           control   = control,
           metrics   = metrics)

### logistic ###
############## we import the test results to make the process fast, comment out following code when running through ################

#glmnet_pre <- glmnet_val_fit %>%
  #extract_workflow() %>%
  #predict(TestingSample)%>% mutate(ACTUAL = TestingSample$TURNOVER_6mo_flag)

#testProbs<-glmnet_val_fit %>%
  #extract_workflow() %>%
  #predict(TestingSample,type='prob')

#testProbs <- testProbs %>%
  #mutate(ACTUAL  = ifelse(glmnet_pre$ACTUAL == "VOLUNTARY", "VOLUNTARY","ACTIVE"))
#testProbs$predOutcome <- as.factor(glmnet_pre$.pred_class)
#testProbs$ACTUAL <- as.factor(testProbs$ACTUAL)

#///////////// comment the this chunck when running through 
testProbs <- testProbs %>% dplyr::select(-V1)
testProbs$ACTUAL <- as.factor(testProbs$ACTUAL)
testProbs$predOutcome <- as.factor(testProbs$predOutcome)
#///////////// comment the this chunck when running through ////////////#

logistic_summary <- caret::confusionMatrix(testProbs$predOutcome, testProbs$ACTUAL,
                       positive = "VOLUNTARY")


### rf ###
############## comment out following code when running through ################
#rf_pre <- rf_val_fit %>%
  #extract_workflow() %>%
  #predict(TestingSample)%>% mutate(ACTUAL = TestingSample$TURNOVER_6mo_flag)

#testProbs2<-rf_val_fit %>%
  #extract_workflow() %>%
  #predict(TestingSample,type='prob')

#testProbs2 <- testProbs2 %>%
  #mutate(ACTUAL  = ifelse(rf_pre$ACTUAL == "VOLUNTARY", "VOLUNTARY","ACTIVE"))
#testProbs2$predOutcome <- as.factor(rf_pre$.pred_class)
#testProbs2$ACTUAL <- as.factor(testProbs2$ACTUAL)

#///////////// comment the this chunck when running through
testProbs2 <- testProbs2 %>% dplyr::select(-V1)
testProbs2$ACTUAL <- as.factor(testProbs2$ACTUAL)
testProbs2$predOutcome <- as.factor(testProbs2$predOutcome)
#///////////// comment the this chunck when running through ////////////#

rf_summary <- confusionMatrix(testProbs2$predOutcome, testProbs2$ACTUAL,
                       positive = "VOLUNTARY")

#### xgb ###
############## comment out following code when running through ################
#xgb_pre <- xgb_val_fit %>%
  #extract_workflow() %>%
#predict(TestingSample)%>% mutate(ACTUAL = TestingSample$TURNOVER_6mo_flag)

#testProbs3<-xgb_val_fit %>%
  #extract_workflow() %>%
  #predict(TestingSample,type='prob')

#testProbs3 <- testProbs3 %>%
  #mutate(ACTUAL  = ifelse(xgb_pre$ACTUAL == "VOLUNTARY", "VOLUNTARY","ACTIVE"))
#testProbs3$predOutcome <- as.factor(xgb_pre$.pred_class)
#testProbs3$ACTUAL <- as.factor(testProbs3$ACTUAL)

#///////////// comment the this chunck when running through
testProbs3 <- testProbs3 %>% dplyr::select(-V1)
testProbs3$ACTUAL <- as.factor(testProbs3$ACTUAL)
testProbs3$predOutcome <- as.factor(testProbs3$predOutcome)
#///////////// comment the this chunck when running through ////////////#
xgb_summary <- caret::confusionMatrix(testProbs3$predOutcome, testProbs$ACTUAL,
                       positive = "VOLUNTARY")

library(stargazer)
stargazer(logistic_summary$byClass, rf_summary$byClass, xgb_summary$byClass, type = "html",
          title = "Metrics of 3 Models",
          header = FALSE,
          single.row = TRUE,
          column.labels=c("Logistic Regression","Random Forest","extreme gradient boosting"))


#### choose RF and further optimization
## plot the probability and choose a new threshold
ggplot(testProbs2, aes(x = .pred_VOLUNTARY, fill = as.factor(ACTUAL))) + 
  geom_density() +
  facet_grid(ACTUAL ~ .) +
  scale_fill_manual(values = pal1,
                          name = "Status") +
  labs(x = "Turnover", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome\n",
       caption = '') +
    theme_minimal()+ 
  theme(plot.background = element_rect('#f5f2e9', color = NA),
        axis.line = element_line(size =.5, color = "grey55"),
        axis.title = element_text(face = "bold"),
        axis.text = element_text(face = "bold"),
        panel.grid = element_blank(),
        panel.grid.major.x = element_line(size = 1, color = "grey75", linetype = 3))

## after setting the threshold
testProbs_threshold <- 
  testProbs2 %>%
  mutate(predOutcome  = ifelse(testProbs2$.pred_VOLUNTARY > 0.2, "VOLUNTARY", "ACTIVE"))
testProbs_threshold$predOutcome <- as.factor(testProbs_threshold$predOutcome)
testProbs_threshold$ACTUAL <- as.factor(testProbs_threshold$ACTUAL)

caret::confusionMatrix(testProbs_threshold$predOutcome, testProbs_threshold$ACTUAL,positive = "VOLUNTARY")
auc(testProbs_threshold$ACTUAL,testProbs_threshold$.pred_VOLUNTARY)

testProbs_threshold$ACTUAL_num <- ifelse(testProbs_threshold$ACTUAL == "VOLUNTARY",1,0)

ggplot(testProbs_threshold, aes(d = as.numeric(testProbs_threshold$ACTUAL_num), m = .pred_VOLUNTARY)) +
     geom_roc(n.cuts = 50, labels = FALSE, colour =  "#e79899" ) +
     geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
     labs(title = "ROC Curve - Turnover_Model")+
     theme_minimal()+ 
     theme(plot.background = element_rect('#f5f2e9', color = NA),
           axis.line = element_line(size =.5, color = "grey55"),
           axis.title = element_text(face = "bold"),
           axis.text = element_text(face = "bold"),
           panel.grid = element_blank(),
           panel.grid.major.x = element_line(size = 1, color = "grey75", linetype = 3))
iterateThresholds <- function(data) {
  x = .01
  all_prediction <- data.frame()
  while (x <= 1) {
    
    this_prediction <-
      testProbs_threshold %>%
      mutate(predOutcome = ifelse(.pred_VOLUNTARY > x, 1, 0)) %>%
      count(predOutcome, ACTUAL_num) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & ACTUAL_num==0]),
                True_Positive = sum(n[predOutcome==1 & ACTUAL_num==1]),
                False_Negative = sum(n[predOutcome==0 & ACTUAL_num==1]),
                False_Positive = sum(n[predOutcome==1 & ACTUAL_num==0])) %>%
      gather(Variable, Count) %>%
      mutate(Cost =
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive", Count * 2505.6,
               ifelse(Variable == "False_Negative",Count * 12455,
               ifelse(Variable == "False_Positive",Count * 14, 0)))),Threshold=x)
    
    all_prediction <- rbind(all_prediction, this_prediction)
    x <- x + .01
  }
  return(all_prediction)
}

whichThreshold <- iterateThresholds(testProbs_threshold)

whichThreshold_revenue <- 
  whichThreshold %>% 
    group_by(Threshold) %>% 
    summarize(Total_Cost = sum(Cost)) 

## plot the cost and the threshold
whichThreshold_revenue %>%
  dplyr::select(Threshold, Total_Cost) %>%
  gather(Variable, Value, -Threshold) %>%
  ggplot(aes(Threshold, Value, colour = Variable)) +
    geom_point() +
    scale_colour_manual(values = "#c70000") +
    labs(title = "Total Cost by threshold cut off (not resonable to choose the threshold based on this)",
       y="Total Cost")+
    geom_vline(xintercept = pull(arrange(whichThreshold_revenue, Total_Cost)[1,1]))+
  theme_minimal()+ 
     theme(plot.background = element_rect('#f5f2e9', color = NA),
           axis.line = element_line(size =.5, color = "grey55"),
           axis.title = element_text(face = "bold"),
           axis.text = element_text(face = "bold"),
           panel.grid = element_blank(),
           panel.grid.major.x = element_line(size = 1, color = "grey75", linetype = 3))


testProbs_threshold$predOutcome_num <-ifelse(testProbs_threshold$predOutcome=="VOLUNTARY",1,0)

cost_benefit_table <-
  testProbs_threshold %>%
  count(predOutcome_num, ACTUAL_num) %>%
  summarize(True_Negative = sum(n[predOutcome_num==0 & ACTUAL_num==0]),
            True_Positive = sum(n[predOutcome_num==1 & ACTUAL_num==1]),
            False_Negative = sum(n[predOutcome_num==0 & ACTUAL_num==1]),
            False_Positive = sum(n[predOutcome_num==1 & ACTUAL_num==0])) %>%
  gather(Variable, Count) %>%
  bind_cols(data.frame(Each_Employee_Cost = c(
    "$0","$2505.6","$12455","$14")))%>%
  mutate(Cost =
           ifelse(Variable == "True_Negative", Count * 0,
           ifelse(Variable == "True_Positive", Count * 2505.6,
           ifelse(Variable == "False_Negative",Count * 12455,
           ifelse(Variable == "False_Positive",Count * 14, 0)))))

kable(cost_benefit_table, align = "l",
      caption = "Cost/Benefit Table") %>% kable_styling()

MAHAN, THOMAS F., ED.D., Work Institute: Retention Report 2019, info.workinstitute.com.↩︎
MAHAN, THOMAS F., ED.D., Work Institute: Retention Report 2019,info.workinstitute.com.↩︎
Chatterjee, K., Clark, B., Martin, A. & Davis, A. (2017). The Commuting and Wellbeing Study: Understanding the Impact of Commuting on People’s Lives. UWE Bristol, UK.travelbehaviour.files.wordpress.com ↩︎
PARKER, KIM, PEW Research Center, “Majority of workers who quit a job in 2021 cite low pay, no opportunities for advancement, feeling disrespected.” March, 9, 2022.www.pewresearch.org ↩︎