1. Introduction

1.1 Abstract
1.2 Motivation
1.3 Methods in brief
1.4 Modeling strategy
1.5 Data

2. Exploratory Analysis

2.1 Understand the Inmate Profile
2.2 What are some of the characteristics of Recidivism?
2.3 Spatial Analysis

3. Feature Engineering and Model Development

3.1 Model Development
3.2 Model validation
3.3 Cross-validation

4. Converting predictions into actionable intelligence

4.1 Cost Analysis

5. Conclusions

Appendix

Return to MUSA 801 Projects Page

About this Document

This project was produced as part of the University of Pennsylvania Master of Urban Spatial Analytics Spring 2019 Practicum (MUSA 801) with the assistance of instructors Ken Steif and Michael Fichman. Special thanks to Jason Jones, Analytics & Innovation Manager for Guilford County, NC who provided data access and content expertise for this project. Without Jason, Ken and Michael, this project would not have been possible.

This document begins with a case study of predicting recidivism among inmates in North Carolina prisons in partnership with Guilford County, North Carolina. This document presents context on recidivism in Guilford County, the rationale for predicting recidivism, the fairness concerns associated with such an exercise, and an app driven by a predictive algorithm as proof of concept. This is followed by a series of appendices that discuss the data wrangling, data visualization, data sources, feature engineering, and model results in greater depth. Navigate through the document either by using the table of contents at the left, or by clicking the hyperlinks throughout the document.

1. Introduction

1.1 Abstract

Preventing recidivism, the repeat incarceration of a previously released offender, is an opportunity to break the cycle of incarceration that affects hundreds of thousands of people in the US. Guilford County, North Carolina has implemented a program they hope will reduce recidivism by offering resources and job training to ex-offenders at the time of release. However, this program relies on a knowledgeable staff member to connect ex-offenders to recidivism prevention programs. We believe available data and machine learning techniques can be leveraged to support the decision making process by quantifying a recidivism risk score. In developing the recidivism risk score algorithm, we aim to ensure the algorithm is racially fair, not systematically over- or under-predicting recidivism risk for ex-offenders of different races.

1.2 Motivation

Use Case

Recidivism, or the repeat incarceration of an ex-offender, is a preventable problem. When an ex-offender re-offends, the community and the individual both suffer social and financial loss. While there is the direct cost of re-imprisonment, which is about $30,000 per year per inmate (VERA Institute), there are also the more difficult to quantify costs of family separation, lost income, and others drive up the enormous costs of repeat imprisonment.

Given these high costs, even a marginal reduction in recidivism would result in large personal, family and community benefits. It’s been shown in ( New York and Washington State ) that an ex-offender training program can result in a notable decrease in recidivism. Guilford County already partners with several local nonprofits which provide programs, such as job training, to ex-offenders upon their release from prison. However, resources are limited and not every released ex-offender can be placed in a program. As a result, the county must decide who to prioritize in allocating program placements.

Our group hopes to add one more tool to decision-makers’ tool chests, with our recidivism risk prediction app. This app provides decision-makers with recidivism likelihood scores for each prisoner, calculated from the prisoner’s demographic characteristics, criminal history, and behavior in prison. Decision-makers can use these predictions as an additional piece of information when determining which ex-offenders would most benefit from a program slot.

Fairness

We are not the first group to develop a risk prediction algorithm in the criminal justice field, and several previous attempts have met with strong criticism. These previous attempts, such as the Northpointe company’s COMPAS model, have been critiqued by journalists and researchers, such as ProPublica and Dartmouth’s Julia Dressel and Hany Farid as racially biased due in part to their propensity to wrongly predict black offenders as repeat offenders relative to white offenders. Ensuring fairness in these predictions is of great importance to prevent unequal treatment based on race. A traditional solution, to expressly exclude race from the model, is insufficient because other variables, such as education, residence, and certain crime types, are strongly correlated with race and serve as race indicators in the model. While this bias is baked into a predictive algorithm, our task is to ‘open the black box’ to understand the extent of racial bias and how it may affect the allocation of job training resources. We do so by exploring the trade-off in model predictions between maximizing social benefits and maximizing fairness. Furthermore, this model is open-source and accessible. This transparency aims to provide decision-makers with an understanding of the trade offs between accuracy and fairness to better inform their decisions when using the algorithm’s predictions.

1.3 Methods in brief

Using a machine learning algorithm we utilize an offender’s criminal history, prison behavior, current crime characteristics, and demographics, we generate a risk score which estimates the likelihood the an ex-offender will recidivate. Prior to making predictions, the data provided by Guilford County was cleaned, matched, and features engineered. The model was then subjected to cross-validation and assessed for accuracy and fairness. Cross-validation consists of splitting the data set into multiple sets, fitting the model on a portion of the data and tested on the remaining set, and then assessing the error in the prediction. By completing this process hundreds of times, we are able to see whether the model can predict consistently on unseen data set. Beyond this, we also compare differences in the incorrect predictions based on race and sex to see if the model generalizes well across social classes. This is of particular concern as recent research has found that widely differing base rates of recidivism between groups can severely impact the capability of a model to predict fairly across groups.

1.4 Modeling strategy

To improve the fairness and accuracy of our predictions, this predictive algorithm combines four separate predictive models using a technique called ensembling. The four separate predictions are each organized around a theme: prisoner demographic characteristics (not including race), current crime, past crime and incarceration, and behavior in prison. Each of these four models generates a risk score based on data within that “story”. All four of these predictions are then used as inputs for the final ensemble model. The risk score generated by this final ensemble model incorporates and attempts to balance the final recidivism predictions.

1.5 Data

Our data comes from 2 sources: the North Carolina State Department of Corrections and Guilford County Sheriff’s Office. From State Department of Corrections, we have records of prison inmates since 1921 (with more extensive and reliable coverage since 1970s). These contain inmates’ demographic characteristics, criminal history, and behavioral record in prison. From the Sheriff’s Office, we have arrest records from 2012-2018. These include spatial information about the ex-offender’s place of arrest and home location. While these records contained a wealth of data, joining them was a challenge because offender records in each database had no common identifier. To match an inmate with an arrest, we had to perform a probabilistic join based on name, date of birth, gender, and race.

After a lot hard work, we have developed a sample of roughly 15 thousand ex-offenders imprisoned during the last several decades who were rearrested between 2012 and 2018. This sample differs from the total prison population of Guilford County in one significant way. Since only ex-offenders who have been re-arrested in Guilford County in the past half decade are included in our sample, ex-offenders who are not re-arrested in this area or time frame (or at all) are excluded from the sample. Those who are re-arrested are more likely to be re-incarcerated than those who are not re-arrested. The effect of this is to enrich the sample with repeat offenders. Since base rates of recidivism are higher among blacks and males, the data used for modeling contains more members of these groups than the broader prison population.

2. Exploratory Analysis

To understand the extent and scope of recidivism in Guilford County, we explore some of the characteristics of recidivism and repeat offenders within the county data. By first visualizing the data, we are able to see the historic and spatial patterns crime and policing.

2.1 Understand the Inmate Profile

(All Inmates in DOC System)

The DOC’s inmate data set contains inmate information dating back to 1921, which is the starting date of the oldest sentence in this record. This data set contains over 400,000 individuals who were or are now being incarcerated within the state of North Carolina. For this project, Recidivism was defined as re-incarceration following release from prison. Using this definition, we identified first-time and repeat offenders and investigated patterns and associations with other available information.

The plots below illustrate the gender and racial composition of inmates incarcerated in North Carolina during the study period.

Males make up nearly 90% of the current and historic prison population. Additionally, over 60% of incarcerated persons are black, however blacks represent 22% of the state population and 33% of Guilford County residents. With this stark difference in base rates of incarceration, the importance of assessing the racial fairness of our algorithm is highlighted.

2.2 What are some of the characteristics of Recidivism?

For each incarceration, the state identifies the most serious offense of that sentence. Below you can see the frequency of the most common offenses. It is interesting to note that repeat offenders commit all of these offenses more frequently than first time offenders.

State-wide, 40% of inmates were repeat offenders, however, in Guilford County, this percentage is higher, at 48%, within our data. This is comparable to the reported statistics from the Department of Justice which estimated a national representative recidivism rate of 49.7% within 3 years of release. While the majority of inmates did not recidivate, 21% were incarcerated 3 or more times.

Understandably, more people are first incarcerated when they are younger than when they are older. However, repeat offenders tend to be younger when they are first incarcerated.

**Note: By computing the difference between an inmate’s date of birth and their earliest sentence start date, we are able to calculate their age at the time of their first incarceration. We treat this as the onset of the inmate’s criminal activity, disregarding previous arrests or offenses that did not end up in prison.

2.3 Spatial Analysis

Using jail population data from 2012-2018 in Guilford county, we were able to identify the home census tract for most Guilford County offenders. This data allowed us to look at the spatial distribution of jail-to-jail repeat offenders. To highlight the impact of recidivism, we mapped the ratio of repeat offenders to first time offenders by census tract. Areas around Greensboro and High Point demonstrated a nearly two-to-one ratio of repeat offenders, while the more rural (and more white) areas of the county demonstrate a lower recidivism ratio. While the base rates of crime in these areas may vary, this pattern is likely the result of systematic over-policing of communities of color in combination with low socio-economic opportunity in these areas.

In this map darker colors are associated with a higher proportion of repeat offenders. A Recidivism Ratio of 2 indicates that tract had two repeat offenders for every first time offender during this time period.

Recidivism Ratio Plot

Additionally, we calculated the length of stay for offenders held in the Guilford County Jail between January 2012 and December 2018. Using current estimates for the daily cost to incarcerate, we calculated how much was spent to hold residents from each tract in the county. Not surprisingly, the tracts adjacent to Greensboro and High Point were associated with the highest costs to incarcerate.

Cost to Imprison

3. Feature Engineering and Model Development

To predict an ex-offender’s risk of recidivating, we built an algorithm using 22 variables related to the ex-offender, their previous crime, and their in-prison behavior. The final model uses four individual risk scores that are combined, or ensembled, into a final risk score for each ex-offender. The four component scores, or stories, are demographic characteristics, current crime characteristics, prior crime characteristics, and in-prison behaviors. The component risk scores were calculated and combined using a random forest, machine learning algorithm.

#### Data Dictionary

Variable	Category	Symbol	Description
Dependent Variable	—	recidivism	Recidivate (1), Not Recidivate (0).
Independent Variable	Demographic Story	sex	Dummy variable. Male or Female.
		age_cat	age category. <25, 25-45, >45.
		ageCatAtFirstIncarceration	Age at first time being incarcerated. <25, 25-45, >45.
	Current Crime Story	felonCharge	Dummy variable. felony or misdemeandor.
		mostSeriousCurrentOffense	Most serious current offense type category.
		CurrCrimeViolence2	Dummy variable. Violent or Non-violent.
	Prior Crime Story	totalSentenceCount	Total number of sentence recorded in all document.
		prior_Incar_Flag	Dummy variable. Incarcerated before or not prior to the current offense.
		Jail_Charge_TotCnt	Total number of charges in jail record.
		juvnileOffensesFlag	Dummy variable. Had committed crime as juvnile or not.
		Cnt_Violent	Total count of violent crime in records.
		Cnt_NonViolent	Total count of non-violent crime in records.
		Cnt_Felony_Tot	Total count of felony charges.
		Cnt_Misd_Tol	Total count of misdemeanor charges.
	Prison Story	custody_class_code	Custody class code category. Security level in prison.
		specialCharFlag	Dummy variable. Has special characteristics or not. (i.e., Split).
		inmateSpecialCharacteristic	Inmate special characteristic category.
		days_served_in_doc_custody	Number of days served in DOC custody.
		totalDisciplineInfractions	Total number of discipline infractions in prison.
		latestDiscInfrac	Latest discpline infraction type in prison.
		TypeOfLastInmateMovement	Type of last inmate movement category.

A fifth domain of features, developed to include neighborhood demographics such as median household income, percent of non-white residents, and percent college educated residents, was ultimately excluded from the model due to incomplete data. This explicitly spatial information presented an opportunity to explore whether recidivism varies spatially in a way that is not captured implicitly through the inclusion of individual-specific demographic and criminal history in the four stories of the model. If a story comprised of spatial features such as these added additional predictive power or fairness to our model, this would indicate some information which predicts recidivism varies spatially and is not accounted for by traditional modelling approaches. This would have been a novel contribution to efforts to predict recidivism. We hope that future research with more complete place-based data can delve into this approach more.

3.1 Model Development

To build a predictive model of recidivism risk, we used an iterative process which involved trying several modeling techniques. The final model utilized a Random Forest algorithm to generate a risk score for each of the four component scores. These component scores we then combined to produce the overall risk score, or probability of the ex-offender recidivating. This ensemble method attempts to limit the predictive strength of any one variable and keep room in the model for less powerful predictors that may improve the fairness or generalizability of the results.

Here you can see the relative contribution of each sub-model to the predictions as well as the variable importance within the sub-models.

We can see that the prior crimes ‘story’ is the sub-model which most strongly impacts the model’s final predictions. Prison activities, both positive and negative, and demographic variables were also significant contributors to the prediction. Interestingly, though, it seems that details of the current imprisonment contribute the least to our predictions of recidivism.

3.2 Model validation

Before interpreting predictions from this model, it is crucial to ensure that the model is both 1) accurate, and 2) generalizable (i.e. fair) across different races and sexes.

Accuracy is a fairly simple concept: of all of the ex-offenders, recidivists or not, how many did the model make the correct prediction for? Accuracy is essentially the percentage of prediction the model made that were right. Accuracy can be further broken down into the metrics of sensitivity and specificity. Sensitivity is also known as the true positive rate; it is the percentage of all recidivists who the model correctly predicted to recidivate. Specificity is the true negative rate; it is the percentage of all non-recidivists who the model correctly predicted to not recidivate. While the larger metric of accuracy is useful for broadly gauging the the model’ predictive power, sensitivity and specificity provide useful information about the source of that predictive power. For example, a model predicting a rare event may have high accuracy and specificity while having low sensitivity, due simply to the fact that it is easier to predict the frequent absence of the event rather than the rare event itself.

Generalizability is somewhat trickier to understand, in large part because the measurement of interest changes depending on the context. In this case, the measurement of interest that we focus on is the false negative rate. False negatives are ex-offenders predicted not to recidivate who were later re-incarcerated. In this context, false negatives are ex-offenders who could have received the training and other supports, but did not. In other words, false offenders are those who were not given the resources that may have benefited from. If the percentage of ex-offenders who fall into this category is larger for one race, then the algorithm is not generalizable to all ex-offenders and may disadvantage some on the basis of race.

An easy “eye test” for accuracy can be found in the histogram of predicted probabilities presented below. The histogram displays ex-offenders separated into buckets based on their predicted likely to recidivate. This risk score ranges from extremely unlikely (0-2.5%) on the left to extremely (97.5-100%) likely on the right. The height of the bar represents the number of ex-offenders with that predicted likelihood of recidivating. If the buckets on the ends of the distribution (representing the extremely unlikely and likely ex-offenders) are tall compared to the middle, then this indicates good discrimination between high likelihood and low likelihood. This does not necessarily indicate high accuracy, however, if the bars representing the middle buckets are the highest, this indicates that the model did not effectively identify predictors of recidivism or non-recidivism. Our model displays the bi-modal distribution indicative of a high-performing model.

AUC - ROC (receiver-operator) curve is a performance measurement for classification problems with several possible threshold values. It is a measure of the model’s capability to distinguish between classes - In this case, Recidivate and Not Recidivate. An AUC (area under the curve) assesses the trade off between sensitivity and specificity, and a value closer to one (the maximum), is better at predicting recidivist and non-recidivist. Our final model displayed great goodness-of-fit with an AUC of 0.937 and good discrimination for the prediction probabilities. When looking at the ROC curves by race, we see that black and white ex-offenders tend to have very similar curves which indicates comparable performance regardless of the ex-offender’s racial group.

To ensure that our model generalizes fairly, we compared the results across sexes and races. The table below lists the statistical measurements we used to examine the model’s performance and fairness.

Reference Table
Statistical.Measures	Abbreviation	Description	Calculation
True positive	TP	Ex-offenders correctly identified as recidivate
False positive	FP	Ex-offenders incorrectly identified as recidivate
True negative	TN	Ex-offenders correctly identified as NOT recidivate
False negative	FN	Ex-offenders incorrectly identified as NOT recidivate
Accuracy		Ability to differentiate the recidivate and not recidivate correctly	(TP+TN)/(TP+TN+FP+FN)
Sensitivity	Sens	Percentage of all recidivists who the model correctly predicted to recidivate	TP/(TP+FN)
Specificity	Spec	Percentage of all NON-recidivists who the model correctly predicted to not recidivate	TN/(TN+FP)
Note:
Positive for Recidivate, negative for Not Recidivate

Fairness by Race

An ideal model would predict perfectly with no false positive or false negative results and an accuracy of 100%. Since perfect predictions do not exist, we are focusing on the differences in the wrong predictions between race and sex groups to assess fairness. A false negative prediction represents a possible missed opportunity to prevent recidivism by providing training and resources. If these missed opportunities are not equally distributed across the races, the algorithm will generate systematically unfair outcomes. Differences in the true positive and true negative predictions reflect differences in the real-world patterns of policing and incarceration. While these differences are unfair, they reflect patterns of policing and incarceration and not the ability of the algorithm to predict recidivism.

The chart and histogram below illustrate the prediction characteristics at a threshold of 0.43 (see section 4.1 for threshold selection details).

race	Sensitivity	Specificity	Accuracy	AUC
BLACK	0.94	0.88	0.92	0.91
WHITE	0.94	0.89	0.91	0.91

Examine Fairness by Sex

We used the same strategy to evaluate fairness between the sexes. Since the base rate of male recidivism is significantly higher than female recidivism, we expected a noticeable difference between the True Positive and True Negative rates across sex. As with the racial groups, the false negative rate between the sexes had a difference of 0.1%, very close to 0, indicating our model has similar chance to predict wrong for both males and females.

sex	Sensitivity	Specificity	Accuracy	AUC
FEMALE	0.92	0.93	0.93	0.93
MALE	0.94	0.87	0.91	0.91

For both race and sex, we see good agreement between the false negative and false positive rates with less than 0.2% of a difference. The difference in true positive and true negative results reflect the differences in base rates between populations and not the fit of the model. Overall our model performs fairly across race and sex with a threshold of 0.43.

3.3 Cross-validation

Using 100-fold cross validation, we evaluated the robustness and generalizabilty of the model. This is a process by with the available data is divided into 100 parts, the model is trained on 99 of them and tested against the one remaining part. This is repeated 99 more times, with a new test group for each repeat, ensuring that each hundredth of the data sets has the chance to be the test set. The narrow span of these graphs indicate that the model is a good fit for these data. However, additional research must be done regarding the few small outliers that cause spikes in the sensitivity (true negative rate) metric for this model.

4. Converting predictions into actionable intelligence

Now that we’ve built it… what can be done with it? Well, let’s think about the damage that can be done if we get the predictions wrong.

4.1 Cost Analysis

For this project, we set out to develop a a risk score that represents an ex-offender’s probability of recidivating, with fair and generalizable predictions. A score closer to 0 means this ex-offence is less likely to recidivate while a score closer to 100 means this inmate is more likely to recidivate. Following release, an inmate can have one of two outcomes: to recidivate and to not recidivate. Our goal is to provide information to help prioritize resources to inmates who are likely to recidivate. To assign each ex-offender a predicted outcome, we have to translate our likelihood predictions, which range from 0 to 100%, into a “yes”-or-“no”. This translation is performed by determining a threshold above which we classify the ex-offender as predicted to recidivate. Anyone below the threshold is classified as predicted to not recidivate. The threshold selected will greatly impact the predictions and every possible threshold value offers a trade-off between accuracy and cost. One of several ways to select an optimal threshold is to assign a cost to each potential outcome, and choose a threshold to minimize these costs.

Costs associated with each type of prediction are included below based on reported costs. Some basic assumptions are:

1. Give resource to all ex-offenders who are predicted to recidivate, based on threshold.

2. 10 % reduction in return-to-prison rate if they particiapte in the program (New York State Pay-for-success Program).

3. $36,000/Year/Prisoner: Cost to Inprison (Vera Institute of Justice, 2016)

4. $2,000/Prisoner: Cost of a Program to Support Recently Released Ex-Offender

5. $10,600/Prisoner: Contribution to Society of a Non-Recidivating Ex-Offender

If everyone who is predicted to recidivate is enrolled in the program, we can calculate what the societal cost would be by adding up the costs of each of the four possible outcomes and multiplying them by the number of ex-offenders predicted to have that outcome. Since the number of ex-offenders who fall into each outcome varies by the threshold use, we can optimize our model by using the threshold which minimizes the overall cost of all four outcomes. The range of thresholds (from 0% to 100%), and our choice of threshold (43%) to minimize costs, can be seen in the plot below.

Using this threshold, we calculated the aggregate costs and benefits predicted for all the ex-offenders in a given category. The Count figure is the number of ex-offenders predicted to fall into this category.

Outcome	Count	Per Person Cost/Benefit	Aggregate Cost/Benefit	Description
True_Negative	5093	$10.6K	$54 Million	Inmate predicted as not recidivate and does not recidivate
True_Positive	6930	-$33.3K	-$231 Million	Inmate predicted as recidivate and does recidivate
False_Negative	429	-$36K	-$15 Million	Inmate predicted as not recidivate but actually recidivate
False_Positive	679	$8.6K	$6 Million	Inmate predicted as recidivate but actually not recidivate
Note:
Ensemble Model - Cost-Benefit Table - Threshold 0.43

Using this threshold, we compared false positive and the more costly false negative results and found that our predicted costs were within 1% of each other.

Additionally, we tuned the model to minimize the difference in the false negative rate. This produced a threshold value of 0.02, which means virtually every inmate would get the training. Providing everyone with the training may be the fairest solution; however, it is certainly not optimal since inmates with extremely low risk of recidivating would receive training they may not need, at great expense.

5. Conclusions

It is possible to create a recidivism risk score that predicts equally well for blacks and whites without including race as in the model. Additionally, we demonstrated that a cost/benefit equation can be derived to inform the threshold for predicting repeat offense. We hope that this work inspires other to build upon this proof of concept and develop transparent predictive algorithms that assess racial, or other, fairness.

Beyond the development of these tools, we have to also sought to apply them in a meaningful way. To assist with the decision making process, we developed a web-based tool to review and prioritize placement of ex-offenders into support programs based on these predictions. To take the application for a test drive with de-identified data visit: https://bucklerd.github.io/MUSA801_Recidivism/.

For more information related to the data wrangling and modeling, visit the appendices.

Appendix

Model Comparisons

Benefit by Confusion Matrix Type and Threshold Cut Off

Observed Vs Predicted Recidivism By Race

False Negative Rates By Race Plot

Variable Importance Within Kitchen Sink Model

ROC Plots

Regression Result Tables

Result Table
modelNum	modelName	Description	Accuracy	Sensitivity	Specificity	AUC	Notes
1	Kitchen Sink Model	Logistic Regression	0.8119	0.8345	0.7843	0.894	Kitchen Sink Model
2	GML Ensemble Model	Ensemble Logistic Regression; Four Individual Stories.	0.7935	0.8155	0.7676	0.888
3	Random Forest + GML Ensemble Model	Random Forest and gml; Four Individual Stories.	0.8440	0.8680	0.8164	0.925
4	Gradient Boosting Model	Gradient Boosting Model	0.8225	0.8516	0.7876	0.902
5	Ensemble Model - Spatial	Ensemble Random Forest Model - Include Spatial Feature	0.8216	0.8400	0.7949	0.898	Including Spatial features, sample size reduced; Spatial feature is not powerful predictor
6	Ensemble Random Forest Model with Race	Ensemble Model Final; Random Forest; Four Individual Stories.	0.8674	0.8751	0.8573	0.940	Race included
7	Ensemble Random Forest Model - Final Model	Random Forest; Four Individual Stories; Without Race	0.8568	0.8764	0.8331	0.937	Threshold = 0.43, maximize cost-benefit; threshold = 0.02, maximize fairness

Recidivism Prediction in Guilford County

Dave Buckler, Li Zhuang, Alex Abramson

May 2, 2019