• 1. Introduction
    • 1.1 Abstract
    • 1.2 Motivation
    • 1.3 Methods in brief
    • 1.4 Modeling strategy
    • 1.5 Data
  • 2. Exploratory Analysis
    • 2.1 Understand the Inmate Profile
    • 2.2 What are some of the characteristics of Recidivism?
    • 2.3 Spatial Analysis
  • 3. Feature Engineering and Model Development
    • 3.1 Model Development
    • 3.2 Model validation
    • 3.3 Cross-validation
  • 4. Converting predictions into actionable intelligence
    • 4.1 Cost Analysis
  • 5. Conclusions
  • Appendix

Return to MUSA 801 Projects Page

About this Document

This project was produced as part of the University of Pennsylvania Master of Urban Spatial Analytics Spring 2019 Practicum (MUSA 801) with the assistance of instructors Ken Steif and Michael Fichman. Special thanks to Jason Jones, Analytics & Innovation Manager for Guilford County, NC who provided data access and content expertise for this project. Without Jason, Ken and Michael, this project would not have been possible.

This document begins with a case study of predicting recidivism among inmates in North Carolina prisons in partnership with Guilford County, North Carolina. This document presents context on recidivism in Guilford County, the rationale for predicting recidivism, the fairness concerns associated with such an exercise, and an app driven by a predictive algorithm as proof of concept. This is followed by a series of appendices that discuss the data wrangling, data visualization, data sources, feature engineering, and model results in greater depth. Navigate through the document either by using the table of contents at the left, or by clicking the hyperlinks throughout the document.


1. Introduction

1.1 Abstract

Preventing recidivism, the repeat incarceration of a previously released offender, is an opportunity to break the cycle of incarceration that affects hundreds of thousands of people in the US. Guilford County, North Carolina has implemented a program they hope will reduce recidivism by offering resources and job training to ex-offenders at the time of release. However, this program relies on a knowledgeable staff member to connect ex-offenders to recidivism prevention programs. We believe available data and machine learning techniques can be leveraged to support the decision making process by quantifying a recidivism risk score. In developing the recidivism risk score algorithm, we aim to ensure the algorithm is racially fair, not systematically over- or under-predicting recidivism risk for ex-offenders of different races.

1.2 Motivation

Use Case

Recidivism, or the repeat incarceration of an ex-offender, is a preventable problem. When an ex-offender re-offends, the community and the individual both suffer social and financial loss. While there is the direct cost of re-imprisonment, which is about $30,000 per year per inmate (VERA Institute), there are also the more difficult to quantify costs of family separation, lost income, and others drive up the enormous costs of repeat imprisonment.

Given these high costs, even a marginal reduction in recidivism would result in large personal, family and community benefits. It’s been shown in ( New York and Washington State ) that an ex-offender training program can result in a notable decrease in recidivism. Guilford County already partners with several local nonprofits which provide programs, such as job training, to ex-offenders upon their release from prison. However, resources are limited and not every released ex-offender can be placed in a program. As a result, the county must decide who to prioritize in allocating program placements.

Our group hopes to add one more tool to decision-makers’ tool chests, with our recidivism risk prediction app. This app provides decision-makers with recidivism likelihood scores for each prisoner, calculated from the prisoner’s demographic characteristics, criminal history, and behavior in prison. Decision-makers can use these predictions as an additional piece of information when determining which ex-offenders would most benefit from a program slot.

Fairness

We are not the first group to develop a risk prediction algorithm in the criminal justice field, and several previous attempts have met with strong criticism. These previous attempts, such as the Northpointe company’s COMPAS model, have been critiqued by journalists and researchers, such as ProPublica and Dartmouth’s Julia Dressel and Hany Farid as racially biased due in part to their propensity to wrongly predict black offenders as repeat offenders relative to white offenders. Ensuring fairness in these predictions is of great importance to prevent unequal treatment based on race. A traditional solution, to expressly exclude race from the model, is insufficient because other variables, such as education, residence, and certain crime types, are strongly correlated with race and serve as race indicators in the model. While this bias is baked into a predictive algorithm, our task is to ‘open the black box’ to understand the extent of racial bias and how it may affect the allocation of job training resources. We do so by exploring the trade-off in model predictions between maximizing social benefits and maximizing fairness. Furthermore, this model is open-source and accessible. This transparency aims to provide decision-makers with an understanding of the trade offs between accuracy and fairness to better inform their decisions when using the algorithm’s predictions.

1.3 Methods in brief

Using a machine learning algorithm we utilize an offender’s criminal history, prison behavior, current crime characteristics, and demographics, we generate a risk score which estimates the likelihood the an ex-offender will recidivate. Prior to making predictions, the data provided by Guilford County was cleaned, matched, and features engineered. The model was then subjected to cross-validation and assessed for accuracy and fairness. Cross-validation consists of splitting the data set into multiple sets, fitting the model on a portion of the data and tested on the remaining set, and then assessing the error in the prediction. By completing this process hundreds of times, we are able to see whether the model can predict consistently on unseen data set. Beyond this, we also compare differences in the incorrect predictions based on race and sex to see if the model generalizes well across social classes. This is of particular concern as recent research has found that widely differing base rates of recidivism between groups can severely impact the capability of a model to predict fairly across groups.

1.4 Modeling strategy

To improve the fairness and accuracy of our predictions, this predictive algorithm combines four separate predictive models using a technique called ensembling. The four separate predictions are each organized around a theme: prisoner demographic characteristics (not including race), current crime, past crime and incarceration, and behavior in prison. Each of these four models generates a risk score based on data within that “story”. All four of these predictions are then used as inputs for the final ensemble model. The risk score generated by this final ensemble model incorporates and attempts to balance the final recidivism predictions.

1.5 Data

Our data comes from 2 sources: the North Carolina State Department of Corrections and Guilford County Sheriff’s Office. From State Department of Corrections, we have records of prison inmates since 1921 (with more extensive and reliable coverage since 1970s). These contain inmates’ demographic characteristics, criminal history, and behavioral record in prison. From the Sheriff’s Office, we have arrest records from 2012-2018. These include spatial information about the ex-offender’s place of arrest and home location. While these records contained a wealth of data, joining them was a challenge because offender records in each database had no common identifier. To match an inmate with an arrest, we had to perform a probabilistic join based on name, date of birth, gender, and race.

After a lot hard work, we have developed a sample of roughly 15 thousand ex-offenders imprisoned during the last several decades who were rearrested between 2012 and 2018. This sample differs from the total prison population of Guilford County in one significant way. Since only ex-offenders who have been re-arrested in Guilford County in the past half decade are included in our sample, ex-offenders who are not re-arrested in this area or time frame (or at all) are excluded from the sample. Those who are re-arrested are more likely to be re-incarcerated than those who are not re-arrested. The effect of this is to enrich the sample with repeat offenders. Since base rates of recidivism are higher among blacks and males, the data used for modeling contains more members of these groups than the broader prison population.


2. Exploratory Analysis

To understand the extent and scope of recidivism in Guilford County, we explore some of the characteristics of recidivism and repeat offenders within the county data. By first visualizing the data, we are able to see the historic and spatial patterns crime and policing.

2.1 Understand the Inmate Profile

(All Inmates in DOC System)

The DOC’s inmate data set contains inmate information dating back to 1921, which is the starting date of the oldest sentence in this record. This data set contains over 400,000 individuals who were or are now being incarcerated within the state of North Carolina. For this project, Recidivism was defined as re-incarceration following release from prison. Using this definition, we identified first-time and repeat offenders and investigated patterns and associations with other available information.

The plots below illustrate the gender and racial composition of inmates incarcerated in North Carolina during the study period.

Males make up nearly 90% of the current and historic prison population. Additionally, over 60% of incarcerated persons are black, however blacks represent 22% of the state population and 33% of Guilford County residents. With this stark difference in base rates of incarceration, the importance of assessing the racial fairness of our algorithm is highlighted.

2.2 What are some of the characteristics of Recidivism?

For each incarceration, the state identifies the most serious offense of that sentence. Below you can see the frequency of the most common offenses. It is interesting to note that repeat offenders commit all of these offenses more frequently than first time offenders.

State-wide, 40% of inmates were repeat offenders, however, in Guilford County, this percentage is higher, at 48%, within our data. This is comparable to the reported statistics from the Department of Justice which estimated a national representative recidivism rate of 49.7% within 3 years of release. While the majority of inmates did not recidivate, 21% were incarcerated 3 or more times.

Understandably, more people are first incarcerated when they are younger than when they are older. However, repeat offenders tend to be younger when they are first incarcerated.

**Note: By computing the difference between an inmate’s date of birth and their earliest sentence start date, we are able to calculate their age at the time of their first incarceration. We treat this as the onset of the inmate’s criminal activity, disregarding previous arrests or offenses that did not end up in prison.

2.3 Spatial Analysis

Using jail population data from 2012-2018 in Guilford county, we were able to identify the home census tract for most Guilford County offenders. This data allowed us to look at the spatial distribution of jail-to-jail repeat offenders. To highlight the impact of recidivism, we mapped the ratio of repeat offenders to first time offenders by census tract. Areas around Greensboro and High Point demonstrated a nearly two-to-one ratio of repeat offenders, while the more rural (and more white) areas of the county demonstrate a lower recidivism ratio. While the base rates of crime in these areas may vary, this pattern is likely the result of systematic over-policing of communities of color in combination with low socio-economic opportunity in these areas.

In this map darker colors are associated with a higher proportion of repeat offenders. A Recidivism Ratio of 2 indicates that tract had two repeat offenders for every first time offender during this time period.

Recidivism Ratio Plot

Recidivism Ratio
0.51.01.52.0