1. Introduction

1.1 How to Use This Document

This project was produced as part of the University of Pennsylvania Master of Urban Spatial Analytics Spring 2018 Practicum (MUSA 801), instructed by Ken Steif, Michael Fichman, and Karl Dailey. This document begins with a case study of predicting spatial risk of opioid overdoses in Providence, Rhode Island and is followed by a series of appendices that discuss data wrangling, data visualization, data sources, feature engineering, and model results. Navigate through the document either by using the panel at the left, or by clicking the hyperlinks throughout the document.

1.2 Abstract

This project seeks to build a spatial risk model of opioid overdose events for the City of Providence, Rhode Island by examining current overdose locations, community protective resources, risk factors, and neighborhood characteristics. Assigning a level of risk to each area of the city can assist Providence and local stakeholders in strategically allocating resources in a way that will achieve the greatest impact. As of January 2018, Providence is implementing a Safe Stations program, where people struggling with substance abuse can come to any of the City’s 12 fire stations to be connected with supportive services. The spatial risk model will help Providence’s Department of Healthy Communities determine other areas at high risk of overdose events where the City could site additional interventions or supplement their communications efforts.

1.3 Using Safe Stations to Combat the Opioid Epidemic

Opioid overdose events have skyrocketed in recent years, fueling a decline in the United States life expectancy for the second year in a row and leading President Trump to declare the crisis a public health emergency. A Centers for Disease Control Report released in March 2018 found that emergency room visits for suspected opioid overdoses increased 30% between July 2016 and September 2017 (CDC, 2018). More than 115 Americans die every day from opioid overdoses, and Rhode Island has been hit especially hard by the crisis, with the rate of overdose deaths per 100,000 persons significantly higher than the national average.

<b>Figure 1.3a</b>

Figure 1.3a

The crisis has also escalated much faster in Rhode Island compared to the nation, with the rate of overdose deaths per 100,000 persons increasing nearly six-fold between 1999 and 2016.

<b>Figure 1.3b</b>

Figure 1.3b

Since 2014, over 1,000 people in Rhode Island have died from opioid overdoses. Over one quarter of these deaths have occurred in Providence.

<b>Figure 1.3c</b>

Figure 1.3c

In 2017, almost 700 overdose events were recorded in Providence. In the face of such an overwhelming public health crisis, one of the tools that the City of Providence is using to address the opioid crisis is its new Safe Stations program. Launched in January 2018, the program turns Providence’s 12 fire stations into 24/7 centers where trained staff connect individuals with supportive services. Based on a program created in Manchester, New Hampshire, Safe Stations demonstrate how a network of public, nonprofit, and private actors are working together to address the mounting problem of opioid overdoses.

The premise of Safe Stations is that they are a community-based response that reduces barriers to accessing resources and provides a safe community for people struggling with substance abuse. As can be seen below, because of their nature as an emergency response facility, Providence’s fire stations are relatively dispersed throughout the city’s neighborhoods.

<b>Figure 1.3d</b>

Figure 1.3d

This analysis seeks to determine whether Safe Stations are the assets best located to provide services for those struggling with substance abuse. Assigning an overdose risk level to each area of Providence can be a valuable tool. Knowledge of risk could allow the City and other stakeholders to assess the cost of providing Safe Stations in these locations relative to the potential benefits of opening stations in additional city facilities in high-risk areas, to determine high-risk areas where Safe Stations should be more heavily advertised, and to allocate overdose prevention and harm reduction resources, such as Narcan trainings, needle exchange centers, or safe injection sites, in areas at the highest risk.

Providing citywide spatial information regarding the risk of opioid overdose events empowers the city and nonprofits to continue to make their public health interventions more robust and could serve as a model for cities around the country that are interested in how to implement similar initiatives with the greatest impact.

1.4 Methods

The spatial risk model built for this analysis varies from one way in which resources are commonly allocated, known as kernel density maps. Kernel density maps, or hotspot maps, interpolate past events across space - suggesting that resources are best allocated in an area where many overdoses have occurred in the past. A kernel density map of Providence’s 2017 overdoses can be seen below:

<b>Figure 1.4a</b>

Figure 1.4a

In contrast, the spatial risk model predicts overdose events as a function of environmental risk and protective factors that vary across space. The unit of analysis for this model breaks down Providence into a set of 500-foot by 500-foot grid cells or a “fishnet” (shown in the map at the bottom left). The count of overdoses per grid cell serves as the dependent variable in the model (shown in the map to the bottom right).

<b>Figure 1.4b</b>

Figure 1.4b

At the fishnet level, it is clear that while the number of overdoses has been increasing in Providence, they remain a relatively rare event in terms of how many grid cells contain overdoses. Only 18% of grid cells contain the approximately 700 overdoses spread out over the 2017 data. This distribution of overdoses is taken into account throughout the modeling process. As the intent of this analysis is to identify high-risk locations that are actionable from a policy standpoint, grid cells allow for the determination of risk-levels specific to an area the size of a city block. At the same time, the grid cells are large enough that they do not provide personally identifiable information or skew the distribution of overdoses within the fishnet.

<b>Figure 1.4c</b>

Figure 1.4c

Independent variables are also included at the fishnet level. Therefore, the final dataset includes a series of characteristics about each 500-foot-square area of the city, along with a count of how many overdoses have occurred there. For the purposes of this analysis, data regarding independent variables was split into one of three categories, with the hypothesis that each would have a different relationship with overdoses: first, community protective resources, such as hospitals or parks, which are amenities where other public health interventions could possibly be sited; second, risk factors, such as crime or 311 complaints; third, neighborhood-level characteristics, such as neighborhood boundaries or census information. The relationship of each of these three categories with overdose events is explored in-depth in the explanatory analysis and model building sections of this document.

1.5 Modeling Strategy

To create the spatial risk model, we began with an extensive data wrangling process (described further in the data wrangling appendix). The overdose location data used in the model was provided through a tremendous effort by the City of Providence. The data was pulled from records of emergency medical services responses, which did not contain an indicator of whether or not the event was an overdose. Therefore, clinicians examined data from 2017 and pulled together a list of approximately 700 events that they considered to be overdoses. This created a one-of-a-kind dataset exclusively for this analysis. As this dataset contains health information, overdoses are always shown at an aggregated level to ensure the protection of personal privacy.

As with any data, there is potential for error in terms of whether all overdose events were captured and whether all events recorded after the fact were overdoses. Additionally, there were several address entry errors that had to be manually geolocated in order to ensure accuracy, which is a source of additional potential error. Furthermore, it is worth mentioning that this data is a small sample size, due to both the size of Providence (the city’s population is 179,000) and the fact that this data was collected only for 2017. While this analysis is worth conducting, the sample size precludes certain tests of model generalizability, such as testing across years. These limitations will be discussed throughout the analysis as relevant, but are necessary to mention at the outset.

In addition to overdose events, the data wrangling process included the collection of a wide variety of data regarding community protective resources, risk factors, and neighborhood characteristics in the City of Providence (listed in the data sources appendix). Following the data collection, we undertook a feature engineering process to extract different types of predictive power from the variables, and all of the features are listed in the feature engineering appendix. Once the entire dataset of overdoses and features had been created, we conducted an explanatory analysis process to determine which variables were correlated with overdoses.

With an understanding of which variables were correlated with overdose events, we constructed different predictive models, informed by machine learning techniques. We then conducted a validation process to ensure that the final model was generalizable across the different areas of Providence. The modeling process and model results are described in detail in the model building section and the model results appendix.

1.6 Data

In addition to the overdose count dependent variable, several data sources were used to determine risk factors and protective assets in Providence, including:

  • Opioid Overdose Locations
  • Community Protective Resources
  • Risk Factors
  • Neighborhood Characteristics

Below appendices include a data dictionary with a full list of variables and their sources; a data wrangling appendix that explains how data was compiled, cleaned, and feature engineered; a list of the final features; a data visualization appendix that explains how to make basic graphics for this analysis; and a table that lists the results from each model.

2. Exploratory Analysis

2.1 Feature Engineering

To begin the exploratory analysis process, we constructed a dataset through an extensive data wrangling and feature engineering process (described in the data wrangling appendix.) Each of the Community Protective Resource and Risk Factor variables were feature engineered to extract different levels of explanatory power. An example of this can be seen below in Figure 2.1a, which shows drug crimes in the past 180 days, distance to each drug crime, distance to the three nearest drug crimes (or “nearest neighbor”), and density of drug crime. The distance, nearest neighbor, and density variables are each included in the fishnet grid cells as potential independent variables in the modeling process. While each variable measures the same drug crime points, they measure spatial distribution differently. Distance, density, and nearest neighbor maps will be seen throughout the exploratory analysis.

<b>Figure 2.1a</b>

Figure 2.1a

2.2 Are Fire Stations the Best-Located Amenities to Combat the Opioid Crisis?

The Safe Stations program has been implemented at Providence’s 12 fire stations. The central question of this analysis is whether these Safe Stations are the most efficient locations to connect those struggling with substance abuse to supportive services, or if this program should be implemented at additional city facilities. In order to examine this question, we first evaluated existing Safe Stations’ proximity to overdose locations. The below map shows where overdoses are occurring within a quarter-mile distance from a fire station, where people could walk less than five minutes to the station to receive aid.

<b>Figure 2.2a</b>

Figure 2.2a

As can be seen in the below bar chart, there are stations that are experiencing considerably more overdoses within a quarter-mile and others that experience almost no overdoses within a quarter-mile. Engine 8, located in West End, had over 20 overdoses occur within a quarter-mile in 2017. In contrast, neighboring Engine 11 near the Reservoir neighborhood experienced one overdose within a quarter-mile. This metric can inform which stations may be in need of additional resources.

<b>Figure 2.2b</b>

Figure 2.2b

Counting the number of overdoses within a quarter-mile of fire stations demonstrates which Safe Stations may experience the most activity. However, it does not show whether or not fire stations are the best location for Safe Stations within the entire city of Providence. We defined “best location” as being in a place where a relatively high number of opioid overdoses are occurring. We developed a hypothesis test that asks whether the average distance from overdoses to fire stations is closer than one would expect due to random chance alone. Figure 2.2c below visualizes the distribution of 999 randomly permuted sets of 12 points (to simulate a set of fire stations) and their corresponding distances to the nearest 10 observed overdose events. The vertical line demonstrates the actual distance between Providence’s 12 fire stations and their 10 nearest overdose events. 63.7% of randomly generated points were closer to overdoses than the actual fire station locations, suggesting that Providence could locate Safe Stations, or supplementary efforts, closer to where overdoses are actually occurring.

<b>Figure 2.2c</b>

Figure 2.2c

2.3 Is Proximity to Community Protective Resources Associated with Overdose Events?

As mentioned above, the independent variables in this analysis were divided into three categories. The first of these categories is Community Protective Resources, which are considered to be neighborhood amenities, many of which could support additional public health interventions. These resources include locations such as hospitals, libraries, schools, bus stops, etc. The full list of amenities can be found in the feature appendix below. The hypothesis is that areas that contain higher numbers of these community protective resources would have a lower number of opioid overdoses. However, this analysis demonstrates that, in reality, the relationship varies depending on the resource - and tends to be negatively correlated with overdoses.

This section will discuss variables that ultimately proved to be useful in the model building process. While the dependent variable for this analysis is the count of overdose events, it is easier to explore and visualize correlation patterns when count has been recoded to a spatial density and, therefore, correlation plots below use the logged density of overdoses as a proxy for overdose counts.

One community resource that this analysis examines is the correlation between proximity to different types of education facilities and overdose events. As can be seen below, both private schools and colleges are clustered in College Hill, as well as the Downtown area of the city. The below scatter plots demonstrate that proximity to all three types of schools is correlated with a higher density of overdoses, meaning that as you get closer to an educational facility, there is a higher number of overdoses. However, public school proximity has a slightly higher correlation with overdoses.

<b>Figure 2.3a</b>

Figure 2.3a

Proximity to public transit is another variable that was considered in the model. In this case, public transit is represented by RIPTA bus stop locations. As can be seen below, bus stops also have a negative correlation with overdoses.

<b>Figure 2.3b</b>

Figure 2.3b

Another community resource to note is food vendors. We wanted to examine whether there was a difference in correlation with density of overdoses between supermarkets and SNAP vendors, with the hypothesis that areas that are less well-served by supermarkets would have a higher number of overdoses. As can be seen below, there are many fewer supermarkets in Providence compared to SNAP vendors, and they are primarily concentrated in the northern portion of the City. SNAP vendors, in comparison, appear to be spread throughout the city, with a strong presence along commercial corridors in the central and southern portions where there is an absence of supermarkets. Examining the correlations of these amenities with overdoses shows that both have a negative correlation between proximity and the density of overdoses. However, SNAP vendors have a more negative correlation, indicating that proximity to a SNAP vendor tends to be an indicator of a higher number of overdoses compared to supermarkets.

<b>Figure 2.3c</b>

Figure 2.3c

The correlation between distance to parks and overdose incidents was also explored. Given that parks serve as one of the best measures of public gathering spaces, the hypothesis was that proximity to parks may be associated with a higher number of overdose events. Correlation tests proved that this hypothesis was correct; as you get closer to a park in Providence, the density of overdoses increases. However, this correlation was smaller than that of any other community resource mentioned in this section.

<b>Figure 2.3b</b>

Figure 2.3b

While the hypothesis at the outset of the exploratory analysis process was that areas with a greater number of community resources would have experienced fewer overdose events in 2017, that is not actually the case. While the correlation between each community resource and overdoses varies, proximity to all community resources tends to suggest a higher density of overdoses.

2.4 Is Proximity to Risk Factors Associated with Overdose Events?

The second category of variables that this analysis examines is risk factors, with the hypothesis that areas containing greater numbers of risk factors would experience higher numbers of overdoses. To explore this hypothesis, we looked at data related to three categories of risk factors: alcohol & tobacco vendors, data from 311 complaint calls, and crime data. Our hypothesis was largely confirmed throughout the risk factors we examined, and risk factors tended to have a greater correlation to density of overdoses than community resources.

2.4.1 Alcohol & Tobacco Vendors

Based on past research indicating that alcohol and tobacco vendors are associated with higher rates of crime, we wanted to explore if proximity to these vendors in Providence was associated with higher overdose density. Using business license data, we pulled the locations of tobacco vendors and liquor licenses in Providence. The below maps show that tobacco vendors tend to cluster more in the southwest area of the city and liquor licenses are very clustered in the center of the city. Scatterplots illustrate a similar relationship between each of these variables; higher numbers of overdoses occur in close proximity to both tobacco vendors and liquor licenses and decrease in number farther away.

<b>Figure 2.4.1a</b>

Figure 2.4.1a

2.4.2 311 Data

The City provided data regarding 311 complaints that served as the basis for several risk factor categories, such as vacant buildings, abandoned vehicles, out street lights, graffiti, trash, and overgrown lots. These variables largely serve as an indicator of neighborhood conditions and levels of investment in an area. Therefore, the hypothesis was that areas with a higher number of complaints would experience a greater number of overdoses. Due to the extensive nature of the 311 call dataset from the City’s open data site (shown below), we grouped this data into broader categories to better extract meaning from the data (listed in the feature appendix.)

<b>Figure 2.4.2a</b>

Figure 2.4.2a

In Figure 2.4.2b below, we have highlighted four variables that ultimately proved to be useful in the modeling process due to their correlation with overdose incidents: abandoned buildings, abandoned cars, code violations, and overgrowth. The positive relationships between the density of these variables and overdoses suggests that greater numbers of overdose events are happening in areas with greater numbers of 311 complaints. The strongest correlations with overdoses are code violation complaints and abandoned building complaints, suggesting that disinvestment in the built environment has a strong relationship with overdose events.

<b>Figure 2.4.2b</b>

Figure 2.4.2b

Figure 2.4.2c below highlights these same four variables and the difference between their density and proximity to the three nearest neighbors. Throughout the modeling process, the spatial differences in these features were considered.

<b>Figure 2.4.2c</b>

Figure 2.4.2c

2.4.3 Crime

A third category of risk factors that we examined was crime data, with the hypothesis that overdoses would be strongly correlated with areas where other crimes were occurring. Specifically, we evaluated drug crimes (removing low-level, marijuana-related crimes) and assaults. The density maps below show that these crimes tend to be clustered toward the center of the city. Correlation tests confirmed our hypothesis that for both drug crimes and assaults, as the density of crime events increases, so does the number of overdose events. Furthermore, the correlation between these types of crime and overdose density was the highest by far of any variable that we examined.

<b>Figure 2.4.3a</b>

Figure 2.4.3a

2.5 Are Certain Neighborhood Characteristics Associated with Overdose Events?

2.5.1 Are Certain Neighborhoods Experiencing a Greater Number of Overdose Events?

<b>Figure 2.5.1a</b>

Figure 2.5.1a

The third category of data explored in this analysis related to broader neighborhood characteristics, such as geographic boundaries and demographic information. We first wanted to explore if overdoses are spatially concentrated in particular neighborhoods. The above bar plot shows that overdose events are not evenly distributed throughout the city. Certain neighborhoods, such as West End, Downtown, and Wanskuck see more overdoses than other areas of the city. However, overdoses are not isolated to a single part of the city and are taking place across many neighborhoods. The inclusion of neighborhood variables in the final model was important in order to capture some of the variation in overdose events across space.

<b>Figure 2.5.1b</b>

Figure 2.5.1b

2.5.2 Are Demographic Characteristics of Neighborhoods Associated with the Number of Overdose Events?

Another measure of neighborhood characteristics is demographic data from the Census, which we included at the block group level. The below maps illustrate the citywide distribution of population density, median household income, unemployment rate, and poverty rate. The accompanying scatterplots show some association between count of overdoses and these demographic variables; more overdoses occur in areas with higher population density, lower median household income, higher unemployment rates, and higher percentages of families living in poverty. However, population density and median household income have a much higher correlation with overdose events than the other two variables. The equity implications of using economic-based variables was considered throughout the model building process.

<b>Figure 2.4.2a</b>

Figure 2.4.2a

3. Model Building

The extensive data wrangling, feature engineering, and exploratory analysis processes informed the creation of a model to predict the risk of overdose events across the city of Providence. The following section will outline the process of selecting features and constructing models, validating models, and converting predictions into actionable intelligence.

3.1 Feature Selection & Model Development

Just as we divided the exploratory analysis into three categories of variables that tell different stories, we structured our model building process in a similar manner. We wanted to explore how well each of these ‘stories’ predicted overdose risk on their own. By building different spatial risk models for community protective resources, risk factors, and neighborhoods characteristics, we sought to capture the predictive power of each of these stories separately before combining them into a final ensemble model that harnesses the predictive power of all 3.

To account for the fact that overdoses are rare events, and therefore have not occurred in the majority of grid cells in the city, Poisson and Negative Binomial models were explored for each variable category. The variable selection process for each of the three models was informed by machine learning processes that identified the variables with the greatest predictive power across many iterations. The results of the top performing models are explored in further detail below. This process, and the results of all tested models, are further detailed in the modeling appendix.

3.1.1 Community Protective Resources Model

The community protective resources model demonstrated that SNAP vendors and bus stops provided the strongest predictive power within this category of variables. The below graph shows the variables included in the final community protective resources model, in order of their relative predictive importance in the model.

<b>Figure 3.1.1a</b>

Figure 3.1.1a

The model containing the variables shown above was informed by a machine learning process called XG Boost that determined the relative importance of each community protective resource variable across many iterations. For variables with multiple feature engineered forms (distance, density, and nearest neighbor), only the nearest neighbor features were included to avoid duplication. The final model was then honed to only include features that were demonstrated to be important in the machine learning process.

<b>Figure 3.1.1b</b>

Figure 3.1.1b

Each model created in this analysis utilizes random subsets of 60% of the city’s fishnet grid cells (a “training set”) to create the model, allowing models’ accuracy to be tested on the 40% of grid cells that were set aside (the “test set”). The “mean absolute error” or “residuals,” are the difference between the observed number of overdoses in a test set grid cell and the number of overdoses that the model has predicted in that grid cell. The below graph shows the distribution of residuals for the community protective resources model, which are largely clustered between 0 and 1 overdose event. The measure of mean absolute error for this model is 0.4234 overdose events.

<b>Figure 3.1.1c</b>

Figure 3.1.1c

3.1.2 Risk Factors Model

In the final risk factors model, assaults provide the strongest predictive power, followed by 311 code violation complaints, and drug crimes.

<b>Figure 3.1.2a</b>

Figure 3.1.2a

The machine learning process also showed assaults to have, by far, the strongest predictive power.

<b>Figure 3.1.2b</b>

Figure 3.1.2b

Similar to the community protective resources model, the residuals for the risk factors model are clustered between 0 and 1. The mean absolute error for this model is 0.4155 overdose events, lower than that of community protective resources.

<b>Figure 3.1.2c</b>

Figure 3.1.2c

3.1.3 Neighborhood Characteristics Model

Assessing the variable importance of the neighborhood characteristics model demonstrated that the West End, Valley, and Silver Lake had, by far, the most importance in the model. The remainder of the variables in this model were either insignificant, or contained much less weight in the model relative to the top three neighborhoods.

<b>Figure 3.1.3a</b>

Figure 3.1.3a

Machine learning shows that, on their own, price per acre provides the strongest predictive power. This pattern does not emerge in the full neighborhood characteristics model likely due to the fact that holding neighborhood constant within the model controls for much of the variation of census-based and land value variables.

<b>Figure 3.1.3b</b>

Figure 3.1.3b

Residuals for the neighborhood model are also clustered between 0 and 1. The model shows a mean absolute error of 0.4486, higher than the errors for the community protective resources and risk factor models.

<b>Figure 3.1.3c</b>

Figure 3.1.3c

3.1.4 Ensembling Models

Since each of the 3 models described above tells a different story about predicting overdose risk in Providence, we combined them into a single model through an ensembling process aimed at reducing the overall error. This section describes the process of comparing the predictions and errors of the 3 different stories and using them to create a final ensemble model.

The outcome of each of the three models described above was a predicted risk map of overdose events across the City of Providence, shown below. In the prediction maps, it is clear how the variables in each model strongly inform its predictions. For example, in the community protective resources model, the bus routes are evident and in the neighborhood model there are hard boundaries between the different neighborhoods. Furthermore, the predictions for all models range between 0 and roughly 4 overdose events per grid cell, while the actual number of overdoses per grid cell in the city ranges from 0 to 22. However, risk predictions are better considered on a relative scale, indicating high- and low-risk areas rather than predicting a specific number of events.

<b>Figure 3.1.4a</b>

Figure 3.1.4a

As described above, each model was trained on 60% of the fishnet grid cells, allowing for its prediction accuracy to be tested on the remaining 40% of the grid cells. The below set of maps shows the residuals, or errors, that each model produced when used to predict on the test set. Since these maps only include the 40% test set of the fishnet grid cells, the full city is not represented. The residuals range from 0 to roughly 20, a much greater range than the predictions themselves. However, the graphs above of the distribution of residuals across all models show that very few grid cells have errors over 3. Similar to the predictions maps, there are variations in errors across the different models; the risk factors model has lower errors in the city’s outer neighborhoods, and is closely followed by the community protective resources model. In contrast, the neighborhood characteristics model has much higher errors in the city’s outer neighborhoods and in the northern half of the city, likely because it contains neighborhood boundaries as a variable, which can be arbitrary at times.

<b>Figure 3.1.4b</b>

Figure 3.1.4b

The discrepancies in the above maps demonstrate that each of these three models is capturing different spatial risk patterns. Combining these models into a single ensemble model improved their predictive power and decreased the model’s overall error. To create the ensemble model, we utilized a process that ran each model 100 times with randomized training and test sets - allowing for an average prediction for each grid cell. Averaging the results of many versions of the model allows for the reduction of errors within each grid cell. Then, we built a model containing these averaged predictions for each of the three models. Ultimately, the process minimizes the error from each of three models, and runs them as a larger, combined model.

A successful ensemble model requires that the input models not be strongly correlated with one another. Therefore, each model’s variables were culled in order to reduce correlation of variables within the three different models. The final ensemble model contains the following variables:

  • Community Protective Resources: distance to the 3 nearest SNAP vendors, distance to bus stops, distance to parks
  • Risk Factors: distance to the 3 nearest assault events, distance to code violations, distance to the 3 nearest overgrowth complaints
  • Neighborhood Characteristics: neighborhood boundaries

As can be seen in the graph below, the risk factors model provides the strongest predictive power to the final model, followed by the community protective resources model. The neighborhood boundaries model contributed the least predictive power to the model, which is consistent with the higher level of errors that the model contained on its own.

<b>Figure 3.1.4c</b>

Figure 3.1.4c

Combining the stories reduced the model’s overall error, lowering the mean absolute error to 0.415. Looking at a map of the predictions for the ensemble model shows that the combined model captures more of the nuance in the higher risk area toward the center, with predictions ranging from 0-8 compared to 0-3 for the individual stories. The map of residuals, again shown on the test set containing 40% of the data, shows that the range of errors has decreased, and is now 0-17 compared with a high of 20 in previous models.

<b>Figure 3.1.4d</b>

Figure 3.1.4d

3.2 Model Validation

We validated this model through several measures of goodness of fit, which are explained in this section. First, we compared the results of the spatial risk model to a traditional kernel density map. Then, we conducted different cross-validation and spatial cross-validation tests to ensure that the model is generalizable, both on different fishnet test and training sets and across different areas of the city. The process that we used to conduct these tests is further detailed in the modeling appendix below.

3.2.1 Comparing to Kernel Density

As described earlier, kernel density maps, or hotspot maps, identify higher-risk areas of the city based on where overdose events have occurred in the past. It is important to evaluate if our model performs better than this traditional approach so that it can be used to inform and improve the way resources are allocated to combat the opioid crisis. The below graphs compare each of the models described above to a traditional kernel density approach by putting their predictions on the same scale and applying “risk levels” to that scale. Then, actual overdoses that occurred within the test set are counted based on which risk level they fall into. For this use case, the aim is to capture points in both of the top two risk levels; if all of the test set points were in the 90-100% risk category, the model would be overfit on the limited data that Providence has for 2017.

As can be seen in the below graphs, while the first three models predict only marginally better than the kernel density approach, the final ensemble model performs noticeably better in both the 70-89% and 90-100% risk categories.

<b>Figure 3.2.1a</b>

Figure 3.2.1a

3.2.2 Cross Validation

Cross-validation is important because it determines whether the model predicts well on many different subsets of the data, not just the initial test set. The cross-validation process involved splitting the fishnet into 100 random test sets and measuring average error for each model. The below histograms show the distribution of errors across these 100 random test sets for each of the four models. For all models, the error clusters below 0.5 overdose events. As with the distribution of errors shown for each model above, there are a few significantly larger outliers, likely accounting for the discrepancy between the prediction values and the areas with higher counts of overdoses. The histogram for the ensemble model shows that it has a higher number of test sets with a mean absolute error below 0.5 overdoses, and the value of its outlier errors is lower than that of the other models. This distribution of error indicates that across many subsets of the fishnet grid cells, the ensemble model performs with the least amount of error.

<b>Figure 3.2.2a</b>

Figure 3.2.2a

3.2.3 Spatial Cross Validation

Throughout the modeling process, the four models have been trained on random subsets of 60% of the city’s fishnet grid cells and tested on the remaining 40% of grid cells. However, testing the model on randomly selected grid cells in Providence is not enough to determine the generalizability of the model; the model also must be tested in different parts of the city, a process called spatial cross-validation. How well does the model predict risk in parts of Providence that have different characteristics? If there are discrepancies between neighborhoods, it is important to be aware of them when implementing any policies based on the model. The spatial cross-validation process involved running the models several times with each of Providence’s census tracts as a test set, predicting on that tract, and calculating the prediction error for each tract. The prediction error in this case was calculated by normalizing the average error for each census tract by the mean count of overdoses per grid cell in the tract. This was determined to be the best measure of spatial autocorrelation of errors for this use case and the process is further described in the modeling appendix.

The maps below show the outcome of the spatial cross-validation process. While the errors across the models are somewhat clustered, there are clear differences between the four models. The ensemble model shows the fewest tracts with high overdose prediction error. While there are higher errors in the southern portion of the city, this area includes neighborhoods such as Reservoir and South Elmwood that did not experience high numbers of overdoses in 2017. Regardless, this distribution of errors must be kept in mind when utilizing the risk predictions.

<b>Figure 3.2.3a</b>

Figure 3.2.3a

In addition to determining if a model is generalizable across space, it is also important to evaluate if it is generalizable across different social and neighborhood conditions. To do this, we evaluated how well our model predicts overdose events across three social patterns: population density, median household income, and families living in poverty.

The below table shows the level of error the ensemble model produces when predicting overdose events across four quartiles of population density in Providence. For this model, the level of error increases as the city becomes less dense, indicating that the model better predicts risk in areas closer to Downtown Providence. This is consistent with the errors observed in the map above, where the southern portion of the city had higher errors.

Density Normalized Mean Absolute Error
Bottom 25% Density 0.6611757
Bottom 50% Density 0.5656666
Top 25% Density 0.2552651
Top 50% Density 0.3987325

However, when examining other demographic and economic characteristics, the level of error is more consistent. The below table shows the ensemble model’s error in predicting overdose events in high-income areas (top 25% citywide) and low-income areas (bottom 25% citywide). While High-income neighborhoods show somewhat less error than low-income neighborhoods, the discrepancy is much less than that of population density.

Income Level Normalized Mean Absolute Error
High Income 0.3894371
Low Income 0.4971342

Lastly, the below table shows the model’s level of error in predicting overdose events in high-poverty areas (top 25% citywide) and low-poverty areas (bottom 25% citywide). While the model predicts slightly better for high-poverty areas, the difference in error levels between these two groups is very low.

Poverty Rate Normalized Mean Absolute Error
High Poverty 0.4335743
Low Poverty 0.4824218

3.3 Converting Predictions into Actionable Intelligence

3.3.1 Building a Web Application

The creation of the final ensemble model and its corresponding risk predictions allowed us to return to the question of whether Safe Stations are the resources best located in Providence to respond to the opioid crisis. Should more resources be allocated to stations that are in higher-risk areas? Should additional Safe Stations be opened in high-risk areas at other publicly-owned facilities? Could advertising for existing facilities be supplemented in high-risk areas to point people to the resources that they need? To convert the ensemble model’s final risk predictions into actionable intelligence, we built a web application that the City of Providence and other stakeholders can use to understand risk in an area and what facilities are best suited to respond.

The web application shows the final risk prediction map and allows users to examine a neighborhood’s average prediction level and compare it to that of other Providence neighborhoods. Then, it allows users to view the locations of community protective resources, such as grocery stores, libraries, hospitals, etc. Each facility has a visible quarter-mile buffer and displays a count of how many overdoses occurred within that distance in 2017. The goal is that with the risk prediction map, along with the counts of nearby actual overdoses, the City can clearly see which facilities are best suited to host or advertise additional interventions. The final web application can be found at this link.

3.3.2 Looking Ahead

The final spatial risk ensemble model predicts overdoses better than a kernel density map in the 90-100% and 70-89% risk areas of Providence. Prediction errors are higher in the Southern and Eastern portions of the city, and errors vary significantly based on population density - a fact that must be considered when utilizing this model’s predictions. However, the model is generalizable across neighborhoods of Providence with different economic characteristics. In the future, this model can be improved through the inclusion of more data. As the City continues to collect information about the location of overdoses, this model can be honed by using all of the 2017 data to predict for 2018, or by introducing seasonality variables. Continually adding overdose data to the model can allow the City of Providence to continue to make improvements that minimize the model’s level of error and improve generalizability across the City.

4. Data Source Appendix

Dataset Source Link
Overdose Points Providence Healthy Communities Office NA
311 Data Providence Healthy Communities Office NA
Every Home Properties Providence Healthy Communities Office NA
Providence Parcel Boundaries 2017 Providence Open Data Portal https://data.providenceri.gov/Neighborhoods/Providence-Parcel-Boundaries-2017/78bu-i8at
Neighborhood Boundaries Providence Open Data Portal https://data.providenceri.gov/Reference/Neighborhood-Boundaries/yn24-dv5q
Police Case Logs Providence Open Data Portal https://data.providenceri.gov/Public-Safety/Providence-Police-Case-Log-Past-180-days/rz3y-pz8v
Active Business Licenses Providence Open Data Portal https://data.providenceri.gov/Neighborhoods/Active-Business-Licenses-in-Providence/ui7z-kv69
SNAP Vendors Providence Open Data Portal https://data.providenceri.gov/Neighborhoods/SNAP-Vendor-List/cvbu-xhqr
Schools Rhode Island GIS http://www.rigis.org/datasets/schools
Libraries Rhode Island GIS http://www.rigis.org/datasets/5c67754a919a456db108bf01a6f5ce5a_0?geometry=-72.085%2C41.75%2C-70.779%2C41.929
RIPTA Bus Stops Rhode Island GIS http://www.rigis.org/datasets/ripta-bus-stops
Hospitals Rhode Island GIS http://www.rigis.org/datasets/0ba978ef54a44b0a93ddd305773733b9_0
Emergency Medical Services Rhode Island GIS http://www.rigis.org/datasets/716b626f823e4204a2d2374593d5c8ba_0
Law Enforcement Facilities Rhode Island GIS http://www.rigis.org/datasets/law-enforcement
Fire Stations Rhode Island GIS http://www.rigis.org/datasets/fire-stations
Correctional Institutions Rhode Island GIS http://www.rigis.org/datasets/correctional-institutions
American Community Survey, 2016 5-Year Estimates US Census Bureau https://factfinder.census.gov/faces/nav/jsf/pages/index.xhtml

5. Feature Appendix

Variable Description Distance Density X3.Nearest.Neighbor Source
Count Count of ODs in grid cell N N N OD points
dummy.od Yes/No OD in grid cell N N N OD points
od Opioid overdose events Y Y Y OD points
statecap State Capitol Y N N Point location
cityhall City Hall Y N N Point location
nhood Neighborhood name N N N Neighborhood Boundaries
cbd Downtown Y N N Neighborhood Boundaries
priceperacre Average price per acre of parcels in grid cell N N N Parcel Boundaries 2017
assault Assaults in past 180 days Y Y Y Police Case Logs
drugcrime Drug crime in past 180 days Y Y Y Police Case Logs
supermarket Supermarkets Y Y Y Business Licenses
grocery Grocery stores (includes corner stores) Y Y Y Business Licenses
drycleaner Dry cleaners and laundromats Y Y Y Business Licenses
tobacco Tobacco vendors Y Y Y Business Licenses
liq Liquor licenses Y Y Y Business Licenses
snap SNAP vendors Y Y Y SNAP Vendors
college Colleges and universities Y Y Y Schools
prvschool Private schools Y Y Y Schools
pubschool Public schools Y Y Y Schools
overgrowth Overgrowth complaints Y Y Y 311 Data
light Out streetlight complaints Y Y Y 311 Data
water Water related compliants Y Y Y 311 Data
traffic High-speed traffic compliants Y Y Y 311 Data
graffiti Graffiti complaints Y Y Y 311 Data
code Code violation complaints Y Y Y 311 Data
trash Trash complaints Y Y Y 311 Data
abldgs Abandoned building complaints Y Y Y 311 Data
acars Abandoned cars complaints Y Y Y 311 Data
ehp Every Home Properties Y Y Y Every Home Properties
library Libraries Y Y Y Libraries
bus RIPTA bus stops Y Y Y RIPTA Bus Stops
hospital Hospitals Y Y Y Hospitals
ems Emergency medical services Y Y Y Emergency Medical Services
law Law enforcement facilities Y Y Y Law Enforcement Facilities
fire Fire stations Y Y Y Fire Stations
prison Correctional institutions Y Y Y Correctional Institutions
pop.dens Population density N N N American Community Survey, 2016
MedHHInc Median household income N N N American Community Survey, 2016
PctPv Poverty rate N N N American Community Survey, 2016
PctUnemp Unemployment rate N N N American Community Survey, 2016

6. Data Wrangling Appendix

Compiling the data needed to complete the above analysis requires wrangling data from a variety of sources. The below appendix will explain the various methods undertaken to create the dataset necessary for a spatial risk model of opioid overdoses. These methods include creating a fishnet, filtering variables, creating distance and nearest neighbor variables, creating kernel density variables, utilizing the tidycensus API, and joining these and other variables to a fishnet.

6.1 Set Up

To begin, load the necessary libraries and read in the data.

#Load the Libraries
  library(knitr)
  library(tidyverse)
  library(tidycensus)
  library(sf)
  library(ggmap)
  library(maptools)
  library(sp)
  library(spatstat)
  library(FNN)
  library(rgdal)
  library(raster)
  library(rasterVis)
  library(grid)
  library(gridExtra)
  library(RColorBrewer)
  library(stargazer)
  require(ggplot2)
  library(rgeos)
  library(kableExtra)
  library(caret)
  library(MASS)
  library(pscl)
  library(glmnet)
  require(Metrics)
  library(sf)
  library(corrplot)
  library(QuantPsyc)
  library(scales)
  library(spdep)
  library(xgboost)
  require(Ckmeans.1d.dp)
  require(gbm)
  require(randomForest)
  library(GGally)

#Turn off scientific notation
  options(scipen=999)

#Set working directory
  #setwd("C:/Users/jbutz/Box Sync/Providence")
  setwd("C:/Users/AnnieStreetman.DESKTOP-9NUSKDD/Desktop/Box Sync/Providence")

#Read in the data and project it immediately!
  ODpoints <- st_read("updated_data.shp") %>% 
              st_transform(3438)
  boundary <- st_read("Providence_Boundary_unproj.shp") %>% 
              st_transform(3438)
  roads <-    st_read("Major_Roads_unproj.shp") %>% 
              st_transform(3438) 
  parks <-    st_read("Parks_OpenSpace_unproj2.shp") %>% 
              st_transform(3438)
  water <-    st_read("area_water_unproj.shp") %>% 
              st_transform(3438)
  nhoods <-   st_read("Neighborhoods_unproj.shp") %>% 
              st_transform(3438)
  parcels <-  st_read("Parcels_updatedFeb_unproj.shp") %>% 
              st_transform(3438)
  counties <- st_read("tl_2012_44_cousub.shp") %>% 
              st_transform(3438)
  AllBiz <-   st_read("AllBiz_unproj.shp")  %>% 
              st_transform(3438)
  SNAPvend <- st_read("SNAP_vendors_unproj.shp") %>% 
              st_transform(3438)
  Vacant <-   st_read("EveryHomeProperties_unproj.shp") %>% 
              st_transform(3438)
  schools <-  st_read("schools_unproj.shp") %>% 
              st_transform(3438)
  prisons <-  st_read("correctional_institutions_unproj.shp") %>% 
              st_transform(3438)
  fire <-     st_read("fire_station_unproj.shp") %>% 
              st_transform(3438)
  hospital <- st_read("hospitals_unproj.shp") %>% 
              st_transform(3438)
  law.enf <-  st_read("Law_Enforcement_unproj.shp") %>% 
              st_transform(3438)
  EMS <-      st_read("EMS_unproj.shp") %>% 
              st_transform(3438)
  bus <-      st_read("bus_stops_unproj.shp") %>% 
              st_transform(3438)
  library <-  st_read("Libraries.shp") %>% 
              st_transform(3438)
  Calls311 <- st_read("311_unproj.shp") %>% 
              st_transform(3438)
  grocery <-  st_read("groc_unproj.shp") %>% 
              st_transform(3438)
  drugcrime <-st_read("drug_proj.shp") %>% 
              st_transform(3438)
  assaults <- st_read("assaults_proj2.shp") %>% 
              st_transform(3438)
  cityhall <- st_read("CityHall.shp") %>% 
              st_transform(3438)
  statecap <- st_read("TrainStations.shp") %>% 
              st_transform(3438)
  wards <-    st_read("Wards_proj.shp")%>% 
              st_transform(3438) %>% 
              dplyr::select(district_1) %>% 
              rename(ward = district_1)

6.2 Unit of Analysis - Making a Fishnet

One of the central decisions for any model building process is determining the unit of analysis. In a spatial risk model the unit of analysis informs whether your model is interpreting a risk level by neighborhood, census tract, or block. In this model, the unit of analysis is 500-square-foot grid cells across the entire city of Providence - otherwise referred to as a “fishnet.” The unit of analysis is discussed in further depth in the case study above, but this code will explain how to create a fishnet. Ultimately, all dependent and independent variables will be joined to the fishnet, giving each 500-square-foot area of Providence a value for each variable that can then be used in the model.

You can see from the below maps that the fishnet breaks Providence into 2,262 grid cells. Centroid points of the grid cells will be used to measure distance and any values that must be joined to grid cells will be averaged within their cells.

Once the fishnet has been made, variables can be aggregated into them in numerous ways that will be outlined in further detail below. In this use case, the first step is to count the number of overdose events that have occurred within each 500-foot grid cell. This will serve as the dependent variable in the spatial risk model.

#Count of ODs per fishnet grid cell  
    Count.FN <- 
      aggregate(ODpoints, provGrid_proj, length) %>%
      bind_cols(provGrid_proj) %>%
      dplyr::select(Count)    
    
    Count.FN$Count[is.na(Count.FN$Count)] <- 0 #remove NAs
    
    summary(Count.FN$Count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2971  0.0000 21.0000

6.3 Preparing the Independent Variables

In order to set up many of the independent variables used in the model, they were filtered out of much larger publicly available datasets. For this analysis, several variables were filtered out of Providence’s business licenses, 311 data, school locations, and neighborhood boundaries. The below code demonstrates how to filter out variables, using Providence’s 311 data as an example.

#Filtering 311 data - using the Request Type column to create several variables

  AbandonedProperties <- #new dataset name
      Calls311 %>% #dataset to filter out of 
      filter(Request_Ty == "Abandoned Property" | 
             Request_Ty == "Abandoned Property Needs Reboarding") #Request_ty = this OR that
      
  AbandonedCars <- 
      Calls311 %>%
      filter(Request_Ty == "Abandoned Vehicles")
  
  Trash <- 
      Calls311 %>%
      filter(Request_Ty == "Dumping of trash, bulky items on Public Street or Vacant Lot" |
             Request_Ty == "Dumping or Trash in Public Park" | 
             Request_Ty == "Overflowing Dumpster" |
             Request_Ty == "Trash/debris on private property or Lot" |
             Request_Ty == "Trash/litter on public street or sidewalk")
  
  CodeViolations <- 
      Calls311 %>%
      filter(Request_Ty == "General Permit Issues" | 
             Request_Ty == "Substandard Housing Code Issues or Violations")
  
  Graffiti <- 
      Calls311 %>%
      filter(Request_Ty == "Graffiti" | 
             Request_Ty == "Graffiti in a Park")
      
  HighTraffic <- 
      Calls311 %>%
      filter(Request_Ty == "Crosswalks and Road Striping" | 
             Request_Ty == "Potholes" | 
             Request_Ty == "Road Repair or Replacement" | 
             Request_Ty == " Sidewalk - Repair Existing" | 
             Request_Ty == "Speed bumps/Traffic calming" | 
             Request_Ty == "Speeding on Streets" |
             Request_Ty == "Traffic/Walk Signal Malfunction")
      
  Water <- 
      Calls311 %>%
      filter(Request_Ty == "Broken Water Pipe" | 
             Request_Ty == "Catch basin blocked or broken" | 
             Request_Ty == "Storm Drain Backup" | 
             Request_Ty == "Street Flooding")
      
  Light <- 
      Calls311 %>%
      filter(Request_Ty == "Decorative Street Light" | 
             Request_Ty == "Street Light" )
      
  Overgrowth <- 
      Calls311 %>%
      filter(Request_Ty == "Overgrowth/High Grass on Private Property" |
             Request_Ty == "Overgrowth/High Grass on Public Property")

6.4 Distance and Nearest Neighbor Variables

The next two sections will walk through some of the most important variables to include in a spatial risk model - distance and density based variables. For example, including a variable of whether or not a grid cell contains a licensed tobacco vendor is not as illustrative as including a variable that tells each grid cell its distance to a tobacco vendor. This is explained in further depth in the feature engineering discussion above. This section explains how to create two types of variables, distance variables and nearest neighbor variables. Nearest neighbor variables take the average proximity of the n nearest licensed tobacco vendors, and therefore often provide a better measure of proximity to certain types of amenities or disamenities. Distance and nearest neighbor variables use almost exactly the same code, a few examples of which are shown below. For this analysis, distance and distance to 3 nearest neighbor variables were generated for every point variable, such as tobacco vendors, liquor licenses, and public schools - the full list is included in the data dictionary appendix. Then these values were joined back to the final fishnet for use in the spatial risk model.

#Distance Variables
#Create matrix function for datasets
  dataXY_function <- function(data) {
    
    dataXY <- 
      data %>% 
      st_coordinates() %>%
      as.data.frame() %>%
      dplyr :: select(X,Y) %>%
      as.matrix()
    
    return(dataXY)
  }

#Create function for distance to X nearest neighbors
  nn_function <- function(measureFrom,measureTo,k) {
    
    nn <-   
      get.knnx(measureTo, measureFrom, k)$nn.dist
    
    output <-
      as.data.frame(nn) %>%
      rownames_to_column(var = "thisPoint") %>%
      gather(points, point_distance, V1:ncol(.)) %>%
      arrange(as.numeric(thisPoint)) %>%
      group_by(thisPoint) %>%
      summarize(pointDistance = mean(point_distance)) %>%
      arrange(as.numeric(thisPoint)) %>% 
      dplyr ::select(-thisPoint)
    
    return(output)  
  }

#Create Fishnet Points Matrix
  fishnetPointsXY <- dataXY_function(fishnetPoints)
  
#OD points  
  #Create Matrix and run distance function
    odXY <- dataXY_function(ODpoints)
    dist.od.1 <- nn_function(fishnetPointsXY, odXY, 1) #For distance, the neighbor value is set to 1
  
  #Tabular and Spatial Join to fishnet
    dist.od <- 
      fishnetPoints %>% 
      bind_cols(dist.od.1)
    
    dist.od.FN <- 
      aggregate(dist.od, provGrid_proj, mean) %>%
      bind_cols(Count.FN) %>% 
      rename(OD.dist = pointDistance)
    
#Tobacco Stores 
  #Create Matrix and run distance function
    tobaccoXY <- dataXY_function(tobacco)
    dist.tobacco.1 <- nn_function(fishnetPointsXY, tobaccoXY, 1)
  
  #Tabular and Spatial Join to fishnet
    dist.tobacco <- 
      fishnetPoints %>% 
      bind_cols(dist.tobacco.1)
    
    dist.tobacco.FN <- 
      aggregate(dist.tobacco, provGrid_proj, mean) %>%
      bind_cols(dist.od.FN) %>% #join this back to the most recently made dataset
      rename(tobacco.dist = pointDistance)
#Nearest Neighbor Variables
#OD
  #Run nearest neighbor function
    nn.od.1 <- nn_function(fishnetPointsXY, odXY, 3) #change the number to change the number of neighbors measured
    
  #Tabular and Spatial Join to fishnet
    nn.od <- 
      fishnetPoints %>% 
      bind_cols(nn.od.1)

    nn.od.FN <- 
      aggregate(nn.od, provGrid_proj, mean) %>%
      bind_cols(dist.statecap.FN) %>% #join to most recently made dataset
      rename(OD.nn = pointDistance)
    
#Tobacco Stores 
  #Run nearest neighbor function
    nn.tobacco.1 <- nn_function(fishnetPointsXY, tobaccoXY, 3)
    
  #Tabular and Spatial Join to fishnet
    nn.tobacco <- 
      fishnetPoints %>% 
      bind_cols(nn.tobacco.1)
    
    nn.tobacco.FN <- 
      aggregate(nn.tobacco, provGrid_proj, mean) %>%
      bind_cols(nn.od.FN) %>%
      rename(tobacco.nn = pointDistance)

6.5 Kernel Density Variables

Another way in which to measure the relative proximity and magnitude of variables is through their kernel density. The following code explains how to measure kernel density and join the average the value of density in each grid cell to the fishnet. These examples are just two of the many variables whose density was compiled in the fishnet, the full list is included in the data dictionary appendix.

#Create an owin window of the Providence boundary
  boundary$geometry #find your xmin and xmax 
## Geometry set for 1 feature 
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 335598.3 ymin: 251068.8 xmax: 361734 ymax: 283675.8
## epsg (SRID):    3438
## proj4string:    +proj=tmerc +lat_0=41.08333333333334 +lon_0=-71.5 +k=0.99999375 +x_0=99999.99998983997 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs
## POLYGON ((359521.1 255864.1, 359523.1 255864.7,...
  provWindow <- owin(c(335598.3,361734), c(251068.8,283675.8))
    
#Make Fishnet a spatial polygon to create a mask for visualizing and extracting data 
  mask <- 
    provGrid_proj %>% 
    as('Spatial') %>%
    as('SpatialPolygons')
    
#Public Schools
  #Create Density Map and Join to Fishnet  
    ppp.publicschools <- as.ppp(publicschoolXY, provWindow)
    dens.publicschools.1 <- raster(density(ppp.publicschools, sigma = 1500)) #sigma is your search distance
    dens.publicschools <- raster::extract(dens.publicschools.1, mask, fun = mean)
    
    dens.publicschools.FN <- 
      cbind(dens.publicschools, nn.assaults.FN) %>% #join to the most recent dataset
      rename(pubschool.dens = dens.publicschools)

#Private Schools
  #Create Density Map and Join to Fishnet  
    ppp.privateschools <- as.ppp(privateschoolXY, provWindow)
    dens.privateschools.1 <- raster(density(ppp.privateschools, sigma = 1500))
    dens.privateschools <- raster::extract(dens.privateschools.1, mask, fun = mean)
    
    dens.privateschools.FN <- 
      cbind(dens.privateschools, dens.publicschools.FN) %>% 
      rename(privschool.dens = dens.privateschools)

6.6 Census Data

To add neighborhood characteristics, the Census API was used to pull a block group shapefile and variables thought to impact the number of overdoses, including population density, poverty rate, unemployment rate, and median household income. The code below utilizes the tidycensus package.

censustable <- 
  BG2016 %>% 
  as.data.frame() %>% 
  dplyr::select(-GEOID, -year, -geometry, -area, -Pop) %>% 
  rename("Median Household Income" = MedHHInc,
         "Poverty Rate" = pctPov,
         "Unemployment Rate" = pctUnemp)

censustable$`Median Household Income`[is.na(censustable$`Median Household Income`)] <- 0

stargazer(censustable, type = "text", title = "Census Variable Summary Statistics")
## 
## Census Variable Summary Statistics
## ===============================================================
## Statistic                N     Mean     St. Dev.   Min    Max  
## ---------------------------------------------------------------
## Median Household Income 156 42,392.290 31,316.450   0   218,583
## Poverty Rate            156   0.209      0.192    0.000  0.913 
## Unemployment Rate       156   0.112      0.090    0.000  0.528 
## ---------------------------------------------------------------

6.7 Joining Neighborhood Characteristics to the Fishnet

Lastly, two additional neighborhood characteristics were added to the fishnet. First, the neighborhood name and second, the total assessed value of land in Providence.

#Joining the neighborhoods to the fishnet 
  finalfishnetneighb <- 
    st_join(fishnetPoints, nhoods) %>% 
    dplyr::select(-shape_area, -shape_len)
  
  finalfishnetneighb2 <- 
    st_join(finalfishnet, finalfishnetneighb) %>% 
    dplyr::select(-uniID.y) %>% 
    rename(nhood = lname)
    
  finalfishnet <- finalfishnetneighb2
  
#Take a look at the output  
  table(finalfishnet$nhood)
## 
##             Blackstone                Charles           College Hill 
##                    167                    100                     87 
##               Downtown               Elmhurst                Elmwood 
##                     86                    117                     79 
##           Federal Hill              Fox Point               Hartford 
##                     59                     62                     78 
##                   Hope Lower South Providence                 Manton 
##                     54                     79                     54 
##             Mount Hope         Mount Pleasant             Olneyville 
##                     95                    140                     63 
##              Reservoir            Silver Lake             Smith Hill 
##                     91                    105                     64 
##          South Elmwood Upper South Providence                 Valley 
##                    110                     68                     53 
##               Wanskuck        Washington Park                Wayland 
##                    128                    151                     52 
##               West End 
##                    104
#Joining the average parcel value to the fishnet
  parcels$ass_total <- as.numeric(parcels$ass_total)
  parcels$area_ac <- as.numeric(parcels$area_ac)

  land.value <- 
    parcels %>% 
    mutate(priceperacre = ifelse(area_ac > 0, ass_total/area_ac, 0)) %>% 
    dplyr::select(-unit, -tax_year, -owner_city, -ass_bldg, -ass_total, -ownername, -parc_add, 
                  -area_ac, -shape_leng, -owner_zip, -cama_link, -owner_ctry, -area_sf, -ownerstate, -shape_area, 
                  -propid, -plat, -ass_land, -owner_add)
  
  finalfishnetLV <- 
    aggregate(land.value, finalfishnet, mean) %>% 
    bind_cols(finalfishnet) %>%  
    dplyr::select(-lot, -geometry1)
  
  finalfishnet <- finalfishnetLV
  
#Take a look at the output
  summary(finalfishnet$priceperacre)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##    0.075    0.520    0.846  196.121    2.998 5343.722        8

6.8 Finalizing the Data

After preparing all of these variables (listed in the data dictionary here) and joining them into a fishnet, the final dataset that is used for model building is a set of 500 foot grid cells that contains the following fields.

names(fishnet)
##   [1] "priceperacre"     "MedHHInc"         "pctPv"           
##   [4] "PctUnemp"         "assault.dens"     "drugcrime.dens"  
##   [7] "supermarket.dens" "grocery.dens"     "library.dens"    
##  [10] "drycleaner.dens"  "ems.dens"         "law.dens"        
##  [13] "hospital.dens"    "fire.dens"        "prison.dens"     
##  [16] "tobacco.dens"     "ehp.dens"         "overgrowth.dens" 
##  [19] "light.dens"       "water.dens"       "traffic.dens"    
##  [22] "graffiti.dens"    "code.dens"        "trash.dens"      
##  [25] "abldgs.dens"      "acars.dens"       "bus.dens"        
##  [28] "snap.dens"        "liq.dens"         "od.dens"         
##  [31] "college.dens"     "prvschool.dens"   "pubschool.dens"  
##  [34] "uniID"            "assault.nn"       "drugcrime.nn"    
##  [37] "supermarket.nn"   "grocery.nn"       "library.nn"      
##  [40] "drycleaner.nn"    "ems.nn"           "law.nn"          
##  [43] "hospital.nn"      "fire.nn"          "ehp.nn"          
##  [46] "overgrowth.nn"    "light.nn"         "water.nn"        
##  [49] "traffic.nn"       "graffiti.nn"      "code.nn"         
##  [52] "trash.nn"         "abldgs.nn"        "acars.nn"        
##  [55] "college.nn"       "prvschool.nn"     "pubschool.nn"    
##  [58] "bus.nn"           "snap.nn"          "liq.nn"          
##  [61] "tobacco.nn"       "od.nn"            "park.dist"       
##  [64] "statecap.dist"    "cityhall.dist"    "assault.dist"    
##  [67] "drugcrime.dist"   "cbd.dist"         "supermarket.dist"
##  [70] "grocery.dist"     "law.dist"         "drycleaner.dist" 
##  [73] "ems.dist"         "prison.dist"      "fire.dist"       
##  [76] "hospital.dist"    "ehp.dist"         "overgrowth.dist" 
##  [79] "light.dist"       "water.dist"       "traffic.dist"    
##  [82] "graffiti.dist"    "code.dist"        "trash.dist"      
##  [85] "abldgs.dist"      "acars.dist"       "college.dist"    
##  [88] "prvschool.dist"   "pubschool.dist"   "bus.dist"        
##  [91] "snap.dist"        "liq.dist"         "tobacco.dist"    
##  [94] "od.dist"          "Count"            "pop.dens"        
##  [97] "nhood"            "dummy.OD"         "geometry"        
## [100] "countbucket"      "countbucket2"     "a1"              
## [103] "d1"               "n1"               "e1"

7. Data Visualization Appendix

7.1 Setting the Aesthetics

It is also very useful at the beginning of the process to set a few basic aesthetic norms that you can use throughout the project. The below code walks through creating several color palettes, a map theme, a basemap, and a plot theme.

#Set a few color palettes using the RColorBrewer library these can be used later in ggplots
  display.brewer.all() #shows your color palette options

  brewer.pal("YlOrRd", n = 5) #provides you with hex codes for n number of colors needed in your palette
## [1] "#FFFFB2" "#FECC5C" "#FD8D3C" "#F03B20" "#BD0026"
#Copy and paste this output into named palettes
  #General Palette     
  palette5 <- c("#FFFFB2", "#FECC5C", "#FD8D3C", "#F03B20", "#BD0026")
  palette5_inverse <- c("#BD0026", "#F03B20", "#FD8D3C", "#FECC5C", "#FFFFB2")
  palette10 <- c("#FFFFFF", "#FFFFCC", "#FFEDA0", "#FED976", "#FEB24C", "#FD8D3C", "#FC4E2A",   "#E31A1C", "#BD0026", "#800026")
  palette10_inverse <- c("#800026", "#BD0026", "#E31A1C", "#FC4E2A", "#FD8D3C", "#FEB24C",   "#FED976", "#FFEDA0", "#FFFFCC", "#FFFFFF")
  
  #Categorical Palettes
  amenities_palette10 <- c("#FFFFFF", "#F7FCFD", "#E5F5F9", "#CCECE6", "#99D8C9", "#66C2A4",   "#41AE76", "#238B45", "#006D2C", "#00441B")
  
  amenities_palette10_inv <- c("#00441B", "#006D2C", "#238B45", "#41AE76", "#66C2A4", "#99D8C9",   "#CCECE6", "#E5F5F9", "#F7FCFD", "#FFFFFF")
  
  amenities_palette5 <- c("#EDF8FB", "#B2E2E2", "#66C2A4", "#2CA25F", "#006D2C")
  
  risk_palette10 <- c("#FFFFFF", "#F7FCFD", "#E0ECF4", "#BFD3E6", "#9EBCDA", "#8C96C6", "#8C6BB1",   "#88419D", "#810F7C", "#4D004B")
  risk_palette10_inv <- c("#4D004B", "#810F7C", "#88419D","#8C6BB1", "#8C96C6",   "#9EBCDA","#BFD3E6", "#E0ECF4", "#F7FCFD", "#FFFFFF")
  
  risk_palette5 <- c("#EDF8FB", "#B3CDE3", "#8C96C6", "#8856A7", "#810F7C")
  
  neighb_palette10 <- c("#FFFFFF", "#FFF7FB", "#ECE7F2", "#D0D1E6", "#A6BDDB", "#74A9CF",   "#3690C0", "#0570B0", "#045A8D",
                        "#023858")
  neighb_palette5 <- c("#F1EEF6", "#BDC9E1", "#74A9CF", "#2B8CBE", "#045A8D")
#A mapTheme function establishes aesthetics such as background color and axis labels that can be added to all maps
  mapTheme <- function(base_size = 12) {
    theme(
      text = element_text(color = "black"),
      plot.title = element_text(size = 16, colour = "black", face = "bold"),
      plot.subtitle=element_text(size = 12, face="italic"),
      plot.caption=element_text(hjust=0),
      axis.ticks = element_blank(),
      panel.background = element_rect(fill = "white"),axis.title = element_blank(),
      axis.text = element_blank(),
      axis.title.x = element_blank(),
      axis.title.y = element_blank(),
      panel.grid.major = element_line(colour="white"),
      panel.grid.minor = element_line(colour = "white"),
      panel.border = element_blank()
    )
  }

#Setting a plotTheme in addition to a mapTheme allows you to set consistent aesthetics for all plots
  plotTheme <- function(base_size = 12) {
    theme(
      plot.title = element_text(size = 16, face = "bold"),
      plot.subtitle=element_text(size =  12, face="italic"),
      strip.text.x = element_text(size = 16), 
      text = element_text(size = 12),
      axis.text = element_text(size = 14),
      axis.title = element_text(size = 14, face = "italic"),
      strip.background = element_rect(colour="white", fill="white")
      )
  }
    
#Make a basemap by creating a legend column for each layer and plotting them together
    boundary <- 
      boundary %>%
      dplyr :: select(geometry) %>%
      mutate(Legend = "Providence Boundary")
    
    parks <-
      parks %>%
      dplyr :: select(geometry) %>%
      mutate(Legend = "Parks")
    
    roads <-
      roads %>%
      dplyr :: select(geometry) %>%
      mutate(Legend = "Roads")
    
    water <-
      water %>%
      dplyr :: select(geometry) %>%
      mutate(Legend = "Water")
    
    BaseMap <- ggplot() + 
      geom_sf(data = boundary, aes(), fill = "grey85", color = NA, size = 1) +
      geom_sf(data = roads, aes(), fill = NA, color = "white") +
      geom_sf(data = water, aes(), fill = "slategray2", color = NA) +
      mapTheme()
    
    BaseMap + labs(title = "Base Map")

7.2 Mapping Data

Mapping can be used to convey a variety of different data and is often a clear way to both explore and convey complicated spatial patterns. This section takes you through creating the different map types used in this project, including mapping data by fishnet, by polygon, and by points.

Maps by fishnet grid cell are especially useful in conveying spatial patterns from the feature engineering section, including count, density, and distance variables. The below code creates a map, in deciles, of the distance to parks across the city.

  ggplot() + 
        geom_sf(data=fishnet, aes(fill=factor(ntile(park.dist, 10))), color = NA) +
        scale_fill_manual(values = amenities_palette10_inv,
                          name = "Decile \n Breaks",
                          labels = (c("Low", "", "", "", "", "", "", "", "",  "High"))) +
        geom_sf(data = boundary, aes(), fill = NA, color = "grey40", size = 0.5) +
        labs(title = 'Distance to Parks') +
        mapTheme()

Maps by larger polygon, such as neighborhood boundaries or census tracts, allow the assessment of data available at those levels, including census data. The below code manually bins census data into quantiles and then maps those quantiles across the city.

BGmaps$IncBucket <- factor(
  cut(as.numeric(BGmaps$MedHHInc), 
      c(0, 22041, 33657, 42375, 60152, 219000)),
  labels = c("$0 to $22,041", 
             "$22,041 to $33,657", 
             "$33,657 to $42,375",
             "$42,375 to $60,152",
             "$60,152 to $218,583")
)

ggplot() +
  geom_sf(data = BGmaps, aes(fill=IncBucket), colour = "black") +
  labs(title= "Median Household Income") +
  scale_fill_manual(values = neighb_palette5, 
                    name = "Income\n(Quintile Breaks)") +
  mapTheme()

Mapping data points can also be useful in examining data across the city. The below code walks you through creating a map of station points in Providence on the basemap.

# map the station points
  fire$X <- (st_coordinates(fire)[,1])
  fire$Y <- (st_coordinates(fire)[,2])

  BaseMap + 
    geom_sf(data = fire, aes(), color = "#FD8D3C", size = 2.5) +
    labs(title = "Safe Stations", subtitle = "Located at Providence's 12 Fire Stations") +
    scale_colour_manual(values = c("#BD0026")) +
    geom_text(aes(label = fire$Station_NU, x = fire$X, y = fire$Y), fontface = "bold", 
                  vjust = -0.75, color = "grey15")

7.3 Other Ways to Visualize Data

Bar charts can be useful to compare data by different groupings. The below code walks through the process of determining the count of overdoses within 1/4 mile of fire stations and graphing that in a bar chart.

  #create quarter-mile buffers around the fire stations
  stationsBuffer_QtrMi <- 
    st_buffer(fire, 1320)

  #determine the count of overdoses within 1/4-mile of each fire station
     count.by.station <- 
      aggregate(ODpoints, stationsBuffer_QtrMi, length) %>% 
      bind_cols(stationsBuffer_QtrMi) %>% 
      dplyr::select(Station_NU, Count)

  #remove 0s
    count.by.station$Count[is.na(count.by.station$Count)] <- 0

  #plot the data in a bar chart  
    ggplot(data = count.by.station, aes(x = reorder(Station_NU, -Count), y = Count, fill = "#BD0026")) +
      geom_bar(stat = "identity", color = "black", size = 1) + 
      scale_fill_manual(values = c("#BD0026")) +
      labs(title = "Count of Overdose Incidents within a 1/4-Mile of Each Fire Station", 
           x = "Fire Station", y = "Count of Overdoses Within 1/4-Mile") +
      theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1),
            plot.title = element_text(face = "bold", size = 16), axis.title = element_text(face = "italic")) 

Scatterplots can be a useful way of examining the relationship between variables. The below code shows how to create a scatterplot that shows the relationship between the count of overdose events and distance to public schools.

  ggplot() + 
  geom_point(data = fishnet, aes(pubschool.dist, log(od.dens)), color = "#006D2C") +
  labs(title = "Distance to Public Schools by Number of Overdoses", 
       subtitle = "Distance to Public Schools & Logged OD Density Have a Correlation of -0.54",
       x = "Distance to Public Schools (ft)",
       y = "Number of Overdoses") +
  geom_abline(intercept = -13.33735471, slope = -0.00031019, 
              color="black", size = 1.25) +
  plotTheme() +
  theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())

7.4 Arranging graphics for ease of comparison

It is often much easier to see differences and similarities between visuals when they are arranged next to each other. This section walks you through the process of creating visual layouts through both the grid arrange and facet wrap methods.

#using grid arrange to explore the distribution of tobaccor vendors and liquor licenses and their relationship with overdoses
tobacco.map <- ggplot() + 
                  geom_sf(data=fishnet, aes(fill=factor(ntile(tobacco.nn, 10))), color = NA) +
                  scale_fill_manual(values = risk_palette10_inv,
                                    name = "Decile \n Breaks",
                                    labels = (c("Low", "", "", "", "", "", "", "", "", "High"))) +
                  geom_sf(data = boundary, aes(), fill = NA, color = "grey40", size = 0.5) +
                  labs(title = 'Distance to 3 Nearest Tobacco Vendors') +
                  mapTheme()

liquor.map <- ggplot() + 
                  geom_sf(data=fishnet, aes(fill=factor(ntile(liq.nn, 10))), color = NA) +
                  scale_fill_manual(values = risk_palette10_inv,
                                    name = "Decile \n Breaks",
                                    labels = (c("Low", "", "", "", "", "", "", "", "", "High"))) +
                  geom_sf(data = boundary, aes(), fill = NA, color = "grey40", size = 0.5) +
                  labs(title = 'Distance to 3 Nearest Liquor Licenses') +
                  mapTheme()

tobacco.scatterplot <- 
  ggplot() + 
  geom_point(data = fishnet, aes(tobacco.nn, log(od.dens)), color = "#810F7C") +
  labs(title = "Distance to 3 Nearest Tobacco Vendors by Number of Overdoses", 
       subtitle = "Distance to Tobacco Vendors & Logged OD Density Have a Correlation of -0.70",
       x = "Distance 3 Nearest to Tobacco Vendors (ft)",
       y = "Number of Overdoses") +
  geom_abline(intercept = -12.97287653, slope = -0.00066801, 
       color="black", size = 1.25) +  
  plotTheme() +
  theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())

liquor.scatterplot <- 
  ggplot() + 
  geom_point(data = fishnet, aes(liq.nn, log(od.dens)), color = "#810F7C") +
  labs(title = "Distance to 3 Nearest Liquor Licenses by Number of Overdoses", 
       subtitle = "Distance to Liquor & Logged OD Density Have a Correlation of -0.49",
       x = "Distance to 3 Nearest Liquor Licenses (ft)",
       y = "Number of Overdoses") +
  geom_abline(intercept = -13.15816145, slope = -0.00053226, 
       color="black", size = 1.25) +  
  plotTheme() +
  theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())

tlplots <- list(tobacco.map, tobacco.scatterplot, liquor.map, liquor.scatterplot)

tllay <- rbind(c(1,2,2),
             c(3,4,4))

grid.arrange(grobs = tlplots, layout_matrix = tllay)

  #Using facet wrap to view model predictions
    fishnet$a1 <- predict.glm(pFit_a2, as.data.frame(fishnet), type = "response")
    fishnet$d1 <- predict.glm(pFit_d1, as.data.frame(fishnet), type = "response")
    fishnet$n1 <- predict.glm(pFit_n1, as.data.frame(fishnet), type = "response")
  
    preda1 <- 
      fishnet %>% 
      dplyr::select(a1) %>% 
      rename(prediction = a1) %>% 
      mutate(model = "Community Protective Resources Model")    
    
    predd1 <- 
      fishnet %>% 
      dplyr::select(d1) %>% 
      rename(prediction = d1) %>% 
      mutate(model = "Risk Factors Model")    
    
    predn1 <- 
      fishnet %>% 
      dplyr::select(n1) %>% 
      rename(prediction = n1) %>% 
      mutate(model = "Neighborhood Model")    
  
    predictionmap.df <- rbind(preda1, predd1, predn1)
    predictionmap.df$facet <- factor(predictionmap.df$model, levels = c("Community Protective Resources Model",
                                                                        "Risk Factors Model", 
                                                                        "Neighborhood Model"))
    
    predmap <- ggplot() + 
      geom_sf(data = predictionmap.df, aes(fill = factor(ntile(prediction, 10))), color = NA) + 
      scale_fill_manual(values = palette10,
                        name="Predictions \n(Decile Breaks)", 
                        labels = c("0", "", "", "", "","","","","","3.84")) +
      geom_sf(data = boundary, aes(), fill = NA, color = "grey40", size = 0.5) +
      mapTheme()
    
    predmap + facet_wrap(~facet, ncol = 3) + labs(title = "Predicted Overdose Events") +
      theme(plot.title = element_text(face = "bold", size = 20), strip.text.x = element_text(size = 16),
            strip.background = element_rect(colour="white", fill="white"))

8. Modeling Appendix

8.1 Building a Model

Building a model that achieves the strongest fit for your use case involves extensive testing of different combinations of variables and model types. The below appendix will explain different methods used in developing a model to predict the count of overdose events in Providence. This includes separating a training and test set, building negative binomial, poisson, XG boost, and ensemble models, evaluating the strength of variables in each model, and using the model to predict on the full data set. It also goes through different model validation metrics to evaluate the strength of the model, including calculating mean absolute error for the model, mapping the spatial distribution of residuals, making a histogram of residuals, comparing the model to kernel density maps, cross validating the model across many iterations, and spatially cross validating the model across different parts of the city.

8.1.1 Separating a Training and Test Set

  # 60 / 40 split for testing regressions
  set.seed(777)  #set a random seed to allow for duplication of results
  partition <- createDataPartition(fishnet$Count, p = 0.6, list = FALSE) #create a partition
  fishnetTrain <- fishnet[partition, ] #Training set is this partition of the fishnet (60%)
  fishnetTest <- fishnet[-partition, ] #Test set is the remaining portion of the fishnet data (40%)

8.1.2 Poisson Models

  #training the model
  pFit <- glm(Count ~ snap.nn + bus.dist + park.dist + statecap.dist + supermarket.nn,
                 family = "poisson", data = fishnetTrain)
  
  #generating predictions
  fishnetTest$pred.pFit <- predict.glm(pFit, as.data.frame(fishnetTest), type = "response")

8.1.3 Negative Binomial Models

  #training the model
  nbFit <- train(Count ~ snap.nn + bus.dist + park.dist + statecap.dist + supermarket.nn,
                          data = fishnetTrain %>% 
                            as.data.frame() %>% 
                            dplyr::select(-geometry), 
                          method = "glm.nb")
  
  #generating predictions
  fishnetTest$pred.nbFit <- predict(nbFit, as.data.frame(fishnetTest))

8.1.4 Creating Variable Importance Graphs to Compare Variables in a Model

  standardized <- as.data.frame(lm.beta(pFit))
  standardized$variable <- row.names(standardized)
  colnames(standardized)[1] <- "std_coefficient"
  standardized$absCoef <- abs(standardized$std_coefficient)
  
  ggplot(standardized, aes(x=reorder(variable,-absCoef), y=absCoef, fill=variable)) + 
    geom_bar(stat="identity", color="black", size = 0.75) +
    scale_fill_manual(values = c("#006d2c", "#006d2c", "#006d2c", "#006d2c", "#006d2c")) +
    labs(x= "Variable",
         y="Relative Importance in Model",
         title= "Variable Importance",
         subtitle = "Community Protective Resources Model") + 
    scale_x_discrete(labels = c("SNAP Vendors", "Bus Stops", "State Capital", "Supermarkets", "Parks")) +
    theme(legend.position = "none") +
    plotTheme()

8.1.5 Running XG Boost

  #create a data set that is just the dependent variable from the training set
    y_train <- fishnetTrain$Count

  #make a fishnetTest without geometry column for running xg boost predictions
    fishnetTest.xgb <-
      fishnetTest %>% 
      as.data.frame() %>% 
      dplyr::select(-geometry)

  #create a training set that is all of the variables you'd like to compare
    train.a <- 
      fishnetTrain %>% 
      as.data.frame() %>% 
      dplyr::select(supermarket.nn, grocery.nn, library.nn, drycleaner.nn, ems.nn, law.nn,
                    hospital.nn, fire.nn, bus.nn, snap.nn, college.nn, prvschool.nn,
                    pubschool.nn, statecap.dist, cityhall.dist, cbd.dist, park.dist)

  #run XG Boost
    fit_xgb <- xgboost(data.matrix(train.a), y_train
                         , max_depth = 7
                         , eta = 0.02
                         , nthread = 2
                         , nrounds = 3000
                         , subsample = .7
                         , colsample_bytree = .7
                         , booster = "gbtree"
                         , eval_metric = "rmse"
                         , objective="count:poisson")
    
  #generating predctions
    fishnetTest$pred.xgb <- predict(fit_xgb, data.matrix(fishnetTest.xgb))


  #show variable importance
    importance_matrix1 <- xgb.importance(colnames(train.a), model = fit_xgb)
    
    var.imp.plot.xgb1 <-ggplot(data = importance_matrix1, aes(x=reorder(Feature,Gain), y=Gain, fill="#BD0026")) +
      geom_bar(stat = "identity", colour = "black", size =1) +
      scale_fill_manual (values = "#BD0026") +
      labs(title = "XG Boost: Variable Importance",
           x="Variable",
           y="Variable Importance") +
      theme(legend.position = "none")
    
    var.imp.plot.xgb1 + coord_flip()

8.1.6 Creating an Ensemble model

  #create separate data sets containing the variables in each story
    story.a <- 
      fishnet %>% 
      as.data.frame() %>% 
      dplyr::select(snap.nn, bus.dist, park.dist, Count)
    
    story.d <- 
      fishnet %>% 
      as.data.frame() %>% 
      dplyr::select(assault.nn, code.dist, overgrowth.nn, Count)
    
    story.nhood <- 
      fishnet %>% 
      as.data.frame() %>% 
      dplyr::select(nhood, Count)

  #create the ensemble function
    ensembleFunction <- function(inputStory, iterations, sampleSize) {
      #create an empty dataframe
      endList <- data.frame(matrix(NA, nrow = nrow(inputStory), ncol = 0))
      
      #build n models with a m% test set
      for (i in 1: iterations ) {
        sample <- sample.int(n = nrow(inputStory), size = floor(sampleSize *nrow(inputStory)), 
                             replace = F)
        train <- inputStory[sample, ]
        
        #train the model
        thisTrain <- glm(train$Count ~ ., family = "poisson", data = train)
        
        #create a vector of predictions for this model.
        thisPrediction = predict.glm(thisTrain, as.data.frame(fishnet), type = "response") 
        
        #append the predictions to the data frame         
        endList <- cbind(endList, thisPrediction)
      }
      
      #label each prediction column 1 through n
      colnames(endList) <- paste("prediction", 1:ncol(endList), sep = "")
      
      #create a data frame of average predictions. 
      return(data.frame(mean.Prediction = rowMeans(endList)))
    }
    
  #run for the 3 different stories
    allStories.predictions <- 
      data.frame(
        cbind(
          story1.mean.predictions = ensembleFunction(story.a,100,.9)[,1,],
          story2.mean.predictions = ensembleFunction(story.d,100,.9)[,1,],
          story3.mean.predictions = ensembleFunction(story.nhood,100,.9)[,1,]
        ))
    
  #look at the correlation between stories (don't want it to be too high)
    cor(allStories.predictions)
##                         story1.mean.predictions story2.mean.predictions
## story1.mean.predictions               1.0000000               0.6058870
## story2.mean.predictions               0.6058870               1.0000000
## story3.mean.predictions               0.3786943               0.4590718
##                         story3.mean.predictions
## story1.mean.predictions               0.3786943
## story2.mean.predictions               0.4590718
## story3.mean.predictions               1.0000000
    ggpairs(allStories.predictions,diag = list("continuous"="blank"))

    cor.test(allStories.predictions$story1.mean.predictions, allStories.predictions$story2.mean.predictions)
## 
##  Pearson's product-moment correlation
## 
## data:  allStories.predictions$story1.mean.predictions and allStories.predictions$story2.mean.predictions
## t = 36.077, df = 2244, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5790371 0.6314242
## sample estimates:
##      cor 
## 0.605887
  #partition ensemble predictions 60/40 for train/test
    ensembleTrain <- allStories.predictions[partition, ]
    ensembleTest <- allStories.predictions[-partition, ]
  
  #create a model from the 3 stories' predictions
    pFit_e <- glm(fishnetTrain$Count ~ ., data=ensembleTrain, family = "poisson")

8.1.7 Generating Prediction Maps

 #predict on the full data set
    fishnet$pred.pFit <- predict.glm(pFit, as.data.frame(fishnet), type = "response")
    
  #mapping the predictions
    ggplot() + 
      geom_sf(data = fishnet, aes(fill = factor(ntile(pred.pFit, 10))), color = NA) + 
      scale_fill_manual(values = palette10,
                        name="Predictions \n(Decile Breaks)", 
                        labels = c("Low", "", "", "", "", "", "", "", "", "High")) +
      geom_sf(data = boundary, aes(), fill = NA, color = "grey40", size = 0.5) +
      labs(title="Map of Predictions") +
      mapTheme()

8.2 Model Validation Metrics

There are many different validation metrics that can be used to evaluate the relative success of each model. The below section walks through the validation metrics used in this project, including examining the distribution of errors, comparing to a traditional kernel density model, cross validating across many folds, and spatially cross validating across different areas of the city.

8.2.1 Predicting on the Test Set and Calculating Error

This code walks you through using the models to predict on the test set, calculating errors, and examining the distribution of errors, both in quantity and across space.

  #Predicting on the Test Set
    fishnetTest$pred.pFit <- predict.glm(pFit, as.data.frame(fishnetTest), type = "response") #predict on the test set
    is.na(fishnetTest$pred.pFit) <- 0  #remove any 0s
  
  #Calculating the Mean Absolute Error
    #take the absolute value of the actual count minus the predicted count
    fishnetTest$absError.pFit <- abs(fishnetTest$Count - fishnetTest$pred.pFit) 
    #take the mean of this value
    mean(fishnetTest$absError.pFit) 
## [1] 0.4234824
  #Visualizing the distribution of residuals
    ggplot(fishnetTest, aes(absError.pFit)) +
      geom_histogram(bins = 20, fill = "#006d2c", color = "black", size = 0.75) +
      labs (x = "Residuals",
            y = "Count",
            title = "Distribution of Residuals, Community Protective Services Model",
            subtitle = "Mean Absolute Error is 0.4234 Overdose Events") +
      plotTheme()

  #Mapping the residuals of each model
    PlainMap + 
      geom_sf(data = fishnetTest, aes(fill = factor(ntile(absError.pFit, 10))), color = NA) + 
      scale_fill_manual(values = palette10,
                        name="Residuals \n(Quintile Breaks)", labels = c("Low", "", "", "", "", "", "", "", "", "High")) +
      labs(title="Map of Test Set Residuals") +
      mapTheme()

8.2.2 Comparing Predictions to Kernel Density

It is important to evaluate how well your model performs in comparison to the way things are currently done, in this case a kernel density hotspot map. This code walks you through building a function that will put the models on the same scale, divide them into risk categories, and output a map that compares the relative predictions of the two models.

  #Create a function that allows you to input the model and spit out the comparison graph
    cv.kd_function <- function(cv.kd_data) {
      
      #reclassify predictions into 100 quantiles
      cv.kd_data$p.r1 <- as.numeric(.bincode(cv.kd_data$p, breaks = quantile (cv.kd_data$p, seq(0,1, by=0.01), na.rm = TRUE,
                                                                              labels = c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18",
                                                                                         "19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34",
                                                                                         "35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50",
                                                                                         "51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66",
                                                                                         "67","68","69","70","71","72","73","74","75","76","77","78","79","80","81","82",
                                                                                         "83","84","85","86","87","88","89","90","91","92","93","94","95","96","97","98",
                                                                                         "99","100"))))
      
      
      #reclassify predictions a second time - into 5 risk levels
      cv.kd_data$p.r2 <- cut(cv.kd_data$p.r1, c(-1, 30, 50, 70, 90, 100), na.rm = TRUE,
                             labels = c("5", "4", "3","2","1"))
      
      #summarize by group
      p.summ <- 
        cv.kd_data %>%
        group_by(p.r2) %>%
        summarize(p.count = sum(Count)) %>% 
        rename(group = p.r2) %>% 
        filter(!is.na(group))
      
      #reclassify kernel density into 100 quantiles
      cv.kd_data$kd.r1 <- as.numeric(cut(cv.kd_data$od.dens, breaks = quantile (cv.kd_data$od.dens, seq(0,1, by=0.01), na.rm = TRUE),
                                         labels = c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18",
                                                    "19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34",
                                                    "35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50",
                                                    "51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66",
                                                    "67","68","69","70","71","72","73","74","75","76","77","78","79","80","81","82",
                                                    "83","84","85","86","87","88","89","90","91","92","93","94","95","96","97","98",
                                                    "99","100")))
      
      #reclassify kernel density a second time - into 5 risk levels
      cv.kd_data$kd.r2 <- cut(cv.kd_data$kd.r1, c(-1, 30, 50, 70, 90, 100), na.rm = TRUE,
                              labels = c("5", "4", "3","2","1"))
      #summarize by group
      kd.summ <- 
        cv.kd_data %>%
        group_by(kd.r2) %>%
        summarize(kd.count = sum(Count)) %>%
        rename(group = kd.r2) %>% 
        filter(!is.na(group))
      
      #join these two tables
      countComparisons <- merge(p.summ, kd.summ)
      
      countComparisons <- cbind(
        countComparisons,
        data.frame(Category = c("90% - 100%", "70% - 89%", "50% - 69%", "30% - 49%", "1% - 29%")))
      
      countComparisons <- 
        countComparisons %>%
        dplyr::mutate(kernelPct = kd.count / sum(kd.count),
                      fittedPct = p.count / sum(p.count))
      
      countComparisonsLong <-
        countComparisons %>% 
        gather(Variable, Value, kernelPct:fittedPct)
      
      output <- 
        ggplot(data=countComparisonsLong, aes(Category,Value)) +
        geom_bar(aes(fill = Variable), position = "dodge", stat="identity", color = "black") +
        scale_fill_manual(values = c("#BD0026", "#FECC5C"),
                          labels=c("Spatial Risk Model", "Kernel Density")) +
        labs(x= "Predicted Risk Levels",
             y="Percent of Test Set Cases",
             title= "Goodness of Fit: Spatial Risk Model vs. Kernel Density hotspot")
      
      return(output)  
    }  
  
  #Then, for each model, you will just need to create a data frame with the correct column names and run the function
    cv.kd_data <- 
      fishnetTest %>%
      as.data.frame() %>%
      dplyr::select(Count, pred.pFit, od.dens, geometry) %>% 
      rename(p = pred.pFit)
    
    cv.kd_function(cv.kd_data)

8.2.3 Cross-Validation Across Many Folds

The cross validation process allows you to determine whether the model predicts well on many different subsets on the data, not just the initial test set. The below code walks through this cross validation process across many different folds.

#Randomly shuffle the data
    fishnetTrain2 <- fishnetTrain[sample(nrow(fishnetTrain)),]
    
    #Create 100 equally size folds
    folds <- cut(seq(1,nrow(fishnetTrain)),breaks=100,labels=FALSE)
    
    #Perform 100 fold cross validation
    endList <- data.frame()
    for(i in 1:100){
      #Segement your data by fold using the which() function 
      testIndexes <- which(folds==i,arr.ind=TRUE)
      testData <- fishnetTrain[testIndexes, ]
      trainData <- fishnetTrain[-testIndexes, ]
      
      #Using your pre trained model, predict 
      thisTrain <- glm(Count ~  supermarket.nn + grocery.nn + library.nn + drycleaner.nn + ems.nn + 
                         law.nn + hospital.nn + fire.nn + bus.nn + snap.nn + college.nn + prvschool.nn +
                         pubschool.nn + statecap.dist + cityhall.dist + cbd.dist,
                       family = "poisson", data = fishnetTrain)
      thisPrediction <- predict.glm(thisTrain, as.data.frame(testData), type = "response")
      
      #calculate mape and mae
      thisMAPE <- mean(abs(testData$Count - thisPrediction) / testData$Count ) 
      thisMAE <- mean(abs(testData$Count  - thisPrediction))  
      
      #cbind this folds MAPE and MAE
      thisList <- cbind(thisMAPE, thisMAE)
      
      #print
      endList <- rbind(endList, thisList)
      resulta <- endList
    }

8.2.4 Spatial Cross Validation Across the Whole City

It is important to determine how well the model predicts across different areas of the city. The below code walks you through the spatial validation process across Providence.

tracts2 <- tracts
  
  #Make the Function
    spatialCV <- function(model) {
      
      #initialize variables  
      training <- 0
      testing <- 0
      tuner <- 0
      currentUniqueID <- 0
      uniqueID_List <- 0
      y <- 0
      endList <- list()
      
      #create a list that is all the spatial group unqiue ids in the data frame (ie all the neighborhoods)    
      uniqueID_List <- unique(model[["tracts"]])  
      x <- 1
      y <- length(uniqueID_List)
      
      #create a counter and while it is less than the number of counties...  
      while(x <= y) 
      {
        
        #call a current county    
        currentUniqueID <- uniqueID_List[x]
        
        #create a training set comprised of units not in that county and a test set of units
        #that are that county
        training <- model[ which(model[["tracts"]] != uniqueID_List[x]),]
        testing <- model[ which(model[["tracts"]] == uniqueID_List[x]),]
        
        trainingX <- training[ , -which(names(training) %in% c("Count"))]
        testingX <- testing[ , -which(names(testing) %in% c("Count"))]
        
        trainY <- training[["Count"]]
        testY <- testing[["Count"]]
        
        #tune a model - note I have hardwired the dependent variable and the trainControl    
        tuner <- glm(trainY ~ ., family = "poisson", 
                     data = trainingX %>% as.data.frame() %>% dplyr::select(-tracts, -geometry))
        
        #come up with predictions on the test county
        thisPrediction <- predict.glm(tuner, testingX, type = "response")
        
        #calculate mape and mae and count of records for a given training set (to get a sense of sample size).
        thisMAE <- mean(abs(testY - thisPrediction))  
        countTestObs <- nrow(testingX)
        
        #add to the final list the current county and the MAPE    
        thisList <- cbind(currentUniqueID, thisMAE, countTestObs)
        #create a final list to output
        endList <- rbind(endList, thisList)
        #iterate counter    
        x <- x + 1 
        
      } 
      #return the final list of counties and associated MAPEs  
      return (as.data.frame(endList))
    }
  
  #create models
    a.model <-  
      fishnettracts %>% 
      dplyr::select(snap.nn, bus.dist, park.dist, statecap.dist, supermarket.nn, Count, tracts) 
  
  #Run the spatial cross validation on the model
    SCVa <- spatialCV(a.model)
    
    SCVA <- 
      SCVa %>%
      mutate(tracts = as.character(currentUniqueID),
             a.FNcells = as.numeric(countTestObs),
             a.MAE = as.numeric(thisMAE)) %>% 
      dplyr::select(-thisMAE, -currentUniqueID)
    
  #Join back to the dataset
    forjoining <- 
      SCVA %>% 
      group_by(tracts) %>% 
      summarize(mean(a.MAE)) %>% 
      rename(a.meanMAE = `mean(a.MAE)`)

It is also important to test if the neighborhood is generalizable to different social and neighborhood characteristics. The below code walks you through running spatial cross validation to determine how well a model predicts across areas with different population densities.

tractstable <- read.csv("SCVtracts2.csv", header = TRUE, sep = ",")

tractstable$e.karl[tractstable$e.karl == Inf] <- 0

SCVresults1 <- tractstable %>% group_by(DensityQuartile) 

SCVresults1.5 <- SCVresults1 %>% summarize(results = mean(e.karl)) 

colnames(SCVresults1.5) <- c("Density", "Normalized Mean Absolute Error")

kable(SCVresults1.5, "html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Density Normalized Mean Absolute Error
Bottom 25% Density 0.6611757
Bottom 50% Density 0.5656666
Top 25% Density 0.2552651
Top 50% Density 0.3987325

8.3 Model Results

The below table shows the mean absolute error across the many different models run as part of this project.

modelNum modelName Description MAE
1 Community Protective Services All All Variables Poisson 0.4223
2 Community Protective Services All All Variables Negative Binomial 0.4219
3 Community Protective Services Model Significant Variables Poisson 0.4235
4 Community Protective Services Model Significant Variables Negative Binomial 0.4237
5 Community Protective Services Ensemble Input Model Ensemble Input Poisson 0.4244
6 Community Protective Services Ensemble Input Model Ensemble Input Negative Binomial 0.4250
7 Risk Factor All All Variables Poisson 0.4177
8 Risk Factor All All Variables Negative Binomial 0.4172
9 Risk Factor Model Significant Variables Poisson 0.4155
10 Risk Factors Model Significant Variables Negative Binomial 0.4237
11 Risk Factors Ensemble Input Model Ensemble Input Poisson 0.4224
12 Risk Factors Ensemble Input Model Ensemble Input Negative Binomial 0.4219
13 Neighborhood Characteristics All All Variables Poisson 0.4482
14 Neighborhood Characteristics All All Variables Negative Binomial 0.4481
15 Neighborhood Characteristics Model Significant Variables Poisson 0.4486
16 Neighborhood Characteristics Model Significant Variables Negative Binomial 0.4484
17 Neighborhood Characteristics Ensemble Input Model Ensemble Input Poisson 0.4650
18 R Ensemble Input Model Ensemble Input Negative Binomial 0.4481
19 Final Ensemble Model Ensemble Model Poisson 0.4145