This project was produced as part of the University of Pennsylvania Master of Urban Spatial Analytics Spring 2018 Practicum (MUSA 801), instructed by Ken Steif, Michael Fichman, and Karl Dailey. This document begins with a case study of predicting spatial risk of opioid overdoses in Providence, Rhode Island and is followed by a series of appendices that discuss data wrangling, data visualization, data sources, feature engineering, and model results. Navigate through the document either by using the panel at the left, or by clicking the hyperlinks throughout the document.
This project seeks to build a spatial risk model of opioid overdose events for the City of Providence, Rhode Island by examining current overdose locations, community protective resources, risk factors, and neighborhood characteristics. Assigning a level of risk to each area of the city can assist Providence and local stakeholders in strategically allocating resources in a way that will achieve the greatest impact. As of January 2018, Providence is implementing a Safe Stations program, where people struggling with substance abuse can come to any of the City’s 12 fire stations to be connected with supportive services. The spatial risk model will help Providence’s Department of Healthy Communities determine other areas at high risk of overdose events where the City could site additional interventions or supplement their communications efforts.
Opioid overdose events have skyrocketed in recent years, fueling a decline in the United States life expectancy for the second year in a row and leading President Trump to declare the crisis a public health emergency. A Centers for Disease Control Report released in March 2018 found that emergency room visits for suspected opioid overdoses increased 30% between July 2016 and September 2017 (CDC, 2018). More than 115 Americans die every day from opioid overdoses, and Rhode Island has been hit especially hard by the crisis, with the rate of overdose deaths per 100,000 persons significantly higher than the national average.
The crisis has also escalated much faster in Rhode Island compared to the nation, with the rate of overdose deaths per 100,000 persons increasing nearly six-fold between 1999 and 2016.
Since 2014, over 1,000 people in Rhode Island have died from opioid overdoses. Over one quarter of these deaths have occurred in Providence.
In 2017, almost 700 overdose events were recorded in Providence. In the face of such an overwhelming public health crisis, one of the tools that the City of Providence is using to address the opioid crisis is its new Safe Stations program. Launched in January 2018, the program turns Providence’s 12 fire stations into 24/7 centers where trained staff connect individuals with supportive services. Based on a program created in Manchester, New Hampshire, Safe Stations demonstrate how a network of public, nonprofit, and private actors are working together to address the mounting problem of opioid overdoses.
The premise of Safe Stations is that they are a community-based response that reduces barriers to accessing resources and provides a safe community for people struggling with substance abuse. As can be seen below, because of their nature as an emergency response facility, Providence’s fire stations are relatively dispersed throughout the city’s neighborhoods.
This analysis seeks to determine whether Safe Stations are the assets best located to provide services for those struggling with substance abuse. Assigning an overdose risk level to each area of Providence can be a valuable tool. Knowledge of risk could allow the City and other stakeholders to assess the cost of providing Safe Stations in these locations relative to the potential benefits of opening stations in additional city facilities in high-risk areas, to determine high-risk areas where Safe Stations should be more heavily advertised, and to allocate overdose prevention and harm reduction resources, such as Narcan trainings, needle exchange centers, or safe injection sites, in areas at the highest risk.
Providing citywide spatial information regarding the risk of opioid overdose events empowers the city and nonprofits to continue to make their public health interventions more robust and could serve as a model for cities around the country that are interested in how to implement similar initiatives with the greatest impact.
The spatial risk model built for this analysis varies from one way in which resources are commonly allocated, known as kernel density maps. Kernel density maps, or hotspot maps, interpolate past events across space - suggesting that resources are best allocated in an area where many overdoses have occurred in the past. A kernel density map of Providence’s 2017 overdoses can be seen below:
In contrast, the spatial risk model predicts overdose events as a function of environmental risk and protective factors that vary across space. The unit of analysis for this model breaks down Providence into a set of 500-foot by 500-foot grid cells or a “fishnet” (shown in the map at the bottom left). The count of overdoses per grid cell serves as the dependent variable in the model (shown in the map to the bottom right).
At the fishnet level, it is clear that while the number of overdoses has been increasing in Providence, they remain a relatively rare event in terms of how many grid cells contain overdoses. Only 18% of grid cells contain the approximately 700 overdoses spread out over the 2017 data. This distribution of overdoses is taken into account throughout the modeling process. As the intent of this analysis is to identify high-risk locations that are actionable from a policy standpoint, grid cells allow for the determination of risk-levels specific to an area the size of a city block. At the same time, the grid cells are large enough that they do not provide personally identifiable information or skew the distribution of overdoses within the fishnet.
Independent variables are also included at the fishnet level. Therefore, the final dataset includes a series of characteristics about each 500-foot-square area of the city, along with a count of how many overdoses have occurred there. For the purposes of this analysis, data regarding independent variables was split into one of three categories, with the hypothesis that each would have a different relationship with overdoses: first, community protective resources, such as hospitals or parks, which are amenities where other public health interventions could possibly be sited; second, risk factors, such as crime or 311 complaints; third, neighborhood-level characteristics, such as neighborhood boundaries or census information. The relationship of each of these three categories with overdose events is explored in-depth in the explanatory analysis and model building sections of this document.
To create the spatial risk model, we began with an extensive data wrangling process (described further in the data wrangling appendix). The overdose location data used in the model was provided through a tremendous effort by the City of Providence. The data was pulled from records of emergency medical services responses, which did not contain an indicator of whether or not the event was an overdose. Therefore, clinicians examined data from 2017 and pulled together a list of approximately 700 events that they considered to be overdoses. This created a one-of-a-kind dataset exclusively for this analysis. As this dataset contains health information, overdoses are always shown at an aggregated level to ensure the protection of personal privacy.
As with any data, there is potential for error in terms of whether all overdose events were captured and whether all events recorded after the fact were overdoses. Additionally, there were several address entry errors that had to be manually geolocated in order to ensure accuracy, which is a source of additional potential error. Furthermore, it is worth mentioning that this data is a small sample size, due to both the size of Providence (the city’s population is 179,000) and the fact that this data was collected only for 2017. While this analysis is worth conducting, the sample size precludes certain tests of model generalizability, such as testing across years. These limitations will be discussed throughout the analysis as relevant, but are necessary to mention at the outset.
In addition to overdose events, the data wrangling process included the collection of a wide variety of data regarding community protective resources, risk factors, and neighborhood characteristics in the City of Providence (listed in the data sources appendix). Following the data collection, we undertook a feature engineering process to extract different types of predictive power from the variables, and all of the features are listed in the feature engineering appendix. Once the entire dataset of overdoses and features had been created, we conducted an explanatory analysis process to determine which variables were correlated with overdoses.
With an understanding of which variables were correlated with overdose events, we constructed different predictive models, informed by machine learning techniques. We then conducted a validation process to ensure that the final model was generalizable across the different areas of Providence. The modeling process and model results are described in detail in the model building section and the model results appendix.
In addition to the overdose count dependent variable, several data sources were used to determine risk factors and protective assets in Providence, including:
Below appendices include a data dictionary with a full list of variables and their sources; a data wrangling appendix that explains how data was compiled, cleaned, and feature engineered; a list of the final features; a data visualization appendix that explains how to make basic graphics for this analysis; and a table that lists the results from each model.
To begin the exploratory analysis process, we constructed a dataset through an extensive data wrangling and feature engineering process (described in the data wrangling appendix.) Each of the Community Protective Resource and Risk Factor variables were feature engineered to extract different levels of explanatory power. An example of this can be seen below in Figure 2.1a, which shows drug crimes in the past 180 days, distance to each drug crime, distance to the three nearest drug crimes (or “nearest neighbor”), and density of drug crime. The distance, nearest neighbor, and density variables are each included in the fishnet grid cells as potential independent variables in the modeling process. While each variable measures the same drug crime points, they measure spatial distribution differently. Distance, density, and nearest neighbor maps will be seen throughout the exploratory analysis.
The Safe Stations program has been implemented at Providence’s 12 fire stations. The central question of this analysis is whether these Safe Stations are the most efficient locations to connect those struggling with substance abuse to supportive services, or if this program should be implemented at additional city facilities. In order to examine this question, we first evaluated existing Safe Stations’ proximity to overdose locations. The below map shows where overdoses are occurring within a quarter-mile distance from a fire station, where people could walk less than five minutes to the station to receive aid.
As can be seen in the below bar chart, there are stations that are experiencing considerably more overdoses within a quarter-mile and others that experience almost no overdoses within a quarter-mile. Engine 8, located in West End, had over 20 overdoses occur within a quarter-mile in 2017. In contrast, neighboring Engine 11 near the Reservoir neighborhood experienced one overdose within a quarter-mile. This metric can inform which stations may be in need of additional resources.
Counting the number of overdoses within a quarter-mile of fire stations demonstrates which Safe Stations may experience the most activity. However, it does not show whether or not fire stations are the best location for Safe Stations within the entire city of Providence. We defined “best location” as being in a place where a relatively high number of opioid overdoses are occurring. We developed a hypothesis test that asks whether the average distance from overdoses to fire stations is closer than one would expect due to random chance alone. Figure 2.2c below visualizes the distribution of 999 randomly permuted sets of 12 points (to simulate a set of fire stations) and their corresponding distances to the nearest 10 observed overdose events. The vertical line demonstrates the actual distance between Providence’s 12 fire stations and their 10 nearest overdose events. 63.7% of randomly generated points were closer to overdoses than the actual fire station locations, suggesting that Providence could locate Safe Stations, or supplementary efforts, closer to where overdoses are actually occurring.
As mentioned above, the independent variables in this analysis were divided into three categories. The first of these categories is Community Protective Resources, which are considered to be neighborhood amenities, many of which could support additional public health interventions. These resources include locations such as hospitals, libraries, schools, bus stops, etc. The full list of amenities can be found in the feature appendix below. The hypothesis is that areas that contain higher numbers of these community protective resources would have a lower number of opioid overdoses. However, this analysis demonstrates that, in reality, the relationship varies depending on the resource - and tends to be negatively correlated with overdoses.
This section will discuss variables that ultimately proved to be useful in the model building process. While the dependent variable for this analysis is the count of overdose events, it is easier to explore and visualize correlation patterns when count has been recoded to a spatial density and, therefore, correlation plots below use the logged density of overdoses as a proxy for overdose counts.
One community resource that this analysis examines is the correlation between proximity to different types of education facilities and overdose events. As can be seen below, both private schools and colleges are clustered in College Hill, as well as the Downtown area of the city. The below scatter plots demonstrate that proximity to all three types of schools is correlated with a higher density of overdoses, meaning that as you get closer to an educational facility, there is a higher number of overdoses. However, public school proximity has a slightly higher correlation with overdoses.
Proximity to public transit is another variable that was considered in the model. In this case, public transit is represented by RIPTA bus stop locations. As can be seen below, bus stops also have a negative correlation with overdoses.
Another community resource to note is food vendors. We wanted to examine whether there was a difference in correlation with density of overdoses between supermarkets and SNAP vendors, with the hypothesis that areas that are less well-served by supermarkets would have a higher number of overdoses. As can be seen below, there are many fewer supermarkets in Providence compared to SNAP vendors, and they are primarily concentrated in the northern portion of the City. SNAP vendors, in comparison, appear to be spread throughout the city, with a strong presence along commercial corridors in the central and southern portions where there is an absence of supermarkets. Examining the correlations of these amenities with overdoses shows that both have a negative correlation between proximity and the density of overdoses. However, SNAP vendors have a more negative correlation, indicating that proximity to a SNAP vendor tends to be an indicator of a higher number of overdoses compared to supermarkets.
The correlation between distance to parks and overdose incidents was also explored. Given that parks serve as one of the best measures of public gathering spaces, the hypothesis was that proximity to parks may be associated with a higher number of overdose events. Correlation tests proved that this hypothesis was correct; as you get closer to a park in Providence, the density of overdoses increases. However, this correlation was smaller than that of any other community resource mentioned in this section.
While the hypothesis at the outset of the exploratory analysis process was that areas with a greater number of community resources would have experienced fewer overdose events in 2017, that is not actually the case. While the correlation between each community resource and overdoses varies, proximity to all community resources tends to suggest a higher density of overdoses.
The second category of variables that this analysis examines is risk factors, with the hypothesis that areas containing greater numbers of risk factors would experience higher numbers of overdoses. To explore this hypothesis, we looked at data related to three categories of risk factors: alcohol & tobacco vendors, data from 311 complaint calls, and crime data. Our hypothesis was largely confirmed throughout the risk factors we examined, and risk factors tended to have a greater correlation to density of overdoses than community resources.
Based on past research indicating that alcohol and tobacco vendors are associated with higher rates of crime, we wanted to explore if proximity to these vendors in Providence was associated with higher overdose density. Using business license data, we pulled the locations of tobacco vendors and liquor licenses in Providence. The below maps show that tobacco vendors tend to cluster more in the southwest area of the city and liquor licenses are very clustered in the center of the city. Scatterplots illustrate a similar relationship between each of these variables; higher numbers of overdoses occur in close proximity to both tobacco vendors and liquor licenses and decrease in number farther away.
The City provided data regarding 311 complaints that served as the basis for several risk factor categories, such as vacant buildings, abandoned vehicles, out street lights, graffiti, trash, and overgrown lots. These variables largely serve as an indicator of neighborhood conditions and levels of investment in an area. Therefore, the hypothesis was that areas with a higher number of complaints would experience a greater number of overdoses. Due to the extensive nature of the 311 call dataset from the City’s open data site (shown below), we grouped this data into broader categories to better extract meaning from the data (listed in the feature appendix.)
In Figure 2.4.2b below, we have highlighted four variables that ultimately proved to be useful in the modeling process due to their correlation with overdose incidents: abandoned buildings, abandoned cars, code violations, and overgrowth. The positive relationships between the density of these variables and overdoses suggests that greater numbers of overdose events are happening in areas with greater numbers of 311 complaints. The strongest correlations with overdoses are code violation complaints and abandoned building complaints, suggesting that disinvestment in the built environment has a strong relationship with overdose events.
Figure 2.4.2c below highlights these same four variables and the difference between their density and proximity to the three nearest neighbors. Throughout the modeling process, the spatial differences in these features were considered.
A third category of risk factors that we examined was crime data, with the hypothesis that overdoses would be strongly correlated with areas where other crimes were occurring. Specifically, we evaluated drug crimes (removing low-level, marijuana-related crimes) and assaults. The density maps below show that these crimes tend to be clustered toward the center of the city. Correlation tests confirmed our hypothesis that for both drug crimes and assaults, as the density of crime events increases, so does the number of overdose events. Furthermore, the correlation between these types of crime and overdose density was the highest by far of any variable that we examined.
The third category of data explored in this analysis related to broader neighborhood characteristics, such as geographic boundaries and demographic information. We first wanted to explore if overdoses are spatially concentrated in particular neighborhoods. The above bar plot shows that overdose events are not evenly distributed throughout the city. Certain neighborhoods, such as West End, Downtown, and Wanskuck see more overdoses than other areas of the city. However, overdoses are not isolated to a single part of the city and are taking place across many neighborhoods. The inclusion of neighborhood variables in the final model was important in order to capture some of the variation in overdose events across space.
Another measure of neighborhood characteristics is demographic data from the Census, which we included at the block group level. The below maps illustrate the citywide distribution of population density, median household income, unemployment rate, and poverty rate. The accompanying scatterplots show some association between count of overdoses and these demographic variables; more overdoses occur in areas with higher population density, lower median household income, higher unemployment rates, and higher percentages of families living in poverty. However, population density and median household income have a much higher correlation with overdose events than the other two variables. The equity implications of using economic-based variables was considered throughout the model building process.
The extensive data wrangling, feature engineering, and exploratory analysis processes informed the creation of a model to predict the risk of overdose events across the city of Providence. The following section will outline the process of selecting features and constructing models, validating models, and converting predictions into actionable intelligence.
Just as we divided the exploratory analysis into three categories of variables that tell different stories, we structured our model building process in a similar manner. We wanted to explore how well each of these ‘stories’ predicted overdose risk on their own. By building different spatial risk models for community protective resources, risk factors, and neighborhoods characteristics, we sought to capture the predictive power of each of these stories separately before combining them into a final ensemble model that harnesses the predictive power of all 3.
To account for the fact that overdoses are rare events, and therefore have not occurred in the majority of grid cells in the city, Poisson and Negative Binomial models were explored for each variable category. The variable selection process for each of the three models was informed by machine learning processes that identified the variables with the greatest predictive power across many iterations. The results of the top performing models are explored in further detail below. This process, and the results of all tested models, are further detailed in the modeling appendix.
The community protective resources model demonstrated that SNAP vendors and bus stops provided the strongest predictive power within this category of variables. The below graph shows the variables included in the final community protective resources model, in order of their relative predictive importance in the model.
The model containing the variables shown above was informed by a machine learning process called XG Boost that determined the relative importance of each community protective resource variable across many iterations. For variables with multiple feature engineered forms (distance, density, and nearest neighbor), only the nearest neighbor features were included to avoid duplication. The final model was then honed to only include features that were demonstrated to be important in the machine learning process.
Each model created in this analysis utilizes random subsets of 60% of the city’s fishnet grid cells (a “training set”) to create the model, allowing models’ accuracy to be tested on the 40% of grid cells that were set aside (the “test set”). The “mean absolute error” or “residuals,” are the difference between the observed number of overdoses in a test set grid cell and the number of overdoses that the model has predicted in that grid cell. The below graph shows the distribution of residuals for the community protective resources model, which are largely clustered between 0 and 1 overdose event. The measure of mean absolute error for this model is 0.4234 overdose events.
In the final risk factors model, assaults provide the strongest predictive power, followed by 311 code violation complaints, and drug crimes.
The machine learning process also showed assaults to have, by far, the strongest predictive power.
Similar to the community protective resources model, the residuals for the risk factors model are clustered between 0 and 1. The mean absolute error for this model is 0.4155 overdose events, lower than that of community protective resources.
Assessing the variable importance of the neighborhood characteristics model demonstrated that the West End, Valley, and Silver Lake had, by far, the most importance in the model. The remainder of the variables in this model were either insignificant, or contained much less weight in the model relative to the top three neighborhoods.
Machine learning shows that, on their own, price per acre provides the strongest predictive power. This pattern does not emerge in the full neighborhood characteristics model likely due to the fact that holding neighborhood constant within the model controls for much of the variation of census-based and land value variables.
Residuals for the neighborhood model are also clustered between 0 and 1. The model shows a mean absolute error of 0.4486, higher than the errors for the community protective resources and risk factor models.
Since each of the 3 models described above tells a different story about predicting overdose risk in Providence, we combined them into a single model through an ensembling process aimed at reducing the overall error. This section describes the process of comparing the predictions and errors of the 3 different stories and using them to create a final ensemble model.
The outcome of each of the three models described above was a predicted risk map of overdose events across the City of Providence, shown below. In the prediction maps, it is clear how the variables in each model strongly inform its predictions. For example, in the community protective resources model, the bus routes are evident and in the neighborhood model there are hard boundaries between the different neighborhoods. Furthermore, the predictions for all models range between 0 and roughly 4 overdose events per grid cell, while the actual number of overdoses per grid cell in the city ranges from 0 to 22. However, risk predictions are better considered on a relative scale, indicating high- and low-risk areas rather than predicting a specific number of events.
As described above, each model was trained on 60% of the fishnet grid cells, allowing for its prediction accuracy to be tested on the remaining 40% of the grid cells. The below set of maps shows the residuals, or errors, that each model produced when used to predict on the test set. Since these maps only include the 40% test set of the fishnet grid cells, the full city is not represented. The residuals range from 0 to roughly 20, a much greater range than the predictions themselves. However, the graphs above of the distribution of residuals across all models show that very few grid cells have errors over 3. Similar to the predictions maps, there are variations in errors across the different models; the risk factors model has lower errors in the city’s outer neighborhoods, and is closely followed by the community protective resources model. In contrast, the neighborhood characteristics model has much higher errors in the city’s outer neighborhoods and in the northern half of the city, likely because it contains neighborhood boundaries as a variable, which can be arbitrary at times.
The discrepancies in the above maps demonstrate that each of these three models is capturing different spatial risk patterns. Combining these models into a single ensemble model improved their predictive power and decreased the model’s overall error. To create the ensemble model, we utilized a process that ran each model 100 times with randomized training and test sets - allowing for an average prediction for each grid cell. Averaging the results of many versions of the model allows for the reduction of errors within each grid cell. Then, we built a model containing these averaged predictions for each of the three models. Ultimately, the process minimizes the error from each of three models, and runs them as a larger, combined model.
A successful ensemble model requires that the input models not be strongly correlated with one another. Therefore, each model’s variables were culled in order to reduce correlation of variables within the three different models. The final ensemble model contains the following variables:
As can be seen in the graph below, the risk factors model provides the strongest predictive power to the final model, followed by the community protective resources model. The neighborhood boundaries model contributed the least predictive power to the model, which is consistent with the higher level of errors that the model contained on its own.
Combining the stories reduced the model’s overall error, lowering the mean absolute error to 0.415. Looking at a map of the predictions for the ensemble model shows that the combined model captures more of the nuance in the higher risk area toward the center, with predictions ranging from 0-8 compared to 0-3 for the individual stories. The map of residuals, again shown on the test set containing 40% of the data, shows that the range of errors has decreased, and is now 0-17 compared with a high of 20 in previous models.
We validated this model through several measures of goodness of fit, which are explained in this section. First, we compared the results of the spatial risk model to a traditional kernel density map. Then, we conducted different cross-validation and spatial cross-validation tests to ensure that the model is generalizable, both on different fishnet test and training sets and across different areas of the city. The process that we used to conduct these tests is further detailed in the modeling appendix below.
As described earlier, kernel density maps, or hotspot maps, identify higher-risk areas of the city based on where overdose events have occurred in the past. It is important to evaluate if our model performs better than this traditional approach so that it can be used to inform and improve the way resources are allocated to combat the opioid crisis. The below graphs compare each of the models described above to a traditional kernel density approach by putting their predictions on the same scale and applying “risk levels” to that scale. Then, actual overdoses that occurred within the test set are counted based on which risk level they fall into. For this use case, the aim is to capture points in both of the top two risk levels; if all of the test set points were in the 90-100% risk category, the model would be overfit on the limited data that Providence has for 2017.
As can be seen in the below graphs, while the first three models predict only marginally better than the kernel density approach, the final ensemble model performs noticeably better in both the 70-89% and 90-100% risk categories.
Cross-validation is important because it determines whether the model predicts well on many different subsets of the data, not just the initial test set. The cross-validation process involved splitting the fishnet into 100 random test sets and measuring average error for each model. The below histograms show the distribution of errors across these 100 random test sets for each of the four models. For all models, the error clusters below 0.5 overdose events. As with the distribution of errors shown for each model above, there are a few significantly larger outliers, likely accounting for the discrepancy between the prediction values and the areas with higher counts of overdoses. The histogram for the ensemble model shows that it has a higher number of test sets with a mean absolute error below 0.5 overdoses, and the value of its outlier errors is lower than that of the other models. This distribution of error indicates that across many subsets of the fishnet grid cells, the ensemble model performs with the least amount of error.
Throughout the modeling process, the four models have been trained on random subsets of 60% of the city’s fishnet grid cells and tested on the remaining 40% of grid cells. However, testing the model on randomly selected grid cells in Providence is not enough to determine the generalizability of the model; the model also must be tested in different parts of the city, a process called spatial cross-validation. How well does the model predict risk in parts of Providence that have different characteristics? If there are discrepancies between neighborhoods, it is important to be aware of them when implementing any policies based on the model. The spatial cross-validation process involved running the models several times with each of Providence’s census tracts as a test set, predicting on that tract, and calculating the prediction error for each tract. The prediction error in this case was calculated by normalizing the average error for each census tract by the mean count of overdoses per grid cell in the tract. This was determined to be the best measure of spatial autocorrelation of errors for this use case and the process is further described in the modeling appendix.
The maps below show the outcome of the spatial cross-validation process. While the errors across the models are somewhat clustered, there are clear differences between the four models. The ensemble model shows the fewest tracts with high overdose prediction error. While there are higher errors in the southern portion of the city, this area includes neighborhoods such as Reservoir and South Elmwood that did not experience high numbers of overdoses in 2017. Regardless, this distribution of errors must be kept in mind when utilizing the risk predictions.
In addition to determining if a model is generalizable across space, it is also important to evaluate if it is generalizable across different social and neighborhood conditions. To do this, we evaluated how well our model predicts overdose events across three social patterns: population density, median household income, and families living in poverty.
The below table shows the level of error the ensemble model produces when predicting overdose events across four quartiles of population density in Providence. For this model, the level of error increases as the city becomes less dense, indicating that the model better predicts risk in areas closer to Downtown Providence. This is consistent with the errors observed in the map above, where the southern portion of the city had higher errors.
Density | Normalized Mean Absolute Error |
---|---|
Bottom 25% Density | 0.6611757 |
Bottom 50% Density | 0.5656666 |
Top 25% Density | 0.2552651 |
Top 50% Density | 0.3987325 |
However, when examining other demographic and economic characteristics, the level of error is more consistent. The below table shows the ensemble model’s error in predicting overdose events in high-income areas (top 25% citywide) and low-income areas (bottom 25% citywide). While High-income neighborhoods show somewhat less error than low-income neighborhoods, the discrepancy is much less than that of population density.
Income Level | Normalized Mean Absolute Error |
---|---|
High Income | 0.3894371 |
Low Income | 0.4971342 |
Lastly, the below table shows the model’s level of error in predicting overdose events in high-poverty areas (top 25% citywide) and low-poverty areas (bottom 25% citywide). While the model predicts slightly better for high-poverty areas, the difference in error levels between these two groups is very low.
Poverty Rate | Normalized Mean Absolute Error |
---|---|
High Poverty | 0.4335743 |
Low Poverty | 0.4824218 |
The creation of the final ensemble model and its corresponding risk predictions allowed us to return to the question of whether Safe Stations are the resources best located in Providence to respond to the opioid crisis. Should more resources be allocated to stations that are in higher-risk areas? Should additional Safe Stations be opened in high-risk areas at other publicly-owned facilities? Could advertising for existing facilities be supplemented in high-risk areas to point people to the resources that they need? To convert the ensemble model’s final risk predictions into actionable intelligence, we built a web application that the City of Providence and other stakeholders can use to understand risk in an area and what facilities are best suited to respond.
The web application shows the final risk prediction map and allows users to examine a neighborhood’s average prediction level and compare it to that of other Providence neighborhoods. Then, it allows users to view the locations of community protective resources, such as grocery stores, libraries, hospitals, etc. Each facility has a visible quarter-mile buffer and displays a count of how many overdoses occurred within that distance in 2017. The goal is that with the risk prediction map, along with the counts of nearby actual overdoses, the City can clearly see which facilities are best suited to host or advertise additional interventions. The final web application can be found at this link.
The final spatial risk ensemble model predicts overdoses better than a kernel density map in the 90-100% and 70-89% risk areas of Providence. Prediction errors are higher in the Southern and Eastern portions of the city, and errors vary significantly based on population density - a fact that must be considered when utilizing this model’s predictions. However, the model is generalizable across neighborhoods of Providence with different economic characteristics. In the future, this model can be improved through the inclusion of more data. As the City continues to collect information about the location of overdoses, this model can be honed by using all of the 2017 data to predict for 2018, or by introducing seasonality variables. Continually adding overdose data to the model can allow the City of Providence to continue to make improvements that minimize the model’s level of error and improve generalizability across the City.
Variable | Description | Distance | Density | X3.Nearest.Neighbor | Source |
---|---|---|---|---|---|
Count | Count of ODs in grid cell | N | N | N | OD points |
dummy.od | Yes/No OD in grid cell | N | N | N | OD points |
od | Opioid overdose events | Y | Y | Y | OD points |
statecap | State Capitol | Y | N | N | Point location |
cityhall | City Hall | Y | N | N | Point location |
nhood | Neighborhood name | N | N | N | Neighborhood Boundaries |
cbd | Downtown | Y | N | N | Neighborhood Boundaries |
priceperacre | Average price per acre of parcels in grid cell | N | N | N | Parcel Boundaries 2017 |
assault | Assaults in past 180 days | Y | Y | Y | Police Case Logs |
drugcrime | Drug crime in past 180 days | Y | Y | Y | Police Case Logs |
supermarket | Supermarkets | Y | Y | Y | Business Licenses |
grocery | Grocery stores (includes corner stores) | Y | Y | Y | Business Licenses |
drycleaner | Dry cleaners and laundromats | Y | Y | Y | Business Licenses |
tobacco | Tobacco vendors | Y | Y | Y | Business Licenses |
liq | Liquor licenses | Y | Y | Y | Business Licenses |
snap | SNAP vendors | Y | Y | Y | SNAP Vendors |
college | Colleges and universities | Y | Y | Y | Schools |
prvschool | Private schools | Y | Y | Y | Schools |
pubschool | Public schools | Y | Y | Y | Schools |
overgrowth | Overgrowth complaints | Y | Y | Y | 311 Data |
light | Out streetlight complaints | Y | Y | Y | 311 Data |
water | Water related compliants | Y | Y | Y | 311 Data |
traffic | High-speed traffic compliants | Y | Y | Y | 311 Data |
graffiti | Graffiti complaints | Y | Y | Y | 311 Data |
code | Code violation complaints | Y | Y | Y | 311 Data |
trash | Trash complaints | Y | Y | Y | 311 Data |
abldgs | Abandoned building complaints | Y | Y | Y | 311 Data |
acars | Abandoned cars complaints | Y | Y | Y | 311 Data |
ehp | Every Home Properties | Y | Y | Y | Every Home Properties |
library | Libraries | Y | Y | Y | Libraries |
bus | RIPTA bus stops | Y | Y | Y | RIPTA Bus Stops |
hospital | Hospitals | Y | Y | Y | Hospitals |
ems | Emergency medical services | Y | Y | Y | Emergency Medical Services |
law | Law enforcement facilities | Y | Y | Y | Law Enforcement Facilities |
fire | Fire stations | Y | Y | Y | Fire Stations |
prison | Correctional institutions | Y | Y | Y | Correctional Institutions |
pop.dens | Population density | N | N | N | American Community Survey, 2016 |
MedHHInc | Median household income | N | N | N | American Community Survey, 2016 |
PctPv | Poverty rate | N | N | N | American Community Survey, 2016 |
PctUnemp | Unemployment rate | N | N | N | American Community Survey, 2016 |
Compiling the data needed to complete the above analysis requires wrangling data from a variety of sources. The below appendix will explain the various methods undertaken to create the dataset necessary for a spatial risk model of opioid overdoses. These methods include creating a fishnet, filtering variables, creating distance and nearest neighbor variables, creating kernel density variables, utilizing the tidycensus API, and joining these and other variables to a fishnet.
To begin, load the necessary libraries and read in the data.
#Load the Libraries
library(knitr)
library(tidyverse)
library(tidycensus)
library(sf)
library(ggmap)
library(maptools)
library(sp)
library(spatstat)
library(FNN)
library(rgdal)
library(raster)
library(rasterVis)
library(grid)
library(gridExtra)
library(RColorBrewer)
library(stargazer)
require(ggplot2)
library(rgeos)
library(kableExtra)
library(caret)
library(MASS)
library(pscl)
library(glmnet)
require(Metrics)
library(sf)
library(corrplot)
library(QuantPsyc)
library(scales)
library(spdep)
library(xgboost)
require(Ckmeans.1d.dp)
require(gbm)
require(randomForest)
library(GGally)
#Turn off scientific notation
options(scipen=999)
#Set working directory
#setwd("C:/Users/jbutz/Box Sync/Providence")
setwd("C:/Users/AnnieStreetman.DESKTOP-9NUSKDD/Desktop/Box Sync/Providence")
#Read in the data and project it immediately!
ODpoints <- st_read("updated_data.shp") %>%
st_transform(3438)
boundary <- st_read("Providence_Boundary_unproj.shp") %>%
st_transform(3438)
roads <- st_read("Major_Roads_unproj.shp") %>%
st_transform(3438)
parks <- st_read("Parks_OpenSpace_unproj2.shp") %>%
st_transform(3438)
water <- st_read("area_water_unproj.shp") %>%
st_transform(3438)
nhoods <- st_read("Neighborhoods_unproj.shp") %>%
st_transform(3438)
parcels <- st_read("Parcels_updatedFeb_unproj.shp") %>%
st_transform(3438)
counties <- st_read("tl_2012_44_cousub.shp") %>%
st_transform(3438)
AllBiz <- st_read("AllBiz_unproj.shp") %>%
st_transform(3438)
SNAPvend <- st_read("SNAP_vendors_unproj.shp") %>%
st_transform(3438)
Vacant <- st_read("EveryHomeProperties_unproj.shp") %>%
st_transform(3438)
schools <- st_read("schools_unproj.shp") %>%
st_transform(3438)
prisons <- st_read("correctional_institutions_unproj.shp") %>%
st_transform(3438)
fire <- st_read("fire_station_unproj.shp") %>%
st_transform(3438)
hospital <- st_read("hospitals_unproj.shp") %>%
st_transform(3438)
law.enf <- st_read("Law_Enforcement_unproj.shp") %>%
st_transform(3438)
EMS <- st_read("EMS_unproj.shp") %>%
st_transform(3438)
bus <- st_read("bus_stops_unproj.shp") %>%
st_transform(3438)
library <- st_read("Libraries.shp") %>%
st_transform(3438)
Calls311 <- st_read("311_unproj.shp") %>%
st_transform(3438)
grocery <- st_read("groc_unproj.shp") %>%
st_transform(3438)
drugcrime <-st_read("drug_proj.shp") %>%
st_transform(3438)
assaults <- st_read("assaults_proj2.shp") %>%
st_transform(3438)
cityhall <- st_read("CityHall.shp") %>%
st_transform(3438)
statecap <- st_read("TrainStations.shp") %>%
st_transform(3438)
wards <- st_read("Wards_proj.shp")%>%
st_transform(3438) %>%
dplyr::select(district_1) %>%
rename(ward = district_1)
One of the central decisions for any model building process is determining the unit of analysis. In a spatial risk model the unit of analysis informs whether your model is interpreting a risk level by neighborhood, census tract, or block. In this model, the unit of analysis is 500-square-foot grid cells across the entire city of Providence - otherwise referred to as a “fishnet.” The unit of analysis is discussed in further depth in the case study above, but this code will explain how to create a fishnet. Ultimately, all dependent and independent variables will be joined to the fishnet, giving each 500-square-foot area of Providence a value for each variable that can then be used in the model.
You can see from the below maps that the fishnet breaks Providence into 2,262 grid cells. Centroid points of the grid cells will be used to measure distance and any values that must be joined to grid cells will be averaged within their cells.
Once the fishnet has been made, variables can be aggregated into them in numerous ways that will be outlined in further detail below. In this use case, the first step is to count the number of overdose events that have occurred within each 500-foot grid cell. This will serve as the dependent variable in the spatial risk model.
#Count of ODs per fishnet grid cell
Count.FN <-
aggregate(ODpoints, provGrid_proj, length) %>%
bind_cols(provGrid_proj) %>%
dplyr::select(Count)
Count.FN$Count[is.na(Count.FN$Count)] <- 0 #remove NAs
summary(Count.FN$Count)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2971 0.0000 21.0000
In order to set up many of the independent variables used in the model, they were filtered out of much larger publicly available datasets. For this analysis, several variables were filtered out of Providence’s business licenses, 311 data, school locations, and neighborhood boundaries. The below code demonstrates how to filter out variables, using Providence’s 311 data as an example.
#Filtering 311 data - using the Request Type column to create several variables
AbandonedProperties <- #new dataset name
Calls311 %>% #dataset to filter out of
filter(Request_Ty == "Abandoned Property" |
Request_Ty == "Abandoned Property Needs Reboarding") #Request_ty = this OR that
AbandonedCars <-
Calls311 %>%
filter(Request_Ty == "Abandoned Vehicles")
Trash <-
Calls311 %>%
filter(Request_Ty == "Dumping of trash, bulky items on Public Street or Vacant Lot" |
Request_Ty == "Dumping or Trash in Public Park" |
Request_Ty == "Overflowing Dumpster" |
Request_Ty == "Trash/debris on private property or Lot" |
Request_Ty == "Trash/litter on public street or sidewalk")
CodeViolations <-
Calls311 %>%
filter(Request_Ty == "General Permit Issues" |
Request_Ty == "Substandard Housing Code Issues or Violations")
Graffiti <-
Calls311 %>%
filter(Request_Ty == "Graffiti" |
Request_Ty == "Graffiti in a Park")
HighTraffic <-
Calls311 %>%
filter(Request_Ty == "Crosswalks and Road Striping" |
Request_Ty == "Potholes" |
Request_Ty == "Road Repair or Replacement" |
Request_Ty == " Sidewalk - Repair Existing" |
Request_Ty == "Speed bumps/Traffic calming" |
Request_Ty == "Speeding on Streets" |
Request_Ty == "Traffic/Walk Signal Malfunction")
Water <-
Calls311 %>%
filter(Request_Ty == "Broken Water Pipe" |
Request_Ty == "Catch basin blocked or broken" |
Request_Ty == "Storm Drain Backup" |
Request_Ty == "Street Flooding")
Light <-
Calls311 %>%
filter(Request_Ty == "Decorative Street Light" |
Request_Ty == "Street Light" )
Overgrowth <-
Calls311 %>%
filter(Request_Ty == "Overgrowth/High Grass on Private Property" |
Request_Ty == "Overgrowth/High Grass on Public Property")
The next two sections will walk through some of the most important variables to include in a spatial risk model - distance and density based variables. For example, including a variable of whether or not a grid cell contains a licensed tobacco vendor is not as illustrative as including a variable that tells each grid cell its distance to a tobacco vendor. This is explained in further depth in the feature engineering discussion above. This section explains how to create two types of variables, distance variables and nearest neighbor variables. Nearest neighbor variables take the average proximity of the n nearest licensed tobacco vendors, and therefore often provide a better measure of proximity to certain types of amenities or disamenities. Distance and nearest neighbor variables use almost exactly the same code, a few examples of which are shown below. For this analysis, distance and distance to 3 nearest neighbor variables were generated for every point variable, such as tobacco vendors, liquor licenses, and public schools - the full list is included in the data dictionary appendix. Then these values were joined back to the final fishnet for use in the spatial risk model.
#Distance Variables
#Create matrix function for datasets
dataXY_function <- function(data) {
dataXY <-
data %>%
st_coordinates() %>%
as.data.frame() %>%
dplyr :: select(X,Y) %>%
as.matrix()
return(dataXY)
}
#Create function for distance to X nearest neighbors
nn_function <- function(measureFrom,measureTo,k) {
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr ::select(-thisPoint)
return(output)
}
#Create Fishnet Points Matrix
fishnetPointsXY <- dataXY_function(fishnetPoints)
#OD points
#Create Matrix and run distance function
odXY <- dataXY_function(ODpoints)
dist.od.1 <- nn_function(fishnetPointsXY, odXY, 1) #For distance, the neighbor value is set to 1
#Tabular and Spatial Join to fishnet
dist.od <-
fishnetPoints %>%
bind_cols(dist.od.1)
dist.od.FN <-
aggregate(dist.od, provGrid_proj, mean) %>%
bind_cols(Count.FN) %>%
rename(OD.dist = pointDistance)
#Tobacco Stores
#Create Matrix and run distance function
tobaccoXY <- dataXY_function(tobacco)
dist.tobacco.1 <- nn_function(fishnetPointsXY, tobaccoXY, 1)
#Tabular and Spatial Join to fishnet
dist.tobacco <-
fishnetPoints %>%
bind_cols(dist.tobacco.1)
dist.tobacco.FN <-
aggregate(dist.tobacco, provGrid_proj, mean) %>%
bind_cols(dist.od.FN) %>% #join this back to the most recently made dataset
rename(tobacco.dist = pointDistance)
#Nearest Neighbor Variables
#OD
#Run nearest neighbor function
nn.od.1 <- nn_function(fishnetPointsXY, odXY, 3) #change the number to change the number of neighbors measured
#Tabular and Spatial Join to fishnet
nn.od <-
fishnetPoints %>%
bind_cols(nn.od.1)
nn.od.FN <-
aggregate(nn.od, provGrid_proj, mean) %>%
bind_cols(dist.statecap.FN) %>% #join to most recently made dataset
rename(OD.nn = pointDistance)
#Tobacco Stores
#Run nearest neighbor function
nn.tobacco.1 <- nn_function(fishnetPointsXY, tobaccoXY, 3)
#Tabular and Spatial Join to fishnet
nn.tobacco <-
fishnetPoints %>%
bind_cols(nn.tobacco.1)
nn.tobacco.FN <-
aggregate(nn.tobacco, provGrid_proj, mean) %>%
bind_cols(nn.od.FN) %>%
rename(tobacco.nn = pointDistance)
Another way in which to measure the relative proximity and magnitude of variables is through their kernel density. The following code explains how to measure kernel density and join the average the value of density in each grid cell to the fishnet. These examples are just two of the many variables whose density was compiled in the fishnet, the full list is included in the data dictionary appendix.
#Create an owin window of the Providence boundary
boundary$geometry #find your xmin and xmax
## Geometry set for 1 feature
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 335598.3 ymin: 251068.8 xmax: 361734 ymax: 283675.8
## epsg (SRID): 3438
## proj4string: +proj=tmerc +lat_0=41.08333333333334 +lon_0=-71.5 +k=0.99999375 +x_0=99999.99998983997 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs
## POLYGON ((359521.1 255864.1, 359523.1 255864.7,...
provWindow <- owin(c(335598.3,361734), c(251068.8,283675.8))
#Make Fishnet a spatial polygon to create a mask for visualizing and extracting data
mask <-
provGrid_proj %>%
as('Spatial') %>%
as('SpatialPolygons')
#Public Schools
#Create Density Map and Join to Fishnet
ppp.publicschools <- as.ppp(publicschoolXY, provWindow)
dens.publicschools.1 <- raster(density(ppp.publicschools, sigma = 1500)) #sigma is your search distance
dens.publicschools <- raster::extract(dens.publicschools.1, mask, fun = mean)
dens.publicschools.FN <-
cbind(dens.publicschools, nn.assaults.FN) %>% #join to the most recent dataset
rename(pubschool.dens = dens.publicschools)
#Private Schools
#Create Density Map and Join to Fishnet
ppp.privateschools <- as.ppp(privateschoolXY, provWindow)
dens.privateschools.1 <- raster(density(ppp.privateschools, sigma = 1500))
dens.privateschools <- raster::extract(dens.privateschools.1, mask, fun = mean)
dens.privateschools.FN <-
cbind(dens.privateschools, dens.publicschools.FN) %>%
rename(privschool.dens = dens.privateschools)
To add neighborhood characteristics, the Census API was used to pull a block group shapefile and variables thought to impact the number of overdoses, including population density, poverty rate, unemployment rate, and median household income. The code below utilizes the tidycensus package.
censustable <-
BG2016 %>%
as.data.frame() %>%
dplyr::select(-GEOID, -year, -geometry, -area, -Pop) %>%
rename("Median Household Income" = MedHHInc,
"Poverty Rate" = pctPov,
"Unemployment Rate" = pctUnemp)
censustable$`Median Household Income`[is.na(censustable$`Median Household Income`)] <- 0
stargazer(censustable, type = "text", title = "Census Variable Summary Statistics")
##
## Census Variable Summary Statistics
## ===============================================================
## Statistic N Mean St. Dev. Min Max
## ---------------------------------------------------------------
## Median Household Income 156 42,392.290 31,316.450 0 218,583
## Poverty Rate 156 0.209 0.192 0.000 0.913
## Unemployment Rate 156 0.112 0.090 0.000 0.528
## ---------------------------------------------------------------
Lastly, two additional neighborhood characteristics were added to the fishnet. First, the neighborhood name and second, the total assessed value of land in Providence.
#Joining the neighborhoods to the fishnet
finalfishnetneighb <-
st_join(fishnetPoints, nhoods) %>%
dplyr::select(-shape_area, -shape_len)
finalfishnetneighb2 <-
st_join(finalfishnet, finalfishnetneighb) %>%
dplyr::select(-uniID.y) %>%
rename(nhood = lname)
finalfishnet <- finalfishnetneighb2
#Take a look at the output
table(finalfishnet$nhood)
##
## Blackstone Charles College Hill
## 167 100 87
## Downtown Elmhurst Elmwood
## 86 117 79
## Federal Hill Fox Point Hartford
## 59 62 78
## Hope Lower South Providence Manton
## 54 79 54
## Mount Hope Mount Pleasant Olneyville
## 95 140 63
## Reservoir Silver Lake Smith Hill
## 91 105 64
## South Elmwood Upper South Providence Valley
## 110 68 53
## Wanskuck Washington Park Wayland
## 128 151 52
## West End
## 104
#Joining the average parcel value to the fishnet
parcels$ass_total <- as.numeric(parcels$ass_total)
parcels$area_ac <- as.numeric(parcels$area_ac)
land.value <-
parcels %>%
mutate(priceperacre = ifelse(area_ac > 0, ass_total/area_ac, 0)) %>%
dplyr::select(-unit, -tax_year, -owner_city, -ass_bldg, -ass_total, -ownername, -parc_add,
-area_ac, -shape_leng, -owner_zip, -cama_link, -owner_ctry, -area_sf, -ownerstate, -shape_area,
-propid, -plat, -ass_land, -owner_add)
finalfishnetLV <-
aggregate(land.value, finalfishnet, mean) %>%
bind_cols(finalfishnet) %>%
dplyr::select(-lot, -geometry1)
finalfishnet <- finalfishnetLV
#Take a look at the output
summary(finalfishnet$priceperacre)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.075 0.520 0.846 196.121 2.998 5343.722 8
After preparing all of these variables (listed in the data dictionary here) and joining them into a fishnet, the final dataset that is used for model building is a set of 500 foot grid cells that contains the following fields.
names(fishnet)
## [1] "priceperacre" "MedHHInc" "pctPv"
## [4] "PctUnemp" "assault.dens" "drugcrime.dens"
## [7] "supermarket.dens" "grocery.dens" "library.dens"
## [10] "drycleaner.dens" "ems.dens" "law.dens"
## [13] "hospital.dens" "fire.dens" "prison.dens"
## [16] "tobacco.dens" "ehp.dens" "overgrowth.dens"
## [19] "light.dens" "water.dens" "traffic.dens"
## [22] "graffiti.dens" "code.dens" "trash.dens"
## [25] "abldgs.dens" "acars.dens" "bus.dens"
## [28] "snap.dens" "liq.dens" "od.dens"
## [31] "college.dens" "prvschool.dens" "pubschool.dens"
## [34] "uniID" "assault.nn" "drugcrime.nn"
## [37] "supermarket.nn" "grocery.nn" "library.nn"
## [40] "drycleaner.nn" "ems.nn" "law.nn"
## [43] "hospital.nn" "fire.nn" "ehp.nn"
## [46] "overgrowth.nn" "light.nn" "water.nn"
## [49] "traffic.nn" "graffiti.nn" "code.nn"
## [52] "trash.nn" "abldgs.nn" "acars.nn"
## [55] "college.nn" "prvschool.nn" "pubschool.nn"
## [58] "bus.nn" "snap.nn" "liq.nn"
## [61] "tobacco.nn" "od.nn" "park.dist"
## [64] "statecap.dist" "cityhall.dist" "assault.dist"
## [67] "drugcrime.dist" "cbd.dist" "supermarket.dist"
## [70] "grocery.dist" "law.dist" "drycleaner.dist"
## [73] "ems.dist" "prison.dist" "fire.dist"
## [76] "hospital.dist" "ehp.dist" "overgrowth.dist"
## [79] "light.dist" "water.dist" "traffic.dist"
## [82] "graffiti.dist" "code.dist" "trash.dist"
## [85] "abldgs.dist" "acars.dist" "college.dist"
## [88] "prvschool.dist" "pubschool.dist" "bus.dist"
## [91] "snap.dist" "liq.dist" "tobacco.dist"
## [94] "od.dist" "Count" "pop.dens"
## [97] "nhood" "dummy.OD" "geometry"
## [100] "countbucket" "countbucket2" "a1"
## [103] "d1" "n1" "e1"
It is also very useful at the beginning of the process to set a few basic aesthetic norms that you can use throughout the project. The below code walks through creating several color palettes, a map theme, a basemap, and a plot theme.
#Set a few color palettes using the RColorBrewer library these can be used later in ggplots
display.brewer.all() #shows your color palette options
brewer.pal("YlOrRd", n = 5) #provides you with hex codes for n number of colors needed in your palette
## [1] "#FFFFB2" "#FECC5C" "#FD8D3C" "#F03B20" "#BD0026"
#Copy and paste this output into named palettes
#General Palette
palette5 <- c("#FFFFB2", "#FECC5C", "#FD8D3C", "#F03B20", "#BD0026")
palette5_inverse <- c("#BD0026", "#F03B20", "#FD8D3C", "#FECC5C", "#FFFFB2")
palette10 <- c("#FFFFFF", "#FFFFCC", "#FFEDA0", "#FED976", "#FEB24C", "#FD8D3C", "#FC4E2A", "#E31A1C", "#BD0026", "#800026")
palette10_inverse <- c("#800026", "#BD0026", "#E31A1C", "#FC4E2A", "#FD8D3C", "#FEB24C", "#FED976", "#FFEDA0", "#FFFFCC", "#FFFFFF")
#Categorical Palettes
amenities_palette10 <- c("#FFFFFF", "#F7FCFD", "#E5F5F9", "#CCECE6", "#99D8C9", "#66C2A4", "#41AE76", "#238B45", "#006D2C", "#00441B")
amenities_palette10_inv <- c("#00441B", "#006D2C", "#238B45", "#41AE76", "#66C2A4", "#99D8C9", "#CCECE6", "#E5F5F9", "#F7FCFD", "#FFFFFF")
amenities_palette5 <- c("#EDF8FB", "#B2E2E2", "#66C2A4", "#2CA25F", "#006D2C")
risk_palette10 <- c("#FFFFFF", "#F7FCFD", "#E0ECF4", "#BFD3E6", "#9EBCDA", "#8C96C6", "#8C6BB1", "#88419D", "#810F7C", "#4D004B")
risk_palette10_inv <- c("#4D004B", "#810F7C", "#88419D","#8C6BB1", "#8C96C6", "#9EBCDA","#BFD3E6", "#E0ECF4", "#F7FCFD", "#FFFFFF")
risk_palette5 <- c("#EDF8FB", "#B3CDE3", "#8C96C6", "#8856A7", "#810F7C")
neighb_palette10 <- c("#FFFFFF", "#FFF7FB", "#ECE7F2", "#D0D1E6", "#A6BDDB", "#74A9CF", "#3690C0", "#0570B0", "#045A8D",
"#023858")
neighb_palette5 <- c("#F1EEF6", "#BDC9E1", "#74A9CF", "#2B8CBE", "#045A8D")
#A mapTheme function establishes aesthetics such as background color and axis labels that can be added to all maps
mapTheme <- function(base_size = 12) {
theme(
text = element_text(color = "black"),
plot.title = element_text(size = 16, colour = "black", face = "bold"),
plot.subtitle=element_text(size = 12, face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_rect(fill = "white"),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_line(colour="white"),
panel.grid.minor = element_line(colour = "white"),
panel.border = element_blank()
)
}
#Setting a plotTheme in addition to a mapTheme allows you to set consistent aesthetics for all plots
plotTheme <- function(base_size = 12) {
theme(
plot.title = element_text(size = 16, face = "bold"),
plot.subtitle=element_text(size = 12, face="italic"),
strip.text.x = element_text(size = 16),
text = element_text(size = 12),
axis.text = element_text(size = 14),
axis.title = element_text(size = 14, face = "italic"),
strip.background = element_rect(colour="white", fill="white")
)
}
#Make a basemap by creating a legend column for each layer and plotting them together
boundary <-
boundary %>%
dplyr :: select(geometry) %>%
mutate(Legend = "Providence Boundary")
parks <-
parks %>%
dplyr :: select(geometry) %>%
mutate(Legend = "Parks")
roads <-
roads %>%
dplyr :: select(geometry) %>%
mutate(Legend = "Roads")
water <-
water %>%
dplyr :: select(geometry) %>%
mutate(Legend = "Water")
BaseMap <- ggplot() +
geom_sf(data = boundary, aes(), fill = "grey85", color = NA, size = 1) +
geom_sf(data = roads, aes(), fill = NA, color = "white") +
geom_sf(data = water, aes(), fill = "slategray2", color = NA) +
mapTheme()
BaseMap + labs(title = "Base Map")
Mapping can be used to convey a variety of different data and is often a clear way to both explore and convey complicated spatial patterns. This section takes you through creating the different map types used in this project, including mapping data by fishnet, by polygon, and by points.
Maps by fishnet grid cell are especially useful in conveying spatial patterns from the feature engineering section, including count, density, and distance variables. The below code creates a map, in deciles, of the distance to parks across the city.
ggplot() +
geom_sf(data=fishnet, aes(fill=factor(ntile(park.dist, 10))), color = NA) +
scale_fill_manual(values = amenities_palette10_inv,
name = "Decile \n Breaks",
labels = (c("Low", "", "", "", "", "", "", "", "", "High"))) +
geom_sf(data = boundary, aes(), fill = NA, color = "grey40", size = 0.5) +
labs(title = 'Distance to Parks') +
mapTheme()
Maps by larger polygon, such as neighborhood boundaries or census tracts, allow the assessment of data available at those levels, including census data. The below code manually bins census data into quantiles and then maps those quantiles across the city.
BGmaps$IncBucket <- factor(
cut(as.numeric(BGmaps$MedHHInc),
c(0, 22041, 33657, 42375, 60152, 219000)),
labels = c("$0 to $22,041",
"$22,041 to $33,657",
"$33,657 to $42,375",
"$42,375 to $60,152",
"$60,152 to $218,583")
)
ggplot() +
geom_sf(data = BGmaps, aes(fill=IncBucket), colour = "black") +
labs(title= "Median Household Income") +
scale_fill_manual(values = neighb_palette5,
name = "Income\n(Quintile Breaks)") +
mapTheme()
Mapping data points can also be useful in examining data across the city. The below code walks you through creating a map of station points in Providence on the basemap.
# map the station points
fire$X <- (st_coordinates(fire)[,1])
fire$Y <- (st_coordinates(fire)[,2])
BaseMap +
geom_sf(data = fire, aes(), color = "#FD8D3C", size = 2.5) +
labs(title = "Safe Stations", subtitle = "Located at Providence's 12 Fire Stations") +
scale_colour_manual(values = c("#BD0026")) +
geom_text(aes(label = fire$Station_NU, x = fire$X, y = fire$Y), fontface = "bold",
vjust = -0.75, color = "grey15")
Bar charts can be useful to compare data by different groupings. The below code walks through the process of determining the count of overdoses within 1/4 mile of fire stations and graphing that in a bar chart.
#create quarter-mile buffers around the fire stations
stationsBuffer_QtrMi <-
st_buffer(fire, 1320)
#determine the count of overdoses within 1/4-mile of each fire station
count.by.station <-
aggregate(ODpoints, stationsBuffer_QtrMi, length) %>%
bind_cols(stationsBuffer_QtrMi) %>%
dplyr::select(Station_NU, Count)
#remove 0s
count.by.station$Count[is.na(count.by.station$Count)] <- 0
#plot the data in a bar chart
ggplot(data = count.by.station, aes(x = reorder(Station_NU, -Count), y = Count, fill = "#BD0026")) +
geom_bar(stat = "identity", color = "black", size = 1) +
scale_fill_manual(values = c("#BD0026")) +
labs(title = "Count of Overdose Incidents within a 1/4-Mile of Each Fire Station",
x = "Fire Station", y = "Count of Overdoses Within 1/4-Mile") +
theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold", size = 16), axis.title = element_text(face = "italic"))
Scatterplots can be a useful way of examining the relationship between variables. The below code shows how to create a scatterplot that shows the relationship between the count of overdose events and distance to public schools.
ggplot() +
geom_point(data = fishnet, aes(pubschool.dist, log(od.dens)), color = "#006D2C") +
labs(title = "Distance to Public Schools by Number of Overdoses",
subtitle = "Distance to Public Schools & Logged OD Density Have a Correlation of -0.54",
x = "Distance to Public Schools (ft)",
y = "Number of Overdoses") +
geom_abline(intercept = -13.33735471, slope = -0.00031019,
color="black", size = 1.25) +
plotTheme() +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
It is often much easier to see differences and similarities between visuals when they are arranged next to each other. This section walks you through the process of creating visual layouts through both the grid arrange and facet wrap methods.
#using grid arrange to explore the distribution of tobaccor vendors and liquor licenses and their relationship with overdoses
tobacco.map <- ggplot() +
geom_sf(data=fishnet, aes(fill=factor(ntile(tobacco.nn, 10))), color = NA) +
scale_fill_manual(values = risk_palette10_inv,
name = "Decile \n Breaks",
labels = (c("Low", "", "", "", "", "", "", "", "", "High"))) +
geom_sf(data = boundary, aes(), fill = NA, color = "grey40", size = 0.5) +
labs(title = 'Distance to 3 Nearest Tobacco Vendors') +
mapTheme()
liquor.map <- ggplot() +
geom_sf(data=fishnet, aes(fill=factor(ntile(liq.nn, 10))), color = NA) +
scale_fill_manual(values = risk_palette10_inv,
name = "Decile \n Breaks",
labels = (c("Low", "", "", "", "", "", "", "", "", "High"))) +
geom_sf(data = boundary, aes(), fill = NA, color = "grey40", size = 0.5) +
labs(title = 'Distance to 3 Nearest Liquor Licenses') +
mapTheme()
tobacco.scatterplot <-
ggplot() +
geom_point(data = fishnet, aes(tobacco.nn, log(od.dens)), color = "#810F7C") +
labs(title = "Distance to 3 Nearest Tobacco Vendors by Number of Overdoses",
subtitle = "Distance to Tobacco Vendors & Logged OD Density Have a Correlation of -0.70",
x = "Distance 3 Nearest to Tobacco Vendors (ft)",
y = "Number of Overdoses") +
geom_abline(intercept = -12.97287653, slope = -0.00066801,
color="black", size = 1.25) +
plotTheme() +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
liquor.scatterplot <-
ggplot() +
geom_point(data = fishnet, aes(liq.nn, log(od.dens)), color = "#810F7C") +
labs(title = "Distance to 3 Nearest Liquor Licenses by Number of Overdoses",
subtitle = "Distance to Liquor & Logged OD Density Have a Correlation of -0.49",
x = "Distance to 3 Nearest Liquor Licenses (ft)",
y = "Number of Overdoses") +
geom_abline(intercept = -13.15816145, slope = -0.00053226,
color="black", size = 1.25) +
plotTheme() +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
tlplots <- list(tobacco.map, tobacco.scatterplot, liquor.map, liquor.scatterplot)
tllay <- rbind(c(1,2,2),
c(3,4,4))
grid.arrange(grobs = tlplots, layout_matrix = tllay)
#Using facet wrap to view model predictions
fishnet$a1 <- predict.glm(pFit_a2, as.data.frame(fishnet), type = "response")
fishnet$d1 <- predict.glm(pFit_d1, as.data.frame(fishnet), type = "response")
fishnet$n1 <- predict.glm(pFit_n1, as.data.frame(fishnet), type = "response")
preda1 <-
fishnet %>%
dplyr::select(a1) %>%
rename(prediction = a1) %>%
mutate(model = "Community Protective Resources Model")
predd1 <-
fishnet %>%
dplyr::select(d1) %>%
rename(prediction = d1) %>%
mutate(model = "Risk Factors Model")
predn1 <-
fishnet %>%
dplyr::select(n1) %>%
rename(prediction = n1) %>%
mutate(model = "Neighborhood Model")
predictionmap.df <- rbind(preda1, predd1, predn1)
predictionmap.df$facet <- factor(predictionmap.df$model, levels = c("Community Protective Resources Model",
"Risk Factors Model",
"Neighborhood Model"))
predmap <- ggplot() +
geom_sf(data = predictionmap.df, aes(fill = factor(ntile(prediction, 10))), color = NA) +
scale_fill_manual(values = palette10,
name="Predictions \n(Decile Breaks)",
labels = c("0", "", "", "", "","","","","","3.84")) +
geom_sf(data = boundary, aes(), fill = NA, color = "grey40", size = 0.5) +
mapTheme()
predmap + facet_wrap(~facet, ncol = 3) + labs(title = "Predicted Overdose Events") +
theme(plot.title = element_text(face = "bold", size = 20), strip.text.x = element_text(size = 16),
strip.background = element_rect(colour="white", fill="white"))
Building a model that achieves the strongest fit for your use case involves extensive testing of different combinations of variables and model types. The below appendix will explain different methods used in developing a model to predict the count of overdose events in Providence. This includes separating a training and test set, building negative binomial, poisson, XG boost, and ensemble models, evaluating the strength of variables in each model, and using the model to predict on the full data set. It also goes through different model validation metrics to evaluate the strength of the model, including calculating mean absolute error for the model, mapping the spatial distribution of residuals, making a histogram of residuals, comparing the model to kernel density maps, cross validating the model across many iterations, and spatially cross validating the model across different parts of the city.
# 60 / 40 split for testing regressions
set.seed(777) #set a random seed to allow for duplication of results
partition <- createDataPartition(fishnet$Count, p = 0.6, list = FALSE) #create a partition
fishnetTrain <- fishnet[partition, ] #Training set is this partition of the fishnet (60%)
fishnetTest <- fishnet[-partition, ] #Test set is the remaining portion of the fishnet data (40%)
#training the model
pFit <- glm(Count ~ snap.nn + bus.dist + park.dist + statecap.dist + supermarket.nn,
family = "poisson", data = fishnetTrain)
#generating predictions
fishnetTest$pred.pFit <- predict.glm(pFit, as.data.frame(fishnetTest), type = "response")
#training the model
nbFit <- train(Count ~ snap.nn + bus.dist + park.dist + statecap.dist + supermarket.nn,
data = fishnetTrain %>%
as.data.frame() %>%
dplyr::select(-geometry),
method = "glm.nb")
#generating predictions
fishnetTest$pred.nbFit <- predict(nbFit, as.data.frame(fishnetTest))
standardized <- as.data.frame(lm.beta(pFit))
standardized$variable <- row.names(standardized)
colnames(standardized)[1] <- "std_coefficient"
standardized$absCoef <- abs(standardized$std_coefficient)
ggplot(standardized, aes(x=reorder(variable,-absCoef), y=absCoef, fill=variable)) +
geom_bar(stat="identity", color="black", size = 0.75) +
scale_fill_manual(values = c("#006d2c", "#006d2c", "#006d2c", "#006d2c", "#006d2c")) +
labs(x= "Variable",
y="Relative Importance in Model",
title= "Variable Importance",
subtitle = "Community Protective Resources Model") +
scale_x_discrete(labels = c("SNAP Vendors", "Bus Stops", "State Capital", "Supermarkets", "Parks")) +
theme(legend.position = "none") +
plotTheme()
#create a data set that is just the dependent variable from the training set
y_train <- fishnetTrain$Count
#make a fishnetTest without geometry column for running xg boost predictions
fishnetTest.xgb <-
fishnetTest %>%
as.data.frame() %>%
dplyr::select(-geometry)
#create a training set that is all of the variables you'd like to compare
train.a <-
fishnetTrain %>%
as.data.frame() %>%
dplyr::select(supermarket.nn, grocery.nn, library.nn, drycleaner.nn, ems.nn, law.nn,
hospital.nn, fire.nn, bus.nn, snap.nn, college.nn, prvschool.nn,
pubschool.nn, statecap.dist, cityhall.dist, cbd.dist, park.dist)
#run XG Boost
fit_xgb <- xgboost(data.matrix(train.a), y_train
, max_depth = 7
, eta = 0.02
, nthread = 2
, nrounds = 3000
, subsample = .7
, colsample_bytree = .7
, booster = "gbtree"
, eval_metric = "rmse"
, objective="count:poisson")
#generating predctions
fishnetTest$pred.xgb <- predict(fit_xgb, data.matrix(fishnetTest.xgb))
#show variable importance
importance_matrix1 <- xgb.importance(colnames(train.a), model = fit_xgb)
var.imp.plot.xgb1 <-ggplot(data = importance_matrix1, aes(x=reorder(Feature,Gain), y=Gain, fill="#BD0026")) +
geom_bar(stat = "identity", colour = "black", size =1) +
scale_fill_manual (values = "#BD0026") +
labs(title = "XG Boost: Variable Importance",
x="Variable",
y="Variable Importance") +
theme(legend.position = "none")
var.imp.plot.xgb1 + coord_flip()
#create separate data sets containing the variables in each story
story.a <-
fishnet %>%
as.data.frame() %>%
dplyr::select(snap.nn, bus.dist, park.dist, Count)
story.d <-
fishnet %>%
as.data.frame() %>%
dplyr::select(assault.nn, code.dist, overgrowth.nn, Count)
story.nhood <-
fishnet %>%
as.data.frame() %>%
dplyr::select(nhood, Count)
#create the ensemble function
ensembleFunction <- function(inputStory, iterations, sampleSize) {
#create an empty dataframe
endList <- data.frame(matrix(NA, nrow = nrow(inputStory), ncol = 0))
#build n models with a m% test set
for (i in 1: iterations ) {
sample <- sample.int(n = nrow(inputStory), size = floor(sampleSize *nrow(inputStory)),
replace = F)
train <- inputStory[sample, ]
#train the model
thisTrain <- glm(train$Count ~ ., family = "poisson", data = train)
#create a vector of predictions for this model.
thisPrediction = predict.glm(thisTrain, as.data.frame(fishnet), type = "response")
#append the predictions to the data frame
endList <- cbind(endList, thisPrediction)
}
#label each prediction column 1 through n
colnames(endList) <- paste("prediction", 1:ncol(endList), sep = "")
#create a data frame of average predictions.
return(data.frame(mean.Prediction = rowMeans(endList)))
}
#run for the 3 different stories
allStories.predictions <-
data.frame(
cbind(
story1.mean.predictions = ensembleFunction(story.a,100,.9)[,1,],
story2.mean.predictions = ensembleFunction(story.d,100,.9)[,1,],
story3.mean.predictions = ensembleFunction(story.nhood,100,.9)[,1,]
))
#look at the correlation between stories (don't want it to be too high)
cor(allStories.predictions)
## story1.mean.predictions story2.mean.predictions
## story1.mean.predictions 1.0000000 0.6058870
## story2.mean.predictions 0.6058870 1.0000000
## story3.mean.predictions 0.3786943 0.4590718
## story3.mean.predictions
## story1.mean.predictions 0.3786943
## story2.mean.predictions 0.4590718
## story3.mean.predictions 1.0000000
ggpairs(allStories.predictions,diag = list("continuous"="blank"))
cor.test(allStories.predictions$story1.mean.predictions, allStories.predictions$story2.mean.predictions)
##
## Pearson's product-moment correlation
##
## data: allStories.predictions$story1.mean.predictions and allStories.predictions$story2.mean.predictions
## t = 36.077, df = 2244, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5790371 0.6314242
## sample estimates:
## cor
## 0.605887
#partition ensemble predictions 60/40 for train/test
ensembleTrain <- allStories.predictions[partition, ]
ensembleTest <- allStories.predictions[-partition, ]
#create a model from the 3 stories' predictions
pFit_e <- glm(fishnetTrain$Count ~ ., data=ensembleTrain, family = "poisson")
#predict on the full data set
fishnet$pred.pFit <- predict.glm(pFit, as.data.frame(fishnet), type = "response")
#mapping the predictions
ggplot() +
geom_sf(data = fishnet, aes(fill = factor(ntile(pred.pFit, 10))), color = NA) +
scale_fill_manual(values = palette10,
name="Predictions \n(Decile Breaks)",
labels = c("Low", "", "", "", "", "", "", "", "", "High")) +
geom_sf(data = boundary, aes(), fill = NA, color = "grey40", size = 0.5) +
labs(title="Map of Predictions") +
mapTheme()
There are many different validation metrics that can be used to evaluate the relative success of each model. The below section walks through the validation metrics used in this project, including examining the distribution of errors, comparing to a traditional kernel density model, cross validating across many folds, and spatially cross validating across different areas of the city.
This code walks you through using the models to predict on the test set, calculating errors, and examining the distribution of errors, both in quantity and across space.
#Predicting on the Test Set
fishnetTest$pred.pFit <- predict.glm(pFit, as.data.frame(fishnetTest), type = "response") #predict on the test set
is.na(fishnetTest$pred.pFit) <- 0 #remove any 0s
#Calculating the Mean Absolute Error
#take the absolute value of the actual count minus the predicted count
fishnetTest$absError.pFit <- abs(fishnetTest$Count - fishnetTest$pred.pFit)
#take the mean of this value
mean(fishnetTest$absError.pFit)
## [1] 0.4234824
#Visualizing the distribution of residuals
ggplot(fishnetTest, aes(absError.pFit)) +
geom_histogram(bins = 20, fill = "#006d2c", color = "black", size = 0.75) +
labs (x = "Residuals",
y = "Count",
title = "Distribution of Residuals, Community Protective Services Model",
subtitle = "Mean Absolute Error is 0.4234 Overdose Events") +
plotTheme()
#Mapping the residuals of each model
PlainMap +
geom_sf(data = fishnetTest, aes(fill = factor(ntile(absError.pFit, 10))), color = NA) +
scale_fill_manual(values = palette10,
name="Residuals \n(Quintile Breaks)", labels = c("Low", "", "", "", "", "", "", "", "", "High")) +
labs(title="Map of Test Set Residuals") +
mapTheme()
It is important to evaluate how well your model performs in comparison to the way things are currently done, in this case a kernel density hotspot map. This code walks you through building a function that will put the models on the same scale, divide them into risk categories, and output a map that compares the relative predictions of the two models.
#Create a function that allows you to input the model and spit out the comparison graph
cv.kd_function <- function(cv.kd_data) {
#reclassify predictions into 100 quantiles
cv.kd_data$p.r1 <- as.numeric(.bincode(cv.kd_data$p, breaks = quantile (cv.kd_data$p, seq(0,1, by=0.01), na.rm = TRUE,
labels = c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18",
"19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34",
"35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50",
"51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66",
"67","68","69","70","71","72","73","74","75","76","77","78","79","80","81","82",
"83","84","85","86","87","88","89","90","91","92","93","94","95","96","97","98",
"99","100"))))
#reclassify predictions a second time - into 5 risk levels
cv.kd_data$p.r2 <- cut(cv.kd_data$p.r1, c(-1, 30, 50, 70, 90, 100), na.rm = TRUE,
labels = c("5", "4", "3","2","1"))
#summarize by group
p.summ <-
cv.kd_data %>%
group_by(p.r2) %>%
summarize(p.count = sum(Count)) %>%
rename(group = p.r2) %>%
filter(!is.na(group))
#reclassify kernel density into 100 quantiles
cv.kd_data$kd.r1 <- as.numeric(cut(cv.kd_data$od.dens, breaks = quantile (cv.kd_data$od.dens, seq(0,1, by=0.01), na.rm = TRUE),
labels = c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18",
"19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34",
"35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50",
"51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66",
"67","68","69","70","71","72","73","74","75","76","77","78","79","80","81","82",
"83","84","85","86","87","88","89","90","91","92","93","94","95","96","97","98",
"99","100")))
#reclassify kernel density a second time - into 5 risk levels
cv.kd_data$kd.r2 <- cut(cv.kd_data$kd.r1, c(-1, 30, 50, 70, 90, 100), na.rm = TRUE,
labels = c("5", "4", "3","2","1"))
#summarize by group
kd.summ <-
cv.kd_data %>%
group_by(kd.r2) %>%
summarize(kd.count = sum(Count)) %>%
rename(group = kd.r2) %>%
filter(!is.na(group))
#join these two tables
countComparisons <- merge(p.summ, kd.summ)
countComparisons <- cbind(
countComparisons,
data.frame(Category = c("90% - 100%", "70% - 89%", "50% - 69%", "30% - 49%", "1% - 29%")))
countComparisons <-
countComparisons %>%
dplyr::mutate(kernelPct = kd.count / sum(kd.count),
fittedPct = p.count / sum(p.count))
countComparisonsLong <-
countComparisons %>%
gather(Variable, Value, kernelPct:fittedPct)
output <-
ggplot(data=countComparisonsLong, aes(Category,Value)) +
geom_bar(aes(fill = Variable), position = "dodge", stat="identity", color = "black") +
scale_fill_manual(values = c("#BD0026", "#FECC5C"),
labels=c("Spatial Risk Model", "Kernel Density")) +
labs(x= "Predicted Risk Levels",
y="Percent of Test Set Cases",
title= "Goodness of Fit: Spatial Risk Model vs. Kernel Density hotspot")
return(output)
}
#Then, for each model, you will just need to create a data frame with the correct column names and run the function
cv.kd_data <-
fishnetTest %>%
as.data.frame() %>%
dplyr::select(Count, pred.pFit, od.dens, geometry) %>%
rename(p = pred.pFit)
cv.kd_function(cv.kd_data)
The cross validation process allows you to determine whether the model predicts well on many different subsets on the data, not just the initial test set. The below code walks through this cross validation process across many different folds.
#Randomly shuffle the data
fishnetTrain2 <- fishnetTrain[sample(nrow(fishnetTrain)),]
#Create 100 equally size folds
folds <- cut(seq(1,nrow(fishnetTrain)),breaks=100,labels=FALSE)
#Perform 100 fold cross validation
endList <- data.frame()
for(i in 1:100){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- fishnetTrain[testIndexes, ]
trainData <- fishnetTrain[-testIndexes, ]
#Using your pre trained model, predict
thisTrain <- glm(Count ~ supermarket.nn + grocery.nn + library.nn + drycleaner.nn + ems.nn +
law.nn + hospital.nn + fire.nn + bus.nn + snap.nn + college.nn + prvschool.nn +
pubschool.nn + statecap.dist + cityhall.dist + cbd.dist,
family = "poisson", data = fishnetTrain)
thisPrediction <- predict.glm(thisTrain, as.data.frame(testData), type = "response")
#calculate mape and mae
thisMAPE <- mean(abs(testData$Count - thisPrediction) / testData$Count )
thisMAE <- mean(abs(testData$Count - thisPrediction))
#cbind this folds MAPE and MAE
thisList <- cbind(thisMAPE, thisMAE)
#print
endList <- rbind(endList, thisList)
resulta <- endList
}
It is important to determine how well the model predicts across different areas of the city. The below code walks you through the spatial validation process across Providence.
tracts2 <- tracts
#Make the Function
spatialCV <- function(model) {
#initialize variables
training <- 0
testing <- 0
tuner <- 0
currentUniqueID <- 0
uniqueID_List <- 0
y <- 0
endList <- list()
#create a list that is all the spatial group unqiue ids in the data frame (ie all the neighborhoods)
uniqueID_List <- unique(model[["tracts"]])
x <- 1
y <- length(uniqueID_List)
#create a counter and while it is less than the number of counties...
while(x <= y)
{
#call a current county
currentUniqueID <- uniqueID_List[x]
#create a training set comprised of units not in that county and a test set of units
#that are that county
training <- model[ which(model[["tracts"]] != uniqueID_List[x]),]
testing <- model[ which(model[["tracts"]] == uniqueID_List[x]),]
trainingX <- training[ , -which(names(training) %in% c("Count"))]
testingX <- testing[ , -which(names(testing) %in% c("Count"))]
trainY <- training[["Count"]]
testY <- testing[["Count"]]
#tune a model - note I have hardwired the dependent variable and the trainControl
tuner <- glm(trainY ~ ., family = "poisson",
data = trainingX %>% as.data.frame() %>% dplyr::select(-tracts, -geometry))
#come up with predictions on the test county
thisPrediction <- predict.glm(tuner, testingX, type = "response")
#calculate mape and mae and count of records for a given training set (to get a sense of sample size).
thisMAE <- mean(abs(testY - thisPrediction))
countTestObs <- nrow(testingX)
#add to the final list the current county and the MAPE
thisList <- cbind(currentUniqueID, thisMAE, countTestObs)
#create a final list to output
endList <- rbind(endList, thisList)
#iterate counter
x <- x + 1
}
#return the final list of counties and associated MAPEs
return (as.data.frame(endList))
}
#create models
a.model <-
fishnettracts %>%
dplyr::select(snap.nn, bus.dist, park.dist, statecap.dist, supermarket.nn, Count, tracts)
#Run the spatial cross validation on the model
SCVa <- spatialCV(a.model)
SCVA <-
SCVa %>%
mutate(tracts = as.character(currentUniqueID),
a.FNcells = as.numeric(countTestObs),
a.MAE = as.numeric(thisMAE)) %>%
dplyr::select(-thisMAE, -currentUniqueID)
#Join back to the dataset
forjoining <-
SCVA %>%
group_by(tracts) %>%
summarize(mean(a.MAE)) %>%
rename(a.meanMAE = `mean(a.MAE)`)
It is also important to test if the neighborhood is generalizable to different social and neighborhood characteristics. The below code walks you through running spatial cross validation to determine how well a model predicts across areas with different population densities.
tractstable <- read.csv("SCVtracts2.csv", header = TRUE, sep = ",")
tractstable$e.karl[tractstable$e.karl == Inf] <- 0
SCVresults1 <- tractstable %>% group_by(DensityQuartile)
SCVresults1.5 <- SCVresults1 %>% summarize(results = mean(e.karl))
colnames(SCVresults1.5) <- c("Density", "Normalized Mean Absolute Error")
kable(SCVresults1.5, "html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Density | Normalized Mean Absolute Error |
---|---|
Bottom 25% Density | 0.6611757 |
Bottom 50% Density | 0.5656666 |
Top 25% Density | 0.2552651 |
Top 50% Density | 0.3987325 |
The below table shows the mean absolute error across the many different models run as part of this project.
modelNum | modelName | Description | MAE |
---|---|---|---|
1 | Community Protective Services All | All Variables Poisson | 0.4223 |
2 | Community Protective Services All | All Variables Negative Binomial | 0.4219 |
3 | Community Protective Services Model | Significant Variables Poisson | 0.4235 |
4 | Community Protective Services Model | Significant Variables Negative Binomial | 0.4237 |
5 | Community Protective Services Ensemble Input Model | Ensemble Input Poisson | 0.4244 |
6 | Community Protective Services Ensemble Input Model | Ensemble Input Negative Binomial | 0.4250 |
7 | Risk Factor All | All Variables Poisson | 0.4177 |
8 | Risk Factor All | All Variables Negative Binomial | 0.4172 |
9 | Risk Factor Model | Significant Variables Poisson | 0.4155 |
10 | Risk Factors Model | Significant Variables Negative Binomial | 0.4237 |
11 | Risk Factors Ensemble Input Model | Ensemble Input Poisson | 0.4224 |
12 | Risk Factors Ensemble Input Model | Ensemble Input Negative Binomial | 0.4219 |
13 | Neighborhood Characteristics All | All Variables Poisson | 0.4482 |
14 | Neighborhood Characteristics All | All Variables Negative Binomial | 0.4481 |
15 | Neighborhood Characteristics Model | Significant Variables Poisson | 0.4486 |
16 | Neighborhood Characteristics Model | Significant Variables Negative Binomial | 0.4484 |
17 | Neighborhood Characteristics Ensemble Input Model | Ensemble Input Poisson | 0.4650 |
18 | R Ensemble Input Model | Ensemble Input Negative Binomial | 0.4481 |
19 | Final Ensemble Model | Ensemble Model Poisson | 0.4145 |