Return to MUSA 801 Projects Page
This project was produced as a part of the University of Pennsylvania’s Master of Urban Spatial Analytics Spring 2019 Practicum (MUSA 801). The course was instructed by Ken Steif, Michael Fichman, and Matthew Harris, for whom we’d like to thank for their guidance. The group would also like to thank the City of Louisville for the opportunity to work on this project. Louisville Data Officer Michael Schnuerle, and to Major Joseph Williams of the Louisville Fire Department were integral partners in the development of this work.
This document is intended to enable others to replicate the methodology of a study of structural fires. Products of this course should be considered works in progress. We first introduce the context of this project, followed by the data, methodology, modeling, and guiding appendices. A table of contents is provided for easy navigation, along with select hyperlinks throughout the document.
Building fires are a relatively rare occurrence, but fire risk, particularly in an older city like Louisville, KY, is omnipresent. This project seeks to predict geospatial fire risk in Louisville, and to use this risk to help inform a smoke detector outreach program. By utilizing a variety of sources, including fire data, property data, and environmental variables, the model assigns a risk score throughout different areas of the city, which can be used to identify the highest risk areas for latent fire risk. While Louisville currently has a smoke detector program, through which citizens can request a smoke detector via 311, the model results illustrate vulnerabilities in the city’s current system. This work can help to direct efforts to mitigate potential fires in at-risk communities, and public locations throughout the city could be utilized for extended smoke detector programming efforts.
Fires, as with many environmental risks, are rare enough that they are often disregarded - at least, until directly affected by one. However, there has been an increase of fire events in the city of Louisville in recent years, and more residents are at risk than ever.
Over 2,600 fires have occurred in the city since 2013 and have caused more than $29 million in property damage in a three-year period (insert source). Furthermore, as with most cities, a vast majority of structural fires occur on residential properties. This means that each home suffered from an average of $20,000 in damage, which adds an enormous burden to an inherently traumatic situation and is often a pivotal moment in a household’s finances. Additionally, the increase in fires is a strain on the Louisville Fire Department and its emergency response services.
Currently, the city mitigates residential fire risk through three main approaches: 21 fire stations service the central part of the city; the city’s inspection system targets a number of residential properties throughout the year; and a focused outreach program provides smoke detectors to residents who call 311 to request installation. We believe there is an opportunity in this final category to establish a more effective system. Relying primarily on inspections can be costly and difficult to arrange for residential properties, but smoke detector outreach can be expounded upon with less manpower, and applied to a broader scope. Last year, there were 700 installations, but we believe that we can help the city not just to increase this quantity, but to get these assets to those who need it most.
To do so, we have built an algorithm that predicts fire risk, with an emphasis on latent risk, rather than depending only on previous occurrences. We have sought to identify where future fires may occur, so that Louisville can mitigate the risk before the fire ever starts.
We compiled the following data for our analysis: environmental data from Louisville’s Open Data website, fire data from the Louisville Fire Department, and property valuation data from the City of Louisville. For further details on these sources, please see Appendix: Data Dictionary.
The data was also “wrangled” before being explored in the following section. This process included various transformations of the data in order to optimize predictive ability of each variable. For details on this procedure, please see Feature Engineering and Appendix: Data Wrangling. Through exploratory analysis and the modeling process, the final dataset was narrowed down to a set of best predictors. For results on how the data ultimately performed in the models, please see Model Building.
This report seeks to create a model that will predict fire risk throughout the city, even if fires have not previously occurred there. However, traditional spatial models simply interpolate across space, assuming that past events are the best indicator of general risk. A kernel density map, also known as a “hotspot map”, is an example of this. The map below illustrates the kernel density alongside a map where fire events are counted for each 500 x 500 grid cell citywide.
A spatial risk model is an alternative approach that utilizes neighborhood, environmental, and demographic variables to help identify latent risk; it does not assume that past fires are the best indication of future ones and therefore allows risk to vary across the city. In order to best account for this variation, the city was divided into 500 x 500 foot grid cells, translating to a city block level unit of analysis. Together, the grid cells form a “fishnet.”
Mapping fires to the fishnet clearly conveys the rarity of such an event, with only 16% percent of grid cells containing at least one fire. This observation subsequently shaped our modeling approach.
Other variables were also added to the fishnet, so that each grid cell contained neighborhood, environmental, and demographic characterstics, as well as data about past fires. Yet, while inclusion of a plenitude of variables thoroughly illustrates past experiences, a critical objective of this report is to balance both the accuracy and generalizability of our predictions. If our model is too accurate, it may overly correlate with past fires, therefore obscuring the latent risk in areas that have not had any. Instead, our model aims to be generalizable as well as accurate, allowing it to be applied to the entire city without simply reiterating previous trends. Consequently, we divided our variables into separate “stories” or lenses through which we could explore the relationship between fires differently: a building-level story, a neighborhood story, and a demographic story. These stories are modeled separately and then combined to maximize predictive ability. In this way, we prevent any set of variables from “overfitting” the model and masking generalizability in favor of accuracy.
In order to best extrapolate the relationship between our data and the fires, we “engineered” variables to include in our dataset. Many of these variables have more explanatory power as varied spatial measurements, which began simply as a count of points within each grid cell, then were also engineered as distances, densities, and an average of the nearest neighbors. The example below shows how these measurements differ spatially:
For an explanation of which measurements were used for each variable, please see Appendix: Data Dictionary.
The spatial clustering of fires we have identified implies that there are other factors that may contribute to the propensity of fires to occur where they have before. We thus explored the characteristics of the environment of past fires to determine whether such variables could help predict latent risk.
The scatter plots above show a relationship between the some of the features we engineered and the density of fires, indicating that environmental variables indeed correlate with fires. In particular, there appear to be strong correlations between electrical permits and vacant structures. Failed inspections and fireplace density tend to have slightly looser relationships. Using this analysis, we assembled an array of variables to model fires in order to assess risk.
Finally, we examined whether current mitigation resources are properly aligned to risk. As mentioned in the Introduction section, these resources include fire stations, inspections, and requests for smoke detector installation. Below, we observe that the current distribution of these services is somewhat correlated with fires such that the distance to these resources decreases with higher densities of fires. This means that they tend to be located in areas that are helpful to mitigation efforts.
We note that this trend holds especially true for smoke detectors, as seen above. If residents are in fact requesting for smoke detectors where there are more fires, then this points to an opportunity to improve upon the current allocation process by providing a more proactive outreach program. We hope to do this by identifying areas of high risk in which to more effectively and efficiently target this resource.
An important step in the modeling process is to determine which variables are the most powerful predictors. Finding stronger variables allows us to slim down the size of our dataset, making for a more efficient model. In order to find such predictors, we first identified significant variables among our entire dataset using a Poisson general linear model. We then iterated through a series of machine learning models to supplement the selection process and cross-referenced the results to see which variables showed the most predictive power. An example of this can be shown below using a eXtreme Gradient Boosting (XGB) model:
In this particular model, distance to vacancy was by far the most predictive variable, followed by interior violations, and the year in which the homes were built. As we iterated through our models, this method helped us hone in on other variables with a similar effect, indicating that homes in poor or vacant condition are associated with higher risk. Offsite owners followed closely behind, showing that potential landlord and owner negligence may correlate with fire risk.
After selecting these variables, we also wanted to ensure that they were sufficiently independent of one another in order to avoid redundancy as well as issues with multicollinearity, which would undermine the predictive ability of our model. We thus divided the variables by story and plotted the pairwise correlations.
From these plots we can determine that there are no major issues with multicollinearity, which would be problematic for any correlation value above 0.8. It is noted that there are a couple exceptions for high correlation between variables that were engineered from the same feature such as the distance to the reinspection and the distance to the 3 nearest reinspections. In general, though, we see that variables of similar scales (e.g. census data in the demographic story) and variables from the same dataset (e.g. distance to HVAC permits and distance to electrical permits) tend to be more strongly correlated. Through this process, we can conclude that we can proceed with these final variables.
Initially, we attempted to model with all of our 202 variables. However, we quickly realized that in doing so, some of our variables were monopolizing the predictive power of the model and leading it to overfit - becoming too accurate at predicting the past fires instead of for latent risk. In order to mitigate this concern, we divided the variables into three different stories: building level, neighborhood level, and demographic elements, believing that each story told a slightly different story about what led to fire risk. By separating them, they could show their distinct relationship to fire occurrences, and therefore help to make our model more generalizable.
The predictions from each story are mapped to show the distribution of the models as predicted by each variable story. They are also validated, using both test and training sets, to show the error in their predictions. For testing, a subset of 60% of the city’s grid cells are set aside as a training set. The training dataset is then tested on the remaining 40%, which is called a test set, to see how well it predicts for those values. The error is the difference between the observed number of fires per grid cell as compared to the number that the model has predicted, showing the accuracy of the model.
The building story is based on variables that are specific to the building scale, often relating to the internal characteristics of the structures themselves. After feature selection, we chose 20 variables in the building story, covering the fire inspection and the property valuation data. Important predictors in this category included the type of construction, if the owners were offsite, and the parcel density. The most important variable by far was a type of construction, identified in the data as “other,” neither frame nor brick.
One can see that this building level scale creates a level of nuance within the predictions in the map below.
The model produces a mean absolute error of 0.374, meaning that it is accurate within 0.374 of a fire. While this relatively low error is promising, we noticed that the 1st and 5th quantile of prediction error are interspersed, which means this model has predicted fires in areas that have not seen fire occurrences in the past, therefore identifying latent risk.
The neighborhood story is made up of variables that describe the build environment and landscape. Some of the best predictors in this story included the distance to electrical permits, the distance to interior violations, and the distance to vacancy. These results can help us to see that structures that need more maintenance are, not surprisingly, more likely to carry fire risk.
As a model, the map shows more accuracy than the building story, particularly in areas with zero previous fires, such as in the bottom left corner in a city park. This means that our neighborhood story may help us to identify places with lower better risk more accurately, helping us to avoid distributing fire detectors to areas where they would be less useful.
The error is also lower in this story, with its errors only accounting for 0.365 of a fire. The map below shows similar features to the previous story, with interspersion of high and low predictions in the map. However, to continue to build off of the models strengths, it is decidedly more accurate in areas with no past fires, meaning that it can add to the predictive power of our final ensemble.
In the demographic story, we used 6 variables, including population, social status indicators, and economic indicators. The best predictors in this model included median household income, population, and median rent. Population helps us to understand that the fires occur in denser locations, and the latter two predictors seem to correlate with the need for maintenance shown in other stories. In this case, people may be struggling with the financial burden of safe upkeep.
From the predicted fire map, we can see that the model generated by this story predicted the entire west area as high risk zone, meaning that the census tract boundary of this map likely obscured much of the variance needed for higher accuracy.
The error in this model are higher than in other stories, again, likely because of the scale of the information. In this model, it predicts within 0.428 of a fire. The map shows a starker contrast between prediction accuracy, with much of the west side, where fires are concentrated, being less accurate.
We ensembled the three stories to one final model using the prediction result from each of the stories as independent variable. The average prediction error is 0.399 of a fire. The most powerful story was the neighborhood story, carrying the strong predictors of distance to vacancy and interior violations. The combined strengths of the model ultimately present a story of dense areas, with lower incomes, that are in areas of neglect or needed maintenance.
From the clusters of high number of predicted fire events, it is evident that our model identified the latent risk of fire rather than just perfectly fitting previous fires. Comparing the prediction results of three stories and the final model, we noticed that the final model combined the spatial characteristics of each stories. One can see some of the strong demarcation of census tracts, as well as the nuance of the neighborhood and building stories.
The error is slightly larger in the final model than the most accurate model, the neighborhood story. One can see on the map below that while some of the errors, are very low, at 0.106 fire, the range of errors is larger than the individual stories. This makes sense, due to it being an ensemble of different spatial relationships.
Ultimately the error in accuracy can be considered a tradeoff for generalizability.Validation in the following sections will explore more about how well the final model performs.
We undertook several methods of validation to see how the final model performs. We begin by comparing it to a kernel density map, then test it through cross validation, and then test that cross validation across neighborhoods to determine the generalizability of the model.
Comparing with a kernel density model is one of the most important forms of validation we can perform. The standard approach to risk mitigation is often to look at a kernel density map to identify risk hot spots. We wanted to be certain that our final model performed better than this traditional approach, to prove that we could offer a better alternative to smoke detector allocation. The results below a map of how the two spatial models compare, with our model showing far more intricacy.
The plot below is the clearest way to measure this comparison. It shows how well each model performs on risk categories, from lowest to highest risk. For our purposes, we wanted to aim to identify the highest risk, and our model clearly surpasses the kernel density map in this category. This is important, signaling that our model can identify the highest areas of latent fire risk for smoke detector outreach.
Cross validation helps to verify that the model is generalizable across the dataset, and not only effective on subsets. In order to perform this analysis, we use a test that iterates through 100 random test sets, and averages the error for each of the models. The results of this analysis can be seen in the table below.
The average errors remain relatively comparable to the individual stories. The neighborhood still predicts better than the others, and the ensemble model predicts less well than neighborhood variables, but better than the other stories. These errors show that our final model, is accurate, but has traded some of the accuracy of the neighborhood story in order to be more generalizable.
Model.Number | Model.Name | Description | Average.MAE |
---|---|---|---|
1 | regStories.census | Poisson regression; Demographic Variables (6) | 0.434124 |
2 | regStories.neighborhood | Poisson regression; Neighborhood Variables (20) | 0.377525 |
3 | regStories.building | Poisson regression; Building Variables (20) | 0.403998 |
4 | regStories.ensembling | Poisson regression; Using the CV Prediction result from 3 stories | 0.399978 |
Spatial cross-validation is a robust way to assess the generalizability of our model across space. We have the done this by looking at the distribution of fire prediction errors within different neighborhoods. In this process, fishnet grid cells of one neighborhood will be set aside as a test set, while the rest of the neighborhoods will be used to train the model. By doing so, we can better assess if there is intrinsic bias baked into our model. The spatial distributions of error vary for our three stories, but all of them have fair performance in the downtown area. This observation is likely attributed to the fact that the risk prediction model trades off pinpoint accuracy overfit on past fire events for broader trends of generalizable risk.
The spatial cross validation map of the building story is shows a slight segregation of errors; we can see lower errors in the west side of the city and higher errors in neighborhoods in the east. Therefore, our model predicts better in areas that tend to have more fires as opposed to those that do not. The neighborhood story also has a notable amount of clustering, with lower errors concentrated in the southwest and higher errors in the northeast sides of the city. Again, it is evident that our model has better predictions in areas that have experienced fires, but in this case, we note that it has also performed relatively well regardless of past events, which is a promising indication of generalizability. Spatial cross-validation by the demographic story is less consistent with the other models. While its highest predictive power is primarily in the westernmost neighborhoods, some of the largest errors exist in adjacent neighborhoods. In this sense, it is the most generalizable story because the errors are more randomly distributed throughout space, rather than clustering in certain areas. As such, the demographic story has an important contribution to the overall generalizability of our final ensembled model, which also has slightly less clustering of errors than the neighborhood and building-level stories, and allows our model to predict fire risk at a reasonable scale for the city of Louisville.
Using this final model, we have created risk score categories that are reasonably accurate yet generalizable for the city of Louisville. We then used these categories to build a web app that allows the Louisville Fire Department to identify areas of high risk in which to locate outreach programs. Specifically, the app presents a set of potential venues in at-risk communities such as community centers, churches, and schools, along with a predicted impact assessment within a user-defined radius. These interactions then yield actionable intelligence by enabling the fire department to determine the best locations at which to host outreach programs such as smoke detector distribution events, thereby mitigating risk of fire in Louisville.
Our data came from several sources, two of which are publically available:
One can see the final variables used in the model listed below:
Variable | Description | Source |
---|---|---|
TotalPop | Total population | US Census |
MedHHInc | Median household income | US Census |
MedRent | Median rent | US Census |
percentWhite | Percent white | US Census |
percentBachelors | Percent with a bachelors degree | US Census |
percentPoverty | Percent of population in poverty | US Census |
countInsp | Count of inspections | Louisville Fire Department |
dens.offsiteowns | Density of offsite owners | City of Louisville |
nInspHome | Count of home inspections | Louisville Fire Department |
dist.InspHome3 | Distance to 3 nearest home inspections | Louisville Fire Department |
nInspOth | Count of other inspections | Louisville Fire Department |
dist.InspOth5 | Distance to 5 nearest other inspections | Louisville Fire Department |
dist.InspRe1 | Distance to nearest reinspection | Louisville Fire Department |
countParcels | Count of parcels (building density) | City of Louisville |
WALL_TYPE | Property wall type | City of Louisville |
YEAR_BUILT | Year property was built | City of Louisville |
nFail | Count of failed inspections | Louisville Fire Department |
dist.Reinsp3 | Distance to 3 nearest reinspections | Louisville Fire Department |
nInspGen | Count of general inspections | Louisville Fire Department |
dens.fireplaces | Density of proeprties with fireplaces | Louisville Open Data |
dens.built80s | Density of properties built in the 1980s | City of Louisville |
dens.InspPreplan | Density of preplan inspections | Louisville Fire Department |
dens.bath12 | Density of properties with less than 3 bathrooms | City of Louisville |
Prop_Type | Property type | City of Louisville |
dens.built10s | Density of properties built in the 2010s | City of Louisville |
distPerm_fdet3 | Distance to 3 nearest fire detector permits | Louisville Open Data |
dist.elec | Distance to nearest permit - electric | Louisville Open Data |
dist.fdet | Distance to nearest permit - fire detectors | Louisville Open Data |
dist.fsup | Distance to nearest permit - fire suppression | Louisville Open Data |
dist.hvac | Distance to nearest permit - HVAC | Louisville Open Data |
dist.rhd.1 | Distance to nearest permit - range hood | Louisville Open Data |
dist.str | Distance to nearest permit - structural inspection | Louisville Open Data |
dist.vacant | Distance to nearest 311 complaint - vacant property | Louisville Open Data |
dens.tarcStops | Density of TARC bus stops | Louisville Open Data |
distOpenB3 | Distance to 3 nearest 311 complaint - open burning | Louisville Open Data |
distTrash3 | Distance to 3 nearest 311 complaints - trash | Louisville Open Data |
distIntV3 | Distance to 3 nearest 311 complaints - interior violations | Louisville Open Data |
dist.Reinsp3 | Distance to 3 nearest reinspections | Louisville Fire Department |
distPest3 | Distance to 3 nearest 311 complaints - pests | Louisville Open Data |
dens.warnSirens | Density of warning sirens | Louisville Open Data |
distExtV3 | Distance to 3 nearest 311 complaints - exterior violations | Louisville Open Data |
distOverG3 | Distance to 3 nearest 311 complaints - over growth | Louisville Open Data |
distAbanV3 | Distance to 3 nearest 311 complaints - abandoned vehicles | Louisville Open Data |
dens.fireStations | Density of fire stations | Louisville Open Data |
dist.histProperties | Distance to nearest historical property | Louisville Open Data |
dens.parcTrees | Density of parcel trees | Louisville Open Data |
The data for this report came from multiple sources, and required wrangling in order to both acquire the data, and to get it in a usable condition. The following methodology is necessary for processing the data sources and building a final data set. Below, we’ll introduce how to acquire the dependent variable, build a fishnet, process independent variables, run distance and density functions, and finally, to combine them to a completed dataset.
The first step to starting the report is to prime your file. This includes loading in libraries and necessary data sources:
library(dplyr)
library(tidyverse)
library(sf)
library(viridis)
library(corrplot)
library(spdep)
library(leaflet)
library(ggplot2)
library(ggcorrplot)
library(lubridate)
library(FNN)
library(caret)
library(xgboost)
library(spatstat)
library(raster)
library(rasterVis)
library(randomForest)
studyArea <- st_read("Louisville_KY_Fire_Districts.shp") %>% previously LFD
st_union() %>%
st_transform(crs=102679)
fires <-
read.csv("Building_Fires_Cleaned_wLonLat.csv") %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
st_transform(crs = 102679)
The next step was to prepare the unit of analysis by creating a fishnet of the study area. Below, we set our cell size to 500x500 square feet, creating a set of grid cells that is specific enough to show nuanced detail, but not too personalized.
# create a 500x500 foot fishnet
fishnet <-
st_make_grid(st_union(studyArea), cellsize = 500) %>%
st_intersection(studyArea, .) %>%
st_sf() %>%
mutate(uniqueID = rownames(.))
The dependent variable must be joined to the fishnet in order for each grid cell to have an affliated fire count. Below, we show how to aggregate each fire to the grid cell level. Each independent variable will also be aggregated, as shown later in this section, eventually giving each grid cell many associated variable values.
# aggregate fire events as count of fires per grid cell
fishnet <-
fires %>%
dplyr::select() %>%
mutate(countFire = 1) %>%
aggregate(., fishnet, length) %>%
mutate(countFire = ifelse(is.na(countFire), 0, countFire), uniqueID = rownames(.))
Before joining to the fishnet, however, the independent variables must go through a series of filtering and processing. The filtering helps to make the large data sources more manageable, and processing was undertaken in order to make the remaining data more helpful for our analysis.
# select meaningful fields from raw inspection data and rename them
inspections <- inspections %>%
dplyr::select(Inspection_Completed_Date,
Inspection_Number,
Inspection_Type,
Inspection_Passed_Flag,
Inspection_Reason,
Inspection_Reinspection_Flag,
Inspection_Assigned_To_Shift,
Inspection_Staff_Hours,
Inspection_Assigned_To_Station,
Inspection_Assigned_To_Unit,
Inspection_Assigned_To_Inspector_Full_Name_List,
Inspection_Violation_Notice_Sent_Flag,
Inspection_Reinspection_Parent_Inspection_Number,
Inspection_Status,
Agency_ID_Internal,
Location_ID_Internal,
Occupants.Occupant_Address_City_Name,
Occupants.Occupant_Full_Address)
names(inspections) <- c("Completed_Date",
"Inspection_Num",
"Inspection_Type",
"Passed",
"Reason",
"Reinspection",
"Assigned_Shift",
"Staff_Hrs",
"Assigned_Station",
"Assigned_Unit",
"Assigned_Names",
"Violation_Notice",
"Reinspection_Ref",
"Status",
"Agency_ID",
"Location_ID",
"City",
"Full Address")
For other variables, data was aggregated into groups that we found to be very related:
# simplify 311 data categories into broader labels
pests <- complaints %>%
filter(service_name == "BEDBUGS CODES"|
service_name == "BEDBUGS HEALTH"|
service_name == "PESTS"|
service_name == "VAP STRUC PETS"|
service_name == "VAP STRUC RATS"|
service_name == "VAP LOT PESTS"|
service_name == "VAP LOT RATS"|
service_name == "HEALTH CONCERNS"|
service_name == "RAT BAITING")
trash <- complaints %>%
filter(service_name == "GARBAGE MISSED"|
service_name == "GARBAGE VIOLATIO"|
service_name == "ILLEGAL DUMPING"|
service_name == "JUNK MISSED"|
service_name == "degt VIOLATION"|
service_name == "ROADSIDE LITTER"|
service_name == "TRASH PVT PROP"|
service_name == "STREET FURNITURE"|
service_name == "VAP LOT TRASH"|
service_name == "VAP STRUC TRASH"|
service_name == "YARD WASTE MISS"|
service_name == "YARD WASTE VIOL")
We also engineered some variables in order to get more explanatory value out of them. In this example, we thought there would be variation between each decade in the year built. We engineered it in order to see if there were patterns in terms of when structures were constructed:
# feature engineer data into new variables
PVA <- PVA %>%
mutate(BUILTPRE30S = ifelse(PVA$YEAR_BUILT < 1930 & PVA$YEAR_BUILT > 0, 1, 0),
BUILT30s = ifelse(PVA$YEAR_BUILT < 1940 & PVA$YEAR_BUILT >= 1930, 1, 0),
BUILT40s = ifelse(PVA$YEAR_BUILT < 1950 & PVA$YEAR_BUILT >= 1940, 1, 0),
BUILT50s = ifelse(PVA$YEAR_BUILT < 1960 & PVA$YEAR_BUILT >= 1950, 1, 0),
BUILT60s = ifelse(PVA$YEAR_BUILT < 1970 & PVA$YEAR_BUILT >= 1960, 1, 0),
BUILT70s = ifelse(PVA$YEAR_BUILT < 1980 & PVA$YEAR_BUILT >= 1970, 1, 0),
BUILT80s = ifelse(PVA$YEAR_BUILT < 1990 & PVA$YEAR_BUILT >= 1980, 1, 0),
BUILT90s = ifelse(PVA$YEAR_BUILT < 2000 & PVA$YEAR_BUILT >= 1990, 1, 0),
BUILT00s = ifelse(PVA$YEAR_BUILT < 2010 & PVA$YEAR_BUILT >= 2000, 1, 0),
BUILT10s = ifelse(PVA$YEAR_BUILT < 2020 & PVA$YEAR_BUILT >= 2010, 1, 0))
Other variables came from the Census API, in which we gathered the variables needed for our analysis. The code below gathers the variables, renames them so that they are recognizable, and then calculates percentages:
#gather census variables
tract10 <-
get_acs(geography = "tract", variables = c("B25026_001E","B02001_002E","B15001_050E",
"B15001_009E","B19013_001E","B25058_001E",
"B06012_002E"),
endyear = 2010, state='KY', county='Jefferson', geometry=T) %>%
select(variable,estimate) %>%
as.data.frame() %>%
spread(variable,estimate) %>%
rename(TotalPop=B25026_001,
NumberWhites=B02001_002,
TotalFemaleBachelors=B15001_050,
TotalMaleBachelors=B15001_009,
MedHHInc=B19013_001,
MedRent=B25058_001,
TotalPoverty=B06012_002) %>%
mutate(percentWhite = ifelse(TotalPop > 0, NumberWhites / TotalPop,0),
percentBachelors = ifelse(TotalPop > 0, ((TotalFemaleBachelors + TotalMaleBachelors) / TotalPop),0),
percentPoverty = ifelse(TotalPop > 0, TotalPoverty / TotalPop, 0),
year = "2010") %>%
select(-NumberWhites,-TotalFemaleBachelors,-TotalMaleBachelors,-TotalPoverty) %>%
st_sf() %>%
st_transform(102679)
Kernel density is a way to show hot spots of different variables, giving us a different way of explaining each variables predictive power. Below, we show a few examples for how to transform variables into this type of spatial feature:
#create kernel density function
dataXY_function <- function(data) {
dataXY <- data %>%
st_coordinates() %>%
as.data.frame() %>%
dplyr :: select(X,Y) %>%
as.matrix()
return(dataXY)
}
kernel_function <- function(DATASET, WINDOW, MASK, SIGMA) {
data.xy <- dataXY_function(DATASET)
data.ppp <- as.ppp(data.xy, WINDOW)
data.dens <- spatstat::density.ppp(data.ppp, sigma = SIGMA)
density <- raster::extract(raster(data.dens), MASK, fun = mean) %>%
as.data.frame()
names(density) <- c("dens")
return(density)
}
Below, one determines the bounding box of the study area, in order to limit the processing area of the raster:
#determine bounding box for study area
boundingBox <- st_bbox(studyArea)
xMinMax <- c(boundingBox[1,], boundingBox[3,])
yMinMax <- c(boundingBox[2,], boundingBox[4,])
studyWindow <- owin(xMinMax, yMinMax)
mask <- fishnet %>% as('Spatial') %>% as('SpatialPolygons')
Then one can process each variable. Below, inspection data is turned into kernel densities using the mask created above, and with a search distance of 1500 feet:
homeInsp <- kernel_function(homeInsp, studyWindow, mask, 1500)
newConstInsp <- kernel_function(newConstInsp, studyWindow, mask, 1500)
genInsp <- kernel_function(genInsp, studyWindow, mask, 1500)
otherInsp <- kernel_function(otherInsp, studyWindow, mask, 1500)
reInsp <- kernel_function(reInsp, studyWindow, mask, 1500)
planRevInsp <- kernel_function(planRevInsp, studyWindow, mask, 1500)
prePlanInsp <- kernel_function(prePlanInsp, studyWindow, mask, 1500)
Another important method of processing independent variables is finding their relative distances. Two different kinds of distances are measured: distance to one nearest object, and dearest to n nearest neighbors. They use the same code, but the n value changes, as shown below. These variables are then joined back to the fishnet with the others.
#distance function
nn_function <- function(measureFrom, measureTo, k) {
nn <- get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
as.data.frame()
names(output) <- c("dist")
return(output)
}
Once the function is created, each variable can be processed as desired. In this example, first the fishnet was converted to a datapoint, creating a center point. From this center point, distances could be processedto each individual variable. Below, different kinds of building permits were transformed to show both the distance to their nearest neighbor, as well as the distance to their nearest three neighbors.
fishxy <- dataXY_function(fishnet)
distPermBldg3 <- nn_function(fishxy, dataXY_function(perm.bldg), 3)
distPermElec3 <- nn_function(fishxy, dataXY_function(perm.elec), 3)
distPermHvac3 <- nn_function(fishxy, dataXY_function(perm.hvac), 3)
distPermBldg3 <- nn_function(fishxy, dataXY_function(perm.bldg), 1)
distPermElec3 <- nn_function(fishxy, dataXY_function(perm.elec), 1)
distPermHvac3 <- nn_function(fishxy, dataXY_function(perm.hvac), 1)
Finally, all of the variables are joined together to create a master dataset. The code below shows how to join together the newly created variables to the final fishnet, including some density and distance examples:
fishnet <- fishnet %>%
mutate(dens.InspHome = homeInsp$dens,
dens.InspNewC = newConstInsp$dens,
dens.InspGen = genInsp$dens,
dens.InspOth = otherInsp$dens,
dens.InspRe = reInsp$dens,
dens.InspPlan = planRevInsp$dens,
dens.InspPreplan = prePlanInsp$dens,
distPerm_bldg3 = distPermBldg3$dist,
distPerm_elec3 = distPermElec3$dist
distPerm_hvac3 = distPermHvac3$dist)
Visualisation is important for communicating your material clearly. We chose to begin by creative a cohesive visual theme, to minimize distracts and make understanding the data simpler.
To begin, we picked a color theme using the Viridis package:
And settled on the theme “plasma”.
Another step was to choose map and plot themes, making for a more cohensive look:
#map theme
mapTheme <- function(base_size = 12) {
theme(
text = element_text(color = "black"),
plot.title = element_text(size = 16, colour = "black", face = "bold"),
plot.subtitle=element_text(size = 12, face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_rect(fill = "white"),
axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_line(colour="white"),
panel.grid.minor = element_line(colour = "white"),
panel.border = element_blank()
)
}
#plot theme
plotTheme <- function(base_size = 12) {
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black"),
panel.border = element_rect(colour = "black", fill=NA, size=2))
}
While most of the maps used raster plots or the fishnet in their visualization, the count data was sometimes set on a basemap. The following code identifies how to make one, starting with loading in a few basic shapefiles usually found in open data. We used the fire districts as our boundary:
# base map
LFD <- st_read("Louisville_KY_Fire_Districts.shp")
# transform the map to the correct projection, and then run "union" to dissolve borders
st_transform(LFD, crs = 102679)
LFD <- st_transform(st_union(LFD), crs=102679)
# water
river <- st_read("waterClip.shp")
# parks
parks <- st_read("parkClip.shp")
# highways
roads <- st_read("hwyClip.shp")
#map it
baseMap <- ggplot() +
geom_sf(data = LFD, aes(), fill = "grey85", color = NA, size = 1) +
geom_sf(data = roads, aes(), fill = NA, color = "white") +
geom_sf(data = river, aes(), fill = "slategray2", color = NA) +
geom_sf(data = parks, aes(), fill = "palegreen", color = NA) +
mapTheme()
This report used several different techniques in order to perform the modeling. We’ll illustrate these for you below.
The initial step is to divide the data into test and training sets. First create a partition that divides your data into percentages. In this case, we used 0.6 as the parameter, creating a training set of 60% and leaving the remaining 40% for the test set., and then divide the training
#test and training sets
partition.census <- createDataPartition(fishnet.census$countFire, p = 0.6, list = FALSE) #create a partition
train.census <- fishnet.census[partition.census, ]
test.census <- fishnet.census[-partition.census, ]
Once the data is divided, one can run models. In this example, we show how to run a Poisson model, where we create our demographic story.
#run a poisson model
regStories.census <- glm(countFire~., family="poisson", data=train.census)
summary(regStories.census)
#check the error
test.census$predFit <- predict.glm(regStories.census, as.data.frame(test.census), type = "response")
test.census <- test.census %>%
mutate(predFit = ifelse(is.na(predFit), 0, predFit))
test.census$absError <- abs(test.census$countFire - test.census$predFit)
mean(test.census$absError)
One way to check how your models are performing as you run them is to see how the variables are performing. A bar plot is a good way to measure this.
#create standardized coefficients
standardized <- as.data.frame(lm.beta(regStories.ensembling))
standardized$variable <- row.names(standardized)
colnames(standardized)[1] <- "std_coefficient"
standardized$absCoef <- abs(standardized$std_coefficient)
#plot the results
ggplot(standardized, aes(x=reorder(variable,-absCoef), y=absCoef, fill=variable)) +
geom_bar(stat="identity", color="black", size = 0.75) +
scale_fill_viridis_d(option = 'plasma')+
labs(x= "Variable",
y="Relative Importance in Model",
title= "Variable Importance",
subtitle = "The Ensemble Model") +
theme(legend.position = "none") +
scale_x_discrete(labels = c("Building Story", "Neighborhood Story", "Demographic Story")) +
plotTheme()
One machine learning model that we attempted to use was the XG Boost model. It vastly overfit our results, so we ultimately did not use it for the final model, but it was helpful for identifying what to aim for.
#run xg boost
fitXgb.building <- xgboost(data.matrix(train.building %>% dplyr::select(-countFire))
, train.building$countFire
, max_depth = 7
, eta = 0.02
, nthread = 2
, nrounds = 1000
, subsample = .7
, colsample_bytree = .7
, booster = "gbtree"
, eval_metric = "mae"
, objective="count:poisson")
#determine the error
test.building$predFit <- predict(fitXgb.building, data.matrix(test.building %>% dplyr::select(-countFire)))
test.building <- test.building %>%
mutate(predFit = ifelse(is.na(predFit), 0, predFit))
test.building$absError <- abs(test.building$countFire - test.building$predFit)
mean(test.building$absError)
The model is particularly helpful for getting a perspective on significant variables. This helped us in our variable selection, in particular.
#run xg boost
fitXgb.building <- xgboost(data.matrix(train.building %>% dplyr::select(-countFire))
, train.building$countFire
, max_depth = 7
, eta = 0.02
, nthread = 2
, nrounds = 1000
, subsample = .7
, colsample_bytree = .7
, booster = "gbtree"
, eval_metric = "mae"
, objective="count:poisson")
#determine the error
test.building$predFit <- predict(fitXgb.building, data.matrix(test.building %>% dplyr::select(-countFire)))
test.building <- test.building %>%
mutate(predFit = ifelse(is.na(predFit), 0, predFit))
test.building$absError <- abs(test.building$countFire - test.building$predFit)
mean(test.building$absError)
Once the poisson models are developed, one can create the ensemble of the various stories. In this example, we take three stories, bind them together, and use them as variables in another poisson regression.
#bind the stories
vars.StoriesPred <- cbind(resultStoriesCV$countFire,resultStoriesCV$predStories1, resultStoriesCV$predStories2, resultStoriesCV$predStories3) %>%
as.data.frame() %>%
rename(countFire = V1, predStories1 = V2, predStories2 = V3, predStories3 = V4)
#run the regression
regStories.ensembling <- glm(countFire ~., family = 'poisson', data = vars.StoriesPred)
summary(regStories.ensembling)
resultStoriesCV <- cbind(resultStoriesCV, regStories.ensembling$fitted.values)
resultStoriesCV <- resultStoriesCV %>% st_as_sf()
resultStoriesCV <- resultStoriesCV %>%
rename(predEnsemble = regStories.ensembling.fitted.values) %>%
mutate(error.ensembling = predEnsemble - countFire) %>% st_as_sf()
mean(abs(resultStoriesCV$error.ensembling))
Below, we show how to cross validate the model. We set our number to 100 iterations, run the validation, and save the results.
#cross validation
#set the iteration and a seed
fitControl100.census <- trainControl(method = 'repeatedcv', number = 100)
set.seed(825)
# run 100-fold cross-validation
poiFit100.census <- train(countFire ~ .,
data = vars.census[-4049,],
method = "glm",
trControl = fitControl100.census,
family = poisson(link = "log"),
na.action = na.omit)
# quick check of mean and standard deviation for the 100 models
mean(poiFit100.census$resample[,3])
sd(poiFit100.census$resample[,3])
# prediction using the cross-validated model
resultStoriesCV1 <- predict(poiFit100.census, vars.census[-4049,])
# combine the results with other models to keep track
resultStoriesCV <- cbind(vars_noNA$uniqueID, vars_noNA$countFire, resultStoriesCV1) %>%
as.data.frame() %>%
rename(uniqueID = V1,
countFire = V2,
predStories1 = resultStoriesCV1) %>%
mutate(error1 = predStories1 - countFire,
uniqueID = as.character(uniqueID))
resultStoriesCV <- left_join(resultStoriesCV, fire_net500) %>% st_as_sf()
Another form of validation is to look at how the model performs across space, using a spatial validation model. In this case, we looked at how the model performed across neighborhoods.
#spatial cross validation
spatialCV <- function(dataFrame, uniqueID, dependentVariable) {
training <- 0
testing <- 0
tuner <- 0
currentUniqueID <- 0
uniqueID_List <- 0
y <- 0
endList <- list()
#create a list that is all the spatial group unique ids in the data frame (i.e. all the neighborhoods)
uniqueID_List <- unique(dataFrame[[uniqueID]])
x <- 1
y <- length(uniqueID_List)
#create a counter and while it is less than the number of counties...
while(x <= y) {
currentUniqueID <- uniqueID_List[x] #call a current county
#create a training set comprised of units not in that county and a test set of units that are that county
training <- dataFrame[ which(dataFrame[[uniqueID]] != uniqueID_List[x]),]
testing <- dataFrame[ which(dataFrame[[uniqueID]] == uniqueID_List[x]),]
trainingX <- training[ , -which(names(training) %in% c(dependentVariable))]
testingX <- testing[ , -which(names(testing) %in% c(dependentVariable))]
trainY <- training[[dependentVariable]]
testY <- testing[[dependentVariable]]
#tune a model - note I have hardwired the dependent variable and the trainControl
tuner <- lm(trainY ~ ., data = trainingX)
#come up with predictions on the test county
thisPrediction <- predict(tuner, testingX)
#calculate mape and mae and count of records for a given training set (to get a sense of sample size).
thisMAE <- mean(abs(testY - thisPrediction))
countTestObs <- nrow(testingX)
#add to the final list the current county and the MAPE
thisList <- cbind(currentUniqueID, thisMAE,countTestObs)
endList <- rbind(endList, thisList) #create a final list to output
x <- x + 1 #iterate counter
}
return (as.data.frame(endList)) #return the final list of counties and associated MAPEs
}
The most important measure of validation in this report is likely the kernel density comparison. The code below takes one through how to set it up, and ultimately visualize it.
#kernel density
# the df with KDE and geometry - line 4049 was removed due to NAs
dens.fire.sf <- cbind(fishnet500[-4049,], dens.fire, fire_net500[-4049,]$countFire) %>% st_as_sf()
names(dens.fire.sf)[3] <- 'fire_count'
pred.fire.sf <- cbind(fishnet500[-4049,], resultStoriesCV$predEnsemble, resultStoriesCV$countFire) %>% st_as_sf()
names(pred.fire.sf)[2] <- 'prediction'; names(pred.fire.sf)[3] <- 'fire_count'
# group KDE data with categories
dens.fire.sf <- dens.fire.sf %>%
mutate(label = "Kernel Density",
Risk_Category = ntile(dens.fire.sf$dens.fire,100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%"
)) %>%
dplyr::select(label,Risk_Category,fire_count)
View(dens.fire.sf)
# group prediction results into categories
pred.fire.sf <- pred.fire.sf %>%
mutate(label = "Risk Predictions",
Risk_Category = ntile(prediction,100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%"
)) %>%
dplyr::select(label,Risk_Category,fire_count)
# visualize them into a map
rbind(dens.fire.sf, pred.fire.sf) %>%
tidyr::gather(Variable,Value,-label, -Risk_Category, -geometry) %>%
filter(!is.na(Risk_Category)) %>%
ggplot() +
geom_sf(aes(fill=Risk_Category), colour=NA) +
facet_wrap(~label) +
scale_fill_viridis_d(option = "plasma") +
labs(title="Comparison of Kernel Density and Risk Predictions",
fill='Risk Category') +
mapTheme()
#visualize a comparison bar plot
rbind(dens.fire.sf, pred.fire.sf) %>%
tidyr::gather(Variable,Value,-label, -Risk_Category, -geometry) %>%
group_by(label, Risk_Category) %>%
summarize(fire_counts = sum(Value)) %>%
ungroup() %>%
filter(!is.na(Risk_Category)) %>%
group_by(label) %>%
mutate(Rate_of_test_set_fires = fire_counts/sum(fire_counts)) %>%
ggplot(aes(Risk_Category,Rate_of_test_set_fires)) +
geom_bar(aes(fill=label), position = 'dodge', stat='identity') +
scale_fill_manual(values = c('#6A00A8FF', '#FCA636FF')) +
labs(title='Comparison of Kernel Density and Risk Predictions',
x = 'Risk Category',
y = 'Rate of Fire Prediction') +
plotTheme()
Throughout the process, it’s helpful to create a results table. Here’s how to do so:
resultsTable <- rbind(resultsTable, data.frame(modelNum = 5,
modelName= "regStories.ensembling",
Description = "Poisson regression; Using the CV Prediction result from 3 stories",
MAE_SingleSet = mean(abs(resultStoriesCV$error.ensembling)),
avg_MAE_cv = poiFit100.ensembling$results[,4],
sd_MAE_cv = poiFit100.ensembling$results[,7],
err_MoransI = poiFit100.ensembling_moran$estimate[1],
err_MoransI_P = poiFit100.ensembling_moran$p.value)) %>% as.data.frame()
The model results for this report (using the results table) are included below. The final model is based off of models 12-15:
modelNum | modelName | Description | MAE_SingleSet | avg_MAE_cv | sd_MAE_cv | err_MoransI | err_MoransI_P |
---|---|---|---|---|---|---|---|
1 | regPoi | Poisson regression; All Variables (189) | 0.187400 | 0.230300 | 0.100000 | -0.039200 | 1.0000000 |
2 | regPoi2 | Poisson regression; Significant Variables (20) | 0.243900 | NA | NA | NA | NA |
3 | regPoi3 | Poisson regression; Fire less Variables (187) | 0.269300 | 1.222100 | 8.486700 | -0.029400 | -0.0294000 |
4 | regStories.census | Poisson regression; Demographic Variables (6) | 0.432000 | 0.432900 | 0.083800 | 0.113200 | 0.0000000 |
5 | regStories.neighborhood | Poisson regression; Demographic Variables (110) | 0.387000 | 0.379200 | 0.087600 | -0.005300 | 0.7354000 |
6 | regStories.building | Poisson regression; Demographic Variables (110) | 0.392500 | 1.405800 | 10.268200 | 0.013600 | 0.0461000 |
7 | regStories.ensembling | Poisson regression; Using the CV Prediction result from 3 stories | 0.402400 | 0.863000 | 4.621800 | 0.039500 | 0.0000000 |
8 | RF | Random Forest; w/o distFires1, distFires3 (202) | 0.120700 | 0.114000 | 0.016700 | -0.019400 | 0.9750000 |
9 | XGBoosting | Gradient Boosting (202) | 0.058700 | 0.127500 | 0.016600 | -0.078900 | 1.0000000 |
10 | RF.woFire | Random Forest regression w/o distance to fire; ntree=100, mtry=204 (202) | 0.364200 | 0.385600 | 0.029400 | 0.033100 | 0.0010000 |
11 | XGBoosting.woFire | Gradient Boosting w/o distFire1/distFire3 (202) | 0.175300 | 0.310600 | 0.022700 | -0.001400 | 0.0001000 |
12 | regStories.census | Poisson regression; Demographic Variables (6) | 0.428157 | 0.434124 | 0.083703 | 0.113082 | 0.0000000 |
13 | regStories.neighborhood | Poisson regression; Neighborhood Variables (20) | 0.365737 | 0.377525 | 0.087690 | 0.042087 | 0.0000001 |
14 | regStories.building | Poisson regression; Building Variables (20) | 0.373927 | 0.403998 | 0.152531 | 0.051176 | 0.0000000 |
15 | regStories.ensembling | Poisson regression; Using the CV Prediction result from 3 stories | 0.399084 | 0.399978 | 0.092833 | 0.090696 | 0.0000000 |