Return to MUSA 801 Projects Page

This project was produced as part of the University of Pennsylvania’s Master of Urban Spatial Analytics Spring 2019 Practicum (MUSA 801) taught by Ken Steif, Michael Fichman, and Matt Harris. We would like to thank the Philadelphia Fire Department for providing useful information and data.

This document is intended to enable others to replicate the methodology of a study of structural fires and hydrant inspections. We first introduce the context of this project, followed by the data, methodology, modeling, and guiding appendices. A table of contents is provided for easy navigation, along with select hyperlinks throughout the document.

1.Introduction

1.1 Motivation

In November 2019, Fire Commissioner Adam Thiel announced that there were 32 deaths in 2019 alone due to fires in Philadelphia, and that this was the highest annually since 2014. Fires result in costly structural damage, and in the worst case, the loss of human life, thus a city must be prepared to fight fires, and this is contingent on working fire hydrants. At present, hydrants are inspected by the Philadelphia Fire Department (PFD) and maintained by the Philadelphia Water Department (PWD). Inspections are carried out based on engine locales by the local PFD engines in no particular order, and this results in an inefficiency of the firefighters’ time.

According to the data from Philadelphia fire department, the recordings of fire accidents in Philadelphia has increased sharply since 2017. Despite the fact that the spike in the count of fires after 2017 was a result of the change in the procedure for recording fires(More information can be found in this article), how to inspect fire hydrants more efficiently and deal with fire risk has become a new challenge for the fire department.

1.2 Use Case

A hydrant that is more likely in need of inspection could be the last on the list, and firefighters have to take time and energy to inspect each hydrant. To address this, we propose using past data on fires and factors related to fire accidents to identify areas in Philadelphia where fires are more likely. To be more specific, we decided to build a latent risk model. Then, based on our modelling result, we also include about using social, hydrant and land use type to extend the fire model to a prioritization to identify which hydrants are more urgent to be maintained. Taking into consideration both fire risk and other related priorities, we aim to generate a list for local engines to inspect hydrants and for the PWD to then maintain them. This would ensure that in each engine district, most hydrants are in good condition most of the time and allows firefighters to focus on a few hydrants rather than inspecting all hydrants.

To summarize, the use case is:

  • Prioritize inspections for local fire engine districts

  • Ensure that at all times, especially in areas at high risk of fire, most hydrants are in good condition

  • Improve the inspection efficiency and reduce the need to inspect good hydrants

2.Data

Our goal is to provide decision-makers with insight regarding the built-environment and intrinsic characteristics that contribute to the priority of fire hydrant inspections with the distribution of fire fire risk over the City. Therefore, besides fire and hydrant information, factors that may outline the fire causes are considered in our data selection, including public safety, demographics, properties, facilities and environment. For our analysis, we primarily use:

  • Data from Philadelphia Fire Department and Philadelphia Water Department in order to work on the most accurate and updated information for fire incidents and fire hydrants;

  • Open data source (OpenDataPhilly) so that other municipalities with similar open data repositories can refer to or even reproduce our analysis.

The data was also “wrangled” before being explored in the following section. This process included various transformations of the data in order to optimize predictive ability of each variable. For details on this procedure, please see Feature Engineering part. Through exploratory analysis and the modeling process, the final dataset was narrowed down to a set of best predictors. For results on how the data ultimately performed in the models, please see Model Building.

3.Exploratory Analysis

A hydrant should have a higher priority of being inspected if

  1. there is a latent risk of fire in the vicinity;

  2. the hydrant is close to social impact factors such as the proximity of vulnerable population and certain facilities, as it is not feasible to model hydrant conditions.

Based on the above, we built a latent risk model to predict the area with higher fire risk. A latent fire risk model is an alternative approach that utilizes neighborhood, environmental, and demographic variables to help identify the distribution of fire counts; it does not assume that past fires are the best indication of future ones and therefore allows risk to vary across the city. In order to best account for this variation, the city was divided into 800*800 foot grid cells, translating to a city block level unit of analysis.(Given that the code by the National Fire Protection Association states that the maximum distance between the hydrants should not exceed 800ft for detached one- and -two family dwellings, and that this for other buildings shall not exceed 500ft, we decided that an appropriate unit for analysis would be a 800x800ft grid cell size. )

Together, the grid cells form a “fishnet.” Mapping fires to the fishnet clearly conveys the rarity of such an event, with 43% percent of grid cells containing at least one fire. This observation subsequently shaped our modeling approach. In addition to mapping fires to the fishnet, other variables were also added to the fishnet, so that each grid cell contained risk factors, built environment data, and demographic characteristics, as well as data about time-space characteristics.

Throughout our analysis, we decided to further explore fire risk in Philadelphia using data from 2018 and 2019, since the procedure of recording fires changes after 2017. The following plot shows the distribution of fires in Philadelphia in 2018 and 2019 using kernel densities.

In addition to the predicted result of fire risk model, three extras factors which we felt could be important will also be taken into consideration when figuring out the final priority of fire hydrants in each engine district. (including Industrial Area, Social Impact and Hydrant Age, the calculation method will be introduced in part 6).

3.1 Feature engineering

In order to predict fire risk, in each location citywide, we need to account for the count of fires as well as measures of exposure to risk factors associated with fire, like blight. The feature engineering process is the process for measuring this exposure.

To best extrapolate the relationship between features and fires, we “engineered” features to include in our fishnet. Many of these variables have more explanatory power as varied spatial measurements, by aggregating and engineering features, we adopted:

1. Kernel Density: This is the average spatial density of the certain features per grid cell.

2. Count: This is the count of certain features per grid cell.

3. Nearest Neighbor Distance: The average distance from each grid cell to the nearest certain number of features.

The features used in the model were classified into four categories: built environment factor, risk factor, demographic factor and time-spatial factor.For an explanation of which measurements were used for each variable, please see Appendix: Data Dictionary.

3.2 Hydrant history map in Philadelphia

In 1803 Frederick Graff, Sr., designed for the then recently constructed Philadelphia hydrant, a stand-pipe intended to remain permanently in position and to be constantly charged with water.After that, the hydrants in Philadelphia continued to expand as the city grew:In the 1900s, fire hydrants were mainly located in the city center; In the 1950s, the distribution of fire hydrants gradually began to spread around the city; In the 2000s, the fire hydrants had basically covered the whole city, and we can find that the fire hydrants density in the downtown area is greater than the surrounding area.

In the present day, there is the highest density of hydrants in center city and the density seems to dissipate outwards from the center city. This could mean that fire fighters spend more time inspecting hydrants in the center city due to the higher concentration. Of course, the high fire hydrant density in central city may also be related to the high frequency of fire accident in the downtown area.

3.3 Do fires cluster in space?

Before jumping into feature engineering, it is worth studying the spatial correlation of fire incidents to understand if fires have a tendency to cluster in Philadelphia. We explored the spatial distribution of fires (2018 data) using the Local Moran’s I statistics for spatial auto correlation. The GIF indicates different fire clusters at different P-value thresholds. Does the presence of one fire event correspond with a higher volume of others nearby? If so, we would expect to observe clusters of fires - as opposed to a random distribution of these events across the whole Philadelphia - and indeed, we see that this is the case in the map below. In fact, the amount of clustering is statistically significant.

Yet, despite this clear relationship, relying too heavily on past events could undermine the ability to identify latent risk; our model could be “over fit” on occurrences of previous fires or lack thereof. Future fires are by no means limited to such trends. What else, then, are fires related to?

3.4 Does the presence of blight affect density of fires?

The spatial clustering of fires we have identified implies that there are other factors that may contribute to the propensity of fires to occur where they have before. In order to predict the latent risk of fire, License and Inspection violations that could potentially increase the risk of fire were analyzed. Four different L&I violations were analyzed:

1.Unsafe Structures
A structure is deemed unsafe if it is dangerous to life, health and safety of the public or occupants, either because it does not protect/warn occupants in the event of a fire, or because it contains unsafe/damaged equipment. Therefore, it felt necessary to examine how the presence of unsafe structures could increase the risk of fire. Given that there is a relationship, such a factor would be useful in predicting the risk of fire, and identifying areas where a working hydrant is more crucial.

2.Imminently Dangerous Structures
An imminently dangerous structure is one that ‘IS IN IMMINENT DANGER OF COLLAPSE YOU MUST REPAIR OR DEMOLISH SAID STRUCTURE IMMEDIATELY’. The presence of such structures could add to the risk of fire in the area.

3.Properties without Smoke Alarms
Such properties lack the required smoke alarms and thus put lives in danger should there be a fires, making it necessary to ensure hydrants are in working condition near these areas.

We ran a closeness test to see how related these features are. The bold lines indicate the observed distance between these features and fire. The histograms show the distances based on 1000 simulations where we randomly scattered these blight points across Philadelphia. The bold lines are far to the left of the random histograms, indicating that indeed fires are closer to these features than under a random chance.

Subsequently, we calculated the kernel density of each of the violations, the distance from each grid cell to 5 nearest violations, as well as the violation count in each grid cell, then explored the relationship between them and the kernel density of fire.The following scatter plots show a relationship between some of the features we engineered and the kernel density of fires, indicating the presence of L&I violations indeed correlate with the presence of fires. In particular, there appears to be a strong correlation between fire density and unsafe structures.

3.5 Could residential property attributes relate to the density of fires?

Next, we examined whether residential property attributes are properly aligned to fire risk, including the median house age and average house value. To be more specific, we aggregated the property age and property value (per sq-ft) to the fishnet, and then calculated the median house age and average value per size. Finally, we compared the relationship between those features and fire density.

The following scatter plots indicates that the property age has a stronger relationship with fire kernel density than the average house value(per sq-ft),and there seems to be a mapping from the oldest property age to the highest fire risk areas.

3.6 Does land use affect the density of fires?

Finally, we explored the relationship between different land use types and fire risk. There are mainly three land use types that are assumed to be related with fire risk: commercial, industrial and residential area. To test our hypothesis, we calculated the percent of land use type for each grid cell, and then plotted the correlation scatter plots.

Below, we observe that the current distribution of land use type is somewhat correlated with fires such that the fire risk increases with higher percent of residential area.

4.Modeling

4.1 Feature Selection

Having explored the relationship between various features and the kernel density of fires, we proceeded to build our model to predict the fire risk in Philadelphia. The processes described in section 3 were repeated for more variables such as crime, 311 requests, the density of trees, and relationship with fires over the past 5 years.

To better visualize the correlation relationship, we classified all of the features into 4 stories:built environment story, risk factor story, demographic story and time-spatial lag story. The following correlation matrices show the relationship between the fire count (count), which is the value the model will be predicting, and predictors used in the model. A higher absolute value indicates a stronger correlation.

The appendices in part 7 include a data dictionary with a full list of variables and their data sources, as well as how those features were compiled, cleaned, and feature engineered.

1. Built Environment Story
2. Risk factor Story
3. Demographic Story
4. Time-Spatial Lag story

4.2 Model Selection

Having created a dataset of the features in 4.1 to build our predictive model, we proceeded to test four different algorithms to identify which would be the best in predicting fire counts in each grid cell. The model was selected using 2018 data (each instance in the model corresponded to one grid cell). 70% of the 2018 data was used to train the model, while 30% was left aside to be used for testing.

The four models used were the linear regression model (LM), the generalized linear regression model (GLM), the random forest model (RF) and gradient boosted decision trees (XGB). The GLM, RF and XGB models could be further tuned by selecting certain hyper-parameters. This was done using grid-search cross validation and set of parameters that returned the lowest mean absolute error for each model is reported in the following table.

The same table shows the results of each of these optimized models using two different datasets - one without spatial structure variables, the second including the engine district and neighborhood that each grid cell belongs to. As a measure of performance, the mean absolute error on the count of fires were used. The random forest and gradient boosted decision trees out performed the linear models on both datasets. This is probably because the underlying relationship between the predictors and fire is not linear.

We used 4 method to validate our models: 5-Fold CV on 2018 data, 70/30 Train-Test Split, LOCOCV-Engine Locals and LOGOCV-Neighborhoods. K-fold Cross validation and 70/30 Train-Test Split validation helps to verify that the model is generalizable across the dataset, and not only effective on subsets. LOGO Cross Validation is a robust way to assess the generalizability of our model across space. The errors on the Leave-One-Group-Out Cross Validation are smaller when the engine district and neighborhood information is added as predictors, which indicates that our model generalized well across both engine districts and neighborhood.

Based on the above table, the Random Forest model performed the best, and was thus selected to predict 2019 fires.

4.3 Model Validation

After selecting the best model, Random Forest, it was trained on the entire 2018 dataset. This trained model was then used to predict 2019 fires.

4.3.1 Accuracy

On 2019 data, there was an average of 1.118 fires per grid cell in Philadelphia. The mean predicted error is only 11% of this value.

Accuracy of Model
Test Set Mean Prediction Error Mean Observed Fires
2018 Data 0.008 1.195
2019 Data 0.130 1.118

The figure below provides another approach for visualizing the accuracy of predictions. The observed fire counts for each grid cell are split into ten deciles, running from lowest to highest. The mean predicted and observed counts for each decile are plotted. The prediction result for 2018 is more accurate than the prediction for 2019. This is expected, given we used the 2018 data as our training dataset.

It’s interesting to note that for the 2019 prediction, it seems all models over predict in the 1st through 8th deciles, but then under predict in the 9th and 10th deciles. In other words, we over predict in low fire risk areas and under predict in high fire risk areas to some extent. While the over prediction in the lower deciles are still small, the under prediction in the 10th decile is about 1 fire count away. Though this is still a small difference, future iterations of this model could include some random noise to model an unexpected shock that could predict sudden spikes in fires in certain parts of the city.

4.3.2 Generalizability

A well generalized fire risk model that predicts equally well in all engine districts and neighborhoods is one which can learn fire risk patterns that apply at both Citywide and local spatial scales. Here, we try to focus more on the generalizability of our model across space.We have done this by looking at the Mean_MAE and Sd_MAE value during the LOGO-Cross Validation: in this process, fishnet grid cells of one neighborhood will be set aside as a test set, while the rest of the engine districts/neighborhoods will be used to train the model. By doing so, we can better assess if there is intrinsic bias baked into our model.

The following table demonstrates the generalizability of the model across engine district spaces and neighborhood spaces. The errors on each grid cell were grouped by the neighborhood they were in and mean of all these errors were calculated for each grid cell. The average and standard deviation of these errors were calculated. The same was done for engine districts. A smaller standard deviation indicates a model that is better able to generalize across that space. From the table below, the model generalizes well across engine districts, but not so much across different neighborhoods in Philadelphia.
Generalizability of Model on 2019 Data
Group Average MAE Standard Deviation MAE
By Neighborhood 0.254 0.496
By Engine Number 0.189 0.296

However, in the context of hydrant inspections, it is more important for the model to generalize well across engine districts, since inspections are carried out by each engine. The following maps indicates the predicted fire count and the MAE for each engine district, it could be concluded that our model generalized well within those high fire risk engines.

4.3.3 Does this model allocate better than traditional fire hotspots?

Fire departments the world over have moved to hot spot fire predicting, targeting hydrant inspection resources to the places where fire incidents are most concentrated. Thus, comparing with a kernel density model is one important form of validation we can perform so far.The standard approach to risk mitigation is often to look at a kernel density map to identify risk hot spots.

For each grid cell and model type (density vs. risk prediction), there is now an associated risk category and observed fire count. Below a map is generated of the risk categories for both model types with the test points (true 2019 fires) overlay-ed. A strongly fit model should show that the highest risk category is uniquely targeted to places with a high density of fires. It is ideal that our model would perform better than this traditional approach, to prove that we could offer a better alternative to predict fire.

Finally, the rate of fire accident points by risk category and model type is calculated and plotted. Our model is able to capture a greater share of observed fire incidents in the highest risk category relative to the kernel density.

5.Hydrant Priority Score

While we believed that fire risk is a very important factor in prioritizing hydrant inspections, we acknowledged that there might be other reasons for wanting to inspect certain hydrants. This list is far from exhaustive, but for the purpose of this project, we outlined three additional factors which we felt could be important. To account for the magnitude of industrial fires, the presence of industrial land was added. To protect vulnerable communities, the social impact factor was generated. Finally, to account for the hydrant’s intrinsic characteristic, the hydrant age was added.

5.1 Selection of final priorities

  1. Fire Risk: For each grid cell, we get the predict fire count from the fire risk model.

  2. Industrial Area: For each grid cell, we get the kernel density of industrial parcels from the land use data.

  3. Social Impact : For each grid cell, we combine the distance to schools, median income of grid cell, population density and median age of grid cell together to get a social impact score. To be more specific, the process of calculating the social score is as follows:

    1. The distance to schools from each grid cell, the average median_age and average median_income and average population_density of census tracts intersecting each grid cell were calculated.

    2. Each of these 4 values (on a grid cell level) were normalized.

    3. Transformations were applied to each of the four factors to reflect priority (higher number = higher priority), and the transformation rules are as follows:
      Lower distance to school == higher priority. Thus a transformation of y=-x was used.
      Lower median income == higher priority. Thus a transformation of y=-x was used.
      Lower and Higher ages == higher priority. Since the factor was normalised with mean 0 and standard deviation 1, a transformation of y=x^2 was used. Thus more negative and more positive numbers then got a higher priority. Lower population density == lower priority. Thus a transformation of y=x was used.

    4. After transformations, the median_age was normalized again to mean of 0 and sd of 1. The rest retained their normalized states even after transformation.

    5. For each grid cell, the sum of these four normalized factors was added together to create the social score. (higher score == higher priority)

  4. Hydrant Age (Specific to a hydrant): For each hydrant, its age, taken as 2020-year built.

5.2 Calculating the final score for each potential combinations

As the fire risk, industrial factor and social impact scores were on a grid cell level, they had to be aggregated to the hydrants. Thus, the process of calculating different scores for each hydrants is as follows:

  1. For each of the first three priorities, create a measure for it per grid cell. (A raw value)

  2. For each of the first three priorities at a grid cell level, aggregate it to a hydrant. This was done using a 400ft radius buffer around each hydrant. The score to each hydrant was then the mean of all grid cells intersecting the hydrant’s buffer.

  3. To each hydrant, append the age.

  4. Normalize all four scores at the hydrant level. (Mean=0, sd=1)

  5. Calculate various combinations of these normalized scores. This was done in a for loop.

  6. After the combination was calculated for each hydrant, the final priority scores were grouped into deciles at the last step. This means that for each combination, 10% of all hydrants are in each decile.

The 4-digit number indicates which of four priorities(fire risk, industrial area, social impact and hydrant age) are taken into consideration. For example, 0110 means that in this combination, we only add the fire risk score and social impact score into our inspection priority score.

The resulting combinations grouped into deciles by engine district is as follows:

By grouping the score into deciles by engine district, we could get a sense of the inspection priority for each engine. This would be important for the firefighters who are tasked with inspecting hydrants in their engine district. In order to get a broader perspective, we also grouped the scores into deciles by the whole city, following the same 4-digit codes. The result is as follows:

The above two maps show the decile score (1-10 with 1 being the highest priority) for each hydrant. For the deciles by engine districts, deciles were taken with respect to each engine district in Philadelphia. That means that each engine district in Philadelphia has an equal distribution of scores of 1-10. The latter, the deciles by Philadelphia, has deciles calculated with respect to the entire city. Hence the distribution of priorities is not even in each engine district. Since hydrants are inspected by engine districts, it would seem reasonable to report the priority scores by engine districts as is done in the former. But should the Fire Chief like to examine the distribution of priority scores across Philadelphia, this is also possible with the latter.

6. Looking Forward

Based on the final model, we have created an algorithm to set the inspection priority of each hydrant, which is reasonably accurate yet generalizable for both the engine districts and the whole city. As the city continues to collect information about fire accidents, this model can be re-trained using 2019 data to predict risks for 2020 and 2021. Features accounting for different seasons can also be included. Furthermore, given the unprecedentedness of the current COVID-19 situation, the fire risk might be completely different than under normal circumstances, and discretion must be exercised when following this model.

In the future, this model can be improved through the inclusion of more data. Particularly, it would be useful to include data from the Philadelphia Water Department on the results of water quality tests done across the city. This data could be used to build a second model to model the under-ground risk factors of hydrant quality, to account for the intrinsic water and hydrant characteristics that is important in prioritizing inspections of hydrants.

One shortfall of the current model is the exclusion of a few hydrants in Wyndmoor and Tinicum Township, as these fell outside the boundaries of Philadelphia’s city limits, making it difficult to get risk factor data for hydrants in these areas. Future work on this project can include methods to aggregte the same risk factor data to these hydrants.

Finally, the algorithm outlined in this report was used to build a web application that allows the Philadelphia Fire Department to identify areas of high fire risk, as well as to customize their own prioritization model using the score that grouped into deciles by engine district in section 5. Our hope is that the model and application can be used by the department in their inspections, and that more work can be done to improve it further.For more information about the App, please click here.

7. Appendices

7.1 Data Dictionary

The following tables are the list of the engineered features:
Table 1. features related to risk factor
Feature Description Data Source
crime average distance to closet 3 crimes, count and kernel density Crime(OpenDataPhilly)
unsafe_18.nn3 average distance from each cell center to 3 nearest L&I unsafe structures L&I violations (OpenDataPhilly)
ID_18.nn3 average distance from each cell center to 3 nearest L&I imminently dangerous structures L&I violations (OpenDataPhilly)
ALM_18.nn3 average distance from each cell center to 3 nearest L&I violations that lack proper fire extinguisher for commercial cooking L&I violations (OpenDataPhilly)
unsafe_count the count of L&I unsafe structures within each grid cell L&I violations (OpenDataPhilly)
ID_count the count of L&I imminently dangerous structures within each grid cell L&I violations (OpenDataPhilly)
ALM_count the count of L&I violations that lack proper fire extinguisher for commercial cookingwithin each grid cell L&I violations (OpenDataPhilly)
hydrant_Down18.nn3 average distance from each cell center to 3 nearest 311 requests about hydrant knocked down 311 requests (OpenDataPhilly)
hydrant_req.nn3 average distance from each cell center to 3 nearest 311 requests about hydrant request 311 requests (OpenDataPhilly)
smoke_detecor18.nn3 average distance from each cell center to 3 nearest 311 requests about smoke detector 311 requests (OpenDataPhilly)
complainfire18.nn3 average distance from each cell center to 3 nearest 311 requests about the complaints against fire or EMS 311 requests (OpenDataPhilly)
fireres_or_com18.nn3 average distance from each cell center to 3 nearest 311 requests about the fire residential or commercial 311 requests (OpenDataPhilly)
hydrant_Down18_count the count of 311 requests about hydrant knocked down within each grid cell 311 requests (OpenDataPhilly)
hydrant_req_count the count of 311 requests about hydrant request within each grid cell 311 requests (OpenDataPhilly)
smoke_detecor18_count the count of 311 requests about smoke detector within each grid cell 311 requests (OpenDataPhilly)
complainfire18_count the count of 311 requests about the complaints against fire or EMS within each grid cell 311 requests (OpenDataPhilly)
fireres_or_com18_count the count of 311 requests about the fire residential or commercial within each grid cell 311 requests (OpenDataPhilly)
Table 2. features related to building environment factor
Feature Description Data Source
hydrant_KD kernel density of fire hydrants in each grid cell Water department
val_per_size average value of houes per size in grid cell (data is over past 5 years, not specific to a year) Office of Property Assesments
building_age median building age in grid cell Office of Property Assesments
fire_district NND to closet 1 fire station Open Data Philly
tree_cover average distance to closet 20 trees, count and kernel density Open Data Philly
percent_residential the percent of parcels labelled as such in each grid cell PDPD
percent_commercial the percent of parcels labelled as such in each grid cell PDPD
percent_industrial the percent of parcels labelled as such in each grid cell PDPD
Table 3. features related to demographic factor
Feature Description Data Source
Total_Pop total population for the tracts in each grid cell ACS (2013-2018)
Med_Inc median income for the tracts in each grid cell ACS (2013-2018)
Med_Age median age for the tracts in each grid cell ACS (2013-2018)
Percent_White the percent of white people for the tracts in each grid cell ACS (2013-2018)
Percent_Bachelor the percent of people who own a bachelor’s degree for the tracts in each grid cell ACS (2013-2018)
Table 4. features related to time-space factor
Feature Description Data Source
lagfire18.nn5 the distance from each grid cell center to 5 nearest fire accidents in 2018 Fire department
lagfire17.nn5 the distance from each grid cell center to 5 nearest fire accidents in 2017 Fire department
lagfire16.nn5 the distance from each grid cell center to 5 nearest fire accidents in 2016 Fire department
lagfire15.nn5 the distance from each grid cell center to 5 nearest fire accidents in 2015 Fire department
ENGINE_NUM engine number Fire department
NAME neighborhood name Neighborhood
fire.isSig whether this cell is fire significant Fire department
fire.isSig.dist the nearest distance to the nearest fire hotspot cell Fire department

7.2 Code

7.2.1 Data Preparation

  1. Preparation
library(tigris)
options(tigris_use_cache = TRUE) 
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(gifski)
library(png)
library(ggsn)
library(gganimate)
library(tidyverse)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(car) #test colinarity
library(ggplot2)
library(dplyr)
library(gmodels)
library(MASS)
library(e1071)
library(Hmisc)
library(Boruta)
library(corrplot)
library(viridis)
library(stargazer)
library(tigris)
library(GoodmanKruskal)
library(spatstat)
library(riem)
library(readr)
library(readxl)
library(QuantPsyc)
library(lubridate)
library(jsonlite)
library(readr)
library(ggplot2)
library(gganimate)
library(scales)
library(dplyr)
library(gifski)
library(png)
library(ggsn)
library(ggpmisc) # to get R squared

options(scipen =999)

# create a mapTheme() and plotTheme()function that is used to standardized the map outputs 
mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                          c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}
palette5 <- c("#25CB10", "#5AB60C", "#8FA108",   "#C48C04", "#FA7800")


q10 <- function(variable) {as.factor(ntile(variable, 10))}
  1. Standarlize Function
##### count/kernel density/knn functions #####
# count function 
count_net <- function(data,fishnetGrid, name){
  count<- data %>% #get count by fishnet
    dplyr::select() %>% 
    mutate(count = 1) %>% 
    aggregate(., fishnetGrid, sum) %>%
    mutate(count = ifelse(is.na(count), 0, count),uniqueID = rownames(.))%>%
    plyr::rename(.,c('count' = name))
  return (count)
}

# kernel density function 
kernelDensity <- function(data, fishnetGrid, name){
  count<- data %>% #get count by fishnet
    dplyr::select() %>% 
    mutate(count = 1) %>% 
    aggregate(., fishnetGrid, sum) %>%
    mutate(count = ifelse(is.na(count), 0, count),
           uniqueID = rownames(.),
           cvID = sample(round(nrow(fishnetGrid) / 24), size=nrow(fishnetGrid), replace = TRUE))
  
  point_ppp <- as.ppp(st_coordinates(data), W = st_bbox(count))
  KD <- spatstat::density.ppp(point_ppp, 1000)
  
  KD_dataframe<-as.data.frame(KD) %>% 
    st_as_sf(coords = c("x", "y"), crs = st_crs(fishnetGrid)) %>%
    aggregate(.,fishnetGrid, mean)%>%plyr::rename(.,c('value' = name))
  
  return(KD_dataframe)
}

kernelDensity_year <- function(pointData, year, fishnetGrid, name){
  data<-pointData %>%filter(Year == year)
  count<- data %>% #get count by fishnet
    dplyr::select() %>% 
    mutate(count = 1) %>% 
    aggregate(., fishnetGrid, sum) %>%
    mutate(count = ifelse(is.na(count), 0, count),
           uniqueID = rownames(.),
           cvID = sample(round(nrow(fishnetGrid) / 24), size=nrow(fishnetGrid), replace = TRUE))
  
  point_ppp <- as.ppp(st_coordinates(data), W = st_bbox(count))
  KD <- spatstat::density.ppp(point_ppp, 1000)
  
  KD_dataframe<-as.data.frame(KD) %>% 
    st_as_sf(coords = c("x", "y"), crs = st_crs(fishnetGrid)) %>%
    aggregate(.,fishnetGrid, mean)%>%plyr::rename(.,c('value' = name))
  
  return(KD_dataframe)
}

#Nearest neighbor (NND) function
nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    dplyr::summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  return(output)  
}
  1. Load Data and Create Fishnet
##### DATA (external sources)#####
#load the boundary data
philly<- st_read("http://data.phl.opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson") %>%
  st_set_crs(4326) %>%
  na.omit() %>%
  st_transform(2272)

# load the neighborhood data
neighbor <- 
  st_read("C:/Users/HP/Desktop/MUSA800_practium/data/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.shp")%>%
  na.omit() %>%
  st_transform(crs=2272)
# plot the neighbourhood
ggplot() +
  geom_sf(data=philly,fill=NA) +
  geom_sf(data=neighbor,fill=NA) +
  labs(title = "neighborhoods in Philly") +
  mapTheme()

# load the census tracts data
tracts <- 
  st_read("http://data.phl.opendata.arcgis.com/datasets/ccdc0d022f4949a7b11333bd37231aef_0.geojson")%>%
  st_set_crs(4326) %>%
  na.omit() %>%
  st_transform(crs=2272)

##### DATA (original)#####
# load the fire data
fire_excel<-read_excel('C:/Users/HP/Desktop/MUSA800_practium/origin_data/2015 to 2019 fires for u of pa report run 1620.xlsx')
fire.sf <- 
  fire_excel%>% 
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326, agr = "constant") %>%
  st_set_crs(4326) %>%
  st_transform(2272)
fire.sf <- 
  st_intersection(fire.sf, philly)

#Plot all fire points (2015-2019)
#ggplot() +
#geom_sf(data = philly, fill = "grey40") +
#geom_sf(data = fire.sf , colour = "#FA7800", size = .75) +
#labs(title = "Philly fires") +
#mapTheme()

# load hydrant inspection data
inspect <- 
  st_read("C:/Users/HP/Desktop/MUSA800_practium/origin_data/PDF_Hydrant_Inspections_2019_for_MUSA/PDF_Hydrant_Inspections_2019.shp") %>%
  st_transform(crs=2272) 

# load the hydrant data
hydrants <- 
  st_read("C:/Users/HP/Desktop/MUSA800_practium/origin_data/Hydrants_2019/Hydrants_2019.shp") %>%
  st_transform(crs=2272) 

# load the engine locals data
engine_new <- 
  st_read("C:/Users/HP/Desktop/MUSA800_practium/origin_data/EngineLocals/EngineLocals_New.shp") %>%
  st_transform(crs=2272) 
engine_old <- 
  st_read("C:/Users/HP/Desktop/MUSA800_practium/origin_data/EngineLocals/EngineLocals_Old.shp") %>%
  st_transform(crs=2272) 


##### Creat 800*800 fishnet #####
#load the philly boundary data
philly<- st_read("http://data.phl.opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson") %>%
  st_set_crs(4326) %>%
  na.omit() %>%
  st_transform(2272)

# Fire hydrants shall be provided for detached one- and two-family dwellings in accordance with both of the following: 
# (1) The maximum distance to a fire hydrant from the closest point on the building shall not exceed 600 ft (122183 m).
# (2) The maximum distance between fire hydrants shall not exceed 800 ft (244 m).
# create the 800ft grid cell fishnet
fishnet <- 
  st_make_grid(philly, cellsize = 800) %>%
  st_sf()
# clip the fishnet by the boundary
fishnet <- 
  fishnet[philly,] %>%
  mutate(uniqueID = rownames(.)) %>%
  dplyr::select(uniqueID)

##### Load fire data to the fishnet#####
# load the fire data
fire_excel<-read_excel('C:/Users/HP/Desktop/MUSA800_practium/origin_data/2015 to 2019 fires for u of pa report run 1620.xlsx')
fire.sf <- 
  fire_excel%>% 
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326, agr = "constant") %>%
  st_set_crs(4326) %>%
  st_transform(2272)
fire.sf <- 
  st_intersection(fire.sf, philly)
fire.sf <- 
  fire.sf %>%
  mutate(date=dmy(alm_date),
         Year=year(date),
         Month=month(date))

fire.sf_18 <- fire.sf[fire.sf$Year==2018,]
fire.sf_19 <- fire.sf[fire.sf$Year==2019,]

f18_KD <- kernelDensity_year(fire.sf, 2018, fishnet, 'Fire18KD')
f19_KD <- kernelDensity_year(fire.sf, 2019, fishnet, 'Fire19KD')

# load hydrant inspection data
inspect <- 
  st_read("C:/Users/HP/Desktop/MUSA800_practium/origin_data/PDF_Hydrant_Inspections_2019_for_MUSA/PDF_Hydrant_Inspections_2019.shp") %>%
  st_transform(crs=2272) 

# load the hydrant data
hydrants <- 
  st_read("C:/Users/HP/Desktop/MUSA800_practium/origin_data/Hydrants_2019/Hydrants_2019.shp") %>%
  st_transform(crs=2272) 

# load the engine locals data
engine_new <- 
  st_read("C:/Users/HP/Desktop/MUSA800_practium/origin_data/EngineLocals/EngineLocals_New.shp") %>%
  st_transform(crs=2272) 
engine_old <- 
  st_read("C:/Users/HP/Desktop/MUSA800_practium/origin_data/EngineLocals/EngineLocals_Old.shp") %>%
  st_transform(crs=2272) 

# join 2018 fire data to the fishnet
fire_fishnet <- 
  fire.sf_18 %>% 
  dplyr::select() %>% 
  mutate(count = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(count = ifelse(is.na(count), 0, count),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))

7.2.2 Exploratory Analysis

  1. Hydrant History Map
hydrant_kernel_map <- function(hydrants,year1,year2){
  hydrants_year <- 
    hydrants %>%
    filter(hydrants$YEAR_INSTA>year1 & hydrants$YEAR_INSTA<=year2 ) 
  
  hydrant_year_fishnet <- 
    hydrants_year%>%
    dplyr::select() %>% 
    mutate(count = 1) %>% 
    aggregate(.,fishnet, sum) %>%
    mutate(count = ifelse(is.na(count), 0, count),
           uniqueID = rownames(.),
           cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))
  
  hydrants_ppp_year <- as.ppp(st_coordinates(hydrants_year), W = st_bbox(hydrant_year_fishnet))
  hydrants_KD_year <- spatstat::density.ppp(hydrants_ppp_year, 1000)
  
  as.data.frame(hydrants_KD_year) %>%
    st_as_sf(coords = c("x", "y"), crs = st_crs(hydrants_fishnet)) %>%
    aggregate(.,hydrant_year_fishnet, mean) %>%
    ggplot() +
    geom_sf(aes(fill=value),colour=NA) +
    ##geom_sf(data = sample_n(fire.sf, 1500), size = .5) +
    labs(title = "Kernel density of fire hydrants",subtitle=
           paste("1800 -",year2)) +
    scale_fill_viridis(option='inferno',limits=c(-0.000001,0.000023)) +
    mapTheme()
}

grid.arrange(hydrant_kernel_map(hydrants,1800,1900),
             hydrant_kernel_map(hydrants,1800,1950),
             hydrant_kernel_map(hydrants,1800,2000),
             hydrant_kernel_map(hydrants,1800,2019),ncol=4)
  1. Relationship between Fire Significance & Fire Density
fire2018<-
  fire.sf %>%
  filter(Year == 2018)

fire2018_fishnet <- 
  fire2018 %>% 
  dplyr::select() %>% 
  mutate(count = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(count = ifelse(is.na(count), 0, count),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))

fire_ppp_2018 <- as.ppp(st_coordinates(fire2018), W = st_bbox(fire2018_fishnet))
fire_KD_2018 <- spatstat::density.ppp(fire_ppp_2018, 1000)

# Local Moran's I
fire2018_fishnet.nb <- poly2nb(fire2018_fishnet, queen=TRUE)
fire2018_fishnet.weights <- nb2listw(fire2018_fishnet.nb, style="W", zero.policy=TRUE)

#Hotspots GIF
sequence=seq(0, 0.05, by=0.001)

for (i in sequence){
  temp <-
    fire2018_fishnet %>% 
    mutate(fire.isSig = ifelse(localmoran(fire2018_fishnet$count, 
                                          fire2018_fishnet.weights)[,5] <= i, "Statistically significant", "Not significant"))
  
  chart <-ggplot() + 
    geom_sf(data = temp, aes(fill = fire.isSig),colour=NA) +
    labs(title = "Significantly clustered fires (local hotspots)",
         subtitle = i) +
    scale_fill_manual(values = c("grey", "red"), name = "Values")+
    mapTheme()
  
  ggsave(file=paste0("./GIF2/",i,".jpg"), plot = chart, width = 8, height = 4.5, units = "in", dpi=300)
  print(paste0("processing: ", i))
}

library(magick)
list.files(path = "./GIF2/", pattern = "*.jpg", full.names = T) %>% 
  map(image_read) %>% # reads each path file
  image_join() %>% # joins image
  image_animate(fps=4) %>% # animates, can opt for number of loops
  image_write("./GIF2/2018hotspots.gif")
  1. Relationship between the Blight & Fire Density
unsafe_structures<-readData("https://phl.carto.com/api/v2/sql?q=SELECT%20*%20FROM%20li_violations%20WHERE%20(violationtype%20=%20%27PM15-108.1%27)%20AND%20(violationdate%20%3E=%20%272015-01-01%27)") 
ID_structures<-readData("https://phl.carto.com/api/v2/sql?q=SELECT%20*%20FROM%20li_violations%20WHERE%20(violationtype%20=%20%27PM15-110.1%27)%20AND%20(violationdate%20%3E=%20%272015-01-01%27)")
ALM_structures<-readData("https://phl.carto.com/api/v2/sql?q=SELECT%20*%20FROM%20li_violations%20WHERE%20(violationtype%20=%20%27FC-907.3/20%27)%20AND%20(violationdate%20%3E=%20%272015-01-01%27)")

# get just 2018 name and geometry
unsafe_18 <- unsafe_structures[unsafe_structures$Year==2018,]
ID_18 <- ID_structures[ID_structures$Year==2018,]
ALM_18 <- ALM_structures[ALM_structures$Year==2018,]

# dataframes to store results
rnd_viol_p_results_total <- NULL
rnd_viol_vec_total <- NULL
viol_mean_NN <- NULL

viol_names <-list('Unsafe Structure', 'Imminently Dangerous Buildings', 'Lacking Propoer Smoke Alarms')
viol_list <- list(unsafe_18, ID_18, ALM_18)


cl <- makePSOCKcluster(detectCores()-1)
registerDoParallel(cl)
for(j in seq_along(viol_list)){
  viol_desc <-viol_list[[j]]
# get nearest neigbor distance to observed
  viol_fire_NN <- nn_function(st_coordinates(viol_desc),st_coordinates(fire2018), k_nearest_nb)
  viol_fire_NN <- mean(viol_fire_NN$value)
# generate random simulations
  rnd_viol_results <- foreach(i = seq_len(simulations),
  .packages = c('sf', 'dplyr', 'FNN', 'tibble', 'tidyr'),
  .combine  = c) %dopar% {
    cat(i,"\n")
    # get random samples, calculate distance
    viol_fire_NN_rnd <- sf::st_sample(philly, nrow(viol_desc)) %>%
    st_coordinates(.) %>%
    nn_function(., st_coordinates(fire2018), k_nearest_nb)
  }
# store requests
# p values of this particular request
  rnd_viol_p_results <- data.frame(p =map_dbl(rnd_viol_results,function(x)1-ecdf(x)(viol_fire_NN)),
  Feature = viol_names[[j]])
  # add this to the list with all requests
  rnd_viol_p_results_total <- rbind(rnd_viol_p_results_total, rnd_viol_p_results)
  # some vector
  rnd_viol_vec <- data.frame(dist = as.numeric(map_dbl(rnd_viol_results, mean)), Feature = viol_names[[j]])
  # add to total
  rnd_viol_vec_total <- rbind(rnd_viol_vec_total, rnd_viol_vec)
  # store the observed mean nn (to be plotted as line)
  viol_mean_NN <- rbind(viol_mean_NN, data.frame(mean = viol_fire_NN, Feature =viol_names[[j]]))
}

# Stop parallel cluster
stopCluster(cl)

rnd_viol_vec_total$Feature <- factor(rnd_viol_vec_total$Feature ,levels = as.character(arrange(viol_mean_NN,mean)$Feature))


VIOL_FIRE_VS_NN_plot <- ggplot(data = rnd_viol_vec_total, aes(x = dist, group = Feature, fill =
                                                              Feature)) +
  geom_histogram(bins = 50) +
  geom_vline(data = viol_mean_NN, aes(xintercept = mean), size = 2) +
  #scale_x_continuous(breaks=seq(400,4500,200), labels = seq(400,4500,200)) +
  facet_wrap(~Feature, ncol = 1, scales = "free_y") +
  labs(x = paste0("Mean NN Distance (k = ",k_nearest_nb,")"),
  title = "L&I Violations"
  #,caption = "Figure 5.8"
  ) +
  scale_fill_viridis_d() +
  plotTheme() +
  theme(
  legend.position = "none",
  strip.text = element_text(size = 11, family = "sans", face = "plain", hjust = 0),
  strip.background = element_rect(fill = "white")
  )

VIOL_FIRE_VS_NN_plot

f18_KD <- kernelDensity(fire.sf, 2018, fishnet, 'Fire18KD')
f17_KD <- kernelDensity(fire.sf, 2017, fishnet, 'Fire17KD')

unsafe_18<-aggViolation(unsafe_structures, 2018, fishnet, f18_KD)


USKD_18<- kernelDensity(unsafe_structures, 2018, fishnet, 'KD')%>%
  ggplot() +
  geom_sf(aes(fill=KD),colour=NA) +
  labs(title = 2018) +
  scale_fill_viridis(option='inferno',limits=c(-0.000001,0.000009)) +
  mapTheme()

#specify_decimal <- function(x, k) trimws(format(round(x, k), nsmall=k))
#sd_MAE = c(specify_decimal(sd(nhood_data$meanError),3), specify_decimal(sd(engine_data$meanError),3))
year18<-unsafe_18%>%st_set_geometry(NULL)%>%
  gather(Variable, Value, -Fire18KD)  %>%
   ggplot(aes(Value, Fire18KD)) +
     geom_point() +
     geom_smooth(method = "lm", se=F, colour = "#FA7800") +
    stat_poly_eq(formula = y ~ x,
               eq.with.lhs = "italic(hat(y))~`=`~",
               aes(label = paste(..rr.label.., sep =" ")), 
               parse = TRUE, label.x = 60, label.y = 70) +    
     facet_wrap(~Variable, ncol = 3, scales = "free") + labs(subtitle = '2018')+
     plotTheme()

IDKD_18<- kernelDensity(ID_structures, 2018, fishnet, 'KD')%>%
  ggplot() +
  geom_sf(aes(fill=KD),colour=NA) +
  labs(title = 2018) +
  scale_fill_viridis(option='inferno',limits=c(-0.000001,0.000009)) +
  mapTheme()

ID_18<-aggViolation(ID_structures, 2018, fishnet, f18_KD)
year18_ID<-ID_18%>%st_set_geometry(NULL)%>%
  gather(Variable, Value, -Fire18KD)  %>%
   ggplot(aes(Value, Fire18KD)) +
     geom_point() +
     geom_smooth(method = "lm", se=F, colour = "#FA7800") +
    stat_poly_eq(formula = y ~ x,
               eq.with.lhs = "italic(hat(y))~`=`~",
               aes(label = paste(..rr.label.., sep =" ")), 
               parse = TRUE, label.x = 60, label.y = 70) +    
     facet_wrap(~Variable, ncol = 3, scales = "free") + labs(subtitle = '2018') +
    plotTheme()

CEKD_17<- kernelDensity(CE_structures, 2017, fishnet, 'KD')%>%
  ggplot() +
  geom_sf(aes(fill=KD),colour=NA) +
  labs(title = 2017) +
  scale_fill_viridis(option='inferno') +
  mapTheme()

CE_17<-aggViolation(CE_structures, 2017, fishnet, f17_KD)

 
year17_CE<-CE_17%>%st_set_geometry(NULL)%>%
  gather(Variable, Value, -Fire17KD)  %>%
   ggplot(aes(Value, Fire17KD)) +
     geom_point() +
     geom_smooth(method = "lm", se=F, colour = "#FA7800") +
    stat_poly_eq(formula = y ~ x,
               eq.with.lhs = "italic(hat(y))~`=`~",
               aes(label = paste(..rr.label.., sep =" ")), 
               parse = TRUE, label.x = 60, label.y = 70) +    
     facet_wrap(~Variable, ncol = 3, scales = "free") + labs(subtitle = '2017')+
     plotTheme()

ALMKD_18<- kernelDensity(ALM_structures, 2018, fishnet, 'KD')%>%
  ggplot() +
  geom_sf(aes(fill=KD),colour=NA) +
  labs(title = 2018) +
  scale_fill_viridis(option='inferno') +
  mapTheme()

ALM_18<-aggViolation(ALM_structures, 2018, fishnet, f18_KD)

year18_ALM<-ALM_18%>%st_set_geometry(NULL)%>%
  gather(Variable, Value, -Fire18KD)  %>%
   ggplot(aes(Value, Fire18KD)) +
     geom_point() +
     geom_smooth(method = "lm", se=F, colour = "#FA7800") +
    stat_poly_eq(formula = y ~ x,
               eq.with.lhs = "italic(hat(y))~`=`~",
               aes(label = paste(..rr.label.., sep =" ")), 
               parse = TRUE, label.x = 60, label.y = 70) +    
     facet_wrap(~Variable, ncol = 3, scales = "free") + labs(subtitle = '2018')+
     plotTheme()

ALMKD_18
year18_ALM
  1. Relationship between property values & Fire Density
raw<-fromJSON('https://phl.carto.com/api/v2/sql?q=SELECT%20census_tract,%20category_code,%20market_value,%20%20parcel_number,%20sale_price,%20total_area,%20total_livable_area,%20year_built,%20zip_code,%20ST_Y(the_geom)%20AS%20lat,%20ST_X(the_geom)%20AS%20lng%20FROM%20opa_properties_public%20WHERE%20category_code%20=%20%271%27')
formated<- as.data.frame(raw$rows)

formated <- na.omit(formated)

final_data <-formated%>%
   dplyr::filter(formated$total_livable_area != 0 & formated$market_value != 0 & formated$year_built!='0000')%>%
  dplyr::select(market_value, total_livable_area, year_built, census_tract, zip_code, parcel_number, lat , lng) %>%na.omit(., cols=c("lng", "lat"))%>%
    st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
  st_transform(crs=2272)%>%
    mutate(
      X = st_coordinates(.)[,1],
      Y = st_coordinates(.)[,2],
      val_per_size = market_value/total_livable_area,
      year_built = as.numeric(year_built),
    )

# more mutate for the year built
final_data<- final_data%>%mutate(year_cat=  cut2(final_data$year_built, c(1650, 1700, 1750, 1800, 1850, 1900, 1950, 2000, 2020)), building_age = 2020-year_built)

values_plot<-ggplot() + 
  geom_sf(data=philly)+
  geom_sf(data = final_data, aes(colour = q5(val_per_size)), show.legend = "point", size = .5) +
  scale_colour_viridis_d(option='inferno',
                         direction=-1,
                      labels=qBr(final_data,"val_per_size"),
                      name="Quintile\nBreaks") +
  labs(title = "House Values in Philadelphia",
       subtitle='House Values= market value/total livable area.\nData from Office of Property Assesments.') +
  mapTheme()

values_plot

# age instead continous
age_plot<-ggplot() + 
  geom_sf(data=philly)+
  geom_sf(data = final_data, aes(colour = final_data$building_age), show.legend = "point", size = .5) +
  scale_colour_viridis(option = 'inferno', name="Building Age", direction=-1) +
  labs(title = "Age of Residential Houses in Philadelphia", subtitle='Data from Office of Property Assesments.') + mapTheme()

age_plot

# Mean value of houses and age of houses per fishnet 
fishnet_meanVal<-final_data %>% 
    dplyr::select('val_per_size') %>% 
    mutate(count = 1) %>% 
    aggregate(., fishnet, mean)%>%
    mutate(count = ifelse(is.na(count), 0, count),
           uniqueID = rownames(.),
           cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))%>%
  mutate(val_per_size=replace_na(val_per_size,0))

fishnet_medianAge<-final_data %>% 
    dplyr::select('building_age') %>% 
    mutate(count = 1) %>% 
    aggregate(., fishnet, median)%>%
    mutate(count = ifelse(is.na(count), 0, count),
           uniqueID = rownames(.),
           cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))%>%
  mutate(building_age=replace_na(building_age,0))

#### PLOTS 
## Value plot 

houseVal_fishnet_plot<-ggplot() + 
  geom_sf(data=philly)+
  geom_sf(data = fishnet_meanVal, aes(colour=val_per_size)) +
  scale_colour_viridis(option='inferno',
                         direction=-1,
                      name="Average Value Per Size Houses") +
  labs(title = "Average House Values in Philadelphia", subtitle='Values are aggregated by 800ft grid cells.\nData from Office of Property Assesments.') +
  mapTheme()

houseVal_fishnet_plot

### Age plot 
houseAge_fishnet_plot<-ggplot() + 
  geom_sf(data=philly)+
  geom_sf(data = fishnet_medianAge, aes(colour=building_age)) +
  scale_colour_viridis(option='inferno',
                         direction=-1,
                      name="Median Age of Houses") +
  labs(title = "Median House Age in Philadelphia", subtitle='Ages are aggregated by 800ft grid cells.\nData from Office of Property Assesments.') +
  mapTheme()

houseAge_fishnet_plot

#### SCATTER PLOTS ### 
houseData_2018<-fishnet_data_fire%>%
  dplyr::select('Fire18KD', 'building_age', 'val_per_size')%>%
  st_set_geometry(NULL)%>%
  gather(Variable, Value, -Fire18KD)%>%
   ggplot(aes(Value, Fire18KD)) +
     geom_point() +
     geom_smooth(method = "lm", se=F, colour = "#FA7800") +
    stat_poly_eq(formula = y ~ x,
               eq.with.lhs = "italic(hat(y))~`=`~",
               aes(label = paste(..rr.label.., sep =" ")), 
               parse = TRUE, label.x = 60, label.y = 70) +    
     facet_wrap(~Variable, ncol = 2, scales = "free") + labs(title = 'Residential Property Data',subtitle = '2018 Fire Data')+plotTheme()


houseData_2019<-fishnet_data_fire%>%
  dplyr::select('Fire19KD', 'building_age', 'val_per_size')%>%
  st_set_geometry(NULL)%>%
  gather(Variable, Value, -Fire19KD)%>%
   ggplot(aes(Value, Fire19KD)) +
     geom_point() +
     geom_smooth(method = "lm", se=F, colour = "#FA7800") +
    stat_poly_eq(formula = y ~ x,
               eq.with.lhs = "italic(hat(y))~`=`~",
               aes(label = paste(..rr.label.., sep =" ")), 
               parse = TRUE, label.x = 60, label.y = 70) +    
     facet_wrap(~Variable, ncol = 2, scales = "free") + labs(subtitle = '2019 Fire Data')

houseData_2018
  1. Relationship between Landuse Affect & Fire Density
landUse<-st_read("http://data-phl.opendata.arcgis.com/datasets/e433504739bd41049de5d8f4a22d34ba_0.geojson")%>%st_transform(crs=2272)


landUse<-landUse%>%mutate(C_DIG1DESC = as.factor(C_DIG1DESC),
                                  C_DIG1 = as.numeric(C_DIG1))%>%
  mutate(is_residential = case_when(
    C_DIG1==1 ~ 1,
    C_DIG1!=1 ~ 0
  ),
  is_commercial=case_when(
    C_DIG1==2~1,
    C_DIG1!=2~0
  ),
  is_industrial = case_when(
    C_DIG1==3 ~1,
    C_DIG1!=3~0,
  ))
  
class(landUse$is_residential)
summary(landUse)
sum(landUse$C_DIG1==1)

landuseforplot<-landUse%>%mutate(is_residential=as.factor(is_residential),
                                 is_commercial=as.factor(is_commercial),
                                 is_industrial=as.factor(is_industrial))


residential_plot<-ggplot() + 
  geom_sf(data=philly)+
  geom_sf(data = landuseforplot, aes(colour=landuseforplot$is_residential)) +
  scale_colour_viridis_d(option = 'D', name="Residential", labels=c('No', 'Yes')) +
  labs(title = "Land Uses in Philadelphia", subtitle='Data from Philadelphia Department of Planning and Development.') + mapTheme()


commercial_plot<-ggplot() + 
  geom_sf(data=philly)+
  geom_sf(data = landuseforplot, aes(colour = landuseforplot$is_commercial)) +
  scale_colour_viridis_d(option = 'D', name="Commercial", labels=c('No', 'Yes')) +
  labs(title = "Land Uses in Philadelphia", subtitle='Data from Philadelphia Department of Planning and Development.') + mapTheme()


industrial_plot<-ggplot() + 
  geom_sf(data=philly)+
  geom_sf(data = landuseforplot, aes(colour = landuseforplot$is_industrial)) +
  scale_colour_viridis_d(option = 'D', name="Industrial", labels=c('No', 'Yes')) +
  labs(title = "Land Uses in Philadelphia", subtitle='Data from Philadelphia Department of Planning and Development.') + mapTheme()


grid.arrange(residential_plot, commercial_plot, industrial_plot, ncol=3, top="Land Uses in Philadelphia")

# since if parcel is residential is 1, the mean is the percent
fishnet_residential<-landUse %>% 
    dplyr::select(is_residential) %>%
    mutate(count = 1) %>%
    aggregate(., fishnet, mean)%>%
    mutate(count = ifelse(is.na(count), 0, count),
           uniqueID = rownames(.),
           cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE),
           percent_residential = is_residential)

fishnet_commercial<-landUse %>% 
    dplyr::select(is_commercial) %>% 
    mutate(count = 1) %>% 
    aggregate(., fishnet, mean)%>%
    mutate(count = ifelse(is.na(count), 0, count),
           uniqueID = rownames(.),
           cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE),
           percent_commercial = is_commercial)

fishnet_industrial<-landUse %>% 
    dplyr::select(is_industrial) %>% 
    mutate(count = 1) %>% 
    aggregate(., fishnet, mean)%>%
    mutate(count = ifelse(is.na(count), 0, count),
           uniqueID = rownames(.),
           cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE),
           percent_industrial = is_industrial)

landuse_2018<-fishnet_data_fire%>%
  dplyr::select('Fire18KD', 'percent_residential', 'percent_commercial','percent_industrial')%>%
  st_set_geometry(NULL)%>%
  gather(Variable, Value, -Fire18KD)%>%
   ggplot(aes(Value, Fire18KD)) +
     geom_point() +
     geom_smooth(method = "lm", se=F, colour = "#FA7800") +
    stat_poly_eq(formula = y ~ x,
               eq.with.lhs = "italic(hat(y))~`=`~",
               aes(label = paste(..rr.label.., sep =" ")), 
               parse = TRUE, label.x = 60, label.y = 70) +    
     facet_wrap(~Variable, ncol = 3, scales = "free") + labs(title = 'Land Use Data',subtitle = '2018 Fire Data')+
    plotTheme()


landuse_2019<-fishnet_data_fire%>%
  dplyr::select('Fire19KD', 'percent_residential', 'percent_commercial','percent_industrial')%>%
  st_set_geometry(NULL)%>%
  gather(Variable, Value, -Fire19KD)%>%
   ggplot(aes(Value, Fire19KD)) +
     geom_point() +
     geom_smooth(method = "lm", se=F, colour = "#FA7800") +
    stat_poly_eq(formula = y ~ x,
               eq.with.lhs = "italic(hat(y))~`=`~",
               aes(label = paste(..rr.label.., sep =" ")), 
               parse = TRUE, label.x = 60, label.y = 70) +    
     facet_wrap(~Variable, ncol = 3, scales = "free") + labs(title = 'Land Use Data',subtitle = '2019 Fire Data')



#grid.arrange(landuse_2018, landuse_2019, ncol=1, top="Kernel Density of Fires to Land Use Data\n Source:  Philadelphia Department of Planning and Development")

landuse_2018

7.2.3 Feature Engineering

  1. Risk Factor Story
# define the significant fire spot (set the significant level as 0.05)
fire_fishnet.nb <- poly2nb(fire_fishnet, queen=TRUE)
fire_fishnet.weights <- nb2listw(fire_fishnet.nb, style="W", zero.policy=TRUE)
# add isSig and isSig.dist variables
fire_fishnet <-
  fire_fishnet %>% 
  mutate(fire.isSig = ifelse(localmoran(fire_fishnet$count, 
                                        fire_fishnet.weights)[,5] <= 0.05, 1, 0)) %>%
  mutate(fire.isSig.dist = nn_function(st_coordinates(st_centroid(fire_fishnet)),
                                        st_coordinates(st_centroid(
                                          filter(fire_fishnet, fire.isSig == 1))), 1 ))
##### Hydrant (count/kernel density---KD is better) #####
# add hydrant count
var_fishnet <- 
  count_net(hydrants,fishnet,"hydrant_count") %>%
  as.data.frame(.)%>%
  dplyr::select(hydrant_count,uniqueID)%>%
  merge(as.data.frame(fire_fishnet),. , by="uniqueID") %>%
  st_sf()
# add hydrant kernel density
var_fishnet <- 
  kernelDensity(hydrants, fishnet, 'hydrant_KD') %>%
  st_join(.,var_fishnet,join=st_equals)

##### 2018 L& I violations (count/kernel density/knn---knn is better) #####
# load L&I violations data, following carto API. 
readData<-function(link){
  raw<-fromJSON(link)
  formated<- as.data.frame(raw$rows) %>%
    dplyr::select(censustract,violationdate,violationdescription, status, casestatus, casepriority, geocode_x,geocode_y) %>%
    mutate('Year' = year(ymd_hms(violationdate)))%>%
    na.omit(., cols=c("geocode_x", "geocode_y"))%>%
    st_as_sf(coords = c("geocode_x", "geocode_y"), crs = 2272) %>%
    mutate(
      X = st_coordinates(.)[,1],
      Y = st_coordinates(.)[,2])
  return(formated)
}

# unsafe structures
unsafe_structures<-readData("https://phl.carto.com/api/v2/sql?q=SELECT%20*%20FROM%20li_violations%20WHERE%20(violationtype%20=%20%27PM15-108.1%27)%20AND%20(violationdate%20%3E=%20%272015-01-01%27)")
unsafe <- unsafe_structures[unsafe_structures$Year==2018,]
#unsafe <- unsafe_structures[unsafe_structures$Year>=2015 && unsafe_structures$Year<=2019,]
# Imminently Dangerous
ID_structures<-readData("https://phl.carto.com/api/v2/sql?q=SELECT%20*%20FROM%20li_violations%20WHERE%20(violationtype%20=%20%27PM15-110.1%27)%20AND%20(violationdate%20%3E=%20%272015-01-01%27)")
ID <- ID_structures[ID_structures$Year==2018,]
# Lacking Proper Fire Extinguisher for Commercial Cooking
# CE_structures<-readData("https://phl.carto.com/api/v2/sql?q=SELECT%20*%20FROM%20li_violations%20WHERE%20(violationtype%20=%20%27FC-904.11/5%27)%20AND%20(violationdate%20%3E=%20%272015-01-01%27)")
# CE <- CE_structures[CE_structures$Year==2018,]
# Lacking smoke Alarm
ALM_structures<-readData("https://phl.carto.com/api/v2/sql?q=SELECT%20*%20FROM%20li_violations%20WHERE%20(violationtype%20=%20%27FC-907.3/20%27)%20AND%20(violationdate%20%3E=%20%272015-01-01%27)")
ALM <- ALM_structures[ALM_structures$Year==2018,]

# knn method
var_fishnet$unsafe.nn3 <-nn_function(st_coordinates(st_centroid(var_fishnet)),st_coordinates(st_centroid(unsafe)),3) 
var_fishnet$ID.nn3 <-nn_function(st_coordinates(st_centroid(var_fishnet)),st_coordinates(st_centroid(ID)),3) 
var_fishnet$ALM.nn3 <-nn_function(st_coordinates(st_centroid(var_fishnet)),st_coordinates(st_centroid(ALM)),3) 
#var_fishnet$CE.nn3 <-nn_function(st_coordinates(st_centroid(var_fishnet)),st_coordinates(st_centroid(CE)),3) 

# add unsafe count
 var_fishnet <-
   count_net(unsafe,fishnet,"unsafe_count") %>%
   as.data.frame(.)%>%
   dplyr::select(unsafe_count,uniqueID)%>%
   merge(as.data.frame(var_fishnet),. , by="uniqueID") %>%
   st_sf()
 
 # add ID_18 count
 var_fishnet <-
   count_net(ID,fishnet,"ID_count") %>%
   as.data.frame(.)%>%
   dplyr::select(ID_count,uniqueID)%>%
   merge(as.data.frame(var_fishnet),. , by="uniqueID") %>%
   st_sf()
 
 # add ALM count
 var_fishnet <-
   count_net(ALM,fishnet,"ALM_count") %>%
   as.data.frame(.)%>%
   dplyr::select(ALM_count,uniqueID)%>%
   merge(as.data.frame(var_fishnet),. , by="uniqueID") %>%
   st_sf()

##### 2018 311 requests(count/kernel density/knn---knn is better) #####
# load 311 data 
req <- read.csv("C:/Users/HP/Desktop/MUSA800_practium/data/public_cases_fc.csv") 
req <- 
  req %>%
  mutate(Year=year(requested_datetime)) %>%
  dplyr::select(lon,lat,service_name,Year)%>%
  na.omit()%>%
  st_as_sf(.,coords=c("lon","lat"), crs=4326)%>%
  st_transform(2272)
table(req$service_name)

hydrant_Down<- req %>%
  filter(.,service_name=="Hydrant Knocked Down (No Water)" & Year==2018)   
hydrant_req<- req %>%
  filter(.,service_name=="Hydrant Request" & Year==2018)  
smoke_dector<- req%>%
  filter(.,service_name=="Smoke Detector" & Year==2018)    
complainfire<- req %>%
  filter(.,service_name=="Complaints against Fire or EMS" & Year==2018)
FireRes_or_Com<- req %>%
  filter(.,service_name=="Fire Residential or Commercial" & Year==2018) 

# knn method
var_fishnet$hydrant_Down.nn3 <-nn_function(st_coordinates(st_centroid(var_fishnet)),st_coordinates(st_centroid(hydrant_Down)),3) 
var_fishnet$hydrant_req.nn3 <-nn_function(st_coordinates(st_centroid(var_fishnet)),st_coordinates(st_centroid(hydrant_req)),3) 
var_fishnet$smoke_dector.nn3 <-nn_function(st_coordinates(st_centroid(var_fishnet)),st_coordinates(st_centroid(smoke_dector)),3) 
var_fishnet$complainfire.nn3 <-nn_function(st_coordinates(st_centroid(var_fishnet)),st_coordinates(st_centroid(complainfire)),3) 
var_fishnet$FireRes_or_Com.nn3 <-nn_function(st_coordinates(st_centroid(var_fishnet)),st_coordinates(st_centroid(FireRes_or_Com)),3) 

# add hydrant_Down count
var_fishnet <-
  count_net(hydrant_Down,fishnet,"hydrant_Down_count") %>%
  as.data.frame(.)%>%
  dplyr::select(hydrant_Down_count,uniqueID)%>%
  merge(as.data.frame(var_fishnet),. , by="uniqueID") %>%
  st_sf()

# add hydrant_req count
var_fishnet <-
  count_net(hydrant_req,fishnet,"hydrant_req_count") %>%
  as.data.frame(.)%>%
  dplyr::select(hydrant_req_count,uniqueID)%>%
  merge(as.data.frame(var_fishnet),. , by="uniqueID") %>%
  st_sf()

# add smoke_dector count
var_fishnet <-
  count_net(smoke_dector,fishnet,"smoke_dector_count") %>%
  as.data.frame(.)%>%
  dplyr::select(smoke_dector_count,uniqueID)%>%
  merge(as.data.frame(var_fishnet),. , by="uniqueID") %>%
  st_sf()

# add complainfire count
var_fishnet <-
  count_net(complainfire,fishnet,"complainfire_count") %>%
  as.data.frame(.)%>%
  dplyr::select(complainfire_count,uniqueID)%>%
  merge(as.data.frame(var_fishnet),. , by="uniqueID") %>%
  st_sf()

# add FireRes_or_Com count
var_fishnet <-
  count_net(FireRes_or_Com,fishnet,"FireRes_or_Com_count") %>%
  as.data.frame(.)%>%
  dplyr::select(FireRes_or_Com_count,uniqueID)%>%
  merge(as.data.frame(var_fishnet),. , by="uniqueID") %>%
  st_sf()

##### 2018 crime data #####
crime_18<- st_read("C:/Users/HP/Desktop/MUSA800_practium/data/crime2018_fishnet.geojson") 
var_fishnet <-
  crime_18 %>%
  as.data.frame(.)%>%
  dplyr::select(crime2018.nn3,crime2018_count,uniqueID)%>%
  merge(as.data.frame(var_fishnet),. , by="uniqueID") %>%
  st_sf()
  1. Built Environment Story
##### tree data #####
tree<- st_read("C:/Users/HP/Desktop/MUSA800_practium/data/tree_fishnet.geojson")
var_fishnet <-
  tree %>%
  as.data.frame(.)%>%
  dplyr::select(tree.nn20,tree_count,uniqueID)%>%
  merge(as.data.frame(var_fishnet),. , by="uniqueID") %>%
  st_sf()

##### fire station data #####
station<- st_read("C:/Users/HP/Desktop/MUSA800_practium/data/FireStation_fishnet.geojson") 
var_fishnet <-
  station %>%
  as.data.frame(.)%>%
  dplyr::select(FireStation.nn1,uniqueID)%>%
  merge(as.data.frame(var_fishnet),. , by="uniqueID") %>%
  st_sf()

##### residential building data #####
raw<-fromJSON('https://phl.carto.com/api/v2/sql?q=SELECT%20census_tract,%20category_code,%20market_value,%20%20parcel_number,%20sale_price,%20total_area,%20total_livable_area,%20year_built,%20zip_code,%20ST_Y(the_geom)%20AS%20lat,%20ST_X(the_geom)%20AS%20lng%20FROM%20opa_properties_public%20WHERE%20category_code%20=%20%271%27')
formated<- as.data.frame(raw$rows)
formated <- na.omit(formated)
final_data <-formated%>%
  dplyr::filter(formated$total_livable_area != 0 & formated$market_value != 0 & formated$year_built!='0000')%>%
  dplyr::select(market_value, total_livable_area, year_built, census_tract, zip_code, parcel_number, lat , lng) %>%na.omit(., cols=c("lng", "lat"))%>%
  st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
  st_transform(crs=2272)%>%
  mutate(
    X = st_coordinates(.)[,1],
    Y = st_coordinates(.)[,2],
    val_per_size = market_value/total_livable_area,
    year_built = as.numeric(year_built),
  )

# more mutate for the year built
final_data<- final_data%>%mutate(year_cat=  cut2(final_data$year_built, c(1650, 1700, 1750, 1800, 1850, 1900, 1950, 2000, 2020)), 
                                 building_age = 2020-year_built)

# Mean value of houses per fishnet 
fishnet_meanVal<-final_data %>% 
  dplyr::select('val_per_size') %>% 
  #mutate(count = 1) %>% 
  aggregate(., fishnet, mean)%>%
  mutate(#count = ifelse(is.na(count), 0, count),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))%>%
  mutate(val_per_size=replace_na(val_per_size,0))

houseVal_fishnet_plot<-ggplot() + 
  geom_sf(data=philly)+
  geom_sf(data = fishnet_meanVal, aes(colour=val_per_size)) +
  scale_colour_viridis(option='inferno',
                       direction=-1,
                       name="Average Value Per Size Houses") +
  labs(title = "Average House Values in Philadelphia", subtitle='Values are aggregated by 800ft grid cells.\nData from Office of Property Assesments.') +
  mapTheme()

houseVal_fishnet_plot

# median age of house per fishnet
fishnet_medianAge<-final_data %>% 
  dplyr::select('building_age') %>% 
  # mutate(count = 1) %>% 
  aggregate(., fishnet, median)%>%
  mutate(# count = ifelse(is.na(count), 0, count),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))%>%
  mutate(building_age=replace_na(building_age,0))

houseAge_fishnet_plot<-ggplot() + 
  geom_sf(data=philly)+
  geom_sf(data = fishnet_medianAge, aes(colour=building_age)) +
  scale_colour_viridis(option='inferno',
                       direction=-1,
                       name="Median Age of Houses") +
  labs(title = "Median House Age in Philadelphia", subtitle='Ages are aggregated by 800ft grid cells.\nData from Office of Property Assesments.') +
  mapTheme()

houseAge_fishnet_plot

# join the above data
val_age <-st_join(fishnet_meanVal, fishnet_medianAge %>% dplyr::select('building_age') %>% st_centroid(.))%>%
  dplyr::select(-uniqueID,-cvID)
var_fishnet <- st_join(var_fishnet,val_age,join=st_equals,left=TRUE)
# fishnet_fireKD<-st_join(f19_KD, f18_KD %>% st_centroid(.))
# fishnet_data_fire<-st_join(fishnet_final_data, fishnet_fireKD %>% st_centroid(.))

##### land use data #####
landUse<-st_read("http://data-phl.opendata.arcgis.com/datasets/e433504739bd41049de5d8f4a22d34ba_0.geojson")%>%st_transform(crs=2272)

landUse<-landUse%>%mutate(C_DIG1DESC = as.factor(C_DIG1DESC),
                          C_DIG1 = as.numeric(C_DIG1))%>%
  mutate(is_residential = case_when(
    C_DIG1==1 ~ 1,
    C_DIG1!=1 ~ 0
  ),
  is_commercial=case_when(
    C_DIG1==2~1,
    C_DIG1!=2~0
  ),
  is_industrial = case_when(
    C_DIG1==3 ~1,
    C_DIG1!=3~0,
  ))

class(landUse$is_residential)
summary(landUse)
sum(landUse$C_DIG1==1)

landuseforplot<-landUse%>%mutate(is_residential=as.factor(is_residential),
                                 is_commercial=as.factor(is_commercial),
                                 is_industrial=as.factor(is_industrial))
# join the landuse to the fishnet
# since if parcel is residential is 1, the mean is the percent
fishnet_residential<-landUse %>% 
  dplyr::select(is_residential) %>%
  mutate(count = 1) %>%
  aggregate(., fishnet, mean)%>%
  mutate(count = ifelse(is.na(count), 0, count),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE),
         percent_residential = is_residential)

fishnet_commercial<-landUse %>% 
  dplyr::select(is_commercial) %>% 
  mutate(count = 1) %>% 
  aggregate(., fishnet, mean)%>%
  mutate(count = ifelse(is.na(count), 0, count),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE),
         percent_commercial = is_commercial)

fishnet_industrial<-landUse %>% 
  dplyr::select(is_industrial) %>% 
  mutate(count = 1) %>% 
  aggregate(., fishnet, mean)%>%
  mutate(count = ifelse(is.na(count), 0, count),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE),
         percent_industrial = is_industrial)

land_type <-st_join(fishnet_residential,fishnet_commercial %>% dplyr::select('percent_commercial') %>% st_centroid(.))%>%
  dplyr::select(-uniqueID,-cvID,-count,-is_residential)
land_type <-st_join(land_type,fishnet_industrial %>% dplyr::select('percent_industrial') %>% st_centroid(.))
var_fishnet <- st_join(var_fishnet,land_type,join=st_equals,left=TRUE)
  1. Demographic Story
##### census tracts data####
# load the census data
census <- get_acs(geography = "tract",
                  variables=c("B01003_001", "B19013_001", 
                              "B02001_002", "B01002_001","B06009_005"),
                  key="d72b594e4d0f9b9b34217cdea8a4bcbc60354e21",
                  state=42,
                  geometry=TRUE,
                  county=101,
                  output="wide")
census1 <- census%>%
  rename(Total_Pop=B01003_001E,
         Med_Inc=B19013_001E,
         Med_Age=B01002_001E,
         White_Pop=B02001_002E,
         Bachelor_Pop=B06009_005E) %>%
  dplyr::select(Total_Pop, Med_Inc, White_Pop,Bachelor_Pop, Med_Age,geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,Percent_Bachelor=Bachelor_Pop/ Total_Pop)%>%
  dplyr::select(- White_Pop,-Bachelor_Pop)

var_fishnet <- 
  st_join(var_fishnet,census1%>% st_transform(2272),st_covered_by,left=TRUE) 

var_fishnet[is.na(var_fishnet)] <- 0
  1. Time-Spatial Factor Story
##### Load 2015-2018 fire data to the fishnet seperately#####
# load the fire data
fire_excel<-read_excel('C:/Users/HP/Desktop/MUSA800_practium/origin_data/2015 to 2019 fires for u of pa report run 1620.xlsx')
fire.sf <- 
  fire_excel%>% 
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326, agr = "constant") %>%
  st_set_crs(4326) %>%
  st_transform(2272)
fire.sf <- 
  st_intersection(fire.sf, philly)
fire.sf <- 
  fire.sf %>%
  mutate(date=dmy(alm_date),
         Year=year(date),
         Month=month(date))

fire.sf_18 <- fire.sf[fire.sf$Year==2018,]
fire.sf_17 <- fire.sf[fire.sf$Year==2017,]
fire.sf_16 <- fire.sf[fire.sf$Year==2016,]
fire.sf_15 <- fire.sf[fire.sf$Year==2015,]

# the distance from the nearest 5 fires to the center of the grid for each year(2015-2018)
fishnet$lagfire18.nn5 <-nn_function(st_coordinates(st_centroid(fishnet)),st_coordinates(st_centroid(fire.sf_18)),5) 
fishnet$lagfire17.nn5 <-nn_function(st_coordinates(st_centroid(fishnet)),st_coordinates(st_centroid(fire.sf_17)),5) 
fishnet$lagfire16.nn5 <-nn_function(st_coordinates(st_centroid(fishnet)),st_coordinates(st_centroid(fire.sf_16)),5) 
fishnet$lagfire15.nn5 <-nn_function(st_coordinates(st_centroid(fishnet)),st_coordinates(st_centroid(fire.sf_15)),5) 

##### engine & neighborhood #####
# load the engine locals data
engine_new <- 
  st_read("C:/Users/HP/Desktop/MUSA800_practium/origin_data/EngineLocals/EngineLocals_New.shp") %>%
  st_transform(crs=2272) 

fishnet<- 
  st_join(fishnet, dplyr::select(engine_new, ENGINE_NUM),st_nearest_feature,left=TRUE) %>%
  st_set_geometry(NULL) %>%
  left_join(dplyr::select(fishnet, geometry, uniqueID)) %>%
  st_sf() %>%
  na.omit()

# load the neighborhood data
neighbor <- 
  st_read("C:/Users/HP/Desktop/MUSA800_practium/data/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.shp")%>%
  na.omit() %>%
  st_transform(crs=2272)

fishnet <-
  st_join(fishnet, dplyr::select(neighbor, NAME),st_nearest_feature,left=TRUE) %>%
  st_set_geometry(NULL) %>%
  left_join(dplyr::select(fishnet, geometry, uniqueID)) %>%
  st_sf() %>%
  na.omit()
  1. Correlation matrix
# correlation Matrix between predictors and DV
col <- colorRampPalette(c("#61c0bf", "#bbded6", "#f8f8f8","#ffb6b9","#fc7978"))

Cor_env <- 
  train %>% 
  st_set_geometry(NULL) %>%
  dplyr::select(count,fire.isSig,fire.isSig.dist,hydrant_KD,tree.nn20,tree_count,percent_residential,
                percent_commercial,percent_industrial,val_per_size,building_age) 
corrplot(cor(Cor_env), method = "color", col = col(10),
         type = "upper",  number.cex = .7,
         addCoef.col = "white", # Add coefficient of correlation
         tl.col = "black", tl.srt = 90, # Text label color and rotation
         diag = TRUE)

Cor_risk <- 
  train %>% 
  st_set_geometry(NULL) %>%
  dplyr::select(count,unsafe.nn3,ID.nn3,ALM.nn3,unsafe_count,ID_count,ALM_count,hydrant_Down.nn3,hydrant_req.nn3,smoke_dector.nn3,complainfire.nn3,FireRes_or_Com.nn3,
                hydrant_Down_count,hydrant_req_count,smoke_dector_count,complainfire_count,FireRes_or_Com_count,crime.nn3) 
corrplot(cor(Cor_risk), method = "color", col = col(20),
         type = "upper", number.cex = .7,
         addCoef.col = "white", # Add coefficient of correlation
         tl.col = "black", tl.srt = 90, # Text label color and rotation
         diag = TRUE)

Cor_demo <- 
  train %>% 
  st_set_geometry(NULL) %>%
  dplyr::select(count,Total_Pop,Med_Inc,Med_Age,Percent_White,Percent_Bachelor) 
corrplot(cor(Cor_demo), method = "color", col = col(20),
         type = "upper", order = "alphabet", number.cex = .7,
         addCoef.col = "white", # Add coefficient of correlation
         tl.col = "black", tl.srt = 90, # Text label color and rotation
         diag = TRUE)

Cor_lag <- 
  train %>% 
  st_set_geometry(NULL) %>%
  dplyr::select(count,lagfire18.nn5,lagfire17.nn5,lagfire16.nn5,lagfire15.nn5) 
corrplot(cor(Cor_lag), method = "color", col = col(10),
         type = "upper", order = "alphabet", number.cex = .7,
         addCoef.col = "white", # Add coefficient of correlation
         tl.col = "black", tl.srt = 90, # Text label color and rotation
         diag = TRUE)

#correlation Matirix between predictors
 Cor <- 
   train %>% 
   st_set_geometry(NULL) %>%
   dplyr::select(-uniqueID, -cvID,-hydrant_count,-count,-percent_residential,-percent_commercial,-percent_industrial,-NAME) 
 
 corrplot(cor(Cor), method = "color", col = col(20),
          type = "upper", order = "alphabet", number.cex = .5,
          addCoef.col = "white", # Add coefficient of correlation
          tl.col = "black", tl.srt = 90, # Text label color and rotation
          diag = TRUE)

7.2.4 Modeling and Validation

  1. Model Selection
#Package installs -------------------------------------------------------------
load.fun <- function(x) { 
  x <- as.character(x) 
  if(isTRUE(x %in% .packages(all.available=TRUE))) { 
    eval(parse(text=paste("require(", x, ")", sep=""))) 
    print(paste(c(x, " : already installed; requiring"), collapse=''))
  } else { 
    #update.packages()
    print(paste(c(x, " : not installed; installing"), collapse=''))
    eval(parse(text=paste("install.packages('", x, "')", sep=""))) 
    print(paste(c(x, " : installed and requiring"), collapse=''))
    eval(parse(text=paste("require(", x, ")", sep=""))) 
  } 
} 

########### Required Packages ###########
packages = c("dplyr", "bayesplot", "lme4", "rstan", "shinystan", "RcppEigen",
             "tidyverse", "tidyr", "AmesHousing", "broom", "caret", "dials", "doParallel", "e1071", "earth",
             "ggrepel", "glmnet", "ipred", "klaR", "kknn", "pROC", "rpart", "randomForest",
             "sessioninfo", "tidymodels","ranger", "recipes", "workflows", "themis","xgboost",
             "sf", "nngeo", "mapview")

for(i in seq_along(packages)){
  packge <- as.character(packages[i])
  load.fun(packge)
}

session_info()

#########################################

set.seed(717)

theme_set(theme_bw())

"%!in%" <- Negate("%in%")
g <- glimpse

# for plots
library(gridExtra)
library(grid)
library(ggplot2)

# for kernel density
library(raster)

# formatting options
options(scipen=999)
specify_decimal <- function(x, k) trimws(format(round(x, k), nsmall=k))

data18 <- st_read("C:/Users/HP/Desktop/MUSA800_practium/new_train18.geojson")%>%
  as.data.frame(.)%>% # convert to dataframe
  dplyr::select(-geometry, -cvID) %>%
  mutate(fire.isSig=as.factor(fire.isSig),
         ENGINE_NUM = as.factor(ENGINE_NUM))


data19 <- st_read("C:/Users/HP/Desktop/MUSA800_practium/new_test19.geojson")%>%
  as.data.frame(.)%>% # convert to dataframe
  dplyr::select(-geometry, -cvID)%>%
  mutate(fire.isSig=as.factor(fire.isSig),
         ENGINE_NUM = as.factor(ENGINE_NUM))

# train and test on 2018 data FIRST!
data_split <- rsample::initial_split(data18, strata = "count", prop = 0.75)

data_train <- training(data_split) # note; only 2018 data
data_test  <- testing(data_split)

# Neighborhood 
cv_splits_hood <- group_vfold_cv(data18, group = "NAME")

# Engine Locals 
cv_splits_engine <- group_vfold_cv(data18, group = "ENGINE_NUM")

## This kfold CV is to determine best hyperparameter settings 
cv_splits_v5 <- vfold_cv(data18, v = 5) # v is number of splits

model_rec <- recipe(count ~ ., data = data_train)%>%
  update_role(uniqueID, new_role='id variable')%>%
  step_zv(all_numeric(), -count)%>%  # standardize numeric variables 
  step_center(all_numeric(), -count) %>%
  step_scale(all_numeric(), -count)%>%# change away from predictor
  step_other(NAME, threshold = 0.005) %>% #OHE nominal 
  step_other(ENGINE_NUM, threshold = 0.005) %>% 
  step_dummy(all_nominal(), -uniqueID) #OHE for norminal 
 

glimpse(model_rec %>% prep() %>% juice()) # total 162 features 
summary(model_rec)
# do feature importance 
# correlation of variables

# 1: linear_regresion 
lm_plan <- 
  linear_reg() %>% 
  set_engine("lm")

# 2: regularized linear regression (with penalties)
glmnet_plan <- 
  linear_reg() %>% 
  set_args(penalty  = tune()) %>% # lambda
  set_args(mixture  = tune()) %>% # type of regularization L1 and L2
  set_engine("glmnet")

# 3. Random forest 
rf_plan <- rand_forest() %>%
  set_args(mtry  =tune()) %>% # number of features selected for each tree
  set_args(min_n = tune()) %>% # minimum samples at each node for a splits
  # set_args(tree_depth = 120)%>%
  set_args(trees = 1000) %>%  # number of trees 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")

# 4. Boosted Descision Trees -- convert weak learners to strong learners 
# gradient boosting 
XGB_plan <- boost_tree() %>%
  set_args(mtry  = tune()) %>%
  set_args(min_n = tune()) %>%
  set_args(trees = 1000) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

# Hyperparameter grids 
glmnet_grid <- expand.grid(penalty = seq(0, 1, by = .20), 
                           mixture = seq(0,1,0.1))

rf_grid <- expand.grid(min_n = c(1,5), mtry=c(10,20,30)) # try different min_ns maybe higher mtry
# low min_n over fitting
# could do: random sample over train data

xgb_grid <- expand.grid(mtry = c(3,5), 
                           min_n = c(1,5))

# create workflow
lm_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(lm_plan)
glmnet_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(glmnet_plan)
rf_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(rf_plan)
xgb_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(XGB_plan)


# fit model to workflow and calculate metrics
control <- control_resamples(save_pred = TRUE, verbose = TRUE)

lm_tuned <- lm_wf %>%
  tune::fit_resamples(.,
                  resamples = cv_splits_v5,
                  control   = control,
                  metrics   = metric_set(mae, rsq, mape)) 

# works when there isnt a factor as a feature
glmnet_tuned <- glmnet_wf %>%
  tune::tune_grid(.,
                  resamples = cv_splits_v5,
                  grid      = glmnet_grid,
                  control   = control,
                  metrics   = metric_set(mae, rsq))


rf_tuned <- rf_wf %>%
  tune::tune_grid(.,
                  resamples = cv_splits_v5,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metric_set(mae, rsq))


xgb_tuned <- xgb_wf %>%
  tune::tune_grid(.,
                  resamples = cv_splits_v5,
                  grid      = xgb_grid,
                  control   = control,
                  metrics   = metric_set(mae, rsq))

## select best parameters
## metrics across grid
# autoplot(xgb_tuned)
# collect_metrics(xgb_tuned)
## 'Best' by some metric and margin
show_best(lm_tuned, metric = "mae", n = 15, maximize = FALSE)
show_best(glmnet_tuned, metric = "mae", n = 15, maximize = FALSE)
show_best(rf_tuned, metric = "mae", n = 15, maximize = FALSE)
show_best(xgb_tuned, metric = "mae", n = 15, maximize = FALSE)



lm_best_params     <- select_best(lm_tuned, metric = "mae", maximize = FALSE)
glmnet_best_params <- select_best(glmnet_tuned, metric = "mae", maximize = FALSE)
rf_best_params     <- select_best(rf_tuned, metric = "mae", maximize = FALSE)
xgb_best_params    <- select_best(xgb_tuned, metric = "mae", maximize = FALSE)


## setting models to have best params
## Final workflow
lm_best_wf     <- finalize_workflow(lm_wf, lm_best_params)
glmnet_best_wf <- finalize_workflow(glmnet_wf, glmnet_best_params)
rf_best_wf     <- finalize_workflow(rf_wf, rf_best_params)
xgb_best_wf    <- finalize_workflow(xgb_wf, xgb_best_params)

# BEST: RANDOM FOREST, mtry= 5, min_n=1 mae=    0.5565294   folds=5, s.d.=  0.02504066


## SAVE PREDICTIONS!
lm_OOF_preds     <- collect_predictions(lm_tuned) 
glmnet_OOF_preds <- collect_predictions(glmnet_tuned) %>% 
  filter(penalty == glmnet_best_params$penalty[1],
         mixture == glmnet_best_params$mixture[1])
  filter(mtry  == rf_best_params$mtry[1],
         min_n == rf_best_params$min_n[1])
xgb_OOF_preds    <- collect_predictions(xgb_tuned) %>% 
  filter(mtry  == xgb_best_params$mtry[1],
         min_n == xgb_best_params$min_n[1])

ggplot(collect_metrics(glmnet_tuned) %>% filter(.metric == "mae"),
       aes(x=factor(penalty),y=factor(mixture),fill=mean)) + 
  geom_raster() +
  theme_minimal() +
  coord_fixed() +
  labs(x="Penalty",y="Mixture",title="GLM Hyperparameter Tune Grid Results") +
  scale_fill_viridis_c(option = "A") +
  facet_wrap(~.metric)



ggplot(collect_metrics(rf_tuned) %>% filter(.metric == "mae"),
       aes(x=factor(mtry),y=factor(min_n),fill=mean)) + 
  geom_raster() +
  theme_minimal() +
  coord_fixed() +
  labs(x="mtry",y="min.node.size",title="RF Hyperparameter Tune Grid Results") +
  scale_fill_viridis_c(option = "A") +
  facet_wrap(~.metric)


ggplot(collect_metrics(xgb_tuned) %>% filter(.metric == "mae"),
       aes(x=factor(tree_depth),y=factor(learn_rate),fill=mean)) + 
  geom_raster() +
  theme_minimal() +
  coord_fixed() +
  labs(x="tree depth",y="learning rate",title="XGB Hyperparameter Tune Grid Results") +
  scale_fill_viridis_c(option = "A") +
  facet_wrap(~.metric)

# GET THIS WARNING:  The `...` are not used in this function but one or more objects were passed: 'control'

lm_val_fit <- lm_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metric_set(mae, rsq))

glmnet_val_fit <- glmnet_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metric_set(mae, rsq))

rf_val_fit <- rf_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metric_set(mae, rsq))

xgb_val_fit <- xgb_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metric_set(mae, rsq))

# collect test set predictions
lm_val_pred     <- collect_predictions(lm_val_fit)
glmnet_val_pred <- collect_predictions(glmnet_val_fit)
rf_val_pred     <- collect_predictions(rf_val_fit)
xgb_val_pred    <- collect_predictions(xgb_val_fit)

# show test set metrics
collect_metrics(lm_val_fit)
collect_metrics(glmnet_val_fit)
collect_metrics(rf_val_fit)
collect_metrics(xgb_val_fit)

# BEST ON TEST SET: RF BEST PARAMS, MAE = 0.5874638


lm_CV_Eg <- lm_best_wf %>%
  tune::fit_resamples(.,
                      resamples = cv_splits_engine,
                      control   = control,
                      metrics   = metric_set(mae, rsq))
glmnet_CV_Eg <- glmnet_best_wf %>%
  tune::fit_resamples(.,
                      resamples = cv_splits_engine,
                      control   = control,
                      metrics   = metric_set(mae, rsq))
rf_CV_Eg <- rf_best_wf %>%
  tune::fit_resamples(.,
                      resamples = cv_splits_engine,
                      control   = control,
                      metrics   = metric_set(mae, rsq))

xgb_CV_Eg <- xgb_best_wf %>%
  tune::fit_resamples(.,
                      resamples = cv_splits_engine,
                      control   = control,
                      metrics   = metric_set(mae, rsq))

## 'Best' by some metric and margin (but should only have 1 entry)
show_best(lm_CV_Eg, metric = "mae", n = 1, maximize = FALSE)
show_best(glmnet_CV_Eg, metric = "mae", n = 15, maximize = FALSE)
show_best(rf_CV_Eg, metric = "mae", n = 15, maximize = FALSE)
show_best(xgb_CV_Eg, metric = "mae", n = 15, maximize = FALSE)
show_best(rf_CV_Eg, metric = "rsq", n = 15, maximize = FALSE)

lm_CV_Nh <- lm_best_wf %>% 
  tune::fit_resamples(.,
                      resamples = cv_splits_hood,
                      control   = control,
                      metrics   = metric_set(mae, rsq))

glmnet_CV_Nh <- glmnet_best_wf %>% 
   tune::fit_resamples(.,
                      resamples = cv_splits_hood,
                      control   = control,
                      metrics   = metric_set(mae, rsq))
rf_CV_Nh <- rf_best_wf %>% 
   tune::fit_resamples(.,
                      resamples = cv_splits_hood,
                      control   = control,
                      metrics   = metric_set(mae, rsq))

xgb_CV_Nh <- xgb_best_wf %>% 
   tune::fit_resamples(.,
                      resamples = cv_splits_hood,
                      control   = control,
                      metrics   = metric_set(mae, rsq))

## 'Best' by some metric and margin (but should only have 1 entry)
show_best(lm_CV_Nh, metric = "mae", n = 1, maximize = FALSE)
show_best(glmnet_CV_Nh, metric = "mae", n = 15, maximize = FALSE)
show_best(rf_CV_Nh, metric = "mae", n = 15, maximize = FALSE)
show_best(xgb_CV_Nh, metric = "mae", n = 15, maximize = FALSE)
show_best(rf_CV_Nh, metric = "rsq", n = 15, maximize = FALSE)
  1. The Best Model: RF
rf_plan <- rand_forest() %>%
  set_args(mtry  =30 )%>% # number of features selected for each tree
  set_args(min_n = 5 )%>% # minimum samples at each node for a splits
  # set_args(tree_depth = 120)%>%
  set_args(trees = 1000) %>%  # number of trees 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")

rf_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(rf_plan)

best_fit<- rf_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metric_set(mae))
best_pred_trainTest_18    <- collect_predictions(best_fit)

collect_metrics(best_fit) # MAE= mae    standard    0.5324928   

### try predictions on data 19
final_best_model <-
  fit(rf_wf, data18)

predictions19<-predict(final_best_model, pred_data19)%>%
  rename(pred_19 = .pred)
predictions19$uniqueID<-data19$uniqueID
predictions19$observed_19<- data19$count
predictions19 <- predictions19[, c("uniqueID", "observed_19", "pred_19")]


predictions19$error19<- predictions19$pred_19 - predictions19$observed_19

MAE19<-mean(predictions19$error19)
MAE19

## DECILES
FIRST<- predictions19%>% mutate(fire_decile_obs = ntile(observed_19, 10))%>%
   group_by(fire_decile_obs) %>%
    summarize(meanObserved = mean(observed_19, na.rm=T),
              meanPrediction = mean(pred_19, na.rm=T))%>%
   gather(Variable, mean, -fire_decile_obs)%>%
  ggplot(aes(fire_decile_obs, mean, shape = Variable)) +
      geom_point(size = 2) + geom_path(aes(group = fire_decile_obs), colour = "black") +
      scale_shape_manual(values = c(2, 17)) + xlim(0,10) +
      labs(title = "mtry=30, min_n=5")

SECOND<- predictions19_2%>% mutate(fire_decile_obs = ntile(observed_19, 10))%>%
   group_by(fire_decile_obs) %>%
    summarize(meanObserved = mean(observed_19, na.rm=T),
              meanPrediction = mean(pred_19, na.rm=T))%>%
   gather(Variable, mean, -fire_decile_obs)%>%
  ggplot(aes(fire_decile_obs, mean, shape = Variable)) +
      geom_point(size = 2) + geom_path(aes(group = fire_decile_obs), colour = "black") +
      scale_shape_manual(values = c(2, 17)) + xlim(0,10) +
      labs(title = "mtry=40, min_n=5")

grid.arrange(FIRST, SECOND, ncol=2, top="Predicted and observed fire counts by observed fire count decile", left="Mean Fire Counts", bottom="note:Both models were trained on 2018 fire data.")

## using current best model 
library(vip) 
final_best_model <-
  fit(rf_best_wf, data18)

vip(final_best_model, aesthetics = list(fill = palette_light()["blue"])) +
    labs(title = "Random Forest Importance - Best Features") +
    theme_tq() 
final_best_model <-
  fit(rf_wf, data18)

# predictions on 2018 data
pred_data <-data18 %>%
  dplyr::select(-count)
predictions18<-predict(final_best_model, pred_data)%>%
  rename(pred_18 = .pred)
predictions18$observed_18 <- data18$count
predictions18$uniqueID<-data18$uniqueID
predictions18 <- predictions18[, c("uniqueID", "observed_18", "pred_18")]

# predictions on 2019 data
pred_data19<-data19 %>% 
  dplyr::select(-count)
predictions19<-predict(final_best_model, pred_data19)%>%
  rename(pred_19 = .pred)
predictions19$uniqueID<-data19$uniqueID
predictions19$observed_19<- data19$count
predictions19 <- predictions19[, c("uniqueID", "observed_19", "pred_19")]


# check that indeed matching correctly
obs18<-data18%>%mutate(uniqueID=as.numeric(uniqueID))%>%
  arrange(.,uniqueID)
sum(obs18$count - predictions$observed_18) # should be 0
obs19<-data19%>%mutate(uniqueID=as.numeric(uniqueID))%>%
  arrange(.,uniqueID)
sum(obs19$count - predictions$observed_19) #should be  0

# export out
write.csv(predictions,"C:/Users/leann/OneDrive/Desktop/MUSA 800/Modelling/predictions_v1.csv", row.names = TRUE)
  1. Model Validation
# Accuracy
predictions18$error18<- predictions18$pred_18 - predictions18$observed_18
predictions19$error19<- predictions19$pred_19 - predictions19$observed_19

MAE18<-mean(predictions18$error18)
MAE19<-mean(predictions19$error19)

spaces19<-data19%>%dplyr::select(uniqueID, ENGINE_NUM, NAME)

pred19_with_space <- merge(spaces19,predictions19,by="uniqueID")%>%dplyr::select(-uniqueID)

# average and sd of mae
nhood_data<-pred19_with_space%>%group_by(NAME)%>% summarise(
  meanError = mean(error19),
)
mean(nhood_data$meanError)
sd(nhood_data$meanError)

# engines
engine_data<-pred19_with_space%>%group_by(ENGINE_NUM)%>% summarise(
  meanError = mean(error19),
)
mean(engine_data$meanError)
sd(engine_data$meanError)
specify_decimal <- function(x, k) trimws(format(round(x, k), nsmall=k))

Group = c('By Neighborhood', 'By Engine Number')
average_MAE=c(specify_decimal(mean(nhood_data$meanError),3),specify_decimal(mean(engine_data$meanError),3))
sd_MAE = c(specify_decimal(sd(nhood_data$meanError),3), specify_decimal(sd(engine_data$meanError),3))
generalizability19<-data.frame(group, average_MAE, sd_MAE)
headers<-c('Group', 'Average MAE', 'Standard Deviation MAE')
names(generalizability19)<-headers

library(kableExtra)
generalizability19%>%kable(caption =   "Generalizability of Model on 2019 Data") %>% kable_styling("striped", full_width = T) 

# just simple MAE
col_headings <- c('Test Set', 'Mean Prediction Error','Mean Observed Fires')
Test_Set = c('2018 Data', '2019 Data')
MAE = c(specify_decimal(MAE18,3), specify_decimal(MAE19,3))
actuals = c(specify_decimal(mean(predictions18$observed_18),3),specify_decimal(mean(predictions19$observed_19),3))
error_df<-data.frame(Test_Set, MAE, actuals)
names(error_df) <- col_headings
error_df%>%kable(caption =   "Accuracy of Model")%>% kable_styling("striped", full_width =F)


actual_mean_19<-mean(predictions19$observed_19)

deciles_18<-predictions18%>% mutate(fire_decile_obs = ntile(observed_18, 10))%>%
   group_by(fire_decile_obs) %>%
    summarize(meanObserved = mean(observed_18, na.rm=T),
              meanPrediction = mean(pred_18, na.rm=T))%>%
   gather(Variable, mean, -fire_decile_obs)%>%
  ggplot(aes(fire_decile_obs, mean, shape = Variable)) +
      geom_point(size = 2) + geom_path(aes(group = fire_decile_obs), colour = "black") +
      scale_shape_manual(values = c(2, 17)) + xlim(0,10) +
      labs(title = "2018")

  
deciles_19<-predictions19%>% mutate(fire_decile_obs = ntile(observed_19, 10))%>%
   group_by(fire_decile_obs) %>%
    summarize(meanObserved = mean(observed_19, na.rm=T),
              meanPrediction = mean(pred_19, na.rm=T))%>%
   gather(Variable, mean, -fire_decile_obs)%>%
  ggplot(aes(fire_decile_obs, mean, shape = Variable)) +
      geom_point(size = 2) + geom_path(aes(group = fire_decile_obs), colour = "black") +
      scale_shape_manual(values = c(2, 17)) + xlim(0,10) +
      labs(title = "2019")

grid.arrange(deciles_18, deciles_19, ncol=2, top="Predicted and observed fire counts by observed fire count decile", left="Mean Fire Counts", bottom="note:Both models were trained on 2018 fire data.")

# generalizability
engineAggregatesFinal%>%mutate(
  predFires=as.numeric(predFires)
)%>%
  ggplot()+
   geom_sf(data=philly)+
    geom_sf(aes(fill = predFires))+
   scale_fill_viridis(option="inferno", name="Fire Counts")+
   labs(title = 'Predicted 2019 Fires by Engine') +
  mapTheme()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black"),
strip.text = element_text(size=25))

error_fishnet<-predictions%>%mutate(error19=abs(pred_19-observed_19))%>%dplyr::select(error19)
error_engines<-aggregate(x=error_fishnet, by=engine_shapefile, FUN=mean)
error_engines%>%
  ggplot()+
   geom_sf(data=philly)+
    geom_sf(aes(fill = error19))+
   scale_fill_viridis(option="inferno", name="MAE 2019")+
   labs(title = 'MAE on 2019 Fire Predictions by Engine',
        subtitle= 'The absolute error is of each grid cell, average is taken for each engine.') +
  mapTheme()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(), axis.line = element_line(colour = "black"),
strip.text = element_text(size=25))

# compared with kernel density
library(spatstat)
library(readxl)
library(lubridate)
philly<- st_read("C:/Users/leann/OneDrive/Desktop/MUSA 800/Data/City_Limits")%>%st_set_crs(4326) %>%
  na.omit() %>%
  st_transform(2272)
fire_excel<-read_excel('C:/Users/leann/OneDrive/Desktop/MUSA 800/Data/2015 to 2019 fires for u of pa report run 1620.xlsx')
fire.sf <- 
  fire_excel%>% 
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326, agr = "constant") %>%
  st_set_crs(4326) %>%
  st_transform(2272)
fire.sf <- 
  st_intersection(fire.sf, philly)
#Standarize date
fire.sf <- 
  fire.sf %>%
  mutate(date=dmy(alm_date),
         Year=year(date),
         Month=month(date))

fishnet <- 
  st_make_grid(philly, cellsize = 800) %>%
  st_sf()
# clip the fishnet by the boundary
fishnet <- 
  fishnet[philly,] %>%
  mutate(uniqueID = rownames(.)) %>%
  dplyr::select(uniqueID)%>%
  st_set_crs(2272)

# 2018 fire density 
fire2018<-
  fire.sf %>%
  filter(Year == 2018)
fire2018_fishnet <- 
  fire2018 %>% 
  dplyr::select() %>% 
  mutate(count = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(count = ifelse(is.na(count), 0, count),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))
fire_ppp_2018 <- as.ppp(st_coordinates(fire2018), W = st_bbox(fire2018_fishnet))
fire_KD_2018<- spatstat::density.ppp(fire_ppp_2018, 1000)

fire_KDE_sf_18 <- as.data.frame(fire_KD_2018) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(fire2018_fishnet)) %>%
  aggregate(., fire2018_fishnet, mean) %>%
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  bind_cols( aggregate(dplyr::select(fire2018) %>% mutate(firecount = 1), ., length) %>%  mutate(firecount = replace_na(firecount, 0))) %>%
  dplyr::select(label, Risk_Category, firecount)

# 2019 fire density 
fire2019<-
  fire.sf %>%
  filter(Year == 2019)
fire2019_fishnet <- 
  fire2019 %>% 
  dplyr::select() %>% 
  mutate(count = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(count = ifelse(is.na(count), 0, count),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))
fire_ppp_2019 <- as.ppp(st_coordinates(fire2019), W = st_bbox(fire2019_fishnet))
fire_KD_2019 <- spatstat::density.ppp(fire_ppp_2019, 1000)

fire_KDE_sf <- as.data.frame(fire_KD_2019) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(fire2019_fishnet)) %>%
  aggregate(., fire2019_fishnet, mean) %>%
  mutate(label = "Kernel Density",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  bind_cols( aggregate(dplyr::select(fire2019) %>% mutate(firecount = 1), ., length) %>%  mutate(firecount = replace_na(firecount, 0))) %>%
  dplyr::select(label, Risk_Category, firecount)

geom19<-st_read("new_test19.geojson")%>%st_set_crs(2272)%>%
  dplyr::select('uniqueID')

pred_risk_sf <- predictions19%>%
  mutate(label = "Risk Predictions",
         Risk_Category = ntile(pred_19, 100),
         Risk_Category = case_when(Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%mutate(
             firecount=observed_19
           )%>%dplyr::select(label, Risk_Category, firecount, uniqueID)

pred_risk_sf<- merge(geom19,pred_risk_sf,by="uniqueID")%>%dplyr::select(-uniqueID)

library(viridis)
rbind(fire_KDE_sf,pred_risk_sf) %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    # geom_sf(data = sample_n(fire2019, 1000), size = .05, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE, option='inferno') +
    labs(title="Comparison of Kernel Density and Risk Predictions of Fire 2019",
         subtitle="Relative to Observered fires 2019 (in black)") +
    mapTheme()

rbind(fire_KDE_sf,pred_risk_sf) %>%
  st_set_geometry(NULL) %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(firecount = sum(Value)) %>%
  ungroup%>%
  group_by(label) %>%
  mutate(Fraction_of_Fires_in_category = firecount / sum(firecount)) %>%
    ggplot(aes(Risk_Category,Fraction_of_Fires_in_category)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(discrete = TRUE, option='inferno', begin=0.18, end=0.55 )
  1. Export Predictions
# merge 2018 and 2019 predictions for exporting out

# join both datafrmaes 
predictions <-merge(predictions18, predictions19, by="uniqueID")

# join the geometries 
geom19<-st_read("new_test19.geojson")%>%st_set_crs(2272)%>%
  dplyr::select('uniqueID', 'ENGINE_NUM', 'NAME')

predictions<-merge(geom19, predictions, by='uniqueID')

# add deciles to predictions 
predictions_w_dec<-predictions%>%dplyr::mutate(obs18_decile = ntile(observed_18, 10),
                                               pred18_decile = ntile(pred_18, 10),
                                               obs19_decile = ntile(observed_19, 10),
                                               pred19_decile = ntile(pred_19, 10))



# Add kernel densities using this fishnet of the predictions  
fire_ppp_2018_2 <- as.ppp(st_coordinates(fire2018), W = st_bbox(predictions))
fire_KD_2018_2 <- spatstat::density.ppp(fire_ppp_2018_2, 1000)
fire_KDE_sf_218 <- as.data.frame(fire_KD_2018_2) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(predictions)) %>%
  aggregate(., predictions, mean)
fire_ppp_2019_2 <- as.ppp(st_coordinates(fire2019), W = st_bbox(predictions))
fire_KD_2019_2 <- spatstat::density.ppp(fire_ppp_2019_2, 1000)
fire_KDE_sf_219 <- as.data.frame(fire_KD_2019_2) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(predictions)) %>%
  aggregate(., predictions, mean)

## JOIN TO DATASET 

predictions_w_KD<-st_join(predictions_w_dec, fire_KDE_sf_218, join=st_equals)%>%dplyr::rename(obs18_KD = value)%>%st_join(., fire_KDE_sf_219, join=st_equals)%>%dplyr::rename(obs19_KD = value)

### add risk categories to the predictions using the kernel density of observed
predictions_w_risk<-predictions_w_KD%>%
  mutate(Risk_Obs_18 = ntile(obs18_KD, 100),
         Risk_Obs_18 = case_when(Risk_Obs_18 >= 90 ~ "90% to 100%",
           Risk_Obs_18 >= 70 & Risk_Obs_18 <= 89 ~ "70% to 89%",
           Risk_Obs_18 >= 50 & Risk_Obs_18 <= 69 ~ "50% to 69%",
           Risk_Obs_18 >= 30 & Risk_Obs_18 <= 49 ~ "30% to 49%",
           Risk_Obs_18 >= 1 & Risk_Obs_18 <= 29 ~ "1% to 29%"))%>%
  mutate(Risk_Obs_19 = ntile(obs19_KD, 100),
         Risk_Obs_19 = case_when(Risk_Obs_19 >= 90 ~ "90% to 100%",
           Risk_Obs_19 >= 70 & Risk_Obs_19 <= 89 ~ "70% to 89%",
           Risk_Obs_19 >= 50 & Risk_Obs_19 <= 69 ~ "50% to 69%",
           Risk_Obs_19 >= 30 & Risk_Obs_19 <= 49 ~ "30% to 49%",
           Risk_Obs_19 >= 1 & Risk_Obs_19 <= 29 ~ "1% to 29%"))%>%
  mutate(Risk_pred_18 = ntile(pred_18, 100),
         Risk_pred_18 = case_when(Risk_pred_18 >= 90 ~ "90% to 100%",
           Risk_pred_18 >= 70 & Risk_pred_18 <= 89 ~ "70% to 89%",
           Risk_pred_18 >= 50 & Risk_pred_18 <= 69 ~ "50% to 69%",
           Risk_pred_18 >= 30 & Risk_pred_18 <= 49 ~ "30% to 49%",
           Risk_pred_18 >= 1 & Risk_pred_18 <= 29 ~ "1% to 29%"))%>%
  mutate(Risk_pred_19 = ntile(pred_19, 100),
         Risk_pred_19 = case_when(Risk_pred_19 >= 90 ~ "90% to 100%",
           Risk_pred_19 >= 70 & Risk_pred_19 <= 89 ~ "70% to 89%",
           Risk_pred_19 >= 50 & Risk_pred_19 <= 69 ~ "50% to 69%",
           Risk_pred_19 >= 30 & Risk_pred_19 <= 49 ~ "30% to 49%",
           Risk_pred_19 >= 1 & Risk_pred_19 <= 29 ~ "1% to 29%"))

# export out

write.csv(as.data.frame(predictions_w_risk)%>%dplyr::select(-geometry),"C:/Users/leann/OneDrive/Desktop/MUSA 800/Modelling/predictions_v3.csv", row.names = TRUE)

st_write(predictions_w_risk, "C:/Users/leann/OneDrive/Desktop/MUSA 800/Modelling/predictions_v3.shp")

data_frame <-st_read("C:/Users/leann/OneDrive/Desktop/MUSA 800/Modelling/predictions_v3")%>%st_set_crs(.,2272)

st_crs(data_frame)

data_frame%>%
  ggplot() +
    geom_sf(aes(fill = Rs_O_19), colour = NA) +
    # geom_sf(data = sample_n(fire2019, 1000), size = .05, colour = "black") +
    scale_fill_viridis(discrete = TRUE, option='inferno') +
    labs(title="Kernel Density of Fires in 2019",
         subtitle="Categorised into quintiles.",
         fill = "Fires 2019") +
    mapTheme()

7.2.5 Hydrant Priority Score Calculation

  1. The latent fire risk Score
fire_risk <- st_read("predictions_v3")%>%st_set_crs(., 2272)

fire_score<-fire_risk%>%dplyr::select(pred_19)
  1. The Industrial Area Score
landUse<-st_read('C:/Users/leann/OneDrive/Desktop/MUSA 800/Data/Land_Use')%>%st_transform(crs=2272)

landUse<-landUse%>%mutate(C_DIG1DESC = as.factor(C_DIG1DESC),
                                  C_DIG1 = as.numeric(C_DIG1))%>%
  mutate(is_residential = case_when(
    C_DIG1==1 ~ 1,
    C_DIG1!=1 ~ 0
  ),
  is_industrial = case_when(
    C_DIG1==3 ~1,
    C_DIG1!=3~0,
  ))

# to get kernel density, each is_industrial parcel needs to be a point

industrial_parcels <- landUse %>% 
    dplyr::filter(is_industrial == 1)%>%st_centroid(.) # these are now points.


industrial_ppp <- as.ppp(st_coordinates(industrial_parcels), W = st_bbox(fishnet))
industrial_score<- spatstat::density.ppp(industrial_ppp, 1000)%>%as.data.frame(.)%>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(fishnet)) %>%
    aggregate(.,fishnet, mean)

## INDUSTRIAL_KD : the kd of industrial parcels in each grid cell . not normalized. 
  1. Social Factor Score
# Get school distance (1 closest school)
# load school  (546 schools)
schools<- st_read("http://data.phl.opendata.arcgis.com/datasets/d46a7e59e2c246c891fbee778759717e_0.geojson") %>%
  st_set_crs(4326) %>%
  st_transform(2272)

ggplot() +
    geom_sf(data = philly, colour = NA) +
    geom_sf(data = schools, size = .05, colour = "black")+mapTheme()

social = fishnet
social$school.nn1=nn_function(st_coordinates(st_centroid(fishnet)),st_coordinates(schools),1)

# so each fishnet grid cell now has nearest distance to school.

## census data
census <- get_acs(geography = "tract",
                  variables=c("B01003_001", "B19013_001", 
                              "B01002_001"),
                  key="d72b594e4d0f9b9b34217cdea8a4bcbc60354e21",
                  state=42,
                  geometry=TRUE,
                  county=101,
                  output="wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E) %>%
  dplyr::select(Total_Pop, Med_Inc, Med_Age,GEOID, geometry)%>%st_set_crs(4326)%>%st_transform(2272)

# get population density per census tract
census$area = st_area(census)
census<-census%>%mutate(Pop_density = Total_Pop/area)

census_fishnet <- st_join(fishnet,census,st_intersects,left=TRUE)%>%
  group_by(uniqueID)%>%
  summarise(pop_dense = mean(Pop_density),
            med_age=mean(Med_Age),
            med_inc = mean(Med_Inc))%>%
  as.data.frame(.)%>%
  dplyr::select(-geometry)
# left=True means all rows in fishnet are returned.
census_fishnet[is.na(census_fishnet)] <- 0

# return one dataframe for the social score
social<-merge(social, census_fishnet, by= 'uniqueID')
  


# normalize all values. value-mean/s.d
social.scaled <- social %>% mutate_at(c("med_inc", "med_age", "school.nn1", "pop_dense"), ~(scale(.) %>% as.vector))

# all values now have mean 0

# transform so as to reflect priority 
# med inc and school.nn1: downwards linear y=-x
# pop_dense: upwards linear y=x
# med_age: prioritize higher for young and old: y=x^2 (mist -ve and most +ve will have highest scores)

social.scaled.transformed <- social.scaled%>%mutate(
  med_inc = -1*med_inc, 
  school.nn1 = -1*school.nn1,
  pop_dense = pop_dense,
  med_age = scale(med_age*med_age)%>%as.vector
)

social_score<-social.scaled.transformed%>%
  mutate(score = (med_inc + med_age + school.nn1 + pop_dense)/4)%>%dplyr::select(uniqueID, score)

# score is per grid-cell.. not yet normalized
  1. Aggregating above three scores to hydrants
industrial_parcels_count<- industrial_parcels %>% 
  dplyr::select() %>% 
  mutate(count= 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(industrial = ifelse(is.na(count), 0, count))%>%dplyr::select(industrial)


priorities_fishnet<-st_join(fire_risk%>%
                              dplyr::select(uniquID, pred_19)%>%
                              rename(fire_count =pred_19,
                                     ID = uniquID), st_centroid(industrial_parcels_count), st_intersects, left=TRUE)%>%
  st_join(., st_centroid(social)%>%
            dplyr::select(-uniqueID))%>%st_set_crs(2272)

cleaned_priorities<-priorities_fishnet%>%mutate_at(c("fire_count", "med_inc", "med_age", "school.nn1"), ~(specify_decimal(.,0)))%>%
mutate(pop_dense = stringr::str_extract(pop_dense*100, "^.{5}"))# value per 100sqft
cleaned_priorities[is.na(cleaned_priorities)] <- 0


st_write(cleaned_priorities, "C:/Users/leann/OneDrive/Desktop/MUSA 800/Modelling/fishnet_priorities_raw.shp",  delete_layer = TRUE)

# data frame of fishnet grid cells with un-normalized individual factor scores
scores_fishnet<-st_join(fire_score, st_centroid(industrial_score), st_intersects, left=TRUE)%>%
  st_join(., social_score%>%st_centroid(.))%>%
  rename(social=score,
         industrial=value,
         fire=pred_19)

st_write(scores_fishnet, "C:/Users/leann/OneDrive/Desktop/MUSA 800/Modelling/fishnet_scores_raw.shp",  delete_layer = TRUE)

hydrants<-st_read("C:/Users/leann/OneDrive/Desktop/MUSA 800/Data/PDF_Hydrant_Inspections_2019")%>%
  st_set_crs(2272)%>%
  dplyr::select(HYDRANTNUM, STATION, DATEOFLAST, YEAR_INSTA, BATTALION)%>%
  mutate(age=2020-hydrants$YEAR_INSTA)
# fil invalid hydrant ages with the average hydrant age 
temp<-
  hydrants%>%
  filter(age>=0)
avg_age = mean(temp$age) 
hydrants$age[hydrants$age<0] <- avg_age
hydrants$age[hydrants$age>=2020] <- avg_age
hydrants$age = specify_decimal(hydrants$age,0)


## 800ft buffer
hydrant.buffers <-st_buffer(hydrants%>%dplyr::select(HYDRANTNUM), 400)

# aggregate the fishnet scores to hydrant buffers
# z contains all the grid cells that intersect each hydrant's buffer.
z <- st_join(hydrant.buffers, scores_fishnet,
            join = st_intersects, left=TRUE) 
# want to get the summary of values of grid cells in the buffer
z_by_hydrant<-z%>%
  group_by(HYDRANTNUM)%>%
  summarise(industrial = mean(industrial), fire = mean(fire), social = mean(social), n_cells =n())%>%na.omit()
## z_by_hydrant does not contain the hydrants outside Philly anymore


## add in the hydrant buffer information to each hydrant
scores_hydrants<-merge(hydrants,
                       z_by_hydrant%>%as.data.frame(.)%>%dplyr::select(-geometry), 
                       by='HYDRANTNUM')


# normalize the scores!!, make sure all are numeric
scores_hydrants$age<-as.numeric(scores_hydrants$age)

norm_scores_hydrants<-scores_hydrants%>%mutate_at(c("fire", "industrial", "social", "age"), ~(scale(.) %>% as.vector))

sd(norm_scores_hydrants$social)

# each scores is the average of the grid cells in a 400ft radius buffer around the hydrant.

engine<-st_read("C:/Users/leann/OneDrive/Desktop/MUSA 800/Data/EngineLocals")%>%st_set_crs(2272)

norm_w_engin<-st_join(
  norm_scores_hydrants,
  engine,
  join = st_intersects,
  suffix = c(".x", ".y"),
  left = TRUE
)

# remove duplicate rows
norm_w_engin<-norm_w_engin[!duplicated(norm_w_engin$HYDRANTNUM),]
sum(is.na(norm_w_engin$ENGINE_NUM)) # 9 NA

norm_w_engin['11415', "ENGINE_NUM"]<- 37
norm_w_engin['18912', "ENGINE_NUM"]<- 56
norm_w_engin['19502', "ENGINE_NUM"]<- 54
sum(is.na(norm_w_engin$ENGINE_NUM)) # 6 NA

## Replace remaining 6 NAs with 9999
norm_w_engin$ENGINE_NUM<-norm_w_engin$ENGINE_NUM%>% replace_na(9999)
sum(is.na(norm_w_engin$ENGINE_NUM)) # 0 NA

## Select only the ENGINE_NUM as the engine identifier
norm_w_engin<-norm_w_engin%>%dplyr::select(-BATTALION.x, -BATTALION.y, -STATION, -Shape_Leng, -Shape_Area, -n_cells)
  1. Given the normalized scores per hydrant, calculate different combinations.
combinations_raw<-norm_w_engin%>%dplyr::select(HYDRANTNUM, ENGINE_NUM, DATEOFLAST, YEAR_INSTA)

norm_fire<-norm_w_engin$fire
norm_industrial<-norm_w_engin$industrial
norm_social<-norm_w_engin$social
norm_age<-norm_w_engin$age

for(fire in (0:1)){
  # take the fire risk (YES OR NO?)
  for (industrial in (0:1)){
    # the the indsutrial yes or not?
    for (social in (0:1)){
      # take social yes or not
      for (age in (0:1)){
        # calculate score 
        combi_score = fire*norm_fire + industrial*norm_industrial + social*norm_social + age*norm_age
        # generate the combination name
        combi_name = paste(fire, industrial, social, age, sep="")
        # add this new combi to the data farme
        combinations_raw$temp <-combi_score 
        # change name
        names(combinations_raw)[names(combinations_raw) == "temp"] <- combi_name
        
      }
    }
  }
}


# within each engine, hydrants are grouped into deciles PER THAT ENGINE.  
# descending ntile --> 1 means HIGHEST PRIORITY. 
combinations_dec_ENGINES<-norm_w_engin%>%dplyr::select(HYDRANTNUM, ENGINE_NUM, DATEOFLAST, YEAR_INSTA)

norm_fire<-norm_w_engin$fire
norm_industrial<-norm_w_engin$industrial
norm_social<-norm_w_engin$social
norm_age<-norm_w_engin$age

track = 0

engines = unique(combinations_dec_ENGINES$ENGINE_NUM)

for(fire in (0:1)){
  # take the fire risk (YES OR NO?)
  for (industrial in (0:1)){
    # the the indsutrial yes or not?
    for (social in (0:1)){
      # take social yes or not
      for (age in (0:1)){
        track = track +1 
        print(paste('starting next combination: ', track))
        # calculate score 
        combi_score = fire*norm_fire + industrial*norm_industrial + social*norm_social + age*norm_age
        # generate the combination name
        combi_name = paste(fire, industrial, social, age, sep="")
        # add this new combi to the data farme
        combinations_dec_ENGINES$temp <-combi_score 
        # mutate to deciles BY ENGINE
        e_count = 0
        for (engine in engines){
          combinations_dec_ENGINES[combinations_dec_ENGINES$ENGINE_NUM==engine,]$temp<-as.factor(as.numeric(ntile(desc(combinations_dec_ENGINES[combinations_dec_ENGINES$ENGINE_NUM==engine,]$temp), 10)))
          e_count=e_count+1
          print(paste('engines iterated: ', e_count))
        }
        
        # change name
        names(combinations_dec_ENGINES)[names(combinations_dec_ENGINES) == "temp"] <- combi_name
        
      }
    }
  }
}

# mutate the 0000 (nothing selected) to all just 10 (10 is lowest priority)
combinations_dec_ENGINES<-combinations_dec_ENGINES%>%mutate(`0000` = 10)

# have to gather the combis
for_plots_eng <- combinations_dec_ENGINES%>%
  as.data.frame(.)%>%
  dplyr::select(-geometry, -ENGINE_NUM,-DATEOFLAST, -YEAR_INSTA)%>%
  gather(., key='combi', value='score', -HYDRANTNUM)
# add back geometry after gathering
for_plots_eng<-merge(hydrants%>%dplyr::select(HYDRANTNUM),
                 for_plots_eng,
                 by='HYDRANTNUM')%>%
  mutate(score = as.numeric(score))%>%
  mutate(score = as.factor(score))

class(for_plots$score)


for_plots_eng%>%
  ggplot() +
    geom_sf(data=philly)+
    geom_sf(aes(colour = score), size=1, show.legend='point') +
    facet_wrap(~combi, ncol = 4) +
    scale_colour_viridis_d(option="inferno",
                           name = "Priority Score",
                           labels = c("Highest", "2", "3", "4", "5", "6", "7", "8", "9", "Lowest"),
                           direction = -1) +
    labs(title = 'Subplot titles indicate scores of the combination fire_industrial_social_age. \nWhere 0=not included, 1= included',
         subtitle = 'Priorities for Different Combinations. Deciles by Engines.' ) +
  theme(legend.key.size = unit(2, "cm"),
        legend.key.width = unit(2,"cm"),
        legend.title = element_text(color = "black", size = 30),
        legend.text = element_text(color = "black", size=20),
        plot.title = element_text(size=50),
        plot.subtitle = element_text(size=30))+
  guides(colour = guide_legend(override.aes = list(size=10)))+
  mapTheme()
  
# this decile is with respect to the entire city. Thus it is possible that a engine has entirely number 10
combinations_deciles<-norm_w_engin%>%dplyr::select(HYDRANTNUM, ENGINE_NUM, DATEOFLAST, YEAR_INSTA)


norm_fire<-norm_w_engin$fire
norm_industrial<-norm_w_engin$industrial
norm_social<-norm_w_engin$social
norm_age<-norm_w_engin$age

for(fire in (0:1)){
  # take the fire risk (YES OR NO?)
  for (industrial in (0:1)){
    # the the indsutrial yes or not?
    for (social in (0:1)){
      # take social yes or not
      for (age in (0:1)){
        # calculate score 
        combi_score = fire*norm_fire + industrial*norm_industrial + social*norm_social + age*norm_age
        # generate the combination name
        combi_name = paste(fire, industrial, social, age, sep="")
        # add this new combi to the data farme
        combinations_deciles$temp <-combi_score 
        # mutate to deciles
        combinations_deciles$temp <-as.factor(as.numeric(ntile(desc(combinations_deciles$temp), 10)))
        # change name
        names(combinations_deciles)[names(combinations_deciles) == "temp"] <- combi_name
        
      }
    }
  }
}

combinations_deciles<-combinations_deciles%>%mutate(`0000` = 10)
# because deciles were done for each combination, there should be an equal spread of 1s-10s for each combination!

# have to gather the combis
for_plots <- combinations_deciles%>%
  as.data.frame(.)%>%
  dplyr::select(-geometry, -ENGINE_NUM,-DATEOFLAST, -YEAR_INSTA)%>%
  gather(., key='combi', value='score', -HYDRANTNUM)
# add back geometry after gathering
for_plots<-merge(hydrants%>%dplyr::select(HYDRANTNUM),
                 for_plots,
                 by='HYDRANTNUM')%>%
  mutate(score = as.numeric(score))%>%
  mutate(score = as.factor(score))

class(for_plots$score)

for_plots%>%
  ggplot() +
    geom_sf(data=philly)+
    geom_sf(aes(colour = score), size=1, show.legend='point') +
    facet_wrap(~combi, ncol = 4) +
    scale_colour_viridis_d(option="inferno",
                           name = "Priority Score",
                           labels = c("Highest", "2", "3", "4", "5", "6", "7", "8", "9", "Lowest"),
                           direction = -1) +
    labs(title = 'Subplot titles indicate scores of the combination fire_industrial_social_age. \nWhere 0=not included, 1= included',
         subtitle = 'Priorities for Different Combinations. Deciles by entire Philly.' ) +
  theme(legend.key.size = unit(2, "cm"),
        legend.key.width = unit(2,"cm"),
        legend.title = element_text(color = "black", size = 30),
        legend.text = element_text(color = "black", size=20),
        plot.title = element_text(size=50),
        plot.subtitle = element_text(size=30))+
  guides(colour = guide_legend(override.aes = list(size=10)))+
  mapTheme()