Project Background

This project is culmination of a semester-long graduate practicum course in urban spatial analytics (MUSA801) in the Department of City & Regional Planning at the University of Pennsylvania taught by Ken Steif, Matthew Harris, and Michael Fichman. Click here to return to MUSA website to view other projects. The work was complete as a proof-of-concept for the City of Syracuse Innovation Team. All of the methods are posted within and are open-sourced with the intent that the research will help other interested municipalities in creating their own prediction tool. For additional information on this work, please email one of the project team members.

ScottDBetz@gmail.com | jibran.n.khan@gmail.com | jyuan@nyu.edu

This document begins with a case study predicting risk of health & safety code violations in homes in Syracuse, New York, and is followed by a series of appendices that discuss data wrangling, data visualization, data sources, feature engineering, and model results. Navigate through the document either by using the panel at the left, or by clicking the hyperlinks throughout the document.




Abstract

Building codes are designed to ensure the health and safety of residents. Building code enforcement has historically been reactive - with inspectors only being dispatched to a property when a call comes in from a concerned tenant or neighbor. This study implements a more proactive approach - developing data-driven tool to prioritize which homes are most at-risk of health and safety violations, so that city officials can more effectively triage their targeted homes for inspection. The estimated risk score is then fed into a web-based application aimed at helping inspection agencies better allocate their limited inspection resources.



1. Introduction

Motivation

How can machine learning enhance code enforcement in order to help municipal governments protect the health and safety of its residents?



The map below demonstrates how our data-driven approach can enhance code enforcement by ensuring that the expertise and efforts of enforcement officials is directed to properties where the largest impact can be made. Per the map below, Syracuse inspectors would be able to unsource far more buildings liable for health & safety violations if they used predictive modeling to inform their inspections.



Why is code enforcement so important for Syracuse?


1. The housing stock is aging

According to data provided by the City of Syracuse, its homes were built, on average, in 1926. Older homes are far more likely to be culprits of building code violations (particularly, violations putting inhabitants’ health and safety at risk) than newer homes.
(data source: Syracuse property records)



2. Syracuse is a city of renters

The city is majority renters and the data show a pattern of owner-occupied units converting to rental units.
(data source: American Community Survey)



3. Renters are disproportionately burdened by housing costs

As the chart below demonstrates, Syracuse renters are far more rent-burdened than their peers nationwide and in New York City. In Syracuse, renters spend over 34% of their monthly income on rent.
(data source: American Community Survey)

Chart

Data

Cost-Income Ratio
Rent Burden
Name Tenure Cost Income Ratio (>30%) (>50%)
United States OWN $13,552 $73,252 18.5% 23.9% 9.6%
RENT $11,106 $36,653 30.3% 50.6% 25.6%
New York state OWN $17,600 $86,275 20.4% 29.8% 13.4%
RENT $12,874 $40,357 31.9% 53.5% 29.1%
New York City OWN $20,396 $92,707 22.0% 36.3% 18.7%
RENT $14,278 $44,480 32.1% 53.8% 29.3%
Syracuse OWN $10,626 $62,876 16.9% 19.2% 8.1%
RENT $7,854 $23,236 33.8% 56.2% 34.6%
Rochester OWN $9,929 $53,384 18.6% 24.5% 10.7%
RENT $8,094 $22,546 35.9% 61.1% 34.4%

4. The situation exacerbates the racial wealth gap

The map below displays the percent of Non-White residents per each Syracuse neighborhood. Such segregation of Syracuse is not dissimilar to many other American cities. However, what is most concerning is when race is viewed in terms of housing tenure.
(data source: American Community Survey)

Households of color in Syracuse are far more likely to be renting their residence than owning. We see that the reverse is true with white households in Syracuse. Roughly 1 out of every 7 Syracuse homeowners are black, whereas over 1 out of every 3 Syracuse renters are black.
(data source: American Community Survey 2017 5-year Estimates)





Code Enforcement Process

We can target the Department of Code Enforcement’s (DOCE) limited resources to properties with a high risk of code violation. Especially important are violations that have damaging health or safety outcomes.

The current process used by Syracuse (and likely most other cities) is relatively reactive, with building code inspections taking place due to one of the following occurrences: Properties are to be regularly inspected every 3-5 years Building inspectors drive through neighborhoods and spot buildings that may be requiring inspection Department of Code Enforcement receives inbound reports (e.g. via 311 calls) from local residents or personnel

This current system will fail to respond to homes violating health & safety codes if these homes’ issues go unreported. This may particularly be the case with non-superficial issues, such as with the risk of lead poisoning from plumbing not up to code.

Although the 3-5 year regular cadence for inspections sounds great in theory, the City still needs some means for being able to prioritize these homes, as even at a 3-5 year cadence the number of homes needing to be inspected outweighs the bandwidth of the Code Enforcement Department. Even if the City is able to inspect all homes at their 3-5 year cadence, this would imply the City must inspect over 8,500 homes per year, or at least 170 homes each week. Even if the City had the bandwidth to conduct this number of inspections, the depth with which it can inspect each building would be relatively low if no prioritization were to be used.



Suggested Future Process for Building Code Inspections

DOCE should have, at any given moment, a comprehensive list of all homes in the City with their respective health & safety risk score. Doing so will allow inspectors to not only efficiently allocate resources where they are most critical, but this moreover allows for inspectors to understand the difference in magnitude of health & safety risk across homes.

Moreover, a prioritized list of health & safety risk scores can be useful for the City beyond just the Department of Code Enforcement; for example, the Health Department could use these scores to understand where children may be most at-risk of lead poisoning. The Department of Public Safety could also use these scores to understand which homes may be most at-risk of flooding or other disasters.





Methods in Brief

The analysis created is based on data both publicly available and that which has been provided by the City of Syracuse. We first began by cleaning the data and conducting an initial exploratory analysis to better understand the City of Syracuse and, in particular, characteristics salient among its housing stock. In particular, we sourced trends and characteristics most apparent among residences which have historically committed code violations, particularly those characteristics which were not so salient among properties inspected but not found to have a health/safety violation. These results were then used to feature engineer variables that can then predict whether or not a home is more or less likely to be in violation of health & safety building codes. This created a prioritized array of risk scores for all Syracuse residential parcels. As a final step, all results are cross-validated to ensure that the model is not disproportionately overburdening poor renters or renters of color.





2. Data

Caveats

The data is inherently biased in the following ways: Only using historical violations, let alone those that were reported. This means that there very likely could be more homes that have been inspected, but we do not have data on their inspections. Additionally, there are likely building code violation categories that the City of Syracuse has, to-date, not searched for and thus data cannot be used to predict violations in this particular category. Nonetheless, we are confident that we have built a working proof of concept that is more accurate and efficient than the current process for inspections.



Unit of Analysis: Parcels

There are over 40,000 total parcels in Syracuse. Of these, approximately 34,000 are residential properties. Given the heterogeneity of commercial buildings and the relative homogeneity of residential buildings, our analysis focuses on residential parcels for predicting code violations. Each parcel is tied to a parcel ID, which is stored in different datasets as “SBL number,” “PrintKey,” and “Address.” However, the integrity of all units mentioned aside from Address is unreliable. 3,022 of the parcels had a different parcel ID and address associated with their printkey. Because “Address” has the fewest issues, we are using street address as the means of identifying individual parcels.







Outcome of Interest: Health + Safety Violations

The dependent variable to be used is code violations. This will be considered both as a binary dependent variable (i.e. “1” for parcels which have had code violations previously, “0” for parcels which have never demonstrated a code violation) and as a count dependent variable (the aggregate number of violations at each parcel is considered). For the purpose of our study, we will focus exclusively on health & safety code violations.

Many properties have
multiple violations

Violations issued to one particular Syracuse property

Parcel ID 19248 19248 19248
Violation # 17808 19833 19836
Violation Date March 01, 2013 May 01, 2013 May 01, 2013
Address 215-17 Cherry St 215-17 Cherry St 215-17 Cherry St
Code Violated 2010 IMC
Section 304.15
Doors
2010 IMC
Section 504.1
General
2010 IMC
Section 704.2
Smoke alarms
HS Indicator HEALTH & SAFETY HEALTH & SAFETY
HS Reclass SAFETY ONLY

Many violations are
not health & safety

Code Violated Count
2010 IMC - Section 303.1 - Swimming Pools 20
2010 IMC - Section 303.2 - Enclosures 7
2010 IMC - Section 303.3 (1) - Outdoor Swimming Pool (Barrier) 12
2010 IMC - Section 303.3 (10.1) - Swimming Pool (Ladder) 2
2010 IMC - Section 303.6.1 - Swimming pool and spa alarms 2

Some most common
violations are not
health & safety

Code Violated Count
SPCC - Section 27-72 (e) -Trash & Debris 4020
SPCC - Section 27-72 (f) - Overgrowth 3646
2010 IMC - Section 305.3 - Interior surfaces 2742
2010 IMC - Section 504.1 - General 1569
2010 IMC - Section 308.1 - Infestation 1449
2010 IMC - Section 305.1 - General 994
SPCC - Section 27-32 (d) Protective coating for wood surfaces 978
2010 IMC - Section 603.1 - Mechanical appliances 971
2010 IMC - Section 304.13 - Window, skylight and door frames 968
SPCC - Section 27-57 (a) (19) - Switch/Outlet is Damaged/ Unserviceable 949
2010 IMC - Section 704.2 - Smoke alarms 937
2015 IMPC - 305.3 - Interior Surfaces 935
2010 IMC - Section 304.15 - Doors 881
2010 IMC - Section 107.1.3 - Premises Unfit for Human Occupancy 785
SPCC - Section 27-31 (c) Structural members 773

\(\color{#067BC2}{\text{Health Violations}}\) \(\color{#D56062}{\text{Safety Violations}}\)



Violations

Approximately 8,000 of 41,000 parcels have at least one violation – 2,600 of these are health and safety-related violations, whereas approximately 5,000 are other violations.



Categorizing Violations

Although there were over 244 different categories of building code violations observed in Syracuse, the aim of our project has been to focus on those which are linked to health & safety dangers. We categorized all previous violations into 3 buckets: health-related violations, safety-related violations, and “other.” Our analysis only focuses on the first 2 categories of violations.





Feature Engineering

The independent variables to be used include: Parcel & Building Characteristics - e.g. number of residences on parcel, location of landlord relative to parcel, parcel assessed value Physical Features - e.g. property age, property square footage, number of stories Public Services - e.g. water services, heating 311 Calls related to the property

In order to predict houses with code violations , features are created to more adequately identify predictive trends in the data. Our exploratory analysis helped us understand which attributes may offer predictive power in determining risk of violation, and this helped guide our creation of features (variables) from which predictions can be generated. Features (variables) were created using the following types of features for every individual parcel. In aggregate, 742 features (variables) were created for each individual parcel.

In order to generate as much predictive power as possible from the various observations, we created a vast array of features related to each of the above observations. For example, we took into consideration a wide array of distances (50ft - 1000ft+) when measuring distance to various landmarks. The variety and breadth of features will maximize the number of possibilities with which our model can predict building code violations.





Independent Variables

Our independent variables come from 6 different categories: 1) parcel characteristics, 2) building characteristics, 3) property sales, 4) building permits, 5) geographic features, 6) public services, and 7) 311 calls. A full list of all independent variables can be found below, organized by each of these category areas.

Parcel characteristics include those related to the parcel’s size, tax details, and owner information. Building characteristics include data regarding the parcel’s buildings’ physical details such as building heights, floor square footage, and amenity details (kitchens, bathrooms). Property sales data includes assessed home values, including sale prices for nearby homes. Building permits data refers to previous permits obtained by parcels and their nearby parcels, such as permits for block parties, renovation, and new construction. Geographic features data is that of nearest physical amenities (canals, rivers, highways). Public services data is that of nearest fires and emergencies to the respective parcel. 311 calls variables are those created based on calls made to Syracuse’s public cityline for non-urgent maintenance requests.

Parcel
Characteristics

Building
Characteristics

Property
Sales

Building
Permits

Geographic
Features

Public
Services

311
Calls



Property Attributes compared to Violation History

Ultimately we want to understand which parcel-level attributes are strikingly associated with the properties that have violated, and which parcel-level attributes are strikingly not associated with the properties that have violated. These will help guide our algorithm’s predictions across the entire city.





3. Exploratory Analysis

The data show some surprising and unsurprising trends in the risk factors associated with health and safety housing risk. Namely, some of the most salient trends associated with health and safety violations include: 1) landlord’s distance to parcel, 2) the price at which the parcel sold, relative to its assessed value, 3) the density of residences on the parcel, 4) 311 Cityline calls routed to departments other than DOCE, and 5) Neighborhood.

What are some of the things that matter most in predicting housing risk?

Where landlord lives

The further away the landlord lives, generally, the higher the risk of health and safety violations. As seen below, risk of violations appears highest when the landlord is either out of state or out of country.

Sale price

If the landlord purchased the property far under its apraised value, generally, the property is far more likely to have a health and/or safety violation. Properties that sell at prices below the appraised value usually do so due to foreclosure, bankruptcy, or other issues related to financial or management troubles.

All sale prices were adjusted for housing price inflation to current dollars and compared to the present apraised value.

Type of residential

Higher density housing, namely multifamily apartments, is generally at higher risk of health & safety violations than single family housing.

311 calls

311 Cityline calls routed to the Department of Code Enforcement are not the best 311 predictors of violations. Rather, calls routed to other departments – such as Department of Public Works, Fire Prevention, and Health Department – can be more significant predictors.

Neighborhood

When mapped, it is apparent that the distribution of inspections is uneven across neighborhoods and some neighborhoods have higher rates of non-hazardous violations. The rate of non-hazardous violations represents an inefficiency in the distribution of inspectors to properties. The neighborhoods with higher percent non-hazardous would benefit the most from an enhanced prioritization tool.







4. Modeling & Validation

We are using a limited sample of Syracuse residential parcels (for which we have building code violation data) to predict the presence of building code violations across all residential parcels in the city. In order to do this, we must first develop strong predictive capabilities from our sample set that can be used to make accurate predictions across the city.

Predictive models need to be able to make generalizations from a small sample size across a larger population. Moreover, these generalizations need to be accurate. As shown below, our model offers approximately 2x more accuracy in predicting violations than the current process.





As outlined earlier, the two types of violations to be predicted are 1) health violations and 2) safety violations. Because different kinds of violations are present under each of these two categories, and because these two areas pose different kinds of risk to citizens, we have built two separate predictive models - one which predicts health violations across Syracuse homes, and another which predicts safety violations across homes. The same overall model has been built for both of these violation types, which ultimately run in parallel.





For the first step in our model, we tested the predictive power of each of the independent variables. We did this via bivariate GLM (generalized linear model), in which a linear regression model was created for each independent variable (there are over 740) against one dependent variable (the presence of a violation). Per standard linear regression, a line of best fit is generated between each independent variable and the dependent variable (violation). We used these bivariate GLMs to help determine which variables were most closely interacting with the dependent variable by measuring the AUC (area under curve) for each independent variable. We sorted all 740+ variables by their respective AUC, and then created a shortlist of the top 100 variables based on AUC.

We then completed a backwards stepwise regression using these top 100 variables. With backwards stepwise regression, one variable was sequentially removed at a time from the regression in order to see whether the removal of certain independent variables would improve the overall predictive power of the model. We completed a backwards stepwise regression for each of the 100 variables using each of the 3 following regression methods: GLM (generalized linear model), RF (random forest), and XGB (extreme gradient boosting). Each of these three models can be appropriate for determining a true/false outcome, in which true would signify that a property has a health/safety violation, and false would signify that a property does not have a health/safety violation. The model would output a risk score of 0 to 1 for each property, where each score is interpreted as the probability that the property is in violation.





GLM, as explained earlier, creates a line of best fit and is particularly useful when explanatory variable results are not normally distributed. Random Forest, in the words of data scientist Niklas Donges, is a regression technique which “builds multiple decision trees and merges them together to get a more accurate and stable prediction.” Extreme gradient boost is similar to random forest in that it builds decision trees, but unlike RF, XGB “builds one tree at a time, where each new tree helps to correct errors made by a previously trained tree.” We ran backwards stepwise regression using each of these three techniques in order to determine which of the three regression methods would most accurately predict violations. We found Extreme Gradient Boost (XGB) to be the most effective in predicting violations, and thus built our model using Extreme Gradient Boost.





5. Model Results

Threshold analysis

Of course, every municipal government must face a trade-off between safety assurance and resources spent. Although risk predictions may be more accurate at surfacing buildings likely to have health/safety violations than a random survey, there will inevitably be situations in which buildings are predicted to have high health/safety risk, when in fact they are completely up to health/safety code (i.e. “false positive”). Inevitably, there will also be situations in which buildings predicted to be clear of health/safety risk are, in reality, in violation of health/safety code standards and inflicting preventable risk upon their inhabitants (i.e. “false negative”). Given that false negatives pose a real threat to civilians’ health and safety whereas false positives only promote extra cautiousness, it is in cities’ best interest to minimize false negatives, and to prioritize minimizing false negatives over minimizing false positives. The extent to which cities can minimize false negatives depends on their own resource constraints (manpower, funds).


Costs associated with building code inspections:

  • Namely labor (inspector, management personnel): approximately $500 per house

  • This assumes that any building code violation remediation costs are then paid for by the landlord.


Costs associated with a building in violation of code:

  • Safety hazard costs: lives can be lost or impacted by safety violations left unresolved. For example, in the City of Philadelphia, a Salvation Army thrift store collapsed in 2013, killing seven people. In the past 3 years, over 20 workers have died on-site of construction projects that failed to uphold building codes. Even when injuries are not fatal, they can be severely debilitating.

  • Health hazard costs: building code violations that can lead to health risks, such as those leading to lead or mold exposure, can cause permanent health defects among those exposed to these harmful substances.

  • The city also faces an impact when a building collapses or a lead/mold outbreak becomes publicized, as these negatively impact the value of nearby homes. These will translate into lower property values and subsequently impact property tax revenue collected by the city.

  • Collectively, we estimate an approximate cost of $10,000 per false negative, although of course the cost of one’s life or livelihood cannot be fully quantified.


Costs associated with each result:

  • True positive: residence predicted to be violating building codes, it is violating building codes Cost: $500 (cost of an inspection)

  • True negative: residence predicted to be clear of building code violations, and it is indeed clear of building code violations. Cost: $0 (no inspection takes place)

  • False positive: residence predicted to have a violation, it’s clear of violations Cost: $500 (cost of an inspection)

  • False negative: residence predicted to be clear of violations, but actually has violations Cost: $10,000 (estimated cost of health impact, safety impact, property tax impact)





Map of predicted violations



Cross Validation



Currently, the process used by City of Syracuse is approximately 38% effective at finding code violators, so any increase above this baseline will help the city minimize building violation risk. analysis.



Neighborhood

The chart below shows a cross-validation by neighborhood across Syracuse. As seen below, some neighborhoods demonstrate more accurate predictions than others. For example, the downtown CBD area and southeast neighborhoods demonstrate the higest predictive accuracy.



Race

Mapping our model accuracy against the percent non-white makeup is each census tract reveals that the model may have some inherent bias. Specifically, those neighborhoods that are most white-dominated are those where our model demonstrated greatest accuracy. Conversely, those neighborhoods with the hghest percent non-white population are those where the model displayed the least predictive accuracy.



Plotting these same variables reveals that the model is disproportionately biased toward neighborhoods of higher white residents. The model’s accuracy is decreased as a tract’s non-white makeup increases. Though a similar trend can be seen as a census tract’s total population increases, the lost accuracy in non-white tracts is greater than that of more populated tracts.



Model Accuracy by Race

Non-White

White

Population Density



However, model sensitivity (true positive rate) increases greatly as percent non-white increases. This indicates that the model is likely overpredicting in neighborhoods of color, which would likely result in overinspection of those neighbrohoods. There are two ways to interpret these findings: the first is that these predictions could disproportionatly overburden neighborhoods of color with inspections, another interpretation is that the public services would be better distributed to where it is needed most. The model shows that the rental housing quality inn neighborhoods of color in Syracuse tends to be higher risk, which is a social injustice that this data-driven aproach seeks to mend.



Model Sensitivity by Race

Non-White

White

Population Density



Housing Type

The last cross-validation we performed is by housing typology in order to measure the model’s bias toward any specific type of housing type. The graph below shows that there is relaively little variation between the different typologies, which indicates that the model is relatively unbiased toward single family vs apartment rental units. However, the graph also shows a slight bias with higher accuracy for single-family properties and slightly lower accuracy for two and three family properties. The lower accuracy for two/three family residences is very likely due to the fact that there are far fewer of these proprty types than the other three types, which inherently leads to diminished prediction accuracy.







6. Conclusion

Our analysis finds that cities such as Syracuse can use predictive modeling to prioritize and triage the order in which they inspect local buildings for code violations, particularly those code violations related to health & safety dangers. Using data from a limited sample size of previous health & safety violations, a predicted risk score has been generated for every individual parcel, offering local inspectors a clearly prioritized list of buildings to inspect for health & safety violations.

In an ideal world, every building would be inspected as frequently as possible, so that the risk of an undetected health/safety violation is virtually eliminated. Of course, however, this would be incredibly cost prohibitive, and so instead a threshold assessing the costs to society at each level has been generated. We estimate that the greatest social cost comes from not inspecting a home when it is actually in violation. That being said, our model has determined that the top 11,500 parcels with the highest risk scores should be inspected in order to minimize the societal cost. This equates to a parcel-level risk score of 0.24 (out of 1).

Without question, there are clear limitations to our model, and any municipal government looking to implement our predictive model will need to make minor adjustments based on the data available and assumptions underlying the data integrity.





Appendix

The Code

Data Wrangling

Dependent Variable: Violations Wrangling

violations <- read_xlsx("Upenn MUSA/Violations.xlsx") %>% mutate(year=year(violation_date)) %>% mutate(month=month(violation_date)) #%>% table() #as.data.frame(jsonlite::fromJSON("https://opendata.arcgis.com/datasets/fb7233117df1443081541f220327f178_0.geojson", flatten = TRUE))


violation_categories <- violations %>% 
  select(codebook_name)

imc_crosswalk <- dplyr::select(violation_categories,codebook_name) %>% 
  dplyr::mutate(new_name=codebook_name) 

imc_crosswalk$new_name <- imc_crosswalk$new_name %>%
  gsub("2015","",.) %>%
  gsub("2010","",.) %>%
  gsub("Section","",.) %>%
  gsub("IMPC","IMC",.) %>%
  gsub("IPMC","IMC-",.) %>%
  gsub("IFC","10.",.)%>%
  gsub("R-2","",.)%>%
  gsub("R-3","",.)%>%
  gsub("R-4","",.)%>%
  gsub("I-1","",.)%>%
  gsub("R-1","",.)%>%
  gsub(" ","",.) %>%
  tolower(.) 

imc_crosswalk$new_name <- imc_crosswalk$new_name %>%
  gsub("imc","",.) %>% 
  gsub("-","",.) %>% 
  gsub("/","",.) %>% 
  gsub("&","",.) %>% 
  gsub("\\(1\\)","",.) %>%
  gsub("\\(","",.) %>%
  gsub("\\)","",.) %>%
  gsub("107.1.3", "108.1.3",.)%>%
  gsub("704.2.1.1", "704.2",.)%>%
  gsub("704.2.1.2", "704.2",.)%>%
  gsub("107.4","108.5",.)%>%
  gsub("305.1.1","108.1",.)%>%
  gsub("108.1.1","108.1",.)%>%
  gsub("19nycrrpart1203","10.105.1.1",.)%>%
  gsub("spcc2757a7wiringequip.notsecured", "604.3",.)%>%
  gsub("spcc2732dprotectivecoatingforwoodsurfaces", "304.2",.) %>%
  gsub("spcc2757a19switchoutletisdamagedunserviceable", "604.3",.)%>%
  gsub("spcc2757a2electricalwiringandequipment", "604.3",.)%>%
  gsub("spcc2757a5abandonedelectricwireequip.", "604.3",.)%>%
  gsub("spcc2731cstructuralmembers", "304.4",.)%>%
  gsub("spcc2732bstairs,porchesandrailings","304.10",.)%>%
  gsub("spcc2757a16receptacleoutletswitchlackscover","604.3",.)%>%
  gsub("spcc2757a17improperextens.corduse","604.3",.)%>%
  gsub("spcc2743amandatorysmokedetectors","704.2",.)%>%
  gsub("10.915.1general","705.1",.)%>%
  gsub("spcc2752c7sewagedrainingsystem","506.2",.)%>%
  gsub("spcc2744abprohibitedaccumulationsandstorage","10.610.1",.)%>%
  gsub("spcc2743e2interconnectedalarmsystem","704.2",.)%>%
  gsub("spcc2743b2smokedetectorrequirements","704.2",.)%>%
  gsub("spcc27115greoccupyinganunfitpremises","108.5",.)%>%
  gsub(",","",.) %>% 
  gsub("[a-z]","",.) %>%
  gsub("2772","302.1",.)

violation_categories_1 <- read_csv("viol_cat_final.csv")


imc_crosswalk <- imc_crosswalk%>%  distinct() %>% 
  left_join(.,violation_categories_1,by=c("new_name"="codebook_number")) %>%
  dplyr::mutate(health_violation=(carbon_monoxide + heating + lead + mold_infestation)) %>%
  dplyr::mutate(safety_violation=(electrical_hazard+fire_hazard+structural))%>%
  na.omit()

violations_new <- left_join(violations,imc_crosswalk, by=c("codebook_name"="codebook_name")) %>%
 filter(status_type_name != "Void") %>%
  select(parcel_id,
         carbon_monoxide,
         electrical_hazard,
         heating,
         fire_hazard,
         lead,
         mold_infestation,
         exterior_hygene,
         structural,
         vacate_order,
         water_supply,
         without_permit,
         health_violation,
         safety_violation)%>%
  aggregate(.~parcel_id,., FUN=sum)

rm(violation_categories)
rm(imc_crosswalk)
rm(violation_categories_1)
rm(violations)

violations_new$carbon_monoxide <- ifelse(violations_new$carbon_monoxide>0,1,0)
violations_new$electrical_hazard <- ifelse(violations_new$electrical_hazard>0,1,0)
violations_new$heating <- ifelse(violations_new$heating>0,1,0)
violations_new$fire_hazard <- ifelse(violations_new$fire_hazard>0,1,0)
violations_new$lead <- ifelse(violations_new$lead>0,1,0)
violations_new$mold_infestation <- ifelse(violations_new$mold_infestation>0,1,0)
violations_new$structural <- ifelse(violations_new$structural>0,1,0)
violations_new$vacate_order <- ifelse(violations_new$vacate_order>0,1,0)
violations_new$water_supply <- ifelse(violations_new$water_supply>0,1,0)
violations_new$without_permit <- ifelse(violations_new$without_permit>0,1,0)
violations_new$health_violation <- ifelse(violations_new$health_violation>0,1,0)
violations_new$safety_violation <- ifelse(violations_new$safety_violation>0,1,0)
violations_new$exterior_hygene <- ifelse(violations_new$exterior_hygene>0,1,0)





Dependent Variable: Violations Joining

library(tidyverse)
library(sf)
library(magrittr)

# NOTE! 
# each lot (SBL) contains multiple sub-parcels (parcel_id)
# each parcel (parcel_id) contains multiple violations
# lot (SBL) should be used as the unit of analysis

#load in parcels
parcels2018Jul <- read_sf("https://opendata.arcgis.com/datasets/1b346804e1364a5eb85ccb53302e3c91_0.geojson",
                          stringsAsFactors = FALSE)
p <- parcels2018Jul %>% as_tibble %>%
  select(SBL      = PRINTKEY,
         AddNum.p = StNum,
         AddSt.p  = StName) %>%
  mutate(AddSt.p  = AddSt.p %>% toupper %>% trimws %>% str_remove("[.]"))

#load in violations
violations <- readxl::read_excel("../data/UPenn MUSA/Violations.xlsx") %>%
  select(-zone_type_description) %>% unique %>%
  filter(status_type_name != "Void")
hs_key <- read_csv("HS_KEY/hs_key.csv") %>%
  select(codebook_name,HEALTH,SAFETY,HS)

v <- violations %>%
  select(violation_number,
         parcel_id,
         AddNum.v = number,
         AddSt.v  = address,
         codebook_name) %>%
  mutate(AddSt.v  = AddSt.v %>% toupper %>% trimws %>% str_remove("[.]")) %>%
  left_join(hs_key)

#load in crosswalk
mapping <- readxl::read_excel("../data/UPenn MUSA/Parcel Mapping.xlsx") 
m <- mapping %>%
  select(SBL      = identifier,
         parcel_id,
         AddNum.m = number,
         AddSt.m  = address) %>%
  mutate(AddSt.m  = AddSt.m %>% toupper %>% trimws %>% str_remove("[.]"))

# join

join <- p %>%
  left_join(m, by = "SBL") %>%
  left_join(v, by = "parcel_id")

dep_viol <- join %>% 
  select(-starts_with("Add")) %>% 
  group_by(SBL) %>%
  mutate(count_viol  = sum(!is.na(violation_number)),
         count_HEALTH= sum(HEALTH == TRUE, na.rm = TRUE),
         count_SAFETY= sum(SAFETY == TRUE, na.rm = TRUE),
         count_HS    = sum(HS == TRUE, na.rm = TRUE),
         has_viol    = count_viol > 0,
         has_HEALTH  = count_HEALTH > 0,
         has_SAFETY  = count_SAFETY > 0,
         has_HS      = count_HS > 0,
         inspected   = has_viol) %>%
  select(-parcel_id,-violation_number,-(codebook_name:HS)) %>% 
  ungroup %>% distinct

save(dep_viol, file = "dep_viol.RData")
load("dep_viol.RData")
dep_viol %>% summarize_all(funs(sum(is.finite(.) | is.character(.)))) %>% t %>% View

# addresses

address <-
  rbind(join[,c("SBL","AddNum.p","AddSt.p")] %>%
          set_colnames(c("SBL","AddNum","AddSt")),
        join[,c("SBL","AddNum.m","AddSt.m")] %>%
          set_colnames(c("SBL","AddNum","AddSt")),
        join[,c("SBL","AddNum.v","AddSt.v")] %>%
          set_colnames(c("SBL","AddNum","AddSt"))) %>%
  as_tibble() %>% 
  drop_na %>% distinct

write_csv(address, "address_lookup.csv")





Independent Variables: Urban Services

library(tidyverse)
library(sf)
library(magrittr)
library(parallel)

options(scipen=999)

### FUNCTION TO RUN IN CHUNKS
run_chunks <- function(split_p, y, prefix = NULL, suffix = NULL) {
  cl = makeCluster(detectCores()-1)
  out <- split_p %>%
    pbapply::pblapply(
      function(chunkx,disty) {
        library(dplyr)
        library(sf)
        library(magrittr)
        distances <- chunkx %>%
          st_distance(disty) %>%
          as_tibble
        chunkout <- tibble(
          nn_dist1 = distances %>% apply(1,FUN = function(row) {sort(row)[1]}),
          nn_dist2 = distances %>% apply(1,FUN = function(row) {sort(row)[1:2]} %>% mean),
          nn_dist5 = distances %>% apply(1,FUN = function(row) {sort(row)[1:5]} %>% mean),
          nn_dist10= distances %>% apply(1,FUN = function(row) {sort(row)[1:10]} %>% mean),
          nn_50ft  = distances %>% apply(1,FUN = function(row) {row %>% extract(. <= 50) %>% length}),
          nn_500ft = distances %>% apply(1,FUN = function(row) {row %>% extract(. <= 500) %>% length}),
          nn_1320ft= distances %>% apply(1,FUN = function(row) {row %>% extract(. <= 1320) %>% length}),
          nn_1mi   = distances %>% apply(1,FUN = function(row) {row %>% extract(. <= 5280) %>% length})
        ) 
        return(chunkout)
      }, disty = y) %>%
    bind_rows %>%
    rename_all(function(.){paste0(prefix,.,suffix)})
  stopCluster(cl)
  return(out)
}


# Parcels July 2018
# https://data.syrgov.net/datasets/1b346804e1364a5eb85ccb53302e3c91_0.geojson
parcels2018Jul <- read_sf("https://opendata.arcgis.com/datasets/1b346804e1364a5eb85ccb53302e3c91_0.geojson",
                          stringsAsFactors = FALSE)
p18Jul <- parcels2018Jul %>%
  select(SBL = PRINTKEY, geometry) %>% 
  st_transform(2261)
p <- p18Jul %>% st_centroid

split_p <- p %>% split((seq(nrow(p18Jul))-1) %/% 100)

# Fire Incidents Jan 2008 - June 2018
# http://data.syrgov.net/datasets/3aeb821352dd4675be0a659e0e6b8682_0
fire <- read_sf("https://opendata.arcgis.com/datasets/3aeb821352dd4675be0a659e0e6b8682_0.geojson",
                stringsAsFactors = FALSE)

f <- fire %>%
  st_transform(2261) %>%
  mutate(year = lubridate::year(alarmdate),
         incidentcat = incidenttype %>% 
           str_sub(1,nchar(incidenttype)-2)) %>%
  filter(incidentcat %in% c(
    1, # fire
    2, # emergency
    3, # hazardous condition
    4, # service call
    9  # special
  )) %>%
  filter(year %>% between(2013,2018)) %>%
  select(incidentcat,geometry)


make_fire <- function(y,suffix) run_chunks(split_p,
                                   y = y,
                                   suffix = suffix,
                                   prefix = "fire_")

#p_fire <- make_fire(y = f[1:7,], suffix = "_test")
#rm(p_fire)

p_fire <- cbind(make_fire(y = f %>% filter(incidentcat == 1),
                          suffix = "_fire"),
                make_fire(y = f %>% filter(incidentcat == 2),
                          suffix = "_emerg"),
                #make_fire(y = f %>% filter(incidentcat == 3),
                #          suffix = "_hazard"),              # no hazards with incidentcat == 3
                make_fire(y = f %>% filter(incidentcat == 4),
                          suffix = "_svccall"),
                make_fire(y = f %>% filter(incidentcat == 9),
                          suffix = "_special"),
                make_fire(y = f,
                          suffix = "_any"))

# Property Sales 1956 - 2018 (data supplied by Sam)

sales <- readxl::read_xlsx("../data/UPenn MUSA/sales.xlsx") %>% 
  transmute(SALE_PRICE,
            SBL = PRINT_KEY,
            year = lubridate::year(DEED_DATE)) %>%
  arrange(SBL,desc(year)) %>% 
  filter(!duplicated(SBL))

#Adjust sale prices for inflation
##Call in housing price index data from New York state and set to 2018 values
hpi <- Quandl::Quandl("FMAC/HPI", order = 'asc') %>%
  mutate(year = lubridate::year(Date)) %>%
  group_by(year) %>%
  summarize(hpi = mean(NY)) %>% 
  rbind(c(2018,174.06169),
        c(1956,23.63157),
        c(1969,23.63157),
        c(1971,23.63157),
        c(1972,23.63157),
        c(1973,23.63157),
        c(1974,23.63157))
hpi2017 <- hpi %>% filter(year == 2017) %>% pull(hpi)
hpi$adj_factor <- (hpi2017/hpi$hpi)

s <- p %>%
  left_join(parcels2018Jul %>% 
              as.data.frame %>% 
              select(SBL = PRINTKEY, AssessedVa)) %>%
  left_join(sales) %>%
  left_join(hpi) %>%
  transmute(SBL, 
            SALE_PRICE_NA = SALE_PRICE,
            SALE_PRICE_ADJ_NA = ifelse(is.na(SALE_PRICE), NA, SALE_PRICE*adj_factor),
            SALE_PCT_ofval_NA = ifelse(AssessedVa != 0, SALE_PRICE_ADJ_NA / AssessedVa, NA),
            
            adj_gt10 = ifelse(SALE_PRICE > 10, SALE_PRICE_ADJ_NA, NA),
            pct_gt10 = ifelse(SALE_PRICE > 10, SALE_PCT_ofval_NA, NA),
            
            SALE_PRICE_ADJ_rank = ifelse(SALE_PRICE > 10,
                                         ntile(adj_gt10,10) %>% as.character(),
                                         "$10 OR LESS"),
            SALE_PRICE_ADJ_rank = ifelse(is.na(SALE_PRICE_ADJ_rank),
                                         "!UNKNOWN",SALE_PRICE_ADJ_rank),
            
            SALE_PCT_rank       = ifelse(SALE_PRICE > 10,
                                         ntile(pct_gt10,10) %>% as.character(),
                                         "$10 OR LESS"),
            SALE_PCT_rank       = ifelse(is.na(SALE_PCT_rank),
                                         "!UNKNOWN",SALE_PCT_rank)) %>%
  select(-adj_gt10,-pct_gt10)

# TO DO: DIST to sales LOW (rank 1,2,3) and VERY LOW (rank 1,2)

sell_home <- function(y,prefix,suffix = NULL) run_chunks(split_p,
                                                  y = y,
                                                  prefix = prefix,
                                                  suffix = suffix)

#p_sale <- sell_home(y = s %>% filter(SALE_PCT_rank == "$10 OR LESS"),
#                    "SALE_PCT_","_$10orLESS")
#rm(p_sale)

### WORKING UP TO HERE ###

p_sale10 <- sell_home(y = s %>% filter(SALE_PCT_rank == "$10 OR LESS"),
                      "SALE_PCT_","_$10orLESS")
p_pct_vlow <- sell_home(y = s %>% filter(SALE_PCT_rank %in% c("1","2")),
                        "SALE_PCT_","_VERYLOW")

p_sale <- cbind(sell_home(y = s %>% filter(SALE_PCT_rank == "$10 OR LESS"),
                          "SALE_PCT_","_$10orLESS"),
                sell_home(y = s %>% filter(SALE_PCT_rank %in% c("1","2")),
                          "SALE_PCT_","_VERYLOW"),
                sell_home(y = s %>% filter(SALE_PCT_rank %in% c("1","2","3")),
                          "SALE_PCT_","_LOW"),
                sell_home(y = s %>% filter(SALE_PCT_rank %in% c("8","9","10")),
                          "SALE_PCT_","_HIGH"),
                sell_home(y = s %>% filter(SALE_PCT_rank %in% c("9","10")),
                          "SALE_PCT_","_VERYHIGH"),
                sell_home(y = s %>% filter(SALE_PRICE_ADJ_rank %in% c("1","2")),
                          "SALE_ADJ_","_VERYLOW"),
                sell_home(y = s %>% filter(SALE_PRICE_ADJ_rank %in% c("1","2","3")),
                          "SALE_ADJ_","_LOW"),
                sell_home(y = s %>% filter(SALE_PRICE_ADJ_rank %in% c("8","9","10")),
                          "SALE_ADJ_","_HIGH"),
                sell_home(y = s %>% filter(SALE_PRICE_ADJ_rank %in% c("9","10")),
                          "SALE_ADJ_","_VERYHIGH"))


##Crime January 2013 - December 2018

#Call in Crime Files and format long to wide
crime <- bind_rows(st_read("https://opendata.arcgis.com/datasets/0583c4cbea2d4edf9f13e8dcbe21eefa_0.geojson",
                           stringsAsFactors = FALSE),
                   st_read("https://opendata.arcgis.com/datasets/b80f28e504fc47148bd1a078955f44b1_0.geojson",
                           stringsAsFactors = FALSE),
                   st_read("https://opendata.arcgis.com/datasets/3fb76f5173474148ab4f0e10e7a2aae3_0.geojson",
                           stringsAsFactors = FALSE),
                   st_read("https://opendata.arcgis.com/datasets/01741069abbb4792aea378fe6f35d09b_0.geojson",
                           stringsAsFactors = FALSE)) %>% as_tibble %>% select(-geometry) %>%
  mutate(ADDRESS = gsub(" ","",ADDRESS))

streets <- st_read("https://opendata.arcgis.com/datasets/47cadd4fe5b344f895c5a3e672463899_0.geojson") %>% 
  transmute(ADDRESS = gsub(" ","",BLKSTREET),
            geometry) %>%
  st_transform(2261)

#Tabular join crimes to street centerlines
cr <- crime %>%
  group_by(ADDRESS,CODE_DEFINED) %>%
  summarize(Freq = n()) %>% ungroup %>%
  spread(CODE_DEFINED,Freq,fill = 0) %>% 
  mutate_all(~replace(., is.na(.), 0)) %>%
  set_colnames(colnames(.) %>% 
                 make.names(unique=TRUE) %>% 
                 str_replace_all("\\.","_")) %>%
  mutate(ANY = rowSums(select(.,-ADDRESS), na.rm = TRUE)) %>%
  transmute(ADDRESS, ANY,
            ASSAULT = AGGRAVATED_ASSAULT + SIMPLE_ASSAULT + OFFN_AGAINST_FAMILY,
            THEFT   = LARCENY + BURGLARY + MV_THEFT + ROBBERY + STOLEN_PROPERTY, 
            ARSON, 
            MURDER,
            WHITE_COL=EXTORTION + FORGERY_COUNTERFEIT + FRAUD,
            SEX_OFF = SEX_OFFENSES + PROSTITUTION_PATRON_PROMOTING,
            DRUG_OFF= POSSESSION_MARIJUANA + POSSESSION_USE_DRUG +
              SALE_MANUFACTURE_MARIJUANA + SALE_MANUFACTURE_CONTROLLED_SUBSTANCE,
            LOITER  = DISORDERLY_CONDUCT + LOITERING,
            VANDAL  = CRIMINAL_MISCHIEF,
            WEAPON  = POSSESSION_USE_DANGEROUS_WEAPONS)
c <- streets %>% inner_join(cr)

do_crime <- function(y,suffix = NULL) run_chunks(split_p,
                                                 y = y,
                                                 prefix = "crime_",
                                                 suffix = suffix)

p_crime <- cbind(do_crime(c %>% filter(ANY > 6),
                          suffix = "_ANY"),
                 do_crime(c %>% filter(ASSAULT > 2),
                          suffix = "_ASSAULT"),
                 do_crime(c %>% filter(THEFT > 1),
                          suffix = "_THEFT"),
                 do_crime(c %>% filter(ARSON > 0),
                          suffix = "_ARSON"),
                 do_crime(c %>% filter(MURDER > 0),
                          suffix = "_MURDER"),
                 do_crime(c %>% filter(WHITE_COL > 0),
                          suffix = "_WHITE_COL"),
                 do_crime(c %>% filter(SEX_OFF > 0),
                          suffix = "_SEX_OFF"),
                 do_crime(c %>% filter(DRUG_OFF > 1),
                          suffix = "_DRUG_OFF"),
                 do_crime(c %>% filter(LOITER > 0),
                          suffix = "_LOITER"),
                 do_crime(c %>% filter(VANDAL > 1),
                          suffix = "_VANDAL"),
                 do_crime(c %>% filter(WEAPON > 0),
                          suffix = "_WEAPON")) 

#Type of Water Tap

#Call in Water Tap Data and select only domestic water  taps
water_tap <- st_read("https://opendata.arcgis.com/datasets/e1deb6e9e4b74071af272982d8f9994e_0.geojson",
                     stringsAsFactors = FALSE)

w <- water_tap %>% 
  filter(STYP=="DOMESTIC") %>% 
  select(PTYPE,SERV_INSTALL) %>% 
  st_transform(2261) 

# Spatial Join property polygons to the nearest water tap point
p_tap <- p18Jul %>% 
  st_join(w) %>%
  as_tibble %>% select(-geometry) %>%
  mutate(PTYPE = gsub(" ","",PTYPE)) %>%
  group_by(SBL,PTYPE) %>%
  summarize(freq = n()) %>% ungroup() %>%
  spread(PTYPE, freq, fill = 0) %>% 
  mutate_all(~replace(., is.na(.), 0)) %>%
  select(CASTIRON,COPPER,DUCTILE,GAL.IRON,LEAD,OTHER) %>%
  mutate("has_castIron" = ifelse(CASTIRON > 0,TRUE,FALSE),
         "has_copper"   = ifelse(COPPER   > 0,TRUE,FALSE),
         "has_ductile"  = ifelse(DUCTILE  > 0,TRUE,FALSE),
         "has_gal.iron" = ifelse(GAL.IRON > 0,TRUE,FALSE),
         "has_lead"     = ifelse(LEAD     > 0,TRUE,FALSE),
         "has_other"    = ifelse(OTHER    > 0,TRUE,FALSE)) %>%
  rename_all(~paste0("tap_",.))


#Building Permits 2013 - 2018

permits <- st_read("https://opendata.arcgis.com/datasets/3f68d80ac391452fbbc1d5086f8fd9bb_0.geojson",
                   stringsAsFactors = FALSE)

pm <- permits %>% as_tibble %>%
    filter(lubridate::year(application_date) <= 2018) %>%
    select(SBL = identifier,
           permit_type_description, 
           project_valuation) %>%
    group_by(SBL,permit_type_description) %>%
    summarize(freq = n()) %>%
    spread(permit_type_description,freq,fill = 0) %>% 
    mutate_all(~replace(., is.na(.), 0)) %>%
    transmute(pm_BLOCK_PARTY      = `Block Party (Business District Street Closing)` +
                `Block Party (Residential Street Closing)`,
              pm_RENOVATION       =`Commercial Renovation / Remodel` + `Residentail Remodel`,
              pm_NEW_CONSTRUCTION =`New 1-2 Family Home` + `New Commercial Construction`,
              pm_has_BLOCK_PARTY  = ifelse(pm_BLOCK_PARTY > 0, TRUE,FALSE),
              pm_has_RENOVATION   = ifelse(pm_RENOVATION > 0, TRUE,FALSE),
              pm_has_NEW_CONST    = ifelse(pm_NEW_CONSTRUCTION > 0, TRUE,FALSE))
p_pm <- p %>% 
  select(SBL) %>%
  left_join(pm) %>% 
  select(-SBL) %>% 
  mutate_at(vars(-geometry),~replace(., is.na(.), FALSE))

file_permit <- function(y,suffix = NULL) run_chunks(split_p,
                                                    y = y,
                                                    prefix = "pm_",
                                                    suffix = suffix)

p_permits <- cbind(p_pm %>% as_tibble %>% select(-geometry),
                   file_permit(y = p_pm %>% filter(pm_has_NEW_CONST),
                               suffix = "_NEW_CONST"),
                   file_permit(y = p_pm %>% filter(pm_has_RENOVATION),
                               suffix = "_RENOVATION"))

indep_svcs <- p18Jul %>%
  as.data.frame %>%
  select(SBL) %>%
  cbind(p_fire,
        p_sale,
        p_crime,
        p_tap,
        p_permits)

save(indep_svcs,file = "indep_svcs.RData")
#load("indep_svcs.RData")
indep_svcs %>% summarize_all(funs(sum(is.finite(.) | is.character(.)))) %>% t %>% View





Independent Variables: Parcel Data

library(tidyverse)
library(sf)
library(magrittr)

# Parcels July 2018
# https://data.syrgov.net/datasets/1b346804e1364a5eb85ccb53302e3c91_0.geojson
parcels2018Jul <- read_sf("https://opendata.arcgis.com/datasets/1b346804e1364a5eb85ccb53302e3c91_0.geojson",
                          stringsAsFactors = FALSE)

options(tigris_class = "sf")
ZIP <- tigris::zctas(cb = TRUE,
                     starts_with = "132")
USA <- c("AE","AK","AL","AR","AZ","CA","CO","CT","DC","DE","FL","GA","HI","IA","ID","IL","IN","KS","KY","LA","MA","MD","ME","MI","MN","MO","MS","MT","NC","ND","NE","NH","NJ","NM","NV","NY","OH","OK","OR","PA","PR","RI","SC","SD","TN","TX","UT","VA","VI","VT","WA","WI","WV","WY")

Jul <- parcels2018Jul %>%
  rename(SBL = PRINTKEY) %>%
  mutate_at(vars(Owner:Add3),
            str_replace_all,
            pattern     = "[[:punct:]]",
            replacement = "") %>%
  st_join(ZIP %>% 
            select(ZCTA5CE10,geometry) %>% 
            st_transform(4326),
          largest = TRUE) %>%
  as.data.frame

p_Jul <- Jul %>%
  transmute(SBL,
            size_FRONT_ft = FRONTFEET,
            size_DEPTH_ft = DEPTH,
            size_ACRES    = ACRES,
            
            owner_onsite  = ifelse(FullAdd == Add3,TRUE,FALSE),
            owner_inzip   = ifelse(ZIP == ZCTA5CE10,TRUE,FALSE),
            owner_inzip   = ifelse(is.na(owner_inzip),FALSE,owner_inzip),
            owner_syr     = agrepl("SYRACUSE",Add4,
                                   ignore.case = TRUE),
            owner_state   = Add4 %>% 
                              trimws %>%
                              str_replace_all("[[:punct:]]","") %>%
                              str_sub(-2,-1),
            owner_state   = ifelse(owner_state %in% USA, 
                                   owner_state,
                                   "International"),
            owner_ny      = ifelse(owner_state == "NY",
                                   TRUE, FALSE),
            owner_intl    = ifelse(owner_state == "International",
                                   TRUE, FALSE),
            
            nbhd_sura     = ifelse(SURA_NRSA == "Y",TRUE,FALSE), # neighborhood revitalization area
            nbhd          = Nhood %>% trimws,
            nbhd_downtown = ifelse(nbhd == "Downtown",TRUE,FALSE),
            nbhd_lakefront= ifelse(nbhd == "Lakefront",TRUE,FALSE),
            nbhd_univhill = ifelse(nbhd == "University Hill",TRUE,FALSE),
            nbhd_quad     = Quad,
            nbhd_TNT      = TNT_NAME,  # tomorrow's neighborhood today
            nbhd_special  = ifelse(Special_Nh == "SND",TRUE,FALSE),
            
            nh_assess     = Assessment %>% as.character,
            nh_tract      = ifelse(CensusTrac == "Null","",CensusTrac),
            nh_council    = CC_Dist,
            nh_legis      = COUNTY_LEG,
            nh_WARD       = WARD,
            
            viol_ct       = NumOpenVio, # number of violations on the super-parcel
            viol_has      = ifelse(NumOpenVio >= 1,  TRUE,FALSE),
            
            condition     = Condition %>% trimws, # includes blanks
            cond_num_NA   = NA,                          # doesn't include blanks
            cond_num_NA   = ifelse(condition == "1-Worst",1,cond_num_NA),
            cond_num_NA   = ifelse(condition == "2",      2,cond_num_NA),
            cond_num_NA   = ifelse(condition == "3",      3,cond_num_NA),
            cond_num_NA   = ifelse(condition == "4",      4,cond_num_NA),
            cond_num_NA   = ifelse(condition == "5-Best", 5,cond_num_NA),
            
            cond_vacant   = ifelse(VacantBuil == "Y",TRUE,FALSE),
            cond_notvacant= ifelse(VacantBuil == "N",TRUE,FALSE),
            cond_vacdate_NA=ifelse(DVDATE == 0,NA,DVDATE),
            
            occupancy     = Occupancy %>% trimws,
            occup_vacant  = ifelse(substr(occupancy,1,6) == "Vacant",TRUE,FALSE),
            occup_vacres  = ifelse(occupancy == "Vacant Residential",TRUE,FALSE),
            occup_owner   = ifelse(occupancy == "Owner Occupied",    TRUE,FALSE),
            occup_rental  = ifelse(occupancy == "Rental Occupied",   TRUE,FALSE),
            occup_com     = ifelse(occupancy == "Occupied Commercial",TRUE,FALSE),
            
            tax_delinq_amt= AmntDelinq,
            tax_delinq_has= ifelse(AmntDelinq > 0,   TRUE,FALSE),
            tax_redem1_amt= RedemptAmn,
            tax_redem1_has= ifelse(RedemptAmn > 0,   TRUE,FALSE),
            tax_seizable  = ifelse(Seizable == "Y",  TRUE,FALSE),
            tax_val       = CityTaxabl,
            tax_val_stars = STARS,
            tax_val_starc = STARC,
            tax_star      = ifelse(STAR == "Y",      TRUE,FALSE),
            
            water_svc     = WaterServi,
            water_inactive= ifelse(WaterServi == "I",TRUE,FALSE),
            water_overdue = OverdueWat,
            water_over_has= ifelse(OverdueWat > 0,   TRUE,FALSE),
            
            tax_bankrupt  = ifelse(Bankruptcy == "Y",TRUE,FALSE),
            tax_trust     = ifelse(TaxTrust   == "Y",TRUE,FALSE),
            tax_senior    = ifelse(SENIOR_EXE == "Y",TRUE,FALSE),
            tax_veteran   = ifelse(VET_EXEMPT == "Y",TRUE,FALSE),
            tax_redem2_amt= Redemption,
            tax_redem2_has= ifelse(Redemption > 0,   TRUE,FALSE),
            
            tax_phase     = Phase %>% trimws,
            tax_phase_NA  = NA,
            tax_phase_NA  = ifelse(tax_phase == "Phase I",   1,tax_phase_NA),
            tax_phase_NA  = ifelse(tax_phase == "Phase II",  2,tax_phase_NA),
            tax_phase_NA  = ifelse(tax_phase == "Phase III", 3,tax_phase_NA),
            tax_phase_NA  = ifelse(tax_phase == "Phase IV",  4,tax_phase_NA),
            tax_phase_NA  = ifelse(tax_phase == "Phase V",   5,tax_phase_NA),
            tax_phase_NA  = ifelse(tax_phase == "Phase VI",  6,tax_phase_NA),
            tax_phase_NA  = ifelse(tax_phase == "Phase VII", 7,tax_phase_NA),
            tax_phase_NA  = ifelse(tax_phase == "Phase VIII",8,tax_phase_NA),
            tax_phase_NA  = ifelse(tax_phase == "Phase IX",  9,tax_phase_NA),
            tax_phase_NA  = ifelse(tax_phase == "Phase X",  10,tax_phase_NA),
            tax_phase_NA  = ifelse(tax_phase == "Phase XI", 11,tax_phase_NA),
            tax_phase_NA  = ifelse(tax_phase == "Phase XII",12,tax_phase_NA),
            tax_phase_NA  = ifelse(tax_phase == "Phase XIII",13,tax_phase_NA),
            tax_phase_NA  = ifelse(tax_phase == "Phase XIV",14,tax_phase_NA),
            tax_phase_NA  = ifelse(tax_phase == "Phase XV", 15,tax_phase_NA),
            
            nh_zoneclass  = ZONEA,
            nh_zone       = str_extract(DIST,"[^ |^:]+"),
            
            cond_year_NA = ifelse(YearBuilt != "0000", YearBuilt %>% as.numeric, NA),
            cond_decade  = ifelse(is.na(cond_year_NA),"!UNKNOWN",
                                  as.integer(cond_year_NA/10) %>% paste0("0s")),
            cond_1860_1900=ifelse(cond_year_NA %in% c(1860:1909),
                                  TRUE,FALSE),
            
            landuse      = LandUse %>% trimws,
            landuse_apt  = ifelse(landuse == "Apartment",    TRUE,FALSE),
            landuse_res1 = ifelse(landuse == "Single Family",TRUE,FALSE),
            landuse_res2 = ifelse(landuse == "Two Family",   TRUE,FALSE),
            landuse_res3 = ifelse(landuse == "Three Family", TRUE,FALSE),
            landuse_res4 = ifelse(landuse == "Multiple Residence",TRUE,FALSE),
            landuse_comm = ifelse(landuse == "Commercial",   TRUE,FALSE),
            landuse_park = ifelse(landuse == "Parks",        TRUE,FALSE),
            
            size_units_NA= ifelse(Units > 0, Units, NA),
            size_units   = Units,
            size_units   = ifelse(Units == 0,               "0 units",size_units),
            size_units   = ifelse(Units == 1,               "1 unit", size_units),
            size_units   = ifelse(Units %in% c(2:5),      "2-5 units",size_units),
            size_units   = ifelse(Units %in% c(6:20),    "6-20 units",size_units),
            size_units   = ifelse(Units %in% c(21:100),"21-100 units",size_units),
            size_units   = ifelse(Units > 100,           ">100 units",size_units),
            size_units_gt20=ifelse(Units > 20,TRUE,FALSE),
            
            tax_val_land = AssessedLa,
            tax_val_assess=AssessedVa) %>%
  mutate_all(list(~ifelse(. == "","!UNKNOWN",.))) %>%
  rename_all(function(.){paste0("p_",.)}) %>%
  rename(SBL = 1)

test_Jul <- cbind(Num_values = apply(p_Jul, 2, function(col) length(unique(col))),
                  Num_NAs    = apply(p_Jul, 2, function(col) sum(is.na(col) | is.nan(col) | is.infinite(col))),
                  Num_Unknown= apply(p_Jul, 2, function(col) sum(col == "!UNKNOWN")))

# 311 Calls
calls311 <- readxl::read_excel("../data/UPenn MUSA/Cityline Complaints (311).xlsx") %>%
  mutate(complaint_number = uniqtag::make_unique(.$complaint_number))
  # status "issued" refers to Certificate of Use, i.e. an issued certificate means there are no problems

b_calls311 <- calls311 %>%
  filter(status_type_name != "Void" & status_type_name != "Issued") %>%
  mutate(status_type_name = str_remove(status_type_name,"^[x |v ]") %>% trimws,
         calls = ifelse(status_type_name %in% c("Completed",
                                                "Denied - OpenViolations",
                                                "In Review - Violations",
                                                "Open",
                                                "Open; Needs DPW Boardup",
                                                "Post Judgement",
                                                "Referred to Law",
                                                "Revoked",
                                                "Violations"), 
                        "cvio","coth"),
         dept  = str_remove_all(department_name,"[^[:alnum:]]")) %>%
  
  # count each type of call for each dept and parcel_id
  group_by(parcel_id,dept,calls) %>% 
  count %>% ungroup %>%
  
  # get count of ALL calls, drop count of "other"
  spread(calls,n) %>%
  group_by(parcel_id,dept) %>%
  mutate(call = sum(coth,cvio,na.rm = TRUE)) %>%
  select(-coth) %>%
  gather(calls,n,c("cvio","call"), -parcel_id, -dept) %>%
  
  # filter by depts
  # TO DO: review this part with scott
  # TO DO: group by interior (lead, fire, etc), exterior violations
  spread(dept,n) %>%
  group_by(parcel_id,calls) %>%
  mutate(any    = sum(CertificateofUse,Codes,
                      Collections,Demolition,
                      DPWBuildingServices,DPWCurbing,
                      DPWISR,DPWSanitation,
                      DPWSewers,DPWSidewalk,
                      DPWStreetCleaning,DPWStreetRepair,
                      DPWTransportation,FirePreventionBureau,
                      Law,NBDCaseMgmt,
                      OnondagaCountyHealth,ParksRecDogControl,
                      ParksRecForestry,ParksRecGrounds,
                      Police,Unassigned,
                      VacantProperties,Water,
                      na.rm = TRUE),
         trash  = sum(Codes,
                      DPWBuildingServices,
                      DPWISR,
                      DPWSanitation,
                      na.rm = TRUE),
         bldg   = sum(FirePreventionBureau,
                      Law,
                      NBDCaseMgmt,
                      OnondagaCountyHealth,
                      na.rm = TRUE),
         abandon= sum(Police,
                      VacantProperties,
                      na.rm = TRUE),
         DPW    = sum(DPWBuildingServices,
                      DPWCurbing,
                      DPWISR,
                      DPWSanitation,
                      DPWSewers,
                      DPWSidewalk,
                      DPWStreetCleaning,
                      DPWStreetRepair,
                      DPWTransportation,
                      na.rm = TRUE),
         parksrec=sum(ParksRecDogControl,
                      ParksRecForestry,
                      ParksRecGrounds,
                      na.rm = TRUE),
         public = sum(DPWCurbing,
                      DPWSewers,
                      DPWSidewalk,
                      DPWStreetRepair,
                      DPWTransportation,
                      ParksRecForestry,
                      ParksRecGrounds,
                      Water,
                      na.rm = TRUE)) %>%
  ungroup() %>%
  gather(dept,n,-parcel_id,-calls) %>%
  
  # prepare for aggregate
  unite(col,calls,dept) %>%
  spread(col,n)

mapping <- readxl::read_excel("../data/UPenn MUSA/Parcel Mapping.xlsx") %>%
  select(SBL = identifier, parcel_id)

p_calls311 <- b_calls311 %>%
  right_join(mapping, by = "parcel_id") %>%
  select(-parcel_id) %>%
  group_by(SBL) %>%
  summarize_all(list(sum = ~sum,
                     any = ~any)) %>%
  filter(!is.na(SBL)) %>%
  replace(is.na(.),FALSE) # coerces to 0 for numeric columns

# Res and Com bldgs
bldg_com <- readxl::read_excel("../data/UPenn MUSA/ComBuilding.xlsx") %>%
  rename(SBL = PRINT_KEY)
bldg_res <- readxl::read_excel("../data/UPenn MUSA/ResBuilding.xlsx") %>%
  rename(SBL = PRINT_KEY)

com <- bldg_com %>% 
  transmute(SBL,
            com_has            = TRUE,
            com_perim_ft       = BLDG_PERIMETER,
            com_grossflr_sqft  = GROSS_FLOOR_AREA,
            bldg_sqft          = GROSS_FLOOR_AREA,
            com_story_height   = STORY_HEIGHT,
            com_elev_nbr       = NBR_ELEV,
            com_elev_has       = ifelse(NBR_ELEV > 0,TRUE,FALSE),
            com_bsmt_perim_ft  = BSMNT_PERIMETER,
            com_bsmt_sqft      = BSMNT_SQFT,
            com_bsmt_has       = ifelse(BSMNT_SQFT > 0,TRUE,FALSE),
            bldg_stories       = NBR_STORIES,
            bldg_multistory    = ifelse(NBR_STORIES > 1,TRUE,FALSE),
            bldg_class         = PROP_CLASS_DESC %>% as.character)

res <- bldg_res %>%
  transmute(SBL,
            res_has            = TRUE,
            res_nbrkitch       = NBR_KITCHENS,
            res_nbrbed         = NBR_BEDROOMS,
            res_nbrbed_lt3     = ifelse(NBR_BEDROOMS < 3,TRUE,FALSE),
            res_nbrbed_3       = ifelse(NBR_BEDROOMS == 3,TRUE,FALSE),
            res_nbrbed_gt3     = ifelse(NBR_BEDROOMS > 3,TRUE,FALSE),
            res_nbrbed_gt4     = ifelse(NBR_BEDROOMS > 4,TRUE,FALSE),
            res_nbrbath        = NBR_FULL_BATHS,
            res_nbrbath_1      = ifelse(NBR_FULL_BATHS == 1,TRUE,FALSE),
            res_nbrbath_2      = ifelse(NBR_FULL_BATHS == 2,TRUE,FALSE),
            res_nbrbath_3      = ifelse(NBR_FULL_BATHS >= 3,TRUE,FALSE),
            res_nbrfire        = NBR_FIREPLACES,
            res_nbrfire_1      = ifelse(NBR_FIREPLACES > 0,TRUE,FALSE),
            res_centralair     = ifelse(CENTRAL_AIR > 0, TRUE,FALSE),
            res_sqft           = SFLA,
            bldg_sqft          = SFLA,
            res_yearbuilt      = YR_BUILT,
            res_stories        = NBR_STORIES,
            bldg_stories       = NBR_STORIES,
            bldg_multistory    = ifelse(NBR_STORIES > 1,TRUE,FALSE),
            res_fuel           = FUEL_TYPE_DESC,
            res_fuel_gas       = ifelse(FUEL_TYPE_DESC == "Natural Gas",TRUE,FALSE),
            res_fuel_elec      = ifelse(FUEL_TYPE_DESC == "Electric",TRUE,FALSE),
            res_fuel_oil       = ifelse(FUEL_TYPE_DESC == "Oil",TRUE,FALSE),
            res_fuel_burn      = ifelse(FUEL_TYPE_DESC %in% c("Coal","Oil","Wood"),TRUE,FALSE),
            res_fuel_none      = ifelse(FUEL_TYPE_DESC == "None",TRUE,FALSE),
            res_heat           = HEAT_TYPE_DESC,
            res_heat_hotair    = ifelse(HEAT_TYPE_DESC == "Hot air",TRUE,FALSE),
            res_heat_steam     = ifelse(HEAT_TYPE_DESC == "Hot wtr/stm",TRUE,FALSE),
            res_heat_electric  = ifelse(HEAT_TYPE_DESC == "Electric",TRUE,FALSE),
            res_heat_none      = ifelse(HEAT_TYPE_DESC == "No central",TRUE,FALSE),
            res_sqft_first     = FIRST_STORY,
            res_sqft_second    = SECOND_STORY,
            res_sqft_addl      = ADDL_STORY,
            res_sqft_half      = HALF_STORY,
            res_sqft_threeqtr  = THREE_QTR_STORY,
            res_story_second   = ifelse(SECOND_STORY > 0, TRUE, FALSE),
            res_story_addl     = ifelse(ADDL_STORY > 0, TRUE, FALSE),
            res_story_half     = ifelse(HALF_STORY > 0, TRUE, FALSE),
            res_story_threeqtr = ifelse(THREE_QTR_STORY > 0, TRUE, FALSE),
            bldg_class         = PROP_CLASS_DESC %>% as.character)

p_bldg <- bind_rows(com,res) %>% 
  group_by(SBL) %>%
  mutate_if(is.numeric,list(~max,~min,~mean),na.rm = TRUE) %>%
  mutate_if(is.logical,list(~sum,~any,~all,~mean),na.rm = TRUE) %>%
  select(-(com_has:res_story_threeqtr)) %>%
  ungroup %>% unique %>% 
  right_join(Jul %>% select(SBL), by = "SBL") %>%
  mutate_all(list(~replace(.,!is.finite(.) & !is.character(.),FALSE))) # numeric columns are coerced to 0

#sales

sales <- readxl::read_xlsx("../data/UPenn MUSA/sales.xlsx")
s <- sales %>% 
  transmute(SALE_PRICE,
            SBL = PRINT_KEY,
            year = lubridate::year(DEED_DATE)) %>%
  arrange(SBL,desc(year)) %>% 
  filter(!duplicated(SBL))

#Adjust sale prices for inflation
##Call in housing price index data from New York state and set to 2018 values
hpi <- Quandl::Quandl("FMAC/HPI", order = 'asc') %>%
  mutate(year = lubridate::year(Date)) %>%
  group_by(year) %>%
  summarize(hpi = mean(NY)) %>% 
  rbind(c(2018,174.06169),
        c(1956,23.63157),
        c(1969,23.63157),
        c(1971,23.63157),
        c(1972,23.63157),
        c(1973,23.63157),
        c(1974,23.63157))
hpi2018 <- hpi %>% filter(year == 2018) %>% pull(hpi)
hpi$adj_factor <- (hpi2018/hpi$hpi)

p_sales <- Jul %>% 
  select(SBL, AssessedVa) %>%
  left_join(s) %>%
  left_join(hpi) %>%
  transmute(SBL, 
            SALE_PRICE_NA = SALE_PRICE,
            SALE_PRICE_ADJ_NA = ifelse(is.na(SALE_PRICE), NA, SALE_PRICE*adj_factor),
            SALE_PCT_ofval_NA = ifelse(AssessedVa != 0, SALE_PRICE_ADJ_NA / AssessedVa, NA),
            
            adj_gt10 = ifelse(SALE_PRICE > 10, SALE_PRICE_ADJ_NA, NA),
            pct_gt10 = ifelse(SALE_PRICE > 10, SALE_PCT_ofval_NA, NA),
            
            SALE_PRICE_ADJ_rank = ifelse(SALE_PRICE > 10,
                                         ntile(adj_gt10,10) %>% as.character(),
                                         "$10 OR LESS"),
            SALE_PRICE_ADJ_rank = ifelse(is.na(SALE_PRICE_ADJ_rank),
                                         "!UNKNOWN",SALE_PRICE_ADJ_rank),
            
            SALE_PCT_rank       = ifelse(SALE_PRICE > 10,
                                         ntile(pct_gt10,10) %>% as.character(),
                                         "$10 OR LESS"),
            SALE_PCT_rank       = ifelse(is.na(SALE_PCT_rank),
                                         "!UNKNOWN",SALE_PCT_rank)) %>%
  select(-adj_gt10,-pct_gt10)

# use rank (deciles)
ggplot(p_sales) + 
  geom_histogram(aes(x = SALE_PRICE_NA)) +
  scale_x_log10(breaks = 10^seq(0,8,1))
ggplot(p_sales %>% filter(SALE_PRICE_NA > 10)) + 
  geom_histogram(aes(x = SALE_PRICE_ADJ_NA)) +
  scale_x_log10(breaks = 10^seq(0,8,1))
ggplot(p_sales %>% filter(SALE_PRICE_NA > 10)) + 
  geom_histogram(aes(x = SALE_PCT_ofval_NA)) +
  scale_x_log10(breaks = 10^seq(-8,4,1))


# OWNERS WITH MOST VIOLATIONS AND MOST PARCELS

load("dep_viol.RData")

Jul_own <- Jul %>% 
  colnames %>% 
  extract(17:21) %>%
  lapply(function(col) {
    out <- Jul %>% 
      left_join(dep_viol, by = "SBL") %>%
      group_by(.dots = col) %>%
      summarize(Count = n(),
                Area  = sum(as.numeric(ACRES),      na.rm = TRUE),
                VLand = sum(as.numeric(AssessedLa), na.rm = TRUE),
                Value = sum(as.numeric(AssessedVa), na.rm = TRUE),
                has_viol   = sum(has_viol,      na.rm = TRUE),
                has_HEALTH = sum(has_HEALTH,    na.rm = TRUE),
                has_SAFETY = sum(has_SAFETY,    na.rm = TRUE),
                has_HS     = sum(has_HS,        na.rm = TRUE)) %>%
      arrange(desc(Count))
  })

owners <- Jul_own %>% lapply(function(own) {
  cols  <- colnames(own)[2:9]
  cols %>% lapply(function(col) {
    col <- paste0("desc(",col,")")
    own %>% 
      arrange_(.dots = col) %>% 
      extract(1:10,1) %>%
      return
  }) %>% 
    bind_rows %>% 
    distinct %>%
    filter_at(1, all_vars(. != " ")) %>%
    extract2(1) %>%
    return
})

p_own <- Jul %>% select(SBL,17:21) %>%
  mutate(Owner  = ifelse(Owner  %in% owners[[1]],Owner, "!OTHER"),
         Owner2 = ifelse(Owner2 %in% owners[[2]],Owner2,"!OTHER"),
         Add1   = ifelse(Add1   %in% owners[[3]],Add1,  "!OTHER"),
         Add2   = ifelse(Add2   %in% owners[[4]],Add2,  "!OTHER"),
         Add3   = ifelse(Add3   %in% owners[[5]],Add3,  "!OTHER")) %>%
  rename_all(function(.){paste0("p_own_",.)}) %>%
  rename(SBL = 1)

#JOIN

indep_files <- Jul %>%
  as.data.frame %>%
  select(SBL) %>%
  left_join(p_bldg,     by = "SBL") %>%
  left_join(p_calls311, by = "SBL") %>%
  left_join(p_Jul,      by = "SBL") %>%
  left_join(p_sales,    by = "SBL") %>%
  left_join(p_own,      by = "SBL")

save(indep_files,file = "indep_files.RData")
save.image(file = "indep_files_working.RData")
#load("indep_files_working.RData")
#load("indep_files.RData")
indep_files %>% summarize_all(funs(sum(is.finite(.) | is.character(.)))) %>% t %>% View





Independent Variables: Spatial Data

library(tidyverse)
library(sf)
library(magrittr)
library(osmdata)

# Parcels July 2018
# https://data.syrgov.net/datasets/1b346804e1364a5eb85ccb53302e3c91_0.geojson
parcels2018Jul <- read_sf("https://opendata.arcgis.com/datasets/1b346804e1364a5eb85ccb53302e3c91_0.geojson",
                          stringsAsFactors = FALSE)
p18Jul <- parcels2018Jul %>%
  select(SBL = PRINTKEY, geometry) %>% 
  st_transform(2261)
p <- p18Jul %>% st_centroid

get_osm_sf <- function(bbox,key,value) {
  if(require(dplyr,quietly = TRUE) & require(osmdata,quietly = TRUE)) {
    get <- opq(bbox = bbox) %>%
      add_osm_feature(key = key,
                      value = value) %>%
      osmdata_sf(stringsAsFactors = FALSE) %>%
      lapply(function(elem) {
        if("sf" %in% class(elem)) {
          elem %>% 
            st_set_crs(4326) %>%
            st_transform(2261) %>%
            return
        } else {return(elem)}
      })
    return(get)
  } else {
    warning("Required package 'dplyr' not installed.")
  }
}
bb <- getbb("Syracuse, NY")
bb["y","min"] = bb["y","min"] - 0.0003
#bb <- parcels2018Jul %>% st_bbox

print_pdf <- function(sf,names) {
  if(require(ggplot2,quietly = TRUE)) {
    l <- lapply(c(1:nrow(sf)),function(i){
      ggplot() +
        geom_sf(data = sf,
                fill = "black", 
                color = "black") + 
        geom_sf(data = sf[i,],
                fill = "red",
                color = "red") +
        ggtitle(sf[i,names] %>%
                  sapply(as.character) %>%
                  paste(collapse = " "))
    })
    
    pdf(paste0(deparse(substitute(sf)),".pdf"))
    invisible(lapply(l,print))
    dev.off()
  } else {
    warning("Required package 'ggplot2' not installed.")
  }
}
  
#### Distance to Features ####

# Water

water <- plyr::rbind.fill(
  get_osm_sf(bbox = bb, key = "natural", value = "water")$osm_polygons,
  get_osm_sf(bbox = bb, key = "natural", value = "water")$osm_multipolygons
) %>% st_sf

#qplot(data = water, geom = "sf", fill = water, color = water)
#print_pdf(water,c("osm_id","name","water"))

w <- list(all   = water,
          onond = water %>% filter(name  == "Onondaga Lake"),
          hiawa = water %>% filter(name  == "Hiawatha Lake"),
          rsvr  = water %>% filter(water == "reservoir"),
          canal = water %>% filter(osm_id %in% c("156751577","156751581")),
          river = water %>% filter(osm_id %in% c("218059758","219168506")))

p_water <- w %>% 
  sapply(function(w_) {
           p %>% 
             st_distance(st_union(w_)) %>%
             as_tibble %>%
             return
           },
         simplify = FALSE,
         USE.NAMES = TRUE) %>% 
  plyr::llply(unlist) %>%
  as_tibble %>%
  set_names(paste0("w_",names(.)))

# Highways

h <- list(motor = get_osm_sf(bbox = bb,
                             key = "highway",
                             value = "motorway") %>% 
            extract2("osm_lines"),
          junct = get_osm_sf(bbox = bb, 
                             key = "highway",
                             value = "motorway") %>%
            extract2("osm_points") %>% 
            filter(highway == "motorway_junction"),
          link  = get_osm_sf(bbox = bb,
                             key = "highway",
                             value = "motorway_link") %>%
            extract2("osm_lines"),
          trunk = get_osm_sf(bbox = bb,
                             key = "highway",
                             value = "trunk") %>%
            extract2("osm_lines"))
h[["major"]] <- rbind(h[["motor"]][,"geometry"],h[["trunk"]][,"geometry"])

p_hwy <- h %>% 
  sapply(function(h_) {
           p %>% 
             st_distance(st_union(h_)) %>%
             as_tibble %>%
             return
         },
         simplify = FALSE,
         USE.NAMES = TRUE) %>% 
  plyr::llply(unlist) %>%
  as_tibble %>%
  set_names(paste0("h_",names(.)))

# Rail

rail <- list(all = get_osm_sf(bbox = bb,
                                key = "railway",
                                value = "rail") %>% 
               extract2("osm_lines"),
             abd  = get_osm_sf(bbox = bb,
                                 key = "railway",
                                 value = "abandoned") %>% 
               extract2("osm_lines"),
             dis  = get_osm_sf(bbox = bb,
                                 key = "railway",
                                 value = "disused") %>% 
               extract2("osm_lines"))
r <- list(all  = rail[["all"]],
          old  = rbind(rail[["abd"]][,"geometry"],rail[["dis"]][,"geometry"]),
          main = rail[["all"]] %>% filter(usage == "main"),
          branch=rail[["all"]] %>% filter(usage == "branch"),
          spur = rail[["all"]] %>% filter(service == "spur"),
          siding=rail[["all"]] %>% filter(service == "siding"),
          yard = rail[["all"]] %>% filter(service == "yard"),
          svc  = rail[["all"]] %>% filter(!is.na(service)))

p_rail <- r %>% 
  sapply(function(r_) {
           p %>% 
             st_distance(st_union(r_)) %>%
             as_tibble %>%
             return
         },
         simplify = FALSE,
         USE.NAMES = TRUE) %>% 
  plyr::llply(unlist) %>%
  as_tibble %>%
  set_names(paste0("r_",names(.)))

# Transit

transit <- list(train = get_osm_sf(bbox = bb,
                                   key = "public_transport",
                                   value = "stop_position") %>%
                  extract2("osm_points"),
                bus   = get_osm_sf(bbox = bb,
                                   key = "highway",
                                   value = "bus_stop") %>%
                  extract2("osm_points"),
                busrte= get_osm_sf(bbox = bb,
                                   key = "route",
                                   value = "bus") %>%
                  extract2("osm_points"))

t <- list(train  = transit[["train"]],
          amtrak = transit[["train"]] %>% filter(operator == "Amtrak"),
          notrak = transit[["train"]] %>% filter(is.na(operator)),
          bus    = transit[["bus"]],
          busrte = transit[["busrte"]],
          shelter= transit[["bus"]] %>% filter(shelter == "yes"),
          noshelt= transit[["bus"]] %>% filter(shelter == "no"),
          bench  = transit[["bus"]] %>% filter(bench == "yes"),
          nobench= transit[["bus"]] %>% filter(bench == "no"))

p_transit <- t %>% 
  sapply(function(t_) {
           p %>% 
             st_distance(st_union(t_)) %>%
             as_tibble %>%
             return
         },
         simplify = FALSE,
         USE.NAMES = TRUE) %>% 
  plyr::llply(unlist) %>%
  as_tibble %>%
  set_names(paste0("t_",names(.)))

# CBD

nbhd <- read_sf("https://opendata.arcgis.com/datasets/4df344e3ad8f45a8b5625967c7b3bbe3_0.geojson") %>%
  st_transform(2261)

n <- list(cbd  = nbhd %>% filter(NAME == "Downtown"),
          univ = nbhd %>% filter(NAME == "University Hill"),
          dist = nbhd %>% filter(Neigh_Cate == "DISTRICT"))

p_nbhd <- n %>% 
  sapply(function(n_) {
           p %>% 
             st_distance(st_union(n_)) %>%
             as_tibble %>%
             return
         },
         simplify = FALSE,
         USE.NAMES = TRUE) %>% 
  plyr::llply(unlist) %>%
  as_tibble %>%
  set_names(paste0("n_",names(.)))

# TO DO: amenities, land use, leisure

## all distance vars
indep_osm <- p18Jul %>%
  as.data.frame %>%
  select(-geometry) %>%
  cbind(p_water) %>%
  cbind(p_hwy) %>%
  cbind(p_rail) %>%
  cbind(p_transit) %>%
  cbind(p_nbhd) %>%
  
  mutate_at(vars(-SBL),
            list(nn_50ft   = ~`<=`(.,50),
                 nn_1320ft = ~`<=`(.,1320),
                 nn_1mi    = ~`<=`(.,5280))) %>%
  
  mutate(t_noshelt_closer = t_noshelt > t_shelter,
         t_nobench_closer = t_nobench > t_bench)

save(indep_osm, file = "indep_osm.RData")
load("indep_osm.RData")
indep_osm %>% summarize_all(funs(sum(is.finite(.) | is.character(.)))) %>% t %>% View





Modeling

library(tidyverse)
library(sf)
library(magrittr)

setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
options(scipen = 999)

theme_plot <- theme(
  text = element_text( color = "black"),
  plot.title = element_text(size = 14,colour = "black"),
  plot.subtitle = element_text(face="italic"),
  plot.caption = element_text(hjust=0),
  axis.ticks = element_blank(),
  panel.background = element_blank(),
  panel.grid.major = element_line("grey80", size = 0.1),
  panel.grid.minor = element_blank(),
  panel.border = element_blank(),
  #panel.border = element_rect(colour = "black", fill=NA, size=2),
  strip.background = element_rect(fill = "grey80", color = "white"),
  strip.text = element_text(size=12),
  axis.title = element_text(size=12),
  axis.text = element_text(size=10),
  plot.background = element_blank(),
  legend.background = element_blank(),
  #legend.title = element_text(colour = "black", face = "italic"),
  #legend.text = element_text(colour = "black", face = "italic"),
  strip.text.x = element_text(size = 14)
)

load("dep_viol.RData")
load("indep_files.RData")
load("indep_osm.RData")
load("indep_svcs.RData")

dep_viol   <- dep_viol   %>% mutate_at(vars(starts_with("has_")),~ifelse(. == TRUE, "VIOL", "NONE"))
indep_svcs <- indep_svcs %>% rename_all(~sub('\\$','',.x))

data <- 
  
  dep_viol %>%
  left_join(indep_files, by = c("SBL" = "SBL")) %>% 
  left_join(indep_osm,   by = c("SBL" = "SBL")) %>%
  left_join(indep_svcs,  by = c("SBL" = "SBL")) %>%
  
  select(-contains("_NA")) %>%
  mutate_if(is.character,~ifelse(is.na(.) | (. == ""),"!UNKNOWN",.)) %>%
  mutate(res = ifelse(p_occupancy %in% c("Owner Occupied","Rental Occupied","Vacant Residential") |
                              p_landuse   %in% c("Apartment","Multiple Residence","Single Family","Three Family","Two Family"),
                              TRUE,FALSE)) #%>%
  #mutate_all(~ifelse(!is.finite(.)&!is.character(.),FALSE,.)) %>%    #test for NAs
  #drop_na

data %>% summarize_all(~sum(is.finite(.) | is.character(.))) %>% t %>% View # test for NAs

dat_res <- data %>% filter(res == TRUE) %>% select(-res)

bivdata <- data %>% 
  filter(inspected == TRUE) %>%
  mutate_at(vars(starts_with("has_")),
            ~factor(.,levels = c("NONE","VIOL"))) %>%
  select(has_HEALTH,has_SAFETY,everything()) %>%
  select(-(SBL:inspected))
bivdat_res <- dat_res %>%
  filter(inspected == TRUE) %>%
  mutate_at(vars(starts_with("has_")),
            ~factor(.,levels = c("NONE","VIOL"))) %>%
  select(has_HEALTH,has_SAFETY,everything()) %>%
  select(-(SBL:inspected))


#function for bivariate models
run_biv <- function(dep,indep,data,...,
                      parallel = TRUE,
                      return_models = NULL) {
  
  if(!require(dplyr)) {
    install.packages("dplyr")
    require(dplyr)
  }
  if(!require(magrittr)) {
    install.packages("magrittr")
    require(magrittr)
  }
  
  if (!(is.vector(dep) & length(dep) == 1 &
        is.vector(indep) &
        is.data.frame(data))) {
    stop("
         `dep` must be a vector of length 1
         `indep` must be a vector of column names or indices
         `data` must be class \"data.frame\"
         ")
  }
  if(!(dep %in% colnames(data) & 
       all(unlist(indep) %in% colnames(data)) )) {
    stop("
         `dep` and `indep` must be columns names or indices of `data`
         ")
  }
  
  dots <- list(...,
               empty = NULL)
  dots <- dots[-which(sapply(dots, is.null))]
  
  dots$method <- ifelse("method" %in% names(dots),      
                        dots$method,     
                        "glm")
  if (dots$method %in% c("glm","rf")) {
    dots$family <- ifelse("family" %in% names(dots),
                          dots$family,
                          "binomial")
  } else {dots$family <- NULL}
  dots$trctrl <- ifelse("trainControl" %in% names(dots),
                        dots$trainControl,
                        ifelse(dots$method == "gbm", "repeatedcv", "cv"))
  dots$folds  <- ifelse("folds" %in% names(dots),  
                        dots$folds,       
                        ifelse(grepl("cv", dots$trctrl), 10, 25))
  
  if(!require(pbapply)) {
    install.packages("pbapply")
    require(pbapply)
  }
  if(!require(parallel)) {
    install.packages("parallel")
    require(parallel)
  }
  if(!require(doParallel)) {
    install.packages("doParallel")
    require(doParallel)
  }
  if(!require(caret)) {
    install.packages("caret")
    require(caret)
  }
  if(!require(pROC)) {
    install.packages("pROC")
    require(pROC)
  }
  
  oldw <- getOption("warn")
  options(warn = -1)
  
  cores <- detectCores()
  
  if (parallel & cores < length(indep)) {
    cores <- ifelse(cores == 1,1,cores-1) %>% makeCluster
  } else {cores <- NULL}

  
  results <- pblapply(
    cl = cores, 
    indep, function(ind) {
      require(magrittr)
      require(dplyr)
      require(caret)
      require(pROC)
      require(broom)
      
      if(parallel & is.null(cores)) {
        require(parallel)
        require(doParallel)
        cores <- ifelse(cores == 1,1,cores-1) %>% makeCluster
        registerDoParallel(cores)
      }
      
      regdat <- data %>% extract(c(dep,ind))
      formula <- paste(dep,"~",
                       paste(ind,collapse = " + "))
      
      if(dots$method == "gbm") {
        model <- train(eval(parse(text = formula)), data = regdat,
                       method = dots$method, family = dots$family,
                       trControl = trainControl(dots$trctrl, dots$folds,
                                                classProbs = TRUE,
                                                summaryFunction = twoClassSummary),
                      verbose = FALSE)
      } else {
        model <- train(eval(parse(text = formula)), data = regdat,
                       method = dots$method, family = dots$family,
                       trControl = trainControl(dots$trctrl, dots$folds,
                                                classProbs = TRUE,
                                                summaryFunction = twoClassSummary))
      }
      
      curvedat <- data.frame(obs = regdat[[dep]] %>% as.factor,
                             prb = predict(model, newdata = regdat, type = "prob")[[levels(model)[2]]])
      curve <- roc(curvedat$obs,curvedat$prb#,
                   #controls = levels(model)[1], cases = levels(model)[2]
      )
      
      row <- data.frame(var = paste(ind,collapse = ", "),
                        auc = auc(curve) %>% as.numeric,
                        cut = coords(curve,"best")[1],
                        tnr = coords(curve,"best")[2],
                        tpr = coords(curve,"best")[3],
                        acc = curvedat %>%
                          mutate(cut = coords(curve,"best")["threshold"],
                                 fit = ifelse(prb >= cut,levels(model)[2],levels(model)[1]),
                                 acc = ifelse(fit == obs,TRUE,FALSE)) %>%
                          summarize(Accuracy = sum(acc,na.rm = TRUE)/n()) %>%
                          as.numeric,
                        sd_auc = model$resample$ROC %>% sd,
                        sd_tnr = model$resample$Spec %>% sd,
                        sd_tpr = model$resample$Sens %>% sd,
                        stringsAsFactors = FALSE)
      
      result <- list(var = row$var,
                     row = row)
      
      if (isTRUE(return_models == "full")) {
        result$model <- model
      } else if (isTRUE(return_models == "tidy")) {
        result$model <- tidy(model$finalModel)
      }
      
      return(result)
      
      if(parallel & is.null(cores)) {
        stopCluster(cluster)
        registerDoSEQ()
      }
      
    }
  )
  
  v  <- indep
  df <- lapply(results, function(result) result[["row"]]) %>% bind_rows
  m  <- lapply(results, function(result) result[["model"]]) %>% set_names(paste(v,collapse = ", "))
  
  out <- list(vars = v, summary = df, models = m)
  
  options(warn = oldw)
  return(out)
}

### FUNCTION runmodels ###
# iteratively creates k-fold models using the caret function

# dep = name of y variable as a string. e.g. "varname". MUST BE A FACTOR OR NUMERIC

# indep = list of x variables. e.g. c("predictor1","predictor2","predictor3")
### indep can handle list of lists, e.g. list(c("x1","x2"),c("x3","x4")) will produce two models (y ~ x1 + x2 and y ~ x3 + x4)
### indep can be "." (all variables)

# ... is any argument that caret::train can handle
### default is glm binomial, 10-fold cv, twoClassSummary

# if parallel is TRUE, will use n-1 cores where n is the total number of cores

# return_models = "tidy" returns a list of broom::tidy objects.
# return_models = "full" returns a list of caret::train objects

set.seed(505050)

#### feature selection ####

# simple variable importance using GLM

biv_H <- run_biv("has_HEALTH",
                   colnames(bivdat_res)[3:length(bivdat_res)],
                   data = bivdat_res,
                   parallel = FALSE,
                   return_models = "tidy")
biv_H$summary <- biv_H$summary %>% arrange(desc(auc))
biv_S <- run_biv("has_SAFETY",
                   colnames(bivdat_res)[3:length(bivdat_res)],
                   data = bivdat_res,
                   parallel = FALSE,
                   return_models = "tidy")
biv_S$summary <- biv_S$summary %>% arrange(desc(auc))

ggplot(biv_H$summary %>% 
         mutate(Variable = reorder(var,auc),
                Importance = auc)) +
  geom_col(aes(x = Variable, y = Importance, fill = Importance),
           width = 1) +
  coord_flip(ylim = c(0.45,0.75)) +
  scale_fill_gradientn(colors = c("#140b34","#84206b","#e55c30","#f6d746"),
                       values = c(        0,    0.333,    0.637,        1),
                       limits = c(0.5,0.75)) + 
  theme_plot() + theme(panel.grid.major.y = element_blank()) + 
  title("Variable Importance ")

ggplot(biv_S$summary %>% 
         mutate(Variable = reorder(var,auc),
                Importance = auc)) +
  geom_col(aes(x = Variable, y = Importance, fill = Importance),
           width = 1) +
  coord_flip(ylim = c(0.45,0.75)) +
  scale_fill_gradientn(colors = c("#140b34","#84206b","#e55c30","#f6d746"),
                       values = c(        0,    0.333,    0.637,        1),
                       limits = c(0.5,0.75)) + 
  theme_plot() + theme(panel.grid.major.y = element_blank())

ggplot(biv_H$summary %>%
         mutate(Variable = reorder(var,auc),
                Importance = auc) %>% 
         extract(1:50,)) +
  geom_col(aes(x = Variable, y = Importance, fill = Importance),
           width = 1) +
  coord_flip(ylim = c(0.45,0.75)) +
  scale_fill_gradientn(colors = c("#140b34","#84206b","#e55c30","#f6d746"),
                       values = c(        0,    0.333,     0.637,       1),
                       limits = c(0.6,0.75)) + 
  theme_plot() + theme(panel.grid.major.y = element_blank())

ggplot(biv_S$summary %>%
         mutate(Variable = reorder(var,auc),
                Importance = auc) %>% 
         extract(1:50,)) +
  geom_col(aes(x = Variable, y = Importance, fill = Importance),
           width = 1) +
  coord_flip(ylim = c(0.45,0.75)) +
  scale_fill_gradientn(colors = c("#140b34","#84206b","#e55c30","#f6d746"),
                       values = c(        0,    0.333,     0.637,       1),
                       limits = c(0.6,0.75)) + 
  theme_plot() + theme(panel.grid.major.y = element_blank())

# collinearity

bivauc <- left_join(biv_H$summary[,c("var","auc")],
                 biv_S$summary[,c("var","auc")],
                 by = "var", suffix = c(".H",".S")) %>%
  mutate(auc_avg = (auc.H + auc.S)/2)

bivcor <- cor(bivdat_res %>%
                select(bivauc %>% arrange(desc(auc_avg)) %>% pull(var)) %>%
                select_if(is.numeric) %>%
                select(1:100))

corrplot::corrplot(bivcor)

biv <- bivauc %>% left_join(
  bivcor %>% 
    abs %>% 
    rowMeans(na.rm = TRUE) %>%
    as_tibble %>%
    rownames_to_column("var") %>%
    rename(cor = value)
)

#### MODELING ####

set.seed(606060)
library(caret)

#H <- list()
#H_ <- tibble()
#S <- list()
#S_ <- tibble()

type <- "xgb"

##################### HEALTH

testall <- bivdat_res %>% 
  select(biv %>% arrange(desc(auc.H)) %>% pull(var),has = has_HEALTH) %>% 
  mutate(has = has %>% as.numeric - 1) %>%
  select(has,1:200)
testall <- dummyVars("~.",testall) %>%
  predict(newdata = testall)
#keep <- colnames(testall) #first time only


#run from here

testnext <- bivdat_res %>% 
  select(biv %>% arrange(desc(auc.H)) %>% pull(var),has = has_HEALTH) %>% 
  mutate(has = has %>% as.numeric - 1) %>%
  select(has,1:130)
# factors must be one-hot encoded.
testnext <- dummyVars("~.",testnext) %>%
  predict(newdata = testnext)

{
  testdat <- cbind(
    #testnext,
    testall[,c("has",keep)]
  )
  testdat <- testdat[,!duplicated(t(testdat))]
  #testdat is a matrix
  
  # if(type == "glm") {
  #   cl <- parallel::makeCluster(parallel::detectCores()-1)
  #   doParallel::registerDoParallel()
  #   mod <- caret::train(has ~ ., data = test,
  #                       method = "glm", family = "binomial",
  #                       trControl = caret::trainControl("cv",10,
  #                                                       savePredictions = TRUE,
  #                                                       classProbs = TRUE))
  #   test$prb <- mod %>% predict(type = "prob", newdata = testdat) %>% pull("VIOL")
  #   parallel::stopCluster(cl)
  #   foreach::registerDoSEQ()
  # }
  if(type == "xgb") {
    mod <- train(has ~ ., data = testdat,
                 method    = "xgbTree",
                 trControl = trainControl(method = "cv",
                                          number = 5,
                                          savePredictions = TRUE),
                 nthreads = parallel::detectCores() - 1,
                 objective = "binary:logistic")
    test$prb <- predict(mod, newdata = testdat)
    names <- mod$finalModel$xnames
  }
  
  crv <- pROC::roc(test$has,test$prb)
  test <- test %>% 
    mutate(cut = pROC::coords(crv,"best")["threshold"],
           fit = ifelse(prb >= cut,"VIOL","NONE"),
           acc = ifelse(fit == has,TRUE,FALSE))
  
  keep <- varImp(mod)$importance %>% 
    rownames_to_column("var") %>%
    filter(Overall > 0) %>%
    pull(var) %>%
    gsub("`","",.)
  
  H[[
    paste(attributes(mod$terms)$term.labels,collapse = " + ")
    ]] <- list(type  = type,
               name  = paste(attributes(mod$terms)$term.labels,collapse = " + "),
               terms = attributes(mod$terms)$term.labels,
               keep  = keep,
               nvar  = length(attributes(mod$terms)$term.labels),
               model = mod$finalModel,
               probs = test %>%
                 select(has,prb,fit),
               tune  = mod$bestTune,
               curve = crv,
               auc = crv$auc,
               cut = pROC::coords(crv,"best")[1] %>% as.numeric,
               tnr = pROC::coords(crv,"best")[2] %>% as.numeric,
               tpr = pROC::coords(crv,"best")[3] %>% as.numeric,
               acc = test %>%
                 summarize(Accuracy = sum(acc,na.rm = TRUE)/n()) %>%
                 as.numeric,
               sd  = sd(test$prb),
               mean = mean(test$prb),
               hist = hist(test$prb)
               )
  H_ <- H_ %>% bind_rows(
    H[[paste(attributes(mod$terms)$term.labels,collapse = " + ")]] %>%
      extract(c("type","name","nvar","auc","cut","tnr","tpr","acc","mean","sd"))
    )
  

  
  View(varImp(mod)$importance)
  hist(test$prb)
  crv$auc
}

keep <- H[[8]]$keep
final_H <- H[[10]]

##################### SAFETY

testall <- bivdat_res %>% 
  select(biv %>% arrange(desc(auc.S)) %>% pull(var),has = has_SAFETY) %>% 
  mutate(has = has %>% as.numeric - 1) %>%
  select(has,1:200)
testall <- dummyVars("~.",testall) %>%
  predict(newdata = testall)
keep <- colnames(testall) #first time only


#run from here

testnext <- bivdat_res %>% 
  select(biv %>% arrange(desc(auc.S)) %>% pull(var),has = has_SAFETY) %>% 
  mutate(has = has %>% as.numeric - 1) %>%
  select(has,1:100)
# factors must be one-hot encoded.
testnext <- dummyVars("~.",testnext) %>%
  predict(newdata = testnext)

{
  testdat <- cbind(
    #testnext,
    testall[,c("has",keep)]
  )
  testdat <- testdat[,!duplicated(t(testdat))]
  #testdat is a matrix
  
  # if(type == "glm") {
  #   cl <- parallel::makeCluster(parallel::detectCores()-1)
  #   doParallel::registerDoParallel()
  #   mod <- caret::train(has ~ ., data = test,
  #                       method = "glm", family = "binomial",
  #                       trControl = caret::trainControl("cv",10,
  #                                                       savePredictions = TRUE,
  #                                                       classProbs = TRUE))
  #   test$prb <- mod %>% predict(type = "prob", newdata = testdat) %>% pull("VIOL")
  #   parallel::stopCluster(cl)
  #   foreach::registerDoSEQ()
  # }
  if(type == "xgb") {
    mod <- train(has ~ ., data = testdat,
                 method    = "xgbTree",
                 trControl = trainControl(method = "cv",
                                          number = 5,
                                          savePredictions = TRUE),
                 nthreads = parallel::detectCores() - 1,
                 objective = "binary:logistic")
    test$prb <- predict(mod, newdata = testdat)
    names <- mod$finalModel$xnames
  }
  
  crv <- pROC::roc(test$has,test$prb)
  test <- test %>% 
    mutate(cut = pROC::coords(crv,"best")["threshold"],
           fit = ifelse(prb >= cut,"VIOL","NONE"),
           acc = ifelse(fit == has,TRUE,FALSE))
  
  keep <- varImp(mod)$importance %>% 
    rownames_to_column("var") %>%
    filter(Overall > 0) %>%
    pull(var) %>%
    gsub("`","",.)
  
  S[[
    paste(attributes(mod$terms)$term.labels,collapse = " + ")
    ]] <- list(type  = type,
               name  = paste(attributes(mod$terms)$term.labels,collapse = " + "),
               terms = attributes(mod$terms)$term.labels,
               keep  = keep,
               nvar  = length(attributes(mod$terms)$term.labels),
               model = mod$finalModel,
               probs = test %>%
                 select(has,prb,fit),
               tune  = mod$bestTune,
               curve = crv,
               auc = crv$auc,
               cut = pROC::coords(crv,"best")[1] %>% as.numeric,
               tnr = pROC::coords(crv,"best")[2] %>% as.numeric,
               tpr = pROC::coords(crv,"best")[3] %>% as.numeric,
               acc = test %>%
                 summarize(Accuracy = sum(acc,na.rm = TRUE)/n()) %>%
                 as.numeric,
               sd  = sd(test$prb),
               mean = mean(test$prb),
               hist = hist(test$prb)
    )
  S_ <- S_ %>% bind_rows(
    S[[paste(attributes(mod$terms)$term.labels,collapse = " + ")]] %>%
      extract(c("type","name","nvar","auc","cut","tnr","tpr","acc","mean","sd"))
  )
  
  
  
  View(varImp(mod)$importance)
  hist(test$prb)
  crv$auc
}

keep <- S[[3]]$keep

final_S <- S[[14]]

##################### MAKE PREDICTIONS

#save(H,H_,S,S_,final_H,final_S,file = "model_final.RData")
#load("model_final.RData")

pred_dat <- dat_res %>% 
  select(biv %>% arrange(desc(auc.H)) %>% pull(var))

#Health
final_H$model$feature_names <- final_H$model$feature_names %>% gsub("`","",.)

pred_dat_H <- dummyVars("~.",pred_dat) %>%
  predict(newdata = pred_dat) %>%
  as_tibble %>%
  select(final_H$model$feature_names) %>%
  as.matrix

dat_res$prb_HEALTH <- predict(final_H$model,newdata = pred_dat_H)
dat_res$pred_HEALTH <- ifelse(dat_res$prb_HEALTH > final_H$cut,"VIOL","NONE")

#Safety
final_S$model$feature_names <- final_S$model$feature_names %>% gsub("`","",.)

pred_dat_S <- dummyVars("~.",pred_dat) %>%
  predict(newdata = pred_dat) %>%
  as_tibble %>%
  select(final_S$model$feature_names) %>%
  as.matrix

dat_res$prb_SAFETY <- predict(final_S$model,newdata = pred_dat_S)
dat_res$pred_SAFETY <- ifelse(dat_res$prb_SAFETY > final_S$cut,"VIOL","NONE")

##################### OUTPUT

final_res <- dat_res

final_data <-
  data %>%
  left_join(final_res %>%
              select(SBL,
                     starts_with("pred_"),
                     starts_with("prb_"))) %>% 
  mutate(prb_viol = prb_HEALTH+prb_SAFETY-(prb_HEALTH*prb_SAFETY)) %>%
  select(SBL,inspected,residential = res,
         ends_with("_viol"),
         ends_with("_HS"),
         ends_with("_HEALTH"),
         ends_with("_SAFETY"),
         everything())

final <- final_data %>% select(SBL,inspected,residential,
                               ends_with("_viol"),
                               ends_with("_HS"),
                               ends_with("_HEALTH"),
                               ends_with("_SAFETY"))

save(final_data,
     H,H_,final_H,
     S,S_,final_S,
     biv,bivcor,
     file = "model.RData")

save(final, file = "final.RData")

save.image(file = "model_working.RData")
#load("model_working.RData")