• 1.0 Introduction
    • 1.1 How to use this document
    • 1.2 Abstract
    • 1.3 Motivation
  • 2.0 Data Wrangling and Cleaning
    • 2.1 Data Stories and Feature Engineering
  • 3.0 Exploratory Analysis
    • 3.1 Where are evictions in Philadelphia, and how has that changed over time?
    • 3.2 Where are illegal rentals in Philadelphia and what types of properties are being illegally rented?
    • 3.3 What is the relationship between key variables and illegal rentals?
  • 4.0 Modeling Strategy
    • 4.1 Model Building
    • 4.2 Validation
    • 4.3 Cost Benefit Analysis
  • 5.0 Application and Interface
  • 6.0 Conclusions
  • 7.0 Code Appendix
    • 7.1 Data Wrangling, Feature Engineering, and Modeling
    • 7.2 Visuals

Return to MUSA 801 Projects Page

1.0 Introduction

1.1 How to use this document

This project was produced as part of the University of Pennsylvania’s Master of Urban Spatial Analytics Spring 2018 Practicum (MUSA 801) taught by Ken Steif, Michael Fichman, and Matt Harris. We would like to thank Jonathan Pyle of Philadelphia Legal Assistance for providing feedback and data.

The following document presents an analysis of illegal rentals in Philadelphia and an interactive tool for the Philadelphia Department of Licenses and Inspections (L&I) that will ultimately help proactively prioritize inspections to stymy illegal rentals before they turn into illegal evictions. The tool will provide a risk score for every rental property in the city, indicating each property’s likelihood of being rented illegally. This document includes a detailed overview of the use case, analysis and key findings, as well as the relevant code needed to replicate the analysis.

1.2 Abstract

In Philadelphia, all rental units are required to be certified prior to leasing. Through certification, landlords must obtain a rental license, at a cost of about $55 annually. If a unit is not certified, but is still rented out, it is considered an illegal rental. If a landlord is renting illegally, then any subsequent tenant eviction is also illegal, meaning the tenant has no legal recourse.

With no official record of the rental and the eviction, it is difficult to distill how often landlords are renting out their properties illegally. The only record we have for illegal rentals is the Department of Licenses and Inspections (L&I) Code Violation 9-3902; renting without a license, but this number is far underestimating the problem.

Currently, L&I conducts inspections reactively, according to received 311 complaints. However, L&I only employs 17 inspectors, and there is a backlog of complaints still to be inspected. A statistical model that can predict property-specific illegal rental risk would allow L&I to proactively address illegal rentals before they turn into illegal evictions. This tool predicts the likelihood that a property is being rented to a tenant without a rental license for all non-owner occupied homes in Philadelphia to help L&I prioritize inspections more efficiently.

1.3 Motivation

Between 2005 to 2016, rental prices in Philadelphia went up by 10 percent, while wages only went up by 0.3 percent. As a result, over half of Philadelphia renters are rent-burdened (spending over one-third of income on housing), and nearly one-third of these renters spend over 50 percent of their salary on housing.

Figure 1. Median rent over time in Philadelphia

Source: American Community Survey 2012-2016

Source: American Community Survey 2012-2016

In 2016 alone, there were 24,000 evictions, impacting 9 percent of households in Philadelphia. Evictions bear real implications, particularly because they impact already vulnerable populations, including single-mothers, low-income, and black households disproportionately (Mayor’s Taskforce on Eviction Prevention and Response, June 2018).

Figure 2. Inequity of evictions in Philadelphia

Source: Mayor’s Task Force on Eviction and Prevention, June 2018

Source: Mayor’s Task Force on Eviction and Prevention, June 2018

As rental prices rise and evictions increase in frequency, so too does homelessness. There are an estimated 5,700 homeless people in Philadelphia, and about 20 percent of people in shelters list eviction as the primary cause of their homelessness.

Evictions are a pressing issue, so much so that Mayor Jim Kenney established the Mayor’s Task Force for Eviction Prevention and Response in September 2017. One action item from the taskforce was to gain a better understanding of illegal rentals, and by extension, illegal evictions in the city.

Our illegal rental predictive model uses a data-driven approach that would allow L&I to proactively address illegal rentals before they turn into illegal evictions, and utilize their limited resources more efficiently. Catching illegal rentals before they turn into illegal evictions protects tenants because legal evictions prohibit landlords from locking tenants out, require landlords to provide eviction notice, and ensure a court process. The true number of illegal evictions is unknowable, but curbing illegal rental violations serves as a concrete way for the city to begin addressing housing insecurity and homelessness.

2.0 Data Wrangling and Cleaning

This analysis was built off of publicly available data, largely from Philadelphia’s open data portal. Philadelphia Community Legal Services also provided us with scraped eviction docket data. Because we were working with messy administrative data, our primary data wrangling challenge was to bringing and cleaning the following datasets together into one dataset:

  • Property Assessments (from the Office of Property Assessment via OpenDataPhiladelphia)
  • Licensing and Inspection Code Violations (from Licenses and Inspections via OpenDataPhiladelphia)
  • Business Licenses (from Licenses and Inspections via OpenDataPhiladelphia)

Additional data was collected to inform both our exploratory analysis and our model development:

  • Docket of Eviction Cases 2014-2016 (from Community Legal Aid Services)
  • Tax Delinquency (via OpenDataPhiladelphia)
  • Socioeconomic and Demographic Data (2012-2016 American Community Survey)

To make the data more manageable, data was limited to the years 2016-2018. Once the cleaning was complete, we were left with a dataset of about 305,000 individual parcels with over 200 variables.

Our final dataset included a “universe” of all potential rental properties in Philadelphia. We limited our data to only properties that we understood to not be owner-occupied because it would be a poor use of L&I’s limited resources to allocate time to inspecting properties for an illegal rental violation that are owner-occupied. We determined the universe of rental properties by filtering our dataset to only include properties zoned as residential and where no homestead (owner-occupied) tax exemption was claimed.

Figure 3. The universe of potential rental properties

Source: Office of Property Assessments, Open Data Philly, 2016-2018

Source: Office of Property Assessments, Open Data Philly, 2016-2018

With our data coming from many sources, the properties in one dataset were not always included in another. As a result, when we joined the different datasets together, these properties, or rows, appear to be missing data. For example, if a property from the OPA dataset does not match with any properties from the violations dataset, based off of its OPA Account Number, we can assume that this property does not have a violation. It is important to note that any assumption will inevitably bias our data. Because across all of our datasets we assumed that missing data meant the absence of a matching property, we bias our data towards underestimating violations, licenses, and tax delinquencies.

Figure 4. Table of properties from each dataset

Percentage of properties joined from cleaned datasets to OPA

Percentage of properties joined from cleaned datasets to OPA

2.1 Data Stories and Feature Engineering

We sorted our data into four different “stories” that together paint a full picture of illegal rentals. These four stories are used to estimate an illegal rental risk score for each property city wide. This is done to ensure our model is sensitive to a diverse range of variables.

The dependent variable for our analysis - that is the variable we are trying to predict - is illegal rental violations, which we encode in our data as ‘0’ to represent properties that have not received an illegal rental violation and ‘1’ to represent properties that have received an illegal rental violation.

Story 1: Violation History

Hypothesis 1: Is violation history correlated with illegal rentals?

To determine each property’s violation history, we looked at the L&I violations dataset. We created variables indicating the status of a citation at a property (e.g. active or inactive), if a property had multiple citations (aside from illegal rental), if an owner had multiple citations (aside from illegal rental), and if a property had interior or exterior citations. Violations are an indicator that a landlord is willing to provide non-code compliant housing and has not necessarily abided by housing code in the past.

Story 2: Legal History

Hypothesis 2: Are past legal events correlated with illegal rentals?

To determine each property’s legal history, we looked at tax delinquency, the eviction docket, and business licenses. Using the docket data, we wanted to determine, who are the repeat offenders and where are there properties evicting multiple families. To do this, we aggregated evictions by landlord and by address to get an eviction count for every property and every landlord. This serves as a way for us to see whether landlords have legally evicted tenants in the past.

Using the tax delinquency data, we determined how much was owed in taxes and for how long. This serves as a proxy for a landlord that has not abided by legal processes in the past.

We also used OPA data to determine whether or not the property has an absentee landlord. A property was considered to have an absentee landlord if the tax mailing address was different than the parcel address. We further categorized absentee landlords depending on whether they were located in the greater Philadelphia area (New Jersey or Delaware), or out-of-region. We created these variables based on the assumption that if a landlord is not physically present it would make it more difficult for them to maintain their paperwork and licensing.

We also used the business license data to determine the status of rental licenses for every property, in order to see whether licenses had been allowed to lapse or consistently renewed, this also served as a legal compliance indicator.

Story 3: Housing Characteristics

Hypothesis 3: Are housing characteristics correlated with illegal rentals?

To determine the physical characteristics of a house that are correlated with illegal rentals, we looked almost exclusively at the OPA property dataset, which includes data on housing condition, size, features, value, and owner. The majority of the features required little manipulation. In the cases where there was significant amounts of missing data, we created a “dummy” column that tells our model to note that this variable is missing significant data.

Story 4: Geospatial and Demographics

Hypothesis 4: Are geospatial and demographic characteristics correlated with illegal rentals?

Because there is already extensive research on who is disproportionately impacted by eviction we wanted to see if these demographic characteristics were also linked to illegal rentals in Philadelphia. We used census data to create variables for all the populations we know to be disproportionately impacted by legal evictions (female-headed households, black households, low-income households), to see if that link held true for illegal evictions.

We also determined the average distance from each violation and each license to the closest five violations or licenses, respectively, as a proxy for spatial patterns in illegal renting. This variable shows us whether there are any clustering patterns to rental licenses or violations, the idea being that certain behaviors are contagious. If owners see other owners renting without a rental licenses, or see other illegal behavior, they may be more likely to mirror that behavior.

We also assigned each property to a neighborhood and council district to see if, at a different scale, there is geographic clustering of illegal rentals that varies from neighborhood to neighborhood or council district to council district.

3.0 Exploratory Analysis

Our exploratory analysis focuses on understanding the rental market in Philadelphia, with a primary focus on evictions and illegal rentals. Through our exploratory analysis, we answered the following questions:

  • Where are evictions in Philadelphia, and how has that changed over time?
  • Where are illegal rentals in Philadelphia, what types of properties are being illegally rented, and how has that changed over time?
  • What is the correlation between key variables and illegal rentals?

3.1 Where are evictions in Philadelphia, and how has that changed over time?

Understanding where legal evictions are occurring is beneficial for understanding the nature of the rental market in Philadelphia, which ultimately provides insight into illegal rentals.

Between 2016 and 2018, there were about 55,700 evictions in the city. This represents about one fifth of all rental properties in the city. As the below chart shows, this number peaked in 2017.

Figure 5. Eviction Frequency

Source: Court Docket, 2016-2018

Source: Court Docket, 2016-2018

The most efficient way to represent the spatial pattern of the 55,700 eviction events between years 2016 and 2018 is the density hotspot map shown in section 3.2.

3.2 Where are illegal rentals in Philadelphia and what types of properties are being illegally rented?

Between 2016-2018, there were around 7,500 citations given out by L&I inspectors to individuals renting their property illegally. It is worth noting that this figure is almost certainly under-reported given the limited inspection resources L&I have at their disposal.

Between 2016-2018, citations were given out for illegally renting a single-family home, illegally renting a multi-family home, and illegally renting a room, with the distribution heavily favoring single family homes.

Figure 6. Distribution of illegal rental violation types

Source: L&I, Open Data Philly, 2016-2018

Source: L&I, Open Data Philly, 2016-2018

The density map below shows the distribution of illegal rental violations between 2016 and 2018 and compares it to the distribution of evictions over the same period. Evictions are more widespread across the city, however, there are similar hotspots in South and West Philadelphia.

Figure 7. Density of illegal rental violations

Source: L&I, Open Data Philly, 2016-2018

Source: L&I, Open Data Philly, 2016-2018

3.3 What is the relationship between key variables and illegal rentals?

Comparing homes with and without illegal rentals provides a strong basis for understanding the relationship between illegal rentals and explanatory variables. This allows us to see which variables see higher instances of illegal rental violations.

As the below charts show, illegal rentals are more prevalent in properties that have had interior and exterior violations, are tax delinquent, have out of state landlords, and are located on parcels with multiple units. However, they are less prevalent in properties built before 1900.

Figure 8. Relationship between variables and illegal rental violations

Source: L&I, Office of Property Assessments,Department of Records, Open Data Philly

Source: L&I, Office of Property Assessments,Department of Records, Open Data Philly

Given that evictions disproportionately impact low income populations and people of color, understanding the relationship between these populations and illegal rental violations is also telling. The maps below show evictions, illegal rentals, tax delinquency, poverty, and minority populations by census tract. The data indicate that evictions and illegal rental violations are more prevalent in areas with higher minority and lower income populations.

Figure 9. Spatial distribution of demographic variables

Source: ACS 2012-2016, Department of Records, Court Docket, L&I, Open Data Philly

Source: ACS 2012-2016, Department of Records, Court Docket, L&I, Open Data Philly

4.0 Modeling Strategy

The goal of our model is to learn from observed violations to predict a relative violation risk score for rental properties, including those homes that L&I does not regularly inspect. In order to ensure that certain data, namely the violation history, does not overpower our model, we used an ensemble model, which takes our four unique stories about violations, legal history, housing characteristics, and demographics and geospatial components and combines them into one model that weighs all four stories equally. Our model was validated using several metrics to ensure it is both accurate and generalizable.

4.1 Model Building

Illegal rentals are a rare occurrence, comprising less than 0.1 percent of our final dataset. This is in part because illegal rentals are likely undercounted. As such, our models for the entire dataset predicted that almost all properties would not be illegal rentals, which makes sense given the rarity of illegal rentals. To create a more balanced dataset, we randomly sampled our data, so the full dataset had four times as many non-violations than violations.

To build our predictive model we estimated separate models for each of the four stories and then ensembled them into one final predictive model. To determine what variables were most significant, we used stepwise selection. The stepwise model added and deleted variables one at a time so that we could see which variables are significant, but also how the variables correlate with each other. Through this process, we were able to winnow down our over 200 independent variables into 24 variables spread across four stories.

Figure 10. Four data stories

Variables included in the four stories

Variables included in the four stories

With our final variables determined, we then plugged these stories into the ensemble model, resulting in a relative risk score for rental properties in the City.

4.2 Validation

The predictive model provides an illegal rental risk score for all rental properties in Philadelphia. This risk score is used to classify properties as ‘worthy of inspection’ or ‘not worthy of inspection’ This ‘worthiness score’ provides a basis for measuring the accuracy and generalizability of our model.

While our continuous relative risk score is useful for L&I, we set a threshold to determine specified outcomes, listed in the table below. These outcomes allow for a more direct comparison of costs between our data driven approach and L&I’s business as usual.

Figure 11. Potential model outcomes

Model outcomes

Model outcomes

We ultimately set the threshold at 0.65, which has a true positive rate of 80 percent and a true negative rate of 93 percent. At this threshold, the model has high true positive and true negative rates as well as low false negative and false positive rates, which have high social costs and burden L&I with unnecessary inspections. Note that this threshold is adjustable, depending on which outcome is most important to the user.

Figure 12. Model results

Confusion matrix for final model

Confusion matrix for final model

The receiver operating characteristic (ROC) curve below shows the trade-offs between false positives and true positives. A model that is perfectly accurate would have an area under the curve (AUC) of 1; however, this model would not be generalizable. Our model is very accurate, with an AUC of 0.97. This indicates that our model is predicting illegal rentals correctly 97 percent of the time.

Figure 13. ROC curve

ROC curve for the final model

ROC curve for the final model

Our model has proven to be accurate, but does it predict equally well in different parts of the city? A generalizable model is imperative for creating a useful inspection tool for L&I, as the agency needs a tool that predicts as well in dense South Philly as it does in more suburban Northeast Philly. The below maps show the generalizability of the model.

Figure 14. Neighborhood level sensitivity and specificity

Map of true positive and true negative rates by neighborhood

Map of true positive and true negative rates by neighborhood

True positive and true negative rates vary across neighborhoods. Our model does a better job of predicting true negative than true positives, this is unsurprising because there are far more negatives than positives. Geographically, our model fares best in predicting true negatives in the more suburban Northeast and Northwest and in predicting true positives in South and West Philadelphia. South and West Philadelphia have traditionally seen more disinvestment as compared to other parts of the city. Given this history, it would not be surprising for L&I to prioritize their inspections in these areas. It would then follow that where there are more inspections there are more illegal rental violations, both predicted and observed.

4.3 Cost Benefit Analysis

Ultimately, one of the key decision points in deciding between a data-driven approach and business as usual is the social and economic costs of both approaches. In other words, does our model help L&I use their resources more efficiently and does it reduce the high social cost of an illegal rental violation and an illegal eviction? To determine this we performed a cost benefit analysis, where we calculated a back of the envelope cost for all outcomes and compared this to the current cost of doing business for 1,000 properties.

We estimated the following costs:
  • $1,180 for every true positive
  • $50 for every true negative
  • $50 for every false positive
  • An unknown, but high social cost for every family displaced by a false negative

Figure 16. Aggregated costs and benefits

Cost Benefit for Business as Usual vs. Data-Driven Approach

Cost Benefit for Business as Usual vs. Data-Driven Approach

When the rate of true positives, true negatives, false positives, and false negatives are compared, the data driven approach is more effective, thereby saving L&I money and avoiding the high social cost of false negatives.

5.0 Application and Interface

Our web-application allows L&I inspectors to identify parcels that have a high illegal rental risk score. Inspectors can open a map of Philadelphia and parcels with “high scores” will populate the map. The inspector can filter the parcels by the risk score so that they can prioritize and focus on inspecting homes with the highest risk. We chose to focus on Gray’s Ferry for this demonstration because it has the highest rate of illegal rentals.

Users can click on a parcel to get more information about the unit such as the risk score, unique parcel number (OPA account number), housing characteristics, legal and violation history, and inspection history. Because inspectors should prioritize inspection on properties that have been inspected less recently, we also included a slider that filters units according to last inspected date. Lastly, an interactive histogram changes to reflect the risk scores for the screen extent, so that inspectors can see if there are clusters of high priority risk scores and can efficiently prioritize not just individual units but also entire neighborhoods or areas.

Click here to access the app

6.0 Conclusions

There is a growing body of work that quantifies the far-reaching implications of formal evictions, however, illegal evictions remain an underreported phenomenon that impacts already disenfranchised communities. If Philadelphia wants to address their growing housing instability and curb evictions, the city must seek to curb illegal rentals before they turn into illegal evictions. Given L&I’s limited bandwidth, a data-driven approach to inspections provides the city with best practices on how to efficiently utilize limited funds and resources, while also increasing housing security for Philadelphia’s most vulnerable residents.

7.0 Code Appendix

The following shows the code used in this analysis. Section 7.1 covers data wrangling, feature engineering, and modeling. Section 7.2 covers the visuals created for this analysis.

7.1 Data Wrangling, Feature Engineering, and Modeling

First we set up our workspace.

#load libraries

#load libraries
require(ggthemes)
require(IDPmisc)
library(corrplot)
library(stargazer)
library(sjPlot)
library(data.table)
library(ggplot2)
library(ggmap)
library(maps)
library(sp)
library(raster)
library(lattice)
library(latticeExtra)
library(RColorBrewer)
library(rasterVis)
library(dplyr)
library(stringr)
library(tidyverse)
library(tidyr)
library(stats)
library(data.table)
library(grid)
library(gridExtra)
library(rgdal)
library(sf)
library(lattice)
library(caret)
library(tidycensus)
library(FNN)
library(Rcpp)
library(viridisLite)
library(viridis)
library(jsonlite)
library(stringi)
library(downloader)
library(lubridate)
library(spdep)
library(randomForest)
library(kableExtra)


#remove scientific noation
options(scipen=999)

Next we began reading in our datasets, cleaning each dataset, and engineering features.

##### OPA DATA #####

#read data

opa_orig<-fread('opa_properties_public.csv', header=TRUE, sep=",", colClasses='numeric')

#make copy
opafull<-opa_orig

#look for empties and NA's
colSums(opafull=='')
colSums(is.na(opafull))

#first cleaning of unecessary columns
opafull <- subset(opa_orig,select=-c(assessment_date,beginning_point,book_and_page,building_code_description,
                                  category_code_description, cross_reference, geographic_ward,
                                  house_number,market_value_date,objectid, parcel_shape, recording_date,
                                  registry_number, street_designation,street_direction,suffix,year_built_estimate,
                                  date_exterior_condition, mailing_address_1, mailing_address_2,mailing_care_of,
                                  building_code,category_code, owner_2,street_code))

sort(names(opafull))

#distribution of NA values across the datatset
colSums(is.na(opafull))

#distribution of empty values across the dataset
colSums(opafull=='')


##Narrow down opa dataset to possible rental houses

#by zoning code (those that begin with R)
opa1<- opafull %>% filter(str_detect(zoning, '^R'))

#by homestead exempt homes
sum(is.na(opa1$homestead_exemption)) #1270 NA
sum(opa1$homestead_exemption=='') #NA empties
#0 not exempt; 1 is exempt(=homeowners)
table(opa1$homestead_exemption)
#filter by 0, which are the not exempt units; 
#consider NA's which we do not know, but assume is a non-exempt/rental
opa1 <- opa1 %>% filter(opa1$homestead_exemption == 0 | is.na(opa1$homestead_exemption))
table(opa1$homestead_exemption)
sum(is.na(opa1$homestead_exemption))

#OPA dataset cut down from 580k to 510k to 305k


#####################################################################################

#wrangling variables with empties

#basements
table(opa1$basements)

#create variables for basement types
opa1$basement_dummy <- ifelse(opa1$basements=='ZZ',1,0)
opa1$basement_none <- ifelse(opa1$basements=='No_Basement',1,0)
opa1$basement_full <- ifelse(opa1$basements == 'Full' , 1,0)
opa1$basement_partial <- ifelse(opa1$basements == 'Partial', 1,0)
opa1$basement_sizeNotKnown <- ifelse(opa1$basements == 'Size_Unknown', 1,0)

opa1$basement_dummy <- as.factor(opa1$basement_dummy)
opa1$basement_none <- as.factor(opa1$basement_none)
opa1$basement_full <- as.factor(opa1$basement_full)
opa1$basement_partial <- as.factor(opa1$basement_partial)
opa1$basement_sizeNotKnown <- as.factor(opa1$basement_sizeNotKnown)


#central_air
table(opa1$central_air)

#create variables for central air
opa1$central_air_dummy <- ifelse(opa1$central_air_new=='ZZ',1,0)
opa1$central_air_y <- ifelse(opa1$central_air_new==1,1,0)
opa1$central_air_n <- ifelse(opa1$central_air_new==0,1,0)

opa1$central_air_dummy <- as.factor(opa1$central_air_dummy)
opa1$central_air_y <- as.factor(opa1$central_air_y)
opa1$central_air_n <- as.factor(opa1$central_air_n)

#unfinished (u = unfinished/buildings under construction)
table(opa1$unfinished)

#create variables for unfinished buildings
opa1$unfinished_unit<-ifelse(opa1$unfinished=='',0,1)
opa1$unfinished_unit<-as.factor(opa1$unfinished_unit)


#unit (specific unit in condo, building complex, etc, where there are subproperties)
table(opa1$unit)

#create variable for unit type
#1-is a subunit; 0-not a subunit
opa1$unit_new <- ifelse(opa1$unit=='',0,1)
opa1$unit_new <- as.factor(opa1$unit_new)

#utility
table(opa1$utility)

#create variables for utility type
#1 - remodeled/modernized; 0 - no remodeling
opa1$utility_new <- ifelse(opa1$utility_new=='None',0,1)
opa1$utility_new <- as.factor(opa1$utility_new)

#garage type
table(opa1$garage_type)

#create variables for garage type
opa1$garage_type_dummy <- ifelse(opa1$garage_type_new=='ZZ',1,0)
opa1$garage_type_y <- ifelse(opa1$garage_type_new=='Has_Garage',1,0)
opa1$garage_type_n <- ifelse(opa1$garage_type_new=='None',1,0)

opa1$garage_type_dummy <- as.factor(opa1$garage_type_dummy)
opa1$garage_type_y <- as.factor(opa1$garage_type_y)
opa1$garage_type_n <- as.factor(opa1$garage_type_n)


#general construction type
table(opa1$general_construction)

#create variable for construction type
opa1$general_construction_dummy<-ifelse(opa1$general_construction_new=='ZZ',1,0)
opa1$general_construction_brickY<-ifelse(opa1$general_construction_new=='Common_Brick',1,0)
opa1$general_construction_brickN<-ifelse(opa1$general_construction_new=='Not_Common_Brick',1,0)

opa1$general_construction_dummy <- as.factor(opa1$general_construction_dummy)
opa1$general_construction_brickY <- as.factor(opa1$general_construction_brickY)
opa1$general_construction_brickN <- as.factor(opa1$general_construction_brickN)


#house_extension
table(opa1$house_extension)

#create variable for house extesnion
opa1$house_extension_n<-ifelse(opa1$house_extension_new=='No_Extension',1,0)
opa1$house_extension_n <- as.factor(opa1$house_extension_n)


#heater type
table(opa1$type_heater)

#heater Yes/No
opa1$type_heater_yn <- ifelse(opa1$type_heater=='','ZZ', 
                              ifelse(opa1$type_heater=='0','NoHeater','HasHeater'))


#other building
table(opa1$other_building)

#create variable for other building
opa1$other_bldg_yn <- ifelse(opa1$other_building=='Y',1,0)
opa1$other_bldg_yn <- as.factor(opa1$other_bldg_yn)

#wrangling variables with NA's from the OPA dataset

#remove one row with several NA columns throughout - this should remove only one row
opa1 <- opa1[!is.na(opa1$number_of_bedrooms),]

#check
colSums(is.na(opa1))
colSums(opa1=='')


#exempt building; assuming NA is not exempt
table(opa1$exempt_building)

#make variable for exempt building
opa1$exempt_building_dummy <- ifelse(is.na(opa1$exempt_building),1,0)
opa1$exempt_building_y <- ifelse(is.na(opa1$exempt_building),0, ifelse(opa1$exempt_building==0,0,1))

opa1$exempt_building_dummy <- as.factor(opa1$exempt_building_dummy)
opa1$exempt_building_y <- as.factor(opa1$exempt_building_y)


#exempt land; assuming NA is not exempt
table(opa1$exempt_land)

#make variable for exempt land
opa1$exempt_land_dummy <- ifelse(is.na(opa1$exempt_land),1,0)
opa1$exempt_land_y <- ifelse(is.na(opa1$exempt_land),0, ifelse(opa1$exempt_land==0,0,1))

opa1$exempt_land_dummy<-as.factor(opa1$exempt_land_dummy)
opa1$exempt_land_y<-as.factor(opa1$exempt_land_y)


#taxable building; assuming NA is not taxable
table(opa1$taxable_building)

#make variable for taxable building
opa1$taxable_building_dummy <- ifelse(is.na(opa1$taxable_building),1,0)
opa1$taxable_building_y <- ifelse(is.na(opa1$taxable_building),0, ifelse(opa1$taxable_building==0,0,1))

opa1$taxable_building_dummy<-as.factor(opa1$taxable_building_dummy)
opa1$taxable_building_y<-as.factor(opa1$taxable_building_y)


#taxable land; assuming NA is not taxable
table(opa1$taxable_land)

#make variable for taxable land
opa1$taxable_land_dummy <- ifelse(is.na(opa1$taxable_land),1,0)
opa1$taxable_land_y <- ifelse(is.na(opa1$taxable_land),0, ifelse(opa1$taxable_land==0,0,1))

opa1$taxable_land_dummy<-as.factor(opa1$taxable_land_dummy)
opa1$taxable_land_y<-as.factor(opa1$taxable_land_y)

#number of stories
table(opa1$number_stories)

#make variables for number of stories
opa1$story1<-ifelse(opa1$number_stories == "1", 1,0)
opa1$story2<-ifelse(opa1$number_stories == "2", 1,0)
opa1$story3<-ifelse(opa1$number_stories == "3", 1,0)
opa1$story4plus<-ifelse(opa1$number_stories >= "4", 1,0)

opa1$story1<-as.factor(opa1$story1)
opa1$story2<-as.factor(opa1$story2)
opa1$story3<-as.factor(opa1$story3)
opa1$story4plus<-as.factor(opa1$story4plus)

#year built; **remove NA's later (only 3)
class(opa1$year_built)
opa1$year_built_copy<-as.numeric(as.character(opa1$year_built))

#make variables for year built
opa1$built_pre1900<-ifelse(opa1$year_built_copy <= 1900, 1,0) 
opa1$built_btw1900and1950<-ifelse(opa1$year_built_copy > 1900 & opa1$year_built_copy < 1950, 1,0)
opa1$built_post1950<-ifelse(opa1$year_built_copy >= 1950, 1,0)

opa1$built_pre1900<-as.factor(opa1$built_pre1900)
opa1$built_btw1900and1950<-as.factor(opa1$built_btw1900and1950)
opa1$built_post1950<-as.factor(opa1$built_post1950)

#exterior condition; aggregate into avg/not avg;
#assumes NA are not avg (21,814 NA's)
table(opa1$exterior_condition)
sum(is.na(opa1$exterior_condition))

opa1$exterior_condition_new <- opa1$exterior_condition
opa1$exterior_condition_new <- ifelse(is.na(opa1$exterior_condition_new),999,opa1$exterior_condition_new)
opa1$exterior_condition_new <- as.factor(opa1$exterior_condition_new)

#make variables for exterior condition
opa1$extcond_dummy <- ifelse(opa1$exterior_condition_new==999,1,0)
opa1$extcond_avg <- ifelse(opa1$exterior_condition_new==2 | 
                           opa1$exterior_condition_new==3 | 
                           opa1$exterior_condition_new==4, 1,0)

opa1$extcond_dummy <- as.factor(opa1$extcond_dummy)
opa1$extcond_avg <- as.factor(opa1$extcond_avg)


#interior condition
#assumes NA are not avg (22,059 NA's)
table(opa1$interior_condition)
sum(is.na(opa1$interior_condition))

opa1$interior_condition_new <- opa1$interior_condition
opa1$interior_condition_new <- ifelse(is.na(opa1$interior_condition_new),999,opa1$interior_condition_new)
opa1$interior_condition_new <- as.factor(opa1$interior_condition_new)

#make variables for interior condition
opa1$intcond_dummy <- ifelse(opa1$interior_condition_new==999,1,0)
opa1$intcond_avg <- ifelse(opa1$interior_condition_new==2 | 
                           opa1$interior_condition_new==3 | 
                           opa1$exterior_condition_new==4, 1,0)

opa1$intcond_dummy <- as.factor(opa1$intcond_dummy)
opa1$intcond_avg <- as.factor(opa1$intcond_avg)

#garage spaces
class(opa1$garage_spaces)
table(opa1$garage_spaces)

#subdivide garage spaces
opa1$garage_spaces_temp <-opa1$garage_spaces
opa1$garage_spaces_temp <- ifelse(is.na(opa1$garage_spaces_temp),999,
                                  ifelse(opa1$garage_spaces_temp>=4,4,opa1$garage_spaces_temp))

#make variables for garage spaces
opa1$garagespace0<-ifelse(opa1$garage_spaces_temp == 0, 1,0)
opa1$garagespace1<-ifelse(opa1$garage_spaces_temp == 1, 1,0)
opa1$garagespace2<-ifelse(opa1$garage_spaces_temp == 2, 1,0)
opa1$garagespace3<-ifelse(opa1$garage_spaces_temp == 3, 1,0)
opa1$garagespace4plus<-ifelse(opa1$garage_spaces_temp >= 4, 1,0)

opa1$garage_spaces_temp <- as.factor(opa1$garage_spaces_temp)
opa1$garagespace0 <- as.factor(opa1$garagespace0)
opa1$garagespace1 <- as.factor(opa1$garagespace1)
opa1$garagespace2 <- as.factor(opa1$garagespace2)
opa1$garagespace3 <- as.factor(opa1$garagespace3)
opa1$garagespace4plus <- as.factor(opa1$garagespace4plus)

#1-there is garage; 0-no garage, assuming NA is no garage
opa1$garage_yn <- ifelse(is.na(opa1$garage_spaces) | opa1$garage_spaces==0,0,1)
opa1$garage_yn <- as.factor(opa1$garage_yn)

table(opa1$garage_yn)

#Miscellaneous columns

opa1$depth_copy <- as.numeric(opa1$depth)
opa1$depth_dummy <- ifelse(opa1$depth==0,1,0)
opa1$depth_dummy <- as.factor(opa1$depth_dummy)

opa1$frontage_copy <- opa1$frontage
opa1$frontage_dummy <- ifelse(opa1$frontage==0,1,0)
opa1$frontage_dummy <- as.factor(opa1$frontage_dummy)

opa1$total_area_copy<-opa1$total_area
opa1$total_area_dummy<-ifelse(opa1$total_area==0,1,0)
opa1$total_area_dummy <- as.factor(opa1$total_area_dummy)

opa1$total_livable_area_copy <- as.numeric(opa1$total_livable_area)
opa1$total_livable_area_dummy <- ifelse(opa1$total_livable_area==0,1,0)
opa1$total_livable_area_dummy <- as.factor(opa1$total_livable_area_dummy)

opa1$market_value_copy <- opa1$market_value
opa1$market_value_copy <- ifelse(is.na(opa1$market_value_copy),0,opa1$market_value_copy)
opa1$market_value_dummy <- ifelse(opa1$market_value_copy==0,1,0) 
opa1$market_value_dummy <- as.factor(opa1$market_value_dummy)


#MORE OPA FEATURE ENGINEERING

#Create absentee landlord variable (defined as outside of Philadelphia)
sum(is.na(opa1$mailing_city_state))
sum(opa1$mailing_city_state=='') #138666 missing

#0 owner is in city; 1 owner is not in city Philadelphia (absent landlord)
opa1$absenteelandlord<-ifelse(str_detect(opa1$mailing_city_state,"^PHILA"), 0, 
                             ifelse(str_detect(opa1$mailing_city_state,"^PHILD"), 0, 
                                    ifelse(str_detect(opa1$mailing_city_state,"^PHILK"), 0,1)))
opa1$absenteelandlord <- as.factor(opa1$absenteelandlord)
table(opa1$absenteelandlord)


#further break down into people that are not in Pennsylvania or New Jersey
#1-out of state; 0-in state
opa1$outofstateLL<-ifelse(grepl("PA", opa1$mailing_city_state), 0, ifelse(grepl("NJ", opa1$mailing_city_state), 0, 1))
#1-out of philly, 0-in philly
opa1$outofPhillyLL<-ifelse(grepl("PHILA", opa1$mailing_city_state), 0,1)

#make into factors
opa1$outofstateLL<-as.factor(opa1$outofstateLL)
opa1$outofPhillyLL<-as.factor(opa1$outofPhillyLL)


#check final opa dataset
colSums(is.na(opa1))
colSums(opa1=='')



#subset with usefull columns for final data modeling

#by removing edited/feature engineered columns
opa_clean <- subset(opa1,select=-c(basements,central_air,garage_type,unfinished,unit,utility,general_construction,
                                   house_extension,site_type,topography,type_heater,other_building,view_type,
                                   exempt_building,exempt_land,fireplaces,taxable_building,taxable_land,
                                   year_built,depth,frontage,number_of_bedrooms, number_of_rooms,number_stories,
                                   number_of_bathrooms,quality_grade,exterior_condition,interior_condition,
                                   garage_spaces,homestead_exemption,fuel,separate_utilities,sewer,
                                   mailing_street,market_value))

colSums(opa_clean=='')
colSums(is.na(opa_clean))

opa_clean$parcel_number_copy <- as.numeric(opa1$parcel_number)

#copy for knn later
opa_final<-opa_clean


##### VIOLATIONS DATA #####


#read data
violations<-read.csv('li_violations.csv')

#make copy
violations2 <- violations

#check empties and NA
colSums(is.na(violations2))
colSums(violations2=='')

#first cleaning of unecessary columns
violations2<-subset(violations,select=-c(geocode_x,geocode_y,objectid,casenumber,apfailkey,
                                         the_geom_webmercator))

unlist(lapply(violations2,class))

#limit violations to 2016/17/18
violations2$year<-substring(violations2$violationdate,1,4)
violations_clean<-subset(violations2, violations2$year=="2016" | violations2$year=='2017' | violations2$year=='2018')

#cuts down violations 1,319,904 to 340,684

#create binary for illegal rental -
#### 9-3902(1) = 1 and 2 family homes, 9-3902(2) = Multi-family homes, 9-3902(3) = rooming unit or house, 9-3902(4) = hotel
violations_clean$illegalRental <- ifelse(violations_clean$violationtype == '9-3902 (1)' | 
                                           violations_clean$violationtype == '9-3902 (2)' |
                                           violations_clean$violationtype == '9-3902 (3)',1,0)

#create binary for single family illegal rentals
violations_clean$SFIllegalRent<-ifelse(violations_clean$violationtype == '9-3902 (1)', 1,0)
violations_clean$SFIllegalRent<-as.factor(violations_clean$SFIllegalRent)

#create binary for multi-family illegal rentals
violations_clean$MFIllegalRent<-ifelse(violations_clean$violationtype == '9-3902 (2)', 1,0)
violations_clean$MFIllegalRent<-as.factor(violations_clean$MFIllegalRent)

#create binary for rooming unit or house illegal rental
violations_clean$RoomIllegalRent<-ifelse(violations_clean$violationtype == '9-3902 (3)', 1,0)
violations_clean$RoomIllegalRent<-as.factor(violations_clean$RoomIllegalRent)

#violation status
#table(violations_clean$status)
#table(violations_clean$casestatus)
violations_clean$issue_fixed<-ifelse(violations_clean$status=="COMPLIED",1, ifelse(violations_clean$status=="RESOLVE", 1,0))
violations_clean$active_case<-ifelse(violations_clean$casestatus=="CLOSED", 0, 1)

violations_clean$issue_fixed<-as.factor(violations_clean$issue_fixed)
violations_clean$active_case<-as.factor(violations_clean$active_case)


#Number of Previous Violations

violation_count<-violations_clean%>% 
  group_by(opa_account_num)%>%
  summarize(viol_count=n())

violation_count_other<-violations_clean %>%
  filter(violationtype != '9-3902 (1)')%>%
  filter(violationtype != '9-3902 (2)')%>%
  filter(violationtype != '9-3902 (3)')%>%
  group_by(opa_account_num) %>%
  summarize(other_violations=n())

#violation_count_other<-violation_count_other[!is.na(violation_count_other$opa_account_num),]

violations_clean$opa_account_num_copy<-as.numeric(as.character(violations_clean$opa_account_num))
violation_count$opa_account_num_copy<-as.numeric(as.character(violation_count$opa_account_num))
violation_count_other$opa_account_num_copy<-as.numeric(as.character(violation_count_other$opa_account_num))

violations_clean <- left_join(violations_clean,violation_count, by='opa_account_num_copy')
violations_clean <- left_join(violations_clean,violation_count_other, by='opa_account_num_copy')

violations_clean$other_violations<-as.numeric(violations_clean$other_violations)
violations_clean$other_violations<-ifelse(!is.na(violations_clean$other_violations),
                                          violations_clean$other_violations, 0)


#Landlord with previous violations
landlord_count<-violations_clean%>% 
  group_by(ownername)%>%
  summarize(owner_count=n())
landlord_count <- landlord_count[-c(1), ]

violations_clean <- left_join(violations_clean,landlord_count, by='ownername')

#assumes NA (no owner name on record) is 1 owner only
violations_clean$owner_count<-ifelse(is.na(violations_clean$owner_count),1,violations_clean$owner_count)
violations_clean$owner_count_dummy<-ifelse(is.na(violations_clean$owner_count),1,0)
violations_clean$owner_count_dummy<-as.factor(violations_clean$owner_count_dummy)

#Type of violation
#table(violations_clean$violationdescription)
violations_clean$exterior<-ifelse(grepl("EXT", violations_clean$violationdescription), 1,0)
violations_clean$interior<-ifelse(grepl("INT", violations_clean$violationdescription), 1,0)


#Length of time since last violation (based on Jan 2019 date)

#make copy
violations_clean2<-violations_clean
#calculate time since last violation
violations_clean2$now<-as.Date('2019-01-31')
violations_clean2$violationsdate2<-as.Date(substring(violations_clean2$violationdate,1,10))

violations_clean3 <-
  violations_clean2 %>%
  mutate(now2 = ymd(now),
         then2 = ymd(violationsdate2))

#time length in days
violations_clean3$lentim_sincelast_viol <- violations_clean3$now2-violations_clean3$then2
class(violations_clean3$lentim_sincelast_viol)
violations_clean3$lentim_sincelast_viol <- as.numeric(as.character(violations_clean3$lentim_sincelast_vio))

#sum exterior, interior and illegalRental by opaaccount
violations_summed <-
  violations_clean3 %>%
  group_by(opa_account_num_copy) %>%
  summarise(had_extvio = sum(exterior),
            had_intvio = sum(interior),
            had_illegalRent = sum(illegalRental))
violations_summed <- violations_summed[!is.na(violations_summed$opa_account_num_copy), ]


violations_clean4<-left_join(violations_clean3,violations_summed, by='opa_account_num_copy')

#history of illegal rental and interior and exterior violations
#could be biased if a property was never cited for illegal rental

#hist_ columns as binary
#1-had prior history of illegal rental; 0-no history of illegal rental
violations_clean4$hist_illegalRent<-ifelse(violations_clean4$had_illegalRent==0,0,1)
#1-had prior history of external violation; 0-no history of external violation
violations_clean4$hist_extviol <- ifelse(violations_clean4$had_extvio==0,0,1)
#1-had prior history of internal violation; 0-no history of internal violation
violations_clean4$hist_intviol <- ifelse(violations_clean4$had_intvio==0,0,1)


#make numeric columns as factors since binary
violations_clean4$illegalRental<-as.factor(violations_clean4$illegalRental)
violations_clean4$exterior<-as.factor(violations_clean4$exterior)
violations_clean4$interior<-as.factor(violations_clean4$interior)
violations_clean4$hist_illegalRent<-as.factor(violations_clean4$hist_illegalRent)
violations_clean4$hist_extviol<-as.factor(violations_clean4$hist_extviol)
violations_clean4$hist_intviol<-as.factor(violations_clean4$hist_intviol)

#make binary for multiple units
#update status
violations_clean4$status2 <- sub("^$", "OPEN", violations_clean4$status)

violations_clean4$multipleunits<-ifelse(violations_clean4$unit=="",0,1)

##make binary for has had violation (0 if only 1 violation; +1 if has had previous violations)
violations_clean4$had_anyviolation<-ifelse(violations_clean4$other_violations >0, 1,0)
table(violations_clean4$had_anyviolation)


#remove unecessary columns
sort(names(violations_clean4))
colSums(is.na(violations_clean4))


vio_clean_full<-subset(violations_clean4, select=-c(the_geom, addresskey, zip, censustract, 
                                                    aptype, casegroup, casepriority, now, now2, 
                                                    then2, violationsdate2,prioritydesc,
                                                    caseresolutioncode,casestatus, status,unit))

#make column class accordingly
unlist(lapply(vio_clean_full,class))

vio_clean_full$year<-as.factor(vio_clean_full$year)
vio_clean_full$status2<-as.factor(vio_clean_full$status2)
vio_clean_full$had_anyviolation<-as.factor(vio_clean_full$had_anyviolation) #not really needed
vio_clean_full$multipleunits<-as.factor(vio_clean_full$multipleunits)

vio_clean_full$mostrecentinsp2<-as.Date(vio_clean_full$mostrecentinsp)

#keep only one record of unique opa;
violations_clean_unique <-
  vio_clean_full %>%
  group_by(opa_account_num.x) %>%
  slice(which.max(mostrecentinsp2))

names(violations_clean_unique)
colSums(is.na(violations_clean_unique)) #1 NA - can remove later
#violations_clean_unique<- violations_clean_unique[!is.na(violations_clean_unique$opa_account_num),]

#copy for knn later
violations_final <- violations_clean_unique


##############################################################################################

##### BUSINESS LICENSES #####

#read data
business_orig<- st_read('li_business_licenses.shp')

#make copy
business<-business_orig
sort(names(business))


#create ACTIVE / INACTIVE column
business$ActiveInactive <- ifelse(business$licensesta=='Active','Active','Inactive,Expired,Closed,Revoked')

#subset RENTALS / VACANT RESIDENTIALS / HIGH RISE
business2 <- subset(business, business$licensetyp == 'Rental' | business$licensetyp == 'High Rise')

#make LAT and LNG coordinates from geometry
business2$lng <- st_coordinates(business2)[,1]
business2$lat <- st_coordinates(business2)[,2]

#check type
class(business2) #sf;data.frame

#check for NA's
sum(is.na(business2$opaaccount)) #4237 NA's
#cut out NA's
business2<-business2[!is.na(business2$opaaccount),]

class(business2$opaaccount) #factor
business2$opaaccount_Copy <- as.numeric(as.character(business2$opaaccount))
class(business2$opaaccount_Copy) #numeric

class(business2$fulladdres) #factor
business2$fulladdres_Copy <- as.character(business2$fulladdres)
class(business2$fulladdres_Copy) #character

class(business2) #sf; data.frame
business2_df <- as.data.frame(business2)
class(business2_df) #data.frame

table(droplevels(business2_df$licensetyp)) # 207596 rentals; 613 high rise
table(business2_df$ActiveInactive)         # 883639 active; 124570 inactive (~60% inactive)
table(business2_df$legalentit)             # 66433 company; 141577 individual

#feature engineering
#Number of previous licenses
business_temp <- 
  business2_df %>%
  group_by(fulladdres) %>%
  summarise(num_prev_lic=n())

business2_df<- left_join(business2_df,business_temp, by='fulladdres')

#Active or inactive based on most recent license date
business3 <- business2_df
business3$mostrecent2<- as.Date(as.character(business3$mostrecent))
business3$status_mostrecent <- ifelse (business3$mostrecent2 < as.Date('2018-01-31'), 'inactive','active')

#Length of time since last license if most recent is inactive
business3$license_year <- format(as.Date(business3$mostrecent, format="%d/%m/%Y"),"%Y")

class(business3$license_year) #character
business3$license_year <- as.integer(business3$license_year)
class(business3$license_year) #integer

business3$time_since_last <- ifelse(business3$num_prev_lic == 1 & business3$status_mostrecent == 'active', "only one active",
                                    ifelse (business3$num_prev_lic == 1 & business3$status_mostrecent == 'inactive', 2018-business3$license_year, 
                                            ifelse(business3$num_prev_lic > 1 & business3$status_mostrecent == 'active', 'newlicense',
                                                   ifelse(business3$num_prev_lic > 1 & business3$status_mostrecent == 'inactive', 2018-business3$license_year, 00000))))

business3_noDup_mostrec <-
  business3 %>%
  group_by(opaaccount_Copy) %>%
  slice(which.max(as.Date(mostrecent2)))

#check for consistency
#temp<-subset(business3_noDup_mostrec,select=c('ActiveInactive','status_mostrecent'))
table(business3_noDup_mostrec$ActiveInactive)
table(business3_noDup_mostrec$status_mostrecent)

names(business3_noDup_mostrec)

business3_noDup_mostrec$activitystat<-ifelse(business3_noDup_mostrec$ActiveInactive=="Active",1,0)
business3_noDup_mostrec$activitystat<-as.factor(business3_noDup_mostrec$activitystat)
class(business3_noDup_mostrec$activitystat)

#clean columns again
license_final<-subset(business3_noDup_mostrec, select=-c(objectid,licensenum,revenuecod,
                                                         censustrac,legalfirst,legallastn,
                                                         business_n,eclipse_ad,licensesta,
                                                         fulladdres,zip,ActiveInactive))


#variable classes to change
class(license_final$num_prev_lic)
class(license_final$mostrecent2)
class(license_final$status_mostrecent)
class(license_final$license_year)
class(license_final$time_since_last)
class(license_final$numberofun)

license_final$status_mostrecent<-as.factor(license_final$status_mostrecent)
license_final$time_since_last<-as.factor(license_final$time_since_last)
license_final$numberofun<-as.factor(license_final$numberofun)

#fixed effect for unkown inactive date
license_final$ZZinactivedate<-ifelse(is.na(license_final$inactiveda), 1,0)
license_final$ZZinactivedate<-as.factor(license_final$ZZinactivedate)

license_final #cleaned business license data

#####   TAX DELINQUENCY   #####
tax_delinquency <- read.csv('tax_delinquency_clean.csv')
names(tax_delinquency)

#clean tax delinquency, remove columns that we don't need
tax_delinquency_clean<-subset(tax_delinquency,select=-c(index,objectid,zip_code,zip_4,co_owner,mailing_address,
                                                        mailing_city,mailing_state,mailing_zip))

names(tax_delinquency_clean)

#check class of columns
class(tax_delinquency_clean$owes)
class(tax_delinquency_clean$num_years_owed)
class(tax_delinquency_clean$total_due)
tax_delinquency_clean$owes<-as.factor(tax_delinquency_clean$owes)

#####   CENSUS DATA   #####

#read in census data
demos<-read.csv('censuscombined.csv')
names(demos)
unlist(lapply(demos,class))

#remove tracts without population
demos <- subset(demos, demos$pct_white!= 0 & demos$pct_black!=0)
#remove tracts where total population is 0
demos <- subset(demos, tot_population > 0)


#recalculate pct of each race accordingly
demos$pct_white <- demos$num_white/demos$tot_population
demos$pct_black <- demos$num_black/demos$tot_population
demos$pct_AmIndian <- demos$num_AmIndian/demos$tot_population
demos$pct_asian <- demos$num_asian/demos$tot_population
demos$pct_nativeHawaiian <- demos$num_nativeHawaiian/demos$tot_population
demos$pct_other <- demos$num_other/demos$tot_population
demos$pct_mixedrace <- demos$num_mixedrace/demos$tot_population
demos$pct_notwhte <- demos$num_notwhite/demos$tot_population
demos$pct_foreignborn <- demos$num_foreignborn/demos$tot_population
demos$pct_naturalizedcitizen <- demos$num_naturalizedcitizen/demos$tot_population
demos$pct_noncitizen <- demos$num_noncitizen/demos$tot_population
demos$pct_female_hh <- demos$tot_female_hh/demos$tot_population
demos$per_elevated <- demos$num_elevated/demos$tot_population
#numeric columns
demos$Pop_pov_Est <- as.numeric(demos$Pop_pov_Est) #integer to numeric
demos$Pop_below_pov <- as.numeric(demos$Pop_below_pov) #integer to numeric
demos$pct_pop_below <- as.numeric(demos$pct_pop_below) #factor to numeric
demos$pct_rent_Inc <- as.numeric(demos$pct_rent_of_Inc) #null to numeric
demos$rent_burden <- as.numeric(demos$rent_burden) #integer to numeric
demos$num_screened <- as.numeric(demos$num_screened) #integer to numeric
demos$num_elevated <- as.numeric(demos$num_elevated) #integer to numeric

unlist(lapply(demos,class))

#remove unencessary columns
demos <- subset(demos, select=-c(num_white,num_black,num_AmIndian,num_asian,
                                 num_nativeHawaiian,num_other,num_mixedrace,
                                 num_notwhite,num_foreignborn,num_naturalizedcitizen,
                                 num_noncitizen,tot_female_hh))

summary(demos)
unlist(lapply(demos,class))


#check where NAs were introduced by coercion
colSums(is.na(demos))

#fixed effect for lead NAs
demos$ZZleadelev<-ifelse(is.na(demos$num_elevated), 1,0)
demos$ZZleadelev<-as.factor(demos$ZZleadelev)

#rename census demos
#names(demos)[1]<-"FIPS"
demos$FIPS<-demos$tract #numeric


##################DOCKET############################

#join in docket data
phillyshape<-st_read('phillyproj2010.shp')
phillyshape = phillyshape%>%st_set_crs(4326)
st_transform(phillyshape, 4326)

evictionsshape<-st_read('evictions_clip.shp')
evictionsshape = evictionsshape%>%st_set_crs(4326)
st_transform(evictionsshape, 4326)
st_intersection(phillyshape, evictionsshape)

evictionsdf2<-read.csv('docket_clean.csv')
names(evictionsshape)
names(evictionsdf2)

#join evictions shape and docket by id

class(evictionsshape$id) #factor
evictionsshape$id_copy<-as.character(evictionsshape$id)
class(evictionsshape$id_copy) #character

class(evictionsdf2$id)   #factor
evictionsdf2$id_copy<-as.character(evictionsdf2$id)
class(evictionsdf2$id_copy) #character

evictions_join<-left_join(evictionsshape,evictionsdf2, by = "id_copy")
names(evictions_join)


#CLEAN DOCKET DATA
evictions_clean<-subset(evictions_join, select=c(ARC_Addres, year.x,month.x,plaintiff.y,X,Y, 
                                                 eviction_count.y,case_count))
evictions_clean<-as.data.frame((evictions_clean))

#rename counts
names(evictions_clean)[2]<-"evict_year"
names(evictions_clean)[3]<-"evict_month"
names(evictions_clean)[7]<-"home_evict_count"
names(evictions_clean)[8]<-"LL_eviction_count"

sum(is.na(evictions_clean$LL_eviction_count))
sum(is.na(evictions_clean$home_evict_count))

#filter to just evict_year
evictions_clean2<-(subset (evictions_clean, evictions_clean$evict_year== 2016 | 
                             evictions_clean$evict_year== 2017 | 
                             evictions_clean$evict_year== 2018))

#remove complete duplicates
evictions_clean3 <- evictions_clean2[!duplicated(evictions_clean2), ]

#group by address and find eviction sums for each address
docket_count<-evictions_clean3%>% 
  group_by(ARC_Addres)%>%
  summarize(sum_home_evict = sum(home_evict_count),
            sum_LL_evict = sum(LL_eviction_count))

#join back to evictions_clean3
evictions_clean4 <- left_join(evictions_clean3,docket_count, by='ARC_Addres')

class(evictions_clean4$evict_year) #factor
evictions_clean4$evict_year_copy <- as.numeric(as.character(evictions_clean4$evict_year))

#keep unique address copies
evictions_unique <- 
  evictions_clean4 %>%
  group_by(ARC_Addres) %>%
  slice(which.max(evict_year_copy))


####KNN violations and licenses###
#KNN FUNCTION
nn_function <- function(measureFrom,measureTo,k) {
  
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint)
  
  return(output)  
}

#KNN
#create clean sf without nas in geometry
license_final2<- license_final[!is.na(license_final$lat), ]
license.sf<-st_as_sf(license_final2, coords=c('lng','lat'),crs=4326)

rentals_geo<-opa_final[!is.na(opa_final$lat), ]
names(rentals_geo)
sum(is.na(rentals_geo$lat))
rentals.sf<-st_as_sf(rentals_geo, coords=c('lng','lat'), crs=4326)

violations_geo<-violations_final[!is.na(violations_final$lat), ]
violations.sf<-st_as_sf(violations_geo, coords=c('lng','lat'), crs=4326)

#pull the coordinates
rentals.xy <- 
  rentals.sf %>%
  cbind(.,st_coordinates(rentals.sf))  %>%
  st_set_geometry(NULL) %>%
  dplyr::select(X,Y) %>%
  as.matrix()

license.xy <- 
  license.sf %>%
  cbind(.,st_coordinates(license.sf))  %>%
  st_set_geometry(NULL) %>%
  dplyr::select(X,Y) %>%
  as.matrix()

violations.xy <- 
  violations.sf %>%
  cbind(.,st_coordinates(violations.sf))  %>%
  st_set_geometry(NULL) %>%
  dplyr::select(X,Y) %>%
  as.matrix()

#add KNN as columns
knn_licenses_5<-nn_function(rentals.xy,license.xy,5)
rentals_geo <- cbind(rentals_geo, knn_licenses_5) 
names(rentals_geo)
colnames(rentals_geo)[colnames(rentals_geo)=="pointDistance"] <- "knn_lic_5"

knn_violations_5<-nn_function(rentals.xy,violations.xy,5)
rentals_geo<-cbind(rentals_geo, knn_violations_5)
colnames(rentals_geo)[colnames(rentals_geo)=="pointDistance"] <- "knn_viol_5"

names(rentals_geo)
opa_final2<-as.data.frame(rentals_geo)



opa_final2        #opa - housing characteristics
license_final     #business licenses

Following the process of data cleaning, we joined the datasets together and did some additional feature engineering.

### JOIN #1: OPA + BUSINESS ###

#check common column
names(opa_final2) #parcel_number_copy
names(license_final2) #opaaccount_copy

#check correct class before join
class(opa_final2$parcel_number_copy) #numeric
class(license_final2$opaaccount_Copy) #numeric

#make join
join1 <- left_join(opa_final2,license_final2,by=c('parcel_number_copy'='opaaccount_Copy'))
names(join1)

#see how many didn't join
sum(is.na(join1$num_prev_lic)) #188,834 didn't join

#check correct join
join1 %>%
  dplyr::select(location,fulladdres_Copy) %>%
  head()


######### Additional FE column ##########

#1-rental but no license; 0- rental with license
#rental but no license <- anything in opa that didnt join with business
join1$rental_butNolic <- ifelse(is.na(join1$licensetyp), 1,0)
join1$rental_butNolic <- as.factor(join1$rental_butNolic)


##############################################################################################

### JOIN #2: OPA/BUSINESS + VIOLATIONS ###

#check common column
sort(names(join1)) #parcel_number_copy
names(violations_final) #opa_account_num_copy

#check correct class before join
class(join1$parcel_number_copy) #numeric
class(violations_final$opa_account_num_copy) #numeric

#make join
join2 <- left_join(join1,violations_final, by=c('parcel_number_copy' = 'opa_account_num_copy'))
names(join2)

#see how many didn't join
sum(is.na(join2$viol_count)) #247,043 didn't join



##############################################################################################

### JOIN #3: OPA/BUSINESS/VIOLATIONS + TAX DELINQUENCY ###

#check common column
sort(names(join2)) #parcel_number_copy
names(tax_delinquency_clean) #opa_number

#check correct class before join
class(join2$parcel_number_copy) #numeric
class(tax_delinquency_clean$opa_number) #integer
tax_delinquency_clean$opa_number_copy <- as.numeric(tax_delinquency_clean$opa_number)
class(tax_delinquency_clean$opa_number_copy) #numeric

#join by parcel_number_copy and opa_number_copy

#make join
join3 <- left_join(join2, tax_delinquency_clean, by=c('parcel_number_copy'= 'opa_number_copy'))
names(join3)

#see how many didn't join
sum(is.na(join3$owes)) #280,005 didn't join


##############################################################################################

### JOIN #4: OPA/BUSINESS/VIOLATIONS/TAX DELINQUENCY + CENSUS ###

# SPATIAL JOIN: join3 + census tracts #

#open census tracts shapefile
census_tracts<-st_read('Census_Tracts_2010.shp') 

#test
ggplot() + geom_sf(data=census_tracts)

#make join3 data frame in sf object
join3_geo<-st_as_sf(join3, coords=c('lng.x','lat.x'), crs=4326)
join3_geo

#check
head(join3$lat.x)
head(join3$lng.x)
head(join3_geo$geometry)

#spatial join census_tracts to join3_geo
join4<-st_join(join3_geo, census_tracts, join=st_within)


##############################################################################################

### JOIN #5: OPA/BUSINESS/VIOLATIONS/TAX DELINQUENCY/CENSUS_TRACTS + DEMOGRAPHIC INFO ###

#check common column
join4$FIPS_geoidCopy <- as.numeric(as.character(join4$GEOID10))
sort(names(join4)) #FIPS_geoidCopy
names(demos) #FIPS

#check correct class before join
class(join4$FIPS_geoidCopy) #numeric
class(demos$FIPS) #numeric

#make join
join5<-left_join(join4, demos, by=c("FIPS_geoidCopy" = "FIPS"))
names(join5)

#see how many didn't join
sum(is.na(join5$pct_white)) # 2,347 didn't join



##############################################################################################

### JOIN #6: OPA/BUSINESS/VIOLATIONS/TAX DELINQUENCY/CENSUS_TRACTS/DEMOGRAPHIC INFO + DOCKET ###

#check common column
sort(names(join5)) #location
names(evictions_unique) #ARC_Addres

#check correct class before join
class(join5$location) #factor
join5$location_copy <- as.character(join5$location)
class(join5$location_copy) #character

class(evictions_unique$ARC_Addres) #factor
evictions_unique$ARC_Addres_copy <- as.character(evictions_unique$ARC_Addres)
class(evictions_unique$ARC_Addres_copy) #character

#make join
join6 <-left_join(join5, evictions_unique, by=c("location_copy"="ARC_Addres_copy"))
names(join6)

#see how many didn't join
sum(is.na(join6$evict_year)) # 294,060  didn't join


##############################################################################################

### JOIN #7: OPA/BUSINESS/VIOLATIONS/TAX DELINQUENCY/CENSUS_TRACTS/DEMOGRAPHIC INFO/DOCKET ###
###          + PHILLY NEIGHBORHOODS

#Add in Philly neighborhoods shapefile
phillyneighbs<-st_read('Neighbs_4326.shp')

#clean
phillyneighbs<-subset(phillyneighbs,select=c(MAPNAME, geometry))

#test
ggplot() + geom_sf(data=phillyneighbs)

#remove unused geometry columns
join6_copy<-subset(join6,select=-c(geometry.y,geometry))
#reassign active geometry column
st_geometry(join6_copy)<-'geometry.x'
#remove unused geometry columns
join6_copy<-subset(join6_copy,select=-c(geometry))

#check
head(join6_copy$geometry.x)

#spatial join census_tracts to join3_geo
join7<-st_join(join6_copy, phillyneighbs, join=st_within)

Next we cleaned the joined dataset to prepare it for the exploratory analysis and model building.

##### CLEANING #####
colSums(is.na(join7))

#removing unecessary rows first
data <- subset(join7, select=-c(geocode_y,geocode_x,street_add,opaaccount,business_m,lng.y,lat.y,fulladdres_Copy,
                                lng,opa_account_num.x,address,ownername,lat.x.x,opa_account_num.y,opa_number,
                                lat.y.y,lon,OBJECTID,STATEFP10,COUNTYFP10,TRACTCE10,NAME10,NAMELSAD10,MTFCC10,
                                FUNCSTAT10,ALAND10,AWATER10,LOGRECNO,FIPS_geoidCopy,tract,location_copy,ARC_Addres,
                                plaintiff.y,X,Y,street_address))


colSums(is.na(data))

#assume NA for illegalRental is 0; 
data$illegalRental_dv <- data$illegalRental
data$illegalRental_dv <- ifelse(is.na(data$illegalRental) | data$illegalRental=='0',0,1)

table(data$illegalRental_dv)
#0-302586;1-2225

colSums(is.na(data))
#make copy
data2<-data

#checked that all na's in this are 0 illegalRental;
data2<-data2[!is.na(data2$off_street_open),]
data2<-data2[!is.na(data2$GEOID10),]

#second filtering of columns
data2<-subset(data2,select=-c(illegalRental,SFIllegalRent,MFIllegalRent,RoomIllegalRent,
                              mostrecent,initialiss,numberofun,legalname,evict_year,licensetyp))


colSums(is.na(data2))

data2$tot_pop2<-data2$tot_population
data2$tot_pop2<-ifelse(is.na(data2$tot_pop2),0,data2$tot_pop2)

data2$pct_white2<-data2$pct_white
data2$pct_white2<-ifelse(is.na(data2$pct_white2),0,data2$pct_white2)

data2$pct_black2<-data2$pct_black
data2$pct_black2<-ifelse(is.na(data2$pct_black2),0,data2$pct_black2)

data2$pct_AmIndian2<-data2$pct_AmIndian
data2$pct_AmIndian2<-ifelse(is.na(data2$pct_AmIndian2),0,data2$pct_AmIndian2)

data2$pct_asian2<-data2$pct_asian
data2$pct_asian2<-ifelse(is.na(data2$pct_asian2),0,data2$pct_asian2)

data2$pct_nativeHawaiian2<-data2$pct_nativeHawaiian
data2$pct_nativeHawaiian2<-ifelse(is.na(data2$pct_nativeHawaiian2),0,data2$pct_nativeHawaiian2)

data2$pct_other2<-data2$pct_other
data2$pct_other2<-ifelse(is.na(data2$pct_other2),0,data2$pct_other2)

data2$pct_mixedrace2<-data2$pct_mixedrace
data2$pct_mixedrace2<-ifelse(is.na(data2$pct_mixedrace2),0,data2$pct_mixedrace2)

data2$pct_notwhte2<-data2$pct_notwhte
data2$pct_notwhte2<-ifelse(is.na(data2$pct_notwhte2),0,data2$pct_notwhte2)

data2$pct_foreignborn2<-data2$pct_foreignborn
data2$pct_foreignborn2<-ifelse(is.na(data2$pct_foreignborn2),0,data2$pct_foreignborn2)

data2$pct_naturalizedcitizen2<-data2$pct_naturalizedcitizen
data2$pct_naturalizedcitizen2<-ifelse(is.na(data2$pct_naturalizedcitizen2),0,data2$pct_naturalizedcitizen2)

data2$pct_noncitizen2<-data2$pct_noncitizen
data2$pct_noncitizen2<-ifelse(is.na(data2$pct_noncitizen2),0,data2$pct_noncitizen2)

data2$Pop_pov_Est2<-data2$Pop_pov_Est
data2$Pop_pov_Est2<-ifelse(is.na(data2$Pop_pov_Est2),0,data2$Pop_pov_Est2)

data2$Pop_below_pov2<-data2$Pop_below_pov
data2$Pop_below_pov2<-ifelse(is.na(data2$Pop_below_pov2),0,data2$Pop_below_pov2)

data2$pct_pop_below2<-data2$pct_pop_below
data2$pct_pop_below2<-ifelse(is.na(data2$pct_pop_below2),0,data2$pct_pop_below2)

data2$pct_rent_Inc2<-data2$pct_rent_Inc
data2$pct_rent_Inc2<-ifelse(is.na(data2$pct_rent_Inc2),0,data2$pct_rent_Inc2)

data2$rent_burden2<-data2$rent_burden
data2$rent_burden2<-ifelse(is.na(data2$rent_burden2),0,data2$rent_burden2)

data2$pct_female_hh2<-data2$pct_female_hh
data2$pct_female_hh2<-ifelse(is.na(data2$pct_female_hh2),0,data2$pct_female_hh2)

data2$per_elevated2<-data2$per_elevated
data2$per_elevated2<-ifelse(is.na(data2$per_elevated2),0,data2$per_elevated2)

data2$pct_rent_Inc2<-data2$pct_rent_Inc
data2$pct_rent_Inc2<-ifelse(is.na(data2$pct_rent_Inc2),0,data2$pct_rent_Inc2)

data2$ZZleadelev2<-data2$ZZleadelev
data2$ZZleadelev2<-ifelse(is.na(data2$ZZleadelev2) | data2$ZZleadelev2=='0','0','1')
data2$ZZleadelev2<-as.factor(data2$ZZleadelev2)

data2$home_evict_count2<-data2$home_evict_count
data2$home_evict_count2<-ifelse(is.na(data2$home_evict_count2),0,data2$home_evict_count2)

data2$LL_eviction_count2<-data2$LL_eviction_count
data2$LL_eviction_count2<-ifelse(is.na(data2$LL_eviction_count2),0,data2$LL_eviction_count2)
data2$LL_eviction_count2<-as.numeric(data2$LL_eviction_count2)

data2$sum_home_evict2_dummy<-ifelse(is.na(data2$sum_home_evict),1,0)
data2$sum_home_evict2_dummy<-as.factor(data2$sum_home_evict2_dummy)
data2$sum_home_evict2<-data2$sum_home_evict
data2$sum_home_evict2<-ifelse(is.na(data2$sum_home_evict2),0,data2$sum_home_evict2)

data2$sum_LL_evict2_dummy<-ifelse(is.na(data2$sum_LL_evict),1,0)
data2$sum_LL_evict2_dummy<-as.factor(data2$sum_LL_evict2_dummy)
data2$sum_LL_evict2<-data2$sum_LL_evict
data2$sum_LL_evict2<-ifelse(is.na(data2$sum_LL_evict2),0,data2$sum_LL_evict2)


#remove columns
data3<- subset(data2, select=-c(tot_population,pct_white,pct_black,pct_AmIndian,pct_asian,
                                pct_nativeHawaiian,pct_other,pct_other,pct_mixedrace, pct_notwhte,
                                pct_foreignborn,pct_naturalizedcitizen,pct_noncitizen, 
                                Pop_pov_Est,Pop_below_pov,pct_pop_below,pct_rent_Inc, 
                                rent_burden,pct_female_hh,num_elevated,num_screened,
                                per_elevated,pct_rent_Inc,pct_rent_of_Inc,home_evict_count,LL_eviction_count,
                                sum_home_evict,sum_LL_evict,ZZleadelev))


colSums(is.na(data3))

data3$owes2<-data3$owes
data3$owes2<-ifelse(is.na(data3$owes2),0,data3$owes2)
data3$owes2<-as.factor(data3$owes2)

data3$hist_extviol2<-data3$hist_extviol
data3$hist_extviol2<-ifelse(is.na(data3$hist_extviol2) | data3$hist_extviol2=='0','0','1')
data3$hist_extviol2<-as.factor(data3$hist_extviol2)

data3$hist_intviol2<-data3$hist_intviol
data3$hist_intviol2<-ifelse(is.na(data3$hist_intviol2) | data3$hist_intviol2=='0','0','1')
data3$hist_intviol2<-as.factor(data3$hist_intviol2)

data3$hist_illegalRent2<-data3$hist_illegalRent
data3$hist_illegalRent2<-ifelse(is.na(data3$hist_illegalRent2) | data3$hist_illegalRent2=='0','0','1')
data3$hist_illegalRent2<-as.factor(data3$hist_illegalRent2)

data3$had_anyviolation2<-data3$had_anyviolation
data3$had_anyviolation_dummy<-ifelse(is.na(data3$had_anyviolation2),1,0)
data3$had_anyviolation_dummy<-as.factor(data3$had_anyviolation_dummy)
data3$had_anyviolation2<-ifelse(is.na(data3$had_anyviolation2),'0',
                                ifelse(data3$had_anyviolation2=='1','1','0'))
data3$had_anyviolation2<-as.factor(data3$had_anyviolation2)

data3$other_violations2<-data3$other_violations
data3$other_violations2<-ifelse(is.na(data3$other_violations2),0,data3$other_violations2+1)

data3$issue_fixed_dummy<-ifelse(is.na(data3$issue_fixed),1,0)
data3$issue_fixed_dummy<-as.factor(data3$issue_fixed_dummy)

data3$issue_fixed_y<-ifelse(is.na(data3$issue_fixed),'0',
                            ifelse(data3$issue_fixed=='1','1','0'))
data3$issue_fixed_y<-as.factor(data3$issue_fixed_y)

data3$issue_fixed_n<-ifelse(is.na(data3$issue_fixed),'0',
                            ifelse(data3$issue_fixed=='0','1','0'))
data3$issue_fixed_n<-as.factor(data3$issue_fixed_n)


data3$active_case_dummy<-ifelse(is.na(data3$active_case),1,0)
data3$active_case_dummy<-as.factor(data3$active_case_dummy)

data3$active_case_y<-ifelse(is.na(data3$active_case),'0',
                            ifelse(data3$active_case=='1','1','0'))
data3$active_case_y<-as.factor(data3$active_case_y)

data3$active_case_n<-ifelse(is.na(data3$active_case),'0',
                            ifelse(data3$active_case=='0','1','0'))
data3$active_case_n<-as.factor(data3$active_case_n)


data3$total_due_dummy<-ifelse(is.na(data3$total_due),1,0)
data3$total_due_dummy<-as.factor(data3$total_due_dummy)
data3$total_due2<-data3$total_due 
data3$total_due2<-ifelse(is.na(data3$total_due2),0,data3$total_due2)

data3$num_years_owed_dummy<-ifelse(is.na(data3$num_years_owed),1,0)
data3$num_years_owed_dummy<-as.factor(data3$num_years_owed_dummy)

#0 because there is nothing due
data3$num_years_owed2<-data3$num_years_owed 
data3$num_years_owed2<-ifelse(is.na(data3$num_years_owed2),0,data3$num_years_owed2)

data3$lentim_sincelast_viol_dummy<-ifelse(is.na(data3$lentim_sincelast_viol),1,0)
data3$lentim_sincelast_viol_dummy<-as.factor(data3$lentim_sincelast_viol_dummy)

#0 because never had violation
data3$lentim_sincelast_viol2<-data3$lentim_sincelast_viol 
data3$lentim_sincelast_viol2<-ifelse(is.na(data3$lentim_sincelast_viol2),0,data3$lentim_sincelast_viol2)

data3$viol_count_dummy<-ifelse(is.na(data3$viol_count),1,0)
data3$viol_count_dummy<-as.factor(data3$viol_count_dummy)

#0 because never had violation
data3$viol_count2<-data3$viol_count
data3$viol_count2<-ifelse(is.na(data3$viol_count2),0,data3$viol_count2)


data4 <- subset(data3, select=-c(owes,had_extvio,had_intvio,had_anyviolation,had_illegalRent,
                                 hist_illegalRent,hist_extviol,hist_intviol,
                                 exterior,interior,issue_fixed,active_case,owner,
                                 total_due,num_years_owed,lentim_sincelast_viol,
                                 viol_count,violationtype,other_violations))

colSums(is.na(data4))


#Length of time since last license if most recent is inactive
#NA's we assume the license is active
data4$time_since_last_dummy<-ifelse(is.na(data4$time_since_last),1,0)
data4$time_since_last_dummy<-as.factor(data4$time_since_last_dummy)
data4$time_since_last2<-data4$time_since_last
data4$time_since_last2<-ifelse(is.na(data4$time_since_last2),'only one active',data4$time_since_last2)
data4$time_since_last2<-as.factor(data4$time_since_last2)

data4$owner_count_dummy<-ifelse(is.na(data4$owner_count_dummy),1,0)
data4$owner_count_dummy<-as.factor(data4$owner_count_dummy)

#assumes NA is at least 1 owner
data4$owner_count<-ifelse(is.na(data4$owner_count),1,data4$owner_count)

#assumes NA is not a multiple unit
data4$multipleunits_dummy<-ifelse(is.na(data4$multipleunits),1,0)
data4$multipleunits_dummy<-as.factor(data4$multipleunits_dummy)

data4$multipleunits_y<-ifelse(is.na(data4$multipleunits),'0',
                              ifelse(data4$multipleunits=='1','1','0'))
data4$multipleunits_y<-as.factor(data4$multipleunits_y)


data4$ZZinactivedate<-ifelse(is.na(data4$ZZinactivedate),1,0)
data4$ZZinactivedate<-as.factor(data4$ZZinactivedate)

#number of previous business license
#dummy for unknown number of previous licenses
data4$num_prev_lic_dummy<-ifelse(is.na(data4$num_prev_lic),1,0)
data4$num_prev_lic_dummy<-as.factor(data4$num_prev_lic_dummy)
#assumes at least 1 if NA
data4$num_prev_lic<-ifelse(is.na(data4$num_prev_lic),1,data4$num_prev_lic)


data4$status_mostrecent_dummy<-ifelse(is.na(data4$status_mostrecent),1,0)
data4$status_mostrecent_dummy<-as.factor(data4$status_mostrecent_dummy)

data4$status_mostrecent_active<-ifelse(is.na(data4$status_mostrecent),'0',
                                       ifelse(data4$status_mostrecent=='active','1','0'))
data4$status_mostrecent_active<-as.factor(data4$status_mostrecent_active)
data4$status_mostrecent_inactive<-ifelse(is.na(data4$status_mostrecent),'0',
                                         ifelse(data4$status_mostrecent=='inactive','1','0'))
data4$status_mostrecent_inactive<-as.factor(data4$status_mostrecent_inactive)

#individual or company
data4$legalentit_dummy<-ifelse(is.na(data4$legalentit),1,0)
data4$legalentit_dummy<-as.factor(data4$legalentit_dummy)

data4$legalentit_indiv<-ifelse(is.na(data4$legalentit),'0',
                               ifelse(data4$legalentit=='Individual','1','0'))
data4$legalentit_indiv<-as.factor(data4$legalentit_indiv)

data4$legalentit_compny<-ifelse(is.na(data4$legalentit),'0',
                                ifelse(data4$legalentit=='Company','1','0'))
data4$legalentit_compny<-as.factor(data4$legalentit_compny)


data5<-subset(data4,select=-c(violationdescription,caseresolutiondate,violationdate, 
                              caseaddeddate,organization,time_since_last,mostrecent2,
                              inactiveda,license_year,councildis,mostrecentinsp,
                              owner_count_dummy,year,mostrecentinsp2,status2,legalentit,
                              status_mostrecent,multipleunits))

colSums(is.na(data5))

data6<-subset(data5,select=-c(evict_month,evict_year_copy,activitystat))

colSums(is.na(data6))

Next we begain building the models using a stepwise regression that helped with model selection.

#the below shows the model building process for the violation story only. The process is the same for all 4 stories. 

###stepwise regression
library(MASS)
library(leaps)

violationsMod<-glm(illegalRental_dv~.,data=trainingViolations)


violationsStepMod<-stepAIC(violationsMod, direction = 'both', trace=FALSE)
summary(violationsStepMod)


##tuning the stepwise
violationModels <- regsubsets(illegalRental_dv~., data = trainingViolations, nvmax = 5, method = "seqrep") 
summary(violationModels)


# Set seed for reproducibility 
set.seed(123) 

# Set up repeated k-fold cross-validation 
train.control <- trainControl(method = "cv", number = 10) 

# Train the model 
Violation.step.model <- train(illegalRental_dv ~., data = trainingViolations, method = "leapSeq", tuneGrid = data.frame(nvmax = 1:5), trControl = train.control ) 
Violation.step.model$results


#Find best nvmax
Violation.step.model$bestTune

#Summary of the best model
summary(Violations.step.model$finalModel)

#find regression coefficients, meaning run a model with only the recommended variables
finalViolationModel<-glm(illegalRental_dv~issue_fixed_y+issue_fixed_dummy+viol_count2+hist_intviol2+hist_extviol2+owner_count+lentim_sincelast_viol2+active_case_y, data=trainingViolations, family=binomial)


summary(finalViolationModel)

####train and validate stories individually

trainingViolations_sample <-
  mutate(trainingViolations_sample, illegalRental_dv2 = ifelse(illegalRental_dv == 1, "illegal", "not_illegal"))


##caret train

##sensitivity = true positive
##specificity = true negative

ctrl <- trainControl(method = "cv", number = 50, classProbs=TRUE, summaryFunction=twoClassSummary) 

cvFitViolations<-train(illegalRental_dv2 ~ ., data = trainingViolations_sample %>%
                         dplyr::select(illegalRental_dv2,issue_fixed_y,issue_fixed_dummy,viol_count2,hist_intviol2,hist_extviol2,owner_count,
                                       lentim_sincelast_viol2,active_case_y), 
                       method='glm', family='binomial', metric='ROC', na.action=na.omit, trControl=ctrl)

cvFitViolations


#cross validation outputs
cvFitViolations$resample %>%
  dplyr::select(-Resample) %>%
  gather() %>%
  ggplot(., aes(value)) + 
  geom_histogram(bins=35) +
  facet_wrap(~key) +
  scale_x_continuous(limits = c(0, 1)) +
  labs(x="Goodness of Fit",
       y="Count",
       title="CV Goodness of Fit Metrics - Violations Story")

Next, because the models were unable to predict for illegal rentals because they are a rare event, we took a random sample of the data and used that smaller dataset to test the model.

####make subsets for each of the storie

####create a smaller sample where non-illegal rentals are 4X illegal rentals
##create a subset of the data 
data7_IR<-subset(data7, data7$illegalRental_dv == 1)
table(data7_IR$illegalRental_dv)

data7_noIR<-subset(data7, data7$illegalRental_dv == 0)
table(data7_noIR$illegalRental_dv)

#select 8900 random rows from temp 0 (2225*4=8900)
random_0 <- data7_noIR[sample(nrow(data7_noIR), 8900), ]

#rbind 8900 rows of 0 to the 2225 rows of 1
newsample <- rbind(data7_IR,random_0)

#shuffle rows - generate a random ordering
set.seed(1) ## make reproducible here, but not if generating many random samples
rand <- sample(nrow(newsample))
newsample<-newsample[rand,]

houseCharacterStory_sample<-subset(newsample, select=c(illegalRental_dv,quality_grade_copy,central_air_dummy,central_air_y,central_air_n,
                                            unfinished_unit,general_construction_dummy,general_construction_brickY,
                                            general_construction_brickN,house_extension_n,type_heater_yn,type_heater_dummy,
                                            other_bldg_yn,exempt_building_dummy,exempt_building_y,exempt_land_dummy,
                                            exempt_land_y,taxable_building_dummy,taxable_building_y,taxable_land_dummy,
                                            taxable_land_y,bedrooms0,bedrooms1,bedrooms2,bedrooms3,bedrooms4,bedrooms5plus,
                                            rooms0to2,rooms3to5,rooms6to8,rooms9to11,bathrooms1,bathrooms2to3,bathrooms4plus,story1,
                                            story2,story3,story4plus,year_built_copy,extcond_dummy,extcond_avg,intcond_dummy,
                                            intcond_avg,depth_copy,depth_dummy,frontage_copy,frontage_dummy,total_area_copy,
                                            total_area_dummy,market_value_copy,market_value_dummy,multipleunits_dummy,     multipleunits_n, multipleunits_y))



violationsStory_sample<-subset(newsample, select=c(illegalRental_dv, issue_fixed_y, issue_fixed_dummy, viol_count2, viol_count_dummy,
                                        hist_intviol2, hist_extviol2, owner_count, lentim_sincelast_viol2, active_case_y))


legalStory_sample<-subset(newsample, select=c(illegalRental_dv,absenteelandlord, outofPhillyLL, outofstateLL, owes2, status_mostrecent_inactive, status_mostrecent_active,  
                                   num_prev_lic,num_prev_lic_dummy, time_since_last2, time_since_last_dummy,rental_butNolic,
                                   ZZinactivedate))



demosGeoStory_sample<-subset(newsample, select=c(illegalRental_dv,pct_white2,pct_black2,pct_AmIndian2,pct_asian2,pct_nativeHawaiian2,
                                      pct_other2,pct_mixedrace2,pct_notwhte2, pct_foreignborn2,
                                      pct_naturalizedcitizen2,pct_noncitizen2,
                                      pct_pop_below2,rent_burden2,pct_female_hh2,per_elevated2,
                                      knn_lic_5, knn_viol_5, MAPNAME))

##the below shows the prediction steps for the violation story only, however the process is the same for all four stories. 

###create training and test set for new sample
inTrain <- createDataPartition(
  y = violationsStory_sample$illegalRental_dv, 
  p = .75, list = FALSE)
trainingViolations_sample <- violationsStory_sample[ inTrain,] #the new training set
testViolations_sample <- violationsStory_sample[-inTrain,]  #the new test set

##final modes with sample data
finalViolationModel<-glm(illegalRental_dv~issue_fixed_y+issue_fixed_dummy+viol_count2+hist_intviol2+hist_extviol2+owner_count+lentim_sincelast_viol2+active_case_y, data=trainingViolations_sample, family=binomial)
finalHouseCharacterModel<-glm(illegalRental_dv~extcond_avg+year_built_copy+story3+multipleunits_dummy+multipleunits_n, data=trainingHouseCharacter_sample, family=binomial)
finalLegalModel<-glm(illegalRental_dv~outofstateLL+owes2+num_prev_lic_dummy+num_prev_lic+time_since_last2+status_mostrecent_active,data=trainingLegal_sample, family=binomial)
finalDemosGeo<-glm(illegalRental_dv~knn_viol_5+knn_lic_5+per_elevated2+pct_black2, data=trainingDemosGeo_sample, family=binomial)

#### predict on sample data
#sample data
sample_test_violation <- predict(finalViolationModel, testViolations_sample, type='response')
testViolations_sample$classProbs <- sample_test_violation

sampletestProbs <- data.frame(Class=testViolations_sample$illegalRental_dv,
                              Probs=sample_test_violation)
head(sampletestProbs)

#distribution plot 
ggplot(sampletestProbs, aes(x = Probs, fill=as.factor(Class))) + 
  geom_density() +
  facet_grid(Class ~ .) +
  scale_fill_manual(values = c("dodgerblue4", "darkgreen"),
                    name = "",
                    labels = c("Not Illegal Rental","Illegal Rental")) +
  labs(x="Illegal Rental", y="Density of probabilities") +
  theme(strip.text.x = element_text(size = 18))


# distribution of the prediction score grouped by known outcome
ggplot( sampletestProbs, aes( Probs, color = as.factor(Class) ) ) + 
  geom_density( size = 1 ) +
  ggtitle( "Test Set's Predicted Score (Sample Data)" ) + 
  scale_fill_manual(values = c('dodgerblue4', 'darkgreen'),
                    name = "data", 
                    labels = c( "Not Illegal", "Illegal"))

The below code runs the ensemble model using a Random Forest regression.

#####ENSEMBLE 1 - Random Forest MODEL######

#violationsStory
#legalStory
#demosGeoStory
#houseCharacterStory

sum(is.na(data7))
data7$time_since_last2<-as.numeric(data7$time_since_last2)
newsample$year_built_copy<-as.factor(newsample$year_built_copy)

story.Violations <- newsample %>% dplyr::select(illegalRental_dv,issue_fixed_y,issue_fixed_dummy,viol_count2,hist_intviol2,hist_extviol2,owner_count,lentim_sincelast_viol2,active_case_y)
story.Legal <- newsample %>% dplyr::select(illegalRental_dv,outofstateLL,owes2,num_prev_lic_dummy,num_prev_lic,time_since_last2,status_mostrecent_active)
story.housingCharacter <- newsample %>% dplyr::select(illegalRental_dv,extcond_avg,year_built_copy,story3,multipleunits_dummy,multipleunits_n)
story.DemosGeo <-newsample %>% dplyr::select(illegalRental_dv,knn_viol_5,knn_lic_5,per_elevated2,pct_black2)

ensembleFunction <- function(inputStory, iterations, sampleSize) {
  #create an empty dataframe
  endList <- data.frame(matrix(NA, nrow = nrow(inputStory), ncol = 0))
  #build n models with a m% test set
  for (i in 1: iterations ) {
    sample <- sample.int(n = nrow(inputStory), size = floor( sampleSize *nrow(inputStory)), replace = F)
    train <- inputStory[sample, ]
    #train the model
    thisTrain <- randomForest(train$illegalRental_dv ~ ., data = train)
    #create a vector of predictions for this model.
    thisPrediction = exp(predict(thisTrain, newsample, type='response'))
    #append the predictions to the data frame         
    endList <- cbind(endList, thisPrediction)
  }
  #label each prediction column 1 through n
  colnames(endList) <- paste("prediction", 1:ncol(endList), sep = "")
  #create a data frame of average, median and sd predictions. I have turned off the latter two for now
  return(data.frame(mean.Prediction = rowMeans(endList), na.rm=T))
}

allStories.predictions_RF <- 
  data.frame(
    cbind(
      story1.mean.predictions = ensembleFunction(story.Violations,100,.4)[,1],
      story2.mean.predictions = ensembleFunction(story.Legal,100,.4)[,1],
      story3.mean.predictions = ensembleFunction(story.housingCharacter,100,.4)[,1],
      story4.mean.predictions = ensembleFunction(story.DemosGeo,100,.4)[,1]
    ))

head(allStories.predictions_RF)

#cbind all story predictions back to newsample dataset
names(newsample)
IR<-subset(newsample, select=c(illegalRental_dv))
allStories.predictions2_rf<-cbind(allStories.predictions_RF, IR)
class(allStories.predictions_RF)

##run random forest regression on predictions
finalReg_rf <- randomForest(newsample$illegalRental_dv ~ .,data=allStories.predictions2_rf )
summary(finalReg_rf)


#final predictions
finalRegTest_rf <- predict(finalReg_rf, allStories.predictions2_rf, type='response')
allStories.predictions2_rf$classProbs <- finalRegTest_rf

finalRegProbs_rf <- data.frame(Class=allStories.predictions2_rf$illegalRental_dv,
                            Probs=finalRegTest_rf)
head(finalRegProbs_rf)

#confusion matrix with selected threshold
finalRegProbs_rf$predClass  = ifelse(finalRegProbs_rf$Probs > .65 ,1,0)
head(finalRegProbs_rf)
caret::confusionMatrix(factor(finalRegProbs_rf$Class), 
                       factor(finalRegProbs_rf$predClass), positive='1')


####ROC/AUC curves for sample RF ensemble
install.packages('plotROC')
library(plotROC)
names(finalRegProbs_rf2)

palette<-c("#c51b8a")
ggplot(finalRegProbs_rf2, aes(d = Class, m = Probs, colour = "#c51b8a")) + 
  geom_roc(n.cuts = 50, labels = FALSE) + 
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = '#feebe2') +
  labs(title = 'Final Model ROC', 
       subtitle = 'AUC = 0.97', 
       xlab = "False positive fraction", 
       ylab = "True positive fraction") +
  scale_colour_manual(values=palette)+
  theme_minimal()+
  theme(plot.title=element_text(hjust=0.5), legend.position='none')


#save AUC
sample_auc <- auc(finalRegProbs_rf2$Class, finalRegProbs_rf2$Probs) ##AUC =0.97


###How well does our model predict across neighborhoods

neighbs<-subset(newsample, select = c(MAPNAME))
head(neighbs)
finalRegProbs_rf_neighbs<-cbind(finalRegProbs_rf, neighbs)
head(finalRegProbs_rf_neighbs)

finalRegProbs_rf_neighbs2<-
  data.frame(class = finalRegProbs_rf_neighbs$Class, 
             nhood = finalRegProbs_rf_neighbs$MAPNAME, 
             probs = finalRegProbs_rf_neighbs$Probs, 
             predClass = finalRegProbs_rf_neighbs$predClass)


head(finalRegProbs_rf_neighbs2)  



options(yardstick.event_first = FALSE)


finalRegProbs_rf_neighbs2 %>%
  dplyr::select(-probs) %>%
  gather(Variable, predClass, -nhood, -class) #%>%
  group_by(nhood) %>%
  summarize(Sensitivity = round(yardstick::sens_vec(factor(class),factor(predClass)),2),
            Specificity = round(yardstick::spec_vec(factor(class),factor(predClass)),2),
            Accuracy = round(yardstick::accuracy_vec(factor(class),factor(predClass)),2),
            AUC = round(Metrics::auc(factor(class),factor(predClass)),2))
  
trueFalsePos_withNhood<-
  finalRegProbs_rf.withNhood %>%
    dplyr::select(-Probs) %>%
    mutate(truePositive = ifelse(predClass == 1 & Class ==1, 1,0),
           trueNegative = ifelse(predClass == 0 & Class ==0, 1,0)) %>%
    group_by(nhood) %>%
    summarize(sumTruePositive = sum(truePositive),
              sumTrueNegative = sum(trueNegative),
              totalObservations = n()) %>%
    mutate(truePositiveRate = sumTruePositive / totalObservations,
           trueNegativeRate = sumTrueNegative / totalObservations) %>%
    left_join(phillyneighbs, by=c("nhood"="MAPNAME")) %>%
    st_sf() 

head(trueFalsePos_withNhood)
trueFalsePos_withNhood<-subset(trueFalsePos_withNhood, trueFalsePos_withNhood$sumTruePositive>=1)
trueFalsePos_withNhood<-subset(trueFalsePos_withNhood, trueFalsePos_withNhood$totalObservations>=6)
head(trueFalsePos_withNhood)

7.2 Visuals

The visuals presented here largely, but not exclusively, feed into the exploratory analysis. To begin, we set up our visuals workspace.

#maptheme
mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

#Here is a function that will create quintile breaks. Note that the inclusion of `round` might not be advantageous if you are dealing with say, percentages.
quintileBreaks <- function(df,variable) {
  as.character(round(quantile(df[[variable]],
                              c(.01,.2,.4,.6,.8),na.rm=T)))
}
#Next, we can create a custom palette
palette5 <- c("#ffffcc","#a1dab4","#41b6c4","#2c7fb8","#253494")
palette5.2 <- c("#feebe2","#fbb4b9","#f768a1","#c51b8a","#7a0177")


#read in shapes
phillyshape2<-st_read('C:/Users/Fay/Desktop/GIS/Philly Shape Files/phillyproj2010.shp')%>%st_transform(crs=4326)

Next we make maps and graphics.

#Read in point data
#licenses
rental_license_sf<- st_read('li_business_licenses.shp')%>%st_transform(crs=4326)
rental_license_sf<-rental_license_sf%>%
  filter(rental_license_sf$licensesta=='Active','Active','Inactive,Expired,Closed,Revoked')%>%
  filter(rental_license_sf, business$licensetyp == 'Rental' | rental_license_sf$licensetyp == 'High Rise')
rental_license_sf$lng <- st_coordinates(rental_license_sf)[,1]
rental_license_sf$lat <- st_coordinates(rental_license_sf)[,2]

#EVICTIONS
evictionsshape<-st_read('evictions_4326.shp')%>%st_transform(crs=4326)%>%st_intersect(phillyshape)
evictionsshape$lng <- st_coordinates(evictionsshape)[,1]
evictionsshape$lat <- st_coordinates(evictionsshape)[,2]

#Illegal Rentals
violations2<-violations

#create sf
illegalRentalsf<-st_as_sf(violations_clean, coords=c('lng','lat'),crs=4326)
illegalRentalsf$lng <- st_coordinates(illegalRentalsf)[,1]
illegalRentalsf$lat <- st_coordinates(illegalRentalsf)[,2]

#IMPORT TAX DELINQUENCY
tax_delinquency_4326<-tax_delinquency_4326%>%
  st_transform(crs=4326)%>%
  st_intersection(phillyshape2)
names(tax_delinquency_4326)
tax_delinquency_4326$lng <- st_coordinates(tax_delinquency_4326)[,1]
tax_delinquency_4326$lat <- st_coordinates(tax_delinquency_4326)[,2]
class(tax_delinquency_4326$lat)
tax_delinquency_4326 <- tax_delinquency_4326[!is.na(tax_delinquency_4326$lat),]

#bar plot of kind of IR
names(illegal_rentals)
palette3<-c("#f768a1","#c51b8a","#7a0177")
ggplot(violations_clean, aes(x=factor(IRtype)))+
  geom_bar(stat="count", width=0.9, fill=palette3)+
  theme_minimal()+
  labs(x="Kind of Illegal Rental",
       y="Count",
       title="Illegal Rentals by Type")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5))

#*******************************************************************************************************************

#Density maps for illegal rentals and evictions

city <- st_read("http://data.phl.opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson")

#creating a 'windown' aka 'owin' in spatstat  

#create a bounding box for city
cityBox <- st_convex_hull(city)

#Convert `city` into a window we need for spatstat. Note it's class.
city.window <-
  as.owin(c(
    #xrange
    min(st_coordinates(cityBox)[,1]),
    max(st_coordinates(cityBox)[,1]),
    #yrange
    min(st_coordinates(cityBox)[,2]),
    max(st_coordinates(cityBox)[,2]))) 


#read in data
evictionpts <- st_read("evictions_4326.shp")

#note the latter fits in the former
plot(city.window)
plot(city,add=T)

#take a sample of your points, remove NAs, us st_difference to get rid of duplicate geometries
evictionpts2 <- evictionpts [1:1000,] %>% 
  mutate(x = st_coordinates(.)[,1],
         y = st_coordinates(.)[,2]) %>%
  filter(!is.na(x)) %>%
  st_difference()

#create the ppp and note the window
evictionpts2.ppp <- as.ppp(cbind(evictionpts2$x,evictionpts2$y), city.window)

#create the density 
evictionDensity <- raster(density(evictionpts2.ppp,.005, eps = .001))
plot(evictionDensity)

#to map - 
#Start by clipping the raster. The mask has to be a 'SpatialPolygons' 

evictionDensity.clip <- mask(evictionDensity, as(as(city,'Spatial'),'SpatialPolygons'))

#plot
palette5 <- c("#ffffcc","#a1dab4","#41b6c4","#2c7fb8","#253494")
palette5.2 <- c("#feebe2","#fbb4b9","#f768a1","#c51b8a","#7a0177")


gplot(evictionDensity.clip) +
  geom_raster(aes(fill = value))+
  scale_fill_distiller(palette="RdPu")+
  labs(title="Density of Evictions")+
  mapTheme()+
  theme(legend.position = "none") 

###repeat the process for illegal rentals

#**************************************************************************************************************#
#CONFUSION MATRICES
#read in data
confusionmatrix<-read.csv('rf0.65.csv')
confusionmatrix<-read.csv('rf0.65_2.csv')

confusionmatrix$Rate<-as.numeric(confusionmatrix$Rate)
names(confusionmatrix)[1]<-"Threshold"

##plot the confusion matrix for 0.4

confusionmatrix$variable <- factor(confusionmatrix$variable, levels = confusionmatrix$variable[order(-confusionmatrix$Rate)])

ggplot(confusionmatrix, aes(x=variable, y=Rate))+
  geom_bar(stat="identity", width=0.9, fill=palette4)+
  theme_minimal()+
  labs(x="",
       y="Rate (%)",
       title="Threshold 0.65",
       subtitle="Sensitivity: 0.8, Specificity: 0.93, Accuracy: 0.91")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5))


#**************************************************************************************************************#
#BAR CHART FOR INPUT ROWS

#read in csv
rowcount<-read.csv('properties_bar_chart_dat2.csv')

palette4<-c("#fbb4b9","#f768a1","#c51b8a","#7a0177")

ggplot(tips2, aes(x = reorder(day, -perc), y = perc)) + geom_bar(stat = "identity")
#create barplot
#bar plot of kind of IR
ggplot(rowcount, aes(x=reorder(Source,-Count), y=Count))+
  geom_bar(stat="identity", width=0.9, fill=palette4)+
  theme_minimal()+
  labs(x="",
       y="Count",
       title="Rows by Property Type")+
  theme(plot.title = element_text(hjust = 0.5))


#*************************************************************************************************************

#annual rent
annual_rent<-read.csv('annual_rent.csv')
names(annual_rent)[1]<-"Year"

#PLOT
ggplot(annual_rent, aes(x=Year, y=Median.Rent, group=1)) +
  geom_point(color="#c51b8a")+geom_line(color="#c51b8a")+
  labs(title="Median Rent in Philadelphia",
       x="Year",
       y="Rent (USD)")+
  theme(plot.title = element_text(hjust = 0.5),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "transparent",colour = NA),
        plot.background = element_rect(fill = "transparent",colour = NA))+
  theme_minimal()


#PLOT ACCURACY BY NEIGHBORHOOD
#cbind sensitivity and specificty to phillyneighbs3

#sensitivity
ggplot() +
  geom_sf(data=phillyneighbs3, aes(fill='gray', colour='gray'))+
  geom_sf(data=phillyneighbs3, aes(fill=factor(ntile(sensitivity,5))))+
  scale_fill_manual(values=palette5.2,
                    labels = quintileBreaks(phillyneighbs3,"sensitivity")) +
  labs(title="Sensitivity by Neighborhood",
       fill="Sensitivity \nby Quantile")+
  mapTheme()

#Specificity
ggplot() +
  geom_sf(data=phillyneighbs3, aes(fill=factor(ntile(specificity,5))))+
  scale_fill_manual(values=palette5.2,
                    labels = quintileBreaks(phillyneighbs3,"specificity")) +
  labs(title="Specificity by Neighborhood",
       fill="Specificity by Quantile")+
  mapTheme()

A final step was developing correlation plots for key variables.

##correlation plots for exploratory analysis - for each chart, show % illegal rental for 0 and 1 for each variable (chart should have variable on x axis and %IR on the y) - how are illegal rentals distributed across variables?

IR<-subset(data7, data7$illegalRental_dv == '1')
NIR<-subset(data7, data7$illegalRental_dv == '0')

#history of interior violations
table(IR$hist_intviol2) #208 - 9%
table(NIR$hist_intviol2) #17618 - 6%
208/2225*100
17618/302539*100

#history of exterior violations
table(IR$hist_extviol2) #494 - 22%
table(NIR$hist_extviol2) #35835 - 12%
494/2225*100
35835/302539*100

#tax delinquency
table(IR$owes2) #383 - 17%
table(NIR$owes2) #24423 - 8%
383/2225*100
24423/302539*100

#out of state LL
table(IR$outofstateLL) #689 - 31%
table(NIR$outofstateLL) #152278 - 50%
689/2225*100
152278/302539*100

table(IR$outofPhillyLL) #1244 - 36%
table(NIR$outofPhillyLL) #198511 - 66%
1244/2225*100
198511/302539*100

#multiple units
table(IR$multipleunits_y) #33 - 1%
table(NIR$multipleunits_y) #271 - .09%
33/2225*100
271/302539*100

#built pre 1900
table(IR$built_pre1900) #138 - 6%
table(NIR$built_pre1900) #50126 - 16%
138/2225*100
50126/302539*100

correlations<-read.csv('correlations.csv')
correlations$PctIR<-as.numeric(correlations$PctIR)
correlations$PctIR.1<-as.numeric(correlations$PctIR.1)
correlations$PctIR.2<-as.numeric(correlations$PctIR.2)
correlations$PctIR.3<-as.numeric(correlations$PctIR.3)
correlations$PctIR.4<-as.numeric(correlations$PctIR.4)
correlations$PctIR.5<-as.numeric(correlations$PctIR.5)


###plots
palette5.2 <- c("#feebe2","#fbb4b9","#f768a1","#c51b8a","#7a0177")

histIntViol<-ggplot(correlations, aes(x =IRStat , y = PCTIntViol)) + 
  geom_bar(stat='identity', wideth =0.9, fill='#fbb4b9') +
  theme_minimal() +
  ylim(0,60) +
  labs(y = "% interior violations", 
       title = 'Properties with interior violations') +
  theme(axis.title.x=element_blank()) +
  theme(plot.title = element_text(hjust = 0.5))
histIntViol

histExtViol<-ggplot(correlations, aes(x =IRStat , y = PCTExtViol)) + 
  geom_bar(stat='identity', wideth =0.9, fill='#fbb4b9') +
  theme_minimal() +
  ylim(0,60) +
  labs(y = "% exterior violations", 
       title = 'Properties with exterior violations') +
  theme(axis.title.x=element_blank()) +
  theme(plot.title = element_text(hjust = 0.5))
histExtViol

taxdel<-ggplot(correlations, aes(x =IRStat , y = PCTTD)) + 
  geom_bar(stat='identity', wideth =0.9, fill='#f768a1') +
  theme_minimal() +
  ylim(0,60) +
  labs(y = "% tax delinquent", 
       title = 'Properties with Tax Delinquency') +
  theme(axis.title.x=element_blank()) +
  theme(plot.title = element_text(hjust = 0.5))
taxdel


LL<-ggplot(correlations, aes(x =IRStat , y = PCTLL)) + 
  geom_bar(stat='identity', wideth =0.9, fill='#f768a1') +
  theme_minimal() +
  ylim(0,60) +
  labs(y = "% out of state landlords", 
       title = 'Properties with out of state landlords') +
  theme(axis.title.x=element_blank()) +
  theme(plot.title = element_text(hjust = 0.5))
LL

pre19<-ggplot(correlations, aes(x =IRStat , y = PCTPre1900)) + 
  geom_bar(stat='identity', wideth =0.9, fill='#c51b8a') +
  theme_minimal() +
  ylim(0,60) +
  labs(y = "% built before 1900", 
       title = 'Properties built before 1900')+
  theme(axis.title.x=element_blank()) +
  theme(plot.title = element_text(hjust = 0.5))
pre19

MU<-ggplot(correlations, aes(x =IRStat , y = PCTMU)) + 
  geom_bar(stat='identity', wideth =0.9, fill='#c51b8a') +
  theme_minimal() +
  ylim(0,60) +
  labs(y = "% multiple units", 
       title = 'Properties with multiple units')+
  theme(axis.title.x=element_blank()) +
  theme(plot.title = element_text(hjust = 0.5))
MU

grid.arrange(histIntViol, histExtViol, taxdel, LL, pre19, MU, nrow = 3)