A Development Risk Assessment Tool for Adverse Possession

Final Project for the Master in Urban Spatial Anatytics Smart Cities Practicum (MUSA 801) at the University of Pennsylvania School of Design.


Introduction

When life hands you a vacant lot, make a garden.

This has been the philosophy of the enterprising residents of Philadelphia, a city that, like many in the United States, is home to thousands of properties that lie vacant in neighborhoods from Kensington to West Philadelphia. Rather than let them sit idle, Philadelphians have taken it upon themselves to tend to these lots as their own, converting them into cherished public assets. These properties come in all shapes and sizes, from small side yards to sprawling community gardens that span across multiple parcels.

Recently, however, a threat has emerged that could result in the sale of scores of these vacant properties.

The City of Philadelphia has been foreclosing on vacant properties in bulk to recover delinquent taxes. Following foreclosure, the City will initiate a forced sale through a public auction known as a sheriff sale, transferring the property to the highest bidder. Neither the original owner nor the neighbors who have cared for the property in the owner’s absence can stop such a sale once it’s underway.

As if that were not enough, vacant lots in Philadelphia face an additional threat. In the late 1990’s, the City of Philadelphia sold over 30,000 tax liens to U.S. Bank, a private financial institution, in an effort to raise funds for public education. (This effort proved largely unsuccessful.) As a result, today over two thousand vacant lots in Philadelphia are encumbered by liens owned by U.S. Bank, entitling the bank unilaterally to foreclose on those properties, much like the City.

In light of these trends, neighbors and communities are rightly concerned that they may lose properties that, for years, have constructively been theirs. But there is reason for hope.

La Finquita

A range of legal and policy options may be available to these stakeholders depending on their specific circumstances. Advocacy efforts are currently underway to halt sheriff’s sales for certain classes of vacant lots. Some policy solutions already exist, especially for vacant lots that are City-owned. In some cases, stakeholders may even have legal rights: Neighbors who have tended to vacant lots adjoining their properties may have a path to ownership through a legal doctrine known as adverse possession.

A variety of grassroots and legal advocacy organizations have taken up the mantle to explore these options, and more, to ensure that vacant lots remain in the care of the neighbors and communities who have invested in them for years. In collaboration with Philadelphia Legal Assistance (PLA), we are creating a data dashboard and web app to assist PLA to weigh viable legal and policy interventions to protect their clients. In addition, the web app is designed to help residents, policymakers, and members of the public to understand the scope of this problem.

Our role is thus three-pronged:

  1. to visualize and explore vacant properties in Philadelphia;
  2. aggregate relevant data on vacant lots to assist PLA in assessing viable legal and policy interventions; and
  3. to design an algorithm to predict which properties are at greatest risk of disposition and development.

Exploratory Analysis

### A. Geographic boundaries
# set CRS to Pennsylvania South NAD83 in meters
phlcrs <- 'EPSG:32129'


if (SWITCH == T) {
  # get geometry for tracts in Philadelphia
  phltracts <- 
    tigris::tracts(state = 42, county = 101) %>%
    dplyr::select(GEOID, geometry) %>%
    st_transform(st_crs(phlcrs))
  
  # get geometry for blocks in Philadelphia
  phlblocks <- 
    tigris::blocks(state = 42, county = 101) %>%
    dplyr::select(GEOID20, geometry) %>%
    st_transform(st_crs(phlcrs))
  
  
  # get geometry for the county of Philadelphia
  phlcounty <-
    phltracts %>%
    st_union()

  } else {
    phltracts <- readRDS("../data/local/phltracts.Rds")
    phlblocks <- readRDS("../data/local/phlblocks.Rds")
    phlcounty <- readRDS("../data/local/phlcounty.Rds")
    
  }

saveRDS(phltracts, "../data/local/phltracts.Rds")
saveRDS(phlblocks, "../data/local/phlblocks.Rds")
saveRDS(phlcounty, "../data/local/phlcounty.Rds")


# set up the fishnet bins as the most granular unit of analysis
fishnet <-
  phlcounty %>%
  st_make_grid(
    .,
    cellsize = 250,
    square = T) %>%
  .[phlcounty] %>%                     # clip to Philadelphia County boundary
  st_sf() %>%
  mutate(uniqueID = rownames(.))
  
  
# Create a fishnet grid boundary of Philadelphia County
phlFishnet <-
  fishnet %>%
  st_union()



### B. Demographic

if (SWITCH == T) {
  
  # Get API key loaded locally
  censusKey <-read_file('../censusapikey.R')
  
  # load census api key from local file
  census_api_key(censusKey)
  
  # set consulted year
  year <- 2019
  
  # get variables names and codes
  acsVariableList <- load_variables(year, "acs5", cache = TRUE)
  
  # set census variables to consult
  censusVars <-
    c("B02001_001E", # Total population 
      "B11001_002E", # Total households
      "B02001_002E", # Total white population 
      "B19013_001E", # Median household income
      "B25003_001E") # Tenure of household
  
  # Get the data
  demographics <-
    get_acs(geography = "tract",
            variables = censusVars,
            year = year,
            state = 42,
            county = 101,
            geometry= T,
            output = "wide") %>%
    st_transform(st_crs(phlcrs)) %>%
    rename(totalPop = "B02001_001E",
           totalHHs = "B11001_002E",
           whitePop = "B02001_002E",
           medHHInc = "B19013_001E",
           tenureHH = "B25003_001E") %>%
    dplyr::select(-NAME, -starts_with("B"))%>%
    replace(is.na(.), 0) %>%
    mutate(pctWhite = ifelse(totalPop > 0, whitePop / totalPop, 0),
           pctTenur = ifelse(totalPop > 0, tenureHH / totalPop, 0)) %>%
    mutate(nhMajMin = ifelse(pctWhite > 0.5, 1, 0),
           nhIncome = ifelse(medHHInc > mean(.$medHHInc), 1, 0),
           nhTenure = ifelse(pctTenur > 0.5, 1, 0)) %>%
    dplyr::select(nhMajMin, nhIncome, nhTenure) %>%
    st_transform(st_crs(phlcrs))
  
  } else {
    
    # OR read locally
    demographics <- readRDS("../data/local/demographics.Rds")
  }

# SAVE locally
saveRDS(demographics, "../data/local/demographics.Rds")

  

# Create a mask for the majority minority census tracts fishnet cells

majminFishnet <- 
  demographics %>%
  filter(nhMajMin == 1) %>%
  st_union() %>%
  st_intersection(fishnet) %>%
  fishnet[.,] %>%
  st_union()


### C. Properties

# [source]('https://www.opendataphilly.org/dataset/opa-property-assessments')

# Load properties from 1999 until now from the Office of Property Assessment.
if (SWITCH == T) {
  properties <- read_csv('https://opendata-downloads.s3.amazonaws.com/opa_properties_public.csv')
  
  # Selected variables of interest
  propertiesVars <-
    c("category_code",                  # KEEP ------------------------------------determines if it is VACANT LAND
      #"exterior_condition",             # KEEP ------------------------------------how the exterior appears based on observation.
      #"frontage",                       # MODEL
      "location",                       # ADDRESS
      "market_value",                   # KEEP ------------------------------------the certified market value of the property.
      #"off_street_open",                # UNNECESSARY
      "owner_1",                        # KEEP
      "owner_2",                        # KEEP
      "parcel_number",                  # KEEP to JOIN***
      #"parcel_shape",                   # MODEL
      "sale_date",                      # KEEP
      #"sale_price",                     # KEEP
      "total_area",                     # KEEP
      #"unfinished",                     # UNNECESSARY
      "year_built",                     # KEEP
      "zoning",                         # KEEP
      "pin",                            # KEEP to JOIN ****
      "lat",                            # KEEP****
      "lng")                            # KEEP****
  
  # Data wrangling
  propertiesData <- properties %>%
    dplyr::select(propertiesVars) %>%
    filter(!is.na(lat), !is.na(lng)) %>%              
    st_as_sf(coords = c("lat","lng"), crs = 4326) %>%
    st_transform(st_crs(phlcrs)) %>%
    st_join(demographics)
  
  } else {
    
    propertiesData <- readRDS("../data/local/propertiesData.rds")
    
  }
  
# SAVE locally
saveRDS(propertiesData, file = "../data/local/propertiesData.rds")

To understand the scale of the challenges facing vacant properties in Philadelphia, we visualize some of the key datasets pertaining to those properties.

emptyCanvas <-
  ggplot(data.frame(x = seq(0, 100), y = seq(0, 100)), aes(x, y)) +
    geom_point(alpha = 0) +
    colTheme()


# All properties map
propertiesMap <- 
  ggplot() +
  geom_sf(data = phlcounty, color = NA, fill = colors[1], size = 0) + 
  geom_sf(data = propertiesData,
          color = palette[1],
          size = 0.25,
          alpha = 0.5) +
  labs(title = 'All Properties',
       subtitle = 'Properties in Philadelphia County',
       caption = 'OpenDataPhilly') +
  mapTheme() +
  plaTheme()

# create bar
propertiesBar <-
  data.frame(countProps = c(nrow(propertiesData)),
             proportion = c(1),
             type = c('all'),
             condition = c('Property in Philadelphia')) %>%
  ggplot(aes(x =type, y = proportion, fill = condition )) + 
  geom_col(position = "stack",
           stat = "summary",
           fun = "mean",
           width = 0.5) +
  labs(title = 'All properties') +
  geom_text(aes(label = paste0(round(proportion*100, digits=2), "%")),
            position = position_stack(vjust = 0.5),
            size = 4.5,
            color = colors[1]) +
  geom_text(aes(label = condition),
            position = position_stack(vjust = 0.5),
            vjust = 4,
            size = 3.5,
            color = colors[1]) +
  scale_fill_manual(values = c(palette[1]),
                    name="") +
  colTheme() +
  coord_flip()
  

# aggregate both in empty canvas
propertiesChart <- 
  emptyCanvas +
  annotation_custom(ggplotGrob(propertiesMap),
                       xmin = 0,
                       xmax = 100, 
                       ymin = 0,
                       ymax = 100) +
  annotation_custom(ggplotGrob(propertiesBar),
                       xmin = 51,
                       xmax = 91, 
                       ymin = 15,
                       ymax = 30)

propertiesChart


Total Vacant Lots

# [source]('https://www.opendataphilly.org/dataset/vacant-property-indicators')

# VACANT LAND

if (SWITCH == F) {
  
  # download OpenDataPhilly Vacant Property Indicator - Lots
  vacantLand <- read_csv('https://opendata.arcgis.com/datasets/19c35fb02d544a9bad0032b58268c9f9_0.csv')

  # Select useful variables --- NO GEOMETRY
  vacantLandVars <-
    c("ADDRESS",                        # KEEP for JOIN CHECK
      "BLDG_DESC",                      # KEEP ------------------------------------Building description from OPA
      "OPA_ID",                         # KEEP to JOIN****
      "LAND_RANK")                      # KEEP
  
  
  # Data wrangling
  vacantLandData <-
    vacantLand %>%
    dplyr::select(vacantLandVars) %>%
    mutate(vacant = 'vacant')
  
  
  # JOIN to propertiesData
  # propertiesData <---> vacantLandData
  vacantLandProps <-
    propertiesData %>%
    inner_join(vacantLandData, by=c('parcel_number'='OPA_ID'))
  
  } else {
    
    # OR read locally
    vacantLandProps <- readRDS("../data/local/vacantLandProps.rds")
    
    }

# save locally
saveRDS(vacantLandProps, file = "../data/local/vacantLandProps.rds")

Today, there are over 27,000 vacant properties across Philadelphia. No two of these properties are the same: Some are expansive, covering multiple parcels; others are small side yards adjacent to homes.

A map of these vacant properties also reveals that they tend to cluster in space and are unevenly distributed across the city. We hypothesize that these patterns are not arbitrary but are attributable to certain demographic and socio-economic divisions that pervade Philadelphia (a claim that will be validated later in this analysis).

# map of vacant lots
vacantMap <- 
  vacantLandData %>%
  mutate(vacant = 'vacant') %>%
  right_join(propertiesData, by = c('OPA_ID' = 'parcel_number')) %>%
  mutate(vacant = replace_na(vacant, 'not vacant')) %>%
  st_sf() %>%
  ggplot() +
  geom_sf(data = phlcounty, color = NA, fill = colors[1], size = 0) + 
  geom_sf(aes(color = vacant), size = 0.02) +
  labs(title = 'Vacant Land',
       subtitle = 'Properties in Philadelphia County',
       caption = 'OpenDataPhilly') +
  scale_color_manual("Legend", values= c(colors[3], palette[2]), labels= c("not vacant", "vacant")) +
  guides(colour = guide_legend(override.aes = list(size=2))) +
  mapTheme() +
  plaTheme() +
  theme(legend.position = "none")


# create bar
vacantBar <- 
  vacantLandData %>%
  mutate(vacant = 'vacant') %>%
  right_join(propertiesData, by = c('OPA_ID' = 'parcel_number')) %>%
  mutate(vacant = replace_na(vacant, 'not vacant')) %>%
  group_by(vacant) %>%
  summarise(countVacant = n()) %>%
  mutate(proportion = countVacant/sum(.$countVacant),
         type = "vacant",
         condition = case_when(
           vacant == 'vacant' ~ "Vacant",
           vacant == 'not vacant' ~ "Not Vacant")) %>%
  ggplot(aes(x =  type, y = proportion, fill = factor(vacant))) + 
  geom_col(position = "stack",
           width = 0.5) +
  labs(title = 'All properties') +
  geom_text(aes(label = paste0(round(proportion*100, digits=2), "%")),
            position = position_stack(vjust = 0.5),
            size = 4.5,
            color = colors[1]) +
  geom_text(aes(label = condition),
            position = position_stack(vjust = 0.5),
            vjust = 4,
            size = 3.5,
            color = colors[1]) +
  scale_fill_manual(values = c(palette[1], palette[2]),
                    name="") +
  colTheme() +
  coord_flip()


# aggregate both in empty canvas
vacantChart <- 
  emptyCanvas +
  annotation_custom(ggplotGrob(vacantMap),
                       xmin = 0,
                       xmax = 100, 
                       ymin = 0,
                       ymax = 100) +
  annotation_custom(ggplotGrob(vacantBar),
                       xmin = 51,
                       xmax = 91, 
                       ymin = 15,
                       ymax = 30)

vacantChart


Tax Delinquent Vacant Lots

# [source]('https://www.opendataphilly.org/dataset/property-tax-delinquencies')

# Real Estate Delinquencies from OpenData Philly:

if (SWITCH == F) {
  
  # read from endpoint
  delinquencies <- read_csv('https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+real_estate_tax_delinquencies&filename=real_estate_tax_delinquencies&format=csv&skipfields=cartodb_id,the_geom,the_geom_webmercator')


  # Variables - Tax Delinquencies (53 vars)
  delinquenciesVars <-
    c("opa_number",                     # KEEP to JOIN****
      #"street_address",                 # KEEP to JOIN check****
      "total_due",                      # KEEP ------------------------------------The total amount owner owes.
      "is_actionable",                  # KEEP?*** --------------------------------Action can be taken against the delinquent property.
      "num_years_owed",                 # KEEP ***for adverse possession reference
      "most_recent_year_owed",          # KEEP for reference
      "oldest_year_owed",               # KEEP for reference
      "most_recent_payment_date",       # KEEP for reference
      "total_assessment",               # KEEP for reference
      "sheriff_sale",                   # KEEP*** ---------------------------------Property is in the Sheriff Sale Process (any stage).
      "liens_sold_1990s",               # KEEP*** ---------------------------------Property was included in 1997 Lien Sale.
      "liens_sold_2015")                # KEEP*** ---------------------------------Property is included in Recent Lien Sales.
  
  
  # Data wrangling
  delinquenciesData <-
    delinquencies %>%
    dplyr::select(all_of(delinquenciesVars)) %>%
    mutate(opa_number = sprintf("%09d", opa_number))
  
  
  # JOIN to properties
  # propertiesData <---> delinquenciesData
  delinquenciesProps <-
    propertiesData %>%
    inner_join(delinquenciesData, by = c('parcel_number'='opa_number')) %>%
    st_sf()
  
  
  # map vacant properties (points)
  delinquenciesVacantProps <-
    left_join(
      vacantLandProps,
      delinquenciesProps %>%
        st_drop_geometry() %>%
        mutate(delinquentStatus = 1), by = "parcel_number") %>%
    mutate(
      delinquentStatus = ifelse(is.na(delinquentStatus) == F, 1, 0),
      delinquentType = ifelse(liens_sold_1990s == T, 2, delinquentStatus), # delinquent type: 0 not, 1 yes, 2 us bank
      delinquentType = replace_na(delinquentType, 0))
  
  } else {
    
    delinquenciesProps <- readRDS("../data/local/delinquenciesProps.rds")
    delinquenciesVacantProps <- readRDS("../data/local/delinquenciesVacantProps.rds")
    
    }

# SAVE locally
saveRDS(delinquenciesProps, file = "../data/local/delinquenciesProps.rds")
saveRDS(delinquenciesVacantProps, file = "../data/local/delinquenciesVacantProps.rds")

So long as the owner of a vacant property keeps up with her property taxes, there is no risk that the property will be foreclosed upon.

However, it is when a property falls into a state of tax delinquency—that is, when the owner fails to pay tax on the property—that such a property becomes vulnerable to a forced sale. Much like a mortgage lender, the City is entitled to foreclose on properties that are tax delinquent and may put them up for a sheriff’s sale to collect on back taxes.

As seen in the map below, not all vacant properties in Philadelphia are tax delinquent, but a substantial share—27.5 percent—are. Within certain legal parameters, the City may exercise its rights to foreclose on these properties, many of which serve the very community uses organizations such as Philadelphia Legal Assistance has set out to protect.

# Map delinquent properties in relation to Vacant properties
delinquentMap <- 
  ggplot() +
  geom_sf(data = phlcounty, color = NA, fill = colors[1], size = 0) + 
  geom_sf(data = delinquenciesVacantProps, 
          aes(color = factor(delinquentStatus)), 
          size = 0.02) +
  labs(
    title = 'Delinquent Vacant Land',
    subtitle = 'Properties in Philadelphia County',
    caption = 'OpenDataPhilly') +
  scale_color_manual(
    "Legend",
    values= c(colors[3], palette[3]),
    labels= c("not delinquent", "delinquent")) +
  guides(colour = guide_legend(override.aes = list(size=2))) +
  mapTheme() +
  plaTheme() +
  theme(legend.position = "none")


# create bar
delinquentBar <- 
  delinquenciesVacantProps %>%
  st_drop_geometry() %>%
  group_by(delinquentStatus) %>%
  summarise(countDelinquent = n()) %>%
  mutate(proportion = countDelinquent/sum(.$countDelinquent),
         type = "delinquent",
         condition = case_when(
           delinquentStatus == 1 ~ "Delinquent",
           delinquentStatus == 0 ~ "Not delinquent")) %>%
  ggplot(aes(x =  type, y = proportion, fill = factor(delinquentStatus))) + 
  geom_col(position = "stack",
           width = 0.5) +
  labs(title = 'Vacant Lots') +
  geom_text(aes(label = paste0(round(proportion*100, digits=2), "%")),
            position = position_stack(vjust = 0.5),
            size = 4.5,
            color = colors[1]) +
  geom_text(aes(label = condition),
            position = position_stack(vjust = 0.5),
            vjust = 4,
            size = 3.5,
            color = colors[1]) +
  scale_fill_manual(values = c(palette[1], palette[3]),
                    name="") +
  colTheme() +
  coord_flip()


# aggregate both in empty canvas
delinquentChart <- 
  emptyCanvas +
  annotation_custom(ggplotGrob(delinquentMap),
                       xmin = 0,
                       xmax = 100, 
                       ymin = 0,
                       ymax = 100) +
  annotation_custom(ggplotGrob(delinquentBar),
                       xmin = 51,
                       xmax = 91, 
                       ymin = 15,
                       ymax = 30)


delinquentChart

Just as the distribution of vacant properties exhibits certain discernible spatial patterns, so too with tax delinquent vacant properties. The 27.5 percent of vacant properties which are tax delinquent cluster in the North Philadelphia region of the city, indicating that those properties which are most exposed to tax foreclosure risk are near to one another.

Where are these delinquencies concentrated?

# map vacant properties (fishnet)
delinquenciesVacantNet <-
  delinquenciesVacantProps %>%
  mutate(vacant = 1) %>%
  select(delinquentStatus, vacant) %>%
  aggregate(fishnet, sum) %>% 
  mutate(countDelinquent = replace_na(delinquentStatus, 0),
         vacant = replace_na(vacant, 0)) %>%
  select(-delinquentStatus)


# delinquencies heat map
ggplot() +
  geom_sf(data = phlFishnet, color = NA, fill = colors[1]) + 
  geom_sf(data = subset(delinquenciesVacantNet, vacant > 0), aes(fill = countDelinquent), color = NA, inherit.aes = FALSE) + 
  scale_fill_viridis(
    option = "mako",
    trans = "log1p",
    direction = 1,
    breaks = c(0,5,25,75)) +
  guides(
    size = F,
    fill= guide_colorbar(barheight = 7, title.position = "top")) +
  labs(title = 'Delinquent Vacant Land Distribution',
       subtitle = 'Properties in Philadelphia County',
       caption = 'OpenDataPhilly') +
  mapTheme() +
  plaTheme()

#mapSave_ggplot("delinquentVacantProperties_fishnet")

U.S. Bank Liens

A further subclass of vacant properties in Philadelphia are those to which taxes are owed to U.S. Bank, a private financial institution that purchased a large portfolio of tax liens from the City of Philadelphia in the late 1990’s. Among all vacant properties in Philadelphia, over 5 percent fall into this category and are subject to liens owned by U.S. Bank. Furthermore, of all vacant properties that are tax delinquent, more than 20 percent are encumbered by debt to U.S. Bank, entitling it to initiate foreclosure proceedings just like the City.

# US Bank Liens comparison map
usBankMap <- 
  delinquenciesVacantProps %>%
  filter(delinquentType > 0) %>%
  ggplot() +
  geom_sf(data = phlcounty, color = NA, fill = colors[1]) + 
  geom_sf(aes(color = factor(delinquentType)), 
          size = 0.02) +
  scale_color_manual("Legend", values = c(colors[3], palette[4]),
                     labels= c("vacant delinquent", "US Bank lien")) +
  labs(title = 'US Bank Delinquent',
       subtitle = 'Delinquent Vacant Land Properties in Philadelphia County',
       caption = 'OpenDataPhilly') +
  guides(colour = guide_legend(override.aes = list(size=2))) +
  mapTheme() +
  plaTheme() +
  theme(legend.position = "none")


# create bar
usBankBar <- 
  delinquenciesVacantProps %>%
  st_drop_geometry() %>%
  filter(delinquentStatus == 1) %>%
  group_by(liens_sold_1990s) %>%
  summarise(countUsBank = n()) %>%
  mutate(proportion = countUsBank/sum(.$countUsBank),
         type = "UsBank",
         condition = case_when(
           liens_sold_1990s == T ~ "US Bank Lien",
           liens_sold_1990s == F ~ "Not US Bank Lien")) %>%
  ggplot(aes(x =  type, y = proportion, fill = liens_sold_1990s)) + 
  geom_col(position = "stack",
           width = 0.5) +
  labs(title = 'Delinquent Vacant Lots') +
  geom_text(aes(label = paste0(round(proportion*100, digits=2), "%")),
            position = position_stack(vjust = 0.5),
            size = 4.5,
            color = colors[1]) +
  geom_text(aes(label = condition),
            position = position_stack(vjust = 0.5),
            vjust = 4.5,
            size = 3.5,
            color = colors[1]) +
  scale_fill_manual(values = c(palette[1], palette[4]),
                    name="") +
  colTheme() +
  coord_flip()


# aggregate both in empty canvas
usBankChart <- 
  emptyCanvas +
  annotation_custom(ggplotGrob(usBankMap),
                       xmin = 0,
                       xmax = 100, 
                       ymin = 0,
                       ymax = 100) +
  annotation_custom(ggplotGrob(usBankBar),
                       xmin = 51,
                       xmax = 91, 
                       ymin = 15,
                       ymax = 30)


usBankChart


Sheriff Sales

# [source]('https://www.opendataphilly.org/dataset/real-estate-transfers')

# OpenDataPhilly Endpoint of Property Assessment History:

if (SWITCH == T) {
  # read CSV from AMAZON AWS
  # transfers <- read_csv('https://opendata-downloads.s3.amazonaws.com/rtt_summary.csv')
  
  # read CSV from ESRI
  transfers <- read_csv('https://opendata.arcgis.com/datasets/88e5bc291b834606bd49f6fd6dca226e_0.csv')

  # OpenDataPhilly - Transfers (48 variables)
  transfersVars <- 
    c("document_type",                # KEEP*** -----------------------------------refers to type of Real Estate Transaction
      "display_date",                 # KEEP***
      "street_address",               # KEEP for JOIN check
      "grantors",                     # KEEP --------------------------------------seller (on deeds), or borrower (on mortgages)
      "grantees",                     # KEEP --------------------------------------buyer, recipient, new owner, or lien holder
      "total_consideration",          # KEEP?** -----------------------------------good exchanged for the real estate (usually money)
      "opa_account_num")              # KEEP****to join

  
  # Data wrangling
  transfersData <- 
    transfers %>%
    dplyr::select(all_of(transfersVars)) %>%
    filter(!is.na(opa_account_num)) %>%
    filter(document_type %in% c('DEED',
                                'DEED LAND BANK',
                                'DEED MISCELLANEOUS',
                                'DEED MISCELLANEOUS TAXABLE',
                                'DEED OF CONDEMNATION',
                                'DEED SHERIFF'))


  # JOIN to properties
  # propertiesData <---> transfersData
  
  # if a parcel_number has multiple deeds, I manipulated data to shift deed sheriff first
  # so distinct() will grab deed sheriff and neglect others
  transfersProps <-
    propertiesData %>%
    inner_join(transfersData, by= c('parcel_number'='opa_account_num'))%>% 
    mutate(
      document_type = factor(
        document_type,labels(
          c("DEED SHERIFF" = 1,
            "DEED" = 2,
            "DEED MISCELLANEOUS" = 3,
            "DEED OF CONDEMNATION" = 4,
            "DEED LAND BANK" = 5,
            "DEED MISCELLANEOUS TAXABLE" = 6))))
  
  } else {
    
    # OR read locally
    transfersProps <- readRDS("../data/local/transfersProps.rds")
    
    }

# SAVE locally

saveRDS(transfersProps, file = "../data/local/transfersProps.rds")

Once either the City or U.S. Bank petitions to foreclose on a tax delinquent property in court, unless the property owner appears to contest the foreclosure or the court independently finds a reason not to foreclose, the court will grant the petition and allow the property to be sold via a sheriff sale. As alluded to in the Introduction, communities and neighbors who may have cared for the property for years in the owner’s absence frequently are not even made aware that a property has been listed, let alone sold, at a sheriff sale.

# SHERIFF SALES

# Sheriff Sales before 2021 from real estate transfers data (completed)
sheriffSales_20 <-
  transfersProps %>%
  filter(document_type == "DEED SHERIFF") %>%
  arrange(parcel_number, document_type) %>%
  distinct(parcel_number, .keep_all = T) # remove duplicates


# sheriff sales after 2021 from client
sheriffSales_21 <-
  read_csv('./data/sheriffSales_21.csv') %>%
  distinct(OPA, .keep_all= T ) # remove duplicates


# join all sheriff sales info we have together
## sheriff sales data from transfers
sheriffProps <-
  left_join(
    delinquenciesVacantProps,
    st_drop_geometry(sheriffSales_20),
    by = "parcel_number") %>%
  distinct(parcel_number, .keep_all = T) %>%
  merge(., sheriffSales_21,
        by.x = "parcel_number", by.y = "OPA", all.x = T, no.dups = T) %>%
  mutate(pastSheriffSale = ifelse(document_type == "DEED SHERIFF", 1, 0)) %>%   #  past records only
  mutate(pastSheriffSale = replace_na(pastSheriffSale, 0)) %>%
  mutate(futureSheriffSale = ifelse(document_type == "DEED SHERIFF" | is.na(Status) == F | sheriff_sale == "Y", 1, 0)) %>% # future
  mutate(futureSheriffSale = replace_na(futureSheriffSale, 0)) %>%
  mutate(allSheriffSales = ifelse(pastSheriffSale == 1 | futureSheriffSale == 1, 1, 0),
         sheriffSaleYear = year(as.Date(display_date))) 


# SAVE locally
saveRDS(sheriffProps, file = "../data/local/sheriffProps.rds")

Where sheriff’s sales are taking place in Philadelphia can be shown on the map. Unsurprisingly, these properties cluster in the same regions in North Philadelphia where there is the highest concentration of tax delinquent vacant properties. Nearly 32 percent of all tax delinquent properties either have already been sold at a sheriff’s sale or are scheduled for sheriff’s sales.

# Map sheriff sales in relation to vacant delinquent properties
sheriffMap <- 
  ggplot() +
  geom_sf(data = phlcounty, color = NA, fill = colors[1]) + 
  geom_sf(data = subset(sheriffProps, delinquentStatus == 1), 
          aes(color = factor(allSheriffSales)), 
          size = 0.2) +
  scale_color_manual("Legend", values= c(colors[3], palette[5]),
                     labels= c("vacant delinquent", "sheriffs sale")) +
  labs(title = 'Sheriff Sales (historical)',
       subtitle = 'Vacant Land Properties in Philadelphia County',
       caption = 'OpenDataPhilly') +
  guides(colour = guide_legend(override.aes = list(size=2))) +
  mapTheme() +
  plaTheme() +
  theme(legend.position = "none")


# create bar
sheriffBar <- 
  sheriffProps %>%
  st_drop_geometry() %>%
  dplyr::select(allSheriffSales) %>%
  group_by(allSheriffSales) %>%
  summarize(countSheriff = n()) %>%
  mutate(proportion = countSheriff/sum(.$countSheriff),
         type = "Sheriff",
         condition = case_when(
           allSheriffSales == 1 ~ "Sheriff Sale",
           allSheriffSales == 0 ~ "Has not been sold in Sheriff Sale")) %>%
  ggplot(aes(x =  type, y = proportion, fill = factor(allSheriffSales))) + 
  geom_col(position = "stack",
           width = 0.5) +
  labs(title = 'Vacant Lots') +
  geom_text(aes(label = paste0(round(proportion*100, digits=2), "%")),
            position = position_stack(vjust = 0.5),
            size = 4.5,
            color = colors[1]) +
  geom_text(aes(label = condition),
            position = position_stack(vjust = 0.5),
            vjust = 4.5,
            size = 3.5,
            color = colors[1]) +
  scale_fill_manual(values = c(palette[1], palette[5]),
                    name="") +
  colTheme() +
  coord_flip()



# aggregate both in empty canvas
sheriffChart <- 
  emptyCanvas +
  annotation_custom(ggplotGrob(sheriffMap),
                       xmin = 0,
                       xmax = 100, 
                       ymin = 0,
                       ymax = 100) +
  annotation_custom(ggplotGrob(sheriffBar),
                       xmin = 51,
                       xmax = 91, 
                       ymin = 15,
                       ymax = 30)


sheriffChart

The urgency of the present problem can be further understood when sheriff sales are displayed as a function of time. In the past decade, the number of yearly sheriff sales has increased dramatically, as shown in the chart below. One plausible theory explaining this recent uptick is that the value of vacant property appreciates with the overall real estate market. The chart loosely proves this correlation: Sheriff sales fell to an all-time low in the immediate aftermath of the 2008 subprime mortgage crisis, increased almost exponentially in the past 5-10 years, and briefly dropped off again in the wake of the coronavirus pandemic. We believe that this recent downtick will likely be temporary, however, as the City moved sheriff sales online, opening properties to a wider number of potential buyers.

# plot a chart of sheriff sales
sheriffProps %>%
  filter(delinquentStatus == 1) %>%
  group_by(pastSheriffSale, sheriffSaleYear) %>%
  summarise(countSheriffSale = n()) %>%
  st_drop_geometry() %>%
  filter(pastSheriffSale == 1) %>%
  ggplot(aes(x = as.numeric(sheriffSaleYear), y= countSheriffSale)) + 
  geom_area(fill = palette[5], alpha = 0.67) + 
  geom_vline(xintercept = 2017, color = colors[3], size = 1) +
  xlab("year") +
  scale_x_continuous(
    breaks = seq(from = 1999, to = 2021, by = 2),
    limits = c(1999, 2021)) +
  ylab("sheriff's sales") +
  labs(title = 'Sheriff Sales: Historical and Scheduled',
       subtitle = 'Vacant Land Properties in Philadelphia County',
       caption = 'OpenDataPhilly & Adam Butler') +
  geom_text(x = 2017, y = 300, label = "", angle = 0) +
  theme(legend.position = "none",
        panel.background = element_rect(fill = colors[2]),
        panel.grid.major.x =  element_blank(),
        panel.grid.minor.x =  element_blank(),
        plot.title = element_text(size = 20, face = "bold"),
        plot.subtitle = element_text(color = colors[1], face = "plain")
        )


# [source]('https://www.opendataphilly.org/dataset/licenses-and-inspections-building-permits')

# PERMITS

if (SWITCH == T) {
  
  # Load PERMITS data from OpenDataPhilly's Carto API
  permits <- read_csv('https://phl.carto.com/api/v2/sql?q=SELECT+*,+ST_Y(the_geom)+AS+lat,+ST_X(the_geom)+AS+lng+FROM+permits&filename=permits&format=csv&skipfields=cartodb_id')

  
  # List the permit data set variables needed
  permitsVars <-
    c('parcel_id_num',                 # KEEP for JOIN
      'permittype',                    # KEEP*****
      'permitdescription',             # KEEP for reference
      'typeofwork',                    # KEEP****
      'permitissuedate',               # KEEP***
      'status',                        # KEEP
      'applicanttype',                 # MAYBE
      'opa_account_num',               # KEEP for JOIN****
      'address',                       # KEEP for JOIN check
      'lng','lat'
      )
  
  # EXAMINE VARIABLES
  # permittype
  permitTypes <- as.data.frame(table(permits$permittype))
  
  # select permittyoe
  permitCats <- 
    c('BP_NEWCNST',             #    4984   # "2007-01-02" to "2020-02-20"
      'BUILDING',               #    8521   # "2015-01-12" to NOW ("2022-01-29")
      'RESIDENTIAL BUILDING',   #   15909   # "2015-01-06" to NOW ("2022-01-29")
      'ZONING',                 #   11187   # "2015-12-02" to NOW ("2022-01-29")
      'ZP_ADMIN',               #     803
      'ZP_USE',                 #    9651   # "2007-01-02" to "2020-03-12"
      'ZP_ZON/USE',             #   12921   # "2007-01-02" to "2020-03-12"
      'ZP_ZONING'               #    7617   # "2007-01-02" to "2020-03-12"
      )
  
  # Old categories
  permitCatsA <- 
    c('BP_NEWCNST',             #    4984   # "2007-01-02" to "2020-02-20"
      'ZP_ADMIN',               #     803
      'ZP_USE',                 #    9651   # "2007-01-02" to "2020-03-12"
      'ZP_ZON/USE',             #   12921   # "2007-01-02" to "2020-03-12"
      'ZP_ZONING')              #    7617   # "2007-01-02" to "2020-03-12"
  
  # New categories
  permitCatsB <- 
    c('BUILDING',               #    8521   # "2015-01-12" to NOW ("2022-01-29")
      'RESIDENTIAL BUILDING',   #   15909   # "2015-01-06" to NOW ("2022-01-29")
      'ZONING')                 #   11187   # "2015-12-02" to NOW ("2022-01-29")
  
  
  # Building categories
  permitBuildingCats <- 
    c('BP_NEWCNST',             #    4984   # "2007-01-02" to "2020-02-20"
      'BUILDING',               #    8521   # "2015-01-12" to NOW ("2022-01-29")
      'RESIDENTIAL BUILDING')   #   15909   # "2015-01-06" to NOW ("2022-01-29")
  
  # Zoning categories
  permitZoningCats <- 
    c('ZP_ADMIN',               #     803
      'ZP_USE',                 #    9651   # "2007-01-02" to "2020-03-12"
      'ZP_ZON/USE',             #   12921   # "2007-01-02" to "2020-03-12"
      'ZP_ZONING',              #    7617   # "2007-01-02" to "2020-03-12"
      'ZONING')                 #   11187   # "2015-12-02" to NOW ("2022-01-29")
  
  
  # Data wrangling ONE
  # Filter data with PERMIT TYPES (building or zoning)
  permitsData <- permits %>%
    dplyr::select(permitsVars) %>%
    filter(permittype %in% permitZoningCats) %>%
    filter(!is.na(lat), !is.na(lng)) %>% 
    st_as_sf(coords = c("lng","lat"), crs = phlcrs) 
  
  # type of work across SELECTED categories (48):
  permitTypesOfWork <- as.data.frame(table(permitsData$typeofwork)) 
  
  # All categories
  permitTypeOfWorkCats <- 
    c(#'ADD',                                              #   1665 ***OLD ***ZONING {New construction attached or added to existing}
      #'ADDITION AND/OR ALTERATION',                       #  17051 ***NEW ***BUILDING (6290)  ***RESIDENTIAL BUILDING (10761)
      #'CHANGE OF USE',                                    #   3046 ***NEW ***ZONING {General change of use}
      'COMBINED LOT LINE RELOCATION AND NEW DEVELOPMENT', #    313 ***NEW ***ZONING {Combine or redraw lots} ***SIGNIFICATIVE!
      'COMDEM',                                           #   1163 ***OLD ***ZONING {Complete demolition}
      'ENTIRE',                                           #   2910 ***OLD ***ZONING (-2890) ***BUILDING {Entire structure}
      'ENTSTR',                                           #   3638 ***OLD ***ZONING
      'FULL DEMOLITION',                                  #    644 ***NEW ***ZONING {Full demolition}
      'LOT LINE RELOCATION',                              #    317 ***NEW ***ZONING {Combine or redraw lots} ***SIGNIFICATIVE!
      'LOTLIN',                                           #   2527 ***OLD ***ZONING {Combine or redraw lots} ***SIGNIFICATIVE!
      'NEW CONSTRUCTION',                                 #<- 6292 ***NEW ***BUILDING (1812) ***RESIDENTIAL BUILDING (4480)
      'NEW CONSTRUCTION (SHELL ONLY)',                    #<-   19 ***NEW ***BUILDING
      #'NEW CONSTRUCTION (STAND ALONE)',                   #<-   48 ***NEW ***RESIDENTIAL BUILDING
      'NEW CONSTRUCTION, ADDITION, GFA CHANGE',           #<- 4986 ***NEW ***ZONING {New construction attached or added to existing}
      'NEWCON'#                                           #<- 5539 ***OLD ***ZONING (-3) ***BUILDING {New construction of structure}
      #'PARTCH',                                           #   3317 ***OLD ***ZONING {Change in use}
      #'SFADD',                                            #   3149 ***OLD ***ZONING {Add square footage to existing}
      )
      
  # Filter data with PERMIT TYPES (building or zoning)
  permitsZoningData <- permitsData %>%
    filter(typeofwork %in% permitTypeOfWorkCats)
  
  # JOIN to properties
  # permitsZoningData <---> propertiesData
  permitsProps <-
    propertiesData %>%
    inner_join(st_drop_geometry(permitsZoningData),  by = c('parcel_number'='opa_account_num'))
  
  } else {
  
  # OR load locally
  permitsProps <- readRDS("../data/local/permitsProps.rds")
  
  }
  
# SAVE locally
saveRDS(permitsProps, file = "../data/local/permitsProps.rds")

Demographic Patterns

Because vacant properties are frequently a signal of disinvestment, where they tend to cluster can reveal much about the socio-economic profile of a city. In Philadelphia, the distribution of vacant properties is markedly uneven, which is a proxy for the spatial inequality of the greater city.

Race is one context in which this unevenness is apparent. Both vacant land and tax delinquent properties are overrepresented in majority–minority neighborhoods relative to majority white neighborhoods. If properties become vacant and tax delinquent in places where economic investment is low, this indicates that investment tends to happen along racial lines.

# set color palette
color1 <- "#ff6e14" 
color2 <- "#00bbc2"

majMin <- c('nhMajMin','non-white','white')
income <- c('nhIncome', 'below', 'over')
tenure <- c('nhTenure', 'not-owner', 'owner')


# define function to sum
getNeighFactor <-
  function(dataframe, name, vars) {
    dataframe %>%
      group_by(dataframe[[vars[1]]]) %>%
      summarize(nhCount = sum(n())) %>%
      na.omit() %>%
      mutate(var = c(vars[2],vars[3]),
             dataset = name) %>%
      dplyr::select(-1) %>%
      st_drop_geometry() 
  }

demoVarNames <- list(
  'properties' = 'Properties',
  'permits' = 'Permits',
  'vacantLand' = 'Vacant Lots',
  'delinquencies' = "Delinquencies"
)

dLabeller <- function(variable, value){
  return(demoVarNames[value])
}

# define function to plot in one single chart
plotNeighFactors <- 
  function(factors, title) {
    allFactors <- rbind(
    getNeighFactor(propertiesData, 'properties', factors),
    getNeighFactor(vacantLandProps, 'vacantLand', factors),
    getNeighFactor(delinquenciesProps, 'delinquencies', factors),
    getNeighFactor(permitsProps, 'permits', factors)) %>%
    mutate(factor = factor(dataset, levels = c("properties", "vacantLand", "delinquencies", "permits")))
    
    # get bar chart
    object1 <- allFactors %>%
    #filter(dataset != 'properties') %>%
    ggplot(aes(var, nhCount, fill = var)) + 
    geom_bar(stat = "identity", width = .8) +
    scale_fill_manual(values = c(color1, color2)) +
    labs(title="Neighborhood effects",
         subtitle = title,
         x = "",
         y = "properties") +
    geom_text(aes(label = var, color = var), vjust = -.5) +
    scale_colour_manual(values = c(color1, color2)) +
    facet_wrap(~factor, ncol = 4, strip.position = "bottom", labeller = dLabeller) +
    plotTheme() +
    theme(axis.text.x = element_blank(),
          legend.position = "none",
          panel.border = element_blank(),
          panel.background = element_rect(fill='transparent', color=NA),
          panel.grid.major.x = element_blank(),
          strip.text.x = element_text(size = 16, color = colors[1], hjust=0.5),
          strip.background=element_rect(fill='transparent', color=NA),
          plot.title = element_text(size = 20, face = "bold", hjust = 0.866),
          plot.subtitle = element_text(color = colors[1], face = "plain", hjust = 0.9),
          plot.background = element_rect(fill='transparent', color=NA),
          )
    
    # produce map
    object2 <-
      demographics %>%
      dplyr::select(factors[1]) %>%
      mutate(var = case_when(
        .[[factors[1]]] == 0 ~ factors[2],
        .[[factors[1]]] == 1 ~ factors[3])) %>%
      dplyr::select(-factors[1]) %>%
      ggplot() +
      geom_sf(aes(fill = var), color = NA) +
      scale_fill_manual(values = c(color1, color2)) +
      mapTheme() +
      theme(axis.text.x = element_blank(),
            legend.position = "none",
            panel.border = element_blank(),
            panel.background = element_rect(fill = "#ffffff"),
            panel.grid.major.x = element_blank(),
            plot.background = element_rect(fill='transparent', color=NA))
    
    # put together and ggplot
    grid.arrange(object1,
                 object2,
                 ncol = 2,
                 widths = c(2, 1))
    }


plotNeighFactors(majMin, "Number of properties by Majority racial composition")

ggsave(plotNeighFactors(majMin, "Number of properties by Majority racial composition"), filename = "../disparity02.png",  bg = "transparent")

Median income is another demographic feature affecting the distribution of vacant properties in Philadelphia. Vacant properties and tax delinquent properties are represented in greater numbers in areas with households below the median income than in those above. The reason for this distribution may be self-evident: Higher income households would be naturally more equipped to keep up and pay taxes on property than low-income households.

color1 <- "#fc754c"
color2 <- "#4c84f5" 

plotNeighFactors(income, "Number of properties by Average Household Income")

ggsave(plotNeighFactors(income, "Number of properties by Average Household Income"), filename = "../disparity01.png",  bg = "transparent")

A third division, home ownership, appears to also be correlated to the neighborhood’s relative share of vacant properties. Those with a higher proportion of renters see a higher number of vacant properties and tax delinquent properties than those predominantly composed of home-owners.

color1 <- "#fc5a4c"
color2 <- "#4cbaf5"

plotNeighFactors(tenure, "Number of properties by Amount of Property Ownership")


Development Activity

As the real estate market has continued to heat up in Philadelphia over the past decade, development activity has increased at a steady clip. Vacant and tax delinquent properties have lately become attractive to developers as a result, so to understand which vacant properties may be at greatest risk of disposition, one must also understand which areas are seeing the steepest rise in development activity.

A readily available proxy for development activity is zoning permits. For all forms of new real estate development, developers must obtain a zoning permit from the City of Philadelphia, even where the development is allowed by right. As such, a map of zoning permits in Philadelphia, shown below, can illustrate where development activity is clustering.

# Set time constraints
initYr <- 2011
startYr <- 2013
endYr <- 2016

permitLag <- 3


# Permits (Y variable)
# by fishnet cell and two scales of time (quarters and years)
permitsNet <-
  permitsProps %>%
  st_join(fishnet, left = F) %>%
  mutate(year = year(permitissuedate),
         month = month(permitissuedate),
         quarter = case_when(
           month %in% c(1, 2, 3) ~ 1,
           month %in% c(4, 5, 6) ~ 2,
           month %in% c(7, 8, 9) ~ 3,
           month %in% c(10, 11, 12) ~ 4)) %>%
  mutate(period = paste(year, quarter, sep='-')) %>%
  dplyr::select(-month)


# empty panel with all possible time/space combinations by fishnet grids YEARS
# for INDEPENDENT VARIABLES
yearPanel <- 
  expand.grid(year = seq(startYr, endYr), 
              uniqueID = unique(fishnet$uniqueID))

# empty panel with all possible time/space combinations by fishnet grids YEARS
# for PERMITS / DEPENDENT VARIABLE
permitPanel <- 
  expand.grid(year = seq(startYr + permitLag, endYr + permitLag), 
              uniqueID = unique(fishnet$uniqueID))


# create empty panel with all possible time/space combinations by fishnet grids YEARS EXT
extYearPanel <- 
  expand.grid(year = seq(initYr, endYr), 
              uniqueID = unique(fishnet$uniqueID))


# Select the zoning permits over the four-year interval (13, 14, 15 and 16)
if (SWITCH == T) {
  
  permitsNetYears <-
    permitsNet %>%
    filter(year <= endYr + permitLag,
           year >= startYr + permitLag) %>%
    group_by(year, uniqueID) %>%
    summarize(permitCount = sum(n())) %>%
    st_drop_geometry() %>%
    right_join(permitPanel) %>% 
    replace(is.na(.), 0) %>%
    mutate(permitDummy = as.factor(ifelse(permitCount > 0, 1, 0)),
           permitDummyNumeric = ifelse(permitCount > 0, 1, 0))
  
  } else {
    
    # OR read locally
    permitsNetYears <- readRDS("../data/local/permitNetYears.Rds")
    
  }

# SAVE locally
saveRDS(permitsNetYears, "../data/local/permitNetYears.Rds")

# historic permits per fishnet for chart


# empty panel with all possible time/space combinations by fishnet grids YEARS
# for INDEPENDENT VARIABLES
allYearsPanel <- 
  expand.grid(year = unique(permitsNet$year), 
              uniqueID = as.character(unique(fishnet$uniqueID)))

allPeriodsPanel <- 
  expand.grid(period = as.factor(unique(permitsNet$period)), 
              uniqueID = unique(fishnet$uniqueID))


# join years panel to fishnet geometry
permitsPanelAllYears <- 
  permitsNet %>%
  dplyr::select(uniqueID, year) %>%
  mutate(uniqueID = as.factor(uniqueID)) %>%
  st_drop_geometry() %>%
  right_join(allYearsPanel) %>% 
  replace(is.na(.), 0) %>%
  left_join(fishnet) %>%
  st_sf() %>%
  filter(year < 2022) %>%
  mutate(lustrum = case_when(
    year %in% seq(2007, 2011) ~ '2007-2011',
    year %in% seq(2012, 2016) ~ '2012-2016',
    year %in% seq(2017, 2021) ~ '2017-2021'))

# join periods panel to fishnet geometry
permitsPanelAllPeriods <- 
  permitsNet %>%
  dplyr::select(uniqueID, period) %>%
  mutate(uniqueID = as.factor(uniqueID),
         period = as.factor(period)) %>%
  st_drop_geometry() %>%
  right_join(allPeriodsPanel) %>% 
  replace(is.na(.), 0) %>%
  left_join(fishnet) %>%
  st_sf()
  


if (SWITCH == T) {
  # classify zoning permits by five-year periods
  permitsAll <- 
    permitsPanelAllYears %>%
    filter(year != '2022') %>%
    dplyr::select(-lustrum) %>%
    group_by(year, uniqueID) %>%
    summarize(permitCount = n())
  
  } else {
  
    # OR read locally
    permitsAll <- readRDS("../data/local/permitsAll.Rds")
    
  }

# SAVE locally
saveRDS(permitsAll, "../data/local/permitsAll.Rds")
# plot map of permits over all years
permitsAll %>%
  ggplot() +
  geom_sf(aes(fill = permitCount), color = NA) +
  scale_fill_viridis(
    option = "magma",
    limits = c(1, 120),
    trans = "log1p",
    breaks = c(0, 5, 25, 50, 100),
    direction = 1) +
  labs(title = "Developement Activity in Philadelphia",
       subtitle = "Zoning permits issued between 2007 and 2021",
       fill = "number of permits",
       caption = "Source: OpenDataPhilly") +
  guides(fill= guide_colorbar(barheight = 6, title.position = "top", label.position = "right")) +
  mapTheme() +
  plaTheme()

Where development is occurring spatially is only one relevant component, however. Equally important is the time element. The below chart displays how the real estate market has moved and increased over time.

Much like with vacant properties, development activity tracks with specific demographic patterns, as well. The maps below show majority–minority Philadelphia neighborhoods with majority white neighborhoods removed, in order to underscore the steady movement of development activity into majority–minority neighborhoods and out of majority white ones, a process most commonly known as gentrification.

# classify zoning permits by five-year periods

if (SWITCH == T) {
  
  permitsLustrums <- 
    permitsPanelAllYears %>%
    filter(year < 2022) %>%
    mutate(lustrum = case_when(
             year %in% seq(2007, 2011) ~ '2007-2011',
             year %in% seq(2012, 2016) ~ '2012-2016',
             year %in% seq(2017, 2021) ~ '2017-2021'
           )) %>%
    group_by(lustrum, uniqueID) %>%
    summarize(permitCount = sum(n()))
  
  } else {
    
    permitsLustrums <- readRDS("../data/local/permitLustrums.Rds")
    
}

# SAVE locally
saveRDS(permitsLustrums, "../data/local/permitLustrums.Rds")



# side-by-side five-year period zoning permits
permitsLustrums %>%
  ggplot() +
  geom_sf(aes(fill = permitCount), color = NA) +
  scale_fill_viridis(
    option = "magma",
    limits = c(5,120),
    trans = "log1p",
    breaks = c(0,5,20,40,120),
    direction = 1) +
  labs(
    title = "Zoning permits issued in Philadelphia",
    subtitle = "by five-year periods. All neighborhoods.",
    fill = "number of permits",
    caption = "Source: OpenDataPhilly") +
  guides(
    size = F,
    fill= guide_colorbar(barwidth = 21, title.position = "top")) +
  facet_wrap(~lustrum, ncol = 3) +
  mapTheme() +
  plaTheme() +
  theme(
    legend.position = "bottom",
    legend.background = element_blank(),
    panel.border = element_blank(),
    panel.background = element_rect(fill = colors[2]),
    panel.grid = element_blank(),
    strip.background = element_rect(fill = colors[2]),
    strip.text.x = element_text(size = 12, color = colors[1], hjust=0.01)
    )

# side-by-side five-year period zoning permits
permitsLustrums %>%
  ggplot() +
  geom_sf(aes(fill = permitCount), color = NA) +
  scale_fill_viridis(
    option = "magma",
    limits = c(5,120),
    trans = "log1p",
    breaks = c(0,30,60,120),
    direction = 1) +
  labs(
    title = "Zoning permits issued in Philadelphia",
    subtitle = "by five-year periods. Non-white majority neighborhoods.",
    fill = "number of permits",
    caption = "Source: OpenDataPhilly") +
  guides(
    size = F,
    fill= guide_colorbar(barwidth = 21, title.position = "top")) +
  geom_sf(data = majminFishnet, fill = colors[2], color = colors[2], alpha= 1) +
  geom_sf(data = phlFishnet, fill = NA, colour = colors[1]) +
  facet_wrap(~lustrum, ncol = 3) +
  mapTheme() +
  plaTheme() +
  theme(
    legend.position = "bottom",
    legend.background = element_blank(),
    panel.border = element_blank(),
    panel.background = element_rect(fill = colors[2]),
    panel.grid = element_blank(),
    strip.background = element_rect(fill = colors[2]),
    strip.text.x = element_text(size = 12, color = colors[1], hjust=0.01)
    )


Modeling

Although visualizing vacant lot data reveals much on its own, more can be done to leverage that data to make relevant predictions. Here, we use publicly accessible data to design a prediction model assessing the risk that vacant properties across Philadelphia will be acquired and developed.

Understanding development risk achieves several important goals for a range of stakeholders. Neighbors and property owners can know if a side yard or urban garden in their neighborhood is at heightened risk and seek aid from organizations such as Philadelphia Legal Assistance. Government and elected officials can tailor policies to protect community assets in their jurisdictions. Community members, the press, and public advocates can call attention to areas where development pressures are especially acute. And of course, legal aid organizations such as PLA can anticipate which neighborhoods are likely to see the most development interest and craft their legal strategy accordingly.

In the first part of this section, we offer a brief overview of our modeling approach. Next, we explain the various features we fed into our prediction model and how we manipulated publicly accessible data to engineer these features. Finally, we report on the results of our model, assessing its accuracy and viability as a predictive tool.


Modeling Approach

Model Output/Dependent Variable

The first step in creating a model that will predict development risk is to determine what specific piece of data represents the closest proxy for development risk. That data ultimately will serve as the model’s output, also known as its dependent variable. By predicting for that proxy, the model effectively will predict for development risk.

Here, our model treats zoning permits as the final output given how closely these permit types are linked to development demand. For virtually all new development projects, the City of Philadelphia requires that developers obtain a zoning permit, so frequently, developers will prioritize applying for these permits early on in the project timeline to ensure that the project is feasible. Zoning permits thus also serve as a leading indicator of where development will likely concentrate in the future. (Importantly, the requirement to obtain a zoning permit holds true even for projects that are legally permitted by right under the zoning code.) We filtered total zoning permits down to those most relevant to the use case here, such as residential buildings, non-commercial, etc.

Furthermore, our model specifically predicts for zoning permits in lieu of building permits for two reasons. Whereas developers will try to obtain zoning approval early in the timeline as mentioned above, they will often wait to apply for building permits until later. Building permits therefore are more of a lagging indicator than zoning permits. More importantly, whereas developers will seek zoning permits as a matter of course for any targeted projects, they will often refrain from seeking a building permit until the project’s feasibility is assured. This is a crucial difference for the use case here, since our model aims to predict for development risk, not necessarily actual development. Neighbors and community members can be harmed not only when a vacant property is ultimately developed into a physical building, but also when that property is first purchased, as new owners will frequently fence it off and eliminate the property’s community use. For the purposes of our model, we therefore care less about whether a given vacant property is actually ripe for development and more about whether a developer is interested in the property in the first place.


Training and Testing Split

Data is divided into training and testing sets according to year. We established a three-year relationship between our predictive features and our dependent variable, zoning permits. For our independent variables, 2013, 2014, and 2015 form the training set, whereas 2016 is used to test the model’s accuracy. For our dependent variable, then, 2016, 2017, and 2018 form the training set, while 2019 zoning permits are used to test the model’s accuract. Using an iterative development process for our model, we further partitioned the training data so that we would always test and evaluate the models on new, unseen data.


Geographic Units: fishnet

To assign development risk scores, we created a “fishnet” grid of equally sized cells, encompassing all of Philadelphia. Each cell represents 250x250 square meters in size. Every cell is characterized by a long list of data features (described below) that serve as the inputs which our model uses to predict the number of zoning permits for that individual cell.


Feature Engineering

Features List
  1. salesCount: total count of property sales
  2. meanSqftPrice: average price per square foot for most recent sale
  3. pop: total population
  4. percWhite: percentage of total population that is white
  5. medInc: median income
  6. medRent: median rent price
  7. popChange1yr: percent change in population over one year
  8. popChange2yr: percent change in population over two years
  9. percWhiteChange1yr: percent change in white population over one year
  10. percWhiteChange2yr: percent change in white population over two years
  11. medIncChange1yr: percent change in median income over one year
  12. medIncChange2yr: percent change in median income over two years
  13. medRentChange1yr: percent change in median rent over one year
  14. medRentChange2yr: percent change in median rent over two years
  15. sheriff: count of sheriff’s sales
  16. meanPrice: average price of most recent sale
  17. meanDebt: average amount of delinquent debt
  18. indOwnerDelta: change in number of discrete property owners
  19. cumulativeVacantArea: total area constituting vacant property
  20. totalVacantLots: total number of vacant lots
  21. jobCount: total number of jobs
  22. lag2years: percent change in total number of jobs
  23. lag2years: percent change in total number of jobs
  24. parks: total number of parks
  25. licenses / licensesNN : number of business licenses and nearest neighbors of such
  26. parks / parksNN: number of parks and nearest neighbors of such
  27. schools / schoolsNN: number of schools and nearest neighbors of such
  28. transit / transitNN: number of transit and trolley stops and nearest neighbors of such
  29. crime / crimeNN: incidents of violent crime and nearest neighbors of such
# variables to keep
salesVars <- 
  c("geometry",
    "document_type",
    "display_date",
    "grantors",
    "grantees",
    "total_consideration",
    "year",
    "period",
    "total_area")


if (SWITCH == T) {
  # if a parcel_number has multiple sales, we only get the latest one
  salesProps <-
    transfersProps %>%
    drop_na(total_area) %>% # drop sales that have NA for their area
    mutate(document_type = as.character(document_type)) %>%  # to ignore Gillian deed factor levels
    group_by(parcel_number) %>%
    slice(which.max(display_date)) %>% # to get only the latest one
    mutate(year = year(display_date),
           month = month(display_date),
           quarter = case_when(
             month %in% c(1, 2, 3) ~ 1,
             month %in% c(4, 5, 6) ~ 2,
             month %in% c(7, 8, 9) ~ 3,
             month %in% c(10, 11, 12) ~ 4)) %>%
    mutate(period = paste(year, quarter, sep='-')) %>%
    st_sf()
  
  
  # select and transform features + filter by period + JOIN into fishnet
  salesPropsNet <- 
    salesProps %>%
    dplyr::select(all_of(salesVars)) %>%
    filter(year <= endYr,
           year >= startYr,
           total_consideration < 4e8,
           total_consideration > 1e4) %>%  # filter outliers and most "symbolic" transfers
    mutate(sqftPrice = ifelse(total_area > 0, total_consideration/total_area, 0)) %>% # calculate price by sqft
    st_join(fishnet, left = F) %>%
    st_drop_geometry()
  
  } else {
  salesPropsNet <- readRDS("../data/local/salesPropsNet.rds")
}


if (SWITCH == T) {
  
  # SALES COUNT
  # Get sales count by fishnet from 2013 to 2016
  salesNetYears <- 
    salesPropsNet %>%
    group_by(year, uniqueID) %>%
    dplyr::summarize(salesCount = sum(n())) %>%
    right_join(yearPanel) %>%
    replace(is.na(.), 0)
  

  # PRICE BY SQ FOOT
  # Get sqft price paid by fishnet from 2013 to 2016
  sqftNetYears <-
    salesPropsNet %>%
    group_by(year, uniqueID) %>%
    dplyr::summarize(meanSqftPrice = mean(sqftPrice)) %>%
    right_join(yearPanel) %>%
    replace(is.na(.), 0)

  
  # TOTAL PRICE
  # Get total price paid by fishnet from 2013 to 2016
  priceNetYears <-
    salesPropsNet %>%
    group_by(year, uniqueID) %>%
    dplyr::summarize(meanPrice = mean(total_consideration)) %>%
    right_join(yearPanel) %>%
    replace(is.na(.), 0)
  
  
  # INDIVIDUAL OWNERS
  # Get change in number of individual owners by fishnet from 2013 to 2016
  individualOwnersNetYears <- 
    salesPropsNet %>%
    group_by(year, uniqueID) %>%
    dplyr::summarize(
      numIndGrantors = n_distinct(grantors),
      numIndGrantees = n_distinct(grantees)) %>%
    mutate(indOwnerDelta = abs(numIndGrantors - numIndGrantees)) %>%
    dplyr::select(-numIndGrantors, -numIndGrantees) %>%
    right_join(yearPanel) %>%
    replace(is.na(.), 0)
  
  } else {
    
    # OR read locally
    salesNetYears <- readRDS("../data/local/salesNetYears.rds")
    sqftNetYears <- readRDS("../data/local/sqftNetYears.rds")
    priceNetYears <- readRDS("../data/local/priceNetYears.rds")
    individualOwnersNetYears <- readRDS("../data/local/individualOwnersNetYears.rds")
    
    }



# DEBT
# All delinquent properties

if (SWITCH == T) {
  
  debtNetYears <-
    delinquenciesProps %>%
    filter(oldest_year_owed <= endYr) %>% # take out the ones whose debt started after 2018
    rename('year' = most_recent_year_owed) %>%
    st_join(., fishnet) %>%
    group_by(year, uniqueID) %>%
    dplyr::summarize(meanDebt = mean(total_due)) %>%
    filter(year >= startYr) %>%
    st_drop_geometry() %>%
    right_join(yearPanel) %>%
    replace(is.na(.), 0)
  
  } else {
    
    # OR read locally
    debtNetYears <- readRDS("../data/local/debtNetYears.rds")
    
  }



# VACANT LOTS
# get count of vacant lots and cumulative vacant area
# ***THIS ONE HAS NO YEAR DIFFERENCE

if (SWITCH == T) {
  
  vacantNet <- 
    vacantLandProps %>%
    drop_na(total_area) %>% # drop sales that have NA for their area
    st_sf() %>%
    st_join(., fishnet) %>%
    group_by(uniqueID) %>%
    dplyr::summarize(cumulativeVacantArea = sum(total_area),
              totalVacantLots = sum(n())) %>%
    st_drop_geometry() %>%
    right_join(yearPanel) %>%
    replace(is.na(.), 0)
  
  } else {
    
    # OR read locally
    vacantNet <- readRDS("../data/local/vacantNet.rds")
    
  }



# SHERIFF SALES
# Count of sheriff sales by fishnet square
if (SWITCH == T) {
  # get properties sold on a sheriff sale
  sheriffSales <-
    sheriffProps %>%
    mutate(year = year(sale_date)) %>%
    select(allSheriffSales, year) %>%
    filter(year %in% seq(startYr, endYr))
  
  
  # join to fishnet and years panel
  sheriffSalesNetYears <-
    sheriffSales %>%
    dplyr::select(year, allSheriffSales) %>%
    st_join(fishnet) %>%
    group_by(year, uniqueID) %>%
    dplyr::summarize(sheriffSalesCount = sum(n())) %>%
    st_drop_geometry() %>%
    right_join(yearPanel) %>%
    replace(is.na(.), 0)
  
  } else {
    
    # OR read locally
    sheriffSalesNetYears <- readRDS("../data/local/sheriffSalesNetYears.rds")
    
    }


# SAVE locally

saveRDS(salesPropsNet, file = "../data/local/salesPropsNet.rds")
saveRDS(salesNetYears, file = "../data/local/salesNetYears.rds")
saveRDS(sqftNetYears, file = "../data/local/sqftNetYears.rds")
saveRDS(priceNetYears, file = "../data/local/priceNetYears.rds")
saveRDS(individualOwnersNetYears, file = "../data/local/individualOwnersNetYears.rds")
saveRDS(debtNetYears, file = "../data/local/debtNetYears.rds")
saveRDS(vacantNet, file = "../data/local/vacantNet.rds")
saveRDS(sheriffSalesNetYears, file = "../data/local/sheriffSalesNetYears.rds")
if (SWITCH == T) {
  
  # census variables to request to the API
  censusVars <- c("B01001_001", "B01001A_001", "B19013_001", "B25064_001")
  
  # for loop to get demographic data for years and change in selected variables
  for(year in seq(initYr, endYr)) {
    yr <- as.character(year - 2000)
    yr1 <- as.character(year - 2000 - 1)
    yr2 <- as.character(year - 2000 - 2)
    
    # get this year data
    demoData <-
      get_acs(
        geography = "tract",
        variables = censusVars, 
        year = year, 
        state = "PA", 
        geometry = T, 
        county = c("Philadelphia"),
        output = "wide") %>%
      mutate(!!paste0('percWhite', yr) := B01001A_001E/B01001_001E) %>%
      rename(!!paste0('pop', yr) := B01001_001E,
             !!paste0('medInc', yr) := B19013_001E,
             !!paste0('medRent', yr) := B25064_001E) %>%
      dplyr::select(-ends_with('M'), -NAME, -B01001A_001E) %>%
      st_drop_geometry()
    
  
    if (year > initYr) {
      demoData <-
        full_join(lastYearData, demoData, by = 'GEOID')
      
      if (year >= startYr) {
        
        medIncYr <- demoData[paste0('medInc', yr)]
        medIncYr1 <- demoData[paste0('medInc', yr1)]
        medIncYr2 <- demoData[paste0('medInc', yr2)]    
        
        medRentYr <- demoData[paste0('medRent', yr)]
        medRentYr1 <- demoData[paste0('medRent', yr1)]
        medRentYr2 <- demoData[paste0('medRent', yr2)]
  
        demoData <-
          demoData %>%
          mutate(popChange1 = ((demoData[paste0('pop', yr)] - demoData[paste0('pop', yr1)])/demoData[paste0('pop', yr1)])[,1],
                 popChange2 = ((demoData[paste0('pop', yr)] - demoData[paste0('pop', yr2)])/demoData[paste0('pop', yr2)])[,1]) %>%
          rename(!!paste0('popChange1yr', yr, yr1) := popChange1,
                 !!paste0('popChange2yr', yr, yr2) := popChange2) %>%
          mutate(percWhiteChange1 = (demoData[paste0('percWhite', yr)] - demoData[paste0('percWhite', yr1)])[,1],
                 percWhiteChange2 = (demoData[paste0('percWhite', yr)] - demoData[paste0('percWhite', yr2)])[,1]) %>%
          rename(!!paste0('percWhiteChange1yr', yr, yr1) := percWhiteChange1,
                 !!paste0('percWhiteChange2yr', yr, yr2) := percWhiteChange2) %>%
          mutate(medIncChange1 = ((medIncYr - medIncYr1) / medIncYr1)[,1],
                 medIncChange2 = ((medIncYr - medIncYr2) / medIncYr2)[,1]) %>%
          rename(!!paste0('medIncChange1yr', yr, yr1) := medIncChange1,
                 !!paste0('medIncChange2yr', yr, yr2) := medIncChange2) %>%
          mutate(medRentChange1 = ((medRentYr - medRentYr1) / medRentYr1)[,1],
                 medRentChange2 = ((medRentYr - medRentYr2) / medRentYr2)[,1]) %>%
          rename(!!paste0('medRentChange1yr', yr, yr1) := medRentChange1,
                 !!paste0('medRentChange2yr', yr, yr2) := medRentChange2)
        }
      }
  
    lastYearData <-
      demoData %>%
      replace(is.na(.), 1)
    
  }
  
  # variables from reference lag years to take out
  initYrVars <- c()
  
  for (var in c('pop', 'medInc' , 'medRent', 'percWhite')) {
    initYrVars <- c(initYrVars,
                    paste0(var, substr(as.character(initYr), 3, 4)),
                    paste0(var, substr(as.character(initYr+1), 3, 4)))
  }
  
  
  # join to census tracts
  demoJoined <- 
    demoData %>%
    left_join(phltracts, by = "GEOID") %>%
    dplyr::select(-all_of(initYrVars)) %>%
    st_sf()
  
  # get all variables names
  demoVars <- colnames(demoJoined)
  
  
  # This function to iterate over year variable, interpolates them to the census
  # tract geometry to a fishnet cell and JOIN them by year and fishnet cell
  createFeature <-
    function(input, varList, fishnet) {
      output <- data.frame(matrix(ncol = 3, nrow = 0))
      
      name <- sub("[0-9]{2,4}$", "", varList[1])
      
      for (var in varList) {
        year = as.numeric(paste0(20, str_sub(var, -2, -1))) + 1
      
        data <-
          st_interpolate_aw(input[var], fishnet, extensive = T) %>%
          as.data.frame(.) %>%
          left_join(fishnet, ., by = "geometry") %>%
          rename(!!name := var) %>%
          mutate(year = year) %>%
          st_drop_geometry() %>%
          select(year, uniqueID, name)
        
        output <- rbind(output, data)
        
        }
      return(output)
      }
  
  # population variables
  varsPop <- as.vector(demoVars[grep('pop[0-9]', demoVars)])
  varsPopChange1yr <- as.vector(demoVars[grep('popChange1yr', demoVars)])
  varsPopChange2yr <- as.vector(demoVars[grep('popChange2yr', demoVars)])
  
  # white variables
  varsWhite <- as.vector(demoVars[grep('percWhite[0-9]', demoVars)])
  varsWhiteChange1yr <- as.vector(demoVars[grep('percWhiteChange1yr', demoVars)])
  varsWhiteChange2yr <- as.vector(demoVars[grep('percWhiteChange2yr', demoVars)])
  
  # Median income variables
  varsMedInc <- as.vector(demoVars[grep('medInc[0-9]', demoVars)])
  varsMedIncChange1yr <- as.vector(demoVars[grep('medIncChange1yr', demoVars)])
  varsMedIncChange2yr <- as.vector(demoVars[grep('medIncChange2yr', demoVars)])
  
  # median Rent variables
  varsMedRent <- as.vector(demoVars[grep('medRent[0-9]', demoVars)])
  varsMedRentChange1yr <- as.vector(demoVars[grep('medRentChange1yr', demoVars)])
  varsMedRentChange2yr <- as.vector(demoVars[grep('medRentChange2yr', demoVars)])
  
  
  
  # Panel with all demographic information
  demoNetYears <- 
    yearPanel %>%
    left_join(., createFeature(demoJoined, varsPop, fishnet), by = c('year', 'uniqueID')) %>%
    left_join(., createFeature(demoJoined, varsPopChange1yr, fishnet), by = c('year', 'uniqueID')) %>%
    left_join(., createFeature(demoJoined, varsPopChange2yr, fishnet), by = c('year', 'uniqueID')) %>%
    left_join(., createFeature(demoJoined, varsWhite, fishnet), by = c('year', 'uniqueID')) %>%
    left_join(., createFeature(demoJoined, varsWhiteChange1yr, fishnet), by = c('year', 'uniqueID')) %>%
    left_join(., createFeature(demoJoined, varsWhiteChange2yr, fishnet), by = c('year', 'uniqueID')) %>%
    left_join(., createFeature(demoJoined, varsMedInc, fishnet), by = c('year', 'uniqueID')) %>%
    left_join(., createFeature(demoJoined, varsMedIncChange1yr, fishnet), by = c('year', 'uniqueID')) %>%
    left_join(., createFeature(demoJoined, varsMedIncChange2yr, fishnet), by = c('year', 'uniqueID')) %>%
    replace(is.na(.), 0) %>%
    mutate_if(is.numeric, function(x) ifelse(is.infinite(x), 0, x))
  
  # Panel with all rent information
  rentNetYears <- 
    yearPanel %>%
    left_join(., createFeature(demoJoined, varsMedRent, fishnet), by = c('year', 'uniqueID')) %>%
    left_join(., createFeature(demoJoined, varsMedRentChange1yr, fishnet), by = c('year', 'uniqueID')) %>%
    left_join(., createFeature(demoJoined, varsMedRentChange2yr, fishnet), by = c('year', 'uniqueID')) %>%
    replace(is.na(.), 0) %>%
    mutate_if(is.numeric, function(x) ifelse(is.infinite(x), 0, x))
  
} else {
  # OR read locally
  demoNetYears <- readRDS("../data/local/demoNet.rds")
  rentNetYears <- readRDS("../data/local/rentNet.rds")
}

# SAVE locally
saveRDS(demoNetYears, file = "../data/local/demoNet.rds")
saveRDS(rentNetYears, file = "../data/local/rentNet.rds")
# ALL exposure features

if (SWITCH == T) {
  
  # SCHOOLS
  # [source](https://metadata.phila.gov/#home/datasetdetails/5543866320583086178c4ef1/)
  
  # take out special and Kindergardens
  schoolsSelected <- c(
    "ELEMENTARY/MIDDLE",
    "ELEMENTARY/MIDDLE/HIGH",
    "HIGH SCHOOL",
    "ELEMENTARY SCHOOL",
    "MIDDLE/HIGH",
    "MIDDLE SCHOOL")
  
  # get school locations
  schoolsData <-
    st_read('https://opendata.arcgis.com/datasets/d46a7e59e2c246c891fbee778759717e_0.geojson') %>%
    st_transform(st_crs(phlcrs)) %>%
    filter(GRADE_LEVEL %in% schoolsSelected) %>%
    dplyr::select(geometry) %>%
    mutate(legend = 'schools')
  
  
  
  # PARKS
  # [source](https://metadata.phila.gov/#home/datasetdetails/5dc1aeb93741fa001504b10b/representationdetails/5dc1aeb93741fa001504b10f/)
  # get parks polygon data
  parksData <-
    st_read('https://opendata.arcgis.com/datasets/d52445160ab14380a673e5849203eb64_0.geojson') %>%
    st_transform(st_crs(phlcrs)) %>%
    filter(!PROPERTY_CLASSIFICATION %in% c('WATERSHED_PARK', 'REGIONAL_PARK')) %>% # Take out big parks and leave nested one
    dplyr::select(geometry) %>%
    st_centroid(.) %>%
    mutate(legend = 'parks')
  
  
  
  # TRANSIT
  # read locally # TODO: create cloud source version
  transitData <-
    rbind(
      read_csv("./data/SEPTA_-_Highspeed_Stations.csv") %>%
        mutate(mode = 'subway') %>%
        st_as_sf(coords = c("Longitude","Latitude"), crs = 4269) %>%
        st_transform(st_crs(phlcrs)) %>%
        dplyr::select(mode, geometry),
      read_csv("./data/SEPTA_-_Trolley_Stops.csv") %>%
        mutate(mode = 'trolley') %>%
        st_as_sf(coords = c("Longitude","Latitude"), crs = 4269) %>%
        st_transform(st_crs(phlcrs)) %>%
        dplyr::select(mode, geometry)) %>%
    mutate(legend = 'transit') %>%
    dplyr::select(-mode)
  
  
  
  # FOOD LICENSES
  # ODP Business Licenses Dataset
  # [source]('https://metadata.phila.gov/#home/datasetdetails/5543865a20583086178c4ed2/representationdetails/5e985a5e344ed50018936bb8/')
  
  # set variables for wrangling
  licenseVars <- 
    c('initialissuedate',
      'inactivedate',
      'licensetype',
      'geometry')
  
  # get licenses data from carto API
  licenses <- 
    st_read('https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+business_licenses&filename=business_licenses&format=geojson&skipfields=cartodb_id')
  
  # set license types related to food
  licenseTypes <- c(
    "Food Preparing and Serving",
    "Food Manufacturer / Wholesaler",
    "Food Establishment, Retail Perm Location (Large)",
    "Food Establishment, Retail Permanent Location",
    "Food Preparing and Serving (30+ SEATS)",
    "Food Caterer",
    "Sidewalk Cafe",
    "Public Garage / Parking Lot",
    "Curb Market",
    "Food Establishment, Outdoor",
    "Sidewalk Cafe (Temporary)"
    )
  
  # select only food licenses that were initially issued before 2020
  licensesData <- 
    licenses %>%
    st_transform(st_crs(phlcrs)) %>%
    dplyr::select(licenseVars) %>%
    filter(licensetype %in% licenseTypes,
           initialissuedate < as.Date.character(paste0(as.character(endYr), '-01-01')),
           !inactivedate > as.Date.character(paste0(as.character(startYr), '-01-01'))) %>%
    dplyr::select(geometry) %>%
    mutate(legend = 'licenses') %>%
    mutate(coords = st_coordinates(.)) %>%
    na.omit() %>%
    dplyr::select(-coords)
  
  
  
  # CRIME
  # [source](https://metadata.phila.gov/#home/datasetdetails/5543868920583086178c4f8e/representationdetails/570e7621c03327dc14f4b68d/)
  
  crimeEndpoint <- 'https://phl.carto.com/api/v2/sql?filename=incidents_part1_part2&format=csv&skipfields=cartodb_id,the_geom,the_geom_webmercator&q=SELECT%20text_general_code,dispatch_date%20,%20ST_Y(the_geom)%20AS%20lat,%20ST_X(the_geom)%20AS%20lng%20FROM%20incidents_part1_part2%20WHERE%20dispatch_date_time%20%3E=%20%27'
  
  crimeStartDate <- paste0(startYr, '-01-01')
  crimeEndDate <- paste0(endYr + 1, '-01-01')
  
  crimeQuery <- paste0(crimeStartDate, '%27%20AND%20dispatch_date_time%20%3C%20%27', crimeEndDate)
  
  crime <- read_csv(paste0(crimeEndpoint, crimeQuery,'%27'))
  
  
  
  # set crime types to consider (keep only ASSAULTS)
  crimeTypes <- c(
    "Aggravated Assault Firearm",
    "Aggravated Assault No Firearm",
    "Rape",
    "Other Assaults",
    "Homicide - Criminal")
  
  
  # wrangle crime data
  crimeData <- 
    crime %>%
    st_as_sf(coords = c('lng','lat'), crs = 4326) %>%
    st_transform(st_crs(phlcrs)) %>%
    mutate(year = year(dispatch_date)) %>%
    rename('crimeType' = text_general_code) %>%
    filter(crimeType %in% crimeTypes) %>%
    st_filter(., phlcounty) %>%
    dplyr::select(-dispatch_date, -crimeType)
  
  
  # get nearest neighbor data
  # set empty vector
  crimeYears <- c()
  
  # loop through years adding crime data for each year
  for (year in startYr:endYr) {
    # get crime data coordinates list for each year
    data <-
      crimeData %>%
      filter(year == year) %>%
      st_c()
    
    # get distance to fishnet cells centroids
    crimeNN <-
      fishnet %>%
      mutate(
        crimeNN = nn_function(st_c(st_centroid(.)), data, 3),
        year = year) %>%
      st_drop_geometry()
    
    # add to empty list
    crimeYears <-
      rbind(crimeYears, crimeNN)
    
  }
  
  
  # add both count and exposure to panel
  crimeNetYears <-
    merge(
      crimeData %>%
        st_join(., fishnet, join = st_within) %>%
        st_drop_geometry() %>%
        group_by(year, uniqueID) %>%
        summarize(crime = n()) %>%
        right_join(yearPanel) %>%
        replace(is.na(.), 0),
      crimeYears)
  
  
  
  # JOIN AND CALCULATE NEAREST NEIGHBOR DISTANCE
  # Ordinance violations
  exposureNet <- 
    rbind(schoolsData,
          parksData,
          transitData,
          licensesData) %>%
    st_join(., fishnet, join = st_within) %>%
    st_drop_geometry() %>%
    group_by(uniqueID, legend) %>%
    summarize(count = n()) %>%
    full_join(fishnet) %>%
    spread(legend, count, fill = 0) %>%
    st_sf() %>%
    dplyr::select(-`<NA>`) %>%
    na.omit() %>%
    ungroup() %>%
    mutate(schoolsNN =
             nn_function(st_c(st_centroid(.)), st_c(schoolsData), 3),
           parksNN =
             nn_function(st_c(st_centroid(.)), st_c(parksData), 3),
           transitNN =
             nn_function(st_c(st_centroid(.)), st_c(transitData), 3),
           licensesNN = 
             nn_function(st_c(st_centroid(.)), st_c(licensesData), 3)) %>%
    st_drop_geometry()
  
  
  # JOIN ALL
  # TURN previous table into YEARS and then join with CRIME table
  exposureNetYears <-
    merge(
      exposureNet %>%
        right_join(yearPanel),
      crimeNetYears)
  
  } else {
    # OR read locally
    exposureNetYears <- readRDS("../data/local/exposureNet.rds")
  }
  
# JOBS
# FROM Longitudinal Origin Destination Survey
# [source]('https://lehd.ces.census.gov/data/lodes/LODES7/pa/od/')

if (SWITCH == T) {
  
  # state the endpoints for getting the jobs OD data
  jobsEndpoints <- c()
  
  for (year in seq(initYr, endYr)) {
    jobsEndpoints <-
      c(jobsEndpoints,
        paste0('https://lehd.ces.census.gov/data/lodes/LODES7/pa/od/pa_od_aux_JT00_',
               year,
               '.csv.gz'))
    }
  
  # create empty list
  jobsAllYears <- c()
  
  # for every download endpoint, unzip, read csv and add to jobsAllYears
  # set initial year of looping
  loopYear <- initYr
  
  # loop through links
  for (dataset in jobsEndpoints) {
    jobsData <- read_csv(dataset) %>%
      mutate(year = loopYear)
    jobsAllYears <-
      rbind(jobsAllYears, jobsData)
    loopYear <- loopYear + 1
  }
  
  # turn data by blocks into points with number of jobs in that block
  jobsData <- 
    jobsAllYears %>%
    mutate(GEOID20 = as.character(w_geocode)) %>%
    dplyr::select(GEOID20, year) %>%
    group_by(year, GEOID20) %>%
    dplyr::summarize(count = sum(n())) %>%
    inner_join(phlblocks) %>%
    st_sf() %>%
    st_centroid(.)
  
  # get jobs by year into a fishnet
  jobsNetAllYears <- 
    jobsData %>%
    st_join(fishnet) %>%
    group_by(year, uniqueID) %>%
    dplyr::summarize(jobCount = sum(count)) %>%
    st_drop_geometry() %>%
    right_join(extYearPanel) %>%
    replace(is.na(.), 0)
  
  # Get change in jobs in 1 and 2 year intervals
  # HARDCODED!
  jobsLaggedNetYears <-
    merge(
      jobsNetAllYears %>%
        pivot_wider(names_from = year, values_from = jobCount) %>%
        mutate(lag1yr2013 = `2013` - `2012`,
               lag1yr2014 = `2014` - `2013`,
               lag1yr2015 = `2015` - `2014`,
               lag1yr2016 = `2016` - `2015`) %>%
        dplyr::select(-starts_with('20')) %>%
        pivot_longer(cols = colnames(.)[-1], names_to = 'year', values_to = 'lag1years') %>%
        mutate(year = as.numeric(substr(year, 7, 10))),
      jobsNetAllYears %>%
        pivot_wider(names_from = year, values_from = jobCount) %>%
        mutate(lag2yr2013 = `2013` - `2011`,
               lag2yr2014 = `2014` - `2012`,
               lag2yr2015 = `2015` - `2013`,
               lag2yr2016 = `2016` - `2014`) %>%
        dplyr::select(-starts_with('20')) %>%
        pivot_longer(cols = colnames(.)[-1], names_to = 'year', values_to = 'lag2years') %>%
        mutate(year = as.numeric(substr(year, 7, 10))))
  
  
  # Merge everything together for the fishnet/years for 16-17-18-19
  jobsNetYears <-
    merge(
      jobsNetAllYears %>%
        filter(year >= startYr),
      jobsLaggedNetYears)
  } else {
    # OR read locally
    jobsNetYears <- readRDS("../data/local/jobsNet.rds")
  }

# SAVE locally
saveRDS(exposureNetYears, file = "../data/local/exposureNet.rds")
saveRDS(jobsNetYears, file = "../data/local/jobsNet.rds")
# join all features together

if (SWITCH == T) {
  
  completeNet <- 
  permitsNetYears %>%
  mutate(permitYear = year,
         year = year - 3) %>%
  left_join(salesNetYears) %>%
  left_join(priceNetYears) %>%
  left_join(sqftNetYears) %>%
  left_join(debtNetYears) %>%
  left_join(rentNetYears) %>%
  left_join(individualOwnersNetYears) %>%
  left_join(vacantNet) %>%
  left_join(sheriffSalesNetYears) %>%
  left_join(demoNetYears) %>%
  left_join(jobsNetYears) %>%
  left_join(exposureNetYears)
  
  
  featuresNet <-
  completeNet %>%
  left_join(fishnet) %>%
  arrange(uniqueID) %>%
  ungroup() %>%
  dplyr::select(-uniqueID) %>%
  st_sf()
  
  } else {
    
    featuresNet <- readRDS("../data/local/featuresNet.rds")
    
    }

saveRDS(featuresNet, file = "../data/local/featuresNet.rds")

Features Exploration

As an initial step, we looked into the relationship between the dependent variable, permitCounts and each of the features in this list, visualizing them in a series of scatter plots.

# Function to plot histograms of each variables/features in a data set
plotFeaturesHistogram <-
  function(dataset) {
    dataset <- 
      dataset %>%
      dplyr::select(!where(is.factor)) %>%
      st_drop_geometry()
    
    
    nm <- names(dataset)
    
    #this iterates over the variable names in the featuresNet dataframe
    for (i in seq_along(nm)) {
      print(
        ggplot(dataset) +
          geom_histogram(
            aes_string(x = nm[i]),
            fill = "#18B6C4",
            color = "white") +
          labs(title = paste("Distribution of", nm[i])))
      }
  }


# Function to plot scatterplots
plotXY <-
  function(net, ind_var_list, dep_var) {
    
    dat_by.stop_ACS <-
      net %>%
      dplyr::select(dep_var, all_of(ind_var_list)) %>%
      st_drop_geometry() %>%
      gather(key, value, -dep_var) %>%
      mutate(key = fct_relevel(key, ind_var_list))
    
    
    plot <- ggplot(dat_by.stop_ACS) +
      geom_point(aes_string("value", dep_var), color="#18B6C4") +
      facet_wrap_paginate(~ key, scales = "free", ncol = 2, nrow = 2)
    
    
    for (i in seq(n_pages(plot))) {
      print(
        ggplot(dat_by.stop_ACS) +
          geom_point(aes_string("value", dep_var), color="#18B6C4") +
          geom_smooth(method = "glm", aes_string(x="value", y=dep_var), color="#10767F", size = 1) +
          scale_y_continuous(limits=c(-1.5, 30)) +
          scale_x_continuous(name = substitute(ind_var_list)) +
          facet_wrap_paginate(~ key, scales = "free_x", ncol = 3, page=i) +
          labs(title = paste("Relations between",
                             substitute(ind_var_list),
                             "and permits count"),
               subtitle = "(continous outcomes for numeric variables)") +
          plaTheme() +
          theme(
            legend.position = "bottom",
            legend.background = element_blank(),
            panel.border = element_blank(),
            panel.background = element_rect(fill = colors[2]),
            panel.grid = element_blank(),
            strip.background = element_rect(fill = colors[2]),
            strip.text.x = element_text(size = 12, color = colors[1], hjust=0.01)
            )
        )
      
      # Save the plots locally
      ggsave(paste("../plots/scatterplots/",
                   substitute(ind_var_list), i, ".png", sep=""),
             plot = last_plot(),
             dpi = 300,
             width = 8,
             height = 5,
             units = "in")
      }
    }


# Set features categories

varsTime <-
  c("year",
    "lag1years",
    "lag2years")

varsRealEstate <-
  c("salesCount",
    "indOwnerDelta")

varsPrice <-
  c("meanSqftPrice",
    "meanPrice",
    "medRent",
    "medRentChange1yr",
    "medRentChange2yr",
    "meanDebt")

varsVacancy <-
  c("cumulativeVacantArea",
    "totalVacantLots",
    "sheriffSalesCount")

varsPopulation <-
  c("pop",
    "popChange1yr",
    "popChange2yr")

varsRace <-
  c("percWhite",
    "percWhiteChange1yr",
    "percWhiteChange2yr")

varsIncome <-
  c("medInc",
    "medIncChange1yr",
    "medIncChange2yr",
    "jobCount")

varsAmenities <-
  c("licenses",
    "licensesNN",
    "parks",
    "parksNN",
    "schools",
    "schoolsNN",
    "transit",
    "transitNN",
    "crime",
    "crimeNN")
# Plot features by categories

plotXY(featuresNet, varsPrice, "permitCount")

# Plot features by categories
# plotXY(featuresNet, varsTime, "permitCount")
plotXY(featuresNet, varsVacancy, "permitCount")

# Plot features by categories
plotXY(featuresNet, varsRace, "permitCount")

# Plot features by categories
plotXY(featuresNet, varsRace, "permitCount")

# Plot features by categories
plotXY(featuresNet, varsAmenities, "permitCount")

doRegression <-
  function (dep, indep) {
    modsum <- summary(lm (dep ~ indep))
  
  modtab <- c(
    modsum$coefficients[, 1], 
    modsum$coefficients[, 2], 
    modsum$coefficients[, 3],    
    modsum$coefficients[, 4],  
    modsum$adj.r.squared)
  
  round(modtab,digits = 6)
}

getBivariate <-
  function(dat_dep, dat_ind, method = "original") {
    
    # empty lists
    tab <- c()
    rname <- c()
    dep_i = 0
    ind_i = 0
    
    #  
    for (y in dat_dep) {
      if (ind_i == length(dat_dep)) {
        break
      }
    
    #
    dep_i = dep_i + 1
    
    # 
    if (method == "logY") {
      dep_name = paste("log", colnames(dat_dep[dep_i]), sep = "_")
      } else if (
        method == "logposY") {
        dep_name = paste("logpos", colnames(dat_dep[dep_i]), sep = "_")
        } else {
      dep_name = colnames(dat_dep[dep_i])
        }
    
    # loop through independent variables
    for (x in dat_ind) {
      if (ind_i == length(dat_ind)) {
        ind_i = 0
        }
      ind_i = ind_i + 1
      
      ind_name = colnames(dat_ind[ind_i])
      res <- doRegression(y, x)
      tab <- rbind(tab, res)
      rname <- append(rname, paste(dep_name, ind_name, sep="~"))
    }
  }
  
  # turn results into dataframe and name columns
  tab <- as.data.frame(tab)
  colnames(tab) = c("Int", "Beta", "stErrorInt", "StErrorBeta", "TSTATInt", "TSTATBeta", "PVALINT", "PVALBeta", "R2")
  rownames(tab) = rname
  
  return(tab)
}



modelSpecs <- function(model, test, output = "confusionMatrix") {
    
  modelOut <-
    data.frame(outcome = as.factor(test$permitDummy),
               probs = predict(model, test, type="response"))
  
  # calculate AUC
  modelAUC <- pROC::auc(as.factor(test$permitDummy), as.numeric(modelOut$probs))
  
  # calculate threshold and confusion matrix to evaluate model
  pred <-
    prediction(modelOut[is.na(modelOut$probs)==FALSE,]$probs,
               modelOut[is.na(modelOut$probs)==FALSE,]$outcome)
  
  f_perf <-
    performance(pred,"f")

  f_score <-
    c(f_perf@y.values[[1]])
  
  cutoff <-
    c(f_perf@x.values[[1]])
  
  f_scoreTable <- 
    data.frame(
    cbind(f_score, cutoff))
  
  fscore <- f_scoreTable[which.max(f_scoreTable$f_score),]
  
  modelOut <- 
    modelOut %>%
    mutate(predOutcome = as.factor(ifelse(modelOut$probs > fscore$cutoff, 1, 0)))
  
  confusionMatrix <- 
    caret::confusionMatrix(
      modelOut$predOutcome,
      modelOut$outcome, 
      positive = "1")
  
  
  if (output == "confusionMatrix") {
    return(confusionMatrix)
  } else if(output == "AUC") {
  return(modelAUC)
  } else if(output == "fscore") {
    return(fscore)
  }
}


# CROSSVALIDATION FUNCTIONS

crossValidate <- function(dataset, id, dependentVariable, indVariables) {

  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
  
    thisFold <- i
    
    cat("This hold out fold is", thisFold, "\n")
  
    foldTrain <-
      filter(dataset,
             dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    
    
    foldTest <-
      filter(dataset,
             dataset[[id]] == thisFold) %>%
      as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    
    
    model <-
      randomForest(
        dependentVariable ~ ., 
      data = foldTrain)
    
    thisPrediction <- 
      mutate(foldTest, Prediction = ifelse(predict(model, foldTest, type = "response") > 0.3523, 1, 0))
    
      
    allPredictions <-
      rbind(allPredictions, thisPrediction)
      
    }
    return(allPredictions)
}


# get Accuracy function
getAccuracy <-
  function(cm) {
    acc <- cm$overall[['Accuracy']]
    return(acc)
  }
# dependent variable as a continuous count outcome
dep <-
  featuresNet %>%
  dplyr::select(permitCount) %>%
  st_drop_geometry()


# dependent variable as a binary outcome (0 or 1)
depBinary <-
  featuresNet %>%
  dplyr::select(permitDummyNumeric) %>%
  st_drop_geometry()


# independent variables without year and dependent variables.
indep <-
  featuresNet %>%
  dplyr::select(
    -permitCount,
    -year,
    -permitYear,
    -permitDummyNumeric,
    -permitDummy) %>%
  st_drop_geometry()


# run bivariate regressions
# continuous outcome
regCont <-
  as.data.frame(getBivariate(dep, indep))


# binary outcome
regBinary <-
  as.data.frame(getBivariate(depBinary, indep))


# SAVE results locally
write.csv(regCont, "../data/regressions/reg.csv")
write.csv(regBinary, "../data/regressions/reg_dummy.csv")

Across the list of chosen features, several features demonstrate strong correlations to one another, as seen in the correlation matrix below. Median rent and median income are closely correlated to demographic features including population, population change, and percent white population. Sales count also registers a strong correlation to median rent, which would indicate that the hotter the sales market, the higher the rent.

var_corr <- indep

# SELECTION
var_corr_some <-
  var_corr %>%
  dplyr::select(
    salesCount,
    indOwnerDelta, 
    meanSqftPrice,
    medRent,
    medRentChange2yr,
    meanDebt,
    totalVacantLots,
    pop,
    popChange2yr,
    percWhite,
    percWhiteChange2yr, 
    medInc,
    medIncChange2yr,
    licenses,
    parksNN,
    schoolsNN,
    transitNN,
    crime)


ggcorrplot(
  outline.col = "white",
  type = "lower",
  round(cor(var_corr_some), 1), 
  lab = T,
  p.mat = cor_pmat(var_corr_some),
  colors = c(palette[2], "white", palette[6]),
  insig = "blank") +  
  labs(title = "Correlation Matrix",
       subtitle = "Important numeric variables") +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(color = colors[1], face = "plain"))

The same correlation matrix approach can be narrowed to and applied within variable types, such as development prospects, demographic data and amenities data.

# group 1 - SALES, REAL ESTATE, VACANCY
var_corr_sales <-
  var_corr %>%
  dplyr::select(
    varsRealEstate,
    varsPrice,
    varsVacancy)


ggcorrplot(
  outline.col = "white",
  type = "lower",
  round(cor(var_corr_sales), 1), 
  lab = T,
  p.mat = cor_pmat(var_corr_sales),
  colors = c(palette[2], "white", palette[6]),
  insig = "blank") +  
  labs(title = "Correlation Matrix",
       subtitle="Development prospect variables") +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(color = colors[1], face = "plain"))

# group 2 - DEMOGRAPHICS & INCOME
var_corr_demo <-
  var_corr %>%
  dplyr::select(
    all_of(varsPopulation),
    all_of(varsRace),
    all_of(varsIncome))


ggcorrplot(
  outline.col = "white",
  type = "lower",
  round(cor(var_corr_demo), 1), 
  lab = T,
  p.mat = cor_pmat(var_corr_demo),
  colors = c(palette[2], "white", palette[6]),
  insig = "blank") +  
  labs(title = "Correlation Matrix",
       subtitle = "Demographic variables")+
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(color = colors[1], face = "plain"))

# group 3 - AMENITIES
var_corr_amen <-
  var_corr %>%
  dplyr::select(all_of(varsAmenities))


ggcorrplot(
  outline.col = "white",
  type = "lower",
  round(cor(var_corr_amen), 1), 
  lab = T,
  p.mat = cor_pmat(var_corr_amen),
  colors = c(palette[2], "white", palette[6]),
  insig = "blank") +  
  labs(title = "Correlation Matrix",
       subtitle = "Variables of amenities") +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(color = colors[1], face = "plain"))

The same correlation matrix approach can be narrowed to and applied within variable types, such as demographic data and amenities data.

Features can also be mapped with reference to the model’s final output. Each of the following box plot maps answers one basic question: How does the specific feature manifest in areas that have seen zoning permit applications versus in those that have not? Answering that question can offer a preliminary sense of the relationship (or lack thereof) of each feature to the model’s dependent variable.

# set color palette
color1 <- "#18B6C4" 
color2 <- "#18B6C4"


featuresNetOnly <-
  featuresNet %>%
  dplyr::select(
    -year,
    -permitYear,
    -permitDummy,
    -permitDummyNumeric) %>%
  st_drop_geometry()


#for (var in colnames(featuresNetOnly)) {
 
getViolinMap <- function(var, title) { 
  
  # produce violin plot
  violinPlot <-
    featuresNet %>%
    ggplot() +
    geom_violin(
      aes_string(x=var, y="permitDummy", fill = "permitDummy"),
      color = "#808080") +
    coord_flip() +
    scale_fill_manual(values = c(color1, color2)) +
    scale_y_discrete(labels=c("no permits", "had permits")) +
    labs(title = title,
         subtitle = paste0("Feature: ", var),
         x = "",
         y = "") +
    theme(
      legend.position = "none",
      plot.background = element_blank(),
      panel.border = element_blank(),
      panel.background = element_rect(fill = "#ffffff"),
      panel.grid.major.x = element_blank(),
      strip.background = element_rect(fill = "#ffffff"),
      strip.text.y = element_text(size = 12, color = colors[1], hjust=0.05)
      )
  
  # produce corresponding map
  mapPlot <- 
    featuresNet %>%
    ggplot() +
      geom_sf(
        data = featuresNet,
        aes_string(fill = var),
        color = NA,
        inherit.aes = F) +
      scale_fill_viridis(
        option = "mako",
        name = "value",
        begin = 0.3,
        # trans = "log1p",
        direction = 1) +
      mapTheme() +
      theme(axis.text.x = element_blank(),
            legend.position = c(0.85, 0.2),
            panel.border = element_blank(),
            panel.background = element_rect(fill = "#ffffff"),
            panel.grid.major.x = element_blank(),
            legend.title=element_text(size=12), 
            legend.text=element_text(size=9),
            plot.title = element_text(size = 16, face = "bold"),
            plot.subtitle = element_text(color = colors[1], face = "plain"))
  
  plot <-
    grid.arrange(
      violinPlot,
      mapPlot,
      ncol = 2,
      widths = c(2, 3))
  
  return(plot)
  
}

Begin with the model’s strongest independent variable, count of sales. In the box plot map below, one can discern a clear relationship between the count of sales and whether an area saw zoning permits or not; namely, areas that saw zoning permit applications contain a greater share of sales than areas without.

getViolinMap(colnames(featuresNetOnly)[2], "Sales Count by fishnet cell")

The same relationship can be observed in demographic data, such as median rent and median income. Areas that saw zoning permit applications generally skew higher in terms of median rent and median income than those without zoning permits.

getViolinMap(colnames(featuresNetOnly)[6], "Median Rent by fishnet cell")

getViolinMap(colnames(featuresNetOnly)[19], "Median Income Count by fishnet cell")

Most on point to the case here, areas that saw zoning permit applications contain a larger proportion of total vacant land than areas that do not. Intuitively, this relationship should make sense—areas with vacant land are attractive as development sites since there is no need to eliminate existing improvements.

getViolinMap(colnames(featuresNetOnly)[11], "Total Vacant Land by fishnet cell")

Finally, features that may be categorized as amenities also show a discernible relationship to the model’s output variable. Here, areas seeing zoning permit applications tend to have higher numbers of business licenses than those that do not. This relationship lends support to one’s intuition that developers will seek to build in places with greater commercial activity.

getViolinMap(colnames(featuresNetOnly)[25], "Food Licenses by fishnet cell")


Model Selection and Tuning

# hold out on last year (2016)
featuresNetTest <-
  featuresNet %>%
  filter(year == endYr)


# train on first three years
featuresNetTrain <-
  featuresNet %>%
  filter(year %in% seq(startYr, endYr-1))


# set random seed to replicate results
set.seed(326)


# set partition boundary
trainIndex <-
  createDataPartition(
    featuresNetTrain$permitCount,
    p = .75,
    list = F,
    times = 1)


# get train set for non-chronological validation
train <-
  featuresNetTrain[trainIndex,] 

# get test set for non-chronological validation
test <-
  featuresNetTrain[-trainIndex,]

In designing the final model, two primary model types were tried and tested for accuracy: binomial logit and random forest. As shown in the below table, each model was run, examined, and tuned before determining which was the optimal choice.

modelFeaturesA <- 
  c('permitDummy',
    'year',
    'salesCount',
    'indOwnerDelta', 
    'meanSqftPrice', 
    'medRent',
    'medRentChange2yr',
    'meanDebt',
    'totalVacantLots',
    'pop',
    'popChange2yr',
    'percWhite',
    'percWhiteChange2yr',
    'medInc',
    'medIncChange2yr',
    'licenses',
    'parksNN',
    'schoolsNN',
    'transitNN',
    'crime'
    )

# model0: select the vars with highest R sq every theoretical group
biModelA <-
    glm(as.factor(permitDummy) ~ .,
        family = "binomial"(link = "logit"), 
        data = train %>%
          st_drop_geometry() %>%
          dplyr::select(all_of(modelFeaturesA)))

summary(biModelA)

modelSpecsA <- modelSpecs(biModelA, test)
modelSpecs(biModelA, test, "AUC")
# Model B: eliminate correlated variables, keeping salesCount

modelFeaturesB <- 
  modelFeaturesA[
    ! modelFeaturesA %in%
      c(
        'medRent',
        'pop',
        'percWhite'
        )]

# make regression
biModelB <-
    glm(permitDummy ~ .,
        family = "binomial"(link = "logit"), 
        data = train %>%
          st_drop_geometry() %>%
          dplyr::select(all_of(modelFeaturesB)))

summary(biModelB)

# get model correlation matrix
modelSpecsB <- modelSpecs(biModelB, test)

# get AUC
modelSpecs(biModelB, test, "AUC")
# Model C: eliminate correlated variables, keeping medRent

modelFeaturesC <- 
  modelFeaturesA[
    ! modelFeaturesA %in%
      c(
        'salesCount',
        'pop',
        'percWhite',
        'parksNN',
        'schoolsNN',
        'transitNN',
        'crime'
        )]

biModelC <-
    glm(permitDummy ~ ., 
        family="binomial"(link="logit"), 
      data = train %>%
        st_drop_geometry() %>%
        dplyr::select(all_of(modelFeaturesC)))

summary(biModelC)

modelSpecsC <- modelSpecs(biModelC, test)
modelSpecs(biModelC, test, "AUC")
# Model D: eliminate correlated variables, keeping medRent, eliminate insignificant vars
# set variables
modelFeaturesD <- 
  modelFeaturesA[
    ! modelFeaturesA %in%
      c(
      'medRentChange2yr',
      'meanDebt',
      'salesCount',
      'pop',
      'percWhite',
      'parksNN',
      'schoolsNN',
      'transitNN',
      'crime'
      )]


# run regression
biModelD <-
    glm(permitDummy ~ ., 
        family = "binomial"(link = "logit"), 
        data = train %>%
          st_drop_geometry() %>%
          dplyr::select(all_of(modelFeaturesD)))

# get model summary
summary(biModelD)

# get correlation matrix
modelSpecsD <- modelSpecs(biModelD, test)

# get AUC
modelSpecs(biModelD, test, "AUC")
# set variables
rfModelFeaturesA <- 
  c(
    'permitDummyNumeric',
    'year',
    'salesCount',
    'indOwnerDelta',
    'meanSqftPrice', 
    'medRent',
    'medRentChange2yr',
    'meanDebt',
    'totalVacantLots',
    'pop',
    'popChange2yr',
    'percWhite',
    'percWhiteChange2yr', 
    'medInc',
    'medIncChange2yr',
    'licenses',
    'parksNN',
    'schoolsNN',
    'transitNN',
    'crime'
  )

# run random forest regression
rfModelA <- 
  randomForest(
    permitDummyNumeric ~ .,
    data = train %>%
          st_drop_geometry() %>%
          dplyr::select(all_of(rfModelFeaturesA)))

# get importance
importance(rfModelA)

# get correlation matrix
modelSpecsRFA <- modelSpecs(rfModelA, test)

# get AUC
modelSpecs(rfModelA, test, "AUC")
# Random Forest Model 01

rfModelFeaturesB <- 
  rfModelFeaturesA[
    ! rfModelFeaturesA %in%
      c(
      'year',
      #'indOwnerDelta',
      'meanDebt',
      'salesCount',
      'pop',
      'percWhite',
      'parksNN',
      'schoolsNN',
      'transitNN',
      'crime'
      )]


# run random forest regression
rfModelB <- 
  randomForest(
    permitDummyNumeric ~ .,
    data = train %>%
          st_drop_geometry() %>%
          dplyr::select(all_of(rfModelFeaturesB)))

# get importance
importance(rfModelB)

# get correlation matrix
modelSpecsRFB <- modelSpecs(rfModelB, test)

# get AUC
modelSpecs(rfModelB, test, "AUC")

First, representative features from the full list of features were inputted into a binomial logit model to test its accuracy as-is (the “kitchen sink” model). Next, the feature list was trimmed down by reducing those features which were obviously correlated to one another. Last, the feature list was further reduced by only including those that had a threshold level of significance to the final model.

We then turned to the other model type, random forest. Because random forest models have the advantage of being able to handle correlated features, only steps one and three from the binomial logit model were repeated here.

Ultimately, after we compared the accuracy, sensitivity, and specificity results between the two models, a clear winner emerged: the random forest kitchen sink model. This model showed the best

# model types
modelTypes <- 
  c('Binomial logit',
    'Binomial logit',
    'Binomial logit',
    'Binomial logit',
    'Random Forest',
    'Random Forest')

# models
models <- c(
  'Kitchen Sink',
  'W/o correlations - sales',
  'W/o correlations - rent',
  'Significant Only',
  'Kitchen Sink',
  'Important Only')


# Accuracy
accuModels <-  
  c(modelSpecsA[3][[1]][[1]],
    modelSpecsB[3][[1]][[1]],
    modelSpecsC[3][[1]][[1]],
    modelSpecsD[3][[1]][[1]],
    modelSpecsRFA[3][[1]][[1]],
    modelSpecsRFB[3][[1]][[1]])


# Accuracy
sensModels <-  
  c(modelSpecsA[4][[1]][[1]],
    modelSpecsB[4][[1]][[1]],
    modelSpecsC[4][[1]][[1]],
    modelSpecsD[4][[1]][[1]],
    modelSpecsRFA[4][[1]][[1]],
    modelSpecsRFB[4][[1]][[1]])


# Accuracy
specModels <-  
  c(modelSpecsA[4][[1]][[2]],
    modelSpecsB[4][[1]][[2]],
    modelSpecsC[4][[1]][[2]],
    modelSpecsD[4][[1]][[2]],
    modelSpecsRFA[4][[1]][[2]],
    modelSpecsRFB[4][[1]][[2]])


data.frame('Model type' = modelTypes,
           Model = models,
           Accuracy = accuModels,
           Sensitivity = sensModels,
           Specificity = specModels) %>%
  mutate_if(is.numeric, format, digits= 4, nsmall = 0) %>%
  kable(caption = "Model Comparison") %>%
    kable_material("striped", full_width = F) %>%
  row_spec(5, bold=T, hline_after = T)

Figure: Model Results

Within the optimal model, the most important features could be determined. Fittingly, of all the inputted features, total vacant lots demonstrated the largest influence on the model’s final accuracy. In contrast, factors such as total average debt and the change in discrete owners did not appear to have an outsized role in the model.

featureRfModelA <-
  data.frame(Feature = row.names(importance(rfModelA)),
             Importance = importance(rfModelA)[, 1]) %>%
  mutate(vacant = ifelse(Feature == "totalVacantLots", "vacant", "not vacant"))


ggplot(
  featureRfModelA,
  aes(x = reorder(Feature, Importance),
      y = Importance,
      fill = vacant)) +
  geom_bar(
    stat = "identity") +
  scale_fill_manual(
    values = c(palette[1], palette[2])) +
  coord_flip() +
  guides(fill = FALSE) +
  xlab("") + 
  labs(
    title = "Feature Importance",
    subtitle = "Random Forest Model A") +
  theme(plot.title = element_text(size=12, hjust = 0.5, face = "bold"),
        plot.subtitle = element_text(size=10, hjust = 0.5),
        legend.position = "top") +
  plaTheme()

How the random forest algorithm arrives at its final prediction can be modeled, as seen below. It is important to note, however, that this decision tree only represents a single sample, whereas the random forest model relies on numerous of such decision trees in practice.

plot(ctree(rfModelA, test),
     main = "Random Forest Decision Tree\nBest Model",
     type = "simple")


Results

tabPreds <-         
  featuresNetTest %>%
  st_sf() %>%
  mutate(probs = predict(rfModelA,
                         featuresNetTest,
                         type="response"),
         predOutcome = as.factor(ifelse(probs >= 0.3523, 1, 0)))

As stated above, the model results in an accuracy score hovering around 0.847. These results can be seen on the next maps. The map on the left displays the model’s predictions as a continuous probability from 0 to 1. The map on the right takes those continuous results and assigns a binary score of 0 and 1 at a threshold of 0.3523 that is, any result above 0.3 represents 1 and any below 0.

paletteM <-
  c('#414081',
    '#D0EFD8')

# map of binary observations
obsDevBinary <- 
  ggplot() +
  geom_sf(
    data = tabPreds,
    aes(fill = as.factor(permitDummy)),
    color = NA,
    inherit.aes = F) +
  scale_fill_manual(
    values = c(paletteM[1], paletteM[2]), 
    labels = c("No", "Yes"), 
    name="Zoning permits issued") +
  labs(title = "Observed Development",
       subtitle = "Binary Outcome") +
  mapTheme() +
  plaTheme() +
  theme(plot.title = element_text(size = 12, hjust = .5, face = "bold"),
        plot.subtitle = element_text(size = 10, hjust = .5, face = "plain"),
              legend.position = c(0.8, 0.15))


# map of continuous predictions
predDevCont <- 
  ggplot() +
  geom_sf(
    data = tabPreds,
    aes(fill = probs),
    color = NA,
    inherit.aes = F) +
  scale_fill_viridis(
    option = "mako",
    name = "development risk",
    begin = 0.3,
    direction = 1) +
  labs(title = "Predicted Development",
       subtitle = "Continuous Outcome") +
  guides(
    size = F,
    fill= guide_colorbar(barheight = 7, title.position = "top")) +
  mapTheme() +
  plaTheme() +
  theme(plot.title = element_text(size = 12, hjust = .5, face = "bold"),
        plot.subtitle = element_text(size = 10, hjust = .5, face = "plain"),
              legend.position = c(0.8, 0.2))



predDevBinary <- 
  ggplot() +
  geom_sf(
    data = tabPreds, aes(fill = predOutcome),
    color=NA,
    inherit.aes = FALSE) +
  scale_fill_manual(
    values = c(paletteM[1], paletteM[2]), 
    labels = c("No", "Yes"), 
    name = "Zoning permits issued") +
  labs(title = "Predicted Development",
       subtitle = "Binary outcome") +
  mapTheme() +
  plaTheme() +
  theme(plot.title = element_text(size = 12, hjust = .5, face = "bold"),
        plot.subtitle = element_text(size = 10, hjust = .5, face = "plain"),
              legend.position = c(0.8, 0.15))


# side-by-side plot
grid.arrange(predDevCont, predDevBinary, ncol = 2)

The results can be validated in reference to observed results from 2019. The left map below shows where zoning permits actually were or were not found in 2019, while the right map shows where the model predicted them to be.

# side-by-side plot

grid.arrange(obsDevBinary, predDevBinary, ncol = 2)

A confusion matrix was produced to examine the relative tradeoffs of the model between true positives and negatives and false positives and negatives.

cm <-
  tabPreds %>%
  conf_mat(permitDummy, predOutcome)

autoplot(cm, type = "heatmap") +
  labs(title = "Confusion Matrix",
       subtitle = "Best Model: Random Forest Kitchen Sink") +
  scale_fill_viridis(option="mako", 
                     begin = 0.3,
                     end = 0.5,
                     alpha = 0.67) +
  plaTheme() +
  theme(plot.title = element_text(size = 12, hjust = .5, face = "bold"),
        plot.subtitle = element_text(size = 10, hjust = .5, face = "plain"),
        legend.position = "none")


Conclusion

As sheriff’s sales continue to escalate across Philadelphia, it becomes ever more critical that organizations such as Philadelphia Legal Assistance offer help to neighbors and community members who have devoted their time and resources to beautifying vacant properties in their neighborhoods. To provide such help, such organizations must have easy access to data that can help inform appropriate legal and policy strategies. Our web app provides just that: an interactive dashboard communicating important information about vacant properties across Philadelphia, giving PLA a snapshot profile of all at-risk vacant properties. Moreover, our predictive algorithm gives these organizations the advantage of foresight, enabling them to know which properties are at greatest risk of development and to preempt the sale of cherished community assets across the city.


Code Appendix

2019 Independent Variables for Dashboard predictions

# Set time constraints

predYr <- 2019
permitLag <- 3


# empty panel with all possible time/space combinations by fishnet grids YEARS
# for INDEPENDENT VARIABLES
yearPanel19 <- 
  expand.grid(year = predYr, 
              uniqueID = unique(fishnet$uniqueID))


# create empty panel with all possible time/space combinations by fishnet grids YEARS EXT
extYearPanel19 <- 
  expand.grid(year = seq(predYr-2, predYr), 
              uniqueID = unique(fishnet$uniqueID))



if (SWITCH == T) {
  # if a parcel_number has multiple sales, we only get the latest one
  salesProps <-
    transfersProps %>%
    drop_na(total_area) %>% # drop sales that have NA for their area
    mutate(document_type = as.character(document_type)) %>%  # to ignore Gillian deed factor levels
    group_by(parcel_number) %>%
    slice(which.max(display_date)) %>% # to get only the latest one
    mutate(year = year(display_date),
           month = month(display_date),
           quarter = case_when(
             month %in% c(1, 2, 3) ~ 1,
             month %in% c(4, 5, 6) ~ 2,
             month %in% c(7, 8, 9) ~ 3,
             month %in% c(10, 11, 12) ~ 4)) %>%
    mutate(period = paste(year, quarter, sep='-')) %>%
    st_sf()
  } else {
    
    salesProps <- readRDS('../data/salesProps.Rds')
    
  }

saveRDS(salesProps, '../data/salesProps.Rds')

  
# select and transform features + filter by period + JOIN into fishnet
salesPropsNet19 <- 
  salesProps %>%
  dplyr::select(all_of(salesVars)) %>%
  filter(year == predYr,
         total_consideration < 4e8,
         total_consideration > 1e4) %>%  # filter outliers and most "symbolic" transfers
  mutate(sqftPrice = ifelse(total_area > 0, total_consideration/total_area, 0)) %>% # calculate price by sqft
  st_join(fishnet, left = F) %>%
  st_drop_geometry()


# SALES COUNT
# Get sales count by fishnet in 2019
salesNetYears19 <- 
  salesPropsNet19 %>%
  group_by(year, uniqueID) %>%
  dplyr::summarize(salesCount = sum(n())) %>%
  right_join(yearPanel19) %>%
  replace(is.na(.), 0)


# PRICE BY SQ FOOT
# Get sqft price paid by fishnet in 2019
sqftNetYears19 <-
  salesPropsNet19 %>%
  group_by(year, uniqueID) %>%
  dplyr::summarize(meanSqftPrice = mean(sqftPrice)) %>%
  right_join(yearPanel19) %>%
  replace(is.na(.), 0)


# TOTAL PRICE
# Get total price paid by fishnet from 2019
priceNetYears19 <-
  salesPropsNet19 %>%
  group_by(year, uniqueID) %>%
  dplyr::summarize(meanPrice = mean(total_consideration)) %>%
  right_join(yearPanel19) %>%
  replace(is.na(.), 0)


# INDIVIDUAL OWNERS
# Get change in number of individual owners by fishnet in 2019
individualOwnersNetYears19 <- 
  salesPropsNet19 %>%
  group_by(year, uniqueID) %>%
  dplyr::summarize(
    numIndGrantors = n_distinct(grantors),
    numIndGrantees = n_distinct(grantees)) %>%
  mutate(indOwnerDelta = abs(numIndGrantors - numIndGrantees)) %>%
  dplyr::select(-numIndGrantors, -numIndGrantees) %>%
  right_join(yearPanel19) %>%
  replace(is.na(.), 0)
  

# DEBT
# All delinquent properties

debtNetYears19 <-
  delinquenciesProps %>%
  filter(oldest_year_owed <= predYr) %>% # take out the ones whose debt started after 2019
  rename('year' = most_recent_year_owed) %>%
  st_join(., fishnet) %>%
  group_by(year, uniqueID) %>%
  dplyr::summarize(meanDebt = mean(total_due)) %>%
  filter(year >= predYr) %>%
  st_drop_geometry() %>%
  right_join(yearPanel19) %>%
  replace(is.na(.), 0)
  


# VACANT LOTS
# get count of vacant lots and cumulative vacant area
# ***THIS ONE HAS NO YEAR DIFFERENCE

vacantNet19 <- 
  vacantLandProps %>%
  drop_na(total_area) %>% # drop sales that have NA for their area
  st_sf() %>%
  st_join(., fishnet) %>%
  group_by(uniqueID) %>%
  dplyr::summarize(cumulativeVacantArea = sum(total_area),
                   totalVacantLots = sum(n())) %>%
  st_drop_geometry() %>%
  right_join(yearPanel19) %>%
  replace(is.na(.), 0)
  


# SHERIFF SALES
# Count of sheriff sales by fishnet square

# get properties sold on a sheriff sale
sheriffSales19 <-
  sheriffProps %>%
  mutate(year = year(sale_date)) %>%
  select(allSheriffSales, year) %>%
  filter(year == 2019)


# join to fishnet and years panel
sheriffSalesNetYears19 <-
  sheriffSales19 %>%
  dplyr::select(year, allSheriffSales) %>%
  st_join(fishnet) %>%
  group_by(year, uniqueID) %>%
  dplyr::summarize(sheriffSalesCount = sum(n())) %>%
  st_drop_geometry() %>%
  right_join(yearPanel19) %>%
  replace(is.na(.), 0)
  


# SAVE locally
saveRDS(salesNetYears19, file = "../data/local/salesNetYears19.rds")
saveRDS(sqftNetYears19, file = "../data/local/sqftNetYears19.rds")
saveRDS(priceNetYears19, file = "../data/local/priceNetYears19.rds")
saveRDS(individualOwnersNetYears19, file = "../data/local/individualOwnersNetYears19.rds")
saveRDS(debtNetYears19, file = "../data/local/debtNetYears19.rds")
saveRDS(vacantNet19, file = "../data/local/vacantNet19.rds")
saveRDS(sheriffSalesNetYears, file = "../data/local/sheriffSalesNetYears19.rds")



getDemographic <- function(demYr, vars, panel) {

  for(year in seq(demYr-2, demYr)) {
    yr <- as.character(year - 2000)
    yr1 <- as.character(year - 2000 - 1)
    yr2 <- as.character(year - 2000 - 2)
    
    # get this year data
    demoData <-
      get_acs(
        geography = "tract",
        variables = vars, 
        year = year, 
        state = "PA", 
        geometry = T, 
        county = c("Philadelphia"),
        output = "wide") %>%
      mutate(!!paste0('percWhite', yr) := B01001A_001E/B01001_001E) %>%
      rename(!!paste0('pop', yr) := B01001_001E,
             !!paste0('medInc', yr) := B19013_001E,
             !!paste0('medRent', yr) := B25064_001E) %>%
      dplyr::select(-ends_with('M'), -NAME, -B01001A_001E) %>%
      st_drop_geometry() 
    
    
    if (year > demYr-2) {
      demoData <-
        full_join(lastYearData, demoData, by = 'GEOID')
      
      if (year == demYr) {
        
        medIncYr <- demoData[paste0('medInc', yr)]
        medIncYr1 <- demoData[paste0('medInc', yr1)]
        medIncYr2 <- demoData[paste0('medInc', yr2)]    
        
        medRentYr <- demoData[paste0('medRent', yr)]
        medRentYr1 <- demoData[paste0('medRent', yr1)]
        medRentYr2 <- demoData[paste0('medRent', yr2)]
        
        demoData <-
          demoData %>%
          mutate(popChange1 = ((demoData[paste0('pop', yr)] - demoData[paste0('pop', yr1)])/demoData[paste0('pop', yr1)])[,1],
                 popChange2 = ((demoData[paste0('pop', yr)] - demoData[paste0('pop', yr2)])/demoData[paste0('pop', yr2)])[,1]) %>%
          rename(!!paste0('popChange1yr', yr, yr1) := popChange1,
                 !!paste0('popChange2yr', yr, yr2) := popChange2) %>%
          mutate(percWhiteChange1 = (demoData[paste0('percWhite', yr)] - demoData[paste0('percWhite', yr1)])[,1],
                 percWhiteChange2 = (demoData[paste0('percWhite', yr)] - demoData[paste0('percWhite', yr2)])[,1]) %>%
          rename(!!paste0('percWhiteChange1yr', yr, yr1) := percWhiteChange1,
                 !!paste0('percWhiteChange2yr', yr, yr2) := percWhiteChange2) %>%
          mutate(medIncChange1 = ((medIncYr - medIncYr1) / medIncYr1)[,1],
                 medIncChange2 = ((medIncYr - medIncYr2) / medIncYr2)[,1]) %>%
          rename(!!paste0('medIncChange1yr', yr, yr1) := medIncChange1,
                 !!paste0('medIncChange2yr', yr, yr2) := medIncChange2) %>%
          mutate(medRentChange1 = ((medRentYr - medRentYr1) / medRentYr1)[,1],
                 medRentChange2 = ((medRentYr - medRentYr2) / medRentYr2)[,1]) %>%
          rename(!!paste0('medRentChange1yr', yr, yr1) := medRentChange1,
                 !!paste0('medRentChange2yr', yr, yr2) := medRentChange2)
      }
    }
    
    lastYearData <-
      demoData %>%
      replace(is.na(.), 0)
    
  }
  
  # variables from reference lag years to take out
  initYrVars <- c()
  
  for (var in c('pop', 'medInc' , 'medRent', 'percWhite')) {
    initYrVars <- c(initYrVars,
                    paste0(var, substr(as.character(demYr-2), 3, 4)),
                    paste0(var, substr(as.character(demYr-1), 3, 4)))
  }
    
  # join to census tracts
  demoJoined <- 
    demoData %>%
    left_join(phltracts, by = "GEOID") %>%
    dplyr::select(-all_of(initYrVars)) %>%
    st_sf()
  
  return(demoJoined)
  
  }
  

censusVars <- c("B01001_001", "B01001A_001", "B19013_001", "B25064_001")

# get all demographic data and rent with 1 a 2 year deltas for a year 
demoJoined19 <- getDemographic(predYr, censusVars, yearPanel19)

# get column names
demoVars19 <- colnames(demoJoined19)

  
# population variables
createFeatureYear <-
  function(input, varList, fishnet, year) {
    output <- data.frame(matrix(ncol = 3, nrow = 0))
    
    name <- sub("[0-9]{2,4}$", "", varList[1])
    print(name)
    
    for (var in varList) {
      data <-
        st_interpolate_aw(input[var], fishnet, extensive = T) %>%
        as.data.frame(.) %>%
        left_join(fishnet, ., by = "geometry") %>%
        rename(!!name := var) %>%
        mutate(year = year) %>%
        st_drop_geometry() %>%
        select(year, uniqueID, name)
      
      output <- rbind(output, data)
      
    }
    return(output)
  }


demoNetYears19 <-
  yearPanel19 %>%
  left_join(., createFeatureYear(demoJoined19, "pop19", fishnet, 2019), by = c('year', 'uniqueID')) %>%
  left_join(., createFeatureYear(demoJoined19, "popChange1yr1918", fishnet, 2019), by = c('year', 'uniqueID')) %>%
  left_join(., createFeatureYear(demoJoined19, "popChange2yr1917", fishnet, 2019), by = c('year', 'uniqueID')) %>%
  left_join(., createFeatureYear(demoJoined19, "percWhite19", fishnet, 2019), by = c('year', 'uniqueID')) %>%
  left_join(., createFeatureYear(demoJoined19, "percWhiteChange1yr1918", fishnet, 2019), by = c('year', 'uniqueID')) %>%
  left_join(., createFeatureYear(demoJoined19, "percWhiteChange2yr1917", fishnet, 2019), by = c('year', 'uniqueID')) %>%
  left_join(., createFeatureYear(demoJoined19, "medInc19", fishnet, 2019), by = c('year', 'uniqueID')) %>%
  left_join(., createFeatureYear(demoJoined19, "medIncChange1yr1918", fishnet, 2019), by = c('year', 'uniqueID')) %>%
  left_join(., createFeatureYear(demoJoined19, "medIncChange2yr1917", fishnet, 2019), by = c('year', 'uniqueID')) %>%
  replace(is.na(.), 0) %>%
  mutate_if(is.numeric, function(x) ifelse(is.infinite(x), 0, x))


# Panel with all rent information
rentNetYears19 <- 
  yearPanel19 %>%
  left_join(., createFeatureYear(demoJoined19, varsMedRent19, fishnet, 2019), by = c('year', 'uniqueID')) %>%
  left_join(., createFeatureYear(demoJoined19, varsMedRentChange1yr19, fishnet, 2019), by = c('year', 'uniqueID')) %>%
  left_join(., createFeatureYear(demoJoined19, varsMedRentChange2yr19, fishnet, 2019), by = c('year', 'uniqueID')) %>%
  replace(is.na(.), 0) %>%
  mutate_if(is.numeric, function(x) ifelse(is.infinite(x), 0, x))
  



# exposure features that change by YEAR

# FOOD LICENSES
# ODP Business Licenses Dataset
# [source]('https://metadata.phila.gov/#home/datasetdetails/5543865a20583086178c4ed2/representationdetails/5e985a5e344ed50018936bb8/')


# select only food licenses that were initially issued before 2020
licensesData19 <- 
  licenses %>%
  st_transform(st_crs(phlcrs)) %>%
  dplyr::select(licenseVars) %>%
  filter(licensetype %in% licenseTypes,
         initialissuedate < as.Date.character(paste0(as.character(predYr), '-01-01')),
         !inactivedate > as.Date.character(paste0(as.character(predYr), '-01-01'))) %>%
  dplyr::select(geometry) %>%
  mutate(legend = 'licenses') %>%
  mutate(coords = st_coordinates(.)) %>%
  na.omit() %>%
  dplyr::select(-coords)



# CRIME
# [source](https://metadata.phila.gov/#home/datasetdetails/5543868920583086178c4f8e/representationdetails/570e7621c03327dc14f4b68d/)

crimeQuery19 <- paste0(paste0(predYr, '-01-01'), '%27%20AND%20dispatch_date_time%20%3C%20%27', paste0(predYr + 1, '-01-01'))
crime19 <- read_csv(paste0(crimeEndpoint, crimeQuery,'%27'))


# wrangle crime data
crimeData19 <- 
  crime19 %>%
  st_as_sf(coords = c('lng','lat'), crs = 4326) %>%
  st_transform(st_crs(phlcrs)) %>%
  mutate(year = year(dispatch_date)) %>%
  rename('crimeType' = text_general_code) %>%
  filter(crimeType %in% crimeTypes) %>%
  st_filter(., phlcounty) %>%
  dplyr::select(-dispatch_date, -crimeType)


# get nearest neighbor data
# get distance to fishnet cells centroids
crimeYears19 <-
  fishnet %>%
  mutate(
    crimeNN = nn_function(st_c(st_centroid(.)), crimeData19 %>% st_c(), 3),
    year = predYr) %>%
  st_drop_geometry()

  
# add both count and exposure to panel
crimeNetYears19 <-
  merge(
    crimeData19 %>%
      st_join(., fishnet, join = st_within) %>%
      st_drop_geometry() %>%
      group_by(uniqueID) %>%
      summarize(crime = n()) %>%
      right_join(yearPanel19) %>%
      replace(is.na(.), 0),
    crimeYears19)



# JOIN AND CALCULATE NEAREST NEIGHBOR DISTANCE
# Ordinance violations
exposureNet19 <- 
  rbind(schoolsData,
        parksData,
        transitData,
        licensesData19) %>%
  st_join(., fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, legend) %>%
  summarize(count = n()) %>%
  full_join(fishnet) %>%
  spread(legend, count, fill = 0) %>%
  st_sf() %>%
  dplyr::select(-`<NA>`) %>%
  na.omit() %>%
  ungroup() %>%
  mutate(schoolsNN =
           nn_function(st_c(st_centroid(.)), st_c(schoolsData), 3),
         parksNN =
           nn_function(st_c(st_centroid(.)), st_c(parksData), 3),
         transitNN =
           nn_function(st_c(st_centroid(.)), st_c(transitData), 3),
         licensesNN = 
           nn_function(st_c(st_centroid(.)), st_c(licensesData19), 3)) %>%
  st_drop_geometry()


# JOIN ALL
# TURN previous table into YEARS and then join with CRIME table
exposureNetYears19 <-
  merge(
    exposureNet19 %>%
      right_join(yearPanel19),
    crimeNetYears19)
  

  
# JOBS
# FROM Longitudinal Origin Destination Survey
# [source]('https://lehd.ces.census.gov/data/lodes/LODES7/pa/od/')
  
# state the endpoints for getting the jobs OD data
jobsEndpoints <- c()

for (year in seq(predYr-2, predYr)) {
  jobsEndpoints <-
    c(jobsEndpoints,
      paste0('https://lehd.ces.census.gov/data/lodes/LODES7/pa/od/pa_od_aux_JT00_',
             year,
             '.csv.gz'))
}

# create empty list
jobsAllYears19 <- c()

# for every download endpoint, unzip, read csv and add to jobsAllYears
# set initial year of looping
loopYear <- predYr-2

# loop through links
for (dataset in jobsEndpoints) {
  jobsData <- read_csv(dataset) %>%
    mutate(year = loopYear)
  jobsAllYears19 <-
    rbind(jobsAllYears19, jobsData)
  loopYear <- loopYear + 1
}
  
# turn data by blocks into points with number of jobs in that block
jobsData19 <- 
  jobsAllYears19 %>%
  mutate(GEOID20 = as.character(w_geocode)) %>%
  dplyr::select(GEOID20, year) %>%
  group_by(year, GEOID20) %>%
  dplyr::summarize(count = sum(n())) %>%
  inner_join(phlblocks) %>%
  st_sf() %>%
  st_centroid(.)
  
# get jobs by year into a fishnet
jobsNetAllYears19 <- 
  jobsData19 %>%
  st_join(fishnet) %>%
  group_by(year, uniqueID) %>%
  dplyr::summarize(jobCount = sum(count)) %>%
  st_drop_geometry() %>%
  right_join(extYearPanel19) %>%
  replace(is.na(.), 0)
  
# Get change in jobs in 1 and 2 year intervals
# HARDCODED!
jobsLaggedNetYears19 <-
  merge(
    jobsNetAllYears19 %>%
      pivot_wider(names_from = year, values_from = jobCount) %>%
      mutate(lag1yr2019 = `2019` - `2018`) %>%
      dplyr::select(-starts_with('20')) %>%
      pivot_longer(cols = colnames(.)[-1], names_to = 'year', values_to = 'lag1years') %>%
      mutate(year = as.numeric(substr(year, 7, 10))),
    jobsNetAllYears19 %>%
      pivot_wider(names_from = year, values_from = jobCount) %>%
      mutate(lag2yr2019 = `2019` - `2017`) %>%
      dplyr::select(-starts_with('20')) %>%
      pivot_longer(cols = colnames(.)[-1], names_to = 'year', values_to = 'lag2years') %>%
      mutate(year = as.numeric(substr(year, 7, 10))))
  
  
# Merge everything together for the fishnet/years for 16-17-18-19
jobsNetYears19 <-
  merge(
    jobsNetAllYears19 %>%
      filter(year == predYr),
    jobsLaggedNetYears19)


# SAVE locally
saveRDS(exposureNetYears19, file = "../data/local/exposureNet19.rds")
saveRDS(jobsNetYears19, file = "../data/local/jobsNet19.rds")


# join all features together
completeNet19 <- 
  salesNetYears19 %>%
  left_join(priceNetYears19) %>%
  left_join(sqftNetYears19) %>%
  left_join(debtNetYears19) %>%
  left_join(rentNetYears19) %>%
  left_join(individualOwnersNetYears19) %>%
  left_join(vacantNet19) %>%
  left_join(sheriffSalesNetYears19) %>%
  left_join(demoNetYears19) %>%
  left_join(jobsNetYears19) %>%
  left_join(exposureNetYears19)


# ungroup years ans sort columns
# take out geometry and uniqueID
featuresNet19 <-
  completeNet19 %>%
  left_join(fishnet) %>%
  arrange(uniqueID) %>%
  ungroup() %>%
  dplyr::select(-uniqueID) %>%
  st_sf()


# SAVE locally
saveRDS(featuresNet19, file = "../data/local/featuresNet19.rds")

Kensington Neighborhood Data

majMinPolygon <- demographics %>%
  filter(nhMajMin == 0) %>%
  dplyr::select(geometry) %>%
  st_union() %>%
  st_sf()

majMinJSON <- 
  majMinPolygon %>%
  st_transform('EPSG:4326') %>%
  sf_geojson(.) 

write(majMinJSON, "./json/majMin.json")


incomePolygon <- demographics %>%
  filter(nhIncome == 0) %>%
  dplyr::select(geometry) %>%
  st_union() %>%
  st_sf()

incomeJSON <- 
  incomePolygon %>%
  st_transform('EPSG:4326') %>%
  sf_geojson(.)

write(incomeJSON, "./json/income.json")


# Kensington Example

kensingtonTracts <- 
  c("42101014400",
    "42101015600",
    "42101015700",
    "42101016200")


kensington <- 
  phltracts %>%
  filter(GEOID %in% kensingtonTracts) %>%
  st_union() %>%
  st_sf()


kensington %>%
st_transform('EPSG:4326') %>%
    st_bbox(kensington)


kensingtonJSON <- 
  kensington %>%
  st_transform('EPSG:4326') %>%
  sf_geojson(.)

write(kensingtonJSON, "./json/kensington.json")


# Get vacant lots in South Kensington
kenVacantLand <-
  vacantLandProps %>%
  dplyr::select(geometry) %>%
  st_filter(., kensington)

kenVacantLandJSON <- 
  kenVacantLand %>%
  st_transform('EPSG:4326') %>%
  sf_geojson(.) %>%
  gsub("\\{\"type\":\"Point\",\"coordinates\":", '', .) %>%
  gsub("\\}", ',', .)
  
write(kenVacantLandJSON, "./json/kenVacantLand.json")



kenVacantProps <-
  vacantLandProps %>%
  dplyr::select(geometry, ADDRESS) %>%
  filter(ADDRESS %in% c('1601 N LAWRENCE ST',
                        '1603 N LAWRENCE ST',
                        '1605 N LAWRENCE ST'
                        ))

kenVacantPropsJSON <- 
  kenVacantProps %>%
  st_transform('EPSG:4326') %>%
  sf_geojson(.) %>%
  gsub("\\{\"type\":\"Point\",\"coordinates\":", '', .) %>%
  gsub("\\}", ',', .)
  
write(kenVacantPropsJSON, "./json/kenVacantProps.json")



# Get delinquent lots in South Kensington
kenDelinquent <-
  delinquenciesVacantProps %>%
  filter(delinquentType != 0) %>%
  dplyr::select(geometry) %>%
  st_filter(., kensington)

kenDelinquentJSON <- 
  kenDelinquent %>%
  st_transform('EPSG:4326') %>%
  sf_geojson(.) %>%
  gsub("\\{\"type\":\"Point\",\"coordinates\":", '', .) %>%
  gsub("\\}", ',', .)
  
write(kenDelinquentJSON, "./json/kenDelinquent.json")


# Get vacant lots in South Kensington
kenUsBank <-
  delinquenciesVacantProps %>%
  filter(delinquentType == 2) %>%
  dplyr::select(geometry) #%>%
  #st_filter(., kensington)

kenUsBankJSON <- 
  kenUsBank %>%
  st_transform('EPSG:4326') %>%
  sf_geojson(.) %>%
  gsub("\\{\"type\":\"Point\",\"coordinates\":", '', .) %>%
  gsub("\\}", ',', .)
  
write(kenUsBankJSON, "./json/kenUsBankAll.json")


m(kensington)
m(kenVacantLand)


# Get vacant lots in South Kensington
kenSheriff <-
  vacantDelinquenciesProps %>%
  # filter(allSheriffSales == 1) %>%
  dplyr::select(street_address, geometry) %>%
  st_filter(., kensington)

kenSheriffJSON <- 
  kenSheriff %>%
  st_transform('EPSG:4326') %>%
  sf_geojson(.) %>%
  gsub("\\{\"type\":\"Point\",\"coordinates\":", '', .) %>%
  gsub("\\}", ',', .)
  
write(kenSheriffJSON, "./json/kenSheriff.json")