Find the MUSA Practicum Homepage here

Visit the Eviction Watch Dashboard here

This project has been developed in collaboration with the University of Pennsylvania’s Master of Urban Spatial Analytics Spring 2023 Practicum (MUSA 801), instructed by Michael Fichman and Matt Harris. We extend our sincere gratitude to Lauren Parker of Community Legal Services, Jonathan Pyle from Philadelphia Legal Assistance, and Steven Suffian of Who Owns Philadelphia. Their invaluable contributions have been instrumental in bringing this project to fruition.

The document includes collection of appendices such as data cleaning, data visualization, feature engineering, and model outcomes. To ensure reproducibility, the complete code is furnished at the end of the report. To navigate this document, you may utilize the table of contents situated on the left or scroll through the content in a sequential manner.

For more information on this project contact , or .

1. A Quick Brief

1.1 Evictions and Housing Displacement

Eviction is a critical issue that has far-reaching consequences for individuals, families, and communities. Evictions can lead to displacement, financial strain, and long-lasting negative impacts on mental and physical well-being. In recent years, the need for effective eviction prevention strategies has become even more pressing, particularly in urban areas like Philadelphia, where the cost of living continues to rise and vulnerable populations are at increased risk of losing their homes.

As of May 1st, 2023, a study by The Eviction Lab shows that in the 10 states and 34 cities that they track, there were 7,456 evictions in the past week alone.1 Evictions have a disproportionately adverse effect on people of color and single-family households. Black households account for 68% of eviction filings in Philadelphia. Additionally, 60% of the evictees identify as female, according to the same study.2

In response to this pressing need, our team is developing an innovative app, “Eviction Watch” designed to predict evictions at the property level, ultimately assisting Community Legal Services in Philadelphia to anticipate and prevent these evictions. By leveraging advanced analytics and machine learning techniques, our app aims to provide accurate and timely information about eviction risks, empowering legal services and stakeholders to make proactive and data-driven decisions about resource allocation.

1.2 The Story in Philadelphia

There are an estimated 311,986 renter households in the city, with a typical monthly rent of $1,181. Over the past 12 months, the average eviction filing rate has been 5% per 100 renter households. From 2016 to 2022, the eviction court in Philadelphia has received over 111,000 eviction filings. These statistics shed light on the prevalence and impact of evictions on Philadelphia’s renters.

To highlight the issue of housing instability in Philadelphia, this section presents two fictional stories of individuals living in the city. It is important to note that these characters are purely fictional and are not based on any real-life instances.

Property Address: 4815 Pine Street
Owner Name: Cityscape Developments Inc.
Owner Type: For Profit
No. of Code Violations: 0
Subsidy Expiration: 12th June, 2022
Evicted On: 7th July, 2022

Kate, a single mother of two, resided in an affordable housing unit in University City owned by a private development company. She is employed in an administrative role at the local VA Hospital. Her children attended Harrington Elementary School, which was conveniently situated near their home. Unfortunately, the subsidy on Kate’s housing unit was set to expire in June of 2022, which placed her at risk of eviction. The escalating rental prices in the University City area made it increasingly challenging for Kate to secure alternative affordable housing options within the vicinity. Maintaining residence in the area was crucial to ensure manageable commuting times to her workplace and her children’s educational institution. Kate was evicted in july of 2022. Kate’s predicament exemplifies the struggles faced by single parents in urban environments amidst the rising costs of living.


Property Address: 841 N American Street
Owner Name: Cedar Grove Management Company
Owner Type: For Profit
No. of Eviction Filings on Property: 6
No. of Code Violations: 3
Active Subsidies: 0

Andrew, a fitness trainer, resides in a studio apartment in the Northern Liberties neighborhood, situated within a property owned by Cedar Grove Management Company. He works as a fitness trainer based in the Southwest Center City area, which necessitates daily commuting. Over the past year, the management company has pursued several eviction cases, with the apparent intent to renovate and upgrade properties to attract new tenants at higher rental rates. Consequently, Andrew faces a significant risk of eviction by the end of the year. This phenomenon is not uncommon in gentrifying neighborhoods throughout Philadelphia, where socio-economic dynamics often lead to the displacement of long-standing residents.

1.3 Eviction Process in Philadelphia

The diagram below illustrates the eviction process in Philadelphia, it can be broken down into four major steps. It is crucial to understand the eviction process in Philadelphia to comprehend the time and legal limitations that present challenges for an intervention against an eviction filing to occur. 3

  1. Eviction Notice: It begins when a tenant receives a written eviction notice, which states that their landlord wants them to move out of their unit within a specified time frame usually specified in the lease agreement or at least 10 days.
  2. Eviction Complaint Filed: If the tenant does not move out in the specified time, this could escalate into an eviction complaint filed against a tenant in the municipal court. Which is when the legal process of eviction begins. The Court send a notice to tenant by mail which states the reasons for eviction, and the date and time the tenant must appear in court.
  3. Hearing Date: The tenant be present in court hearing to defend against any claims made against them by the landlord. Failure to attend the hearing date or late arrival would lead to a default judgement against the tenant. If the tenant loses at the hearing, they have 10 calendar days to file an appeal to a higher court.
  4. Eviction: After the 10-day appeal period has expired in the event of a ruling against the tenant, the landlord may file for a Writ of Possession. This notifies the tenant that an eviction will take place on or after 11 days from the day the Writ of Possession is served. The eviction of the tenant cannot occur before 21 days after the court judgement, which includes the time period for an appeal to take place and the Writ of Possession.

1.4 Client and Use Case

The client, Community Legal Services (CLS) provides free legal civil services to low income residents of Philadelphia. The CLS Housing Unit represents private, public, and subsidized housing tenants in matters involving eviction, illegal lockouts, and substandard housing. The unit also uses systems advocacy and litigation to address issues ranging from lead paint elimination, to federal housing policy changes, to tenant eviction laws.4 For more information, visit the CLS website.

Community Legal Services currently provides legal assistance, advocacy, and representation as needed. That is, their ability to provide their services is limited to the calls and walk-ins they receive from tenants in need of their services.

Using EvictionWatch, Community Legal Services will anticipate eviction filing frequency at a property level, in order to proactively allocate legal resources to tenants at risk of facing eviction.


1.5 Point of Intervention

Time is of the essence when intervening against an eviction, the earlier CLS can step in, the better the outcome tenants can have. CLS would like to have a more proactive approach to intervention, which would mean that they can perform outreach when an eviction notice is given, or even before that happens. A reactive approach to intervention would be to step in after an eviction complaint is filed to the municipal court or after a negative judgment is passed against the tenant. We hope to provide CLS with a proactive advertisement methodology for outreach and intervention.

Utilizing the EvictionWatch application, Community Legal Services (CLS) is positioned to deliver timely intervention and support to tenants facing eviction risks. By providing legal assistance and representation, CLS can facilitate negotiations for payment arrangements, mediate between defendants and plaintiffs, and ultimately contribute to the prevention of eviction. This targeted intervention strategy aims to empower vulnerable tenants and promote housing stability in the face of ongoing displacement pressures. Ultimately, CLS could increase impact by providing its services to a wider range of tenants at risk in Philadelphia.

2. Exploratory Data Analysis

2.1 Municipal Court Filings

The primary dataset utilized in this study is derived from municipal court filings spanning the years 2016-2022. This comprehensive dataset features individual-level data, including the filing dates, defendant and plaintiff names, and addresses, as well as information pertaining to the outcome of each court case. Moreover, the dataset provides building-level information, such as the rental amount due, geographic location, property classification (i.e., commercial or residential), and ownership status, indicating whether the property is managed by a public housing agency, such as the Pennsylvania Housing Finance Agency (PHFA). The dataset also comprises a column labeled EVICTION_COUNT, which enumerates the eviction cases filed during the specified time frame. By analyzing this rich dataset, we were able to gain valuable insights into the patterns and trends associated with eviction filings and outcomes, as well as the potential factors that contribute to housing instability within the city of Philadelphia.


An examination of the annual eviction filing counts reveals distinct trends over time. Notably, there was a 62.5% decrease in the number of filings during 2020. This substantial reduction can largely be attributed to the eviction moratorium instituted by the Centers for Disease Control and Prevention (CDC). Designed to protect eligible tenants from eviction due to non-payment of rent, the CDC’s moratorium took effect in September 2020 and was subsequently extended on multiple occasions before ultimately expiring on July 31, 2021. There was an increase in eviction filings commencing in 2021, highlighting the potential impact of the moratorium’s expiration on housing stability.



The frequency of evictions appears to be subject to seasonal variations, with certain months exhibiting consistently higher counts compared to others. Notably, there is a marked increase in evictions between the months of August and October each year, while April sees the lowest numbers. While it is difficult to draw definitive conclusions about the underlying factors that drive these trends, it is possible that the higher eviction numbers in the fall months could be due, in part, to the influx of students returning to universities and colleges. Further research is needed to explore the underlying causes and dynamics of these patterns in greater detail.



The graph displayed illustrates the relationship between eviction filings and owner type for properties in Philadelphia. The findings indicate that properties owned by profit agencies have a higher likelihood of experiencing eviction filings compared to public entities such as the Philadelphia Housing Authority (PHA).The lower history of eviction filings by public entities like PHA may suggest a more tenant-friendly approach in public housing management, compared to private ownership models. Additionally, it is noteworthy that a significant proportion of eviction filings occur in non-subsidized units, which warrants further exploration in subsequent sections.


2.2 Property Level Data

The dataset used for this study was obtained from the Office of Property Assessments (OPA) and comprises of property-level information for all properties in Philadelphia. This dataset is dynamic and is updated frequently to capture the latest information on property sales, assessed property value, and zoning type. Our analysis shows that certain variables such as MARKET_VALUE, CATEGORY_CODE, SALES_PRICE and TOTAL_AREA demonstrate a correlation with evictions, which may provide insight into the factors that contribute to the likelihood of eviction. From this data, we can see that 4.3% of properties have at least one eviction filed since 2016.

The bar charts below displays the relationship between market values and eviction filings of properties in Philadelphia. The plot suggests that properties with lower market values are more prone to eviction filings than higher valued properties. Specifically, the plot indicates that the majority of eviction filings occur in properties valued below $5 million, which is consistent with the trend seen with variables that are directly related to market value such as taxable land and livable area. This suggests that lower property values may be an indicator of housing instability and a higher risk of eviction filings. However, there also seems to be an increase in eviction filings in properties valued higher than 20 million. This finding is a contradiction to the fact that higher valued properties tend to have more stable tenants, and are therefore less likely to experience eviction filings. Further analysis may be needed to determine the factors that contribute to this trend.

The second bar chart in this section represents eviction filings in relation to the year of construction and age of homes. The analysis of the data indicates that the highest number of eviction filings is observed in homes built between the years 1950 and 2000. This observation is consistent with the fact that homes constructed after 2000 are relatively new and have a higher price. Similarly, older homes that are considered historic also have a high value and lower probability of evictions. Therefore, homes constructed between 1950 and 2000, which need renovation, are more likely to face evictions. The renovations of these homes could lead to an increase in their market value and prompt the eviction of the previous tenants to make way for newer ones, as has been observed in parts of University City.

2.3 Licenses and Subsidies

This section investigates the association between housing subsidies, code violations, and eviction rates to identify the underlying factors of housing instability and propose relevant policy interventions. Data on housing subsidies is obtained from the National Housing Preservation Database (NHPD) while code violations data is obtained from the Licenses and Inspections dataset hosted on Open Data Philly. Interestingly, the data shows that there were a significant number of eviction filings in properties with no active subsidies, indicating the presence of other factors that contribute to housing instability.


The analysis of eviction filings in properties with code violations reveals a bimodal pattern, with a higher rate of filings observed in properties with either no violations or a high count of violations. This suggests the existence of a complex relationship between code violations and eviction, which merits further investigation to inform targeted interventions aimed at reducing housing instability. Other variables of interest that could add value to this data would be the quality of the housing, and the financial situation of tenants. Understanding the complexity of these factors is crucial for designing effective policies to reduce eviction rates in Philadelphia


2.4 Data Wrangling and Cleaning

Links to all data sources will be provided at the end of this document

This study utilized two main types of data sources: property level data and individual level data. The property level data was obtained from several sources such as the National Housing Preservation Database (NHPD) and the Office of Property Assessment (OPA). Meanwhile, the individual level data was collected from the 5 year American Community Survey (ACS) conducted in 2021. These data sources were then linked to the municipal court filing data using various tools and variables including the PASSYUNK parser for standardizing addresses, PARCEL_NUMBERS, and GEOID. In particular, variables such as OWNER_TYPE, RACE, RENT_BURDEN, CODE_VIOLATIONS etc. were included in the model building process.

2.5 Feature Engineering

To optimize the significant variables in our datasets and capture their nuances, we created new features with clearer categories and limits. The newly engineered features were used in our regression model which we will discuss in the subsequent section. The new variables that we engineered are as follows:

Owner Type: We grouped the properties from the OPA dataset into for-profit and public owners. Using the case_when function, we filtered the data for keywords such as CITY, PHILADELPHIA, AUTHORITY, PHILA, COMMONWEALTH, PA and RESTORATIONS to identify the government-owned properties in Philadelphia separating them from for-profit owned properties.

Age: We derived the age of the properties from the OPA dataset in years from the year_built column. We did so by subtracting the year the buildings were built by 2023.

Eviction counts by month: Using the filing dates provided in the municipal court filing dataset, we created new categories for the sum of eviction filing per month and year from 2016 to 2022. These new categories helped us account for the temporal nature of eviction filings, adding time lag as one of the independent variables in our model. We derived our dependent variable from this transformation as well, which is the sum of evictions from the first six months of 2022.

Market value brackets: We decided to run our model using the market value bracket with the highest number of eviction filings, which we identified in the EDA section as properties valued below 50 million dollars. This was done to improve the accuracy of our model and to prevent our model from being skewed by the wide variations in market values and outliers.

We recognize that we significantly narrowed the number of properties we are predicting outcomes for, but we found it valuable in our modeling process. Given more time, we would have repeated our modeling processes on properties with a higher market value and compared our findings from each market value bracket side-by-side.

2.6 Spatial Features

We accounted for the spatial processes that inform eviction filings by adding the average nearest-neighbor distance and neighborhood effects in Philadelphia. The nearest neighbor bunching informs our model if eviction fillings tend to cluster and if their proximity can impact the risk of an eviction being filed nearby. Adding neighborhood effects to our model was important to capture the localized characteristics of eviction filings, which vary from neighborhood to neighborhood.

Eviction_properties: We calculated the average nearest neighbor distance from each eviction filing to its 2 nearest eviction filings.

Neighborhood: Using the st_join function, we categorized each property by its corresponding neighborhoods.

3. Model Building

3.1 Introducing the Zero-Inflated Poisson Model

A zero-inflated Poisson model is a statistical model used to analyze count data with a high number of zeros. This type of model is helpful in this situation because it takes into account the excess number of zeros in the dependent variable (EVICTION_COUNT), as seen in the graph. By using this model, we can estimate the probability of a property experiencing an eviction while accounting for the fact that a large number of properties have never experienced an eviction. This model can also help identify the factors that are associated with a higher likelihood of eviction filings, allowing our client to better understand the dynamics of eviction in the city and develop targeted interventions to prevent evictions.

3.2 Splitting the Data

This study acknowledges that the eviction data used in the analysis varies depending on the time and the trends from 2019-2021 may not be the best predictors due to the eviction moratorium. To address this issue, we have randomly sampled the eviction data from 2022 and divided it into training and test sets, using a 75-25 split. This method allows for the development and testing of models on more recent data, which is likely to better represent current eviction patterns. We have also used the 2022 data in EvictionWatch to predict eviction filings across Philadelphia for 2023.

3.3 Building the Model

The process of building a model is an iterative process, various combinations of variables were tested which lead us back to feature engineering and further fine-tuning of our variables. Using the Zero-Inflated Poisson Model, we ran 5 regressions with a different combination of variables. These models are explained below:
  1. *Model 1 – Numerical values only: The first model consists of variables with numerical values such as market value, total livable area, violation count, rent burden, and taxable land.
  2. Model 2 – Plus categorical variables: The second model consists of variables with numerical values from model one, plus categorical variables such as category code (descriptive categories describing building conditions), owner types, and active subsidies (in binary form: Yes or No).
  3. Model 3 – Plus spatial lag: The third model consists of variables including the numerical and categorical variables in the previous model, plus a spatial lag variable from the calculated average nearest neighbor distance.
  4. Model 4 – Plus time lag: The fourth model consists of numerical, categorical, and spatial lag features found in the previous model, plus time lag features that account for the eviction counts from one, two, three, and six months prior as well as one and two years prior to January 2022.
  5. Model 5 – Plus neighborhood effects: The fourth model consists of numerical, categorical, spatial lag, and time lag features found in the previous model, plus a variable accounting for neighborhood effects.
  6. The summary of the 5 regression models can be found in the table below:


3.4 Evaluating the Model

The performance of our models was evaluated using the MAE and MAPE metrics. The MAE measures the average absolute difference between the predicted and actual values, while the MAPE measures the average percentage difference between the predicted and actual values. Given that we ran 5 models, we utilized the results from MAE and MAPE metrics to identify the appropriate model to use for our final eviction filing predictions. A lower MAE indicates better performance of the model since it means that the predictions are closer to the actual values.

As seen in adjacent figure, the lowest MAE value can be seen in the model with the neighborhood effects.

3.5 Model 5: Plus Neighborhood Effects

Our final model was Model 5, which included neighborhood effects. The MAE value of the model was found to be 0.04, indicating that the average absolute difference between the predicted and actual values is 0.04 units. Similarly, the MAPE value of the model was found to be 3.6%, indicating that on average, the predicted values differ from the actual values by 3.6%. These values suggest that our model has moderate accuracy in predicting the outcome variable. However, it is important to note that model accuracy is not synonymous with model generalizability. We would be discussing generalizability further in the next section.

4. Model Generalizability

When predicting eviction filings in the future using a machine learning model, it is critical that the model is both accurate and generalizable. Generalizability is crucial because eviction filings can vary based on different factors, such as geography, demographics, owner types, and property characteristics. Therefore, a model that can generalize well across different contexts can provide more robust and reliable predictions. There are 3 patterns found in our modeling results. Our model is:

  1. Good at predicting filings in properties with a history of the high frequency of eviction filings.
  2. Good at predicting filings in properties with no history of eviction filings.
  3. Bad at predicting filings in properties with a history of the low frequency of eviction filings.
  4. We will illustrate these instances using 3 examples from our prediction results. First Instance (image to the left): Properties with a history of high-frequency eviction filings. Second Instance (image in the middle): Properties with no history of evictions. Third Instance (image to the right): Properties with a history of lower frequency of eviction filings




    As seen in the examples above, our current model fails to capture the nuances that cause low-frequency eviction filings. In the next section, we would discuss some of the steps we would have taken to improve the model given more time.

    Visit the Eviction Watch Dashboard

5. Next Steps

Our current model provides valuable insights into predicting eviction filings in the future, but it’s important to recognize that there may be additional factors or variables that we have not yet considered in our modeling approach. We must also acknowledge the limitations of the accuracy and generalizability of our models. Given more time, there are steps we could take to improve our model such as:

  1. Feature engineering property-level variables to capture nuances that differentiate one property from another.
  2. Categorizing properties valued above 50 million dollars into brackets and running separate models for each bracket. (Since our model currently captures properties valued below 50 million dollars.
  3. Exploring modeling on a smaller geography such as the Philadelphia planning district level instead of at the city level.
  4. Trying different modeling techniques or algorithms that may perform better on our zero-inflated dataset.
  5. Using cross-validation techniques to evaluate the performance of our model on new datasets and ensure that it is generalizable.
  6. An accurate and generalizable machine learning model for predicting eviction filings in the future can help identify at-risk individuals and households, inform policy decisions, and provide valuable insights to prevent housing insecurity and improve the lives of individuals and families. Despite its limitations, our current model provides a starting point for further exploration and refinement. By collaborating with practitioners in the field of housing and continuing to refine our approach over time, we can create more effective solutions to address the complex challenges of eviction and housing insecurity.

6. Acknowledgements

We would like to acknowledge the invaluable contributions of Lauren Parker of Community Legal Services, Jonathan Pyle from Philadelphia Legal Assistance, and Steven Suffian of Who Owns Philadelphia, for their assistance in the development of this project. We would like to express their gratitude to the instructors of the University of Pennsylvania’s Master of Urban Spatial Analytics Spring 2023 Practicum (MUSA 801); Michael Fichman and Matthew Harris, for their guidance and support throughout this project.

8. Code Appendix

# read the data
eviction <- read.csv("evictions_out.csv") #dataset from steve 

other <- read.csv("other.csv") # dataset from other datasets

eviction_all <- read.csv("summary-table.csv") # dataset from summary table

opa <- st_read("opa_properties_public.geojson") 

modeldata <- read.csv("Model_0316.csv") # dataset with properties with subsidies

opa_keep <- opa %>% 
  dplyr::select(. , parcel_number, location, mailing_street) %>% 
  dplyr::filter(., location!=mailing_street) %>% 
  st_drop_geometry() %>% 
  dplyr::select(., parcel_number)

opa_keep$parcel_number <- as.integer(opa_keep$parcel_number)

### Adding subsidy info to OPA data


# Subset Active Subsidies
trim_modeldata <- modeldata %>% 
  dplyr::select(OwnerType, ActiveSubsidies, TotalUnits, PropertyStatus, objectid)

# Join 
other_01 <- other
other_01<- other_01 %>%
  left_join(trim_modeldata, by = "objectid") 

### Initial Feature Engineering 

# Grouping Owner types 
## Select Owner
ownerdata <- opa %>%  dplyr::select(owner_1, parcel_number)
  
## Create a frequency table of the words
freq_table <- table(ownerdata$owner_1)

## Sort the table by frequency
sorted_table <- sort(freq_table, decreasing = TRUE)

## Print the top 10 most frequent words
head(sorted_table, n = 40)

## Grouping Owner types into Public and Private
ownerdata <- ownerdata %>%  
  mutate(ownertype= case_when(grepl("CITY|PHILADELPHIA|AUTHORITY|PHILA|COMMONWEALTH|PA|RESTORATIONS", owner_1) ~ "Public",
    TRUE ~ "forProfit"))

ownerdata$parcel_number <- as.numeric(ownerdata$parcel_number)
ownerdata$geometry <- NULL
ownerdata$owner_1 <- NULL

# Join Owner data to other datasets
other_01 <- other_01 %>%
  left_join(ownerdata, by = "parcel_number") 

# Grouping Owner Types for subsidized housing
other_01 <- other_01  %>% 
     mutate(other_01, OwnerType = (case_when(OwnerType == "For Profit" ~ "forProfit",
        OwnerType == "Multiple" ~ "forProfit",
        OwnerType == "Limited Dividend" ~ "forProfit",
        OwnerType == "Profit Motivated" ~ "forProfit", 
        OwnerType == "Non-Profit" ~ "Non-Profit", 
        OwnerType == "Public Entity" ~ "Public ")))
   
# Age of homes  
other_01 <- mutate (other_01, Age = 2023 - year_built)

# count the evictions by different months and join the right address
eviction$month <- substr(eviction$id, 7, 8)
eviction$year_2 <- substr(eviction$id, 4, 5)

evictions_locations <- eviction_all%>% 
  dplyr::select(., id, clean_address, census, zip) 
evictions_links <- evictions_locations %>% 
  left_join(eviction, by = "id") %>% 
  na.omit()

## Create dataset for modeling

evictions_links <- evictions_links %>% 
  mutate(., year_head = 20) %>% 
  unite(combined, year_head, year_2, month, sep = "")

count <- evictions_links %>% 
  group_by(parcel_number, combined) %>% 
  summarise(count = n()) %>% 
  ungroup()
count <- count %>% 
  spread(., combined, count)
count <- replace(count, is.na(count), 0)

opa_with_counts_0328 <- other_01 %>%
  left_join(count, by = "parcel_number") 

# it is an inner join because we filtered the buildings that are not housing 
dataset_total <- opa_with_counts_0328
opa_keep <- opa_keep %>% 
  inner_join(dataset_total, by = "parcel_number")
dataset_total <- opa_keep
dataset_total[, 49:127] <- lapply(dataset_total[, 49:127], function(x) replace(x, is.na(x), 0))
dataset_total$evictions_total <- rowSums(dataset_total[, 49:127], na.rm = TRUE)

### Adding geometry

PHITracts <-
  tigris::tracts(state = "PA", county = "Philadelphia") %>%
  dplyr::select(GEOID)%>%
  filter(GEOID!="36061000100")

dataset_total <- st_as_sf(dataset_total, coords = c("lng", "lat"),
                 crs = st_crs(PHITracts), agr = "constant")

ggplot() + 
  geom_sf(data=PHITracts,color="grey50",size=0.5,fill = "transparent")+
  geom_sf(data = dataset_total, aes(color = q5(market_value)),size=1, alpha=0.5) +
  geom_sf(data=st_union(PHITracts),color="black",size=3,fill = "transparent")+
  scale_color_manual(values = pal5,
                    name = "Market Value") +
  labs(title="Market Value 2022") +
  mapTheme(10,75) + 
  theme(legend.position = "bottom",
        panel.border = element_rect(color = "black",fill = "transparent", size = 2))

## Adding Spatial Features Nearest Neighbor


table_not_zero <- dataset_total %>% 
  dplyr::filter(., evictions_total!=0)

# Using NN function to get the distance from 2 nearest neighbor which has eviction
table_model <- dataset_total %>% 
  mutate(evictions_properties = nn_function(st_coordinates(dataset_total), st_coordinates(table_not_zero), 2))

## Feature engineering additional variables


# Make active subsidies a binary column
table_model<- mutate(table_model, activeSubsidies_binary = as.numeric(case_when(ActiveSubsidies>=1 ~ "1", ActiveSubsidies==0 ~ "0")))

table_model$activeSubsidies_binary[is.na(table_model$activeSubsidies_binary)] <- 0

# trim outliers based on market value 
lower_bound <- 80000
upper_bound <- 50000000

trimmed_table_model <- table_model[table_model$market_value  >= lower_bound & table_model$market_value  <= upper_bound,]

# trim outliers based on evictions
lower_bound2 <- 0
upper_bound2 <- 400

trimmed_table_model <- table_model[table_model$evictions_total >= lower_bound2 & table_model$evictions_total  <= upper_bound2,]

# Create category for 6 month intervals 
## First 6 months of 2021
trimmed_table_model <- mutate(trimmed_table_model, First6months_2021 = (`202106`+ `202105`+`202104`+ `202103`+ `202102`+ `202101`)) 
## Last 6 months of 2021
trimmed_table_model <- mutate(trimmed_table_model, Last6months_2021 = (`202112`+ `202111`+ `202110`+ `202109`+`202108`+`202107`))
## First 6 months of 2022
trimmed_table_model <- mutate(trimmed_table_model, First6months_2022 = (`202206`+ `202205`+`202204`+ `202203`+ `202202`+ `202201`))

### Add Philadelphia neighborhoods 


neighborhood <- st_read("https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson") 

neighborhood <- neighborhood %>% 
  dplyr::select(name, geometry)

neighborhood <- st_transform(neighborhood, st_crs(trimmed_table_model))

trimmed_table_model<- st_join(trimmed_table_model,neighborhood, left = TRUE)
trimmed_table_model <- trimmed_table_model%>%
  rename(neighborhood = name)


## Select Variables for model 

model_df <- trimmed_table_model 
model_df <-model_df %>% 
  dplyr::select(., market_value, total_livable_area, violation_count, rent_burden_total.2021, White.2021, taxable_land, TotalUnits, Age, category_code, OwnerType, ownertype, activeSubsidies_binary, evictions_properties,`202112`, `202111`, `202110`, `202109`, `202106`, `202012`, `201912`, evictions_total, First6months_2021, Last6months_2021, First6months_2022, neighborhood)

model_df [is.na(model_df)] <- 0

### Split into Train and Test sets

model_df  <- st_read(paste0(getwd(),"/Modeling/model_df.csv"))
finaltrain  <- st_read(paste0(getwd(),"/Modeling/finaltrain.csv"))
finaltest  <- st_read(paste0(getwd(),"/Modeling/finaltest.csv"))


model_df  <- initial_split(model_df , strata = `First6months_2022`, prop = 0.75)
finaltrain <- training(model_df)
finaltest  <- testing(model_df)

## Zero-inflated Poisson MOdel, dependent variable = First 6 months of 2022 


install.packages("pscl")
library(pscl)

pal <- c("#2e2345","#524d60","#a57569","#d19b75","#e4dbc4")

ggplot(MAE_1, aes(x = MODELS_1, y = MAE, fill = "#2e2345")) + 
  geom_bar(stat = "identity") +
  labs(x = "\nYear and months", y = "MAE\n") + 
  theme_ipsum()+
  scale_fill_manual(values = pal)


# reg1 only numeric predictors
reg1 <- zeroinfl(`First6months_2022` ~ market_value + total_livable_area + violation_count + rent_burden_total.2021+ taxable_land, data = finaltrain)
summary(reg1)

# reg2 + categorical predictors
reg2 <- zeroinfl(`First6months_2022` ~ market_value + total_livable_area + violation_count + rent_burden_total.2021+ taxable_land
                 + category_code + ownertype + OwnerType + activeSubsidies_binary, 
                 data = finaltrain)
summary(reg2)  

# reg8 + spatial lags
reg3 <- zeroinfl(`First6months_2022` ~ market_value + total_livable_area + violation_count + rent_burden_total.2021 + taxable_land
                 + category_code + ownertype + OwnerType + activeSubsidies_binary 
                 + evictions_properties, 
                 data = finaltrain)  
summary(reg3)

# reg4 + time lags
reg4 <- zeroinfl(`First6months_2022` ~ market_value + total_livable_area + violation_count + rent_burden_total.2021 + taxable_land 
                 + category_code + ownertype + OwnerType + activeSubsidies_binary 
                 + evictions_properties
                  +`202112`+ `202111` + `202110` + `202109` + `202106` + `202012`+ `201912`,
                 data = finaltrain) 
summary(reg4)  

# reg5 + neighborhood effects
reg5 <- zeroinfl(`First6months_2022` ~ market_value + total_livable_area + violation_count + rent_burden_total.2021 + taxable_land 
                 + category_code + ownertype + OwnerType + activeSubsidies_binary 
                 + evictions_properties + neighborhood +
                  +`202112`+ `202111` + `202110` + `202109` + `202106` + `202012`+ `201912`,
                 data = finaltrain) 
summary(reg4)  

## Calculate the MAE for all regressions

result1 <- finaltest
result2 <- finaltest
result3 <- finaltest
result4 <- finaltest
result5 <- finaltest

result1$pred1 <- predict(reg1, newdata = finaltest, type = "response")
result2$pred2 <- predict(reg2, newdata = finaltest, type = "response")
result3$pred3 <- predict(reg3, newdata = finaltest, type = "response")
result4$pred4 <- predict(reg4, newdata = finaltest, type = "response")
result5$pred5 <- predict(reg5, newdata = finaltest, type = "response")

result1 <- result1 %>% 
  dplyr::select(., `First6months_2022`, pred1)
result2 <- result2 %>% 
  dplyr::select(., `First6months_2022`, pred2)
result3 <- result3 %>% 
  dplyr::select(., `First6months_2022`, pred3)
result4 <- result4 %>% 
  dplyr::select(., `First6months_2022`, pred4)
result5 <- result5 %>% 
  dplyr::select(., `First6months_2022`, pred5)

mae1 <- mean(abs(result1$pred1 - finaltest$`First6months_2022`))
mae2 <- mean(abs(result2$pred2 - finaltest$`First6months_2022`))
mae3 <- mean(abs(result3$pred3 - finaltest$`First6months_2022`))
mae4 <- mean(abs(result4$pred4 - finaltest$`First6months_2022`))
mae5 <- mean(abs(result5$pred5 - finaltest$`First6months_2022`))

mape1 <- mean(abs((result1$pred1 - finaltest$`First6months_2022`) / finaltest$`First6months_2022`)) * 100
mape2 <- mean(abs((result2$pred2 - finaltest$`First6months_2022`) / finaltest$`First6months_2022`)) * 100
mape3 <- mean(abs((result3$pred3 - finaltest$`First6months_2022`) / finaltest$`First6months_2022`)) * 100
mape4 <- mean(abs((result4$pred4 - finaltest$`First6months_2022`) / finaltest$`First6months_2022`)) * 100
mape5 <- mean(abs((result5$pred5 - finaltest$`First6months_2022`) / finaltest$`First6months_2022`)) * 100

MAE_1 <- data.frame(MODELS_1 = c("Numeric Only", "Plus categorical", "Plus Spatial Lag", "Plus Time Lag", "Plus Neighborhoods"), MAE = c(mae1, mae2, mae3, mae4, mae5))

ggplot(MAE_1, aes(x = MODELS_1, y = MAE, fill = "#2e2345")) + 
  geom_bar(stat = "identity") +
  scale_fill_identity() +
  labs(x = "Year and months", y = "MAE") + 
  plotTheme()

## plot eviction as a function of continuos variables

train1 <- finaltrain
train1$pred1 <- predict(reg1, newdata = finaltrain, type = "response")
train1<- train1 %>% 
  dplyr::select(., `First6months_2022`, pred1)

ggplot(data = train1, aes(x = `First6months_2022`, y = pred1)) +
  geom_point(size = 3,alpha=0.5) + geom_smooth(method = "lm", se=F, colour = "#a57569",size=4) +
  labs(title = "eviction numbers as a function of continuous variables") +
  plotTheme()


## Comparing model 9 (Plus Time Lag) with model 10 (Neighborhood effect)

finaltest <-
  finaltest %>%
  mutate(Regression = "Numeric Predictors",
         Eviction.Predict = predict(reg1, finaltest),
         Eviction.Error = Eviction.Predict - First6months_2022,
         Eviction.AbsError = abs(Eviction.Predict - First6months_2022),
         Eviction.APE = (abs(Eviction.Predict - First6months_2022)) / Eviction.Predict)

finaltest_category <-
  finaltest %>%
  mutate(Regression = "Categorical Predictors",
         Eviction.Predict = predict(reg2, finaltest),
         Eviction.Error = Eviction.Predict - First6months_2022,
         Eviction.AbsError = abs(Eviction.Predict - First6months_2022),
         Eviction.APE = (abs(Eviction.Predict - First6months_2022)) / Eviction.Predict)


finaltest_spatial <-
  finaltest %>%
  mutate(Regression = "Spatial Lag",
         Eviction.Predict = predict(reg3, finaltest),
         Eviction.Error = Eviction.Predict - First6months_2022,
         Eviction.AbsError = abs(Eviction.Predict - First6months_2022),
         Eviction.APE = (abs(Eviction.Predict - First6months_2022)) / Eviction.Predict)

finaltest_time <-
  finaltest %>%
  mutate(Regression = "Time Lag",
         Eviction.Predict = predict(reg4, finaltest),
         Eviction.Error = Eviction.Predict - First6months_2022,
         Eviction.AbsError = abs(Eviction.Predict - First6months_2022),
         Eviction.APE = (abs(Eviction.Predict - First6months_2022)) / Eviction.Predict)


finaltest_nhood <-
  finaltest %>%
  mutate(Regression = "Neighborhood Effects",
         Eviction.Predict = predict(reg5, finaltest),
         Eviction.Error = Eviction.Predict - First6months_2022,
         Eviction.AbsError = abs(Eviction.Predict - First6months_2022),
         Eviction.APE = (abs(Eviction.Predict - First6months_2022)) / Eviction.Predict)

#bothRegressions <- 
# rbind(
#  dplyr::select(finaltest, starts_with("Eviction"), Regression, neighborhood, First6months_2022),
#    dplyr::select(finaltest_nhood, starts_with("Eviction"),First6months_2022, Regression, neighborhood))

#st_drop_geometry(bothRegressions) %>%
#  gather(Variable, Value, -Regression, -neighborhood) %>%
# filter(Variable == "Eviction.AbsError" | Variable == "Eviction.APE") %>%
#  group_by(Regression, Variable) %>%
#    summarize(meanValue = mean(Value, na.rm = T)) %>%
#    spread(Variable, meanValue) %>%
#    kable() %>% kable_styling()

fiveRegressions <- 
  rbind(
    dplyr::select(finaltest, starts_with("Eviction"), Regression, neighborhood, First6months_2022),
    dplyr::select(finaltest_category, starts_with("Eviction"),First6months_2022, Regression, neighborhood),
    dplyr::select(finaltest_spatial, starts_with("Eviction"),First6months_2022, Regression, neighborhood),
    dplyr::select(finaltest_time, starts_with("Eviction"),First6months_2022, Regression, neighborhood),    
    dplyr::select(finaltest_nhood, starts_with("Eviction"),First6months_2022, Regression, neighborhood))

st_drop_geometry(fiveRegressions) %>%
  gather(Variable, Value, -Regression, -neighborhood) %>%
  filter(Variable == "Eviction.AbsError" | Variable == "Eviction.APE") %>%
  group_by(Regression, Variable) %>%
  summarize(meanValue = mean(Value, na.rm = T)) %>%
  spread(Variable, meanValue) %>%
  kable(col.names = c("Regression", "Absolute Error", "APE"), digits = 2) %>%
  kable_styling()

library(kableExtra)

my_table <- st_drop_geometry(fiveRegressions) %>%
  gather(Variable, Value, -Regression, -neighborhood) %>%
  filter(Variable == "Eviction.AbsError" | Variable == "Eviction.APE") %>%
  group_by(Regression, Variable) %>%
  summarize(meanValue = mean(Value, na.rm = T)) %>%
  spread(Variable, meanValue) %>%
  kable(col.names = c("Regression", "Absolute Error", "APE"), digits = 2) %>%
  kable_styling()

save_kable(my_table, "/home/user/Documents/Chris/MUSA-Practicum/my_table.html")

B<-fiveRegressions %>%
  dplyr::select(Eviction.Predict, First6months_2022, Regression) %>%
    ggplot(aes(First6months_2022, Eviction.Predict)) +
  geom_point(size = 0.5,alpha=0.5) +
  stat_smooth(aes(First6months_2022, First6months_2022), 
             method = "lm", se = FALSE, size = 1, colour="#524d60") + 
  stat_smooth(aes(Eviction.Predict, First6months_2022), 
              method = "lm", se = FALSE, size = 1, colour="#a57569") +
  facet_wrap(~Regression) +
  labs(title="Predicted evictions as a function of observed evictions\n",
       subtitle="purple line represents a perfect prediction; peach line represents prediction\n") +
  theme_ipsum()+
  theme(plot.title =  element_text(size = 15, hjust = 0.5),
        axis.title.x = element_text(size = 14, hjust = 0.5),
        axis.title.y = element_text(size = 14, hjust = 0.5),
        axis.text.x = element_text(size = 10),
        axis.text.y = element_text(size = 10)) +
  labs(x = "\nEviction Prediction ", y = "Evictions in 2022 (Jan-June)\n") 
  
  ggsave("/home/user/Documents/Chris/MUSA-Practicum/Predicted-observed2.png",B, device = 'png',
       width = 7.81, height = 6.01, units = "in",dpi=600)
  
  
B<-fiveRegressions %>%
  dplyr::select(Eviction.Predict, First6months_2022, Regression) %>%
  ggplot(aes(First6months_2022, Eviction.Predict)) +
  geom_point(size = 0.5,alpha=0.5) +
  stat_smooth(aes(First6months_2022, First6months_2022), 
             method = "lm", se = FALSE, size = 1, colour="#524d60") + 
  stat_smooth(aes(Eviction.Predict, First6months_2022), 
              method = "lm", se = FALSE, size = 1, colour="#a57569") +
  facet_wrap(~Regression) +
  labs(title="Predicted evictions as a function of observed evictions",
       subtitle="purple line represents a perfect prediction; peach line represents prediction") +
  theme_ipsum()+
  theme(plot.background = element_rect(fill = "white"),
        plot.title =  element_text(size = 15, hjust = 0.5),
        axis.title.x = element_text(size = 14, hjust = 0.5),
        axis.title.y = element_text(size = 14, hjust = 0.5),
        axis.text.x = element_text(size = 10),
        axis.text.y = element_text(size = 10)) +
  labs(x = "\nEviction Prediction ", y = "Evictions in 2022 (Jan-June)\n")

## Cross-Validation

CV_model <- zeroinfl(First6months_2022 ~ market_value + total_livable_area + violation_count + rent_burden_total.2021 + taxable_land 
                 + category_code + ownertype + OwnerType + activeSubsidies_binary 
                 + evictions_properties
                  +`202112`+ `202111` + `202110` + `202109` + `202106` + `202012`+ `201912`,
                 data = finaltrain, family = "poisson")

cv <- cv.glm(finaltrain, CV_model, K = 10)


zip_cv_results <- data.frame(
  fold = rep(1:10, 2),
  mse = c(cv$crit[1:10,"deviance"]/cv$size[1:10], cv$crit[11:20,"deviance"]/cv$size[11:20]),
  part = rep(c("count", "zero"), each = 10)
)
ggplot(zip_cv_results, aes(x = fold, y = mse, color = part)) +
  geom_point() +
  geom_line() +
  xlab("Number of folds") +
  ylab("Mean squared error") +
  scale_color_manual(values = c("count" = "blue", "zero" = "red")

 
###LOGO-CV

install.packages("groupdata2")
library(groupdata2)
install.packages("MASS")
library(MASS)

groups <- finaltrain$name
zip_model <- zeroinfl(First6months_2022 ~market_value + total_livable_area + violation_count + rent_burden_total.2021 + taxable_land + TotalUnits
                 + category_code + ownertype + OwnerType + activeSubsidies_bin 
                 + evictions_properties
                  +`202112`+ `202111` + `202110` + `202109` + `202106` + `202012`+ `201912`,
                 data = finaltrain)

logo_indices <- groupKFold(groups, k = length(unique(groups)))

library(purrr)
cv_results <- map_dfr(logo_indices, function(indices) {
  train_data <- data[-indices, ]
  test_data <- data[indices, ]
  train_model <- zeroinfl(First6months_2022 ~market_value + total_livable_area + violation_count + rent_burden_total.2021 + taxable_land + TotalUnits
                 + category_code + ownertype + OwnerType + activeSubsidies_bin 
                 + evictions_properties
                  +`202112`+ `202111` + `202110` + `202109` + `202106` + `202012`+ `201912`,
                 data = finaltrain)
  test_pred <- predict(train_model, newdata = test_data, type = "response")
  mse <- mean((test_data$count - test_pred)^2)
  data.frame(mse = mse, fold = "logo")
})