Return to MUSA 801 Projects Page

1 Introduction

This project was produced as part of the University of Pennsylvania’s Master of Urban Spatial Analytics Spring 2020 Practicum (MUSA 801) taught by Ken Steif, Michael Fichman, and Matt Harris. We would like to thank the City of Louisville for providing feedback and data.

1.1 How to use this document

The following document presents an analysis of affordable housing in Louisville, KY and EquiLiving, an interactive tool for the Louisville Affordable Housing Trust Fund (LAHTF) that will ultimately help proactively prioritize the allocation of affordable housing in neighborhoods with the greatest opportunity before it becomes more expensive to develop in that part of the City. EquiLiving provides a risk score for market housing areas in the city, indicating each neighborhoods opportunity for affordable housing.

This document begins with a case study predicting home sales prices in Louisville, KY, and is followed by a series of appendices that discuss data wrangling, data visualization, data sources, feature engineering, and model results. Navigate through the document either by using the panel at the left, or by clicking the hyperlinks throughout the document.

1.2 Abstract

Home prices have increased over the last decade throughout Louisville making it more difficult for low-income families to live in neighborhoods of opportunity - areas with good school systems, low crime rates and poverty. The Louisville Affordable Housing Trust Fund has been working on solving this problem by developing affordable housing in areas that allow low-income families to live in these types of neighborhoods. In order for the Trust Fund to provide the greatest amount of opportunity for low-income households, they need to build affordable housing in inexpensive neighborhoods today before they gentrify into neighborhoods of opportunity tomorrow. Therefore, by analyzing Louisville’s current housing market to predict future home prices, we designed a predictive planning tool that uses future trends to help determine where to build affordable housing.

1.3 Motivation:

Louisville, like many American Cities, have experienced high levels of development pressure over the years as demand for living in a city has increased. Single family home prices have increased by almost 6% (accounting for inflation) since 2008. New construction permits have increased 75% (617 to 1,082) since 2010. This new development demand may be making Louisville less affordable to all, and in particular low-income communities and communities of color where there is a shortage of ~30,000 affordable and available housing units in Louisville for extremely low-income households (at or below 30% AMI) (Louisville Housing Needs Assessment, 2019). The goal for Louisville is to ensure that despite recent neighborhood change, Louisville remains a place where all can thrive regardless of race or class.

Among a host of equity-related interventions currently underway in Louisville, the City has created a Louisville Affordable Housing Trust Fund (AHTF) to provide bridge financing for the development of affordable housing. In this project we explore the neighborhood change process in Louisville using parcel and neighborhood level administrative data. We then use this information to develop a price-index that can forecast home sale prices across space and time. Such forecasts may enable planners to better allocate limited affordable housing resources in the communities that will experience major changes in the socioeconomic makeup of neighborhoods in the future.

According to Opportunity Insights, a Harvard University research team, many low-income households are separated into areas with lower economic opportunity decreasing their chance of upward mobility (Opportunity Insights). “Every year a child spends growing up in an area with better outcomes causes the child to have better outcomes in adulthood” (Opportunity Insights). Therefore, having a low-income child grow up in a Neighborhood of Opportunity, some including a higher-income neighborhood with lower crime rates and better school systems, can provide the child with a better chance of earning a higher income than his or her parents. This economic challenge can be improved by developing affordable housing in Louisville’s neighborhoods of opportunity to help low-income families become socially mobile.

1.4 What is the Louisville Affordable Housing Trust Fund (LAHTF)?

The Housing Trust Fund, established in 2008, invests in affordable housing for residents who do not have enough income to afford market rate housing. Since 2014, the AHTF has allocated over $20 Million dollors to fund almost 1600 units throughout Louisville to improve the affordable housing shortage. Currently, they primarily focus on giving grants and loans for affordable housing projects focused on new construction, rehabilitation, rental assistance, and much more where priority is given to projects integrating low income housing with market rate housing (2020 Funding Application Guidelines and Project/Program Policies).

1.5 AHTF Allocation Process

There are six primary steps in the AHTF allocation process:
1. Housing Trust Fund Assessment: an annual assessment is completed to determine the demand for affordable housing in Louisville. This report outlines the City’s current housing market conditions to identify the types of housing projects the City needs.
2. Funding: Based on the recommendations from the Housing Needs Assessment Report, the City of Louisville provides an annual budget to the LAHTF.
3. Budget & Allocation: Based on the funding received, an annual budget is created with specific requirements of how the funds will be allocated. For example, the Louisville Metro Government allocated $5 million in funds where $4.75 Million is used for development activities, $250,000 for program administration, and $25,000 for supportive housing services.
4. Request for Proposals (RFP): Each year the LAHTF issues an RFP for a project that meets that the City’s housing needs. These projects are primarily focused on bridge financing.
5. Review Process: The LAHTF Program Committee, consisting of LAHTF Board Members, Kentucky AHTF Experts, Louisville Metro Housing Authority, and Louisville Metro Government. A variety of factors, including historical projects and current affordability trends are considered, but future change in homes prices are not.
6. Investment: The funds are provided to fund development of affordable housing projects.

Improvement to Allocation Process:
In order to improve the process, EquiLiving, a planning tool will be developed to provide home price prediction to help Trust Fund identify future changes in neighborhood home prices to better allocate funds to invest in affordable housing.

1.6 Modeling Strategy

To predict neighborhoods of opportunity in Louisville, we will forecast the change in home sale prices throughout Louisville using the repeat sales method, a technique that involves calculating changes in the sales price of the same piece of real estate within given timeframes. This relatively simple approach can be used to estimate shifts in home prices over periods extending from months to years. A major difficulty in generating home price indices is the heterogeneity of different units of real estate in terms of their location or features. Moreover, composition of the housing stock and the homes sold changes over time (months or years) adds to the difficulty.

Hedonic variables such as the built area of the house, the number of bedrooms or other house characteristics could be used to control for the differences in real estate features of each home. However, models with hedonic variables could face issues when there are problems concerning data availability or data accuracy. The repeat sales model addresses these problems by using a fixed control for each house in the dataset which can otherwise control for the hedonic variables. Perhaps the most well-known housing index that uses the repeat-sales method is the Case-Shiller National Home Price Index. The Case-Shiller Index measures changes in the value of the U.S. residential housing market by tracking the sale price and resale price of single-family homes that, as the name suggests, have transacted more than one time.

Price Index Example

Price Index Example

2 Data

2.1 Sources

The table below includes the data for our analysis: environmental data from Louisville’s Open Data website, fire data from the Louisville Fire Department, and property valuation data from the City of Louisville..

Dataset Source Years
Historic Home Sales Louisville Property Value Assessment 2000-2019
Historic Permits Open Data Louisville 2003-2019
American Community Survey, 5-Year Estimates US Census Bureau 2000-2018
LAHTF Projects Lousville Affordable Housing Trust Fund 2014-2019
Land Use Open Data Louisville NA

The data was also “wrangled” before being explored in the following section. This process included various transformations of the data in order to optimize predictive ability of each variable. For details on this procedure, please see Appendix.

2.2 Variables

Dataset Variable Description
Census TotalPop Total population
Census change_pct_owner_00_10 Change of homeownership rate from 2000 to 2010
Census gini_index Gini Coefficient
Census medHomePrice Median Home Price
Census medInc/MedHHInc Median household income
Census medGrossRent/MedRent Median Rent Price
Census pct_belowPov Percent of Families Below Poverty Level
Census pct_black Percentage of population that is Black
Census pct_white Percentage of population that is white
Census pct_owner Percentage of owner occupied housing
Census pct_SF Perctentage of single family homes
Census pct_MF Percentage of multi family homes
Census pctPoverty Percentage of population below the poverty line
Census pctBachelors Percentage of population with at least a bachelors degree
Land Use Per_Vacant_Land Percent of vacant land
Land Use tot_Vacant_Parcels Total number of vacant parcels (
Home Sales Price_2019 Home sale price adjusted for inflation
Home Sales Price Home sale price
NA HMA/H_Mkt_Area Housing market area

2.3 Unit of Geometry

For this analyisis, we use housing market areas (HMAs) as our geographic unit of analysis to predict home sale values. Louisville’s 21 housing market areas, defined by Louisville Metro Government based on residential neighborhoods, employment centers, and landmarks, are comprised of the 2010 census tracts, as shown in the map below (Louisville Housing Needs Assessment, 2019). Since the socioeconomic data is only available at the census tract level, part of our exploratory data analysis will be at census tracts levels. However, we try to spatailly show and analyze trends at the HMA level when possible since our model predictions are for each HMA. As shown in the map below, the HMAs located inside interstate 264 are considered the urban core, an urban setting including 8 HMAs (Northwest Core, West Core, Southwest Core, Downtown, University, Norteast Core, East Core, Southeast Core), while the non-urban core, a more suburban environment, consist of 13 HMAs. Throughout this study, we will compare trends inside and outside the urban core.

3 Exploratory Data Analysis:

The goal of exploratory analysis is to first explore where the LAHTF has invested in projects and backwards engineer the allocation process of HTF investments. Second, to explore the process of neighborhood change in Louisville focusing on development and socioeconomic trends across space.

Through our exploratory analysis, we answer the following questions:
  • Where has Louisville’s Affordable Housing Trust Fund allocated funds?
  • What is the relationship between key variables and neighborhoods receiving LAHTF projects?
  • What are the time and space trends of home prices across Louisville?
  • What are the time and space trends of permits across Louisville?
  • What are the relationships between socioeconomic characteristics and home prices across Louisville?

3.1. Where has LAHTF allocated funds?

Between 2014 and 2019, the LAHTF funded $20.3 million for 184 projects resulting in ~1600 units. Since 2016, the total amount of dollars invested in AHTF projects has increased each year resulting in more affordable units being developed. The chart below shows an annual summary of the LAHTF’s investments.

Year Projects Units Dollars ($)
2014 12 15 282,000
2015 9 9 152,000
2016 7 7 197,000
2017 37 331 2,390,000
2018 58 734 6,347,000
2019 61 477 9,976,000

The maps below show the spatial distribution of AHTF projects, units developed, and dollars invested in Louisville’s housing market areas. Even though a majority of the AHTF projects occurred inside the urban core, along the northwest portion, most of these projects resulted in single family homes while a greater amount of dollars were invested outside the urban core in multi-family units. There are a couple neighborhoods in the urban core that have multi-family units, but a majority of the multi-family units are outside.

Figure 3.1: Allocation of AHTF Investments by Housing Market Area (2014-2019)

3.2 What is the relationship between key variables and neighborhoods receiving LAHTF projects?

Louisville’s Affordable Housing Trust Fund allocation process prioritizes projects in neighborhoods based on the following criteria:

  • Declining homeownership
  • Blocks with multiple vacant or abandoned buildings or lots to develop complete blocks
  • Single-family homes in the urban core (typically the urban area with lower income levels)
  • Multi-family homes outside the urban core (typically the suburbs with higher income levels)

Comparing neighborhoods with and without AHTF projects provides a strong basis for understanding the relationships between neighborhoods with AHTF projects and exploratory variables. This allows up to see which variables see higher in neighborhoods with AHTF projects.

The 2010 mean values of different variables are plotted in groups (Affordable Housings vs. Not Affordable Housing) to determine which types of neighborhoods received LAHTF investments from 2014-2019. These plots provide supporting evidence to confirm the allocation criteria that the LAHTF considers for prioritizing investments. As shown in the charts below, AHTF were allocated to projects from 2014-19 in neighborhoods that have a lower percent change in homeownership, higher percentage of vacant land, lower median household income, higher income inequality, higher percent black.

Figure 3.2:Feature associations with the likelihood of Funded Affordable Housing Project through LAHTF

3.5 What are the relationships between a neighborhoods’ socioeconomic characteristics and home sale prices?

Home sales prices vary across space based on the socioeconomic makeup of a neighborhood. Typically, the higher income neighborhoods have higher home prices. In Louisville, the same story holds true where there is a strong positive relationship between sale prices and median household income and median rent, as shown below.

Based on past AHTF investments, a majority of these projects are in areas with the lowest average home prices in Louisville that have high poverty rates, low percentage of whites and bachelors degrees, and low rent and household incomes. This is illustrated by the green dots in the charts below that represent census tracts with at least one AHTF project.

Figre 3.6: Mean Home Sale Price (2018) as a function of select socioeconomic characteristics

Figre 3.7: Socioeconomic Characteristics Across Space

4 Model Development & Validation

In our project, we used percent increase in home sale prices as an indicator to identify areas most likely to gentrify, which would be housing market areas with the highest percentage increase in home sale prices. To this end, our model was developed using the repeat sales method, which used monthly periods to predict future home sale prices for housing market areas throughout Louisville (See modeling strategy section for more details). This method is similar to the Case-Shiller National Home Price index, by tracking the sale price and resale price of single-family homes that have transacted more than one time. In this section, we discuss the data features used in the model, development of the price index for Louisville, validation of the model results, and home sale price predictions for 2019-2023.

4.1 Model Building

First, we created date features such as year, quarter, months_since_00 using lubridate package in R. months_since_00 is the number of months that have elapsed since January 1, 2000. For our project, the data used for building the model ranged from January 2000 to December 2018. The data for 2019 was incomplete and thus excluded. Additionally, sale prices less than 5000 USD and greater than 2500000 USD were considered as outliers and thus were removed.

The repeat sales indices are typically calculated by using a linear regression. The sale price indices were measured as a function of year quarters from 2000 - 2018 and house fixed effects. The below plot shows the coefficients of the regression plotted against time (in year quarters). The coefficients are averages and each of them is interpreted relative to the baseline quarter (January - March 2000). The plot of these coefficients is typically how the repeat-sales index would look like.

Figre 4.1: Jefferson County Price Index: 2000 - 2018 (95% Confidence Intervals Included)

The indices in the above plot show an example of using the repeat sales model for inference, but our interest is using it to forecast. Next, we carry out another regression which hypothesizes that the sale price in time is a function of neighborhood fixed effect, house fixed effect and controls for time, like months, quarters or years.

The time fixed effect controls for factors that vary over the years (2000-2018), and consequently accounts for inflation in the dollar value. The neighborhood fixed effect controls for factors that vary across different housing market areas, such as location of amenities or dis-amenities. The house fixed effect accounts for factors that change from one real estate unit to another, such as house characteristics or whether a house may have been sold more than once within this study period. Sales price data from January 2000 was used as a baseline for calculating price indices. For this regression, we first split the data into a train set and test set. The train set was obtained by randomly sampling the dataset and selecting 75% of the total data points. The remaining 25% of the dataset was the test set which was used for validation.

4.2 Validation

MAPE by Year

Mean Absolute Percent Error (MAPE) is the measure of how much percentage of the predicted values deviate from the observed values. It indicates the accuracy of the forecast. We calculated the absolute percent error for the test set. The mean absolute percent error was calculated for each year from 2000 to 2018 and is shown in the plot below. Its values range between 0.4 and 0.7 for the study period. The least MAPE is observed around the 2008 recession. The high values of the error are because the model only focuses on time/space averaging. The only controls for variations in house features are dummy variables (created using the house fixed effect). However, this level of smoothing is appropriate given the use case.

Figre 4.2: MAPE by Year

MAPE by Housing Market Areas

The MAPE was also visualized by the housing market areas and is shown in the below plot. The areas with the highest MAPE are colored light blue and the areas color darkest blue have the least MAPE. The most housing market areas have similar shades of blue except the West Core, Northwest Core, Downtown and the three outer suburban market areas in eastern part of Louisville, which have a lighter blue. This indicates that the model works well for the darker blue housing market areas and generalizes well to them, since it gives similar values of MAPE.

Figure 4.3: MAPE by Housing Market Area

4.3 Predictions

Predicting home prices from 2019 - 2023:

Having a model trained on the space/time real estate trends from 2000 to 2018, it used in a predictive context, to predict sale prices between 2019 and 2023. To derive a sample of houses on which sale prices can be forecast, we randomly sampled 100,000 homes since 2013 (assuming that ~20,000 homes were sold each year and 100,000 homes would therefore be sold in 5 years). Sale price values since 2013 were selected to represent modern parcels/ homes. A random sale month from 2019 -2023 was assigned to each of the 100,000 sampled homes prices. This random sample was used to predict the price indices for the future and to reduce bias in selection. These predicted prices were used to identify the sale price change from 2018 to 2023.

As shown in the plot below, the mean observed sales price gradually increased from 2000-2018 while the model’s mean predicted sales prices, highlighted in blue, are expected to increase at a faster rate from 2019-2013.

Figure 4.4: Average Home Prices per Months (2000-2023)

When comparing the mean observed and predicted sales prices across urban core and non-urban core HMAs, as shown below, the trends are similiar in these different markets, but the rate and variability in the predictions from month to month are different. For example, the mean predicted home prices in the non-urban core are expected to increase at a faster rate than urban core and there is less variability in prices month-to-month in the non-urban core. This tells us that prices vary less in the non-urban core.

Figure 4.5: Average Home Prices per Months (2000-2023): Urban vs. Non-Urban Core

Plot of mean observed sale prices (2000 - 2018) and mean predicted sale prices (2019 - 2023) for each month, for each housing market area.

Figure 4.6: Jefferson County Average Home Prices per Month (2000 - 2023) for Each Housing Market Area

Finally, we calculated the mean percent change in home prices from 2018-2023 for each housing market area to determine which areas are expected to have the greatest opportunity - the neighborhoods with the lowest home prices today, but the greatest increase in homes prices in the future. The darker blue areas, highlighted in the map below, with the higher percent change in home sale prices are more likely to have a greater chance of opportunity for lower-income households compared to areas with lower or negative percentage changes.

Figure 4.7: Percent Change in Home Sale Price by HMA (2018-2023)

Predictions Nears AHTF Investments:

So far, the LAHTF has invested $20 Million dollars in ~1600 units in Louisville over the last 6 years (2014-2019). When comparing home price predictions to affordable housing, the goal is to invest in areas that will have the highest chance of opportunity. In order to determine whether or these units are being developed in neighborhoods with the greatest amount of opportunity, we decided to calculate the percent change in home sale price (2018-2023) of properties surrounding (within 1/2 mile) the LAHTF projects invested and compare it to areas that contain a majority of the affordable housing units financed by LAHTF. This prediction tool can not only help the LAHTF evaluate past investments, but determine what areas they should invest in moving forward that have the greatest opportunity.

Figure 4.8:

Use Cases: Evaluating Past AHTF Investments

As shown in the map above, a majority of affordable housing units financed by the LAHTF are in the North Core, Downtown, and Central Preston. However, the North Core is expected to have the least amount of opportunity compared to other HMAs, such as the West Core and Central Preston, that have greater opportunity for affordable housing. Listed below are use cases of three different HMAs (North Core, Downtown, and Preston) where the past AHTF investments are analyzed based on the home sales.

Northwest Core

The LAHTF has allocated almost 25% (or ~$5 Million) of its investments in the Northwest Core, located on the west side of the urban core, resulting in over 250 units. However, these investments are expected to provide lower opportunity for residents over the next five years. This is because the average home price surrounding most of the LAHTF projects in the North Core are expected to decline by over 25% over the next five years, as indicated by the yellow dots in the map below. This means that housing in the near future will most likely be less than or equal to what it costs today, most likely keeping the neighborhood in a similar socioeconomic state it is in today.

West Core

The LAHTF has allocated ~5% (or ~$1 Million) of its investments in the West Core, located directly South of the Northwest Core, resulting in only ~50 units. However, a majority of these investments are expected to provide greater opportunity for residents over the next five years. This is because the average home price surrounding most of the LAHTF projects in the Northwest Core is expected to increase by over 15% over the next five years, as indicated by the dark blue dots in the map below. This means that housing in the near future will most likely be greater than what it costs today, most likely improving the socio economic state it is in today. Considering that the average home sale price is ~150,000 and expected to increase, it is more expensive than other areas to purchase affordable housing, but there is still potential for this area to provide opportunity for lower-income households in the future.

Central Preston

The LAHTF has allocated ~10% (or ~$2.5 Million) of its investments in Central Preston, located in the center of the Non-Urban Core, resulting in only ~400 units. Similar to the West Core, a majority of these multi-family home investments in Central Preston are expected to increase by over 15% over the next five years, as indicated by the dark blue dots in the map below. However, this means that housing in the near future will most likley be greater than what it costs today, most likely improving the socioeconomic state it is in today. Considering that the average home sale price is ~100,000 and expected to increase, it is an area where affordable housing can be purchased at a lower pirce today and creates more opportunity in the future.

##### {-}

5 Interactive Map

We design a web-based interactive map to visualize prediction results and provide suggestions on potential locations where the cost of building affordable housing projects is low today, while the benefits will be the greatest tomorrow. The application is built to assist the stakeholders in affordable housing development in interacting and visualizing the housing market predictions. The front-page of the application is shown as below.

Figure 5.1: Front Page

5.1 Application Logic

The website, EquiLiving, visualizes the home sale price predictions for neighborhoods in Louisville, KY. By predicting the future housing market, EquiLiving will help the Louisville Metro Affordable Housing Trust Fund (LAHTF) committees to make better decisions of allocating fundings in neighborhoods that need them most.

5.2 How to Use the App?

  • The home page visualizes the AHTF Potential Score, developed by EquiLiving, which evaluates each neighborhood’s potential to develop new AHTF projects. Scaling from 0 to 10, higher score means that the cost of development is low now (low initial home sale price) and the benefit of the AHTF project will be great in the future (dramatic increase of home sale price).

Figure 5.2: Homepage
- In the Neighborhoods Explorer page, you can map the initial home sale price in 2018, change of home sale price from 2018 to 2023, and the AHTF Potential Score of each neighborhood in Louisville by selecting it. Click a specific neighborhood that you are interested in and see its initial sale price in 2018, change of home sale price from 2018 to 2023, AHTF Potential Score, and home sale price trend since 2000.
Figure 5.3: Homepage
- In the AHTF Projects Explorer page, you can look at all the existing AFTF projects in Louisville. Click a specific project to explore its address, total number of units, built year, total dollars invested, and average home sale price trend within 1/2 mile.
Figure 5.4: AHTF Projects Explorer page

5.3 Example Use Case

To better illustrate how to use the EquiLiving website in practice, here is an example use case.

Mary is the executive director of Louisville Metro Affordable Housing Trust Fund (LAHTF). With 1 million of funding available this year, Mary and her team are looking for neighborhoods to invest in. For LAHTF, the most important thing is to address the affordable housing shortage for individuals whose wages are not enough to maintain a stable place to live.

In the past, Mary will consider neighborhoods with a median household income lower than 80% of the Average Median Income of the city. However, the housing situation keeps changing, especially for residents who live in an area that will encounter a significant increase of home sale price in the future. Such residents are highly likely to be priced out. Nevertheless, it is of great difficulty to know which neighborhood will face dramatic home price increase in the future.

Now, with EquiLiving, Mary and her team will be able to locate neighborhoods to invest in. Opening the website, there is a map of the AHTF Potential Score of the city. The potential score, developed by EquiLiving, ranging from 0 to 10, describes a neighborhood’s potential to develop new affordable housing projects. Neighborhoods with a score of 10 will be most suitable for such development because the cost to develop new projects is low, for the low current home sale price, while the benefit will be great, because there will be a significant increase of home sale price in the future.

Figure 5.5: Homepage of App

Therefore, Mary decides to take a look at the neighborhood with a potential score of 10, which is Iroquois Park, where there will be 31% of home sale price increase in the future. By clicking on the neighborhood, Mary will also see the home sale price trend of Iroquois Park since 2000. Iroquois Park has been facing continuous home sale price growth since 2015, and will continue growing according to the prediction. Moreover, because the average home sale price of Iroquois Park is at a lower level, the cost to develop new affordable housing projects there is ideal. Low cost but great opportunity.

Figure 5.6: Iroquois Park Detail

However, Mary still wants to know if there is already an existing affordable housing project serving Iroquois Park. Therefore, click on the neighborhood explorer tab, Mary takes a look at the existing AHTF projects there.
Figure 5.6: Iroquois Park Projects
There are not that many existing projects there, and most of them are located in the northern part of the neighborhood. The southern part has not been served well yet. How does the existing AHTF project in the southern part of Iroquois Park perform? Clicking on the project, the detailed page shows that the project is located at 317 Kenny Boulevard, which is a $9000 project developed in 2017 focusing on just one unit. Moreover, the home sale price trend within half a mile of the project shows that the housing market around it is going through a continuous increase of home sale price. In conclusion, by using EquiLiving, Mary has found out that the southern part of Iroquois Park can be a really suitable area to focus on.

Figure 5.7: Project Detail

6 Appendix

6.1 Code

Settings

Load Libraries
library(tidycensus)
library(tidyverse)
library(ggplot2)
library(viridis)
library(sf)
library(grid)
library(gridExtra)
library(rgdal)
library(NLP)
library(dplyr)
library(lubridate)
library(scales)
library(blscrapeR)
library(forcats)
library(reshape)
library(blscrapeR)
library(spdep)
library(lwgeom)
library(geosphere)
library(rasterImage)
library(png)
library(knitr)
library(kableExtra)
library(lubridate)
library(fixest)
library(viridis)
Load functions
##### +Functions####
#This is plot maps using quintiles
qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}
Load map themes and color palette
##### + Map Themes####
mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "#2D3F50"),
    plot.title = element_text(size = 14,colour = "#2D3F50"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=1),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank()
  )
}
plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "#2D3F50"),
    plot.title = element_text(size = 14,colour = "#2D3F50", face = 'bold'),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=1),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("#2D3F50", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    strip.background = element_rect(fill = "#2D3F50", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "#2D3F50", face = "italic"),
    legend.text = element_text(colour = "#2D3F50", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

#### +Palettes ####
palette5 <- c("#EAF46E","#83CE7B","#3AA083","#266E75","#2D3F50")
palette2 <- c("#EAF46E","#2D3F50")

#If only using two colors from the palette, use these two colors
#palette5[1] #Blue
#palette5[5] #Yellow

#If using all five colors to create a discrete color scheme
#Add this line of code after ggplot arguments
#scale_fill_discrete(values = palette5)
#Or
#scale_color_discrete(values = palette5)

#If creating a continus color scheme showing low to high
#Add this line of code after ggplot arguments
#scale_fill_gradientn(colours = palette5, na.value = "grey50",guide = "colourbar",aesthetics = "fill") +
  
#If creating a continus color scheme showing diverging trend, for example, -1 to 1
#Add this line of code after ggplot arguments
#scale_color_gradientn(colours = palette5, na.value = "grey50",guide = "colourbar",aesthetics = "colour") 
Load data
#### +PVA Home Sale####
sales <- sf::st_read(dsn = "./pvadata.gdb",layer="remf_master")

#### +Urban Core Shapefile####
Urban_Core <- sf::st_read("./Urban_Core/UrbanCore.shp")%>%  
  st_transform(crs = 2246)

#### +Parcel ####
parcel_poly <- sf::st_read("./Parcel_Polygon/Parcel.shp")%>%  
  st_transform(crs = 2246) 
  
#### +Affordable Housing Trust Fund Projects####
Aff_House <- read.csv("./Affordable_Housing_Trust_Fund/Affordable_Housing_Geocoded_v3_3-30-2020.csv")

#### +Housing Market Areas####
hma.tracts <- 
  st_read("./Housing_Market_Areas/Housing_Market_Area_Census_Tracts_Shapefile/Housing_Mkt_Area_Census_Tracts.shp")  %>% 
  st_transform(102679)

#### +Census Data####
#Using census API
key <- "#INSERT YOUR OWN API KEY"
census_api_key(key, overwrite = TRUE, install = TRUE)
readRenviron("~/.Renviron")
Sys.getenv("CENSUS_API_KEY")

#Import 2018 5-Year ACS Data to caculated the following variables:
#Median Household Income, Median Rent, Total Population, % Bachelors Degree, % Poverty, % White, 
tracts18 <- 
  get_acs(geography = "tract", variables = c("B25026_001E","B02001_002E","B15001_050E",
                                             "B15001_009E","B19013_001E","B25058_001E",
                                             "B06012_002E", "B15001_017E", "B15001_025E", "B15001_033E",
                                             "B15001_041E", "B15001_058E","B15001_066E",
                                             "B15001_074E", "B15001_082E","B15001_001E"), 
          year = 2018, state=21, county=111, geometry=T) %>%
  st_transform(st_crs(hma)) %>%
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  dplyr::rename(TotalPop = B25026_001,
                NumberWhites = B02001_002,
                TotalBachelorsPop = B15001_001,
                MedHHInc = B19013_001,
                MedRent = B25058_001,
                TotalPoverty = B06012_002) %>%
  dplyr::mutate(TotalFemaleBachelors = B15001_050+B15001_058+B15001_066+B15001_074+B15001_082,
                TotalMaleBachelors = B15001_009 +B15001_017+B15001_025+B15001_033+B15001_041)%>%
  dplyr::mutate(percentWhite = ifelse(TotalPop > 0, NumberWhites / TotalPop,0),
                percentBachelors = ifelse(TotalPop > 0, ((TotalFemaleBachelors + TotalMaleBachelors) / TotalBachelorsPop),0),
                percentPoverty = ifelse(TotalPop > 0, TotalPoverty / TotalPop, 0),
                GEOID = as.numeric(GEOID)) %>%
  dplyr::select(-NumberWhites, -TotalFemaleBachelors, -TotalMaleBachelors, -TotalPoverty,-B15001_050,-B15001_058,-B15001_066,-B15001_074,-B15001_082,-B15001_009,-B15001_017,-B15001_025,-B15001_033,-B15001_041) 

#Import 2010 5-Year ACS Data
tracts10 <- 
  get_acs(geography = "tract", variables = c("B25026_001E","B02001_002E","B15001_050E",
                                             "B15001_009E","B19013_001E","B25058_001E",
                                             "B06012_002E", "B15001_017E", "B15001_025E", "B15001_033E",
                                             "B15001_041E", "B15001_058E","B15001_066E",
                                             "B15001_074E", "B15001_082E","B15001_001E"), 
          year = 2010, state=21, county=111, geometry=T) %>%
  st_transform(st_crs(hma)) %>%
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  dplyr::rename(TotalPop = B25026_001,
                NumberWhites = B02001_002,
                TotalBachelorsPop = B15001_001,
                MedHHInc = B19013_001,
                MedRent = B25058_001,
                TotalPoverty = B06012_002) %>%
  dplyr::mutate(TotalFemaleBachelors = B15001_050+B15001_058+B15001_066+B15001_074+B15001_082,
                TotalMaleBachelors = B15001_009 +B15001_017+B15001_025+B15001_033+B15001_041)%>%
  dplyr::mutate(percentWhite = ifelse(TotalPop > 0, NumberWhites / TotalPop,0),
                percentBachelors = ifelse(TotalPop > 0, ((TotalFemaleBachelors + TotalMaleBachelors) / TotalBachelorsPop),0),
                percentPoverty = ifelse(TotalPop > 0, TotalPoverty / TotalPop, 0),
                GEOID = as.numeric(GEOID)) %>%
  dplyr::select(-NumberWhites, -TotalFemaleBachelors, -TotalMaleBachelors, -TotalPoverty,-B15001_050,-B15001_058,-B15001_066,-B15001_074,-B15001_082,-B15001_009,-B15001_017,-B15001_025,-B15001_033,-B15001_041)

Data Cleansing

Affordable Housing
####+AHTF####
#Convert datatype from factor to numeric
Aff_House <- 
  Aff_House %>%
  mutate(x = as.numeric(as.character(Latitude)),
         y  = as.numeric(as.character(Longitude)),
         TotalUnits = as.numeric(as.character(TotalUnits)),
         Amount = as.numeric(as.character(Amount)))

#Convert to sf
Aff_House <-
  Aff_House %>% 
  st_as_sf(coords = c("y", "x"), crs = 4326, agr = "constant") %>%
  st_transform(crs = 2246) 
Sales
#### +Inflation Adjustment for Home Sales ####
#Inflation CPI Data
cpi <- bls_api("CUSR0000SA0")
cpi <- inflation_adjust(2019)
#Change datatype from FACTOR to CHARACTER
sales$PROP_ADDRESS <- as.character(sales$PROP_ADDRESS)

cpi <- cpi%>%
  #Convert year from character to numeric
  mutate(year = as.numeric(year))%>%
  #Add Inflation % Column 
  mutate(Inflation_rate = (255.65658/avg_cpi))

#Currently, each row contains the last 3 home sale records for each parcel. 
#Goal is to have eacah row contain 1 home sale record for each parcel. This would results in 3 rows for each parcel where each row represents an individual sale.
sales_long <- sales %>%
  #convert the data from wide to long format where each row represents an individual sales record
  reshape(direction = "long",
          varying = list(c("TRANSFER_DATE1", "TRANSFER_DATE2", "TRANSFER_DATE3"), c("CONSIDERATION1", "CONSIDERATION2", "CONSIDERATION3")),
          v.names = c("SaleDate","Price"))%>%
  #Remove all records where there is no sales 
  filter(SaleDate != "1900-01-01")%>%
  #Create a Year column based on date the home sold
  mutate(Year = year(SaleDate),
  #Create MonthYear column based on date the home sold
         MonthYear = format(SaleDate, "%Y/%m"))%>%
  #Remove records sold before 1947 since CPI API data is availabile starting in 1947
  filter(Year > 1947)%>%
  #Remove $0 sales
  filter(Price != 0)

#Join Sales and CPI data to adjust home sales prices for inflation 
sales_long <- sales_long%>% 
  left_join(cpi, by = c("Year" = "year"))%>%
  #create new price column that calculate 2019 home sales prices using inflation rate
  mutate(Price_2019 = Price * Inflation_rate)%>%
  dplyr::select(-avg_cpi,-adj_value,-base_year,-pct_increase)

#Merge sales_long with Parcels
Sales_long_n_Parcels <- left_join(sales_long, parcel, by = c("PARCELID" = "PARCELI"))

#Merge Tax data with sales
Sales_long_n_Parcels_2 <- left_join(sales_long, tax_data, by = c("PARCELID" = "PARCELID"))

##Remove NA - These are addresses that only have either a permit or a sales
Sales_long_n_Parcels <- Sales_long_n_Parcels[!is.na(Sales_long_n_Parcels$Shap_Ar), ]

Sales_long_n_Parcels.sf <- st_as_sf(Sales_long_n_Parcels)%>%
  st_transform(crs = 2246) 

Sales_long_n_Parcels.sf <- st_make_valid(Sales_long_n_Parcels.sf)

#Create centroid from polygon
Sales_long_n_Parcels.sf$geometry <- st_centroid(Sales_long_n_Parcels.sf$geometry)

Sales_long_n_Parcels.sf <- Sales_long_n_Parcels.sf %>%
  as.data.frame() %>%
  st_as_sf()


#Join Sales to Census Tracts
Sales_Tracts_n_Mkt_Area <- st_join(Sales_long_n_Parcels.sf,
                                   tract_2000_geo%>%
                                     st_transform(crs=2246),
                                   join=st_within,
                                   left = TRUE)

Sales_Tracts_n_Mkt_Area <- Sales_Tracts_n_Mkt_Area %>%
  # mutate(GEOID = as.character(GEOID))%>%
  left_join(Housing_Mkt_Area_Tracts, by = c("GEOID" = "GEOID_tract"))

#Rename sales data
dat <- Sales_Tracts_n_Mkt_Area

#Create a bunch of date features using lubridate including a variable called months_since_00 which is the number of months that has elapsed since Jan 1 200.
dat <-
  as.data.frame(dat) %>%
  dplyr::select(PARCELID, Price, SaleDate, H_Mkt_Area, Price_2019, PROP_ADDRESS, geometry) %>%
  mutate(year = year(SaleDate),
         quarter = quarter(SaleDate),
         month = month(SaleDate),
         year_quarter = paste(year(SaleDate), quarter(SaleDate), sep="-"),
         year_month = paste(year(SaleDate), month(SaleDate), sep="-"),
         #Calculate months since Jan 2000 which is the number of months that has elapsed since Jan 1 2000.  
         months_since_00 = ymd(SaleDate) - ymd(20000101)) %>%
  mutate(months_since_00 = round(time_length(months_since_00, "months"))) %>%
  #select sales between Nov 2000-2018 -exclude 2019 since data for 2019 is incomplete
  filter(ymd(SaleDate) >= "2000-01-01" & year(SaleDate) < 2019,
         Price > 5000 & Price < 2500000) %>% #remove outliers
  na.omit()

#Create dat.sf by converting dat to simple features, projecting and then spatial joining such that each sales is assigned the Housing Market Area it is located in
dat.sf <-
  st_sf(dat) %>% st_transform(102679) %>%
  #Spatial join sales with Housing Market Areas
  st_join(hma.tracts) %>% 
  #Exclude Airport HMA since there are very few sales in this area
  filter(H_Mkt_Area != "Airport")

dat.df <- as.data.frame(dat.sf)

Modeling

Create price index for every quarter since 1991
#This estimates a giant house price fixed effects regression.  Here we estimate log Price as a function of year by quarter and house fixed effects. Note the latter get put to the right of the pipe, |, which makes this go quick. 
index.quarter <- feols(log(Price) ~ year_quarter | as.factor(PARCELID), data=dat)

#Create a new data frame of the estimated quarter/year fixed effects coefficients, p-values and an upper and lower confidence interval.
#Note: these coefficients are averages, each interpreted relative to the baseline month (which is the first month of the time series). 
index.quarter.df <-
  data.frame(Quarter = rownames_to_column(index.quarter$coeftable, "Quarter")[,1],
             Price = index.quarter$coeftable[,1],
             P_Value = index.quarter$coeftable[,4],
             Confidence_2.5 = confint(index.quarter)[,1],
             Confidence_97.5 = confint(index.quarter)[,2]) %>%
  mutate(Quarter = substr(Quarter,13,18),
         Significant = ifelse(P_Value <= 0.05, "Yes","No"))

#Create plot of coefficients.  This  plot is typically how repeat-sales indexes are visualized.
index.quarter.df %>%
  ggplot(aes(Quarter, Price, group=1)) +
  geom_ribbon(aes(ymin = Confidence_2.5, ymax = Confidence_97.5), fill = "grey70") +
  geom_point(aes(colour=Significant)) +
  scale_color_manual(values=c("#266E75","#266E75"))+
  geom_line() +
  labs(title="Jefferson County Price Index: 2000 - 2018", subtitle="95% Confidence Intervals included",
       y="Price Index") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  theme(legend.position = "none")
Calculate the sale price indices for 2000-2018.
#For reproducability
set.seed(101)

#Create a test and train dataset by splitting the dataframe 'df': Train (75%) / Test (25%)
sample <- sample.int(n = nrow(dat.df), size = floor(.75*nrow(dat.df)), replace = F)
dat.train <- dat.df[sample, ]
dat.test  <- dat.df[-sample, ]

#Instead of year by quarter fixed effects, log Price is estimated as a function of months since Jan 1 2000, a fixed effect for year, a neighborhood fixed effect (Housing Market Area) and house price effects (uses ParcelID).

#This is another regression that estimates a giant house price fixed effects regression. Here we estimate log Price as a function of neighborhood (Housing Market Area), time (months since 2000), and house fixed effects (uses ParcelId). 
index.monthlynum <- feols(log(Price) ~ H_Mkt_Area + months_since_00 | as.factor(year)  + as.factor(PARCELID)   , data=dat.train)

#Results of the fixed effect regression model.
summary(index.monthlynum)
Calculate Error for Regression
#Mean Absolute Percent Error (MAPE) indicates of the accuracy of the forecast. It is the measure of how much percentage of the predicted values deviate from the observed values. We calculated the absolte percent error for the test set. The mean absolute percent error was calculated for each year from 2000 to 2018 and is shown in the plot below. Its values range between 0.4 and 0.7 for the study period. The least MAPE is observed around the 2008 recession.

#Calculate MAPE on the test set
error <-
  dat.test %>%
  mutate(Predicted_Price = exp(predict(index.monthlynum, .))) %>%
  mutate(Error = Price - Predicted_Price ) %>%
  mutate(absPercentError = abs(Price - Predicted_Price) / Price)

#Calculate MAPE by year.
error %>%
  group_by(year) %>%
  summarize(MAPE = mean(absPercentError, na.rm=T)) %>%
  ggplot(aes(year, MAPE)) +
  geom_bar(stat="identity", fill ="#3AA083" ) +
  ggtitle("MAPE by year")

##MAPE by HMA
error %>%
  group_by(H_Mkt_Area) %>%
  summarize(MAPE = mean(absPercentError, na.rm=T)) %>%
  left_join(hma, by=c("H_Mkt_Area" = "H_Mkt_A")) %>%
  st_sf() %>%
  ggplot() + geom_sf(aes(fill=MAPE)) +
  scale_fill_gradientn(colours = palette5, na.value = "grey50",guide = "colourbar",aesthetics = "fill") +
  ggtitle("MAPE by Housing Market Area")+
  mapTheme()


###
ggplot(error, aes(Price, Error)) +
  geom_point(color="#2D3F50") + geom_smooth(method="lm", color = "#83CE7B") +
  ggtitle("Price as a function of Error")
Create prediction dataset for next 5 years (2019-2023)
#Since ~20,000 residential single family parcels are sold a year (based on what was observed in training dataset for Louisville), a new dataset of 100,000 samples homes is created (20,000/yr x 5 years).


#Create a new dataset for future predictions (2019-2013) by randomly sampling 100k houses that sold since 2013. 
future.monthlynum <-
  #Select parcels sold since 2013 from the df dataframe
  filter(dat.df, year >= 2013) %>%
  #Randomly select 100k parcels with replacement
  sample_n(100000, replace = T) %>%
  dplyr::select(PARCELID, year, months_since_00, H_Mkt_Area, Price, GEOID_t, SaleDate, geometry) %>%
  #Set  Price = 0 (b/c prediction values will be assiged to these parcels later) 
  mutate(Price = 0,
         #Assign a random sale month to each of the 100k parcels in the prediction dataset
         months_since_00 = sample(c(229:288),100000, replace=T))


#Next, mutate a predicted Price for those future sales I just manufactured. These are bound to the observed sales from dat (not just the training set). Convert the months indicator to factor for the group_by below. Sales are then labeled as either predicted or observed. Note that rbind here creates a long form dataset by stacking Predicted and Observed.
dat.monthly.prediction <-
  rbind(
    future.monthlynum %>% 
      #Create predicted Price for those 100k future sales 
      mutate(Price = exp(predict(index.monthlynum, .)),
             Legend = "Predicted"),
    dplyr::select(dat.df, PARCELID, year, months_since_00, H_Mkt_Area, Price, GEOID_t, SaleDate, geometry) %>%
      mutate(Legend = "Observed")) %>%
  mutate(months_since_00 = as.factor(months_since_00))

#Plot average sales prices of observed (2000-2018) and predicted sales (2019-2013)
dat.monthly.prediction %>%
  group_by(months_since_00, Legend) %>%
  summarize(Mean_Predicted_Price = mean(Price,na.rm=T))  %>%
  ggplot(aes(as.numeric(months_since_00), Mean_Predicted_Price, colour=Legend)) +
  geom_point() +
  geom_smooth(aes(colour=Legend),method="loess") +
  scale_color_manual(values=c("#83CE7B","#266E75"))+
  labs(title="Jefferson County Average Home Prices per month: 2000 - 2023", subtitle="95% Confidence Intervals included") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
Calculate Sales Predictions for Each Housing Market Area
#Convert all the Predicted dates from their randomly sampled SaleDate to their actual dates, which is calculated as the first month, “2000-01-01” plus months_since_00. Below I use an ifelse to do this. I’m not exactly sure why all the extra as.Date and as.Character formatting is needed, but I don’t have time to troubleshoot. Note that at the end, I am o
dat.monthly.prediction <-
  dat.monthly.prediction %>% 
  mutate(months_since_00 = as.numeric(months_since_00)) %>%
  #Overwriting the SalesDate field for predictions (2019-2013) so that it no longer reflects the year that was generated from the random sample, but the year of the updated SaleDate.
  mutate(SaleDate = ifelse(Legend == "Observed", as.character(SaleDate),
                           as.character(as.Date(ymd("2000-11-01") + months(months_since_00))))) %>%
  mutate(year = year(SaleDate))

#calculate Price_Change between 2018 and 2023, on a percentage basis. Below, we first filter by the year of SaleDate to get only those observed and predicted sales from the end of both our observed and predicted data, respectively. Sales are grouped by year and HMA and the Mean_Price is taken. To calculate Price_Change the dataset is converted to wide.
price.change.hma <-
  dat.monthly.prediction %>%
    #filter dataset to years 2018 and 2023
    filter(year(SaleDate) == 2018 | year(SaleDate) == 2023) %>%
    #Group sales date by year and housing market area
    group_by(year, H_Mkt_Area) %>%
    #average price for every housing market area/year
    summarize(Mean_Price = mean(Price, na.rm=T)) %>%
    #convert to wide format
    pivot_wider(names_from=year, values_from=Mean_Price) %>%
    mutate(Price_Change = ((`2023` - `2018`) / `2023`) * 100)

#Plot the percent change as a map
left_join(price.change.hma, hma, by=c("H_Mkt_Area" = "H_Mkt_A")) %>% 
  st_sf() %>%
  ggplot() +
    geom_sf(aes(fill=Price_Change)) +
    scale_fill_gradient2(midpoint = 0, low = "#78281F", mid = "#FDEDEC",
                              high = "#154360", space = "Lab",
                         name="Percent Price\nChange",
                         breaks = c(-50,0,50), 
                         labels = c("-100%", "0%", "100%"))
Calculate Average Sales Nearby Existing Affordable Housing Trust Fund Projects
#Create 1/2 mile Buffer surrounding all existing affordable housing trust fund projects
AH_Buffer <- 
  st_buffer(AH, 2640) %>%
  mutate(Legend = "Buffer") 
  #dplyr::select(Legend)

#Join Sales with AH Buffers
price.AH <- dat.monthly.prediction %>%
  st_sf()%>%
  st_transform(102679) %>%
  st_join(AH_Buffer)

#Remove NA - Removes sales that are not locoated within 1/2 mile from an existing affordable housing trust fund project
price.AH <- price.AH[!is.na(price.AH$City), ]

#Calculate % change of average home prices (2018-2023) within 1/2 mile of every affordable housing trust fund project 
price.change.AH <-
  price.AH %>%
  filter(year(SaleDate) == 2018 | year(SaleDate) == 2023) %>%
  group_by(year, Address) %>%
  st_drop_geometry()%>%
  summarize(Mean_Price = mean(Price, na.rm=T)) %>%
  pivot_wider(names_from=year, values_from=Mean_Price) %>%
  mutate(Price_Change = ((`2023` - `2018`) / `2023`) * 100)

#Spatial join the existing afforable housing projects with housing market areas. This will asign each affordable housing project to the housing market area it is located in
price.change.AH.geo <- price.change.AH%>%
  left_join(AH, by=c("Address" = "Address"))%>%
  st_sf() %>%
  st_join(hma)

Data Visualizations: Maps & Plots

6.2 Work Cited

2020 Funding Application Guidelines and Project/Program Policies. (n.d.). Louisville Affordable Housing Trust Fund.

Louisville Affordable Housing Trust Fund – Open the door! (n.d.). Retrieved May 11, 2020, from http://loutrustfund.org/

Louisville Housing Needs Assessment. (2019). Louisville Metro Government Office of Housing & Community Development.

Opportunity Insights. (n.d.). Opportunity Insights. Retrieved May 11, 2020, from https://opportunityinsights.org/