library(dplyr)
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(ggplot2)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(caret)
library(purrr)
library(FNN)
library(stargazer)
library(spatstat)
library(raster)
library(spdep)
library(grid)
library(mapview)
library(stringr)
library(ggcorrplot)
library(scales)
library(colorspace)
library(rgdal)
library(RColorBrewer)
library(rasterVis)
library(sp)
library(ggpubr)
library(leaflet)
library(transformr)
library(jtools)
library(mapview)
library(randomForest)
library(e1071) # SVM
library(xgboost)
library(readr)
library(car)
palette_5 <- c("#0c1f3f", "#08519c", "#3bf0c0", "#e6a52f", "#e76420")
#palette_5blues <-c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette_4 <-c("#08519c","#3bf0c0","#e6a52f","#e76420")
palette_2 <-c("#e6a52f","#3FB0C0")
palette_3 <-c("#e6a52f","#3FB0C0", "#e76420")
palette_5_mako <- c("#0B0405", "#3E356B", "#357BA2", "#49C1AD", "#DEF5E5")
palette_2_mako <- c("#3E356B", "#49C1AD")
#show_col(viridis_pal(option="G")(5))
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 16,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.text.x = element_text(size = 14))
}
plotTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 16,colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}
# RData from EDA part
# load("Data/EDA2.RData")
# the only thing need to load 04/26
#load("Data/All_feature_models.RData")
load("Data/Final_Workspace.RData")
knitr::include_graphics("Images/ElPasoRoads_cropped.png")
This project is part of the Spring 2022 Master of Urban Spatial Analytics/Smart Cities Practicum course (MUSA 801) at the University of Pennsylvania taught by Michael Fichman and Matt Harris.
Thank you to Alex Hoffman, AICP at the El Paso Capital Improvements Department for connecting us with all of the data and background context, and for his support and enthusiasm for bringing data analytics to the city.
We’d also like to thank MUSA Faculty members Michael Fichman, Matt Harris, and Mjumbe Poe for their superior guidance and support throughout this semester.
The City of El Paso, Texas has been experiencing tremendous growth over the past decade. With growing population comes more pressure - literally - on roads.
There is hope that passage of the Bipartisan Infrastructure Bill in congress will allow for more funding for streets and maintenance projects to help the city’s road network become safer and more resilient - but the eventual funding from “Build Back Better” still leaves decision-making process of which local roads to update up to the city.
The City of El Paso’s Capital Improvement Department (Planning Division) wishes to improve their system for deciding where to allocate capital improvement funds for roadway projects. Presently, this is done by integrating spatial data sets reflecting current conditions to determine where there is need and opportunity.
There currently is not an established prioritization system when it comes to which roads to improve - or even add to the queue for improvement. PCI scores inform the decision making, but lots of the decisions are ad hoc. For example, constituents raise concerns about certain roads, the department looks at the Pavement Condition Index (PCI) score, and if it is below a certain threshold, then they add it to the list of projects for improvement. This is a very reactive process. They would like it to be more proactive.
But PCI also does not tell the whole story, and the client has expressed interest in exploring other factors that may drive a new prioritization system.
This project is two-pronged:
First, we have the predictive model. We will predict PCI based on 2018 historical data and lots of feature engineering - which we will discuss shortly. This part of the project is mostly for the exercise of modeling in the academic setting of MUSA801, but the client could always choose to integrate our model outputs of PCI into tools later on.
The second part of our project is the prioritization system, which will take the form of a web application. We incorporate PCI (our modeled score that can be replaced with the PCI used by the city once their 2021 study is complete) as just one piece of a resource application and prioritization system. We will be exploring factors to drive the new system that include both built and social environmental variables, thus bringing a lens of equity into the project.
A Pavement Condition Index (PCI) measures the quality of a specific road segment. The index was developed by the United States Army Corps of Engineers (USACE) initially as an airfield pavement rating system, but it was later modified for roadway pavements. The index was standardized by the American Society for Testing and Materials (ASTM).
The Capital Improvements Department in El Paso hired a private contractor in 2018 to conduct a digital image scan of the city’s roads to evaluate them based on a wide range of conditions. While their exact metrics are proprietary, generally, PCIs are based off of factors such as presence of potholes, bumps, or cracks. The index ranges from a qualitative scale of Failing to Good or quantitatively 0 to 100. This chart adapted from Colorado State University below shows a significant drop in condition of a road after a certain amount of time. We consider the cost savings calculations from this study when building in the second part of our project, the decision-making tool. This would be a strong advocacy tool for improving a road as well as a strong planning tool by considering the road’s lifetime. For the first portion of the project, this made us curious about the construction of roads and if there are any earlier stage indicators that could show signs of weakening conditions.
knitr::include_graphics("Images/PCI-plot-white.jpg")
To hypothesize what conditions could weaken a road overtime, we wanted to understand the different parts of a road. We identified three main road anatomy-related features to pay attention to: the earth foundation, the roadbed base, and the road surface.
knitr::include_graphics("Images/road_cross_section.png")
The surface as an important feature is more obvious as potholes, bumps, and cracks are noticeable to the everyday road user. However, there are roads that have no base layer, weak materials, or are built in areas prone to flooding that can weaken the road’s structure as it ages. These are all important aspects of a road’s anatomy that we explored in during our exploratory data analysis and their relationship to PCI.
For this section, we loaded and processed data provided by the City of El Paso and other open data sources. These datasets formed the basis for our data wrangling and feature engineering. The feature summary tables below include the sources for each dataset.
We considered how various factors can contribute to the overall “wear and tear” of a road, and grouped our features into three overarching categories: Road Conditions, Environment, and Road Network. * Road Conditions * Road Network * Environment
The far right column indicates if the feature was ultimately used in our final model. The next several subsections go into detail on how these features were engineered and what decisions went into determining their inclusion in the modeling.
The road conditions group contains features that indicate anything that has to do with the physical properties of a road.
Feature | Type | Source | Used in Final Model? |
---|---|---|---|
Roadbed Base Material | Categorical | TxDOT | Yes |
Roadbed Surface Material | Categorical | TxDOT | Yes |
Roadbed Width | Numeric | TxDOT | No |
Potholes (by Road Length) | Numeric | Engineered from City of El Paso Streets & Maintenance Data | Yes |
Road Age | Numeric | City of El Paso | Yes |
The environment group includes features that may indirectly affect road conditions like if there is water nearby, what is the majority land use category, or how many city amenities are near road segments like entertainment, restaurants or shops that would indicate a lot of travel happening on those roads.
Feature | Type | Source | Used in Final Model? |
---|---|---|---|
Land Use | Categorical | City of El Paso | Yes |
Land Cover | Categorical | USGS National Land Cover Database | No |
High Flood Risk Area or Not | Binary | Engineered from FEMA 2020 Preliminary Flood Zones | Yes |
Distance to Hydrology | Numeric | Engineered from TIGER/Line Texas Geodatabase - Hydrology | Yes |
Distance to Food/Drink Amenity | Numeric | Engineered from OpenStreetMap (3rd nearest neighbor) | Yes |
Distance to Car Facility Amenity | Numeric | Engineered from OpenStreetMap (3rd nearest neighbor) | Yes |
Distance to Entertainment Amenity | Numeric | Engineered from OpenStreetMap (3rd nearest neighbor) | Yes |
Below Interstate 10 or Not | Numeric | Engineered from TIGER/Line Texas Geodatabase - Roads | Yes |
The road network feature group consists of data that describes how the different road segments interact with one another.
Feature | Type | Source | Used in Final Model? |
---|---|---|---|
Road Class | Categorical | City of El Paso | Yes |
Distance to Major Intersection | Numeric | Engineered from City of El Paso Centerlines Data | Yes |
Distance to Major Arterial | Numeric | Engineered from TxDOT | Yes |
Crashes (By Road Length) | Numeric | City of El Paso | Yes |
Vehicle Miles Traveled (VMT) | Numeric | Replica, via City of El Paso | Yes |
Traffic Jams (By Road length) | Numeric | Engineered from Waze, via City of El Paso | No |
Because the client mentioned they wanted to include an aspect of equity into the decision making process, we analyzed data from the US Census. Data from the 5-year American Community Survey (ACS) and TIGER/Lines national datasets help us better understand the population that rely on the roads.
The demographic and socioeconomic breakdown of El Paso’s population allows the city to be mindful of equity while allocating funds for road improvement projects. Although this is an important factor to include in the decision making process, we ultimately determined this data would be most useful in the web application and not in the predictive model as it is unlikely the private contractor considered equity when assigning the PCI scores in 2018.
We looked at race, ethnicity, age, income, and transportation to work data through the ACS 2019 5-year dataset to see the demographic breakdown of the city.
ggplot(pop_pyramid, aes(x = variable, fill = Sex,
y = ifelse(test = Sex == "Male",
yes = -value, no = value))) +
geom_bar(stat = "identity") +
# geom_line(aes(x = "15 to 19 years"), color = "red", size=1) +
scale_y_continuous(labels = abs, limits = max(pop_pyramid$value) * c(-1,1)) +
scale_fill_manual(values=palette_2_mako)+
labs(title = "Population Pyramid", x = "Age Group", y = "Population by Gender") +
coord_flip() + plotTheme()
#pct transport to work map
ggplot()+
geom_sf(data=EP_econ, aes(fill=pct_transport_to_work), color="grey")+
scale_fill_viridis(option='G', direction=-1)+
labs(title="Percent Population with Transportation to Work in 2019",
fill="% Transport \nto Work",
subtitle="Census Tracts in El Paso, TX", caption="Source: US Census, ACS 2019") + mapTheme()
The population pyramid shows that El Paso’s population is young, and the working age population accounts for a large proportion of the total. We can infer that this means many people commute to work and, thus, rely on safe roads on a daily basis. To confirm, we generated a map of the percent of the population that has a means of transportation to work from the ACS. The map shows very little of the population works from home so safe roads are crucial to the city’s population.
#pct white map
pctWhite_map <- ggplot()+
geom_sf(data=EP_race, aes(fill=pctWhite), color="grey")+
scale_fill_viridis(option='G', direction=-1)+
labs(title="Percent White",
fill="%",
subtitle="Census Tracts in El Paso, TX",
caption = "Source: US Census, ACS 2019\n\nNote: Gray tracts indicate no data") + mapTheme()
#ethnicity map - Hispanic or Latino
ethnicity_hisp_lat_map <- ggplot()+
geom_sf(data=EP_ethnicity, aes(fill=pctHL), color="grey")+
scale_fill_viridis(option='G', direction=-1)+
labs(title="Percent Hispanic or Latino",
fill="%",
subtitle="Census Tracts in El Paso, TX",
caption = "Source: US Census, ACS 2019\n\nNote: Gray tracts indicate no data") + mapTheme()
grid.arrange(pctWhite_map, ethnicity_hisp_lat_map, ncol=2)
These maps show that the majority of the city’s population identifies as White or Hispanic or Latino. As pointed out by our client, El Paso has a history of disinvestment in the minority communities that live south of Interstate 10. This trend can be further observed in the median income map where tracts with the lowest median incomes are below this highway. This can mean poor data collection in these census tracts which is important to be aware of when creating our model and also reinforces the motivation to include equity in the decision making tool.
#Median household income
ggplot()+
geom_sf(data=EP_econ, aes(fill=med_hh_income), color="grey")+
scale_fill_viridis(option='G', direction=-1)+
labs(title="Median Household Income in 2019",
fill="Dollars ($)",
subtitle="Census Tracts in El Paso, TX", caption="Source: US Census, ACS 2019\n\nNote: Gray tracts indicate no data") + mapTheme()
We also imported the hydrology features from the US Census TIGER/Line Geodatabase. The map below shows that El Paso does not have an extensive hydrology network, and water is concentrated mostly to the southwestern border of the city.
ggplot()+
geom_sf(data=El_Paso_city, aes(), color="grey")+
geom_sf(data = EPhydrology, color = '#357BA2', alpha = 0.9, show.legend = T)+
labs(title="Hydrology Across the City",
subtitle="El Paso, TX", caption="Source: US Census - TIGER/Line Shapefiles") + mapTheme()
In this portion of our exploratory data analysis, we dug deeper into
existing road centerlines data (EPCenterline
) provided by
the City of El Paso.
Data cleaning and wrangling on the EPCenterline
data
layer included:
EPCenterline
to
centerline_with_age
After speaking to our client, we only keep four pavement categories that fall under the city’s jurisdiction for maintenance - LOCAL, MINOR, MAJOR, and COLLECTOR. We can see that most of the segments belong to the LOCAL category.
ggplot() +
geom_sf(data = EPCenterline, aes(color = CLASS), alpha=0.8, size=0.6, show.legend = "line") +
scale_color_manual(values=palette_4)+
labs(title = "Road Centerlines by Class",
fill="Class",
subtitle = "El Paso, TX") + mapTheme()
PCI values are assigned at the road segment level, so we selected road segments as our spatial unit of analysis.
Here we focus on fundamental visualizations on the features of
EPCenterline_with_PCI
. As is shown in the plots below,
LOCAL and COLLECTOR segments have
higher average PCI values, while segments in MAJOR
class tend to have more problems in pavement condition. When it comes to
different planning areas, segments in Northwest El Paso and the Art
Craft region have better pavement conditions, while the central region
performs poorly.
ggplot(EPCenterline_with_PCI, aes(y=CLASS)) +
geom_bar(width=0.5, color="black", fill = "#357BA2") +
labs(title = "Road Centerlines by Class",
y="Class",
x="Count",
subtitle = "El Paso, TX") + plotTheme()
We plot the histogram of the PCI distribution for the cleaned dataset shown below, and detect some negative PCI values from the plot. According to our client, these negative PCI scores are assigned to segments that are highways, interstates, private roads, etc., which are out of the city’s responsibility for maintaining. After removing segments with negative PCIs, we get a new PCI distribution, which shows three peaks in numbers at the value ranges of 98-100, 80-85, and 58-63.
# unique(center_line$PCI_2018)
ggplot(EPCenterline_with_PCI, aes(y=PCI_2018), color="grey") +
geom_bar(width=0.6, color="transparent", fill = "#357BA2") +
labs(title = "PCI 2018 Score Distribution",
x="Count",
y="PCI",
subtitle = "El Paso, TX") + plotTheme()
The interactive map below shows the road segments colored by PCI score to see the spatial distribution of scores across the city. Higher scores - denoted by the darker purples - tend to be located towards the outer edges of the city, especially to the eastern portions.
library(mapview)
mapview(EPCenterline_with_PCI, zcol="PCI_2018", color = c("#DEF5E5", "#49C1AD", "#357BA2", "#3E356B", "#0B0405"), popup=FALSE, layer.name = "Road Centerlines by 2018 PCI Scores")