Return to MUSA 801 Projects page

This project was done as part of the University of Pennsylvania’s Master of Urban Spatial Analytics (MUSA) Spring 2022 Practicum. The instructors were Michael Fichman and Matthew Harris. We would like to thank them for their tireless help with this project. The group would also like to thank the team at US Ignite, Palak Agarwal and Kyle Compton, for their help in supplying the data and high-level guidance.

This report documents the data analysis pipeline through multiple steps. First is processing, exploration and visualization, then building the traffic prediction model through feature engineering, model training and validation testing.

Aerial image of Fort Carson. Source: https://theclio.com/entry/51720

0 Introduction

0.1 Motivation

Ever since the dawn of the Auto age, traffic jams have been present wherever large numbers of drivers exist. It is one phenomenon where, though widely experienced by nearly everyone at some point in time, is far less commonly understood in terms of its causes and potential mitigations. Without certain policy measures such as dynamic highway tolling, road network utilization becomes a classic “tragedy of the commons” situation, where unfettered open access to a shared resource result in systemic collapse and disutility for all who use it. Although the COVID-19 pandemic has shifted the usual travel and commute patterns found throughout the United States, especially for those those who have the option to work from home, the truth is that congestion will only shift to different locations and hours throughout the day. And even so, there are other workers who are required to work on site and still face congestion day to day. Novel ways of predicting and alerting drivers of traffic conditions ahead of time, especially for known major traffic generators such as large employers, will help both regular commuters make informed decisions on when and where to travel, and also the greater public.

In this study, we take the case of Colorado Springs, Colorado, a mid-sized city which is dominated by particular major employers, namely the US Armed Forces. To the south of the city and separate from it lies Fort Carson, one of the largest Army bases in the country since 1942, with a total area of 137,000 acres. Though Census data claims the CDP population is only 13,800, mainly counting active duty soldiers, including the “associated” population of retirees, family members, and civilian employees brings the total population closer to 125,000, which is 26% of the entirety of Colorado Springs.

Before the pandemic, Colorado Springs experienced a steadily-rising congestion level, peaking in 2019 according to the Texas Transportation Institute’s Urban Mobility Report. Measured in hours of travel delay in aggregate and per-person, this was 20,010 hours and 48 hours of delay respectively, right at the average for mid-sized cities. But what is unusual is that this congestion is particularly car commuter and Interstate highway-based: both statistics are significantly above the average for mid-sized cities1. Putting this together with the large employment center of Fort Carson (as well as the Air Force Academy and Peterson Air Force Base), it is apparent that these large employment sites, closely sited along Colorado Springs’ highways, are a major driver of regional traffic volumes and congestion.

Congestion along I-25 in Colorado Springs, South Gap. Source: KRDO

0.2 Project: Smart Base Artifical Intelligence for Traffic and Weather (AI4TW)

Presently, there are little to no interventions made in Fort Carson in terms of managing traffic flow. As a military base will have restricted access to and from its premises, there is a ring of seven gates surrounding Fort Carson which control traffic movements in and out of the base. During hours of heavy congestion, base operators will selectively close gates in order to redirect traffic to less congested highways. In addition, during forecasts of inclement weather, the base will go through an overnight protocol to either delay base opening or limit opening to critical operating personnel only. The project team and US Ignite’s goal then is to develop a sophisticated predictive model for traffic congestion for Fort Carson and Colorado Springs, that will serve both commuters and base operators in identifying and avoiding potential traffic jams based on historical data.

Our primary data source and response variable will be a spatio-temporal dataset of traffic jams and irregularities around Carson Springs, provided from the Waze for Cities data sharing program through US Ignite. We transform this data into a defined congestion index (CI), and gather other candidate predictors of interest, such as historic accidents, weather patterns, roadway infrastructure features, and more.

Outline of data workflow

The use cases for the predictions yielded from this project are twofold. First, we seek to send alerts and live map updates of predicted traffic disruptions, around two hours in advance, for commuters to and from Fort Carson to aid in their drive to and from the base. Secondly, base operators will have access to predictive information at their fingertips in order to direct drivers to specific gates or close them entirely.

1 Data preparation

First, the relevant boundaries were created and read in, including for Fort Carson, the city of Colorado Springs, the county, and study border. The main border for analysis purposes is a predefined bounding box over most of the Colorado Springs metropolitan area, covering all of the relevant Waze traffic data.

Bounding box of analysis

border.county = st_read("data/boundary_county.geojson") %>% 
  st_transform(crs)

border.city = st_read("data/boundary_city.geojson") %>% 
  st_transform(crs)

border = st_read("data/boundary_research.geojson") %>% 
  st_transform(crs)

border.military = st_read("data/boundary_military.geojson") %>% 
  st_transform(crs) %>% 
  filter(NAME == "Fort Carson") %>% 
  st_intersection(border)

Waze data was loaded with two major categories: irregularities and jams. Irregularities are automatically generated by users’ GPS data and identified by the Waze system as irregular or atypical by taking into account historical speed data for the same day of the week and hour. On the other hand, jams are either irregularities determined by the app to be significantly slower than normal, or user-generated reports shared by Waze users.

blacklist <- c("N Cheyenne Canyon Rd","High Dr")

waze.ir = read_xlsx("./data/Waze_Irregularities_1215.xlsx", sheet = "Waze_Irregularities_1215") %>%
  st_as_sf(wkt = 'geo',crs = 4326)

waze.jam = st_read("data/waze.jam.geojson") %>%
  # filter this road under construction
  filter(!street %in% blacklist) %>% 
  rename(jam_id = id) %>% 
  mutate(ymdh = ymd_h(paste0(date, hour),tz = "UTC"),
         ymdh = with_tz(ymdh,"MST"),
         hour = hour(ymdh),
         date = date(ymdh))
# filter jams with same id
# getting the most severe ci

waze.jam.uniqueid.severe =
  waze.jam %>%
  group_by(jam_id) %>%
  arrange(speed) %>%
  slice(1)

# for numeric, getting the average/median ci
# for categorical, getting the first value
# for geom, getting the union

waze.jam.uniqueid.uniongeom =
  waze.jam %>%
  group_by(jam_id) %>%
  summarise(geometry = st_union(geometry))

waze.jam.uniqueid.avg.left =
  waze.jam.uniqueid.severe %>%
  st_drop_geometry() %>%
  dplyr::select(-where(is.numeric),jam_id) %>%
  add_column(geometry = waze.jam.uniqueid.uniongeom$geometry) %>%
  st_sf()

waze.jam.uniqueid.avg =
  waze.jam %>%
  st_drop_geometry() %>%
  group_by(jam_id) %>%
  summarise(across(where(is.numeric), mean)) %>%
  left_join(waze.jam.uniqueid.avg.left,by="jam_id") %>%
  st_sf()

waze.jam.uniqueid.median =
  waze.jam %>%
  st_drop_geometry() %>%
  group_by(jam_id) %>%
  summarise(across(where(is.numeric), median)) %>%
  left_join(waze.jam.uniqueid.avg.left,by="jam_id" ) %>%
  st_sf()

waze.jam.uniqueid.severe %>% 
  st_write('data/waze.jam.uniqueid.severe.geojson',append=F)
waze.jam.uniqueid.avg %>% 
  st_write('data/waze.jam.uniqueid.avg.geojson',append=F)
waze.jam.uniqueid.median %>% 
  st_write('data/waze.jam.uniqueid.median.geojson',append=F)

# the most severe jam in one id (lowest speed)
waze.jam.uniqueid.severe =
  st_read('data/waze.jam.uniqueid.severe.geojson')
# average in one id
waze.jam.uniqueid.avg =
  st_read('data/waze.jam.uniqueid.avg.geojson')
# median in one id
waze.jam.uniqueid.median =
  st_read('data/waze.jam.uniqueid.median.geojson')

The jam dataset columns results as follows:

Name Type Explanation %NA
type Category Alert type 98.5
speed Number Current average speed on jammed segments in m/s 0
speedKMH Number Speed in km/h 0
length Number Jam length in meters 0
delay Number Delay of jam (in seconds) 0
street Text .. 4.39
city Text .. 4.27
level Category Traffic congestion level 0

2 Data exploration

2.1 Data coverage and selection

Overall, there was waze data for 2.5 years, which comprised nearly 3 million rows of points. As the data were analyzed for overall patterns, it was decided early on that in order to save processing time for this analysis, the data would be subsetted to a time period of October-December 2021. Not only was this a fairly recent time period available to use that reflected post-pandemic traffic usage, but it was also largely clear of any major unexplained events and disruptions at Fort Carson that could throw off our prediction. Overall, the dataset length of waze.jam fell to 344,529 rows. In addition, in further review it was decided to omit analysis of the waze irregularities dataset, as there were too many NA values to produce reasonable analysis without extensive imputation methods.

2.2 Time patterns

As we expect, there is strong periodicity within the jam dataset, both in hours and in days. The X axis is time from Monday to Sunday, And the Y axis is week number from 1 to 54.

There are 3 main takeaways.

  • is some seasonal tendency in 2021, there are more jams in the summer and autumn.
  • There is strong periodic time patterns between weeks. So to predict the jams of one week, the data of last week is very important.
  • If we look it more closely in a week, again ,there is similar periodic pattern in a week. The jam of Tuesday is similar to Monday.
    But however, there are still some sudden changes in jams, which may caused by factors like weather or data noises and this will be a challenge in our modelling.