library(tidyverse)
library(tidylog)
library(janitor)
library(scales)
library(sf)
library(tidymodels)
library(knitr)
library(kableExtra)
library(patchwork)
library(DALEXtra)
library(pdp)
library(boxr)
library(mapview)

# Box authentication (run once per session)
box_auth()
# Project color palette — matched to maps and charts throughout
roadway_colors <- c("#ff9500", "#ffd000", "#00badb", "#156082")
names(roadway_colors) <- c("Major Arterial", "Minor Arterial", "Collector", "Local")

palette_orange <- "#ff9500"   # Major Arterial / highlight
palette_yellow <- "#ffd000"   # Minor Arterial
palette_cyan   <- "#00badb"   # Collector
palette_blue   <- "#156082"   # Local / primary headings

speed_detail_colors <- c(
  "1-10mph over speed limit"         = "#ff9500",
  "11-15mph over speed limit"        = "#ffd000",
  "16-20mph over speed limit"        = "#00badb",
  "More than 20mph over speed limit" = "#156082"
)

theme_set(theme_light(base_size = 12))

A note to the reader: This document presents the complete analysis pipeline for the MUSA 801 Practicum project on off-peak roadway safety, conducted in partnership with the City of Philadelphia’s Office of Transportation and Infrastructure Systems (OTIS). Raw data are stored in a shared Box repository and loaded via the boxr API. Code chunks marked eval=FALSE contain the exact processing logic but are not re-executed here to avoid re-downloading large datasets. Pre-processed outputs are loaded directly from Box where needed. Raw and processed data which are publicly sharable are also stored in the project Github repository, located here.


1. Use Case

1.1 Background

Most roadway design criteria target peak-hour travel, when congestion is most severe. Yet a quieter, more lethal phenomenon unfolds each night: when the same roads that efficiently move rush-hour traffic fall silent, they become something far more dangerous.

Wide arterials and multi-lane collectors, engineered to manage thousands of vehicles per hour, become unobstructed corridors after 10 PM. A crash at 50 mph on an empty road at 2 AM is not an anomaly — it is a structural outcome, predicted by the design of the road itself.

“Speed is the number one determinant of severity in a crash — as speed increases so does the probability of crash fatalities and serious injuries.”
— DVRPC Arterial Typology & Speed Management Framework, 2022

The Off-Peak Safety Paradox

Philadelphia’s Office of Multimodal Planning documented this pattern directly in an analysis of eight multilane arterials on the city’s High Injury Network (OTIS Office of Multimodal Planning, “Evaluating the Off-Peak Impacts of Designing for Peak Hour Operations,” TESC, December 2024; internal report). The finding is stark:

Evening hours (8 PM–3 AM) represent only 29% of the day and carry just 14% of daily traffic volume — yet they account for 46% of all pedestrian KSI crashes. Peak hours (7–10 AM, 3–7 PM), despite representing the same share of time (29%) and three times the traffic share (42%), produce only 26% of KSI crashes. The conclusion is direct: LOS A — free-flow traffic — is not safe. Excess capacity kills.

The hourly KSI-to-demand ratio tells the same story at finer resolution: during overnight hours, the KSI share (orange) rises sharply while the traffic share (black) falls — creating the inverse pattern that defines the off-peak paradox. The divergence is most extreme between midnight and 7 AM, when pedestrian KSI rates are 3–5× higher per unit of traffic than during midday hours.

Ped KSI crashes vs. hourly traffic share by hour of day across Philadelphia multilane arterials. Orange = pedestrian KSI share; Black = hourly traffic share (% of AADT). Source: OTIS Office of Multimodal Planning, TESC, December 2024 (internal report).

Ped KSI crashes vs. hourly traffic share by hour of day across Philadelphia multilane arterials. Orange = pedestrian KSI share; Black = hourly traffic share (% of AADT). Source: OTIS Office of Multimodal Planning, TESC, December 2024 (internal report).

“LOS A is dangerous for vulnerable road users because it allows for unmetered speeding by drivers.”
— OTIS Office of Multimodal Planning, Evaluating the Off-Peak Impacts of Designing for Peak Hour Operations, TESC, December 2024 (internal report)

Why Road Design Makes Drivers Speed

The off-peak paradox is not simply about fewer cars at night. It is about what wide, multi-lane roads do to driver cognition and behavior. Research by Tice et al. (2021), using naturalistic driving data from 200 urban locations, found that corridor width is the single strongest predictor of 85th percentile speed (r = 0.686) — stronger than speed limits, land use, or neighborhood density.

The mechanism operates through driver attention. Wide corridors with few visual interruptions trigger a near-automatic driving state — the same attentional mode activated on highways — in which drivers perceive minimal risk and unconsciously increase speed.

The same study found that at speeds above 45 mph, drivers cannot reliably decode facial expressions — meaning they may not register a pedestrian’s presence before passing them. This is not distraction or recklessness; it is the predictable neurological response to a road environment designed for highway-speed throughput.

A natural experiment during COVID-19 lockdowns confirmed this mechanism: when pedestrian activity dropped during shelter-in-place orders, driving speeds on urban arterials increased substantially — the same roads, the same drivers, but an environment that no longer signaled human presence (Tice et al., 2021).

The Philadelphia Context

DVRPC’s 2022 Arterial Typology and Speed Management Framework identified 623 miles of arterial roads in Philadelphia — 50% PennDOT-owned, 50% city-owned — as requiring structured speed management. The report found that because most city streets already operate below target speeds during congested peak hours, standard traffic engineering’s focus on peak capacity systematically neglects the more dangerous off-peak window:

“Installing traffic calming can greatly improve target speed-compliance at off-peak times when speeding is more likely and LOS levels have minimal delay.”
— DVRPC Arterial Typology & Speed Management Framework, 2022

Research from Northeastern University (Furth et al., 2024) further demonstrated that on multilane arterials — where physical calming cannot always be deployed — redesigned signal timing (“Safe Waves”) reduced the number of speeding vehicles by 75% with fewer than 2 seconds of added travel delay per intersection. The implication: the off-peak speeding problem is not intractable. It is a design problem, and it has design solutions.

This Project

The City of Philadelphia’s Office of Transportation and Infrastructure Systems (OTIS), Office of Multimodal Planning, commissioned this study to shift from a reactive safety paradigm — responding to crash histories — toward a predictive, planning-oriented one. Rather than waiting for the next fatality to identify a dangerous corridor, the goal is to predict which road segments are structurally predisposed to off-peak speeding and simulate which interventions would reduce that risk most effectively.

1.2 Study Area

The study covers Philadelphia, Pennsylvania, with a focus on 446 unique road segments equipped with speed measurement sensors distributed across arterial and collector streets citywide. These segments span a range of road types — from major arterials like Roosevelt Boulevard to neighborhood collectors — and carry speed monitoring data from 2022 to 2025.

Each segment is identified by a unique seg_id drawn from the Philadelphia Street Centerline dataset. All spatial analysis uses EPSG:2272 (NAD83 / Pennsylvania South) for foot-level precision.

# Load centerlines geometry (reused in crash maps later)
centerlines_geometry <- box_read_rds(2139915462983) %>%
  mutate(seg_id = as.character(seg_id)) %>%
  st_transform(crs = "EPSG:2272")

# Load processed speed data — monitoring point coordinates + seg_id
speed_pts_raw <- box_read_rds(2193174265310)

# Monitored segments: filter centerlines to the 446 monitored seg_ids
monitored_seg_ids <- speed_pts_raw %>%
  distinct(seg_id) %>%
  mutate(seg_id = as.character(seg_id)) %>%
  pull(seg_id)

monitored_segments <- centerlines_geometry %>%
  filter(seg_id %in% monitored_seg_ids) %>%
  st_transform("EPSG:4326")

# Monitoring point locations (one point per recordnum)
speed_pts <- speed_pts_raw %>%
  distinct(recordnum, speed_measurement_longitude, speed_measurement_latitude,
           speed_measurement_road) %>%
  filter(!is.na(speed_measurement_longitude), !is.na(speed_measurement_latitude)) %>%
  st_as_sf(coords = c("speed_measurement_longitude", "speed_measurement_latitude"),
           crs = "EPSG:4326")

mapviewOptions(legend.pos = "bottomright")
mapview(monitored_segments,
        color      = palette_orange,
        lwd        = 2,
        alpha      = 0.9,
        label      = monitored_segments$road,
        layer.name = paste0("Speed sensor segments (n = ", nrow(monitored_segments), ")"),
        map.types  = "CartoDB.Positron")

1.3 Methodology Overview

Our methodology proceeds in three stages, each building on the last:

Stage 1 — Establish the problem empirically. Using five years of crash records (2020–2024), we quantify whether off-peak crashes are truly more lethal, and whether speeding-involved crashes concentrate outside peak hours. These analyses provide the empirical grounding for the modeling work.

Stage 2 — Build a predictive model. We train a Random Forest regression model to predict the percentage of vehicles speeding on a given segment during a given hour, as a function of structural road characteristics, temporal controls, and contextual features. The model achieves R² = 0.920 on and mean absolute error of 4 percentage points on held-out test data.

Stage 3 — Support planning decisions. The model enables counterfactual simulation: if this road were placed on a road diet (lanes reduced), or a separated bike lane added, how much would predicted speeding rates change? These outputs help planners prioritize investments where structural changes produce the largest safety gains.


2. Exploratory Data Analysis

2.1 Data Sources

All data were assembled from public open data portals, PennDOT administrative databases, and client-provided files stored in a shared Box repository.

# Dataset Source Role in Analysis
1 Crash Records (2020–2024) PennDOT Statewide / OpenDataPhilly KSI validation; crash rates per segment
2 Speed Data Philadelphia County Speed CSVs (Box) Primary modeling outcome (% speeding)
3 Volume Data PennDOT TIRE axle counts (Box) Traffic volume denominator
4 Street Centerlines OpenDataPhilly Base network; seg_id join key
5 RMS Admin Segments GISDATA_RMSADMIN.shp (Box) Road geometry; surface width
6 Complete Streets / DVRPC DVRPC Road typology, lanes, width
7 Traffic Calming Devices PennDOT Open Data / Box Speed humps, cushions — road diet proxy
8 Intersection Controls OpenDataPhilly Stop signs, signals — node friction
9 Streetlights / Street Poles OpenDataPhilly Nighttime visibility proxy
10 Bike Network OpenDataPhilly Bike lane type — road diet indicator
11 Bus Stops / Transit SEPTA Pedestrian activity zones
12 Red Light Cameras PPA / OpenDataPhilly Enforcement; driver behavior signal
13 PA TIP Projects DVRPC / PennDOT Planned/recent construction context
14 OpenStreetMap (OSM) osmdata R package Supplemental road characteristics (pre-hand-check)
15 OPA Properties Data OpenDataPhilly Parcel density as land-use activity proxy

2.2 Crash Analysis (2020–2024)

2.2.1 Data Overview

# Load and harmonize five years of PennDOT crash records
crash_years <- list(
  box_read_csv("2129578929490"),  # 2020
  box_read_csv("2129584283678"),  # 2021
  box_read_csv("2129572753523"),  # 2022
  box_read_csv("2129590204560"),  # 2023
  box_read_csv("2129573871742")   # 2024
)

# Harmonize schema differences across years
all_crashes <- crash_years %>%
  map(~ mutate(.x,
               WEATHER1    = as.character(WEATHER1),
               TCD_FUNC_CD = as.character(TCD_FUNC_CD))) %>%
  bind_rows() %>%
  filter(COUNTY == "67")    # Philadelphia only
philly_crashes <- all_crashes %>%
  mutate(
    hour_raw        = str_sub(as.character(TIME_OF_DAY), 1, 2),
    HOUR_OF_DAY     = as.integer(hour_raw),
    HOUR_OF_DAY     = if_else(HOUR_OF_DAY == 99, NA_integer_, HOUR_OF_DAY),

    is_ksi          = if_else(FATAL_OR_SUSP_SERIOUS_INJ == "1", 1, 0, missing = 0),

    time_period     = case_when(
      HOUR_OF_DAY >= 7  & HOUR_OF_DAY < 10 ~ "AM Peak (7–10)",
      HOUR_OF_DAY >= 16 & HOUR_OF_DAY < 19 ~ "PM Peak (16–19)",
      HOUR_OF_DAY >= 10 & HOUR_OF_DAY < 16 ~ "Midday (10–16)",
      HOUR_OF_DAY >= 19 & HOUR_OF_DAY < 22 ~ "Evening (19–22)",
      HOUR_OF_DAY >= 22 | HOUR_OF_DAY < 7  ~ "Night/Late (22–7)",
      TRUE ~ "Unknown"
    ),

    is_peak         = if_else(
      time_period %in% c("AM Peak (7–10)", "PM Peak (16–19)"), 1, 0),
    is_weekend      = if_else(DAY_OF_WEEK %in% c(1, 7), 1, 0, missing = 0),
    SPEEDING_RELATED = as.integer(SPEEDING_RELATED)
  ) %>%
  mutate(
    time_period = factor(time_period, levels = c(
      "AM Peak (7–10)", "Midday (10–16)", "PM Peak (16–19)",
      "Evening (19–22)", "Night/Late (22–7)", "Unknown"))
  )

After filtering to Philadelphia (county code 67) and harmonizing schema differences across years, the combined dataset contains more than 16,000 crashes with GPS coordinates, timestamps, severity indicators, and roadway context variables. The PennDOT crash records have three sub-tables per year (crash master, flags, and roadway); we join all three on CRN, then filter to Philadelphia (COUNTY == "67").

Crash distributions by posted speed limit, lane count, hour of day, and day of week (2020–2024). Crashes concentrate on 25–35 mph corridors, on 2–4 lane roads, in afternoon hours, and on weekdays.

Crash distributions by posted speed limit, lane count, hour of day, and day of week (2020–2024). Crashes concentrate on 25–35 mph corridors, on 2–4 lane roads, in afternoon hours, and on weekdays.

2.2.2 The Off-Peak Paradox: Fewer Crashes, More Deaths

This is the empirical heart of the project. When crash counts and KSI rates are plotted by hour of day, a striking inversion emerges:

  • Peak hours (7–10 AM, 4–7 PM): Crash volume is highest, but the KSI rate is lowest. Congestion keeps speeds down; most crashes are low-severity.
  • Night/Late hours (10 PM–7 AM): Crash volume is lowest, but the KSI rate climbs steadily, peaking between 2–4 AM.

You are far more likely to be in a fender-bender at 5 PM than at 3 AM — but far more likely to be killed or seriously injured at 3 AM than at 5 PM.

2.2.3 Peak vs. Off-Peak: Statistical Test

peak_comparison <- philly_crashes %>%
  filter(!is.na(HOUR_OF_DAY)) %>%
  mutate(period_type = if_else(is_peak == 1, "Peak Hours", "Off-Peak Hours")) %>%
  group_by(period_type) %>%
  summarise(
    total_crashes  = n(),
    ksi_crashes    = sum(is_ksi),
    ksi_rate       = ksi_crashes / total_crashes * 100,
    speeding_rate  = sum(SPEEDING_RELATED, na.rm = TRUE) / total_crashes * 100
  )

# Chi-squared test for KSI rate difference
ksi_test <- prop.test(
  x = c(sum(philly_crashes$is_ksi[philly_crashes$is_peak == 0], na.rm = TRUE),
        sum(philly_crashes$is_ksi[philly_crashes$is_peak == 1], na.rm = TRUE)),
  n = c(sum(philly_crashes$is_peak == 0, na.rm = TRUE),
        sum(philly_crashes$is_peak == 1, na.rm = TRUE))
)
KSI rate is significantly higher during off-peak hours — a difference confirmed by chi-squared test (p < 0.05).

KSI rate is significantly higher during off-peak hours — a difference confirmed by chi-squared test (p < 0.05).

A chi-squared proportion test confirms that the KSI rate difference between peak and off-peak hours is statistically significant (p < 0.05), consistent across all five years of data (2020–2024).

2.2.4 Spatial Distribution

crash_summary <- box_read_csv(2195155235316) %>%
  mutate(seg_id = as.character(seg_id))

# centerlines_geometry already loaded in study-area-map chunk
crash_network <- centerlines_geometry %>%
  left_join(crash_summary, by = "seg_id") %>%
  mutate(total_crashes = replace_na(total_crashes, 0),
         ksi_crashes   = replace_na(ksi_crashes,   0))

Total Crash Density by Segment (2020–2024)

crash_with_crashes <- crash_network %>% filter(total_crashes > 0) %>% arrange(total_crashes)

ggplot() +
  geom_sf(data = crash_network, color = "#e0e4ea", linewidth = 0.15) +
  geom_sf(data = crash_with_crashes, aes(color = total_crashes), linewidth = 1.0) +
  scale_color_viridis_c(name = "Total crashes\n(2020–2024)",
                        trans = "log1p", option = "inferno",
                        direction = -1,
                        labels = scales::comma) +
  labs(title   = "Total Crash Density by Segment",
       caption = "Source: PennDOT crash records, Philadelphia Street Centerlines") +
  theme_void() +
  theme(plot.title      = element_text(size = 14, face = "bold", color = palette_blue),
        legend.position = "right")

Crashes are distributed across the entire city, with the highest concentrations along major corridors in North, West, and South Philadelphia. The spatial pattern closely tracks traffic volume — high-density street grids and major arterials dominate the crash count map.

KSI Crash Density by Segment (2020–2024)

crash_with_ksi <- crash_network %>% filter(ksi_crashes > 0) %>% arrange(ksi_crashes)

ggplot() +
  geom_sf(data = crash_network, color = "#e0e4ea", linewidth = 0.15) +
  geom_sf(data = crash_with_ksi, aes(color = ksi_crashes), linewidth = 1.0) +
  scale_color_viridis_c(name = "KSI crashes\n(2020–2024)",
                        trans = "log1p", option = "mako",
                        direction = -1,
                        labels = scales::comma) +
  labs(title   = "KSI Crash Density by Segment",
       caption = "Source: PennDOT crash records, Philadelphia Street Centerlines") +
  theme_void() +
  theme(plot.title      = element_text(size = 14, face = "bold", color = palette_blue),
        legend.position = "right")