Estimating Bike Counts in DC

You can view the full MUSA project page and our BEAUTIFUL Dashboard.

Instructors

Michael Fishman & Matthew Harris

Client Contacts

Executive Summary

Problem Statement: Current counter data is limited in spatial and temporal coverage. This provides an incomplete picture of cycling patterns in DC, making it challenging to fully understand cycling patterns and optimize network planning.

Use Case: Essentially addressing a problem of incomplete information, our analysis will estimates and provides summaries of micro-mobility ridership across DC’s networks.

Key Data Challenges: relative sparsity of automated counters across space are too sparse to provide representative picture of DC

Modelling Solution: Estimate peak hour counts with manual counter data, before using this to make estimates of hourly counts in locations without automated counters, augmented with other data sources

Model Performance: Final Model configurations for both temporal and spatial model outperformed static models in existing workflows recorded in Portland, Bend, Dallas and Charlotte

Strengths and Applications: Providing an overview of bike commuting patterns and infrastructure utilization during critical time periods (peak hours), which may support planning decisions at the city-wide scale - Not a “real-time” , granular or “realistic” representation of cyclist volumes

Limitations:

  1. Remains impossible to test and comment on the generalizability of the temporal model without sufficient data points across DC

  2. Inherent selection biases with the data arising from data collection decisions such as prioritizing of key transport corridors for counter installation

  3. Areas may have daily patterns which differ substantially from conventional dual peak patterns, reducing the confidence of estimates in these areas

Recommendations:

  1. Utilize predictions made outside of peak hours in areas further from known automated records with caution

  2. Expand the coverage of its automated counters and do so within a greater variety of locations; which will allow for further model iterations and testing in currently unknown locations

  3. Utilize domain knowledge and understanding of local contexts within DC for validating model outputs

Introduction

The District Department of Transportation wishes to understand trip volumes and time-space patterns of micro mobility (bicycle and scooter) use in Washington, DC. Estimating and summarizing such patterns will provide the department a better idea of cycling mobility flows across the city which can be used for a variety of analyses and decisions surrounding infrastrcuture provision and interventions or understanding relations with equity and demographics. However, as the following sections will exemplify, while Washington, while D.C.’s bike network is extensive, the current counter data is limited in spatial and temporal coverage. This provides an incomplete picture of cycling patterns in DC, making it challenging to fully understand cycling patterns and optimize network planning. The principal goal of this project is thus to estimate the volume of micro mobility ridership across the city. The rest of this report is structured as such: first, it introduces the motivation and use-case, before conducting a short review of typical approaches and considerations in modelling and estimating bicycle counts. Then, it illustrates the key data issues regarding spatial and temporal sparsity of provided counter data and outlines our predictive solution. It then explores selected predictors before comparing the performance of various model configurations. It concludes by assessing the generalizability of the final model for different demographics and settings and offering recommendations for existing workflows.

Client Background

Motivation & Use Case

Bicycle count data is important for a variety of planning purposes, including measuring changes over time, prioritizing projects, planning and designing for future infrastructures, providing modelling inputs for multi-modal models, assessing overall footfall, signal adjustmetns and property research among other things (Nordback et al 2018). The growing interest in bicycling and cycling facilities necessitates a need to understand its usage and how people utilize the mode and available facilities for getting to important places in their daily life. This need is not unnoticed by the DDOT, who has been carrying out their own data collection program for the past few years. However, as we further reveal in subsequent sections, there are limits to the ability of collected data to generalize to the entire city and its network.

Essentially addressing a problem of incomplete information, our analysis will estimates and provides summaries of micro-mobility ridership across DC’s networks. By revealing the distribution and gaps in available data, our study highlights opportunities to enhance monitoring strategies, understand how patterns vary across the city with various demographic and infrastructure features, and guide targeted infrastructure improvements. This is facilitated by a dashboard which provides various functions for exploring the phenomena to aid decision making processes.

Literature Review

In recent years, the need for accurate, network-wide estimates of bicycle activity has grown in response to increased investment in cycling infrastructure and a push for data-driven planning. Traditionally, cities have relied on permanent counters and short-duration manual counts to monitor bicycle traffic, yet these methods offer limited spatial and temporal coverage. As a result, researchers have developed several modeling approaches to extrapolate counts across space and time, and to integrate newer, non-traditional data sources.

The Role of Bicycle Count Data

Bicycle count data serve various planning and evaluation purposes, including measuring trends, prioritizing infrastructure investments, calibrating travel models, conducting before-and-after studies, and assessing safety and equity outcomes (Nordback et al., 2018). However, stationary counters can only reflect activity at specific points, leaving large portions of the network unmeasured. To address this, researchers have explored methods to extrapolate short-duration counts to average annual daily bicycle traffic (AADBT) using grouping techniques such as cluster analysis, spatial indexing, and travel pattern similarities (Team 2025).

Modeling Approaches

Direct Demand Models

One common strategy is the direct demand model, which regresses observed counts against built environment, network, and demographic variables. Key predictors often include:

  • Land use: employment and population density, retail access, and land use mix.
  • Transportation infrastructure: bike facility density, proximity to bridges, and centrality in the network.
  • Sociodemographics: race, age, education, and student population.
  • Accessibility: distance to jobs, schools, and the central business district.
  • Temporal factors: day of week, time of day, and weather conditions.

Poisson and negative binomial models are frequently used due to the count-based nature of the data (Team 2025).

Machine Learning and Data Fusion

Recent research has emphasized data fusion techniques that combine traditional counts with crowdsourced or third-party data sources, such as Strava, StreetLight, and bikeshare systems. The study by Kothuri et al. (2022) demonstrated that integrating these sources with contextual variables (termed “static” data) can enhance model performance. In their comparative study across six cities, models that combined static, Strava, and StreetLight data consistently outperformed those using single sources, especially in medium- to high-volume contexts. However, low-volume sites remained difficult to model accurately, partly due to sparse data and bias in source coverage (Kothuri et al. 2022).

The study further evaluated Random Forest regression and found it performed comparably to Poisson models, with better results in data-rich settings. While machine learning allows for flexible modeling of complex relationships, it demands substantial data and tuning to avoid overfitting or underperformance in low-data environments.

Connectivity and Network Measures

Complementing count-based models, Miah et al. (2023) explored network connectivity using graph theory and Level of Traffic Stress (LTS). Their study assessed how link-level characteristics—such as speed, lane width, and motor vehicle volume—affect ridership by user type (e.g., children vs. confident adults). The results highlighted significant connectivity disparities, with large portions of networks inaccessible to low-stress riders, thereby limiting cycling uptake in equity-seeking communities (Miah, Mattingly, and Hyun 2023).

Exploratory Analysis: Modeling Considerations

Our Target Variable: Weekday Hourly Bike Counts

Data Quality and Coverage

The locations of counters and the nature of their counts are important considerations when utilising them for modelling unknown counts, and can limit the types of models and inferences which can be made from data. Importantly, stationary counters can only tell us about the traffic passing by them, during the times which they are active.

The two counters differ greatly in spatial and temporal coverage:

  • Manual Counters:

    • Greater density of locations, especially if aggregated across 4 years

    • Data collected only during peak hours, once per year within time periods ranging from 3- 4 hours

  • Automatic Counters:

    • Poor spatial coverage, with only 3 to 4 counters per month, and a total of around 30 counters deployed in 2024

    • provides 15 minute counts

To address this we thus focussed on generating estimates at an hourly timescale

Manual Counters Spatial Distribution

The manual counters—shown in purple—are more widespread, but they’re only used occasionally and often during peak hours.

manual <- manual %>% drop_na(x, y)
manual <- st_as_sf(manual, coords = c("x", "y"), crs = 4326) 
manual <- st_transform(manual, st_crs(basemap))
ggplot() +
  geom_sf(data = basemap, fill = "grey90", color = "white") +
  geom_sf(data = bike_lane, color = "grey30", size = 0.5, alpha = 0.7) +
  geom_sf(data = manual, color = "#E3D7F5", size = 3, shape = 21, fill = "#562DAB") +
  labs(title = "Manual Counters Locations") +
  theme_minimal() +  
  theme(
    legend.position = "right",   
    panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(),  
    axis.text = element_blank(), 
    axis.ticks = element_blank())

Automatic Counters Spatial Distribution

These are fairly limited in number and tend to be clustered in specific parts of the city. Because automated counters are concentrated in certain areas, they don’t reflect the diversity of biking environments across the whole network.

auto <- auto %>% drop_na(Latitude, Longitude)
auto <- st_as_sf(auto, 
                  coords = c("Longitude", "Latitude"),  # note the order: X = Longitude, Y = Latitude
                  crs = 4326)
auto <- st_transform(auto, st_crs(basemap))

auto_cut <- auto %>%
  filter(st_coordinates(.)[, 1] > -77.12 & st_coordinates(.)[, 1] < -76.90,
         st_coordinates(.)[, 2] > 38.79 & st_coordinates(.)[, 2] < 39.00)
ggplot() +
  geom_sf(data = basemap, fill = "grey90", color = "white") +
  geom_sf(data = bike_lane, color = "grey30", size = 0.5, alpha = 0.7) +
  geom_sf(data = auto_cut, color = "#E3D7F5", size = 3, shape = 21, fill = "#9C7BD8") +
  labs(title = "Automatic Counters Locations") +
  theme_minimal() +  
  theme(
    legend.position = "right",    
    panel.grid.major = element_blank(),   
    panel.grid.minor = element_blank(),  
    axis.text = element_blank(),   
    axis.ticks = element_blank())

By definition, count data not tell us anything about the network or times outside of the records. However, through a predictive modelling approach, we can attempt to use data to understand how patterns differ across space and time, and predict how they might be in areas outside of the temporal and spatial scope of comparison.

Exploring Temporal & Spatial Variations in Bike Counts

In order to understand how patterns vary more broadly across space and time, we first explored spatial patterns using Manual count data and then the temporal patterns using automated counts.

Spatial Patterns

Temporal Patterns

Range of automated counts around 1 to 30, with most counts being less than 10 every 15 minutes, averaging to about 30 to 40 per hour.

main %>% filter(Flow_Count>0) %>% ggplot()+
  geom_histogram(aes(x=Flow_Count),fill= "#9C7BD8", color='white')+
  xlim(0,30)+
  labs(title= 'Distribution of Automated Counts',
       subtitle='15 minute intervals')+
  theme_minimal()

Daily patterns of counts are roughly even across different days of the year. While the patterns below seem to suggest that certain months are characterised by higher than normal counts, this is due to the differences in counter locations.

main %>% st_drop_geometry() %>% filter(Flow_Count>0) %>% group_by(interval_60) %>% 
  summarise(mean_count= mean(Flow_Count)) %>% 
  ggplot()+
  geom_line(aes(x=interval_60,y= mean_count), size=0.3, color= "#9C7BD8")+
  labs(title= 'Automated Counts across study period',
       subtitle='1 hour intervals')+
  theme_minimal()

Weekday patterns differ greatly from weekend patterns, exhibiting the dual peak patterns commonly also seen in other forms of transit (buses and rail). This strongly suggests that the nature of recorded counts is largely commuter–centric in nature.

main %>% filter(Flow_Count>0) %>% mutate(hour = hour(interval_60),dotw = wday(interval_60, label=TRUE)) %>% group_by(dotw,hour) %>% 
   st_drop_geometry() %>%  summarise(mean_count= mean(Flow_Count)) %>% 
  ggplot()+
  geom_area(aes(x=hour,y= mean_count), fill="#9C7BD8", color='transparent')+
  labs(title= 'Automated Counts across study period',
       subtitle='1 hour intervals, by day of week')+
  facet_wrap(~dotw)+
  theme_minimal()

The twin peak pattern is also mostly generalizable to all locations, with a few notable exceptions. 1st St NE exhibited rather anomalous patterns and was excluded from the training process. Notably, streets such as East Capitol Street and the Marvin Gaye Trails have patterns which deviate slightly from the norm. The former is characterised by a much larger evening peak, while the former sites have much more erratic patterns, likely due to the low average counts within these sites.

main %>% filter(Flow_Count>0) %>% mutate(hour = hour(interval_60),dotw = wday(interval_60, label=TRUE)) %>% subset(!(dotw %in% c('Sat', 'Sun'))) %>% group_by(Site.Name, hour) %>% 
   st_drop_geometry() %>%  summarise(mean_count= mean(Flow_Count)) %>% 
  ggplot()+
  geom_area(aes(x=hour,y= mean_count), fill='#9C7BD8', color='transparent')+
  labs(title= 'Automated Counts across Locations ',
       subtitle='1 hour intervals')+
       ylab('Mean Count')+
  facet_wrap( ~ Site.Name, 
           scales='free',
             ncol= 5)+
  theme_minimal()

The consistency of such temporal patterns across different sites is important, as it allows us to estimate counts across different times of the day in unknown locations, by assuming that they vary similarly across time across different locations and only vary in magnitude (i.e. the peak bike counts) .

auto_counts%>% 
        mutate(time_of_day = case_when(hour(interval_60) < 7 | hour(interval_60) > 18 ~ "Overnight",
                                 hour(interval_60) >= 7 & hour(interval_60) < 10 ~ "AM Rush",
                                 hour(interval_60) >= 10 & hour(interval_60) < 15 ~ "Mid-Day",
                                 hour(interval_60) >= 15 & hour(interval_60) <= 18 ~ "PM Rush"))%>%
  #        group_by( time_of_day) %>%
  # 
  # summarize(mean_flow = mean(as.numeric(Flow_Count), na.rm=T))%>%
  ggplot()+
  geom_histogram(aes(x=as.numeric(Flow_Count)), fill="#9C7BD8", color='white')+
  # labs(title="Figure 4b: Mean Number of Hourly Trips Per Station,  New York, August, 2024",
  # subtitle='Based on a 20% random sample out of 4.5m trips',
  #      x="Number of trips", 
  #      y="Frequency")+
  facet_wrap(~time_of_day)+
 theme_minimal()

Our Approach: Estimate Spatial Patterns with Manual Counts & Temporal Patterns from Automated Counts

Typical approaches for obtaining estimates of bike counts in areas without counters can be classified into three main types, 1) Direct Extrapolation , 2) Travel Demand Modelling and 3) Direct Demand Models.

Direct extrapolation are based on the principles of spatial autocorrelation, that areas near each other have similar patterns of counts. This often entails grouping sites based on similar characteristics and using factors to extrapolate short duration counts to annual estimates. This may be done visually, or through methods of cluster analysis or spatial variables.

Travel demand modelling describes patterns as decisions to travel for activities using specific modes of transport, and often suffer from their large aggregation scale and difficulty in validating with external data.

As such, given the availability of various external data sources, direct demand models, which regresses (or predicts) observed counts based on known characteristics of the surrounding built environment, are popular methods for estimating counts. The accuracy of such models depends on the types of variables selected as predictors, as well as their suitability for extrapolating to the entire network - meaning that including strongly predictive characteristics not present within areas without records should be approached with caution.

Crucially, majority of reviewed approaches are often concerned with obtaining annually aggregated estimates of bicycle volume, in other words, where there tends to be most or least cyclists on the network. Our use-case and motivation, however, requires that we also understand when and how bike counts vary.

Given the consistency of hourly patterns across locations, the wealth of temporal data from automated counters allows us to estimate how patterns vary across the day, barring a few limitations and caveats which are explained further below.

As such, we propose a modelling workflow which can maximize the degree of spatial and temporal information we can obtain from the 2 data sources.

Overall Modelling WorkflowModel Overview

Exploratory Analysis: Relevant Predictors and Data Sources

To model and estimate how flows vary across space and time, it is important to identify relevant and useful predictors.

Typical characteristics included for prediction include (Muner and Sener, ):

  • Socio-demographic: race/ethnicity, education, students, age

  • Network measure: centrality, bridges, bicycle facility density, number of lanes

  • Land use: population density, employment density, commercial retail density, industrial use, open space, land use mix

  • Accessibility: to jobs, bike trail entrances, transit stops, schools distance form cbd

  • Temporal: weather, time of day

Based on Kothuri et al’s study and other literature, we identified a number of variables and data sources for both spatial and temporal models. [insert diagram ]

Spatial Model

Variables chosen for the spatial model included….

Temporal Model

Given that our approach means that we can not utilize accurate spatial grouping variables (e.g streets) to provide information regarding the For the temporal model, we considered the usage of various spatial variables for providing information regarding the spatial differences in the magnitude of bicycle counts. This included predicted values based of the spatial model, a kernel density estimate of manual count data and engineered count feature from capital bikeshare data.

Capital Shared Bike Data

Our analysis reveals a characteristic “twin-peak” trend in cycling activity, closely mirroring the morning and evening commuter surges seen in other transport modes. This pattern highlights how bike usage aligns with standard workday rhythms, suggesting that cycling also plays a meaningful role during peak travel times.

By incorporating these repeated daily peaks into our modeling approach, we can better predict where and when people are most likely to ride. Understanding how traffic volume shifts at specific locations allows us to fine-tune infrastructure, plan more effectively for congestion, and ultimately deliver a safer, more efficient cycling experience throughout Washington, D.C.

process_auto_counts <- function(auto_counts) {
    
    auto_counts$interval_60 <- as.POSIXct(auto_counts$interval_60, origin="1970-01-01", tz="UTC")
    auto_counts$hour <- hour(auto_counts$interval_60)
    auto_counts$day <- as.Date(auto_counts$interval_60)
    auto_counts$weekday <- wday(auto_counts$interval_60, week_start = 1)
    auto_counts$is_weekend <- auto_counts$weekday %in% c(6, 7)
    return(auto_counts)
}

auto_counts <- process_auto_counts(auto_counts)
process_cabi_data <- function(cabi_data) {
    cabi_data$hour <- ifelse(is.na(cabi_data$hour), -1, as.integer(cabi_data$hour))
    cabi_data$dotw <- ifelse(is.na(cabi_data$dotw), -1, as.integer(cabi_data$dotw))
    cabi_data$is_weekend <- cabi_data$dotw %in% c(6, 7)
    return(cabi_data)
}
cabi_data <- process_cabi_data(cabi_sum)

plot_hourly_patterns <- function(auto_counts, cabi_data) {
    hourly_auto <- auto_counts %>% group_by(hour) %>% summarize(Flow_Count = mean(Flow_Count, na.rm=TRUE))
    hourly_cabi <- cabi_data %>% group_by(hour) %>% summarize(cabi_count = mean(cabi_count, na.rm=TRUE))
    
    p1 <- ggplot(hourly_auto, aes(x = hour, y = Flow_Count, fill = "Auto Counts")) + 
        geom_bar(stat="identity", color = "#C4AEE8") + 
        scale_fill_manual(values = c("Auto Counts" = "#562DAB")) + 
        ggtitle("Average Hourly Auto Counts") + 
        xlab("Hour of Day") + 
        ylab("Average Count") + 
        theme_minimal()

    p2 <- ggplot(hourly_cabi, aes(x = hour, y = cabi_count, fill = "CABI Usage")) + 
        geom_bar(stat="identity", color = "#C4AEE8") + 
        scale_fill_manual(values = c("CABI Usage" = "#6A5ACD")) + 
        ggtitle("Average Hourly CABI Usage") + 
        xlab("Hour of Day") + 
        ylab("Average Count") + 
        theme_minimal()

weekday_auto <- auto_counts %>% group_by(hour, is_weekend) %>% summarize(Flow_Count = mean(Flow_Count, na.rm=TRUE))

weekday_cabi <- cabi_data %>% group_by(hour, is_weekend) %>% summarize(cabi_count = mean(cabi_count, na.rm=TRUE))

    p3 <- ggplot(weekday_auto, aes(x = hour, y = Flow_Count, color = factor(is_weekend))) + 
        geom_line(size = 1.2) + 
        scale_color_manual(values = c("FALSE" = "#C4AEE8", "TRUE" = "#6A5ACD")) + 
        ggtitle("Weekday vs Weekend - Auto Counts") + 
        xlab("Hour of Day") + 
        ylab("Average Count") + 
        theme_minimal()

    p4 <- ggplot(weekday_cabi, aes(x = hour, y = cabi_count, color = factor(is_weekend))) +
        geom_line(size = 1.2) + 
        scale_color_manual(values = c("FALSE" = "#C4AEE8", "TRUE" = "#6A5ACD")) + 
        ggtitle("Weekday vs Weekend - CABI Usage") + 
        xlab("Hour of Day") + 
        ylab("Average Count") + 
        theme_minimal()
    return(list(p1, p2, p3, p4))
}

analyze_peak_hours <- function(auto_counts, cabi_data) {
    morning_peak <- 7:9
    evening_peak <- 16:18

    auto_peaks <- data.frame(
        morning = mean(auto_counts$Flow_Count[auto_counts$hour %in% morning_peak], na.rm=TRUE),
        evening = mean(auto_counts$Flow_Count[auto_counts$hour %in% evening_peak], na.rm=TRUE),
        off_peak = mean(auto_counts$Flow_Count[!(auto_counts$hour %in% c(morning_peak, evening_peak))], na.rm=TRUE))

    cabi_peaks <- data.frame(
        morning = mean(cabi_data$cabi_count[cabi_data$hour %in% morning_peak], na.rm=TRUE),
        evening = mean(cabi_data$cabi_count[cabi_data$hour %in% evening_peak], na.rm=TRUE),
        off_peak = mean(cabi_data$cabi_count[!(cabi_data$hour %in% c(morning_peak, evening_peak))], na.rm=TRUE)
    )
    return(list(Auto_Counts = auto_peaks, CABI_Usage = cabi_peaks))
}

plots <- plot_hourly_patterns(auto_counts, cabi_data)
peak_stats <- analyze_peak_hours(auto_counts, cabi_data)
print(plots[[2]])

print(plots[[4]])

Engineered Features

Using the Capital Bikeshare data, we created bikeshare counts for every hour by reconstructing possible paths from Origin Destination Data. All unique Origin-Destination Pairs were reconstructed using a network based on the OSM extract and Rapid Realistic Multimodal Routing for R, also known as R5R. For each hour, we counted the number of times each route was taken and then spatially joined this count to each segment. The routing was based off a simple fastest route argument with basic cycling speeds, relying on rather simplistic assumptions given the “broadness” of our intended prediction, and the hourly nature of our target variable. The code for cleaning and route assignment may be reviewed in the appendix.

Land Use

To understand how land use influences bike counts, we applied a 300-meter buffer around each bike lane segment. This allows us to capture the mix of land uses within close proximity to each bike route, rather than just relying on a single adjacent land use type.

The highest bike counts are observed in residential, commercial, and institutional areas, indicating that these environments support frequent cycling—likely due to a mix of commuting, errands, and recreational trips.

seg_nongem <- st_drop_geometry(segments)
landuse <- seg_nongem %>%
  select(Commercial, Industrial, Institutional, Mixed, Recreational, Religious, Residential, Vacant, Other) %>%
  summarise_all(sum) %>%
  pivot_longer(cols = everything(), names_to = "Land_Use_Type", values_to = "Count")

ggplot(landuse, aes(x = reorder(Land_Use_Type, Count), y = Count, fill = Land_Use_Type)) +   
  geom_col(width = 0.7) +  # Adjust bar width (default is 0.9, reduce for thinner bars)
  coord_flip() +   
scale_fill_manual(values = c(
  "Vacant"        = "#562DAB",  # Very dark purple
  "Residential"   = "#6A3BB8",
  "Institutional" = "#7A50C7",
  "Commercial"    = "#8E68CC",
  "Religious"     = "#9C7BD8",
  "Mixed"         = "#A989E0",
  "Other"         = "#C4AEE8",
  "Industrial"    = "#D6C4F0",
  "Recreational"  = "#E3D7F5"   # Lightest
), guide = "none") +  
  labs(title = "Distribution of Bike Lanes Across Adjacent Land Use Types",
       x = "Land Use Type", y = "Count") +   
  theme_minimal() +
  theme(
    legend.position = "right",   # Keep legend on the right
    panel.grid.major = element_blank())

Census Data

We examined the spatial distribution of key census indicators across our study area to better understand the local socio-demographic context. Our maps reveal marked differences in variables such as median income, the rate of public transit commuting, and the proportion of residents who use bicycles for commuting. Areas with higher median incomes may have distinct mobility patterns compared to regions with lower incomes, and these economic disparities can influence the demand for cycling infrastructure. By overlaying these data on our geographic base map, we can begin to see how the underlying socio-economic factors correlate with the spatial distribution of transportation modes.

In addition, the visualizations of transit and bicycle commuting rates serve as essential proxies for local travel behavior. Regions with higher bicycle commuting rates, for example, might already exhibit a strong cycling culture that can support further network investments, while areas with prominent public transit usage suggest the potential for integrated transport solutions that merge cycling with existing transit services. These insights are critical for identifying data gaps in our current bike counter coverage and for forecasting future demand, ultimately guiding more targeted and effective infrastructure planning in Washington, D.C.

ggplot() +
  geom_sf(data = basemap, fill = "grey90", color = "white") +
  geom_sf(
    data = segments,
    aes(color = median_incomeE),
    alpha = 0.8) +
    scale_color_gradientn(
    colors = rev(pp),
    name = "Income",
    na.value = "grey80") +
  labs(title = "Median Income") +
  theme_minimal()

ggplot() +
  geom_sf(data = basemap, fill = "grey90", color = "white") +
  geom_sf(
    data = segments,
    aes(color = commute_bicycleE),
    alpha = 0.8) +
    scale_color_gradientn(
    colors = rev(pp),
    name = "commute_bicycle") +
  labs(title = "Bicycle Commuters") +
  theme_minimal()

ggplot() +
  geom_sf(data = basemap, fill = "grey90", color = "white") +
  geom_sf(
    data = segments,
    aes(color = pct_white),
    alpha = 0.8) +
    scale_color_gradientn(
    colors = rev(pp),
    name = "pct_white") +
  labs(title = "Percentage of White") +
  theme_minimal()

ggplot() +
  geom_sf(data = basemap, fill = "grey90", color = "white") +
  geom_sf(
    data = segments,
    aes(color = edu_bachelors_plusE),
    alpha = 0.8) +
    scale_color_gradientn(
    colors = rev(pp),
    name = " edu_bachelor_plus") +
  labs(title = "Education Level") +
  theme_minimal()

Network Characteristics

We also examine the network characteristics as our additional resources to help our model. Betweenness Centrality explains how often a node falls on the shortest path between other nodes.

ggplot() +
  geom_sf(data = basemap, fill = "grey97", color = "white") +
  geom_sf(
    data = segments,
    aes(color = betweenness_max),
    alpha = 0.8) +
    scale_color_gradientn(
    colors = rev(pp),
    name = "Betweenness") +
  labs(title = "Bike Lane Network: Binned Betweenness Centrality") +
  theme_minimal()

Exploratory Data Analysis with Manual Counts

In this section, we explore how various segment-level characteristics—such as roadway design, land use, and traffic conditions—relate to observed manual bicycle counts in Washington, DC. This exploratory analysis aims to identify potential predictors of cyclist volumes and inform our modeling efforts to estimate unobserved counts citywide.

segments_manual <- segments %>%
  semi_join(seg_manual, by = "seg_id") %>%     
  left_join(seg_manual, by = "seg_id")

Overview of Average Current Manual Counts

ggplot() +
  geom_sf(data = basemap, fill = "grey97", color = "white") +
  geom_sf(
    data = segments_manual,
    aes(color = avg_count),
    alpha = 0.8) +
    scale_color_gradientn(
    colors = rev(pp),
    name = "Average Bike Counts") +
  labs(title = "Manual Bike Counts") +
  theme_minimal()

ggplot(segments_manual, aes(x = avg_count)) +
  geom_histogram(bins = 30, fill = "#7A50C7", color = "white") +
  theme_minimal() +
  labs(title = "Distribution of Average Manual Counts",
       x = "Average Count", y = "Number of Segments")

Continuous Variables with Manual Average Counts

To better understand what factors are associated with cyclist activity, we explored how average manual bike counts (avg_count) relate to several continuous segment-level variables. Scatterplots with linear regression lines were used to examine relationships between avg_count and predictors such as annual average daily traffic (AADT), median household income, streetlight count, pavement condition index (PCI), and bike infrastructure scores. These visualizations revealed varying degrees of correlation, suggesting that both traffic volume and built environment features may play a role in influencing observed cyclist volumes on different street segments.

Continuous Variables with Manual Average Counts
Continuous Variables with Manual Average Counts

Categorical Variables with Manual Average Counts

In addition to numeric predictors, we analyzed binary and categorical variables, including land use types and physical street characteristics. Boxplots were used to compare average manual counts across categories such as residential, commercial, and recreational land uses, as well as the presence or absence of buffers, parking lanes, and double yellow lines. Results suggest that segments with residential land use or buffered bike lanes generally experienced higher ridership, while others like vacant land or unbuffered segments saw lower cyclist activity. These findings provide preliminary evidence that both land use context and roadway design elements contribute to differences in observed bicycle volumes.

binary_vars <- c(
  "Commercial.x", "Industrial.x", "Institutional.x", "Mixed.x",
  "Recreational.x", "Religious.x", "Residential.x", "Vacant.x", "Other.x")

segment_binary <- segments_manual %>%
  mutate(across(all_of(binary_vars), ~ factor(., levels = c(0, 1), labels = c("No", "Yes"))))
binary_plots <- lapply(binary_vars, function(var) {
  ggplot(segment_binary, aes_string(x = var, y = "avg_count")) +
    geom_boxplot(fill = "#C4AEE8", alpha = 0.5, color = "grey70") +
    theme_minimal() +
    labs(title = paste("Manual by", var) +
    theme(plot.title = element_text(size = 1)))
})

wrap_plots(binary_plots, ncol = 3)

landuse_vars <- c(
  "Commercial.x", "Industrial.x", "Institutional.x", "Mixed.x",
  "Recreational.x", "Religious.x", "Residential.x", "Vacant.x", "Other.x"
)

# Convert 0/1 to "No"/"Yes" for better labeling
segment_lu <- segments_manual %>%
  mutate(across(all_of(landuse_vars), ~ factor(., levels = c(0, 1), labels = c("No", "Yes"))))

# Pivot to long format
segment_lu_long <- segment_lu %>%
  pivot_longer(
    cols = all_of(landuse_vars),
    names_to = "land_use_type",
    values_to = "present"
  )

ggplot(segment_lu_long, aes(x = present, y = avg_count, fill = present)) +
  stat_summary(fun = mean, geom = "bar", width = 0.8) +
  facet_wrap(~ land_use_type, ncol = 3, scales = "free_y") +  
  scale_fill_manual(values = c("No" = "#D6C4F0", "Yes" = "#562DAB")) +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_text(size = 10),
    strip.text = element_text(size = 11, face = "bold"),
    plot.title = element_text(size = 12),
    legend.position = "none"
  ) +
  labs(
    title = "Average Manual Count by Land Use Presence",
    x = "Land Use Present",
    y = "Average Manual Count"
  )

Model

As aforementioned, our approach estimates Spatial Peak hour Patterns with Manual Counts and spatial predictors before using peak hour estimates to supplement a temporal model for estimating hourly traffic from automated counters - as seen in the diagram below. This section briefly explains the types of models utilized in the modelling process, before undertaking a detailed comparison of performance metrics for the two models, before assessing the generalizability of estimates from the two final selections. Detailed Code for both modelling workflows ma be referred to in the appendix.

Models Utilized

As part of our exploratory analyses, we used and compared 3 different models for the performance in estimating both peak hour counts across space and also location specific hourly counts in a day: namely regression models, random forest models and XG-Boost models for their performance in terms of RMSE, MAPE and %RMSE.

Linear Model

Predicts the dependent variable as a linear combination of independent variables

Use-case: whne the relationship between predictor and the response is linear

Strengths:

  • ease of interpretation

  • fast to train and evaluate

Limitations:

  • assumes linearity with poor performance with complex linear patterns

  • does not work well with multi-collinearity, which is often present within time predictors - values of each hour are related to one another

LASSO

A variation of regression models which uses L1 regularization to shrink some model coefficients and force others to be exactly zero

Strengths:

  • L1 based feature selection.

Limitations:

  • Used only for spatial model, as it was found to penalise hour-based fixed effects due to its strong collinearity with other time-based fixed effects (other hours or time of day) or low contribution

Random Forest Models

An ensemble of decision trees built using bootstrapped samples and random subsets of features at each split. Decision trees model outcomes as a series of decisions and splits known as edges, which can completely capture the entire range of variation of values and how different variables affect it at different stages.

Strengths:

  • capturing complex relationships which are nonlinear with interactions

  • feature selection, irrelevant features are removed via various penalties, making the model robust to noise where we have large bumber of predictors relative to observations

Limitations:

  • difficulty in interpretation- such models are often ‘black-boxed’, where it may be relatively difficult to understand how each variable contributes to the modelling process compared to a linear model

  • slow fitting

  • requires tuning and pruning to prevent overfitting

XG-Boost

A powerful, boosted tree-based ensemble that builds trees sequentially, optimizing for errors from prior trees using gradient descent. Trees are added iteratively to correct residual erros and overfitting is controlled via regularisation

Strengths:

  • capturing complex relationships which are nonlinear with interactions

  • feature selection

  • accurate and efficient and able to handle missing data

Limitations:

  • complexity of tuning process

  • interpretability

Model Evaluation Workflow

Given that most of our selected models required extensive tuning of various hyperparameters. We evaluated both models based on a two-stage validation process, where we first tune the models via a k-fold cross-validation, before evaluating its performance by training the model on the entire train dataset and testing its performance on the holdout set. For all tree based models, we utilised permutation importance for splits over impurity. Impurity importance measures how much each feature reduces the variance of the model, and each time a feature is selected to split a node, the impurity reduction is added to its score. Permutation importance randomly shuffles feature values across the dataset and records its impact on performance. The latter is thus often much better as it is not biased by the cardinality (or continuity) of features and is less sensitive to multicollinearity. The iterative nature makes it more accurate at the expense of processing time.

Spatial Model

For the spatial model, we did a 80-20 train-test split, with training data points selected at random given the observations reveal that most spaces contained a manual counter. As high volume locations were relatively rare, we over-sampled relatively high volume locations with over 100 counts by replicating these points and adding random noise in order to improve modelling performance. For the cross validation process, we conducted a leave out group out split using block groups, in order to improve the ability of our model to generalise across different locations in DC.

Temporal Model

We conducted a time based split where we train on the first 2-3 weeks of the month and tested based on the last 1-2 weeks of each month. Such a split evaluates the model performance in the ‘future’ and also resolves problems accounting for seasonal differences when month by month split is utilized. Given the sparsity of recorded locations, we were unable to adopt a spatially differentiated approach to our testing process, as splitting based on locations would leave far too little unique observations across space to ensure generalizability across different locations. Instead, we thus conducted cross validation based on wards to ensure we account in part for spatial generalizability.

The code for both performance evaluation workflows may be referred to in the Appendix.

Spatial Model Performance

Given that the models employed are able to select features, we primarily tested the models for differences in performance with different spatial fixed effects and the addition of capital bike-share data. Four models are highlighted in the sections below for comparison, namely:

  • Predictor set + block group geoid

  • Predictor set + wards

  • Predictor set + block group geoid + capital bikeshare estimates

  • Predictor set + wards + capital bikeshare estimates

Comparing Spatial Fixed effects

Comparing the Geoid model with Wards, utilising wards led to better overall model perfor,mance across all metrics, with a much lower Root Mean Squared Error of 21 compared to 23, which unsurprisingly suggests that the additional variation provided by block groups which are much more granular lead to overfitting. Such performance improvements are evident across locations of varying volumes. Across both models, the random forest regressor performed best, albeit minimally so.

With Geoid
load("Spatial_geoid.rData")
# average error for each model
ggplot(data = OOF_preds %>% 
         dplyr::select(model, RMSE) %>% 
         distinct(), 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = "#562DAB") +
  geom_label(aes(label = paste0(round(RMSE,1)))) +
  theme_bw()

With_Geoid
With_Geoid
test_set %>% select(avg_count) %>% cbind(rf_pred) %>% summarise(
  MAPE = mean(abs((avg_count + 1 - .pred) / (avg_count + 1))) * 100,
  SMAPE = mean(2 * abs(avg_count - .pred) / (abs(avg_count) + abs(.pred))) * 100,
  RMSE = sqrt(mean((avg_count - .pred)^2)),
  RMPSE = sqrt(mean(((avg_count - .pred) / (avg_count + 1))^2)) * 100) %>% 
    kbl() %>%
  kable_styling()
MAPE SMAPE RMSE RMPSE
63.31059 43.60809 23.84898 113.4964
test_set %>% mutate(load= ifelse(avg_count<50, "low", ifelse(avg_count<100, "mid","high"))) %>% select(avg_count, load) %>% cbind(rf_pred) %>%
  group_by(load) %>%  summarise(
  MAPE = mean(abs((avg_count + 1 - .pred) / (avg_count + 1))) * 100,
  SMAPE = mean(2 * abs(avg_count - .pred) / (abs(avg_count) + abs(.pred))) * 100,
  RMSE = sqrt(mean((avg_count - .pred)^2)),
  RMPSE = sqrt(mean(((avg_count - .pred) / (avg_count + 1))^2)) * 100, 
  avg_count=mean(avg_count))%>% 
    kbl() %>%
  kable_styling()
load MAPE SMAPE RMSE RMPSE avg_count
high 15.04321 16.90076 35.48023 18.74176 181.54412
low 91.35428 58.52571 20.67447 142.68820 21.33500
mid 18.56912 21.15760 14.49096 22.44028 65.19231
With Wards
load("Spatial_wards.rData")

# average error for each model
ggplot(data = OOF_preds %>% 
         dplyr::select(model, RMSE) %>% filter(model!='glmnet') %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_bar(stat="identity", fill='#6A5ACD') +
  # geom_label(aes(label = paste0(round(RMSE,1)))) +
  theme_minimal()

With_Wards
With_Wards
test_set %>% select(avg_count) %>% cbind(rf_pred) %>% summarise(
  MAPE = mean(abs((avg_count + 1 - .pred) / (avg_count + 1))) * 100,
  SMAPE = mean(2 * abs(avg_count - .pred) / (abs(avg_count) + abs(.pred))) * 100,
  RMSE = sqrt(mean((avg_count - .pred)^2)),
  RMPSE = sqrt(mean(((avg_count - .pred) / (avg_count + 1))^2)) * 100)%>% 
    kbl() %>%
  kable_styling()
MAPE SMAPE RMSE RMPSE
55.02091 40.0635 21.47979 100.2898
test_set %>% mutate(load= ifelse(avg_count<50, "low", ifelse(avg_count<100, "mid","high"))) %>% select(avg_count, load) %>% cbind(rf_pred) %>%
  group_by(load) %>%  summarise(
  MAPE = mean(abs((avg_count + 1 - .pred) / (avg_count + 1))) * 100,
  SMAPE = mean(2 * abs(avg_count - .pred) / (abs(avg_count) + abs(.pred))) * 100,
  RMSE = sqrt(mean((avg_count - .pred)^2)),
  RMPSE = sqrt(mean(((avg_count - .pred) / (avg_count + 1))^2)) * 100,
  avg_count=mean(avg_count))%>% 
    kbl() %>%
  kable_styling()
load MAPE SMAPE RMSE RMPSE avg_count
high 12.26562 13.36645 30.24426 14.95944 181.54412
low 79.22764 54.34520 19.42408 126.07106 21.33500
mid 17.82886 20.04539 13.85533 21.73968 65.19231

Capital Bikeshare addition

Adding capital bikeshare estimates lead to a reduction in overall model performance with a root mean squared error of 1 across both types of models. This suggests that the addition of the dataset, which provides rich data about bikeshare ridership across the city, may have lead to an overfitting scenario where the model relies too much on the information present within Cabi at the expense of accuracy. This is perhaps unsurprising, considering that capital bikeshare users represent a rather small subset of the population of bicyclists in DC.

Wards with Cabi
load("Spatial_wards_cabi.rData")

# average error for each model
ggplot(data = OOF_preds %>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = "#562DAB") +
  geom_label(aes(label = paste0(round(RMSE,1)))) +
  theme_bw()

With_Cabi
With_Cabi
test_set %>% select(avg_count) %>% cbind(rf_pred) %>% summarise(
  MAPE = mean(abs((avg_count + 1 - .pred) / (avg_count + 1))) * 100,
  SMAPE = mean(2 * abs(avg_count - .pred) / (abs(avg_count) + abs(.pred))) * 100,
  RMSE = sqrt(mean((avg_count - .pred)^2)),
  RMPSE = sqrt(mean(((avg_count - .pred) / (avg_count + 1))^2)) * 100
  )%>% 
    kbl() %>%
  kable_styling()
MAPE SMAPE RMSE RMPSE
55.26133 39.98131 22.19401 100.758
test_set %>% mutate(load= ifelse(avg_count<50, "low", ifelse(avg_count<100, "mid","high"))) %>% select(avg_count, load) %>% cbind(rf_pred) %>%
  group_by(load) %>%  summarise(
  MAPE = mean(abs((avg_count + 1 - .pred) / (avg_count + 1))) * 100,
  SMAPE = mean(2 * abs(avg_count - .pred) / (abs(avg_count) + abs(.pred))) * 100,
  RMSE = sqrt(mean((avg_count - .pred)^2)),
  RMPSE = sqrt(mean(((avg_count - .pred) / (avg_count + 1))^2)) * 100,
  avg_count=mean(avg_count))%>% 
    kbl() %>%
  kable_styling()
load MAPE SMAPE RMSE RMPSE avg_count
high 12.56056 13.83365 31.74215 15.73629 181.54412
low 79.47908 53.92993 19.80589 126.60565 21.33500
mid 17.95560 20.52588 14.31423 22.38721 65.19231
Geoid with cabi
load("Spatial_geoid_cabi.rData")

# average error for each model
ggplot(data = OOF_preds %>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
  aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = "#562DAB") +
  geom_label(aes(label = paste0(round(RMSE,1)))) +
  theme_bw()

Geoid_Cabi
Geoid_Cabi
test_set %>% select(avg_count) %>% cbind(rf_pred) %>% summarise(
  MAPE = mean(abs((avg_count + 1 - .pred) / (avg_count + 1))) * 100,
  SMAPE = mean(2 * abs(avg_count - .pred) / (abs(avg_count) + abs(.pred))) * 100,
  RMSE = sqrt(mean((avg_count - .pred)^2)),
  RMPSE = sqrt(mean(((avg_count - .pred) / (avg_count + 1))^2)) * 100
  )%>% 
    kbl() %>%
  kable_styling()
MAPE SMAPE RMSE RMPSE
62.75195 42.94631 23.19022 113.2666
test_set %>% mutate(load= ifelse(avg_count<50, "low", ifelse(avg_count<100, "mid","high"))) %>% select(avg_count, load) %>% cbind(rf_pred) %>%
  group_by(load) %>%  summarise(

  MAPE = mean(abs((avg_count + 1 - .pred) / (avg_count + 1))) * 100,
  SMAPE = mean(2 * abs(avg_count - .pred) / (abs(avg_count) + abs(.pred))) * 100,
  RMSE = sqrt(mean((avg_count - .pred)^2)),
  RMPSE = sqrt(mean(((avg_count - .pred) / (avg_count + 1))^2)) * 100,
  avg_count=mean(avg_count))%>% 
    kbl() %>%
  kable_styling()
load MAPE SMAPE RMSE RMPSE avg_count
high 13.78588 15.20753 33.40671 16.95501 181.54412
low 91.06790 58.30417 20.72968 142.50011 21.33500
mid 17.87702 20.15139 14.04603 21.74086 65.19231

Performance across different volume locales

All four models exhibit the worst performance in terms of RMSE but the best performance in terms of Mean Absolute Percentage Error within high volume locales, highlighting how errors need to be considered in terms of the context. The spatial wards model had a RMSE of 30 in high volume locations with more than 150 counts on average, but this related to a MAPE of 12 and RMPSE of 15, percentage errors which are relatively low. Comparatively, a rmse of 19 is associated with a RMPSE of 126 in low volume locations, despite such absolute errors likely being inconsequential for most planning decisions. This is most evident from the plot below, which plots the RMSE alongside the scatter points of counts, illustrating how most of the true errors are rather low, but amplified when considered in proportion to the lower counts

test_set %>%
 mutate(load= ifelse(avg_count<20, "low", ifelse(avg_count<50, "mid","high")) )%>% 
  select(avg_count,load) %>% cbind(rf_pred) %>% group_by( load) %>%  summarise(
  MAPE = mean(abs((avg_count + 1 - .pred) / (avg_count + 1))) * 100,
  SMAPE = mean(2 * abs(avg_count - .pred) / (abs(avg_count) + abs(.pred))) * 100,
  RMSE = sqrt(mean((avg_count - .pred)^2)),
  RMPSE = sqrt(mean(((avg_count - .pred) / (avg_count + 1))^2)) * 100,
  avg_count= mean(avg_count)
)%>% ggplot() +
  geom_col(aes(x=load, y=RMSE), fill = "#562DAB")+
  geom_jitter(test_set %>% 
 mutate(load= ifelse(avg_count<50, "low", ifelse(avg_count<100, "mid","high")) )%>% 
  select(avg_count,load) %>% cbind(rf_pred),mapping=aes(x=load, y=avg_count),width=0.3, alpha=1, color = "#E3D7F5")+
  # geom_col(aes(x=hour, y=MAPE))+
   
  theme_minimal()

Temporal Model Performance

The task of predicting for unknown locations without records mean that we can not rely on past data, such as time lags to make estimates as such information does not exist for unknown locations we are making predictions for. However, there is strong evidence that most bicycle counts vary rather similarly across DC, and hence a reasonable estimate can be derived by modelling the hourly counts as a function of spatial variations and temporal variations at each location

For the temporal model, we were hence interested in what spatial-varying predictors provided the best performance improvements for a model, when modelled with simple time-based features such as time of day and hours. At earlier stages of the modelling process, we also found that taking the logarithm of bicycle counts and and including block group fixed effects was able to greatly improve model performance. In the interest of clarity and conciseness, the rest of the section highlights 5 models for comparison, namely:

  • No spatial-varying predictors

  • Cabi with KDE

  • CABI, KDE, Spatial predictions

  • CABI spatial predictions

  • Spatial Predictions & KDE

Train and Test Data Specification

Model Performance

log without cabi, kde, spatial

The base model with only basic time fixed effects has a rather high RMSE of 133 and high MAPE of 133%. This poor performance is attributed to the fact that the model essentially predicts hourly bike counts as a function of time and merely distinguishes spatial differences based on block groups. Most of the variation across locations is hence unaccounted for.

load("Time_log_TOD_nopred_nocabi_nokde_rmse133.Rdata")
test_set %>% 
  select(Flow_Count) %>% 
  cbind(rf_pred) %>% 
  summarise(
    MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
    SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
    RMSE = sqrt(mean((Flow_Count - .pred)^2)),
    RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100)%>% 
    kbl() %>%
  kable_styling()
MAPE SMAPE RMSE RMPSE
133.6887 114.5037 133.3843 420.3019
test_set %>% mutate(load= ifelse(Flow_Count<50, "low", ifelse(Flow_Count<150, "mid","high"))) %>% select(Flow_Count, load) %>% cbind(rf_pred) %>% 
  group_by(load) %>%  summarise(
  
  MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
  SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
  RMSE = sqrt(mean((Flow_Count - .pred)^2)),
  RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100,
  Flow_Count=max(Flow_Count)) %>% 
    kbl() %>%
  kable_styling()
load MAPE SMAPE RMSE RMPSE Flow_Count
high 69.33513 111.9741 320.08043 72.98628 1560
low 166.40720 116.0051 11.90840 514.27378 48
mid 70.27283 111.1914 67.48759 71.33713 147
log with cabi, kde without spatial predictions

Including capital bikeshare and kernel density estimates is able to improve performance substantially, essentially halving the error rates across most metrics.

load("Time_log_TOD_cabi_kde_nopred_rmse68.Rdata")
# # average error for each model
# ggplot(data = OOF_preds %>% 
#          dplyr::select(model, RMSE) %>% 
#          distinct() , 
#        aes(x = model, y = RMSE, group = 1)) +
#   geom_path(color = "red") +
#   geom_label(aes(label = paste0(round(RMSE,1)))) +
#   theme_bw()
# 
# # OOF predicted versus actual
# ggplot(OOF_preds, aes(x = Flow_Count, y = .pred,
#                       group = model
#                       )) +
#   geom_point(alpha = 0.3) +
#   geom_abline(linetype = "dashed", color = "red") +
#   geom_smooth(method = "lm", color = "blue") +
#   # coord_equal() +
#   facet_wrap(~model, nrow = 2) +
#   theme_bw()
test_set %>% 
  select(Flow_Count) %>% 
  cbind(rf_pred) %>% 
  summarise(
    MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
    SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
    RMSE = sqrt(mean((Flow_Count - .pred)^2)),
    RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100)%>% 
    kbl() %>%
  kable_styling()
MAPE SMAPE RMSE RMPSE
74.04041 89.90652 68.70751 198.1228
test_set %>% mutate(load= ifelse(Flow_Count<50, "low", ifelse(Flow_Count<150, "mid","high"))) %>% select(Flow_Count, load) %>% cbind(rf_pred) %>% 
  group_by(load) %>%  summarise(
  MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
  SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
  RMSE = sqrt(mean((Flow_Count - .pred)^2)),
  RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100,
  Flow_Count=max(Flow_Count))%>% 
    kbl() %>%
  kable_styling()
load MAPE SMAPE RMSE RMPSE Flow_Count
high 30.62307 36.99418 161.41049 36.25464 1560
low 94.10425 113.94237 12.87767 241.54565 48
mid 38.90830 48.67405 42.36437 51.15077 147
log with cabi, kde with spatial predictions

Including all spatially engineered features, with capital bikeshare data, spatial predictions and the kernel density estimates decreases model performance, suggesting overfitting.

load("Time_log_TOD_cabi_kde_pred_rmse70.Rdata")

# # average error for each model
# ggplot(data = OOF_preds %>% 
#          dplyr::select(model, RMSE) %>% 
#          distinct() , 
#        aes(x = model, y = RMSE, group = 1)) +
#   geom_path(color = "red") +
#   geom_label(aes(label = paste0(round(RMSE,1)))) +
#   theme_bw()
# 
# # OOF predicted versus actual
# ggplot(OOF_preds, aes(x = Flow_Count, y = .pred,
#                       group = model
#                       )) +
#   geom_point(alpha = 0.3) +
#   geom_abline(linetype = "dashed", color = "red") +
#   geom_smooth(method = "lm", color = "blue") +
#   # coord_equal() +
#   facet_wrap(~model, nrow = 2) +
#   theme_bw() 
test_set %>% 
  select(Flow_Count) %>% 
  cbind(rf_pred) %>% 
  summarise(
    MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
    SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
    RMSE = sqrt(mean((Flow_Count - .pred)^2)),
    RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100)%>% 
    kbl() %>%
  kable_styling()
MAPE SMAPE RMSE RMPSE
73.63031 90.07854 70.3944 205.4105
test_set %>% mutate(load= ifelse(Flow_Count<50, "low", ifelse(Flow_Count<150, "mid","high"))) %>% select(Flow_Count, load) %>% cbind(rf_pred) %>% 
  group_by(load) %>%  
  summarise(
  MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
  SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
  RMSE = sqrt(mean((Flow_Count - .pred)^2)),
  RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100,
  Flow_Count=max(Flow_Count))%>% 
    kbl() %>%
  kable_styling()
load MAPE SMAPE RMSE RMPSE Flow_Count
high 31.03486 37.67099 165.59025 36.69641 1560
low 93.30793 114.00810 12.98295 250.53798 48
mid 39.18761 48.77115 42.85780 51.67716 147
log with cabi, spatial predictions no kde

The model with cabi data and spatial predictions alone does not perform as well as the previous model, with a RMSE of 73. However, it interestingly has a lower %RMSE (RMPSE) than the model with all spatial variables within areas with medium volumes (mean<150), highlighting the importance of considering different model configurations for the intended planning decision and locale.

load("Time_log_TOD_cabi_rmse73.Rdata")




# # average error for each model
# ggplot(data = OOF_preds %>% 
#          dplyr::select(model, RMSE) %>% 
#          distinct() , 
#        aes(x = model, y = RMSE, group = 1)) +
#   geom_path(color = "red") +
#   geom_label(aes(label = paste0(round(RMSE,1)))) +
#   theme_bw()
# 
# # OOF predicted versus actual
# ggplot(OOF_preds, aes(x = Flow_Count, y = .pred,
#                       group = model
#                       )) +
#   geom_point(alpha = 0.3) +
#   geom_abline(linetype = "dashed", color = "red") +
#   geom_smooth(method = "lm", color = "blue") +
#   # coord_equal() +
#   facet_wrap(~model, nrow = 2) +
#   theme_bw()
test_set %>% 
  select(Flow_Count) %>% 
  cbind(rf_pred) %>% 
  summarise(
    MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
    SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
    RMSE = sqrt(mean((Flow_Count - .pred)^2)),
    RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100)%>% 
    kbl() %>%
  kable_styling()
MAPE SMAPE RMSE RMPSE
84.79347 NaN 73.3175 342.3214
test_set %>% mutate(load= ifelse(Flow_Count<50, "low", ifelse(Flow_Count<150, "mid","high"))) %>% select(Flow_Count, load) %>% cbind(rf_pred) %>% 
  group_by(load) %>%  summarise(
  MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
  SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
  RMSE = sqrt(mean((Flow_Count - .pred)^2)),
  RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100,
  Flow_Count=max(Flow_Count))%>% 
    kbl() %>%
  kable_styling()
load MAPE SMAPE RMSE RMPSE Flow_Count
high 31.91663 40.08549 173.07089 37.87136 1560
low 109.97425 NaN 13.88127 419.80960 48
mid 39.16885 49.98051 41.91161 48.30286 147
log with spatial predictions, kde,no cabi

The model with the best overall performance includes the spatial predictions and kernel density estimates and excludes capital bike-share features, with a RMSE of 65.5. Performance across locations and times of different volumes are also much better than the other two models.

load("Time_log_TOD_no_cabi_kde.Rdata")
test_set %>% 
  select(Flow_Count) %>% 
  
  cbind(rf_pred) %>% 
  summarise(
    MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
    SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
    RMSE = sqrt(mean((Flow_Count - .pred)^2)),
    RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100
  )%>% 
    kbl() %>%
  kable_styling()
MAPE SMAPE RMSE RMPSE
81.29022 NaN 65.49156 215.6808
test_set %>% mutate(load= ifelse(Flow_Count<50, "low", ifelse(Flow_Count<150, "mid","high"))) %>% select(Flow_Count, load) %>% cbind(rf_pred) %>% 
  group_by(load) %>%  summarise(
  MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
  SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
  RMSE = sqrt(mean((Flow_Count - .pred)^2)),
  RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100,
  Flow_Count=mean(Flow_Count))%>% 
    kbl() %>%
  kable_styling()
load MAPE SMAPE RMSE RMPSE Flow_Count
high 27.49657 32.57583 153.18023 34.05269 386.81405
low 106.11811 NaN 13.24458 263.42707 12.24130
mid 37.87989 46.81794 41.63423 50.26581 86.62022

From the plot of values and RMSE below, we similarly observe that the errors are rather low in relation to actual values for high and medium volume locations. For low volume locations, the raw errors of predictions are rather low, despite having higher percentage errors. Nonetheless, it should be highlighted that the impact of these errors will depend on the specific application of the model.

test_set %>% group_by(Site_Name) %>% 
 mutate(load= ifelse(mean(Flow_Count)<50, "low", ifelse(mean(Flow_Count)<100, "mid","high")) )%>% 
  select(Flow_Count,load, time_of_day) %>% cbind(rf_pred) %>% group_by( load,time_of_day) %>%  summarise(
  MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
  SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
  RMSE = sqrt(mean((Flow_Count - .pred)^2)),
  RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100,
  Flow_Count= mean(Flow_Count)
)%>% ggplot() +
  geom_col(aes(x=time_of_day, y=RMSE), fill = "#E3D7F5", color = "grey30")+
  geom_jitter(test_set %>% group_by(Site_Name) %>% 
 mutate(load= ifelse(mean(Flow_Count)<50, "low", ifelse(mean(Flow_Count)<100, "mid","high")) )%>% 
  select(Flow_Count,load, time_of_day) %>% cbind(rf_pred),mapping=aes(x=time_of_day, y=Flow_Count),width=0.3, alpha=0.1, color = "#6A3BB8")+
  # geom_col(aes(x=hour, y=MAPE))+
    facet_wrap(~load)+
  theme_minimal()
## Adding missing grouping variables: `Site_Name`
## `summarise()` has grouped output by 'load'. You can override using the
## `.groups` argument.
## Adding missing grouping variables: `Site_Name`

Final Model Configurations

Model Generalizability

Having selected the best model as our final model, we conducted a series of tests of generalizability across space and time.

Temporal Model

Building on the spatial strengths of our model, our temporal predictions also show a distinct pattern of accuracy. The temporal model performs best during daytime peak hours, with significantly lower error rates between 7-10am and 4-7pm.

The blue bars show average flow count throughout the day, while the red bars represent the Mean Absolute Percentage Error. Notice how the error rates decrease substantially during precisely the hours when cycling volumes are highest. This is particularly valuable because daytime peak hours are when bike lanes operate near capacity and when planning decisions have the greatest impact on cyclist safety and efficiency.

test_set %>% select(Flow_Count,hour) %>% cbind(rf_pred) %>% group_by(hour) %>%  summarise(
  MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
  SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
  RMSE = sqrt(mean((Flow_Count - .pred)^2)),
  RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100,
  Flow_Count=mean(Flow_Count)
)%>% ggplot() +
# geom_col(aes(x=hour, y=RMSE))+
  geom_col(aes(x=hour, y=MAPE), fill= 'red', alpha=0.5) +
  geom_col(aes(x=hour, y=Flow_Count), fill= 'blue', alpha=0.3) +
  theme_minimal() +
  theme()

Temporal_MAPE Spatial Model

Looking at the distribution of our spatial model’s performance geographically, it achieves its highest accuracy when predicting bike counts in lanes within the Central Business District. On this map, the darker purple dots represent locations with lower error rates, clearly concentrated in the CBD area.

MAPE & Average Counts
MAPE & Average Counts

Moreover, beyond technical performance, we also evaluated our model through an equity lens. An important consideration for any urban planning tool is whether it performs equally well across different communities. Interestingly, our analysis shows that our model actually performs better within minority-populated areas, with lower error rates on average compared to majority-white areas.

MAPE & Percentage of White
MAPE & Percentage of White
test_set %>% select(avg_count,GEOID,pct_white,pct_black,pct_asian, pct_hispanic) %>% cbind(rf_pred) %>% mutate(maj_white= ifelse(pct_white>pct_black+pct_asian+pct_hispanic, "Yes","No")) %>% group_by(maj_white) %>%  summarise(
  MAPE = mean(abs((avg_count + 1 - .pred) / (avg_count + 1))) * 100,
  SMAPE = mean(2 * abs(avg_count - .pred) / (abs(avg_count) + abs(.pred))) * 100,
  RMSE = sqrt(mean((avg_count - .pred)^2)),
  RMPSE = sqrt(mean(((avg_count - .pred) / (avg_count + 1))^2)) * 100,
  avg_count=mean(avg_count)) %>%
  pivot_longer(
    cols = c(MAPE, SMAPE, RMSE, RMPSE, avg_count), 
    names_to = "metric",
    values_to = "value"
  ) %>% drop_na() %>% ggplot()+ 
  geom_bar(mapping=aes(fill = maj_white,x = metric, y=value),stat = "identity",
           position = position_dodge(width = 1)) +
  scale_fill_manual(values = c("Yes" = "#E3D7F5", "No" = "#6A5ACD"))+
  theme_minimal()

Findings

Performance comparisons with other cities (Kothuri et al 2022)

Our peak hour static estimates are more accurate than existing estimates made in other cities, with much lower MAPE overall

It also performs exceptionally well within high volume locales, especially in comparison to other cities.

The hourly dynamic estimates also perform much better than static estimates made in the study.

Observations

Feature importance

To demonstrate the relative importance of features for both models, the feature importance of the models containing all features are presented for comparison. As previously mentioned, these importance values are based on permutation importance - ie how much each feature contributed to splits in reducing overall model errors over various permutation cycles.

Spatial Model

Notably, it is observed that capital bikeshare estimates have the greatest influence, followed by the percentage of white population and being located in Ward 2. This is then followed by network or street chracteristics such as PCI, infrastructure score and the presence of dual protected lanes.

p <- vip::vip(full_fit, num_features = 30)

# So instead, modify the internal geom:
p + ggplot2::geom_col(fill = "#7A50C7")  # may result in doubled bars unless cleaned

Temporal Model

For the temporal model, the manual KDE has the greatest collective permutation importance, followed by whether the time of day is overnight. Relatively, capital bikeshare counts have less importance. Each hour of day is awarded roughly similar values of importance, highlighting how the random forest model accurately captures how different hours are associated with different volumes, compared to a LASSO or Ridge Regression model which will remove these features due to their low proportion of variance explained.

p2 <- vip::vip(full_fit, num_features = 30)

# So instead, modify the internal geom:
p2 + ggplot2::geom_col(fill = "#7A50C7")  # may result in doubled bars unless cleaned

Conclusions and Recommendations

What can our model not tell us?

“All models are wrong, but some models are useful” – George Box

It is important to highlight the inherent limitations of our predictions due to the issues with data sparsity. While we attempt to ensure as much as possible the generalizability of the model across different and unknown locations, the low number of automated counter locations means that the true accuracy of the model within much of DC remains unknown and untestable. Our modelling workflow attempts to circumvent this through a spatial cross-validation during the tuning phase, but it is not possible to test and comment on the generalizability of the temporal model without sufficient data points across DC for a rigorous spatial train-test validation process.

The selection of automated counter sites also introduces an inherent bias to our model which cannot be addressed by model specifications or workflows. Selection biases may arise from data collection decisions such as prioritizing of key transport corridors for counter installation during the site selection process. This will mean that the training and testing process is restricted to these types of locations. This may admittedly be a reason for the consistency in patterns observed in automated counter locations.

The strong accuracy and generalizability of our spatial model provides a great degree of confidence with regards to the temporal model’s accuracy during peak hours in unknown locations. Nonetheless, given the aforementioned caveats and limitations, predicted values outside of these peak hours should be utilised with caution, especially within locales far from recorded locations, especially if existing knowledge suggests that these areas may have daily patterns which differ substantially from conventional dual peak patterns. Observations from our exploratory analysis seem to suggest that this may be a greater issue for areas with low volume counts.

What can our model tell us?

This model and its outputs should be seen as an initial step in estimating and summarizing bike counts across DC. At the current stage of data completeness, it is likely useful for making general conclusions about how bicycle volumes differ across various neighbourhoods in DC across broad period of the day. As our dashboard demonstrates, this is nonetheless useful for visualizing how biking patterns vary across DC and how this may potentially relate to various built environment and demographic characteristics. Most of our modelling considerations has been focused at enabling such broad and ‘non granular’ analyses. For instance, we opted to use a simple kernel density estimate, instead of more complex network based interpolation measures which account for the boundedness of cyclists to most central networks. The information from more complex and accurate features will be more useful when we have a greater granularity of counts across space for generating models which can capture the true complexity of biking patterns in DC. Potentially, where more complete flow data exists, models such as gravity models or Bayesian flow networks can be employed to model bicycle counts - currently unfeasible due to the large

Recommendations

As such, much remains to be done to test and validate this model. In the long term, it is recommended that DDOT expands the coverage of its automated counters and do so within a greater variety of locations; which will allow for further model iterations and testing in currently unknown locations. This process will likely reveal rather different conclusions about what features are important and also generate different conclusions about bicycling patterns in DC. As data coverage improves, more accurate and generalizable conclusions can be made using this model.

Where constraints surrounding data collection exist, the domain knowledge and understanding of local contexts within DC will be extremely useful for validating model outputs. This may range from simply asking “does this value make sense in this location?”, or conducting detailed site surveys and interviews. The flaws inherent with the data and this model is a strong reminder of the limits of data-driven approaches towards urban practices. We can only model as well as the data allows us, and in the case where much remains unknown - it is imperative that urban practitioners evaluate model outputs from a critical perspective.

Appendices: Sample Code

Capital Bikeshare

cabi <- read.csv('202408-capitalbikeshare-tripdata.csv')

cabi$duration <- difftime( as.POSIXct( cabi$ended_at),as.POSIXct( cabi$started_at), units = "hours") 
library(sf)
library(tidyr)
library(dplyr)
cabi <- cabi[!is.na(cabi$end_lat),]
start_points <- st_as_sf(cabi, coords = c("start_lng", "start_lat"), crs = 4326)
end_points <- st_as_sf(cabi, coords = c("end_lng", "end_lat"), crs = 4326)
cabi$distance_km <- as.numeric(st_distance(start_points, end_points, by_element = TRUE))/1000
cabi <- cabi %>%  cbind(cabi %>% unite("OD", start_lng,start_lat,end_lng,end_lat, sep = "_") %>% select(OD))
trip_sum= cabi %>% mutate(speed= distance_km/as.numeric(duration)) %>%  group_by(started_at,start_lng,start_lat,end_lng,end_lat, OD) %>% summarise(mean_speed=mean(speed), trip_count= n())

library(ggplot2)
library(sf)
# install.packages('r5r')
# Sys.setenv(JAVA_HOME = "C:/Program Files/Eclipse Adoptium/jdk-21.0.6.7-hotspot/bin/java.exe") 
library(r5r)
options(java.parameters = "-Xmx8G")
r5r_core <- setup_r5(data_path = 'C:/Users/Xian Lu Lee/OneDrive - PennO365/Datasets/Practicum DCDOT/Cabi')
# library(rJava)
dc_bound <- st_read('Washington_DC_Boundary.geojson')

  street_network <- street_network_to_sf(r5r_core) 
origins_sf <-  trip_sum %>% st_as_sf(coords = c("start_lng", "start_lat"), crs = 4326)
ggplot() +
  
  geom_sf(street_network$vertices$geometry, mapping=aes(), color = "gray", size = 0.3)+
  geom_sf(origins_sf,mapping=aes()) +
  geom_sf(dc_bound,mapping=aes())+
  theme_minimal() +
  labs(title = "R5R Street Network and Origins", color = "Origin ID")


origins_sf <- st_as_sf(trip_sum, coords = c("start_lng", "start_lat"), crs = 4326)
destinations_sf <- st_as_sf(trip_sum, coords = c("end_lng", "end_lat"), crs = 4326)

# Ensure CRS matches
dc_bound <- st_transform(dc_bound, crs = st_crs(origins_sf))

# Filter origins and destinations inside the shapefile
origins_within <- st_intersects(origins_sf,dc_bound, sparse = FALSE)
destinations_within <- st_intersects(destinations_sf, dc_bound, sparse = FALSE)

# Keep only rows where both origin and destination are inside the shapefile
trip_filtered <- trip_sum[origins_within & destinations_within, ]  %>% ungroup() %>% mutate(id=row_number())

#snap points
snapped_points <- find_snap(r5r_core, trip_filtered %>% rename('lon'='start_lng', 'lat'='start_lat') %>% select (id, lat, lon))
trip_filtered$start_lng <- snapped_points$lon
trip_filtered$start_lat <- snapped_points$lat

snapped_points <- find_snap(r5r_core, trip_filtered %>% rename('lon'='end_lng', 'lat'='end_lat') %>% select (id, lat, lon))
trip_filtered$end_lng <- snapped_points$lon
trip_filtered$end_lat <- snapped_points$lat
trip_filtered <- trip_filtered %>% filter(trip_count>10)


all_itineraries <- list()

# Loop through each trip
for (i in 1:nrow(trip_filtered)) {
  # Extract trip details
  origin <- data.frame(
    id = i,
    lon = trip_filtered$start_lng[i],
    lat = trip_sum$start_lat[i]
  )
  
  destination <- data.frame(
    id = i,
    lon = trip_filtered$end_lng[i],
    lat = trip_filtered$end_lat[i]
  )
  
  bike_speed <- trip_filtered$mean_speed[i]+3
  
  # Ensure bike speed is positive; otherwise, set a default value
  if (bike_speed <= 0) {
    bike_speed <- 12  # Default to 3 m/s (~10.8 km/h) if speed is 0
  }
  
  # Generate itinerary
  itinerary <- detailed_itineraries(
    r5r_core,
    origins = origin,
    destinations = destination,
    mode = "BICYCLE",
    bike_speed = bike_speed
  )
  
  # Store results
  all_itineraries[[i]] <- itinerary
  print(i)
}

bike_trips_df <- do.call(rbind, all_itineraries)

trip_sum$started_at <- as.POSIXct(cabi_sf$started_at)
library(lubridate)
trip_sum<- trip_sum %>% mutate(
              hour= hour(started_at),
              dotw= wday(started_at))
trip_sum <- trip_sum %>%group_by(hour,dotw,OD) %>% summarise(trip_count=sum(trip_count))
cabi_sf <- trip_sum %>% right_join(trip_filtered %>% select(OD,id))

cabi_sf <- bike_trips_df %>% select(from_id,geometry) %>% right_join(cabi_sf %>% mutate(id=as.character(id)), by= c("from_id"="id"))

Centrality

library(tidygraph)
library(sfnetworks)

seg_network <- st_cast(seg_network, "LINESTRING")
seg_network <- as_sfnetwork(seg_network)

seg_network <- seg_network %>%
  activate("nodes") %>%
  mutate(
    degree = centrality_degree(),         # Number of direct connections
    betweenness = centrality_betweenness(), # Control over information flow
    closeness = centrality_closeness()      # Average distance to all nodes
  )
library(igraph)

# Convert to an igraph object for connectivity analysis
graph_obj <- as.igraph(seg_network)

# Check if the network is fully connected
network_connected <- is_connected(graph_obj)
print(network_connected)

# Identify connected components
components_info <- components(graph_obj)
print(components_info)

Spatial Model Workflow

set.seed(717)

library(spatialsample)
# library(tidymodels)
library(recipes)
library(rsample)
library(parsnip)
library(workflows)
library(yardstick)
library(dplyr)
library(lubridate)
library(tidyverse)
library(tune)
library(sf)
theme_set(theme_bw())

"%!in%" <- Negate("%in%")
g <- glimpse

nn_function <- function(measureFrom,measureTo,k) {
  library(FNN)
  nn <-   
    FNN::get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}


get_week_of_month <- function(date) {
  week(date) - week(floor_date(date, "month")) + 1
}
#######train test split######
setwd('C:/Users/Xian Lu Lee/OneDrive - PennO365/Datasets/GeoCloud/DC_DOT')

cabi_sum <- read.csv('Data/cabi_sum.csv') 
segments <- st_read("C:/Users/Xian Lu Lee/OneDrive - PennO365/Datasets/GeoCloud/DC_DOT/Data/segments_network.geojson"
)
cabi_max <- cabi_sum %>% group_by(seg_id) %>% summarise(cabi_count=max(cabi_count))

# cabi_sum <- cabi_sum[cabi_sum$hour %in% c(7,8,9,10, 15, 16, 17, 18),] 
load("C:/Users/Xian Lu Lee/OneDrive - PennO365/Datasets/GeoCloud/DC_DOT/model/data_joined(1).RData")
manual_counts <- manual_234 %>% st_drop_geometry()
  # read.csv("Data/manual_21_to_24.csv")

# Function to standardize time format: Add leading zero & ensure ":00" is present
extract_hour_ampm <- function(time_str) {
  time_str <- str_trim(time_str)  # Remove extra spaces
  
  # Extract AM/PM
  am_pm <- ifelse(grepl("AM", time_str), "AM", "PM")
  
  # Extract numeric hour
  hour <- as.numeric(str_extract(time_str, "^[0-9]+"))
  
  return(data.frame(Hour = hour, AM_PM = am_pm, stringsAsFactors = FALSE))
}


manual <- segments %>%right_join(manual_counts %>% select(-SUBBLOCKKEY,-SUBBLOCKID) %>% ungroup(), by='seg_id' )%>%  filter(!is.na(Total))%>% 
  mutate(
    Start_Hour_AMPM = lapply(Start_Time, extract_hour_ampm),
    End_Hour_AMPM = lapply(End_Time, extract_hour_ampm)
  ) %>%
  tidyr::unnest_wider(Start_Hour_AMPM, names_sep = "_") %>%
  tidyr::unnest_wider(End_Hour_AMPM, names_sep = "_") %>%
  rename(Start_Hour = Start_Hour_AMPM_Hour, Start_AMPM = Start_Hour_AMPM_AM_PM,
         End_Hour = End_Hour_AMPM_Hour, End_AMPM = End_Hour_AMPM_AM_PM)%>%
  mutate(
    Hours_Duration = ifelse(End_AMPM == Start_AMPM, End_Hour - Start_Hour,
                            (12 - Start_Hour) + End_Hour)  # Adjust if AM/PM changes
  ) %>% group_by(seg_id, Start_AMPM) %>% mutate(Total=sum(Total))%>% ##remove direction and get totals
  ##
  mutate(avg_count= Total/Hours_Duration) %>% 
  ungroup() 
# centrality <- read.csv("Data/network.csv")


main <- manual[ ,c(1,2:4, 10:18, 22:63, 95, 126,137, 141,146,148, 164, 168)] 
main %>% write.csv("manual_full.csv")
main <- main[!is.na(main$seg_id), ]
main <- main %>% mutate(Ward=as.factor(Ward))
main <- main %>% left_join(cabi_max)
# main <- main %>% left_join(centrality)
# main <- main %>% st_drop_geometry() %>% select(-geometry)
# main <- main %>% mutate(obs_weight = if_else(avg_count > 100, 10, if_else(avg_count > 25, 5, 1)))


# View(main[is.na(main$SUBBLOCKKEY), ])

### Initial Split for Training and Test
data_split <- initial_split(main , strata = "avg_count", prop = 0.80)
train_set <- training(data_split)
test_set  <- testing(data_split)

# This generates synthetic examples focused on rare/extreme values
train_set <- train_set %>%
  mutate(count_bin = cut(avg_count, breaks = 5)) 
train_set %>% ggplot(aes(x=count_bin))+geom_histogram(stat = 'count')
rare_bins <- train_set %>%
  group_by(count_bin) %>%
  summarise(count = n()) %>%
  filter(count < 50)  # 
# Select high-count rows
high_count_rows <- train_set %>% filter(avg_count > 100)

# Create synthetic rows by adding random noise to the high-count values
synthetic_rows <- high_count_rows %>%
  mutate(avg_count = avg_count + rnorm(n(), mean = 0, sd = 5)) %>% rbind(high_count_rows %>%
                                                                       mutate(avg_count = avg_count + rnorm(n(), mean = 0, sd = 5))) %>% 
   rbind(high_count_rows %>%
              mutate(avg_count = avg_count + rnorm(n(), mean = 0, sd = 5))) %>% 
  rbind(high_count_rows %>%
                                                                       mutate(avg_count = avg_count + rnorm(n(), mean = 0, sd = 5))) # Adjust sd to control noise level

# Combine original data with synthetic rows
train_set_synthetic <- bind_rows(train_set, synthetic_rows)
train_set_synthetic %>% ggplot(aes(x=count_bin))+geom_histogram(stat = 'count')
# Replace with newer spatialsample package approach
library(spatialsample)

# Spatial CV using newer approach

## LOGOCV on Neighborhood with group_vfold_cv()
cv_splits_geo <- group_vfold_cv(train_set,  
                                group = "GEOID")



# Get the first fold's split object
fold_1 <- cv_splits_geo$splits[[1]]

# Extract the analysis (training) set from the first fold
analysis_set_1 <- analysis(fold_1)

# Extract the assessment (validation) set from the first fold
assessment_set_1 <- assessment(fold_1)

### Create Recipes
test_set%>% select(where(~ is.factor(.) || is.character(.))) %>% colnames()
# Feature Creation
model_rec <- recipe(avg_count ~ ., data = train_set_synthetic
  %>% select(-seg_id,-count_bin,-BLOCKKEY, -has_parking,
  -has_buffer, -has_doubleyellow, 
  -GEOID #geoid,,
  # -cabi_count,
  # -Ward
  # -TOTALBIKELANES.x, -TOTALBIKELANEWIDTH.x,
  # -TRAIL_CLASS, -CENTERLINE,- LIGHTING
  ))%>% 
  update_role(Ward, new_role = "Neighborhood") %>%
  # update_role(obs_weight, new_role = "case_weight") %>% 
  step_other(Ward, threshold = 0.005) %>%
  step_mutate_at(all_predictors()&all_numeric(), fn = ~ ifelse(is.na(.), 1e-6, .)) %>%
  step_novel(all_nominal()) %>% 
  step_unknown(all_nominal()) %>%
  step_dummy(all_nominal(), one_hot = TRUE) %>% 
# step_log(avg_count, offset = 1, skip = TRUE) %>%
# step_nzv(all_predictors()&all_numeric(),-avg_count,freq_cut = 95/5, unique_cut = 10) %>%
  step_normalize(where(is.double),-avg_count)


# # step_zv(all_predictors()) %>%
# step_center(where(is.double) )%>%
#   step_scale(where(is.double) )

# step_ns(y,x, options = list(df = 4),keep_original_cols=T)

glimpse(model_rec %>% prep() %>% juice())

# transform new data
model_rec %>% prep() %>%  bake(new_data = main %>%select(-seg_id,-BLOCKKEY,-has_parking, -has_buffer, -has_doubleyellow,
                                                         
                                                         # -TOTALBIKELANES.x, -TOTALBIKELANEWIDTH.x,
                                                         # -TRAIL_CLASS, -CENTERLINE,- LIGHTING
                                                         ))

cor_matrix <- cor(main %>% select(where(is.numeric)),use='complete.obs')
cor_matrix['avg_count',]
########################################model testing###################################


## Model specifications
lm_plan <- 
  linear_reg() %>% 
  set_engine("lm", case.weights = TRUE)

glmnet_plan <-
  linear_reg() %>%
  set_args(penalty  = tune()) %>%
  set_args(mixture  = tune()) %>%
  set_engine("glmnet",case.weights = TRUE)%>%
  set_mode("regression")

rf_plan <- rand_forest() %>%
  set_args(mtry  = tune()) %>%
  set_args(min_n = tune()) %>%  # ✅ Now only tuning min_n
  set_args(n_trees= 2000) %>% 
  set_engine("ranger", importance = "permutation",case.weights = TRUE) %>%  
  set_mode("regression")


XGB_plan <- boost_tree() %>%
  set_args(trees = tune()) %>%  
  set_args(learn_rate = tune()) %>%  # Learning rate (shrinkage)
  set_args(tree_depth = tune()) %>%  # Maximum tree depth
  set_args(min_n = tune()) %>%       # Minimum node size
  # set_args(colsample_bytree = tune()) %>% # Feature subsampling
  set_args(early_stopping_rounds = 10) %>%       # Early stopping
  set_engine("xgboost",validation = 0.2, case.weights = TRUE,
             # objective="count:poisson",
             lambda=0.1, alpha=0.1) %>% 
  set_mode("regression")


# Hyperparameter grid for glmnet (penalization)
glmnet_grid <- expand.grid(
  penalty = 10^seq(-4, 2, length.out = 50),  # Wider range of penalties
  mixture = seq(0, 1, by = 0.25)
)

# Hyperparameter grid for random forest (tuning min_n only)
rf_grid <- expand.grid(
  mtry = c( 13, 30, 45, 80, 110), 
  min_n = c(5, 50, 100, 500)  # ✅ Only tuning min_n
)

# Hyperparameter grid for XGBoost (tuning n_estimators with early stopping)
xgb_grid <- expand.grid(
  trees = seq(10, 800, by = 50),  
  learn_rate = c(0.01, 0.1, 0.3),  
  tree_depth = c( 5, 6, 7),  
  min_n = c( 5, 50, 80) 
  # colsample_bytree = c(0.5, 0.7, 1)
)
#ntrees for xgboost does not need to be as big

# create workflow
lm_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(lm_plan)
glmnet_wf <-
  workflow() %>%
  add_recipe(model_rec) %>%
  add_model(glmnet_plan)
rf_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(rf_plan)
xgb_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(XGB_plan)


# fit model to workflow and calculate metrics
control <- control_resamples(save_pred = TRUE, verbose = TRUE)
control <- control_grid(
  save_pred = TRUE,   # Ensures predictions are saved
  extract = function(x) x,  # Stores the full model object
  verbose = TRUE)
metrics <- metric_set(rmse, rsq, mape, smape)
lm_tuned <- lm_wf %>%
  tune::fit_resamples(.,
                      resamples = cv_splits_geo,
                      control   = control,
                      metrics   = metrics)

glmnet_tuned <- glmnet_wf %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo,
                  grid      = glmnet_grid,
                  control   = control,
                  metrics   = metrics)

rf_tuned <- rf_wf %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metrics)

xgb_tuned <- xgb_wf %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo,
                  grid      = xgb_grid,
                  control   = control,
                  metrics   = metrics)


# ## metrics across grid
# collect_metrics(xgb_tuned)
# autoplot(xgb_tuned)
# collect_metrics(glmnet_tuned)



## 'Best' by some metric and margin
show_best(lm_tuned, metric = "rmse", n = 15)
show_best(glmnet_tuned, metric = "rmse", n = 15)
show_best(rf_tuned, metric = "rmse", n = 15)
show_best(xgb_tuned, metric = "rmse", n = 15)

lm_best_params     <- select_best(lm_tuned, metric = "rmse"    )
glmnet_best_params <- select_best(glmnet_tuned, metric = "rmse")
rf_best_params     <- select_best(rf_tuned, metric = "rmse"    )
xgb_best_params    <- select_best(xgb_tuned, metric = "rmse"   )


## Final workflow
lm_best_wf     <- finalize_workflow(lm_wf, lm_best_params)
glmnet_best_wf <- finalize_workflow(glmnet_wf, glmnet_best_params)
rf_best_wf     <- finalize_workflow(rf_wf, rf_best_params)
xgb_best_wf    <- finalize_workflow(xgb_wf, xgb_best_params)


# last_fit() emulates the process where, after determining the best model, the final fit on the entire training set is needed and is then evaluated on the test set.
lm_val_fit_geo <- lm_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metrics)

glmnet_val_fit_geo <- glmnet_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metrics)

rf_val_fit_geo <- rf_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metrics)

xgb_val_fit_geo <- xgb_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metrics)

##########################################################

# Pull best hyperparam preds from out-of-fold predictions
lm_best_OOF_preds <- collect_predictions(lm_tuned) 

glmnet_best_OOF_preds <- collect_predictions(glmnet_tuned) %>%
  filter(penalty  == glmnet_best_params$penalty[1] & mixture == glmnet_best_params$mixture[1])

rf_best_OOF_preds <- collect_predictions(rf_tuned) %>%
  filter(mtry  == rf_best_params$mtry[1] & min_n == rf_best_params$min_n[1])

xgb_best_OOF_preds <- collect_predictions(xgb_tuned) %>%
  filter(min_n == xgb_best_params$min_n[1]& .config == xgb_best_params$.config)

# collect validation set predictions from last_fit model
lm_val_pred_geo     <- collect_predictions(lm_val_fit_geo)
glmnet_val_pred_geo <- collect_predictions(glmnet_val_fit_geo)
rf_val_pred_geo     <- collect_predictions(rf_val_fit_geo)
xgb_val_pred_geo    <- collect_predictions(xgb_val_fit_geo)




# Aggregate OOF predictions (they do not overlap with Validation prediction set)
OOF_preds <- rbind(
                  data.frame(dplyr::select(lm_best_OOF_preds, .pred, avg_count), model = "lm") ,
                   data.frame(dplyr::select(glmnet_best_OOF_preds, .pred, avg_count), model = "glmnet"),
                   data.frame(dplyr::select(rf_best_OOF_preds, .pred, avg_count), model = "rf"),
                   data.frame(dplyr::select(xgb_best_OOF_preds, .pred, avg_count), model = "xgb")
                   # %>% mutate(.pred=expm1(.pred))
                  ) %>% 
  group_by(model) %>% 
  mutate(
         RMSE = yardstick::rmse_vec(avg_count, .pred),
         MAE  = yardstick::mae_vec(avg_count, .pred),
         MAPE = yardstick::mape_vec(avg_count, .pred)) %>% 
  ungroup() %>% 
  mutate(model = factor(model, levels=c(
    "lm",
    "glmnet",
    "rf",
    "xgb"
    )))



# average error for each model
ggplot(data = OOF_preds %>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = "red") +
  geom_label(aes(label = paste0(round(RMSE,1)))) +
  theme_bw()

# OOF predicted versus actual
ggplot(OOF_preds, aes(x = avg_count, y = .pred, group = model)) +
  geom_point(alpha = 0.3) +
  geom_abline(linetype = "dashed", color = "red") +
  geom_smooth(method = "lm", color = "blue") +
  # coord_equal() +
  facet_grid(~model) +
  ylim(0,300)+
  theme_bw()
full_fit    <-rf_best_wf %>% fit(train_set)

full_fit_lm     <- rf_best_wf %>% fit(main)


rf_pred_pm = predict(full_fit_lm, new_data = segments %>% left_join(manual_counts %>% select(seg_id,Ward, TOTALBIKELANES, TOTALBIKELANEWIDTH,
                                                                                             TRAIL_CLASS, CENTERLINE, LIGHTING), by='seg_id') %>% left_join(cabi_max) %>%
                       # rename(TOTALBIKELANES.x= TOTALBIKELANES, TOTALBIKELANEWIDTH.x= TOTALBIKELANEWIDTH) %>%

                       mutate(Start_AMPM= 'PM',
                                 Type.y=ifelse(is.na(road_class), "Trails", "Lanes"
                                               ),
                              Ward=as.factor(Ward)))
# rf_pred_pm$.pred<- rf_pred_pm$.pred %>% expm1()

rf_pred_am = predict(full_fit_lm, new_data = segments %>% left_join(manual_counts %>% select(seg_id,Ward, TOTALBIKELANES, TOTALBIKELANEWIDTH,
                                                                                                          TRAIL_CLASS, CENTERLINE, LIGHTING),by='seg_id') %>% left_join(cabi_max)%>%
                                    # rename(TOTALBIKELANES.x= TOTALBIKELANES, TOTALBIKELANEWIDTH.x= TOTALBIKELANEWIDTH) %>%

                                    mutate(Start_AMPM= 'AM',
                                           Type.y=ifelse(is.na(road_class), "Trails", "Lanes"
                                           ),
                                           Ward=as.factor(Ward)))
                                                        Type.y=ifelse(is.na(road_class), "Trails", "Lanes")))
# rf_pred_am$.pred<- rf_pred_am$.pred %>% expm1()


segments %>% left_join(manual_counts %>% select(seg_id,Ward, TOTALBIKELANES, TOTALBIKELANEWIDTH,
                                                        TRAIL_CLASS, CENTERLINE, LIGHTING),by='seg_id') %>% select(seg_id,GEOID,Ward) %>%
  cbind(rf_pred_am,rf_pred_pm) %>% distinct() %>%
  st_write('spatial_predictions6.csv', append = FALSE)

save(full_fit, test_set, rf_pred,OOF_preds, file = "model/Time_log_no_time_of_day_kde.RData")

Temporal Model Worflow

load.fun <- function(x) { 
  x <- as.character(x) 
  if(isTRUE(x %in% .packages(all.available=TRUE))) { 
    eval(parse(text=paste("require(", x, ")", sep=""))) 
    print(paste(c(x, " : already installed; requiring"), collapse=''))
  } else { 
    #update.packages()
    print(paste(c(x, " : not installed; installing"), collapse=''))
    eval(parse(text=paste("install.packages('", x, "')", sep=""))) 
    print(paste(c(x, " : installed and requiring"), collapse=''))
    eval(parse(text=paste("require(", x, ")", sep=""))) 
  } 
} 

########### Required Packages ###########
packages = c(
  # Core ML
  "tidymodels", "tidyverse", 
  # Specific model packages
  "glmnet", "ranger", "xgboost", 
  # Newer spatial packages
  "sf", "stars", "terra", "spatialsample", "nngeo", "mapview",
  # Newer extensions
  "finetune", "themis", "applicable", "vetiver",
  # Existing packages
  "bayesplot", "lme4", "RcppEigen", "AmesHousing"
)

for(i in seq_along(packages)){
  packge <- as.character(packages[i])
  load.fun(packge)
}

set.seed(717)
theme_set(theme_bw())

"%!in%" <- Negate("%in%")
g <- glimpse

nn_function <- function(measureFrom,measureTo,k) {
  library(FNN)
  nn <-   
    FNN::get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}


get_week_of_month <- function(date) {
  week(date) - week(floor_date(date, "month")) + 1
}
#######train test split######
setwd('C:/Users/Xian Lu Lee/OneDrive - PennO365/Datasets/GeoCloud/DC_DOT/')
segments <-read.csv(
#   # "Data/Lanes_KumaHao.geojson"
  "/spatial_predictions6.csv"
  )
load("data_joined(1).RData")
# auto_counts <- st_read("Data/auto_2024_by_hr.csv")
auto_counts$interval_60 <- as.numeric(auto_counts$interval_60)
auto_counts$interval_60 <- as.POSIXct(auto_counts$interval_60, origin = "1970-01-01", tz = "UTC")
library(lubridate)
auto_counts <- auto_counts %>% 
  mutate(WkoMn = get_week_of_month(interval_60))

# full_weeks %>% select(Site_Name) %>% distinct()

cabi_sum <- read.csv('C:/Users/Xian Lu Lee/OneDrive - PennO365/Datasets/GeoCloud/DC_DOT/Data/cabi_sum.csv') %>% select(-X) %>% drop_na()
main <- segments  %>% left_join(auto_counts, by= "seg_id") %>% mutate(Flow_Count= as.numeric(Flow_Count)) %>%
  filter(!is.na(interval_60)) %>% 
  mutate(hour = as.numeric(hour(interval_60)),dotw = as.numeric(wday(interval_60, label=TRUE))) %>% st_drop_geometry()
main <- main %>% 
  left_join(cabi_sum, by= c('seg_id','hour', 'dotw')) 

main <- main  %>% 
  mutate(cabi_count = ifelse(is.na(cabi_count), 0, cabi_count))

# main <- main %>% drop_na()
main$hour= as.factor(main$hour)
main$.pred <- as.numeric( main$.pred)
main$.pred.1 <- as.numeric( main$.pred.1)
library(ggplot2)


main_summary <- main %>%
  count(Site_Name, dotw, WkoMn) %>% mutate(count=if_else(n>0, 1, 0))


# ggplot(weekdays, aes(y= Flow_Count, x= hour)) +geom_point()
# weekdays %>% group_by(hour) %>% summarise(mean=mean(Flow_Count)) %>% View()

ggplot(main_summary, aes(x = dotw, y = as.factor(WkoMn), fill = count)) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(name = "Data Collected") +
  facet_wrap(~ Site_Name, scales = "free_y") +  # One heatmap per site
  labs(
    title = "Data Collection by Week of Month and Day of Week",
    x = "Day of the Week", y = "Week of the Month"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

main <- main %>% mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                             time_of_day = case_when(hour(interval_60) < 7 | hour(interval_60) > 18 ~ "Overnight",
                                                     hour(interval_60) >= 7 & hour(interval_60) < 10 ~ "AM Rush",
                                                     hour(interval_60) >= 10 & hour(interval_60) < 15 ~ "Mid-Day",
                                                     hour(interval_60) >= 15 & hour(interval_60) <= 18 ~ "PM Rush"))

main <- main %>% group_by(Site_Name,interval_60) %>% mutate(Flow_Count= sum(Flow_Count)) %>% ungroup() %>% select(-Flow_Name,-Flow_Identifier) %>% distinct()###remove directionality because

bg <- st_read("Data/bg/bg2020.GEOJSON")
bg$GEOID <-bg$GEOCODE
manual_clean <- manual_234[complete.cases(st_coordinates(manual_234)), ] %>% st_transform(3857) 
# create kdem
kde_hp = sf.kde(manual_clean%>% select(Total), y=manual_clean$Total, bw=1000)
kde_df <- as.data.frame(kde_hp, xy = TRUE) %>%
  rename(density = 3)  
# map the result
ggplot(kde_df, aes(x = x, y = y, fill = density)) +
  geom_raster() +
  scale_fill_viridis_c(option = "plasma") +
  coord_equal() +
  theme_minimal() +
  labs(title = "Kernel Density Estimation", fill = "Density") 
# Convert sf points to SpatVector
pts_vect <- vect(main %>% st_as_sf())

# Extract KDE values
kde_vals <- extract(kde_hp, pts_vect)

# Combine with original sf object
main$manual_kde <- kde_vals[,2] 

# library(tidymodels)
library(rsample)
library(parsnip)
library(workflows)
library(yardstick)
main <- main %>% mutate(YearMonth = floor_date(interval_60, "month"))
main$GEOID <- as.factor(main$GEOID)
main$Ward<- as.factor(main$Ward)
main$dotw <- as.factor(main$dotw)
dat <- main %>% select(1:6, 8,9 ,12:18) %>% ungroup()%>%
  mutate(month=month(interval_60)) %>% 
    filter(month==8, weekend=="Weekday")
random_week_per_site <- dat %>%
  group_by(Site_Name) %>%
  distinct(WkoMn) %>%  # Get unique weeks per site
  filter(WkoMn %in% c(4:6)) %>% filter(Site_Name!='1st St NE') %>% 
  filter(Site_Name!= 'R Street NW')



# test_month <- sample(unique(dat$YearMonth), 1)
# 3. Create train and test sets (just key columns for matching)
test_set <- dat %>%
  inner_join(random_week_per_site, by = c("Site_Name", "WkoMn"))

train_set <- dat %>%anti_join(test_set)

# 4. Match and extract indices using row-wise logic
train_indices <- dat%>%
  mutate(row_id = row_number()) %>%
  semi_join(train_set, by = c("Site_Name", "interval_60")) %>%
  pull(row_id)

test_indices <- dat %>%
  mutate(row_id = row_number()) %>%
  semi_join(test_set, by = c("Site_Name", "interval_60")) %>%
  pull(row_id)

# 5. Create rsample split
library(rsample)
data_split <- make_splits(
  list(analysis = train_indices, assessment = test_indices),
  data = dat
)


## LOGOCV on Neighborhood with group_vfold_cv()
cv_splits_geo <- group_vfold_cv(train_set,  
                                group = "Ward")

## Accessing CV split data
glimpse(cv_splits_geo[1,1][[1]][[1]][[1]])

# Get the first fold's split object
fold_1 <- cv_splits_geo$splits[[1]]

# Extract the analysis (training) set from the first fold
analysis_set_1 <- analysis(fold_1)

# Extract the assessment (validation) set from the first fold
assessment_set_1 <- assessment(fold_1)

### Create Recipes
test_set%>% select(where(~ is.factor(.) || is.character(.))) %>% colnames()

# dat$Flow_Count <- log1p(dat$Flow_Count)
# train_set$Flow_Count <- log1p(train_set$Flow_Count)
# test_set$Flow_Count <- log1p(test_set$Flow_Count)
model_rec <- recipe(Flow_Count ~ ., data = train_set %>% 
                      select(-seg_id, -Site_Name,-GEOID, -interval_60, -month,
                             -WkoMn,-cabi_count)) %>%
  update_role(Ward, new_role = "Neighborhood") %>%
  step_other(Ward, threshold = 0.005) %>%
  step_normalize(where(is.double), -Flow_Count) %>% 
  step_unknown(all_nominal()) %>% 
  step_dummy(all_nominal(), one_hot = TRUE) %>% 
  # step_interact(~ hour:.pred) %>%   # Specify interaction term correctly
  step_log(Flow_Count, offset = 1, skip=T)
    
# step_nzv(all_numeric(),-Flow_Count,freq_cut = 95/5, unique_cut = 10) %>%

  # step_zv(all_predictors()) %>%
  # step_center(all_predictors(), -Flow_Count, -has_buffer, -has_parking, -has_doubleyellow) %>%
  # step_scale(all_predictors(), -Flow_Count, -has_buffer, -has_parking, -has_doubleyellow)
  
# step_ns(Latitude, Longitude, options = list(df = 4),keep_original_cols=TRUE)

glimpse(model_rec %>% prep() %>% juice())

# transform new data
model_rec %>% prep() %>%  bake(new_data = dat)



set.seed(42)
## Model specifications
lm_plan <- 
  linear_reg() %>% 
  set_engine("lm")

glmnet_plan <-
  linear_reg() %>%
  set_args(penalty  = tune()) %>%
  set_args(mixture  = tune()) %>%
  set_engine("glmnet")%>%
  set_mode("regression")


rf_plan <- rand_forest() %>%
  set_args(mtry  = tune()) %>%
  set_args(n_trees  = 2000) %>%
  set_args(min_n = tune()) %>%  # ✅ Now only tuning min_n
  set_engine("ranger", importance = "permutation") %>%  
  set_mode("regression")

XGB_plan <- boost_tree() %>%
  set_args(trees = tune()) %>%  
  set_args(learn_rate = tune()) %>%  # Learning rate (shrinkage)
  set_args(tree_depth = tune()) %>%  # Maximum tree depth
  set_args(min_n = tune()) %>%       # Minimum node size
  # set_args(colsample_bytree = tune()) %>% # Feature subsampling
  set_args(early_stopping_rounds = 10) %>%        # Early stopping
  set_engine("xgboost",validation = 0.2) %>% 
  set_mode("regression")

# poisson_mod <- 
#   poisson_reg() %>% 
#   set_engine("glm") %>% 
#   set_mode("regression")


# Hyperparameter grid for glmnet (penalization)
glmnet_grid <- expand.grid(
  penalty = 10^seq(-4, 2, length.out = 50),  # Wider range of penalties
  mixture = seq(0, 1, by = 0.25)
)

# Hyperparameter grid for random forest (tuning min_n only)
rf_grid <- expand.grid(
  mtry = c( 45, 80, 110), 
  min_n = c(5, 50, 100, 500)  # ✅ Only tuning min_n
)
# Hyperparameter grid for XGBoost (tuning n_estimators with early stopping)
xgb_grid <- expand.grid(
  trees = seq(50, 500, by = 50),  
  learn_rate = c(0.01, 0.1, 0.3),  
  tree_depth = c( 5, 7, 15,25),  
  min_n = c( 50, 200, 400) 
  # colsample_bytree = c(0.5, 0.7, 1)
)


library(poissonreg)
# create workflow
lm_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(lm_plan)
# poisson_wf <- workflow() %>%  
  # add_recipe(model_rec) %>% 
  # add_model(poisson_mod)
  
glmnet_wf <-
  workflow() %>%
  add_recipe(model_rec) %>%
  add_model(glmnet_plan)
rf_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(rf_plan)
xgb_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(XGB_plan)





# fit model to workflow and calculate metrics
control <- control_resamples(save_pred = TRUE, verbose = TRUE)
metrics <- metric_set(rmse, rsq, mape, smape)
lm_tuned <- lm_wf %>%
  tune::fit_resamples(.,
                      resamples = cv_splits_geo,
                      control   = control,
                      metrics   = metrics)

glmnet_tuned <- glmnet_wf %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo,
                  grid      = glmnet_grid,
                  control   = control,
                  metrics   = metrics)

rf_tuned <- rf_wf %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metrics)

xgb_tuned <- xgb_wf %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo,
                  grid      = xgb_grid,
                  control   = control,
                  metrics   = metrics)

## 'Best' by some metric and margin
show_best(lm_tuned, metric = "rmse", n = 15)
# show_best(glmnet_tuned, metric = "rmse", n = 15)
show_best(rf_tuned, metric = "rmse", n = 15)
show_best(xgb_tuned, metric = "rmse", n = 15)

lm_best_params     <- select_best(lm_tuned, metric = "rmse"    )
# glmnet_best_params <- select_best(glmnet_tuned, metric = "rmse")
rf_best_params     <- select_best(rf_tuned, metric = "rmse"    )
xgb_best_params    <- select_best(xgb_tuned, metric = "rmse"   )


## Final workflow
lm_best_wf     <- finalize_workflow(lm_wf, lm_best_params)
# glmnet_best_wf <- finalize_workflow(glmnet_wf, glmnet_best_params)
rf_best_wf     <- finalize_workflow(rf_wf, rf_best_params)
xgb_best_wf    <- finalize_workflow(xgb_wf, xgb_best_params)


# last_fit() emulates the process where, after determining the best model, the final fit on the entire training set is needed and is then evaluated on the test set.
lm_val_fit_geo <- lm_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metrics)

# glmnet_val_fit_geo <- glmnet_best_wf %>% 
#   last_fit(split     = data_split,
#            control   = control,
#            metrics   = metrics)

rf_val_fit_geo <- rf_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metrics)

xgb_val_fit_geo <- xgb_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metrics)
# poisson_val_fit_geo <- poisson_wf %>% 
#   last_fit(split     = data_split,
#            control   = control,
#            metrics   = metrics)

##########################################################

# Pull best hyperparam preds from out-of-fold predictions
lm_best_OOF_preds <- collect_predictions(lm_tuned)

# glmnet_best_OOF_preds <- collect_predictions(glmnet_tuned) %>%
#   filter(penalty  == glmnet_best_params$penalty[1] & mixture == glmnet_best_params$mixture[1])

rf_best_OOF_preds <- collect_predictions(rf_tuned) %>%
  filter(mtry  == rf_best_params$mtry[1] & min_n == rf_best_params$min_n[1])

xgb_best_OOF_preds <- collect_predictions(xgb_tuned) %>%
  filter( min_n == xgb_best_params$min_n[1])

# collect validation set predictions from last_fit model
lm_val_pred_geo     <- collect_predictions(lm_val_fit_geo)
glmnet_val_pred_geo <- collect_predictions(glmnet_val_fit_geo)
rf_val_pred_geo     <- collect_predictions(rf_val_fit_geo)
xgb_val_pred_geo    <- collect_predictions(xgb_val_fit_geo)
# poisson_val_pred_geo    <- collect_predictions(poisson_val_fit_geo)



# Aggregate OOF predictions (they do not overlap with Validation prediction set)
OOF_preds <- rbind(
  data.frame(dplyr::select(lm_best_OOF_preds, .pred, Flow_Count)%>% mutate(.pred=expm1(.pred)), model = "lm") ,
   # data.frame(dplyr::select(poisson_val_pred_geo, .pred, Flow_Count), model = "poisson")
  # data.frame(dplyr::select(glmnet_best_OOF_preds, .pred, Flow_Count), model = "glmnet") ,
  data.frame(dplyr::select(rf_best_OOF_preds, .pred, Flow_Count)%>% mutate(.pred=expm1(.pred)), model = "rf"),
  data.frame(dplyr::select(xgb_best_OOF_preds, .pred, Flow_Count) %>% mutate(.pred=expm1(.pred)), model = "xgb")
  
) %>% 
  group_by(model) %>% 
  mutate(
    RMSE = yardstick::rmse_vec(Flow_Count, .pred),
    MAE  = yardstick::mae_vec(Flow_Count, .pred),
    MAPE = yardstick::mape_vec(Flow_Count, .pred)) %>% 
  ungroup() %>% 
  mutate(model = factor(model, levels=c(
    "lm",
    "glmnet",
    "rf",
    "xgb"
  )))
# average error for each model
ggplot(data = OOF_preds %>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = "red") +
  geom_label(aes(label = paste0(round(RMSE,1)))) +
  theme_bw()

# OOF predicted versus actual
ggplot(OOF_preds, aes(x = Flow_Count, y = .pred,
                      group = model
                      )) +
  geom_point(alpha = 0.3) +
  geom_abline(linetype = "dashed", color = "red") +
  geom_smooth(method = "lm", color = "blue") +
  # coord_equal() +
  facet_wrap(~model, nrow = 2) +
  theme_bw()

full_fit     <- rf_best_wf %>% workflows::fit(train_set)
rf_pred <- predict(full_fit, test_set %>% select(-Flow_Count))
rf_pred$.pred=expm1(rf_pred$.pred)
full_fit$fit$fit$fit

References

Kothuri, Sirisha, Joseph Broach, Nathan McNeil, Kate Hyun, Stephen Mattingly, Md Mintu Miah, Krista Nordback, and Frank Proulx. 2022. “Exploring Data Fusion Techniques to Estimate Network-Wide Bicycle Volumes.” NITC-RR-1269. National Institute for Transportation; Communities. https://nitc.trec.pdx.edu/research/project/1269/.
Miah, Md Mintu, Stephen P Mattingly, and Kate Kyung Hyun. 2023. “Evaluation of Bicycle Network Connectivity Using Graph Theory and Level of Traffic Stress.” Journal of Transportation Engineering, Part A: Systems 149 (9): 04023080. https://doi.org/10.1061/JTEPBS.TEENG-7776.
Team, Smart Cities Practicum. 2025. “Modelling Ideas for Bicycle Count Estimation.”