Estimating Bike Counts in DC
You can view the full MUSA project page and our BEAUTIFUL Dashboard.
Instructors
Michael Fishman & Matthew Harris
Client Contacts
Nellie Moore, Senior Civic Design Researcher, DDOT (nellie.moore@dc.gov)
David Balick, Trails Planner, DDOT (david.balick@dc.gov)
Jack Crum, Data Scientist, The Lab @ DC (jack.crum@dc.gov)
Executive Summary
Problem Statement: Current counter data is limited in spatial and temporal coverage. This provides an incomplete picture of cycling patterns in DC, making it challenging to fully understand cycling patterns and optimize network planning.
Use Case: Essentially addressing a problem of incomplete information, our analysis will estimates and provides summaries of micro-mobility ridership across DC’s networks.
Key Data Challenges: relative sparsity of automated counters across space are too sparse to provide representative picture of DC
Modelling Solution: Estimate peak hour counts with manual counter data, before using this to make estimates of hourly counts in locations without automated counters, augmented with other data sources
Model Performance: Final Model configurations for both temporal and spatial model outperformed static models in existing workflows recorded in Portland, Bend, Dallas and Charlotte
Strengths and Applications: Providing an overview of bike commuting patterns and infrastructure utilization during critical time periods (peak hours), which may support planning decisions at the city-wide scale - Not a “real-time” , granular or “realistic” representation of cyclist volumes
Limitations:
Remains impossible to test and comment on the generalizability of the temporal model without sufficient data points across DC
Inherent selection biases with the data arising from data collection decisions such as prioritizing of key transport corridors for counter installation
Areas may have daily patterns which differ substantially from conventional dual peak patterns, reducing the confidence of estimates in these areas
Recommendations:
Utilize predictions made outside of peak hours in areas further from known automated records with caution
Expand the coverage of its automated counters and do so within a greater variety of locations; which will allow for further model iterations and testing in currently unknown locations
Utilize domain knowledge and understanding of local contexts within DC for validating model outputs
Introduction
The District Department of Transportation wishes to understand trip volumes and time-space patterns of micro mobility (bicycle and scooter) use in Washington, DC. Estimating and summarizing such patterns will provide the department a better idea of cycling mobility flows across the city which can be used for a variety of analyses and decisions surrounding infrastrcuture provision and interventions or understanding relations with equity and demographics. However, as the following sections will exemplify, while Washington, while D.C.’s bike network is extensive, the current counter data is limited in spatial and temporal coverage. This provides an incomplete picture of cycling patterns in DC, making it challenging to fully understand cycling patterns and optimize network planning. The principal goal of this project is thus to estimate the volume of micro mobility ridership across the city. The rest of this report is structured as such: first, it introduces the motivation and use-case, before conducting a short review of typical approaches and considerations in modelling and estimating bicycle counts. Then, it illustrates the key data issues regarding spatial and temporal sparsity of provided counter data and outlines our predictive solution. It then explores selected predictors before comparing the performance of various model configurations. It concludes by assessing the generalizability of the final model for different demographics and settings and offering recommendations for existing workflows.
Client Background
Motivation & Use Case
Bicycle count data is important for a variety of planning purposes, including measuring changes over time, prioritizing projects, planning and designing for future infrastructures, providing modelling inputs for multi-modal models, assessing overall footfall, signal adjustmetns and property research among other things (Nordback et al 2018). The growing interest in bicycling and cycling facilities necessitates a need to understand its usage and how people utilize the mode and available facilities for getting to important places in their daily life. This need is not unnoticed by the DDOT, who has been carrying out their own data collection program for the past few years. However, as we further reveal in subsequent sections, there are limits to the ability of collected data to generalize to the entire city and its network.
Essentially addressing a problem of incomplete information, our analysis will estimates and provides summaries of micro-mobility ridership across DC’s networks. By revealing the distribution and gaps in available data, our study highlights opportunities to enhance monitoring strategies, understand how patterns vary across the city with various demographic and infrastructure features, and guide targeted infrastructure improvements. This is facilitated by a dashboard which provides various functions for exploring the phenomena to aid decision making processes.
Literature Review
In recent years, the need for accurate, network-wide estimates of bicycle activity has grown in response to increased investment in cycling infrastructure and a push for data-driven planning. Traditionally, cities have relied on permanent counters and short-duration manual counts to monitor bicycle traffic, yet these methods offer limited spatial and temporal coverage. As a result, researchers have developed several modeling approaches to extrapolate counts across space and time, and to integrate newer, non-traditional data sources.
The Role of Bicycle Count Data
Bicycle count data serve various planning and evaluation purposes, including measuring trends, prioritizing infrastructure investments, calibrating travel models, conducting before-and-after studies, and assessing safety and equity outcomes (Nordback et al., 2018). However, stationary counters can only reflect activity at specific points, leaving large portions of the network unmeasured. To address this, researchers have explored methods to extrapolate short-duration counts to average annual daily bicycle traffic (AADBT) using grouping techniques such as cluster analysis, spatial indexing, and travel pattern similarities (Team 2025).
Modeling Approaches
Direct Demand Models
One common strategy is the direct demand model, which regresses observed counts against built environment, network, and demographic variables. Key predictors often include:
- Land use: employment and population density, retail
access, and land use mix.
- Transportation infrastructure: bike facility
density, proximity to bridges, and centrality in the network.
- Sociodemographics: race, age, education, and
student population.
- Accessibility: distance to jobs, schools, and the
central business district.
- Temporal factors: day of week, time of day, and weather conditions.
Poisson and negative binomial models are frequently used due to the count-based nature of the data (Team 2025).
Machine Learning and Data Fusion
Recent research has emphasized data fusion techniques that combine traditional counts with crowdsourced or third-party data sources, such as Strava, StreetLight, and bikeshare systems. The study by Kothuri et al. (2022) demonstrated that integrating these sources with contextual variables (termed “static” data) can enhance model performance. In their comparative study across six cities, models that combined static, Strava, and StreetLight data consistently outperformed those using single sources, especially in medium- to high-volume contexts. However, low-volume sites remained difficult to model accurately, partly due to sparse data and bias in source coverage (Kothuri et al. 2022).
The study further evaluated Random Forest regression and found it performed comparably to Poisson models, with better results in data-rich settings. While machine learning allows for flexible modeling of complex relationships, it demands substantial data and tuning to avoid overfitting or underperformance in low-data environments.
Connectivity and Network Measures
Complementing count-based models, Miah et al. (2023) explored network connectivity using graph theory and Level of Traffic Stress (LTS). Their study assessed how link-level characteristics—such as speed, lane width, and motor vehicle volume—affect ridership by user type (e.g., children vs. confident adults). The results highlighted significant connectivity disparities, with large portions of networks inaccessible to low-stress riders, thereby limiting cycling uptake in equity-seeking communities (Miah, Mattingly, and Hyun 2023).
Exploratory Analysis: Modeling Considerations
Our Target Variable: Weekday Hourly Bike Counts
Data Quality and Coverage
The locations of counters and the nature of their counts are important considerations when utilising them for modelling unknown counts, and can limit the types of models and inferences which can be made from data. Importantly, stationary counters can only tell us about the traffic passing by them, during the times which they are active.
The two counters differ greatly in spatial and temporal coverage:
Manual Counters:
Greater density of locations, especially if aggregated across 4 years
Data collected only during peak hours, once per year within time periods ranging from 3- 4 hours
Automatic Counters:
Poor spatial coverage, with only 3 to 4 counters per month, and a total of around 30 counters deployed in 2024
provides 15 minute counts
To address this we thus focussed on generating estimates at an hourly timescale
Manual Counters Spatial Distribution
The manual counters—shown in purple—are more widespread, but they’re only used occasionally and often during peak hours.
manual <- manual %>% drop_na(x, y)
manual <- st_as_sf(manual, coords = c("x", "y"), crs = 4326)
manual <- st_transform(manual, st_crs(basemap))
ggplot() +
geom_sf(data = basemap, fill = "grey90", color = "white") +
geom_sf(data = bike_lane, color = "grey30", size = 0.5, alpha = 0.7) +
geom_sf(data = manual, color = "#E3D7F5", size = 3, shape = 21, fill = "#562DAB") +
labs(title = "Manual Counters Locations") +
theme_minimal() +
theme(
legend.position = "right",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())Automatic Counters Spatial Distribution
These are fairly limited in number and tend to be clustered in specific parts of the city. Because automated counters are concentrated in certain areas, they don’t reflect the diversity of biking environments across the whole network.
auto <- auto %>% drop_na(Latitude, Longitude)
auto <- st_as_sf(auto,
coords = c("Longitude", "Latitude"), # note the order: X = Longitude, Y = Latitude
crs = 4326)
auto <- st_transform(auto, st_crs(basemap))
auto_cut <- auto %>%
filter(st_coordinates(.)[, 1] > -77.12 & st_coordinates(.)[, 1] < -76.90,
st_coordinates(.)[, 2] > 38.79 & st_coordinates(.)[, 2] < 39.00)
ggplot() +
geom_sf(data = basemap, fill = "grey90", color = "white") +
geom_sf(data = bike_lane, color = "grey30", size = 0.5, alpha = 0.7) +
geom_sf(data = auto_cut, color = "#E3D7F5", size = 3, shape = 21, fill = "#9C7BD8") +
labs(title = "Automatic Counters Locations") +
theme_minimal() +
theme(
legend.position = "right",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())By definition, count data not tell us anything about the network or times outside of the records. However, through a predictive modelling approach, we can attempt to use data to understand how patterns differ across space and time, and predict how they might be in areas outside of the temporal and spatial scope of comparison.
Exploring Temporal & Spatial Variations in Bike Counts
In order to understand how patterns vary more broadly across space and time, we first explored spatial patterns using Manual count data and then the temporal patterns using automated counts.
Spatial Patterns
Temporal Patterns
Range of automated counts around 1 to 30, with most counts being less than 10 every 15 minutes, averaging to about 30 to 40 per hour.
main %>% filter(Flow_Count>0) %>% ggplot()+
geom_histogram(aes(x=Flow_Count),fill= "#9C7BD8", color='white')+
xlim(0,30)+
labs(title= 'Distribution of Automated Counts',
subtitle='15 minute intervals')+
theme_minimal()Daily patterns of counts are roughly even across different days of the year. While the patterns below seem to suggest that certain months are characterised by higher than normal counts, this is due to the differences in counter locations.
main %>% st_drop_geometry() %>% filter(Flow_Count>0) %>% group_by(interval_60) %>%
summarise(mean_count= mean(Flow_Count)) %>%
ggplot()+
geom_line(aes(x=interval_60,y= mean_count), size=0.3, color= "#9C7BD8")+
labs(title= 'Automated Counts across study period',
subtitle='1 hour intervals')+
theme_minimal()Weekday patterns differ greatly from weekend patterns, exhibiting the dual peak patterns commonly also seen in other forms of transit (buses and rail). This strongly suggests that the nature of recorded counts is largely commuter–centric in nature.
main %>% filter(Flow_Count>0) %>% mutate(hour = hour(interval_60),dotw = wday(interval_60, label=TRUE)) %>% group_by(dotw,hour) %>%
st_drop_geometry() %>% summarise(mean_count= mean(Flow_Count)) %>%
ggplot()+
geom_area(aes(x=hour,y= mean_count), fill="#9C7BD8", color='transparent')+
labs(title= 'Automated Counts across study period',
subtitle='1 hour intervals, by day of week')+
facet_wrap(~dotw)+
theme_minimal()The twin peak pattern is also mostly generalizable to all locations, with a few notable exceptions. 1st St NE exhibited rather anomalous patterns and was excluded from the training process. Notably, streets such as East Capitol Street and the Marvin Gaye Trails have patterns which deviate slightly from the norm. The former is characterised by a much larger evening peak, while the former sites have much more erratic patterns, likely due to the low average counts within these sites.
main %>% filter(Flow_Count>0) %>% mutate(hour = hour(interval_60),dotw = wday(interval_60, label=TRUE)) %>% subset(!(dotw %in% c('Sat', 'Sun'))) %>% group_by(Site.Name, hour) %>%
st_drop_geometry() %>% summarise(mean_count= mean(Flow_Count)) %>%
ggplot()+
geom_area(aes(x=hour,y= mean_count), fill='#9C7BD8', color='transparent')+
labs(title= 'Automated Counts across Locations ',
subtitle='1 hour intervals')+
ylab('Mean Count')+
facet_wrap( ~ Site.Name,
scales='free',
ncol= 5)+
theme_minimal()The consistency of such temporal patterns across different sites is important, as it allows us to estimate counts across different times of the day in unknown locations, by assuming that they vary similarly across time across different locations and only vary in magnitude (i.e. the peak bike counts) .
auto_counts%>%
mutate(time_of_day = case_when(hour(interval_60) < 7 | hour(interval_60) > 18 ~ "Overnight",
hour(interval_60) >= 7 & hour(interval_60) < 10 ~ "AM Rush",
hour(interval_60) >= 10 & hour(interval_60) < 15 ~ "Mid-Day",
hour(interval_60) >= 15 & hour(interval_60) <= 18 ~ "PM Rush"))%>%
# group_by( time_of_day) %>%
#
# summarize(mean_flow = mean(as.numeric(Flow_Count), na.rm=T))%>%
ggplot()+
geom_histogram(aes(x=as.numeric(Flow_Count)), fill="#9C7BD8", color='white')+
# labs(title="Figure 4b: Mean Number of Hourly Trips Per Station, New York, August, 2024",
# subtitle='Based on a 20% random sample out of 4.5m trips',
# x="Number of trips",
# y="Frequency")+
facet_wrap(~time_of_day)+
theme_minimal()Our Approach: Estimate Spatial Patterns with Manual Counts & Temporal Patterns from Automated Counts
Typical approaches for obtaining estimates of bike counts in areas without counters can be classified into three main types, 1) Direct Extrapolation , 2) Travel Demand Modelling and 3) Direct Demand Models.
Direct extrapolation are based on the principles of spatial autocorrelation, that areas near each other have similar patterns of counts. This often entails grouping sites based on similar characteristics and using factors to extrapolate short duration counts to annual estimates. This may be done visually, or through methods of cluster analysis or spatial variables.
Travel demand modelling describes patterns as decisions to travel for activities using specific modes of transport, and often suffer from their large aggregation scale and difficulty in validating with external data.
As such, given the availability of various external data sources, direct demand models, which regresses (or predicts) observed counts based on known characteristics of the surrounding built environment, are popular methods for estimating counts. The accuracy of such models depends on the types of variables selected as predictors, as well as their suitability for extrapolating to the entire network - meaning that including strongly predictive characteristics not present within areas without records should be approached with caution.
Crucially, majority of reviewed approaches are often concerned with obtaining annually aggregated estimates of bicycle volume, in other words, where there tends to be most or least cyclists on the network. Our use-case and motivation, however, requires that we also understand when and how bike counts vary.
Given the consistency of hourly patterns across locations, the wealth of temporal data from automated counters allows us to estimate how patterns vary across the day, barring a few limitations and caveats which are explained further below.
As such, we propose a modelling workflow which can maximize the degree of spatial and temporal information we can obtain from the 2 data sources.
Overall Modelling Workflow
Exploratory Analysis: Relevant Predictors and Data Sources
To model and estimate how flows vary across space and time, it is important to identify relevant and useful predictors.
Typical characteristics included for prediction include (Muner and Sener, ):
Socio-demographic: race/ethnicity, education, students, age
Network measure: centrality, bridges, bicycle facility density, number of lanes
Land use: population density, employment density, commercial retail density, industrial use, open space, land use mix
Accessibility: to jobs, bike trail entrances, transit stops, schools distance form cbd
Temporal: weather, time of day
Based on Kothuri et al’s study and other literature, we identified a number of variables and data sources for both spatial and temporal models. [insert diagram ]
Spatial Model
Variables chosen for the spatial model included….
Temporal Model
Given that our approach means that we can not utilize accurate spatial grouping variables (e.g streets) to provide information regarding the For the temporal model, we considered the usage of various spatial variables for providing information regarding the spatial differences in the magnitude of bicycle counts. This included predicted values based of the spatial model, a kernel density estimate of manual count data and engineered count feature from capital bikeshare data.
Land Use
To understand how land use influences bike counts, we applied a 300-meter buffer around each bike lane segment. This allows us to capture the mix of land uses within close proximity to each bike route, rather than just relying on a single adjacent land use type.
The highest bike counts are observed in residential, commercial, and institutional areas, indicating that these environments support frequent cycling—likely due to a mix of commuting, errands, and recreational trips.
seg_nongem <- st_drop_geometry(segments)
landuse <- seg_nongem %>%
select(Commercial, Industrial, Institutional, Mixed, Recreational, Religious, Residential, Vacant, Other) %>%
summarise_all(sum) %>%
pivot_longer(cols = everything(), names_to = "Land_Use_Type", values_to = "Count")
ggplot(landuse, aes(x = reorder(Land_Use_Type, Count), y = Count, fill = Land_Use_Type)) +
geom_col(width = 0.7) + # Adjust bar width (default is 0.9, reduce for thinner bars)
coord_flip() +
scale_fill_manual(values = c(
"Vacant" = "#562DAB", # Very dark purple
"Residential" = "#6A3BB8",
"Institutional" = "#7A50C7",
"Commercial" = "#8E68CC",
"Religious" = "#9C7BD8",
"Mixed" = "#A989E0",
"Other" = "#C4AEE8",
"Industrial" = "#D6C4F0",
"Recreational" = "#E3D7F5" # Lightest
), guide = "none") +
labs(title = "Distribution of Bike Lanes Across Adjacent Land Use Types",
x = "Land Use Type", y = "Count") +
theme_minimal() +
theme(
legend.position = "right", # Keep legend on the right
panel.grid.major = element_blank())Census Data
We examined the spatial distribution of key census indicators across our study area to better understand the local socio-demographic context. Our maps reveal marked differences in variables such as median income, the rate of public transit commuting, and the proportion of residents who use bicycles for commuting. Areas with higher median incomes may have distinct mobility patterns compared to regions with lower incomes, and these economic disparities can influence the demand for cycling infrastructure. By overlaying these data on our geographic base map, we can begin to see how the underlying socio-economic factors correlate with the spatial distribution of transportation modes.
In addition, the visualizations of transit and bicycle commuting rates serve as essential proxies for local travel behavior. Regions with higher bicycle commuting rates, for example, might already exhibit a strong cycling culture that can support further network investments, while areas with prominent public transit usage suggest the potential for integrated transport solutions that merge cycling with existing transit services. These insights are critical for identifying data gaps in our current bike counter coverage and for forecasting future demand, ultimately guiding more targeted and effective infrastructure planning in Washington, D.C.
ggplot() +
geom_sf(data = basemap, fill = "grey90", color = "white") +
geom_sf(
data = segments,
aes(color = median_incomeE),
alpha = 0.8) +
scale_color_gradientn(
colors = rev(pp),
name = "Income",
na.value = "grey80") +
labs(title = "Median Income") +
theme_minimal()ggplot() +
geom_sf(data = basemap, fill = "grey90", color = "white") +
geom_sf(
data = segments,
aes(color = commute_bicycleE),
alpha = 0.8) +
scale_color_gradientn(
colors = rev(pp),
name = "commute_bicycle") +
labs(title = "Bicycle Commuters") +
theme_minimal()ggplot() +
geom_sf(data = basemap, fill = "grey90", color = "white") +
geom_sf(
data = segments,
aes(color = pct_white),
alpha = 0.8) +
scale_color_gradientn(
colors = rev(pp),
name = "pct_white") +
labs(title = "Percentage of White") +
theme_minimal()ggplot() +
geom_sf(data = basemap, fill = "grey90", color = "white") +
geom_sf(
data = segments,
aes(color = edu_bachelors_plusE),
alpha = 0.8) +
scale_color_gradientn(
colors = rev(pp),
name = " edu_bachelor_plus") +
labs(title = "Education Level") +
theme_minimal()Network Characteristics
We also examine the network characteristics as our additional resources to help our model. Betweenness Centrality explains how often a node falls on the shortest path between other nodes.
ggplot() +
geom_sf(data = basemap, fill = "grey97", color = "white") +
geom_sf(
data = segments,
aes(color = betweenness_max),
alpha = 0.8) +
scale_color_gradientn(
colors = rev(pp),
name = "Betweenness") +
labs(title = "Bike Lane Network: Binned Betweenness Centrality") +
theme_minimal()Exploratory Data Analysis with Manual Counts
In this section, we explore how various segment-level characteristics—such as roadway design, land use, and traffic conditions—relate to observed manual bicycle counts in Washington, DC. This exploratory analysis aims to identify potential predictors of cyclist volumes and inform our modeling efforts to estimate unobserved counts citywide.
segments_manual <- segments %>%
semi_join(seg_manual, by = "seg_id") %>%
left_join(seg_manual, by = "seg_id")Overview of Average Current Manual Counts
ggplot() +
geom_sf(data = basemap, fill = "grey97", color = "white") +
geom_sf(
data = segments_manual,
aes(color = avg_count),
alpha = 0.8) +
scale_color_gradientn(
colors = rev(pp),
name = "Average Bike Counts") +
labs(title = "Manual Bike Counts") +
theme_minimal()ggplot(segments_manual, aes(x = avg_count)) +
geom_histogram(bins = 30, fill = "#7A50C7", color = "white") +
theme_minimal() +
labs(title = "Distribution of Average Manual Counts",
x = "Average Count", y = "Number of Segments")Continuous Variables with Manual Average Counts
To better understand what factors are associated with cyclist activity, we explored how average manual bike counts (avg_count) relate to several continuous segment-level variables. Scatterplots with linear regression lines were used to examine relationships between avg_count and predictors such as annual average daily traffic (AADT), median household income, streetlight count, pavement condition index (PCI), and bike infrastructure scores. These visualizations revealed varying degrees of correlation, suggesting that both traffic volume and built environment features may play a role in influencing observed cyclist volumes on different street segments.
Categorical Variables with Manual Average Counts
In addition to numeric predictors, we analyzed binary and categorical variables, including land use types and physical street characteristics. Boxplots were used to compare average manual counts across categories such as residential, commercial, and recreational land uses, as well as the presence or absence of buffers, parking lanes, and double yellow lines. Results suggest that segments with residential land use or buffered bike lanes generally experienced higher ridership, while others like vacant land or unbuffered segments saw lower cyclist activity. These findings provide preliminary evidence that both land use context and roadway design elements contribute to differences in observed bicycle volumes.
binary_vars <- c(
"Commercial.x", "Industrial.x", "Institutional.x", "Mixed.x",
"Recreational.x", "Religious.x", "Residential.x", "Vacant.x", "Other.x")
segment_binary <- segments_manual %>%
mutate(across(all_of(binary_vars), ~ factor(., levels = c(0, 1), labels = c("No", "Yes"))))binary_plots <- lapply(binary_vars, function(var) {
ggplot(segment_binary, aes_string(x = var, y = "avg_count")) +
geom_boxplot(fill = "#C4AEE8", alpha = 0.5, color = "grey70") +
theme_minimal() +
labs(title = paste("Manual by", var) +
theme(plot.title = element_text(size = 1)))
})
wrap_plots(binary_plots, ncol = 3)landuse_vars <- c(
"Commercial.x", "Industrial.x", "Institutional.x", "Mixed.x",
"Recreational.x", "Religious.x", "Residential.x", "Vacant.x", "Other.x"
)
# Convert 0/1 to "No"/"Yes" for better labeling
segment_lu <- segments_manual %>%
mutate(across(all_of(landuse_vars), ~ factor(., levels = c(0, 1), labels = c("No", "Yes"))))
# Pivot to long format
segment_lu_long <- segment_lu %>%
pivot_longer(
cols = all_of(landuse_vars),
names_to = "land_use_type",
values_to = "present"
)
ggplot(segment_lu_long, aes(x = present, y = avg_count, fill = present)) +
stat_summary(fun = mean, geom = "bar", width = 0.8) +
facet_wrap(~ land_use_type, ncol = 3, scales = "free_y") +
scale_fill_manual(values = c("No" = "#D6C4F0", "Yes" = "#562DAB")) +
theme_minimal() +
theme(
panel.grid = element_blank(),
axis.text.x = element_text(size = 10),
strip.text = element_text(size = 11, face = "bold"),
plot.title = element_text(size = 12),
legend.position = "none"
) +
labs(
title = "Average Manual Count by Land Use Presence",
x = "Land Use Present",
y = "Average Manual Count"
)Model
As aforementioned, our approach estimates Spatial Peak hour Patterns with Manual Counts and spatial predictors before using peak hour estimates to supplement a temporal model for estimating hourly traffic from automated counters - as seen in the diagram below. This section briefly explains the types of models utilized in the modelling process, before undertaking a detailed comparison of performance metrics for the two models, before assessing the generalizability of estimates from the two final selections. Detailed Code for both modelling workflows ma be referred to in the appendix.
Models Utilized
As part of our exploratory analyses, we used and compared 3 different models for the performance in estimating both peak hour counts across space and also location specific hourly counts in a day: namely regression models, random forest models and XG-Boost models for their performance in terms of RMSE, MAPE and %RMSE.
Linear Model
Predicts the dependent variable as a linear combination of independent variables
Use-case: whne the relationship between predictor and the response is linear
Strengths:
ease of interpretation
fast to train and evaluate
Limitations:
assumes linearity with poor performance with complex linear patterns
does not work well with multi-collinearity, which is often present within time predictors - values of each hour are related to one another
LASSO
A variation of regression models which uses L1 regularization to shrink some model coefficients and force others to be exactly zero
Strengths:
- L1 based feature selection.
Limitations:
- Used only for spatial model, as it was found to penalise hour-based fixed effects due to its strong collinearity with other time-based fixed effects (other hours or time of day) or low contribution
Random Forest Models
An ensemble of decision trees built using bootstrapped samples and random subsets of features at each split. Decision trees model outcomes as a series of decisions and splits known as edges, which can completely capture the entire range of variation of values and how different variables affect it at different stages.
Strengths:
capturing complex relationships which are nonlinear with interactions
feature selection, irrelevant features are removed via various penalties, making the model robust to noise where we have large bumber of predictors relative to observations
Limitations:
difficulty in interpretation- such models are often ‘black-boxed’, where it may be relatively difficult to understand how each variable contributes to the modelling process compared to a linear model
slow fitting
requires tuning and pruning to prevent overfitting
XG-Boost
A powerful, boosted tree-based ensemble that builds trees sequentially, optimizing for errors from prior trees using gradient descent. Trees are added iteratively to correct residual erros and overfitting is controlled via regularisation
Strengths:
capturing complex relationships which are nonlinear with interactions
feature selection
accurate and efficient and able to handle missing data
Limitations:
complexity of tuning process
interpretability
Model Evaluation Workflow
Given that most of our selected models required extensive tuning of various hyperparameters. We evaluated both models based on a two-stage validation process, where we first tune the models via a k-fold cross-validation, before evaluating its performance by training the model on the entire train dataset and testing its performance on the holdout set. For all tree based models, we utilised permutation importance for splits over impurity. Impurity importance measures how much each feature reduces the variance of the model, and each time a feature is selected to split a node, the impurity reduction is added to its score. Permutation importance randomly shuffles feature values across the dataset and records its impact on performance. The latter is thus often much better as it is not biased by the cardinality (or continuity) of features and is less sensitive to multicollinearity. The iterative nature makes it more accurate at the expense of processing time.
Spatial Model
For the spatial model, we did a 80-20 train-test split, with training data points selected at random given the observations reveal that most spaces contained a manual counter. As high volume locations were relatively rare, we over-sampled relatively high volume locations with over 100 counts by replicating these points and adding random noise in order to improve modelling performance. For the cross validation process, we conducted a leave out group out split using block groups, in order to improve the ability of our model to generalise across different locations in DC.
Temporal Model
We conducted a time based split where we train on the first 2-3 weeks of the month and tested based on the last 1-2 weeks of each month. Such a split evaluates the model performance in the ‘future’ and also resolves problems accounting for seasonal differences when month by month split is utilized. Given the sparsity of recorded locations, we were unable to adopt a spatially differentiated approach to our testing process, as splitting based on locations would leave far too little unique observations across space to ensure generalizability across different locations. Instead, we thus conducted cross validation based on wards to ensure we account in part for spatial generalizability.
The code for both performance evaluation workflows may be referred to in the Appendix.
Spatial Model Performance
Given that the models employed are able to select features, we primarily tested the models for differences in performance with different spatial fixed effects and the addition of capital bike-share data. Four models are highlighted in the sections below for comparison, namely:
Predictor set + block group geoid
Predictor set + wards
Predictor set + block group geoid + capital bikeshare estimates
Predictor set + wards + capital bikeshare estimates
Comparing Spatial Fixed effects
Comparing the Geoid model with Wards, utilising wards led to better overall model perfor,mance across all metrics, with a much lower Root Mean Squared Error of 21 compared to 23, which unsurprisingly suggests that the additional variation provided by block groups which are much more granular lead to overfitting. Such performance improvements are evident across locations of varying volumes. Across both models, the random forest regressor performed best, albeit minimally so.
With Geoid
load("Spatial_geoid.rData")
# average error for each model
ggplot(data = OOF_preds %>%
dplyr::select(model, RMSE) %>%
distinct(),
aes(x = model, y = RMSE, group = 1)) +
geom_path(color = "#562DAB") +
geom_label(aes(label = paste0(round(RMSE,1)))) +
theme_bw()test_set %>% select(avg_count) %>% cbind(rf_pred) %>% summarise(
MAPE = mean(abs((avg_count + 1 - .pred) / (avg_count + 1))) * 100,
SMAPE = mean(2 * abs(avg_count - .pred) / (abs(avg_count) + abs(.pred))) * 100,
RMSE = sqrt(mean((avg_count - .pred)^2)),
RMPSE = sqrt(mean(((avg_count - .pred) / (avg_count + 1))^2)) * 100) %>%
kbl() %>%
kable_styling()| MAPE | SMAPE | RMSE | RMPSE |
|---|---|---|---|
| 63.31059 | 43.60809 | 23.84898 | 113.4964 |
test_set %>% mutate(load= ifelse(avg_count<50, "low", ifelse(avg_count<100, "mid","high"))) %>% select(avg_count, load) %>% cbind(rf_pred) %>%
group_by(load) %>% summarise(
MAPE = mean(abs((avg_count + 1 - .pred) / (avg_count + 1))) * 100,
SMAPE = mean(2 * abs(avg_count - .pred) / (abs(avg_count) + abs(.pred))) * 100,
RMSE = sqrt(mean((avg_count - .pred)^2)),
RMPSE = sqrt(mean(((avg_count - .pred) / (avg_count + 1))^2)) * 100,
avg_count=mean(avg_count))%>%
kbl() %>%
kable_styling()| load | MAPE | SMAPE | RMSE | RMPSE | avg_count |
|---|---|---|---|---|---|
| high | 15.04321 | 16.90076 | 35.48023 | 18.74176 | 181.54412 |
| low | 91.35428 | 58.52571 | 20.67447 | 142.68820 | 21.33500 |
| mid | 18.56912 | 21.15760 | 14.49096 | 22.44028 | 65.19231 |
With Wards
load("Spatial_wards.rData")
# average error for each model
ggplot(data = OOF_preds %>%
dplyr::select(model, RMSE) %>% filter(model!='glmnet') %>%
distinct() ,
aes(x = model, y = RMSE, group = 1)) +
geom_bar(stat="identity", fill='#6A5ACD') +
# geom_label(aes(label = paste0(round(RMSE,1)))) +
theme_minimal()test_set %>% select(avg_count) %>% cbind(rf_pred) %>% summarise(
MAPE = mean(abs((avg_count + 1 - .pred) / (avg_count + 1))) * 100,
SMAPE = mean(2 * abs(avg_count - .pred) / (abs(avg_count) + abs(.pred))) * 100,
RMSE = sqrt(mean((avg_count - .pred)^2)),
RMPSE = sqrt(mean(((avg_count - .pred) / (avg_count + 1))^2)) * 100)%>%
kbl() %>%
kable_styling()| MAPE | SMAPE | RMSE | RMPSE |
|---|---|---|---|
| 55.02091 | 40.0635 | 21.47979 | 100.2898 |
test_set %>% mutate(load= ifelse(avg_count<50, "low", ifelse(avg_count<100, "mid","high"))) %>% select(avg_count, load) %>% cbind(rf_pred) %>%
group_by(load) %>% summarise(
MAPE = mean(abs((avg_count + 1 - .pred) / (avg_count + 1))) * 100,
SMAPE = mean(2 * abs(avg_count - .pred) / (abs(avg_count) + abs(.pred))) * 100,
RMSE = sqrt(mean((avg_count - .pred)^2)),
RMPSE = sqrt(mean(((avg_count - .pred) / (avg_count + 1))^2)) * 100,
avg_count=mean(avg_count))%>%
kbl() %>%
kable_styling()| load | MAPE | SMAPE | RMSE | RMPSE | avg_count |
|---|---|---|---|---|---|
| high | 12.26562 | 13.36645 | 30.24426 | 14.95944 | 181.54412 |
| low | 79.22764 | 54.34520 | 19.42408 | 126.07106 | 21.33500 |
| mid | 17.82886 | 20.04539 | 13.85533 | 21.73968 | 65.19231 |
Performance across different volume locales
All four models exhibit the worst performance in terms of RMSE but the best performance in terms of Mean Absolute Percentage Error within high volume locales, highlighting how errors need to be considered in terms of the context. The spatial wards model had a RMSE of 30 in high volume locations with more than 150 counts on average, but this related to a MAPE of 12 and RMPSE of 15, percentage errors which are relatively low. Comparatively, a rmse of 19 is associated with a RMPSE of 126 in low volume locations, despite such absolute errors likely being inconsequential for most planning decisions. This is most evident from the plot below, which plots the RMSE alongside the scatter points of counts, illustrating how most of the true errors are rather low, but amplified when considered in proportion to the lower counts
test_set %>%
mutate(load= ifelse(avg_count<20, "low", ifelse(avg_count<50, "mid","high")) )%>%
select(avg_count,load) %>% cbind(rf_pred) %>% group_by( load) %>% summarise(
MAPE = mean(abs((avg_count + 1 - .pred) / (avg_count + 1))) * 100,
SMAPE = mean(2 * abs(avg_count - .pred) / (abs(avg_count) + abs(.pred))) * 100,
RMSE = sqrt(mean((avg_count - .pred)^2)),
RMPSE = sqrt(mean(((avg_count - .pred) / (avg_count + 1))^2)) * 100,
avg_count= mean(avg_count)
)%>% ggplot() +
geom_col(aes(x=load, y=RMSE), fill = "#562DAB")+
geom_jitter(test_set %>%
mutate(load= ifelse(avg_count<50, "low", ifelse(avg_count<100, "mid","high")) )%>%
select(avg_count,load) %>% cbind(rf_pred),mapping=aes(x=load, y=avg_count),width=0.3, alpha=1, color = "#E3D7F5")+
# geom_col(aes(x=hour, y=MAPE))+
theme_minimal()Temporal Model Performance
The task of predicting for unknown locations without records mean that we can not rely on past data, such as time lags to make estimates as such information does not exist for unknown locations we are making predictions for. However, there is strong evidence that most bicycle counts vary rather similarly across DC, and hence a reasonable estimate can be derived by modelling the hourly counts as a function of spatial variations and temporal variations at each location
For the temporal model, we were hence interested in what spatial-varying predictors provided the best performance improvements for a model, when modelled with simple time-based features such as time of day and hours. At earlier stages of the modelling process, we also found that taking the logarithm of bicycle counts and and including block group fixed effects was able to greatly improve model performance. In the interest of clarity and conciseness, the rest of the section highlights 5 models for comparison, namely:
No spatial-varying predictors
Cabi with KDE
CABI, KDE, Spatial predictions
CABI spatial predictions
Spatial Predictions & KDE
Train and Test Data Specification
Model Performance
log without cabi, kde, spatial
The base model with only basic time fixed effects has a rather high RMSE of 133 and high MAPE of 133%. This poor performance is attributed to the fact that the model essentially predicts hourly bike counts as a function of time and merely distinguishes spatial differences based on block groups. Most of the variation across locations is hence unaccounted for.
load("Time_log_TOD_nopred_nocabi_nokde_rmse133.Rdata")
test_set %>%
select(Flow_Count) %>%
cbind(rf_pred) %>%
summarise(
MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
RMSE = sqrt(mean((Flow_Count - .pred)^2)),
RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100)%>%
kbl() %>%
kable_styling()| MAPE | SMAPE | RMSE | RMPSE |
|---|---|---|---|
| 133.6887 | 114.5037 | 133.3843 | 420.3019 |
test_set %>% mutate(load= ifelse(Flow_Count<50, "low", ifelse(Flow_Count<150, "mid","high"))) %>% select(Flow_Count, load) %>% cbind(rf_pred) %>%
group_by(load) %>% summarise(
MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
RMSE = sqrt(mean((Flow_Count - .pred)^2)),
RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100,
Flow_Count=max(Flow_Count)) %>%
kbl() %>%
kable_styling()| load | MAPE | SMAPE | RMSE | RMPSE | Flow_Count |
|---|---|---|---|---|---|
| high | 69.33513 | 111.9741 | 320.08043 | 72.98628 | 1560 |
| low | 166.40720 | 116.0051 | 11.90840 | 514.27378 | 48 |
| mid | 70.27283 | 111.1914 | 67.48759 | 71.33713 | 147 |
log with cabi, kde without spatial predictions
Including capital bikeshare and kernel density estimates is able to improve performance substantially, essentially halving the error rates across most metrics.
load("Time_log_TOD_cabi_kde_nopred_rmse68.Rdata")
# # average error for each model
# ggplot(data = OOF_preds %>%
# dplyr::select(model, RMSE) %>%
# distinct() ,
# aes(x = model, y = RMSE, group = 1)) +
# geom_path(color = "red") +
# geom_label(aes(label = paste0(round(RMSE,1)))) +
# theme_bw()
#
# # OOF predicted versus actual
# ggplot(OOF_preds, aes(x = Flow_Count, y = .pred,
# group = model
# )) +
# geom_point(alpha = 0.3) +
# geom_abline(linetype = "dashed", color = "red") +
# geom_smooth(method = "lm", color = "blue") +
# # coord_equal() +
# facet_wrap(~model, nrow = 2) +
# theme_bw()test_set %>%
select(Flow_Count) %>%
cbind(rf_pred) %>%
summarise(
MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
RMSE = sqrt(mean((Flow_Count - .pred)^2)),
RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100)%>%
kbl() %>%
kable_styling()| MAPE | SMAPE | RMSE | RMPSE |
|---|---|---|---|
| 74.04041 | 89.90652 | 68.70751 | 198.1228 |
test_set %>% mutate(load= ifelse(Flow_Count<50, "low", ifelse(Flow_Count<150, "mid","high"))) %>% select(Flow_Count, load) %>% cbind(rf_pred) %>%
group_by(load) %>% summarise(
MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
RMSE = sqrt(mean((Flow_Count - .pred)^2)),
RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100,
Flow_Count=max(Flow_Count))%>%
kbl() %>%
kable_styling()| load | MAPE | SMAPE | RMSE | RMPSE | Flow_Count |
|---|---|---|---|---|---|
| high | 30.62307 | 36.99418 | 161.41049 | 36.25464 | 1560 |
| low | 94.10425 | 113.94237 | 12.87767 | 241.54565 | 48 |
| mid | 38.90830 | 48.67405 | 42.36437 | 51.15077 | 147 |
log with cabi, kde with spatial predictions
Including all spatially engineered features, with capital bikeshare data, spatial predictions and the kernel density estimates decreases model performance, suggesting overfitting.
load("Time_log_TOD_cabi_kde_pred_rmse70.Rdata")
# # average error for each model
# ggplot(data = OOF_preds %>%
# dplyr::select(model, RMSE) %>%
# distinct() ,
# aes(x = model, y = RMSE, group = 1)) +
# geom_path(color = "red") +
# geom_label(aes(label = paste0(round(RMSE,1)))) +
# theme_bw()
#
# # OOF predicted versus actual
# ggplot(OOF_preds, aes(x = Flow_Count, y = .pred,
# group = model
# )) +
# geom_point(alpha = 0.3) +
# geom_abline(linetype = "dashed", color = "red") +
# geom_smooth(method = "lm", color = "blue") +
# # coord_equal() +
# facet_wrap(~model, nrow = 2) +
# theme_bw() test_set %>%
select(Flow_Count) %>%
cbind(rf_pred) %>%
summarise(
MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
RMSE = sqrt(mean((Flow_Count - .pred)^2)),
RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100)%>%
kbl() %>%
kable_styling()| MAPE | SMAPE | RMSE | RMPSE |
|---|---|---|---|
| 73.63031 | 90.07854 | 70.3944 | 205.4105 |
test_set %>% mutate(load= ifelse(Flow_Count<50, "low", ifelse(Flow_Count<150, "mid","high"))) %>% select(Flow_Count, load) %>% cbind(rf_pred) %>%
group_by(load) %>%
summarise(
MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
RMSE = sqrt(mean((Flow_Count - .pred)^2)),
RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100,
Flow_Count=max(Flow_Count))%>%
kbl() %>%
kable_styling()| load | MAPE | SMAPE | RMSE | RMPSE | Flow_Count |
|---|---|---|---|---|---|
| high | 31.03486 | 37.67099 | 165.59025 | 36.69641 | 1560 |
| low | 93.30793 | 114.00810 | 12.98295 | 250.53798 | 48 |
| mid | 39.18761 | 48.77115 | 42.85780 | 51.67716 | 147 |
log with cabi, spatial predictions no kde
The model with cabi data and spatial predictions alone does not perform as well as the previous model, with a RMSE of 73. However, it interestingly has a lower %RMSE (RMPSE) than the model with all spatial variables within areas with medium volumes (mean<150), highlighting the importance of considering different model configurations for the intended planning decision and locale.
load("Time_log_TOD_cabi_rmse73.Rdata")
# # average error for each model
# ggplot(data = OOF_preds %>%
# dplyr::select(model, RMSE) %>%
# distinct() ,
# aes(x = model, y = RMSE, group = 1)) +
# geom_path(color = "red") +
# geom_label(aes(label = paste0(round(RMSE,1)))) +
# theme_bw()
#
# # OOF predicted versus actual
# ggplot(OOF_preds, aes(x = Flow_Count, y = .pred,
# group = model
# )) +
# geom_point(alpha = 0.3) +
# geom_abline(linetype = "dashed", color = "red") +
# geom_smooth(method = "lm", color = "blue") +
# # coord_equal() +
# facet_wrap(~model, nrow = 2) +
# theme_bw()test_set %>%
select(Flow_Count) %>%
cbind(rf_pred) %>%
summarise(
MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
RMSE = sqrt(mean((Flow_Count - .pred)^2)),
RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100)%>%
kbl() %>%
kable_styling()| MAPE | SMAPE | RMSE | RMPSE |
|---|---|---|---|
| 84.79347 | NaN | 73.3175 | 342.3214 |
test_set %>% mutate(load= ifelse(Flow_Count<50, "low", ifelse(Flow_Count<150, "mid","high"))) %>% select(Flow_Count, load) %>% cbind(rf_pred) %>%
group_by(load) %>% summarise(
MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
RMSE = sqrt(mean((Flow_Count - .pred)^2)),
RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100,
Flow_Count=max(Flow_Count))%>%
kbl() %>%
kable_styling()| load | MAPE | SMAPE | RMSE | RMPSE | Flow_Count |
|---|---|---|---|---|---|
| high | 31.91663 | 40.08549 | 173.07089 | 37.87136 | 1560 |
| low | 109.97425 | NaN | 13.88127 | 419.80960 | 48 |
| mid | 39.16885 | 49.98051 | 41.91161 | 48.30286 | 147 |
log with spatial predictions, kde,no cabi
The model with the best overall performance includes the spatial predictions and kernel density estimates and excludes capital bike-share features, with a RMSE of 65.5. Performance across locations and times of different volumes are also much better than the other two models.
load("Time_log_TOD_no_cabi_kde.Rdata")
test_set %>%
select(Flow_Count) %>%
cbind(rf_pred) %>%
summarise(
MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
RMSE = sqrt(mean((Flow_Count - .pred)^2)),
RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100
)%>%
kbl() %>%
kable_styling()| MAPE | SMAPE | RMSE | RMPSE |
|---|---|---|---|
| 81.29022 | NaN | 65.49156 | 215.6808 |
test_set %>% mutate(load= ifelse(Flow_Count<50, "low", ifelse(Flow_Count<150, "mid","high"))) %>% select(Flow_Count, load) %>% cbind(rf_pred) %>%
group_by(load) %>% summarise(
MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
RMSE = sqrt(mean((Flow_Count - .pred)^2)),
RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100,
Flow_Count=mean(Flow_Count))%>%
kbl() %>%
kable_styling()| load | MAPE | SMAPE | RMSE | RMPSE | Flow_Count |
|---|---|---|---|---|---|
| high | 27.49657 | 32.57583 | 153.18023 | 34.05269 | 386.81405 |
| low | 106.11811 | NaN | 13.24458 | 263.42707 | 12.24130 |
| mid | 37.87989 | 46.81794 | 41.63423 | 50.26581 | 86.62022 |
From the plot of values and RMSE below, we similarly observe that the errors are rather low in relation to actual values for high and medium volume locations. For low volume locations, the raw errors of predictions are rather low, despite having higher percentage errors. Nonetheless, it should be highlighted that the impact of these errors will depend on the specific application of the model.
test_set %>% group_by(Site_Name) %>%
mutate(load= ifelse(mean(Flow_Count)<50, "low", ifelse(mean(Flow_Count)<100, "mid","high")) )%>%
select(Flow_Count,load, time_of_day) %>% cbind(rf_pred) %>% group_by( load,time_of_day) %>% summarise(
MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
RMSE = sqrt(mean((Flow_Count - .pred)^2)),
RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100,
Flow_Count= mean(Flow_Count)
)%>% ggplot() +
geom_col(aes(x=time_of_day, y=RMSE), fill = "#E3D7F5", color = "grey30")+
geom_jitter(test_set %>% group_by(Site_Name) %>%
mutate(load= ifelse(mean(Flow_Count)<50, "low", ifelse(mean(Flow_Count)<100, "mid","high")) )%>%
select(Flow_Count,load, time_of_day) %>% cbind(rf_pred),mapping=aes(x=time_of_day, y=Flow_Count),width=0.3, alpha=0.1, color = "#6A3BB8")+
# geom_col(aes(x=hour, y=MAPE))+
facet_wrap(~load)+
theme_minimal()## Adding missing grouping variables: `Site_Name`
## `summarise()` has grouped output by 'load'. You can override using the
## `.groups` argument.
## Adding missing grouping variables: `Site_Name`
Final Model Configurations
Model Generalizability
Having selected the best model as our final model, we conducted a series of tests of generalizability across space and time.
Temporal Model
Building on the spatial strengths of our model, our temporal predictions also show a distinct pattern of accuracy. The temporal model performs best during daytime peak hours, with significantly lower error rates between 7-10am and 4-7pm.
The blue bars show average flow count throughout the day, while the red bars represent the Mean Absolute Percentage Error. Notice how the error rates decrease substantially during precisely the hours when cycling volumes are highest. This is particularly valuable because daytime peak hours are when bike lanes operate near capacity and when planning decisions have the greatest impact on cyclist safety and efficiency.
test_set %>% select(Flow_Count,hour) %>% cbind(rf_pred) %>% group_by(hour) %>% summarise(
MAPE = mean(abs((Flow_Count + 1 - .pred) / (Flow_Count + 1))) * 100,
SMAPE = mean(2 * abs(Flow_Count - .pred) / (abs(Flow_Count) + abs(.pred))) * 100,
RMSE = sqrt(mean((Flow_Count - .pred)^2)),
RMPSE = sqrt(mean(((Flow_Count - .pred) / (Flow_Count + 1))^2)) * 100,
Flow_Count=mean(Flow_Count)
)%>% ggplot() +
# geom_col(aes(x=hour, y=RMSE))+
geom_col(aes(x=hour, y=MAPE), fill= 'red', alpha=0.5) +
geom_col(aes(x=hour, y=Flow_Count), fill= 'blue', alpha=0.3) +
theme_minimal() +
theme() Spatial
Model
Looking at the distribution of our spatial model’s performance geographically, it achieves its highest accuracy when predicting bike counts in lanes within the Central Business District. On this map, the darker purple dots represent locations with lower error rates, clearly concentrated in the CBD area.
Moreover, beyond technical performance, we also evaluated our model through an equity lens. An important consideration for any urban planning tool is whether it performs equally well across different communities. Interestingly, our analysis shows that our model actually performs better within minority-populated areas, with lower error rates on average compared to majority-white areas.
test_set %>% select(avg_count,GEOID,pct_white,pct_black,pct_asian, pct_hispanic) %>% cbind(rf_pred) %>% mutate(maj_white= ifelse(pct_white>pct_black+pct_asian+pct_hispanic, "Yes","No")) %>% group_by(maj_white) %>% summarise(
MAPE = mean(abs((avg_count + 1 - .pred) / (avg_count + 1))) * 100,
SMAPE = mean(2 * abs(avg_count - .pred) / (abs(avg_count) + abs(.pred))) * 100,
RMSE = sqrt(mean((avg_count - .pred)^2)),
RMPSE = sqrt(mean(((avg_count - .pred) / (avg_count + 1))^2)) * 100,
avg_count=mean(avg_count)) %>%
pivot_longer(
cols = c(MAPE, SMAPE, RMSE, RMPSE, avg_count),
names_to = "metric",
values_to = "value"
) %>% drop_na() %>% ggplot()+
geom_bar(mapping=aes(fill = maj_white,x = metric, y=value),stat = "identity",
position = position_dodge(width = 1)) +
scale_fill_manual(values = c("Yes" = "#E3D7F5", "No" = "#6A5ACD"))+
theme_minimal()Findings
Performance comparisons with other cities (Kothuri et al 2022)
Our peak hour static estimates are more accurate than existing estimates made in other cities, with much lower MAPE overall
It also performs exceptionally well within high volume locales, especially in comparison to other cities.
The hourly dynamic estimates also perform much better than static estimates made in the study.
Observations
Feature importance
To demonstrate the relative importance of features for both models, the feature importance of the models containing all features are presented for comparison. As previously mentioned, these importance values are based on permutation importance - ie how much each feature contributed to splits in reducing overall model errors over various permutation cycles.
Spatial Model
Notably, it is observed that capital bikeshare estimates have the greatest influence, followed by the percentage of white population and being located in Ward 2. This is then followed by network or street chracteristics such as PCI, infrastructure score and the presence of dual protected lanes.
p <- vip::vip(full_fit, num_features = 30)
# So instead, modify the internal geom:
p + ggplot2::geom_col(fill = "#7A50C7") # may result in doubled bars unless cleanedTemporal Model
For the temporal model, the manual KDE has the greatest collective permutation importance, followed by whether the time of day is overnight. Relatively, capital bikeshare counts have less importance. Each hour of day is awarded roughly similar values of importance, highlighting how the random forest model accurately captures how different hours are associated with different volumes, compared to a LASSO or Ridge Regression model which will remove these features due to their low proportion of variance explained.
p2 <- vip::vip(full_fit, num_features = 30)
# So instead, modify the internal geom:
p2 + ggplot2::geom_col(fill = "#7A50C7") # may result in doubled bars unless cleanedObserved Trends
To visualize our model predictions across space and time, we developed an interactive Dashboard using Leaflet and JavaScript. The dashboard allows users to explore predicted bicycle volumes at the street-segment level throughout Washington, DC.
As shown in the examples below, selecting a specific street reveals both spatial metrics—such as predicted average hourly ridership and land use context—and temporal patterns across the week. For example, on 9th St in Chinatown, our model predicts 31 riders per hour at peak time, while the temporal panel shows variation across days. Another example from G St in the GWU neighborhood highlights the comparative view, placing predicted ridership, local demographics, and infrastructure context side by side.
Conclusions and Recommendations
What can our model not tell us?
“All models are wrong, but some models are useful” – George Box
It is important to highlight the inherent limitations of our predictions due to the issues with data sparsity. While we attempt to ensure as much as possible the generalizability of the model across different and unknown locations, the low number of automated counter locations means that the true accuracy of the model within much of DC remains unknown and untestable. Our modelling workflow attempts to circumvent this through a spatial cross-validation during the tuning phase, but it is not possible to test and comment on the generalizability of the temporal model without sufficient data points across DC for a rigorous spatial train-test validation process.
The selection of automated counter sites also introduces an inherent bias to our model which cannot be addressed by model specifications or workflows. Selection biases may arise from data collection decisions such as prioritizing of key transport corridors for counter installation during the site selection process. This will mean that the training and testing process is restricted to these types of locations. This may admittedly be a reason for the consistency in patterns observed in automated counter locations.
The strong accuracy and generalizability of our spatial model provides a great degree of confidence with regards to the temporal model’s accuracy during peak hours in unknown locations. Nonetheless, given the aforementioned caveats and limitations, predicted values outside of these peak hours should be utilised with caution, especially within locales far from recorded locations, especially if existing knowledge suggests that these areas may have daily patterns which differ substantially from conventional dual peak patterns. Observations from our exploratory analysis seem to suggest that this may be a greater issue for areas with low volume counts.
What can our model tell us?
This model and its outputs should be seen as an initial step in estimating and summarizing bike counts across DC. At the current stage of data completeness, it is likely useful for making general conclusions about how bicycle volumes differ across various neighbourhoods in DC across broad period of the day. As our dashboard demonstrates, this is nonetheless useful for visualizing how biking patterns vary across DC and how this may potentially relate to various built environment and demographic characteristics. Most of our modelling considerations has been focused at enabling such broad and ‘non granular’ analyses. For instance, we opted to use a simple kernel density estimate, instead of more complex network based interpolation measures which account for the boundedness of cyclists to most central networks. The information from more complex and accurate features will be more useful when we have a greater granularity of counts across space for generating models which can capture the true complexity of biking patterns in DC. Potentially, where more complete flow data exists, models such as gravity models or Bayesian flow networks can be employed to model bicycle counts - currently unfeasible due to the large
Recommendations
As such, much remains to be done to test and validate this model. In the long term, it is recommended that DDOT expands the coverage of its automated counters and do so within a greater variety of locations; which will allow for further model iterations and testing in currently unknown locations. This process will likely reveal rather different conclusions about what features are important and also generate different conclusions about bicycling patterns in DC. As data coverage improves, more accurate and generalizable conclusions can be made using this model.
Where constraints surrounding data collection exist, the domain knowledge and understanding of local contexts within DC will be extremely useful for validating model outputs. This may range from simply asking “does this value make sense in this location?”, or conducting detailed site surveys and interviews. The flaws inherent with the data and this model is a strong reminder of the limits of data-driven approaches towards urban practices. We can only model as well as the data allows us, and in the case where much remains unknown - it is imperative that urban practitioners evaluate model outputs from a critical perspective.
Appendices: Sample Code
Centrality
library(tidygraph)
library(sfnetworks)
seg_network <- st_cast(seg_network, "LINESTRING")
seg_network <- as_sfnetwork(seg_network)
seg_network <- seg_network %>%
activate("nodes") %>%
mutate(
degree = centrality_degree(), # Number of direct connections
betweenness = centrality_betweenness(), # Control over information flow
closeness = centrality_closeness() # Average distance to all nodes
)library(igraph)
# Convert to an igraph object for connectivity analysis
graph_obj <- as.igraph(seg_network)
# Check if the network is fully connected
network_connected <- is_connected(graph_obj)
print(network_connected)
# Identify connected components
components_info <- components(graph_obj)
print(components_info)Spatial Model Workflow
set.seed(717)
library(spatialsample)
# library(tidymodels)
library(recipes)
library(rsample)
library(parsnip)
library(workflows)
library(yardstick)
library(dplyr)
library(lubridate)
library(tidyverse)
library(tune)
library(sf)
theme_set(theme_bw())
"%!in%" <- Negate("%in%")
g <- glimpse
nn_function <- function(measureFrom,measureTo,k) {
library(FNN)
nn <-
FNN::get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
get_week_of_month <- function(date) {
week(date) - week(floor_date(date, "month")) + 1
}
#######train test split######
setwd('C:/Users/Xian Lu Lee/OneDrive - PennO365/Datasets/GeoCloud/DC_DOT')
cabi_sum <- read.csv('Data/cabi_sum.csv')
segments <- st_read("C:/Users/Xian Lu Lee/OneDrive - PennO365/Datasets/GeoCloud/DC_DOT/Data/segments_network.geojson"
)
cabi_max <- cabi_sum %>% group_by(seg_id) %>% summarise(cabi_count=max(cabi_count))
# cabi_sum <- cabi_sum[cabi_sum$hour %in% c(7,8,9,10, 15, 16, 17, 18),]
load("C:/Users/Xian Lu Lee/OneDrive - PennO365/Datasets/GeoCloud/DC_DOT/model/data_joined(1).RData")
manual_counts <- manual_234 %>% st_drop_geometry()
# read.csv("Data/manual_21_to_24.csv")
# Function to standardize time format: Add leading zero & ensure ":00" is present
extract_hour_ampm <- function(time_str) {
time_str <- str_trim(time_str) # Remove extra spaces
# Extract AM/PM
am_pm <- ifelse(grepl("AM", time_str), "AM", "PM")
# Extract numeric hour
hour <- as.numeric(str_extract(time_str, "^[0-9]+"))
return(data.frame(Hour = hour, AM_PM = am_pm, stringsAsFactors = FALSE))
}
manual <- segments %>%right_join(manual_counts %>% select(-SUBBLOCKKEY,-SUBBLOCKID) %>% ungroup(), by='seg_id' )%>% filter(!is.na(Total))%>%
mutate(
Start_Hour_AMPM = lapply(Start_Time, extract_hour_ampm),
End_Hour_AMPM = lapply(End_Time, extract_hour_ampm)
) %>%
tidyr::unnest_wider(Start_Hour_AMPM, names_sep = "_") %>%
tidyr::unnest_wider(End_Hour_AMPM, names_sep = "_") %>%
rename(Start_Hour = Start_Hour_AMPM_Hour, Start_AMPM = Start_Hour_AMPM_AM_PM,
End_Hour = End_Hour_AMPM_Hour, End_AMPM = End_Hour_AMPM_AM_PM)%>%
mutate(
Hours_Duration = ifelse(End_AMPM == Start_AMPM, End_Hour - Start_Hour,
(12 - Start_Hour) + End_Hour) # Adjust if AM/PM changes
) %>% group_by(seg_id, Start_AMPM) %>% mutate(Total=sum(Total))%>% ##remove direction and get totals
##
mutate(avg_count= Total/Hours_Duration) %>%
ungroup()
# centrality <- read.csv("Data/network.csv")
main <- manual[ ,c(1,2:4, 10:18, 22:63, 95, 126,137, 141,146,148, 164, 168)]
main %>% write.csv("manual_full.csv")
main <- main[!is.na(main$seg_id), ]
main <- main %>% mutate(Ward=as.factor(Ward))
main <- main %>% left_join(cabi_max)
# main <- main %>% left_join(centrality)
# main <- main %>% st_drop_geometry() %>% select(-geometry)
# main <- main %>% mutate(obs_weight = if_else(avg_count > 100, 10, if_else(avg_count > 25, 5, 1)))
# View(main[is.na(main$SUBBLOCKKEY), ])
### Initial Split for Training and Test
data_split <- initial_split(main , strata = "avg_count", prop = 0.80)
train_set <- training(data_split)
test_set <- testing(data_split)
# This generates synthetic examples focused on rare/extreme values
train_set <- train_set %>%
mutate(count_bin = cut(avg_count, breaks = 5))
train_set %>% ggplot(aes(x=count_bin))+geom_histogram(stat = 'count')
rare_bins <- train_set %>%
group_by(count_bin) %>%
summarise(count = n()) %>%
filter(count < 50) #
# Select high-count rows
high_count_rows <- train_set %>% filter(avg_count > 100)
# Create synthetic rows by adding random noise to the high-count values
synthetic_rows <- high_count_rows %>%
mutate(avg_count = avg_count + rnorm(n(), mean = 0, sd = 5)) %>% rbind(high_count_rows %>%
mutate(avg_count = avg_count + rnorm(n(), mean = 0, sd = 5))) %>%
rbind(high_count_rows %>%
mutate(avg_count = avg_count + rnorm(n(), mean = 0, sd = 5))) %>%
rbind(high_count_rows %>%
mutate(avg_count = avg_count + rnorm(n(), mean = 0, sd = 5))) # Adjust sd to control noise level
# Combine original data with synthetic rows
train_set_synthetic <- bind_rows(train_set, synthetic_rows)
train_set_synthetic %>% ggplot(aes(x=count_bin))+geom_histogram(stat = 'count')
# Replace with newer spatialsample package approach
library(spatialsample)
# Spatial CV using newer approach
## LOGOCV on Neighborhood with group_vfold_cv()
cv_splits_geo <- group_vfold_cv(train_set,
group = "GEOID")
# Get the first fold's split object
fold_1 <- cv_splits_geo$splits[[1]]
# Extract the analysis (training) set from the first fold
analysis_set_1 <- analysis(fold_1)
# Extract the assessment (validation) set from the first fold
assessment_set_1 <- assessment(fold_1)
### Create Recipes
test_set%>% select(where(~ is.factor(.) || is.character(.))) %>% colnames()
# Feature Creation
model_rec <- recipe(avg_count ~ ., data = train_set_synthetic
%>% select(-seg_id,-count_bin,-BLOCKKEY, -has_parking,
-has_buffer, -has_doubleyellow,
-GEOID #geoid,,
# -cabi_count,
# -Ward
# -TOTALBIKELANES.x, -TOTALBIKELANEWIDTH.x,
# -TRAIL_CLASS, -CENTERLINE,- LIGHTING
))%>%
update_role(Ward, new_role = "Neighborhood") %>%
# update_role(obs_weight, new_role = "case_weight") %>%
step_other(Ward, threshold = 0.005) %>%
step_mutate_at(all_predictors()&all_numeric(), fn = ~ ifelse(is.na(.), 1e-6, .)) %>%
step_novel(all_nominal()) %>%
step_unknown(all_nominal()) %>%
step_dummy(all_nominal(), one_hot = TRUE) %>%
# step_log(avg_count, offset = 1, skip = TRUE) %>%
# step_nzv(all_predictors()&all_numeric(),-avg_count,freq_cut = 95/5, unique_cut = 10) %>%
step_normalize(where(is.double),-avg_count)
# # step_zv(all_predictors()) %>%
# step_center(where(is.double) )%>%
# step_scale(where(is.double) )
# step_ns(y,x, options = list(df = 4),keep_original_cols=T)
glimpse(model_rec %>% prep() %>% juice())
# transform new data
model_rec %>% prep() %>% bake(new_data = main %>%select(-seg_id,-BLOCKKEY,-has_parking, -has_buffer, -has_doubleyellow,
# -TOTALBIKELANES.x, -TOTALBIKELANEWIDTH.x,
# -TRAIL_CLASS, -CENTERLINE,- LIGHTING
))
cor_matrix <- cor(main %>% select(where(is.numeric)),use='complete.obs')
cor_matrix['avg_count',]
########################################model testing###################################
## Model specifications
lm_plan <-
linear_reg() %>%
set_engine("lm", case.weights = TRUE)
glmnet_plan <-
linear_reg() %>%
set_args(penalty = tune()) %>%
set_args(mixture = tune()) %>%
set_engine("glmnet",case.weights = TRUE)%>%
set_mode("regression")
rf_plan <- rand_forest() %>%
set_args(mtry = tune()) %>%
set_args(min_n = tune()) %>% # ✅ Now only tuning min_n
set_args(n_trees= 2000) %>%
set_engine("ranger", importance = "permutation",case.weights = TRUE) %>%
set_mode("regression")
XGB_plan <- boost_tree() %>%
set_args(trees = tune()) %>%
set_args(learn_rate = tune()) %>% # Learning rate (shrinkage)
set_args(tree_depth = tune()) %>% # Maximum tree depth
set_args(min_n = tune()) %>% # Minimum node size
# set_args(colsample_bytree = tune()) %>% # Feature subsampling
set_args(early_stopping_rounds = 10) %>% # Early stopping
set_engine("xgboost",validation = 0.2, case.weights = TRUE,
# objective="count:poisson",
lambda=0.1, alpha=0.1) %>%
set_mode("regression")
# Hyperparameter grid for glmnet (penalization)
glmnet_grid <- expand.grid(
penalty = 10^seq(-4, 2, length.out = 50), # Wider range of penalties
mixture = seq(0, 1, by = 0.25)
)
# Hyperparameter grid for random forest (tuning min_n only)
rf_grid <- expand.grid(
mtry = c( 13, 30, 45, 80, 110),
min_n = c(5, 50, 100, 500) # ✅ Only tuning min_n
)
# Hyperparameter grid for XGBoost (tuning n_estimators with early stopping)
xgb_grid <- expand.grid(
trees = seq(10, 800, by = 50),
learn_rate = c(0.01, 0.1, 0.3),
tree_depth = c( 5, 6, 7),
min_n = c( 5, 50, 80)
# colsample_bytree = c(0.5, 0.7, 1)
)
#ntrees for xgboost does not need to be as big
# create workflow
lm_wf <-
workflow() %>%
add_recipe(model_rec) %>%
add_model(lm_plan)
glmnet_wf <-
workflow() %>%
add_recipe(model_rec) %>%
add_model(glmnet_plan)
rf_wf <-
workflow() %>%
add_recipe(model_rec) %>%
add_model(rf_plan)
xgb_wf <-
workflow() %>%
add_recipe(model_rec) %>%
add_model(XGB_plan)
# fit model to workflow and calculate metrics
control <- control_resamples(save_pred = TRUE, verbose = TRUE)
control <- control_grid(
save_pred = TRUE, # Ensures predictions are saved
extract = function(x) x, # Stores the full model object
verbose = TRUE)
metrics <- metric_set(rmse, rsq, mape, smape)
lm_tuned <- lm_wf %>%
tune::fit_resamples(.,
resamples = cv_splits_geo,
control = control,
metrics = metrics)
glmnet_tuned <- glmnet_wf %>%
tune::tune_grid(.,
resamples = cv_splits_geo,
grid = glmnet_grid,
control = control,
metrics = metrics)
rf_tuned <- rf_wf %>%
tune::tune_grid(.,
resamples = cv_splits_geo,
grid = rf_grid,
control = control,
metrics = metrics)
xgb_tuned <- xgb_wf %>%
tune::tune_grid(.,
resamples = cv_splits_geo,
grid = xgb_grid,
control = control,
metrics = metrics)
# ## metrics across grid
# collect_metrics(xgb_tuned)
# autoplot(xgb_tuned)
# collect_metrics(glmnet_tuned)
## 'Best' by some metric and margin
show_best(lm_tuned, metric = "rmse", n = 15)
show_best(glmnet_tuned, metric = "rmse", n = 15)
show_best(rf_tuned, metric = "rmse", n = 15)
show_best(xgb_tuned, metric = "rmse", n = 15)
lm_best_params <- select_best(lm_tuned, metric = "rmse" )
glmnet_best_params <- select_best(glmnet_tuned, metric = "rmse")
rf_best_params <- select_best(rf_tuned, metric = "rmse" )
xgb_best_params <- select_best(xgb_tuned, metric = "rmse" )
## Final workflow
lm_best_wf <- finalize_workflow(lm_wf, lm_best_params)
glmnet_best_wf <- finalize_workflow(glmnet_wf, glmnet_best_params)
rf_best_wf <- finalize_workflow(rf_wf, rf_best_params)
xgb_best_wf <- finalize_workflow(xgb_wf, xgb_best_params)
# last_fit() emulates the process where, after determining the best model, the final fit on the entire training set is needed and is then evaluated on the test set.
lm_val_fit_geo <- lm_best_wf %>%
last_fit(split = data_split,
control = control,
metrics = metrics)
glmnet_val_fit_geo <- glmnet_best_wf %>%
last_fit(split = data_split,
control = control,
metrics = metrics)
rf_val_fit_geo <- rf_best_wf %>%
last_fit(split = data_split,
control = control,
metrics = metrics)
xgb_val_fit_geo <- xgb_best_wf %>%
last_fit(split = data_split,
control = control,
metrics = metrics)
##########################################################
# Pull best hyperparam preds from out-of-fold predictions
lm_best_OOF_preds <- collect_predictions(lm_tuned)
glmnet_best_OOF_preds <- collect_predictions(glmnet_tuned) %>%
filter(penalty == glmnet_best_params$penalty[1] & mixture == glmnet_best_params$mixture[1])
rf_best_OOF_preds <- collect_predictions(rf_tuned) %>%
filter(mtry == rf_best_params$mtry[1] & min_n == rf_best_params$min_n[1])
xgb_best_OOF_preds <- collect_predictions(xgb_tuned) %>%
filter(min_n == xgb_best_params$min_n[1]& .config == xgb_best_params$.config)
# collect validation set predictions from last_fit model
lm_val_pred_geo <- collect_predictions(lm_val_fit_geo)
glmnet_val_pred_geo <- collect_predictions(glmnet_val_fit_geo)
rf_val_pred_geo <- collect_predictions(rf_val_fit_geo)
xgb_val_pred_geo <- collect_predictions(xgb_val_fit_geo)
# Aggregate OOF predictions (they do not overlap with Validation prediction set)
OOF_preds <- rbind(
data.frame(dplyr::select(lm_best_OOF_preds, .pred, avg_count), model = "lm") ,
data.frame(dplyr::select(glmnet_best_OOF_preds, .pred, avg_count), model = "glmnet"),
data.frame(dplyr::select(rf_best_OOF_preds, .pred, avg_count), model = "rf"),
data.frame(dplyr::select(xgb_best_OOF_preds, .pred, avg_count), model = "xgb")
# %>% mutate(.pred=expm1(.pred))
) %>%
group_by(model) %>%
mutate(
RMSE = yardstick::rmse_vec(avg_count, .pred),
MAE = yardstick::mae_vec(avg_count, .pred),
MAPE = yardstick::mape_vec(avg_count, .pred)) %>%
ungroup() %>%
mutate(model = factor(model, levels=c(
"lm",
"glmnet",
"rf",
"xgb"
)))
# average error for each model
ggplot(data = OOF_preds %>%
dplyr::select(model, RMSE) %>%
distinct() ,
aes(x = model, y = RMSE, group = 1)) +
geom_path(color = "red") +
geom_label(aes(label = paste0(round(RMSE,1)))) +
theme_bw()
# OOF predicted versus actual
ggplot(OOF_preds, aes(x = avg_count, y = .pred, group = model)) +
geom_point(alpha = 0.3) +
geom_abline(linetype = "dashed", color = "red") +
geom_smooth(method = "lm", color = "blue") +
# coord_equal() +
facet_grid(~model) +
ylim(0,300)+
theme_bw()
full_fit <-rf_best_wf %>% fit(train_set)
full_fit_lm <- rf_best_wf %>% fit(main)
rf_pred_pm = predict(full_fit_lm, new_data = segments %>% left_join(manual_counts %>% select(seg_id,Ward, TOTALBIKELANES, TOTALBIKELANEWIDTH,
TRAIL_CLASS, CENTERLINE, LIGHTING), by='seg_id') %>% left_join(cabi_max) %>%
# rename(TOTALBIKELANES.x= TOTALBIKELANES, TOTALBIKELANEWIDTH.x= TOTALBIKELANEWIDTH) %>%
mutate(Start_AMPM= 'PM',
Type.y=ifelse(is.na(road_class), "Trails", "Lanes"
),
Ward=as.factor(Ward)))
# rf_pred_pm$.pred<- rf_pred_pm$.pred %>% expm1()
rf_pred_am = predict(full_fit_lm, new_data = segments %>% left_join(manual_counts %>% select(seg_id,Ward, TOTALBIKELANES, TOTALBIKELANEWIDTH,
TRAIL_CLASS, CENTERLINE, LIGHTING),by='seg_id') %>% left_join(cabi_max)%>%
# rename(TOTALBIKELANES.x= TOTALBIKELANES, TOTALBIKELANEWIDTH.x= TOTALBIKELANEWIDTH) %>%
mutate(Start_AMPM= 'AM',
Type.y=ifelse(is.na(road_class), "Trails", "Lanes"
),
Ward=as.factor(Ward)))
Type.y=ifelse(is.na(road_class), "Trails", "Lanes")))
# rf_pred_am$.pred<- rf_pred_am$.pred %>% expm1()
segments %>% left_join(manual_counts %>% select(seg_id,Ward, TOTALBIKELANES, TOTALBIKELANEWIDTH,
TRAIL_CLASS, CENTERLINE, LIGHTING),by='seg_id') %>% select(seg_id,GEOID,Ward) %>%
cbind(rf_pred_am,rf_pred_pm) %>% distinct() %>%
st_write('spatial_predictions6.csv', append = FALSE)
save(full_fit, test_set, rf_pred,OOF_preds, file = "model/Time_log_no_time_of_day_kde.RData")Temporal Model Worflow
load.fun <- function(x) {
x <- as.character(x)
if(isTRUE(x %in% .packages(all.available=TRUE))) {
eval(parse(text=paste("require(", x, ")", sep="")))
print(paste(c(x, " : already installed; requiring"), collapse=''))
} else {
#update.packages()
print(paste(c(x, " : not installed; installing"), collapse=''))
eval(parse(text=paste("install.packages('", x, "')", sep="")))
print(paste(c(x, " : installed and requiring"), collapse=''))
eval(parse(text=paste("require(", x, ")", sep="")))
}
}
########### Required Packages ###########
packages = c(
# Core ML
"tidymodels", "tidyverse",
# Specific model packages
"glmnet", "ranger", "xgboost",
# Newer spatial packages
"sf", "stars", "terra", "spatialsample", "nngeo", "mapview",
# Newer extensions
"finetune", "themis", "applicable", "vetiver",
# Existing packages
"bayesplot", "lme4", "RcppEigen", "AmesHousing"
)
for(i in seq_along(packages)){
packge <- as.character(packages[i])
load.fun(packge)
}
set.seed(717)
theme_set(theme_bw())
"%!in%" <- Negate("%in%")
g <- glimpse
nn_function <- function(measureFrom,measureTo,k) {
library(FNN)
nn <-
FNN::get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
get_week_of_month <- function(date) {
week(date) - week(floor_date(date, "month")) + 1
}
#######train test split######
setwd('C:/Users/Xian Lu Lee/OneDrive - PennO365/Datasets/GeoCloud/DC_DOT/')
segments <-read.csv(
# # "Data/Lanes_KumaHao.geojson"
"/spatial_predictions6.csv"
)
load("data_joined(1).RData")
# auto_counts <- st_read("Data/auto_2024_by_hr.csv")
auto_counts$interval_60 <- as.numeric(auto_counts$interval_60)
auto_counts$interval_60 <- as.POSIXct(auto_counts$interval_60, origin = "1970-01-01", tz = "UTC")
library(lubridate)
auto_counts <- auto_counts %>%
mutate(WkoMn = get_week_of_month(interval_60))
# full_weeks %>% select(Site_Name) %>% distinct()
cabi_sum <- read.csv('C:/Users/Xian Lu Lee/OneDrive - PennO365/Datasets/GeoCloud/DC_DOT/Data/cabi_sum.csv') %>% select(-X) %>% drop_na()
main <- segments %>% left_join(auto_counts, by= "seg_id") %>% mutate(Flow_Count= as.numeric(Flow_Count)) %>%
filter(!is.na(interval_60)) %>%
mutate(hour = as.numeric(hour(interval_60)),dotw = as.numeric(wday(interval_60, label=TRUE))) %>% st_drop_geometry()
main <- main %>%
left_join(cabi_sum, by= c('seg_id','hour', 'dotw'))
main <- main %>%
mutate(cabi_count = ifelse(is.na(cabi_count), 0, cabi_count))
# main <- main %>% drop_na()
main$hour= as.factor(main$hour)
main$.pred <- as.numeric( main$.pred)
main$.pred.1 <- as.numeric( main$.pred.1)
library(ggplot2)
main_summary <- main %>%
count(Site_Name, dotw, WkoMn) %>% mutate(count=if_else(n>0, 1, 0))
# ggplot(weekdays, aes(y= Flow_Count, x= hour)) +geom_point()
# weekdays %>% group_by(hour) %>% summarise(mean=mean(Flow_Count)) %>% View()
ggplot(main_summary, aes(x = dotw, y = as.factor(WkoMn), fill = count)) +
geom_tile(color = "white") +
scale_fill_viridis_c(name = "Data Collected") +
facet_wrap(~ Site_Name, scales = "free_y") + # One heatmap per site
labs(
title = "Data Collection by Week of Month and Day of Week",
x = "Day of the Week", y = "Week of the Month"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
main <- main %>% mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval_60) < 7 | hour(interval_60) > 18 ~ "Overnight",
hour(interval_60) >= 7 & hour(interval_60) < 10 ~ "AM Rush",
hour(interval_60) >= 10 & hour(interval_60) < 15 ~ "Mid-Day",
hour(interval_60) >= 15 & hour(interval_60) <= 18 ~ "PM Rush"))
main <- main %>% group_by(Site_Name,interval_60) %>% mutate(Flow_Count= sum(Flow_Count)) %>% ungroup() %>% select(-Flow_Name,-Flow_Identifier) %>% distinct()###remove directionality because
bg <- st_read("Data/bg/bg2020.GEOJSON")
bg$GEOID <-bg$GEOCODE
manual_clean <- manual_234[complete.cases(st_coordinates(manual_234)), ] %>% st_transform(3857)
# create kdem
kde_hp = sf.kde(manual_clean%>% select(Total), y=manual_clean$Total, bw=1000)
kde_df <- as.data.frame(kde_hp, xy = TRUE) %>%
rename(density = 3)
# map the result
ggplot(kde_df, aes(x = x, y = y, fill = density)) +
geom_raster() +
scale_fill_viridis_c(option = "plasma") +
coord_equal() +
theme_minimal() +
labs(title = "Kernel Density Estimation", fill = "Density")
# Convert sf points to SpatVector
pts_vect <- vect(main %>% st_as_sf())
# Extract KDE values
kde_vals <- extract(kde_hp, pts_vect)
# Combine with original sf object
main$manual_kde <- kde_vals[,2]
# library(tidymodels)
library(rsample)
library(parsnip)
library(workflows)
library(yardstick)
main <- main %>% mutate(YearMonth = floor_date(interval_60, "month"))
main$GEOID <- as.factor(main$GEOID)
main$Ward<- as.factor(main$Ward)
main$dotw <- as.factor(main$dotw)
dat <- main %>% select(1:6, 8,9 ,12:18) %>% ungroup()%>%
mutate(month=month(interval_60)) %>%
filter(month==8, weekend=="Weekday")
random_week_per_site <- dat %>%
group_by(Site_Name) %>%
distinct(WkoMn) %>% # Get unique weeks per site
filter(WkoMn %in% c(4:6)) %>% filter(Site_Name!='1st St NE') %>%
filter(Site_Name!= 'R Street NW')
# test_month <- sample(unique(dat$YearMonth), 1)
# 3. Create train and test sets (just key columns for matching)
test_set <- dat %>%
inner_join(random_week_per_site, by = c("Site_Name", "WkoMn"))
train_set <- dat %>%anti_join(test_set)
# 4. Match and extract indices using row-wise logic
train_indices <- dat%>%
mutate(row_id = row_number()) %>%
semi_join(train_set, by = c("Site_Name", "interval_60")) %>%
pull(row_id)
test_indices <- dat %>%
mutate(row_id = row_number()) %>%
semi_join(test_set, by = c("Site_Name", "interval_60")) %>%
pull(row_id)
# 5. Create rsample split
library(rsample)
data_split <- make_splits(
list(analysis = train_indices, assessment = test_indices),
data = dat
)
## LOGOCV on Neighborhood with group_vfold_cv()
cv_splits_geo <- group_vfold_cv(train_set,
group = "Ward")
## Accessing CV split data
glimpse(cv_splits_geo[1,1][[1]][[1]][[1]])
# Get the first fold's split object
fold_1 <- cv_splits_geo$splits[[1]]
# Extract the analysis (training) set from the first fold
analysis_set_1 <- analysis(fold_1)
# Extract the assessment (validation) set from the first fold
assessment_set_1 <- assessment(fold_1)
### Create Recipes
test_set%>% select(where(~ is.factor(.) || is.character(.))) %>% colnames()
# dat$Flow_Count <- log1p(dat$Flow_Count)
# train_set$Flow_Count <- log1p(train_set$Flow_Count)
# test_set$Flow_Count <- log1p(test_set$Flow_Count)
model_rec <- recipe(Flow_Count ~ ., data = train_set %>%
select(-seg_id, -Site_Name,-GEOID, -interval_60, -month,
-WkoMn,-cabi_count)) %>%
update_role(Ward, new_role = "Neighborhood") %>%
step_other(Ward, threshold = 0.005) %>%
step_normalize(where(is.double), -Flow_Count) %>%
step_unknown(all_nominal()) %>%
step_dummy(all_nominal(), one_hot = TRUE) %>%
# step_interact(~ hour:.pred) %>% # Specify interaction term correctly
step_log(Flow_Count, offset = 1, skip=T)
# step_nzv(all_numeric(),-Flow_Count,freq_cut = 95/5, unique_cut = 10) %>%
# step_zv(all_predictors()) %>%
# step_center(all_predictors(), -Flow_Count, -has_buffer, -has_parking, -has_doubleyellow) %>%
# step_scale(all_predictors(), -Flow_Count, -has_buffer, -has_parking, -has_doubleyellow)
# step_ns(Latitude, Longitude, options = list(df = 4),keep_original_cols=TRUE)
glimpse(model_rec %>% prep() %>% juice())
# transform new data
model_rec %>% prep() %>% bake(new_data = dat)
set.seed(42)
## Model specifications
lm_plan <-
linear_reg() %>%
set_engine("lm")
glmnet_plan <-
linear_reg() %>%
set_args(penalty = tune()) %>%
set_args(mixture = tune()) %>%
set_engine("glmnet")%>%
set_mode("regression")
rf_plan <- rand_forest() %>%
set_args(mtry = tune()) %>%
set_args(n_trees = 2000) %>%
set_args(min_n = tune()) %>% # ✅ Now only tuning min_n
set_engine("ranger", importance = "permutation") %>%
set_mode("regression")
XGB_plan <- boost_tree() %>%
set_args(trees = tune()) %>%
set_args(learn_rate = tune()) %>% # Learning rate (shrinkage)
set_args(tree_depth = tune()) %>% # Maximum tree depth
set_args(min_n = tune()) %>% # Minimum node size
# set_args(colsample_bytree = tune()) %>% # Feature subsampling
set_args(early_stopping_rounds = 10) %>% # Early stopping
set_engine("xgboost",validation = 0.2) %>%
set_mode("regression")
# poisson_mod <-
# poisson_reg() %>%
# set_engine("glm") %>%
# set_mode("regression")
# Hyperparameter grid for glmnet (penalization)
glmnet_grid <- expand.grid(
penalty = 10^seq(-4, 2, length.out = 50), # Wider range of penalties
mixture = seq(0, 1, by = 0.25)
)
# Hyperparameter grid for random forest (tuning min_n only)
rf_grid <- expand.grid(
mtry = c( 45, 80, 110),
min_n = c(5, 50, 100, 500) # ✅ Only tuning min_n
)
# Hyperparameter grid for XGBoost (tuning n_estimators with early stopping)
xgb_grid <- expand.grid(
trees = seq(50, 500, by = 50),
learn_rate = c(0.01, 0.1, 0.3),
tree_depth = c( 5, 7, 15,25),
min_n = c( 50, 200, 400)
# colsample_bytree = c(0.5, 0.7, 1)
)
library(poissonreg)
# create workflow
lm_wf <-
workflow() %>%
add_recipe(model_rec) %>%
add_model(lm_plan)
# poisson_wf <- workflow() %>%
# add_recipe(model_rec) %>%
# add_model(poisson_mod)
glmnet_wf <-
workflow() %>%
add_recipe(model_rec) %>%
add_model(glmnet_plan)
rf_wf <-
workflow() %>%
add_recipe(model_rec) %>%
add_model(rf_plan)
xgb_wf <-
workflow() %>%
add_recipe(model_rec) %>%
add_model(XGB_plan)
# fit model to workflow and calculate metrics
control <- control_resamples(save_pred = TRUE, verbose = TRUE)
metrics <- metric_set(rmse, rsq, mape, smape)
lm_tuned <- lm_wf %>%
tune::fit_resamples(.,
resamples = cv_splits_geo,
control = control,
metrics = metrics)
glmnet_tuned <- glmnet_wf %>%
tune::tune_grid(.,
resamples = cv_splits_geo,
grid = glmnet_grid,
control = control,
metrics = metrics)
rf_tuned <- rf_wf %>%
tune::tune_grid(.,
resamples = cv_splits_geo,
grid = rf_grid,
control = control,
metrics = metrics)
xgb_tuned <- xgb_wf %>%
tune::tune_grid(.,
resamples = cv_splits_geo,
grid = xgb_grid,
control = control,
metrics = metrics)
## 'Best' by some metric and margin
show_best(lm_tuned, metric = "rmse", n = 15)
# show_best(glmnet_tuned, metric = "rmse", n = 15)
show_best(rf_tuned, metric = "rmse", n = 15)
show_best(xgb_tuned, metric = "rmse", n = 15)
lm_best_params <- select_best(lm_tuned, metric = "rmse" )
# glmnet_best_params <- select_best(glmnet_tuned, metric = "rmse")
rf_best_params <- select_best(rf_tuned, metric = "rmse" )
xgb_best_params <- select_best(xgb_tuned, metric = "rmse" )
## Final workflow
lm_best_wf <- finalize_workflow(lm_wf, lm_best_params)
# glmnet_best_wf <- finalize_workflow(glmnet_wf, glmnet_best_params)
rf_best_wf <- finalize_workflow(rf_wf, rf_best_params)
xgb_best_wf <- finalize_workflow(xgb_wf, xgb_best_params)
# last_fit() emulates the process where, after determining the best model, the final fit on the entire training set is needed and is then evaluated on the test set.
lm_val_fit_geo <- lm_best_wf %>%
last_fit(split = data_split,
control = control,
metrics = metrics)
# glmnet_val_fit_geo <- glmnet_best_wf %>%
# last_fit(split = data_split,
# control = control,
# metrics = metrics)
rf_val_fit_geo <- rf_best_wf %>%
last_fit(split = data_split,
control = control,
metrics = metrics)
xgb_val_fit_geo <- xgb_best_wf %>%
last_fit(split = data_split,
control = control,
metrics = metrics)
# poisson_val_fit_geo <- poisson_wf %>%
# last_fit(split = data_split,
# control = control,
# metrics = metrics)
##########################################################
# Pull best hyperparam preds from out-of-fold predictions
lm_best_OOF_preds <- collect_predictions(lm_tuned)
# glmnet_best_OOF_preds <- collect_predictions(glmnet_tuned) %>%
# filter(penalty == glmnet_best_params$penalty[1] & mixture == glmnet_best_params$mixture[1])
rf_best_OOF_preds <- collect_predictions(rf_tuned) %>%
filter(mtry == rf_best_params$mtry[1] & min_n == rf_best_params$min_n[1])
xgb_best_OOF_preds <- collect_predictions(xgb_tuned) %>%
filter( min_n == xgb_best_params$min_n[1])
# collect validation set predictions from last_fit model
lm_val_pred_geo <- collect_predictions(lm_val_fit_geo)
glmnet_val_pred_geo <- collect_predictions(glmnet_val_fit_geo)
rf_val_pred_geo <- collect_predictions(rf_val_fit_geo)
xgb_val_pred_geo <- collect_predictions(xgb_val_fit_geo)
# poisson_val_pred_geo <- collect_predictions(poisson_val_fit_geo)
# Aggregate OOF predictions (they do not overlap with Validation prediction set)
OOF_preds <- rbind(
data.frame(dplyr::select(lm_best_OOF_preds, .pred, Flow_Count)%>% mutate(.pred=expm1(.pred)), model = "lm") ,
# data.frame(dplyr::select(poisson_val_pred_geo, .pred, Flow_Count), model = "poisson")
# data.frame(dplyr::select(glmnet_best_OOF_preds, .pred, Flow_Count), model = "glmnet") ,
data.frame(dplyr::select(rf_best_OOF_preds, .pred, Flow_Count)%>% mutate(.pred=expm1(.pred)), model = "rf"),
data.frame(dplyr::select(xgb_best_OOF_preds, .pred, Flow_Count) %>% mutate(.pred=expm1(.pred)), model = "xgb")
) %>%
group_by(model) %>%
mutate(
RMSE = yardstick::rmse_vec(Flow_Count, .pred),
MAE = yardstick::mae_vec(Flow_Count, .pred),
MAPE = yardstick::mape_vec(Flow_Count, .pred)) %>%
ungroup() %>%
mutate(model = factor(model, levels=c(
"lm",
"glmnet",
"rf",
"xgb"
)))
# average error for each model
ggplot(data = OOF_preds %>%
dplyr::select(model, RMSE) %>%
distinct() ,
aes(x = model, y = RMSE, group = 1)) +
geom_path(color = "red") +
geom_label(aes(label = paste0(round(RMSE,1)))) +
theme_bw()
# OOF predicted versus actual
ggplot(OOF_preds, aes(x = Flow_Count, y = .pred,
group = model
)) +
geom_point(alpha = 0.3) +
geom_abline(linetype = "dashed", color = "red") +
geom_smooth(method = "lm", color = "blue") +
# coord_equal() +
facet_wrap(~model, nrow = 2) +
theme_bw()
full_fit <- rf_best_wf %>% workflows::fit(train_set)
rf_pred <- predict(full_fit, test_set %>% select(-Flow_Count))
rf_pred$.pred=expm1(rf_pred$.pred)
full_fit$fit$fit$fit