Estimating Bike Counts in DC
You can view the full MUSA project page and our BEAUTIFUL Dashboard.
Instructors
Michael Fishman & Matthew Harris
Client Contacts
Nellie Moore, Senior Civic Design Researcher, DDOT (nellie.moore@dc.gov)
David Balick, Trails Planner, DDOT (david.balick@dc.gov)
Jack Crum, Data Scientist, The Lab @ DC (jack.crum@dc.gov)
Executive Summary
Problem Statement: Current counter data is limited in spatial and temporal coverage. This provides an incomplete picture of cycling patterns in DC, making it challenging to fully understand cycling patterns and optimize network planning.
Use Case: Essentially addressing a problem of incomplete information, our analysis will estimates and provides summaries of micro-mobility ridership across DC’s networks.
Key Data Challenges: relative sparsity of automated counters across space are too sparse to provide representative picture of DC
Modelling Solution: Estimate peak hour counts with manual counter data, before using this to make estimates of hourly counts in locations without automated counters, augmented with other data sources
Model Performance: Final Model configurations for both temporal and spatial model outperformed static models in existing workflows recorded in Portland, Bend, Dallas and Charlotte
Strengths and Applications: Providing an overview of bike commuting patterns and infrastructure utilization during critical time periods (peak hours), which may support planning decisions at the city-wide scale - Not a “real-time” , granular or “realistic” representation of cyclist volumes
Limitations:
Remains impossible to test and comment on the generalizability of the temporal model without sufficient data points across DC
Inherent selection biases with the data arising from data collection decisions such as prioritizing of key transport corridors for counter installation
Areas may have daily patterns which differ substantially from conventional dual peak patterns, reducing the confidence of estimates in these areas
Recommendations:
Utilize predictions made outside of peak hours in areas further from known automated records with caution
Expand the coverage of its automated counters and do so within a greater variety of locations; which will allow for further model iterations and testing in currently unknown locations
Utilize domain knowledge and understanding of local contexts within DC for validating model outputs
Introduction
The District Department of Transportation wishes to understand trip volumes and time-space patterns of micro mobility (bicycle and scooter) use in Washington, DC. Estimating and summarizing such patterns will provide the department a better idea of cycling mobility flows across the city which can be used for a variety of analyses and decisions surrounding infrastrcuture provision and interventions or understanding relations with equity and demographics. However, as the following sections will exemplify, while Washington, while D.C.’s bike network is extensive, the current counter data is limited in spatial and temporal coverage. This provides an incomplete picture of cycling patterns in DC, making it challenging to fully understand cycling patterns and optimize network planning. The principal goal of this project is thus to estimate the volume of micro mobility ridership across the city. The rest of this report is structured as such: first, it introduces the motivation and use-case, before conducting a short review of typical approaches and considerations in modelling and estimating bicycle counts. Then, it illustrates the key data issues regarding spatial and temporal sparsity of provided counter data and outlines our predictive solution. It then explores selected predictors before comparing the performance of various model configurations. It concludes by assessing the generalizability of the final model for different demographics and settings and offering recommendations for existing workflows.
Client Background
Motivation & Use Case
Bicycle count data is important for a variety of planning purposes, including measuring changes over time, prioritizing projects, planning and designing for future infrastructures, providing modelling inputs for multi-modal models, assessing overall footfall, signal adjustmetns and property research among other things (Nordback et al 2018). The growing interest in bicycling and cycling facilities necessitates a need to understand its usage and how people utilize the mode and available facilities for getting to important places in their daily life. This need is not unnoticed by the DDOT, who has been carrying out their own data collection program for the past few years. However, as we further reveal in subsequent sections, there are limits to the ability of collected data to generalize to the entire city and its network.
Essentially addressing a problem of incomplete information, our analysis will estimates and provides summaries of micro-mobility ridership across DC’s networks. By revealing the distribution and gaps in available data, our study highlights opportunities to enhance monitoring strategies, understand how patterns vary across the city with various demographic and infrastructure features, and guide targeted infrastructure improvements. This is facilitated by a dashboard which provides various functions for exploring the phenomena to aid decision making processes.
Literature Review
In recent years, the need for accurate, network-wide estimates of bicycle activity has grown in response to increased investment in cycling infrastructure and a push for data-driven planning. Traditionally, cities have relied on permanent counters and short-duration manual counts to monitor bicycle traffic, yet these methods offer limited spatial and temporal coverage. As a result, researchers have developed several modeling approaches to extrapolate counts across space and time, and to integrate newer, non-traditional data sources.
The Role of Bicycle Count Data
Bicycle count data serve various planning and evaluation purposes, including measuring trends, prioritizing infrastructure investments, calibrating travel models, conducting before-and-after studies, and assessing safety and equity outcomes (Nordback et al., 2018). However, stationary counters can only reflect activity at specific points, leaving large portions of the network unmeasured. To address this, researchers have explored methods to extrapolate short-duration counts to average annual daily bicycle traffic (AADBT) using grouping techniques such as cluster analysis, spatial indexing, and travel pattern similarities (Team 2025).
Modeling Approaches
Direct Demand Models
One common strategy is the direct demand model, which regresses observed counts against built environment, network, and demographic variables. Key predictors often include:
- Land use: employment and population density, retail
access, and land use mix.
- Transportation infrastructure: bike facility
density, proximity to bridges, and centrality in the network.
- Sociodemographics: race, age, education, and
student population.
- Accessibility: distance to jobs, schools, and the
central business district.
- Temporal factors: day of week, time of day, and weather conditions.
Poisson and negative binomial models are frequently used due to the count-based nature of the data (Team 2025).
Machine Learning and Data Fusion
Recent research has emphasized data fusion techniques that combine traditional counts with crowdsourced or third-party data sources, such as Strava, StreetLight, and bikeshare systems. The study by Kothuri et al. (2022) demonstrated that integrating these sources with contextual variables (termed “static” data) can enhance model performance. In their comparative study across six cities, models that combined static, Strava, and StreetLight data consistently outperformed those using single sources, especially in medium- to high-volume contexts. However, low-volume sites remained difficult to model accurately, partly due to sparse data and bias in source coverage (Kothuri et al. 2022).
The study further evaluated Random Forest regression and found it performed comparably to Poisson models, with better results in data-rich settings. While machine learning allows for flexible modeling of complex relationships, it demands substantial data and tuning to avoid overfitting or underperformance in low-data environments.
Connectivity and Network Measures
Complementing count-based models, Miah et al. (2023) explored network connectivity using graph theory and Level of Traffic Stress (LTS). Their study assessed how link-level characteristics—such as speed, lane width, and motor vehicle volume—affect ridership by user type (e.g., children vs. confident adults). The results highlighted significant connectivity disparities, with large portions of networks inaccessible to low-stress riders, thereby limiting cycling uptake in equity-seeking communities (Miah, Mattingly, and Hyun 2023).
Exploratory Analysis: Modeling Considerations
Our Target Variable: Weekday Hourly Bike Counts
Data Quality and Coverage
The locations of counters and the nature of their counts are important considerations when utilising them for modelling unknown counts, and can limit the types of models and inferences which can be made from data. Importantly, stationary counters can only tell us about the traffic passing by them, during the times which they are active.
The two counters differ greatly in spatial and temporal coverage:
Manual Counters:
Greater density of locations, especially if aggregated across 4 years
Data collected only during peak hours, once per year within time periods ranging from 3- 4 hours
Automatic Counters:
Poor spatial coverage, with only 3 to 4 counters per month, and a total of around 30 counters deployed in 2024
provides 15 minute counts
To address this we thus focussed on generating estimates at an hourly timescale
Manual Counters Spatial Distribution
The manual counters—shown in purple—are more widespread, but they’re only used occasionally and often during peak hours.
manual <- manual %>% drop_na(x, y)
manual <- st_as_sf(manual, coords = c("x", "y"), crs = 4326)
manual <- st_transform(manual, st_crs(basemap))
ggplot() +
geom_sf(data = basemap, fill = "grey90", color = "white") +
geom_sf(data = bike_lane, color = "grey30", size = 0.5, alpha = 0.7) +
geom_sf(data = manual, color = "#E3D7F5", size = 3, shape = 21, fill = "#562DAB") +
labs(title = "Manual Counters Locations") +
theme_minimal() +
theme(
legend.position = "right",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())Automatic Counters Spatial Distribution
These are fairly limited in number and tend to be clustered in specific parts of the city. Because automated counters are concentrated in certain areas, they don’t reflect the diversity of biking environments across the whole network.
auto <- auto %>% drop_na(Latitude, Longitude)
auto <- st_as_sf(auto,
coords = c("Longitude", "Latitude"), # note the order: X = Longitude, Y = Latitude
crs = 4326)
auto <- st_transform(auto, st_crs(basemap))
auto_cut <- auto %>%
filter(st_coordinates(.)[, 1] > -77.12 & st_coordinates(.)[, 1] < -76.90,
st_coordinates(.)[, 2] > 38.79 & st_coordinates(.)[, 2] < 39.00)
ggplot() +
geom_sf(data = basemap, fill = "grey90", color = "white") +
geom_sf(data = bike_lane, color = "grey30", size = 0.5, alpha = 0.7) +
geom_sf(data = auto_cut, color = "#E3D7F5", size = 3, shape = 21, fill = "#9C7BD8") +
labs(title = "Automatic Counters Locations") +
theme_minimal() +
theme(
legend.position = "right",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank())By definition, count data not tell us anything about the network or times outside of the records. However, through a predictive modelling approach, we can attempt to use data to understand how patterns differ across space and time, and predict how they might be in areas outside of the temporal and spatial scope of comparison.
Exploring Temporal & Spatial Variations in Bike Counts
In order to understand how patterns vary more broadly across space and time, we first explored spatial patterns using Manual count data and then the temporal patterns using automated counts.
Spatial Patterns
Temporal Patterns
Range of automated counts around 1 to 30, with most counts being less than 10 every 15 minutes, averaging to about 30 to 40 per hour.
main %>% filter(Flow_Count>0) %>% ggplot()+
geom_histogram(aes(x=Flow_Count),fill= "#9C7BD8", color='white')+
xlim(0,30)+
labs(title= 'Distribution of Automated Counts',
subtitle='15 minute intervals')+
theme_minimal()Daily patterns of counts are roughly even across different days of the year. While the patterns below seem to suggest that certain months are characterised by higher than normal counts, this is due to the differences in counter locations.
main %>% st_drop_geometry() %>% filter(Flow_Count>0) %>% group_by(interval_60) %>%
summarise(mean_count= mean(Flow_Count)) %>%
ggplot()+
geom_line(aes(x=interval_60,y= mean_count), size=0.3, color= "#9C7BD8")+
labs(title= 'Automated Counts across study period',
subtitle='1 hour intervals')+
theme_minimal()Weekday patterns differ greatly from weekend patterns, exhibiting the dual peak patterns commonly also seen in other forms of transit (buses and rail). This strongly suggests that the nature of recorded counts is largely commuter–centric in nature.
main %>% filter(Flow_Count>0) %>% mutate(hour = hour(interval_60),dotw = wday(interval_60, label=TRUE)) %>% group_by(dotw,hour) %>%
st_drop_geometry() %>% summarise(mean_count= mean(Flow_Count)) %>%
ggplot()+
geom_area(aes(x=hour,y= mean_count), fill="#9C7BD8", color='transparent')+
labs(title= 'Automated Counts across study period',
subtitle='1 hour intervals, by day of week')+
facet_wrap(~dotw)+
theme_minimal()The twin peak pattern is also mostly generalizable to all locations, with a few notable exceptions. 1st St NE exhibited rather anomalous patterns and was excluded from the training process. Notably, streets such as East Capitol Street and the Marvin Gaye Trails have patterns which deviate slightly from the norm. The former is characterised by a much larger evening peak, while the former sites have much more erratic patterns, likely due to the low average counts within these sites.
main %>% filter(Flow_Count>0) %>% mutate(hour = hour(interval_60),dotw = wday(interval_60, label=TRUE)) %>% subset(!(dotw %in% c('Sat', 'Sun'))) %>% group_by(Site.Name, hour) %>%
st_drop_geometry() %>% summarise(mean_count= mean(Flow_Count)) %>%
ggplot()+
geom_area(aes(x=hour,y= mean_count), fill='#9C7BD8', color='transparent')+
labs(title= 'Automated Counts across Locations ',
subtitle='1 hour intervals')+
ylab('Mean Count')+
facet_wrap( ~ Site.Name,
scales='free',
ncol= 5)+
theme_minimal()The consistency of such temporal patterns across different sites is important, as it allows us to estimate counts across different times of the day in unknown locations, by assuming that they vary similarly across time across different locations and only vary in magnitude (i.e. the peak bike counts) .
auto_counts%>%
mutate(time_of_day = case_when(hour(interval_60) < 7 | hour(interval_60) > 18 ~ "Overnight",
hour(interval_60) >= 7 & hour(interval_60) < 10 ~ "AM Rush",
hour(interval_60) >= 10 & hour(interval_60) < 15 ~ "Mid-Day",
hour(interval_60) >= 15 & hour(interval_60) <= 18 ~ "PM Rush"))%>%
# group_by( time_of_day) %>%
#
# summarize(mean_flow = mean(as.numeric(Flow_Count), na.rm=T))%>%
ggplot()+
geom_histogram(aes(x=as.numeric(Flow_Count)), fill="#9C7BD8", color='white')+
# labs(title="Figure 4b: Mean Number of Hourly Trips Per Station, New York, August, 2024",
# subtitle='Based on a 20% random sample out of 4.5m trips',
# x="Number of trips",
# y="Frequency")+
facet_wrap(~time_of_day)+
theme_minimal()Our Approach: Estimate Spatial Patterns with Manual Counts & Temporal Patterns from Automated Counts
Typical approaches for obtaining estimates of bike counts in areas without counters can be classified into three main types, 1) Direct Extrapolation , 2) Travel Demand Modelling and 3) Direct Demand Models.
Direct extrapolation are based on the principles of spatial autocorrelation, that areas near each other have similar patterns of counts. This often entails grouping sites based on similar characteristics and using factors to extrapolate short duration counts to annual estimates. This may be done visually, or through methods of cluster analysis or spatial variables.
Travel demand modelling describes patterns as decisions to travel for activities using specific modes of transport, and often suffer from their large aggregation scale and difficulty in validating with external data.
As such, given the availability of various external data sources, direct demand models, which regresses (or predicts) observed counts based on known characteristics of the surrounding built environment, are popular methods for estimating counts. The accuracy of such models depends on the types of variables selected as predictors, as well as their suitability for extrapolating to the entire network - meaning that including strongly predictive characteristics not present within areas without records should be approached with caution.
Crucially, majority of reviewed approaches are often concerned with obtaining annually aggregated estimates of bicycle volume, in other words, where there tends to be most or least cyclists on the network. Our use-case and motivation, however, requires that we also understand when and how bike counts vary.
Given the consistency of hourly patterns across locations, the wealth of temporal data from automated counters allows us to estimate how patterns vary across the day, barring a few limitations and caveats which are explained further below.
As such, we propose a modelling workflow which can maximize the degree of spatial and temporal information we can obtain from the 2 data sources.
Overall Modelling Workflow
Exploratory Analysis: Relevant Predictors and Data Sources
To model and estimate how flows vary across space and time, it is important to identify relevant and useful predictors.
Typical characteristics included for prediction include (Muner and Sener, ):
Socio-demographic: race/ethnicity, education, students, age
Network measure: centrality, bridges, bicycle facility density, number of lanes
Land use: population density, employment density, commercial retail density, industrial use, open space, land use mix
Accessibility: to jobs, bike trail entrances, transit stops, schools distance form cbd
Temporal: weather, time of day
Based on Kothuri et al’s study and other literature, we identified a number of variables and data sources for both spatial and temporal models. [insert diagram ]
Spatial Model
Variables chosen for the spatial model included….
Temporal Model
Given that our approach means that we can not utilize accurate spatial grouping variables (e.g streets) to provide information regarding the For the temporal model, we considered the usage of various spatial variables for providing information regarding the spatial differences in the magnitude of bicycle counts. This included predicted values based of the spatial model, a kernel density estimate of manual count data and engineered count feature from capital bikeshare data.
Land Use
To understand how land use influences bike counts, we applied a 300-meter buffer around each bike lane segment. This allows us to capture the mix of land uses within close proximity to each bike route, rather than just relying on a single adjacent land use type.
The highest bike counts are observed in residential, commercial, and institutional areas, indicating that these environments support frequent cycling—likely due to a mix of commuting, errands, and recreational trips.
seg_nongem <- st_drop_geometry(segments)
landuse <- seg_nongem %>%
select(Commercial, Industrial, Institutional, Mixed, Recreational, Religious, Residential, Vacant, Other) %>%
summarise_all(sum) %>%
pivot_longer(cols = everything(), names_to = "Land_Use_Type", values_to = "Count")
ggplot(landuse, aes(x = reorder(Land_Use_Type, Count), y = Count, fill = Land_Use_Type)) +
geom_col(width = 0.7) + # Adjust bar width (default is 0.9, reduce for thinner bars)
coord_flip() +
scale_fill_manual(values = c(
"Vacant" = "#562DAB", # Very dark purple
"Residential" = "#6A3BB8",
"Institutional" = "#7A50C7",
"Commercial" = "#8E68CC",
"Religious" = "#9C7BD8",
"Mixed" = "#A989E0",
"Other" = "#C4AEE8",
"Industrial" = "#D6C4F0",
"Recreational" = "#E3D7F5" # Lightest
), guide = "none") +
labs(title = "Distribution of Bike Lanes Across Adjacent Land Use Types",
x = "Land Use Type", y = "Count") +
theme_minimal() +
theme(
legend.position = "right", # Keep legend on the right
panel.grid.major = element_blank())Census Data
We examined the spatial distribution of key census indicators across our study area to better understand the local socio-demographic context. Our maps reveal marked differences in variables such as median income, the rate of public transit commuting, and the proportion of residents who use bicycles for commuting. Areas with higher median incomes may have distinct mobility patterns compared to regions with lower incomes, and these economic disparities can influence the demand for cycling infrastructure. By overlaying these data on our geographic base map, we can begin to see how the underlying socio-economic factors correlate with the spatial distribution of transportation modes.
In addition, the visualizations of transit and bicycle commuting rates serve as essential proxies for local travel behavior. Regions with higher bicycle commuting rates, for example, might already exhibit a strong cycling culture that can support further network investments, while areas with prominent public transit usage suggest the potential for integrated transport solutions that merge cycling with existing transit services. These insights are critical for identifying data gaps in our current bike counter coverage and for forecasting future demand, ultimately guiding more targeted and effective infrastructure planning in Washington, D.C.
ggplot() +
geom_sf(data = basemap, fill = "grey90", color = "white") +
geom_sf(
data = segments,
aes(color = median_incomeE),
alpha = 0.8) +
scale_color_gradientn(
colors = rev(pp),
name = "Income",
na.value = "grey80") +
labs(title = "Median Income") +
theme_minimal()ggplot() +
geom_sf(data = basemap, fill = "grey90", color = "white") +
geom_sf(
data = segments,
aes(color = commute_bicycleE),
alpha = 0.8) +
scale_color_gradientn(
colors = rev(pp),
name = "commute_bicycle") +
labs(title = "Bicycle Commuters") +
theme_minimal()ggplot() +
geom_sf(data = basemap, fill = "grey90", color = "white") +
geom_sf(
data = segments,
aes(color = pct_white),
alpha = 0.8) +
scale_color_gradientn(
colors = rev(pp),
name = "pct_white") +
labs(title = "Percentage of White") +
theme_minimal()ggplot() +
geom_sf(data = basemap, fill = "grey90", color = "white") +
geom_sf(
data = segments,
aes(color = edu_bachelors_plusE),
alpha = 0.8) +
scale_color_gradientn(
colors = rev(pp),
name = " edu_bachelor_plus") +
labs(title = "Education Level") +
theme_minimal()Network Characteristics
We also examine the network characteristics as our additional resources to help our model. Betweenness Centrality explains how often a node falls on the shortest path between other nodes.
ggplot() +
geom_sf(data = basemap, fill = "grey97", color = "white") +
geom_sf(
data = segments,
aes(color = betweenness_max),
alpha = 0.8) +
scale_color_gradientn(
colors = rev(pp),
name = "Betweenness") +
labs(title = "Bike Lane Network: Binned Betweenness Centrality") +
theme_minimal()Exploratory Data Analysis with Manual Counts
In this section, we explore how various segment-level characteristics—such as roadway design, land use, and traffic conditions—relate to observed manual bicycle counts in Washington, DC. This exploratory analysis aims to identify potential predictors of cyclist volumes and inform our modeling efforts to estimate unobserved counts citywide.
segments_manual <- segments %>%
semi_join(seg_manual, by = "seg_id") %>%
left_join(seg_manual, by = "seg_id")Overview of Average Current Manual Counts
ggplot() +
geom_sf(data = basemap, fill = "grey97", color = "white") +
geom_sf(
data = segments_manual,
aes(color = avg_count),
alpha = 0.8) +
scale_color_gradientn(
colors = rev(pp),
name = "Average Bike Counts") +
labs(title = "Manual Bike Counts") +
theme_minimal()ggplot(segments_manual, aes(x = avg_count)) +
geom_histogram(bins = 30, fill = "#7A50C7", color = "white") +
theme_minimal() +
labs(title = "Distribution of Average Manual Counts",
x = "Average Count", y = "Number of Segments")Continuous Variables with Manual Average Counts
To better understand what factors are associated with cyclist activity, we explored how average manual bike counts (avg_count) relate to several continuous segment-level variables. Scatterplots with linear regression lines were used to examine relationships between avg_count and predictors such as annual average daily traffic (AADT), median household income, streetlight count, pavement condition index (PCI), and bike infrastructure scores. These visualizations revealed varying degrees of correlation, suggesting that both traffic volume and built environment features may play a role in influencing observed cyclist volumes on different street segments.