Link to our web app: https://roshiniganesh.shinyapps.io/MUSA-Charlotte-ShinyApp/

This project is a part of 8010 Master of Urban Spatial Analytics (MUSA)/Smart Cities Practicum Spring 2024 and in collaboration with Charlotte Area Transit System (CATS). We would like to give special thanks to our instructors Michael Fichman and Matthew Harris, also to Carlos Parada, Bruce Jones, and Jessica Odette from CATS for providing data, insight, and support over the course in the semester

1. Introduction

In 2021, the Charlotte Area Transit System (CATS) embarked on ‘The Bus Priority Study’ as part of the city’s larger Strategic Mobility Plan, aiming to enhance bus trips’ speed, reliability, and convenience for riders. CATS wishes to understand the recent dynamics of local bus transit and the financial and equity ramification of hypothetical alterations to the network. To achieve this goal, CATS set out to identify and understand three primary objectives:

  1. Understanding recent bus ridership dynamics involved
  2. Exploring the present reality of bus accessibility, considering ridership trends and demographic distributions.
  3. Examining network efficiencies and possibilities

In pursuit of the three objectives, several actions must be undertaken:

  1. Identify underperforming bus stops.
  2. Study demographic, spatial, and economic factors in the Charlotte area that potentially influence bus ridership and performance.
  3. Alter the current bus network of stops and routes to optimize transit services

The goal of this project is to create a proof-of-concept analytical and informational system to identify underperforming stops or routes in Charlotte’s bus transit network

2. Use case

Our team will support CATS transportation planners in improving bus line efficiency by analyzing historical ridership trends per bus stop to identify and predict underperforming stops and routes. Our team want to understand the dynamics of the bus system then develop a model that forecast demand. Our team will model monthly bus demand by line and stop using APC data (boarding and alighting counts)

With this, we will propose alternate network options through a web-based dashboard that presents predictions with the context of demographic, spatial, financial factors of influence.

3. Data

3.1 Datasets considered

The ridership data, boarding and alighting of the passenger is collected by sensors, grouped into a monthly total and calculated daily average format for each stop. This ridership data, the total of board and alight, is the dependent variable of this project.

To analyze and forecast performance trends, we have used ridership data from CATS for the years between 2017 and 2023. Additionally we have identified potential demographic from American Census survey, spatial parameter such as proximity to amenities, and economic parameters that may influence our predictor variable that is ridership.

3.2 Limitations of dataset

On initial review, certain limitations of the data present themselves. These include: 1. No hourly ridership data is available precluding us from analyzing peak hour trends 2. Weekend vs weekday ridership is not available which leads to the assumption of similar usage patterns of weekends, weekdays, and holidays 3. Data on stops added or removed in these years of study is not explicit 4. Only stop locations are made available which means route locations will have to be sourced and spatially joined. This allows us to join only the most recent route data for these stops. 5. The dataset for 2023 is incomplete and only includes data for months January to November.

3.3 Data Wrangling

In the data wrangling phase, we cleaned, preprocessed, and transformed the raw datasets to ensure consistency, accuracy, and compatibility with our machine learning models.

4. Exploratory Analysis

Before starting the model building, a good understanding about the relationship, the patterns, and the distribution within the dataset is important to get the sense of the model and feature selection for the model development. During the exploratory analysis phase,it’s important to uncover valuable insights and patterns that could guide our subsequent modeling efforts.

Initially we conducted various data visualization techniques, including histograms, plots, to show relationships between variables, detect outliers, and identify potential trends.This rigorous exploratory analysis not only enhanced our understanding of the dataset but also laid the foundation for informed feature selection and model design.

How did the current Board and Alight condition of bus passenger in Charlotte? Current available data from CATS allows us to observe the initial visualization of board and alight trend from APC Data

4.1 APC data

ggplot(busstop_sf) +
  geom_sf(data = allTracts_sf, fill = "grey90", alpha = 0.8 , color = NA)+
  geom_sf(aes(color = Board), size = .3) +
  scale_color_gradient(low = "#FFAB00", high = "darkgreen") +
  mapTheme() +
  labs(title = "Bus Stops", color = "Board Value")

ggplot(busstop_sf) +
  geom_sf(data = allTracts_sf, fill = "grey90", alpha = 0.8 , color = NA)+
  geom_sf(aes(color = Alight), size = .3) +
  scale_color_gradient(low = "#FFAB00", high = "darkgreen") +
  mapTheme() +
  labs(title = "Bus Stops", color = "Alight Value")

4.2.3 Top 50 underperforming stops in 2022

The underperformance and spatial clustering trends in 2022 are consistent with the overall trend.

# Underperforming Stops in 2022
underperforming_stops_2022 <- stops_data[(stops_data$Board + stops_data$Alight) <= 10 & stops_data$Year == 2022, ]
underperformance_counts_2022 <- table(underperforming_stops_2022$Stop_ID)

# Top 30 Stops for 2023
top_50_stops_2022 <- names(sort(underperformance_counts_2022, decreasing = TRUE)[1:50])
underperforming_top_50_2022 <- underperforming_stops_2022[underperforming_stops_2022$Stop_ID %in% top_50_stops_2022, ]


# Order
underperforming_top_50_2022$Stop_ID <- factor(underperforming_top_50_2022$Stop_ID, levels = rev(top_50_stops_2022))

bar_plot_top_50_2022 <- ggplot(underperforming_top_50_2022, aes(x = Stop_ID, fill = "2022")) +
  geom_bar(stat = "count", color = "black") +
  labs(title = "Top 50 Bus Stops with Highest Underperformance Counts",
       x = "Bus Stop Name",
       y = "Number of Times Underperformed",
       fill = "Year") +
  scale_fill_manual(values = c("2022" = "#7ACFD3"), name = "Year") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
##print(bar_plot_top_50_2022)

underperforming_top_50_2022 <- underperforming_top_50_2022 %>%
  mutate(Total_Board_Alight = Board + Alight)


underperforming_top_50_sf <- st_as_sf(underperforming_top_50_2022, coords = c("Longitude", "Latitude"), crs = 4326)
underperforming_top_50_sf <- st_transform(underperforming_top_50_sf, crs = 3358)  # NAD83 North Carolina

map_plot_top_50_2022 <- ggplot() +
  geom_sf(data = dat2020, fill = "grey90", alpha = 0.5, color = NA) +
  geom_sf(data = bus_routes, color = "darkgrey", size = 5, alpha = 0.5) +
  geom_sf(data = underperforming_top_50_sf, aes(colour = q5(Total_Board_Alight)), size = 2) +
  scale_color_manual(values = rev(palettebase5),
                     labels = qBr(underperforming_top_50_sf, "Total_Board_Alight"),
                     name = "Quintile\nBreaks") +
  labs(title = "Top 50 Bus Stops with Highest Underperformance Counts in 2022",
       fill = "Total_Board_Alight") +
  mapTheme()
##print(map_plot_top_50_2022)

combined_plot <- bar_plot_top_50_2022 + map_plot_top_50_2022
print(combined_plot)

Some stops exhibit a growing trend of underperformance post the pandemic.About half of these stops exhibit spatial clustering patterns and are located along a single route. The underperformance and clustering trends in 2022 are consistent with the overall trend in the city center area.

4.3 Spatial factors:

There are other features related to the built environment itself which determins the amount of ridership of an area. In this section, some spatial features will be explored to further reveal bus ridership pattern in the city of Charlotte.

4.3.1 Existing and proposed transit lines and stations

When discussing spatial feature that might affect a bus stop ridership, other existing and proposed transit stops and routes have to be considered, if any. If significant, the distance of bus stops to other transit stops would also serve as variables in the model. In City of Charlotte, there are two existing rail transit line and two proposed rail transit line. The existing transit line in the city are CityLYNX Gold Line and Light Rail - Lynx Blue Line. Meanwhile there are red line and silver line rail transit line that have been proposed to be operated in the city of Charlotte. All of those rail transit line following the existing bus routes are shown in the following map

transit <- ggplot() +
  geom_sf(data = dat2020, fill = "grey90", alpha = 0.5, color = NA) +
  geom_sf(data = bus_routes, aes(color = "Bus Route"), size = 5, alpha = 0.5) +
  geom_sf(data = blue_routes, aes(color = "Blue Line"), size = 8) +
  geom_sf(data = blue_stops, aes(color = "Blue Line"), size = 2) +
  geom_sf(data = gold_routes, aes(color = "Gold Line"), size = 8) +
  geom_sf(data = gold_stops, aes(color = "Gold Line"), size = 2) +
  geom_sf(data = silver_routes, aes(color = "Silver Line"), size = 8, linetype = "dashed") +
  geom_sf(data = silver_stops, aes(color = "Silver Line"), size = 2, shape = 21) +
  geom_sf(data = red_routes, aes(color = "Red Line"), size = 8, linetype = "dashed") +
  geom_sf(data = red_stops, aes(color = "Red Line"), size = 2, shape = 21) +
  
  labs(title = "Existing and Proposed Transit Lines and Stops") +
    scale_color_manual(name = "Lines",
                     values = c("Bus Route" = "darkgrey",
                                "Blue Line" = "#4682B4", 
                                "Gold Line" = "#FFD700", 
                                "Silver Line" = "purple", 
                                "Red Line" = "#FF6347"),
                     labels = c("Blue Line","Bus Routes", "Gold Line", "Proposed Red Line", "Proposed Silver Line")) +

  mapTheme()

print(transit)

4.3.2 Density of Amenities

Besides the nearby transit lines, spatial feature such as schools, parks, police stations, grocery stores, and shopping centers are also considered as the factor of ridership variation in a bus stop. These public amenities are the destinations that bus riders might head to. There might be different pattern of days and time for people to travel to particular amenities, but due to data limitation we would like to see this as a more general pattern. Following maps represent the density of public amenities in the city of Charlotte. Most amenities in the city are clustered within the city center and expanded to the south, presenting an inequitable distribution. For model input, the distance to the amenities would serve as variables as well.

A <- ggplot() + geom_sf(data = dat2020, fill = "grey90", alpha = 0.5, color = NA) +
  geom_sf(data = bus_routes, color = "darkgrey", size = 5, alpha = 0.5) +
  geom_sf(data = school, color = "darkgrey", size = .5, alpha = 0.3) +
  stat_density2d(data = data.frame(st_coordinates(school.sf)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "grey", high = "#FFE6A7", name = "Density") +
  scale_alpha(range = c(0.00, 0.1), guide = "none") +
  labs(title = "Density of Schools") +
  mapTheme()

B <- ggplot() + geom_sf(data = dat2020, fill = "grey90", alpha = 0.5, color = NA) +
  geom_sf(data = bus_routes, color = "darkgrey", size = 5, alpha = 0.5) +
  geom_sf(data = parks, color = "darkgrey", size = .5, alpha = 0.5) +
  stat_density2d(data = data.frame(st_coordinates(parks.sf)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "grey", high = "#7ACFD3", name = "Density") +
  scale_alpha(range = c(0.00, 0.1), guide = "none") +
  labs(title = "Density of Parks") +
  mapTheme()

C <- ggplot() + geom_sf(data = dat2020, fill = "grey90", alpha = 0.5, color = NA) +
  geom_sf(data = bus_routes, color = "darkgrey", size = 5, alpha = 0.5) +
  geom_sf(data = groceries, color = "darkgrey", size = .5, alpha = 0.5) +
  stat_density2d(data = data.frame(st_coordinates(groceries.sf)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "grey", high = "#FFAB00", name = "Density") +
  scale_alpha(range = c(0.00, 0.1), guide = "none") +
  labs(title = "Density of Grocery Stores") +
  mapTheme()

D <- ggplot() + geom_sf(data = dat2020, fill = "grey90", alpha = 0.5, color = NA) +
  geom_sf(data = bus_routes, color = "darkgrey", size = 5, alpha = 0.5) +
  geom_sf(data = policeoffice, color = "darkgrey", size = .5, alpha = 0.5) +
  stat_density2d(data = data.frame(st_coordinates(policeoffice.sf)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "grey", high = "#0077B6", name = "Density") +
  scale_alpha(range = c(0.00, 0.1), guide = "none") +
  labs(title = "Density of Police Offices") +
  mapTheme()

E <- ggplot() + geom_sf(data = dat2020, fill = "grey90", alpha = 0.5, color = NA) +
  geom_sf(data = bus_routes, color = "darkgrey", size = 5, alpha = 0.5) +
  geom_sf(data = shoppingcen, color = "darkgrey", size = .5, alpha = 0.5) +
  stat_density2d(data = data.frame(st_coordinates(shoppingcen.sf)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "grey", high = "#FF6662", name = "Density") +
  scale_alpha(range = c(0.00, 0.1), guide = "none") +
  labs(title = "Density of Shopping Centers") +
  mapTheme()


combined_maps <- A + B + C + D + E
#combined_maps <- combined_maps / plot_layout(nrow = 2, ncol = 3)
print(combined_maps)

4.4. Demographic Factors

Demographics is the ultimate features that we consider. Some census characteristic of an area might affect the number of ridership in that particular area. Characteristic such as total population, median household income, race, gender, transportation mode to work, vehicle ownership, etc. are some of the characteristic that might influence somebody preference in transportation mode.

4.4.2 By Year and Population

As a brief introduction to Mecklenburg County, which is including the City of Charlotte, we collected census data for census tract from 2017 to 2022. In the following chart below we can see Total Population increase from 2017 to 2022 in direct proportion to the increase of Total Worker Population. During the same period of years, Total Population who work from home have a significant increase, especially during the Covid 19 pandemic. This is likely to affect the number of people traveling to work by bus, which has decreased during this time period.

a <- ggplot(allTracts, aes(x = as.factor(year), y = TotalPop, fill = as.factor(year))) +
  geom_bar(stat = "identity", color = "transparent") +
  labs(title = "Total Population: 2017-2022", x = "Year", y = "Total Population") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position = "none") +
  theme(plot.title = element_text(size = 6)) +
  scale_fill_manual(values = palettebase7)

b <- ggplot(allTracts, aes(x = as.factor(year), y = TotalWorker, fill = as.factor(year))) +
  geom_bar(stat = "identity", color = "transparent") +
  labs(title = "Population of Worker: 2017-2022", x = "Year", y = "Total Population") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position = "none") +
  theme(plot.title = element_text(size = 6)) +
  scale_fill_manual(values = palettebase7)

c <- ggplot(allTracts, aes(x = as.factor(year), y = Workfromhome, fill = as.factor(year))) +
  geom_bar(stat = "identity", color = "transparent") +
  labs(title = "Population of Work from Home: 2017-2022", x = "Year", y = "Total Population") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position = "none") +
  theme(plot.title = element_text(size = 6)) +
  scale_fill_manual(values = palettebase7)
d <- ggplot(allTracts, aes(x = as.factor(year), y = Workwithbus, fill = as.factor(year))) +
  geom_bar(stat = "identity", color = "transparent") +
  labs(title = "Population of People Using Bus as Transportation to Work: 2017-2022", x = "Year", y = "Total Population") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position = "none") +
  theme(plot.title = element_text(size = 6)) +
  scale_fill_manual(values = palettebase7)

combined_plot <- grid.arrange(a,b,c,d, ncol = 4, top = c(2, 2))

print(combined_plot)
## TableGrob (1 x 4) "arrange": 4 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]
## 4 4 (1-1,4-4) arrange gtable[layout]
4.4.3. Mode Share

The general trend of percentage of population commute by bus has decreased. The 2022 percentage is halved comparing to 2017 percentage. In comparison, the general trend of percentage of population commute by car has decreased as by more than 10%. Additionally, the general trend of percentage of population commute by bicycle has a subtle decrease. Over the 6 year, it decreased 0.07%. Finally, the general trend of percentage of population commute by walk has slightly increased. Its peak was during Covid-19, but post Covid-19 period, the rate drops.

allTracts$year <- as.numeric(allTracts$year)

average_pct_bus <- allTracts %>%
  group_by(year) %>%
  summarise(avg_pct_bus = mean(pctworkwithbus))

ggplot(average_pct_bus, aes(x = year, y = avg_pct_bus)) +
  geom_line(color = "black",linewidth=1.5) +
  geom_point(color = "red",size=3) +
  labs(x = "Year", 
       y = "Average Percentage of Population Commuting with Bus",
       title = "Trend of Commuting by Bus from 2017 to 2022") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 0.1)) +
  theme_minimal()

average_pct_car <- allTracts %>%
  group_by(year) %>%
  summarise(avg_pct_car = mean(pctworkbycar))

ggplot(average_pct_car, aes(x = year, y = avg_pct_car)) +
  geom_line(color = "black",linewidth=1.5) +
  geom_point(color = "red",size=3) +
  labs(x = "Year", 
       y = "Average Percentage of Population Commuting by Car",
       title = "Trend of Commuting by Car from 2017 to 2022") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 0.1)) +
  theme_minimal()

average_pct_taxi <- allTracts %>%
  group_by(year) %>%
  summarise(avg_pct_taxi = mean(pctworkbytaxi))

ggplot(average_pct_taxi, aes(x = year, y = avg_pct_taxi)) +
  geom_line(color = "black",linewidth=1.5) +
  geom_point(color = "red",size=3) +
  labs(x = "Year", 
       y = "Average Percentage of Population Commuting by Taxi",
       title = "Trend of Commuting by Taxi from 2017 to 2022") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 0.01)) +
  theme_minimal()

average_pct_motor <- allTracts %>%
  group_by(year) %>%
  summarise(avg_pct_motor = mean(pctworkbymotor))

ggplot(average_pct_motor, aes(x = year, y = avg_pct_motor)) +
  geom_line(color = "black",linewidth=1.5) +
  geom_point(color = "red",size=3) +
  labs(x = "Year", 
       y = "Average Percentage of Population Commuting with Motorcycle",
       title = "Trend of Commuting by Motorcycle from 2017 to 2022") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 0.01)) +
  theme_minimal()

average_pct_bike <- allTracts %>%
  group_by(year) %>%
  summarise(avg_pct_bike = mean(pctworkbybike))

ggplot(average_pct_bike, aes(x = year, y = avg_pct_bike)) +
  geom_line(color = "black",linewidth=1.5) +
  geom_point(color = "red",size=3) +
  labs(x = "Year", 
       y = "Average Percentage of Population Commuting by Bike",
       title = "Trend of Commuting by Bike from 2017 to 2022") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 0.01)) +
  theme_minimal()

average_pct_walk <- allTracts %>%
  group_by(year) %>%
  summarise(avg_pct_walk = mean(pctworkbywalk))

ggplot(average_pct_walk, aes(x = year, y = avg_pct_walk)) +
  geom_line(color = "black",linewidth=1.5) +
  geom_point(color = "red",size=3) +
  labs(x = "Year", 
       y = "Average Percentage of Population Commuting by Walk",
       title = "Trend of Commuting by Walk from 2017 to 2022") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 0.1)) +
  theme_minimal()

4.4.4. Demographic Maps

The majority of Mecklenburg County population resides in suburban census tract, a trend that is consistent with the number of employed individuals in the county who also live in the suburbs. However, when examining census data on the use of public transportation to commute to work, a significant proportion is concentrated in the city center and expanded to the south. This can be correlated with census data on the prevalence of working from home, which tends to be dispersed in suburban areas situated relatively farther away from the city center. Following maps explain the dynamic of demographic in City of Charlotte during 2017 to 2022.

From the plotted maps, suburban census tracts in Charlotte have more residents as well as more number of employed workers. For the people commuting to work by bus are largely concentrated in the city center of Charlotte. Meanwhile, work from home trends indicate dominant concentrations in the suburban areas.

4.4.5 Spatial Reference of Car Ownership and Commute by Bus

One of the major factor that might affect the bus ridership is the transportation mode option. The following map show the distribution of vehicle ownership in Mecklenburg County census tracts. As shown on the map, there are more households in the suburban area that own private vehicles compared to households in the city center. This is also evident in the map, which indicates that the population primarily using buses for transportation is concentrated in the city center area.

4.4.6. Car Ownership Trend

Households with no cars show a declining trend, potentially suggesting less households taking public transportation. In contrast, one-car households slightly increase, indicating a stable preference for single-car ownership. There’s a noticeable decline in households with two and three cars, highlighting a shift away from multiple-car ownership, possibly due to rising maintenance costs or a move towards sustainability. However, an opposite trend is observed for four-car and five or more car households, which show a clear increasing trend, indicating that higher car ownership is on the rise, possibly due to higher income levels or larger family sizes. This contrasting data provides critical insights transportation patterns which can influence public transportation planning and infrastructure development.

5.Modeling

5.1. Modeling Strategies

In this section, we will build a model to predict future bus ridership in each bus stop. This model will allow CATS transportation planners to test different scenario such as change in demographic features value such as population and median household income, changes in public amenities location, and changes in number of routes served in particular stop that could impact the bus network ridership.

We aim to choose and build an accurate and generalizable model in this section, so CATS transportation planner can get the accurate ridership prediction with minimal error. In order to make the ‘best’ prediction model, we will look and compare Ordinary Least Square (OLS) Regression Model and Random Forest Model capture variability in our dataset. The feature engineering in this model consist of 3 categories: Demographic, Proximity to Amenities, and Network Characteristics. In the Demographic category, we use some input features, such as: Population, Median Income, Median Rent, Commute to Work, Race, Gender, Work Mode, Car Ownership, Population Density. As the proximity to amenities, we choose distance to: Schools, Parks, Grocery Stores, Police Offices, Shopping Centers, Light Rail Stops. And for the Network Characteristic, we analyze: Number of Routes served by each bus stop, Total Stops in the Route that served by each bus stop, and Distance from each bus stop to City Center. We also use the Month feature in training the model to capture seasonal changing in bus ridership.

The dependent variable we used is the daily average ridership for each stop in 2022. Then, the daily average prediction multiplied by the number of days the bus operates to obtain the monthly ridership count.We chose the data from 2022 as our training data, assuming that this data is not influenced by the impact of the COVID-19 pandemic. In the modeling process, we split 70% of the data into training set and the rest will be in the test set. Then we will test the model on the test set data and get the model’s performance.

5.2. Feature Engineering

In the demographics category, data about population, median income and other census data is collected. For demographics, we used spatial join to join the census data with bus stop ridership data from APC based on the census tract geometry.

In the proximity to amenities category, information about where the amenities located is collected from open data portal. We created new features to the table by calculating the nearest neighbors distance of each bus stop in the training data (2022 ridership data) to the closest 5 of each public amenities used: schools, parks, grocery stores, police offices, shopping centers, and light rail stops.

In the network characteristic category, we joined the bus routes data with the bus stop ridership data from APC based on the bus stop ID. Then we calculate the total routes that served in each bus stop, and total stops in that particular route. We also added a point representing the downtown area of Charlotte to measure the distance of each bus stop to this point. This was done under the assumption that the downtown area serves as the economic center and is a destination for a significant portion of bus users.

#Demographic Category
# Spatial join based on containment (bus stops within census tracts)
if (st_crs(busstop_sf) != st_crs(allTracts_sf)) {
  busstop_sf <- st_transform(busstop_sf, crs = st_crs(allTracts_sf))
}

# Spatial join
joined_data <- st_join(busstop_sf, allTracts_sf, join = st_within)

# Separate dataset
data_2017 <- filter(joined_data, Year == 2017 & year == 2017)
data_2018 <- filter(joined_data, Year == 2018 & year == 2018)
data_2019 <- filter(joined_data, Year == 2019 & year == 2019)
data_2020 <- filter(joined_data, Year == 2020 & year == 2020)
data_2021 <- filter(joined_data, Year == 2021 & year == 2021)
data_2022 <- filter(joined_data, Year == 2022 & year == 2022)
# Proximity to amenities
data_2022 <-
  data_2022 %>% 
    mutate(
      school_nn1 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(school.sf), k = 1),
      
      school_nn2 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(school.sf), k = 2), 
      
      school_nn3 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(school.sf), k = 3), 
      
      school_nn4 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(school.sf), k = 4), 
      
      school_nn5 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(school.sf), k = 5)) 

data_2022 <-
  data_2022 %>% 
    mutate(
      policeoffice_nn1 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(policeoffice.sf), k = 1),
      
      policeoffice_nn2 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(policeoffice.sf), k = 2), 
      
      policeoffice_nn3 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(policeoffice.sf), k = 3), 
      
      policeoffice_nn4 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(policeoffice.sf), k = 4), 
      
      policeoffice_nn5 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(policeoffice.sf), k = 5)) 

data_2022 <-
  data_2022 %>% 
    mutate(
      groceries_nn1 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(groceries.sf), k = 1),
      
      groceries_nn2 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(groceries.sf), k = 2), 
      
      groceries_nn3 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(groceries.sf), k = 3), 
      
      groceries_nn4 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(groceries.sf), k = 4), 
      
      groceries_nn5 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(groceries.sf), k = 5)) 


data_2022 <-
  data_2022 %>% 
    mutate(
      parks_nn1 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(parks.sf), k = 1),
      
      parks_nn2 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(parks.sf), k = 2), 
      
      parks_nn3 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(parks.sf), k = 3), 
      
      parks_nn4 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(parks.sf), k = 4), 
      
      parks_nn5 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(parks.sf), k = 5)) 

data_2022 <-
  data_2022 %>% 
    mutate(
      shoppingcen_nn1 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(shoppingcen.sf), k = 1),
      
      shoppingcen_nn2 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(shoppingcen.sf), k = 2), 
      
      shoppingcen_nn3 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(shoppingcen.sf), k = 3), 
      
      shoppingcen_nn4 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(shoppingcen.sf), k = 4), 
      
      shoppingcen_nn5 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(shoppingcen.sf), k = 5)) 
# Rename OBJECTID_1 column to OBJECTID in blue_stops
blue_stops <- blue_stops %>% rename(OBJECTID = OBJECTID_1)
blue_stops <- select(blue_stops, c(OBJECTID, NAME))

gold_stops <- select(gold_stops, c(OBJECTID, Stop_Name))
gold_stops <- gold_stops %>% rename(NAME = Stop_Name)
# Merge the datasets based on OBJECTID
existing_transit_stops <- rbind(blue_stops, gold_stops)
existing_transit_stops <- st_as_sf(existing_transit_stops, coords = c("geometry.x", "geometry.y"))


# Plot the density of points
ggplot() +
  geom_sf(data = dat2020, fill = "grey90", alpha = 0.5, color = NA) +
  geom_sf(data = bus_routes, color = "darkgrey", size = 5, alpha = 0.5) +
  geom_sf(data = existing_transit_stops, color = "black", size = .5, alpha = 0.5) +
  stat_density_2d(data = existing_transit_stops, 
                  aes(x = st_coordinates(geometry)[, 1], 
                      y = st_coordinates(geometry)[, 2],
                      fill = ..level.., alpha = ..level..),
                  size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#737373", high = "#0077B6", name = "Density") +  # Adjust color gradient
  scale_alpha(range = c(0.00, 0.1), guide = "none") +
  labs(title = "Density of Existing Light Rail Stops") +
  mapTheme()

data_2022 <-
  data_2022 %>% 
    mutate(
      transit_stops_nn1 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(existing_transit_stops), k = 1),
      
      transit_stops_nn2 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(existing_transit_stops), k = 2), 
      
      transit_stops_nn3 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(existing_transit_stops), k = 3), 
      
      transit_stops_nn4 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(existing_transit_stops), k = 4), 
      
      transit_stops_nn5 = nn_function(st_coordinates(data_2022), 
                              st_coordinates(existing_transit_stops), k = 5)) 
# Network characteristic category
city_center<- st_sfc(st_point(c(1692115, 5236081)), crs = 3358)
data_2022$dist_to_center <- st_distance(data_2022$geometry, city_center)
# Network characteristic category
routes_test <- st_read("Bus Transit Data/Bus_Stops_With_Frequency_HLT.shp")
## Reading layer `Bus_Stops_With_Frequency_HLT' from data source 
##   `C:\Users\jonat\MUSA8010-Practicum-Charlotte-NC\Bus Transit Data\Bus_Stops_With_Frequency_HLT.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 2992 features and 21 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 1349366 ymin: 434579.5 xmax: 1533447 ymax: 645331.2
## Projected CRS: NAD83 / North Carolina (ftUS)
new_routes <- routes_test[, c("StopID", "routes")]

# Assuming your dataset is named "new_data" and column "a" contains values like "00040", "00125", etc.
# Convert column "a" to numeric and then format it without leading zeros
new_routes_2 <- new_routes %>%
  mutate(StopID = as.numeric(StopID),
         StopID = sprintf("%d", StopID))

new_routes_2 <- new_routes_2 %>%
  rename('Stop ID' = StopID)

# Convert the sf object "sf_data" to a regular data frame
routes_df <- st_drop_geometry(new_routes_2)

# Remove spaces in "Stop ID" values
data_2022$`Stop ID` <- gsub("\\s+", "", data_2022$`Stop ID`)

# Perform a left join based on the common column "Stop ID"
merged_dataset_bus <- left_join(routes_df, data_2022,   by = "Stop ID")
## Warning in left_join(routes_df, data_2022, by = "Stop ID"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 17904 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
merged_dataset_bus <- merged_dataset_bus %>%
  mutate("total_routes_in_stops" = str_count(routes, ",") + 1)

# Separate the routes into individual rows
separated_data <- merged_dataset_bus %>%
  separate_rows(routes, sep = ",\\s*")

count_stops_routes <- separated_data %>%
  group_by(routes) %>%
  summarise(total_stops_in_routes = n_distinct(`Stop ID`))

# Merge the count_stops_data back to the original_data based on routes and counts
merge_count_data <- left_join(separated_data, count_stops_routes, by = "routes")

merge_count_data <- merge_count_data %>%
  mutate("total_routes_in_stops" = str_count(routes, ",") + 1)


summarized_data2 <- merge_count_data %>%
  group_by(`Stop ID`, Month) %>%
  summarize(total_stops_in_served_routes = sum(total_stops_in_routes, na.rm = TRUE))
## `summarise()` has grouped output by 'Stop ID'. You can override using the
## `.groups` argument.
# Merge the summarized data back with the original 'bus_2022' dataset
RF2_table <- merge(merged_dataset_bus, summarized_data2, by = c('Stop ID', 'Month'), all.x = TRUE)

RF2_table_sf <- st_as_sf(RF2_table)

5.3. Modeling Results

5.3.1. OLS

We first have to have the independent variable and predictors ready, the we split the 2022 transit data into 70/30 of training and testing datasets. Then we fit a linear regression model using the training data. The model predicts Avg_ridership based on a variety of predictors. Using the test data, we calculate predictions and evaluate the error metrics including Mean Absolute Error (MAE) and Absolute Percentage Error (APE). These metrics help in understanding the accuracy of the model predictions. The R^2 is around 0.17, which is not a optimal number for a model of high predicting power.Although some predictors demonstrate significant correlations with the independent variable, others are not showing this significance.

Residual plots from this OLS model doesn’t scatter randomly, indicating potential heteroscedasticity.

We plot the actual vs. predicted ridership to visually assess the model’s performance. A line of perfect prediction is added for reference. Based on our results, there are many points of predictions off the line actuality by much, therefore we do not consider the OLS model has high predictive power and accuracy.

data_2022 <- data_2022 %>% 
             mutate(Avg_ridership = Avg_Board + Avg_Alight)
# split traning and testing
set.seed(123) # Set seed for reproducibility
train_index <- createDataPartition(data_2022$Avg_ridership, p = 0.7, list = FALSE)
OLS_train_data <- data_2022[train_index, ]
OLS_test_data <- data_2022[-train_index, ]

#Fit regression
OLS <- lm(Avg_ridership ~ TotalPop + MedHHInc + MedRent + Commutetowork + pctWhite + pctAfricanamerican + pctAsian + pctOther + pctMale + pctFemale + pctBachelors + pctWorker + pctPoverty + pctworkwithbus + pctworkfromhome + pctworkbycar + pctworkbytaxi + pctworkbymotor + pctworkbybike + pctworkbywalk + pctnocar + pct1car + pct2car + pct3car + pct4car + pct5car + policeoffice_nn1 + policeoffice_nn2 + policeoffice_nn3 + policeoffice_nn4 + policeoffice_nn5 + groceries_nn1 + groceries_nn2 + groceries_nn3 + groceries_nn4 + groceries_nn5 + parks_nn1 + parks_nn2 + parks_nn3 + parks_nn4 + parks_nn5 + shoppingcen_nn1 + shoppingcen_nn2 + shoppingcen_nn3 + shoppingcen_nn4 + shoppingcen_nn5 + transit_stops_nn1 + transit_stops_nn2 + transit_stops_nn3 + transit_stops_nn4 + transit_stops_nn5 + school_nn1 + school_nn2 + school_nn3 + school_nn4 + school_nn5 + dist_to_center
          , data = OLS_train_data)

# Evaluate model
OLS_testing <- OLS_test_data %>%
  mutate(
    Predictions = predict(OLS, OLS_test_data),
    Error = Predictions - OLS_test_data$Avg_ridership,
    AbsError = abs(Predictions - OLS_test_data$Avg_ridership),
    MAE = mean(abs(Error), na.rm = TRUE),
    APE = abs(Predictions - OLS_test_data$Avg_ridership) / Predictions)
summary(OLS)
## 
## Call:
## lm(formula = Avg_ridership ~ TotalPop + MedHHInc + MedRent + 
##     Commutetowork + pctWhite + pctAfricanamerican + pctAsian + 
##     pctOther + pctMale + pctFemale + pctBachelors + pctWorker + 
##     pctPoverty + pctworkwithbus + pctworkfromhome + pctworkbycar + 
##     pctworkbytaxi + pctworkbymotor + pctworkbybike + pctworkbywalk + 
##     pctnocar + pct1car + pct2car + pct3car + pct4car + pct5car + 
##     policeoffice_nn1 + policeoffice_nn2 + policeoffice_nn3 + 
##     policeoffice_nn4 + policeoffice_nn5 + groceries_nn1 + groceries_nn2 + 
##     groceries_nn3 + groceries_nn4 + groceries_nn5 + parks_nn1 + 
##     parks_nn2 + parks_nn3 + parks_nn4 + parks_nn5 + shoppingcen_nn1 + 
##     shoppingcen_nn2 + shoppingcen_nn3 + shoppingcen_nn4 + shoppingcen_nn5 + 
##     transit_stops_nn1 + transit_stops_nn2 + transit_stops_nn3 + 
##     transit_stops_nn4 + transit_stops_nn5 + school_nn1 + school_nn2 + 
##     school_nn3 + school_nn4 + school_nn5 + dist_to_center, data = OLS_train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -44.45  -7.59  -3.01   2.55 396.28 
## 
## Coefficients: (2 not defined because of singularities)
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.333e+02  1.454e+02  -0.917 0.359111    
## TotalPop            4.284e-03  4.677e-04   9.159  < 2e-16 ***
## MedHHInc           -1.296e-05  6.496e-06  -1.995 0.046091 *  
## MedRent            -1.650e-03  5.896e-04  -2.798 0.005145 ** 
## Commutetowork      -9.055e-03  8.527e-04 -10.619  < 2e-16 ***
## pctWhite           -1.525e+01  1.631e+00  -9.348  < 2e-16 ***
## pctAfricanamerican -1.420e+01  1.524e+00  -9.318  < 2e-16 ***
## pctAsian            1.154e+01  2.972e+00   3.882 0.000104 ***
## pctOther                   NA         NA      NA       NA    
## pctMale             2.431e+01  2.085e+00  11.660  < 2e-16 ***
## pctFemale          -1.428e+01  2.523e+00  -5.659 1.54e-08 ***
## pctBachelors        6.018e+01  4.864e+00  12.373  < 2e-16 ***
## pctWorker           1.982e+01  3.270e+00   6.060 1.38e-09 ***
## pctPoverty          6.839e+00  1.925e+00   3.554 0.000381 ***
## pctworkwithbus     -9.382e+00  7.570e+00  -1.239 0.215217    
## pctworkfromhome    -3.481e+01  6.291e+00  -5.533 3.19e-08 ***
## pctworkbycar       -3.666e+01  6.136e+00  -5.975 2.34e-09 ***
## pctworkbytaxi      -9.638e+01  1.748e+01  -5.514 3.54e-08 ***
## pctworkbymotor     -3.181e+02  3.476e+01  -9.150  < 2e-16 ***
## pctworkbybike      -1.567e+02  2.529e+01  -6.195 5.92e-10 ***
## pctworkbywalk      -1.873e+01  6.537e+00  -2.865 0.004179 ** 
## pctnocar           -3.387e+01  5.031e+00  -6.732 1.71e-11 ***
## pct1car            -8.044e-01  4.035e+00  -0.199 0.841987    
## pct2car             9.492e+00  4.023e+00   2.360 0.018301 *  
## pct3car            -1.834e+01  5.029e+00  -3.648 0.000265 ***
## pct4car                    NA         NA      NA       NA    
## pct5car             5.628e+00  2.038e+00   2.761 0.005768 ** 
## policeoffice_nn1   -1.396e-03  2.026e-04  -6.887 5.84e-12 ***
## policeoffice_nn2    2.045e-03  5.313e-04   3.850 0.000118 ***
## policeoffice_nn3   -2.996e-03  8.757e-04  -3.421 0.000624 ***
## policeoffice_nn4    4.474e-03  1.058e-03   4.228 2.36e-05 ***
## policeoffice_nn5   -1.809e-03  6.731e-04  -2.688 0.007201 ** 
## groceries_nn1       2.025e-03  5.350e-04   3.784 0.000154 ***
## groceries_nn2      -5.062e-03  1.277e-03  -3.963 7.42e-05 ***
## groceries_nn3       6.423e-03  2.378e-03   2.701 0.006926 ** 
## groceries_nn4      -1.144e-02  3.524e-03  -3.247 0.001169 ** 
## groceries_nn5       7.732e-03  2.136e-03   3.621 0.000295 ***
## parks_nn1           5.579e-03  6.706e-04   8.320  < 2e-16 ***
## parks_nn2          -2.656e-03  1.657e-03  -1.603 0.108909    
## parks_nn3          -5.217e-03  2.895e-03  -1.802 0.071523 .  
## parks_nn4           1.229e-02  4.020e-03   3.057 0.002242 ** 
## parks_nn5          -1.073e-02  2.413e-03  -4.448 8.72e-06 ***
## shoppingcen_nn1    -3.204e-03  5.610e-04  -5.711 1.14e-08 ***
## shoppingcen_nn2    -9.558e-03  1.367e-03  -6.991 2.80e-12 ***
## shoppingcen_nn3     1.261e-02  2.273e-03   5.545 2.97e-08 ***
## shoppingcen_nn4     8.924e-04  2.774e-03   0.322 0.747716    
## shoppingcen_nn5    -3.699e-03  1.667e-03  -2.218 0.026540 *  
## transit_stops_nn1  -3.752e-03  1.095e-03  -3.425 0.000615 ***
## transit_stops_nn2  -1.280e-03  2.219e-03  -0.577 0.564023    
## transit_stops_nn3  -3.809e-03  3.128e-03  -1.218 0.223387    
## transit_stops_nn4   3.740e-02  5.039e-03   7.422 1.19e-13 ***
## transit_stops_nn5  -2.911e-02  2.953e-03  -9.860  < 2e-16 ***
## school_nn1         -5.028e-04  6.878e-04  -0.731 0.464787    
## school_nn2          2.804e-03  1.796e-03   1.562 0.118391    
## school_nn3         -4.548e-03  3.022e-03  -1.505 0.132331    
## school_nn4         -1.606e-03  3.929e-03  -0.409 0.682745    
## school_nn5          3.822e-03  2.313e-03   1.652 0.098553 .  
## dist_to_center      3.470e-05  2.775e-05   1.250 0.211230    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.48 on 24593 degrees of freedom
##   (425 observations deleted due to missingness)
## Multiple R-squared:  0.169,  Adjusted R-squared:  0.1672 
## F-statistic: 90.95 on 55 and 24593 DF,  p-value: < 2.2e-16
plot(OLS)

# Extract actual and predicted values
OLSactual <- OLS_test_data$Avg_ridership
OLSpredicted <- predict(OLS, newdata = OLS_test_data)

# Create a scatter plot of actual vs. predicted values
plot(OLSactual, OLSpredicted, main = "Actual vs. Predicted", xlab = "Actual", ylab = "Predicted")
abline(0, 1, col = "red")  # Add a line of perfect prediction (y = x)

Using the testing dataset now converted to a spatial format, we create a map to visualize the absolute error in predictions across different geographic locations.The map indicates that for some stops the absolute errors of the OLS model is larger than 300 ridership, further indicates this model does not perform well in predicting average ridership based on the predictors provided.

OLS_testing_sf <- st_as_sf(OLS_testing)

ggplot(OLS_testing_sf) +
  geom_sf(data = allTracts_sf, fill = "grey90", alpha = 0.8 , color = NA)+
  geom_sf(aes(color = AbsError), size = .5) +
  scale_color_gradient(low = "#FFBD37", high = "#005357") +
  mapTheme() +
  labs(title = "Bus Stops", color = "Absolute Error Value")

5.3.2. Random Forest Model 1 : Bus Stops
5.3.2.1 Random Forest Model 1 Result

Same as the OLS model, we ran a random forest model of bus stops following the same process. The output from your Random Forest regression model indicates a highly effective model for predicting Avg_ridership, as shown by the 94.95% of variance explained in the dependent variable. This high percentage suggests that the model does an excellent job of capturing the relationships and patterns within the data, which is further supported by the mean of squared residuals at 18.31362, indicating the predictions are generally close to the actual values. The model, built with 500 trees and evaluating 16 variables at each split, demonstrates robust predictive performance.

We also created this Actual vs. Prediction plot, we can see the predictions are now more close to the line of perfect prediction than OLS model, which further indicates the fact that the performance of random forest model is better.

Absolute errors have a significant decrease as well, none, of the bus stops possess an absolute error greater than 75.

# Split the data into training and testing sets
set.seed(123)  # For reproducibility
train_index <- sample(1:nrow(data_2022), 0.7 * nrow(data_2022)) 
train_data_RF <- data_2022[train_index, ]
train_data_RF <- na.omit(train_data_RF)
test_data_RF <- data_2022[-train_index, ]
test_data_RF<-na.omit(test_data_RF)

# Fit the Random Forest model
RF_model <- randomForest(
  formula = Avg_ridership ~ TotalPop + MedHHInc + MedRent + Commutetowork + pctWhite + pctAfricanamerican + pctAsian + pctOther + pctMale + pctFemale + pctBachelors + pctWorker + pctPoverty + pctworkwithbus + pctworkfromhome + pctworkbycar + pctworkbytaxi + pctworkbymotor + pctworkbybike + pctworkbywalk + pctnocar + pct1car + pct2car + pct3car + pct4car + pct5car + policeoffice_nn1 + policeoffice_nn2 + policeoffice_nn3 + policeoffice_nn4 + policeoffice_nn5 + groceries_nn1 + groceries_nn2 + groceries_nn3 + groceries_nn4 + groceries_nn5 + parks_nn1 + parks_nn2 + parks_nn3 + parks_nn4 + parks_nn5 + shoppingcen_nn1 + shoppingcen_nn2 + shoppingcen_nn3 + shoppingcen_nn4 + shoppingcen_nn5 + Month + PopDensity_km2, 
  data = train_data_RF)


RF_testing <- test_data_RF %>%
  mutate(Predictions_RF = predict(RF_model, newdata = test_data_RF),
Error_RF = Predictions_RF - test_data_RF$Avg_ridership,
AbsError_RF = abs(Predictions_RF - test_data_RF$Avg_ridership),
MAE_RF= mean(abs(Error_RF)),
APE_RF = (abs(Predictions_RF - test_data_RF$Avg_ridership)) / Predictions_RF)

RF_model
## 
## Call:
##  randomForest(formula = Avg_ridership ~ TotalPop + MedHHInc +      MedRent + Commutetowork + pctWhite + pctAfricanamerican +      pctAsian + pctOther + pctMale + pctFemale + pctBachelors +      pctWorker + pctPoverty + pctworkwithbus + pctworkfromhome +      pctworkbycar + pctworkbytaxi + pctworkbymotor + pctworkbybike +      pctworkbywalk + pctnocar + pct1car + pct2car + pct3car +      pct4car + pct5car + policeoffice_nn1 + policeoffice_nn2 +      policeoffice_nn3 + policeoffice_nn4 + policeoffice_nn5 +      groceries_nn1 + groceries_nn2 + groceries_nn3 + groceries_nn4 +      groceries_nn5 + parks_nn1 + parks_nn2 + parks_nn3 + parks_nn4 +      parks_nn5 + shoppingcen_nn1 + shoppingcen_nn2 + shoppingcen_nn3 +      shoppingcen_nn4 + shoppingcen_nn5 + Month + PopDensity_km2,      data = train_data_RF) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 16
## 
##           Mean of squared residuals: 18.31362
##                     % Var explained: 94.95
# Extract actual and predicted values
actual <- test_data_RF$Avg_ridership
predicted <- predict(RF_model, newdata = test_data_RF)

# Create a scatter plot of actual vs. predicted values
plot(actual, predicted, main = "Actual vs. Predicted", xlab = "Actual", ylab = "Predicted")
abline(0, 1, col = "red")  # Add a line of perfect prediction (y = x)

RF_testing_sf <- st_as_sf(RF_testing)

ggplot(RF_testing_sf) +
  geom_sf(data = allTracts_sf, fill = "grey90", alpha = 0.8 , color = NA)+
  geom_sf(aes(color = AbsError_RF), size = .5) +
  scale_color_gradient(low = "#FFBD37", high = "#005357") +
  mapTheme() +
  labs(title = "Random Forest Model 1 Visualizations", color = "Absolute Error Value")

5.3.2.2 Random Forest Model 1 Generalizability

The residuals are centered around the zero line (marked by the red dashed line), which is generally a good sign. This suggests that there’s no systematic bias in the model predictions overall. There is a noticeable increase in the spread of residuals as fitted values increase, particularly visible past a fitted value of about 200. This could indicate heteroscedasticity—where the variability in residuals is not constant across all levels of the predictor variables. In other words, the model might be less reliable at predicting higher values.

predictions <- predict(RF_model, newdata = train_data_RF)
residuals <- train_data_RF$Avg_ridership - predictions

residual_plot <- ggplot(train_data_RF, aes(x = predictions, y = residuals)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  ggtitle("Residual Plot of Random Forest Model 1") +
  xlab("Fitted Values") +
  ylab("Residuals")

print(residual_plot)

5.3.3. Random Forest Model 2 : Bus Routes

As the transit system is not only formed with stops but connected through routes, and one stop may belong to multiple routes, so it is essential to analyze the pattern of bus routes as well. So we ran a second random forest model of bus stops following the same process with the network characteristics added. We ran the same process to train the second random forest model for predicting the Avg_ridership. Trained using 500 trees, and at each decision splot, 20 variables are randomly considered. Results indicates Mean of Squared Residuals is 10.75963 and a remarkably high 97.03% of the variance explained.

5.3.3.1. Random Forest Model 2 Result

Same as the random forest model of bus stops, this model has been trained using 500 trees, and at each decision split, 20 variables are randomly considered. Results indicates Mean of Squared Residuals is 10.75963 and a remarkably high 97.03% of the variance explained.

We also created this Actual vs. Prediction plot for the second random forest model. We can see the predictions are better thatn the previous model and more close to the line of perfect prediction than OLS model and the first Random Forest Model, which further indicates the fact that the performance of Random Forest Model 2 is better. Absolute errors have a significant decrease as well, none, of the bus stops possess an absolute error greater than 75.

The predictions generated by the model represent the average daily bus passenger forecast. To obtain the monthly forecast, the daily prediction value is then multiplied by the number of days the bus stop operates.

RF2_table_sf <- RF2_table_sf %>% 
             mutate(Avg_ridership = Avg_Board + Avg_Alight)
# Split the data into training and testing sets
set.seed(123)  # For reproducibility
train_index2 <- sample(1:nrow(RF2_table_sf), 0.7 * nrow(RF2_table_sf)) 
train_data_RF2 <- RF2_table_sf[train_index2, ]
train_data_RF2 <- na.omit(train_data_RF2)
test_data_RF2 <- RF2_table_sf[-train_index2, ]
test_data_RF2<- na.omit(test_data_RF2)


# Fit the Random Forest model
RF_model3 <- randomForest(
  formula = Avg_ridership ~ TotalPop + MedHHInc + MedRent + Commutetowork + pctWhite + pctAfricanamerican + pctAsian + pctOther + pctMale + pctFemale + pctBachelors + pctWorker + pctPoverty + pctworkwithbus + pctworkfromhome + pctworkbycar + pctworkbytaxi + pctworkbymotor + pctworkbybike + pctworkbywalk + pctnocar + pct1car + pct2car + pct3car + pct4car + pct5car + policeoffice_nn1 + policeoffice_nn2 + policeoffice_nn3 + policeoffice_nn4 + policeoffice_nn5 + groceries_nn1 + groceries_nn2 + groceries_nn3 + groceries_nn4 + groceries_nn5 + parks_nn1 + parks_nn2 + parks_nn3 + parks_nn4 + parks_nn5 + shoppingcen_nn1 + shoppingcen_nn2 + shoppingcen_nn3 + shoppingcen_nn4 + shoppingcen_nn5+ total_routes_in_stops + total_stops_in_served_routes + Month + PopDensity_km2 + transit_stops_nn1 + transit_stops_nn2 + transit_stops_nn3 + transit_stops_nn4 + transit_stops_nn5 + school_nn1 + school_nn2 + school_nn3 + school_nn4 + school_nn5 + dist_to_center, 
  data = train_data_RF2)


RF_testing3 <- test_data_RF2 %>%
  mutate(Predictions_RF3 = predict(RF_model3, newdata = test_data_RF2),
Error_RF3 = Predictions_RF3 - test_data_RF2$Avg_ridership,
AbsError_RF3 = abs(Predictions_RF3 - test_data_RF2$Avg_ridership),
MAE_RF3= mean(abs(Error_RF3)),
APE_RF3 = (abs(Predictions_RF3 - test_data_RF2$Avg_ridership)) / Predictions_RF3)

RF_model3
## 
## Call:
##  randomForest(formula = Avg_ridership ~ TotalPop + MedHHInc +      MedRent + Commutetowork + pctWhite + pctAfricanamerican +      pctAsian + pctOther + pctMale + pctFemale + pctBachelors +      pctWorker + pctPoverty + pctworkwithbus + pctworkfromhome +      pctworkbycar + pctworkbytaxi + pctworkbymotor + pctworkbybike +      pctworkbywalk + pctnocar + pct1car + pct2car + pct3car +      pct4car + pct5car + policeoffice_nn1 + policeoffice_nn2 +      policeoffice_nn3 + policeoffice_nn4 + policeoffice_nn5 +      groceries_nn1 + groceries_nn2 + groceries_nn3 + groceries_nn4 +      groceries_nn5 + parks_nn1 + parks_nn2 + parks_nn3 + parks_nn4 +      parks_nn5 + shoppingcen_nn1 + shoppingcen_nn2 + shoppingcen_nn3 +      shoppingcen_nn4 + shoppingcen_nn5 + total_routes_in_stops +      total_stops_in_served_routes + Month + PopDensity_km2 + transit_stops_nn1 +      transit_stops_nn2 + transit_stops_nn3 + transit_stops_nn4 +      transit_stops_nn5 + school_nn1 + school_nn2 + school_nn3 +      school_nn4 + school_nn5 + dist_to_center, data = train_data_RF2) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 20
## 
##           Mean of squared residuals: 10.75963
##                     % Var explained: 97.03
# Extract actual and predicted values
actual3 <- test_data_RF2$Avg_ridership
predicted3 <- predict(RF_model3, newdata = test_data_RF2)

# Create a scatter plot of actual vs. predicted values
plot(actual3, predicted3, main = "Actual vs. Predicted", xlab = "Actual", ylab = "Predicted")
abline(0, 1, col = "red")  # Add a line of perfect prediction (y = x)

RF_testing_sf3 <- st_as_sf(RF_testing3)

ggplot(RF_testing_sf3) +
  geom_sf(data = allTracts_sf, fill = "grey90", alpha = 0.8 , color = NA)+
  geom_sf(aes(color = AbsError_RF3), size = .5) +
  scale_color_gradient(low = "#FFBD37", high = "#005357") +
  mapTheme() +
  labs(title = "Random Forest Model 2 Visualizations", color = "Absolute Error Value")

# Calculating Monthly Prediction
RF_testing_sf_monthly <- RF_testing_sf3 %>%
  mutate(Monthly_prediction = Predictions_RF3 * Number_of_Days)

We identify and visualize the top 50 underperforming bus stops based on predicted average ridership for 2022. First,we filter out bus stops with predicted ridership values less than or equal to 10 from the RF_testing2 dataset. Then, we rank these underperforming stops based on their frequency of occurrence and selects the top 50.

underperforming_stops_predicted2 <- RF_testing_sf_monthly[RF_testing_sf_monthly$Monthly_prediction <= 10, ]
underperformance_counts_predicted2 <- table(underperforming_stops_predicted2$"Stop ID")

# Top 50 Stops for 2023
top_50_stops_predicted2 <- names(sort(underperformance_counts_predicted2, decreasing = TRUE)[1:50])
underperforming_top_50_predicted2 <- underperforming_stops_predicted2[underperforming_stops_predicted2$"Stop ID" %in% top_50_stops_predicted2, ]

# Order
underperforming_top_50_predicted2$"Stop ID" <- factor(underperforming_top_50_predicted2$"Stop ID", levels = rev(top_50_stops_predicted2))

underperforming_top_50_predicted2$geometry <- st_transform(underperforming_top_50_predicted2$geometry, crs = 3358)  # NAD83 North Carolina

map_plot_top_50_predicted2 <- ggplot() +
  geom_sf(data = allTracts_sf, fill = "grey90", alpha = 0.5, color = NA) +
  geom_sf(data = bus_routes, color = "darkgrey", size = 5, alpha = 0.5) +
  geom_sf(data = underperforming_top_50_predicted2, aes(geometry = geometry, colour = q5(Monthly_prediction)), size = 2) +
  scale_color_manual(values = rev(palettebase5),
                     labels = c("1", "2", "3", "4", "5"),
                     name = "Quintile\nBreaks") +
  labs(title = "50 Bus Stops with Lowest Predicted Avg Ridership",
       fill = "Predictions_RF2") +
  mapTheme()+
  theme(legend.position = "bottom")

print(map_plot_top_50_predicted2)

We plot a graph that specifically focusing on underperforming stops predicted to have average ridership for the month of January. Most stops are showing around 3000 riderships during January,2022.

underperforming_stops_predicted2_m <- underperforming_stops_predicted2[underperforming_stops_predicted2$Month == 'January', ]

ggplot() +
  geom_sf(data = allTracts_sf, fill = "grey90", alpha = 0.8, color = NA) +
  geom_sf(data = bus_routes, color = "darkgrey", size = 5, alpha = 0.5) +
  geom_sf(data = underperforming_stops_predicted2_m, aes(color = Monthly_prediction), size = 1) +
  scale_color_gradient(low = "#FFBD37", high = "#005357") +
  mapTheme() +
  labs(title = "Predicted Avg Ridership under 10 in January", color = "Predicted Ridership") +
  theme(legend.position = "bottom")

5.3.3.2. Random Forest Model 2 Generalizability

We saw a similar pattern of the residuals of the random forest model of bus routes and bus stops. The residuals are centered around the zero line. This suggests that there’s no systematic bias in the model predictions overall. There is a noticeable increase in the spread of residuals as fitted values increase, particularly visible past a fitted value of about 200. This could indicate heteroscedasticity—where the variability in residuals is not constant across all levels of the predictor variables. The model might be less reliable at predicting higher values as well.

predictions3 <- predict(RF_model3, newdata = train_data_RF2)
residuals3 <- train_data_RF2$Avg_ridership - predictions3

residual_plot <- ggplot(train_data_RF2, aes(x = predictions3, y = residuals3)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  ggtitle("Residual Plot of Random Forest Model 2") +
  xlab("Fitted Values") +
  ylab("Residuals")

print(residual_plot)

5.3.3.3. Random Forest Model 2 Tunning Hyperparameters

Additionally, analyzing which features are most influential in predicting ridership can provide valuable insights and help in refining the model further by adjusting parameters, so we changed the number of trees or the mtry value.The model was tuned using 5-fold cross-validation to determine the optimal number of predictors (mtry) at each decision split, with 25 being selected as optimal. The final model fitting used the full training dataset, resulting in a Mean Absolute Error (MAE) of 1.372992, indicating the average absolute error in the ridership predictions. The Mean Absolute Percentage Error (MAPE) is reported as Infinity (Inf), which often occurs when actual values used in the denominator are zero or near zero, leading to undefined or infinite values in the percentage error calculations.

# Define the parameter grid for tuning only 'mtry'
tuneGrid <- expand.grid(
  mtry = c(10, 15, 20, 25)  # Number of variables randomly sampled as candidates at each split
)
trainControl <- trainControl(
  method = "cv",
  number = 5,  # 5-fold cross-validation
  verboseIter = TRUE
)
# Tune the Random Forest model
set.seed(123)  # Ensure reproducibility
train_index_tuned1 <- sample(1:nrow(RF2_table_sf), 0.7 * nrow(RF2_table_sf)) 
train_data_tuned1 <- RF2_table_sf[train_index_tuned1, ]
train_data_tuned1 <- na.omit(train_data_tuned1)
test_data_tuned1 <- RF2_table_sf[-train_index_tuned1, ]
test_data_tuned1 <- na.omit(test_data_tuned1)

RF_model_tuned1 <- train(
  Avg_ridership ~ TotalPop + MedHHInc + MedRent + Commutetowork + pctWhite + pctAfricanamerican + pctAsian + pctOther + pctMale + pctFemale + pctBachelors + pctWorker + pctPoverty + pctworkwithbus + pctworkfromhome + pctworkbycar + pctworkbytaxi + pctworkbymotor + pctworkbybike + pctworkbywalk + pctnocar + pct1car + pct2car + pct3car + pct4car + pct5car + policeoffice_nn1 + policeoffice_nn2 + policeoffice_nn3 + policeoffice_nn4 + policeoffice_nn5 + groceries_nn1 + groceries_nn2 + groceries_nn3 + groceries_nn4 + groceries_nn5 + parks_nn1 + parks_nn2 + parks_nn3 + parks_nn4 + parks_nn5 + shoppingcen_nn1 + shoppingcen_nn2 + shoppingcen_nn3 + shoppingcen_nn4 + shoppingcen_nn5+ total_routes_in_stops + total_stops_in_served_routes + PopDensity_km2 + transit_stops_nn1 + transit_stops_nn2 + transit_stops_nn3 + transit_stops_nn4 + transit_stops_nn5 + school_nn1 + school_nn2 + school_nn3 + school_nn4 + school_nn5 + dist_to_center,
  data = train_data_tuned1,
  method = "rf",
  metric = "RMSE",
  tuneGrid = tuneGrid,
  trControl = trainControl
)
## + Fold1: mtry=10 
## - Fold1: mtry=10 
## + Fold1: mtry=15 
## - Fold1: mtry=15 
## + Fold1: mtry=20 
## - Fold1: mtry=20 
## + Fold1: mtry=25 
## - Fold1: mtry=25 
## + Fold2: mtry=10 
## - Fold2: mtry=10 
## + Fold2: mtry=15 
## - Fold2: mtry=15 
## + Fold2: mtry=20 
## - Fold2: mtry=20 
## + Fold2: mtry=25 
## - Fold2: mtry=25 
## + Fold3: mtry=10 
## - Fold3: mtry=10 
## + Fold3: mtry=15 
## - Fold3: mtry=15 
## + Fold3: mtry=20 
## - Fold3: mtry=20 
## + Fold3: mtry=25 
## - Fold3: mtry=25 
## + Fold4: mtry=10 
## - Fold4: mtry=10 
## + Fold4: mtry=15 
## - Fold4: mtry=15 
## + Fold4: mtry=20 
## - Fold4: mtry=20 
## + Fold4: mtry=25 
## - Fold4: mtry=25 
## + Fold5: mtry=10 
## - Fold5: mtry=10 
## + Fold5: mtry=15 
## - Fold5: mtry=15 
## + Fold5: mtry=20 
## - Fold5: mtry=20 
## + Fold5: mtry=25 
## - Fold5: mtry=25 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 10 on full training set
# Assuming predictions should be made with RF_model_tuned1
predictions_tuned1 <- predict(RF_model_tuned1, test_data_tuned1) 
results_tuned1 <- test_data_RF2 %>%
  mutate(Predictions_tuned1 = predictions_tuned1,
    Error_tuned1 = Predictions_tuned1 - Avg_ridership,
    AbsError_tuned1 = abs(Error_tuned1),
    MAE_tuned1 = mean(AbsError_tuned1),
    APE_tuned1 = (AbsError_tuned1 / Avg_ridership)
  ) %>%
  summarise(MAE = mean(AbsError_tuned1), MAPE = mean(APE_tuned1))

# Print the error metrics
print(results_tuned1)
## Simple feature collection with 1 feature and 2 fields
## Geometry type: MULTIPOINT
## Dimension:     XY
## Bounding box:  xmin: 427800.6 ymin: 146155.5 xmax: 458741 ymax: 196692
## Projected CRS: NAD83(HARN) / North Carolina
##        MAE MAPE                       geometry
## 1 1.372216  Inf MULTIPOINT ((427800.6 16747...

6. Use-Case Scenario

The original goal of building this model is to provide predictions of bus ridership in Charlotte. As explained in the workflow, the model is constructed using demographic features, distances to public amenities, and network characteristics. Therefore, the predictions of bus ridership that can be made using this model occur when there are changes in the values of these input features.

This motivates us to develop an application to assist CATS transportation planners in predicting future ridership numbers when changes such as shifts in population demographics, modifications to public facilities, or alterations in the number of bus stops occur. As information for future transportation planners for development, decision-making, risk mitigation, and problem-solving, the insights provided by this model will be invaluable

In this project we create three different scenarios for ridership change in city of Charlotte.

6.1 Scenario 1: Increase in Population Density

The last measured population density for Charlotte, NC was 2,827 in 2018. Charlotte, NC experienced an average growth rate of 2.87% from our first statistic recorded in 2009. If past trends continue, we forecast the population density growth rate would be 15% from 2022 to 2027 and 30% from 2022 to 2032. Population density increase does not affect average ridership by much but decreases underperforming stops. In both 2027 and 2032, we would see ridership larger than 300 at stops located at the southwestern periphery and downtown of the county, but still the majority of the stops with ridership under 100. The southeastern and northern area of the county is having the most stops with average ridership under 10 in January 2027 and January 2032, but the total number of stops with average ridership decresed from 2027 to 2032.

6.1.1 Increase in Population Density 15% (2022 & 2027)
# Simulate 15% increase in PopDensity
RF2_table_2022_scenario1 <- RF2_table_sf %>% 
  mutate(PopDensity_km2 = PopDensity_km2 * 1.15)

# Make Predictions with the New Scenario Data
predictions2_2022_scenario1 <- predict(RF_model3, newdata = RF2_table_2022_scenario1)

# Add Predictions to the Data
RF2_table_2022_scenario1$Predicted_Avg_ridership <- predictions2_2022_scenario1

# Add Monthly Prediction Column
RF2_table_2022_scenario1 <- RF2_table_2022_scenario1 %>%
  mutate(Monthly_prediction = Predicted_Avg_ridership * Number_of_Days)

# Format Predicted Avg Ridership as Decimal Numbers
#RF2_table_2022_scenario1$Monthly_prediction <- sprintf("%.2f", predictions2_2022_scenario1)

# Convert Data to an sf Object
RF2_table_2022_scenario1_sf <- st_as_sf(RF2_table_2022_scenario1, geom = "geometry")


# Convert Predicted Avg Ridership to Numeric
RF2_table_2022_scenario1_sf$Monthly_prediction <- as.numeric(RF2_table_2022_scenario1_sf$Monthly_prediction)

result <- RF2_table_2022_scenario1_sf[RF2_table_2022_scenario1_sf$Monthly_prediction <= 10, ]
result <- result[result$Month == 'January', ]

# Plot the Map with Correct Fill Scale
ggplot(result) +
  geom_sf(data = allTracts_sf, fill = "grey90", alpha = 0.8 , color = NA)+
  geom_sf(data = bus_routes, color = "darkgrey", size = 5, alpha = 0.5) +
  geom_sf(aes(color = Monthly_prediction), size = 1) +
  scale_color_gradient(low = "#FFBD37", high = "#005357") +
  mapTheme() +
  labs(title = "Predicted Avg Ridership under 10 in January: Scenario I", color = "Predicted Avg Ridership with \n15% Increase in Population Density")+
  theme(legend.position = "bottom")

6.1.2 Increase in Population Density 30%
# Simulate 30% increase in PopDensity
RF2_table_2022_scenario1a <- RF2_table_sf %>% 
  mutate(PopDensity_km2 = PopDensity_km2 * 1.3)

# Make Predictions with the New Scenario Data
predictions2_2022_scenario1a <- predict(RF_model3, newdata = RF2_table_2022_scenario1a)

# Add Predictions to the Data
RF2_table_2022_scenario1a$Predicted_Avg_ridership <- predictions2_2022_scenario1a

# Add Monthly Prediction Column
RF2_table_2022_scenario1a <- RF2_table_2022_scenario1a %>%
  mutate(Monthly_prediction = Predicted_Avg_ridership * Number_of_Days)

# Convert Data to an sf Object
RF2_table_2022_scenario1a_sf <- st_as_sf(RF2_table_2022_scenario1a, geom = "geometry")


# Convert Predicted Avg Ridership to Numeric
RF2_table_2022_scenario1a_sf$Monthly_prediction <- as.numeric(RF2_table_2022_scenario1a_sf$Monthly_prediction)


result <- RF2_table_2022_scenario1a_sf[RF2_table_2022_scenario1a_sf$Monthly_prediction <= 10, ]
result <- result[result$Month == 'January', ]

# Plot the Map with Correct Fill Scale
ggplot(result) +
  geom_sf(data = allTracts_sf, fill = "grey90", alpha = 0.8 , color = NA)+
  geom_sf(data = bus_routes, color = "darkgrey", size = 5, alpha = 0.5) +
  geom_sf(aes(color = Monthly_prediction), size = 1) +
  scale_color_gradient(low = "#FFBD37", high = "#005357") +
  mapTheme() +
  labs(title = "Predicted Avg Ridership under 10 in January: Scenario I", color = "Predicted Avg Ridership with \n30% Increase in Population Density")+
  theme(legend.position = "bottom")

6.2 Scenario 2: Increase in Median Income

Between 2020 and 2021 Charlotte’s median household income grew from $65,359 to $68,367, a 4.6% increase. It will be an increase of 25% from 2022 to 2027 and 50% from2022 to 2032. As income increases, the average ridership does not change much. Average ridership larger than 300 at stops located at the southwestern periphery and downtown of the county, but still the majority of the stops with ridership under 100. However, less bus stops is having monthly average ridership under 10 in 2032 than 2027.

6.2.1 Increase in Income 25%
RF2_table_2022_scenario2 <- RF2_table_sf %>% 
  mutate(MedHHInc = MedHHInc * 1.25)

# Make Predictions with the New Scenario Data
predictions2_2022_scenario2 <- predict(RF_model3, newdata = RF2_table_2022_scenario2)

# Add Predictions to the Data
RF2_table_2022_scenario2$Predicted_Avg_ridership <- predictions2_2022_scenario2

# Add Monthly Prediction Column
RF2_table_2022_scenario2 <- RF2_table_2022_scenario2 %>%
  mutate(Monthly_prediction = Predicted_Avg_ridership * Number_of_Days)

# Format Predicted Avg Ridership as Decimal Numbers
# RF2_table_2022_scenario2$Predicted_Avg_ridership <- sprintf("%.2f", predictions2_2022_scenario2)

# Convert Data to an sf Object
RF2_table_2022_scenario2_sf <- st_as_sf(RF2_table_2022_scenario2, geom = "geometry")


# Convert Predicted Avg Ridership to Numeric
RF2_table_2022_scenario2_sf$Monthly_prediction <- as.numeric(RF2_table_2022_scenario2_sf$Monthly_prediction)


result <- RF2_table_2022_scenario2_sf[RF2_table_2022_scenario2_sf$Monthly_prediction <= 10, ]
result <- result[result$Month == 'January', ]

# Plot the Map with Correct Fill Scale
ggplot(result) +
  geom_sf(data = allTracts_sf, fill = "grey90", alpha = 0.8 , color = NA)+
  geom_sf(data = bus_routes, color = "darkgrey", size = 5, alpha = 0.5) +
  geom_sf(aes(color = Monthly_prediction), size = 1) +
  scale_color_gradient(low = "#FFBD37", high = "#005357") +
  mapTheme() +
  labs(title = "Predicted Avg Ridership under 10 in January: Scenario II", color = "Predicted Avg Ridership with \n25% Increase in Median Income")+
  theme(legend.position = "bottom")

6.2.2 Increase in Median Income 50%
# Simulate 10% increase in PopDensity
RF2_table_2022_scenario2a <- RF2_table_sf %>% 
  mutate(MedHHInc = MedHHInc * 1.5)

# Make Predictions with the New Scenario Data
predictions2_2022_scenario2a <- predict(RF_model3, newdata = RF2_table_2022_scenario2a)

# Add Predictions to the Data
RF2_table_2022_scenario2a$Predicted_Avg_ridership <- predictions2_2022_scenario2a

# Add Monthly Prediction Column
RF2_table_2022_scenario2a <- RF2_table_2022_scenario2a %>%
  mutate(Monthly_prediction = Predicted_Avg_ridership * Number_of_Days)

# Format Predicted Avg Ridership as Decimal Numbers
# RF2_table_2022_scenario2a$Predicted_Avg_ridership <- sprintf("%.2f", predictions2_2022_scenario2a)

# Convert Data to an sf Object
RF2_table_2022_scenario2a_sf <- st_as_sf(RF2_table_2022_scenario2a, geom = "geometry")


# Convert Predicted Avg Ridership to Numeric
RF2_table_2022_scenario2a_sf$Monthly_prediction <- as.numeric(RF2_table_2022_scenario2a_sf$Monthly_prediction)


result <- RF2_table_2022_scenario2a_sf[RF2_table_2022_scenario2a_sf$Monthly_prediction <= 10, ]
result <- result[result$Month == 'January', ]

# Plot the Map with Correct Fill Scale
ggplot(result) +
  geom_sf(data = allTracts_sf, fill = "grey90", alpha = 0.8 , color = NA)+
  geom_sf(data = bus_routes, color = "darkgrey", size = 5, alpha = 0.5) +
  geom_sf(aes(color = Monthly_prediction), size = 1) +
  scale_color_gradient(low = "#FFBD37", high = "#005357") +
  mapTheme() +
  labs(title = "Predicted Avg Ridership under 10 in January: Scenario II", color = "Predicted Avg Ridership with \n50% Increase in Median Income")+
  theme(legend.position = "bottom")

6.3 Scenario 3: Add 2 Proposed Light Rail Stops as Feature

The density of new stops of the proposed light rail is the highest in the city center area. Following the same process of modeling and visualizing, we test the dataset adding the two proposed light rail data using random forest and assessed its performance using Actual vs. Prediction plot and absolute error map. We also mapping out the underperforming stops.

# Rename OBJECTID_1 column to OBJECTID in red_stops
red_stops <- red_stops %>% rename(OBJECTID = OBJECTID_1)
red_stops <- red_stops %>% rename(NAME = Name)
red_stops <- select(red_stops, -Status)


# Add a unique identifier to silver_stops
silver_stops <- select(silver_stops, -c(Phase, parkRide, GeoID))
silver_stops <- silver_stops %>% rename(NAME = Name)

# Merge the datasets based on OBJECTID
proposed_line_merged_data <- rbind(existing_transit_stops, red_stops)
proposed_line_merged_data <- rbind(proposed_line_merged_data, silver_stops)

# Assuming your dataset is named "stops_data"
proposed_line_merged_data_sf <- st_as_sf(proposed_line_merged_data, coords = c("geometry.x", "geometry.y"))


# Plot the density of points

ggplot() +
  geom_sf(data = dat2020, fill = "grey90", alpha = 0.5, color = NA) +
  geom_sf(data = bus_routes, color = "darkgrey", size = 5, alpha = 0.5) +
  geom_sf(data = proposed_line_merged_data, color = "black", size = .5, alpha = 0.5) +
  stat_density_2d(data = proposed_line_merged_data_sf, 
                  aes(x = st_coordinates(geometry)[, 1], 
                      y = st_coordinates(geometry)[, 2],
                      fill = ..level.., alpha = ..level..),
                  size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#737373", high = "#0077B6", name = "Density") +  # Adjust color gradient
  scale_alpha(range = c(0.00, 0.1), guide = "none") +
  labs(title = "Density of Proposed Lines") +
  mapTheme()

RF2_table_sf <- na.omit(RF2_table_sf)

data_2022_s1 <-
  RF2_table_sf %>% 
    mutate(
      proposed_line_nn1 = nn_function(st_coordinates(RF2_table_sf), 
                              st_coordinates(proposed_line_merged_data_sf), k = 1),
      
      proposed_line_nn2 = nn_function(st_coordinates(RF2_table_sf), 
                              st_coordinates(proposed_line_merged_data_sf), k = 2), 
      
      proposed_line_nn3 = nn_function(st_coordinates(RF2_table_sf), 
                              st_coordinates(proposed_line_merged_data_sf), k = 3), 
      
      proposed_line_nn4 = nn_function(st_coordinates(RF2_table_sf), 
                              st_coordinates(proposed_line_merged_data_sf), k = 4), 
      
      proposed_line_nn5 = nn_function(st_coordinates(RF2_table_sf), 
                              st_coordinates(proposed_line_merged_data_sf), k = 5)) 
data_2022_s1 <- subset(data_2022_s1, select = -c(transit_stops_nn1, transit_stops_nn2, transit_stops_nn3, transit_stops_nn4, transit_stops_nn5))

# Rename columns using base R
names(data_2022_s1)[names(data_2022_s1) == "proposed_line_nn1"] <- "transit_stops_nn1"
names(data_2022_s1)[names(data_2022_s1) == "proposed_line_nn2"] <- "transit_stops_nn2"
names(data_2022_s1)[names(data_2022_s1) == "proposed_line_nn3"] <- "transit_stops_nn3"
names(data_2022_s1)[names(data_2022_s1) == "proposed_line_nn4"] <- "transit_stops_nn4"
names(data_2022_s1)[names(data_2022_s1) == "proposed_line_nn5"] <- "transit_stops_nn5"
# Make Predictions with the New Scenario Data
predictions2_2022_scenario3 <- predict(RF_model3, newdata = data_2022_s1)

# Add Predictions to the Data
data_2022_s1$Predicted_Avg_ridership <- predictions2_2022_scenario3

# Add Monthly Prediction Column
data_2022_s1 <- data_2022_s1 %>%
  mutate(Monthly_prediction = Predicted_Avg_ridership * Number_of_Days)

# Convert Data to an sf Object
RF2_table_2022_scenario3_sf <- st_as_sf(data_2022_s1, geom = "geometry")


# Convert Predicted Avg Ridership to Numeric
RF2_table_2022_scenario3_sf$Monthly_prediction <- as.numeric(RF2_table_2022_scenario3_sf$Monthly_prediction)


result <- RF2_table_2022_scenario3_sf[RF2_table_2022_scenario3_sf$Monthly_prediction <= 10, ]
result <- result[result$Month == 'January', ]

# Plot the Map with Correct Fill Scale
ggplot(result) +
  geom_sf(data = allTracts_sf, fill = "grey90", alpha = 0.8 , color = NA)+
  geom_sf(data = bus_routes, color = "darkgrey", size = 5, alpha = 0.5) +
  geom_sf(data = silver_routes, color = "purple", size = 7, alpha = 0.5) +
  geom_sf(data = red_routes,color = "#FF6347", size = 7, alpha = 0.5) +
  geom_sf(data = blue_routes, color = "#4682B4", size = 7, alpha = 0.5) +
  geom_sf(data = gold_routes,color = "#FFD700", size = 7, alpha = 0.5) +
  geom_sf(aes(color = Monthly_prediction), size = 1) +
  scale_color_gradient(low = "#FFBD37", high = "#005357") +
  mapTheme() +
  labs(title = "Predicted Avg Ridership under 10 in January: Scenario III",
       color = "Predicted Avg Ridership \nwith Proposed Transit Lines Added") +
  theme(
    legend.position = "bottom",
    plot.title = element_text(size = 10)  # Adjust size here to your preference
  )

6.4 Senario 4: Test add 10 stops

We add 10 stops to the original dataset to see what will happen when new stops are built. Again following the same step using Actual vs.Prediction scatterplot and map of absolute errors to visualize the performance, also mapping out the underperforming stops when 10 more stops are added.

newstops_test <- st_read("Test/2022_test.shp")%>%
  st_transform('EPSG:3358')
## Reading layer `2022_test' from data source 
##   `C:\Users\jonat\MUSA8010-Practicum-Charlotte-NC\Test\2022_test.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 35995 features and 12 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -81.17988 ymin: 34.92705 xmax: -80.55787 ymax: 35.50911
## Geodetic CRS:  WGS 84
# Perform spatial join based on containment (bus stops within census tracts)
joined_data_test <- st_join(newstops_test, dat2022, join = st_within)
joined_data_test <-
  joined_data_test %>% 
    mutate(
      school_nn1 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(school.sf), k = 1),
      
      school_nn2 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(school.sf), k = 2), 
      
      school_nn3 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(school.sf), k = 3), 
      
      school_nn4 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(school.sf), k = 4), 
      
      school_nn5 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(school.sf), k = 5)) 

joined_data_test <-
  joined_data_test %>% 
    mutate(
      policeoffice_nn1 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(policeoffice.sf), k = 1),
      
      policeoffice_nn2 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(policeoffice.sf), k = 2), 
      
      policeoffice_nn3 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(policeoffice.sf), k = 3), 
      
      policeoffice_nn4 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(policeoffice.sf), k = 4), 
      
      policeoffice_nn5 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(policeoffice.sf), k = 5)) 

joined_data_test <-
  joined_data_test %>% 
    mutate(
      groceries_nn1 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(groceries.sf), k = 1),
      
      groceries_nn2 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(groceries.sf), k = 2), 
      
      groceries_nn3 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(groceries.sf), k = 3), 
      
      groceries_nn4 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(groceries.sf), k = 4), 
      
      groceries_nn5 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(groceries.sf), k = 5)) 


joined_data_test <-
  joined_data_test %>% 
    mutate(
      parks_nn1 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(parks.sf), k = 1),
      
      parks_nn2 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(parks.sf), k = 2), 
      
      parks_nn3 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(parks.sf), k = 3), 
      
      parks_nn4 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(parks.sf), k = 4), 
      
      parks_nn5 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(parks.sf), k = 5)) 

joined_data_test <-
  joined_data_test %>% 
    mutate(
      shoppingcen_nn1 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(shoppingcen.sf), k = 1),
      
      shoppingcen_nn2 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(shoppingcen.sf), k = 2), 
      
      shoppingcen_nn3 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(shoppingcen.sf), k = 3), 
      
      shoppingcen_nn4 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(shoppingcen.sf), k = 4), 
      
      shoppingcen_nn5 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(shoppingcen.sf), k = 5)) 
joined_data_test <-
  joined_data_test %>% 
    mutate(
      transit_stops_nn1 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(existing_transit_stops), k = 1),
      
      transit_stops_nn2 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(existing_transit_stops), k = 2), 
      
      transit_stops_nn3 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(existing_transit_stops), k = 3), 
      
      transit_stops_nn4 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(existing_transit_stops), k = 4), 
      
      transit_stops_nn5 = nn_function(st_coordinates(joined_data_test), 
                              st_coordinates(existing_transit_stops), k = 5)) 
city_center<- st_sfc(st_point(c(1692115, 5236081)), crs = 3358)
joined_data_test$dist_to_center <- st_distance(joined_data_test$geometry, city_center)
joined_data_test <- joined_data_test %>% 
             mutate(Avg_ridership = Avg_Board + Avg_Alight)
routes_test <- st_read("Bus Transit Data/Bus_Stops_With_Frequency_HLT.geojson")
## Reading layer `Bus_Stops_With_Frequency_HLT' from data source 
##   `C:\Users\jonat\MUSA8010-Practicum-Charlotte-NC\Bus Transit Data\Bus_Stops_With_Frequency_HLT.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 2992 features and 21 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -81.17989 ymin: 34.92705 xmax: -80.55811 ymax: 35.50916
## Geodetic CRS:  WGS 84
new_routes <- routes_test[, c("StopID", "routes")]


# Assuming your dataset is named "new_data" and column "a" contains values like "00040", "00125", etc.
# Convert column "a" to numeric and then format it without leading zeros
new_routes_2 <- new_routes %>%
  mutate(StopID = as.numeric(StopID),
         StopID = sprintf("%d", StopID))


new_routes_2 <- new_routes_2 %>%
  rename('Stop ID' = StopID)

# Convert the sf object "sf_data" to a regular data frame
routes_df <- st_drop_geometry(new_routes_2)

# Remove spaces in "Stop ID" values
joined_data_test$`Stop ID` <- gsub("\\s+", "", joined_data_test$`Stop.ID`)

# Perform a left join based on the common column "Stop ID"
merged_dataset_bus2 <- left_join( routes_df,joined_data_test, by = "Stop ID")
## Warning in left_join(routes_df, joined_data_test, by = "Stop ID"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 17988 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
merged_dataset_bus2 <- merged_dataset_bus2 %>%
  mutate("total_routes_in_stops" = str_count(routes, ",") + 1)


# Separate the routes into individual rows
separated_data2 <- merged_dataset_bus2 %>%
  separate_rows(routes, sep = ",\\s*")

count_stops_routes2 <- separated_data2 %>%
  group_by(routes) %>%
  summarise(total_stops_in_routes = n_distinct(`Stop ID`))

# Merge the count_stops_data back to the original_data based on routes and counts
merge_count_data2 <- left_join(separated_data2, count_stops_routes2, by = "routes")

merge_count_data2 <- merge_count_data2 %>%
  mutate("total_routes_in_stops" = str_count(routes, ",") + 1)


summarized_data22 <- merge_count_data2 %>%
  group_by(`Stop ID`, Month) %>%
  summarize(total_stops_in_served_routes = sum(total_stops_in_routes, na.rm = TRUE))
## `summarise()` has grouped output by 'Stop ID'. You can override using the
## `.groups` argument.
# Merge the summarized data back with the original 'bus_2022' dataset
RF2_table2 <- merge(merged_dataset_bus2, summarized_data22, by = c('Stop ID', 'Month'), all.x = TRUE)


RF2_table2_sf <- st_as_sf(RF2_table2)
# Extract rows from table2 where stop_name matches specific values
filtered_table2 <- joined_data_test[joined_data_test$Stop.Name %in% c("Test_1", "Test_2", "Test_3", "Test_4", "Test_5", "Test_6", "Test_7", "Test_8", "Test_9", "Test_10"), ]

filtered_table2$routes <- 1
filtered_table2$total_routes_in_stops <- 1
filtered_table2$total_stops_in_served_routes <- 1

colnames(RF2_table2)<-colnames(filtered_table2)  

table_test <- rbind(RF2_table2_sf, filtered_table2)
# Make Predictions with the New Scenario Data
predictions_add_stops <- predict(RF_model3, newdata = table_test)

# Add Predictions to the Data
table_test$Predicted_Avg_ridership <- predictions_add_stops

# Add Monthly Prediction Column
table_test <- table_test %>%
  mutate(Monthly_prediction = Predicted_Avg_ridership * Number_of_)

# Convert Data to an sf Object
table_test_sf <- st_as_sf(table_test, geom = "geometry")


# Convert Predicted Avg Ridership to Numeric
table_test_sf$Monthly_prediction <- as.numeric(table_test_sf$Monthly_prediction)

result <- table_test_sf[table_test_sf$Monthly_prediction <= 10, ]
result <- result[result$Month == 'January', ]

filtered_table2_sf <- st_as_sf(filtered_table2, geom = "geometry")

# Plot the Map with Correct Fill Scale
ggplot(result) +
  geom_sf(data = allTracts_sf, fill = "grey90", alpha = 0.8 , color = NA)+
#  geom_sf(data = dat2022,aes(fill = (PopDensity_km2)), alpha = 0.5) +
#  scale_fill_gradient(low = "white", high = "grey20") + 
  geom_sf(data = bus_routes, color = "darkgrey", size = 5, alpha = 0.5) +
  geom_sf(data = filtered_table2_sf, fill = "red", alpha = 0.8) +
  geom_sf(aes(color = Monthly_prediction), size = 1) +
  scale_color_gradient(low = "#FFBD37", high = "#005357") +
  mapTheme() +
  labs(title = "Predicted Avg Ridership under 10 in January: Scenario IV", color = "Predicted Avg Ridership with \n10 New Added Stops")+
  theme(legend.position = "bottom",
    plot.title = element_text(size = 10)  # Adjust size here to your preference
  )

7. Web Application

We developed a web application for CATS transportation planner to help understanding the current transportation system’s status and predicting future changes. Specifically, the app focuses on forecasting passenger numbers while considering demographic shifts, changes in public amenities, and the addition of new stops or routes. This tool provides valuable insights for planners to make informed decisions, optimize resources, and enhance the overall efficiency and effectiveness of the transportation network in Charlotte.

Link to our web app: https://roshiniganesh.shinyapps.io/MUSA-Charlotte-ShinyApp/