This project is one part of MUSA 801 course in the Master of Urban and Spatial Analysis in University of Pennsylvania, with the supervision of Ken Steif, Michael Fichman, Matthew Harris. Many thanks to them for gracefully providing insightful suggestion during five months. Special thanks to Jonathan Hartman from Cap Metro who provided the ridership data this project is using. Thanks to Kimberly Feldbauer from AECOM’s Austin office who provided insights on transit planning in Austin.

Visit MUSA 801 website for more projects in this course: https://pennmusa.github.io/MUSA_801.io/

1. Introduction

1.1 Background

While evaluating the performance of a transit system, ridership has been an extremely important aspect helping the transportation planning development. Even in the challenging time brought by the pandemic, when ridership decrease becomes the nightmare for every transit agency in the US, it is still critical to explore transit ridership in relation to built environment, land use, demographics and system dynamics. Using Automatic Passenger Counter data provided by the sensors, Austin bus ridership data offers a great chance in helping planners to assess the system performance with predictions to backup better decisions. The following image shows how the data is collected on each bus. source

1.2 Use Case

Austin is a city full of development opportunity and a city growing vibrantly. The city’s demand for bus system also keeps changing with the development. In Jun 2018, Austin rolled out a large-scale bus system redesign, CapRemap. Cap Remap adjusted the transit network according to internal analysis and community outreach and aims to provide a more frequent, more reliable, and better connected bus system. This project brings an opportunity to understand what factors influence bus ridership.

Given the renewed interest in bus transit in US cities, such as Austin, there is an opportunity to streamline the bus planning process using modern data science methods. Currently, cities have to gather all the information, such as land use, built environment, demographics etc., from different sources, to gain understanding of bus ridership change in the future. This method is usually time consuming and requires a lot of human resources. Oftentimes, cities have to outsource those analysis to third parties, which inevitably leads to higher project cost. The goal of this project, therefore, is to present a scenario planning tool for planners to test how changes in local land uses and characteristics of bus routes predict bus ridership. If such a predictive model proves robust, planners can use it to evaluate a series of possible pictures regarding the development of different land use and change of bus routes in the future and make strategic decisions efficiently in Austin.

1.3 What Data Are We Using?

The ridership data collected by sensors is calculated into a monthly average format for each stop. This data is going to be our dependent variable in this project.

The dataset also provides info on route characteristics, such as route types and high ridership lines, which we call hotlines.

From multiple open data platforms, we were also able to retrieve built environment data, such as building areas, and amenities.

Then, US census provides comprehensive demographics data, such as population, vehicle ownerships, and median income.

2. Exploratory Analysis

Before diving into the model building, it is crucial to have a good grasp on the characteristics and anatomy of the bus ridership in Austin in order to construct a useful ridership prediction application for planners to utilize. This section aims to investigate Austin’s ridership data provided by APC and answer the following questions: How did ridership change before and after the large-scale system redesign “Cap Remap” (06/03/2018)? How does the ridership change across the city? What types of route characteristics have influences on ridership? What are the popular bus routes in Austin and what are their attributes?

2.1 Annual Citywide Ridership Trend

How did ridership change before and after CapRemap (06/03, 2018)? Did it increase after the redesign?

Current available data from Capital Metro allows us to observe the trend in ridership change before and after Cap Remap. The first important part of exploratory analysis is to see the city-wide change in ridership brought by CapRemap. To get the overall trend of ridership change before and after redesign, the monthly ridership data helps to create the system-wide regression discontinuity chart.

The x-axis represents months (June has been marked as 0 since redesign happened here), y represents average daily ridership in the given month. As the color distinguish time difference (before or after redesign), the line type further represents years. Comparing year 2018 and year 2019 from month January to June (-6 to 0 in the graphics), it is obvious that system-wide ridership has generally increased after the redesign. However, in terms of the short-term effect of the redesign, the ridership actually declined in June and July in 2018 just after redesign happened. This decrease was reasonable as passengers normally need some time to adjust to the new schedule.

2.2 Ridership Pattern in Subdivisions

After knowing the trend of city-wide ridership change, the next question is how the ridership changed across the city: which area experienced ridership increase and which area exprienced ridership decrease. We first look at the riderhsip patterns according to general typology to understand the general trend, and then use neighborhoods in Austin are used here to show the spatial trend here.

We defined the general typology according to the two major factors of the city in transportation planning: economic activity and user groups. In general, the downtown district has higher economic activity and provides lots of jobs and business opportunities; on the other hand, universities attract students or staff to commute everyday. Thus, we seperated the city into three major regions: Downtown Central Business District (CBD), UT Austin (UT), and the rest of the Austin (Rest). The map below shows the relative location of each typology. The bar plot below demonstrates the 2019 average ridership of each region. UT and CBD regions have higher ridership than the rest of the city. UT has a particularly high ridership, which means there are many transit users among students, faculties, and university staff.

Then we looked at the ridership by neighborhood map. The following maps indicate the average daily ridership in a neighborhood on a given weekday. Within our expectation, downtown and UT Austin neighborhoods (in the middle of the map)are the ones with high ridership in dark blue. Several other neighborhoods to the north side of Austin and southeast side also have high ridership. At the same time, the west side of Austin and the outskirt area of Austin seem to have lower ridership. This is one outcome of a bus system which is mainly focusing on north-south connection.

We then created charts showing the ridership change in each neighborhood in 2018 in June and September. There are 12 neighborhoods experienced ridership decrease from June to September. There are several neighborhoods experienced high ridership increase of more than 10,000 from June to September. Generally, most neighborhoods experienced ridership increase after CapRemap from June to September. Among the 78 neighborhoods in Austin, we identified three neighborhoods that represents different characteristics: neighborhoods with expected ridership increase; neighborhoods with unexpected ridership increase; neighborhoods with unexpected ridership decrease.

Among all the neighborhoods, UT is the neighborhood with expected ridership increase.The location of UT neighborhood is just above downtown neighborhood. With a lot of university students living around here, the bus network is sensitive to school schedule. There is a relatively clear trend in ridership change according to school seasons.

The second neighborhood Govalle is the neighborhood that experiencnig unexpected ridership increase. After CapRemap, the ridership in Govalle nearly increased by 50% to 75%. Govalle is a neighborhood locared in the eastside of Austin which is experiencing gentrification in recent years. The artists are developing industrial spaces in the south of the neighborhood into studios and exhibition spaces. The increasing popularity plus the bus redesign together give an explanation for the ridership increase.

But there are also neighborhoods exepriencing ridership decrease on the east-west direction. Zilker located in the southwest side of Austin’s downtown region. In Zilker, however, the ridership seems not much impacted by the redesign as it remains stable before and after the redesign. The stable ridership is not surprising since Zilker locates to the south of downtown Austin within a walkable distance. It is a neighborhood with a lot of parks, green spaces and leisures. The demand for bus system is thus not as strong as other neighborhoods.

2.3 Route Analysis

2.3.1 Route Type

What could potentially influence ridership in terms of route information?

There are several route types, each serving different purposes. Our hypothesis is that they will play an important role in determing the ridership.

The Austin Bus System is comprised of nine types of routes. The graphs below show six main route types. Since Capital Metro is a regional transit agency so that its service area covers more than City of Austin. Since we mainly focus our analysis and model building in Austin, the basemap below outlines only City of Ausitn.

Regarding route types, the characteristics are listed below:

  • Local: Capital Metro’s Local routes are intended to connect specific neighborhoods of Austin to Downtown Austin, with frequent stops.

  • MetroRapid: Capital Metro’s MetroRapid routes is an ostensibly bus rapid transit service serving high-traffic corridors. The service utilizes high-frequency service of every 15 minutes on weekdays with 10 minute service at rush hours.

  • UT Shuttle: The UT Shuttle system includes a number of routes during the University of Texas semester. They do not operate on Saturdays, except during finals.

  • Crosstown: Capital Metro’s Crosstown routes are local services between two neighborhoods of Austin, for which the route does not pass through Downtown Austin or the University of Texas.

  • Limited & Flyer: Capital Metro’s Limited and Flyer routes are limited stop services between two destinations. Limited routes tend to have fewer stops compared to their local counterparts, while Flyer routes serve nonstop between downtown or the UT campus and their neighborhoods of service.

  • Feeder: Capital Metro’s Feeder routes are local services between a neighborhood and a major transfer point for connecting service.

2.4 What other characteristics may be important for predicting ridership

Besides the route features that are already in the bus system, clearly there are other features related to the built environment itself which determins whether an area will have high riderhsip or not. In this section, three different types of features will be explored to further reveal ridership pattern in Austin.

2.4.1 Amenities

Amenities in a city normally are key nodes including schools, stadiums,supermarket, etc. These nodes are the destinations that travelers might head to. Thus, they can be an influencer to bus system ridership variation. The amenities we investigated here are offices and schools. Since the ridership data is the average daily ridership on a given weekday, offices and schools, as popular desinations for commuters worth investigation. As the following maps indicate, most offices gather in downtown area while schools are scattered in every neighborhood. Compared with the ridership map, offices are likely to have a large impact on ridership at the north side of Austin, across the Colorado River. As schools are pretty evenly distributed around the neighborhoods, it is hard to measure which area has been impacted more.

2.4.2 Built Environment

Many factors in the built environment could have a strong impact on bus ridership. Landuse would no doubt be an important feature to consider according to transit planning experiences. In order to reflect the impact of landuse on ridership, two types of landuse are selected: commercial and civic. As our initial priority is commercial and residential, there are too many residential landuse within Austin which makes the map hard to reveal the pattern. Thus the final landuse types are commercial and civic. According to the following maps, commercial landuse is mostly along the main corridors in Austin. As we expected to see more commercial landuse within downtown area, it is actually scatterly distributed. Comparing commercial landuse map with the ridership map, it is clear that in every high-ridership neighborhood,there are relatively more commercial landuse. For civic landuse, some of the areas are noticeable large compared to commercial landuse. In the University of Texas neighborhood, where the civic landuse occupies a large amount of the space, the ridership is very high.

2.4.3 Demographics

Lastly, demographics is the ultimate features that we take into account.In this case, some basic characteristics about the areas might influence the ridership such as race, income and age. Here we will use vehicle ownership, more specifically, households without vehicle, as an exploratory feature. Since demographic data has the unit of census tract, it reveals more details than the ridership map. Mostly, in the neighborhoods with high ridership, there are some census tract where more households have no vehicle ownership. In this aspect, it clearly indicates the importance of our application which allows planners to know the ridership change according to demographic features, which can effectively improve resource allocation considering social equity.

3. Modeling

3.1 Strategies

We will be creating a machine learning model that predicts the ridership at each stop. This model will allow planners to test diffrent scenarios in which large development, land use change or route frequency change could largely impact the system ridership. In the modeling section, the ultimate goal is to build and choose an accurate and generalizable model. The accuracy is critical for the prediction since we want the model to generate results with as few error as possible so that it can effectively guide the planners to know the ridership change. The generalizability means the model’s performance over different groups, such as spatial areas like downtown, UT and the rest. The ambition here is to get a generalizable model which has similar errors or percentage errors across different groups so that the prediction will not lead to severe mistakes in certain groups.

To make the prediction model more accurate and generalizable, we will look at how Linear, Lasso and Ridge regression, Random Forest, and Xgboost captures the variability in our dataset. We will start with feature engineering, which consists of 5 major categories: amenity, built environment, demographics, route network, stop characteristics (internal data). The hypothesis is that these five categories will influence the ridership at each stop in different ways. The dependent variable we used is the average ridership for each stop in 2019. We use 2019 data because we want to focus on the data after Cap Remap, and our feature engineering aligns with the year 2019 better. In the modeling process, we will split 75% of the data into training set and the rest will be in the test set. The model will be trained on the training set with a cross-validation based on typology disivion. Then we will test the model on the test set data and get the model’s performance.

3.2 Feature Engineering

The feature table below demonstrates five types of features and the sources of each feature. All data comes from the following sources: APC aggregated and disaggregated data, Capital Metro, OpenStreetMap (OSM), Open Data Austin, and ACS Census.

In the amenity category, information about where the amenities located is collected from OSM. The examples of the amenities are stadiums, supermarkets, offices, train stations, etc. For amenity, we created buffers of each stop with the size of 1/2 mile, 1/4 mile and 1/8 mile. The number of each type of amenity within the buffer is calculated. In order to capture the distance factor related between stops and amenities, the distance between the stop and the closest 3 amenities are calculated as well.

In the built environment category, land use types, building area, neighborhood fixed effect and school district fixed effect are included. Some features are spatially joined with the stops, such as the neighborhood and the school district data. Other features such as land use and building area, the percentage of each type of land use and the total area of buildings within the buffer is calculated as well. Noted here that the three different buffer sizes are all tested here for capturing more variation in the dataset.

In the demographics category, data about population, median income and car ownership is collected. For demographics, we used Areal Weighted Interpolation and joined the weighted census estimates within the buffer to each stop.

For route information, the percentage of each route type passing each bus stop is calculated. The number of shifts or buses going through each stop on a given week is also calculated. We also included route number as a feature, for example, for feature “r10”, we labeled all stops on route 10 as 1 and all other stops as 0; we did this step for each route.

For the internal (stop characteristics) category, we first added a transit hub dummy defined by Capital Metro; we then calculated the spatial lag, which is the average ridership of the surrounding stops.

A series of analysis is conducted to see what features are important. We first looked at the correlation between all features and the dependent variable, the mean ridership in 2019. Below are selected features that highly correlate with ridership positively and negatively. We found that the route information, amenity distance, and stop characteristics often correlates highly with ridership.

A correlation matrix is made to see potential collinearity between all features as well as the dependent variable. We found that certain features correlate highly, such as route direction feature SouthNorth versus WestEast, land use feature commercial versus residential, amenity feature distance to CBD versus distance to train stations. It is important to identify these variables when using features in the model.

3.3 Results and Validation

As mentioned in the exploratory analysis, neighborhood has been contributing a huge impact on ridership, similarly, a typology division of downtown and UT versus the rest of the Austin also captures the ridership pattern difference a lot becuase of the city’s natural built environments or school schedules. Thus in the modeling validation section, neighborhood and city typology (downtown, UT Austin, and the rest) are used to test the model’s generalizability.

With the features created in five categories, there are four types of models built: simple linear regression model (lm in the following visualizations), lasso and ridge regression model (glmnet in the following visualizations); random forest model (rf in the following visualizations) and xgboost model (xgb in the following visualizations). They will be tested and validated through a generalizability test to see which model fits the best.

Apart from the original 1/4 mile buffer, two more buffers sizes were created during feature engineering. The dataset corresponding to each buffer size will be tested and validated through the generalizability test, and the best buffer size will be selected.

Selecting the Best Model

We tested the generalizability of the four models mentioned above by holding out each neighborhood each time, use the rest data to train the model, and compare and calculate the prediction error of the hold-out neighborhood.

The MAPE and MAE of the four models reveals that, genrally speaking, random forest model and the xgboost model have the best accuracy while lasso and ridge model gives prediction less accurately than other three. Lasso and ridge regression has the largest MAPE and MAE while random forest model has the least MAPE and MAE.

In terms of predicted value and actual values, although simple linear regression predictions are close to the actual values, it has large variance. Lasso and ridge model generally overpredict ridership. Random forest model tends to overpredict ridership when the actual ridership is over 500. Xgboost model generally predicts ridership well and has relatively small variance.

In order to test the the generaliability of the models on neighborhoods, maps of MAPE are plotted to show the generalizability. The model’s accuracy is lower with high MAPE in the neighborhoods around University of Texas and a couple neighborhoods in the outskirts of Austin. By comparing the range and distribution of MAPE, we found that Xgboost model has the best performance with small in MAPE and variance.

Besides selecting the best model type, the best buffer size is also tested among 1/2-mile, 1/4-mile and 1/8-mile. We settled on 1/4 mile as the buffer size. For details in selecting buffer size, please read in Appendix.

4. Application

4.1 Why scenario planning?

As we mentioned in the beginning, the current workflow of redesigning a bus system lacks efficiency. Most of the analytical work and modeling on ridership was outsourced and for in-house planners, it is hard to test out different planning scenarios on the fly. This inspired us to create an application that helps planners to have a better understanding of the ridership change through different scenarios. In addition, scenario planning provides planners a tool to think holistically into the future. It gives planners the opportunity to investigate a number of future possibilities without being bogged down by how likely these scenarios might actually happen. Such a procedure offers planners the system thinking that reduces future risks and capture potential opportunities.

4.2 Three different scenarios for ridership change

According to the economic growth study conducted by the Austin Chamber of Commerce, Austin’s tech industry grew 2% more than the average growth rate across all industries in the Austin area from 2017 to 2018. Moreover, the tech industry in Austin also has grown by nearly 25% in the past five years. Austin area has a long history in the tech industry with the establishment of Dell back in 1984. In the recent years, the amicable business environment and vibrant arts and cultural scene have attracted tech giants, including Google, Facebook, Amazon, and Oracle, diverting their human resources to new campuses in the outskirt area of Austin. For instance, Apple just announced its $1 billion new campus plan in North Austin, adding more than 15,000 workers on top of its existing 7,000. With outer Austin area becoming more and more like the second silicon valley, the following three scenarios select Feeder routes, the ones running between neighborhoods, transit centers and Park & Rides, as the targets for ridership study, with the intention that those new tech employees will be likely to use those Feeder routes to commute.

4.2.1 Scenario 1 - Bus Frequency and Ridership

In this scenario, we investigated the relationship between change of bus frequency and ridership. The frequency of all Feeder routes was increased to make them becomes high frequency routes (meaning with a headway that is less than 15 minutes). The overall ridership has the largest increase among all the tested scenarios, with more than 1600 average riders each month. The majority of the stops would experience ridership increase while a few stops in the east side of the city demonstrate decrease in ridership.

4.2.2 What is the change of transit ridership with new real estate development?

In this scenario, the areas around the stops of all Feeder routes are set to have 400,000 sqft more building areas. This scenario is designed to represent situation in which new developments with higher density come into the neighborhood. Based on the rapid growth in Austin and real estate projects in the pipeline, it is very likely that the outer skirt of the city becomes denser. In this scenario, the areas around the stops of all Feeder routes are set to have 400,000 sqft more building areas. This scenario is designed to represent situation in which new developments with higher density come into the neighborhood. Based on the rapid growth in Austin and real estate projects in the pipeline, it is very likely that the outer skirt of the city becomes denser.

Similar to the increased frequency scenario, the change of the ridership doesn’t seem to follow a pattern with the consistent change of building area.The overall ridership slightly has an increase with 206 more monthly riders It is worth further investigation that most terminal stops experience ridership increase.

4.2.3 What is the change of transit ridership with new landuse update?

The driven rationale behind this scenario is the relationship between land use policy and ridership change. The scenario assumed that the land use composition around the stops of Feeder routes increase 10% in commercial, but decrease 10% in residential. This is the only scenario among the three that shows a decrease in the total ridership across the entire city. The average monthly ridership is reduced by 143 riders. Zooming in to the ridership changes of individual stops, there are stops experiencing increase and decrease in ridership on every single route. It is suspected that the overall ridership decrease informs users the influence of land use type, in this case commercial and residential, has on bus ridership. Yet, it is also clear from our model that the bus ridership in Austin is impacted by a series complex, and interweaving factors.

4.3 Web Application

We developed a web application according to our use case mentioned above: helping planners to understand the transit system and evision the potential change to transit ridership with respect to the change in frequency, building area, and land use. The following screenshots demonstrate the prototype. The first four screenshots below demonstrate how users can investigate the current ridership patterns, including ridership by typology, ridership by hot lines, and ridership by stops.

Click link for a real experience in ridership prediction! https://hanyongxu.com/rmd/austin/index.html

The screenshot below demonstrates how the application predicts the ridership in a given scenario. Although the app only consists of the three scenarios discussed above, the model and the application could be further developed and have a lot of potentials to cope with many more other scenarios for planners to use.

5. Appendix

Setup

######### Set Up Functions and Plotting Options ######### 
mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}

#plotTheme <- theme(
#  plot.title =element_text(size=12),
#  plot.subtitle = element_text(size=8),
#  plot.caption = element_text(size = 6),
#  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
#  axis.text.y = element_text(size = 10),
#  axis.title.y = element_text(size = 10),
#  panel.background=element_blank(),
#  plot.background=element_blank(),
#  axis.ticks=element_blank())

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=1.5),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

Background

Data Structure

#turn dataframe into spacitial object
agg_sf <- agg%>%
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326)%>%
  st_transform(2278)

disagg_sf <- disagg%>%
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326)%>%
  st_transform(2278)

# We use aggregated data to look at the average ridership on weekdays at individual stops
ggplot()+
  geom_sf(data = subset(serviceArea,NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA)))+
  geom_sf(data = subset(agg_after_sf, STOP_ID == 476), aes(color = "Stop 476"), size = 2, show.legend = "point")+
  scale_colour_manual(values = c("Stop 476" = "darkorange"),
                      guide = guide_legend("Aggregated Data Example"))+
  labs(title = "Aggregated Data Structure",
       subtitle = "Data from Capital Metro")+
  ggrepel::geom_label_repel(
    data = subset(agg_after_sf, STOP_ID == 476),aes(label = "Average Ridership = 33 \n Average Passing Buses = 55", geometry = geometry),
    stat = "sf_coordinates",
    min.segment.length = 3)+mapTheme()

# We use disaggregated data to investigate the average ridership on weekdays on different routes.
disagg_803 <- subset(disagg_sf, ROUTE == 803)%>%
  group_by(STOP_ID)%>%
  summarize(avg_on = mean(PSGR_ON),
            avg_load = mean(PSGR_LOAD))
ggplot()+
  geom_sf(data = subset(serviceArea,NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA)))+
  geom_sf(data = disagg_803, aes(color = "Stops on Route 803"), size = 2, show.legend = "point")+
  scale_colour_manual(values = c("Stops on Route 803" = "darkorange"),
                      guide = guide_legend("Disggregated Data Example"))+
  labs(title = "Disaggregated Data Structure",
       subtitle = "Data from Capital Metro")+
  geom_label_repel(
    data = subset(disagg_803, STOP_ID == 2606),aes(label = "Average On-board Passengers of Stop 2606 = 11 \n Route Type = Metro Rapid", geometry = geometry),
    stat = "sf_coordinates",
    min.segment.length = 0,
    segment.color = "lightgrey",
    point.padding = 20)+mapTheme()

Route Types Changes

# Crosstown
crosstown <-ggplot()+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Crosstown"), color = "greenyellow",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Crosstown"), color = "greenyellow",lwd = 0.8,show.legend = FALSE)+
  facet_grid(~capremap)+
  labs(title = "Crosstown Routes Before and After Cap Remap")+mapTheme()

# Feeder
feeder <-ggplot()+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Feeder"), color = "lightcoral",lwd = 0.8, show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Feeder"), color = "lightcoral",lwd = 0.8, show.legend = FALSE)+
  facet_grid(~capremap)+
  labs(title = "Feeder Routes Before and After Cap Remap")+mapTheme()


# Flyer
flyer <- ggplot()+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Flyer"), color = "magenta2",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Flyer"), color = "magenta2",lwd = 0.8,show.legend = FALSE)+
  facet_grid(~capremap)+
  labs(title = "Flyer Routes Before and After Cap Remap")+mapTheme()

# Express
express <-ggplot()+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Express"), color = "red",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Express"), color = "red",lwd = 0.8,show.legend = FALSE)+
  facet_grid(~capremap)+
  labs(title = "Express Routes Before and After Cap Remap")+mapTheme()

# Special
special <- ggplot()+
  geom_sf(data = subset(serviceArea, NAME == "Austin"), aes(fill = "Austin"))+
  scale_fill_manual(values = c("Service Areas" = "gray25", "Austin" = "black"), name = NULL,
                    guide = guide_legend("Jurisdictions", override.aes = list(linetype = "blank", shape = NA))) +
  geom_sf(data = subset(Routes1801, ROUTETYPE == "Special"), color = "seashell2",lwd = 0.8,show.legend = FALSE)+
  geom_sf(data = subset(Routes2001, ROUTETYPE == "Special"), color = "seashell2",lwd = 0.8,show.legend = FALSE)+
  facet_grid(~capremap)+
  labs(title = "Speical Routes Before and After Cap Remap")+mapTheme()

# minor changes grid arrange

grid.arrange(crosstown, feeder, flyer, express, ncol =2)

Exploratory Analysis

Ridership Typology

#create stop shapefile
agg_sf <- agg%>%
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326)%>%
  st_transform(2278)

agg_sf19 <- agg_sf%>%
  filter(YEAR_ID == 2019)%>%
  group_by(STOP_ID)%>%
  summarize(avg_on = mean(AVERAGE_ON))

#Read UT and CBD shapefiles
UT <- st_read("D:/Spring20/Practicum/data/UTAustin/UT.shp")%>%
  st_transform(2278)
CBD <- st_read("D:/Spring20/Practicum/data/CBD/CBD.shp")%>%
  st_transform(2278)

#create shapefile of area outside of CBD
nhood_CBD <- st_difference(nhood_merge, CBD)

#ST_DIFFERENCE didnt work for UT; created in Arcmap
nhood_UT <- st_read("D:/Spring20/Practicum/data/nhood_UT.shp")%>%
  st_as_sf()%>%
  st_transform(2278)

#Create CBD typology
agg_sf19_CBD <- st_join(CBD, agg_sf19, join = st_contains)%>%
  mutate(typology = "CBD")

agg_sf19_oCBD <- st_join(nhood_CBD, agg_sf19, join = st_contains)%>%
  mutate(typology = "oCBD")%>%
  rename(geometry = x)

agg_sf19_oCBD <- agg_sf19_oCBD%>%
  group_by(Id)%>%
  summarize(avg_on = mean(avg_on))%>%
  mutate(label = "The Rest of Austin")

agg_CBD_typology <- rbind(agg_sf19_CBD,agg_sf19_oCBD)

#Create UT typology
agg_sf19_UT <- st_join(UT, agg_sf19, join = st_contains)%>%
  mutate(typology = "UT")

agg_sf19_UT <- agg_sf19_UT%>%
  group_by(Id)%>%
  summarize(avg_on = mean(na.omit(avg_on)))%>%
  mutate(label = "UT Austin")

agg_sf19_oUT <- st_join(nhood_UT, agg_sf19, join = st_contains)%>%
  mutate(typology = "oUT")%>%
  select(STOP_ID,
         avg_on,
         typology,
         geometry)

agg_sf19_oUT <- agg_sf19_oUT%>%
  group_by(Id)%>%
  summarize(avg_on = mean(na.omit(avg_on)))%>%
  mutate(label = "The Rest of Austin")

agg_UT_typology <- rbind(agg_sf19_UT,agg_sf19_oUT)

How did ridership change before and after CapRemap (06/03, 2018)?

Ridership Change in Different Neighborhoods in Austin in 2018

Identify the Hotlines

First, let us look at the Kmeans analysis before the CapRemap. We group the disaggregated data by routes, and calculated the max and mean number of passengers on bus at each stop, the average miles traveled and the average hours spent for each passenger at each stop, as well as the total run length and total run time of the route.

Then, we apply Kmeans analysis. The number of clusters are determined by both the elbow chart and the 26 criteria provided by the NbClus package. For more information, see appendix.

We do the same analysis to the disaggregated dataset after the CapRemap.

These clustering labels are joined to the original dataset. For more about the clustering result, please see appendix.

Find the number of kmeans clusters for both before and after the CapRemap:

Both the Elbow chart and the 26 indicies provided by the NbClust package are used to check how many clusters should be used in the Kmeans analysis.

Before CapRemap:

After CapRemap:

In either case, it is evident that the most optimal number for the Kmeans cluster analysis is 3. We then conduct Kmeans analysis with 3 clusters as mentioned above in the exploratory analysis section.

Here is the Kmeans analysis result we got for before and after the CapRemap. The numbers are average of each feature used in the Kmeans analysis. We can clearly see that cluster 2 for both before and after the remapping have the highest average ridership as well as run times. They also have the smallest size. We can conclude that these are the most popular routes and we then define these routes as ‘hotlines’.

routeplot1 <- function(n,p,p1,d) {
  # line n before map
  t1 = ggplot() +
  geom_sf(data = rivers, color = "grey90", fill = "grey90")+
  geom_sf(data = streets %>% filter(road_class %in% c(1,2)), color = "grey90")+
  geom_sf(data = nhood, color = 'grey100',fill = 'grey90') +
  geom_sf(data = disagn1j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN== p) %>%
            st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326, agr = "constant") %>%
            st_transform(2278) %>%
            group_by(STOP_ID) %>%
            summarise(mean_stop_load = mean(PSGR_LOAD),size = 0.8), 
          aes(color = mean_stop_load))+
  scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25), 
                       breaks = c(0, 5, 10, 15, 20, 25)) +
  coord_sf(xlim = c(2310627, 2390089), ylim = c(13980004, 14059914)) + 
  labs(title=paste("Line",n,"Direction 1, Before CapRemap"),
  subtitle = "Average Number of\nPassengers at Each Stop")+mapTheme()
  
  #line n before passenger load chart
  t11 = ggplot(data = disagn1j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p) %>%
         group_by(STOP_SEQ_ID) %>%
         summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF), 
                   mean_load = mean(PSGR_LOAD))) +
    geom_path(aes(x = STOP_SEQ_ID, y = mean_load, 
                size = mean_load, color = mean_load), lineend="round",linejoin="mitre")+
    scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25), 
                       breaks = c(0, 5, 10, 15, 20, 25))+
    scale_size_continuous()+
    ylim(0, 23) +
    labs(subtitle=paste("Average Passenger Load"))+plotTheme()+ 
    theme(legend.position="none")
  
  #line n before passenger boarding and alighting
  t12 = ggplot(data = disagn1j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p) %>%
         group_by(STOP_SEQ_ID) %>%
         summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF), 
                   mean_load = mean(PSGR_LOAD))) +
    geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_on), fill="#9999CC", alpha="0.25") +
    geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_off), fill="#9999CC", alpha="0.25") +
    geom_line(aes(x = STOP_SEQ_ID, y = mean_on, color = "Average Boarding"), size=1) + 
    geom_line(aes(x = STOP_SEQ_ID, y = mean_off, color = "Average Alighting"), size=1)+ 
    ylim(0, 10) +
    labs(subtitle=paste("Average Boarding/Alighting"))+plotTheme()+ 
    theme(legend.position="bottom", legend.box = "horizontal")
  
  # line n after map
  t2 = ggplot() +
  geom_sf(data = rivers, color = "grey90", fill = "grey90")+
  geom_sf(data = streets %>% filter(road_class %in% c(1,2)), color = "grey90")+
  geom_sf(data = nhood, color = 'grey100',fill = 'grey90') +
  geom_sf(data = disagn2j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN== p1) %>%
            st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326, agr = "constant") %>%
            st_transform(2278) %>%
            group_by(STOP_ID) %>%
            summarise(mean_stop_load = mean(PSGR_LOAD),size = 0.8), 
          aes(color = mean_stop_load))+
  scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25), 
                       breaks = c(0, 5, 10, 15, 20, 25)) +
  coord_sf(xlim = c(2310627, 2390089), ylim = c(13980004, 14059914)) + 
  labs(title=paste("Line",n,"Direction 1, After CapRemap"),
  subtitle = "Average Number of\nPassengers at Each Stop")+mapTheme()
  
  #line n after passenger load chart
  t21 = ggplot(data = disagn2j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p1) %>%
         group_by(STOP_SEQ_ID) %>%
         summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF), 
                   mean_load = mean(PSGR_LOAD))) +
    geom_path(aes(x = STOP_SEQ_ID, y = mean_load, 
                size = mean_load, color = mean_load), lineend="round",linejoin="mitre")+
    scale_color_gradientn(colors = c("#0c2c84","#41b6c4", "#ffffcc"), limits = c(0,25), 
                       breaks = c(0, 5, 10, 15, 20, 25))+
    scale_size_continuous()+
    ylim(0, 23) +
    labs(subtitle=paste("Average Passenger Load"))+plotTheme()+ 
    theme(legend.position="none")
  
  #line n after passenger boarding and alighting
  t22 = ggplot(data = disagn2j %>% filter(ROUTE == n & DIRECTION ==d & PATTERN == p1) %>%
         group_by(STOP_SEQ_ID) %>%
         summarise(mean_on = mean(PSGR_ON), mean_off = mean(PSGR_OFF), 
                   mean_load = mean(PSGR_LOAD))) +
    geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_on), fill="#9999CC", alpha="0.25") +
    geom_ribbon(aes(x = STOP_SEQ_ID,ymin=0,ymax=mean_off), fill="#9999CC", alpha="0.25") +
    geom_line(aes(x = STOP_SEQ_ID, y = mean_on, color = "Average Boarding"), size=1) + 
    geom_line(aes(x = STOP_SEQ_ID, y = mean_off, color = "Average Alighting"), size=1)+ 
    ylim(0, 10) +
    labs(subtitle=paste("Average Boarding/Alighting"))+plotTheme()+ 
    theme(legend.position="bottom", legend.box = "horizontal")
  
  grid.arrange(arrangeGrob(t1, t11, t12, ncol = 1, nrow = 3),
               arrangeGrob(t2, t21, t22, ncol = 1, nrow = 3),ncol=2)
}

Features influenced ridership in Austin

Feature Engineering

Amenities (use buffer)

Open Street Map (OSM) Amenity Counts

######### Get OSM Data #########
getOSM <- function(key,value){
  feature <- opq(bbox = 'Austin, Texas')%>%
    add_osm_feature(key = key, value = value) %>%
    osmdata_sf ()
  if(is.null(feature$osm_points)){
    feature_poly <- feature$osm_polygons%>%
      select(osm_id,geometry)%>%
      st_as_sf(coords = geometry, crs = 4326, agr = "constant")%>%
      st_transform(2278)
    return(feature_poly)
  } else {
  feature_pt <- feature$osm_points%>%
    select(osm_id,geometry)%>%
    st_as_sf(coords = geometry, crs = 4326, agr = "constant")%>%
    st_transform(2278)
  return (feature_pt)
  }
}

#commercial
commercial <- getOSM('building', 'commercial')
#retail
retail <- getOSM('building', 'retail')
#supermarket
supermkt <- getOSM('building', 'supermarket')
#office
office <- getOSM('building', 'office')
#residential
residential <- getOSM('building','residential')
#bar
bar <- getOSM('amenity', 'bar')
#school
school <- getOSM('amenity', 'school')
#uni
university <- getOSM('amenity', 'university')
#parking
parking <- getOSM('amenity', 'parking')
#statium
stadium <- getOSM('building', 'stadium')
#trainstation
trainstation <- getOSM('building', 'train_station')

######### spatial join #########
bufferInit <- function(Buffer, Points, Name){
  if(class(Points$geometry) == "sfc_POINT"){
  Init <- st_join(Buffer%>% select(STOP_ID), Points, join = st_contains)%>%
  group_by(STOP_ID)%>%
    summarize(count = n())%>%
    rename(!!Name := count)
  }else {
    Init <- st_join(Buffer%>% select(STOP_ID), Points, join = st_intersects)%>%
      group_by(STOP_ID)%>%
      summarize(count = n())%>%
      rename(!!Name := count)
  }
}

Amenity Distance

Built Environments

Demographics

######### census #########
options(tigris_use_cache = TRUE)
v17 <- load_variables(2017, "acs5", cache = TRUE)

Hays <- get_acs(state = "48", county = "209", geography = "tract", 
                variables = "B01001_001", geometry = TRUE)
Travis <- get_acs(state = "48", county = "453", geography = "tract", 
                  variables = "B01001_001", geometry = TRUE)
Williamson <- get_acs(state = "48", county = "491", geography = "tract", 
                      variables = "B01001_001", geometry = TRUE) 

Travis_race <- get_acs(state = "48", county = "453", geography = "tract", 
                       variables = "B02001_002", geometry = TRUE)
Williamson_race <- get_acs(state = "48", county = "491", geography = "tract", 
                           variables = "B02001_002", geometry = TRUE) 

Travis_noveh <- get_acs(state = "48", county = "453", geography = "tract", 
                        variables = "B08014_002", geometry = TRUE)
Williamson_noveh <- get_acs(state = "48", county = "491", geography = "tract", 
                            variables = "B08014_002", geometry = TRUE)

Travis_oneveh <- get_acs(state = "48", county = "453", geography = "tract", 
                        variables = "B08014_003", geometry = TRUE)
Williamson_oneveh <- get_acs(state = "48", county = "491", geography = "tract", 
                            variables = "B08014_003", geometry = TRUE)

Travis_twoveh <- get_acs(state = "48", county = "453", geography = "tract", 
                         variables = "B08014_004", geometry = TRUE)
Williamson_twoveh <- get_acs(state = "48", county = "491", geography = "tract", 
                             variables = "B08014_004", geometry = TRUE)

Travis_threeveh <- get_acs(state = "48", county = "453", geography = "tract", 
                         variables = "B08014_005", geometry = TRUE)
Williamson_threeveh <- get_acs(state = "48", county = "491", geography = "tract", 
                             variables = "B08014_005", geometry = TRUE)

Travis_fourveh <- get_acs(state = "48", county = "453", geography = "tract", 
                           variables = "B08014_006", geometry = TRUE)
Williamson_fourveh <- get_acs(state = "48", county = "491", geography = "tract", 
                               variables = "B08014_006", geometry = TRUE)

Travis_fiveveh <- get_acs(state = "48", county = "453", geography = "tract", 
                          variables = "B08014_007", geometry = TRUE)
Williamson_fiveveh <- get_acs(state = "48", county = "491", geography = "tract", 
                              variables = "B08014_007", geometry = TRUE)

Travis_poverty <- get_acs(state = "48", county = "453", geography = "tract", 
                          variables = "B06012_002", geometry = TRUE)
Williamson_poverty <- get_acs(state = "48", county = "491", geography = "tract", 
                              variables = "B06012_002", geometry = TRUE)

Travis_medInc <- get_acs(state = "48", county = "453", geography = "tract", 
                          variables = "B19013_001", geometry = TRUE)
Williamson_medInc <- get_acs(state = "48", county = "491", geography = "tract", 
                              variables = "B19013_001", geometry = TRUE)
######### buffer deomographics #########
#population
Population <- rbind(Travis, Williamson)%>%
  st_transform(2278)
Population_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = Population, sid = GEOID, weight = "sum",
                                  output = "sf", extensive = "estimate")
Population_buff$estimate<- round(Population_buff$estimate)

#race
Race <- rbind(Travis_race, Williamson_race)%>%
  st_transform(2278)
Race_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = Race, sid = GEOID, weight = "sum",
                            output = "sf", extensive = "estimate")
Race_buff$estimate <- round(Race_buff$estimate)

#vehicle ownership
NoVeh <- rbind(Travis_noveh, Williamson_noveh)%>%
  st_transform(2278)
NoVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = NoVeh, sid = GEOID, weight = "sum",
                             output = "sf", extensive = "estimate")
NoVeh_buff$estimate <- round(NoVeh_buff$estimate)


OneVeh <- rbind(Travis_oneveh, Williamson_oneveh)%>%
  st_transform(2278)
OneVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = OneVeh, sid = GEOID, weight = "sum",
                              output = "sf", extensive = "estimate")
OneVeh_buff$estimate <- round(OneVeh_buff$estimate)


TwoVeh <- rbind(Travis_twoveh, Williamson_twoveh)%>%
  st_transform(2278)
TwoVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = TwoVeh, sid = GEOID, weight = "sum",
                              output = "sf", extensive = "estimate")
TwoVeh_buff$estimate <- round(TwoVeh_buff$estimate)


ThreeVeh <- rbind(Travis_threeveh, Williamson_threeveh)%>%
  st_transform(2278)
ThreeVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = ThreeVeh, sid = GEOID, weight = "sum",
                                output = "sf", extensive = "estimate")
ThreeVeh_buff$estimate <- round(ThreeVeh_buff$estimate)


FourVeh <- rbind(Travis_fourveh, Williamson_fourveh)%>%
  st_transform(2278)
FourVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = FourVeh, sid = GEOID, weight = "sum",
                               output = "sf", extensive = "estimate")
FourVeh_buff$estimate <- round(FourVeh_buff$estimate)


FiveVeh <- rbind(Travis_fiveveh, Williamson_fiveveh)%>%
  st_transform(2278)

FiveVeh_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = FiveVeh, sid = GEOID, weight = "sum",
                               output = "sf", extensive = "estimate")
FiveVeh_buff$estimate <- round(FiveVeh_buff$estimate)


#poverty
Poverty <- rbind(Travis_poverty, Williamson_poverty)%>%
  st_transform(2278)
Poverty_buff <- aw_interpolate(StopBuff, tid = STOP_ID, source = Poverty, sid = GEOID, weight = "sum",
                               output = "sf", extensive = "estimate")
Poverty_buff$estimate <- round(Poverty_buff$estimate)


#MedInc
medInc <- rbind(Travis_medInc, Williamson_medInc)%>%
  st_transform(2278)
medInc_stop <- st_join(stops, medInc, join = st_intersects)
medInc_stop <- medInc_stop %>%
  st_drop_geometry() %>%
  select(STOP_ID, estimate) %>%
  rename(medInc = estimate)

Route Network

use datasets after cap remap

Join All Features

all_x1 <- CommercialInit %>%  #amenities and route related
  left_join(st_drop_geometry(RetailInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(OfficeInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(ResidentialInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(SupermktInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(BarInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(UniInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(ParkingInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(SchoolInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(StationInit), by = "STOP_ID") %>%
  left_join(st_drop_geometry(StadiumInit), by = "STOP_ID") %>%
  left_join(stop_dir_freq, by = "STOP_ID") %>%
  left_join(stop_type_freq, by = "STOP_ID") %>%
  left_join(stop_hot_freq, by = "STOP_ID") %>%
  left_join(build_dens, by = "STOP_ID") %>%
  left_join(st_drop_geometry(stop_buff_landuse), by = "STOP_ID") %>%
  left_join(st_drop_geometry(Race_buff) %>% rename(race = estimate) %>% select(STOP_ID, race) %>% mutate(STOP_ID = as.numeric(STOP_ID), race = as.numeric(race)), by = "STOP_ID") %>% #census data
  left_join(st_drop_geometry(Population_buff) %>%  rename(population = estimate) %>% select(STOP_ID, population) %>% mutate(STOP_ID = as.numeric(STOP_ID), population = as.numeric(population)), by = "STOP_ID") %>% 
  left_join(st_drop_geometry(NoVeh_buff) %>%  rename(NoVeh = estimate) %>% select(STOP_ID, NoVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), NoVeh = as.numeric(NoVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(OneVeh_buff) %>%  rename(OneVeh = estimate) %>% select(STOP_ID, OneVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), OneVeh = as.numeric(OneVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(TwoVeh_buff) %>%  rename(TwoVeh = estimate) %>% select(STOP_ID, TwoVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), TwoVeh = as.numeric(TwoVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(ThreeVeh_buff) %>%  rename(ThreeVeh = estimate) %>% select(STOP_ID, ThreeVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), ThreeVeh = as.numeric(ThreeVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(FourVeh_buff) %>%  rename(FourVeh = estimate) %>% select(STOP_ID, FourVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), FourVeh = as.numeric(FourVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(FiveVeh_buff) %>%  rename(FiveVeh = estimate) %>% select(STOP_ID, FiveVeh) %>% mutate(STOP_ID = as.numeric(STOP_ID), FiveVeh = as.numeric(FiveVeh)), by = "STOP_ID") %>%
  left_join(st_drop_geometry(Poverty_buff) %>%  rename(Poverty = estimate) %>% select(STOP_ID, Poverty) %>% mutate(STOP_ID = as.numeric(STOP_ID), Poverty = as.numeric(Poverty)), by = "STOP_ID") %>%
  left_join(medInc_stop, by= "STOP_ID") %>%
  left_join(st_drop_geometry(stop_nhood), by = "STOP_ID") %>% # fixed effects
  left_join(st_drop_geometry(stop_school), by = "STOP_ID") %>%
  select(-c(hotline_0)) %>%
  left_join(data.2019.mean, by = "STOP_ID") %>%
  left_join(morning, by = "STOP_ID")%>%
  #left_join(afternoon, by = "STOP_ID")%>%
  left_join(evening, by = "STOP_ID") %>%
  left_join(route_fix, by = c("STOP_ID" = "r")) %>%
  left_join(bus_count, by = "STOP_ID")

#spatial lag, knn dist
all_x3 = bind_cols(list(all_x1, utaustinDist, CBDDist, commercialDist, retailDist, supermktDist, officeDist, residentialDist, schoolDist, universityDist, parkingDist, stadiumDist, trainstationDist, airportDist, spatial_lag %>% select(spatial_lag2, spatial_lag3, spatial_lag5)))
#recategorize variables
all_x4_normalize <-
  all_x3 %>% 
  mutate(Clockwise_cat = case_when(
      Clockwise == 0 ~ "0",
      Clockwise == 1 ~ "1",
      Clockwise > 0 & Clockwise <1 ~ "others"),
    Counterclockwise_cat = case_when(
      Counterclockwise == 0 ~ "0",
      Counterclockwise == 1 ~ "1",
      Counterclockwise > 0 & Counterclockwise <1 ~ "others"),
    Crosstown_cat = case_when(
      Crosstown == 0 ~ "0",
      Crosstown == 1 ~ "1",
      Crosstown > 0 & Crosstown <1 ~ "others"),
    Express_cat = case_when(
      Express == 0 ~ "0",
      Express == 1 ~ "1",
      Express > 0 & Express <1 ~ "others"),
    Feeder_cat = case_when(
      Feeder == 0 ~ "0",
      Feeder == 1 ~ "1",
      Feeder > 0 & Feeder <1 ~ "others"),
    Flyer_cat = case_when(
      Flyer == 0 ~ "0",
      Flyer == 1 ~ "1",
      Flyer > 0 & Flyer <1 ~ "others"),
    HighFreq_cat = case_when(
      `High Frequency` == 0 ~ "0",
      `High Frequency` == 1 ~ "1",
      `High Frequency` > 0 & `High Frequency` <1 ~ "others"),
    hotline_cat = case_when(
      hotline_1 == 0 ~ "0",
      hotline_1 == 1 ~ "1",
      hotline_1 > 0 & hotline_1 <1 ~ "others"),
    InOut_cat = case_when(
      InOut == 0 ~ "0",
      InOut == 1 ~ "1",
      InOut > 0 & Flyer <1 ~ "others"),
    Local_cat = case_when(
      Local == 0 ~ "0",
      Local == 1 ~ "1",
      Local > 0 & Local <1 ~ "others"),
    NightOwl_cat = case_when(
      `Night Owl` == 0 ~ "0",
      `Night Owl` == 1 ~ "1",
      `Night Owl` > 0 & `Night Owl` <1 ~ "others"),
    SN_cat = case_when(
      SouthNorth == 0 ~ "0",
      SouthNorth == 1 ~ "1",
      SouthNorth > 0 & SouthNorth <1 ~ "others"),
    Special_cat = case_when(
      Special == 0 ~ "0",
      Special == 1 ~ "1",
      Special > 0 & Special <1 ~ "others"),
    utshuttle_cat = case_when(
      `UT Shuttle` == 0 ~ "0",
      `UT Shuttle` == 1 ~ "1",
      `UT Shuttle` > 0 & `UT Shuttle` <1 ~ "others"),
    WE_cat = case_when(
      WestEast == 0 ~ "0",
      WestEast == 1 ~ "1",
      WestEast > 0 & WestEast <1 ~ "others"))

Features Explorations

change features and see results

Building area SCENARIO 1:

Building area SCENARIO 2:

Building area SCENARIO 3:

Modeling and Validation

Use Neighborhood For Validation with Four Types of Models

###################################Modelling part ends here, below are the visualizations##############
#MAPE chart
ggplot(data = val_preds %>% 
         dplyr::select(model, MAPE) %>% 
         distinct() , 
       aes(x = model, y = MAPE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
  labs(title= "MAPE of each model on the testing set")
  theme_bw()
#MAE chart
ggplot(data = val_preds %>% 
           dplyr::select(model, MAE) %>% 
           distinct() , 
         aes(x = model, y = MAE, group = 1)) +
    geom_path(color = "blue") +
    geom_label(aes(label = paste0(round(MAE,1)))) +
    labs(title= "MAE of each model on the testing set")
  theme_bw()  
  
#Predicted vs Observed
ggplot(val_preds, aes(x =.pred, y = mean_on, group = model)) +
  geom_point() +
  geom_abline(linetype = "dashed", color = "red") +
  geom_smooth(method = "lm", color = "blue") +
  coord_equal() +
  facet_wrap(~model, nrow = 2) +
  labs(title="Predicted vs Observed on the testing set", subtitle= "blue line is predicted value") 
  theme_bw()

#Neighborhood validation
val_MAPE_by_hood <- val_preds %>% 
  group_by(label, model) %>% 
  summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
            MAE  = yardstick::mae_vec(mean_on, .pred),
            MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup() 
#Barchart of the MAE in each neighborhood
palette4 <- c("#eff3ff", "#bdd7e7","#6baed6","#2171b5")


val_MAPE_by_hood %>%
  dplyr::select(label, model, MAE) %>%
  gather(Variable, MAE, -model, -label) %>%
  ggplot(aes(label, MAE)) + 
  geom_bar(aes(fill = model), position = "dodge", stat="identity") +
  scale_fill_manual(values = "Blues") +
  facet_wrap(~label,scales="free", ncol=6)+
  labs(title = "Mean Absolute Errors by model specification and neighborhood") +
  plotTheme()

#Map of MAE in each neighborhood
#Add geometry to the MAE
MAE.nhood <- merge(nhood, val_MAPE_by_hood, by.x="label", by.y="label", all.y=TRUE)

#Produce the map

#Map: MAPE of lm
MAE.nhood%>%
  filter(model=="lm") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.nhood,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of lm in Neighborhoods") +
  mapTheme()

#Map: MAPE of glmnet
MAE.nhood%>%
  filter(model=="glmnet") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.nhood,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of glmnet in Neighborhoods") +
  mapTheme()
#MAPE of rf
MAE.nhood%>%
  filter(model=="rf") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.nhood,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of rf in Neighborhoods") +
  mapTheme()

#MAPE of xgb
MAE.nhood%>%
  filter(model=="xgb") %>%
  ggplot() +
  #    geom_sf(data = nhoods, fill = "grey40") +
  geom_sf(aes(fill = q5(MAPE))) +
  scale_fill_brewer(palette = "Blues",
                    aesthetics = "fill",
                    labels=qBr(MAE.nhood,"MAPE"),
                    name="Quintile\nBreaks, (%)") +
  labs(title="MAPE of xgb in Neighborhoods") +
  mapTheme()

Testing different buffer size for model accuracy and generalizability

Buffer size 1/2 mile

#1/2 Buffer Size with Typology Test
data.half <- join(all_half, typology, type ="left")
data.half$STOP_ID <- NULL
data.half<-data.half %>%
  drop_na()
data.half$universityDist1<-NULL
#Slipt the data into training and testing sets
data_split.half <- rsample::initial_split(data.half, strata = "mean_on", prop = 0.75)

bus_train.half <- rsample::training(data_split.half)
bus_test.half  <- rsample::testing(data_split.h)
names(bus_train.half)


cv_splits_geo.half <- rsample::group_vfold_cv(bus_train.half,  strata = "mean_on", group = "typology")

#Create recipe
model_rec.half <- recipe(mean_on ~ ., data = bus_train.half) %>% #the "." means using every variable we have in the training dataset
  update_role(typology, new_role = "typology") %>% #This is more like to keep the neighborhood variable out of the model
  step_other(typology, threshold = 0.005) %>%
  step_dummy(all_nominal(), -typology) %>%
  step_log(mean_on) %>% 
  step_zv(all_predictors()) %>%
  step_center(all_predictors(), -mean_on) %>%
  step_scale(all_predictors(), -mean_on) #%>% #put on standard deviation scale
#step_ns(Latitude, Longitude, options = list(df = 4))
?step_cv
model_rec.half

#Build the model
lm_plan <- 
  parsnip::linear_reg() %>% 
  parsnip::set_engine("lm")

glmnet_plan <- 
  parsnip::linear_reg() %>% 
  parsnip::set_args(penalty  = tune()) %>%
  parsnip::set_args(mixture  = tune()) %>%
  parsnip::set_engine("glmnet")

rf_plan <- parsnip::rand_forest() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 1000) %>% 
  parsnip::set_engine("ranger", importance = "impurity") %>% 
  parsnip::set_mode("regression")

XGB_plan <- parsnip::boost_tree() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 100) %>% 
  parsnip::set_engine("xgboost") %>% 
  parsnip::set_mode("regression")

#
glmnet_grid <- expand.grid(penalty = seq(0, 1, by = .25), 
                           mixture = seq(0,1,0.25))

rf_grid <- expand.grid(mtry = c(2,5), 
                       min_n = c(1,5))
xgb_grid <- expand.grid(mtry = c(3,5), 
                        min_n = c(1,5))
#Create workflow
lm_wf.half <-
  workflows::workflow() %>% 
  add_recipe(model_rec.half) %>% 
  add_model(lm_plan)

glmnet_wf.half <-
  workflow() %>% 
  add_recipe(model_rec.half) %>% 
  add_model(glmnet_plan)

rf_wf.half <-
  workflow() %>% 
  add_recipe(model_rec.half) %>% 
  add_model(rf_plan)
xgb_wf.half <-
  workflow() %>% 
  add_recipe(model_rec.half) %>% 
  add_model(XGB_plan)
# fit model to workflow and calculate metrics
control <- tune::control_resamples(save_pred = TRUE, verbose = TRUE)
library(tune)
library(yardstick)
?tune_grid
?metric_set

lm_tuned.half <- lm_wf.half %>%
  fit_resamples(.,
                resamples = cv_splits_geo.half,
                control   = control,
                metrics   = metric_set(rmse, rsq))
glmnet_tuned.half <- glmnet_wf.half %>%
  tune_grid(.,
            resamples = cv_splits_geo.half,
            grid      = glmnet_grid,
            control   = control,
            metrics   = metric_set(rmse, rsq))

rf_tuned.half <- rf_wf.half %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.half,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

xgb_tuned.half <- xgb_wf.half %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.half,
                  grid      = xgb_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

show_best(lm_tuned.half, metric = "rmse", n = 15, maximize = FALSE)
show_best(glmnet_tuned.half, metric = "rmse", n = 15, maximize = FALSE)
show_best(rf_tuned.half, metric = "rmse", n = 15, maximize = FALSE)
show_best(xgb_tuned.half, metric = "rmse", n = 15, maximize = FALSE)

lm_best_params.half     <- select_best(lm_tuned.half, metric = "rmse", maximize = FALSE)
glmnet_best_params.half <- select_best(glmnet_tuned.half, metric = "rmse", maximize = FALSE)
rf_best_params.half     <- select_best(rf_tuned.half, metric = "rmse", maximize = FALSE)
xgb_best_params.half    <- select_best(xgb_tuned.half, metric = "rmse", maximize = FALSE)
#Final workflow
lm_best_wf.half     <- finalize_workflow(lm_wf.half, lm_best_params.half)
glmnet_best_wf.half <- finalize_workflow(glmnet_wf.half, glmnet_best_params.half)
rf_best_wf.half     <- finalize_workflow(rf_wf.half, rf_best_params.half)
xgb_best_wf.half    <- finalize_workflow(xgb_wf.half, xgb_best_params.half)

lm_val_fit_geo.half <- lm_best_wf.half %>% 
  last_fit(split     = data_split.half,
           control   = control,
           metrics   = metric_set(rmse, rsq))
glmnet_val_fit_geo.half <- glmnet_best_wf.half %>% 
  last_fit(split     = data_split.half,
           control   = control,
           metrics   = metric_set(rmse, rsq))

rf_val_fit_geo.half <- rf_best_wf.half %>% 
  last_fit(split     = data_split.half,
           control   = control,
           metrics   = metric_set(rmse, rsq))

xgb_val_fit_geo.half <- xgb_best_wf.half %>% 
  last_fit(split     = data_split.half,
           control   = control,
           metrics   = metric_set(rmse, rsq))

####################################Model Validation
# Pull best hyperparam preds from out-of-fold predictions
lm_best_OOF_preds.half <- collect_predictions(lm_tuned.half) 
glmnet_best_OOF_preds.half <- collect_predictions(glmnet_tuned.half) %>% 
  filter(penalty  == glmnet_best_params.half$penalty[1] & mixture == glmnet_best_params.half$mixture[1])
rf_best_OOF_preds.half <- collect_predictions(rf_tuned.half) %>% 
  filter(mtry  == rf_best_params.half$mtry[1] & min_n == rf_best_params.half$min_n[1])

xgb_best_OOF_preds.half <- collect_predictions(xgb_tuned.half) %>% 
  filter(mtry  == xgb_best_params.half$mtry[1] & min_n == xgb_best_params.half$min_n[1])
# collect validation set predictions from last_fit model
lm_val_pred_geo.half     <- collect_predictions(lm_val_fit_geo.half)
glmnet_val_pred_geo.half <- collect_predictions(glmnet_val_fit_geo.half)
rf_val_pred_geo.half     <- collect_predictions(rf_val_fit_geo.half)
xgb_val_pred_geo.half    <- collect_predictions(xgb_val_fit_geo.half)
# Aggregate OOF predictions (they do not overlap with Validation prediction set)
OOF_preds.half <- rbind(data.frame(dplyr::select(lm_best_OOF_preds.half, .pred, mean_on), model = "lm"),
                           data.frame(dplyr::select(glmnet_best_OOF_preds.half, .pred, mean_on), model = "glmnet"),
                           data.frame(dplyr::select(rf_best_OOF_preds.half, .pred, mean_on), model = "RF"),
                           data.frame(dplyr::select(xgb_best_OOF_preds.half, .pred, mean_on), model = "xgb")) %>% 
  group_by(model) %>% 
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         #RSQUARE = yardstick::rsq(mean_on, .pred),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         #SD_RMSE = sd(yardstick::rmse_vec(mean_on, .pred)),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         #SD_MAE = sd(yardstick::mae_vec(mean_on, .pred)),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
  #SD_MAPE = sd(yardstick::mape_vec(mean_on, .pred))) %>% 
  ungroup()


# Aggregate predictions from Validation set
library(tidyverse)
library(yardstick)
#lm_val_pred_geo
val_preds.half <- rbind(data.frame(lm_val_pred_geo.half, model = "lm"),
                   data.frame(glmnet_val_pred_geo.half, model = "glmnet"),
                   data.frame(rf_val_pred_geo.half, model = "rf"),
                   data.frame(xgb_val_pred_geo.half, model = "xgb")) %>% 
  left_join(., data.half %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(typology, .row), 
            by = ".row") %>% 
  dplyr::group_by(model) %>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
  ungroup()


summary(lm_val_pred_geo.half$MAE)
summary(glmnet_val_pred_geo.half$MAE)
summary(rf_val_pred_geo.half$MAE)
summary(xgb_val_pred_geo.half$MAE)
summary(lm_val_pred_geo.half$MAPE)
summary(glmnet_val_pred_geo.half$MAPE)
summary(rf_val_pred_geo.half$MAPE)
summary(xgb_val_pred_geo.half$MAPE)
summary(lm_val_pred_geo.half$RMSE)
summary(glmnet_val_pred_geo.half$RMSE)
summary(rf_val_pred_geo.half$RMSE)
summary(xgb_val_pred_geo.half$RMSE)
?group_by
#Rsquared
1- sum((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred) ^ 2)/sum((lm_val_pred_geo$mean_on - mean(lm_val_pred_geo$mean_on)) ^ 2)
rsq(lm_val_pred_geo.half, mean_on, .pred)
sd(rsq(lm_val_pred_geo, mean_on, .pred))
rsq(glmnet_val_pred_geo.half, mean_on, .pred)
rsq(rf_val_pred_geo.half, mean_on, .pred)
rsq(xgb_val_pred_geo.half, mean_on, .pred)
#MAE and MAPE
mean(abs(lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred))
sd(abs(lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred))
mean(abs((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)/lm_val_pred_geo$mean_on))
sd(abs((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)/lm_val_pred_geo$mean_on))
sd(abs(lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred))

mean(abs(glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred))
sd(abs(glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred))
mean(abs((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)/glmnet_val_pred_geo$mean_on))
sd(abs((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)/glmnet_val_pred_geo$mean_on))

mean(abs(rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred))
sd(abs(rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred))
mean(abs((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)/rf_val_pred_geo$mean_on))
sd(abs((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)/rf_val_pred_geo$mean_on))

mean(abs(xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred))
sd(abs(xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred))
mean(abs((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)/xgb_val_pred_geo$mean_on))
sd(abs((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)/xgb_val_pred_geo$mean_on))
#RMSE
sqrt(mean((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)^2))
sqrt(mean((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)^2))
sqrt(mean((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)^2))
sqrt(mean((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)^2))
sqrt(sd((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)^2))
sqrt(sd((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)^2))
sqrt(sd((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)^2))
sqrt(sd((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)^2))

yardstick::rmse_vec(lm_val_pred_geo$mean_on, lm_val_pred_geo$.pred)
yardstick::mape_vec(lm_val_pred_geo$mean_on, lm_val_pred_geo$.pred)
###################################Modelling part ends here, below are the visualizations##############
#MAPE chart
ggplot(data = val_preds.half %>% 
         dplyr::select(model, MAPE) %>% 
         distinct() , 
       aes(x = model, y = MAPE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
  labs(title= "1/2-mi Buffer, MAPE of each model on the testing set with typology")
theme_bw()
#MAE chart
ggplot(data = val_preds.half%>% 
         dplyr::select(model, MAE) %>% 
         distinct() , 
       aes(x = model, y = MAE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAE,1)))) +
  labs(title= "1/2 mi Buffer, MAE of each model on the testing set with typology")
theme_bw()  
#RMSE
ggplot(data = val_preds.half%>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(RMSE,1)))) +
  labs(title= "1/2 mi Buffer, RMSE of each model on the testing set with typology")
theme_bw() 
#Predicted vs Observed
ggplot(val_preds.half, aes(x =.pred, y = mean_on, group = model)) +
  geom_point() +
  geom_abline(linetype = "dashed", color = "red") +
  geom_smooth(method = "lm", color = "blue") +
  coord_equal() +
  facet_wrap(~model, nrow = 2) +
  labs(title="1/2 Mile: Predicted vs Observed on the testing set", subtitle= "blue line is predicted value") 
theme_bw()

#Neighborhood validation
val_MAPE_by_typology.half <- val_preds.half %>% 
  group_by(typology, model) %>% 
  summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
            MAE  = yardstick::mae_vec(mean_on, .pred),
            MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup() 
#Barchart of the MAE in each neighborhood
plotTheme <- function(base_size = 10) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 20,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=20),
    axis.title = element_text(size=20),
    axis.text = element_text(size=15),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic", size= 20),
    legend.text = element_text(colour = "black", face = "italic",size = 20),
    strip.text.x = element_text(size = 15)
  )
}
palette4 <- c("#eff3ff", "#bdd7e7","#6baed6","#2171b5")


as.data.frame(val_MAPE_by_typology.half)%>%
  dplyr::select(typology, model, MAE) %>%
  #gather(Variable, MAE, -model, -typology) %>%
  ggplot(aes(typology, MAE)) + 
  geom_bar(aes(fill = model), position = "dodge", stat="identity") +
  ylim(0, 300)+
  scale_fill_manual(values = palette4) +
  facet_wrap(~typology, scale= "free", ncol=4)+
  labs(title = "1/2 mile: Mean Absolute Errors by model specification") +
  plotTheme()

Buffer size: 1/4 mile

#1/4 mile buffer with ridership: data.quarter
#typology
sum(is.na(All_3))
sum(is.na(all_2))
All_3$typology
summary(all_2$parkingDist)
typology<- All_3 %>%
  dplyr::select(STOP_ID, typology)
typology$typology <- ifelse(typology$typology == "CBD" , 'CBD',
                                ifelse(typology$typology == "UT", 'UT',
                                       ifelse(typology$typology == "UT&CBD", 'CBD', 'Rest')))

#write.csv(typology, "C:/Upenn/Practicum/Data/Typology_withSTOP_ID.csv")
?join
names(all_2)
data.quarter <- plyr::join(all_2, typology, type ="left")
data.quarter$STOP_ID <- NULL

data.quarter<-data.quarter %>%
  drop_na()
data.quarter$universityDist1<-NULL
#Slipt the data into training and testing sets
data_split.quarter <- rsample::initial_split(data.quarter, strata = "mean_on", prop = 0.75)

bus_train.quarter <- rsample::training(data_split.quarter)
bus_test.quarter  <- rsample::testing(data_split.quarter)
names(bus_train.quarter)


cv_splits_geo.quarter <- rsample::group_vfold_cv(bus_train.quarter,  strata = "mean_on", group = "typology")
print(cv_splits_geo)

#Create recipe
model_rec.quarter <- recipe(mean_on ~ ., data = bus_train.quarter) %>% #the "." means using every variable we have in the training dataset
  update_role(typology, new_role = "typology") %>% #This is more like to keep the neighborhood variable out of the model
  step_other(typology, threshold = 0.005) %>%
  step_dummy(all_nominal(), -typology) %>%
  step_log(mean_on) %>% 
  step_zv(all_predictors()) %>%
  step_center(all_predictors(), -mean_on) %>%
  step_scale(all_predictors(), -mean_on) #%>% #put on standard deviation scale
#step_ns(Latitude, Longitude, options = list(df = 4))
?step_cv
model_rec.quarter

#Build the model
lm_plan <- 
  parsnip::linear_reg() %>% 
  parsnip::set_engine("lm")

glmnet_plan <- 
  parsnip::linear_reg() %>% 
  parsnip::set_args(penalty  = tune()) %>%
  parsnip::set_args(mixture  = tune()) %>%
  parsnip::set_engine("glmnet")

rf_plan <- parsnip::rand_forest() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 1000) %>% 
  parsnip::set_engine("ranger", importance = "impurity") %>% 
  parsnip::set_mode("regression")

XGB_plan <- parsnip::boost_tree() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 100) %>% 
  parsnip::set_engine("xgboost") %>% 
  parsnip::set_mode("regression")

#
glmnet_grid <- expand.grid(penalty = seq(0, 1, by = .25), 
                           mixture = seq(0,1,0.25))

rf_grid <- expand.grid(mtry = c(2,5), 
                       min_n = c(1,5))
xgb_grid <- expand.grid(mtry = c(3,5), 
                        min_n = c(1,5))
#Create workflow
lm_wf.quarter <-
  workflows::workflow() %>% 
  add_recipe(model_rec.quarter) %>% 
  add_model(lm_plan)

glmnet_wf.quarter <-
  workflow() %>% 
  add_recipe(model_rec.quarter) %>% 
  add_model(glmnet_plan)

rf_wf.quarter <-
  workflow() %>% 
  add_recipe(model_rec.quarter) %>% 
  add_model(rf_plan)
xgb_wf.quarter <-
  workflow() %>% 
  add_recipe(model_rec.quarter) %>% 
  add_model(XGB_plan)
# fit model to workflow and calculate metrics
control <- tune::control_resamples(save_pred = TRUE, verbose = TRUE)
library(tune)
library(yardstick)
?tune_grid
?metric_set

lm_tuned.quarter <- lm_wf.quarter %>%
  fit_resamples(.,
                resamples = cv_splits_geo.quarter,
                control   = control,
                metrics   = metric_set(rmse, rsq))
glmnet_tuned.quarter <- glmnet_wf.quarter %>%
  tune_grid(.,
            resamples = cv_splits_geo.quarter,
            grid      = glmnet_grid,
            control   = control,
            metrics   = metric_set(rmse, rsq))

rf_tuned.quarter <- rf_wf.quarter %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.quarter,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

xgb_tuned.quarter <- xgb_wf.quarter %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.quarter,
                  grid      = xgb_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

show_best(lm_tuned.quarter, metric = "rmse", n = 15, maximize = FALSE)
show_best(glmnet_tuned.quarter, metric = "rmse", n = 15, maximize = FALSE)
show_best(rf_tuned.quarter, metric = "rmse", n = 15, maximize = FALSE)
show_best(xgb_tuned.quarter, metric = "rmse", n = 15, maximize = FALSE)

lm_best_params.quarter     <- select_best(lm_tuned.quarter, metric = "rmse", maximize = FALSE)
glmnet_best_params.quarter <- select_best(glmnet_tuned.quarter, metric = "rmse", maximize = FALSE)
rf_best_params.quarter     <- select_best(rf_tuned.quarter, metric = "rmse", maximize = FALSE)
xgb_best_params.quarter    <- select_best(xgb_tuned.quarter, metric = "rmse", maximize = FALSE)
#Final workflow
lm_best_wf.quarter     <- finalize_workflow(lm_wf.quarter, lm_best_params.quarter)
glmnet_best_wf.quarter <- finalize_workflow(glmnet_wf.quarter, glmnet_best_params.quarter)
rf_best_wf.quarter     <- finalize_workflow(rf_wf.quarter, rf_best_params.quarter)
xgb_best_wf.quarter    <- finalize_workflow(xgb_wf.quarter, xgb_best_params.quarter)

lm_val_fit_geo.quarter <- lm_best_wf.quarter %>% 
  last_fit(split     = data_split.quarter,
           control   = control,
           metrics   = metric_set(rmse, rsq))
glmnet_val_fit_geo.quarter <- glmnet_best_wf.quarter %>% 
  last_fit(split     = data_split.quarter,
           control   = control,
           metrics   = metric_set(rmse, rsq))

rf_val_fit_geo.quarter <- rf_best_wf.quarter %>% 
  last_fit(split     = data_split.quarter,
           control   = control,
           metrics   = metric_set(rmse, rsq))

xgb_val_fit_geo.quarter <- xgb_best_wf.quarter %>% 
  last_fit(split     = data_split.quarter,
           control   = control,
           metrics   = metric_set(rmse, rsq))

####################################Model Validation
# Pull best hyperparam preds from out-of-fold predictions
lm_best_OOF_preds.quarter <- collect_predictions(lm_tuned.quarter) 
glmnet_best_OOF_preds.quarter <- collect_predictions(glmnet_tuned.quarter) %>% 
  filter(penalty  == glmnet_best_params.quarter$penalty[1] & mixture == glmnet_best_params.quarter$mixture[1])
rf_best_OOF_preds.quarter <- collect_predictions(rf_tuned.quarter) %>% 
  filter(mtry  == rf_best_params.quarter$mtry[1] & min_n == rf_best_params.quarter$min_n[1])

xgb_best_OOF_preds.quarter <- collect_predictions(xgb_tuned.quarter) %>% 
  filter(mtry  == xgb_best_params.quarter$mtry[1] & min_n == xgb_best_params.quarter$min_n[1])
# collect validation set predictions from last_fit model
lm_val_pred_geo.quarter     <- collect_predictions(lm_val_fit_geo.quarter)
glmnet_val_pred_geo.quarter <- collect_predictions(glmnet_val_fit_geo.quarter)
rf_val_pred_geo.quarter     <- collect_predictions(rf_val_fit_geo.quarter)
xgb_val_pred_geo.quarter    <- collect_predictions(xgb_val_fit_geo.quarter)
# Aggregate OOF predictions (they do not overlap with Validation prediction set)
lm_best_OOF_preds$mean_on <- as.numeric(lm_best_OOF_preds$mean_on)
glmnet_best_OOF_preds$mean_on <- as.numeric(glmnet_best_OOF_preds$mean_on)
rf_best_OOF_preds$mean_on <- as.numeric(rf_best_OOF_preds$mean_on)
xgb_best_OOF_preds$mean_on <- as.numeric(xgb_best_OOF_preds$mean_on)

OOF_preds.quarter <- rbind(data.frame(dplyr::select(lm_best_OOF_preds.quarter, .pred, mean_on), model = "lm"),
                   data.frame(dplyr::select(glmnet_best_OOF_preds.quarter, .pred, mean_on), model = "glmnet"),
                   data.frame(dplyr::select(rf_best_OOF_preds.quarter, .pred, mean_on), model = "RF"),
                   data.frame(dplyr::select(xgb_best_OOF_preds.quarter, .pred, mean_on), model = "xgb")) %>% 
  group_by(model) %>% 
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         #RSQUARE = yardstick::rsq(mean_on, .pred),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         #SD_RMSE = sd(yardstick::rmse_vec(mean_on, .pred)),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         #SD_MAE = sd(yardstick::mae_vec(mean_on, .pred)),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
         #SD_MAPE = sd(yardstick::mape_vec(mean_on, .pred))) %>% 
  ungroup()


# Aggregate predictions from Validation set
library(tidyverse)
library(yardstick)
#lm_val_pred_geo
val_preds.quarter <- rbind(data.frame(lm_val_pred_geo.quarter, model = "lm"),
                   data.frame(glmnet_val_pred_geo.quarter, model = "glmnet"),
                   data.frame(rf_val_pred_geo.quarter, model = "rf"),
                   data.frame(xgb_val_pred_geo.quarter, model = "xgb")) %>% 
  left_join(., data.quarter %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(typology, .row), 
            by = ".row") %>% 
  dplyr::group_by(model) %>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
  ungroup()


summary(rf_val_pred_geo.quarter$MAE)
summary(xgb_val_pred_geo.quarter$MAE)
summary(lm_val_pred_geo.quarter$MAPE)
summary(glmnet_val_pred_geo.quarter$MAPE)
summary(rf_val_pred_geo.quarter$MAPE)
summary(xgb_val_pred_geo.quarter$MAPE)
summary(lm_val_pred_geo.quarter$RMSE)
summary(glmnet_val_pred_geo.quarter$RMSE)
summary(rf_val_pred_geo.quarter$RMSE)
summary(xgb_val_pred_geo.quarter$RMSE)
?group_by
#Rsquared
1- sum((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred) ^ 2)/sum((lm_val_pred_geo$mean_on - mean(lm_val_pred_geo$mean_on)) ^ 2)
rsq(lm_val_pred_geo.quarter, mean_on, .pred)
sd(rsq(lm_val_pred_geo, mean_on, .pred))
rsq(glmnet_val_pred_geo.quarter, mean_on, .pred)
rsq(rf_val_pred_geo.quarter, mean_on, .pred)
rsq(xgb_val_pred_geo.quarter, mean_on, .pred)
#MAE and MAPE
mean(abs(lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred))
sd(abs(lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred))
mean(abs((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)/lm_val_pred_geo$mean_on))
sd(abs((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)/lm_val_pred_geo$mean_on))
sd(abs(lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred))

mean(abs(glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred))
sd(abs(glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred))
mean(abs((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)/glmnet_val_pred_geo$mean_on))
sd(abs((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)/glmnet_val_pred_geo$mean_on))

mean(abs(rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred))
sd(abs(rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred))
mean(abs((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)/rf_val_pred_geo$mean_on))
sd(abs((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)/rf_val_pred_geo$mean_on))

mean(abs(xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred))
sd(abs(xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred))
mean(abs((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)/xgb_val_pred_geo$mean_on))
sd(abs((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)/xgb_val_pred_geo$mean_on))
#RMSE
sqrt(mean((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)^2))
sqrt(mean((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)^2))
sqrt(mean((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)^2))
sqrt(mean((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)^2))
sqrt(sd((lm_val_pred_geo$mean_on - lm_val_pred_geo$.pred)^2))
sqrt(sd((glmnet_val_pred_geo$mean_on - glmnet_val_pred_geo$.pred)^2))
sqrt(sd((rf_val_pred_geo$mean_on - rf_val_pred_geo$.pred)^2))
sqrt(sd((xgb_val_pred_geo$mean_on - xgb_val_pred_geo$.pred)^2))

yardstick::rmse_vec(lm_val_pred_geo$mean_on, lm_val_pred_geo$.pred)
yardstick::mape_vec(lm_val_pred_geo$mean_on, lm_val_pred_geo$.pred)
###################################Modelling part ends here, below are the visualizations##############
#MAPE chart
ggplot(data = val_preds.quarter %>% 
         dplyr::select(model, MAPE) %>% 
         distinct() , 
       aes(x = model, y = MAPE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
  labs(title= "1/4-mi Buffer, MAPE of each model on the testing set with typology")
theme_bw()
#MAE chart
ggplot(data = val_preds.quarter%>% 
         dplyr::select(model, MAE) %>% 
         distinct() , 
       aes(x = model, y = MAE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAE,1)))) +
  labs(title= "1/4 mi Buffer, MAE of each model on the testing set with typology")
theme_bw()  
#RMSE
ggplot(data = val_preds.quarter %>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(RMSE,1)))) +
  labs(title= "1/4 mi Buffer, RMSE of each model on the testing set with typology")
theme_bw() 
#Predicted vs Observed
ggplot(val_preds.quarter, aes(x =.pred, y = mean_on, group = model)) +
  geom_point() +
  geom_abline(linetype = "dashed", color = "red") +
  geom_smooth(method = "lm", color = "blue") +
  coord_equal() +
  facet_wrap(~model, nrow = 2) +
  labs(title="1/4 Mile: Predicted vs Observed on the testing set", subtitle= "blue line is predicted value") 
theme_bw()

#Neighborhood validation
val_MAPE_by_typology.quarter <- val_preds.quarter %>% 
  group_by(typology, model) %>% 
  summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
            MAE  = yardstick::mae_vec(mean_on, .pred),
            MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup() 
#Barchart of the MAE in each neighborhood
plotTheme <- function(base_size = 10) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 20,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=20),
    axis.title = element_text(size=20),
    axis.text = element_text(size=15),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic", size= 20),
    legend.text = element_text(colour = "black", face = "italic",size = 20),
    strip.text.x = element_text(size = 15)
  )
}
palette4 <- c("#eff3ff", "#bdd7e7","#6baed6","#2171b5")


as.data.frame(val_MAPE_by_typology.quarter)%>%
  dplyr::select(typology, model, MAE) %>%
  #gather(Variable, MAE, -model, -typology) %>%
  ggplot(aes(typology, MAE)) + 
  geom_bar(aes(fill = model), position = "dodge", stat="identity") +
  ylim(0, 300)+
  scale_fill_manual(values = palette4) +
  facet_wrap(~typology, scale= "free", ncol=4)+
  labs(title = "1/4 mile: Mean Absolute Errors by model specification") +
  plotTheme()

Buffer size: 1/2 mile

#1/8-mile buffer
data.eighth <- join(all_eighth, typology, type ="left")
data.eighth$STOP_ID <- NULL

data.eighth<-data.eighth %>%
  drop_na()
data.eighth$universityDist1<-NULL
#Slipt the data into training and testing sets
data_split.eighth <- rsample::initial_split(data.eighth, strata = "mean_on", prop = 0.75)

bus_train.eighth <- rsample::training(data_split.eighth)
bus_test.eighth  <- rsample::testing(data_split.eighth)
names(bus_train.eighth)


cv_splits_geo.eighth <- rsample::group_vfold_cv(bus_train.eighth,  strata = "mean_on", group = "typology")
print(cv_splits_geo)

#Create recipe
model_rec.eighth <- recipe(mean_on ~ ., data = bus_train.eighth) %>% #the "." means using every variable we have in the training dataset
  update_role(typology, new_role = "typology") %>% #This is more like to keep the neighborhood variable out of the model
  step_other(typology, threshold = 0.005) %>%
  step_dummy(all_nominal(), -typology) %>%
  step_log(mean_on) %>% 
  step_zv(all_predictors()) %>%
  step_center(all_predictors(), -mean_on) %>%
  step_scale(all_predictors(), -mean_on) #%>% #put on standard deviation scale
#step_ns(Latitude, Longitude, options = list(df = 4))
?step_cv
model_rec.eighth

#Build the model
lm_plan <- 
  parsnip::linear_reg() %>% 
  parsnip::set_engine("lm")

glmnet_plan <- 
  parsnip::linear_reg() %>% 
  parsnip::set_args(penalty  = tune()) %>%
  parsnip::set_args(mixture  = tune()) %>%
  parsnip::set_engine("glmnet")

rf_plan <- parsnip::rand_forest() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 1000) %>% 
  parsnip::set_engine("ranger", importance = "impurity") %>% 
  parsnip::set_mode("regression")

XGB_plan <- parsnip::boost_tree() %>%
  parsnip::set_args(mtry  = tune()) %>%
  parsnip::set_args(min_n = tune()) %>%
  parsnip::set_args(trees = 100) %>% 
  parsnip::set_engine("xgboost") %>% 
  parsnip::set_mode("regression")

#
glmnet_grid <- expand.grid(penalty = seq(0, 1, by = .25), 
                           mixture = seq(0,1,0.25))

rf_grid <- expand.grid(mtry = c(2,5), 
                       min_n = c(1,5))
xgb_grid <- expand.grid(mtry = c(3,5), 
                        min_n = c(1,5))
#Create workflow
lm_wf.eighth <-
  workflows::workflow() %>% 
  add_recipe(model_rec.eighth) %>% 
  add_model(lm_plan)

glmnet_wf.eighth <-
  workflow() %>% 
  add_recipe(model_rec.eighth) %>% 
  add_model(glmnet_plan)

rf_wf.eighth <-
  workflow() %>% 
  add_recipe(model_rec.eighth) %>% 
  add_model(rf_plan)
xgb_wf.eighth <-
  workflow() %>% 
  add_recipe(model_rec.eighth) %>% 
  add_model(XGB_plan)
# fit model to workflow and calculate metrics
control <- tune::control_resamples(save_pred = TRUE, verbose = TRUE)
library(tune)
library(yardstick)
?tune_grid
?metric_set

lm_tuned.eighth <- lm_wf.eighth %>%
  fit_resamples(.,
                resamples = cv_splits_geo.eighth,
                control   = control,
                metrics   = metric_set(rmse, rsq))
glmnet_tuned.eighth <- glmnet_wf.eighth %>%
  tune_grid(.,
            resamples = cv_splits_geo.eighth,
            grid      = glmnet_grid,
            control   = control,
            metrics   = metric_set(rmse, rsq))

rf_tuned.eighth <- rf_wf.eighth %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.eighth,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

xgb_tuned.eighth <- xgb_wf.eighth %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.eighth,
                  grid      = xgb_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

show_best(lm_tuned.eighth, metric = "rmse", n = 15, maximize = FALSE)
show_best(glmnet_tuned.eighth, metric = "rmse", n = 15, maximize = FALSE)
show_best(rf_tuned.eighth, metric = "rmse", n = 15, maximize = FALSE)
show_best(xgb_tuned.eighth, metric = "rmse", n = 15, maximize = FALSE)

lm_best_params.eighth     <- select_best(lm_tuned.eighth, metric = "rmse", maximize = FALSE)
glmnet_best_params.eighth <- select_best(glmnet_tuned.eighth, metric = "rmse", maximize = FALSE)
rf_best_params.eighth     <- select_best(rf_tuned.eighth, metric = "rmse", maximize = FALSE)
xgb_best_params.eighth    <- select_best(xgb_tuned.eighth, metric = "rmse", maximize = FALSE)
#Final workflow
lm_best_wf.eighth     <- finalize_workflow(lm_wf.eighth, lm_best_params.eighth)
glmnet_best_wf.eighth <- finalize_workflow(glmnet_wf.eighth, glmnet_best_params.eighth)
rf_best_wf.eighth     <- finalize_workflow(rf_wf.eighth, rf_best_params.eighth)
xgb_best_wf.eighth    <- finalize_workflow(xgb_wf.eighth, xgb_best_params.eighth)

lm_val_fit_geo.eighth <- lm_best_wf.eighth %>% 
  last_fit(split     = data_split.eighth,
           control   = control,
           metrics   = metric_set(rmse, rsq))
glmnet_val_fit_geo.eighth <- glmnet_best_wf.eighth %>% 
  last_fit(split     = data_split.eighth,
           control   = control,
           metrics   = metric_set(rmse, rsq))

rf_val_fit_geo.eighth <- rf_best_wf.eighth %>% 
  last_fit(split     = data_split.eighth,
           control   = control,
           metrics   = metric_set(rmse, rsq))

xgb_val_fit_geo.eighth <- xgb_best_wf.eighth %>% 
  last_fit(split     = data_split.eighth,
           control   = control,
           metrics   = metric_set(rmse, rsq))

####################################Model Validation
# Pull best hyperparam preds from out-of-fold predictions
lm_best_OOF_preds.eighth <- collect_predictions(lm_tuned.eighth) 
glmnet_best_OOF_preds.eighth <- collect_predictions(glmnet_tuned.eighth) %>% 
  filter(penalty  == glmnet_best_params.eighth$penalty[1] & mixture == glmnet_best_params.eighth$mixture[1])
rf_best_OOF_preds.eighth <- collect_predictions(rf_tuned.eighth) %>% 
  filter(mtry  == rf_best_params.eighth$mtry[1] & min_n == rf_best_params.eighth$min_n[1])

xgb_best_OOF_preds.eighth <- collect_predictions(xgb_tuned.eighth) %>% 
  filter(mtry  == xgb_best_params.eighth$mtry[1] & min_n == xgb_best_params.eighth$min_n[1])
# collect validation set predictions from last_fit model
lm_val_pred_geo.eighth     <- collect_predictions(lm_val_fit_geo.eighth)
glmnet_val_pred_geo.eighth <- collect_predictions(glmnet_val_fit_geo.eighth)
rf_val_pred_geo.eighth     <- collect_predictions(rf_val_fit_geo.eighth)
xgb_val_pred_geo.eighth    <- collect_predictions(xgb_val_fit_geo.eighth)
# Aggregate OOF predictions (they do not overlap with Validation prediction set)
OOF_preds.eighth <- rbind(data.frame(dplyr::select(lm_best_OOF_preds.eighth, .pred, mean_on), model = "lm"),
                           data.frame(dplyr::select(glmnet_best_OOF_preds.eighth, .pred, mean_on), model = "glmnet"),
                           data.frame(dplyr::select(rf_best_OOF_preds.eighth, .pred, mean_on), model = "RF"),
                           data.frame(dplyr::select(xgb_best_OOF_preds.eighth, .pred, mean_on), model = "xgb")) %>% 
  group_by(model) %>% 
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         #RSQUARE = yardstick::rsq(mean_on, .pred),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         #SD_RMSE = sd(yardstick::rmse_vec(mean_on, .pred)),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         #SD_MAE = sd(yardstick::mae_vec(mean_on, .pred)),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
  #SD_MAPE = sd(yardstick::mape_vec(mean_on, .pred))) %>% 
  ungroup()


# Aggregate predictions from Validation set
library(tidyverse)
library(yardstick)
#lm_val_pred_geo
detach(package:plyr)
val_preds.eighth <- rbind(data.frame(lm_val_pred_geo.eighth, model = "lm"),
                   data.frame(glmnet_val_pred_geo.eighth, model = "glmnet"),
                   data.frame(rf_val_pred_geo.eighth, model = "rf"),
                   data.frame(xgb_val_pred_geo.eighth, model = "xgb")) %>% 
  left_join(., data.eighth %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(typology, .row), 
            by = ".row") %>% 
  group_by(model) %>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = mean(abs(mean_on - .pred)),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
  ungroup()


lm_val_pred_geo.eighth<- lm_val_pred_geo.eighth%>%
  left_join(., data.eighth %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(typology, .row), 
            by = ".row")%>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on))


lm_val_pred_geo.eighth%>%
  group_by(typology)%>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         Error = abs(mean_on, .pred),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         #MAE  = yardstick::mae_vec(mean_on, .pred),
         MAE = mean(Error),
         MAPE = yardstick::mape_vec(mean_on, .pred))

glmnet_val_pred_geo.eighth<- glmnet_val_pred_geo.eighth%>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred))

rf_val_pred_geo.eighth<- rf_val_pred_geo.eighth%>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred))

xgb_val_pred_geo.eighth<- xgb_val_pred_geo.eighth%>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred))

val_preds.eighth <- rbind(data.frame(lm_val_pred_geo.eighth, model = "lm"),
                           data.frame(glmnet_val_pred_geo.eighth, model = "glmnet"),
                           data.frame(rf_val_pred_geo.eighth, model = "rf"),
                           data.frame(xgb_val_pred_geo.eighth, model = "xgb"))%>%
  left_join(., data.eighth %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(typology, .row), 
            by = ".row")
summary(lm_val_pred_geo.eighth$MAE)
summary(glmnet_val_pred_geo.eighth$MAE)
summary(rf_val_pred_geo.eighth$MAE)
summary(xgb_val_pred_geo.eighth$MAE)
summary(lm_val_pred_geo.eighth$MAPE)
summary(glmnet_val_pred_geo.eighth$MAPE)
summary(rf_val_pred_geo.eighth$MAPE)
summary(xgb_val_pred_geo.eighth$MAPE)
summary(lm_val_pred_geo.eighth$RMSE)
summary(glmnet_val_pred_geo.eighth$RMSE)
summary(rf_val_pred_geo.eighth$RMSE)
summary(xgb_val_pred_geo.eighth$RMSE)

#Rsquared
rsq(lm_val_pred_geo.eighth, mean_on, .pred)
sd(rsq(lm_val_pred_geo, mean_on, .pred))
rsq(glmnet_val_pred_geo.eighth, mean_on, .pred)
rsq(rf_val_pred_geo.eighth, mean_on, .pred)
rsq(xgb_val_pred_geo.eighth, mean_on, .pred)

###################################Modelling part ends here, below are the visualizations##############
#MAPE chart
ggplot(data = val_preds.eighth %>% 
         dplyr::select(model, MAPE) %>% 
         distinct() , 
       aes(x = model, y = MAPE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
  labs(title= "1/8-mi Buffer, MAPE of each model on the testing set with typology")
theme_bw()
#MAE chart
ggplot(data = val_preds.eighth%>% 
         dplyr::select(model, MAE) %>% 
         distinct() , 
       aes(x = model, y = MAE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(MAE,1)))) +
  labs(title= "1/8 mi Buffer, MAE of each model on the testing set with typology")
theme_bw()  
#RMSE
ggplot(data = val_preds.eighth %>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = "blue") +
  geom_label(aes(label = paste0(round(RMSE,1)))) +
  labs(title= "1/8 mi Buffer, RMSE of each model on the testing set with typology")
theme_bw() 
#Predicted vs Observed
ggplot(val_preds.eighth, aes(x =.pred, y = mean_on, group = model)) +
  geom_point() +
  geom_abline(linetype = "dashed", color = "red") +
  geom_smooth(method = "lm", color = "blue") +
  coord_equal() +
  facet_wrap(~model, nrow = 2) +
  labs(title="1/8 Mile: Predicted vs Observed on the testing set", subtitle= "blue line is predicted value") 
theme_bw()

val_MAPE_by_typology.eighth <- val_preds.eighth %>% 
  group_by(typology, model) %>% 
  summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
            MAE  = yardstick::mae_vec(mean_on, .pred),
            MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup() 
#Barchart of the MAE in each neighborhood
plotTheme <- function(base_size = 10) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 20,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=20),
    axis.title = element_text(size=20),
    axis.text = element_text(size=15),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic", size= 20),
    legend.text = element_text(colour = "black", face = "italic",size = 20),
    strip.text.x = element_text(size = 15)
  )
}
palette4 <- c("#eff3ff", "#bdd7e7","#6baed6","#2171b5")


as.data.frame(val_MAPE_by_typology.eighth)%>%
  dplyr::select(typology, model, MAE) %>%
  #gather(Variable, MAE, -model, -typology) %>%
  ggplot(aes(typology, MAE)) + 
  geom_bar(aes(fill = model), position = "dodge", stat="identity") +
  ylim(0, 300)+
  scale_fill_manual(values = palette4) +
  facet_wrap(~typology, scale= "free", ncol=4)+
  labs(title = "1/8 mile: Mean Absolute Errors by model specification") +
  plotTheme()

Selecting the Best Buffer Size

It is hard to arbitrarily decide what the best buffer size is in capturing most of the variations of the dataset. As the bus stops are relatively densely distributed in Austin, we used our knowledge for walkable distances and start testing from 1/2 mile and gradually reduce the size to 1/4 mile and 1/8 mile. By comparing the r square, RMSE, MAE and MAPE of each model with different buffer size, and further testing the generalizability of the models, we will get the best buffer size.

1/2 mile Buffer

In each buffer size, four types of models are compared.For each buffer size, the model with the best performance will be picked and enter the final comparison. In 1/2 mile buffer size test, the random forest model has the largest r-square of 0.81 and lowest MAE of 64 and MAPE of 30.4%. Among the four models, this rf model has the best performance.

1/4 mile Buffer

In 1/4 mile buffer size test, the random forest model has the largest r-square of 0.79 and lowest MAE of 69.3 and MAPE of 28.8%. This rf model has the best performance among the four models here.

1/8 mile Buffer

In 1/8 mile buffer size test, the random forest model has the largest r-square of 0.72 and lowest MAE of 77.8 and MAPE of 33.6%. This rf model has the best performance among the four models here.

After the above comparison, the largest r-square is from 1/2-mile buffer while the least MAPE is from 1/4-mile. The possible reason for the lose of 1/8-mile buffer is it cannot sucessfully capture enough variation within the limited radius of the stops. Especially for the areas where ridership is high, such as CBD or UT, there can be a lot of variances outside the 1/8 mile.

Then if we take a look at the generalizability of the three buffer sizes on their performance in CBD, UT and the rest areas, it is clear that 1/8 mile buffer fails to achieve generaliability as well as other two buffer sizes. the MAE in CBD and UT are very high. For the comparison between 1/2 mile buffer and 1/4 mile buffer, the MAE demonstrates a better generalizability for 1/2 mile buffer for its smaller value in UT. However, considering our use case where we allow planners to test different development at specific locations, a smaller buffer size would more accurately reflect real development in the world. Thus, the final buffer radius is settled to 1/4 mile.

### Build the kitchen sink model using selected variables

sum(is.na(sce0))
library(dplyr)
install.packages("mltools")
library(mltools)
sce0<-sce %>% drop_na()

sce0 <- plyr::join(sce0, typology, type= "left")

library(data.table)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "SN_cat",
          dropCols = TRUE)
names(sce0)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "Crosstown_cat",
          dropCols = TRUE)

sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "Express_cat",
          dropCols = TRUE)

sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "Local_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "Flyer_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "NightOwl_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "HighFreq_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "InOut_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "Clockwise_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "utshuttle_cat",
          dropCols = TRUE)
sce0 <- 
  as.data.table(sce0)%>%
  one_hot(cols = "Special_cat",
          dropCols = TRUE)

sce0 <- sce0 %>%dplyr::select(-STOP_ID)
data_split.sce0 <- rsample::initial_split(sce0, strata = "mean_on", prop = 0.75)

bus_train.sce0 <- rsample::training(data_split.sce0)
bus_test.sce0  <- rsample::testing(data_split.sce0)
names(bus_train.quarter)

cv_splits_geo.sce0 <- rsample::group_vfold_cv(bus_train.sce0,  strata = "mean_on", group = "typology")
print(cv_splits_geo)

model_rec.sce0 <- recipe(mean_on ~ ., data = bus_train.sce0) %>% #the "." means using every variable we have in the training dataset
  update_role(typology, new_role = "typology") %>% #This is more like to keep the neighborhood variable out of the model
  step_other(typology, threshold = 0.005) %>%
  step_dummy(all_nominal(), -typology) %>%
  step_log(mean_on) %>% 
  step_zv(all_predictors()) %>%
  step_center(all_predictors(), -mean_on) %>%
  step_scale(all_predictors(), -mean_on) #%>% #put on standard deviation scale
#step_ns(Latitude, Longitude, options = list(df = 4))


#Create workflow

rf_wf.sce0 <-
  workflow() %>% 
  add_recipe(model_rec.sce0) %>% 
  add_model(rf_plan)

# fit model to workflow and calculate metrics
#Metrics are changes from rmse + rsq to only rsq
rf_tuned.sce0 <- rf_wf.sce0 %>%
  tune::tune_grid(.,
                  resamples = cv_splits_geo.sce0,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metric_set(rsq))

?tune_grid
show_best(rf_tuned.sce0, metric = "rsq", n = 15, maximize = FALSE)



rf_best_params.sce0     <- select_best(rf_tuned.sce0, metric = "rsq", maximize = FALSE)

#Final workflow
rf_best_wf.sce0     <- finalize_workflow(rf_wf.sce0, rf_best_params.sce0)

rf_val_fit_geo.sce0 <- rf_best_wf.sce0 %>% 
  last_fit(split     = data_split.sce0,
           control   = control,
           metrics   = metric_set(rsq))

####################################Model Validation
# Pull best hyperparam preds from out-of-fold predictions
rf_best_OOF_preds.sce0 <- collect_predictions(rf_tuned.sce0) %>% 
  filter(mtry  == rf_best_params.sce0$mtry[1] & min_n == rf_best_params.sce0$min_n[1])
# collect validation set predictions from last_fit model
rf_val_pred_geo.sce0     <- collect_predictions(rf_val_fit_geo.sce0)
# Aggregate predictions from Validation set
library(tidyverse)
library(yardstick)
#lm_val_pred_geo
rf_best_OOF_preds.sce0 <- rf_best_OOF_preds.sce0 %>% dplyr::select(-min_n, -mtry)
val_preds.sce0 <- rbind(data.frame(rf_val_pred_geo.sce0), data.frame(rf_best_OOF_preds.sce0) )%>% 
  left_join(., sce0 %>% 
              rowid_to_column(var = ".row") %>% 
              dplyr::select(typology, .row), 
            by = ".row") %>% 
  dplyr::group_by(typology) %>%
  mutate(.pred = exp(.pred),
         mean_on = exp(mean_on),
         RMSE = yardstick::rmse_vec(mean_on, .pred),
         MAE  = yardstick::mae_vec(mean_on, .pred),
         MAPE = yardstick::mape_vec(mean_on, .pred))%>%
  ungroup()

val_MAPE_by_typology.sce0 <- val_preds.sce0 %>% 
  group_by(typology) %>% 
  summarise(RMSE = yardstick::rmse_vec(mean_on, .pred),
            MAE  = yardstick::mae_vec(mean_on, .pred),
            MAPE = yardstick::mape_vec(mean_on, .pred)) %>% 
  ungroup() 
#Barchart of the MAE in each neighborhood
plotTheme <- function(base_size = 10) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 20,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=20),
    axis.title = element_text(size=20),
    axis.text = element_text(size=15),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic", size= 20),
    legend.text = element_text(colour = "black", face = "italic",size = 20),
    strip.text.x = element_text(size = 15)
  )
}
palette4 <- c("#eff3ff", "#bdd7e7","#6baed6","#2171b5")


as.data.frame(val_MAPE_by_typology.sce0)%>%
  dplyr::select(typology,MAPE) %>%
  #gather(Variable, MAE, -model, -typology) %>%
  ggplot(aes(typology, MAPE)) + 
  geom_bar(aes(fill = typology), position = "dodge", stat="identity") +
  ylim(0, 120)+
  scale_fill_manual(values = palette4) +
  facet_wrap(~typology, scale= "free", ncol=4)+
  labs(title = "MAPE of the random forest model") +
  plotTheme()