Quantifying Bike Share Demand and Subsidy in Philadelphia

The project is for a course in the Master of Urban Spatial Analysis at the University of Pennsylvania taught by Professor Ken Steif.

How to use this document This report is divided into two sections. The first explains our use case and results and the second provides source code intended to help others replicate our findings.

1.Introduction

1.1.Abstract

Indego, the bike share company in Philadelphia, is now serving actively for the center of Philadelphia while few of the stations benefit low-income neighborhoods. To make Indego a more sustainable and equitable bike share system, Bicycle Coalition of Greater Philadelphia wishes to expand the bike share stations into low-income communities. The aim of our project is to help them to figure out the best places for building new bike share stations.

In this project, we predict for bike share demand at every location citywide using machine learning algorithms, provide instant cost-benefit analysis based on the prediction results and design a web app visualizing our cost-benefit analysis results to assist with the bike share planning process.

1.2.Motivation

Subsidies are usually needed for the first-year construction of a new bike share station. As shown in the table below, the cost of equipment and installation in the first year is about 3 times more than yearly operating cost.

Station Size	1st Year (Equipment, install, operating)	2nd Year (operating)	3rd Year (operating)	4th Year (operating)
Station with 6 bikes & 11 docks	$49,458	$11,718	$12,304	$12,919
Station with 8 bikes & 15 docks	$60,913	$15,624	$16,405	$17,225
Station with 10 bikes & 19 docks	$72,367	$19,530	$20,507	$21,532
Station with 12 bikes & 23 docks	$83,822	$23,436	$24,608	$25,838
Station with 14 bikes & 27 docks	$95,277	$27,342	$28,709	$30,145

Source: ashevillenc.gov

A successful bike share system balances the desire to profit with the need to benefit people, so it is meaningful to estimate bike share demand and cost for future bike share siting. In the past, the policy-makers usually predicted or evaluated siting suitability by simply overlaying different criteria, or based on their personal experience. In order to make the prediction more accurate, we build predictive models using machine learning algorithms and train the model on actual data. Then, we predict for the trip count at every location in Philadelphia with the model and design an user-friendly interactive app, which will allow the client to get an immediate sense of predicted bike share demand and subsidies needed.

Methods in Brief

Our project follows 4 steps:

Exploratory Analysis
First, we conduct exploratory analysis to answer several questions relevant to our scenario. After presenting examples of factors related to bike share demand, we visualize the relationship between them and summarize the important factors into 5 categories, which are to be processed in the next step.

Data
Next, we prepare our data to finalize the dataset. In order to build the relationship between our predictors and the dependent variable (trip count), we encode our variables in multiple ways. After that, we select the most influential predictors using machine learning methods and have our final dataset ready for the modeling part.

Modeling
Then, we build the predictive model using 6 different machine learning algorithms. We divide our dataset into a training set, which is to train the model, and a testing set, which is to test the accuracy of the model. To improve the performance of the model, we conduct various methods and algorithms in different models and choose the best model with the lowest error. Then we test the model with different training and testing sets to ensure the generalizability of our model.

Cost-Benefit Analysis
Finally, we analyze the cost side and benefits side based on the prediction for every location in Philadelphia. Besides, we give the interface design visualizing our cost-benefit analysis results.

2. Exploratory Analysis

In this section, we are going to explore several questions related to bike share demands.

What factors are associated with bike share demands?

The next step is to figure out which factors might contribute or reduce bike share demand. If we plot the total trip count of stations in 2016 (as shown in the figure below), we can discover that stations with higher trip count are obviously clustered in the Center City. So we want to explore relevant factors which would capture this spatial structure.

Demographic
The potential users of bike share matters to the demand prediction. So, we investigate the relationship between trip count and demographic variables, one of which is commute pattern. Bike share stations are expected to be used more frequently where more people commute to work by bike. The maps (code) show the percentage of residents cycling to work in each census tract and the percentage of residents commuting by car as a comparison. On the map, residents near the Center City are more likely to commute by bike, where the majority of bike share stations are located at.

Attractions and Repellent
Where do these potential users go? We should definitely consider the attractive destinations those potential users may cycle to. Around these attractions, the trip count is expected to be higher.

For example, Central Business District is regarded as one of the most important attractions. Bike share stations with larger trip count are mainly found to be closer to Center City, the Central Business District of Philadelphia, as demonstrated by the figure below (code).

Each pixel in the plot on the right represents a bike share station: the darker the square is, the more trips of the corresponding station; the closer this pixel to (0,0), the closer the station to CBD.

Job place is another kind of attraction for people commuting to work by bike. The following map (code) shows the job count within 2500 ft for each station, and we find that most existing stations fall into areas with more jobs. Also, as shown in the scatterplot, the number of jobs rises as the trip count increases .

There are many other attractions related to bike share demands, such as supermarkets, university, restaurants, and tourist attractions. We create factors based on distance from each station to its nearest attractions. According to the scatterplots shown below, as the distance to attractions increases, the trip count tends to drop.

Transportation Network
What do these potential users experience during their riding in the transportation network? We assume that in areas where transportation network is well connected, people are more likely to use bike share.

For example, people might choose to cycle to transit stops if the stops are not too far away from the bike share stations. Thus, we create maps and scatterplots to visualize the relationship between trip count and distance to transit (subway stations and bus stops). Both maps and plots show that the closer to bus stops, the higher trip count, while the distance to subway stations seems not correlated with bike share demand.

Another assumption is that bike share demand is associated with how busy and convenient the road network is. According to the plot below, while street density is not apparently associated with trip count, street length is negatively related to trip count.

Neighboring Stations
How does a bike share station relate to its neighbors? As introduced before, high trip count stations are usually clustered in groups, which means a station’s bike demand might be affected by surrounding stations, which we call as “spatial lag” (code). The scatterplot below demonstrates the trip count as a function of the average trip count of the nearest 5 bike share stations. We could easily discover that one station’s trip count is positively associated with its nearby stations .

Internal Factor
How is the trip count of a bike share station influenced by its internal features? Another assumption is that trip count is positively associated with the number of docks in the station.

3. Data

From the exploratory analysis, we know factors which are expected to be associated with trip count. After gathering these data from multiple data sources, we start processing our data.

Collect Generalizable Training Data

For the dependent variable, we use the whole year data of the trip count in 2016 to do the prediction. To ensure that every station has the whole year data, some stations are removed from the dataset, so that there are only 66 stations in Philadelphia in our dataset. The next step is to find cities where the bike share experience is generalizable to that of Philadelphia. This includes factors such as: they should share similarities with Philadelphia in longitude and latitude, demographic environment, spatial scale, spatial characteristics and bike share demand. Based on these criteria, we select 4 other cities and create boxplot showing the distribution of the demographic and spatial factors (including attractions, repellants, and transportation network) in each city. From the boxplots, 3 of the 4 cities share similar characters except for Minneapolis. Therefore, we drop Minneapolis and select the final three peer cities: Chicago, Boston, and Washington D.C.

Predictors: Feature Engineering and Selection

Now we have factors and trip count data. To build relationship between them, we encode our predictors and create many candidate features with different feature engineering methods. For example, to reflect the spatial relationship of stations and CBD area, we create 2 variables indicating: 1) whether the station is inside/outside CBD and 2) the distance of the station to CBD.

Another example is the distance to intersections. We calculate the distance of each station to 5/10/20 nearest road intersections and create 3 variables.

We also apply multiple machine learning methods such as Boruta and Random Forest to select significant predictors among 130 possible predictors created by feature engineering. The final variable list is shown below. (For more information in feature engineering and selection, please refer to Feature Engineering and Selection).

The final feature list is shown below.

4. Modeling

After having a list of variables to be included in our model, we start modeling. What is a good predictive model? First, it should perform well on unseen data. Second, the error, namely the gap between the actual value and predicted value, should be small. Third, it must perform similarly well to different sub-datasets.

Firstly, we divide our dataset into training set and testing set to see how model performs on the unseen data.

How do we build training set and testing set?

We have 4 cities data in our final dataset, but our real concern is how accurate the prediction is in Philadelphia, so the testing set is selected from Philadelphia’s bike share stations data. Meanwhile, as Philadelphia’s past experience is vital for the prediction, we should also include part of Philadelphia’s data in our training set.

Therefore, we randomly select 60% of Philadelphia’s data along with other three cities’ data as our training set, and test on the remaining 40% stations in Philadelphia and repeat this process for 100 times.

How do we calculate model error?

We build 6 models(code) based on different algorithms: OLS, Lasso, Ridge, Elastic Net, Gradient Boosting and Random Forest (a brief introduction to our models) and compare their goodness of fit to select a final model.

We develop several goodness of fit indicators relating predicted trip counts to the observed trip counts. We report the difference or “error” in observed versus predicted trip counts by two statistical terms MAPE (Mean Absolute Percentage Error) and MAE (Mean Absolute Percentage Error). MAPE represents the percentage of trips more or less than actually occurred and MAE is the number of trips more or less than actually occurred. They are calculated by the formulas below:

MAPE= mean(abs(predicted trip count-actual trip count)/actual trip count)
                               
MAE= mean(abs(predicted trip count-actual trip count))

We hope that these indicators are as small as possible, since we expect our predicted trip count to be close to the actual trip count. The following tables shows the summary of the models and the maps below show the predicted trip count. The MAPE of OLS model is the lowest, at around 25%, which means that our accuracy could reach 75% using OLS model.

Summary of Models
Model	MAPE	MAE
OLS	0.2523615	2,112.858
Ridge	0.2751189	2,188.518
lasso	0.2791624	2,201.902
net	0.2852234	2,240.458
gbm	0.2759588	2,152.214
RF	0.2685985	2,111.944

Does our model perform well and capture the spatial structure of trip count?

After running the final OLS model, the predicted trip counts and actual trip counts for bike share stations are shown by the following maps. We could see that their spatial patterns are similar.

The plot of residual (error) as a function of trip counts is shown below.

In the residual plots above, we can see that the points concentrate where the trip count of bike share stations are low and become wide-spread when the trip count is high. It indicates that our model does a good job in predicting low and medium trip count, but does worse in predicting stations with the high trip count. This makes sense since most of the stations in Philadelphia have low and medium trip count, and our goal is also to predict for low-income communities, where trip count will not be high.

Since stations with higher trip counts are not well predicted and we know from the previous maps that higher trip counts are often clustered around CBD, we want to test if the residuals are also clustered spatially. In other words, we want to test if our model well captured the underlying spatial structures of bike share trip count. To do this, we map the residuals as below:

In the map, we can see the residuals are randomly distributed and no significant spatial correlation among residuals. This fact indicates that our model successfully captured the spatial structures of bike share stations trip counts.

Does our model predict well for different sub-datasets?

The basic idea of this section is to test if our model performs equally well when predicting for different sub-datasets.

As we mention before, we repeat the building testing set and training set process and running model for 100 times. The plot shows the distribution of error rate over 100 times. The error rates are concentrated around 25%, which means our model is generalizable.

Moreover, since we will predict for every location in Philadelphia, we have to make sure that our model performs constantly well in diverse urban context. For example, if our model only predicts well for stations in high-income areas, locations in low/medium-income areas will be predicted inaccurately.

To do this, we hold out stations in low-income communities as the testing set, train the model with the remaining stations and validate on the testing set. Then we hold out stations in medium and high-income communities respectively and repeat the same process. We do the same thing for distance to CBD and trip count. The results are shown in the following tables.

Income
	MAPE	MAE
Low	0.2980	1,656
Mid	0.2167	2,413
High	0.2372	2,274

Distance to CBD
	MAPE	MAE
<1 mile	0.2543	3,075
>=1 mile	0.2541	1,515

Trip Count
	MAPE	MAE
Small	0.2544	774
Mid	0.2109	1,342
Large	0.3173	4,159

From the above process, despite small differences of each subset, the model shows overall generalizability to a diverse set of urban contexts, because the error rate, namely MAPE of subsets is generally close. The error rate is relatively higher for large trip count stations. It is regarded as acceptable, since the natural demand in low and medium income communities may not be very high.

The conclusion of the model

Overall, this is a good model. The error rate of our model is around 25% which is an acceptable predictive power as our dataset only consists of 600+ observations. The average error when we predict test dataset is about 2113. The spatial variation is well captured by our model and there’s no clear spatial autocorrelation in the residuals. Meanwhile, our model has a good generalization ability and predicts well for different urban contexts overall. We would recommend this model to our client and continue using this model to predict for cost and benefit for each location in Philadelphia.

We still can improve the performance of the model in the future in several ways. First, we should definitely collect more up-to-date data from more cities, so that we can include more observations in the training set, and the prediction power of the model is expected to be enhanced significantly. Second, more advanced machine learning algorithms can be applied to reduce the error. Third, the model predicts a little bit worse for stations with the large trip count. So, for the areas where stations with large trip count are clustered, we need more investigation to discover more variables to make the model more generalizable to diverse urban contexts.

5. Cost-Benefit Analysis

After we finalize our model, we conduct the cost-benefit analysis to estimate required subsidies, with a view to both cost side and benefit side.

Cost Side

Building a new bike share station always costs a lot. Obviously, and the cost varies with the number of docks for each station. More docks a bike share station has the higher cost. Usually, in the first year, the cost includes the equipment, installation and operating. Since the second year and thereafter, only operating cost should be considered. We assume that the cost of equipment and installation are subsidized in the first year. Based on previous research, the operating cost for the first year is as follows:

Operating Cost for the First Year
Number of Docks	Operating Cost per Dock	Total Operating Cost
10	$1,014	$10,140
15	$990	$14,850
20	$981	$19,620
25	$971	$24,275
30	$962	$28,860
35	$952	$33,320

Benefit Side

Although the city doesn’t peddle for profit, bike share can still generate some revenue because people pay for their trips. Based on the current rate of Indego, every trip is assumed to generate $2 revenue. So, the total revenue of each station is calculated as:

              Total Revenue per Year ($) = Total Trips per Year * $2

Cost-Benefit Balance

Now that we already know the cost and benefit of opening a new station, the cost-benefit balance is calculated as follows:

              Cost-Benefit Balance = Cost - Benefit
                                   = Cost - Total Trips per Year * $2

In this way, we are able to estimate how much net revenue it will be to build a new station at any location in Philadelphia.

6. Use Case

We devise a web app to visualize the prediction result and cost-benefit analysis. As users change the number of docks by the slider, the map will show the cost-benefit balance for each location citywide. If users want to view more details of any specific location, they can click on the map, and in the sidebar, there will show how many trips it will bring about, how much it will cost, and how much net revenue it will generate to build a new station at this location. Also, users can have a glance at the neighborhood profile through the information box, which shows population density, median age, and median household income on the block group level, as well as the distance to CBD, the distance to SEPTA station, and the average distance to 10 assault incidents. The web app enables policymakers to have an intuitive understanding of cost-benefit conditions of the whole city, thus making it possible to allocate resources in a better way and build a more sustainable, more equitable bike share network.

7. Appendix

7.1. Data Source

Dependent Variable

-Trip Count: official websites of bike share programs

Philadelphia: Indego

Boston: Hubway

Chicago: Divvy

Washington D.C: Capital

Predictors

-Demographic: ACS 5-year estimated, loaded from R package tidycensus

-Attractions and Repellent: city open data portal, Open Street Map, LEHD Employment data

-Transportation Network: city open data portal, Open Street Map

-Neighboring Stations: calculated from trip count

-Internal Factor: official websites of bike share programs

7.2 Machine Learning Models

We start with OLS (Ordinary Least Squares) regression. As a basic and commonly used linear regression model, OLS regression is simple to interpret and its mathematical framework is easy to understand. However, OLS model has some limitations as well. A problem of OLS is that it could easily overfit. To avoid overfitting and make the model more generalizable, we used three regression shrinkage methods-ridge, lasso and elastic net-to penalize coefficients.

Besides, we also used other two tree-based models: random forest and gradient boosting, both of which create a number of decision trees. Random forest makes the prediction based on the voting results of all decision trees and usually makes a better prediction when handling a large dataset, while gradient boosting is an iterative process, where the algorithm is able to correct the errors from the previous trees. They are flexible and are adaptive to various distribution patterns.

7.3 Feature Engineering and Selection

Feature engineering

Feature engineering plays an extremely vital part in building modeling algorithms. Actually, the performance of our prediction is highly dependent on how we translate and present our data. In the above paragraphs, we have already explored many factors that might be related to bike share demand. Since there are numerous methods to characterize our data, we need to create features which are relevant to the scenario. To get the most of our data, we discovered and engineered many features derived from the raw data.

1).Transform Continuous Variables to Categorical Ones

One way to engineer a numeric feature is to transform it into a categorical one. As mentioned above, income is regarded as an important predictor associated with bike share demand. There are several ways to depict income. Here, based on the comparison between the median household income on census tract level with the whole city, we categorized the median household income into three levels. In the model, we used income level-low, medium, high-to represent income, instead of detailed numbers.

2). Isolate Key Information using Indicator Variables

An important way to extract key information from our data is to create indicator variables. Undoubtedly, we have to highlight the role of Central Business District (CBD, Center City, Downtown, etc.) in exploring bike share demand. We have many ways to explore the relationship between CBD and bike share trip count. For example, the distance to CBD is an intuitive way to depict the CBD factor. Also, we could use ìis the station in CBD (isCBD)î to demonstrate the information. If the station is in CBD, it is 1, otherwise, it is 0.

3). Discover Best Distance Variables

To deal with distance variables, we need to find the best one in the pool. For example, we believe that the distance to intersections is an important predictor. First, we calculated the distance to 5, 10, 15 and 20 nearest road intersections for each station respectively and did it for other initial target variables. Next, we could select the best one from several alternative features by looking at the correlation between each distance variable and bike share trip count.

4). Interact Two Features to Tell a New Story

Some variables tell stories individually. But if we combine them, they can provide more information than they do as individuals. Thus, we created many interactions to see how these variables tell stories simultaneously. For instance, the number of docks is expected to influence bike share trip count, and so does the variable ìif the station is in CBDî. However, we tend to believe that to add one more dock in a station in CBD differs from to add one more dock in a station out of CBD. Therefore, we created a new feature by interacting the variable ìnumber of docks (Num_Docks)î and the variable ìis the station in CBDî (isCBD).

5). Apply polynomial function instead of linear function

We also created new features by transforming existing features using elementary functions. A simple example is that we used the square of the distance to CBD instead of the distance to CBD.

Feature Selection

After feature engineering, we have many potential features. The next step is to select the significant features, which is called feature selection. We apply different methods to select features with assisstance from popular machine learning algorithms like Boruta and Random Forest.

Boruta is an all relevant feature selection wrapper algorithm, capable of working with any classification method that output variable importance measure (VIM). The method performs a top-down search for relevant features by comparing original attributes’ importance with importance achievable at random, estimated using their permuted copies, and progressively elliminating irrelevant features to stabilise that test.

library(Boruta)
boruta_Train <- Boruta(Ln_count~. -Ln_count, data = data_s, doTrace = 2)

## save features
borutaVars <- getSelectedAttributes(boruta_Train)

## save formula
boruta.formula <- formula(paste("Ln_count ~ ", paste(borutaVars, collapse = " + ")))
boruta.formula

And we finished Variable Importance Plot. The script and plot is shown below.

importance_matrix <- xgb.importance(colnames(df), model = fit_xgb)
xgb.plot.importance(importance_matrix = importance_matrix[1:50])

7.4. Effects of New Stations

#read the dataset of daily trips for each station
bikeshare_byday<-read.csv("bikeshare_byday.csv")
stationID<-read.csv("indego-stations-2017-10-20.csv")

#clean the time
bikeshare_byday$dep_time<-as.character(bikeshare_byday$dep_time)
bikeshare_byday$dep_time<-ymd(bikeshare_byday$dep_time)

bikeshare_byday$year=year(bikeshare_byday$dep_time)
bikeshare_byday$month=month(bikeshare_byday$dep_time)
bikeshare_byday$day=mday(bikeshare_byday$dep_time)

#make datetime
bikeshare_byday<-bikeshare_byday %>% 
  mutate(dep_time = make_datetime(year=year,month=month,day=day)) %>% 
  dplyr::select(-X,-year,-month,-day)

#selected new stations were built between 5.2-5.5 in 2016, so just filter days 2 weeks before and 2 weeks after
newstation_daily <- bikeshare_byday %>% 
  filter(dep_time<=ymd("2016-05-05")+days(14) & dep_time>=ymd("2016-05-02")-days(14)) %>%
  mutate(day_no_0502=yday(dep_time)-yday("2016-05-02"),
         day_no_0503=yday(dep_time)-yday("2016-05-03"),
         day_no_0504=yday(dep_time)-yday("2016-05-04"),
         day_no_0505=yday(dep_time)-yday("2016-05-05"))%>% 
  group_by(day_no_0502,day_no_0503,day_no_0504, day_no_0505,start_station_id) %>% 
  dplyr::summarise(trips= sum(count))

newstation_daily <- dplyr::rename(newstation_daily,station_id=start_station_id)

#read the table including the start_station_id, lon and lat of each station
station_trips<-read.csv("Philadelphia.csv")
station_trips$id<-rownames(station_trips)
reference<-data.frame(station_id=station_trips$start_station_id,
                      row_id=station_trips$id,
                      trips=station_trips$count)

new_stations <- station_trips %>% 
  filter(start_station_id>=3098 & start_station_id<=3115)

old_stations <- station_trips %>% 
  filter(start_station_id<3098)

newstations_coor<-new_stations %>% select(lon,lat)%>% as.matrix()
oldstations_coor<-old_stations %>% select(lon,lat)%>% as.matrix()

#figure out which stations are nearest five stations of the new station
newstation_nearby=get.knnx(oldstations_coor,newstations_coor,k=5)

temp1 <-
  as.data.frame(newstation_nearby) %>%
  select(-nn.dist.1,-nn.dist.2,-nn.dist.3,-nn.dist.4,-nn.dist.5) %>%
  bind_cols(new_stations)

temp2<- temp1 %>% gather(attributes, value,-start_station_id,-lat,-lon,-count,-id,-city)

join_table<-merge(temp2,reference,by.x='value',by.y='row_id') %>%  
          select(-value,-id,-attributes,-trips,-city)

newstation_2weeks<-merge(join_table,newstation_daily,by.x='station_id',by.y='station_id')
newstation_2weeks<-na.omit(newstation_2weeks)

#read the table including the station address information
stationID<-stationID %>% select(Station.Name,Station.ID,-Go.live.date,-Status)

#filter stations built on 05-02-2016, and only include the data for two weeks before and two weeks after
nearby_trips_daily_0502<-newstation_2weeks %>% filter(start_station_id>=3098 & start_station_id<=3101 &
                                                        day_no_0502>=-14 & day_no_0502<=14) %>% 
                          group_by(start_station_id,day_no_0502) %>%  
                          dplyr::summarise(trips=as.integer(mean(trips)))
#join table to get address
nearby_trips_daily_0502<-merge(nearby_trips_daily_0502,stationID,by.x='start_station_id',by.y='Station.ID')

#find weekends
nearby_trips_daily_0502 <-nearby_trips_daily_0502 %>% filter(day_no_0502!=0)%>%
                          mutate(after=as.factor(ifelse(day_no_0502>0,"after","before")),
                          weekdays=ifelse(day_no_0502==-1 | day_no_0502==-2|
                                   day_no_0502==-8 | day_no_0502==-9|
                                   day_no_0502==5 | day_no_0502==6|
                                   day_no_0502==12 | day_no_0502==13,
                                  0,1))

#get rid of weekends
station1<-nearby_trips_daily_0502 %>% filter(start_station_id==3098 & weekdays==1)

#plot
newstation1<-
  ggplot(station1,aes(day_no_0502,trips,group=after))+
  geom_point(shape=16,size=3,color="#812781",alpha=0.8)+
  stat_smooth(method = "loess", se = FALSE, size = 0.8,alpha=0.6,color="#b63578")+
  geom_vline(xintercept = 0, colour = "#fc8861", linetype= "longdash", size=1) + 
  geom_vline(xintercept = 14, colour = "#fc8861", linetype= "longdash", size=1)+
  geom_vline(xintercept = -14, colour = "#fc8861", linetype= "longdash", size=1)+
  theme_light() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.border = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(size = 12, face = "italic", colour = "dark grey", hjust = 0)
    )+
   labs(title = station1$Station.Name)+
  xlab("Day Number") +
  ylab("Trip Counts")
  
newstation1

#do the above steps for all newstations
#arrange them together
total1<-ggarrange(newstation1,newstation2,newstation3,newstation4,newstation5,ncol=5,nrow=1)
total2<-ggarrange(newstation6,newstation7,newstation8,newstation9,newstation10,ncol=5,nrow=1)
total3<-ggarrange(newstation11,newstation12,newstation13,newstation14,newstation15,ncol=5,nrow=1)

total1
total2
total3
total<-ggarrange(total1,total2,total3,ncol=1,nrow=3)
total

7.5. Maps based on Shapefile (commute pattern)

# Load datasets
Philly2016 <- read.csv("Philly2016.csv")
CityJoin <- rbind(Philly2016,DC2016,Boston2016,Chicago2016)
# Load shapefile
Philly <- readShapeSpatial("census-tracts-philly.shp")
# Spatial join tables to Philly shapefile
data <- fortify(Philly, region = "GISJOIN")

##### Commute Pattern Map #####
# Calculate proportion and transform NAs to 0
Philly2016$PctBike1 <- Philly2016$Bicycle/Philly2016$Total
Philly2016$PctCar1 <- Philly2016$CarTruckVan/Philly2016$Total
Philly2016 <- mutate(Philly2016, PctBike1 = ifelse(is.na(PctBike1),0,PctBike1) )
Philly2016 <- mutate(Philly2016, PctCar1 = ifelse(is.na(PctCar1),0,PctCar1) )
# Define cutlines
Philly2016$PctBike <- factor(
  cut(Philly2016$PctBike1, c(-1, 0.03, 0.06, 0.09, 0.12,1)),
  labels = c("<3%","3%-6%","6%-9%","9%-12%",">12%")
)
Philly2016$PctCar = cut( Philly2016$PctCar1 , breaks=c(-1, 0.1, 0.2, 0.4, 0.6,1), labels=c("<10%","10%-20%","20%-40%","40%-60%",">60%"), include.lowest = TRUE )
PhillyJoin <- left_join(data, Philly2016, by=c("id" = "GISJOIN"))

# 1) Map of Bike

BikePhilly<-ggplot() +
  geom_polygon(data = PhillyJoin, aes(x = long, y = lat, group = group,
                                      fill = PctBike), color = "grey30", size = 0.1) +
  coord_map()+ # Mercator projection
  #set the color scale
  scale_fill_manual(values=c('#ffffd4','#fee391','#fec44f','#fe9929','#ec7014')) +
  labs(title = "By Bike",
       subtitle = "Percentage of residents cycling\nto work by census tracts ",
       # remove the caption from the legend
       fill = "Percentage") +
  #set the plot theme
  theme_void() +
  theme(text = element_text(size = 8,face = "italic"),
        plot.title = element_text(size = 11,face = "bold.italic",colour = "dark orange", hjust = 0),
        plot.subtitle = element_text(size = 8, face = "italic", colour = "black", hjust = 0),
        panel.background = element_rect(fill = "NA", colour = "NA"),
        plot.margin = unit(c(0.25, 0.25, 0.25, 0.25), "in"),
        legend.text = element_text(size = 7),
        legend.position = c(0.8, 0.25))

# 2) Map of Car, Truck, Van 
CarPhilly<-ggplot() +
  geom_polygon(data = PhillyJoin, aes(x = long, y = lat, group = group,
                                      fill = PctCar), color = "grey30", size = 0.1) +
  coord_map()+ # Mercator projection
  #set the color scale
  scale_fill_manual(values=c('#feebe2','#fcc5c0','#f768a1','#ae017e')) +
  labs(title = "By Car, Truck or Van ",
       subtitle = "Percentage of residents taking car,truck or van \nto work by census tracts ",
       # remove the caption from the legend
       fill = "Percentage") +
  #set the plot theme
  theme_void() +
  theme(text = element_text(size = 8,face = "italic"),
        plot.title = element_text(size = 11,face = "bold.italic",colour = "dark blue", hjust = 0),
        plot.subtitle = element_text(size = 8, face = "italic", colour = "black", hjust = 0),
        panel.background = element_rect(fill = "NA", colour = "NA"),
        plot.margin = unit(c(0.25, 0.25, 0.25, 0.25), "in"),
        legend.text = element_text(size = 7),
        legend.position = c(0.8, 0.25))

7.6. Heat Plot

mapThemeflowCBD<-function(base_size = 12) {
  theme(text = element_text(size = 12,face = "italic"),
        plot.title = element_text(size = 17,face = "bold",colour = "black", hjust = 0,vjust=1),
        plot.subtitle = element_text(size = 12, face = "italic", colour = "dark grey", hjust = 0),
        plot.caption = element_text(size = 10, face = "italic", colour = "grey"),
        panel.background = element_blank(),
        legend.text = element_text(size = 10),
        panel.border = element_rect(colour = "grey", fill=NA, size=1),
        panel.grid.major =  element_line(colour="white",size = rel(0.5)),
        panel.grid.minor = element_blank(), 
        plot.background = element_rect(fill = "white"),
        plot.margin = unit(c(0,0,0,0), "lines"),
        legend.position = "bottom"
  )
}

dis_CBD_Philly<-read.csv("philly_distance.csv")
dis_CBD<-dis_CBD_Philly %>% dplyr::select(start_station_id,dis_CBD)
View()
#Arrange by ascening order
dis_CBD<-arrange(dis_CBD,dis_CBD)

#The order of the station by the distance to CBD
dis_CBD$id<-rownames(dis_CBD)

#Join table to make start_station_id equal order id
flow_data1<-flow_data %>% dplyr::select(start_station_id,end_station_id,count)

flow_data1<-left_join(flow_data1,dis_CBD,by="start_station_id")

flow_data1$start_id<-flow_data1$id

#Join table to make end_station_id equal order id
flow_data2<-flow_data %>% mutate(start=start_station_id,end=end_station_id) %>% dplyr::select(start,end)

flow_data2<-merge(flow_data2,dis_CBD,by.x="end",by.y="start_station_id")

flow_data2$end_id<-flow_data2$id 

flow_data2<-flow_data2 %>% dplyr::select(-id,-dis_CBD) 

flow_data2<-flow_data2 %>% group_by(end) %>% dplyr::summarise(end_id=first(end_id))

#Join two tables and both start_station_id and end_station_id are replaced by order id
data<-merge(flow_data1,flow_data2,by.x="end_station_id",by.y="end")

start_id<-data$start_id
end_id<-data$end_id
count<-data$count

mydata<-cbind(start_id,end_id,count)
mydata<-as.data.frame(mydata)

mydata$start_id<-as.character(mydata$start_id)
mydata$start_id<-as.integer(mydata$start_id)
mydata$end_id<-as.character(mydata$end_id)
mydata$end_id<-as.integer(mydata$end_id)
mydata$count<-as.character(mydata$count)
mydata$count<-as.integer(mydata$count)

#Define break
mydata$Break_count<-mydata$count

mydata$Break_count = cut(mydata$Break_count, breaks=c(0,20,50,100,200,500,4000), labels=c("0-20","20-50","50-100","100-200","200-500","500+"), include.lowest = TRUE)

#Use scatterlot to plot the heatplot
flowCBD<-ggplot(mydata, aes(x=start_id, y=end_id)) +
  geom_point(aes(colour=Break_count),alpha=0.8,size=3,shape=15) + 
  scale_colour_manual(values =c("#fef0d9","#fdd49e","#fdbb84","#fc8d59","#e34a33","#b30000"),
                      name="Trip Counts")+
  labs(title = "Trip Flow Based on Distance to CBD",
       subtitle = "Philadelphia, 2016",
       x = "Rank of Start Station by Distance to CBD*",
       y = "Rank of End Station by Distance to CBD*",
       caption = "*The closer to (0,0), the shorter distance to CBD")+
  theme_void() +
  theme(text = element_text(size = 12,face = "italic"),
        plot.title = element_text(size = 17,face = "bold",colour = "black",vjust=0),
        plot.subtitle = element_text(size = 12, face = "italic", colour = "dark grey"),
        plot.caption = element_text(size = 10, face = "italic", colour = "grey"),
        panel.background = element_blank(),
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 10),
        panel.border = element_rect(colour = "grey", fill=NA, size=1),
        axis.ticks = element_blank(), 
        panel.grid.major =  element_line(colour="white",size = rel(0.5)),
        panel.grid.minor = element_blank(), 
        plot.background = element_rect(fill = "white"),
        plot.margin = unit(c(0,0,0,0), "lines"),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.box = "vertical")

7.7. Spatial Join within Distance

#### 1 Spatial join to whole city(hex grids) and calculate count within a certain distance
# Job shapefile
Job <- st_read("points_2015.shp")
# Spatial join
Job <-
  Job  %>%
  st_transform(crs=2272) %>%
  st_intersection(PhillyBoundary) %>%
  dplyr::select(s000)%>%
  st_sf()
st_crs(hexPhilly)=st_crs(Job)
hexJobjoin = st_join(hexPhilly, Job, st_is_within_distance, dist = 2640) # join within 0.5 mile
# Sumarize
hexJoin<- hexJobjoin%>% # This is our final dataset!!!!!!!
  group_by(idhex) %>%
  summarize(JobCount = sum(s000, na.rm = TRUE))%>%    
  st_sf()

# Create map
hexJoin$Break = cut( hexJoin$JobCount , breaks=c(0,1000,2500,5000,10000,999999), labels=c("0-1000","1000-2500","2500-5000","5000-10000", "10000+"), include.lowest = TRUE )
my_palette=rev(magma(9))[c(-1,-9)]
jobp1<-ggplot()+
  geom_sf(data=hexJoin, aes(fill=Break), color = NA, size = 0.1,linetype = 0)+
  coord_sf()+
  theme_void() +
  scale_fill_manual(values=my_palette, name="Number of Jobs", guide = guide_legend( keyheight = unit(2, units = "mm"), keywidth=unit(3, units = "mm"), label.position = "bottom", title.position = 'top', nrow=1) ) +
  labs(title = "Job Counts by 500ft Hex Grids",
       subtitle = "Philadelphia, 2016",
       caption = "",
       fill = "Job Counts") +
  geom_point(data = tripPhilly, 
             aes(x=lon, y=lat),size=0.5,color="white",shape=5) + 
  #set the plot theme
  theme_void() +
  theme(text = element_text(size = 12,face = "italic"),
        plot.title = element_text(size = 17,face = "bold",colour = "black", hjust = 0),
        plot.subtitle = element_text(size = 12, face = "italic", colour = "dark grey", hjust = 0),
        plot.caption = element_text(size = 10, face = "italic", colour = "grey"),
        panel.background = element_blank(),
        legend.text = element_text(size = 10),
        panel.border = element_rect(colour = "grey", fill=NA, size=1),
        axis.ticks = element_blank(), 
        panel.grid.major =  element_line(colour="white",size = rel(0.5)),
        panel.grid.minor = element_blank(), 
        plot.background = element_rect(fill = "white"),
        plot.margin = unit(c(0,0,0,0), "lines"),
        legend.position = c(0.7, 0.04))


#### 2 Spatial join to stations and calculate count within a certain distance
# Bike Trips(Transform Projection)
tripcounts<-read.csv("FPhiladelphia.csv")
tripxy <- tripcounts %>%
  dplyr::select(lon,lat) %>%
  as.matrix()
tripcounts<-tripcounts%>%
  dplyr::select(-lon,-lat)
tripxy_coor<-project(tripxy,"+proj=lcc +lat_1=40.96666666666667 +lat_2=39.93333333333333 +lat_0=39.33333333333334 +lon_0=-77.75 +x_0=600000 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=us-ft +no_defs")
tripPhilly<-cbind(tripcounts,tripxy_coor) 
#final projected stations(sf) below:
tripPhillysf <- st_as_sf(tripPhilly, coords = c("lon","lat")) 
# make sure the projection are all same with the city
st_crs(tripPhillysf) = st_crs(PhillyBoundary)
st_crs(Job) = st_crs(PhillyBoundary)
# spatial join
StationJobjoin = st_join(tripPhillysf, Job, st_is_within_distance, dist = 2640)
# summarize
StationJobjoinF<-
  StationJobjoin%>% # This is our final dataset!!!!!!!
  group_by(start_station_id) %>%
  summarize(JobCount = sum(s000, na.rm = TRUE))%>%    
  st_sf()
# Add job count variable to original dataset
tripPhilly$JobCount<-
  StationJobjoinF$JobCount

7.8. Spatial Lag

#create coordinates
coords <- cbind(nearby_stations$lon, nearby_stations$lat)

#get a vector of 5 nearest neighbor points.
neighborPoints <- knn2nb(knearneigh(coords, 5))

#calculate spaitla weights
spatialWeights <- nb2listw(neighborPoints, style="W")

#create the spatial lag
Count.Lag <- lag.listw(spatialWeights, nearby_stations$count)

#create an outupt data frame
output <- cbind(nearby_stations,data.frame(LagCount = Count.Lag))

7.9. Boxplot

#Read data of spatial predictors such as distance
Philly_spatial<-read.csv("philly_distance.csv")
Boston_spatial<-read.csv("Boston_distance.csv")
Chicago_spatial<-read.csv("Chica_distance.csv")
Minne_spatial<-read.csv("Minne_distance.csv")
DC_spatial<-read.csv("DCWashington_distance.csv")

#Combind them together
SpatialAll<-rbind(Philly_spatial,Boston_spatial,Chicago_spatial,Minne_spatial,DC_spatial)

#Select and gather it
Spatial_plot<-SpatialAll %>% dplyr::select(-start_station_id,-lon,-lat,-count,-dis_2school,-dis_5crime)%>%
                             gather(variable,value,dis_sub,dis_3bus,dis_5rest,dis_3spmkt,dis_univ,
                                   dis_tourism,dis_CBD,dis_parks,CrimeCount)

#Create boxplot
Spatial_boxplot<-
  ggplot(data = Spatial_plot, aes(city,value)) +
  geom_boxplot(aes(fill=city),width=9,alpha=0.8) +  
  facet_wrap(~variable,scales="free",ncol=3) +
  scale_fill_manual(values = c("#fc8861","#e85166","#b63578","#812781","#1c1148"),
                    name = "Cities") +
  labs(title="Variable Distributions across Cities",
       x="Cities",
       y="Value") +
  theme(plot.title = element_text(size =20,colour = "black"))

Spatial_boxplot

7.10. Modeling

#### Model Building ####
#OLS
set.seed(777)
mapetable<-data.frame()
predicttable<-data.frame(id)
for (i in 1:100){
  partition <- createDataPartition(y,p=.6,list=F)
  training_Philly <- bike_Philly[partition,]
  testing <- bike_Philly[-partition,]
  id<-testing$start_station_id
  training <-rbind(bikeno_Philly,training_Philly)
  ols_model4<-lm(Ln_count~LnMedVal + hosng_n + pBac + pBi + pCa + Age + LnPop + IcLv +
                   LagCount8 + dist_8_sta +inter20 + Num_Docks +job05 + Lane_dens+ 
                   dis_10rest +dis_10crime + dis_sub + hf_dis_sub +
                   dis_CBD_s2 + dis_bus_qr_inter + qr_dis_3sta +dis_3sta_qr_inter+LagCount3_dis_inter+
                   isCBD + Num_Docks_CBD + zoneother + Zone_income + popdens,
                 data=training)
  pred_ols4<-predict(ols_model4,testing)
  MAPE_ols4<-mean(abs(exp(pred_ols4)-testing$count2)/testing$count2)
  MAE_ols4<-mean(abs(exp(pred_ols4)-testing$count2))
  MAPEMAE_ols4<-data_frame(MAPE_ols4,MAE_ols4,pred_ols4)
  Predited_Value<-data_frame(abs(exp(pred_ols4)),id)
  predicttable<-left_join(predicttable,Predited_Value,by="id",all=TRUE)
  mapetable<-rbind(MAPEMAE_ols4,mapetable)
}
id<-bike_Philly$start_station_id
Predict_Count<-rowMeans(predicttable[,2:101],na.rm=TRUE)
ols<-data_frame(id,Predict_Count,Actual_Count,lon,lat)
ols$Residual<-abs(ols$Actual_Count-ols$Predict_Count)
ols$error<-ols$Residual/ols$Actual_Count


MAPE_ols<-mean(mapetable$MAPE_ols4)
MAE_ols<-mean(mapetable$MAE_ols4)
Total_Trips_ols<-sum(ols$Predict_Count)
Total_Trips_error_ols<-abs(sum(ols$Predict_Count)-sum(ols$Actual_Count))/sum(ols$Actual_Count)

MAPE_ols
MAE_ols
m1<-data_frame("OLS",MAPE_ols,MAE_ols,Total_Trips_ols,Total_Trips_error_ols)

m1



#Ridge


set.seed(777)
mapetable2<-data.frame()
id<-bike_Philly$start_station_id
predicttable2<-data.frame(id)
for (i in 1:100){
  partition <- createDataPartition(y,p=.6,list=F)
  training_Philly <- bike_Philly[partition,]
  testing <- bike_Philly[-partition,]
  training <-rbind(bikeno_Philly,training_Philly)
  trainingx <-training %>%
    dplyr::select(LnMedVal,hosng_n,pBac,pBi, pCa, Age,LnPop,LagCount8 ,dist_8_sta,inter20 , Num_Docks ,job05 , Lane_dens, LowInc,MediumInc,dis_10rest ,dis_10crime , dis_sub , hf_dis_sub ,dis_CBD_s2 , dis_bus_qr_inter , qr_dis_3sta , dis_3sta_qr_inter,LagCount3_dis_inter,isCBD, Num_Docks_CBD , zoneother, Zone_income, popdens)
  testingx <-testing %>%
    dplyr::select(LnMedVal,hosng_n,pBac,pBi, pCa, Age,LnPop,LagCount8 ,dist_8_sta,inter20 , Num_Docks ,job05 , Lane_dens, LowInc,MediumInc,
                  dis_10rest ,dis_10crime , dis_sub , hf_dis_sub ,dis_CBD_s2 , dis_bus_qr_inter , qr_dis_3sta ,
                  dis_3sta_qr_inter,LagCount3_dis_inter,isCBD, Num_Docks_CBD , zoneother, Zone_income, popdens)
  fit.ridge.cv <- cv.glmnet(data.matrix(trainingx), training$Ln_count, alpha=0,type.measure="mae")
  y_hat_ridge <- predict.cv.glmnet(fit.ridge.cv,data.matrix(testingx))
  MAPE_ridge<-mean(abs(exp(y_hat_ridge)-testing$count2)/testing$count2)
  MAE_ridge<-mean(abs(exp(y_hat_ridge)-testing$count2))
  MAPEMAEridge<-data_frame(MAPE_ridge,MAE_ridge)
  p<-abs(exp(y_hat_ridge))
  id<-testing$start_station_id
  Predited_Value<-data.frame(p,id)
  predicttable2<-left_join(predicttable2,Predited_Value,by="id",all=TRUE)
  mapetable2<-rbind(MAPEMAEridge,mapetable2)
}


MAPE_ridge<-mean(mapetable2$MAPE_ridge)
MAPE_ridge
MAE_ridge<-mean(mapetable2$MAE_ridge)
MAE_ridge

Predict_Count<-rowMeans(predicttable2[,2:101],na.rm=TRUE)
id<-bike_Philly$start_station_id
ridge<-data_frame(id,Predict_Count,Actual_Count,lon,lat)
ridge$Residual<-abs(ridge$Actual_Count-ridge$Predict_Count)
ridge$error<-ridge$Residual/ridge$Actual_Count

Total_Trips_ridge<-sum(ridge$Predict_Count)
Total_Trips_error_ridge<-abs(sum(ridge$Predict_Count)-sum(ridge$Actual_Count))/sum(ridge$Actual_Count)
m2<-data_frame("Ridge",MAPE_ridge,MAE_ridge,Total_Trips_ridge,Total_Trips_error_ridge)
m2

#Lasso
set.seed(777)
mapetable3<-data.frame()
id<-bike_Philly$start_station_id
predicttable3<-data.frame(id)
for (i in 1:100){
  partition <- createDataPartition(y,p=.6,list=F)
  training_Philly <- bike_Philly[partition,]
  testing <- bike_Philly[-partition,]
  training <-rbind(bikeno_Philly,training_Philly)
  trainingx <-training %>%
    dplyr::select(LnMedVal,hosng_n,pBac,pBi, pCa, Age,LnPop,LagCount8 ,dist_8_sta,inter20 , Num_Docks ,job05 , Lane_dens, LowInc,MediumInc,dis_10rest ,dis_10crime , dis_sub , hf_dis_sub ,dis_CBD_s2 , dis_bus_qr_inter , qr_dis_3sta ,dis_3sta_qr_inter,LagCount3_dis_inter,isCBD, Num_Docks_CBD , zoneother, Zone_income, popdens)
  testingx <-testing %>%
    dplyr::select(LnMedVal,hosng_n,pBac,pBi, pCa, Age,LnPop,LagCount8 ,dist_8_sta,inter20 , Num_Docks ,job05 , Lane_dens, LowInc,MediumInc, dis_10rest ,dis_10crime , dis_sub , hf_dis_sub ,dis_CBD_s2 , dis_bus_qr_inter , qr_dis_3sta ,dis_3sta_qr_inter,LagCount3_dis_inter,isCBD, Num_Docks_CBD , zoneother, Zone_income, popdens)
  fit.Lasso.cv <- cv.glmnet(data.matrix(trainingx), training$Ln_count, alpha=1,type.measure="mae")
  y_hat_Lasso <- predict.cv.glmnet(fit.Lasso.cv,data.matrix(testingx))
  MAPE_Lasso<-mean(abs(exp(y_hat_Lasso)-testing$count2)/testing$count2)
  MAE_Lasso<-mean(abs(exp(y_hat_Lasso)-testing$count2))
  MAPEMAELasso<-data_frame(MAPE_Lasso,MAE_Lasso)
  p<-abs(exp(y_hat_Lasso))
  id<-testing$start_station_id
  Predited_Value<-data.frame(p,id)
  predicttable3<-left_join(predicttable3,Predited_Value,by="id",all=TRUE)
  mapetable3<-rbind(MAPEMAELasso,mapetable3)}
MAPE_lasso<-mean(mapetable3$MAPE_Lasso)
MAE_lasso<-mean(mapetable3$MAE_Lasso)

Predict_Count<-rowMeans(predicttable3[,2:101],na.rm=TRUE)
id<-bike_Philly$start_station_id
lasso<-data_frame(id,Predict_Count,Actual_Count,lon,lat)
lasso$Residual<-abs(lasso$Actual_Count-lasso$Predict_Count)
lasso$error<-lasso$Residual/lasso$Actual_Count

Total_Trips_lasso<-sum(lasso$Predict_Count)
Total_Trips_error_lasso<-abs(sum(lasso$Predict_Count)-sum(lasso$Actual_Count))/sum(lasso$Actual_Count)
m3<-data_frame("lasso",MAPE_lasso,MAE_lasso,Total_Trips_lasso,Total_Trips_error_lasso)
m3

#Elastic Net
set.seed(1000)
mapetable4<-data.frame()
id<-bike_Philly$start_station_id
predicttable4<-data.frame(id)
for (i in 1:100){
  partition <- createDataPartition(y,p=.6,list=F)
  training_Philly <- bike_Philly[partition,]
  testing <- bike_Philly[-partition,]
  training <-rbind(bikeno_Philly,training_Philly)
  trainingx <-training %>%
    dplyr::select(LnMedVal,hosng_n,pBac,pBi, pCa, Age,LnPop,LagCount8 ,dist_8_sta,inter20 , Num_Docks ,job05 , Lane_dens, LowInc,MediumInc, dis_10rest ,dis_10crime , dis_sub , hf_dis_sub ,dis_CBD_s2 , dis_bus_qr_inter , qr_dis_3sta ,dis_3sta_qr_inter,LagCount3_dis_inter,isCBD, Num_Docks_CBD , zoneother, Zone_income, popdens)
  testingx <-testing %>%
    dplyr::select(LnMedVal,hosng_n,pBac,pBi, pCa, Age,LnPop,LagCount8 ,dist_8_sta,inter20 , Num_Docks ,job05 , Lane_dens, LowInc,MediumInc,dis_10rest ,dis_10crime , dis_sub , hf_dis_sub ,dis_CBD_s2 , dis_bus_qr_inter, qr_dis_3sta,dis_3sta_qr_inter,LagCount3_dis_inter,isCBD, Num_Docks_CBD , zoneother, Zone_income, popdens)
  fit.net.cv <- cv.glmnet(data.matrix(trainingx), training$Ln_count, alpha=.5,type.measure="mae")
  y_hat_net <- predict.cv.glmnet(fit.net.cv,data.matrix(testingx))
  MAPE_net<-mean(abs(exp(y_hat_net)-testing$count2)/testing$count2)
  MAE_net<-mean(abs(exp(y_hat_net)-testing$count2))
  MAPEMAEnet<-data_frame(MAPE_net,MAE_net)
  p<-abs(exp(y_hat_net))
  id<-testing$start_station_id
  Predited_Value<-data.frame(p,id)
  predicttable4<-left_join(predicttable4,Predited_Value,by="id",all=TRUE)
  mapetable4<-rbind(MAPEMAEnet,mapetable4)
}
MAPE_net<-mean(mapetable4$MAPE_net)
MAE_net<-mean(mapetable4$MAE_net)

Predict_Count<-rowMeans(predicttable4[,2:101],na.rm=TRUE)
id<-bike_Philly$start_station_id
net<-data_frame(id,Predict_Count,Actual_Count,lon,lat)
net$Residual<-abs(net$Actual_Count-net$Predict_Count)
net$error<-net$Residual/net$Actual_Count

Total_Trips_net<-sum(net$Predict_Count)
Total_Trips_error_net<-abs(sum(net$Predict_Count)-sum(net$Actual_Count))/sum(net$Actual_Count)
m4<-data_frame("net",MAPE_net,MAE_net,Total_Trips_net,Total_Trips_error_net)
m4

#Gradient Boosting
set.seed(300)
mapetable5<-data.frame()
id<-bike_Philly$start_station_id
predicttable5<-data.frame(id)
for (i in 1:100){
  partition <- createDataPartition(y,p=.6,list=F)
  training_Philly <- bike_Philly[partition,]
  testing <- bike_Philly[-partition,]
  training <-rbind(bikeno_Philly,training_Philly)
  ols_model4<-lm(Ln_count~LnMedVal + hosng_n + pBac + pBi + pCa + Age + LnPop + IcLv +
                   LagCount8 + dist_8_sta +inter20 + Num_Docks +job05 + Lane_dens+ 
                   dis_10rest +dis_10crime + dis_sub + hf_dis_sub +
                   dis_CBD_s2 + dis_bus_qr_inter + qr_dis_3sta +dis_3sta_qr_inter+LagCount3_dis_inter+
                   isCBD + Num_Docks_CBD + zoneother + Zone_income + popdens,
                 data=training)
  
  training$pred_ols<-predict(ols_model4,training)
  testing$pred_ols<-predict(ols_model4,testing)
  
  fit_gbm_Philly <- gbm(Ln_count~LnMedVal + hosng_n + pBac + pBi + pCa + Age + LnPop + IcLv +
                          LagCount8 + dist_8_sta +inter20 + Num_Docks +job05 + Lane_dens+ 
                          dis_10rest +dis_10crime + dis_sub + hf_dis_sub +
                          dis_CBD_s2 + dis_bus_qr_inter + qr_dis_3sta +dis_3sta_qr_inter+LagCount3_dis_inter+
                          isCBD + Num_Docks_CBD + zoneother + Zone_income + popdens+pred_ols,
                        distribution = "gaussian",
                        n.trees = 50,
                        shrinkage = 0.1,
                        interaction.depth = 2,
                        n.minobsinnode = 10,
                        data=training)
  y_hat_gbm_Philly <- predict(fit_gbm_Philly,testing,n.trees = 50)
  MAPE_gbm<-mean(abs(exp(y_hat_gbm_Philly)-testing$count2)/testing$count2)
  MAPE_gbm<-data_frame(MAPE_gbm)
  MAE_gbm<-mean(abs(exp(y_hat_gbm_Philly)-testing$count2))
  MAPEMAEgbm<-data.frame(MAPE_gbm,MAE_gbm)
  p<-abs(exp(y_hat_gbm_Philly))
  id<-testing$start_station_id
  Predited_Value<-data.frame(p,id)
  predicttable5<-left_join(predicttable5,Predited_Value,by="id",all=TRUE)
  mapetable5<-rbind(MAPEMAEgbm,mapetable5)
}
MAPE_gbm<-mean(mapetable5$MAPE_gbm)
MAE_gbm<-mean(mapetable5$MAE_gbm)

Predict_Count<-rowMeans(predicttable5[,2:101],na.rm=TRUE)
id<-bike_Philly$start_station_id
gbm<-data_frame(id,Predict_Count,Actual_Count,lon,lat)
gbm$Residual<-abs(gbm$Actual_Count-gbm$Predict_Count)
gbm$error<-gbm$Residual/gbm$Actual_Count

Total_Trips_gbm<-sum(gbm$Predict_Count)
Total_Trips_error_gbm<-abs(sum(gbm$Predict_Count)-sum(gbm$Actual_Count))/sum(gbm$Actual_Count)
m5<-data_frame("gbm",MAPE_gbm,MAE_gbm,Total_Trips_gbm,Total_Trips_error_gbm)
m5


#Random Forest
set.seed(777)
mapetable6<-data.frame()
id<-bike_Philly$start_station_id
predicttable6<-data.frame(id)
for (i in 1:100){
  partition <- createDataPartition(y,p=.6,list=F)
  training_Philly <- bike_Philly[partition,]
  testing <- bike_Philly[-partition,]
  training <-rbind(bikeno_Philly,training_Philly)
  ols_model4<-lm(Ln_count~LnMedVal + hosng_n + pBac + pBi + pCa + Age + LnPop + IcLv +
                   LagCount8 + dist_8_sta +inter20 + Num_Docks +job05 + Lane_dens+ 
                   dis_10rest +dis_10crime + dis_3tourism + dis_sub + hf_dis_sub +
                   dis_CBD_s2 + dis_bus_qr_inter + qr_dis_3sta +dis_3sta_qr_inter+LagCount3_dis_inter+
                   isCBD + Num_Docks_CBD + zoneother + Zone_income + popdens,
                 data=training)
  
  training$pred_ols<-predict(ols_model4,training)
  testing$pred_ols<-predict(ols_model4,testing)
  
  fit_rf <- randomForest(Ln_count~LnMedVal + hosng_n + pBac + pBi + pCa + Age + LnPop + IcLv +
                           LagCount8 + dist_8_sta +inter20 + Num_Docks +job05 + Lane_dens+ 
                           dis_10rest +dis_10crime + dis_3tourism + dis_sub + hf_dis_sub +
                           dis_CBD_s2 + dis_bus_qr_inter + qr_dis_3sta +dis_3sta_qr_inter+LagCount3_dis_inter+
                           isCBD + Num_Docks_CBD + zoneother + Zone_income + popdens+pred_ols,
                         method="anova", 
                         ntree = 1000,
                         data = training)
  
  y_hat_rf <- predict(fit_rf,testing)
  
  MAPE_RF <- mean(abs(exp(y_hat_rf) - testing$count2)/testing$count2)
  MAE_RF<-mean(abs(exp(y_hat_rf)-testing$count2))
  MAPEMAERF<-data.frame(MAPE_RF,MAE_RF)
  p<-abs(exp(y_hat_rf))
  id<-testing$start_station_id
  Predited_Value<-data.frame(p,id)
  predicttable6<-left_join(predicttable6,Predited_Value,by="id",all=TRUE)
  mapetable6<-rbind(MAPEMAERF,mapetable6)
}


MAPE_RF<-mean(mapetable6$MAPE_RF)
MAE_RF<-mean(mapetable6$MAE_RF)

Predict_Count<-rowMeans(predicttable6[,2:101],na.rm=TRUE)
id<-bike_Philly$start_station_id
RF<-data_frame(id,Predict_Count,Actual_Count,lon,lat)
RF$Residual<-abs(RF$Actual_Count-RF$Predict_Count)
RF$error<-RF$Residual/RF$Actual_Count

Total_Trips_RF<-sum(RF$Predict_Count)
Total_Trips_error_RF<-abs(sum(RF$Predict_Count)-sum(RF$Actual_Count))/sum(RF$Actual_Count)
m6<-data_frame("RF",MAPE_RF,MAE_RF,Total_Trips_RF,Total_Trips_error_RF)
m6


#### Final Model ####
set.seed(777)
mapetable7<-data.frame()
id<-bike_Philly$start_station_id
predicttable7<-data.frame(id)
for (i in 1:100){
  partition <- createDataPartition(y,p=.6,list=F)
  training_Philly <- bike_Philly[partition,]
  testing <- bike_Philly[-partition,]
  training <-rbind(bikeno_Philly,training_Philly)
  ols_model4<-lm(Ln_count~LnMedVal + hosng_n + pBac + pBi + pCa + Age + LnPop + IcLv +
                   LagCount8 + dist_8_sta +inter20 + Num_Docks +job05 + Lane_dens+ 
                   dis_10rest +dis_10crime + dis_3tourism + dis_sub + hf_dis_sub +
                   dis_CBD_s2 + dis_bus_qr_inter + qr_dis_3sta +dis_3sta_qr_inter+LagCount3_dis_inter+
                   isCBD + Num_Docks_CBD + zoneother + Zone_income + popdens,
                 data=training)
  
  training$pred_ols<-predict(ols_model4,training)
  testing$pred_ols<-predict(ols_model4,testing)
  
  fit_rf <- randomForest(Ln_count~LnMedVal + hosng_n + pBac + pBi + pCa + Age + LnPop + IcLv +
                           LagCount8 + dist_8_sta +inter20 + Num_Docks +job05 + Lane_dens+ 
                           dis_10rest +dis_10crime + dis_3tourism + dis_sub + hf_dis_sub +
                           dis_CBD_s2 + dis_bus_qr_inter + qr_dis_3sta +dis_3sta_qr_inter+LagCount3_dis_inter+
                           isCBD + Num_Docks_CBD + zoneother + Zone_income + popdens+pred_ols,
                         method="anova", 
                         ntree = 1000,
                         data = training)
  
  y_hat_rf <- predict(fit_rf,testing)
  
  fit_gbm_Philly <- gbm(Ln_count~LnMedVal + hosng_n + pBac + pBi + pCa + Age + LnPop + IcLv +
                          LagCount8 + dist_8_sta +inter20 + Num_Docks +job05 + Lane_dens+ 
                          dis_10rest +dis_10crime + dis_sub + hf_dis_sub +
                          dis_CBD_s2 + dis_bus_qr_inter + qr_dis_3sta +dis_3sta_qr_inter+LagCount3_dis_inter+
                          isCBD + Num_Docks_CBD + zoneother + Zone_income + popdens+pred_ols,
                        distribution = "gaussian",
                        n.trees = 50,
                        shrinkage = 0.1,
                        interaction.depth = 2,
                        n.minobsinnode = 10,
                        data=training)
  y_hat_gbm_Philly <- predict(fit_gbm_Philly,testing,n.trees = 50)
  
  y_hat_all<-0.8*testing$pred_ols+0.15* y_hat_rf+0.05*y_hat_gbm_Philly
  
  MAPE_all <- mean(abs(exp(y_hat_all) - testing$count2)/testing$count2)
  MAE_all<-mean(abs(exp(y_hat_all)-testing$count2))
  MAPEMAEall<-data.frame(MAPE_all,MAE_all)
  p<-abs(exp(y_hat_all))
  id<-testing$start_station_id
  Predited_Value<-data.frame(p,id)
  predicttable7<-left_join(predicttable7,Predited_Value,by="id",all=TRUE)
  mapetable7<-rbind(MAPEMAEall,mapetable7)
}


MAPE_all<-mean(mapetable7$MAPE_all)
MAE_all<-mean(mapetable7$MAE_all)

Predict_Count<-rowMeans(predicttable7[,2:101],na.rm=TRUE)
id<-bike_Philly$start_station_id
all<-data_frame(id,Predict_Count,Actual_Count,lon,lat)
all$Residual<-abs(all$Actual_Count-all$Predict_Count)
all$error<-all$Residual/all$Actual_Count

Total_Trips_all<-sum(all$Predict_Count)
Total_Trips_error_all<-abs(sum(all$Predict_Count)-sum(all$Actual_Count))/sum(all$Actual_Count)
m7<-data_frame("final model",MAPE_all,MAE_all,Total_Trips_all,Total_Trips_error_all)
m7

7.11. Spatial Cross-validation

  training_lowcbd<-bike_Philly %>% filter(dis_CBD>=5280)
  training_lowcbd<-rbind(bikeno_Philly,training_lowcbd)
  testing_lowcbd<-bike_Philly %>% filter(dis_CBD<5280)
  
  ols_model_lowcbd<-lm(Ln_count~LnMedVal + hosng_n + pBac + pBi + pCa + Age + LnPop + IcLv +
                          LagCount8 + dist_8_sta +inter20 + Num_Docks +job05 + Lane_dens+ 
                          dis_10rest +dis_10crime + dis_sub + hf_dis_sub +
                          dis_CBD_s2 + dis_bus_qr_inter + qr_dis_3sta +dis_3sta_qr_inter+LagCount3_dis_inter+
                          isCBD + Num_Docks_CBD + zoneother + Zone_income + popdens,
                        data=training_lowcbd)
  
  pred_ols_lowcbd<-predict(ols_model_lowcbd,testing_lowcbd)
  
  training_lowcbd$pred_ols_lowcbd<-ols_model_lowcbd$fitted.values
  testing_lowcbd$pred_ols_lowcbd<-pred_ols_lowcbd
  
  fit_rf_lowcbd <- randomForest(Ln_count~LnMedVal + hosng_n + pBac + pBi + pCa + Age + LnPop + IcLv +
                                   LagCount8 + dist_8_sta +inter20 + Num_Docks +job05 + Lane_dens+ 
                                   dis_10rest +dis_10crime + dis_3tourism + dis_sub + hf_dis_sub +
                                   dis_CBD_s2 + dis_bus_qr_inter + qr_dis_3sta +dis_3sta_qr_inter+LagCount3_dis_inter+
                                   isCBD + Num_Docks_CBD + zoneother + Zone_income + popdens+pred_ols_lowcbd,
                                 method="anova", 
                                 ntree = 1000,
                                 data = training_lowcbd)
  
  pred_rf_lowcbd <- predict(fit_rf_lowcbd,testing_lowcbd)
  
  fit_gbm_lowcbd <- gbm(Ln_count~LnMedVal + hosng_n + pBac + pBi + pCa + Age + LnPop + IcLv +
                           LagCount8 + dist_8_sta +inter20 + Num_Docks +job05 + Lane_dens+ 
                           dis_10rest +dis_10crime + dis_sub + hf_dis_sub +
                           dis_CBD_s2 + dis_bus_qr_inter + qr_dis_3sta +dis_3sta_qr_inter+LagCount3_dis_inter+
                           isCBD + Num_Docks_CBD + zoneother + Zone_income + popdens+pred_ols_lowcbd,
                         distribution = "gaussian",
                         n.trees = 50,
                         shrinkage = 0.1,
                         interaction.depth = 2,
                         n.minobsinnode = 10,
                         data=training_lowcbd)
  pred_gbm_lowcbd <- predict(fit_gbm_lowcbd,testing_lowcbd,n.trees = 50)
  
  pred_lowcbd<-0.8*pred_ols_lowcbd+0.15* pred_rf_lowcbd+0.05*pred_gbm_lowcbd
  
  pred_lowcbd<-predict(ols_model_lowcbd,testing_lowcbd)
  mape_lowcbd<-mean(abs(exp(pred_lowcbd)-testing_lowcbd$count2)/testing_lowcbd$count2)
  mae_lowcbd<-mean(abs(exp(pred_lowcbd)-testing_lowcbd$count2))
  mape_lowcbd
  mae_lowcbd
  
  
  
  
  
  training_highcbd<-bike_Philly %>% filter(dis_CBD<5280)
  training_highcbd<-rbind(bikeno_Philly,training_highcbd)
  testing_highcbd<-bike_Philly %>% filter(dis_CBD>=5280)
  
  ols_model_highcbd<-lm(Ln_count~LnMedVal + hosng_n + pBac + pBi + pCa + Age + LnPop + IcLv +
                         LagCount8 + dist_8_sta +inter20 + Num_Docks +job05 + Lane_dens+ 
                         dis_10rest +dis_10crime + dis_sub + hf_dis_sub +
                         dis_CBD_s2 + dis_bus_qr_inter + qr_dis_3sta +dis_3sta_qr_inter+LagCount3_dis_inter+
                         isCBD + Num_Docks_CBD + zoneother + Zone_income + popdens,
                       data=training_highcbd)
  
  pred_ols_highcbd<-predict(ols_model_highcbd,testing_highcbd)
  
  training_highcbd$pred_ols_highcbd<-ols_model_highcbd$fitted.values
  testing_highcbd$pred_ols_highcbd<-pred_ols_highcbd
  
  fit_rf_highcbd <- randomForest(Ln_count~LnMedVal + hosng_n + pBac + pBi + pCa + Age + LnPop + IcLv +
                                  LagCount8 + dist_8_sta +inter20 + Num_Docks +job05 + Lane_dens+ 
                                  dis_10rest +dis_10crime + dis_3tourism + dis_sub + hf_dis_sub +
                                  dis_CBD_s2 + dis_bus_qr_inter + qr_dis_3sta +dis_3sta_qr_inter+LagCount3_dis_inter+
                                  isCBD + Num_Docks_CBD + zoneother + Zone_income + popdens+pred_ols_highcbd,
                                method="anova", 
                                ntree = 1000,
                                data = training_highcbd)
  
  pred_rf_highcbd <- predict(fit_rf_highcbd,testing_highcbd)
  
  fit_gbm_highcbd <- gbm(Ln_count~LnMedVal + hosng_n + pBac + pBi + pCa + Age + LnPop + IcLv +
                          LagCount8 + dist_8_sta +inter20 + Num_Docks +job05 + Lane_dens+ 
                          dis_10rest +dis_10crime + dis_sub + hf_dis_sub +
                          dis_CBD_s2 + dis_bus_qr_inter + qr_dis_3sta +dis_3sta_qr_inter+LagCount3_dis_inter+
                          isCBD + Num_Docks_CBD + zoneother + Zone_income + popdens+pred_ols_highcbd,
                        distribution = "gaussian",
                        n.trees = 50,
                        shrinkage = 0.1,
                        interaction.depth = 2,
                        n.minobsinnode = 10,
                        data=training_highcbd)
  pred_gbm_highcbd <- predict(fit_gbm_highcbd,testing_highcbd,n.trees = 50)
  
  pred_highcbd<-0.8*pred_ols_highcbd+0.15* pred_rf_highcbd+0.05*pred_gbm_highcbd
  pred_highcbd<-predict(ols_model_highcbd,testing_highcbd)
  mape_highcbd<-mean(abs(exp(pred_highcbd)-testing_highcbd$count2)/testing_highcbd$count2)
  mae_highcbd<-mean(abs(exp(pred_highcbd)-testing_highcbd$count2))
  mape_highcbd
  mae_highcbd
  
CV_cbd_close<-data.frame(MAPE=round(mape_lowcbd,digit=4),
                         MAE=as.integer(mae_lowcbd))

CV_cbd_far<-data.frame(MAPE=round(mape_highcbd,digit=4),
                         MAE=as.integer(mae_highcbd))

CV_cbd<-rbind(CV_cbd_close,CV_cbd_far)


rownames(CV_cbd)<-c("<1 mile",">=1 mile")

CV_cbd_table<-kable(CV_cbd,format='html',caption='Distance to CBD',
                        align='c',format.args = list(big.mark = ",")) %>%
                 kable_styling(latex_options = c("striped", "hold_position"),full_width = T,font_size = 15) %>%
                 row_spec(0, bold = T, color = "black") %>%
                 row_spec(1:2, font_size = 15 )