This project is part of the Spring 2021 Master of Urban Spatial Analytics Practicum course at the University of Pennsylvania taught by Ken Steif, Michael Fichman, and Matthew Harris. We are incredibly thankful for their their time, knowledge, and generosity during this challenging and fully online semester. Additionally, we want to take this opportunity to thank Daniel Lodise and Dave Maynard from Philadelphia City Councilmember Isaiah Thomas’ office for sharing their perspectives on policy decision-making for the nighttime economy. Additionally, we are incredibly grateful to Kae Anderson from the Fishtown Business Improvement District for fantastic insights about how economic development professionals manage corridors and make decisions. We also want to acknowledge Andrew Renninger and Eugene Chong for their invaluable support accessing the SafeGraph dataset that was integral to this project and sharing general knowledge about the dataset.
Finally we want to acknowledge the support and feedback from our classmates in the MUSA and city planning programs at the University of Pennsylvania. We thank you for your help!
This project studies nightlife mobility patterns in Philadelphia, Pennsylvania and develops a web application for visualizing nightlife activity and predicting future nightlife traffic to commercial corridors. This document provides an overview of our process and key takeaways from our analysis.
While nighttime activity is an important component to Philadelphia’s economy, culture, and identity, there is limited empirical data available to help decision makers govern the nighttime economy. As a result, many of the logistical and policy decisions are largely based on emotional perceptions of the negative externalities of the nighttime economy, such as noise pollution and safety concerns. This project uses the novel SafeGraph dataset to analyze nighttime mobility patterns in Philadelphia and develop a model to predict nighttime trips to commercial corridors. Our model utilizes a variety of factors including retail mix, built environment features, and demographic data. Our findings indicate that changing the corridor retail mix impacts traffic during evening hours. Additionally, our research provides new quantitative insights into the behavior of visitors during nighttime hours in Philadelphia which has important implications for economic development in the city, particularly in the post COVID-19 climate.
All source code can be found throughout the document by pressing the code buttons throughout the report. Additional code we used for setup, data loading, and app development can be found in the appendix at the end of the document.
Broadly defined as economic activity that takes place during the evening hours and centered around food, drink, and entertainment, the nighttime economy is an important component for any vibrant city. Philadelphia, the focus of this study, has a dynamic nightlife economy that features a dynamic arts and culture scene. In 2017, The Philadelphia Cultural Alliance estimated that arts and culture in the city has a total economic impact of over $4 billion in the region. Philadelphia is also home to a renowned selection of restaurants that have an important economic impact on the city. A 2019 report by the Economy League of Greater Philadelphia and the Philadelphia Department of Public Health estimated that food jobs account for 12% of all jobs in the city and food related businesses account for 18% of all firms. These sectors are clearly an important piece of the wider economy with much of their activity and business taking place during the evening hours.
Despite Philadelphia’s nighttime economy being both culturally and economically important to the city, there is relatively limited knowledge of the nighttime economy’s impact at the neighborhood or corridor scale. As many policy and logistical decisions are drawn from perceptions of the nighttime economy’s negative externalities, such as noise pollution and safety concerns, there is a lack of data explaining the positive effects of the nighttime economy at this hyper-local level. Therefore, a key motivation of this project is to better understand how the nighttime economy contributes to commercial corridor traffic, which we understand as a proxy for local economic activity. We hope that this research will help policy makers better weigh the costs and benefits when making decisions governing the nighttime economy.
A separate motivation for this project comes from the unprecedented economic shock of the COVID-19 pandemic beginning in March 2020 and continuing through the lifespan of this project in Spring 2021. As shown in the below Figure 1.2.1, traffic to restaurants, bars, and arts establishments have decreased by about 50% compared to 2018 measurements. While the days of the pandemic appear to wane in Philadelphia, there is still much uncertainty about how these establishments will rebound from this devastating event. As these business types that are integral to the nighttime economy largely depend on gathering large groups of people indoors, it is important for economic development professionals in Philadelphia to have a deep understanding of how the possible benefits of investing in the nighttime economy can benefit the city’s broader economy and help the city rebound from the pandemic.
dat_2018 <-
dat %>%
mutate(popularity_by_hour = str_remove_all(popularity_by_hour, pattern = "\\[|\\]")) %>% #remove brackets
unnest(popularity_by_hour) %>% #unnest values
separate(.,
popularity_by_hour,
c("18",
"19", "20", "21", "22", "23"),
sep = ",") %>%
mutate(.,NightVisits2018 = as.numeric(`18`) + as.numeric(`19`) + as.numeric(`20`) + as.numeric(`21`) + as.numeric(`22`) + as.numeric(`23`))
dat_2018 <- dat_2018 %>%
dplyr::select(safegraph_place_id,
location_name,
date_range_start,
date_range_end,
raw_visit_counts,
raw_visitor_counts,
popularity_by_day,
NightVisits2018) %>%
rename(raw_visit_counts2018 = raw_visit_counts,
raw_visitor_counts2018 = raw_visitor_counts,
popularity_by_day2018 = popularity_by_day) %>%
mutate(month = substring(date_range_start,6,7))
dat_2020 <-
data2020 %>%
mutate(popularity_by_hour = str_remove_all(popularity_by_hour, pattern = "\\[|\\]")) %>% #remove brackets
unnest(popularity_by_hour) %>% #unnest values
separate(.,
popularity_by_hour,
c("18",
"19", "20", "21", "22", "23"),
sep = ",") %>%
mutate(.,NightVisits2020 = as.numeric(`18`) + as.numeric(`19`) + as.numeric(`20`) + as.numeric(`21`) + as.numeric(`22`) + as.numeric(`23`))
dat_join <- dat_2020 %>%
dplyr::select(safegraph_place_id,
location_name,
date_range_start,
date_range_end,
raw_visit_counts,
raw_visitor_counts,
popularity_by_day,
NightVisits2020) %>%
rename(raw_visit_counts2020 = raw_visit_counts,
raw_visitor_counts2020 = raw_visitor_counts,
popularity_by_day2020 = popularity_by_day) %>%
mutate(month = substring(date_range_start,6,7)) %>%
inner_join(., dat_2018, by = c("safegraph_place_id", "month", "location_name")) %>%
mutate(Month = case_when(month == "01" ~ "January",
month == "02" ~ "February",
month == "03" ~ "March",
month == "04" ~ "April",
month == "05" ~ "May",
month == "06" ~ "Jun",
month == "07" ~ "May",
month == "08" ~ "May",
month == "09" ~ "May",
month == "10" ~ "May",
month == "11" ~ "May",
month == "12" ~ "May"))
dat_join <- dat_join %>%
dplyr::select(safegraph_place_id,
location_name,
raw_visit_counts2018,
raw_visit_counts2020,
NightVisits2018,
NightVisits2020,
popularity_by_day2018,
popularity_by_day2020,
month) %>%
left_join(., phila, by = "safegraph_place_id") %>%
st_as_sf() %>%
st_transform('ESRI:102728')
dat_citywide <- dat_join %>%
filter(top_category == "Drinking Places (Alcoholic Beverages)" |
top_category == "Restaurants and Other Eating Places" |
top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
top_category == "Performing Arts Companies") %>%
group_by(month) %>%
summarize(Total_Visits2018 = sum(NightVisits2018),
Total_Visits2020 = sum(NightVisits2020)) %>%
mutate(Percent_Change = (Total_Visits2020 - Total_Visits2018)/Total_Visits2018*100)
#Separate by commercial use
dat_citywide2 <- dat_join %>%
filter(top_category == "Drinking Places (Alcoholic Beverages)" |
top_category == "Restaurants and Other Eating Places" |
top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
top_category == "Performing Arts Companies") %>%
mutate(category = case_when(top_category == "Drinking Places (Alcoholic Beverages)" ~ "Bars",
top_category == "Restaurants and Other Eating Places" ~ "Restaurants",
top_category == "Promoters of Performing Arts, Sports, and Similar Events" ~ "Arts",
top_category == "Performing Arts Companies" ~ "Arts")) %>%
group_by(category, month) %>%
summarize(Total_Visits2018 = sum(NightVisits2018),
Total_Visits2020 = sum(NightVisits2020)) %>%
mutate(Percent_Change = (Total_Visits2020 - Total_Visits2018)/Total_Visits2018*100)
#Generate lineplot
rbind(
(dat_join %>%
group_by(month) %>%
summarize(Total_Visits2018 = sum(NightVisits2018),
Total_Visits2020 = sum(NightVisits2020)) %>%
mutate(Percent_Change = (Total_Visits2020 - Total_Visits2018)/Total_Visits2018*100,
category = "All")),
dat_citywide2) %>%
ggplot(., aes(x = month, y = Percent_Change, group = category, color = category)) +
geom_line(lwd = 1.5) +
geom_hline(yintercept=0, lwd = 1.5, linetype="dotted")+
scale_color_manual(values = c("#d3d3d3", "#440154", "#21908C", "#FDE725")) +
scale_x_discrete(name ="Month")+
scale_y_continuous(name ="Percent Change") +
labs(title = "Philadelphia Commercial Trip % Change, 2018 & 2020",
subtitle = "Figure 1.2.1") +
plotTheme()
Our goal is to help business improvement districts (BIDs) and economic development professionals understand nightlife patterns across Philadelphia’s commercial corridors. Not only do we believe this insight will provide valuable and interesting findings that will deepen the understanding of evening traffic patterns, but also that this can help further recovery efforts following the COVID-19 pandemic. While we believe that there are many potential applications for this work, we envisioned the following specific uses for our research, analysis, and web application.
In this study, we use a combination of public and private datasets to study nighttime trends across commercial corridors in Philadelphia. Our model predicts trips between the hours of 7PM and 12AM at the corridor level. After an iterative process of feature engineering, we built a model that combined features that described retail mix, visitor behaviors, the built environment, and demographic data. Ultimately, we embedded this model into a proof-of-concept web application that allows users to understand the impact of a corridor’s retail mix on the evening trip traffic. While the model has limited accuracy and generalizability indicating that it requires continual calibration, this proof-of-concept web application is an important step in making informed, data-driven policy decisions concerning the nighttime economy.
This document walks through our exploratory data analysis and model-building process in detail and includes relevant code to replicate our analysis. All source code can be found throughout the document by pressing the code buttons throughout the report. Additional code we used for setup, data loading, and app development can be found in the appendix at the end of the document.
SafeGraph is a novel data source that uses anonymized cell phone GPS data to record trips to commercial points of interest. For this project, we used two datasets from SafeGraph: the Monthly Patterns dataset and the Places dataset. For a given point of interest, the Monthly Patterns dataset tells us detailed information on the monthly traffic patterns, such as the number of trips and visitors, the median distance traveled, the median time spent at the establishment, and the origin census block group. Complementing the Patterns dataset, the Places dataset provides detailed information about the given point of interest, including the business category, the hours open, and other descriptive tags to help categorize the place.
The SafeGraph data are brand new and have numerous unexplored applications and insights. The dataset was essential to this project as it allowed us to see trip patterns with unprecedented detail and far higher accuracy than other mobility datasets. While SafeGraph is undoubtedly the cutting edge of mobility data, we encountered some limitations. Primarily, as SafeGraph is based on GPS data, it is inherently noisy with trips incorrectly attributed to certain points of interest. For example, if someone is waiting for a bus outside of an establishment, this may be incorrectly recorded in the data as a trip to the establishment. In dense urban areas where points of interest may be located on multiple floors of the same building or in close proximity to one another, this may result in inaccurate counts across the dataset.
Separately, one component of our app visualizes how many trips come from a given census block group. Due to privacy concerns, SafeGraph only records census block groups with at least 2 devices and any census block group with less than 5 devices are reported as 4. This limits the resolution with which we can see low-traffic census block groups.
We also used data from various sources on Open Data Philly to capture features in the built environment such as transit, park space, and building size. The dataset we used are summarized in the table in Section 4.1. While these features are relatively standard to cities, replicating this analysis and model for other cities would require that they have similar sources available.
Finally, we used 2018 American Community Survey 5-Year Estimates to describe demographic data for each corridor.
As our model predicts nighttime trips at the commercial corridor level, we relied heavily on a shapefile from the City of Philadelphia’s Planning Department which demarcates individual corridors and districts throughout the city. According to the available metadata “locations range from large, regional and specialty destinations to corridors that reflect the evolving economy, culture, and aesthetic traditions of surrounding neighborhoods.” We also use the administrative corridor categories defined in the dataset. a description of these categories is included in the below table.
Corridor Type | Gross Leasable Area | Store Types |
---|---|---|
Neighborhood Subcenter | 10,000 - 35,000 sq.ft. | Convenience store grocery, pharmacy, dry cleaner, deli, etc. |
Neighborhood Center | 30,000 - 120,000 sq.ft. | Supermarket, variety store, bank, pharmacy, post office, etc. |
Community Center | 100,000 - 500, 000 sq. ft. | Discount dept store, home improvement, “big boxes” or equiv., “power center” |
Regional Center | 300,000 - 900,000+ sq.ft. | One or two full-line department stores or equivalent |
Superregional Center | 500,000 - 2,000,000+ sq.ft. | Three or more full-line department stores or equivalent |
Specialty Center | N/A | Specialty goods or services, dining, bars, amusements, arts, etc. |
Figure 2.1.1 below shows the spatial distribution of the different types of commercial corridors across the city.
Before explaining the model building process, it is important to understand the overall trends in the data. In addition to informing which variables to incorporate into the model, this section highlights the descriptive capabilities of the SafeGraph dataset. While our model uses a broad range of datasets to describe Philadelphia’s socio-economic and built environment characteristics, this section primarily focuses on our analysis of the SafeGraph variables.
Key findings from the analysis include:
First, we are curious to see where Philadelphia nightlife establishments are located across the city. The following maps indicate where businesses that contribute to the city’s nightlife economy are located. The categories include:
Throughout the analysis, we pay special attention to bars and restaurants, as these organizations are well distributed throughout the city and apply to a local Philadelphia customer base.
Figure 3.1.1 below shows the spatial patterns of these business types. Bars and restaurants represent the highest number of businesses which are spread across the city. Hotels are mostly clustered in the central district of the city and near the airport in the southwest portion of the city. There are far fewer casinos and arts venues.
dat2 %>%
filter(top_category == "Drinking Places (Alcoholic Beverages)" |
top_category == "Restaurants and Other Eating Places" |
top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
top_category == "Performing Arts Companies") %>%
mutate(category = case_when(top_category == "Drinking Places (Alcoholic Beverages)" ~ "Bars",
top_category == "Restaurants and Other Eating Places" ~ "Restaurants",
top_category == "Promoters of Performing Arts, Sports, and Similar Events" ~ "Arts",
top_category == "Performing Arts Companies" ~ "Arts")) %>%
ggplot() +
geom_sf(data = phl_cbg, fill = "grey80", color = "transparent") +
geom_sf(color = "red", size = .1) +
labs(title = "Location of Nightlife Establishments",
subtitle = "Figure 3.1.1") +
facet_wrap(~category, nrow = 1) +
mapTheme()
To look at the spatial distribution of business types another way, we review the distribution of businesses with a fishnet grid. The fishnet allows us to visualize clusters of each business type, as clusters are aggregated to the individual cell level. This allows us to more effectively see the density of these business types more effectively than the above Figure 3.1.1. While all three business types are mostly concentrated within Center City, the below Figure 3.1.2 demonstrates how restaurants are most thoroughly distributed throughout Philadelphia.
fishnet <-
st_make_grid(phl_boundary, cellsize = 1000, square = FALSE) %>%
.[phl_boundary] %>%
st_sf() %>%
mutate(uniqueID = rownames(.))
#Filter restaurants
restaurants <- dat2 %>%
filter(top_category == "Restaurants and Other Eating Places")%>%
mutate(Legend = "Restaurants")
#aggregate restaurant count by fishnet cell
restaurants_net <-
dplyr::select(restaurants) %>%
mutate(countRestaurants = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countRestaurants = replace_na(countRestaurants, 0),
uniqueID = rownames(.),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE))
#Bars
bars <- dat2 %>%
filter(top_category == "Drinking Places (Alcoholic Beverages)") %>%
mutate(Legend = "Bars")
#aggregate bars by fishnet cell
bars_net <-
dplyr::select(bars) %>%
mutate(countBars = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countBars = replace_na(countBars, 0),
uniqueID = rownames(.),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE)) %>%
mutate(Legend = "Bars")
#Performing arts
performingarts <- dat2 %>%
filter(top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
top_category == "Performing Arts Companies") %>%
mutate(Legend = "Performing Arts")
performingarts_net <-
dplyr::select(performingarts) %>%
mutate(countPerformingarts = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countPerformingarts = replace_na(countPerformingarts, 0),
uniqueID = rownames(.),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE))
# Combining fishnets into a single dataframe
vars_net <-
rbind(restaurants,
bars,
performingarts) %>%
st_join(., fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
dplyr::summarize(count = n()) %>%
full_join(fishnet) %>%
spread(Legend,count, fill=0) %>%
st_sf() %>%
dplyr::select(-`<NA>`) %>%
na.omit() %>%
ungroup()
vars_net.long <- gather(vars_net, Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.long$Variable)
mapList <- list()
#Plotting small multiple maps
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange, c(mapList,
ncol=3, top="Count of Nightlife Businesses per Fishnet",
bottom = "Figure 3.1.2"))
Figure 3.1.3 below breaks down the density of restaurants in a given corridor. While centralized corridors tend to have more restaurants per square mile other smaller corridors in the city support a high concentration of restaurants as well.
Similar to the above, Figure 3.1.4 shows the density of bars along each commercial corridor. Again, we see the corridors around Center City generally showing a higher concentration of this business category. Distinct from the restaurant density map in Figure 3.1.3 above, there seem to be many more corridors with a low density of bars, particularly moving further away from the city’s central core.
Finally, Figure 3.1.5 below shows the density of arts venues by commercial corridor. The majority of corridors have a very low density of performing arts establishments as there are fewer destinations within this category across the city.
Now that we have investigated the distribution of nightlife establishments across Philadelphia, it is important to understand how these establishments have been impacted by the COVID-19 pandemic. Using SafeGraph data, we are able to see that the pandemic has a drastic impact on foot traffic across sectors and space, though not necessarily evenly. First, we observe citywide changes in foot traffic over the course of the pandemic in the below Figure 3.2.1. Comparing 2020 traffic to our 2018 dataset, we see that traffic across the city drastically decreased in March 2020 and represented less than half of the 2018 traffic. This trend continued throughout the year.
#Plot citywide chart
dat_citywide %>%
ggplot(., aes(x = month, y = Percent_Change, fill = Percent_Change)) +
geom_col() +
scale_fill_viridis_c() +
labs(title = "Philadelphia Commercial Trip % Change, 2018 & 2020",
subtitle = "Figure 3.2.1",
y = "Percent Change",
x = "Month",
fill = "Percent Change") +
plotTheme()
The next graphics splits out the percent change in traffic by the business types relevant to our nightlife study: arts venues, bars, and restaurants. The data show arts establishments were able to recover some of their traffic towards the end of 2020. Bars and restaurants, on the other hand, suffered the decreased traffic volumes throughout the end of the year. One possible explanation for this trend is that it may be more likely for arts establishments to hold programming while adhering to COVID-19 safety standards (e.g. mask-wearing and social distancing). Restaurants and bars in Philadelphia, are often in smaller spaces and require that visitors remove masks to partake in the experience.
#Separate by commercial use
dat_citywide2 <- dat_join %>%
filter(top_category == "Drinking Places (Alcoholic Beverages)" |
top_category == "Restaurants and Other Eating Places" |
top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
top_category == "Performing Arts Companies") %>%
mutate(category = case_when(top_category == "Drinking Places (Alcoholic Beverages)" ~ "Bars",
top_category == "Restaurants and Other Eating Places" ~ "Restaurants",
top_category == "Promoters of Performing Arts, Sports, and Similar Events" ~ "Arts",
top_category == "Performing Arts Companies" ~ "Arts")) %>%
group_by(category, month) %>%
summarize(Total_Visits2018 = sum(NightVisits2018),
Total_Visits2020 = sum(NightVisits2020)) %>%
mutate(Percent_Change = (Total_Visits2020 - Total_Visits2018)/Total_Visits2018*100)
#Plot citywide chart by commercial use
dat_citywide2 %>%
ggplot(., aes(x = month, y = Percent_Change, fill = Percent_Change)) +
geom_col() +
labs(title = "Philadelphia Commercial Trip % Change, 2018 & 2020",
subtitle = "Figure 3.2.2",
x = "Month",
y = "Percent Change",
fill = "Percent Change") +
scale_fill_viridis_c() +
facet_wrap(~category) +
plotTheme() +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 45))
In addition to looking at traffic by business type, the data allows us to view change in business traffic across space. The following Figure 3.2.3 aggregates the change in traffic across all nightlife business categories, underscoring the drastic economic impact the pandemic has had on businesses across the cit. Dark grey areas indicate census block groups where none of the three nighttime economy business category is located. While there are a few exceptions, almost all block groups show a sharp decrease in business traffic, many by well over 50%.
dat_night <- dat_join %>%
filter(top_category == "Drinking Places (Alcoholic Beverages)" |
top_category == "Restaurants and Other Eating Places" |
top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
top_category == "Performing Arts Companies") %>%
filter(month == '04'| month == '05' | month == '06' | month == '07' | month == '08' | month == '09' | month =='10' | month == '11' | month == '12') %>%
mutate(GEOID10 = as.numeric(GEOID)) %>%
group_by(GEOID10) %>%
summarize(Total_Visits2018 = sum(NightVisits2018),
Total_Visits2020 = sum(NightVisits2020)) %>%
st_drop_geometry() %>%
left_join(phl_cbg) %>%
st_as_sf() %>%
mutate(Percent_Change = (Total_Visits2020 - Total_Visits2018)/Total_Visits2018*100)
#Plot map
dat_night %>%
mutate(percent_change = case_when(Percent_Change > 100 ~ 100,
Percent_Change <= 100 ~ Percent_Change)) %>%
ggplot() +
geom_sf(data = phl_boundary, fill = "grey20", color = "black")+
geom_sf(aes(fill = percent_change), color = "transparent") +
scale_fill_distiller(palette="RdYlGn", direction = 1) +
labs(title = "Nightlife Trip % Change, 2018 & 2020",
subtitle = "Figure 3.2.3",
fill = "Percent Change") +
mapTheme()
Finally, we review how the changes in mobility to nightlife establishments have impacted individual commercial corridors across Philadelphia in Figure 3.2.4. The below map shows the change in traffic to nightlife establishments aggregated by corridor. While some smaller corridors did not experience much change or even observed small increases in traffic, most corridors saw a drastic decrease in traffic to nightlife establishments. This is particularly evident in Center City, along Market Street in West Philadelphia and at the airport in far Southwest Philadelphia.
corridors_filter <- phl_corridors %>%
select(OBJECTID, NAME, GLA, P_DIST, ST_EXT, PT_ADD, VAC_RATE) %>%
mutate(VAC_RATE = str_remove_all(VAC_RATE, pattern = "%")) %>%
mutate(VAC_RATE = as.numeric(VAC_RATE)) %>%
st_as_sf() %>%
st_transform('ESRI:102728')
dat_corr <- dat_join %>%
filter(top_category == "Drinking Places (Alcoholic Beverages)" |
top_category == "Restaurants and Other Eating Places" |
top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
top_category == "Performing Arts Companies") %>%
filter(month == '04'| month == '05' | month == '06' | month == '07' | month == '08' | month == '09' | month =='10' | month == '11' | month == '12') %>%
mutate(GEOID10 = as.numeric(GEOID)) %>%
st_join(corridors_filter) %>%
group_by(NAME.y, month) %>%
summarize(Total_Visits2018 = sum(NightVisits2018),
Total_Visits2020 = sum(NightVisits2020)) %>%
mutate(Percent_Change = (Total_Visits2020 - Total_Visits2018)/Total_Visits2018*100) %>%
drop_na(NAME.y) %>%
st_as_sf() %>%
st_transform('ESRI:102728') %>%
st_drop_geometry()
dat_corr <- dat_corr %>%
mutate(NAME = NAME.y) %>%
left_join(corridors_filter)
#Average across all months
dat_corr_avg <- dat_corr %>%
group_by(NAME) %>%
summarize(Percent_Change = mean(Percent_Change))
#All Commercial Uses Corridor Map
dat_corr %>%
filter(month == "04") %>%
mutate(Percent_Change = case_when(Percent_Change > 100 ~ 100,
Percent_Change <= 100 ~ Percent_Change)) %>%
ggplot() +
geom_sf(data = phl_boundary, fill = "grey20", color = "darkgrey")+
geom_sf(aes(fill = Percent_Change, color = Percent_Change, geometry = geometry)) +
scale_fill_distiller(palette="RdYlGn", direction = 1) +
scale_color_distiller(palette="RdYlGn", direction = 1) +
labs(title = "Commercial Nighttime Trip % Change, April 2018 & 2020",
subtitle = "Figure 3.2.4",
fill = "Percent Change",
color = "Percent Change") +
mapTheme()
Next we wanted to understand the hourly traffic patterns of the nightlife establishments and corridor types that we are studying in this project. Figure 3.3.1 shows the average daily foot traffic by business type. This graphic indicates that though certain business types are considered part of the nightlife economy, they do not exclusively experience traffic in the evenings. Restaurants are a good example of this, where the data indicates that the highest levels of traffic occur in the middle of the day. Arts venues, on the other hand, have a clear patterns indicating that they are more popular later in the day.
To further understand the role of nightlife for the different corridors across Philadelphia, we examine traffic by time of day for the six different corridor types introduced above. Figure 3.3.2 below aggregates the data by corridor type and shows the share of traffic across different segments of the day. We see that Regional Centers and Superregional Centers, larger corridors that tend to be located closer to Center City, have the highest share of traffic during the late morning and early afternoon. This makes sense, as these corridors are largely synonymous with the city’s major job hubs that attract a substantial amount of daytime traffic. Smaller corridors, such as the Neighborhood Subcenters and Neighborhood Centers, demonstrate a more significant share of their overall traffic occurring during the early morning and evening hours between 7PM and 6AM. The higher share of evening traffic in these areas suggests that the city’s nighttime economy is relatively active in the smaller corridors outside of Center City.
dat_1_6 <- dat_hour %>%
#select trip counts between the hours of 1AM and 6AM
filter(Hour == "1" | Hour == "2" | Hour == "3" | Hour == "4" | Hour == "5" | Hour == "6") %>%
group_by(safegraph_place_id, date_range_start) %>%
summarize(Hrs1_6 = sum(Count))
dat_7_12 <- dat_hour %>%
#select trip counts between the hours of 7AM and 12PM
filter(Hour == "7" | Hour == "8" | Hour == "9" | Hour == "10" | Hour == "11" | Hour == "12") %>%
group_by(safegraph_place_id, date_range_start) %>%
summarize(Hrs7_12 = sum(Count))
dat_13_18 <- dat_hour %>%
#select trip counts between the hours of 1PM and 6PM
filter(Hour == "13" | Hour == "14" | Hour == "15" | Hour == "16" | Hour == "17" | Hour == "18") %>%
group_by(safegraph_place_id, date_range_start) %>%
summarize(Hrs13_18 = sum(Count))
dat_19_0 <- dat_hour %>%
#select trip counts between the hours of 6PM and 12AM
filter(Hour == "19" | Hour == "20" | Hour == "21" | Hour == "22" | Hour == "23" | Hour == "0") %>%
group_by(safegraph_place_id, date_range_start) %>%
summarize(Hrs19_0 = sum(Count))
dat_workday <- dat_hour %>%
#select trip counts between workday hours of 9AM and 6PM
filter(Hour == "9" | Hour == "10" | Hour == "11" | Hour == "12" | Hour == "13" | Hour == "14" |Hour == "15" | Hour == "16" |Hour == "17" |Hour == "18") %>%
group_by(safegraph_place_id, date_range_start) %>%
summarize(Hrs_workday = sum(Count))
#Combine into new dataset with hour trip counts and corridor + neighborhood information
dat3 <-
dat2 %>%
left_join(dat_1_6, by = c('safegraph_place_id', 'date_range_start')) %>%
left_join(dat_7_12, by = c('safegraph_place_id', 'date_range_start')) %>%
left_join(dat_13_18, by = c('safegraph_place_id', 'date_range_start')) %>%
left_join(dat_19_0, by = c('safegraph_place_id', 'date_range_start')) %>%
left_join(dat_workday, by = c('safegraph_place_id', 'date_range_start')) %>%
st_join(phl_corridors, join = st_intersects) %>%
st_join(phl_nhoods, join = st_intersects)
#Visualization
dat3 %>%
st_drop_geometry() %>%
drop_na(corr_type) %>%
group_by(corr_type) %>%
summarise(Early_AM = sum(Hrs1_6),
Late_AM = sum(Hrs7_12),
Early_PM = sum(Hrs13_18),
Late_PM = sum(Hrs19_0)) %>%
pivot_longer(cols = 2:5,
names_to = "Time",
values_to = "Traffic") %>%
ggplot(aes(fill=factor(Time, levels=c("Late_PM",
"Early_PM",
"Late_AM",
"Early_AM")),
y=Traffic,
x=factor(corr_type, levels=c("Specialty Center",
"Superregional Center",
"Regional Center",
"Community Center",
"Neighborhood Center",
"Neighborhood Subcenter")))) +
geom_bar(position="fill", stat="identity") +
coord_flip() +
scale_fill_viridis_d() +
labs(title = "Share of Traffic by Time of Day Across Corridor Types",
subtitle = "Figure 3.3.2") +
scale_y_continuous(labels = scales::percent) +
plotTheme() +
theme(legend.position = "bottom",
legend.title = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank()) +
guides(fill = guide_legend(reverse = TRUE))
The below Figure 3.3.3 looks into the daily traffic patterns in greater detail and breaks out daily traffic pattern by each corridor type and specifically for the traffic to the specific business categories we are studying. Based on this graphic, arts establishments in Neighborhood Centers, Specialty Centers, and Superregional Centers receive the highest share of evening traffic of all corridor types. It is worth noting that the arts category is our smallest group of business types, so a certain establishment might be skewing these results. Restaurants and bars are more evenly visited throughout the day. Notably, restaurants in Regional Centers serve the highest share of day time traffic of the six corridor types. It is also clear that bars tend to have higher traffic in the evening than other business types.
The below Figure 3.3.4 shows that Neighborhood Subcenters and Superregional Centers have the highest shares of traffic occurring in the evening hours, between the 7PM and 12AM. Regional Centers, on average, have a substantially lower share of evening traffic at just over 12%. That being said, there is not huge variation among the different corridor types indicating that nighttime traffic occurs in many different spaces and in different corridor types throughout the city.
Finally, the below map in Figure 3.4.5 shows which corridors have a higher share of evening traffic. The dark purple, representing the lowest quintile of evening traffic is largely concentrated in Center City, suggesting that these corridors are relatively less popular during the evening. Interestingly, the waterfront corridors directly to the east of Center City are in the highest quintile for nighttime traffic. This suggests that the program of these corridors is quite distinct from their neighbors.
One of the fascinating features of the SafeGraph dataset is its ability to capture flows of traffic across space by connecting a series of origins and destinations. By mapping the origins and destinations of trips taking place across Philadelphia, we can observe which regions of the city draw from a broader pool swath of visitors across the city, and which areas cater to a more local population. To study these trends in the data, we chose a selection of representative corridors located within the central core of the city and were classified as a separate corridor type. The selected corridors and corresponding classification are listed below. Though we only show a subset of the corridors here, this information is available for all Philadelphia commercial corridors on our app.
Corridor Name | Corridor Type |
---|---|
South Street/16th-21st | Neighborhood Subcenter |
Fairmount/19th-25th | Neighborhood Center |
Central Waterfront/Washington | Community Center |
Market West | Regional Center |
Market East | Superregional Center |
Old City | Specialty Center |
Figures 3.4.1 below shows the volume of visitors to the selected corridors by census block group. Block groups shaded grey indicate that no resident made a trip to the corridor. Note that any trips originating from outside of the city are not included in this graphic.
corr_select <-
phl_corridors %>%
filter(NAME == "South Street/16th-21st" | #Neighborhood Subcenter
NAME == "Fairmount/19th-25th" | #Neighborhood Center
NAME == "Central Waterfront/Washington" | #Community Center
NAME == "Market West" | #Regional Center
NAME == "Market East" | #Superregional Center
NAME == "Old City") %>% #Specialty Center
st_as_sf()
dat_corr_select <- dat3[corr_select,]
corr_select_cbg <-
dat_corr_select %>%
select(safegraph_place_id,
date_range_start,
top_category,
sub_category,
visitor_home_cbgs,
NAME.y,
geometry) %>%
filter(top_category == "Drinking Places (Alcoholic Beverages)" |
top_category == "Restaurants and Other Eating Places" |
top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
top_category == "Performing Arts Companies") %>%
mutate(visitor_home_cbgs = str_remove_all(visitor_home_cbgs, pattern = "\\[|\\]")) %>%
mutate(visitor_home_cbgs = str_remove_all(visitor_home_cbgs, pattern = "\\{|\\}")) %>%
mutate(visitor_home_cbgs = str_remove_all(visitor_home_cbgs, pattern = '\\"|\\"')) %>%
mutate(visitor_home_cbgs = str_split(visitor_home_cbgs, pattern = ",")) %>%
unnest(visitor_home_cbgs) %>%
separate(.,
visitor_home_cbgs,
c("cbg_origin", "visitors"),
sep = ":") %>%
mutate(cbg_origin = as.numeric(cbg_origin),
visitors = as.numeric(visitors))
corr_select_cbg_map <-
corr_select_cbg %>%
dplyr::group_by(cbg_origin, NAME.y) %>%
dplyr::summarize(visitors = sum(visitors)) %>%
st_drop_geometry() %>%
dplyr::rename(., GEOID10 = cbg_origin) %>%
left_join(phl_cbg, by = "GEOID10") %>%
st_as_sf() %>%
drop_na(geometry) %>%
mutate(corr_type = case_when(NAME.y == "South Street/16th-21st" ~ "Neighborhood Subcenter",
NAME.y == "Fairmount/19th-25th"~ "Neighborhood Center",
NAME.y == "Central Waterfront/Washington" ~ "Community Center",
NAME.y == "Market West" ~ "Regional Center",
NAME.y == "Market East" ~ "Superregional Center",
NAME.y == "Old City" ~ "Specialty Center"))
corr_select_cbg_map %>%
ggplot() +
geom_sf(data = phl_cbg, fill = "grey70", color = "transparent") +
geom_sf(aes(fill = q5(visitors), geometry = geometry), color = "transparent") +
scale_fill_manual(values = palette5,
aesthetics = c("colour", "fill"),
name = "Visitor Count \n(Quintile)") +
mapTheme() +
labs(title = "Where do nightlife visitors come from?",
subtitle = "Figure 3.4.1") +
facet_wrap(~factor(corr_type,
levels=c("Neighborhood Subcenter",
"Neighborhood Center",
"Community Center",
"Regional Center",
"Superregional Center",
"Specialty Center")))+
theme(legend.position = "bottom")
Visualizing the origins and destinations in another way, the following maps in Figure 3.4.2 look at the origins and destinations for trips to restaurants, bars, and arts venues across the city. Specifically, it shows the distance between the centroid of the origin neighborhood to the centroid of the destination commercial corridor. We show the same six corridors discussed above. Again, any trips from outside of the city are not included in this graphic.
The graphic demonstrates how traffic flows change across corridor types, with the larger corridors generating far more traffic than neighborhood level corridors. This has important implications for the nighttime economy in Philadelphia as we know that corridor types generating more traffic also tend to be more “daytime” in character. This suggests that businesses catering to the nighttime crowd are pulling from a smaller subset of the traffic that these larger corridors are seeing.
flows <- dat2 %>%
filter(top_category == "Drinking Places (Alcoholic Beverages)" |
top_category == "Restaurants and Other Eating Places" |
top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
top_category == "Performing Arts Companies") %>% #filter for business types
st_join(phl_corridors) %>% #join destinations to phl corridors (apply corridor destination to each trip)
st_as_sf() %>%
st_drop_geometry() %>%
select(safegraph_place_id,
date_range_start,
top_category,
poi_cbg,
visitor_home_cbgs,
NAME.y) %>% #select columns
mutate(visitor_home_cbgs = str_remove_all(visitor_home_cbgs, pattern = "\\[|\\]")) %>%
mutate(visitor_home_cbgs = str_remove_all(visitor_home_cbgs, pattern = "\\{|\\}")) %>%
mutate(visitor_home_cbgs = str_remove_all(visitor_home_cbgs, pattern = '\\"|\\"')) %>%
mutate(visitor_home_cbgs = str_split(visitor_home_cbgs, pattern = ",")) %>%
unnest(visitor_home_cbgs) %>% #unnest visitor cbg column
separate(.,
visitor_home_cbgs,
c("visitor_cbg", "count"),
sep = ":") %>% #separate count column from visitor cbg
mutate(count = as.numeric(count),
visitor_cbg = as.numeric(visitor_cbg)) %>%
dplyr::rename(., corridor_dest = NAME.y) %>%
drop_na(corridor_dest) #Remove trips with destinations outside of corridors
#prepare PHL boundary file to coordinates for plotting
phl_boundary_unproj <-
phl_boundary_unproj %>%
st_coordinates() # split out coordinates
phl_boundary_unproj <-
as.data.frame(phl_boundary_unproj) # save as dataframe
#----MAP V1: NEIGHBORHOOD CENTROID TO CORRIDOR CENTROIDS
#Generate CBG centroid
phl_cbg_cent <-
phl_cbg_unproj %>%
select(GEOID10, geometry) %>%
st_centroid()
#Generate neighborhood centroids
phl_nhood_cent <-
phl_nhoods_unproj %>%
select(name) %>%
st_centroid()
#Generate corridor centroid
phl_corr_cent <-
phl_corridors_unproj %>%
select(NAME, geometry) %>%
st_centroid()
#Join CBG centroid to neighborhood shapefile
phl_cbg_nhood <-
st_join(phl_nhoods_unproj, phl_cbg_cent, join = st_intersects) %>%
select(GEOID10, name) %>%
st_centroid()
flows_nhood <- flows %>%
left_join(phl_cbg_nhood, by=c("visitor_cbg"="GEOID10")) %>% #join visitor CBGs to nhoods
dplyr::rename(., nhood_origin = name) %>% #cleanup columns for clarity
drop_na(nhood_origin) %>% #dropping trips outside of Philadelphia
st_as_sf() %>%
st_drop_geometry() %>%
group_by(nhood_origin, top_category, corridor_dest) %>% #grouping trip counts by origin neighborhood
summarize(count = sum(count)) %>%
left_join(phl_nhood_cent, by=c("nhood_origin"="name")) %>% #join origin nhood to nhood centroid
left_join(., phl_corr_cent, by=c("corridor_dest"="NAME")) %>% #join destination corridor to corr centroid
dplyr::rename(., origin.geom = geometry.x,
dest.geom = geometry.y) #clean-up columns for clarity
#Convert from tibble to data frame for next step
flows_nhood <- as.data.frame(flows_nhood)
#split point data into lat and long columns
flows_nhood <- flows_nhood %>%
mutate(lat.origin = unlist(map(flows_nhood$origin.geom,1)),
long.origin = unlist(map(flows_nhood$origin.geom,2)),
lat.dest = unlist(map(flows_nhood$dest.geom,1)),
long.dest = unlist(map(flows_nhood$dest.geom,2)),
id = as.character(c(1:nrow(.))))
#Maps - straight line segment example
flows_nhood %>%
filter(corridor_dest == "South Street/16th-21st" | #Neighborhood Subcenter
corridor_dest == "Fairmount/19th-25th" | #Neighborhood Center
corridor_dest == "Central Waterfront/Washington" | #Community Center
corridor_dest == "Market West" | #Regional Center
corridor_dest == "Market East" | #Superregional Center
corridor_dest == "Old City") %>% #Specialty Center
mutate(corr_type = case_when(corridor_dest == "South Street/16th-21st" ~ "Neighborhood Subcenter",
corridor_dest == "Fairmount/19th-25th"~ "Neighborhood Center",
corridor_dest == "Central Waterfront/Washington" ~ "Community Center",
corridor_dest == "Market West" ~ "Regional Center",
corridor_dest == "Market East" ~ "Superregional Center",
corridor_dest == "Old City" ~ "Specialty Center")) %>%
ggplot() +
geom_polygon(data = phl_boundary_unproj, aes(x=X, y=Y), fill = "grey40") +
geom_segment(aes(x = lat.origin, y = long.origin, xend = lat.dest, yend = long.dest,
color=log(count)),
size = .5,
lineend = "round") +
scale_color_viridis_c() +
coord_equal() +
mapTheme() +
facet_wrap(~factor(corr_type,
levels=c("Neighborhood Subcenter",
"Neighborhood Center",
"Community Center",
"Regional Center",
"Superregional Center",
"Specialty Center")), ncol = 3) +
labs(title = "Trip Flows Across Center City Corridors",
subtitle = "Figure 3.4.2",
color = "Trip Volume") +
theme(legend.position = "bottom")
The SafeGraph dataset is incredibly rich and there are many more fascinating variables to investigate. The following section shows a selection of other discrete analyses that we conducted on the SafeGraph dataset that had interesting implications for our use case.
Figures 3.5.1 and 3.5.2 below visualize SafeGraph “distance_from_home” variable, which takes the average distance traveled from a visitor’s home to the point of interest along the corridor. For Figure 3.5.1, we have aggregated this variable by corridor to show that on average, visitors travel farther to Center Center and corridors close to the city’s border. Smaller corridors throughout the city draw more local visitors that travel shorter distances. One possible explanation for this pattern could be the presence of job hubs. With jobs concentrated in Center City and a handful of other centers, such as the airport in Southwest Philadelphia and the Navy Yard in South Philadelphia, this could indicate that the people are willing to travel relatively long distances to these corridors. Another possible explanation for this pattern could be the connectivity of these corridors. The commercial corridors in Center City are well-served by public transit and the corridors located close to the city’s border, particularly around South Philly are well connected to the interstate network, facilitating car travel to these destinations. Finally, tourism could play a role in the aggregated distance from home variable. Most of the city’s tourist-friendly attractions are located in the city core, so visitors from outside of Philadelphia could be skewing this variable higher.
dat3 %>%
drop_na(NAME.y) %>%
group_by(NAME.y) %>%
summarize(distance_from_home = mean(distance_from_home, na.rm = TRUE)) %>%
st_drop_geometry() %>%
rename("NAME" = "NAME.y") %>%
left_join(., phl_corridors) %>%
st_as_sf() %>%
ggplot() +
geom_sf(data = phl_boundary, fill = "grey80", color = "grey40") +
geom_sf(aes(fill = q5(distance_from_home), color = q5(distance_from_home))) +
scale_fill_manual(values = palette5,
aesthetics = c("colour", "fill"),
name = "Distance from Home \n(Quintile)") +
scale_colour_manual(values = palette5,
aesthetics = c("colour", "fill"),
name = "Distance from Home \n(Quintile)") +
labs(title = "Distance from Home Across Commercial Corridors",
subtitle = "Figure 3.5.1") +
mapTheme()
Looking at the same variable across the corridor types, Figure 3.5.2 demonstrates that the Regional, Superregional, and Specialty Centers have a substantially higher average distance traveled corridor at 6.4, 14.8, and 9.7 miles respectively. Neighborhood Subcenters have the smallest distance from home value, suggesting that, on average, this corridor type appeals to a more local crowd.
#Barplots
dat3 %>%
st_drop_geometry() %>%
drop_na(corr_type) %>%
filter(top_category == "Drinking Places (Alcoholic Beverages)" |
top_category == "Restaurants and Other Eating Places" |
top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
top_category == "Performing Arts Companies") %>%
group_by(corr_type) %>%
summarise(Avg_Distance_mi = mean(distance_from_home, na.rm = TRUE)*0.000621371) %>%
ggplot(aes(y=Avg_Distance_mi,
x=factor(corr_type, levels=c("Neighborhood Subcenter",
"Neighborhood Center",
"Community Center",
"Regional Center",
"Superregional Center",
"Specialty Center")))) +
geom_bar(stat="identity") +
# facet_wrap(~corr_type) +
guides(fill=guide_legend(title=NULL)) +
coord_flip()+
theme(axis.title.y = element_blank()) +
geom_text(aes(label=round(Avg_Distance_mi, digits = 1)), hjust=1.6, color="white", size=3.5) +
labs(title = "Distance from Home by Corridor Type",
subtitle = "Figure 3.5.2",
y = "Average Distance (miles)") +
plotTheme()
Another variable unique to the SafeGraph dataset is the median_dwell variable which aggregates the median amount of time spent by all visitors at a given point of interest. While we do not see as clear of a spatial pattern in the map as we did with the above distance_from_home variable (Figure 3.5.3), the bar plot in Figure 3.5.4 below indicates that Neighborhood Subcenters have the highest median dwell time at over 45 minutes, while Community Centers have the lowest. As Community Centers tend to be auto-oriented and suburban-style commercial centers, these corridors may be optimized for convenience and, therefore, encourage visitors to spend less time.
Applying this finding to the nighttime economy, establishments catering to the evening crowd tend to benefit when visitors spend a longer time at the establishment. For example, at bars and restaurants, the longer a patron spends at the establishment, the more likely they are to purchase additional items, translating to increased business revenue. This finding could have interesting planning implications for identifying optimal corridors for increased development of the nighttime economy.
dat3 %>%
drop_na(NAME.y, corr_type) %>%
group_by(NAME.y) %>%
summarize(median_dwell = median(median_dwell, na.rm = TRUE)) %>%
st_drop_geometry() %>%
rename("NAME" = "NAME.y") %>%
left_join(., phl_corridors) %>%
st_as_sf() %>%
ggplot() +
geom_sf(data = phl_boundary, fill = "grey80", color = "grey40") +
geom_sf(aes(fill = q5(median_dwell), color = q5(median_dwell))) +
scale_fill_manual(values = palette5,
aesthetics = c("colour", "fill"),
name = "Median Dwell \n(Quintile)") +
scale_color_manual(values = palette5,
aesthetics = c("colour", "fill"),
name = "Median Dwell \n(Quintile)") +
labs(title = "Median Dwell Times by Philadelphia Commercial Corridor",
subtitle = "Figure 3.5.3") +
mapTheme()
#Barplots
dat3 %>%
st_drop_geometry() %>%
drop_na(corr_type) %>%
filter(top_category == "Drinking Places (Alcoholic Beverages)" |
top_category == "Restaurants and Other Eating Places" |
top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
top_category == "Performing Arts Companies") %>%
group_by(corr_type) %>%
summarise(Dwell = median(median_dwell, na.rm = TRUE)) %>%
ggplot(aes(y=Dwell,
x=factor(corr_type, levels=c("Neighborhood Subcenter",
"Neighborhood Center",
"Community Center",
"Regional Center",
"Superregional Center",
"Specialty Center")))) +
geom_bar(stat="identity") +
scale_fill_viridis_d() +
labs(title = "Median Dwell Times by Corridor Type",
subtitle = "Figure 3.5.4",
y = "Median Dwell Time (minutes)") +
geom_text(aes(label=round(Dwell, digits = 1)), hjust=1.6, color="white", size=3.5)+
guides(fill=guide_legend(title=NULL)) +
theme(axis.title.y = element_blank()) +
coord_flip()+
plotTheme()
The SafeGraph dataset is incredibly rich and the above analysis merely demonstrates a small sample of the possibilities of this data. By combining the SafeGraph dataset with Philadelphia’s local commercial corridors, we were able to see unique patterns that further distinguish the corridor types as well as provide important insights into the behavior of people participating in the nighttime economy.
Informed by the above exploratory analysis, the following section discusses how we developed a model to predict nighttime traffic between the hours of 7PM and 12PM to commercial corridors in Philadelphia. We begin by explaining the features that we ultimately chose for the model, describe the modelling approach, and finally walk through an evaluation of our error terms.
As mentioned above, the ultimate output of the model is a prediction of trips to commercial corridors during evening hours. This raw trip count by corridor needed to be normalized by corridor area, to account for differently sized corridors. This variable, however, was highly skewed and would be difficult to model. Therefore, we ultimately used the log transformation of the normalized trip count as our dependent variable for the model.
Our independent variables that we used to predict the nighttime trip outcome were developed over an iterative process of data wrangling, feature engineering, and evaluation. Our final list falls into three primary categories described below.
SafeGraph Features: These features are entirely sourced from the SafeGraph dataset. They help describe the density of certain business types along a corridor, a measure of retail mix, the distance home and dwell time (explore further in the above Section 3), and how late businesses are open.
Built Environment Features: These features are sourced from local datasets maintained on Open Data Philly and describe the built environment of the corridor. They include information on distance to transit, parks, and average building size.
Demographic Features: These features are sourced from U.S. Census data and describe socio-economic characteristics of a corridor.
dat_pred_temp <-
dat2 %>%
left_join(dat_1_6, by = c('safegraph_place_id', 'date_range_start')) %>%
left_join(dat_7_12, by = c('safegraph_place_id', 'date_range_start')) %>%
left_join(dat_13_18, by = c('safegraph_place_id', 'date_range_start')) %>%
left_join(dat_19_0, by = c('safegraph_place_id', 'date_range_start')) %>%
left_join(dat_workday, by = c('safegraph_place_id', 'date_range_start')) %>%
st_join(phl_blockgroups) %>%
st_join(phl_nhoods) %>%
st_join(phl_corridors_pred) %>%
st_join(., phl_building_footprints) %>%
drop_na(corridor)
#Engineer variables at the destination level.
dat_pred <-
dat_pred_temp %>%
dplyr::mutate(total = 1,
bars = ifelse(top_category == 'Drinking Places (Alcoholic Beverages)', 1, 0),
restaurant = ifelse(top_category == 'Restaurants and Other Eating Places', 1, 0),
arts = ifelse(top_category == 'Promoters of Performing Arts, Sports, and Similar Events' |
top_category == 'Performing Arts Companies', 1, 0),
grocery = ifelse(top_category == 'Grocery Stores', 1, 0),
sports = ifelse(top_category == 'Spectator Sports', 1, 0),
transit_nn = nn_function(st_c(st_centroid(dat_pred_temp)), st_c(st_centroid(septaStops)), 2),
parks_nn = nn_function(st_c(st_centroid(dat_pred_temp)), st_c(st_centroid(parks)), 4),
trolley_nn = nn_function(st_c(st_centroid(dat_pred_temp)), st_c(st_centroid(phl_trolley)), 4),
late_night = if_else(grepl("Late Night", dat_pred_temp$category_tags), 1, 0),
closing_time = ifelse(str_detect(open_hours, paste(c("20:00", "21:00", "22:00", "23:00", "0:00", "1:00", "2:00", "3:00"), collapse = "|")), "OPEN LATE", "NOT OPEN LATE"),
closing_time = ifelse(is.na(closing_time) == TRUE, "NO DATA", closing_time),
open_late = ifelse(closing_time == "OPEN LATE", 1, 0)
)
dat_pred$UPenn <- as.numeric(st_distance(dat_pred, UPenn.sf))
#Aggregating up to the corridor level
dat_pred_agg <-
dat_pred %>%
group_by(corridor, date_range_start, corr_type) %>%
summarize(Night_visits = sum(Hrs19_0),
Workday_visits = sum(Hrs_workday),
phl_area_sqmi = mean(corr_area_sqmi, na.rm = TRUE),
Night_visits_sqmi = (Night_visits + 1) / phl_area_sqmi,
Night_visits_sqmi_log = log(Night_visits_sqmi),
Workday_visits_sqmi = (Workday_visits +1) / phl_area_sqmi,
Workday_visits_sqmi_log = log(Workday_visits_sqmi),
total = sum(total),
phl_building_size = mean(bld_area_sqft +1, na.rm = TRUE),
phl_building_size_log = log(phl_building_size),
phl_UPenn = mean(UPenn),
sg_dwell = mean(median_dwell, na.rm = TRUE),
sg_distance_home = mean(distance_from_home, na.rm = TRUE),
sg_distance_home_log = log(sg_distance_home),
dens_bars = (sum(bars, na.rm = TRUE) + 1)/phl_area_sqmi,
dens_rest = (sum(restaurant, na.rm = TRUE) + 1)/phl_area_sqmi,
dens_arts = (sum(arts, na.rm = TRUE) +1)/phl_area_sqmi,
dens_sports = (sum(sports, na.rm = TRUE) + 1)/phl_area_sqmi,
dens_grocery = (sum(grocery, na.rm = TRUE) + 1)/phl_area_sqmi,
count_retailmix_top = n_distinct(top_category),
count_retailmix_sub = n_distinct(sub_category),
dens_late_tag = (sum(late_night, na.rm = TRUE) + 1)/phl_area_sqmi,
dens_openlate = (sum(open_late, na.rm = TRUE) + 1)/phl_area_sqmi,
dens_bars_log = log(dens_bars),
dens_rest_log = log(dens_rest),
dens_arts_log = log(dens_arts),
dens_grocery_log = log(dens_grocery),
dens_late_tag_log = log(dens_late_tag),
count_retailmix_top_log = log(count_retailmix_top),
count_retailmix_sub_log = log(count_retailmix_sub),
dens_openlate_log = log(dens_openlate),
nn_transit = mean(transit_nn),
nn_parks = mean(parks_nn),
nn_trolley = mean(trolley_nn),
nn_parks_log = log(nn_parks),
nn_transit_log = log(nn_transit),
nn_trolley_log = log(nn_trolley),
demo_popdens = mean(popdens_mi),
demo_popdens_log = log(demo_popdens),
demo_pctWhite = weighted.mean(pctWhite, Hrs19_0, na.rm = TRUE),
demo_medAge = weighted.mean(MedAge, Hrs19_0, na.rm = TRUE),
demo_MHI = weighted.mean(MedHHInc, Hrs19_0, na.rm = TRUE),
demo_medrent = weighted.mean(MedRent, Hrs19_0, na.rm = TRUE)) %>%
st_drop_geometry() %>%
left_join(phl_corridors_pred %>% dplyr::select(corridor)) %>%
drop_na(corridor, demo_MHI, demo_medrent) %>% #Some of the census variables aren't reporting for our block groups
st_as_sf() %>%
ungroup() %>%
na.omit(st_distance_home)
The below is a final list of variables as well as a brief description of each.
Feature Name | Category | Source | Description |
---|---|---|---|
Night_visits_sqmi_log | Dependent Variable | SafeGraph | Log transformation of trip count between the hours of 7PM and 12AM normalized by area |
phl_building_size_log | Built Environment | City of Philadelphia Building Footprints | Log transformation of average building footprint (sq ft) on commercial corridor |
phl_UPenn | Built Environment | SafeGraph | Distance to the University of Pennsylvania |
sg_distance_home_log | SafeGraph | SafeGraph | Log transformation of average distance traveled to a point of interest aggregated to the commercial corridor |
sg_dwell | SafeGraph | SafeGraph | Median dwell time of points of interested aggregated to the corridor level |
dens_bars_log | SafeGraph | SafeGraph | Log transformation of bar density variable |
dens_rest_log | SafeGraph | SafeGraph | Log transformation of restaurant density variable |
dens_arts_log | SafeGraph | SafeGraph | Log transformation of arts venue density variable |
dens_grocery | SafeGraph | SafeGraph | Log transformation of grocery density variable |
dens_sports | SafeGraph | SafeGraph | Log transformation of sports establishment density variable |
count_retailmix_top | SafeGraph | SafeGraph | Count of unique business top categories |
count_retailmix_sub_log | SafeGraph | SafeGraph | Log transformation of count of unique business subcategories |
dens_openlate_log | SafeGraph | SafeGraph | Log transformation of businesses open after 8PM based on open_hours variable |
nn_transit_log | Built Environment | SEPTA - Market-Frankford Line Stations & SEPTA - Broad Street Line Stations | Log transformation of distance to nearest 2 SEPTA stations |
nn_parks_log | Built Environment | Philadelphia Parks & Recreation Properties | Log transformation of distance to nearest 4 parks |
nn_trolley | Built Environment | SEPTA - Trolley Stops | Distance to nearest 4 trolley stations |
dens_late_tag | SafeGraph | SafeGraph | Count of businesses with “late” tag |
corr_type | Built Environment | Commercial Corridors of Philadelphia | Corridor typology (see detailed description above) |
dens_late_tag_log | SafeGraph | SafeGraph | Log transformation of density of businesses with “late” tag |
demo_pctWhite | Demographic | 2018 ACS 5-Year Estimates | Percent of residents in corresponding census tract that are white |
demo_medAge | Demographic | 2018 ACS 5-Year Estimates | Median Age of residents in corresponding census tract |
demo_popdens | Demographic | 2018 ACS 5-Year Estimates | Population density of corresponding census tract |
demo_medrent | Demographic | 2018 ACS 5-Year Estimates | Median rent of corresponding census tract |
demo_MHI | Demographic | 2018 ACS 5-Year Estimates | Median household income of corresponding census tract |
In Figure 4.1.1 below, we see a corrplot display of the final features chosen for the model. There is some collinearity among the variables that describe the density of business types. While not ideal for modelling, including these variables was necessary for the ultimate application of predicting nighttime traffic based on different retail mix scenarios. In order to adjust the retail mix for certain nighttime establishments (bars, restaurants, and arts venues) a variable representing each of these factors needed to be present in the final model. Other variables, though they may have exhibited some degree of collinearity, are present in the model because they were ultimately statistically significant and improved the fit.
We developed two primary models: an OLS linear regression and random forest regression. Both models used the same dataset and predicted the logged nighttime visits per square mile. For both models, we split our data into a training and test sets (60% and 40%, respectively) and evaluated the Mean Absolute Error (MAE) and Mean Average Percent Error (MAPE).
We first developed and evaluated a linear model using our selected features. The results of the regression summary are shown in the table below. We can see that with the exception of the bars and arts variables (which, as explained above, are included for the prediction scenario purposes) and a few of the corridor types, the individual features are quite significant predictors overall. However, the R-Squared value is relatively low at about .5.
##
## Linear Regression Results
## ===========================================================
## Dependent variable:
## ---------------------------
## Night_visits_sqmi_log
## -----------------------------------------------------------
## phl_building_size_log 0.060***
## (0.017)
##
## phl_UPenn 0.00001***
## (0.00000)
##
## sg_distance_home_log -0.271***
## (0.037)
##
## sg_dwell 0.001***
## (0.0003)
##
## dens_bars_log 0.052
## (0.049)
##
## dens_rest_log 0.508***
## (0.031)
##
## dens_arts_log 0.035
## (0.056)
##
## dens_grocery 0.001***
## (0.0003)
##
## dens_sports 0.003***
## (0.001)
##
## count_retailmix_top -0.019***
## (0.003)
##
## count_retailmix_sub_log 0.673***
## (0.046)
##
## dens_openlate_log -0.099***
## (0.035)
##
## nn_transit_log -0.135***
## (0.023)
##
## nn_parks_log 0.123***
## (0.043)
##
## nn_trolley -0.00001***
## (0.00000)
##
## dens_late_tag -0.002***
## (0.0004)
##
## corr_typeNeighborhood Center -0.082*
## (0.045)
##
## corr_typeNeighborhood Subcenter -0.041
## (0.059)
##
## corr_typeRegional Center 0.596***
## (0.142)
##
## corr_typeSpecialty Center 0.041
## (0.081)
##
## corr_typeSuperregional Center 1.212***
## (0.152)
##
## dens_late_tag_log 0.239***
## (0.052)
##
## demo_pctWhite -0.002***
## (0.001)
##
## demo_medAge 0.005**
## (0.002)
##
## demo_popdens 0.00000**
## (0.00000)
##
## demo_medrent 0.0003***
## (0.0001)
##
## demo_MHI -0.00001***
## (0.00000)
##
## Constant 7.764***
## (0.513)
##
## -----------------------------------------------------------
## Observations 1,896
## R2 0.548
## Adjusted R2 0.542
## Residual Std. Error 0.584 (df = 1868)
## F Statistic 83.963*** (df = 27; 1868)
## ===========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Figure 4.2.1 below compares the predictions and the actual values across the training and the test sets. While the predicted value (the green line) lines up fairly well with the actual value (orange line), the scatter plot indicates a fair amount of variance. Additionally, there are some extreme outliers across both the training and the test set that the model is predicting quite poorly.
We also ran a k-fold cross validation test to view the distribution of errors. The MAE for each fold are plotted in the below Figure 4.2.2.
fitControl <- trainControl(method = "cv",
number = 10,
savePredictions = TRUE)
set.seed(414)
reg1.cv <-
train(Night_visits_sqmi_log ~ ., data = st_drop_geometry(dat_pred_agg) %>%
dplyr::select(reg.vars),
method = "lm",
trControl = fitControl,
na.action = na.pass)
reg1.cv.resample <- reg1.cv$resample
ErrorHist <-
ggplot(reg1.cv.resample, aes(x=MAE)) +
geom_histogram(color = "grey40", fill = "#27fdf5", bins = 50) +
labs(title="Linear Model: Histogram of Mean Absolute Error \nAcross 10 Folds",
subtitle = "Figure 4.2.2",
y = "Count",
x = "Mean Absolute Error") +
plotTheme()
ErrorHist
We also map the errors by commercial corridor in the following maps. The absolute errors suggest that our model is less accurate predicting nighttime traffic for larger, more heavily-trafficked corridors. The Percent Errors map on the right, however, indicates that smaller corridors throughout the city have a higher error.
cv_preds <- reg1.cv$pred
map_preds <- dat_pred_agg %>%
rowid_to_column(var = "rowIndex") %>%
left_join(cv_preds, by = "rowIndex") %>%
group_by(corridor, corr_type) %>%
summarise(Night_visits_sqmi = mean(Night_visits_sqmi),
pred = mean(pred)) %>%
dplyr::mutate(Visits.AbsError = abs(exp(pred) - Night_visits_sqmi),
PercentError = (Visits.AbsError / Night_visits_sqmi)*100)
#Quintile maps
ErrorPlot1 <- ggplot() +
geom_sf(data = phl_boundary, fill = "grey80", color = "grey40") +
geom_sf(data = map_preds, aes(fill = q5(Visits.AbsError), color = q5(Visits.AbsError))) +
scale_fill_manual(values = palette5,
labels=qBr(map_preds,"Visits.AbsError"),
name="Quintile\nBreaks") +
scale_color_manual(values = palette5,
labels=qBr(map_preds,"Visits.AbsError"),
name="Quintile\nBreaks") +
labs(title="Absolute Errors",
subtitle = "Night Visit Predictions",
caption = "Figure 4.2.3a") +
mapTheme()
ErrorPlot2 <- ggplot() +
geom_sf(data = phl_boundary, fill = "grey80", color = "grey40") +
geom_sf(data = map_preds, aes(fill = q5(PercentError), color = q5(PercentError))) +
scale_fill_manual(values = palette5,
labels=qBr(map_preds, "PercentError"),
name="Quintile\nBreaks") +
scale_color_manual(values = palette5,
labels=qBr(map_preds,"PercentError"),
name="Quintile\nBreaks") +
labs(title="Percent Errors",
subtitle = "Night Visit Predictions",
caption = "Figure 4.2.3b") +
mapTheme()
grid.arrange(ErrorPlot1, ErrorPlot2, ncol=2)
Finally the following tables summarize the error terms by corridor type. We acknowledge that the linear model is most accurate at predicting traffic for Neighborhood and Community Centers with an average percent error of about 25%. The model is substantially worse at predicting traffic along Regional Centers and Superregional Centers, where the model’s average percent error is about 50%. Not only is the linear model inaccurate, but it is not generalizable across corridors.
#Comparing errors by corr type
map_preds_corr <-
map_preds %>%
dplyr::mutate(MAE = round(abs(exp(pred) - Night_visits_sqmi),2),
MAPE = round((MAE / Night_visits_sqmi)*100,2)) %>%
group_by(corr_type) %>%
summarise(Night_visits_sqmi = mean(Night_visits_sqmi),
MAE = mean(MAE),
MAPE = mean(MAPE)) %>%
st_drop_geometry() %>%
dplyr::select(corr_type, MAE, MAPE)
map_preds_corr[order(factor(map_preds_corr$corr_type,
levels = c("Neighborhood Subcenter",
"Neighborhood Center",
"Community Center",
"Regional Center",
"Superregional Center",
"Specialty Center"))),] %>%
kable(caption = "Errors by Corridor Type: Linear Model") %>%
kable_styling(full_width = FALSE)
corr_type | MAE | MAPE |
---|---|---|
Neighborhood Subcenter | 29682.11 | 35.51520 |
Neighborhood Center | 24695.91 | 26.93982 |
Community Center | 20527.93 | 26.87079 |
Regional Center | 43407.16 | 47.77333 |
Superregional Center | 78367.69 | 44.75667 |
Specialty Center | 23359.27 | 33.76462 |
We also look at the errors for the linear model across months in the following table. The MAPE remains high across all months, but is particularly high during the winter months of January, February and December.
#Errors by date
dat_pred_agg %>%
rowid_to_column(var = "rowIndex") %>%
left_join(cv_preds, by = "rowIndex") %>%
group_by(corridor, date_range_start) %>%
summarise(Night_visits_sqmi = mean(Night_visits_sqmi),
pred = mean(pred)) %>%
dplyr::mutate(Visits.AbsError = abs(exp(pred) - Night_visits_sqmi),
PercentError = (Visits.AbsError / Night_visits_sqmi)*100) %>%
dplyr::mutate(MAE = round(abs(exp(pred) - Night_visits_sqmi),2),
MAPE = round((MAE / Night_visits_sqmi)*100,2)) %>%
group_by(date_range_start) %>%
summarise(Night_visits_sqmi = mean(Night_visits_sqmi),
MAE = mean(MAE),
MAPE = mean(MAPE)) %>%
st_drop_geometry() %>%
dplyr::select(date_range_start, MAE, MAPE) %>%
mutate(date_range_start = str_remove_all(date_range_start, pattern = "T05:00:00Z"),
date_range_start = str_remove_all(date_range_start, pattern = "T04:00:00Z")) %>%
kable(caption = "Errors by Corridor Type: Linear Model") %>%
kable_styling(full_width = FALSE)
date_range_start | MAE | MAPE |
---|---|---|
2018-01-01 | 35078.91 | 97.83992 |
2018-02-01 | 37863.43 | 107.83184 |
2018-03-01 | 30527.72 | 33.65015 |
2018-04-01 | 40404.08 | 34.39489 |
2018-05-01 | 27327.91 | 32.57414 |
2018-06-01 | 23889.45 | 31.31711 |
2018-07-01 | 26530.56 | 32.79917 |
2018-08-01 | 43336.84 | 37.66049 |
2018-09-01 | 39554.70 | 35.91173 |
2018-10-01 | 41802.99 | 35.10199 |
2018-11-01 | 43968.28 | 35.77019 |
2018-12-01 | 15757.49 | 184.57811 |
As discussed above, the linear model has a number of issues with accuracy and generalizability. In an effort to improve these metrics, we built a random forest model using the same features and dependent variable. In the below summary table, we see much improvement in the random forest model’s OOB Error of about 20%, down from about 40% in the linear model. Additionally, the adjusted R-squared value has increased from .5 to .7. This indicates that the model’s accuracy has improved.
set.seed(414)
inTrain <- createDataPartition(y=dat_pred_agg$Night_visits_sqmi_log, p = .60, list = FALSE)
phl.training.rf <- dat_pred_agg[inTrain,]
phl.test.rf <- dat_pred_agg[-inTrain,]
#Multivariate regression
rf1 <-
ranger(Night_visits_sqmi_log ~ ., #change lm to ranger, predictions are a little different
data = st_drop_geometry(phl.training.rf) %>%
dplyr::select(reg.vars),
importance = "impurity")
rf1
## Ranger result
##
## Call:
## ranger(Night_visits_sqmi_log ~ ., data = st_drop_geometry(phl.training.rf) %>% dplyr::select(reg.vars), importance = "impurity")
##
## Type: Regression
## Number of trees: 500
## Sample size: 1896
## Number of independent variables: 23
## Mtry: 4
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.2106636
## R squared (OOB): 0.7171268
Next we tune the model’s hyperparameters to optimize for the variance across time. As we saw in the above table, there was extensive variation in the error terms from month-to-month. Therefore, we decided to optimize for this metric. After testing different parameters using the tune_grid() function, the final parameters were selected based on the lowest Root Mean Square Error. The results of the grid search are shown in the below table.
mtry | min_n | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|---|
5 | 5 | rmse | standard | 0.4527947 | 12 | 0.1077851 | Preprocessor1_Model02 |
10 | 5 | rmse | standard | 0.4530589 | 12 | 0.1054736 | Preprocessor1_Model03 |
10 | 10 | rmse | standard | 0.4542392 | 12 | 0.1054445 | Preprocessor1_Model08 |
15 | 5 | rmse | standard | 0.4545687 | 12 | 0.1039943 | Preprocessor1_Model04 |
15 | 10 | rmse | standard | 0.4555386 | 12 | 0.1041836 | Preprocessor1_Model09 |
20 | 10 | rmse | standard | 0.4556419 | 12 | 0.1012414 | Preprocessor1_Model10 |
5 | 10 | rmse | standard | 0.4556548 | 12 | 0.1082819 | Preprocessor1_Model07 |
20 | 5 | rmse | standard | 0.4556906 | 12 | 0.1016482 | Preprocessor1_Model05 |
2 | 5 | rmse | standard | 0.4595188 | 12 | 0.1114530 | Preprocessor1_Model01 |
2 | 10 | rmse | standard | 0.4634090 | 12 | 0.1116446 | Preprocessor1_Model06 |
15 | 50 | rmse | standard | 0.4866572 | 12 | 0.1002259 | Preprocessor1_Model14 |
20 | 50 | rmse | standard | 0.4877684 | 12 | 0.0985493 | Preprocessor1_Model15 |
10 | 50 | rmse | standard | 0.4880091 | 12 | 0.1026802 | Preprocessor1_Model13 |
5 | 50 | rmse | standard | 0.4951714 | 12 | 0.1057242 | Preprocessor1_Model12 |
2 | 50 | rmse | standard | 0.5110093 | 12 | 0.1083489 | Preprocessor1_Model11 |
The below shows the tuned random forest model’s updated specifications.
## [[1]]
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: rand_forest()
##
## -- Preprocessor ----------------------------------------------------------------
## 0 Recipe Steps
##
## -- Model -----------------------------------------------------------------------
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~5, x), num.trees = ~1000, min.node.size = min_rows(~5, x), importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
##
## Type: Regression
## Number of trees: 1000
## Sample size: 1896
## Number of independent variables: 23
## Mtry: 5
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.2032407
## R squared (OOB): 0.7098645
The below Figure 4.2.4 is a bar plot showing the tuned and final random forest model’s relative feature importance. The features ranked more highly were more important predictors for nighttime traffic, while features ranked lower were less important predictors. A more extensive discussion of this model’s feature importance is included in below in Section 5.
Using a similar workflow to evaluate the random forest model errors, the following graphic shows a scatter plot comparing the actual and predicted values between the training and test sets of data. There is less overlap between the green and orange lines than in the analogous plot for the linear model (Figure 4.2.1), which suggests that this model could be better calibrated to align the predicted and actual values. However, the distribution of the predictions, shown by the tighter clustering of dots, has much improved. There are still outliers in both the testing and training datasets that indicate the model has trouble coping with particularly high values. Again, further calibration and including additional features to account for these high values would further improve the model’s accuracy.
Comparing the error terms of the test sets between the linear model and the random forest model, we see that the latter is substantially more accurate with a MAPE down to 17% from 40%.
phl.test_table <-
phl.test %>%
st_drop_geometry() %>%
mutate(Regression = "Linear Regression",
Visits.Predict = exp(predict(reg1, phl.test)),
Visits.Error = Visits.Predict - Night_visits_sqmi,
Visits.AbsError = abs(Visits.Predict - Night_visits_sqmi),
Visits.APE = (abs(Visits.Predict - Night_visits_sqmi)) / Visits.Predict)
phl.test.rf_table <-
phl.test.rf %>%
st_drop_geometry() %>%
mutate(Regression = "Random Forest Regression",
Visits.Predict = as.numeric(unlist(rf1_predict.test_fit)),
Visits.Error = Visits.Predict - Night_visits_sqmi,
Visits.AbsError = abs(Visits.Predict - Night_visits_sqmi),
Visits.APE = (abs(Visits.Predict - Night_visits_sqmi)) / Visits.Predict)
rbind(phl.test.rf_table, phl.test_table) %>%
group_by(Regression) %>%
summarize(MAE = mean(Visits.AbsError, na.rm = TRUE),
MAPE = (mean(Visits.AbsError, na.rm = T) / mean(Night_visits_sqmi, na.rm = T)*100)) %>%
kable(caption = "MAE and MAPE for Test Set Data") %>% kable_styling(full_width = FALSE)
Regression | MAE | MAPE |
---|---|---|
Linear Regression | 36855.72 | 42.34710 |
Random Forest Regression | 16284.79 | 18.71117 |
We performed a k-fold analysis on the random forest model to compare the MAE distribution across 10 folds. The output of this analysis is shown in Figure 4.2.6 below. We can see that the random forest’s error terms cluster lower than the linear model errors and are also more concentrated. This further demonstrates an improvement over the first model’s generalizability.
set.seed(414)
folds <- vfold_cv(dat_pred_agg, v = 10)
reg1.cv.rf <-
rf_best_wf_time %>%
fit_resamples(folds, metrics = metric_set(mae))
reg1.cv.rf.metrics <- collect_metrics(reg1.cv.rf, summarize = FALSE)
rbind(reg1.cv.resample %>%
mutate(Regression = "Linear Model") %>%
dplyr::select(MAE, Regression),
reg1.cv.rf.metrics %>%
mutate(Regression = "Random Forest Model") %>%
dplyr::select(.estimate, Regression) %>%
dplyr::rename("MAE" = ".estimate")) %>%
ggplot(aes(x=MAE)) +
geom_histogram(color = "grey40", fill = "#27fdf5", bins = 50) +
labs(title="Full Dataset: Histogram of Mean Absolute Error Across 10 Folds",
subtitle = "Figure 4.2.6") +
facet_wrap(~Regression, ncol = 1) +
scale_x_continuous(name = "Mean Absolute Error") +
scale_y_continuous(name = "Count") +
plotTheme()
date_range_start | MAE | MAPE |
---|---|---|
2018-01-01 | 23719.446 | 55.27263 |
2018-02-01 | 22863.423 | 58.18820 |
2018-03-01 | 13808.787 | 11.89650 |
2018-04-01 | 21049.975 | 17.22808 |
2018-05-01 | 9196.629 | 11.35959 |
2018-06-01 | 8515.113 | 11.08929 |
2018-07-01 | 8493.734 | 10.71643 |
2018-08-01 | 21914.708 | 16.67665 |
2018-09-01 | 17430.761 | 14.68143 |
2018-10-01 | 19129.894 | 14.41026 |
2018-11-01 | 21681.780 | 14.31361 |
2018-12-01 | 8867.888 | 130.37438 |
corr_type | MAE | MAPE |
---|---|---|
Neighborhood Subcenter | 17372.55 | 38.05579 |
Neighborhood Center | 18086.87 | 25.90396 |
Community Center | 11731.69 | 21.90437 |
Regional Center | 17339.01 | 13.50889 |
Superregional Center | 14820.46 | 13.56314 |
Specialty Center | 10115.62 | 24.12173 |
As a final evaluation of the random forest model, we map the errors across Philadelphia commercial corridors in the below Figures 4.2.7a and 4.2.7b. The map on the left shows the highest quintile of absolute error tends to occur in smaller corridors. We see a particular cluster of these errors in far West Philadelphia and the area north of Center City. When reviewing the Percent Error metric on the right, the best predictions are concentrated in Center City, likely because these corridors see more traffic overall.
map_preds_space <-
map_preds_rf %>%
dplyr::mutate(Visits.AbsError = round(abs(exp(.pred) - Night_visits_sqmi),2),
PercentError = round((Visits.AbsError / Night_visits_sqmi)*100,2)) %>%
group_by(corridor) %>%
summarise(Night_visits_sqmi = mean(Night_visits_sqmi),
Visits.AbsError = mean(Visits.AbsError),
PercentError = mean(PercentError))
#Quintile maps
ErrorPlot1 <- ggplot() +
geom_sf(data = phl_boundary, fill = "grey80", color = "grey40") +
geom_sf(data = map_preds_space, aes(fill = q5(Visits.AbsError), color = q5(Visits.AbsError))) +
scale_fill_manual(values = palette5,
labels=qBr(map_preds_space,"Visits.AbsError"),
name="Quintile\nBreaks") +
scale_color_manual(values = palette5,
labels=qBr(map_preds_space,"Visits.AbsError"),
name="Quintile\nBreaks") +
labs(title="Absolute Errors",
subtitle = "Night Visit Predictions",
caption = "Figure 4.2.7a") +
mapTheme()
ErrorPlot2 <- ggplot() +
geom_sf(data = phl_boundary, fill = "grey80", color = "grey40") +
geom_sf(data = map_preds_space, aes(fill = q5(PercentError), color = q5(PercentError))) +
scale_fill_manual(values = palette5,
labels=qBr(map_preds_space, "PercentError"),
name="Quintile\nBreaks") +
scale_color_manual(values = palette5,
labels=qBr(map_preds_space,"PercentError"),
name="Quintile\nBreaks") +
labs(title="Percent Errors",
subtitle = "Night Visit Predictions",
caption = "Figure 4.2.7b") +
mapTheme()
grid.arrange(ErrorPlot1, ErrorPlot2, ncol=2)
Regression | MAE | MAPE |
---|---|---|
Linear Regression | 36855.72 | 42.34710 |
Random Forest Regression | 16284.79 | 18.71117 |
Workday Linear Regression | 103900.78 | 124.42342 |
Workday Random Forest Regression | 78828.11 | 90.57324 |
This suggests that the factors determining workday traffic are distinct from those important to evening traffic. There does, however, seem to be some overlap in the significant features of the two models. The below summary table from the linear model shows that a number of variables are still statistically significant, including the density of restaurants, retail mix, and even if businesses are open late. Surprisingly, the distance to University of Pennsylvania, an important job hub within Philadelphia, is not a significant predictor. This is possibly due to the fact that the university is situated within a residential area in West Philadelphia. While the university itself may attract a fair amount of day time traffic, the surrounding areas might not. Other variables that are not similarly significant between the nighttime and workday models include many of the demographic variables, such as percent white, median age, and population density.
##
## Linear Regression Results
## =============================================================================
## Dependent variable:
## ---------------------------------------------
## Night_visits_sqmi_log Workday_visits_sqmi_log
## (1) (2)
## -----------------------------------------------------------------------------
## phl_building_size_log 0.060*** 0.142***
## (0.017) (0.016)
##
## phl_UPenn 0.00001*** 0.00000
## (0.00000) (0.00000)
##
## sg_distance_home_log -0.271*** -0.272***
## (0.037) (0.036)
##
## sg_dwell 0.001*** 0.001***
## (0.0003) (0.0003)
##
## dens_bars_log 0.052 -0.027
## (0.049) (0.047)
##
## dens_rest_log 0.508*** 0.421***
## (0.031) (0.029)
##
## dens_arts_log 0.035 0.154***
## (0.056) (0.053)
##
## dens_grocery 0.001*** 0.001***
## (0.0003) (0.0003)
##
## dens_sports 0.003*** 0.003***
## (0.001) (0.001)
##
## count_retailmix_top -0.019*** -0.003
## (0.003) (0.003)
##
## count_retailmix_sub_log 0.673*** 0.623***
## (0.046) (0.044)
##
## dens_openlate_log -0.099*** -0.074**
## (0.035) (0.032)
##
## nn_transit_log -0.135*** -0.158***
## (0.023) (0.022)
##
## nn_parks_log 0.123*** 0.064
## (0.043) (0.040)
##
## nn_trolley -0.00001*** -0.00000
## (0.00000) (0.00000)
##
## dens_late_tag -0.002*** -0.002***
## (0.0004) (0.0004)
##
## corr_typeNeighborhood Center -0.082* -0.009
## (0.045) (0.044)
##
## corr_typeNeighborhood Subcenter -0.041 -0.057
## (0.059) (0.057)
##
## corr_typeRegional Center 0.596*** 0.410***
## (0.142) (0.144)
##
## corr_typeSpecialty Center 0.041 -0.009
## (0.081) (0.081)
##
## corr_typeSuperregional Center 1.212*** 0.776***
## (0.152) (0.142)
##
## dens_late_tag_log 0.239*** 0.277***
## (0.052) (0.050)
##
## demo_pctWhite -0.002*** -0.001*
## (0.001) (0.001)
##
## demo_medAge 0.005** 0.001
## (0.002) (0.002)
##
## demo_popdens 0.00000** -0.00000
## (0.00000) (0.00000)
##
## demo_medrent 0.0003*** 0.0003***
## (0.0001) (0.0001)
##
## demo_MHI -0.00001*** -0.00001***
## (0.00000) (0.00000)
##
## Constant 7.764*** 9.103***
## (0.513) (0.486)
##
## -----------------------------------------------------------------------------
## Observations 1,896 1,896
## R2 0.548 0.551
## Adjusted R2 0.542 0.545
## Residual Std. Error (df = 1868) 0.584 0.557
## F Statistic (df = 27; 1868) 83.963*** 84.950***
## =============================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
This model sheds light on research findings that have important implications for the governance of the nighttime economy in Philadelphia and perhaps in other cities as well. The following section discusses these findings in greater detail.
According to our final model, the following predictors were some of the most important in estimating nighttime traffic. A discussion of these features is included below.
The density of restaurants is, by far, the most significant predictor of nighttime traffic than the other features. In terms of an economic development takeaway, this suggests that restaurants and food service are important generators of evening activity that contribute to the nighttime economy. This also suggests that creating and sustaining an environment that is friendly to these businesses will be beneficial to the larger economic health of local commercial corridors. This finding is also significant within the context of COVID-19, as we have seen that restaurant operations have been severely hampered during this difficult year. Our model suggests that prioritizing restaurant recovery in the post-COVID era could have a positive spillover effect for other retail types in Philadelphia.
The density of businesses open late is also an important predictor in the model. While this may be unsurprising when predicting nighttime traffic, this has interesting policy implications in terms of licensing and permitting along commercial corridors. Facilitating businesses staying open later may be another strategy for policymakers to take into consideration when promoting economic recovery. This is particularly relevant to businesses that have highly regulated hours, such as bars and arts venues. Of course, the negative externalities of increased traffic in the evening should be considered carefully. That said, easing the restrictions on certain business types may encourage increased nighttime traffic revenue along local corridors.
The distance from home and dwell time variables derived from the SafeGraph dataset were also strong predictors of nighttime traffic. These are new variables that we did not have access to before SafeGraph showing how this new dataset is an important development in our ability to understand mobility patterns. Reviewing the coefficients of the linear model, we see that the distance from home has a negative relationship with nighttime visits, meaning that as the average distance traveled increases the number of nighttime trips goes down. This suggests that people are more likely to travel shorter distances in the evening and that local, neighborhood corridors that are near a residential areas are an important site of the nightlife economy. The dwell time variable, in contrast, has a positive relationship with nighttime visits. As the average dwell time on a corridor increases, so does the number of trips. In terms of a policy takeaway from this finding, helping nightlife establishments accommodate visitors comfortably (e.g. expanding seating area in and around nightlife destinations) may be a viable strategy for increasing traffic and thereby growing business revenue.
Finally, retail mix is another important predictor of nighttime trips, suggesting that the more retail diversity along a corridor, the more traffic it attracts during nighttime hours. This makes intuitive sense, as more business types along a corridor make it appeal to a wider audience. It also suggests that the most successful corridors at generating nighttime traffic cater to a wide variety of businesses, including those open during the daytime. If nighttime establishments do best along corridors with a diverse mix of retail, some of which we are assuming is not open late, this proposes interesting new planning challenges involving how to plan for a corridor that adapts from day to night. For example, how can a corridor feel both safe and inviting when some of the businesses have closed for the evening, leaving storefronts unactivated throughout the night?
Conversely, the features that were less important predictors have generated equally interesting findings and implications for economic development. We will discuss the three features listed below.
Though certain corridor types were statistically significant to our linear regression – notably the Regional, Superregional, and Specialty Centers – the variable overall was less important to our random forest model. There are a number of possibilities for why this is the case. First, though the corridor typologies describe a specific retail environment, there is likely extensive variation within each category. For example, there are over a hundred Neighborhood Subcenters across the city. The same is true for Neighborhood Centers. While the corridor typology might be useful for conceptualizing the range of retail destinations in Philadelphia or describing those categories with fewer corridors, this variable may be too broad to adequately describe these corridor types.
Given our understanding of the nighttime economy, we were surprised to see that arts establishments and bars ranked low in terms of feature importance. There could be a number of explanations for this. For one, there seems to be some collinearity with some of the other variables describing the density of retail establishments, meaning that they match the pattern of another feature and therefore do not contribute new information to the model. Another possibility for the lack of importance could be related to the hours that we are predicting. The model specifically looks at trips between 7PM and 12AM. It is possible that a number of late night trips to bars and arts establishments are falling outside of that time window. There may be an additional problem with predicting traffic to arts venues, as the trips to these destinations are almost entirely reliant on programming (same with the density of sports venues which also ranks low in terms of importance). This abnormality is difficult to model and is likely one of the factors accounting for the feature’s relatively weak performance.
Though the features representing bars and arts venues were not important to our final model, this is not to say that these establishments are unimportant to the nightlife economy more broadly. From a quantitative perspective, we discuss in the above section that a diverse retail mix along a corridor is positively correlated with nightlife traffic and an important factor in predicting these patterns. Bars and arts venues undoubtedly contribute to a commercial corridor’s retail diversity and mix of uses, adding to its vibrancy. Separately, though our project has modeled mobility as a proxy for economic importance, an establishment’s economic viability may not be the only or primary rationale for its existence. Even if an arts venue does not attract regular traffic to a commercial corridor, this does not mean that we should stop investing in these establishments. They enrich our lives and enhance the character along a corridor, and few would argue that this is not worthy of support. This simply means that the arguments to advocate for these places may not be exclusively economic and rather focus on the establishment’s social and cultural benefits.
Finally, the above analysis indicates that the model does not predict nighttime traffic with equivalent accuracy across time and and corridor type. Looking first at the generalizability across time, the model is best at predicting values during the spring and summer. In particular, the model is most accurate during March, May, June, and July where the MAPE is as low as 11%. In contrast, the predictions over the winter months is far less accurate, with the December prediction having a MAPE of over 100% and the January and February predictions having a MAPE of over 50%. These errors are likely due to far lower evening traffic counts during these months as shown in the below Figure 5.3.1. It is possible that the discrepancy between evening traffic during these months is an error in the data, but it is also possible that it correlates with weather and hours of day light. As the winter months are darker and colder, people may be less likely to participate in the nighttime economy activities during those months.
Another limitation in the model’s generalizability is across corridor types. As discussed in the above Section 4.2 on feature importance, the corridor types may be useful administrative tools for the City of Philadelphia, but they might not be descriptive enough for modelling purposes, particularly for the smaller Neighborhood Subcenters and Centers that have the highest errors in our model. There may simply be too much variation within these administrative categories to guarantee accuracy.
Finally, the model does not generalize well to predicting traffic during the workday. As discussed above, this means that people fundamentally travel for different reasons over the course of a day. While these distinctions in mobility patterns may be intuitive, this is an important reminder that attracting visitors to a given corridor is largely dependent upon surrounding context. Expanding the pool of visitors and times that people visit a corridor likely depends on a complex interplay of retail mix, the built environment, and the socio-economic characteristics of a given corridor. Understanding this broad context is essential to planning for the nighttime economy effectively.
While this project has only scratched the surface of the possibilities of the SafeGraph dataset, we have generated learnings that are relevant to the city’s economic development efforts, particularly during the COVID-19 recovery. This analysis and research has also contributed to a web application that features a data dashboard summarizing socio-economic information and shows visitors origins to commercial corridors across Philadelphia. We hope this tool will be a valuable empirical device for Philadelphia’s corridor managers and economic development officials who have a vested interest in exploring the benefits of the nighttime economy at the corridor level. Separately, the app also puts our predictive model into use. Showcasing a selection of corridors across Philadelphia, our prototype app allows users to manipulate the retail mix of nightlife establishments along a given corridor and see the impact on nighttime traffic.
There are many other iterations of the model that could enhance our understanding of the nighttime economy and improve the model’s accuracy. For one, we could further refine the hours when we are predicting trips. Instead of predicting between the hours of 7PM and 12PM, we could extend the boundaries of this variable to encompass the early morning. This would likely provide a more comprehensive view of traffic that occurs over the course of the entire night. The model could also incorporate multiple years of SafeGraph data to build a more robust picture of mobility along nightlife corridors. Finally, the model could be extended to other cities, which would expand the sample size, hopefully improving the model’s accuracy and applying the learnings outside of Philadelphia to other cities that have a vested interest in fostering their nighttime economy.
library(tidyverse)
library(tidycensus)
library(sf)
library(lubridate)
library(datetime)
library(viridis)
library(gridExtra)
library(kableExtra)
library(dplyr)
library(ggplot2)
library(gganimate)
library(FNN)
library(caret)
library(stargazer)
library(ranger)
library(tidymodels)
qBr <- function(df, variable, rnd) {
if (missing(rnd)) {
as.character(quantile(round(df[[variable]],0),
c(.01,.2,.4,.6,.8), na.rm=T))
} else if (rnd == FALSE | rnd == F) {
as.character(formatC(quantile(df[[variable]]), digits = 3),
c(.01,.2,.4,.6,.8), na.rm=T)
}
}
q5 <- function(variable) {as.factor(ntile(variable, 5))}
palette3 <- viridis_pal()(3)
palette5 <- viridis_pal()(5)
mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2)
)
}
mapTheme_dark <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_rect(fill = "black"),
axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "black"),
panel.border = element_rect(colour = "black", fill=NA, size=2)
)
}
plotTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 16,colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <- as.matrix(measureFrom)
measureTo_Matrix <- as.matrix(measureTo)
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
dplyr::summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
setwd("~/GitHub/musa_practicum_nighttime")
dat <- read.csv("./data/moves_2018.csv") #2018 SafeGraph Patterns Data
data2020 <- read.csv("./data/moves_monthly2020.csv") # 2020 SafeGraph Patterns Data
phila <- st_read("./demo/phila.geojson", quiet = TRUE) # SafeGraph Places data
#Wrangle 2018 SafeGraph Patterns and Places data into one dataset
dat2 <- dat %>%
dplyr::select(safegraph_place_id,
date_range_start,
date_range_end,
raw_visit_counts,
raw_visitor_counts,
visits_by_day,
poi_cbg,
visitor_home_cbgs,
visitor_daytime_cbgs,
visitor_work_cbgs,
visitor_country_of_origin,
distance_from_home,
median_dwell,
bucketed_dwell_times,
related_same_day_brand,
related_same_month_brand,
popularity_by_hour,
popularity_by_day,
device_type) %>%
left_join(., phila, by = "safegraph_place_id") %>%
st_as_sf() %>%
st_transform('ESRI:102728') #Project to Philly coordinates
#Load census block group shapefile (projected & unprojected)
phl_cbg <-
st_read("http://data.phl.opendata.arcgis.com/datasets/2f982bada233478ea0100528227febce_0.geojson", quiet = TRUE) %>%
mutate(GEOID10 = as.numeric(GEOID10))%>%
st_transform('ESRI:102728')
phl_cbg_unproj <-
st_read("http://data.phl.opendata.arcgis.com/datasets/2f982bada233478ea0100528227febce_0.geojson", quiet = TRUE) %>%
mutate(GEOID10 = as.numeric(GEOID10),
Lat = as.numeric(INTPTLAT10),
Lon = as.numeric(INTPTLON10))
#Load Philadelphia boundary shapefile (projected & unprojected)
phl_boundary <-
st_read("https://opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson", quiet = TRUE)%>%
st_transform('ESRI:102728')
phl_boundary_unproj <-
st_read("https://opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson", quiet = TRUE)
#Load commercial corridor shapefile (projected & unprojected)
phl_corridors <- st_read("http://data.phl.opendata.arcgis.com/datasets/f43e5f92d34e41249e7a11f269792d11_0.geojson", quiet = TRUE)%>%
st_transform('ESRI:102728') %>%
mutate(corr_type = ifelse(CORRIDOR_TYPE == 1, "Neighborhood Subcenter",
ifelse(CORRIDOR_TYPE == 2, "Neighborhood Center",
ifelse(CORRIDOR_TYPE == 3, "Community Center",
ifelse(CORRIDOR_TYPE == 4, "Regional Center",
ifelse(CORRIDOR_TYPE == 5, "Superregional Center",
ifelse(CORRIDOR_TYPE == 6, "Specialty Center", "Other")))))))
phl_corridors_unproj <- st_read("http://data.phl.opendata.arcgis.com/datasets/f43e5f92d34e41249e7a11f269792d11_0.geojson", quiet = TRUE)
#Load Neighborhood shapefile (projected & unprojected)
phl_nhoods <-
st_read("https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson",
quiet = TRUE) %>%
st_transform('ESRI:102728')
phl_nhoods_unproj <-
st_read("https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson",
quiet = TRUE)
#Load Philadelphia zip code shapefile & project
phl_zip <-
st_read("http://data.phl.opendata.arcgis.com/datasets/b54ec5210cee41c3a884c9086f7af1be_0.geojson", quiet = TRUE) %>%
st_transform('ESRI:102728') %>%
mutate(CODE = as.numeric(CODE))
#Generate fishnets
fishnet <-
st_make_grid(phl_boundary, cellsize = 1500, square = FALSE) %>%
.[phl_boundary] %>%
st_sf() %>%
mutate(uniqueID = rownames(.))
#Filter restaurants
restaurants <- dat2 %>%
filter(top_category == "Restaurants and Other Eating Places")%>%
mutate(Legend = "Restaurants")
#aggregate restaurant count by fishnet cell
restaurants_net <-
dplyr::select(restaurants) %>%
mutate(countRestaurants = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countRestaurants = replace_na(countRestaurants, 0),
uniqueID = rownames(.),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE))
#Bars
bars <- dat2 %>%
filter(top_category == "Drinking Places (Alcoholic Beverages)") %>%
mutate(Legend = "Bars")
#aggregate bars by fishnet cell
bars_net <-
dplyr::select(bars) %>%
mutate(countBars = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countBars = replace_na(countBars, 0),
uniqueID = rownames(.),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE)) %>%
mutate(Legend = "Bars")
#Performing arts
performingarts <- dat2 %>%
filter(top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
top_category == "Performing Arts Companies") %>%
mutate(Legend = "Performing Arts")
performingarts_net <-
dplyr::select(performingarts) %>%
mutate(countPerformingarts = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countPerformingarts = replace_na(countPerformingarts, 0),
uniqueID = rownames(.),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE))
# Combining fishnets into a single dataframe
vars_net <-
rbind(restaurants,
bars,
performingarts) %>%
st_join(., fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
dplyr::summarize(count = n()) %>%
full_join(fishnet) %>%
spread(Legend,count, fill=0) %>%
st_sf() %>%
dplyr::select(-`<NA>`) %>%
na.omit() %>%
ungroup()
vars_net.long <- gather(vars_net, Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.long$Variable)
mapList <- list()
#Plotting small multiple maps
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis(name="") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange, c(mapList,
ncol=3, top="Count of Nightlife Businesses per Fishnet",
bottom = "Figure 3.1.3"))
# Modify 2018 data
dat_2018 <-
dat %>%
mutate(popularity_by_hour = str_remove_all(popularity_by_hour, pattern = "\\[|\\]")) %>% #remove brackets
unnest(popularity_by_hour) %>% #unnest values
separate(.,
popularity_by_hour,
c("18",
"19", "20", "21", "22", "23"),
sep = ",") %>%
mutate(.,NightVisits2018 = as.numeric(`18`) + as.numeric(`19`) + as.numeric(`20`) + as.numeric(`21`) + as.numeric(`22`) + as.numeric(`23`))
dat_2018 <- dat_2018 %>%
dplyr::select(safegraph_place_id,
location_name,
date_range_start,
date_range_end,
raw_visit_counts,
raw_visitor_counts,
popularity_by_day,
NightVisits2018) %>%
rename(raw_visit_counts2018 = raw_visit_counts,
raw_visitor_counts2018 = raw_visitor_counts,
popularity_by_day2018 = popularity_by_day) %>%
mutate(month = substring(date_range_start,6,7))
dat_2020 <-
data2020 %>%
mutate(popularity_by_hour = str_remove_all(popularity_by_hour, pattern = "\\[|\\]")) %>% #remove brackets
unnest(popularity_by_hour) %>% #unnest values
separate(.,
popularity_by_hour,
c("18",
"19", "20", "21", "22", "23"),
sep = ",") %>%
mutate(.,NightVisits2020 = as.numeric(`18`) + as.numeric(`19`) + as.numeric(`20`) + as.numeric(`21`) + as.numeric(`22`) + as.numeric(`23`))
dat_join <- dat_2020 %>%
dplyr::select(safegraph_place_id,
location_name,
date_range_start,
date_range_end,
raw_visit_counts,
raw_visitor_counts,
popularity_by_day,
NightVisits2020) %>%
rename(raw_visit_counts2020 = raw_visit_counts,
raw_visitor_counts2020 = raw_visitor_counts,
popularity_by_day2020 = popularity_by_day) %>%
mutate(month = substring(date_range_start,6,7)) %>%
inner_join(., dat_2018, by = c("safegraph_place_id", "month", "location_name")) %>%
mutate(Month = case_when(month == "01" ~ "January",
month == "02" ~ "February",
month == "03" ~ "March",
month == "04" ~ "April",
month == "05" ~ "May",
month == "06" ~ "Jun",
month == "07" ~ "May",
month == "08" ~ "May",
month == "09" ~ "May",
month == "10" ~ "May",
month == "11" ~ "May",
month == "12" ~ "May"))
dat_join <- dat_join %>%
dplyr::select(safegraph_place_id,
location_name,
raw_visit_counts2018,
raw_visit_counts2020,
NightVisits2018,
NightVisits2020,
popularity_by_day2018,
popularity_by_day2020,
month) %>%
left_join(., phila, by = "safegraph_place_id") %>%
st_as_sf() %>%
st_transform('ESRI:102728')
#Unnesting popularity by hour variable
dat_hour <-
dat2 %>%
select(safegraph_place_id,
date_range_start,
top_category,
sub_category,
popularity_by_hour,
poi_cbg,
median_dwell) %>% #select variables
mutate(popularity_by_hour = str_remove_all(popularity_by_hour, pattern = "\\[|\\]")) %>% #clean-up
unnest(popularity_by_hour) %>% #unnest
separate(.,
popularity_by_hour,
c("0", "1", "2", "3", "4", "5", "6",
"7", "8", "9", "10", "11", "12",
"13", "14", "15", "16", "17", "18",
"19", "20", "21", "22", "23"),
sep = ",") %>% #separate hour values into individual columns
pivot_longer(cols = 4:27, names_to = "Hour", values_to = "Count") %>% #convert to long format
mutate(Hour = as.numeric(Hour),
Count = as.numeric(Count)) #convert hour trip counts to numeric values
dat_1_6 <- dat_hour %>%
#select trip counts between the hours of 1AM and 6AM
filter(Hour == "1" | Hour == "2" | Hour == "3" | Hour == "4" | Hour == "5" | Hour == "6") %>%
group_by(safegraph_place_id, date_range_start) %>%
summarize(Hrs1_6 = sum(Count))
dat_7_12 <- dat_hour %>%
#select trip counts between the hours of 7AM and 12PM
filter(Hour == "7" | Hour == "8" | Hour == "9" | Hour == "10" | Hour == "11" | Hour == "12") %>%
group_by(safegraph_place_id, date_range_start) %>%
summarize(Hrs7_12 = sum(Count))
dat_13_18 <- dat_hour %>%
#select trip counts between the hours of 1PM and 6PM
filter(Hour == "13" | Hour == "14" | Hour == "15" | Hour == "16" | Hour == "17" | Hour == "18") %>%
group_by(safegraph_place_id, date_range_start) %>%
summarize(Hrs13_18 = sum(Count))
dat_19_0 <- dat_hour %>%
#select trip counts between the hours of 6PM and 12AM
filter(Hour == "19" | Hour == "20" | Hour == "21" | Hour == "22" | Hour == "23" | Hour == "0") %>%
group_by(safegraph_place_id, date_range_start) %>%
summarize(Hrs19_0 = sum(Count))
dat_workday <- dat_hour %>%
#select trip counts between workday hours of 9AM and 6PM
filter(Hour == "9" | Hour == "10" | Hour == "11" | Hour == "12" | Hour == "13" | Hour == "14" |Hour == "15" | Hour == "16" |Hour == "17" |Hour == "18") %>%
group_by(safegraph_place_id, date_range_start) %>%
summarize(Hrs_workday = sum(Count))
#Combine into new dataset with hour trip counts and corridor + neighborhood information
dat3 <-
dat2 %>%
left_join(dat_1_6, by = c('safegraph_place_id', 'date_range_start')) %>%
left_join(dat_7_12, by = c('safegraph_place_id', 'date_range_start')) %>%
left_join(dat_13_18, by = c('safegraph_place_id', 'date_range_start')) %>%
left_join(dat_19_0, by = c('safegraph_place_id', 'date_range_start')) %>%
left_join(dat_workday, by = c('safegraph_place_id', 'date_range_start')) %>%
st_join(phl_corridors, join = st_intersects) %>%
st_join(phl_nhoods, join = st_intersects)
#Philadelphia Features
phl_corridors_pred <- phl_corridors %>%
st_as_sf() %>%
dplyr::select(4:5,18, 'corr_type') %>%
rename(., corridor = NAME,
district = P_DIST,
vacancy_rate = VAC_RATE) %>%
mutate(vacancy_rate = str_remove_all(vacancy_rate, pattern = "%"),
vacancy_rate = str_remove_all(vacancy_rate, pattern = "\r\n"),
vacancy_rate = as.numeric(vacancy_rate),
corr_area_sqft = as.numeric(st_area(phl_corridors)),
corr_area_sqmi = as.numeric(st_area(phl_corridors)*0.000000035870))
phl_corridors_pred$vacancy_rate <- phl_corridors_pred$vacancy_rate %>% replace_na(0)
UPenn.sf <-
dat3 %>%
dplyr::select(location_name, date_range_start, top_category) %>%
filter(location_name == "Univ of Penn",
top_category == "Colleges, Universities, and Professional Schools",
date_range_start == "2018-01-01T05:00:00Z")
septaStops <-
rbind(
st_read("https://opendata.arcgis.com/datasets/8c6e2575c8ad46eb887e6bb35825e1a6_0.geojson", quiet = TRUE) %>%
dplyr::mutate(Line = "El") %>%
st_transform('ESRI:102728') %>%
dplyr::select(Station, Line),
st_read("https://opendata.arcgis.com/datasets/2e9037fd5bef406488ffe5bb67d21312_0.geojson", quiet = TRUE) %>%
dplyr::mutate(Line ="Broad_St") %>%
st_transform('ESRI:102728') %>%
dplyr::select(Station, Line))
parks <-
st_read("https://opendata.arcgis.com/datasets/d52445160ab14380a673e5849203eb64_0.geojson",
quiet = TRUE) %>%
st_transform('ESRI:102728') %>%
dplyr::select(PUBLIC_NAME)
phl_building_footprints <-
st_read("https://opendata.arcgis.com/datasets/ab9e89e1273f445bb265846c90b38a96_0.geojson",
quiet = TRUE) %>%
dplyr::select(ADDRESS) %>%
st_transform('ESRI:102728')
phl_building_footprints <- phl_building_footprints %>%
dplyr::mutate(bld_area_sqft = as.numeric(st_area(phl_building_footprints)),
bld_area_sqmi = as.numeric(st_area(phl_building_footprints)*0.000000035870))
phl_trolley <-
st_read("https://opendata.arcgis.com/datasets/e09e9f98bdf04eada214d2217f3adbf1_0.geojson") %>%
st_transform('ESRI:102728') %>%
.[phl_boundary,] %>%
filter(Mode == "Trolley")
#Demographic data
phl_blockgroups <-
get_acs(geography = "block group",
variables = c("B01003_001E",
"B02001_002E",
"B01002_001E",
"B19013_001E",
"B25064_001E"
# "B03002_012E",
# "B02001_003E"
),
year=2018,
state=42,
county=101,
geometry=T,
output = "wide") %>%
st_transform('ESRI:102728')%>%
dplyr::rename(TotalPop = B01003_001E,
White = B02001_002E,
MedAge = B01002_001E,
MedHHInc = B19013_001E,
MedRent = B25064_001E,
# Hisp = B03002_012E,
# Black = B02001_003E,
GEOID10 = GEOID) %>%
dplyr::select(-ends_with("M")) %>%
dplyr::mutate(pctWhite = ((White / TotalPop)*100),
# pctBlack = ((Black / TotalPop)*100),
# pctHisp = ((Hisp / TotalPop)*100),
tract_area_mi = as.numeric(st_area(geometry)*0.000000035870),
popdens_mi = TotalPop / tract_area_mi
) %>%
st_as_sf()
dat_pred_temp <-
dat2 %>%
left_join(dat_1_6, by = c('safegraph_place_id', 'date_range_start')) %>%
left_join(dat_7_12, by = c('safegraph_place_id', 'date_range_start')) %>%
left_join(dat_13_18, by = c('safegraph_place_id', 'date_range_start')) %>%
left_join(dat_19_0, by = c('safegraph_place_id', 'date_range_start')) %>%
left_join(dat_workday, by = c('safegraph_place_id', 'date_range_start')) %>%
st_join(phl_blockgroups) %>%
st_join(phl_nhoods) %>%
st_join(phl_corridors_pred) %>%
st_join(., phl_building_footprints) %>%
drop_na(corridor)
#Engineer variables at the destination level.
dat_pred <-
dat_pred_temp %>%
dplyr::mutate(total = 1,
bars = ifelse(top_category == 'Drinking Places (Alcoholic Beverages)', 1, 0),
restaurant = ifelse(top_category == 'Restaurants and Other Eating Places', 1, 0),
arts = ifelse(top_category == 'Promoters of Performing Arts, Sports, and Similar Events' |
top_category == 'Performing Arts Companies', 1, 0),
grocery = ifelse(top_category == 'Grocery Stores', 1, 0),
sports = ifelse(top_category == 'Spectator Sports', 1, 0),
transit_nn = nn_function(st_c(st_centroid(dat_pred_temp)), st_c(st_centroid(septaStops)), 2),
parks_nn = nn_function(st_c(st_centroid(dat_pred_temp)), st_c(st_centroid(parks)), 4),
trolley_nn = nn_function(st_c(st_centroid(dat_pred_temp)), st_c(st_centroid(phl_trolley)), 4),
late_night = if_else(grepl("Late Night", dat_pred_temp$category_tags), 1, 0),
closing_time = ifelse(str_detect(open_hours, paste(c("20:00", "21:00", "22:00", "23:00", "0:00", "1:00", "2:00", "3:00"), collapse = "|")), "OPEN LATE", "NOT OPEN LATE"),
closing_time = ifelse(is.na(closing_time) == TRUE, "NO DATA", closing_time),
open_late = ifelse(closing_time == "OPEN LATE", 1, 0)
)
dat_pred$UPenn <- as.numeric(st_distance(dat_pred, UPenn.sf))
#Aggregating up to the corridor level
dat_pred_agg <-
dat_pred %>%
group_by(corridor, date_range_start, corr_type) %>%
summarize(Night_visits = sum(Hrs19_0),
Workday_visits = sum(Hrs_workday),
phl_area_sqmi = mean(corr_area_sqmi, na.rm = TRUE),
Night_visits_sqmi = (Night_visits + 1) / phl_area_sqmi,
Night_visits_sqmi_log = log(Night_visits_sqmi),
Workday_visits_sqmi = (Workday_visits +1) / phl_area_sqmi,
Workday_visits_sqmi_log = log(Workday_visits_sqmi),
total = sum(total),
phl_building_size = mean(bld_area_sqft +1, na.rm = TRUE),
phl_building_size_log = log(phl_building_size),
phl_UPenn = mean(UPenn),
sg_dwell = mean(median_dwell, na.rm = TRUE),
sg_distance_home = mean(distance_from_home, na.rm = TRUE),
sg_distance_home_log = log(sg_distance_home),
dens_bars = (sum(bars, na.rm = TRUE) + 1)/phl_area_sqmi,
dens_rest = (sum(restaurant, na.rm = TRUE) + 1)/phl_area_sqmi,
dens_arts = (sum(arts, na.rm = TRUE) +1)/phl_area_sqmi,
dens_sports = (sum(sports, na.rm = TRUE) + 1)/phl_area_sqmi,
dens_grocery = (sum(grocery, na.rm = TRUE) + 1)/phl_area_sqmi,
count_retailmix_top = n_distinct(top_category),
count_retailmix_sub = n_distinct(sub_category),
dens_late_tag = (sum(late_night, na.rm = TRUE) + 1)/phl_area_sqmi,
dens_openlate = (sum(open_late, na.rm = TRUE) + 1)/phl_area_sqmi,
dens_bars_log = log(dens_bars),
dens_rest_log = log(dens_rest),
dens_arts_log = log(dens_arts),
dens_grocery_log = log(dens_grocery),
dens_late_tag_log = log(dens_late_tag),
count_retailmix_top_log = log(count_retailmix_top),
count_retailmix_sub_log = log(count_retailmix_sub),
dens_openlate_log = log(dens_openlate),
nn_transit = mean(transit_nn),
nn_parks = mean(parks_nn),
nn_trolley = mean(trolley_nn),
nn_parks_log = log(nn_parks),
nn_transit_log = log(nn_transit),
nn_trolley_log = log(nn_trolley),
demo_popdens = mean(popdens_mi),
demo_popdens_log = log(demo_popdens),
demo_pctWhite = weighted.mean(pctWhite, Hrs19_0, na.rm = TRUE),
demo_medAge = weighted.mean(MedAge, Hrs19_0, na.rm = TRUE),
demo_MHI = weighted.mean(MedHHInc, Hrs19_0, na.rm = TRUE),
demo_medrent = weighted.mean(MedRent, Hrs19_0, na.rm = TRUE)) %>%
st_drop_geometry() %>%
left_join(phl_corridors_pred %>% dplyr::select(corridor)) %>%
drop_na(corridor, demo_MHI, demo_medrent) %>% #Some of the census variables aren't reporting for our block groups
st_as_sf() %>%
ungroup() %>%
na.omit(st_distance_home)
reg.vars <- c('Night_visits_sqmi_log',
'phl_building_size_log',
'phl_UPenn',
'sg_distance_home_log',
'sg_dwell',
'dens_bars_log',
'dens_rest_log',
'dens_arts_log',
'dens_grocery',
'dens_sports',
'count_retailmix_top',
'count_retailmix_sub_log',
'dens_openlate_log',
'nn_transit_log',
'nn_parks_log',
'nn_trolley',
'dens_late_tag',
'corr_type',
'dens_late_tag_log',
'demo_pctWhite',
'demo_medAge',
'demo_popdens',
'demo_medrent',
'demo_MHI')
#Setting up test and training datasets
set.seed(414)
inTrain <- createDataPartition(y=dat_pred_agg$Night_visits_sqmi_log, p = .60, list = FALSE)
phl.training <- dat_pred_agg[inTrain,]
phl.test <- dat_pred_agg[-inTrain,]
#Multivariate Linear regression
reg1 <-
lm(Night_visits_sqmi_log ~ .,
data = st_drop_geometry(phl.training) %>%
dplyr::select(reg.vars))
stargazer(reg1, type = "text", title = "Linear Regression Results")
set.seed(414)
inTrain <- createDataPartition(y=dat_pred_agg$Night_visits_sqmi_log, p = .60, list = FALSE)
phl.training.rf <- dat_pred_agg[inTrain,]
phl.test.rf <- dat_pred_agg[-inTrain,]
#Multivariate regression
rf1 <-
ranger(Night_visits_sqmi_log ~ ., #change lm to ranger, predictions are a little different
data = st_drop_geometry(phl.training.rf) %>%
dplyr::select(reg.vars),
importance = "impurity")
rf1
set.seed(414)
phl.training.rf_fit <- dat_pred_agg[inTrain,]
phl.test.rf_fit <- dat_pred_agg[-inTrain,]
data_split <- initial_split(dat_pred_agg, strata = "Night_visits_sqmi_log", prop = 0.60)
model_rec <- recipe(Night_visits_sqmi_log ~ ., data = phl.training.rf %>%
dplyr::select(reg.vars) %>%
st_drop_geometry())
#Splitting by month
cv_splits_time <- group_vfold_cv(phl.training.rf%>%
dplyr::select(date_range_start, reg.vars) %>%
st_drop_geometry(),
strata = "Night_visits_sqmi_log",
group = "date_range_start")
rf_plan <- rand_forest() %>%
set_args(mtry = tune()) %>%
set_args(min_n = tune()) %>%
set_args(trees = 1000) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
#Setting parameters to try
rf_grid <- expand.grid(mtry = c(2, 5, 10, 15, 20),
min_n = c(5, 10, 50))
rf_wf <-
workflow() %>%
add_recipe(model_rec) %>%
add_model(rf_plan)
control <- control_resamples(save_pred = TRUE, verbose = TRUE)
#Test hyperparameters
rf_tuned_time <- rf_wf %>%
tune::tune_grid(.,
resamples = cv_splits_time,
grid = rf_grid,
control = control,
metrics = metric_set(rmse, rsq))
show_best(rf_tuned_time, metric = "rmse", n = 15) %>% kable() %>% kable_styling(full_width = FALSE)
rf_best_params_time <- select_best(rf_tuned_time, metric = "rmse")
rf_best_wf_time <- finalize_workflow(rf_wf, rf_best_params_time)
rf_val_fit_time <- rf_best_wf_time %>%
last_fit(split = data_split,
control = control,
metrics = metric_set(mae, rsq))
rf_val_fit_time$.workflow
#Funcion to generate a grid with the desired combinations.
ComboGrid <- function(rest_factor, bar_factor, arts_factor) {
allCombos <- expand.grid(rest_factor = rest_factor*100,
bar_factor = bar_factor*100,
arts_factor = arts_factor*100)
allCombos <- tibble::rowid_to_column(allCombos, "scenario")
allCombos <-
allCombos %>%
mutate(
rest = "rest",
bars = "bars",
arts = "arts")
allCombos <- allCombos %>%
mutate(
filename =
paste(allCombos$rest,
allCombos$rest_factor,
allCombos$bars,
allCombos$bar_factor,
allCombos$arts,
allCombos$arts_factor, sep = "-"))
return(allCombos)
}
#Function to generate separate dataframes for each of the scenarios
ScenarioGenerator <- function(dataset, corr_list, rest_factor, bar_factor, arts_factor) {
allCombos <- expand.grid(rest_factor = rest_factor,
bar_factor = bar_factor,
arts_factor = arts_factor)
allCombos <- tibble::rowid_to_column(allCombos, "ID")
comboList <- unique(allCombos[['ID']])
allScenarios <- data.frame()
for (i in comboList) {
thisScenario <-
dataset %>%
filter(corridor %in% corr_list) %>%
dplyr::mutate(scenario = as.numeric(allCombos %>% filter(ID==i) %>% dplyr::select(ID)),
count_rest = as.numeric(allCombos %>% filter(ID == i) %>% dplyr::select(rest_factor)) * count_rest,
count_rest_a = count_rest/phl_area_sqmi,
count_rest_a_log = log(count_rest_a + 1),
count_bars = as.numeric(allCombos %>% filter(ID == i) %>% dplyr::select(bar_factor)) * count_bars,
count_bars_a = count_bars/phl_area_sqmi,
count_bars_a_log = log(count_bars_a + 1),
count_arts = as.numeric(allCombos %>% filter(ID == i) %>% dplyr::select(arts_factor)) * count_arts,
count_arts_a = count_arts/phl_area_sqmi,
count_arts_a_log = log(count_arts_a + 1))
Scenario_preds <- predict(rf_val_fit_time$.workflow, new_data = thisScenario %>% as.data.frame())
Scenario_preds <- exp(Scenario_preds %>% as.data.frame())
Scenario_final <- cbind(Scenario_preds, thisScenario %>%
dplyr::select(corridor,
count_rest,
count_arts,
count_bars,
Night_visits,
phl_area_sqmi,
scenario,
geometry)) %>%
dplyr::rename("predictions" = ".pred") %>% #clean-up
mutate(predictions = round((predictions * phl_area_sqmi), digits = 0)) %>%
st_as_sf()
Scenario_final <- Scenario_final %>% st_as_sf() %>% st_transform(crs = "EPSG: 4326")
Scenario_final <- Scenario_final %>%
group_by(corridor, scenario) %>%
summarize(Night_visits = sum(Night_visits),
predictions = sum(predictions))
Scenario_final <-
Scenario_final %>%
mutate(pct_change = round(((predictions - Night_visits)/Night_visits * 100), digits = 1)) %>%
as.data.frame() %>%
left_join(., ScenarioCombos %>% dplyr::select(scenario, filename)) %>%
st_as_sf()
Scenario_final <- as_Spatial(Scenario_final)
writeOGR(Scenario_final,
dsn = paste(unique(Scenario_final$filename), ".geojson"),
'allScenario_final_small',
driver = 'GeoJSON')
}
return(Scenario_final)
}
#Preparing the dataset to run in the function
dat_scenario <-
dat_pred %>%
group_by(corridor, date_range_start) %>%
summarize(count_rest = sum(restaurant, na.rm = TRUE),
count_bars = sum(bars, na.rm = TRUE),
count_arts = sum(arts, na.rm = TRUE)) %>%
st_drop_geometry() %>%
left_join(dat_pred_agg) %>%
group_by(corridor) %>%
drop_na(corridor, demo_MHI, demo_medrent)
#Running the function
rest_factor_select <- c(.95, 1, 1.05) #different scenarios
bar_factor_select <- c(.95, 1, 1.05)
arts_factor_select <- c(.95, 1, 1.05)
ScenarioCombos <- ComboGrid(rest_factor = rest_factor_select,
bar_factor = bar_factor_select,
arts_factor = arts_factor_select)
ScenarioGenerator(dataset = dat_scenario, #Dataset
corr_list = c("30th Street", #Corridors we are including
"54006000 Lancaster Ave",
"Broad and Snyder",
"Bustleton and Red Lion",
"Callowhill East",
"Castor and Magee",
"Central Waterfront/Spring Garden",
"City Ave and Belmont vicinity",
"Front and Kensington",
"Girard and Marshall",
"Godfrey and Ogontz",
"Kensington Ave/Harrowgate",
"Market East",
"Oregon Ave/5th-13th",
"Presidential/Belair",
"Richmond and Allegheny",
'Snyder Plaza and Columbus Commons',
"South Street/Front-8th",
"Washington Avenue West"),
rest_factor = rest_factor_select, #values determined above
bar_factor = bar_factor_select,
arts_factor = arts_factor_select)
#Pulll census data for PA blockgroups
pa_blockgroups <-
get_acs(geography = "block group",
variables = c("B01003_001E", #select variables
"B02001_002E",
"B01002_001E",
"B19013_001E",
"B25064_001E",
"B03002_012E",
"B02001_003E"),
year=2018, #select year
state=42, #PA FIPs code
county=c(101, 017, 045, 091), #FIPS code for Philly, Bucks, Montgomery, Delaware
geometry=T,
output = "wide") %>%
st_transform('ESRI:102728') %>% #PA projection
rename(TotalPop = B01003_001E,
White = B02001_002E,
MedAge = B01002_001E,
MedHHInc = B19013_001E,
MedRent = B25064_001E,
Hisp = B03002_012E,
Black = B02001_003E,
GEOID10 = GEOID) %>% #Clean up column names
dplyr::select(-ends_with("M")) %>%
mutate(pctWhite = ((White / TotalPop)*100),
pctBlack = ((Black / TotalPop)*100),
pctHisp = ((Hisp / TotalPop)*100),
tract_area_mi = as.numeric(st_area(geometry)*0.000000035870),
popdens_mi = TotalPop / tract_area_mi) %>% #Generate new columns
st_as_sf()
#Pull census data for NJ blockgroups
nj_blockgroups <-
get_acs(geography = "block group",
variables = c("B01003_001E",
"B02001_002E",
"B01002_001E",
"B19013_001E",
"B25064_001E",
"B03002_012E",
"B02001_003E"),
year=2018,
state=c(34),
county=c(007, 015, 005), #Camden, Gloucester and Burlington
geometry=T,
output = "wide") %>%
st_transform('ESRI:102728')%>%
rename(TotalPop = B01003_001E,
White = B02001_002E,
MedAge = B01002_001E,
MedHHInc = B19013_001E,
MedRent = B25064_001E,
Hisp = B03002_012E,
Black = B02001_003E,
GEOID10 = GEOID) %>%
dplyr::select(-ends_with("M")) %>%
mutate(pctWhite = ((White / TotalPop)*100),
pctBlack = ((Black / TotalPop)*100),
pctHisp = ((Hisp / TotalPop)*100),
tract_area_mi = as.numeric(st_area(geometry)*0.000000035870),
popdens_mi = TotalPop / tract_area_mi) %>%
st_as_sf()
blockgroups2018 <- rbind(pa_blockgroups, nj_blockgroups)
blockgroups2018 <- blockgroups2018 %>%
mutate(centroid_cbg = st_centroid(geometry),
visitor_cbg = as.numeric(GEOID10)) %>%
select(-GEOID10)
#Modify 2018 data
#Join to places dataset
dat_2018 <- dat2018 %>%
dplyr::select(safegraph_place_id,
raw_visit_counts,
poi_cbg,
visitor_home_cbgs,
visitor_daytime_cbgs,
visitor_work_cbgs,
distance_from_home) %>%
left_join(., phila, by = "safegraph_place_id") %>%
st_as_sf() %>%
st_transform('ESRI:102728')
#Filter to just bars, restaurants, and performing arts and join to corridors
flows <- dat_2018 %>%
filter(top_category == "Drinking Places (Alcoholic Beverages)" |
top_category == "Restaurants and Other Eating Places" |
top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
top_category == "Performing Arts Companies") %>% #filter for business types
mutate(category = case_when(top_category == "Drinking Places (Alcoholic Beverages)" ~ "Bars",
top_category == "Restaurants and Other Eating Places" ~ "Restaurants",
top_category == "Promoters of Performing Arts, Sports, and Similar Events" ~ "Arts",
top_category == "Performing Arts Companies" ~ "Arts")) %>%
st_join(phl_corridors) %>% #join destinations to phl corridors (apply corridor destination to each trip)
st_as_sf() %>%
st_drop_geometry() %>%
select(safegraph_place_id,
category,
poi_cbg,
visitor_home_cbgs,
distance_from_home,
NAME.y) %>% #select columns
mutate(visitor_home_cbgs = str_remove_all(visitor_home_cbgs, pattern = "\\[|\\]")) %>%
mutate(visitor_home_cbgs = str_remove_all(visitor_home_cbgs, pattern = "\\{|\\}")) %>%
mutate(visitor_home_cbgs = str_remove_all(visitor_home_cbgs, pattern = '\\"|\\"')) %>%
mutate(visitor_home_cbgs = str_split(visitor_home_cbgs, pattern = ",")) %>%
unnest(visitor_home_cbgs) %>% #unnest visitor cbg column
separate(.,
visitor_home_cbgs,
c("visitor_cbg", "count"),
sep = ":") %>% #separate count column from visitor cbg
mutate(count = as.numeric(count),
visitor_cbg = as.numeric(visitor_cbg)) %>%
dplyr::rename(., NAME = NAME.y) %>%
drop_na(NAME) #Remove trips with destinations outside of corridors
#Join to corridors centroids
corr_summary <- flows %>%
group_by(NAME, visitor_cbg) %>%
summarize(Count = sum(count)-1,
distance_from_home = mean(distance_from_home)) %>%
left_join(., phl_corridors_cent, by = "NAME")
#Join to blockgroups data
corr_summary1 <- corr_summary %>%
right_join(., blockgroups2018, by = "visitor_cbg") %>%
dplyr::select(visitor_cbg,
NAME.x,
Count,
centroid_corr,
MedAge,
MedHHInc,
pctWhite,
pctBlack,
pctHisp,
centroid_cbg,
distance_from_home) %>%
rename(NAME = NAME.x)
#Group by destination corridor (One row per commercial corridor)
corr_group <- corr_summary1 %>%
group_by(NAME) %>%
summarize(total_trips = sum(Count),
distance_from_home = mean(distance_from_home, na.rm = TRUE),
weighted_age = weighted.mean(MedAge, Count, na.rm = TRUE),
weighted_inc = weighted.mean(MedHHInc, Count, na.rm = TRUE),
weighted_pctWhite = weighted.mean(pctWhite, Count, na.rm = TRUE),
weighted_pctBlack = weighted.mean(pctBlack, Count, na.rm = TRUE),
weighted_pctHisp = weighted.mean(pctHisp, Count, na.rm = TRUE)) %>%
mutate(phl_distance_from_home = weighted.mean(distance_from_home, total_trips, na.rm = TRUE),
phl_age = philadelphia_ACS$MedAge,
phl_inc = philadelphia_ACS$MedHHInc,
phl_pctWhite = philadelphia_ACS$pctWhite,
phl_pctBlack = philadelphia_ACS$pctBlack,
phl_pctHisp = philadelphia_ACS$pctHisp)
#Rejoin to original corridor dataset for geometry (FINAL CORRIDOR DESTINATION DATASET)
corr_group_final <- corr_group %>%
left_join(., phl_corridors %>% select(NAME, VAC_RATE),
by = "NAME")
#Rejoin corridor origins to cbg dataset for geometry (FINAL CORRIDOR ORIGINS DATASET. EACH ROW IS ONE CORRIDOR/CBG COMBINATION)
corr_origins_final <- corr_summary1 %>%
left_join(., corr_group %>% select(NAME, total_trips), by = "NAME") %>%
dplyr::select(visitor_cbg,
NAME,
Count,
MedAge,
MedHHInc,
pctWhite,
pctBlack,
pctHisp,
distance_from_home,
total_trips) %>%
mutate(percent_of_trips = Count / total_trips) %>%
left_join(., blockgroups2018 %>% select(visitor_cbg), by = "visitor_cbg")
corr_origins_final <- corr_origins_final %>% drop_na(NAME)
#Function to write GeoJSON for each corridor
CorridorFilter <- function(dataset) {
corridorList <- unique(dataset[['NAME']])
for (i in corridorList) {
thisCorridor <- dataset %>% filter(NAME==i)
thisCorridor <- thisCorridor %>%
mutate(NAME = gsub("/", "-", NAME, fixed = TRUE),
NAME = gsub(".", "", NAME, fixed = TRUE),
filename = gsub(" ", "-", NAME, fixed = TRUE)) %>%
st_as_sf() %>% st_transform(crs = "EPSG: 4326")
thisCorridor <- as_Spatial(thisCorridor)
writeOGR(thisCorridor,
dsn = paste(unique(thisCorridor$filename), ".geojson", sep = ""),
'thisCorridor',
driver = 'GeoJSON')
}
}
#Run function
CorridorFilter(corr_origins_final)