1. Introduction

1.1 About This Project

This project is part of the Spring 2021 Master of Urban Spatial Analytics Practicum course at the University of Pennsylvania taught by Ken Steif, Michael Fichman, and Matthew Harris. We are incredibly thankful for their their time, knowledge, and generosity during this challenging and fully online semester. Additionally, we want to take this opportunity to thank Daniel Lodise and Dave Maynard from Philadelphia City Councilmember Isaiah Thomas’ office for sharing their perspectives on policy decision-making for the nighttime economy. Additionally, we are incredibly grateful to Kae Anderson from the Fishtown Business Improvement District for fantastic insights about how economic development professionals manage corridors and make decisions. We also want to acknowledge Andrew Renninger and Eugene Chong for their invaluable support accessing the SafeGraph dataset that was integral to this project and sharing general knowledge about the dataset.

Finally we want to acknowledge the support and feedback from our classmates in the MUSA and city planning programs at the University of Pennsylvania. We thank you for your help!

1.1 Abstract

This project studies nightlife mobility patterns in Philadelphia, Pennsylvania and develops a web application for visualizing nightlife activity and predicting future nightlife traffic to commercial corridors. This document provides an overview of our process and key takeaways from our analysis.

While nighttime activity is an important component to Philadelphia’s economy, culture, and identity, there is limited empirical data available to help decision makers govern the nighttime economy. As a result, many of the logistical and policy decisions are largely based on emotional perceptions of the negative externalities of the nighttime economy, such as noise pollution and safety concerns. This project uses the novel SafeGraph dataset to analyze nighttime mobility patterns in Philadelphia and develop a model to predict nighttime trips to commercial corridors. Our model utilizes a variety of factors including retail mix, built environment features, and demographic data. Our findings indicate that changing the corridor retail mix impacts traffic during evening hours. Additionally, our research provides new quantitative insights into the behavior of visitors during nighttime hours in Philadelphia which has important implications for economic development in the city, particularly in the post COVID-19 climate.

All source code can be found throughout the document by pressing the code buttons throughout the report. Additional code we used for setup, data loading, and app development can be found in the appendix at the end of the document.

1.2 Motivation

Broadly defined as economic activity that takes place during the evening hours and centered around food, drink, and entertainment, the nighttime economy is an important component for any vibrant city. Philadelphia, the focus of this study, has a dynamic nightlife economy that features a dynamic arts and culture scene. In 2017, The Philadelphia Cultural Alliance estimated that arts and culture in the city has a total economic impact of over $4 billion in the region. Philadelphia is also home to a renowned selection of restaurants that have an important economic impact on the city. A 2019 report by the Economy League of Greater Philadelphia and the Philadelphia Department of Public Health estimated that food jobs account for 12% of all jobs in the city and food related businesses account for 18% of all firms. These sectors are clearly an important piece of the wider economy with much of their activity and business taking place during the evening hours.

Despite Philadelphia’s nighttime economy being both culturally and economically important to the city, there is relatively limited knowledge of the nighttime economy’s impact at the neighborhood or corridor scale. As many policy and logistical decisions are drawn from perceptions of the nighttime economy’s negative externalities, such as noise pollution and safety concerns, there is a lack of data explaining the positive effects of the nighttime economy at this hyper-local level. Therefore, a key motivation of this project is to better understand how the nighttime economy contributes to commercial corridor traffic, which we understand as a proxy for local economic activity. We hope that this research will help policy makers better weigh the costs and benefits when making decisions governing the nighttime economy.

A separate motivation for this project comes from the unprecedented economic shock of the COVID-19 pandemic beginning in March 2020 and continuing through the lifespan of this project in Spring 2021. As shown in the below Figure 1.2.1, traffic to restaurants, bars, and arts establishments have decreased by about 50% compared to 2018 measurements. While the days of the pandemic appear to wane in Philadelphia, there is still much uncertainty about how these establishments will rebound from this devastating event. As these business types that are integral to the nighttime economy largely depend on gathering large groups of people indoors, it is important for economic development professionals in Philadelphia to have a deep understanding of how the possible benefits of investing in the nighttime economy can benefit the city’s broader economy and help the city rebound from the pandemic.

dat_2018 <- 
  dat %>% 
  mutate(popularity_by_hour = str_remove_all(popularity_by_hour, pattern = "\\[|\\]")) %>% #remove brackets
  unnest(popularity_by_hour) %>% #unnest values
  separate(.,
           popularity_by_hour,
           c("18",
             "19", "20", "21", "22", "23"),
           sep = ",") %>%
  mutate(.,NightVisits2018 = as.numeric(`18`) + as.numeric(`19`) + as.numeric(`20`) + as.numeric(`21`) + as.numeric(`22`) + as.numeric(`23`))

dat_2018 <- dat_2018 %>% 
  dplyr::select(safegraph_place_id, 
                location_name,
                date_range_start, 
                date_range_end, 
                raw_visit_counts,
                raw_visitor_counts, 
                popularity_by_day,
                NightVisits2018) %>%
  rename(raw_visit_counts2018 = raw_visit_counts, 
         raw_visitor_counts2018 = raw_visitor_counts, 
         popularity_by_day2018 = popularity_by_day) %>%
  mutate(month = substring(date_range_start,6,7))

dat_2020 <- 
  data2020 %>% 
  mutate(popularity_by_hour = str_remove_all(popularity_by_hour, pattern = "\\[|\\]")) %>% #remove brackets
  unnest(popularity_by_hour) %>% #unnest values
  separate(.,
           popularity_by_hour,
           c("18",
             "19", "20", "21", "22", "23"),
           sep = ",") %>%
  mutate(.,NightVisits2020 = as.numeric(`18`) + as.numeric(`19`) + as.numeric(`20`) + as.numeric(`21`) + as.numeric(`22`) + as.numeric(`23`))

dat_join <- dat_2020 %>% 
  dplyr::select(safegraph_place_id, 
                location_name,
                date_range_start, 
                date_range_end, 
                raw_visit_counts,
                raw_visitor_counts, 
                popularity_by_day,
                NightVisits2020) %>%
  rename(raw_visit_counts2020 = raw_visit_counts, 
         raw_visitor_counts2020 = raw_visitor_counts, 
         popularity_by_day2020 = popularity_by_day) %>%
  mutate(month = substring(date_range_start,6,7)) %>%
  inner_join(., dat_2018, by = c("safegraph_place_id", "month", "location_name")) %>%
  mutate(Month = case_when(month == "01" ~ "January",
                    month == "02" ~ "February",
                    month == "03" ~ "March",
                    month == "04" ~ "April",
                    month == "05" ~ "May", 
                    month == "06" ~ "Jun",
                    month == "07" ~ "May",
                    month == "08" ~ "May",
                    month == "09" ~ "May",
                    month == "10" ~ "May",
                    month == "11" ~ "May",
                    month == "12" ~ "May"))

dat_join <- dat_join %>% 
  dplyr::select(safegraph_place_id, 
                location_name,
                raw_visit_counts2018,
                raw_visit_counts2020,
                NightVisits2018,
                NightVisits2020,
                popularity_by_day2018,
                popularity_by_day2020,
                month) %>%
  left_join(., phila, by = "safegraph_place_id") %>% 
  st_as_sf() %>%
  st_transform('ESRI:102728') 

dat_citywide <- dat_join %>%
  filter(top_category == "Drinking Places (Alcoholic Beverages)" |
           top_category == "Restaurants and Other Eating Places" |
           top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
           top_category == "Performing Arts Companies") %>%
  group_by(month) %>%
  summarize(Total_Visits2018 = sum(NightVisits2018),
            Total_Visits2020 = sum(NightVisits2020)) %>%
  mutate(Percent_Change = (Total_Visits2020 - Total_Visits2018)/Total_Visits2018*100)

#Separate by commercial use  
dat_citywide2 <- dat_join %>%
    filter(top_category == "Drinking Places (Alcoholic Beverages)" |
             top_category == "Restaurants and Other Eating Places" |
             top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
             top_category == "Performing Arts Companies") %>%
  mutate(category = case_when(top_category == "Drinking Places (Alcoholic Beverages)" ~ "Bars",
           top_category == "Restaurants and Other Eating Places" ~ "Restaurants",
           top_category == "Promoters of Performing Arts, Sports, and Similar Events" ~ "Arts",
           top_category == "Performing Arts Companies" ~ "Arts")) %>%
    group_by(category, month) %>%
    summarize(Total_Visits2018 = sum(NightVisits2018),
              Total_Visits2020 = sum(NightVisits2020)) %>%
    mutate(Percent_Change = (Total_Visits2020 - Total_Visits2018)/Total_Visits2018*100) 

#Generate lineplot
rbind(
  (dat_join %>% 
  group_by(month) %>%
  summarize(Total_Visits2018 = sum(NightVisits2018),
            Total_Visits2020 = sum(NightVisits2020)) %>%
  mutate(Percent_Change = (Total_Visits2020 - Total_Visits2018)/Total_Visits2018*100,
         category = "All")), 
  dat_citywide2) %>%
  ggplot(., aes(x = month, y = Percent_Change, group = category, color = category)) + 
  geom_line(lwd = 1.5) +
  geom_hline(yintercept=0, lwd = 1.5, linetype="dotted")+
  scale_color_manual(values = c("#d3d3d3", "#440154", "#21908C", "#FDE725")) +
  scale_x_discrete(name ="Month")+  
  scale_y_continuous(name ="Percent Change") +
  labs(title = "Philadelphia Commercial Trip % Change, 2018 & 2020",
       subtitle = "Figure 1.2.1") +
  plotTheme()

1.3 Use Case

Our goal is to help business improvement districts (BIDs) and economic development professionals understand nightlife patterns across Philadelphia’s commercial corridors. Not only do we believe this insight will provide valuable and interesting findings that will deepen the understanding of evening traffic patterns, but also that this can help further recovery efforts following the COVID-19 pandemic. While we believe that there are many potential applications for this work, we envisioned the following specific uses for our research, analysis, and web application.

  • Understanding relative popularity of commercial corridors to inform decisions to grant licenses or small business assistance.
  • Understanding origins and demographics of visitors to more effectively target marketing resources.
  • Understanding the effects of changing the nightlife retail mix in a given commercial corridor.

1.4 Summary of Methodology

In this study, we use a combination of public and private datasets to study nighttime trends across commercial corridors in Philadelphia. Our model predicts trips between the hours of 7PM and 12AM at the corridor level. After an iterative process of feature engineering, we built a model that combined features that described retail mix, visitor behaviors, the built environment, and demographic data. Ultimately, we embedded this model into a proof-of-concept web application that allows users to understand the impact of a corridor’s retail mix on the evening trip traffic. While the model has limited accuracy and generalizability indicating that it requires continual calibration, this proof-of-concept web application is an important step in making informed, data-driven policy decisions concerning the nighttime economy.

This document walks through our exploratory data analysis and model-building process in detail and includes relevant code to replicate our analysis. All source code can be found throughout the document by pressing the code buttons throughout the report. Additional code we used for setup, data loading, and app development can be found in the appendix at the end of the document.

2. The Data

2.1 The SafeGraph Dataset

SafeGraph is a novel data source that uses anonymized cell phone GPS data to record trips to commercial points of interest. For this project, we used two datasets from SafeGraph: the Monthly Patterns dataset and the Places dataset. For a given point of interest, the Monthly Patterns dataset tells us detailed information on the monthly traffic patterns, such as the number of trips and visitors, the median distance traveled, the median time spent at the establishment, and the origin census block group. Complementing the Patterns dataset, the Places dataset provides detailed information about the given point of interest, including the business category, the hours open, and other descriptive tags to help categorize the place.

The SafeGraph data are brand new and have numerous unexplored applications and insights. The dataset was essential to this project as it allowed us to see trip patterns with unprecedented detail and far higher accuracy than other mobility datasets. While SafeGraph is undoubtedly the cutting edge of mobility data, we encountered some limitations. Primarily, as SafeGraph is based on GPS data, it is inherently noisy with trips incorrectly attributed to certain points of interest. For example, if someone is waiting for a bus outside of an establishment, this may be incorrectly recorded in the data as a trip to the establishment. In dense urban areas where points of interest may be located on multiple floors of the same building or in close proximity to one another, this may result in inaccurate counts across the dataset.

Separately, one component of our app visualizes how many trips come from a given census block group. Due to privacy concerns, SafeGraph only records census block groups with at least 2 devices and any census block group with less than 5 devices are reported as 4. This limits the resolution with which we can see low-traffic census block groups.

2.2 Other Sources

We also used data from various sources on Open Data Philly to capture features in the built environment such as transit, park space, and building size. The dataset we used are summarized in the table in Section 4.1. While these features are relatively standard to cities, replicating this analysis and model for other cities would require that they have similar sources available.

Finally, we used 2018 American Community Survey 5-Year Estimates to describe demographic data for each corridor.

2.3 Philadelphia Commercial Corridors

As our model predicts nighttime trips at the commercial corridor level, we relied heavily on a shapefile from the City of Philadelphia’s Planning Department which demarcates individual corridors and districts throughout the city. According to the available metadata “locations range from large, regional and specialty destinations to corridors that reflect the evolving economy, culture, and aesthetic traditions of surrounding neighborhoods.” We also use the administrative corridor categories defined in the dataset. a description of these categories is included in the below table.

Corridor Type Gross Leasable Area Store Types
Neighborhood Subcenter 10,000 - 35,000 sq.ft. Convenience store grocery, pharmacy, dry cleaner, deli, etc.
Neighborhood Center 30,000 - 120,000 sq.ft. Supermarket, variety store, bank, pharmacy, post office, etc.
Community Center 100,000 - 500, 000 sq. ft. Discount dept store, home improvement, “big boxes” or equiv., “power center”
Regional Center 300,000 - 900,000+ sq.ft. One or two full-line department stores or equivalent
Superregional Center 500,000 - 2,000,000+ sq.ft. Three or more full-line department stores or equivalent
Specialty Center N/A Specialty goods or services, dining, bars, amusements, arts, etc.

Figure 2.1.1 below shows the spatial distribution of the different types of commercial corridors across the city.

3. Exploratory Data Analysis

Before explaining the model building process, it is important to understand the overall trends in the data. In addition to informing which variables to incorporate into the model, this section highlights the descriptive capabilities of the SafeGraph dataset. While our model uses a broad range of datasets to describe Philadelphia’s socio-economic and built environment characteristics, this section primarily focuses on our analysis of the SafeGraph variables.

Key findings from the analysis include:

  • While we primarily examine restaurants, bars, and arts venues as the key business types that contribute to the nighttime economy, all three business types in Philadelphia generate traffic during all hours of the day.
  • COVID-19 has caused a drastic decrease in business traffic to the businesses we consider nightlife establishments. While the data indicate that arts establishments across Philadelphia were able to recover some of the loss towards the end of 2020, bars and restaurants maintained over a 50% decrease in traffic throughout the end of the year. This could be related to the relatively high risk associated with frequenting these establishment types.
  • Of the total daily traffic frequenting commercial corridors across the city, Regional and Superregional Centers have the highest share of traffic during the workday. Smaller corridors outside the central core of Philadelphia tend to have a higher share of evening and early morning traffic. This suggests that visitors tend to leave Center City outside of working hours.
  • The volume of trips varies by corridor type. Regional and Superregional Centers tend to draw a higher volume of travelers, while the smaller neighborhood corridors draw fewer visitors. Community Centers and Specialty Centers fall somewhere in the middle.
  • People travel further distances to reach the Center City area.

3.1 Philadelphia Nightlife Establishments

First, we are curious to see where Philadelphia nightlife establishments are located across the city. The following maps indicate where businesses that contribute to the city’s nightlife economy are located. The categories include:

  • Bars
  • Restaurants
  • Arts Venues

Throughout the analysis, we pay special attention to bars and restaurants, as these organizations are well distributed throughout the city and apply to a local Philadelphia customer base.

Figure 3.1.1 below shows the spatial patterns of these business types. Bars and restaurants represent the highest number of businesses which are spread across the city. Hotels are mostly clustered in the central district of the city and near the airport in the southwest portion of the city. There are far fewer casinos and arts venues.

dat2 %>%
  filter(top_category == "Drinking Places (Alcoholic Beverages)" |
           top_category == "Restaurants and Other Eating Places" |
           top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
           top_category == "Performing Arts Companies") %>%
  mutate(category = case_when(top_category == "Drinking Places (Alcoholic Beverages)" ~ "Bars",
                              top_category == "Restaurants and Other Eating Places" ~ "Restaurants",
                              top_category == "Promoters of Performing Arts, Sports, and Similar Events" ~ "Arts",
                              top_category == "Performing Arts Companies" ~ "Arts")) %>%
  ggplot() + 
  geom_sf(data = phl_cbg, fill = "grey80", color = "transparent") +
  geom_sf(color = "red", size = .1) +
  labs(title = "Location of Nightlife Establishments",
       subtitle = "Figure 3.1.1") +
  facet_wrap(~category, nrow = 1) +
  mapTheme()

To look at the spatial distribution of business types another way, we review the distribution of businesses with a fishnet grid. The fishnet allows us to visualize clusters of each business type, as clusters are aggregated to the individual cell level. This allows us to more effectively see the density of these business types more effectively than the above Figure 3.1.1. While all three business types are mostly concentrated within Center City, the below Figure 3.1.2 demonstrates how restaurants are most thoroughly distributed throughout Philadelphia.

fishnet <- 
  st_make_grid(phl_boundary, cellsize = 1000, square = FALSE) %>%
  .[phl_boundary] %>% 
  st_sf() %>%
  mutate(uniqueID = rownames(.))

#Filter restaurants
restaurants <- dat2 %>%
  filter(top_category == "Restaurants and Other Eating Places")%>%
  mutate(Legend = "Restaurants")

#aggregate restaurant count by fishnet cell
restaurants_net <-
  dplyr::select(restaurants) %>% 
  mutate(countRestaurants = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countRestaurants = replace_na(countRestaurants, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

#Bars
bars <- dat2 %>%
  filter(top_category == "Drinking Places (Alcoholic Beverages)") %>% 
  mutate(Legend = "Bars")

#aggregate bars by fishnet cell
bars_net <-
  dplyr::select(bars) %>% 
  mutate(countBars = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countBars = replace_na(countBars, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE)) %>%
  mutate(Legend = "Bars")

#Performing arts
performingarts <- dat2 %>%
  filter(top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
           top_category == "Performing Arts Companies") %>%
  mutate(Legend = "Performing Arts")

performingarts_net <-
  dplyr::select(performingarts) %>% 
  mutate(countPerformingarts = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countPerformingarts = replace_na(countPerformingarts, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

# Combining fishnets into a single dataframe
vars_net <- 
  rbind(restaurants, 
        bars, 
        performingarts) %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  dplyr::summarize(count = n()) %>%
    full_join(fishnet) %>%
    spread(Legend,count, fill=0) %>%
    st_sf() %>%
    dplyr::select(-`<NA>`) %>%
    na.omit() %>%
    ungroup()

vars_net.long <- gather(vars_net, Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
mapList <- list()

#Plotting small multiple maps
for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange, c(mapList, 
                        ncol=3, top="Count of Nightlife Businesses per Fishnet", 
                        bottom = "Figure 3.1.2"))

Figure 3.1.3 below breaks down the density of restaurants in a given corridor. While centralized corridors tend to have more restaurants per square mile other smaller corridors in the city support a high concentration of restaurants as well.

Similar to the above, Figure 3.1.4 shows the density of bars along each commercial corridor. Again, we see the corridors around Center City generally showing a higher concentration of this business category. Distinct from the restaurant density map in Figure 3.1.3 above, there seem to be many more corridors with a low density of bars, particularly moving further away from the city’s central core.

Finally, Figure 3.1.5 below shows the density of arts venues by commercial corridor. The majority of corridors have a very low density of performing arts establishments as there are fewer destinations within this category across the city.

3.2 The Impacts of COVID-19 on Nightlife Establishments & Commercial Corridors

Now that we have investigated the distribution of nightlife establishments across Philadelphia, it is important to understand how these establishments have been impacted by the COVID-19 pandemic. Using SafeGraph data, we are able to see that the pandemic has a drastic impact on foot traffic across sectors and space, though not necessarily evenly. First, we observe citywide changes in foot traffic over the course of the pandemic in the below Figure 3.2.1. Comparing 2020 traffic to our 2018 dataset, we see that traffic across the city drastically decreased in March 2020 and represented less than half of the 2018 traffic. This trend continued throughout the year.

#Plot citywide chart
dat_citywide %>%
ggplot(., aes(x = month, y = Percent_Change, fill = Percent_Change)) + 
  geom_col() +
  scale_fill_viridis_c() +
  labs(title = "Philadelphia Commercial Trip % Change, 2018 & 2020",
       subtitle = "Figure 3.2.1", 
       y = "Percent Change", 
       x = "Month",
       fill = "Percent Change") +
  plotTheme()

The next graphics splits out the percent change in traffic by the business types relevant to our nightlife study: arts venues, bars, and restaurants. The data show arts establishments were able to recover some of their traffic towards the end of 2020. Bars and restaurants, on the other hand, suffered the decreased traffic volumes throughout the end of the year. One possible explanation for this trend is that it may be more likely for arts establishments to hold programming while adhering to COVID-19 safety standards (e.g. mask-wearing and social distancing). Restaurants and bars in Philadelphia, are often in smaller spaces and require that visitors remove masks to partake in the experience.

#Separate by commercial use  
dat_citywide2 <- dat_join %>%
    filter(top_category == "Drinking Places (Alcoholic Beverages)" |
             top_category == "Restaurants and Other Eating Places" |
             top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
             top_category == "Performing Arts Companies") %>%
  mutate(category = case_when(top_category == "Drinking Places (Alcoholic Beverages)" ~ "Bars",
           top_category == "Restaurants and Other Eating Places" ~ "Restaurants",
           top_category == "Promoters of Performing Arts, Sports, and Similar Events" ~ "Arts",
           top_category == "Performing Arts Companies" ~ "Arts")) %>%
    group_by(category, month) %>%
    summarize(Total_Visits2018 = sum(NightVisits2018),
              Total_Visits2020 = sum(NightVisits2020)) %>%
    mutate(Percent_Change = (Total_Visits2020 - Total_Visits2018)/Total_Visits2018*100)  
  
#Plot citywide chart by commercial use
dat_citywide2 %>%
  ggplot(., aes(x = month, y = Percent_Change, fill = Percent_Change)) + 
  geom_col() +
  labs(title = "Philadelphia Commercial Trip % Change, 2018 & 2020",
       subtitle = "Figure 3.2.2",
       x = "Month",
       y = "Percent Change",
       fill = "Percent Change") +
  scale_fill_viridis_c() +
  facet_wrap(~category) +
  plotTheme() +
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45))

In addition to looking at traffic by business type, the data allows us to view change in business traffic across space. The following Figure 3.2.3 aggregates the change in traffic across all nightlife business categories, underscoring the drastic economic impact the pandemic has had on businesses across the cit. Dark grey areas indicate census block groups where none of the three nighttime economy business category is located. While there are a few exceptions, almost all block groups show a sharp decrease in business traffic, many by well over 50%.

dat_night <- dat_join %>%
  filter(top_category == "Drinking Places (Alcoholic Beverages)" |
             top_category == "Restaurants and Other Eating Places" |
             top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
             top_category == "Performing Arts Companies") %>%
  filter(month == '04'| month == '05' | month == '06' | month == '07' | month == '08' | month == '09' | month =='10' | month == '11' | month == '12') %>%
  mutate(GEOID10 = as.numeric(GEOID)) %>%
  group_by(GEOID10) %>%
  summarize(Total_Visits2018 = sum(NightVisits2018),
            Total_Visits2020 = sum(NightVisits2020)) %>%
  st_drop_geometry() %>%
  left_join(phl_cbg) %>% 
  st_as_sf() %>%
  mutate(Percent_Change = (Total_Visits2020 - Total_Visits2018)/Total_Visits2018*100)

#Plot map
dat_night %>%  
  mutate(percent_change = case_when(Percent_Change > 100 ~ 100, 
                                    Percent_Change <= 100 ~ Percent_Change)) %>%
  ggplot() + 
  geom_sf(data = phl_boundary, fill = "grey20", color = "black")+
  geom_sf(aes(fill = percent_change), color = "transparent") + 
  scale_fill_distiller(palette="RdYlGn", direction = 1) +
  labs(title = "Nightlife Trip % Change, 2018 & 2020",
       subtitle = "Figure 3.2.3",
       fill = "Percent Change") +
  mapTheme()

Finally, we review how the changes in mobility to nightlife establishments have impacted individual commercial corridors across Philadelphia in Figure 3.2.4. The below map shows the change in traffic to nightlife establishments aggregated by corridor. While some smaller corridors did not experience much change or even observed small increases in traffic, most corridors saw a drastic decrease in traffic to nightlife establishments. This is particularly evident in Center City, along Market Street in West Philadelphia and at the airport in far Southwest Philadelphia.

corridors_filter <- phl_corridors %>%
  select(OBJECTID, NAME, GLA, P_DIST, ST_EXT, PT_ADD, VAC_RATE) %>%
  mutate(VAC_RATE = str_remove_all(VAC_RATE, pattern = "%")) %>%
  mutate(VAC_RATE = as.numeric(VAC_RATE)) %>%
  st_as_sf() %>%
  st_transform('ESRI:102728') 
  
dat_corr <- dat_join %>%
  filter(top_category == "Drinking Places (Alcoholic Beverages)" |
           top_category == "Restaurants and Other Eating Places" |
           top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
           top_category == "Performing Arts Companies") %>%
  filter(month == '04'| month == '05' | month == '06' | month == '07' | month == '08' | month == '09' | month =='10' | month == '11' | month == '12') %>%
  mutate(GEOID10 = as.numeric(GEOID)) %>%
  st_join(corridors_filter) %>% 
  group_by(NAME.y, month) %>%
  summarize(Total_Visits2018 = sum(NightVisits2018),
            Total_Visits2020 = sum(NightVisits2020)) %>%
  mutate(Percent_Change = (Total_Visits2020 - Total_Visits2018)/Total_Visits2018*100) %>%
  drop_na(NAME.y) %>%
  st_as_sf() %>%
  st_transform('ESRI:102728') %>%
  st_drop_geometry() 

dat_corr <- dat_corr %>%
  mutate(NAME = NAME.y) %>%
  left_join(corridors_filter)

#Average across all months
dat_corr_avg <- dat_corr %>%  
  group_by(NAME) %>%
  summarize(Percent_Change = mean(Percent_Change))

#All Commercial Uses Corridor Map
dat_corr %>%  
  filter(month == "04") %>%
  mutate(Percent_Change = case_when(Percent_Change > 100 ~ 100, 
                                    Percent_Change <= 100 ~ Percent_Change)) %>%
  ggplot() + 
  geom_sf(data = phl_boundary, fill = "grey20", color = "darkgrey")+
  geom_sf(aes(fill = Percent_Change, color = Percent_Change, geometry = geometry)) + 
  scale_fill_distiller(palette="RdYlGn", direction = 1) +
  scale_color_distiller(palette="RdYlGn", direction = 1) +
  labs(title = "Commercial Nighttime Trip % Change, April 2018 & 2020",
       subtitle = "Figure 3.2.4",
       fill = "Percent Change",
       color = "Percent Change") +
  mapTheme()

3.3 Hourly Traffic Volume Across Corridor and Business Category

Next we wanted to understand the hourly traffic patterns of the nightlife establishments and corridor types that we are studying in this project. Figure 3.3.1 shows the average daily foot traffic by business type. This graphic indicates that though certain business types are considered part of the nightlife economy, they do not exclusively experience traffic in the evenings. Restaurants are a good example of this, where the data indicates that the highest levels of traffic occur in the middle of the day. Arts venues, on the other hand, have a clear patterns indicating that they are more popular later in the day.

To further understand the role of nightlife for the different corridors across Philadelphia, we examine traffic by time of day for the six different corridor types introduced above. Figure 3.3.2 below aggregates the data by corridor type and shows the share of traffic across different segments of the day. We see that Regional Centers and Superregional Centers, larger corridors that tend to be located closer to Center City, have the highest share of traffic during the late morning and early afternoon. This makes sense, as these corridors are largely synonymous with the city’s major job hubs that attract a substantial amount of daytime traffic. Smaller corridors, such as the Neighborhood Subcenters and Neighborhood Centers, demonstrate a more significant share of their overall traffic occurring during the early morning and evening hours between 7PM and 6AM. The higher share of evening traffic in these areas suggests that the city’s nighttime economy is relatively active in the smaller corridors outside of Center City.

dat_1_6 <- dat_hour %>%
  #select trip counts between the hours of 1AM and 6AM
  filter(Hour == "1" | Hour == "2" | Hour == "3" | Hour == "4" | Hour == "5" | Hour == "6") %>% 
  group_by(safegraph_place_id, date_range_start) %>%
  summarize(Hrs1_6 = sum(Count)) 

dat_7_12 <- dat_hour %>%
  #select trip counts between the hours of 7AM and 12PM
  filter(Hour == "7" | Hour == "8" | Hour == "9" | Hour == "10" | Hour == "11" | Hour == "12") %>%
  group_by(safegraph_place_id, date_range_start) %>%
  summarize(Hrs7_12 = sum(Count)) 

dat_13_18 <- dat_hour %>%
  #select trip counts between the hours of 1PM and 6PM
  filter(Hour == "13" | Hour == "14" | Hour == "15" | Hour == "16" | Hour == "17" | Hour == "18") %>%
  group_by(safegraph_place_id, date_range_start) %>%
  summarize(Hrs13_18 = sum(Count)) 

dat_19_0 <- dat_hour %>%
  #select trip counts between the hours of 6PM and 12AM
  filter(Hour == "19" | Hour == "20" | Hour == "21" | Hour == "22" | Hour == "23" | Hour == "0") %>%
  group_by(safegraph_place_id, date_range_start) %>%
  summarize(Hrs19_0 = sum(Count)) 

dat_workday <- dat_hour %>%
  #select trip counts between workday hours of 9AM and 6PM
  filter(Hour == "9" | Hour == "10" | Hour == "11" | Hour == "12" | Hour == "13" | Hour == "14" |Hour == "15" | Hour == "16" |Hour == "17" |Hour == "18") %>%
  group_by(safegraph_place_id, date_range_start) %>%
  summarize(Hrs_workday = sum(Count))

#Combine into new dataset with hour trip counts and corridor + neighborhood information
dat3 <- 
  dat2 %>% 
  left_join(dat_1_6, by = c('safegraph_place_id', 'date_range_start')) %>%
  left_join(dat_7_12, by = c('safegraph_place_id', 'date_range_start')) %>%
  left_join(dat_13_18, by = c('safegraph_place_id', 'date_range_start')) %>%
  left_join(dat_19_0, by = c('safegraph_place_id', 'date_range_start')) %>%
  left_join(dat_workday, by = c('safegraph_place_id', 'date_range_start')) %>%
  st_join(phl_corridors, join = st_intersects) %>%
  st_join(phl_nhoods, join = st_intersects)

#Visualization
dat3 %>%
  st_drop_geometry() %>%
  drop_na(corr_type) %>%
  group_by(corr_type) %>%
  summarise(Early_AM = sum(Hrs1_6),
            Late_AM = sum(Hrs7_12),
            Early_PM = sum(Hrs13_18),
            Late_PM = sum(Hrs19_0)) %>%
  pivot_longer(cols = 2:5,
               names_to = "Time",
               values_to = "Traffic") %>%
  ggplot(aes(fill=factor(Time, levels=c("Late_PM", 
                                        "Early_PM", 
                                        "Late_AM", 
                                        "Early_AM")), 
             y=Traffic, 
             x=factor(corr_type, levels=c("Specialty Center",
                                          "Superregional Center",
                                          "Regional Center",
                                          "Community Center",
                                          "Neighborhood Center",
                                          "Neighborhood Subcenter")))) + 
  geom_bar(position="fill", stat="identity") +
  coord_flip() +
  scale_fill_viridis_d() +
  labs(title = "Share of Traffic by Time of Day Across Corridor Types",
       subtitle = "Figure 3.3.2") +
  scale_y_continuous(labels = scales::percent) +
  plotTheme() +
  theme(legend.position = "bottom",
        legend.title = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank()) +
  guides(fill = guide_legend(reverse = TRUE))

The below Figure 3.3.3 looks into the daily traffic patterns in greater detail and breaks out daily traffic pattern by each corridor type and specifically for the traffic to the specific business categories we are studying. Based on this graphic, arts establishments in Neighborhood Centers, Specialty Centers, and Superregional Centers receive the highest share of evening traffic of all corridor types. It is worth noting that the arts category is our smallest group of business types, so a certain establishment might be skewing these results. Restaurants and bars are more evenly visited throughout the day. Notably, restaurants in Regional Centers serve the highest share of day time traffic of the six corridor types. It is also clear that bars tend to have higher traffic in the evening than other business types.

The below Figure 3.3.4 shows that Neighborhood Subcenters and Superregional Centers have the highest shares of traffic occurring in the evening hours, between the 7PM and 12AM. Regional Centers, on average, have a substantially lower share of evening traffic at just over 12%. That being said, there is not huge variation among the different corridor types indicating that nighttime traffic occurs in many different spaces and in different corridor types throughout the city.

Finally, the below map in Figure 3.4.5 shows which corridors have a higher share of evening traffic. The dark purple, representing the lowest quintile of evening traffic is largely concentrated in Center City, suggesting that these corridors are relatively less popular during the evening. Interestingly, the waterfront corridors directly to the east of Center City are in the highest quintile for nighttime traffic. This suggests that the program of these corridors is quite distinct from their neighbors.

3.4 Trip Flows: Origins & Destinations

One of the fascinating features of the SafeGraph dataset is its ability to capture flows of traffic across space by connecting a series of origins and destinations. By mapping the origins and destinations of trips taking place across Philadelphia, we can observe which regions of the city draw from a broader pool swath of visitors across the city, and which areas cater to a more local population. To study these trends in the data, we chose a selection of representative corridors located within the central core of the city and were classified as a separate corridor type. The selected corridors and corresponding classification are listed below. Though we only show a subset of the corridors here, this information is available for all Philadelphia commercial corridors on our app.

Corridor Name Corridor Type
South Street/16th-21st Neighborhood Subcenter
Fairmount/19th-25th Neighborhood Center
Central Waterfront/Washington Community Center
Market West Regional Center
Market East Superregional Center
Old City Specialty Center

Figures 3.4.1 below shows the volume of visitors to the selected corridors by census block group. Block groups shaded grey indicate that no resident made a trip to the corridor. Note that any trips originating from outside of the city are not included in this graphic.

corr_select <-
  phl_corridors %>%
  filter(NAME == "South Street/16th-21st" | #Neighborhood Subcenter
           NAME == "Fairmount/19th-25th" | #Neighborhood Center
           NAME == "Central Waterfront/Washington" | #Community Center
           NAME == "Market West" | #Regional Center
           NAME == "Market East" | #Superregional Center
           NAME == "Old City") %>% #Specialty Center
  st_as_sf()

dat_corr_select <- dat3[corr_select,]

corr_select_cbg <-
  dat_corr_select %>%
  select(safegraph_place_id,
         date_range_start,
         top_category,
         sub_category,
         visitor_home_cbgs,
         NAME.y,
         geometry) %>%
  filter(top_category == "Drinking Places (Alcoholic Beverages)" |
         top_category == "Restaurants and Other Eating Places" |
         top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
         top_category == "Performing Arts Companies") %>%
  mutate(visitor_home_cbgs = str_remove_all(visitor_home_cbgs, pattern = "\\[|\\]")) %>%
  mutate(visitor_home_cbgs = str_remove_all(visitor_home_cbgs, pattern = "\\{|\\}")) %>%
  mutate(visitor_home_cbgs = str_remove_all(visitor_home_cbgs, pattern = '\\"|\\"')) %>%
  mutate(visitor_home_cbgs = str_split(visitor_home_cbgs, pattern = ",")) %>%
  unnest(visitor_home_cbgs) %>%
  separate(.,
           visitor_home_cbgs,
           c("cbg_origin", "visitors"),
           sep = ":") %>%
  mutate(cbg_origin = as.numeric(cbg_origin),
         visitors = as.numeric(visitors)) 

corr_select_cbg_map <-  
  corr_select_cbg %>%
  dplyr::group_by(cbg_origin, NAME.y) %>%
  dplyr::summarize(visitors = sum(visitors)) %>%
  st_drop_geometry() %>%
  dplyr::rename(., GEOID10 = cbg_origin) %>%
  left_join(phl_cbg, by = "GEOID10") %>%
  st_as_sf() %>%
  drop_na(geometry) %>%
  mutate(corr_type = case_when(NAME.y == "South Street/16th-21st" ~ "Neighborhood Subcenter",
                               NAME.y == "Fairmount/19th-25th"~ "Neighborhood Center",
                               NAME.y == "Central Waterfront/Washington" ~ "Community Center",
                               NAME.y == "Market West" ~ "Regional Center",
                               NAME.y == "Market East" ~ "Superregional Center",
                                NAME.y == "Old City" ~ "Specialty Center")) 

corr_select_cbg_map %>%
  ggplot() +
  geom_sf(data = phl_cbg, fill = "grey70", color = "transparent") +
  geom_sf(aes(fill = q5(visitors), geometry = geometry), color = "transparent") +
  scale_fill_manual(values = palette5,
                    aesthetics = c("colour", "fill"),
                    name = "Visitor Count \n(Quintile)") +
  mapTheme() +
  labs(title = "Where do nightlife visitors come from?",
       subtitle = "Figure 3.4.1") +
  facet_wrap(~factor(corr_type, 
                     levels=c("Neighborhood Subcenter",
                              "Neighborhood Center",
                              "Community Center",
                              "Regional Center",
                              "Superregional Center",
                              "Specialty Center")))+
  theme(legend.position = "bottom")

Visualizing the origins and destinations in another way, the following maps in Figure 3.4.2 look at the origins and destinations for trips to restaurants, bars, and arts venues across the city. Specifically, it shows the distance between the centroid of the origin neighborhood to the centroid of the destination commercial corridor. We show the same six corridors discussed above. Again, any trips from outside of the city are not included in this graphic.

The graphic demonstrates how traffic flows change across corridor types, with the larger corridors generating far more traffic than neighborhood level corridors. This has important implications for the nighttime economy in Philadelphia as we know that corridor types generating more traffic also tend to be more “daytime” in character. This suggests that businesses catering to the nighttime crowd are pulling from a smaller subset of the traffic that these larger corridors are seeing.

flows <- dat2 %>%
  filter(top_category == "Drinking Places (Alcoholic Beverages)" |
           top_category == "Restaurants and Other Eating Places" |
           top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
           top_category == "Performing Arts Companies") %>% #filter for business types
  st_join(phl_corridors) %>% #join destinations to phl corridors (apply corridor destination to each trip)
  st_as_sf() %>%
  st_drop_geometry() %>% 
  select(safegraph_place_id, 
         date_range_start,
         top_category,
         poi_cbg,
         visitor_home_cbgs,
         NAME.y) %>% #select columns
  mutate(visitor_home_cbgs = str_remove_all(visitor_home_cbgs, pattern = "\\[|\\]")) %>%
  mutate(visitor_home_cbgs = str_remove_all(visitor_home_cbgs, pattern = "\\{|\\}")) %>%
  mutate(visitor_home_cbgs = str_remove_all(visitor_home_cbgs, pattern = '\\"|\\"')) %>%
  mutate(visitor_home_cbgs = str_split(visitor_home_cbgs, pattern = ",")) %>%
  unnest(visitor_home_cbgs) %>% #unnest visitor cbg column
  separate(.,
           visitor_home_cbgs,
           c("visitor_cbg", "count"),
           sep = ":") %>% #separate count column from visitor cbg
  mutate(count = as.numeric(count),
         visitor_cbg = as.numeric(visitor_cbg)) %>%
  dplyr::rename(., corridor_dest = NAME.y) %>%
  drop_na(corridor_dest) #Remove trips with destinations outside of corridors

#prepare PHL boundary file to coordinates for plotting
phl_boundary_unproj <-
  phl_boundary_unproj %>%
  st_coordinates() # split out coordinates

phl_boundary_unproj <-
  as.data.frame(phl_boundary_unproj) # save as dataframe

#----MAP V1: NEIGHBORHOOD CENTROID TO CORRIDOR CENTROIDS
#Generate CBG centroid
phl_cbg_cent <- 
  phl_cbg_unproj %>% 
  select(GEOID10, geometry) %>% 
  st_centroid()

#Generate neighborhood centroids
phl_nhood_cent <- 
  phl_nhoods_unproj %>% 
  select(name) %>%
  st_centroid()

#Generate corridor centroid
phl_corr_cent <- 
  phl_corridors_unproj %>% 
  select(NAME, geometry) %>% 
  st_centroid()

#Join CBG centroid to neighborhood shapefile
phl_cbg_nhood <- 
  st_join(phl_nhoods_unproj, phl_cbg_cent, join = st_intersects) %>%
  select(GEOID10, name) %>%
  st_centroid()

flows_nhood <- flows %>%
  left_join(phl_cbg_nhood, by=c("visitor_cbg"="GEOID10")) %>% #join visitor CBGs to nhoods
  dplyr::rename(., nhood_origin = name) %>% #cleanup columns for clarity
  drop_na(nhood_origin) %>% #dropping trips outside of Philadelphia
  st_as_sf() %>%
  st_drop_geometry() %>% 
  group_by(nhood_origin, top_category, corridor_dest) %>% #grouping trip counts by origin neighborhood
  summarize(count = sum(count)) %>%
  left_join(phl_nhood_cent, by=c("nhood_origin"="name")) %>% #join origin nhood to nhood centroid
  left_join(., phl_corr_cent, by=c("corridor_dest"="NAME")) %>% #join destination corridor to corr centroid
  dplyr::rename(., origin.geom = geometry.x,
                        dest.geom = geometry.y) #clean-up columns for clarity

#Convert from tibble to data frame for next step
flows_nhood <- as.data.frame(flows_nhood) 

#split point data into lat and long columns
flows_nhood <- flows_nhood %>% 
  mutate(lat.origin = unlist(map(flows_nhood$origin.geom,1)),
         long.origin = unlist(map(flows_nhood$origin.geom,2)),
         lat.dest = unlist(map(flows_nhood$dest.geom,1)),
         long.dest = unlist(map(flows_nhood$dest.geom,2)),
         id = as.character(c(1:nrow(.))))

#Maps - straight line segment example
flows_nhood %>% 
  filter(corridor_dest == "South Street/16th-21st" | #Neighborhood Subcenter
           corridor_dest == "Fairmount/19th-25th" | #Neighborhood Center
           corridor_dest == "Central Waterfront/Washington" | #Community Center
           corridor_dest == "Market West" | #Regional Center
           corridor_dest == "Market East" | #Superregional Center
           corridor_dest == "Old City") %>% #Specialty Center
  mutate(corr_type = case_when(corridor_dest == "South Street/16th-21st" ~ "Neighborhood Subcenter",
                               corridor_dest == "Fairmount/19th-25th"~ "Neighborhood Center",
                               corridor_dest == "Central Waterfront/Washington" ~ "Community Center",
                               corridor_dest == "Market West" ~ "Regional Center",
                               corridor_dest == "Market East" ~ "Superregional Center",
                                corridor_dest == "Old City" ~ "Specialty Center")) %>%
  ggplot() + 
  geom_polygon(data = phl_boundary_unproj, aes(x=X, y=Y), fill = "grey40") + 
  geom_segment(aes(x = lat.origin, y = long.origin, xend = lat.dest, yend = long.dest,  
                   color=log(count)),
               size = .5,
               lineend = "round") +
  scale_color_viridis_c() +
  coord_equal() +
  mapTheme() + 
  facet_wrap(~factor(corr_type, 
                     levels=c("Neighborhood Subcenter",
                              "Neighborhood Center",
                              "Community Center",
                              "Regional Center",
                              "Superregional Center",
                              "Specialty Center")), ncol = 3) +
  labs(title = "Trip Flows Across Center City Corridors",
       subtitle = "Figure 3.4.2",
       color = "Trip Volume") +
  theme(legend.position = "bottom")

3.5 Other Exploratory Analysis with SafeGraph Data

The SafeGraph dataset is incredibly rich and there are many more fascinating variables to investigate. The following section shows a selection of other discrete analyses that we conducted on the SafeGraph dataset that had interesting implications for our use case.

Distance from Home

Figures 3.5.1 and 3.5.2 below visualize SafeGraph “distance_from_home” variable, which takes the average distance traveled from a visitor’s home to the point of interest along the corridor. For Figure 3.5.1, we have aggregated this variable by corridor to show that on average, visitors travel farther to Center Center and corridors close to the city’s border. Smaller corridors throughout the city draw more local visitors that travel shorter distances. One possible explanation for this pattern could be the presence of job hubs. With jobs concentrated in Center City and a handful of other centers, such as the airport in Southwest Philadelphia and the Navy Yard in South Philadelphia, this could indicate that the people are willing to travel relatively long distances to these corridors. Another possible explanation for this pattern could be the connectivity of these corridors. The commercial corridors in Center City are well-served by public transit and the corridors located close to the city’s border, particularly around South Philly are well connected to the interstate network, facilitating car travel to these destinations. Finally, tourism could play a role in the aggregated distance from home variable. Most of the city’s tourist-friendly attractions are located in the city core, so visitors from outside of Philadelphia could be skewing this variable higher.

dat3 %>%
  drop_na(NAME.y) %>%
  group_by(NAME.y) %>%
  summarize(distance_from_home = mean(distance_from_home, na.rm = TRUE)) %>%
  st_drop_geometry() %>%
  rename("NAME" = "NAME.y") %>%
  left_join(., phl_corridors) %>%
  st_as_sf() %>%
  ggplot() +
  geom_sf(data = phl_boundary, fill = "grey80", color = "grey40") +
  geom_sf(aes(fill = q5(distance_from_home), color = q5(distance_from_home))) +
  scale_fill_manual(values = palette5,
                    aesthetics = c("colour", "fill"),
                    name = "Distance from Home \n(Quintile)") +
  scale_colour_manual(values = palette5,
                    aesthetics = c("colour", "fill"),
                    name = "Distance from Home \n(Quintile)") +
  labs(title = "Distance from Home Across Commercial Corridors",
       subtitle = "Figure 3.5.1") +
  mapTheme()

Looking at the same variable across the corridor types, Figure 3.5.2 demonstrates that the Regional, Superregional, and Specialty Centers have a substantially higher average distance traveled corridor at 6.4, 14.8, and 9.7 miles respectively. Neighborhood Subcenters have the smallest distance from home value, suggesting that, on average, this corridor type appeals to a more local crowd.

#Barplots
dat3 %>%
  st_drop_geometry() %>%
  drop_na(corr_type) %>%
  filter(top_category == "Drinking Places (Alcoholic Beverages)" |
           top_category == "Restaurants and Other Eating Places" |
           top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
           top_category == "Performing Arts Companies") %>%
  group_by(corr_type) %>%
  summarise(Avg_Distance_mi = mean(distance_from_home, na.rm = TRUE)*0.000621371) %>%
  ggplot(aes(y=Avg_Distance_mi, 
             x=factor(corr_type, levels=c("Neighborhood Subcenter",
                                          "Neighborhood Center",
                                          "Community Center",
                                          "Regional Center",
                                          "Superregional Center",
                                          "Specialty Center")))) + 
  geom_bar(stat="identity") +
  # facet_wrap(~corr_type) +
  guides(fill=guide_legend(title=NULL)) +
  coord_flip()+
  theme(axis.title.y = element_blank()) +
  geom_text(aes(label=round(Avg_Distance_mi, digits = 1)), hjust=1.6, color="white", size=3.5) +
  labs(title = "Distance from Home by Corridor Type",
       subtitle = "Figure 3.5.2",
       y = "Average Distance (miles)") +
  plotTheme()

Median Dwell Time

Another variable unique to the SafeGraph dataset is the median_dwell variable which aggregates the median amount of time spent by all visitors at a given point of interest. While we do not see as clear of a spatial pattern in the map as we did with the above distance_from_home variable (Figure 3.5.3), the bar plot in Figure 3.5.4 below indicates that Neighborhood Subcenters have the highest median dwell time at over 45 minutes, while Community Centers have the lowest. As Community Centers tend to be auto-oriented and suburban-style commercial centers, these corridors may be optimized for convenience and, therefore, encourage visitors to spend less time.

Applying this finding to the nighttime economy, establishments catering to the evening crowd tend to benefit when visitors spend a longer time at the establishment. For example, at bars and restaurants, the longer a patron spends at the establishment, the more likely they are to purchase additional items, translating to increased business revenue. This finding could have interesting planning implications for identifying optimal corridors for increased development of the nighttime economy.

dat3 %>%  
  drop_na(NAME.y, corr_type) %>%
  group_by(NAME.y) %>%
  summarize(median_dwell = median(median_dwell, na.rm = TRUE)) %>%
  st_drop_geometry() %>%
  rename("NAME" = "NAME.y") %>%
  left_join(., phl_corridors) %>%
  st_as_sf() %>%
  ggplot() +
  geom_sf(data = phl_boundary, fill = "grey80", color = "grey40") +
  geom_sf(aes(fill = q5(median_dwell), color = q5(median_dwell))) +
  scale_fill_manual(values = palette5,
                    aesthetics = c("colour", "fill"),
                    name = "Median Dwell \n(Quintile)") +
  scale_color_manual(values = palette5,
                    aesthetics = c("colour", "fill"),
                    name = "Median Dwell \n(Quintile)") +
  labs(title = "Median Dwell Times by Philadelphia Commercial Corridor",
       subtitle = "Figure 3.5.3") +
  mapTheme()

#Barplots
dat3 %>%
  st_drop_geometry() %>%
  drop_na(corr_type) %>%
  filter(top_category == "Drinking Places (Alcoholic Beverages)" |
           top_category == "Restaurants and Other Eating Places" |
           top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
           top_category == "Performing Arts Companies") %>%
  group_by(corr_type) %>%
  summarise(Dwell = median(median_dwell, na.rm = TRUE)) %>%
  ggplot(aes(y=Dwell, 
             x=factor(corr_type, levels=c("Neighborhood Subcenter",
                                          "Neighborhood Center",
                                          "Community Center",
                                          "Regional Center",
                                          "Superregional Center",
                                          "Specialty Center")))) + 
  geom_bar(stat="identity") +
  scale_fill_viridis_d() +
  labs(title = "Median Dwell Times by Corridor Type",
       subtitle = "Figure 3.5.4",
       y = "Median Dwell Time (minutes)") +
  geom_text(aes(label=round(Dwell, digits = 1)), hjust=1.6, color="white", size=3.5)+
  guides(fill=guide_legend(title=NULL)) +
  theme(axis.title.y = element_blank()) +
  coord_flip()+
  plotTheme()

3.6 Concluding Thoughts from Exploratory Data Analysis

The SafeGraph dataset is incredibly rich and the above analysis merely demonstrates a small sample of the possibilities of this data. By combining the SafeGraph dataset with Philadelphia’s local commercial corridors, we were able to see unique patterns that further distinguish the corridor types as well as provide important insights into the behavior of people participating in the nighttime economy.

4 Predictive Model

Informed by the above exploratory analysis, the following section discusses how we developed a model to predict nighttime traffic between the hours of 7PM and 12PM to commercial corridors in Philadelphia. We begin by explaining the features that we ultimately chose for the model, describe the modelling approach, and finally walk through an evaluation of our error terms.

4.1 Feature Summary

As mentioned above, the ultimate output of the model is a prediction of trips to commercial corridors during evening hours. This raw trip count by corridor needed to be normalized by corridor area, to account for differently sized corridors. This variable, however, was highly skewed and would be difficult to model. Therefore, we ultimately used the log transformation of the normalized trip count as our dependent variable for the model.

Our independent variables that we used to predict the nighttime trip outcome were developed over an iterative process of data wrangling, feature engineering, and evaluation. Our final list falls into three primary categories described below.

  • SafeGraph Features: These features are entirely sourced from the SafeGraph dataset. They help describe the density of certain business types along a corridor, a measure of retail mix, the distance home and dwell time (explore further in the above Section 3), and how late businesses are open.

  • Built Environment Features: These features are sourced from local datasets maintained on Open Data Philly and describe the built environment of the corridor. They include information on distance to transit, parks, and average building size.

  • Demographic Features: These features are sourced from U.S. Census data and describe socio-economic characteristics of a corridor.

dat_pred_temp <- 
  dat2 %>% 
  left_join(dat_1_6, by = c('safegraph_place_id', 'date_range_start')) %>%
  left_join(dat_7_12, by = c('safegraph_place_id', 'date_range_start')) %>%
  left_join(dat_13_18, by = c('safegraph_place_id', 'date_range_start')) %>%
  left_join(dat_19_0, by = c('safegraph_place_id', 'date_range_start')) %>%
  left_join(dat_workday, by = c('safegraph_place_id', 'date_range_start')) %>%
  st_join(phl_blockgroups) %>%
  st_join(phl_nhoods) %>%
  st_join(phl_corridors_pred) %>%
  st_join(., phl_building_footprints) %>%
  drop_na(corridor)

#Engineer variables at the destination level.
dat_pred <-
  dat_pred_temp %>%
  dplyr::mutate(total = 1,
         bars = ifelse(top_category == 'Drinking Places (Alcoholic Beverages)', 1, 0),
         restaurant = ifelse(top_category == 'Restaurants and Other Eating Places', 1, 0),
         arts = ifelse(top_category == 'Promoters of Performing Arts, Sports, and Similar Events' |
                         top_category == 'Performing Arts Companies', 1, 0),
         grocery = ifelse(top_category == 'Grocery Stores', 1, 0),
         sports = ifelse(top_category == 'Spectator Sports', 1, 0),
         transit_nn = nn_function(st_c(st_centroid(dat_pred_temp)), st_c(st_centroid(septaStops)), 2),
         parks_nn = nn_function(st_c(st_centroid(dat_pred_temp)), st_c(st_centroid(parks)), 4),
         trolley_nn = nn_function(st_c(st_centroid(dat_pred_temp)), st_c(st_centroid(phl_trolley)), 4),
         late_night = if_else(grepl("Late Night", dat_pred_temp$category_tags), 1, 0),
         closing_time = ifelse(str_detect(open_hours, paste(c("20:00", "21:00", "22:00", "23:00", "0:00", "1:00", "2:00", "3:00"), collapse = "|")), "OPEN LATE", "NOT OPEN LATE"),
         closing_time = ifelse(is.na(closing_time) == TRUE, "NO DATA", closing_time),
         open_late = ifelse(closing_time == "OPEN LATE", 1, 0)
    )

dat_pred$UPenn <- as.numeric(st_distance(dat_pred, UPenn.sf))

#Aggregating up to the corridor level
dat_pred_agg <- 
  dat_pred %>%
  group_by(corridor, date_range_start, corr_type) %>%
  summarize(Night_visits = sum(Hrs19_0),
            Workday_visits = sum(Hrs_workday),
            phl_area_sqmi = mean(corr_area_sqmi, na.rm = TRUE),
            Night_visits_sqmi = (Night_visits + 1) / phl_area_sqmi,
            Night_visits_sqmi_log = log(Night_visits_sqmi),
            Workday_visits_sqmi = (Workday_visits +1) / phl_area_sqmi,
            Workday_visits_sqmi_log = log(Workday_visits_sqmi),
            total = sum(total),
            phl_building_size = mean(bld_area_sqft +1, na.rm = TRUE),
            phl_building_size_log = log(phl_building_size),
            phl_UPenn = mean(UPenn),
            sg_dwell = mean(median_dwell, na.rm = TRUE),
            sg_distance_home = mean(distance_from_home, na.rm = TRUE),
            sg_distance_home_log = log(sg_distance_home),
            dens_bars = (sum(bars, na.rm = TRUE) + 1)/phl_area_sqmi,
            dens_rest = (sum(restaurant, na.rm = TRUE) + 1)/phl_area_sqmi,
            dens_arts = (sum(arts, na.rm = TRUE) +1)/phl_area_sqmi,
            dens_sports = (sum(sports, na.rm = TRUE) + 1)/phl_area_sqmi,
            dens_grocery = (sum(grocery, na.rm = TRUE) + 1)/phl_area_sqmi,
            count_retailmix_top = n_distinct(top_category),
            count_retailmix_sub = n_distinct(sub_category),
            dens_late_tag = (sum(late_night, na.rm = TRUE) + 1)/phl_area_sqmi,
            dens_openlate = (sum(open_late, na.rm = TRUE) + 1)/phl_area_sqmi,
            dens_bars_log = log(dens_bars),
            dens_rest_log = log(dens_rest),
            dens_arts_log = log(dens_arts),
            dens_grocery_log = log(dens_grocery),
            dens_late_tag_log = log(dens_late_tag),
            count_retailmix_top_log = log(count_retailmix_top),
            count_retailmix_sub_log = log(count_retailmix_sub),
            dens_openlate_log = log(dens_openlate),
            nn_transit = mean(transit_nn),
            nn_parks = mean(parks_nn),
            nn_trolley = mean(trolley_nn),
            nn_parks_log = log(nn_parks),
            nn_transit_log = log(nn_transit),
            nn_trolley_log = log(nn_trolley),
            demo_popdens = mean(popdens_mi),
            demo_popdens_log = log(demo_popdens),
            demo_pctWhite = weighted.mean(pctWhite, Hrs19_0, na.rm = TRUE),
            demo_medAge = weighted.mean(MedAge, Hrs19_0, na.rm = TRUE),
            demo_MHI = weighted.mean(MedHHInc, Hrs19_0, na.rm = TRUE),
            demo_medrent = weighted.mean(MedRent, Hrs19_0, na.rm = TRUE)) %>% 
  st_drop_geometry() %>%
  left_join(phl_corridors_pred %>% dplyr::select(corridor)) %>%
  drop_na(corridor, demo_MHI, demo_medrent) %>% #Some of the census variables aren't reporting for our block groups
  st_as_sf() %>%
  ungroup() %>%
  na.omit(st_distance_home)

The below is a final list of variables as well as a brief description of each.

Feature Name Category Source Description
Night_visits_sqmi_log Dependent Variable SafeGraph Log transformation of trip count between the hours of 7PM and 12AM normalized by area
phl_building_size_log Built Environment City of Philadelphia Building Footprints Log transformation of average building footprint (sq ft) on commercial corridor
phl_UPenn Built Environment SafeGraph Distance to the University of Pennsylvania
sg_distance_home_log SafeGraph SafeGraph Log transformation of average distance traveled to a point of interest aggregated to the commercial corridor
sg_dwell SafeGraph SafeGraph Median dwell time of points of interested aggregated to the corridor level
dens_bars_log SafeGraph SafeGraph Log transformation of bar density variable
dens_rest_log SafeGraph SafeGraph Log transformation of restaurant density variable
dens_arts_log SafeGraph SafeGraph Log transformation of arts venue density variable
dens_grocery SafeGraph SafeGraph Log transformation of grocery density variable
dens_sports SafeGraph SafeGraph Log transformation of sports establishment density variable
count_retailmix_top SafeGraph SafeGraph Count of unique business top categories
count_retailmix_sub_log SafeGraph SafeGraph Log transformation of count of unique business subcategories
dens_openlate_log SafeGraph SafeGraph Log transformation of businesses open after 8PM based on open_hours variable
nn_transit_log Built Environment SEPTA - Market-Frankford Line Stations & SEPTA - Broad Street Line Stations Log transformation of distance to nearest 2 SEPTA stations
nn_parks_log Built Environment Philadelphia Parks & Recreation Properties Log transformation of distance to nearest 4 parks
nn_trolley Built Environment SEPTA - Trolley Stops Distance to nearest 4 trolley stations
dens_late_tag SafeGraph SafeGraph Count of businesses with “late” tag
corr_type Built Environment Commercial Corridors of Philadelphia Corridor typology (see detailed description above)
dens_late_tag_log SafeGraph SafeGraph Log transformation of density of businesses with “late” tag
demo_pctWhite Demographic 2018 ACS 5-Year Estimates Percent of residents in corresponding census tract that are white
demo_medAge Demographic 2018 ACS 5-Year Estimates Median Age of residents in corresponding census tract
demo_popdens Demographic 2018 ACS 5-Year Estimates Population density of corresponding census tract
demo_medrent Demographic 2018 ACS 5-Year Estimates Median rent of corresponding census tract
demo_MHI Demographic 2018 ACS 5-Year Estimates Median household income of corresponding census tract

In Figure 4.1.1 below, we see a corrplot display of the final features chosen for the model. There is some collinearity among the variables that describe the density of business types. While not ideal for modelling, including these variables was necessary for the ultimate application of predicting nighttime traffic based on different retail mix scenarios. In order to adjust the retail mix for certain nighttime establishments (bars, restaurants, and arts venues) a variable representing each of these factors needed to be present in the final model. Other variables, though they may have exhibited some degree of collinearity, are present in the model because they were ultimately statistically significant and improved the fit.

4.2 Model Building

We developed two primary models: an OLS linear regression and random forest regression. Both models used the same dataset and predicted the logged nighttime visits per square mile. For both models, we split our data into a training and test sets (60% and 40%, respectively) and evaluated the Mean Absolute Error (MAE) and Mean Average Percent Error (MAPE).

OLS Model

We first developed and evaluated a linear model using our selected features. The results of the regression summary are shown in the table below. We can see that with the exception of the bars and arts variables (which, as explained above, are included for the prediction scenario purposes) and a few of the corridor types, the individual features are quite significant predictors overall. However, the R-Squared value is relatively low at about .5.

## 
## Linear Regression Results
## ===========================================================
##                                     Dependent variable:    
##                                 ---------------------------
##                                    Night_visits_sqmi_log   
## -----------------------------------------------------------
## phl_building_size_log                    0.060***          
##                                           (0.017)          
##                                                            
## phl_UPenn                               0.00001***         
##                                          (0.00000)         
##                                                            
## sg_distance_home_log                     -0.271***         
##                                           (0.037)          
##                                                            
## sg_dwell                                 0.001***          
##                                          (0.0003)          
##                                                            
## dens_bars_log                              0.052           
##                                           (0.049)          
##                                                            
## dens_rest_log                            0.508***          
##                                           (0.031)          
##                                                            
## dens_arts_log                              0.035           
##                                           (0.056)          
##                                                            
## dens_grocery                             0.001***          
##                                          (0.0003)          
##                                                            
## dens_sports                              0.003***          
##                                           (0.001)          
##                                                            
## count_retailmix_top                      -0.019***         
##                                           (0.003)          
##                                                            
## count_retailmix_sub_log                  0.673***          
##                                           (0.046)          
##                                                            
## dens_openlate_log                        -0.099***         
##                                           (0.035)          
##                                                            
## nn_transit_log                           -0.135***         
##                                           (0.023)          
##                                                            
## nn_parks_log                             0.123***          
##                                           (0.043)          
##                                                            
## nn_trolley                              -0.00001***        
##                                          (0.00000)         
##                                                            
## dens_late_tag                            -0.002***         
##                                          (0.0004)          
##                                                            
## corr_typeNeighborhood Center              -0.082*          
##                                           (0.045)          
##                                                            
## corr_typeNeighborhood Subcenter           -0.041           
##                                           (0.059)          
##                                                            
## corr_typeRegional Center                 0.596***          
##                                           (0.142)          
##                                                            
## corr_typeSpecialty Center                  0.041           
##                                           (0.081)          
##                                                            
## corr_typeSuperregional Center            1.212***          
##                                           (0.152)          
##                                                            
## dens_late_tag_log                        0.239***          
##                                           (0.052)          
##                                                            
## demo_pctWhite                            -0.002***         
##                                           (0.001)          
##                                                            
## demo_medAge                               0.005**          
##                                           (0.002)          
##                                                            
## demo_popdens                             0.00000**         
##                                          (0.00000)         
##                                                            
## demo_medrent                             0.0003***         
##                                          (0.0001)          
##                                                            
## demo_MHI                                -0.00001***        
##                                          (0.00000)         
##                                                            
## Constant                                 7.764***          
##                                           (0.513)          
##                                                            
## -----------------------------------------------------------
## Observations                               1,896           
## R2                                         0.548           
## Adjusted R2                                0.542           
## Residual Std. Error                  0.584 (df = 1868)     
## F Statistic                      83.963*** (df = 27; 1868) 
## ===========================================================
## Note:                           *p<0.1; **p<0.05; ***p<0.01

Figure 4.2.1 below compares the predictions and the actual values across the training and the test sets. While the predicted value (the green line) lines up fairly well with the actual value (orange line), the scatter plot indicates a fair amount of variance. Additionally, there are some extreme outliers across both the training and the test set that the model is predicting quite poorly.

We also ran a k-fold cross validation test to view the distribution of errors. The MAE for each fold are plotted in the below Figure 4.2.2.

fitControl <- trainControl(method = "cv", 
                           number = 10,
                           savePredictions = TRUE)

set.seed(414)
reg1.cv <- 
  train(Night_visits_sqmi_log ~ ., data = st_drop_geometry(dat_pred_agg) %>% 
          dplyr::select(reg.vars), 
        method = "lm", 
        trControl = fitControl, 
        na.action = na.pass)

reg1.cv.resample <- reg1.cv$resample

ErrorHist <- 
  ggplot(reg1.cv.resample, aes(x=MAE)) + 
  geom_histogram(color = "grey40", fill = "#27fdf5", bins = 50) + 
  labs(title="Linear Model: Histogram of Mean Absolute Error \nAcross 10 Folds",
       subtitle = "Figure 4.2.2",
       y = "Count",
       x = "Mean Absolute Error") +
  plotTheme()

ErrorHist

We also map the errors by commercial corridor in the following maps. The absolute errors suggest that our model is less accurate predicting nighttime traffic for larger, more heavily-trafficked corridors. The Percent Errors map on the right, however, indicates that smaller corridors throughout the city have a higher error.

cv_preds <- reg1.cv$pred

map_preds <- dat_pred_agg %>% 
  rowid_to_column(var = "rowIndex") %>% 
  left_join(cv_preds, by = "rowIndex") %>%
  group_by(corridor, corr_type) %>%
  summarise(Night_visits_sqmi = mean(Night_visits_sqmi),
            pred = mean(pred)) %>% 
  dplyr::mutate(Visits.AbsError = abs(exp(pred) - Night_visits_sqmi),
         PercentError = (Visits.AbsError / Night_visits_sqmi)*100) 

#Quintile maps
ErrorPlot1 <- ggplot() +
  geom_sf(data = phl_boundary, fill = "grey80", color = "grey40") +
  geom_sf(data = map_preds, aes(fill = q5(Visits.AbsError), color = q5(Visits.AbsError))) +
  scale_fill_manual(values = palette5,
                      labels=qBr(map_preds,"Visits.AbsError"),
                      name="Quintile\nBreaks") +
  scale_color_manual(values = palette5,
                      labels=qBr(map_preds,"Visits.AbsError"),
                      name="Quintile\nBreaks") +
  labs(title="Absolute Errors",
       subtitle = "Night Visit Predictions",
       caption = "Figure 4.2.3a") +
  mapTheme() 

ErrorPlot2 <- ggplot() +
  geom_sf(data = phl_boundary, fill = "grey80", color = "grey40") +
  geom_sf(data = map_preds, aes(fill = q5(PercentError), color = q5(PercentError))) +
  scale_fill_manual(values = palette5,
                      labels=qBr(map_preds, "PercentError"),
                      name="Quintile\nBreaks") +
  scale_color_manual(values = palette5,
                      labels=qBr(map_preds,"PercentError"),
                      name="Quintile\nBreaks") +
  labs(title="Percent Errors",
       subtitle = "Night Visit Predictions",
       caption = "Figure 4.2.3b") +
  mapTheme() 

grid.arrange(ErrorPlot1, ErrorPlot2, ncol=2)

Finally the following tables summarize the error terms by corridor type. We acknowledge that the linear model is most accurate at predicting traffic for Neighborhood and Community Centers with an average percent error of about 25%. The model is substantially worse at predicting traffic along Regional Centers and Superregional Centers, where the model’s average percent error is about 50%. Not only is the linear model inaccurate, but it is not generalizable across corridors.

#Comparing errors by corr type
map_preds_corr <- 
  map_preds %>%
  dplyr::mutate(MAE = round(abs(exp(pred) - Night_visits_sqmi),2),
                MAPE = round((MAE / Night_visits_sqmi)*100,2)) %>%
  group_by(corr_type) %>% 
  summarise(Night_visits_sqmi = mean(Night_visits_sqmi),
            MAE = mean(MAE),
            MAPE = mean(MAPE)) %>% 
  st_drop_geometry() %>%
  dplyr::select(corr_type, MAE, MAPE)

map_preds_corr[order(factor(map_preds_corr$corr_type, 
                            levels = c("Neighborhood Subcenter",
                                       "Neighborhood Center",
                                       "Community Center",
                                       "Regional Center",
                                       "Superregional Center",
                                       "Specialty Center"))),] %>%
  kable(caption = "Errors by Corridor Type: Linear Model") %>%
  kable_styling(full_width = FALSE)
Errors by Corridor Type: Linear Model
corr_type MAE MAPE
Neighborhood Subcenter 29682.11 35.51520
Neighborhood Center 24695.91 26.93982
Community Center 20527.93 26.87079
Regional Center 43407.16 47.77333
Superregional Center 78367.69 44.75667
Specialty Center 23359.27 33.76462

We also look at the errors for the linear model across months in the following table. The MAPE remains high across all months, but is particularly high during the winter months of January, February and December.

#Errors by date
dat_pred_agg %>% 
  rowid_to_column(var = "rowIndex") %>% 
  left_join(cv_preds, by = "rowIndex") %>%
  group_by(corridor, date_range_start) %>%
  summarise(Night_visits_sqmi = mean(Night_visits_sqmi),
            pred = mean(pred)) %>% 
  dplyr::mutate(Visits.AbsError = abs(exp(pred) - Night_visits_sqmi),
         PercentError = (Visits.AbsError / Night_visits_sqmi)*100) %>%
  dplyr::mutate(MAE = round(abs(exp(pred) - Night_visits_sqmi),2),
                MAPE = round((MAE / Night_visits_sqmi)*100,2)) %>%
  group_by(date_range_start) %>% 
  summarise(Night_visits_sqmi = mean(Night_visits_sqmi),
            MAE = mean(MAE),
            MAPE = mean(MAPE)) %>% 
  st_drop_geometry() %>%
  dplyr::select(date_range_start, MAE, MAPE) %>%
  mutate(date_range_start = str_remove_all(date_range_start, pattern = "T05:00:00Z"),
         date_range_start = str_remove_all(date_range_start, pattern = "T04:00:00Z")) %>%
  kable(caption = "Errors by Corridor Type: Linear Model") %>%
  kable_styling(full_width = FALSE)
Errors by Corridor Type: Linear Model
date_range_start MAE MAPE
2018-01-01 35078.91 97.83992
2018-02-01 37863.43 107.83184
2018-03-01 30527.72 33.65015
2018-04-01 40404.08 34.39489
2018-05-01 27327.91 32.57414
2018-06-01 23889.45 31.31711
2018-07-01 26530.56 32.79917
2018-08-01 43336.84 37.66049
2018-09-01 39554.70 35.91173
2018-10-01 41802.99 35.10199
2018-11-01 43968.28 35.77019
2018-12-01 15757.49 184.57811

Random Forest Model

As discussed above, the linear model has a number of issues with accuracy and generalizability. In an effort to improve these metrics, we built a random forest model using the same features and dependent variable. In the below summary table, we see much improvement in the random forest model’s OOB Error of about 20%, down from about 40% in the linear model. Additionally, the adjusted R-squared value has increased from .5 to .7. This indicates that the model’s accuracy has improved.

set.seed(414)
inTrain <- createDataPartition(y=dat_pred_agg$Night_visits_sqmi_log, p = .60, list = FALSE)

phl.training.rf <- dat_pred_agg[inTrain,]
phl.test.rf <- dat_pred_agg[-inTrain,]

#Multivariate regression
rf1 <- 
  ranger(Night_visits_sqmi_log ~ ., #change lm to ranger, predictions are a little different
     data = st_drop_geometry(phl.training.rf) %>%
             dplyr::select(reg.vars),
     importance = "impurity")

rf1
## Ranger result
## 
## Call:
##  ranger(Night_visits_sqmi_log ~ ., data = st_drop_geometry(phl.training.rf) %>%      dplyr::select(reg.vars), importance = "impurity") 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      1896 
## Number of independent variables:  23 
## Mtry:                             4 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.2106636 
## R squared (OOB):                  0.7171268
Next we tune the model’s hyperparameters to optimize for the variance across time. As we saw in the above table, there was extensive variation in the error terms from month-to-month. Therefore, we decided to optimize for this metric. After testing different parameters using the tune_grid() function, the final parameters were selected based on the lowest Root Mean Square Error. The results of the grid search are shown in the below table.
mtry min_n .metric .estimator mean n std_err .config
5 5 rmse standard 0.4527947 12 0.1077851 Preprocessor1_Model02
10 5 rmse standard 0.4530589 12 0.1054736 Preprocessor1_Model03
10 10 rmse standard 0.4542392 12 0.1054445 Preprocessor1_Model08
15 5 rmse standard 0.4545687 12 0.1039943 Preprocessor1_Model04
15 10 rmse standard 0.4555386 12 0.1041836 Preprocessor1_Model09
20 10 rmse standard 0.4556419 12 0.1012414 Preprocessor1_Model10
5 10 rmse standard 0.4556548 12 0.1082819 Preprocessor1_Model07
20 5 rmse standard 0.4556906 12 0.1016482 Preprocessor1_Model05
2 5 rmse standard 0.4595188 12 0.1114530 Preprocessor1_Model01
2 10 rmse standard 0.4634090 12 0.1116446 Preprocessor1_Model06
15 50 rmse standard 0.4866572 12 0.1002259 Preprocessor1_Model14
20 50 rmse standard 0.4877684 12 0.0985493 Preprocessor1_Model15
10 50 rmse standard 0.4880091 12 0.1026802 Preprocessor1_Model13
5 50 rmse standard 0.4951714 12 0.1057242 Preprocessor1_Model12
2 50 rmse standard 0.5110093 12 0.1083489 Preprocessor1_Model11

The below shows the tuned random forest model’s updated specifications.

## [[1]]
## == Workflow [trained] ==========================================================
## Preprocessor: Recipe
## Model: rand_forest()
## 
## -- Preprocessor ----------------------------------------------------------------
## 0 Recipe Steps
## 
## -- Model -----------------------------------------------------------------------
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~5,      x), num.trees = ~1000, min.node.size = min_rows(~5, x), importance = ~"impurity",      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 
## 
## Type:                             Regression 
## Number of trees:                  1000 
## Sample size:                      1896 
## Number of independent variables:  23 
## Mtry:                             5 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.2032407 
## R squared (OOB):                  0.7098645

Feature Importance

The below Figure 4.2.4 is a bar plot showing the tuned and final random forest model’s relative feature importance. The features ranked more highly were more important predictors for nighttime traffic, while features ranked lower were less important predictors. A more extensive discussion of this model’s feature importance is included in below in Section 5.

Using a similar workflow to evaluate the random forest model errors, the following graphic shows a scatter plot comparing the actual and predicted values between the training and test sets of data. There is less overlap between the green and orange lines than in the analogous plot for the linear model (Figure 4.2.1), which suggests that this model could be better calibrated to align the predicted and actual values. However, the distribution of the predictions, shown by the tighter clustering of dots, has much improved. There are still outliers in both the testing and training datasets that indicate the model has trouble coping with particularly high values. Again, further calibration and including additional features to account for these high values would further improve the model’s accuracy.

Comparing the error terms of the test sets between the linear model and the random forest model, we see that the latter is substantially more accurate with a MAPE down to 17% from 40%.

phl.test_table <-
  phl.test %>%
  st_drop_geometry() %>%
  mutate(Regression = "Linear Regression",
         Visits.Predict = exp(predict(reg1, phl.test)),
         Visits.Error = Visits.Predict - Night_visits_sqmi,
         Visits.AbsError = abs(Visits.Predict - Night_visits_sqmi),
         Visits.APE = (abs(Visits.Predict - Night_visits_sqmi)) / Visits.Predict)

phl.test.rf_table <-
  phl.test.rf %>%
  st_drop_geometry() %>%
  mutate(Regression = "Random Forest Regression",
         Visits.Predict = as.numeric(unlist(rf1_predict.test_fit)),
         Visits.Error = Visits.Predict - Night_visits_sqmi,
         Visits.AbsError = abs(Visits.Predict - Night_visits_sqmi),
         Visits.APE = (abs(Visits.Predict - Night_visits_sqmi)) / Visits.Predict)

rbind(phl.test.rf_table, phl.test_table) %>%
  group_by(Regression) %>%
  summarize(MAE = mean(Visits.AbsError, na.rm = TRUE),
            MAPE = (mean(Visits.AbsError, na.rm = T) / mean(Night_visits_sqmi, na.rm = T)*100)) %>% 
  kable(caption = "MAE and MAPE for Test Set Data") %>% kable_styling(full_width = FALSE)
MAE and MAPE for Test Set Data
Regression MAE MAPE
Linear Regression 36855.72 42.34710
Random Forest Regression 16284.79 18.71117

We performed a k-fold analysis on the random forest model to compare the MAE distribution across 10 folds. The output of this analysis is shown in Figure 4.2.6 below. We can see that the random forest’s error terms cluster lower than the linear model errors and are also more concentrated. This further demonstrates an improvement over the first model’s generalizability.

set.seed(414)
folds <- vfold_cv(dat_pred_agg, v = 10)

reg1.cv.rf <- 
  rf_best_wf_time %>% 
  fit_resamples(folds, metrics = metric_set(mae)) 

reg1.cv.rf.metrics <- collect_metrics(reg1.cv.rf, summarize = FALSE) 

rbind(reg1.cv.resample %>% 
        mutate(Regression = "Linear Model") %>% 
        dplyr::select(MAE, Regression), 
      reg1.cv.rf.metrics %>% 
        mutate(Regression = "Random Forest Model") %>% 
        dplyr::select(.estimate, Regression) %>%
        dplyr::rename("MAE" = ".estimate")) %>%
  ggplot(aes(x=MAE)) + 
  geom_histogram(color = "grey40", fill = "#27fdf5", bins = 50) + 
  labs(title="Full Dataset: Histogram of Mean Absolute Error Across 10 Folds",
       subtitle = "Figure 4.2.6") +
  facet_wrap(~Regression, ncol = 1) +
  scale_x_continuous(name = "Mean Absolute Error") +
  scale_y_continuous(name = "Count") +
  plotTheme()

Leave One Group Out Cross Validation

Finally, the following code looks at generalizability across time and corridor type for the tuned random forest model. First we compare the model’s errors across months. Similar to our analysis of the linear regression, the random forest model is far better at predicting during the spring and summer with MAPE’s between 10% and 15%. This contrasts greatly to the predictions for the winter months, where the errors are as high as 50%. A continued discussion of why these errors deviate is included in the below Section 5.
Errors by Month: Random Forest Model
date_range_start MAE MAPE
2018-01-01 23719.446 55.27263
2018-02-01 22863.423 58.18820
2018-03-01 13808.787 11.89650
2018-04-01 21049.975 17.22808
2018-05-01 9196.629 11.35959
2018-06-01 8515.113 11.08929
2018-07-01 8493.734 10.71643
2018-08-01 21914.708 16.67665
2018-09-01 17430.761 14.68143
2018-10-01 19129.894 14.41026
2018-11-01 21681.780 14.31361
2018-12-01 8867.888 130.37438
In addition to studying the random forest model’s errors across time, the below table also looks at the errors across corridor type. Contrary to our evaluation of the linear model, where the predictions for smaller corridors were most accurate, the below table shows how our final random forest model predicts nighttime trips for Regional Centers best with a MAPE of a little over 15%. Additionally, this table shows less variation across corridor types compared to the linear model.
Errors by Corridor Type: Random Forest Model
corr_type MAE MAPE
Neighborhood Subcenter 17372.55 38.05579
Neighborhood Center 18086.87 25.90396
Community Center 11731.69 21.90437
Regional Center 17339.01 13.50889
Superregional Center 14820.46 13.56314
Specialty Center 10115.62 24.12173

Mapping Errors

As a final evaluation of the random forest model, we map the errors across Philadelphia commercial corridors in the below Figures 4.2.7a and 4.2.7b. The map on the left shows the highest quintile of absolute error tends to occur in smaller corridors. We see a particular cluster of these errors in far West Philadelphia and the area north of Center City. When reviewing the Percent Error metric on the right, the best predictions are concentrated in Center City, likely because these corridors see more traffic overall.

map_preds_space <-
  map_preds_rf %>%
  dplyr::mutate(Visits.AbsError = round(abs(exp(.pred) - Night_visits_sqmi),2),
                PercentError = round((Visits.AbsError / Night_visits_sqmi)*100,2)) %>%
  group_by(corridor) %>%
  summarise(Night_visits_sqmi = mean(Night_visits_sqmi),
            Visits.AbsError = mean(Visits.AbsError),
            PercentError = mean(PercentError)) 

#Quintile maps
ErrorPlot1 <- ggplot() +
  geom_sf(data = phl_boundary, fill = "grey80", color = "grey40") +
  geom_sf(data = map_preds_space, aes(fill = q5(Visits.AbsError), color = q5(Visits.AbsError))) +
  scale_fill_manual(values = palette5,
                      labels=qBr(map_preds_space,"Visits.AbsError"),
                      name="Quintile\nBreaks") +
  scale_color_manual(values = palette5,
                      labels=qBr(map_preds_space,"Visits.AbsError"),
                      name="Quintile\nBreaks") +
  labs(title="Absolute Errors",
       subtitle = "Night Visit Predictions",
       caption = "Figure 4.2.7a") +
  mapTheme()

ErrorPlot2 <- ggplot() +
  geom_sf(data = phl_boundary, fill = "grey80", color = "grey40") +
  geom_sf(data = map_preds_space, aes(fill = q5(PercentError), color = q5(PercentError))) +
  scale_fill_manual(values = palette5,
                      labels=qBr(map_preds_space, "PercentError"),
                      name="Quintile\nBreaks") +
  scale_color_manual(values = palette5,
                      labels=qBr(map_preds_space,"PercentError"),
                      name="Quintile\nBreaks") +
  labs(title="Percent Errors",
       subtitle = "Night Visit Predictions",
       caption = "Figure 4.2.7b") +
  mapTheme()

grid.arrange(ErrorPlot1, ErrorPlot2, ncol=2)

4.3 Workday Model

Finally, we were curious to see if our model could predict traffic during other times of day. Using the same features, we ran an additional linear model and random forest model predicting workday traffic between the hours of 9AM and 6PM. While we were not concerned with evaluating the errors with the same rigor that we did above for our nighttime models, we compared the MAE and MAPE to the other models we built. From this comparison we see that the workday model is far less accurate than both the nighttime models with a MAPE’s as high as 125%.
MAE and MAPE for Test Set Data
Regression MAE MAPE
Linear Regression 36855.72 42.34710
Random Forest Regression 16284.79 18.71117
Workday Linear Regression 103900.78 124.42342
Workday Random Forest Regression 78828.11 90.57324

This suggests that the factors determining workday traffic are distinct from those important to evening traffic. There does, however, seem to be some overlap in the significant features of the two models. The below summary table from the linear model shows that a number of variables are still statistically significant, including the density of restaurants, retail mix, and even if businesses are open late. Surprisingly, the distance to University of Pennsylvania, an important job hub within Philadelphia, is not a significant predictor. This is possibly due to the fact that the university is situated within a residential area in West Philadelphia. While the university itself may attract a fair amount of day time traffic, the surrounding areas might not. Other variables that are not similarly significant between the nighttime and workday models include many of the demographic variables, such as percent white, median age, and population density.

## 
## Linear Regression Results
## =============================================================================
##                                              Dependent variable:             
##                                 ---------------------------------------------
##                                 Night_visits_sqmi_log Workday_visits_sqmi_log
##                                          (1)                    (2)          
## -----------------------------------------------------------------------------
## phl_building_size_log                 0.060***               0.142***        
##                                        (0.017)                (0.016)        
##                                                                              
## phl_UPenn                            0.00001***               0.00000        
##                                       (0.00000)              (0.00000)       
##                                                                              
## sg_distance_home_log                  -0.271***              -0.272***       
##                                        (0.037)                (0.036)        
##                                                                              
## sg_dwell                              0.001***               0.001***        
##                                       (0.0003)               (0.0003)        
##                                                                              
## dens_bars_log                           0.052                 -0.027         
##                                        (0.049)                (0.047)        
##                                                                              
## dens_rest_log                         0.508***               0.421***        
##                                        (0.031)                (0.029)        
##                                                                              
## dens_arts_log                           0.035                0.154***        
##                                        (0.056)                (0.053)        
##                                                                              
## dens_grocery                          0.001***               0.001***        
##                                       (0.0003)               (0.0003)        
##                                                                              
## dens_sports                           0.003***               0.003***        
##                                        (0.001)                (0.001)        
##                                                                              
## count_retailmix_top                   -0.019***               -0.003         
##                                        (0.003)                (0.003)        
##                                                                              
## count_retailmix_sub_log               0.673***               0.623***        
##                                        (0.046)                (0.044)        
##                                                                              
## dens_openlate_log                     -0.099***              -0.074**        
##                                        (0.035)                (0.032)        
##                                                                              
## nn_transit_log                        -0.135***              -0.158***       
##                                        (0.023)                (0.022)        
##                                                                              
## nn_parks_log                          0.123***                 0.064         
##                                        (0.043)                (0.040)        
##                                                                              
## nn_trolley                           -0.00001***             -0.00000        
##                                       (0.00000)              (0.00000)       
##                                                                              
## dens_late_tag                         -0.002***              -0.002***       
##                                       (0.0004)               (0.0004)        
##                                                                              
## corr_typeNeighborhood Center           -0.082*                -0.009         
##                                        (0.045)                (0.044)        
##                                                                              
## corr_typeNeighborhood Subcenter        -0.041                 -0.057         
##                                        (0.059)                (0.057)        
##                                                                              
## corr_typeRegional Center              0.596***               0.410***        
##                                        (0.142)                (0.144)        
##                                                                              
## corr_typeSpecialty Center               0.041                 -0.009         
##                                        (0.081)                (0.081)        
##                                                                              
## corr_typeSuperregional Center         1.212***               0.776***        
##                                        (0.152)                (0.142)        
##                                                                              
## dens_late_tag_log                     0.239***               0.277***        
##                                        (0.052)                (0.050)        
##                                                                              
## demo_pctWhite                         -0.002***               -0.001*        
##                                        (0.001)                (0.001)        
##                                                                              
## demo_medAge                            0.005**                 0.001         
##                                        (0.002)                (0.002)        
##                                                                              
## demo_popdens                          0.00000**              -0.00000        
##                                       (0.00000)              (0.00000)       
##                                                                              
## demo_medrent                          0.0003***              0.0003***       
##                                       (0.0001)               (0.0001)        
##                                                                              
## demo_MHI                             -0.00001***            -0.00001***      
##                                       (0.00000)              (0.00000)       
##                                                                              
## Constant                              7.764***               9.103***        
##                                        (0.513)                (0.486)        
##                                                                              
## -----------------------------------------------------------------------------
## Observations                            1,896                  1,896         
## R2                                      0.548                  0.551         
## Adjusted R2                             0.542                  0.545         
## Residual Std. Error (df = 1868)         0.584                  0.557         
## F Statistic (df = 27; 1868)           83.963***              84.950***       
## =============================================================================
## Note:                                             *p<0.1; **p<0.05; ***p<0.01

5. Findings & Discussion

This model sheds light on research findings that have important implications for the governance of the nighttime economy in Philadelphia and perhaps in other cities as well. The following section discusses these findings in greater detail.

5.1 Important Predictors of Nighttime Traffic

According to our final model, the following predictors were some of the most important in estimating nighttime traffic. A discussion of these features is included below.

  • Density of Restaurants
  • Density of Businesses Open Late
  • Average Distance Home
  • Average Dwell Time
  • Retail Mix

The density of restaurants is, by far, the most significant predictor of nighttime traffic than the other features. In terms of an economic development takeaway, this suggests that restaurants and food service are important generators of evening activity that contribute to the nighttime economy. This also suggests that creating and sustaining an environment that is friendly to these businesses will be beneficial to the larger economic health of local commercial corridors. This finding is also significant within the context of COVID-19, as we have seen that restaurant operations have been severely hampered during this difficult year. Our model suggests that prioritizing restaurant recovery in the post-COVID era could have a positive spillover effect for other retail types in Philadelphia.

The density of businesses open late is also an important predictor in the model. While this may be unsurprising when predicting nighttime traffic, this has interesting policy implications in terms of licensing and permitting along commercial corridors. Facilitating businesses staying open later may be another strategy for policymakers to take into consideration when promoting economic recovery. This is particularly relevant to businesses that have highly regulated hours, such as bars and arts venues. Of course, the negative externalities of increased traffic in the evening should be considered carefully. That said, easing the restrictions on certain business types may encourage increased nighttime traffic revenue along local corridors.

The distance from home and dwell time variables derived from the SafeGraph dataset were also strong predictors of nighttime traffic. These are new variables that we did not have access to before SafeGraph showing how this new dataset is an important development in our ability to understand mobility patterns. Reviewing the coefficients of the linear model, we see that the distance from home has a negative relationship with nighttime visits, meaning that as the average distance traveled increases the number of nighttime trips goes down. This suggests that people are more likely to travel shorter distances in the evening and that local, neighborhood corridors that are near a residential areas are an important site of the nightlife economy. The dwell time variable, in contrast, has a positive relationship with nighttime visits. As the average dwell time on a corridor increases, so does the number of trips. In terms of a policy takeaway from this finding, helping nightlife establishments accommodate visitors comfortably (e.g. expanding seating area in and around nightlife destinations) may be a viable strategy for increasing traffic and thereby growing business revenue.

Finally, retail mix is another important predictor of nighttime trips, suggesting that the more retail diversity along a corridor, the more traffic it attracts during nighttime hours. This makes intuitive sense, as more business types along a corridor make it appeal to a wider audience. It also suggests that the most successful corridors at generating nighttime traffic cater to a wide variety of businesses, including those open during the daytime. If nighttime establishments do best along corridors with a diverse mix of retail, some of which we are assuming is not open late, this proposes interesting new planning challenges involving how to plan for a corridor that adapts from day to night. For example, how can a corridor feel both safe and inviting when some of the businesses have closed for the evening, leaving storefronts unactivated throughout the night?

5.2 Less Important Predictors of Nighttime Traffic

Conversely, the features that were less important predictors have generated equally interesting findings and implications for economic development. We will discuss the three features listed below.

  • Corridor Type
  • Bar Density
  • Arts Density

Though certain corridor types were statistically significant to our linear regression – notably the Regional, Superregional, and Specialty Centers – the variable overall was less important to our random forest model. There are a number of possibilities for why this is the case. First, though the corridor typologies describe a specific retail environment, there is likely extensive variation within each category. For example, there are over a hundred Neighborhood Subcenters across the city. The same is true for Neighborhood Centers. While the corridor typology might be useful for conceptualizing the range of retail destinations in Philadelphia or describing those categories with fewer corridors, this variable may be too broad to adequately describe these corridor types.

Given our understanding of the nighttime economy, we were surprised to see that arts establishments and bars ranked low in terms of feature importance. There could be a number of explanations for this. For one, there seems to be some collinearity with some of the other variables describing the density of retail establishments, meaning that they match the pattern of another feature and therefore do not contribute new information to the model. Another possibility for the lack of importance could be related to the hours that we are predicting. The model specifically looks at trips between 7PM and 12AM. It is possible that a number of late night trips to bars and arts establishments are falling outside of that time window. There may be an additional problem with predicting traffic to arts venues, as the trips to these destinations are almost entirely reliant on programming (same with the density of sports venues which also ranks low in terms of importance). This abnormality is difficult to model and is likely one of the factors accounting for the feature’s relatively weak performance.

Though the features representing bars and arts venues were not important to our final model, this is not to say that these establishments are unimportant to the nightlife economy more broadly. From a quantitative perspective, we discuss in the above section that a diverse retail mix along a corridor is positively correlated with nightlife traffic and an important factor in predicting these patterns. Bars and arts venues undoubtedly contribute to a commercial corridor’s retail diversity and mix of uses, adding to its vibrancy. Separately, though our project has modeled mobility as a proxy for economic importance, an establishment’s economic viability may not be the only or primary rationale for its existence. Even if an arts venue does not attract regular traffic to a commercial corridor, this does not mean that we should stop investing in these establishments. They enrich our lives and enhance the character along a corridor, and few would argue that this is not worthy of support. This simply means that the arguments to advocate for these places may not be exclusively economic and rather focus on the establishment’s social and cultural benefits.

5.3 Limited Generalizability

Finally, the above analysis indicates that the model does not predict nighttime traffic with equivalent accuracy across time and and corridor type. Looking first at the generalizability across time, the model is best at predicting values during the spring and summer. In particular, the model is most accurate during March, May, June, and July where the MAPE is as low as 11%. In contrast, the predictions over the winter months is far less accurate, with the December prediction having a MAPE of over 100% and the January and February predictions having a MAPE of over 50%. These errors are likely due to far lower evening traffic counts during these months as shown in the below Figure 5.3.1. It is possible that the discrepancy between evening traffic during these months is an error in the data, but it is also possible that it correlates with weather and hours of day light. As the winter months are darker and colder, people may be less likely to participate in the nighttime economy activities during those months.

Another limitation in the model’s generalizability is across corridor types. As discussed in the above Section 4.2 on feature importance, the corridor types may be useful administrative tools for the City of Philadelphia, but they might not be descriptive enough for modelling purposes, particularly for the smaller Neighborhood Subcenters and Centers that have the highest errors in our model. There may simply be too much variation within these administrative categories to guarantee accuracy.

Finally, the model does not generalize well to predicting traffic during the workday. As discussed above, this means that people fundamentally travel for different reasons over the course of a day. While these distinctions in mobility patterns may be intuitive, this is an important reminder that attracting visitors to a given corridor is largely dependent upon surrounding context. Expanding the pool of visitors and times that people visit a corridor likely depends on a complex interplay of retail mix, the built environment, and the socio-economic characteristics of a given corridor. Understanding this broad context is essential to planning for the nighttime economy effectively.

6. Conclusions & Opportunities for Continued Research

While this project has only scratched the surface of the possibilities of the SafeGraph dataset, we have generated learnings that are relevant to the city’s economic development efforts, particularly during the COVID-19 recovery. This analysis and research has also contributed to a web application that features a data dashboard summarizing socio-economic information and shows visitors origins to commercial corridors across Philadelphia. We hope this tool will be a valuable empirical device for Philadelphia’s corridor managers and economic development officials who have a vested interest in exploring the benefits of the nighttime economy at the corridor level. Separately, the app also puts our predictive model into use. Showcasing a selection of corridors across Philadelphia, our prototype app allows users to manipulate the retail mix of nightlife establishments along a given corridor and see the impact on nighttime traffic.

There are many other iterations of the model that could enhance our understanding of the nighttime economy and improve the model’s accuracy. For one, we could further refine the hours when we are predicting trips. Instead of predicting between the hours of 7PM and 12PM, we could extend the boundaries of this variable to encompass the early morning. This would likely provide a more comprehensive view of traffic that occurs over the course of the entire night. The model could also incorporate multiple years of SafeGraph data to build a more robust picture of mobility along nightlife corridors. Finally, the model could be extended to other cities, which would expand the sample size, hopefully improving the model’s accuracy and applying the learnings outside of Philadelphia to other cities that have a vested interest in fostering their nighttime economy.

7. Code Appendix

Introduction

Set-Up Packages

library(tidyverse)
library(tidycensus)
library(sf)
library(lubridate)
library(datetime)
library(viridis)
library(gridExtra)
library(kableExtra)
library(dplyr)
library(ggplot2)
library(gganimate)
library(FNN)
library(caret)
library(stargazer)
library(ranger)
library(tidymodels)

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}
q5 <- function(variable) {as.factor(ntile(variable, 5))}
palette3 <- viridis_pal()(3)
palette5 <- viridis_pal()(5)
mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}
mapTheme_dark <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_rect(fill = "black"),
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "black"),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}


plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 16,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}
nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <- as.matrix(measureFrom)
  measureTo_Matrix <- as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    dplyr::summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}

setwd("~/GitHub/musa_practicum_nighttime")

Load Data

dat <- read.csv("./data/moves_2018.csv") #2018 SafeGraph Patterns Data
data2020 <- read.csv("./data/moves_monthly2020.csv") # 2020 SafeGraph Patterns Data

phila <- st_read("./demo/phila.geojson", quiet = TRUE) # SafeGraph Places data

#Wrangle 2018 SafeGraph Patterns and Places data into one dataset
dat2 <- dat %>% 
  dplyr::select(safegraph_place_id, 
                date_range_start, 
                date_range_end, 
                raw_visit_counts,
                raw_visitor_counts, 
                visits_by_day, 
                poi_cbg, 
                visitor_home_cbgs, 
                visitor_daytime_cbgs, 
                visitor_work_cbgs, 
                visitor_country_of_origin,
                distance_from_home, 
                median_dwell, 
                bucketed_dwell_times, 
                related_same_day_brand, 
                related_same_month_brand, 
                popularity_by_hour, 
                popularity_by_day, 
                device_type) %>%
  left_join(., phila, by = "safegraph_place_id") %>% 
  st_as_sf() %>%
  st_transform('ESRI:102728') #Project to Philly coordinates

#Load census block group shapefile (projected & unprojected)
phl_cbg <- 
  st_read("http://data.phl.opendata.arcgis.com/datasets/2f982bada233478ea0100528227febce_0.geojson", quiet = TRUE) %>%
  mutate(GEOID10 = as.numeric(GEOID10))%>%
  st_transform('ESRI:102728') 
phl_cbg_unproj <- 
  st_read("http://data.phl.opendata.arcgis.com/datasets/2f982bada233478ea0100528227febce_0.geojson", quiet = TRUE) %>%
  mutate(GEOID10 = as.numeric(GEOID10),
         Lat = as.numeric(INTPTLAT10),
         Lon = as.numeric(INTPTLON10))

#Load Philadelphia boundary shapefile (projected & unprojected)
phl_boundary <- 
  st_read("https://opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson", quiet = TRUE)%>%
  st_transform('ESRI:102728')

phl_boundary_unproj <- 
  st_read("https://opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson", quiet = TRUE)

#Load commercial corridor shapefile (projected & unprojected)
phl_corridors <- st_read("http://data.phl.opendata.arcgis.com/datasets/f43e5f92d34e41249e7a11f269792d11_0.geojson", quiet = TRUE)%>%
  st_transform('ESRI:102728') %>%
  mutate(corr_type = ifelse(CORRIDOR_TYPE == 1, "Neighborhood Subcenter", 
                            ifelse(CORRIDOR_TYPE == 2, "Neighborhood Center", 
                                   ifelse(CORRIDOR_TYPE == 3, "Community Center", 
                                          ifelse(CORRIDOR_TYPE == 4, "Regional Center", 
                                                 ifelse(CORRIDOR_TYPE == 5, "Superregional Center", 
                                                        ifelse(CORRIDOR_TYPE == 6, "Specialty Center", "Other")))))))

phl_corridors_unproj <- st_read("http://data.phl.opendata.arcgis.com/datasets/f43e5f92d34e41249e7a11f269792d11_0.geojson", quiet = TRUE)

#Load Neighborhood shapefile (projected & unprojected)
phl_nhoods <- 
  st_read("https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson", 
          quiet = TRUE) %>%
  st_transform('ESRI:102728')
phl_nhoods_unproj <- 
  st_read("https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson", 
          quiet = TRUE)

#Load Philadelphia zip code shapefile & project
phl_zip <- 
  st_read("http://data.phl.opendata.arcgis.com/datasets/b54ec5210cee41c3a884c9086f7af1be_0.geojson", quiet = TRUE) %>%
  st_transform('ESRI:102728') %>%
  mutate(CODE = as.numeric(CODE))

Exploratory Data Analysis

Count of Nightlife Establishments by Fishnet

#Generate fishnets
fishnet <- 
  st_make_grid(phl_boundary, cellsize = 1500, square = FALSE) %>%
  .[phl_boundary] %>% 
  st_sf() %>%
  mutate(uniqueID = rownames(.))

#Filter restaurants
restaurants <- dat2 %>%
  filter(top_category == "Restaurants and Other Eating Places")%>%
  mutate(Legend = "Restaurants")

#aggregate restaurant count by fishnet cell
restaurants_net <-
  dplyr::select(restaurants) %>% 
  mutate(countRestaurants = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countRestaurants = replace_na(countRestaurants, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

#Bars
bars <- dat2 %>%
  filter(top_category == "Drinking Places (Alcoholic Beverages)") %>% 
  mutate(Legend = "Bars")

#aggregate bars by fishnet cell
bars_net <-
  dplyr::select(bars) %>% 
  mutate(countBars = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countBars = replace_na(countBars, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE)) %>%
  mutate(Legend = "Bars")

#Performing arts
performingarts <- dat2 %>%
  filter(top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
           top_category == "Performing Arts Companies") %>%
  mutate(Legend = "Performing Arts")

performingarts_net <-
  dplyr::select(performingarts) %>% 
  mutate(countPerformingarts = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countPerformingarts = replace_na(countPerformingarts, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

# Combining fishnets into a single dataframe
vars_net <- 
  rbind(restaurants, 
        bars, 
        performingarts) %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  dplyr::summarize(count = n()) %>%
    full_join(fishnet) %>%
    spread(Legend,count, fill=0) %>%
    st_sf() %>%
    dplyr::select(-`<NA>`) %>%
    na.omit() %>%
    ungroup()

vars_net.long <- gather(vars_net, Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
mapList <- list()

#Plotting small multiple maps
for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange, c(mapList, 
                        ncol=3, top="Count of Nightlife Businesses per Fishnet", 
                        bottom = "Figure 3.1.3"))

COVID-19 Data Wrangling

# Modify 2018 data
dat_2018 <-
  dat %>%
  mutate(popularity_by_hour = str_remove_all(popularity_by_hour, pattern = "\\[|\\]")) %>% #remove brackets
  unnest(popularity_by_hour) %>% #unnest values
  separate(.,
           popularity_by_hour,
           c("18",
             "19", "20", "21", "22", "23"),
           sep = ",") %>%
  mutate(.,NightVisits2018 = as.numeric(`18`) + as.numeric(`19`) + as.numeric(`20`) + as.numeric(`21`) + as.numeric(`22`) + as.numeric(`23`))

dat_2018 <- dat_2018 %>%
  dplyr::select(safegraph_place_id,
                location_name,
                date_range_start,
                date_range_end,
                raw_visit_counts,
                raw_visitor_counts,
                popularity_by_day,
                NightVisits2018) %>%
  rename(raw_visit_counts2018 = raw_visit_counts,
         raw_visitor_counts2018 = raw_visitor_counts,
         popularity_by_day2018 = popularity_by_day) %>%
  mutate(month = substring(date_range_start,6,7))

dat_2020 <-
  data2020 %>%
  mutate(popularity_by_hour = str_remove_all(popularity_by_hour, pattern = "\\[|\\]")) %>% #remove brackets
  unnest(popularity_by_hour) %>% #unnest values
  separate(.,
           popularity_by_hour,
           c("18",
             "19", "20", "21", "22", "23"),
           sep = ",") %>%
  mutate(.,NightVisits2020 = as.numeric(`18`) + as.numeric(`19`) + as.numeric(`20`) + as.numeric(`21`) + as.numeric(`22`) + as.numeric(`23`))

dat_join <- dat_2020 %>%
  dplyr::select(safegraph_place_id,
                location_name,
                date_range_start,
                date_range_end,
                raw_visit_counts,
                raw_visitor_counts,
                popularity_by_day,
                NightVisits2020) %>%
  rename(raw_visit_counts2020 = raw_visit_counts,
         raw_visitor_counts2020 = raw_visitor_counts,
         popularity_by_day2020 = popularity_by_day) %>%
  mutate(month = substring(date_range_start,6,7)) %>%
  inner_join(., dat_2018, by = c("safegraph_place_id", "month", "location_name")) %>%
  mutate(Month = case_when(month == "01" ~ "January",
                    month == "02" ~ "February",
                    month == "03" ~ "March",
                    month == "04" ~ "April",
                    month == "05" ~ "May",
                    month == "06" ~ "Jun",
                    month == "07" ~ "May",
                    month == "08" ~ "May",
                    month == "09" ~ "May",
                    month == "10" ~ "May",
                    month == "11" ~ "May",
                    month == "12" ~ "May"))

dat_join <- dat_join %>%
  dplyr::select(safegraph_place_id,
                location_name,
                raw_visit_counts2018,
                raw_visit_counts2020,
                NightVisits2018,
                NightVisits2020,
                popularity_by_day2018,
                popularity_by_day2020,
                month) %>%
  left_join(., phila, by = "safegraph_place_id") %>%
  st_as_sf() %>%
  st_transform('ESRI:102728')

Unnesting & Wrangling SafeGraph popularity_by_hour variable

#Unnesting popularity by hour variable
dat_hour <- 
  dat2 %>% 
  select(safegraph_place_id, 
         date_range_start, 
         top_category, 
         sub_category, 
         popularity_by_hour, 
         poi_cbg, 
         median_dwell) %>% #select variables
  mutate(popularity_by_hour = str_remove_all(popularity_by_hour, pattern = "\\[|\\]")) %>% #clean-up
  unnest(popularity_by_hour) %>% #unnest
  separate(.,
           popularity_by_hour,
           c("0", "1", "2", "3", "4", "5", "6", 
             "7", "8", "9", "10", "11", "12", 
             "13", "14", "15", "16", "17", "18",
             "19", "20", "21", "22", "23"),
           sep = ",") %>% #separate hour values into individual columns
  pivot_longer(cols = 4:27, names_to = "Hour", values_to = "Count") %>% #convert to long format
  mutate(Hour = as.numeric(Hour),
         Count = as.numeric(Count)) #convert hour trip counts to numeric values

dat_1_6 <- dat_hour %>%
  #select trip counts between the hours of 1AM and 6AM
  filter(Hour == "1" | Hour == "2" | Hour == "3" | Hour == "4" | Hour == "5" | Hour == "6") %>% 
  group_by(safegraph_place_id, date_range_start) %>%
  summarize(Hrs1_6 = sum(Count)) 

dat_7_12 <- dat_hour %>%
  #select trip counts between the hours of 7AM and 12PM
  filter(Hour == "7" | Hour == "8" | Hour == "9" | Hour == "10" | Hour == "11" | Hour == "12") %>%
  group_by(safegraph_place_id, date_range_start) %>%
  summarize(Hrs7_12 = sum(Count)) 

dat_13_18 <- dat_hour %>%
  #select trip counts between the hours of 1PM and 6PM
  filter(Hour == "13" | Hour == "14" | Hour == "15" | Hour == "16" | Hour == "17" | Hour == "18") %>%
  group_by(safegraph_place_id, date_range_start) %>%
  summarize(Hrs13_18 = sum(Count)) 

dat_19_0 <- dat_hour %>%
  #select trip counts between the hours of 6PM and 12AM
  filter(Hour == "19" | Hour == "20" | Hour == "21" | Hour == "22" | Hour == "23" | Hour == "0") %>%
  group_by(safegraph_place_id, date_range_start) %>%
  summarize(Hrs19_0 = sum(Count)) 

dat_workday <- dat_hour %>%
  #select trip counts between workday hours of 9AM and 6PM
  filter(Hour == "9" | Hour == "10" | Hour == "11" | Hour == "12" | Hour == "13" | Hour == "14" |Hour == "15" | Hour == "16" |Hour == "17" |Hour == "18") %>%
  group_by(safegraph_place_id, date_range_start) %>%
  summarize(Hrs_workday = sum(Count))

#Combine into new dataset with hour trip counts and corridor + neighborhood information
dat3 <- 
  dat2 %>% 
  left_join(dat_1_6, by = c('safegraph_place_id', 'date_range_start')) %>%
  left_join(dat_7_12, by = c('safegraph_place_id', 'date_range_start')) %>%
  left_join(dat_13_18, by = c('safegraph_place_id', 'date_range_start')) %>%
  left_join(dat_19_0, by = c('safegraph_place_id', 'date_range_start')) %>%
  left_join(dat_workday, by = c('safegraph_place_id', 'date_range_start')) %>%
  st_join(phl_corridors, join = st_intersects) %>%
  st_join(phl_nhoods, join = st_intersects)

Model Building

Data Pull

#Philadelphia Features
phl_corridors_pred <- phl_corridors %>%
  st_as_sf() %>%
  dplyr::select(4:5,18, 'corr_type') %>%
  rename(., corridor = NAME,
         district = P_DIST,
         vacancy_rate = VAC_RATE) %>%
  mutate(vacancy_rate = str_remove_all(vacancy_rate, pattern = "%"),
         vacancy_rate = str_remove_all(vacancy_rate, pattern = "\r\n"),
         vacancy_rate = as.numeric(vacancy_rate),
         corr_area_sqft = as.numeric(st_area(phl_corridors)),
         corr_area_sqmi = as.numeric(st_area(phl_corridors)*0.000000035870))

phl_corridors_pred$vacancy_rate <- phl_corridors_pred$vacancy_rate %>% replace_na(0)

UPenn.sf <- 
  dat3 %>% 
  dplyr::select(location_name, date_range_start, top_category) %>%
  filter(location_name == "Univ of Penn",
         top_category == "Colleges, Universities, and Professional Schools",
         date_range_start == "2018-01-01T05:00:00Z")

septaStops <- 
  rbind(
    st_read("https://opendata.arcgis.com/datasets/8c6e2575c8ad46eb887e6bb35825e1a6_0.geojson", quiet = TRUE) %>% 
      dplyr::mutate(Line = "El") %>% 
      st_transform('ESRI:102728') %>%
      dplyr::select(Station, Line),
    st_read("https://opendata.arcgis.com/datasets/2e9037fd5bef406488ffe5bb67d21312_0.geojson", quiet = TRUE) %>%
      dplyr::mutate(Line ="Broad_St") %>%
      st_transform('ESRI:102728') %>%
      dplyr::select(Station, Line)) 

parks <-
  st_read("https://opendata.arcgis.com/datasets/d52445160ab14380a673e5849203eb64_0.geojson", 
          quiet = TRUE) %>%
  st_transform('ESRI:102728') %>%
  dplyr::select(PUBLIC_NAME)

phl_building_footprints <-
  st_read("https://opendata.arcgis.com/datasets/ab9e89e1273f445bb265846c90b38a96_0.geojson", 
          quiet = TRUE) %>%
  dplyr::select(ADDRESS) %>%
  st_transform('ESRI:102728')

phl_building_footprints <- phl_building_footprints %>%
  dplyr::mutate(bld_area_sqft = as.numeric(st_area(phl_building_footprints)),
         bld_area_sqmi = as.numeric(st_area(phl_building_footprints)*0.000000035870))

phl_trolley <-
  st_read("https://opendata.arcgis.com/datasets/e09e9f98bdf04eada214d2217f3adbf1_0.geojson") %>%
  st_transform('ESRI:102728') %>%
  .[phl_boundary,] %>%
  filter(Mode == "Trolley")

#Demographic data
phl_blockgroups <- 
  get_acs(geography = "block group", 
          variables = c("B01003_001E", 
                        "B02001_002E", 
                        "B01002_001E",
                        "B19013_001E", 
                        "B25064_001E"
                        # "B03002_012E",
                        # "B02001_003E"
                        ),
          year=2018, 
          state=42, 
          county=101, 
          geometry=T, 
          output = "wide") %>%
  st_transform('ESRI:102728')%>%
  dplyr::rename(TotalPop = B01003_001E,
                White = B02001_002E,
                MedAge = B01002_001E,
                MedHHInc = B19013_001E,
                MedRent = B25064_001E,
                # Hisp = B03002_012E,
                # Black =   B02001_003E,
                GEOID10 = GEOID) %>%
  dplyr::select(-ends_with("M")) %>%
  dplyr::mutate(pctWhite = ((White / TotalPop)*100),
                # pctBlack = ((Black / TotalPop)*100),
                # pctHisp = ((Hisp / TotalPop)*100),
                tract_area_mi = as.numeric(st_area(geometry)*0.000000035870),
                popdens_mi = TotalPop / tract_area_mi
         ) %>%
  st_as_sf()

Data Wrangling

dat_pred_temp <- 
  dat2 %>% 
  left_join(dat_1_6, by = c('safegraph_place_id', 'date_range_start')) %>%
  left_join(dat_7_12, by = c('safegraph_place_id', 'date_range_start')) %>%
  left_join(dat_13_18, by = c('safegraph_place_id', 'date_range_start')) %>%
  left_join(dat_19_0, by = c('safegraph_place_id', 'date_range_start')) %>%
  left_join(dat_workday, by = c('safegraph_place_id', 'date_range_start')) %>%
  st_join(phl_blockgroups) %>%
  st_join(phl_nhoods) %>%
  st_join(phl_corridors_pred) %>%
  st_join(., phl_building_footprints) %>%
  drop_na(corridor)

#Engineer variables at the destination level.
dat_pred <-
  dat_pred_temp %>%
  dplyr::mutate(total = 1,
         bars = ifelse(top_category == 'Drinking Places (Alcoholic Beverages)', 1, 0),
         restaurant = ifelse(top_category == 'Restaurants and Other Eating Places', 1, 0),
         arts = ifelse(top_category == 'Promoters of Performing Arts, Sports, and Similar Events' |
                         top_category == 'Performing Arts Companies', 1, 0),
         grocery = ifelse(top_category == 'Grocery Stores', 1, 0),
         sports = ifelse(top_category == 'Spectator Sports', 1, 0),
         transit_nn = nn_function(st_c(st_centroid(dat_pred_temp)), st_c(st_centroid(septaStops)), 2),
         parks_nn = nn_function(st_c(st_centroid(dat_pred_temp)), st_c(st_centroid(parks)), 4),
         trolley_nn = nn_function(st_c(st_centroid(dat_pred_temp)), st_c(st_centroid(phl_trolley)), 4),
         late_night = if_else(grepl("Late Night", dat_pred_temp$category_tags), 1, 0),
         closing_time = ifelse(str_detect(open_hours, paste(c("20:00", "21:00", "22:00", "23:00", "0:00", "1:00", "2:00", "3:00"), collapse = "|")), "OPEN LATE", "NOT OPEN LATE"),
         closing_time = ifelse(is.na(closing_time) == TRUE, "NO DATA", closing_time),
         open_late = ifelse(closing_time == "OPEN LATE", 1, 0)
    )

dat_pred$UPenn <- as.numeric(st_distance(dat_pred, UPenn.sf))

#Aggregating up to the corridor level
dat_pred_agg <- 
  dat_pred %>%
  group_by(corridor, date_range_start, corr_type) %>%
  summarize(Night_visits = sum(Hrs19_0),
            Workday_visits = sum(Hrs_workday),
            phl_area_sqmi = mean(corr_area_sqmi, na.rm = TRUE),
            Night_visits_sqmi = (Night_visits + 1) / phl_area_sqmi,
            Night_visits_sqmi_log = log(Night_visits_sqmi),
            Workday_visits_sqmi = (Workday_visits +1) / phl_area_sqmi,
            Workday_visits_sqmi_log = log(Workday_visits_sqmi),
            total = sum(total),
            phl_building_size = mean(bld_area_sqft +1, na.rm = TRUE),
            phl_building_size_log = log(phl_building_size),
            phl_UPenn = mean(UPenn),
            sg_dwell = mean(median_dwell, na.rm = TRUE),
            sg_distance_home = mean(distance_from_home, na.rm = TRUE),
            sg_distance_home_log = log(sg_distance_home),
            dens_bars = (sum(bars, na.rm = TRUE) + 1)/phl_area_sqmi,
            dens_rest = (sum(restaurant, na.rm = TRUE) + 1)/phl_area_sqmi,
            dens_arts = (sum(arts, na.rm = TRUE) +1)/phl_area_sqmi,
            dens_sports = (sum(sports, na.rm = TRUE) + 1)/phl_area_sqmi,
            dens_grocery = (sum(grocery, na.rm = TRUE) + 1)/phl_area_sqmi,
            count_retailmix_top = n_distinct(top_category),
            count_retailmix_sub = n_distinct(sub_category),
            dens_late_tag = (sum(late_night, na.rm = TRUE) + 1)/phl_area_sqmi,
            dens_openlate = (sum(open_late, na.rm = TRUE) + 1)/phl_area_sqmi,
            dens_bars_log = log(dens_bars),
            dens_rest_log = log(dens_rest),
            dens_arts_log = log(dens_arts),
            dens_grocery_log = log(dens_grocery),
            dens_late_tag_log = log(dens_late_tag),
            count_retailmix_top_log = log(count_retailmix_top),
            count_retailmix_sub_log = log(count_retailmix_sub),
            dens_openlate_log = log(dens_openlate),
            nn_transit = mean(transit_nn),
            nn_parks = mean(parks_nn),
            nn_trolley = mean(trolley_nn),
            nn_parks_log = log(nn_parks),
            nn_transit_log = log(nn_transit),
            nn_trolley_log = log(nn_trolley),
            demo_popdens = mean(popdens_mi),
            demo_popdens_log = log(demo_popdens),
            demo_pctWhite = weighted.mean(pctWhite, Hrs19_0, na.rm = TRUE),
            demo_medAge = weighted.mean(MedAge, Hrs19_0, na.rm = TRUE),
            demo_MHI = weighted.mean(MedHHInc, Hrs19_0, na.rm = TRUE),
            demo_medrent = weighted.mean(MedRent, Hrs19_0, na.rm = TRUE)) %>% 
  st_drop_geometry() %>%
  left_join(phl_corridors_pred %>% dplyr::select(corridor)) %>%
  drop_na(corridor, demo_MHI, demo_medrent) %>% #Some of the census variables aren't reporting for our block groups
  st_as_sf() %>%
  ungroup() %>%
  na.omit(st_distance_home)

Linear Model

reg.vars <- c('Night_visits_sqmi_log',
              'phl_building_size_log',
              'phl_UPenn',
              'sg_distance_home_log',
              'sg_dwell',
              'dens_bars_log',
              'dens_rest_log',
              'dens_arts_log',
              'dens_grocery',
              'dens_sports',
              'count_retailmix_top',
              'count_retailmix_sub_log',
              'dens_openlate_log',
              'nn_transit_log',
              'nn_parks_log',
              'nn_trolley',
              'dens_late_tag',
              'corr_type',
              'dens_late_tag_log',
              'demo_pctWhite',
              'demo_medAge',
              'demo_popdens',
              'demo_medrent',
              'demo_MHI')

#Setting up test and training datasets
set.seed(414)
inTrain <- createDataPartition(y=dat_pred_agg$Night_visits_sqmi_log, p = .60, list = FALSE)

phl.training <- dat_pred_agg[inTrain,]
phl.test <- dat_pred_agg[-inTrain,]

#Multivariate Linear regression
reg1 <- 
  lm(Night_visits_sqmi_log ~ ., 
     data = st_drop_geometry(phl.training) %>% 
       dplyr::select(reg.vars))
             
stargazer(reg1, type = "text", title = "Linear Regression Results")

Random Forest Model

set.seed(414)
inTrain <- createDataPartition(y=dat_pred_agg$Night_visits_sqmi_log, p = .60, list = FALSE)

phl.training.rf <- dat_pred_agg[inTrain,]
phl.test.rf <- dat_pred_agg[-inTrain,]

#Multivariate regression
rf1 <- 
  ranger(Night_visits_sqmi_log ~ ., #change lm to ranger, predictions are a little different
     data = st_drop_geometry(phl.training.rf) %>%
             dplyr::select(reg.vars),
     importance = "impurity")

rf1

Hypertuning Random Forest Parameters

set.seed(414)

phl.training.rf_fit <- dat_pred_agg[inTrain,]
phl.test.rf_fit <- dat_pred_agg[-inTrain,]

data_split <- initial_split(dat_pred_agg, strata = "Night_visits_sqmi_log", prop = 0.60)

model_rec <- recipe(Night_visits_sqmi_log ~ ., data = phl.training.rf %>% 
                      dplyr::select(reg.vars) %>% 
                      st_drop_geometry())

#Splitting by month
cv_splits_time <- group_vfold_cv(phl.training.rf%>%
                      dplyr::select(date_range_start, reg.vars) %>%
                      st_drop_geometry(),
                      strata = "Night_visits_sqmi_log",
                      group = "date_range_start")

rf_plan <- rand_forest() %>%
  set_args(mtry  = tune()) %>%
  set_args(min_n = tune()) %>%
  set_args(trees = 1000) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")

#Setting parameters to try
rf_grid <- expand.grid(mtry = c(2, 5, 10, 15, 20), 
                       min_n = c(5, 10, 50))

rf_wf <-
  workflow() %>% 
  add_recipe(model_rec) %>% 
  add_model(rf_plan)

control <- control_resamples(save_pred = TRUE, verbose = TRUE)

#Test hyperparameters
rf_tuned_time <- rf_wf %>%
  tune::tune_grid(.,
                  resamples = cv_splits_time,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metric_set(rmse, rsq))

show_best(rf_tuned_time, metric = "rmse", n = 15) %>% kable() %>% kable_styling(full_width = FALSE)

rf_best_params_time  <- select_best(rf_tuned_time, metric = "rmse") 

rf_best_wf_time <- finalize_workflow(rf_wf, rf_best_params_time)

rf_val_fit_time <- rf_best_wf_time %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metric_set(mae, rsq))

rf_val_fit_time$.workflow

App Development

Generate GeoJSONs for Prediction Scenarios

#Funcion to generate a grid with the desired combinations.
ComboGrid <- function(rest_factor, bar_factor, arts_factor) {
  
  allCombos <- expand.grid(rest_factor = rest_factor*100,
                           bar_factor = bar_factor*100,
                           arts_factor = arts_factor*100) 
  
  allCombos <- tibble::rowid_to_column(allCombos, "scenario")
  
  allCombos <- 
    allCombos %>% 
    mutate(
      rest = "rest",
      bars = "bars",
      arts = "arts")
  
  allCombos <- allCombos %>%
    mutate(
      filename = 
        paste(allCombos$rest,
      allCombos$rest_factor, 
      allCombos$bars, 
      allCombos$bar_factor, 
      allCombos$arts, 
      allCombos$arts_factor, sep = "-"))
  
  return(allCombos)
}

#Function to generate separate dataframes for each of the scenarios
ScenarioGenerator <- function(dataset, corr_list, rest_factor, bar_factor, arts_factor) {

  allCombos <- expand.grid(rest_factor = rest_factor,
                           bar_factor = bar_factor,
                           arts_factor = arts_factor) 

  allCombos <- tibble::rowid_to_column(allCombos, "ID")

  comboList <- unique(allCombos[['ID']])
  allScenarios <- data.frame()

for (i in comboList) {
    
  thisScenario <-
    dataset %>%
    filter(corridor %in% corr_list) %>%
    dplyr::mutate(scenario = as.numeric(allCombos %>% filter(ID==i) %>% dplyr::select(ID)),
                  count_rest = as.numeric(allCombos %>% filter(ID == i) %>% dplyr::select(rest_factor)) * count_rest,
                  count_rest_a = count_rest/phl_area_sqmi,
                  count_rest_a_log = log(count_rest_a + 1),
                  count_bars = as.numeric(allCombos %>% filter(ID == i) %>% dplyr::select(bar_factor)) * count_bars,
                  count_bars_a = count_bars/phl_area_sqmi,
                  count_bars_a_log = log(count_bars_a + 1),
                  count_arts = as.numeric(allCombos %>% filter(ID == i) %>% dplyr::select(arts_factor)) * count_arts,
                  count_arts_a = count_arts/phl_area_sqmi,
                  count_arts_a_log = log(count_arts_a + 1))
  
  Scenario_preds <- predict(rf_val_fit_time$.workflow, new_data = thisScenario %>% as.data.frame())
  Scenario_preds <- exp(Scenario_preds %>% as.data.frame())

  Scenario_final <- cbind(Scenario_preds, thisScenario %>%
                            dplyr::select(corridor,
                                          count_rest,
                                          count_arts,
                                          count_bars,
                                          Night_visits,
                                          phl_area_sqmi,
                                          scenario,
                                          geometry)) %>%
    dplyr::rename("predictions" = ".pred") %>% #clean-up
    mutate(predictions = round((predictions * phl_area_sqmi), digits = 0)) %>%
    st_as_sf()

  Scenario_final <- Scenario_final %>% st_as_sf() %>% st_transform(crs = "EPSG: 4326")

  Scenario_final <- Scenario_final %>%
    group_by(corridor, scenario) %>%
    summarize(Night_visits = sum(Night_visits),
              predictions = sum(predictions))

  Scenario_final <-
    Scenario_final %>%
    mutate(pct_change = round(((predictions - Night_visits)/Night_visits * 100), digits = 1)) %>%
    as.data.frame()  %>%
    left_join(., ScenarioCombos %>% dplyr::select(scenario, filename)) %>%
    st_as_sf()

  Scenario_final <- as_Spatial(Scenario_final)

  writeOGR(Scenario_final,
           dsn = paste(unique(Scenario_final$filename), ".geojson"),
           'allScenario_final_small',
           driver = 'GeoJSON')
  
}
  return(Scenario_final)
}

#Preparing the dataset to run in the function
dat_scenario <-
  dat_pred %>%
  group_by(corridor, date_range_start) %>%
  summarize(count_rest = sum(restaurant, na.rm = TRUE),
            count_bars = sum(bars, na.rm = TRUE),
            count_arts = sum(arts, na.rm = TRUE)) %>%
  st_drop_geometry() %>%
  left_join(dat_pred_agg) %>%
  group_by(corridor) %>%
  drop_na(corridor, demo_MHI, demo_medrent)

#Running the function
rest_factor_select <- c(.95, 1, 1.05) #different scenarios
bar_factor_select <- c(.95, 1, 1.05)
arts_factor_select <- c(.95, 1, 1.05)

ScenarioCombos <- ComboGrid(rest_factor = rest_factor_select,
                            bar_factor = bar_factor_select,
                            arts_factor = arts_factor_select)

ScenarioGenerator(dataset = dat_scenario, #Dataset
                          corr_list = c("30th Street", #Corridors we are including
                                        "54006000 Lancaster Ave",
                                        "Broad and Snyder",
                                        "Bustleton and Red Lion",
                                        "Callowhill East",
                                        "Castor and Magee",
                                        "Central Waterfront/Spring Garden",
                                        "City Ave and Belmont vicinity",
                                        "Front and Kensington",
                                        "Girard and Marshall",
                                        "Godfrey and Ogontz",
                                        "Kensington Ave/Harrowgate",
                                        "Market East",
                                        "Oregon Ave/5th-13th",
                                        "Presidential/Belair",
                                        "Richmond and Allegheny",
                                        'Snyder Plaza and Columbus Commons',
                                        "South Street/Front-8th",
                                        "Washington Avenue West"),
                          rest_factor = rest_factor_select, #values determined above
                          bar_factor = bar_factor_select,
                          arts_factor = arts_factor_select)

Generate GeoJSONs for Census Block Group Trip Origins

#Pulll census data for PA blockgroups
pa_blockgroups <- 
  get_acs(geography = "block group", 
          variables = c("B01003_001E", #select variables 
                        "B02001_002E", 
                        "B01002_001E",
                        "B19013_001E", 
                        "B25064_001E",
                        "B03002_012E",
                        "B02001_003E"),
          year=2018, #select year
          state=42, #PA FIPs code
          county=c(101, 017, 045, 091), #FIPS code for Philly, Bucks, Montgomery, Delaware
          geometry=T, 
          output = "wide") %>%
  st_transform('ESRI:102728') %>% #PA projection
  rename(TotalPop = B01003_001E, 
         White = B02001_002E,
         MedAge = B01002_001E,
         MedHHInc = B19013_001E,
         MedRent = B25064_001E,
         Hisp = B03002_012E,
         Black =    B02001_003E,
         GEOID10 = GEOID) %>% #Clean up column names
  dplyr::select(-ends_with("M")) %>%
  mutate(pctWhite = ((White / TotalPop)*100),
         pctBlack = ((Black / TotalPop)*100),
         pctHisp = ((Hisp / TotalPop)*100),
         tract_area_mi = as.numeric(st_area(geometry)*0.000000035870),
         popdens_mi = TotalPop / tract_area_mi) %>% #Generate new columns
  st_as_sf()

#Pull census data for NJ blockgroups
nj_blockgroups <- 
  get_acs(geography = "block group", 
          variables = c("B01003_001E", 
                        "B02001_002E", 
                        "B01002_001E",
                        "B19013_001E", 
                        "B25064_001E",
                        "B03002_012E",
                        "B02001_003E"),
          year=2018, 
          state=c(34),
          county=c(007, 015, 005), #Camden, Gloucester and Burlington
          geometry=T, 
          output = "wide") %>%
  st_transform('ESRI:102728')%>%
  rename(TotalPop = B01003_001E,
         White = B02001_002E,
         MedAge = B01002_001E,
         MedHHInc = B19013_001E,
         MedRent = B25064_001E,
         Hisp = B03002_012E,
         Black =    B02001_003E,
         GEOID10 = GEOID) %>%
  dplyr::select(-ends_with("M")) %>%
  mutate(pctWhite = ((White / TotalPop)*100),
         pctBlack = ((Black / TotalPop)*100),
         pctHisp = ((Hisp / TotalPop)*100),
         tract_area_mi = as.numeric(st_area(geometry)*0.000000035870),
         popdens_mi = TotalPop / tract_area_mi) %>%
  st_as_sf()

blockgroups2018 <- rbind(pa_blockgroups, nj_blockgroups)

blockgroups2018 <- blockgroups2018 %>%
  mutate(centroid_cbg = st_centroid(geometry),
         visitor_cbg = as.numeric(GEOID10)) %>%
  select(-GEOID10)

#Modify 2018 data
#Join to places dataset
dat_2018 <- dat2018 %>% 
  dplyr::select(safegraph_place_id, 
                raw_visit_counts,
                poi_cbg, 
                visitor_home_cbgs, 
                visitor_daytime_cbgs, 
                visitor_work_cbgs, 
                distance_from_home) %>%
  left_join(., phila, by = "safegraph_place_id") %>% 
  st_as_sf() %>%
  st_transform('ESRI:102728') 

#Filter to just bars, restaurants, and performing arts and join to corridors
flows <- dat_2018 %>%
  filter(top_category == "Drinking Places (Alcoholic Beverages)" |
           top_category == "Restaurants and Other Eating Places" |
           top_category == "Promoters of Performing Arts, Sports, and Similar Events" |
           top_category == "Performing Arts Companies") %>% #filter for business types
  mutate(category = case_when(top_category == "Drinking Places (Alcoholic Beverages)" ~ "Bars",
                              top_category == "Restaurants and Other Eating Places" ~ "Restaurants",
                              top_category == "Promoters of Performing Arts, Sports, and Similar Events" ~ "Arts",
                              top_category == "Performing Arts Companies" ~ "Arts")) %>%
  st_join(phl_corridors) %>% #join destinations to phl corridors (apply corridor destination to each trip)
  st_as_sf() %>%
  st_drop_geometry() %>% 
  select(safegraph_place_id, 
         category,
         poi_cbg,
         visitor_home_cbgs,
         distance_from_home,
         NAME.y) %>% #select columns
  mutate(visitor_home_cbgs = str_remove_all(visitor_home_cbgs, pattern = "\\[|\\]")) %>%
  mutate(visitor_home_cbgs = str_remove_all(visitor_home_cbgs, pattern = "\\{|\\}")) %>%
  mutate(visitor_home_cbgs = str_remove_all(visitor_home_cbgs, pattern = '\\"|\\"')) %>%
  mutate(visitor_home_cbgs = str_split(visitor_home_cbgs, pattern = ",")) %>%
  unnest(visitor_home_cbgs) %>% #unnest visitor cbg column
  separate(.,
           visitor_home_cbgs,
           c("visitor_cbg", "count"),
           sep = ":") %>% #separate count column from visitor cbg
  mutate(count = as.numeric(count),
         visitor_cbg = as.numeric(visitor_cbg)) %>%
  dplyr::rename(., NAME = NAME.y) %>%
  drop_na(NAME) #Remove trips with destinations outside of corridors

#Join to corridors centroids
corr_summary <- flows %>%
  group_by(NAME, visitor_cbg) %>%
  summarize(Count = sum(count)-1,
            distance_from_home = mean(distance_from_home)) %>%
  left_join(., phl_corridors_cent, by = "NAME") 

#Join to blockgroups data
corr_summary1 <- corr_summary %>%
  right_join(., blockgroups2018, by = "visitor_cbg")  %>%
  dplyr::select(visitor_cbg, 
                NAME.x, 
                Count, 
                centroid_corr, 
                MedAge, 
                MedHHInc, 
                pctWhite, 
                pctBlack, 
                pctHisp, 
                centroid_cbg, 
                distance_from_home) %>%
  rename(NAME = NAME.x) 

#Group by destination corridor (One row per commercial corridor)
corr_group <- corr_summary1 %>%
  group_by(NAME) %>%
  summarize(total_trips = sum(Count),
            distance_from_home = mean(distance_from_home, na.rm = TRUE),
            weighted_age = weighted.mean(MedAge, Count, na.rm = TRUE),
            weighted_inc = weighted.mean(MedHHInc, Count, na.rm = TRUE),
            weighted_pctWhite = weighted.mean(pctWhite, Count, na.rm = TRUE),
            weighted_pctBlack = weighted.mean(pctBlack, Count, na.rm = TRUE),
            weighted_pctHisp = weighted.mean(pctHisp, Count, na.rm = TRUE)) %>%
  mutate(phl_distance_from_home = weighted.mean(distance_from_home, total_trips, na.rm = TRUE),
         phl_age = philadelphia_ACS$MedAge,
         phl_inc = philadelphia_ACS$MedHHInc,
         phl_pctWhite = philadelphia_ACS$pctWhite,
         phl_pctBlack = philadelphia_ACS$pctBlack,
         phl_pctHisp = philadelphia_ACS$pctHisp) 

#Rejoin to original corridor dataset for geometry (FINAL CORRIDOR DESTINATION DATASET)
corr_group_final <- corr_group %>%
  left_join(., phl_corridors %>% select(NAME, VAC_RATE), 
            by = "NAME") 

#Rejoin corridor origins to cbg dataset for geometry (FINAL CORRIDOR ORIGINS DATASET. EACH ROW IS ONE CORRIDOR/CBG COMBINATION)
corr_origins_final <- corr_summary1 %>%
  left_join(., corr_group %>% select(NAME, total_trips), by = "NAME") %>%
  dplyr::select(visitor_cbg,
                NAME,
                Count,
                MedAge,
                MedHHInc,
                pctWhite,
                pctBlack,
                pctHisp,
                distance_from_home,
                total_trips) %>%
  mutate(percent_of_trips = Count / total_trips) %>%
  left_join(., blockgroups2018 %>% select(visitor_cbg), by = "visitor_cbg")

corr_origins_final <- corr_origins_final %>% drop_na(NAME)

#Function to write GeoJSON for each corridor
CorridorFilter <- function(dataset) {
  
  corridorList <- unique(dataset[['NAME']])
  
  for (i in corridorList) {
    
    thisCorridor <- dataset %>% filter(NAME==i)
    
    thisCorridor <- thisCorridor %>% 
      mutate(NAME = gsub("/", "-", NAME, fixed = TRUE),
             NAME = gsub(".", "", NAME, fixed = TRUE),
             filename = gsub(" ", "-", NAME, fixed = TRUE)) %>%
      st_as_sf() %>% st_transform(crs = "EPSG: 4326")
    
    thisCorridor <- as_Spatial(thisCorridor)
    writeOGR(thisCorridor,
             dsn = paste(unique(thisCorridor$filename), ".geojson", sep = ""),
             'thisCorridor',
             driver = 'GeoJSON')
    }
}

#Run function
CorridorFilter(corr_origins_final)