1 Introduction

Currently, the Philadelphia Department of Parks & Recreation (PPR) owns 524 facilities across the city and host thousands of programs and events every year that contribute to the wellness of people.From top down, the hierarchy of their service system is: Districts -> Service Areas -> Facilities -> Programs and Permits.

In the past, staffing in PPR estimates the demand for its programming based on program data (like registered attendance) and other proxy measures about park visits (e.g. total trash collected). However, these measures may not be fully reliable and accurate. How can PPR make smarter decisions about allocating programs in the parks and recreation facilities? We will refer to a data-driven approach to help them find the dynamic relationship between planned activities and visitors.

Only recently, with the dynamic data collected by SafeGraph and other cell phone data carriers, it is now possible to analyze large data sets of cell phone location activity, including where people are traveling and how long they stay. SafeGraph’s mobile device panels get anonymous data about users’ foot traffic from numerous smartphone apps and could be considered as a selected sample to understand people’s travel pattern. These data are further aggregated to answer a series questions like how often do people visit a location or how long do they stay in a location.

By incorporating this novel dataset, we can help the PPR to analyze if their programming meet citizens’ demands and to better assign their program resources with SafeGraph’s Pattern data. With understanding of the imbalance between demand and supply, we can adjust the quantity of programs and events, like increasing number in low-supplied facilities and reducing number in over-supplied facilities. The prediction outcome of market area can be used to suggest PPR’s future programming strategies

In order to know how many additional programs are needed and how the adjustment will affect other facilities, we will refer to the Huff Model. In spatial analysis, the Huff model is a widely used tool for predicting the probability of a consumer visiting a site, as a function of the distance from the origin to the destination, its attractiveness, and the relative attractiveness of alternatives. With the predicted probability of a consumer visiting a facility, we could interpret and normalize it to reflect the quantity of visitors from a census block group to a certain facility. In that case, we can help the PPR better understand the use of their facilities and provide recommendations on how to allocate programming resources better within Program Service Areas and the types of programs they should offer to meet diverse user demands.

The eventually user interface will be the Dashboard. That will convey the PPR related information, useful exploratory analysis, and the outcome of market area from the huff model in the end. This dashboard will provide data visualization of their existing programs, permits and estimated number of visits in each facility, service area, and district, as well as display proposed activities to visitors in the future.

windowsFonts(font = windowsFont('Helvetica'))
crs = 4326
library(vroom)
library(grid)
library(gridExtra)
library(sf)
library(terra)
library(dplyr)
library(spData)
library(mapview)
library(geosphere)
library(sp)
library(rgeos)
library(ggplot2)
library(ggmap)
library(kableExtra)
library(tidyverse)
library(data.table)
#https://rdrr.io/cran/osrm/man/osrmRoute.html
library(osrm)
library(corrplot)
#remotes::install_github("CityOfPhiladelphia/rphl")
library(rphl)
library(lubridate)
library(furrr)
library(riem)
library(tidycensus)
library(rgdal)
library(furrr)
library(mapview)
library(NbClust)
library(cluster)
library(psych)
library(splitTools)
library(scales)
library(stringr) 
library(FNN)
library(caret)
library(cowplot)
ll <- function(dat, proj4 = 4326){st_transform(dat, proj4)}

census_api_key("b33ec1cb4da108659efd12b3c15412988646cbd8", overwrite = TRUE)

root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

plotTheme <- function(base_size = 9, title_size = 10){
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size, colour = "black", hjust = 0.5), 
    plot.subtitle = element_text( face = 'italic',
                                 size = base_size, colour = "black", hjust = 0.5),
    plot.caption = element_text( hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.01),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=.5),
    strip.background = element_blank(),
    strip.text = element_text( size=9),
    axis.title = element_text( size=9),
    axis.text = element_text( size=7),
    axis.text.y = element_text( size=7),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text( colour = "black", face = "italic", size = 9),
    legend.text = element_text( colour = "black", face = "italic", size = 7),
    strip.text.x = element_text( size = 9),
    legend.key.size = unit(.5, 'line')
  )
}

mapTheme <- function(base_size = 9, title_size = 10){
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size, colour = "black", hjust = 0.5), 
    plot.subtitle = element_text( face = 'italic',
                                 size = base_size, colour = "black", hjust = 0.5),
    plot.caption = element_text( hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    strip.background = element_blank(),
    strip.text = element_text(size=base_size),
    axis.title = element_text( size=9),
    axis.text = element_blank(),
    axis.text.y = element_blank(),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text( colour = "black", face = "italic", size = 9),
    legend.text = element_text( colour = "black", face = "italic", size = 7),
    strip.text.x = element_text(size=base_size),
    legend.key.size = unit(.5, 'line')
  )
}

palette10 <- c("#315d7f","#4f5d7f","#6d5c7e","#a36681","#d96f83","#f2727f","#f6928a","#f8a28f","#f9b294","#fcdc97")
palette9 <- c('#315d7f', '#4f5d7f', '#6d5c7e', '#a36681', '#d96f83', '#f2727f', '#f6928a', '#f8a28f', '#f9b294')
palette7 <- c('#315d7f', '#4f5d7f', '#6d5c7e', '#d96f83', '#f2727f', '#f6928a', '#f9b294')
palette5 <- c("#f9b294","#f2727f","#c06c86","#6d5c7e","#315d7f")
palette4 <- c("#f9b294","#f2727f","#c06c86","#6d5c7e")
palette2 <- c("#f9b294","#f2727f")
palette1_main <- "#F2727F"
palette1_assist <- '#F9B294'

1.1 District & Service Area

pprDistrict <- st_read('https://opendata.arcgis.com/datasets/0cdc4a1e86c6463b9600f9d9fca39875_0.geojson') %>%
  st_transform(crs)

pprServiceArea <- read_sf(dsn="data/FromPPR/PPR_Service_Areas_2021/PPR_Service_Areas_2021.shp")%>%
  st_transform(crs = crs)

#save as geojson for app
# st_write(pprServiceArea,"pprServiceArea.GEOJSON")

base_map <- get_map(location = unname(st_bbox(ll(st_buffer(st_union(pprDistrict),5000)))),maptype = "terrian")

ggmap(base_map) + 
  geom_sf(data=ll(st_union(pprDistrict)),color="black",size=1,fill = "transparent",inherit.aes = FALSE)+
  geom_sf(data=ll(pprDistrict),color='black',size=1,fill = "transparent",inherit.aes = FALSE)+
  geom_sf(data=ll(pprServiceArea),color='black',size=0.6,fill = "transparent",inherit.aes = FALSE)+
  geom_sf(data=ll(pprDistrict %>% filter(DISTRICTID %in% c(7,8,9))),color=palette1_main,size=1,fill = "transparent",inherit.aes = FALSE)+
  geom_sf(data=ll(pprServiceArea %>% filter(PPR_DIST %in% c(7,8,9))),color=palette1_main,size=0.6,fill = "transparent",inherit.aes = FALSE)+
  labs(title = "Location of PPR Districts and Pilot Service Areas", 
       subtitle = "",
       x="",y="")+
  mapTheme()
Figure1.1.1 Location of PPR Districts and Pilot Service Areas

Figure1.1.1 Location of PPR Districts and Pilot Service Areas

The city is divided into 10 districts. Each district contains several Program Service Areas. PPR allocates staff members to run programs at facilities and parks within a Program Service Area. Besides macro analyses, this practicum focuses on the Districts 7, 8 and 9 per PPR’s request. These Districts are part of a pilot that begins in spring 2022. In the Figure1.1.1, the areas highlighted by pink lines are the services areas in District 7,8 and 9.

1.2 Properties

pprProperties <- st_read('https://opendata.arcgis.com/datasets/d52445160ab14380a673e5849203eb64_0.geojson')%>%
  st_transform(crs)

property <- st_join(st_centroid(pprProperties),pprServiceArea,left=TRUE) %>% 
  st_drop_geometry() %>% 
  left_join(pprProperties %>% dplyr::select(OBJECTID,geometry),by='OBJECTID') %>% 
  st_sf() %>% 
  st_transform(crs = crs) %>% 
  dplyr::select(-Shape__Length,-Shape__Area,-Shape_Leng,-Shape_Area) %>% 
  rename('ServiceAreaID' = 'INFO')

#save as geojson for app
#st_write(property,"property.GEOJSON")

ggplot() + 
  geom_sf(data=property,color=palette1_main,fill = palette1_main)+
  geom_sf(data=st_union(pprDistrict),color="black",size=1,fill = "transparent")+
  geom_sf(data=pprDistrict,color="black",size=0.7,linetype ="dashed",fill = "transparent")+
  geom_sf(data=pprDistrict %>% filter(DISTRICTID %in% c(7,8,9)),color="black",size=1,fill = "transparent")+
  labs(title = "Locations of PPR Properties", 
       subtitle = "",
       x="",y="")+
  mapTheme()
Figure1.2.1 Locations of PPR Properties

Figure1.2.1 Locations of PPR Properties

Figure 1.2.1 shows the locations of more than 500 PPR property boundaries in Philadelphia. Notably, some larger parks contain several “child” properties, located wholly inside the “parent” property. For example, ‘Wissahickon Valley Park’ includes 16 “child-properties (‘Wissahickon Valley Park’, ‘Wissahickon Environmental Center’, ‘Salvatore Pachella Memorial Field’, ‘David P Montgomery Field’, ‘John F Boyce Memorial Field’, ‘Arrow Field’, ‘Walnut Lane Golf Club’, ‘Samuel F Houston Playground’, ‘Carpenters Woods’, ‘Dodge Tract’, ‘Historic Rittenhouse Town’, ‘Clifford Park’, ‘Blue Bell Park’, ‘Saul High School Farm’, ‘Andorra Natural Area’, and ‘Saylors Grove’).

2 Explorary Analysis - Program & Permits

The activities hold in PPR properties are recorded in two systems: the program schedules and the permit. With the former, the PPR staffing arrange many activities seasonally in different PPR facilities. With the latter, people outside the PPR apply for conducting activities and reserve space in Parks & Recreation areas. So even though in the section 2 & 3 we will analyze program and permit to help PPR better understand the situation, in the Huff model section we will only focus on the impact of programs because of the use case proposed.

In 2021, the recorded program schedules cover seven categories of After School, Athletic, Camp, Cultural, Educational, Environmental/Outdoor, and other activities. There were other activities applied by volunteer or local residents recorded in the permit system. In the following analyses, events refer to the combination of program and permit data.

2.1 Overall Distribution

In Figure 2.1.1 below, red legends show the locations of facilities with program schedules while orange legends show the distributions of facilities with permit records. The number of facilities with permits is larger than the number of facilities with programs. That illustrates the demand of self-proposed activities and the potential to enrich the programming schedules in those areas.

Through the data wrangling, we obtain information like the duration of the events, the attendance of the events etc., recorded by the PPR. Furthermore, we linked program and permit datasets to their based properties and their Program Service Area (which is shown in the following map).

permit2021 <- vroom("data/FromPPR/PPR-recreation-permits-2021.csv")
program2021 <- vroom("data/FromPPR/PPR-programs-attended-2021-with-schedules.csv")
facilityID <- read.csv("data/FromPPR/tblFacility_to_PPR_Properties.csv")
facilityIDNOT789 <- read.csv("data/FromPPR/tblFacility_to_PPR_Properties_NOT_789.csv")
# facilityID <- facilityID %>% rbind(facilityIDNOT789 %>% rename("PPR_Properties_ObjectID"= "PPR_Properties_ID"))
facilityID <- facilityID %>% rbind(facilityIDNOT789 %>% rename(
  "FacilityID"="ï..FacilityID",
  "PPR_Properties_ObjectID"= "PPR_Properties_ID"))

# define date, filter by attendance date
program2021.clean <- program2021 %>% 
  mutate(AttendanceWeekDate = mdy(AttendanceWeekDate),
         DateFrom = mdy(DateFrom),
         DateTo = mdy(DateTo)) %>% 
  filter(AttendanceWeekDate > DateFrom & AttendanceWeekDate < DateTo)

# original data is recorded by week, here we change it into being recorded by occurrence
program2021.clean <- separate(program2021.clean, Days,into= c("1","2","3","4","5","6","7"))

program2021.clean <- program2021.clean %>% 
  gather(colNames, weekday, 15:21) %>% 
  select(-colNames) %>% 
  na.omit(cols='weekday')%>% 
  mutate(AttendenceRealDate = case_when(
    weekday == "Monday" ~ AttendanceWeekDate,
    weekday == "Tuesday" ~ AttendanceWeekDate+1,
    weekday == "Wednesday" ~ AttendanceWeekDate+2,
    weekday == "Thursday" ~ AttendanceWeekDate+3,
    weekday == "Friday" ~ AttendanceWeekDate+4,
    weekday == "Saturday" ~ AttendanceWeekDate+5,
    weekday == "Sunday" ~ AttendanceWeekDate+6,
  ))

# join property,permit and program data
program2021.join <- left_join(program2021.clean, facilityID, by =c("FacilityID" = "FacilityID")) %>% 
  left_join(., property, by =c("PPR_Properties_ObjectID"="OBJECTID"))

permit2021.join <- left_join(permit2021, facilityID, by =c("FacilityID")) %>% 
  left_join(., property, by =c("PPR_Properties_ObjectID"="OBJECTID"))

# filter the failed joining items
program2021.otherDIstrict <- program2021.join %>% filter(PPR_DISTRICT == 1| PPR_DISTRICT ==2|PPR_DISTRICT ==3|PPR_DISTRICT ==4|PPR_DISTRICT ==5|PPR_DISTRICT ==6|PPR_DISTRICT ==10)
program2021.join <- program2021.join %>% drop_na(PPR_Properties_ObjectID)

permit2021.otherDIstrict <- permit2021.join %>% filter(is.na(PPR_Properties_ObjectID))
permit2021.join <- permit2021.join %>% drop_na(PPR_Properties_ObjectID)

# Wrangle "program2021.join", and extract month attendance
Facility_Program <- program2021.join %>%
  select(Facility,ActvityTypeCategory,ActivityType,
         AttendanceWeekDate,
         RegisteredIndividualsCount,
         PPR_DISTRICT, PUBLIC_NAME, PARENT_NAME,geometry) %>%
  mutate(month = case_when(month(AttendanceWeekDate)==1 ~ "Jan",
                           month(AttendanceWeekDate)==2 ~ "Feb",
                           month(AttendanceWeekDate)==3 ~ "Mar",
                           month(AttendanceWeekDate)==4 ~ "Apr",
                           month(AttendanceWeekDate)==5 ~ "May",
                           month(AttendanceWeekDate)==6 ~ "Jun",
                           month(AttendanceWeekDate)==7 ~ "Jul",
                           month(AttendanceWeekDate)==8 ~ "Aug",
                           month(AttendanceWeekDate)==9 ~ "Sep",
                           month(AttendanceWeekDate)==10 ~ "Oct",
                           month(AttendanceWeekDate)==11 ~ "Nov",
                           month(AttendanceWeekDate)==12 ~ "Dec")) %>% 
  distinct(.keep_all = FALSE)

#save as geojson for app
#st_write(Facility_Program,"Facility_Program.GEOJSON")

Facility_Program_otherDistricts <- program2021.otherDIstrict %>%
  select(Facility,ActvityTypeCategory,ActivityType,
         AttendanceWeekDate,
         RegisteredIndividualsCount,
         PUBLIC_NAME, PARENT_NAME,geometry) %>%
  mutate(month = case_when(month(AttendanceWeekDate)==1 ~ "Jan",
                           month(AttendanceWeekDate)==2 ~ "Feb",
                           month(AttendanceWeekDate)==3 ~ "Mar",
                           month(AttendanceWeekDate)==4 ~ "Apr",
                           month(AttendanceWeekDate)==5 ~ "May",
                           month(AttendanceWeekDate)==6 ~ "Jun",
                           month(AttendanceWeekDate)==7 ~ "Jul",
                           month(AttendanceWeekDate)==8 ~ "Aug",
                           month(AttendanceWeekDate)==9 ~ "Sep",
                           month(AttendanceWeekDate)==10 ~ "Oct",
                           month(AttendanceWeekDate)==11 ~ "Nov",
                           month(AttendanceWeekDate)==12 ~ "Dec")) %>%
  distinct(.keep_all = FALSE)

# Aggregate weekly visites
WeekVisit <- aggregate(
  RegisteredIndividualsCount ~ AttendanceWeekDate + ActvityTypeCategory + PPR_DISTRICT, data = Facility_Program, FUN = sum)

#save as geojson for app
#st_write(permit2021.join,"permit2021.join.GEOJSON")

#save as geojson for app
#st_write(program2021.join,"program2021.join.GEOJSON")
ggplot()+
  geom_sf(data=pprServiceArea,color='black',size=0.25,linetype ="dashed", fill= "transparent")+
  geom_sf(data=pprDistrict,color="black",size=1,fill = "transparent")+
  geom_sf(data=permit2021.join,aes(geometry = geometry),color =palette1_assist,fill = palette1_assist, alpha = 0.7) +
  geom_sf(data=program2021.join,aes(geometry = geometry),color = palette1_main,fill = palette1_main,alpha = 0.7)+
  labs(title="Facilities w/ Programmed (red) & Permited (orange) Activities")+
  mapTheme()
Figure2.1.1 Facilities w/ Programmed (red) & Permited (orange) Activities across Philly

Figure2.1.1 Facilities w/ Programmed (red) & Permited (orange) Activities across Philly

2.2 District 7,8,9

In the following section, we will focus on the district 7,8,9 as requested in the proposal. Permits in each facility are not discussed in Chapter 2.

2.2.1 District 7

There are three facilities with programs scheduled in District 7: Athletic Recreation Center, Mander Playground, and Marian Anderson Recreation Center.

ggplot()+
  geom_sf(data=st_union(pprServiceArea %>% filter(PPR_DIST ==7)),color="black",size=1,fill = "transparent")+
  geom_sf(data=pprServiceArea %>% filter(PPR_DIST ==7),color="black",linetype ="dashed",size=1,fill = "transparent")+
  geom_sf(data=permit2021.join%>% filter(PPR_DISTRICT ==7),aes(geometry = geometry),color =palette1_assist,fill = palette1_assist, alpha = 0.7) +
  geom_sf(data=program2021.join%>% filter(PPR_DISTRICT ==7),aes(geometry = geometry),color = palette1_main,fill = palette1_main,alpha = 0.7)+
  labs(title="Facilities w/ Programed (red) & Permitted (orange) Activities in District 7")+
  mapTheme()
Figure2.2.2 Facilities w/ Programed (red) & Permitted (orange) Activities in District 7

Figure2.2.2 Facilities w/ Programed (red) & Permitted (orange) Activities in District 7

plot1 <- ggplot(Facility_Program %>%filter(PPR_DISTRICT == 7)) +
  geom_bar(aes(x= Facility,fill = ActvityTypeCategory),position="stack")+
  scale_fill_manual(values = palette5)+ 
  labs(y = "Program Frequency", fill="Category", title = "")+
  plotTheme()+
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        legend.text = element_text( colour = "black", face = "italic", size = 5))

plot2 <- ggplot(Facility_Program %>%filter(PPR_DISTRICT == 7)) + 
  geom_bar(aes(x= Facility, fill = ActivityType),position="stack")+
  scale_fill_manual(values = palette7)+
  labs(y = "Program Frequency", fill="sub-Category", title = "")+ 
  plotTheme()+
  theme(legend.position = "bottom",
        axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        legend.text = element_text( colour = "black", face = "italic", size = 5))

grid.arrange(plot1, plot2,ncol=2,top = "Categories of Events in District7")
Figure2.2.3 Categories of Events in District7

Figure2.2.3 Categories of Events in District7

In Figure 2.2.3, we can see that Marian Anderson Recreation Center held more programs mainly focus on the athletic category, like soccer and baseball. The other two facilities hold more cultural activities, like art and music. But overall, the number of programs in Marian Anderson Recreation Center is way more larger than the other two facilities.

ggplot(WeekVisit %>%filter(PPR_DISTRICT == 7)) +
  geom_line(size=0.5,aes(x = AttendanceWeekDate, y = RegisteredIndividualsCount, group = ActvityTypeCategory, colour = ActvityTypeCategory)) +
  geom_point(aes(x = AttendanceWeekDate, y = RegisteredIndividualsCount, group = ActvityTypeCategory, colour = ActvityTypeCategory, size = RegisteredIndividualsCount)) +
  scale_color_manual(values = palette5)+
  scale_size_continuous(range = c(2, 4))+
  labs(title = "Visitor Counts by Week and Activity Categories in District 7",
       color = "Program Category",
       size="Visitor Counts",
       x = "Week of the Year",
       y = "Visitor Counts")+
  plotTheme()+
    theme(axis.text.x = element_text( hjust = 1, size = 8),
        axis.text.y = element_text( hjust = 1, size = 8),
        legend.text = element_text( colour = "black", size = 8))
Figure2.2.4 Visitor Counts by Week and Activity Categories in District 7

Figure2.2.4 Visitor Counts by Week and Activity Categories in District 7

#save as geojson for app
##st_write(WeekVisit,"WeekVisit.GEOJSON")

In the Figure 2.2.4, we can see that Athletic activities were hold from March to November while cultural and after school activities mostly took place in fall.

2.2.2 District 8

There are five facilities with programs scheduled in District 8 shown in the Figure2.2.5: 48th & Woodland Playground, Christy Recreation Center, Howards S. Morris Recreation Center, Laura Sims Skate House, and Shepard Recreation Center.

ggplot()+
  geom_sf(data=st_union(pprServiceArea %>% filter(PPR_DIST ==8)),color="black",size=1,fill = "transparent")+
  geom_sf(data=pprServiceArea %>% filter(PPR_DIST ==8),color="black",linetype ="dashed",size=1,fill = "transparent")+
  geom_sf(data=permit2021.join%>% filter(PPR_DISTRICT ==8),aes(geometry = geometry),color =palette1_assist,fill = palette1_assist, alpha = 0.7) +
  geom_sf(data=program2021.join%>% filter(PPR_DISTRICT ==8),aes(geometry = geometry),color = palette1_main,fill = palette1_main,alpha = 0.7)+
  labs(title="Facilities w/ Programed (red) & Permitted (orange) Activities in District 8")+
  mapTheme()
Figure2.2.5 Facilities w/ Programed (red) & Permitted (orange) Activities in District 8

Figure2.2.5 Facilities w/ Programed (red) & Permitted (orange) Activities in District 8

plot1 <- ggplot(Facility_Program %>%filter(PPR_DISTRICT == 8)) +
  geom_bar(aes(x= Facility,fill = ActvityTypeCategory),position="stack")+
  scale_fill_manual(values = palette5)+
  labs(y = "Program Frequency", fill="Category", title = "")+
  plotTheme()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        legend.position = "bottom",
        legend.text = element_text( colour = "black", face = "italic", size = 5))

plot2 <- ggplot(Facility_Program %>%filter(PPR_DISTRICT == 8)) + 
  geom_bar(aes(x= Facility, fill = ActivityType),position="stack")+
  scale_fill_manual(values = palette9)+
  labs(y = "Program Frequency", fill="sub-Category", title = "Categories of Events in District8")+ 
  plotTheme()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        legend.position = "bottom",
        legend.text = element_text( colour = "black", face = "italic", size = 5))

grid.arrange(plot1, plot2,ncol=2,top = "Categories of Events in District8")
Figure2.2.6 Categories of Events in District8

Figure2.2.6 Categories of Events in District8

In Figure 2.2.6 we can see that Laura Sims Skate House held hockey and ice skating activities, while Morris Recreation Center hosted more cultural activities like dance as well as athletic activities of gymnastics, tumbling and basketball. Overall, in District 8, Laura Sims Skate House held most programs. The Morris Recreation Center has second largest programs.

ggplot(WeekVisit %>%filter(PPR_DISTRICT == 8)) +
  geom_point(aes(x = AttendanceWeekDate, y = RegisteredIndividualsCount, group = ActvityTypeCategory, colour = ActvityTypeCategory, size = RegisteredIndividualsCount)) +
  geom_line(size=0.5,aes(x = AttendanceWeekDate, y = RegisteredIndividualsCount, group = ActvityTypeCategory, colour = ActvityTypeCategory)) +
  scale_color_manual(values = palette5)+
  scale_size_continuous(range = c(2, 4))+
  labs(title = "Visitor Counts by Week and Activity Categories in District 8",
       color = "Program Category",
       size="Visitor Counts",
       x = "Week of the Year",
       y = "Visitor Counts")+
  plotTheme()+
  theme(legend.text = element_text( colour = "black", size = 8))
Figure2.2.7 Visitor Counts by Week and Activity Categories in District 9

Figure2.2.7 Visitor Counts by Week and Activity Categories in District 9

In Figure 2.2.7 we can see that Hockey and ice skating activities take place in fall, winter and spring, while cultural and after school activities are mainly held in fall.

2.2.3 District 9

There are 7 facilities with programs scheduled in District 9: Barry Playground and Pool, Cibotti Recreation Center, DiSilvestro Playground, East Passyunk Community Center, Eastwick Regional Playground, and Thomas B. Smith Recreation Center.

ggplot()+
  geom_sf(data=st_union(pprServiceArea %>% filter(PPR_DIST ==9)),color="black",size=1,fill = "transparent")+
  geom_sf(data=pprServiceArea %>% filter(PPR_DIST ==9),color="black",linetype ="dashed",size=1,fill = "transparent")+
  geom_sf(data=permit2021.join%>% filter(PPR_DISTRICT ==9 & FacilityName != "FDR Park"),aes(geometry = geometry),color =palette1_assist,fill = palette1_assist, alpha = 0.7) +
  geom_sf(data=program2021.join%>% filter(PPR_DISTRICT ==9),aes(geometry = geometry),color = palette1_main,fill = palette1_main,alpha = 0.7)+
  labs(title="Facilities w/ Programed (red) & Permitted (orange) Activities in Disdrict 9")+
  mapTheme()
Figure2.2.8 Facilities w/ Programed (red) & Permitted (orange) Activities in Disdrict 9

Figure2.2.8 Facilities w/ Programed (red) & Permitted (orange) Activities in Disdrict 9

In Figure 2.2.9 we can see that Athletic activities of basketball and aquatics mostly took place in Barry Playground and Pool and East Passyunk Community Center. Eastwick Regional Playground, and Thomas B. Smith Recreation Center are the two most popular facilities with activities of athletic, after school, cultural, educational, and other categories.

ggplot(Facility_Program %>%filter(PPR_DISTRICT == 9)) +
  geom_bar(aes(x= Facility,fill = ActvityTypeCategory),position="stack")+
  scale_fill_manual(values = palette7[2:7])+
  labs(y = "Program Frequency", fill="Category", title = "Categories of Events in District 9")+
  plotTheme()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        legend.position = "bottom",
        legend.text = element_text( colour = "black", size = 8))
Figure2.2.9 Facility & Program Categories in District 9

Figure2.2.9 Facility & Program Categories in District 9

ggplot(WeekVisit %>%filter(PPR_DISTRICT == 9)) +
  geom_point(aes(x = AttendanceWeekDate, y = RegisteredIndividualsCount, group = ActvityTypeCategory, colour = ActvityTypeCategory, size = RegisteredIndividualsCount)) +
  geom_line(size=0.5,aes(x = AttendanceWeekDate, y = RegisteredIndividualsCount, group = ActvityTypeCategory, colour = ActvityTypeCategory)) +
  scale_color_manual(values = palette7[2:7])+
  scale_size_continuous(range = c(2, 4))+
  labs(title = "Visitor Counts by Week and Activity Categories in District 9",
       color = "Program Category",
       size="Visitor Counts",
       x = "Week of the Year",
       y = "Visitor Counts")+
  plotTheme()+
    theme(legend.text = element_text( colour = "black", size = 8))
Figure2.2.10 Visitor Counts by Week and Activity Categories in District 9

Figure2.2.10 Visitor Counts by Week and Activity Categories in District 9

Figure 2.2.10 indicates that Athletic activities took place throughout the whole year, while other categories of activities, like older adults and mentoring, were held in the 2nd half of the year. Cultural activities, like art, dance, music, usually suspended in summer.

2.3 Other Districts

In Figure2.3.1 there are 27 facilities with programs scheduled in other districts of PPR serving areas. Each facility has different program arrngement, and they are shown in the graph.

ggplot(Facility_Program_otherDistricts) +
  geom_bar(aes(x= Facility,fill = ActvityTypeCategory),position="stack")+
  scale_fill_manual(values = palette7)+
  labs(y = "Program Frequency", fill="Program Category", title = "Categories of Events in Other Districts")+
  plotTheme()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        legend.position = "bottom",
        legend.text = element_text( colour = "black", face = "italic", size = 8))
Figure2.3.1 Facility & Program Categories in other Districts

Figure2.3.1 Facility & Program Categories in other Districts

3 Explorary Analysis - SafeGraph Data

This project aims to use SafeGraph’s Pattern dataset to analyze whether PPR programmings achieve their goals of meeting citizens’ demands and serving them well. Further, SafeGraph data can be used to suggest PPR’s future programming strategies in Philadelphia.

SafeGraph’s Patterns dataset includes visitor and visit aggregations for points of interest (POIs) in the US. This contains aggregated raw counts of visits to POIs from a panel of mobile devices, answering how often people visit, how long they stay (dwelling time), where they came from (origin), where else they go (departure), and more. More Information

# brand_info <- vroom("data/safegraph/Philadelphia-Camden-WilmingtonPA-NJ-DE-MDMSA-CORE_POI-2021_11-2021-12-17/brand_info.csv")
# core_poi <- vroom("data/safegraph/Philadelphia-Camden-WilmingtonPA-NJ-DE-MDMSA-CORE_POI-2021_11-2021-12-17/core_poi.csv")
# 
# monthList = c("01","02","03","04","05","06","07","08","09","10","11")
# 
# patternAllMonth = data.frame()
# for (i in monthList){
#   currentMonth = vroom(paste("data/safegraph/SafeGraph Data Purchase Dec-16-2021/Philadelphia-Camden-WilmingtonPA-NJ-DE-MDMSA-PATTERNS-2021_",
#        i,
#        "-2021-12-17/patterns.csv.gz",sep = ""))%>%
#     filter(region=="PA")%>%
#     mutate(month=paste(i,sep = ""))
#   patternAllMonth = rbind(patternAllMonth,currentMonth)
# }
# 
# # filter into philly, join with POI data
# safeGraph <- patternAllMonth %>%
#   filter(city == "Philadelphia")%>%
#   left_join(core_poi %>% dplyr::select(placekey,location_name,top_category,sub_category,naics_code,latitude,longitude),
#             by=c("placekey"="placekey","location_name" = "location_name"),keep=FALSE)
# 
# # create geometry from lat & lng
# safeGraph.geo <-
#   safeGraph %>%
#   st_as_sf(coords = c("longitude", "latitude"), crs = crs, agr = "constant", na.fail=FALSE)


# patternAllMonth <- st_read("data/output/patternAllMonth.csv")
#st_write(safeGraph.geo,"data/output/safeGraph.geo.GeoJSON")
safeGraph.geo <- st_read("data/output/safeGraph.geo.GeoJSON",crs = crs)

# change workers number by yourself
plan(multiprocess, workers = 10)

# keep congeneric bussiness
congenericMoves <-
  safeGraph.geo %>%
  filter(top_category %in% c("Promoters of Performing Arts, Sports, and Similar Events","Other Amusement and Recreation Industries","Museums, Historical Sites, and Similar Institutions") | str_detect(location_name, "Park") | str_detect(location_name, "Playground") | str_detect(location_name, "Recreation Center")) %>%
  filter(str_detect(location_name, "Parking", negate = TRUE))

# Keep only ppr sites
parks <- safeGraph.geo %>% 
  dplyr::select(placekey, naics_code, location_name) %>% 
  distinct() %>% 
  filter(naics_code %in% c(712190, 713990, 713940, 711310) | str_detect(location_name, "Park") | str_detect(location_name, "Playground") | str_detect(location_name, "Recreation Center")) %>%
  filter(str_detect(location_name, "Parking", negate = TRUE)) %>% 
  st_transform(crs = 4326)

PPRmoves <- safeGraph.geo %>% 
  filter(placekey %in% as.list(parks$placekey))

In the above data wrangling, we first combine all 11-month pattern data from SafeGraph dataset, attaching them with geometry information. Further, we filter the data into Pennsylvania region and Project-related POIs using NAICS code. The NAICS codes we chose are as follow.

712190: Nature Parks and Other Similar Institutions;
713990: All Other Amusement and Recreation Industries;
713940: Fitness and Recreational Sports Centers;
711310:Promoters of Performing Arts, Sports, and Similar Events

# join filtered safeGraph place with ppr property
propertyWithPlaceKey <- st_join(property %>% filter(NESTED == "N") %>% st_transform(crs=32118),
                                st_buffer(parks %>% drop_na() %>% dplyr::select(placekey, geometry) %>% st_transform(crs=32118),10)) %>%
  st_drop_geometry() %>% 
  left_join(property %>% dplyr::select(OBJECTID),by=('OBJECTID'='OBJECTID')) %>% 
  st_sf() %>% 
  st_transform(crs=crs) %>% 
  drop_na(placekey)
#st_write(propertyWithPlaceKey, "data/output/dashboard/SitesRelation_PPR_SG.GeoJSON")

# back up geometry information
program2021.joinGEO <- program2021.join %>% dplyr::select(FacilityID,geometry) %>% distinct()
permit2021.joinGEO <- permit2021.join %>% dplyr::select(FacilityID,geometry) %>% distinct()

program2021.joinWithPlaceKey <- st_join(program2021.join %>%
            st_sf() %>% 
            st_transform(crs=32118),
          propertyWithPlaceKey %>% 
            st_transform(crs=32118) %>% 
            mutate(Parent_ID = OBJECTID) %>% 
            dplyr::select(Parent_ID, placekey, geometry)
          ,left=FALSE, join=st_contains) %>%
  st_drop_geometry() %>% 
  left_join(program2021.joinGEO,
              by='FacilityID')%>%
  st_sf() %>% 
  st_transform(crs=crs)

permit2021.joinWithPlaceKey <- st_join(permit2021.join %>%
            st_sf() %>% 
            st_transform(crs=32118),
          propertyWithPlaceKey%>% 
            st_transform(crs=32118) %>% 
            mutate(Parent_ID = OBJECTID) %>% 
            dplyr::select(Parent_ID, placekey, geometry),
          left=FALSE, join=st_contains) %>% 
  st_drop_geometry() %>% 
  left_join(permit2021.joinGEO,
              by='FacilityID')%>%
  st_sf() %>% 
  st_transform(crs=crs)

Through data wrangling, we use spatial join to link the filtered SafeGraph POIs with PPR facilities and PPR programmings & permits. So far we have successfully bridged the channel of communicating among these three main datasets. In the above map, the purple polygons are the all properties of PPR, the green dots are the successfully joined properties with placekey. The map shows most of the PPR facilities are joined successfully. The layers can be switched on and off to see the map more clearly.

placeKeyNeeded = unique(propertyWithPlaceKey$placekey)
PPRmoves <- PPRmoves %>% filter(placekey %in% placeKeyNeeded)
# unnest visit Count data
# visitCount <-
#   PPRmoves %>%
#   select(placekey, date_range_start, date_range_end, visits_by_day) %>%
#   mutate(date_range_start = as_date(date_range_start),
#          date_range_end = as_date(date_range_end)) %>%
#   dplyr::select(-date_range_end) %>%
#   mutate(visits_by_day = future_map(visits_by_day, function(x){
#     unlist(x) %>%
#       as_tibble() %>%
#       rowid_to_column(var = "day") %>%
#       mutate(visits = value) %>%
#       dplyr::select(-value)
#   })) %>%
#   unnest(cols = c("visits_by_day"))
# visitCount <- visitCount %>%
#     rename(visitDay = date_range_start) %>%
#     mutate(visitDay = day+visitDay-1) %>%
#     mutate(month = month(visitDay))
# st_write(visitCount,"data/output/visitCount.GeoJSON")
visitCount <- st_read("data/output/visitCount.GeoJSON",crs=crs)

# unnest popularity_by_hour data
# visitHour <-
#   PPRmoves %>%
#   select(placekey, popularity_by_hour, date_range_start) %>%
#   mutate(date_range_start = as_date(date_range_start),
#          month =  month(date_range_start)) %>%
#   dplyr::select(-date_range_start) %>%
#   mutate(popularity_by_hour = future_map(popularity_by_hour, function(x){
#     unlist(x) %>%
#       as_tibble() %>%
#       rowid_to_column(var = "hour") %>%
#       rename(visit = value)
#   }))%>%
#   unnest(popularity_by_hour)
# 
# st_write(visitHour,"data/output/visitHour.GeoJSON")
visitHour <- st_read("data/output/visitHour.GeoJSON",crs=crs)


# unnest the origin-destination data

# originCount <-
#   PPRmoves %>%
#   select(placekey, visitor_home_cbgs, date_range_start) %>%
#   mutate(date_range_start = as_date(date_range_start),
#          month =  month(date_range_start)) %>%
#   dplyr::select(-date_range_start) %>%
#   mutate(visitor_home_cbgs = future_map(visitor_home_cbgs, function(x){
#     jsonlite::fromJSON(x) %>%
#       as_tibble() %>%
#       dplyr::select(starts_with("4")) %>%
#       gather()
#   })) %>%
#   unnest(visitor_home_cbgs) %>%
#   rename(origin =key ,visitors= value)
# 
# st_write(originCount,"data/output/originCount.GeoJSON")
originCount <- st_read("data/output/originCount.GeoJSON",crs=crs)


#unnest the departure - destination data
# departCount <-
#   PPRmoves %>%
#   select(placekey, visitor_daytime_cbgs, date_range_start) %>%
#   mutate(date_range_start = as_date(date_range_start),
#          month =  month(date_range_start)) %>%
#   dplyr::select(-date_range_start) %>%
#   mutate(visitor_daytime_cbgs = future_map(visitor_daytime_cbgs, function(x){
#     jsonlite::fromJSON(x) %>%
#       as_tibble() %>%
#       dplyr::select(starts_with("4")) %>%
#       gather()
#   }))%>%
#   unnest(visitor_daytime_cbgs) %>%
#   rename(departure =key ,visitors= value)
# 
# st_write(departCount,"data/output/departCount.GeoJSON")
departCount <- st_read("data/output/departCount.GeoJSON",crs=crs)

# unnest the bucketed_dwell_times data
# dwellTime <-
#   PPRmoves %>%
#   select(placekey, bucketed_dwell_times, date_range_start) %>%
#   mutate(date_range_start = as_date(date_range_start),
#          month =  month(date_range_start)) %>%
#   dplyr::select(-date_range_start) %>%
#   mutate(bucketed_dwell_times = future_map(bucketed_dwell_times, function(x){
#     jsonlite::fromJSON(x) %>%
#       as_tibble() %>%
#       gather()
#   }))%>%
#   unnest(bucketed_dwell_times) %>%
#   rename(dwellTimes =key ,visitors= value)
# 
# st_write(dwellTime,"data/output/dwellTime.GeoJSON")
dwellTime <- st_read("data/output/dwellTime.GeoJSON",crs=crs)


# unnest the related_same_day_brand
# relatedBrand <-
#   PPRmoves %>%
#   select(placekey, related_same_day_brand, date_range_start) %>%
#   mutate(date_range_start = as_date(date_range_start),
#          month =  month(date_range_start)) %>%
#   dplyr::select(-date_range_start) %>%
#   mutate(related_same_day_brand = future_map(related_same_day_brand, function(x){
#     jsonlite::fromJSON(x) %>%
#       as_tibble() %>%
#       gather()
#   }))
# 
# relatedBrand <- relatedBrand %>%
#   unnest(related_same_day_brand) %>%
#   rename(relatedBrand =key ,visitors= value)
# 
# st_write(relatedBrand,"data/output/relatedBrand.GeoJSON")
relatedBrand <- st_read("data/output/relatedBrand.GeoJSON",crs=crs)


# unnest the popularity_by_day data
# visitWeekday <-
#   PPRmoves %>%
#   select(placekey, popularity_by_day, date_range_start) %>%
#   mutate(date_range_start = as_date(date_range_start),
#          month =  month(date_range_start)) %>%
#   dplyr::select(-date_range_start) %>%
#   mutate(popularity_by_day = future_map(popularity_by_day, function(x){
#     jsonlite::fromJSON(x) %>%
#       as_tibble() %>%
#       gather()
#   }))%>%
#   unnest(popularity_by_day) %>%
#   rename(visitWeekday =key ,visits= value)
# 
# st_write(visitWeekday,"data/output/visitWeekday.GeoJSON")
visitWeekday <- st_read("data/output/visitWeekday.GeoJSON",crs=crs)

Because SafeGraph data is nested within the dataframe. To use this data in further analyses, we have to release them into respective dataframes. Through the above data wrangling, we obtain information like number of visits to these facilities by day, the number of visits to these facilities by hour, the number of visits to these facilities by weekday, the census block groups of these visitors’ home, the census block groups of these visitors’ departure places, the distribution of these visits’ dwelling time etc.

Notably, the visits here refer to the visits in SafeGraph Mobile Device Panel instead of the true visits to these facilities, considering that these data have been filtered and privacy-screened. Even though there might be some biases in the representatives, we still believe this data can give us insights about the pattern of actual visits. However, some data rectifications are still deployed in the following to make it more reliable.

3.1 Preprocessing - Bias Rectification

Sometimes, SafeGraph will count “Residents” or “passer-bys” as “visitors”, which will result in huge bias and mislead the analysis. So, to conclude typical position types of SafeGraph data bias, we choose midnight visitsand dwelling time as factors for cluster analysis.

# calculate midnight visit
visitHourNight <- visitHour %>% 
  filter(hour >= 1 | hour <=5) %>% 
  group_by(placekey) %>% 
  summarize(nightVisit = sum(visit))


# integrate sg data by midnight visit and dwell time
visitIntegrated <- full_join(visitHourNight %>% 
                               st_drop_geometry(), 
                             dwellTime %>% 
                               group_by(placekey, dwellTimes) %>% 
                               summarize(visitors = sum(visitors)),
                             by=c('placekey'))

# wide format
visitIntegrated <- visitIntegrated %>% 
  spread(key = dwellTimes, value = visitors)

#save as geojson for app
# st_write(visitIntegrated, "data/output/visitIntegrated.GeoJSON")
# import
visitIntegrated <- st_read("data/output/visitIntegrated.GeoJSON") %>% 
  rename(c("Dwell<5"="X.5","Dwell>240"="X.240","Dwell11-20"="X11.20","Dwell121-240"="X121.240","Dwell21-60"="X21.60","Dwell5-10"="X5.10","Dwell61-120"="X61.120" ))

visitIntegrated <- visitIntegrated[,c(1,2,3,8,5,7,9,6,4)]

# normalize
visitIntegrated.scale <- scale(visitIntegrated %>% 
                                 dplyr::select(-placekey) %>% 
                                 st_drop_geometry())

set.seed(12)
# decide cluster number (only run once)

#nc <- NbClust(visitIntegrated.scale, min.nc=2, max.nc=15, method="kmeans", index="all")
#table(nc$Best.n[1,])
# 
# barplot(table(nc$Best.n[1,]),
#         xlab="Numer of Clusters", ylab="Number of Criteria",
#         main="Number of Clusters Chosen by 26 Criteria")


# Run K-Means cluster
cluster1 <- kmeans(visitIntegrated.scale, 3) 
#summary(cluster1)

# add cluster number back
visitIntegrated <- visitIntegrated %>% 
  mutate(cluster = cluster1$cluster)

# mean by cluster
cluster1_mean <- aggregate(visitIntegrated %>% 
                             dplyr::select(-placekey, -cluster) %>% 
                             st_drop_geometry(),
                       by=list(cluster=visitIntegrated$cluster),
                       FUN=mean) %>% 
  left_join(visitIntegrated %>% 
              st_drop_geometry() %>% 
              group_by(cluster) %>% 
              summarize(size = n()),
            by="cluster")
Table 1. Mean values of clusters for SafeGraph data
cluster nightVisit Dwell<5 Dwell5-10 Dwell11-20 Dwell21-60 Dwell61-120 Dwell121-240 Dwell>240 size
1 199231.2 1761.50000 15820.0000 9939.7500 16928.500 11083.0000 8557.7500 7321.75 4
2 2589580.1 340.57143 3161.0000 4029.0714 26152.429 48977.9286 64342.5714 172938.86 14
3 29297.5 72.22395 696.6275 455.0355 1043.796 908.0621 837.4013 1666.14 451
## put histograms on the diagonal
panel.hist <- function(x, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(usr[1:2], 0, 1.5) )
    h <- hist(x, plot = FALSE)
    breaks <- h$breaks; nB <- length(breaks)
    y <- h$counts; y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y, ...)
}

#Color points by groups
my_cols <- c(palette5[5],palette5[4],palette5[1])

pairs(visitIntegrated %>% 
        st_drop_geometry() %>% 
        mutate(`Dwel<10`=`Dwell<5`+`Dwell5-10`,
               `Dwell11-120` = `Dwell11-20` + `Dwell21-60` + `Dwell61-120`,
               `Dwell>120` =  `Dwell121-240` + `Dwell>240`) %>% 
        dplyr::select(nightVisit, `Dwel<10`, `Dwell11-120`, `Dwell>120`), 
      pch = 19,  cex = 0.5, cex.labels=1, diag.panel = panel.hist,
      col = my_cols[visitIntegrated$cluster],
      lower.panel=NULL, panel = panel.smooth)
Figure3.1.1 Correlation Matrix of SafeGraph Bias

Figure3.1.1 Correlation Matrix of SafeGraph Bias

Taking results of cluster analysis, most locations, 451 out of 469, belong to cluster 3 with reliable data. According to Figure3.1.1, Cluster 2 includes 14 places with extremely more night visits and long visits that are more than 2 hours. Cluster 1 consists of 4 locations with many transitory visits that are less than 10 minutes.

# cluster info
cluster <- visitIntegrated %>% 
              st_drop_geometry() %>%
              dplyr::select(placekey, cluster)

# caculate the ratio to trim "Dwell5-10" for cluster 1
Sum <- visitIntegrated %>% 
  st_drop_geometry() %>% 
  mutate(Total = `Dwell<5`+`Dwell5-10`+`Dwell11-20`+`Dwell21-60`+`Dwell61-120`+`Dwell121-240`+`Dwell>240`) %>% 
  group_by(cluster) %>% 
  summarize(`Dwell5-10` = sum(`Dwell5-10`),
            `Dwell<5` = sum(`Dwell<5`),
            Total = sum(Total))

ratio5To10 <- Sum[3,]$`Dwell5-10`/Sum[3,]$Total

group1.dwell <- dwellTime %>% 
  st_drop_geometry() %>% 
  spread(key=dwellTimes, value = visitors) %>% 
  left_join(visitIntegrated %>% dplyr::select(placekey, cluster) %>% st_drop_geometry(), by="placekey") %>% 
  filter(cluster==1) %>% 
  mutate(Total = `<5`+`5-10`+`11-20`+`21-60`+`61-120`+`121-240`+`>240`,
         `rectified5-10`=ifelse((`5-10`-ratio5To10*Total/(ratio5To10-1)/`5-10`)<=`5-10`,
                                `5-10`-ratio5To10*Total/(ratio5To10-1)/`5-10`,
                                `5-10`),
         `rectified5-10rate`=`rectified5-10`/Total,
         `<5rate`=`<5`/Total)

group23.dwell <- dwellTime %>% 
  st_drop_geometry() %>% 
  spread(key=dwellTimes, value = visitors) %>% 
  mutate(Total = `<5`+`5-10`+`11-20`+`21-60`+`61-120`+`121-240`+`>240`,
         `<5rate`=`<5`/Total)

# caculate the ratio to trim "Dwell121-240" and "Dwell>240" for cluster 2
ratio2h <- visitHour %>% 
  st_drop_geometry() %>% 
  spread(key = hour, value = visit) %>% 
  left_join(visitIntegrated %>% 
              st_drop_geometry() %>%
              dplyr::select(placekey, cluster)) %>% 
  mutate(residents = (`1`+`2`+`3`+`4`+`5`)/5,
         hourlySum = (`1`+`2`+`3`+`4`+`5`+`6`+`7`+`8`+`9`+`10`+`11`+`12`+`13`+`14`+`15`+`16`+`17`+`18`+`19`+`20`+`21`+`22`+`23`+`24`),
         `ratio>2h` = 16.25*residents/hourlySum) %>% 
  dplyr::select(placekey, month, `ratio>2h`, cluster, residents)


# rectify 'visitHour'
visitHour <- visitHour %>% 
  left_join(ratio2h, by=c("placekey","month"))
  
visitHour <- visitHour %>% 
  filter(cluster==2) %>% 
  filter(hour<=5) %>% 
  mutate(visit = ifelse(visit-residents*0.9<0,
                        visit*0.05,
                        visit-residents*0.9)) %>%  
  rbind(
    visitHour %>% 
      filter(cluster==2) %>% 
      filter(hour>=6 & hour<=10) %>% 
      mutate(visit=ifelse(visit-residents*(0.9-0.115*(hour-5))<0,
                          visit*0.05,
                          visit-residents*(0.9-0.115*(hour-5))))
  ) %>% 
  rbind(
    visitHour %>% 
      filter(cluster==2) %>% 
      filter(hour>=11 & hour<=18) %>% 
      mutate(visit=ifelse(visit-residents*0.25*20/24<0,
                          visit*0.05,
                          visit-residents*0.25*20/24))
  ) %>% 
  rbind(
    visitHour %>% 
      filter(cluster==2) %>% 
      filter(hour>=19) %>% 
      mutate(visit=ifelse(visit-residents*(0.25*20/24+0.115*(hour-18))<0,
                          visit*0.05,
                          visit-residents*(0.25*20/24+0.115*(hour-18))))
  ) %>% 
  rbind(
    visitHour %>% 
      filter(cluster!=2)
  ) %>% 
  dplyr::select(-`ratio>2h`, -cluster,-residents)

# rectify 'dwellTime'
dwellTime <- dwellTime %>% 
  left_join(ratio2h, by=c("placekey","month"))

dwellTime <- dwellTime %>% 
  filter(cluster==2) %>% 
  filter(dwellTimes=="121-240" | dwellTimes==">240") %>% 
  mutate(visitors = visitors*(1-`ratio>2h`)) %>%  
  rbind(dwellTime %>% 
          filter(cluster==2) %>% 
          filter(dwellTimes!="121-240" & dwellTimes!=">240")) %>% 
  rbind(dwellTime %>% 
          filter(cluster==1) %>% 
          filter(dwellTimes=="5-10") %>% 
          left_join(group1.dwell %>% dplyr::select(placekey,month,`rectified5-10`),by=c("placekey","month")) %>% 
          mutate(visitors = `rectified5-10`) %>% 
          dplyr::select(-`rectified5-10`)) %>% 
  rbind(dwellTime %>% 
          filter(cluster==1) %>% 
          filter(dwellTimes!="5-10")) %>% 
  rbind(dwellTime %>% 
          filter(cluster==3)) %>% 
  dplyr::select(-`ratio>2h`, -cluster)

dwellTime <- dwellTime %>% 
  filter(dwellTimes!='<5')

# rectify 'visitCount'
visitCount <- visitCount %>% 
  left_join(ratio2h, by=c("placekey","month"))

visitCount <- visitCount %>% 
  filter(cluster==1) %>% 
  left_join(group1.dwell,by=c("placekey","month")) %>% 
  mutate(visits = visits*(1-`rectified5-10rate`-`<5rate`)) %>% 
  dplyr::select(placekey,visitDay,day,visits,month,geometry) %>% 
  rbind(visitCount %>% 
          filter(cluster==2)%>% 
          left_join(group23.dwell,by=c("placekey","month")) %>% 
          left_join(ratio2h, by=c("placekey","month")) %>% 
          mutate(visits = visits*(1-`<5rate`-`ratio>2h.y`)) %>% 
          dplyr::select(placekey,visitDay,day,visits,month,geometry)) %>% 
  rbind(visitCount %>% 
          filter(cluster==3) %>% 
          left_join(group23.dwell,by=c("placekey","month")) %>% 
          mutate(visits = visits*(1-`<5rate`)) %>% 
          dplyr::select(placekey,visitDay,day,visits,month,geometry)) %>% 
  mutate(visits = round(visits,0))

Based on the bias analysis above, we rectify SafeGraph data depending on its cluster / type. First, all visit counts with dwell time that is less than 5 minutes are removed. Also, for cluster 1, visit counts with dwell time that is between 5 and 10 minutes will be reduced referring to the related proportion in the normal cluster 3. On top of that, long visits that are longer than 2 hours in Cluster 2 will be trimmed into the appropriate ratio according to midnight visitors which can be treated as “residents”.

3.2 Overall SafeGraph Patterns

Through the above complicated analysis and rectification, we finally obtain more reliable SafeGraph data. In the following section, we attempt to draw insights from the SafeGraph data in 5 aspects: visit counts in device panel, origins of visits in device panel, departure of visits in device panel, weekday pattern of device panel visits, dwelling time of device panel visits, and related brands of device panel visits at the same day.

3.2.1 Visit Counts in Device Panel
sumVisit <- visitCount %>%
  dplyr::select(-visitDay,-day,-month) %>% 
  group_by(placekey) %>%
  summarise(visits=sum(visits))

ggplot(sumVisit)+
  geom_sf(data=pprServiceArea,color='black',size=0.5,fill= "transparent")+
  geom_sf(data=pprDistrict,color='black',size=1,fill= "transparent")+
  geom_sf(aes(size = visits),color = palette1_main,fill = palette1_main,alpha = 0.3) +
  scale_size_continuous(range = c(1, 3),name = "Visits")+
  mapTheme()+
  labs(title = "Map of Visit Counts in Device Panel")+
  theme(legend.position = "bottom",
        legend.key.width = unit(0.5, 'cm'),
        legend.key.height  = unit(0.2, 'cm'),
        legend.text = element_text( colour = "black", size = 8))
Figure3.2.1 Map of Visit Counts in Device Panel

Figure3.2.1 Map of Visit Counts in Device Panel

From the Figure 3.2.1 we can see that people frequently visit facilities in the Center City where locate dense PPR facilities. For those facilities, visit counts are small because citizens dispersed into many facilities. However, the large visit counts happen at the outskirt of Philadelphia where have less facilities

sumVisitServiceArea <- pprServiceArea %>%
  st_join(sumVisit,left =TRUE) %>% 
  drop_na(INFO) %>% 
  dplyr::select(INFO,visits,geometry) %>% 
  group_by(INFO) %>% 
  mutate(totalVisits=sum(visits)) %>% 
  dplyr::select(-visits) %>% 
  distinct() %>% 
  st_sf() %>% 
  st_transform(crs=crs)

ggplot(sumVisitServiceArea)+
  geom_sf(aes(fill = q5(totalVisits)),color = 'white',size=0.5) +
  scale_fill_manual(values = palette5,labels = qBr(sumVisitServiceArea,'totalVisits'),name = "Total Device Visits") +
  mapTheme()+
  labs(title = "Map of total device visits")+
  theme(legend.position = "bottom",
        legend.key.width = unit(0.5, 'cm'),
        legend.key.height  = unit(0.2, 'cm'),
        legend.text = element_text( colour = "black", size = 8))
Figure3.2.2 Map of total device visits

Figure3.2.2 Map of total device visits

We aggregate the visits into service area level on Figure3.2.2. Overall, service areas in the Center City where locates dense facilities will attract larger visits.

3.2.2 Origins of Visits in Device Panel
CensusData <-
  get_acs(geography = "block group",
          variables = c("B01003_001E"),
          year=2019, state="PA", county="Philadelphia", geometry=T, output="wide") %>%
  st_transform(crs=4326) %>%
  dplyr::select(-NAME,-starts_with('B')) %>%
  st_centroid() %>%
  as.data.frame() %>%
  rename("originGeometry" = "geometry")
# 
# placekeyGeometry <-
#   originCount %>%
#   dplyr::select(placekey) %>%
#   group_by(placekey) %>% summarise()
# 
# orgCountPlot <- originCount %>%
#   st_drop_geometry() %>%
#   group_by(origin,placekey) %>%
#   summarise(visitors=sum(visitors))%>%
#   left_join(placekeyGeometry,by=c("placekey" = "placekey")) %>%
#   rename("parkGeometry" = "geometry")%>%
#   filter(startsWith(origin,"42101"))%>%
#   left_join(CensusData,by=c("origin" = "GEOID"))%>%
#   mutate(park.y=as.numeric(map(parkGeometry,function(x){return(x[2])})),
#          park.x=as.numeric(map(parkGeometry,function(x){return(x[1])})),
#          origin.y=as.numeric(map(originGeometry,function(x){return(x[2])})),
#          origin.x=as.numeric(map(originGeometry,function(x){return(x[1])})))
# 
# st_write(orgCountPlot,"data/output/orgCountPlot.GEOJSON")
orgCountPlot <- st_read("data/output/orgCountPlot.GEOJSON")

orgCountPlot2 <- orgCountPlot%>%
  filter(visitors>200)

orgCountPlotHF <- orgCountPlot%>%
  filter(visitors>5000)


ggplot(data = orgCountPlot2) + 
  geom_sf(data = pprServiceArea %>% st_transform(crs=4326),fill ="transparent", color="black",size=0.5)+
  geom_sf(data=pprDistrict %>% st_transform(crs=4326),fill ="transparent", color="black",size=1)+
  geom_sf(data=propertyWithPlaceKey %>% dplyr::select(OBJECTID) %>%distinct() %>%  st_transform(crs=crs), fill ='#C5C5C5', color='#C5C5C5',alpha=0.1,size=0)+
  geom_curve(aes(x = origin.x, y = origin.y,xend = park.x,yend = park.y,color = q5(visitors)),
             size = 0.5,
             curvature = -0.2, 
             alpha = 0.4,)+
  scale_color_manual(values = palette5,
                     labels = qBr(orgCountPlot2,"visitors"),
                     name = "Device Visits (Quintile Breaks)") +
  labs(x="",y="",title="Flow map of parks and origins")+
  mapTheme()+
  theme(legend.position = "bottom",panel.spacing = unit(6, 'lines'),
        legend.key.width = unit(0.5, 'cm'),
        legend.key.height  = unit(0.2, 'cm'),
        legend.text = element_text( colour = "black", size = 8))
Figure3.2.3 Flow map of parks and origins

Figure3.2.3 Flow map of parks and origins

ggplot(data = orgCountPlotHF) + 
  geom_sf(data = pprServiceArea %>% st_transform(crs=4326),fill ="transparent", color="black",size=0.5)+
  geom_sf(data=pprDistrict %>% st_transform(crs=4326),fill ="transparent", color="black",size=1)+
  geom_sf(data=propertyWithPlaceKey %>% dplyr::select(OBJECTID) %>%distinct() %>%  st_transform(crs=crs), fill ='#C5C5C5', color='#C5C5C5',alpha=0.3,size=0)+
  geom_curve(aes(x = origin.x,y = origin.y,xend = park.x,yend = park.y),
             # arrow = arrow(length = unit(0.01, "npc"), type="closed"),
             size = 0.75,color = palette1_main,curvature = -0.2, alpha = 0.7)+
  labs(x="",y="",title="Flow map of parks and origins - routes have more than 5000 device visits in 2021")+
  mapTheme()+
  theme(legend.position = "bottom",panel.spacing = unit(6, 'lines'),
        legend.key.width = unit(0.5, 'cm'),
        legend.key.height  = unit(0.2, 'cm'),
        legend.text = element_text( colour = "black", size = 8))
Figure3.2.4 Flow map of parks and origins - routes have more than 5000 device visits in 2021

Figure3.2.4 Flow map of parks and origins - routes have more than 5000 device visits in 2021

In the Figure 3.2.3, we plot the origins/homes of visitors for each PPR site which has visits more than 200. Furthermore, we filter routes have more than 5000 device visits in 2021 and plot it in the Figure3.2.4. Then we find interesting result because people from this census block group (421010177011), which are located at the Kensington Ave, are more likely to go to PPR facilities compared to the people in other census block group.

One possible cause maybe that special groups have a stronger need for public resources. These public resources may make up their unsatisfying living environment. They can have social connections, taken part in the training program and even have free food there.

3.2.3 Departure of visits in device panel
# depaCountPlot <- departCount %>%
#   st_drop_geometry() %>%
#   group_by(departure,placekey) %>%
#   summarise(visitors=sum(visitors))%>%
#   left_join(placekeyGeometry,by=c("placekey" = "placekey")) %>%
#   rename("parkGeometry" = "geometry")%>%
#   filter(startsWith(departure,"42101"))%>%
#   left_join(CensusData,by=c("departure" = "GEOID"))%>%
#   mutate(park.y=as.numeric(map(parkGeometry,function(x){return(x[2])})),
#          park.x=as.numeric(map(parkGeometry,function(x){return(x[1])})),
#          departure.y=as.numeric(map(originGeometry,function(x){return(x[2])})),
#          departure.x=as.numeric(map(originGeometry,function(x){return(x[1])})))
# 
# st_write(depaCountPlot,"data/output/depaCountPlot.GEOJSON")
depaCountPlot <- st_read("data/output/depaCountPlot.GEOJSON")
normalizationQuotient = st_read("data/output/normalizationQuotient.csv")
normalizationQuotient$quotient = as.numeric(normalizationQuotient$quotient)
depaCountPlot <- 
  depaCountPlot %>% 
  left_join(normalizationQuotient %>% dplyr::select(GEOID, quotient),by=c("departure" = "GEOID")) %>% 
  mutate(visitors = visitors * quotient) %>% 
  dplyr::select(-quotient)

depaCountPlot2 <- depaCountPlot%>%
  filter(visitors>1000)

ggplot(data = depaCountPlot2) + 
  geom_sf(data = pprServiceArea %>% st_transform(crs=4326),fill ="transparent", color="black",size=0.5)+
  geom_sf(data=pprDistrict %>% st_transform(crs=4326),fill ="transparent", color="black",size=1)+
  geom_sf(data=propertyWithPlaceKey %>% dplyr::select(OBJECTID) %>%distinct() %>%  st_transform(crs=crs), fill ='#C5C5C5', color='#C5C5C5',alpha=0.1,size=0)+
  geom_curve(aes(x = departure.x, y = departure.y, xend = park.x, yend = park.y,color = q5(visitors)),
             size = 0.5,curvature = -0.2, alpha = 0.4)+
  scale_color_manual(values = palette5,
                     labels = qBr(depaCountPlot2,"visitors"),
                     name = "Device Visits (Quintile Breaks)") +
  labs(x="",y="",title="Flow map of parks and departure points")+
  mapTheme()+
  theme(legend.position = "bottom",panel.spacing = unit(6, 'lines'),
        legend.key.width = unit(0.5, 'cm'),
        legend.key.height  = unit(0.2, 'cm'),
        legend.text = element_text( colour = "black", size = 8))
Figure3.2.5 Flow map of parks and departure points

Figure3.2.5 Flow map of parks and departure points

depaCountPlotMostPopular <- depaCountPlot%>%
  st_drop_geometry() %>% 
  dplyr::select(placekey,visitors) %>% 
  group_by(placekey) %>% 
  summarise(totalvisits = sum(visitors)) %>% 
  filter(totalvisits==max(totalvisits))

depaCountPlotHF <- depaCountPlot%>%
  filter(placekey=="zzz-222@628-pj6-ndv" & visitors>=2000)
  # filter(placekey==depaCountPlotMostPopular$placekey[1] & visitors>=2000)

ggplot(data = depaCountPlotHF) + 
  geom_sf(data = pprServiceArea %>% st_transform(crs=4326),fill ="transparent", color="black",size=0.5)+
  geom_sf(data=pprDistrict %>% st_transform(crs=4326),fill ="transparent", color="black",size=1)+
  geom_sf(data=propertyWithPlaceKey %>% dplyr::select(OBJECTID) %>%distinct() %>%  st_transform(crs=crs), fill ='#C5C5C5', color='#C5C5C5',alpha=0.3,size=0)+
  geom_curve(aes(x = departure.x, y = departure.y, xend = park.x, yend = park.y,color = q5(visitors)),
             # arrow = arrow(length = unit(0.01, "npc"), type="closed"),
             size = 1,curvature = -0.2, alpha = 0.7)+
  scale_color_manual(values = palette5,
                     labels = qBr(depaCountPlot2,"visitors"),
                     name = "Device Visits (Quintile Breaks)") +
  labs(x="",y="",title="Flow map of most popular parks and its departure")+
  mapTheme()+
  theme(legend.position = "bottom",panel.spacing = unit(6, 'lines'),
        legend.key.width = unit(0.5, 'cm'),
        legend.key.height  = unit(0.2, 'cm'),
        legend.text = element_text( colour = "black", size = 8))
Figure3.2.6 Flow map of most popular parks and its departure

Figure3.2.6 Flow map of most popular parks and its departure

Figure 3.2.5 and Figure 3.2.6 are the maps of the departure block group of visitors for each PPR site. From Figure 3.2.5, we can further dive into each specific facility. For example, in Figure3.2.6 we can see that people who like to go to the PPR facility Charles playgroud are mainly from these three communities: A ST, Chinatown, and Rittenhouse Square.

3.2.4 Weekday Pattern of Device Panel Visits
VisitWeekdaybyServiceArea <- pprServiceArea %>% 
  st_join(visitWeekday) %>% 
  st_drop_geometry() %>%  
  dplyr::select(-placekey,-month,-PPR_DIST,-Shape_Leng,-Shape_Area)%>% 
  group_by(NAME,INFO,visitWeekday) %>%
  summarise(totalVisit = sum(visits, na.rm = T)) %>% 
  ungroup()

VisitWeekdaybyServiceArea$visitWeekday <- factor(VisitWeekdaybyServiceArea$visitWeekday, levels= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"))

VisitWeekdaybyServiceArea%>%
  ggplot(aes(visitWeekday,totalVisit)) +
  geom_bar(position ="dodge",fill = palette1_main,stat='identity') +
  labs(x="Weekday", y="Aggregated Visits in device panel") +
  facet_wrap(~NAME, ncol = 8, scales = "fixed")+
  plotTheme(5,5)+
  labs(title="Distribution of weekday pattern by service area level")+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        strip.text = element_text( size=6),
        axis.text = element_text( size=6),
        strip.text.x = element_text( size = 6),
        legend.text = element_text( colour = "black", size = 6))
Figure3.2.7 Distribution of weekday pattern by service area level

Figure3.2.7 Distribution of weekday pattern by service area level

In Figure 3.2.7, we group the data by weekday to see their weekday distribution pattern. Apart from their own difference in total visit counts, there are three types of distribution. One is that similar total visit counts in device panel occur on every day of a week. The second is that the peak appears on weekend. And the third is that the peak happens during weekdays.

3.2.5 Hour Pattern of Device Panel Visits
VisitHourbyServiceArea <- pprServiceArea %>% 
  st_join(visitHour) %>% 
  st_drop_geometry() %>%  
  dplyr::select(-placekey,-month,-PPR_DIST,-Shape_Leng,-Shape_Area)%>% 
  group_by(NAME,INFO,hour) %>%
  summarise(totalVisit = sum(visit, na.rm = T)) %>% 
  ungroup()

VisitHourbyServiceArea%>%
  ggplot(aes(hour,totalVisit)) +
  geom_bar(position ="dodge",fill = palette1_main,stat='identity') +
  labs(x="hour", y="Aggregated Visits in device panel",title="Distribution of hour pattern by service area level") +
  facet_wrap(~NAME, ncol = 8, scales = "fixed")+
  plotTheme(5,5)+
  theme(axis.text = element_text( size=6),strip.text = element_text( size=8),strip.text.x = element_text( size = 6))
Figure3.2.8 Distribution of hour pattern by service area level

Figure3.2.8 Distribution of hour pattern by service area level

After the rectification above, we finally obtain the common sensible hour pattern. In the Figure3.2.8, we group the data by hour to see their hour distribution pattern. Apart from their own difference in total visit counts, there are two types of distribution. One is that there is no obvious hourly pattern during the day. The other one has obvious peak time happening at the mid-day.

3.2.6 Dwelling Time of Device Panel Visits
dwellTimeForPlot <- dwellTime %>% 
  mutate(dwellTimes = recode(dwellTimes,
                             "<5" = 2.5,
                             "5-10" = 7.5,
                             "11-20" = 15,
                             "21-60" = 40,
                             "61-120" = 90,
                             "121-240" = 180,
                             ">240" = 240)) %>%
  mutate(sepTotalDwellTime = (visitors*dwellTimes)) %>% 
  group_by(placekey) %>% 
  mutate(totalVisitors=sum(visitors) )%>%
  filter(totalVisitors>50) %>% 
  mutate(avgDwellTime=sum(sepTotalDwellTime)/totalVisitors) %>% 
  dplyr::select(placekey,avgDwellTime) %>% 
  distinct()

ggplot(dwellTimeForPlot)+
  geom_sf(data=pprServiceArea,color='black',size=0.25,fill= "transparent")+
  geom_sf(data=pprDistrict,color='black',size=0.75,fill='transparent')+
  geom_sf(aes(size = avgDwellTime,color = avgDwellTime),alpha = 0.5) +
  scale_size_continuous(range = c(0,2),name = "avgDwellTime")+
  scale_color_continuous(low = '#FFDEDB',high ='#FF2903',
                     name = "avgDwellTime") +
  labs(title="Map of dwelling time")+
  mapTheme()+
  theme(legend.position = "bottom",
        legend.key.width = unit(0.5, 'cm'),
        legend.key.height  = unit(0.2, 'cm'),
        legend.text = element_text( colour = "black", size = 8))
Figure3.2.9 Map of dwelling time

Figure3.2.9 Map of dwelling time

In Figure 3.2.9, we calculate the average dwelling time for each facility. From the map, we can find out that the longest dwelling time happes at the placekey (), the Clemente Park Playground.

dwellTImePlotServiceArea <- pprServiceArea %>%
  st_join(dwellTimeForPlot,left =TRUE) %>% 
  drop_na(INFO) %>% 
  dplyr::select(INFO,avgDwellTime,geometry) %>% 
  group_by(INFO) %>% 
  mutate(avgDwellTime=mean(avgDwellTime)) %>% 
  distinct() %>% 
  st_sf() %>% 
  st_transform(crs=crs)

ggplot(dwellTImePlotServiceArea)+
  geom_sf(aes(fill = q5(avgDwellTime)),color = 'white',size=0.5) +
  scale_fill_manual(values = palette5,labels = qBr(dwellTImePlotServiceArea,'avgDwellTime'),name = "Average Device Dwelling Time") +
  labs(title="Map of average dwelling time")+
  mapTheme()+
  theme(legend.position = "bottom",
        legend.key.width = unit(0.5, 'cm'),
        legend.key.height  = unit(0.2, 'cm'),
        legend.text = element_text( colour = "black", size = 8))
Figure3.2.10 Map of average dwelling time

Figure3.2.10 Map of average dwelling time

If we aggregate them into the service area level, individual facility’s characteristics are erased, we can see that long dwelling times are more likely to happen at the residential regions, which have more higher population densities. And facilities in this area are more likely visited by chance.

4 Demand & Supply Analysis - Facility Classification

With SafeGraph’s Pattern dataset, we can find visit counts from device paenl are inconsistent with their scheduled programs and events in some facilities.

4.1 Preprocessing - Rescale SafeGraph Visits

The number of device residing from the SafeGraph data is a sample of device users tracked by SafeGraph, which cannot be considered as all the visitors to certain park facilities from a census block group. According to SafeGraph’s guidelines for normalizing Patterns data, we normalized the number of devices residing by multiplying a parameter of total census block group (CBG) population divided by average device residing in this CBG. In that case, the outcome can estimate all the visitors from a CBG to certain park facilities. Since the PPR doesn’t schedule programs in some facilities, we exclude them (168 out of 524 facilities) on the exception list in our further analysis.

excepListog <- st_read("data/FromPPR/exceptionListPPR_Properties.csv")
excepList <- excepListog %>% filter(Exclude=="Yes")
excepList$OBJECTID <- as.numeric(excepList$OBJECTID)
excepList <- excepList %>%
  dplyr::select(OBJECTID) %>% 
  left_join(propertyWithPlaceKey %>% st_drop_geometry() %>%  dplyr::select(OBJECTID,placekey),by="OBJECTID") %>% 
  drop_na()
monthList = c("01","02","03","04","05","06","07","08","09","10","11")

# device panel data from PPR
homePanelAllMonth = data.frame()
for (i in monthList){
  currentMonth = vroom(paste("data/safegraph/SafeGraph Data Purchase Dec-16-2021/Philadelphia-Camden-WilmingtonPA-NJ-DE-MDMSA-PATTERNS-2021_",
       i,
       "-2021-12-17/home_panel_summary.csv",sep = ""))%>%
    filter(region=="pa")%>%
    filter(census_block_group %in% CensusData$GEOID)
  homePanelAllMonth = rbind(homePanelAllMonth,currentMonth)
}

homePanelAllMonth <- homePanelAllMonth %>% dplyr::select(month,census_block_group,number_devices_residing)

# population data from acs
censusPopData <-
  get_acs(geography = "block group",
          variables = c("B01003_001E"),
          year=2019, state="PA", county="Philadelphia", geometry=T, output="wide") %>%
  st_transform(crs=4326) %>%
  dplyr::select(-NAME,-B01003_001M) %>%
  st_drop_geometry() %>%
  as.data.frame() %>%
  rename("population" = "B01003_001E")

#st_write(censusPopData, "data/output/dashboard/population-CensusBlockGroup.csv")

# get the quotient and apply
# notice: here some cbg doesn't have population
homePanelToGo <- 
  homePanelAllMonth %>% 
  rename("GEOID"="census_block_group") %>% 
  group_by(GEOID) %>% 
  summarise(avgDeviceResiding = mean(number_devices_residing)) %>% 
  left_join(censusPopData, by="GEOID") %>% 
  mutate(quotient = population / avgDeviceResiding) %>% 
  filter(quotient>1)

originCount_11m <- originCount %>% 
  st_drop_geometry() %>% 
  left_join(excepList,by="placekey")  %>% 
  filter(is.na(OBJECTID)) %>% 
  dplyr::select(-OBJECTID) %>% 
  group_by(placekey, origin) %>% 
  summarise(visitors = sum(visitors)) %>% 
  ungroup() %>% 
  filter(origin %in% CensusData$GEOID) %>% 
  inner_join(homePanelToGo %>% dplyr::select(GEOID,quotient),by=c("origin"="GEOID")) %>% 
  mutate(visitors =visitors*quotient) %>% 
  dplyr::select(-quotient)

originCount_seperate_month <- originCount %>% 
  st_drop_geometry() %>% 
  left_join(excepList,by="placekey") %>% 
  filter(is.na(OBJECTID)) %>% 
  dplyr::select(-OBJECTID) %>% 
  group_by(placekey, origin, month) %>% 
  summarise(visitors = sum(visitors)) %>% 
  ungroup() %>% 
  filter(origin %in% CensusData$GEOID) %>% 
  inner_join(homePanelToGo %>% dplyr::select(GEOID,quotient),by=c("origin"="GEOID")) %>% 
  mutate(visitors =visitors*quotient) %>% 
  dplyr::select(-quotient)

modelData <- 
  left_join(originCount_11m, propertyWithPlaceKey %>% 
              st_drop_geometry() %>% 
              filter(NESTED == "N") %>%
              dplyr::select(placekey, OBJECTID, ACREAGE),by="placekey") %>% 
  rename(c("area" = "ACREAGE")) %>% 
  group_by(OBJECTID, origin) %>% 
  summarise(visitors = sum(visitors),
            area = max(area))

4.2 Imbalance between Mobility Data and Position Usage

In terms of helping allocate programs and resources, we divided all locations into 3 categories based on their visit data and program data. (The number of facilities with both of program scheduled and safegraph data available is limited, only 14 facilities, so we use facilities with programs in each month to do analyses below).

# calculate mean dwell time
dwellMean <- dwellTime %>% 
  st_drop_geometry() %>% 
  mutate(dwell = case_when(dwellTimes=="5-10" ~ 7.5,
                           dwellTimes=="11-20" ~ 15.5,
                           dwellTimes=="21-60" ~ 40.5,
                           dwellTimes=="61-120" ~ 90.5,
                           dwellTimes=="121-240" ~ 180.5,
                           dwellTimes==">240" ~ 270),
         dwellMuti = dwell*visitors) %>% 
  group_by(placekey, month) %>% 
  summarize(dwellMuti = sum(dwellMuti),
            dvisitors = sum(visitors)) %>% 
  ungroup() 


# integrate dwell, total counts in PPR property unit
sgIntegrated_byMonth <- full_join(dwellMean, 
                                  originCount_seperate_month %>% 
                                    group_by(placekey, month) %>% 
                                    summarize(visitors = sum(visitors)) %>% ungroup(),
                                  by=c('placekey', 'month')) %>% 
  na.omit() %>% 
  left_join(propertyWithPlaceKey %>% 
              dplyr::select(OBJECTID,placekey,PPR_DISTRICT, placekey, PUBLIC_NAME), by=c("placekey")) %>% 
  group_by(OBJECTID, month, PPR_DISTRICT, PUBLIC_NAME) %>% 
  summarise(dwellMuti = sum(dwellMuti),
            dvisitors = sum(dvisitors),
            visitors = sum(visitors)) %>% 
  ungroup() %>% 
  mutate(meanDwell = dwellMuti/dvisitors) %>% 
  dplyr::select(-dwellMuti, -dvisitors)
  

  # Add programs numbers
sgIntegrated_byMonth <-  sgIntegrated_byMonth %>% 
  left_join(program2021.joinWithPlaceKey %>% 
              st_drop_geometry() %>% 
              mutate(month = month(AttendenceRealDate)) %>% 
              group_by(PPR_Properties_ObjectID, month,PPR_DISTRICT, PUBLIC_NAME) %>% 
              summarize(programs = n()) %>% ungroup(),
            by=c('OBJECTID'='PPR_Properties_ObjectID', 'month'='month',
                 'PPR_DISTRICT'='PPR_DISTRICT', 'PUBLIC_NAME'='PUBLIC_NAME')) 
sgProgramLists = sgIntegrated_byMonth %>% drop_na() %>% distinct(PUBLIC_NAME)
sgIntegrated_byMonth <- sgIntegrated_byMonth %>% filter(PUBLIC_NAME %in% sgProgramLists$PUBLIC_NAME)
sgIntegrated_byMonth$programs = sgIntegrated_byMonth$programs %>% replace_na(0) 

Potential facilities~month: 12 out of 380, have huge amount of visitors who will stay longer than normal, but only relatively few activity programs here, which means there is great potential to develop activities, and PPR may set more programs in the future. 

Normal facilities~month: These facilities’ programs are consistent with the visits. The size in this clustering is 333.

Over-programmed facilities~month: 35 out of 380 have many programs but few foot traffic, which means there is over-programming, and PPR may reduce the number of programs in the future.

# normalize
sgIntegrated_byMonth.scale <- scale(sgIntegrated_byMonth %>% 
                                      dplyr::select(-month, -OBJECTID, -PPR_DISTRICT, -PUBLIC_NAME,- meanDwell))

set.seed(114)

# decide cluster number (only run once)
# nc <- NbClust(sgIntegrated_byMonth.scale, min.nc=2, max.nc=15, method="kmeans", index="all")
# table(nc$Best.n[1,])
# 
# barplot(table(nc$Best.n[1,]),
#         xlab="Numer of Clusters", ylab="Number of Criteria",
#         main="Number of Clusters Chosen by 26 Criteria")


# Run K-Means cluster
cluster2 <- kmeans(sgIntegrated_byMonth.scale, 3) 
# summary(cluster2)

# add cluster number back
sgIntegrated_byMonth <- sgIntegrated_byMonth %>% 
  mutate(cluster = cluster2$cluster)

# mean by cluster
cluster2_mean <- aggregate(sgIntegrated_byMonth %>% 
                             dplyr::select(-cluster,-month,-PPR_DISTRICT,-OBJECTID, -PUBLIC_NAME,- meanDwell),
                       by=list(cluster=sgIntegrated_byMonth$cluster),
                       FUN=mean) %>% 
  left_join(sgIntegrated_byMonth %>% 
              group_by(cluster) %>% 
              summarize(size = n()),
            by="cluster")
kable(cluster2_mean %>% 
        # mutate(color=c("pink","orange","blue"),.before = 2) %>% 
        mutate(group=c("normal","potential","over-programmed"),.before = 2),
      align = 'c',caption = '<center>Table 4.1 Mean values of clusters for conflicts <center/>') %>%
  kable_classic(full_width = T)%>%
  kable_styling(position = "center")
Table 4.1 Mean values of clusters for conflicts
cluster group visitors programs size
1 normal 2647.048 3.936937 333
2 potential 28475.018 3.250000 12
3 over-programmed 3241.220 66.514286 35
ggplot() + 
  geom_sf(data=sgIntegrated_byMonth %>% 
            left_join(propertyWithPlaceKey,by=c("OBJECTID")) %>% st_sf() %>% st_centroid(),
          aes(color=as.factor(cluster)), size=1)+
  geom_sf(data=pprDistrict,color="black",size=0.7,fill = "transparent")+
  facet_wrap(~month, ncol = 3)+
  labs(title = "", 
       subtitle = "",
       x="",y="")+
  scale_color_manual(values = c(palette1_assist,palette1_main,palette5[5]),name = "Facility Clusters", 
                     label=c("Normal (Visitors correspond to program numbers)",
                             "Potetial (many visitors few programs)",
                             "Over-programmed (few visitors many programs)")) +
  labs(title="4.1.2 Map of Bias Type of SafeGraph Data")+
  mapTheme()+
  theme(legend.position = "bottom",
        legend.text = element_text( colour = "black", size = 8))
Figure 4.1.2 Map of Bias Type of SafeGraph Data

Figure 4.1.2 Map of Bias Type of SafeGraph Data

4.3 Over-programmed Case - Thomas B. Smith Recreation Center

Thomas B. Smith Recreation Center is a case of these over-programmed facilities with low demand but high activity supply (permits) in July, October, and November. There are several sport courts, the Smith playground, and a spray park surrounding the B. Smith Recreation Center. According to scheduled program, more activities of athletic, education, and after school take place here.

kable(sgIntegrated_byMonth %>% filter(PUBLIC_NAME=="Thomas B Smith Recreation Center" & cluster == 3) %>% dplyr::select(-OBJECTID, -PPR_DISTRICT,-PUBLIC_NAME,-meanDwell),
      align = 'c',caption = '<center>Table 4.2 Over-programmed months for Thomas B Smith Recreation Center <center/>') %>%
  kable_classic(full_width = T)%>%
  kable_styling(position = "center")
Table 4.2 Over-programmed months for Thomas B Smith Recreation Center
month visitors programs cluster
7 5888.187 40 3
10 7149.213 82 3
11 5226.784 84 3

Much feedback from the Google review represent it is a good location for the community residents to use, but a three-year old kid will get bored in 10 minutes due to the limited equipment in playground. Maybe more equipment, renovations, better management and maintenance are needed to improve this facility.

Smith <- program2021.join %>% 
  filter(FacilityID == "{13600B8B-AB62-4A39-95CF-5F31C6942010}")


SmithSum <- Smith %>%
  mutate(totalCount = as.numeric(RegisteredIndividualsCount))%>%
  dplyr::select(totalCount,ActivityType, AttendanceWeekDate) %>% 
  drop_na() %>% 
  group_by(ActivityType, AttendanceWeekDate)%>%
  summarise(totalCount=mean(totalCount)) %>%
  ungroup() %>%
  group_by(ActivityType)%>%
  summarise(totalCount=sum(totalCount))

SmithSum %>%
  ggplot(aes(ActivityType, totalCount, fill=ActivityType)) +
  geom_bar(position ="dodge",stat="identity",fill=palette1_main) +
  labs(y = "Visitor Count", fill=palette1_main, title = "Smith Rec Center Visits in Program Level (data from PPR)")+
  plotTheme()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom",
        legend.text = element_text( colour = "black", size = 8))
Figure3.3.3 Smith Rec Center Visits in Program Level (data from PPR)

Figure3.3.3 Smith Rec Center Visits in Program Level (data from PPR)

Records from the PPR shows various types of programs are hold in this place. The multi-purpose interior space with a computer lab, a kitchen and several tables and seats provide services for after school, cultural, and educational activities. There are several sport courts (tennis courts, basketball courts and a football field) surrounded to hold athletic activities.

sumVisit <- dwellTime %>% 
  filter(placekey=='zzw-222@628-pm7-rtv') %>% 
  dplyr::select(-month) %>% 
  group_by(dwellTimes) %>%
  summarise(visitors=sum(visitors)) %>% 
  st_drop_geometry() 

sumVisit$dwellTimes <- factor(sumVisit$dwellTimes, levels= c("<5","5-10","11-20","21-60","61-120","121-240",">240" ))

plot1 <- sumVisit%>%
  ggplot(aes(dwellTimes,visitors)) +
  geom_bar(position ="dodge",fill = palette1_main,stat='identity') +
  labs(x="Dwell Time", y="Aggregated Visitors",
       title ='') +
  plotTheme()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        legend.position = "bottom",
        legend.text = element_text( colour = "black", face = "italic", size = 8))

sumVisit <- visitHour %>% 
  filter(placekey=='zzw-222@628-pm7-rtv') %>% 
  dplyr::select(-month) %>% 
  group_by(hour) %>%
  summarise(visits=sum(visit)) %>% 
  st_drop_geometry()

plot2 <- sumVisit%>%
  ggplot(aes(hour,visits)) +
  geom_bar(position ="dodge",fill = palette1_main,stat='identity') +
  labs(x="hour", y="Aggregated Visits",
       title ='Dwelling time and time distribution of Smith Rec Center Activity (data from SafeGraph)') +
  plotTheme()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        legend.position = "bottom",
        legend.text = element_text( colour = "black", face = "italic", size = 8))

grid.arrange(plot1, plot2,ncol=2,top = "")
Figure3.3.4 Dwelling time and time distribution of Smith Rec Center Activity (data from SafeGraph)

Figure3.3.4 Dwelling time and time distribution of Smith Rec Center Activity (data from SafeGraph)

smithProgramDate <- 
  program2021.joinWithPlaceKey %>% 
  filter(placekey=="zzw-222@628-pm7-rtv") %>% 
  mutate(month = month(AttendenceRealDate))
visitCount %>% 
  filter(placekey=='zzw-222@628-pm7-rtv') %>% 
  st_drop_geometry()%>%
  na.omit()%>%
  ggplot(aes(visitDay,visits)) + 
  geom_line(color=palette1_main,size=0.75)+
  geom_vline(xintercept = smithProgramDate$AttendenceRealDate , colour=palette1_assist, size=0.35,linetype = "longdash")+
  geom_vline(xintercept = (smithProgramDate%>% filter(month %in% c(7,10,11)))$AttendenceRealDate , colour=palette1_assist, size=0.35)+
  labs(title = 'Distribution of Smith Rec Center visits in 2021',x="Visit Date",y="Safegraph Visit")+
  plotTheme()+
  theme(panel.border = element_rect(colour = "black", fill=NA, size=1))
Figure3.3.5 Distribution of Smith Rec Center visits in 2021

Figure3.3.5 Distribution of Smith Rec Center visits in 2021

Figure3.3.5 is the time trend of visits recorded by SafeGraph. The orange line in the graph is the date of programs. The solid lines in the graph represents the programs happening at the July, October, and November. These three months are ,according to the analysis above, over-programmed months for Thomas B. Smith Recreation Center.

4.4 Potential Case - Albert W. Christy, Sr. Recreation Center

Albert W. Christy, Sr. Recreation Center is a representative case of potential facilities with small number of programs, high visits and long dwelling time.It is a neighborhood park in District 8. There are only 33 programs scheduled in the last year, but 1000+ device visits was recorded by the SafeGraph data on a single day.

According to local news, the weekly program attracts an average of 50-100 teens a week, and is contributing to the overall increase in recreational programming now available to older youth in the West Philadelphia neighborhood. But limited types of programs were provided based on PPR record, which are athletic events and after class activities.

Christy <- program2021.join %>% 
  filter(FacilityID == "{AC45E5EF-2467-4926-A1DA-07258305D871}")


ChristySum <- Christy %>%
  mutate(totalCount = as.numeric(RegisteredIndividualsCount))%>%
  dplyr::select(totalCount, ActivityType, AttendanceWeekDate) %>% 
  drop_na() %>% 
  group_by(ActivityType, AttendanceWeekDate)%>%
  summarise(totalCount=mean(totalCount)) %>% 
  ungroup() %>% 
  group_by(ActivityType)%>%
  summarise(totalCount=sum(totalCount))

ChristySum %>%
  ggplot(aes(ActivityType, totalCount, fill=ActivityType)) +
  geom_bar(position ="dodge",stat="identity",fill=palette1_main) +
  labs(y = "Visitor Count", fill=palette1_main, title = "Christy Rec Center Activity with Total")+
  plotTheme()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom",
        legend.text = element_text( colour = "black", size = 8))
Figure3.3.6 Christy Rec Center Activity with Total

Figure3.3.6 Christy Rec Center Activity with Total

Several peak visit counts show up in the middle of March, May, June, and August, but programs were scheduled in autumn. So, it is a lot of potentials to optimize program schedules in the seasons when the peak values show up.

christyProgramDate <- 
  program2021.joinWithPlaceKey %>% 
  filter(placekey=="zzz-222@63s-dvq-5fz")
visitCount %>% 
  filter(placekey=='zzz-222@63s-dvq-5fz') %>% 
  st_drop_geometry()%>%
  na.omit()%>%
  ggplot(aes(visitDay,visits)) + 
  geom_line(color=palette1_main,size=0.75)+
  geom_vline(xintercept = christyProgramDate$AttendenceRealDate , colour=palette1_assist, size=0.5,linetype = "longdash")+
  labs(title = 'Distribution of  Christy Recreation Center visits in 2021',x="Visit Date",y="Safegraph Visit")+
  plotTheme()+
  theme(panel.border = element_rect(colour = "black", fill=NA, size=1))
Figure3.3.7 Distribution of  Christy Recreation Center visits in 2021

Figure3.3.7 Distribution of Christy Recreation Center visits in 2021

A baseball court, playground, and an outdoor basketball court surround the Christy Rec Center. Interior basketball court and multi-purpose interior space are provided inside the Christy Rec Center.

sumVisit <- visitHour %>% 
  filter(placekey=='zzz-222@63s-dvq-5fz') %>% 
  dplyr::select(-month) %>% 
  group_by(hour) %>%
  summarise(visits=sum(visit)) %>% 
  st_drop_geometry()

plot1 <- ggplot(data=sumVisit, aes(hour,visits)) +
  geom_bar(position ="dodge",fill = palette1_main,stat='identity') +
  labs(x="hour", y="Aggregated Visits",
       title ='') +
  plotTheme()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        legend.position = "bottom",
        legend.text = element_text( colour = "black", face = "italic", size = 8))
  

sumVisit3 <- dwellTime %>% 
  filter(placekey=='zzz-222@63s-dvq-5fz') %>% 
  dplyr::select(-month) %>% 
  group_by(dwellTimes) %>%
  summarise(visitors=sum(visitors)) %>% 
  st_drop_geometry() 

sumVisit3$dwellTimes <- factor(sumVisit3$dwellTimes, levels= c("<5","5-10","11-20","21-60","61-120","121-240",">240" ))

plot2 <- ggplot(data=sumVisit3, aes(dwellTimes,visitors)) +
  geom_bar(position ="dodge",fill = palette1_main,stat='identity') +
  labs(x="Dwell Time", y="Aggregated Visitors",
       title ='Temporal Vistis Pattern By Hour & Vistis Dwelling Time') +
  plotTheme()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
        legend.position = "bottom",
        legend.text = element_text( colour = "black", face = "italic", size = 8))

grid.arrange(plot2,plot1,ncol=2,top="")
Figure3.3.8 Temperal Vistis Pattern By Hour & Vistis Dwelling Time

Figure3.3.8 Temperal Vistis Pattern By Hour & Vistis Dwelling Time

According to the Figure3.3.8 above, people usually dwell longer (>120min) here, and it seems 11 am to 3 pm is the most time period with the most aggregated visits. But most of programs were always scheduled in the evening, indicating that there is insufficient supply to meet demand of peak visit .

The facility improvements in 2021 make the Christy Recreation Center a safe destination for West Philadelphia teens and young adults. It is a good opportunity to see the community and local businesses respond to violence by coming together with law enforcement and elected leaders to expand on- and off-court programs for teens and older youth. More programs related basketball and more interior activities might be suggested for future program schedules. Considering of the estimated amount of visitors and optimized time schedule, we can refer to Huff model for insights.

5 Huff Model - Feature Engineering

The original Huff Model (Huff, 1964) is designed to estimate the probability of customers at each origin patronizing a given store among all stores as their destination choices. Two factors are taken into account: 1. Attractiveness 2. Distance. Attractiveness can be computed as a function of many attributes of a store, including the store size, number of parking spaces, customer reviews, etc.

We will use an advanced version of Huff model to predict the probability of visitors at each census block group going to a given park facility. Our Huff model includes neighboring effect, based on the assumptions that:

  1. People tend to choose more attractive travel destinations;

  2. People tend to choose closer travel destinations;

  3. People tend to choose travel destinations with more beneficial future choices. For example, a tourist at Origin O has two identical destination choices A and B, with equal distance to origin O. In this case, destination B should be preferred since it has more future choices in its neighborhood compared with destination A.

The below is the equation for our Huff Model.

\[ P_{ijt} = \frac{A_{jt}^\alpha*D_{ij}^\beta*C_{jt}^\theta}{\sum_{j=1}^{n}A_{jt}^\alpha*D_{ij}^\beta*C_{jt}^\theta} \] \(P_{ijt}\) represents the probability of a tourist at location i visiting attraction j at time t; Ajt is the attractiveness of attraction j at time t (we will include several attractiveness in our model); \(D_{ij}\) is the distance between origin i and attraction j; \(C_{jt}\) is the term used to describe the neighboring effect of attraction j (we mainly consider convenient stores), relative to other attractions at time t; and n indicates the total number of attractions in the area. The parameters \(\alpha\), \(\beta\), and \(\theta\) are three parameters associated with the attractiveness, distance, and neighboring effect factors,for each origin respectively.

5.1 Attractiveness

Responding to the term \(A_j\) in Huff model, we collected many features of facilities as attractiveness, including: area, facility number, facility area, trail miles, picnic table, exercise number, swimming pool number, spray ground number, hydration number, tennis court number, playground number, tree height, tree area, program number and capacity, and permit number and capacity.

Figure5.2.1 Summary of Attractiveness

Figure5.2.1 Summary of Attractiveness

In the Figure5.2.2, we plot correlations between scaled SafeGraph visits and each attractivenss. We can find out that most correlations are moderate. In the section 6, We will do Prominent Component Analysis(PCA) to see it it can improve the prediction result.

Figure5.2.2 Correlation of Attractiveness

Figure5.2.2 Correlation of Attractiveness

5.2 Distance

OSRM (Open Source Routing Machine) is a C++ implementation of a high performance route search engine to obtain the shortest driving paths in a road network. Responding to the term \(D_{ij}\) in Huff model, we use it to calculate the shortest driving distance from origin places to destination facilities.

# # parks
# modelPlaces<- as.character(attracGEO$geometry %>% st_centroid())
# 
# modelPlaces <- modelPlaces %>% str_replace("\\(","") %>%
#   str_replace("\\)","")%>%
#   str_replace("c","")
# 
# latParks = map(modelPlaces,function(x){
#   medium = str_split(x,",")
#   medium = medium[[1]][2]
#   return(medium)
#   })
# 
# lngParks = map(modelPlaces,function(x){
#   medium = str_split(x,",")
#   medium = medium[[1]][1]
#   return(medium)
# })
# 
# latParks = unlist(latParks)
# lngParks = unlist(lngParks)
# 
# modelPlacesFinal <- attracGEO %>%
#   mutate(parkLat = latParks)%>%
#   mutate(parkLng = lngParks) %>%
#   st_drop_geometry()
# 
# # census
# distCensusData<- as.character(CensusData$originGeometry)
# 
# distCensusData <- distCensusData %>% str_replace("\\(","") %>%
#   str_replace("\\)","")%>%
#   str_replace("c","")
# 
# latCensus = map(distCensusData,function(x){
#   medium = str_split(x,",")
#   medium = medium[[1]][2]
#   return(medium)
# })
# 
# lngCensus = map(distCensusData,function(x){
#   medium = str_split(x,",")
#   medium = medium[[1]][1]
#   return(medium)
# })
# 
# latCensus = unlist(latCensus)
# lngCensus = unlist(lngCensus)
# 
# CensusData2 <- CensusData %>%
#   mutate(bgLat = latCensus)%>%
#   mutate(bgLng = lngCensus)
# 
# #openstreetmap
# st_write(modelPlacesFinal,"data/output/modelPlacesFinal.csv")
# st_write(CensusData2,"data/output/CensusData2.csv")
modelPlacesFinal <- st_read("data/output/modelPlacesFinalNew.csv")
# modelPlacesFinal <- st_read("data/output/modelPlacesFinal.csv")
CensusData2 <- st_read("data/output/CensusData2.csv")
#
# #query the distance
# options(osrm.server = "http://127.0.0.1:5000/")
# options(digits=8)
# 
# theDistanceMatrix=list()
# for(i in 1:nrow(CensusData2)){
#   src = c(as.numeric(CensusData2["bgLng"][i,]),as.numeric(CensusData2["bgLat"][i,]))
#   theVector = c()
#   for (j in 1:nrow(modelPlacesFinal)){
#     dst = c(as.numeric(modelPlacesFinal["parkLng"][j,]),as.numeric(modelPlacesFinal["parkLat"][j,]))
#     medium = osrmRoute(src,dst,overview = "FALSE")
#     theVector = append(theVector,medium[2])
#   }
#   theDistanceMatrix= append(theDistanceMatrix,list(theVector))
#   print(paste0("finish ...",i,"/",nrow(CensusData2)))
# }
# 
# theDistanceMatrix2 <- theDistanceMatrix
# daFrame <- as.data.frame(matrix(unlist(theDistanceMatrix2), nrow=length(theDistanceMatrix2), byrow=TRUE))
# st_write(daFrame,"data/output/theDistanceMatrixWhole2.csv")
theDistanceMatrixWhole <- st_read("data/output/theDistanceMatrixWhole2.csv")
theDistanceMatrixWhole <- setNames(theDistanceMatrixWhole, 
                                   modelPlacesFinal$OBJECTID)
theDistanceMatrixWhole <- 
  merge(theDistanceMatrixWhole, CensusData2 %>% dplyr::select(GEOID), by = "row.names") %>%
  dplyr::select(-Row.names)
theDistanceMatrixWhole2 <- 
  theDistanceMatrixWhole %>%
  pivot_longer(!GEOID, names_to = "OBJECTID", values_to = "distance")

# in PPR OBJECTID unit
theDistanceMatrixWhole3 <- theDistanceMatrixWhole2 %>%
  right_join(attracGEO %>%
               dplyr::select(OBJECTID) %>% 
               st_drop_geometry() %>% 
               mutate(OBJECTID = as.character(OBJECTID)),
             by="OBJECTID") %>%
  drop_na() %>% # delete this step in the future
  mutate(distance = as.numeric(distance)) %>%
  group_by(GEOID, OBJECTID) %>%
  summarise(distance = mean(distance)) %>%
  ungroup()

5.3 Centrality

Centrality is the term used to describe the neighboring effect of attraction, responding to the term \(C_j\) in Huff model. We ranked POIs with their NAICS codes and find the top one is convenience stores (445120), then we will use it as the Centrality factors which affect visitors’ choice the park facilities.

# find the naics code that has most smae-day visits

# unnest the related_same_day_brand
# relatedBrand <-
#   PPRmoves %>%
#   select(placekey, related_same_day_brand, date_range_start) %>%
#   mutate(date_range_start = as_date(date_range_start),
#          month =  month(date_range_start)) %>%
#   dplyr::select(-date_range_start) %>%
#   mutate(related_same_day_brand = future_map(related_same_day_brand, function(x){
#     jsonlite::fromJSON(x) %>%
#       as_tibble() %>%
#       gather()
#   }))
# 
# relatedBrand <- relatedBrand %>%
#   unnest(related_same_day_brand) %>%
#   rename(relatedBrand =key ,visitors= value)
# 
# st_write(relatedBrand,"data/output/relatedBrand.GeoJSON")

# relatedBrand <- st_read("data/output/relatedBrand.GeoJSON",crs=crs)
# relatedBrand <- relatedBrand %>% 
#   dplyr::select(-placekey,-month) %>% 
#   st_drop_geometry() %>% 
#   group_by(relatedBrand) %>% 
#   summarize(totalVisitors = sum(visitors))
# 
# # connect this dataframe with naics code
# core_poi <- vroom("data/safegraph/Philadelphia-Camden-WilmingtonPA-NJ-DE-MDMSA-CORE_POI-2021_11-2021-12-17/core_poi.csv")
# relatedNaics <- core_poi %>% dplyr::select(brands,naics_code) %>% distinct() %>% drop_na()
# 
# relatedBrandWithNaics <- 
#   relatedBrand %>% 
#   left_join(relatedNaics,by=c("relatedBrand" = "brands")) %>% 
#   drop_na() %>% 
#   group_by(naics_code) %>% 
#   summarise(totalVisitors = sum(totalVisitors)) %>% 
#   filter(totalVisitors>365)
# 
# relatedBrandWithNaics$naics_code <- as.factor(relatedBrandWithNaics$naics_code)
# st_write(relatedBrandWithNaics,"data/output/relatedBrandWithNaics.csv")
relatedBrandWithNaics <- st_read("data/output/relatedBrandWithNaics.csv")
relatedBrandWithNaics$totalVisitors <- as.numeric(relatedBrandWithNaics$totalVisitors)
relatedBrandWithNaics%>%
  arrange(-totalVisitors) %>% 
  mutate(naics_code = factor(naics_code, naics_code)) %>%
  ggplot(aes(x=naics_code, y=totalVisitors)) +
  geom_bar(position ="dodge",fill = palette1_main,stat='identity') +
  labs(x="Naics Code", y="Total Visitors", title="Total Vistors of Related Brand With Naics",size = 10) +
  plotTheme(5,5)+
  theme(
    axis.text = element_text( size=5,angle = 90),
    strip.text = element_text( size=5),
    strip.text.x = element_text( size = 10))
Figure5.5.1 Total Vistors of Related Brand With Naics

Figure5.5.1 Total Vistors of Related Brand With Naics

In the Figure5.5.2, we plot the location of parks and their surrounding convenient stores. The parks are in grey shape and convenient stores are in orange dot.

# use the same naics code to get the safegraph poi data to construct naics code's location
# 445120 - Convenience Stores

# use number of visits as their attractiveness
pprRelatedNaicsMoves <- function(pprRelatedNaics, safeGraph.geo) {
  safeGraph.geo$raw_visitor_counts = as.numeric(safeGraph.geo$raw_visitor_counts)
  moves <- safeGraph.geo %>% 
    filter(naics_code==pprRelatedNaics) %>% 
    dplyr::select(placekey, location_name, raw_visitor_counts) %>% 
    group_by(placekey) %>% 
    summarise(visits = sum(raw_visitor_counts))
  return(moves)
}

# centrality with Convenient variable
pprRelatedNaics = 445120
convenientSurroundEffect <- pprRelatedNaicsMoves(pprRelatedNaics, safeGraph.geo)

# centrality with Park variable
parksSurroundEffect <- attracGEO

# visualized
# rbind(restSurroundEffect %>% mutate(label="restaurant") %>% dplyr::select(-visits),
#       convenientSurroundEffect %>% mutate(label="convinient store")%>% dplyr::select(-visits),
#       parksSurroundEffect %>% mutate(label="park")) %>% 
#   ggplot() +
#   geom_sf(data=pprServiceArea,color='black',size=0.35,fill= "transparent")+
#   geom_sf(data=pprDistrict,color='black',size=0.5,fill= "transparent")+
#   geom_sf(color = palette1_main,fill = palette1_main,alpha = 0.3) +
#   facet_wrap(~label)+
#   mapTheme()+
#   theme(legend.position = "bottom",
#         legend.key.width = unit(0.5, 'cm'),
#         legend.key.height  = unit(0.2, 'cm'))

convenientSurroundDot <- convenientSurroundEffect %>% mutate(label="convinient store")%>% dplyr::select(-visits)

parksSurroundDot <- parksSurroundEffect %>% mutate(label="park")

# visualized three types of objects in one map
ggplot() +
  geom_sf(data=pprServiceArea,color='black',size=0.35,fill= "transparent")+
  geom_sf(data=pprDistrict,color='black',size=0.5,fill= "transparent")+
  geom_sf(data=parksSurroundDot,aes(geometry=geometry),color = "black",fill = "grey",alpha = 1, size = 0.35) +
  geom_sf(data=convenientSurroundDot,aes(geometry=geometry),color = palette2[1],fill = palette2[1],alpha = 1,size = 1.5) +
  labs(title="Overlay Map of Parks(black), Convinient Stores(orange)")+
  mapTheme()
Figure5.5.2 Overlay Map of Parks(black), Convinient Stores(orange)

Figure5.5.2 Overlay Map of Parks(black), Convinient Stores(orange)

5.4 Probability

The observed probability that visitors from a census block group (CBG) choose a facility as destination can be calculated with dividing normalized number of visitors by total normalized visits in each CBG.

# calculate probability 
prob <- modelData %>% 
  group_by(origin) %>% 
  summarise(total = sum(visitors)) %>% 
  left_join(modelData, by="origin") %>% 
  mutate(probability = visitors/total) %>% 
  dplyr::select(OBJECTID, origin, probability)

prob.na <- prob %>% 
  filter(is.na(prob$probability))

CensusPoly <-
  get_acs(geography = "block group",
          variables = c("B01003_001E"),
          year=2019, state="PA", county="Philadelphia", geometry=T, output="wide") %>%
  st_transform(crs=4326) %>%
  dplyr::select(-NAME,-starts_with('B'))

6 Huff Model - Preprocessing

Since the limitation of SafeGraph data, there is no record for some origin=destination combinations. So the probability of these routes will be missining. How to tackle with these data becomes a tricky thing.

And also because of data point limitation, we need to use PCA to summarize predictors. Here, we prepare 4 sets of data to test different tackling methods. The data sets are displayed below.

  1. Replace NA value of probability with 4.4e-5 and use PCA from the large group of attractiveness as predictors;

  2. Drop probability with NA value and use PCA from large group of attractiveness as predictors;

  3. Drop probability with NA value and use PCA from small group of attractiveness as predictors;

  4. Drop probability with NA value and use all features as predictors with no PCA.

6.1 Dimensionality - PCA

We will use two methods to do feature extraction.

  1. Large group of attractiveness: all 17 attractiveness are included in PCA and 3 prominent components are extracted.
Figure6.2.3 Primary Components of Large Attractiveness

Figure6.2.3 Primary Components of Large Attractiveness

  1. Small group of attractiveness: some important attractiveness, like area, trail miles, swimming PoolNum, spraygroundNum, tennisCourtNum, treeHeight, treeArea, programNum, and permitNum are included in the PCA. Then 2 prominent components are extracted.
Figure6.2.3 Primary Components of Small Attractiveness

Figure6.2.3 Primary Components of Small Attractiveness

# construct attractiveness matrix
largeRetrunPCAValue <- as.data.frame(pca$x)
smallRetrunPCAValue <- as.data.frame(pcaSmall$x)
LargePCA <- largeRetrunPCAValue %>% dplyr::select(PC1,PC2,PC3)
SmallPCA <- smallRetrunPCAValue %>% dplyr::select(PC1,PC2)

LargePCAAttTogo <- 
  cbind(LargePCA, attracGEO %>% dplyr::select(OBJECTID)) %>% 
  st_sf() %>% 
  mutate(PC1 = scales::rescale(PC1, to=c(0,1)),
         PC2 = scales::rescale(PC2, to=c(0,1)),
         PC3 = scales::rescale(PC3, to=c(0,1)))

smallPCAAttTogo <- 
  cbind(SmallPCA, attracGEO %>% dplyr::select(OBJECTID)) %>% 
  st_sf()%>% 
  mutate(PC1 = scales::rescale(PC1, to=c(0,1)),
         PC2 = scales::rescale(PC2, to=c(0,1)))

# st_write(LargePCAAttTogo,"data/output/LargePCAAttTogo.GEOJSON")
# st_write(smallPCAAttTogo,"data/output/smallPCAAttTogo.GEOJSON")

# try no pca
noPCATogo <- attracGEO %>% dplyr::select(-PPR_DISTRICT) %>% st_sf()

6.2 Origin-Destination Combination Wrangling

As we mentioned before, there is no record for some origin~destination combinations. So the probability of these routes will be missining. Here, we try two ways to wrangle the data. First one is that we replace the na value with 4.4e-5. The second one is that we just drop the na value.

# Based on the distance panel
huffData <- 
  theDistanceMatrixWhole2 %>% 
  rename("origin" = "GEOID")%>%
  mutate(distance = as.numeric(distance)) %>% 
  mutate(OBJECTID = as.numeric(OBJECTID)) %>% 
  left_join(prob,by=c("OBJECTID","origin"))

huffDataReplaceNa <- 
  huffData %>% 
  mutate(probability = replace_na(probability, 4.4e-5))

huffDataDropNa <- 
  huffData %>% 
  drop_na(probability)

huffDataDropNaNOPCA <- 
  huffDataDropNa%>% 
  left_join(noPCATogo %>% 
              st_drop_geometry() %>% 
              mutate_all(.,as.numeric),by="OBJECTID")

huffDataReplaceNaLargeAttract <- 
  huffDataReplaceNa %>% 
  left_join(LargePCAAttTogo %>% st_drop_geometry(),by="OBJECTID")

huffDataDropNaLargeAttract <- 
  huffDataDropNa %>% 
  left_join(LargePCAAttTogo %>% st_drop_geometry(),by="OBJECTID")

huffDataDropNaLargeAttract_program <- 
  huffDataDropNa %>% 
  left_join(LargePCAAttTogo %>% st_drop_geometry(),by="OBJECTID") %>% 
  left_join(attracGEOVISIT %>% dplyr::select(OBJECTID,programNum), by="OBJECTID") %>% 
  mutate(programNum = as.numeric(programNum))

huffDataDropNaSmallAttract <- 
  huffDataDropNa %>% 
  left_join(smallPCAAttTogo %>% st_drop_geometry(),by="OBJECTID")

7 Huff Model - Modeling and Validation

7.1 Fitting Data Selection

To select which data wrangling method can produce more accurate result, in the following section, we try different data sets and use fitting Rsquare to select the data set.

source("huffModelScripts.R")

outcomeAnalysis <- function(df,scenario){
  print(scenario)
  dfnoNA <- df %>% drop_na()
  print(paste("Rsquare: Mean is: ",mean(dfnoNA$r2)))
  print(paste("# of lost origins is: ",nrow(df) - nrow(dfnoNA)))
}

## neighbor attr and geometry - Here is convenient stores
convenientSurroundEffect <- st_read("data/output/convenientSurroundEffect.GeoJSON")
7.1.1 Annual Model

SCENARIO 1 Drop NA probability and use PCA from the large group of attractiveness.

# SCENARIO 1: Drop NA probability and use large PCA
RealModelData <- huffDataDropNaLargeAttract
RealModelPlaces <- attracGEO %>% dplyr::select(OBJECTID) %>% st_centroid()

fit <- fit_parameter(data = RealModelData,
                           places = RealModelPlaces,
                           neighbor_data = convenientSurroundEffect,
                           neighbor_id_column="placekey",
                           neighbor_attr_column ="visits",
                           id_column="OBJECTID",
                           attr_column =c("PC1","PC2","PC3"),
                           distance_column="distance",
                           probability_column="probability",
                           origin_column = "origin",
                           k=2)

parameter <- fit %>% 
  dplyr::select(parameter) %>% 
  unnest(cols="parameter")
s1R2 <- parameter[['r2']]
outcomeAnalysis(parameter,"SCENARIO 1: Drop NA probability and use PCA from the large group of attractiveness")
## [1] "SCENARIO 1: Drop NA probability and use PCA from the large group of attractiveness"
## [1] "Rsquare: Mean is:  0.396096893759358"
## [1] "# of lost origins is:  4"


SCENARIO 2 Drop NA probability and use PCA from the small group of attractiveness.

# SCENARIO 2: Drop NA probability and use small PCA
RealModelData <- huffDataDropNaSmallAttract

fit <- fit_parameter(data = RealModelData,
                           places = RealModelPlaces,
                           neighbor_data = convenientSurroundEffect,
                           neighbor_id_column="placekey",
                           neighbor_attr_column ="visits",
                           id_column="OBJECTID",
                           attr_column =c("PC1","PC2"),
                           distance_column="distance",
                           probability_column="probability",
                           origin_column = "origin",
                           k=2)

parameter <- fit %>% 
  dplyr::select(parameter) %>% 
  unnest(cols="parameter")
s2R2 <- parameter[['r2']]
outcomeAnalysis(parameter,"SCENARIO 2: Drop NA probability and use PCA from the small group of attractiveness")
## [1] "SCENARIO 2: Drop NA probability and use PCA from the small group of attractiveness"
## [1] "Rsquare: Mean is:  0.349729671636457"
## [1] "# of lost origins is:  2"


SCENARIO 3 Drop NA probability and use no PCA

# SCENARIO 3: Drop NA probability and use no PCA
RealModelData <- huffDataDropNaNOPCA

fit <- fit_parameter(data = RealModelData,
                           places = RealModelPlaces,
                           neighbor_data = convenientSurroundEffect,
                           neighbor_id_column="placekey",
                           neighbor_attr_column ="visits",
                           id_column="OBJECTID",
                           attr_column =colnames(huffDataDropNaNOPCA)[5:19],
                           distance_column="distance",
                           probability_column="probability",
                           origin_column = "origin",
                           k=2)

parameter <- fit %>% 
  dplyr::select(parameter) %>% 
  unnest(cols="parameter")
s3R2 <- parameter[['r2']]
outcomeAnalysis(parameter,"SCENARIO 3: Drop NA probability and use no PCA")
## [1] "SCENARIO 3: Drop NA probability and use no PCA"
## [1] "Rsquare: Mean is:  0.654948454087436"
## [1] "# of lost origins is:  262"


SCENARIO 4 Replace NA probability and use PCA from the large group of variables

# SCENARIO 4: Replace NA probability and use large PCA
RealModelData <- huffDataReplaceNaLargeAttract

fit <- fit_parameter(data = RealModelData,
                           places = RealModelPlaces,
                           neighbor_data = convenientSurroundEffect,
                           neighbor_id_column="placekey",
                           neighbor_attr_column ="visits",
                           id_column="OBJECTID",
                           attr_column =c("PC1","PC2","PC3"),
                           distance_column="distance",
                           probability_column="probability",
                           origin_column = "origin",
                           k=2)

parameter <- fit %>% 
  dplyr::select(parameter) %>% 
  unnest(cols="parameter")
s4R2 <- parameter[['r2']]
outcomeAnalysis(parameter,"SCENARIO 4: Replace NA probability and use PCA from the large group of variables")
## [1] "SCENARIO 4: Replace NA probability and use PCA from the large group of variables"
## [1] "Rsquare: Mean is:  0.172348957656676"
## [1] "# of lost origins is:  16"
p1 <- ggplot(as.data.frame(s1R2), aes(x=s1R2)) + 
  geom_histogram(color=palette1_main, fill="white",bins = 50) +
  plotTheme()

p2 <- ggplot(as.data.frame(s2R2), aes(x=s2R2)) + 
  geom_histogram(color=palette1_main, fill="white",bins = 50) +
  plotTheme()

p3 <- ggplot(as.data.frame(s3R2), aes(x=s3R2)) + 
  geom_histogram(color=palette1_main, fill="white",bins = 50) +
  plotTheme()

p4 <- ggplot(as.data.frame(s4R2), aes(x=s4R2)) + 
  geom_histogram(color=palette1_main, fill="white",bins = 50) +
  plotTheme()

plot_grid(p1, p2, p3, p4, ncol = 2)
Figure7.1.1 Histograms of R square in different scenarios

Figure7.1.1 Histograms of R square in different scenarios

In the above, we see the R square from different models. The best one is the scenario 3, where we drop NA probability and don’t use PCA. But one thing needs to be noted is that there are many lost origins in scenario 3. The reason is that as the number of the predictors increase, the corresponding least training data points should increase. Besides, we should also notice that scenario 1 also has a similar R square with scenario 3. And the scenario 1 will have a better prediction accuracy in the test dataset.

7.1.2 Seasonal Model

We also try to divide the data based on record dates into four seasons and run individual model to predict the probability in each season. The red line represent temperature while the orange line represent number of visitors. The numbers of temperature are exaggerate 500 times for visualization.

# temperature data
weather.Data <-
  riem_measures(station = "PHL", date_start = "2021-01-01",
                date_end = "2021-11-30") %>% 
  mutate(day = ymd(substr(valid, 1, 10))) %>% 
  dplyr::select(day, tmpf) %>% 
  drop_na() %>% 
  group_by(day) %>% 
  summarise(avgTmp = 500*mean(tmpf))

# map of daily visits
visitCountForPlot <- 
  visitCount %>% 
  st_drop_geometry() %>% 
  group_by(visitDay) %>% 
  summarize(dayVisits = sum(visits)) %>% 
  left_join(weather.Data,by=c("visitDay" = "day"))

ggplot(data = visitCountForPlot, aes(x=visitDay))+
  geom_line(aes(y=dayVisits),color=palette1_main,size=0.5)+
  geom_line(aes(y=avgTmp),color=palette1_assist,size=0.5)+
  scale_y_continuous(
    name = "Safegraph Visit",
    sec.axis = sec_axis(~ ., name="500*Daily Temp"))+
  labs(title = 'Temporal Chart of vistor and temperature',x="Visit Date")+
  plotTheme()+
  theme(panel.border = element_rect(colour = "black", fill=NA, size=1))
Figure5.7.1 Temporal Chart of vistor and temperature

Figure5.7.1 Temporal Chart of vistor and temperature

# allocate the season
originCountSeasonal <- originCount %>% 
  st_drop_geometry() %>% 
  left_join(excepList,by="placekey") %>% 
  filter(is.na(OBJECTID)) %>% 
  dplyr::select(-OBJECTID) %>% 
  left_join(.,propertyWithPlaceKey %>% dplyr::select(OBJECTID, placekey),
            by="placekey") %>% 
  group_by(OBJECTID, origin, month) %>% 
  summarise(visitors = sum(visitors)) %>% 
  ungroup() %>% 
  filter(origin %in% CensusData$GEOID) %>% 
  left_join(homePanelToGo %>% dplyr::select(GEOID,quotient),by=c("origin"="GEOID")) %>% 
  mutate(visitors =visitors*quotient) %>% 
  dplyr::select(-quotient) %>% 
  mutate(season = case_when(month %in% c(3,4,5) ~ "spring",
                            month %in% c(6,7,8) ~ "summer",
                            month %in% c(9,10,11) ~ "autumn",
                            month %in% c(12,1,2) ~ "winter")) %>% 
  dplyr::select(-month)

modelDataSeasonal <- 
  left_join(originCountSeasonal, propertyWithPlaceKey %>% 
              st_drop_geometry() %>% 
              filter(NESTED == "N") %>%
              dplyr::select(OBJECTID, ACREAGE),by="OBJECTID") %>% 
  rename(c("area" = "ACREAGE")) 

With limited data pattern after assigning to four seasons, some Census Block Group can not have a convergent huff model. And these areas are blank in the map. Finally we decide to give up the seasonal model for now since we only have one year SafeGraph data. But in the future, the seasonal model can be attempted if we get more data. The following is the illustration of the progress and displays the areas can’t have a convergent huff model.

# calculate probability 
probSeasonal <- modelDataSeasonal %>% 
  group_by(origin,season) %>% 
  summarise(total = sum(visitors)) %>% 
  left_join(modelDataSeasonal, by=c("origin","season")) %>% 
  mutate(probability = visitors/total) %>% 
  dplyr::select(OBJECTID, origin, probability,season)

probSeasonal.na <- probSeasonal %>% filter(is.na(probability))

# Based on the distance panel
huffDataSeasonal <- 
  theDistanceMatrixWhole3 %>% 
  rename("origin" = "GEOID")%>% 
  mutate(distance = as.numeric(distance)) %>% 
  left_join(probSeasonal %>% 
              mutate(OBJECTID = as.character(OBJECTID)),
            by=c("OBJECTID","origin"))

huffDataDropNaSeasonal <- 
  huffDataSeasonal %>% 
  drop_na()%>% 
  left_join(LargePCAAttTogo %>% 
              st_drop_geometry() %>% 
              mutate(OBJECTID = as.character(OBJECTID)),by="OBJECTID")

The example will be the Winter Model. We predict probability of visitors at each census block group going to a given park facility from December to February. We can see that there are 393 Census Block Groups are blank areas in map due to limited visit data from SafeGraph.

winterModelData <- huffDataDropNaSeasonal %>% filter(season=="winter") %>% dplyr::select(-season)
winterPlaces <- attracGEO %>% 
  filter(OBJECTID %in% unique(winterModelData$OBJECTID)) %>% 
  mutate(OBJECTID = as.character(OBJECTID)) %>% 
  st_centroid()

fit <- fit_parameter(data = winterModelData,
                         places = winterPlaces,
                         neighbor_data = convenientSurroundEffect,
                         neighbor_id_column="placekey",
                         neighbor_attr_column ="visits",
                         id_column="OBJECTID",
                         attr_column =c("PC1","PC2","PC3"),
                         distance_column="distance",
                         probability_column="probability",
                         origin_column = "origin",
                         k=2)

parameter <- fit %>% 
  dplyr::select(parameter) %>% 
  unnest(cols="parameter")

outcomeAnalysis(parameter,"winter")
## [1] "winter"
## [1] "Rsquare: Mean is:  0.559587589496826"
## [1] "# of lost origins is:  393"
winterPlot <- 
  parameter %>% 
  drop_na() %>% 
  dplyr::select(origin,r2) %>% 
  left_join(CensusPoly, by=c("origin" = "GEOID")) %>% 
  st_sf()

ggplot() + 
  geom_sf(data=winterPlot,color="black",size=0.1,fill = "transparent")+
  geom_sf(data=winterPlot,color="black",size=0.1,aes(fill = q5(r2)))+
  scale_fill_manual(values = palette5 %>% rev(), name="Fitting R2", label=qBr(winterPlot,"r2",F))+
  labs(title="Fitting R square Distribution Map in Winter Model ")+
  mapTheme()+
  theme(legend.position = "bottom")
Figure6.7.6 Fitting R square Distribution Map in Winter Model

Figure6.7.6 Fitting R square Distribution Map in Winter Model

7.2 Train and Test

As we mentioned above, the Scenario 3 had a better fitting R square. But it has many lost regions. And Scenrio 1 has a similar fitting R square but has way less lost regions. In the following, we test the prediction accuracy for Scenario 1 and 3.

7.2.1 Scenario 1
# SCENARIO 1: Drop NA probability and use large PCA
RealModelData1 <- huffDataDropNaLargeAttract
RealModelPlaces1 <- attracGEO %>% dplyr::select(OBJECTID) %>% st_centroid()

# records are not enough for (421010008033, 421010008034, 421010008042, 421010158004)
RealModelData1 <- RealModelData1 %>%
  filter(!origin %in% c(421010008033, 421010008034, 421010008042, 421010158004))

# Split data
set.seed(3456)
split <- partition(RealModelData1$origin, p=c(train=0.5, test=0.5))
Train <- RealModelData1[split$train,]
Test <- RealModelData1[split$test,]

fit <- fit_parameter(data = Train,
                           places = RealModelPlaces1,
                           neighbor_data = convenientSurroundEffect,
                           neighbor_id_column="placekey",
                           neighbor_attr_column ="visits",
                           id_column="OBJECTID",
                           attr_column =c("PC1","PC2","PC3"),
                           distance_column="distance",
                           probability_column="probability",
                           origin_column = "origin",
                           k=2)

parameter1 <- fit %>% 
  dplyr::select(parameter) %>% 
  unnest(cols="parameter")
# replce na in parameter
parameter1 <- parameter1 %>% 
  mutate_all(~replace(., is.na(.), 0))
# join to same index with Test Set
parameter_full1 <- left_join(Test, parameter1, by="origin") 

# centrality
TestPlaces <- RealModelPlaces1 %>% 
  filter(OBJECTID %in% as.list(Test$OBJECTID))
Test <- centrality_to_df(data = Test, 
                         places = TestPlaces, 
                         neighbor_data = convenientSurroundEffect, 
                         neighbor_id_column="placekey",
                         neighbor_attr_column ="visits",
                         id_column="OBJECTID",
                         k = 2)

# attr in test cannot be 0!!!! because 0^negative => NA 
Test <- Test %>% 
  mutate_if(is.numeric, funs(ifelse(.==0,0.0000001,.)))


# predict
huff_results <- huff_NE(destinations_name = Test$id, 
                        destinations_attractiveness = Test[c("PC1","PC2","PC3")], 
                        origins_name = Test$origin, 
                        destinations_centrality = Test$centrality,
                        distance = Test$distance,
                        alpha = parameter_full1 %>% dplyr::select(starts_with("alpha")),
                        beta = parameter_full1$beta,
                        theta = parameter_full1$theta)

error <- left_join(huff_results, Test, by=c("origins_name"="origin", "destinations_name"="id") ) %>% 
  mutate(AE = abs(huff_probability - probability),
         APE = abs(huff_probability - probability)/probability) %>% 
  #filter(probability>0.1) %>% 
  group_by(origins_name) %>% 
  summarise(MAPE = mean(APE,na.rm=T),
            MAE = mean(AE,na.rm=T))

scenario1FrequencyResult <- left_join(huff_results, Test, by=c("origins_name"="origin", "destinations_name"="id") ) %>%
  mutate(frequency = case_when(
    probability > 0.5 ~ "super-high",
    probability > 0.1 & probability <= 0.5 ~ "high",
    probability > 0.03 & probability <= 0.1 ~ "medium",
    probability > 0.015 & probability <= 0.03 ~ "mid-low",
    probability > 0.009 & probability <= 0.015 ~ "low",
    probability <= 0.009 ~ "super-low"
  )) %>% 
  drop_na() %>% 
  mutate(error = probability - huff_probability) %>% 
  group_by(frequency) %>% 
  summarise(MAE = mean(abs(error)),
            mean_p = mean(probability)) %>% 
  ungroup() %>% 
  mutate(MAPE = MAE/mean_p)

print(c(paste0("MAE for Scenario 1 is ",round(mean(error$MAE,na.rm=T),4)),
        paste0("MAPE for Scenario 1 is ",round(mean(error$MAPE,na.rm=T),4))))
## [1] "MAE for Scenario 1 is 0.0605"  "MAPE for Scenario 1 is 3.3391"
7.2.1 Scenario 3
# sCENARIO 3: Drop NA probability and use no PCA
RealModelData3 <- huffDataDropNaNOPCA %>% 
  dplyr::select(-origin, -OBJECTID, -probability) %>% 
  mutate_all(funs(scales::rescale(., to=c(0,1)))) %>% 
  cbind(huffDataDropNaNOPCA %>% dplyr::select(origin, OBJECTID, probability))

RealModelPlaces3 <- attracGEO %>% dplyr::select(OBJECTID) %>% st_centroid()

# records are not enough for (421010008033, 421010008034, 421010008042, 421010158004)
RealModelData3 <- RealModelData3 %>%
  filter(!origin %in% c(421010008033, 421010008034, 421010008042, 421010158004))

# Split data
set.seed(3456)
split <- partition(RealModelData3$origin, p=c(train=0.5, test=0.5))
Train <- RealModelData3[split$train,]
Test <- RealModelData3[split$test,]
colnames <- c("area", "facilityNum", "facilityArea","hydrationNum","tennisCourtNum",
              "playgroundNum","treeHeight","programNum","permitNum")

fit <- fit_parameter(data = Train,
                           places = RealModelPlaces3,
                           neighbor_data = convenientSurroundEffect,
                           neighbor_id_column="placekey",
                           neighbor_attr_column ="visits",
                           id_column="OBJECTID",
                           attr_column =colnames,
                           #attr_column = colnames(huffDataDropNaNOPCA)[5:19],
                           distance_column="distance",
                           probability_column="probability",
                           origin_column = "origin",
                           k=2)

parameter3 <- fit %>% 
  dplyr::select(parameter) %>% 
  unnest(cols="parameter")
# replce na in parameter
parameter3 <- parameter3 %>% 
  mutate_all(~replace(., is.na(.), 0))
# join to same index with Test Set
parameter_full3 <- left_join(Test, parameter3, by="origin") 

# centrality
TestPlaces <- RealModelPlaces3 %>% 
  filter(OBJECTID %in% as.list(Test$OBJECTID))
Test <- centrality_to_df(data = Test, 
                         places = TestPlaces, 
                         neighbor_data = convenientSurroundEffect, 
                         neighbor_id_column="placekey",
                         neighbor_attr_column ="visits",
                         id_column="OBJECTID",
                         k = 2)

# attr in test cannot be 0!!!! because 0^negative => NA 
Test <- Test %>% 
  mutate_if(is.numeric, funs(ifelse(.==0,0.0000001,.)))

# predict
huff_results <- huff_NE(destinations_name = Test$id, 
                        destinations_attractiveness = Test[colnames], 
                        origins_name = Test$origin, 
                        destinations_centrality = Test$centrality,
                        distance = Test$distance,
                        alpha = parameter_full3 %>% dplyr::select(starts_with("alpha")),
                        beta = parameter_full3$beta,
                        theta = parameter_full3$theta)

error <- left_join(huff_results, Test, by=c("origins_name"="origin", "destinations_name"="id") ) %>% 
  mutate(AE = abs(huff_probability - probability),
         APE = abs(huff_probability - probability)/probability) %>% 
  group_by(origins_name) %>% 
  summarise(MAPE = mean(APE,na.rm=T),
            MAE = mean(AE,na.rm=T))

print(c(paste0("MAE for Scenario 3 is ",round(mean(error$MAE,na.rm=T),4)),
        paste0("MAPE for Scenario 3 is ",round(mean(error$MAPE,na.rm=T),4))))
## [1] "MAE for Scenario 3 is 0.0793"  "MAPE for Scenario 3 is 4.0921"

We find out that Scenario 1 has a higher prediction accuracy than Scenario 3. Considering the number of lost regions, we decide to stick to the Scenario 1 to further improve the model.

kable(scenario1FrequencyResult,align = 'c',caption = '<center>Table 7.2.1 Prediction Error of Scenario 1 in different visit frequency <center/>') %>%
  kable_classic(full_width = T)%>%
  kable_styling(position = "center")
Table 7.2.1 Prediction Error of Scenario 1 in different visit frequency
frequency MAE mean_p MAPE
high 0.1258262 0.1806336 0.6965821
low 0.0388017 0.0117808 3.2936335
medium 0.0593454 0.0515820 1.1505062
mid-low 0.0490007 0.0210707 2.3255407
super-high 0.4022483 0.5780892 0.6958240
super-low 0.0304014 0.0057639 5.2744242

Besides in the Table 7.2.1, we find out that MAPE increases as the frequency decreases. So in the following optimization, we will try to use categorical variables - frequency to account for these inter group error.

7.3 Model Optimization

7.3.1 Training
# sCENARIO 1 variant: Drop NA probability and use large PCA with programNum outside PCA
RealModelData_final <- huffDataDropNaLargeAttract_program
RealModelPlaces_final <- attracGEO %>% dplyr::select(OBJECTID) %>% st_centroid()

RealModelData_final <- RealModelData_final %>% 
  mutate(frequency = case_when(
    probability > 0.5 ~ "super-high",
    probability > 0.1 & probability <= 0.5 ~ "high",
    probability > 0.03 & probability <= 0.1 ~ "medium",
    probability > 0.015 & probability <= 0.03 ~ "mid-low",
    probability > 0.009 & probability <= 0.015 ~ "low",
    probability <= 0.009 ~ "super-low"
  ))

# Split data
set.seed(3456)
split <- partition(RealModelData_final$origin, p=c(train=0.5, test=0.5))
Train <- RealModelData_final[split$train,]
Test <- RealModelData_final[split$test,]

# avoid error:contrasts can be applied only to factors with 2 or more levels
drop_list2 <- Train %>%
  group_by(origin) %>%
  summarise(n_program = n_distinct(programNum),
            n_frequency = n_distinct(frequency)) %>%
  filter(n_program < 2 | n_frequency < 2)

Train <- Train %>%
  filter(!origin %in% drop_list2$origin)

trainPlaces <- RealModelPlaces_final %>% 
  filter(OBJECTID %in% unique(Train$OBJECTID))

fit <- fit_parameter_level(data = Train,
                           places = trainPlaces,
                           neighbor_data = convenientSurroundEffect,
                           neighbor_id_column="placekey",
                           neighbor_attr_column ="visits",
                           id_column="OBJECTID",
                           attr_column =c("PC1","PC2","PC3","programNum"),
                           distance_column="distance",
                           probability_column="probability",
                           origin_column = "origin",
                           k=2,
                           category = "frequency")

parameter <- fit %>% 
  dplyr::select(parameter) %>% 
  unnest(cols="parameter") 

parameter <- parameter %>% 
  dplyr::select(-frequencyhigh)

outcomeAnalysis(parameter,"Final Optimal Model: Dropping NA probability, using PCA, and taking visits frequency level into consideration")
## [1] "Final Optimal Model: Dropping NA probability, using PCA, and taking visits frequency level into consideration"
## [1] "Rsquare: Mean is:  0.972369869839444"
## [1] "# of lost origins is:  74"
ggplot(parameter, aes(x=r2)) + 
  geom_histogram(color=palette1_main, fill="white",bins = 50) +
  labs(title="R-squared Distribution Histgram for Optimized Model")+
  plotTheme()
Figure 7.3.1 R-squared Distribution Histgram for Optimized Model

Figure 7.3.1 R-squared Distribution Histgram for Optimized Model

We can see that after adding the categorical variable - frequency, the R square for the training data has a huge improve, from 0.39 in the 7.1.1 Scenario 1 to 0.97 here. That means the sucess of the model optimization.

In the following, we will look at the accuracy for the prediction to see if the prediction accuracy has successfully improve.

7.3.2 Prediction
# avoid error:contrasts can be applied only to factors with 2 or more levels
Test <- Test %>%
  filter(origin %in% Train$origin)

testPlaces <- RealModelPlaces_final %>% 
  filter(OBJECTID %in% unique(Train$OBJECTID))

#encode category
dmy <- dummyVars(" ~ .", data = Test %>% mutate(origin=as.numeric(origin)))
Test <- data.frame(predict(dmy, 
                                        newdata = Test %>% mutate(origin=as.numeric(origin))))
Test <- Test %>% 
  mutate(origin = as.character(origin)) %>% 
  rename(c("high" = "frequencyhigh",
           "low" = "frequencylow",
           "medium" = "frequencymedium",
           "mid-low" = "frequencymid.low",
           #"super-high" = "frequencysuper.high",
            "super-low" =  "frequencysuper.low"))

# replce na in parameter
parameter <- parameter %>% 
  mutate_all(~replace(., is.na(.), 0))

predictionResult <- huff_NE_level(data = Test,
                           places = testPlaces,
                           neighbor_data = convenientSurroundEffect,
                           neighbor_id_column="placekey",
                           neighbor_attr_column ="visits",
                           id_column="OBJECTID",
                           attr_column =c("PC1","PC2","PC3","programNum"),
                           distance_column="distance",
                           probability_column="probability",
                           origin_column = "origin",
                           k=2,
                           category = c("high", "low", "medium",
                                  "mid-low", 
                                  #"super-high", 
                                  "super-low"),
                           parameter = parameter,
                           prefix = "frequency")

predictionResult <- predictionResult %>% 
  distinct() %>% 
  mutate(prob = exp(prob))

predictionResult <- left_join(predictionResult, 
                              Test %>% dplyr::select(origin, OBJECTID, probability),
                              by=c("origin"="origin", "id"="OBJECTID"))

predictionResult_2 <- predictionResult %>% 
  group_by(origin) %>% 
  summarise(d = sum(prob)) %>% 
  ungroup() %>% 
  right_join(predictionResult, by=c("origin"="origin")) %>% 
  mutate(prob = prob/d,
         error = probability-prob)

predictionResult_3 <- predictionResult_2 %>% 
  left_join(.,RealModelData_final %>% dplyr::select(origin, OBJECTID, frequency),
            by=c("origin"="origin","id"="OBJECTID"))

predictionResult_group2 <- predictionResult_3 %>% 
  group_by(frequency) %>% 
  summarise(MAE = mean(abs(error)),
            mean_p = mean(probability)) %>% 
  ungroup() %>% 
  mutate(MAPE = MAE/mean_p)

predictionResult_group <- predictionResult_2 %>% 
  group_by(origin) %>% 
  summarise(MAE = mean(abs(error)),
            mean_p = mean(probability)) %>% 
  ungroup() %>% 
  mutate(MAPE = MAE/mean_p)

print(
  c(
    paste0("MAE is ",round(mean(predictionResult_group$MAE,na.rm=T),4)),
  paste0("MAPE is ",round(mean(predictionResult_group$MAPE,na.rm=T),4))))
## [1] "MAE is 0.0441" "MAPE is 1.464"

The MAE has decreased from 0.06 to 0.044, that is a impressing improve. And the MAPE also decreases from 3.34 in 7.2.1 to 1.46. The testing accuracy means that the model optimization also helps improve the prediction accuracy. The specifc distribution is shown in the Figure7.3.2.

p1 <- ggplot(data=melt(predictionResult_group2 %>% mutate(frequency = fct_reorder(frequency, mean_p)), 
                          id.vars = c("frequency","MAPE"), 
                          variable.name = "var"),
                aes(x=frequency, y=value, fill=var)) +
  geom_bar(stat="identity", position=position_dodge())+
  scale_fill_manual(values = palette2, label=c("MAE", "Mean Probability"))+
  plotTheme() + 
  theme(legend.position = "bottom",axis.text.x = element_text(angle = 30, vjust = 1, hjust=1))

p2 <- ggplot(data=predictionResult_group2 %>% mutate(frequency = fct_reorder(frequency, mean_p)),
                aes(x=frequency, y=MAPE)) +
  geom_bar(stat="identity", position=position_dodge(), fill=palette1_main) +
  plotTheme() + 
  theme(axis.text.x = element_text(angle = 30, vjust = 1, hjust=1))

plot_grid(p1, p2 
          # labels = c("Distribution Histogram of MAE for Different Census Block Groups",
          #                    "Distribution Histogram of MAPE for Different Census Block Groups")
          )
Figure 7.3.2 Error among Different Route Visit Frequency Level

Figure 7.3.2 Error among Different Route Visit Frequency Level

p1 <- ggplot(predictionResult_group, aes(x=MAE)) + 
  geom_histogram(color=palette1_main, fill="white",bins = 50) +
  plotTheme()

p2 <- ggplot(predictionResult_group, aes(x=MAPE)) + 
  geom_histogram(color=palette1_main, fill="white",bins = 50) +
  plotTheme()
     
plot_grid(p1, p2
          # labels = c("Distribution Histogram of MAE for Different Census Block Groups",
          #                    "Distribution Histogram of MAPE for Different Census Block Groups")
          )
Figure7.3.2.1 MAPE Distribution Histgram for Census Block Groups

Figure7.3.2.1 MAPE Distribution Histgram for Census Block Groups

In the Figure7.3.2.2, we can see that 74 out of 1339 census block groups do not have convergent huff model. But overall, most census block groups have convergent models. And most of them have MAPE around 1.

  ggplot() + 
    geom_sf(data=left_join(predictionResult_group, CensusPoly, by=c("origin"="GEOID")) %>% st_sf(),
            color="black",size=0.05,aes(fill = q5(MAPE)))+
    geom_sf(data=pprDistrict %>% st_union(),color="black",size=1,fill = "transparent")+
    scale_fill_manual(values = palette5, name="MAPE", label=qBr(predictionResult_group,"MAPE",F))+
  labs(title="MAPE Distribution Map for Census Block Groups")+
    mapTheme()+
    theme(legend.position = "bottom")
Figure 7.3.2.2 MAPE Distribution Map for Census Block Groups

Figure 7.3.2.2 MAPE Distribution Map for Census Block Groups

Even though the huff model is based on the census block group, if we aggregate the MAE & MAPE on facilities. We can find out that centers of distributions are moved to left, that means the practical accuracy regarding the use case should be even higher than the above. The MAE under this circumstance is 0.032.

error_parks <- predictionResult_2 %>% 
  group_by(id) %>% 
  summarise(MAE = mean(abs(error)),
            mean_p = mean(probability)) %>% 
  mutate(MAPE = MAE/mean_p)

p1 <- ggplot(error_parks, aes(x=MAE)) + 
  geom_histogram(color=palette1_main, fill="white",bins = 50) +
  plotTheme()

p2 <- ggplot(error_parks, aes(x=MAPE)) + 
  geom_histogram(color=palette1_main, fill="white",bins = 50) +
  plotTheme()
     
plot_grid(p1, p2
          # labels = c("Distribution Histogram of MAE for Different Parks $ Rec Centers",
          #                    "Distribution Histogram of MAPE for Different Parks $ Rec Centers")
          )
Figure 7.3.2.3 MAE & MAPE Distribution Histgram for Different Parks & Rec Centers

Figure 7.3.2.3 MAE & MAPE Distribution Histgram for Different Parks & Rec Centers

Figure7.2.3.4 is the geospatial distribution of above MAPE. We can see that big facilities in Philadelphia have a more accurate result than those small facilities. That corresponds to the above argument that high frequency level facilities always have a more accurate result.

  ggplot() + 
    geom_sf(data=left_join(error_parks, pprProperties, by=c("id"="OBJECTID")) %>% st_sf(),
            color="black",size=0.05,aes(fill = q5(MAPE)))+
    geom_sf(data=pprDistrict %>% st_union(),color="black",size=0.5,fill = "transparent")+
    scale_fill_manual(values = palette5, name="MAPE", label=qBr(error_parks,"MAPE",F))+
    mapTheme()+
    theme(legend.position = "bottom")
Figure7.2.3.4 MAPE Distribution Map for Parks

Figure7.2.3.4 MAPE Distribution Map for Parks

7.4 Cross Validation - Spatial Genaralizability

CVModelData1 <- RealModelData1 %>%
  left_join(., pprProperties %>%
              dplyr::select(OBJECTID, PPR_DISTRICT) %>%
              st_drop_geometry(),
            by = "OBJECTID")
# 
# 
# cv_result  <- huff_crossValidate(data = CVModelData1,
#                                          cv_id = "PPR_DISTRICT",
#                                          places = RealModelPlaces1,
#                                          neighbor_data = convenientSurroundEffect,
#                                          neighbor_id_column="placekey",
#                                          neighbor_attr_column ="visits",
#                                          id_column="OBJECTID",
#                                          attr_column =c("PC1","PC2","PC3"),
#                                          distance_column="distance",
#                                          probability_column="probability",
#                                          origin_column = "origin",
#                                          k=2)
# cv_result <- left_join(cv_result,
#                        CVModelData1 %>% dplyr::select(origin, OBJECTID, probability, PPR_DISTRICT),
#                        by = c("origins_name"="origin", "destinations_name"="OBJECTID")) %>% 
#   mutate(error = probability - huff_probability)
# 
# st_write(cv_result,"data/output/cv_result.csv")
cv_result <- st_read("data/output/cv_result.csv")
cv_result$error <- as.numeric(cv_result$error)
cv_result$probability <- as.numeric(cv_result$probability)
cv_group_result <- cv_result %>%
  group_by(origins_name) %>%
  summarise(MAE = mean(abs(error), na.rm = T),
            MAPE = MAE/mean(probability, na.rm = T)) %>%
  ungroup()
CVModelData2 <- RealModelData_final %>%
  left_join(., pprProperties %>%
              dplyr::select(OBJECTID, PPR_DISTRICT) %>%
              st_drop_geometry(),
            by = "OBJECTID")
# 
# source("huffModelScripts.R")
# 
# cv_result2  <- huff_crossValidate_level(data = CVModelData2,
#                                          cv_id = "PPR_DISTRICT",
#                                          places = RealModelPlaces_final,
#                                          neighbor_data = convenientSurroundEffect,
#                                          neighbor_id_column="placekey",
#                                          neighbor_attr_column ="visits",
#                                          id_column="OBJECTID",
#                                          attr_column =c("PC1","PC2","PC3","programNum"),
#                                          distance_column="distance",
#                                          probability_column="probability",
#                                          origin_column = "origin",
#                                          k=2,
#                                          category = "frequency",
#                                         category_list =  c("high", "low", "medium",
#                                                           "mid-low", 
#                                                           #"super-high", 
#                                                           "super-low"),
#                                         prefix = "frequency")
# 
# st_write(cv_result2,"data/output/cv_result2.csv")
cv_result2 <- st_read("data/output/cv_result2.csv")
cv_result2$prob <- as.numeric(cv_result2$prob)
cv_result2$id <- as.numeric(cv_result2$id)
cv_result_clean <- cv_result2 %>%
  distinct() %>%
  mutate(prob = exp(prob))

cv_result_clean <- cv_result_clean %>%
  group_by(origin) %>%
  summarise(d = sum(prob)) %>%
  ungroup() %>%
  right_join(.,cv_result_clean, by=c("origin"="origin")) %>%  # obtain prob
  left_join(., CVModelData2 %>%  # obtain probability
              dplyr::select(origin, OBJECTID, probability, frequency),
            by=c("origin"="origin", "id"="OBJECTID")) %>%
  mutate(prob = prob/d,
         error = probability-prob)

cv_group_result2 <- cv_result_clean %>%
  group_by(origin) %>%
  summarise(MAE = mean(abs(error)),
            MAPE = MAE/mean(probability)) %>%
  ungroup()
cv_group_result_dist <- cv_result %>% 
  group_by(PPR_DISTRICT) %>% 
  summarise(MAPE = mean(abs(error))/mean(probability)) %>% 
  left_join(.,pprDistrict %>% mutate(DISTRICTID = as.character(DISTRICTID)),
            by=c("PPR_DISTRICT"="DISTRICTID")) %>% 
  st_sf()

m1 <- ggplot() + 
  geom_sf(data=cv_group_result_dist,
          color="white",size=0.05,aes(fill=MAPE))+
  geom_sf(data=pprDistrict %>% st_union(),color="white",size=0.5,fill = "transparent")+
  scale_fill_gradient2(low=palette5[1], mid=palette5[3], high=palette5[5], 
                       midpoint=6, limits=c(0,13), 
                       breaks = c(0, 6, 13), labels = c("0", "6", "13"))+
  mapTheme()+
  theme(legend.position = "bottom")

cv_group_result2_dist <- left_join(cv_result_clean, 
                                   pprProperties %>% 
                                     dplyr::select(OBJECTID, PPR_DISTRICT) %>% 
                                     st_drop_geometry(), 
                                   by=c("id"="OBJECTID")) %>% 
  group_by(PPR_DISTRICT) %>% 
  summarise(MAPE = mean(abs(error))/mean(probability)) %>% 
  left_join(.,pprDistrict %>% mutate(DISTRICTID = as.character(DISTRICTID)),
            by=c("PPR_DISTRICT"="DISTRICTID")) %>% 
  st_sf()

m2 <- ggplot() + 
  geom_sf(data=cv_group_result2_dist,
          color="white",size=0.05,aes(fill=MAPE))+
  geom_sf(data=pprDistrict %>% st_union(),color="white",size=0.5,fill = "transparent")+
  scale_fill_gradient2(low=palette5[1], mid=palette5[3], high=palette5[5], 
                       midpoint=6, limits=c(0,13), 
                       breaks = c(0, 6, 13), labels = c("0", "6", "13"))+
  mapTheme()+
  theme(legend.position = "bottom")

plot_grid(m1, m2, 
          labels = c(" ",
                     " ")
          )
Figure 7.4.1 MAPE from Cross-Validation for PPR Districts (Left: Before-Optimization, Right:After-Optimization

Figure 7.4.1 MAPE from Cross-Validation for PPR Districts (Left: Before-Optimization, Right:After-Optimization

8 Conclusion

8.1 Application

For the use case, we can help PPR better understand the use of their facilities and effect of rearranging the program numbers. Each time when PPR rearrange the number of programs, the market area will accordingly change. And PPR can repeatedly attempt and find the optimized program number. The specific process is displayed in the dashboard.

8.2 Limitation

Since Huff model is predicting the probabilities which are always smaller than 1, the MAPE will be inevitably high. (The denominator is small leading to the outcome become large.) And if we multiple the prediction outcome by the total visits in each census block, the error of predicted visits from each census block group to each facility will even become larger. This is a dominant limitation of Huff model. Further more, time lag variables will always be helpful in predicting the temporal visits. Maybe time lag variables will further improve the model. But we only have 2021 SafeGraph data, this method can not be attempted.

8.3 Conclusion

In this research, we firstly analyze the PPR program data and SafeGraph Pattern data respectively. And based on the analysis, we further unveiled the conflicts between visit pattern from Safegraph device panel and scheduled programs from PPR pattern with clustering regression, which indicates imbalance between demand and supply.

To provide optimized solutions, we used huff model to predict the probability of visitors at each census block group going to a given park facility. By changing the number of programs, the Huff model can predict future visit flow, which help PPR reallocate resources more reasonable. During modeling process, we attempt different techniques including PCA, Modeling Data Selection, Temporal Model Attempt, and Adding of Frequency Level etc. Eventually, we obtain the MAE as 0.028 in cross validation. And the optimized model also displays a good generalizability with MAPE as 0.88. Compared to the initial model, the improvement is huge.

Currently, we only focused on adjusting programs, one of social assets, to affect trend of visitors and market areas. There are other assets to organize civic infrastructure, like physical assets and community engagement. We could consider allocating investment in physical assets with reference to Huff model, which can be used to renovate facilities to affect visitor flow. The Huff model dashboard may also extend a communication platform of visitors’ expectations for better community engagement. Optimization schemes with Huff model can remediate these Imbalances in its distribution of resources and better address inequity, gentrification, and unequal access to urban parks and public spaces.