This project was completed for the Urban Spatial Analytics/Smart Cities Practicum course (MUSA 801), instructed by Michael Fichman and Matthew Harris. We are grateful to our instructors for their continued support and feedback. We are also grateful to our client Daniel Sebastian Padilla Ochoa from Ocean Conservancy for his guidance and support throughout this practicum project.
Marine litter continues to pose a major threat to people, animals, and the environment. While zero-waste pilot programs like Urban Ocean have made strides in reducing marine litter, there persists difficulty in site-selection of pilots due to lacking litter information. This adds unnecessary barriers to the siting of pilot zero-waste pickup programs. By leveraging geospatial machine learning, this project aims to explore how best to navigate zero-waste siting in areas facing bleak marine litter conditions.
Image Source: Ocean Conservancy, The Independent
Urban Ocean aims to prevent ocean plastic pollution by supporting projects in cities that address plastic pollution and infrastructure resilience. This includes reducing the amount of single-use plastic produced, improving waste management systems, promoting circular economy principles, building awareness, and more. Urban Ocean is an accelerator project created out of a trilateral initiative/partnership between Circulate Initiative, Resilient Cities Network, and Ocean Conservancy:
This program takes multiple approaches to address ocean plastic pollution. First, Urban Ocean mobilizes cleanup campaigns across each city based on volunteer efforts with international and local support groups, such as the International Coastal Cleanup. Second, they research each city’s sanitation and urban systems to identify key details regarding what causes trash to end up in oceans and waterways, sharing findings and recommending action plans to local communities and decision makers and working with stakeholders directly to create changes in products, practices, and behaviors as alternative solutions to their current needs. With these efforts, each campaign requires an evaluation of local context and detailed understanding of the city’s systems and dynamic. This aspect of the process requires long-term planning and methodical decision-making to understand the pre-existing conditions in place. Hence, our model aims to streamline the process of understanding the general structure of each city-context and provide a comprehensive solution on where efforts and resources should be allocated to.
We based our focus and model on the current cities from the Urban Ocean program. The 12 participating cities a part of the Urban Ocean program as of May 2024 are:
As marine litter worsens across the world, litter accumulation poses significant cyclical health, biodiversity, and climate risks. For example, an estimated 11 million metric tons of plastic waste enter the ocean every year, and this number expected to triple by 2040. Particularly in developing countries with lacking waste disposal systems and recycling programs, litter accumulation poses a much steeper issue that is often difficult to solve systemically. This often disproportionately impacts vulnerable coastal dwellers and fisherman who heavily rely on the marine ecosystem. The trilateral partnership program, Urban Ocean, has been developing initiatives that mitigate marine pollution, assess waste management, and enable cities to address ocean plastics and resilience. These projects intend to deploy “zero-waste” pilots in cities around the world. With this wide-spanning outreach and diversity in partnerships, the difficulty of site selection and resource allocation hinders efficiency each waste-reduction campaign.
Our objective is to develop a site selection and assessment tool that identifies effective zero-waste site locations for multinational use. This geospatial risk assessment model predicts litter accumulation based on globally sourced data with a repeatable framework on an international scale. The results of the model aim to evaluate which sections of a given area have the highest likelihood to produce and contain litter relative to its surroundings. For our case, the Urban Ocean program intends to dedicate “zero-waste” solutions in these areas for the most effective impact for each deployment.
Our resulting dashboard aims to support the narrative-building and decision-making of zero-waste pilots and attract funding sources and other support for these cities.
Our project involved a range of data sources for finding key insights of the current state of marine litter and further model building. We primarily focused on adopting global data sources to support scalability of the model to the international cities.
Our primary dataset was existing marine debris/litter from the Marine Debris Tracker powered by Morgan Stanley in collaboration with the National Geographic Society and the University of Georgia. The Marine Debris Tracker is a web application and public use database that tracks user-recorded global litter pick-up projects and initiatives from participating organizations. The marine litter data scope selected for this analysis was between January 2021 - February 2024. This period was selected because it was the largest period of data collection time post-COVID-19.
OpenStreetMap (OSM) data was used to provide additional context to the litter data. OSM featured land-use, proximity to water, waste site facilities, roads, and other indicators of commercial activity. The aim of these variables was to act as proxies for litter cases, providing insight on where litter most likely accumulated.
For our model, we directed our focus on data modeling from the Urban Ocean cities. To develop our model for each respective city, each area was determined by a ‘boundary box’ that was created by a custom KML file using Google My Maps or sourced similarly. Each boundary matched existing political boundaries and a practical scope of where data was recorded within the city.
Additionally, we also hypothesized population density as having a major influence on our model. For this, we used Data for Good at Meta’s global high resolution population density maps. This global dataset acts as a proxy for population density.
Given all these data sources, our model utilized a range of variables, depicted below. The variables used in the model were selected based on hypothesized impact on litter accumulation. These variables included land-use, proximity to water, waste site facilities, roads, and other indicators of commercial activity. We hypothesized these variables being able to provide the best insight on where litter most likely accumulated or ends up.
Social and economic activities are evaluated using restaurants and population density. Restaurants are an indication of commercial corridors, given they are typically located in central hubs of urban activity. Their presence, especially when in clusters, can indicate a higher likelihood of commercial corridors and in turn program the model to classify these areas as with greater activity. Another metric for characterizing a city landscape is population density. Regardless of population itself, density is a key factor in determining where people are located relative to the rest of the city. This metric supports our geospatial assessment in determining relative risk of litter accumulation, rather than independent probability.
Geomorphology describes the city’s physical geography for the model. Trash, when moved around by nature, weather, human activity, has a higher probability of moving towards lower elevations and bodies of water. Using water distribution as a factor accomplishes two pieces of information in one variable. Naturally, water distribution displays where water is located. When trash makes contact with a body of water, it rarely exits, requiring cleanup intervention to remove it, acting as an indication of higher risk. Secondly, it acts as a general but not definite idea of terrain. Proximity to water is not fullyindicative of elevation, but it does provide a general idea of a path of travel for loose trash naturally moving.
Waste management or waste facilities serve the purpose of collecting, storing, transporting, recycling, and disposing of waste. This impact directly minimizes adverse effects on the environment and public health of an area. These facilities act as a factor of litter risk because of their location in relation to the city center, where litter has been located, and their capabilities to cover their assigned jurisdiction. Their critical role in the sanitation infrastructure and system in a city is a key factor in understanding any operation weaknesses.
Land use is a general indicator of the city’s zoning and urban planning. Industrial zones, for example, are likely to have litter accumulation due to the nature of the activities. Residential zones, on the other hand, are less likely to have litter accumulation due to the opposite implication. The model uses land use to potentially discover patterns in the city that may not necessarily be identified by intuition alone, but is still important is creating the picture of how an area is structured. With this generalization in mind, zoning alone is not enough for describing landu se. Roads also are a variable incorporated into the model to frame the network of the city, also acting as a proxy to density. Gridded streets or segments closer together often represent denser, potentially taller buildings that represent activity, as well as vise-versa. Hence, each variable is a general feature known to be present in every city and allows for replicated use to any place in the world.
A fishnet grid was employed in this analysis to allow for the most effective modeling of risk. Each grid cell has a count of a particular variable. Following data extrapolation from OSM data for, the litter data and other variables were overlaid on a 500x500 meter fishnet grid to create a uniformed data structure for each city. The fishnet grid was created using the ‘sf’ package and projected to the respective city’s UTM projection. The grid was used to aggregate and visualize the litter data and other variables to create a uniformed data structure for each city.
The following section demonstrate the process of importing the litter CSV, boundary KML, and the calling of OSM data and other data onto the grid used for data exploration and analysis. This process was then repeated for the rest of the Urban Ocean cities. As calling large amount of OSM data is time consuming for each city, each fishnet for each city was saved to file and reloaded for the purposes of this report.
litter <- read.csv('https://raw.githubusercontent.com/TrevorKap/MUSA810-Marine-Pollution/main/Data/mdt-dataChennai.csv')
litter_p <- litter%>%filter(master_material == 'PLASTIC')%>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326, agr = "constant")%>%st_transform('EPSG:32643')
chen_bdry <- st_read('https://github.com/TrevorKap/MUSA810-Marine-Pollution/raw/main/Data/Chennai.kml')
chen_bdry <- st_set_crs(chen_bdry, 4326)%>%st_transform('EPSG:32643')
temp_bd <- st_read('https://github.com/TrevorKap/MUSA810-Marine-Pollution/raw/main/Data/Chennai.kml')
temp_bbox <- get_bbox(temp_bd)
temp_fish <- create_fish(chen_bdry)
final_net <- countfishnet(temp_fish, litter_p)
final_net <- pn_gen(stor_df)
final_net <- moran_gen(final_net,stor_df)
chen_net <- final_netcity_data_list <- lapply(cities, load_city_data)
names(city_data_list) <- cities
bd_data_list <- lapply(cities, load_city_kml)
names(bd_data_list) <- bd
bd_data_meter_list <- lapply(cities, load_city_kml_meter)
names(bd_data_meter_list) <- bdM
for (i in seq_along(city_data_list)) {assign(cities[i], city_data_list[[i]])}
for (i in seq_along(bd_data_list)) {assign(bd[i], bd_data_list[[i]])}
for (i in seq_along(bd_data_meter_list)) {assign(bdM[i], bd_data_meter_list[[i]])}
base_path <- "https://raw.githubusercontent.com/TrevorKap/MUSA810-Marine-Pollution/main/Data/stored_city/"
city_names <- c("santa_fe", "semarang", "panama", "can_tho", "melaka", "salvador",
"surat", "santiago", "bangkok", "chennai", "mumbai", "pune")
city_data <- list()
for (city in city_names) {
file_path <- paste0(base_path, city, "_net.geojson")
net_sf <- st_read(file_path)
net_sf <- st_set_crs(net_sf, 32643)
city_data[[city]] <- net_sf
}
list2env(setNames(city_data, paste0("net_", city_names)), envir = .GlobalEnv)As suggested above, MDT data used only accounts for litter that has been recorded by the user of the tracker, suggesting that areas where litter was not recorded in a general surrounding could not confidently be assumed to be present or not present. Thus, there was a range of data wrangling techniques employed in this model to reduce bias and noise. For example, rather than predicting for litter count, we only considered fishnet grid cells that had records of litter. This meant that only grid cells with litter at a count greater than 0 were modeled. Furthermore, we decided to opt for a “per user” count, in which the litter per grid cell was measured with regards to the number of pieces reported per user to be able to compare differing amounts of litter per person, given that users would typically record data in areas that were biased to their own living circumstances and conditions. Additionally, we used a function to sample more data from the original litter dataset. Lastly, z-score normalization was applied to help normalize the data from a scale of 0-1 to make the data more interpretable.
library(grid)
cnt_wt <- function(litter,net){
temp <- st_join(litter,net,join = st_within)
temp <- temp %>% filter(!is.na(uniqueID))
temp_sum <- temp %>%
group_by(uniqueID)%>%
summarise(count_un = n_distinct(username))
net <- left_join(net,st_drop_geometry(temp_sum),by = 'uniqueID')
net <- net %>%
mutate(count_un = replace_na(count_un, 0),
count = count/count_un,
count = replace_na(count, 0))
return(net)
}
net_mumbai <- cnt_wt(Mumbai,net_mumbai)
net_chennai <- cnt_wt(Chennai,net_chennai)
net_bangkok <- cnt_wt(Bangkok,net_bangkok)
net_can_tho <- cnt_wt(Can_Tho,net_can_tho)
net_melaka <- cnt_wt(Melaka,net_melaka)
net_panama <- cnt_wt(Panama_City,net_panama)
net_pune <- cnt_wt(Pune,net_pune)
net_salvador <- cnt_wt(Salvador,net_salvador)
net_santa_fe <- cnt_wt(Santa_Fe,net_santa_fe)
net_santiago <- cnt_wt(Santiago,net_santiago)
net_semarang <- cnt_wt(Semarang,net_semarang)
net_surat <- cnt_wt(Surat,net_surat)
#z-score-normalizing
z_score_normalize <- function(x) {
(x - mean(x)) / sd(x)
}
normal_city <- function(data){
data <- data%>%
mutate(across(where(is.numeric) & -'count', z_score_normalize),
across(where(is.numeric), ~replace(., is.nan(.), 0)))
return(data)
}
net_salvador <- normal_city(net_salvador)
net_santa_fe <- normal_city(net_santa_fe)
net_santiago <- normal_city(net_santiago)
net_semarang <- normal_city(net_semarang)
net_surat <- normal_city(net_surat)
net_bangkok <- normal_city(net_bangkok)
net_can_tho <- normal_city(net_can_tho)
net_chennai <- normal_city(net_chennai)
net_melaka <- normal_city(net_melaka)
net_mumbai <- normal_city(net_mumbai)
net_panama <- normal_city(net_panama)
net_pune <- normal_city(net_pune)
# agg
net_bangkok <- net_bangkok %>% mutate(city = 'Bangkok',country = 'Thailand')
net_can_tho <- net_can_tho %>% mutate(city = 'Can_Tho',country = 'Vietnam')
net_chennai <- net_chennai %>% mutate(city = 'Chennai',country = 'India')
net_melaka <- net_melaka %>% mutate(city = 'Melaka',country = 'Malaysia')
net_mumbai <- net_mumbai %>% mutate(city = 'Mumbai',country = 'India')
net_panama <- net_panama %>% mutate(city = 'Panama_City',country = 'Panama')
net_pune <- net_pune %>% mutate(city = 'Pune',country = 'India')
net_salvador <- net_salvador %>% mutate(city = 'Salvador',country = 'Brazil')
net_santa_fe <- net_santa_fe %>% mutate(city = 'Santa_Fe',country = 'Argentina')
net_santiago <- net_santiago %>% mutate(city = 'Santiago',country = 'Chile')
net_semarang <- net_semarang %>% mutate(city = 'Semarang',country = 'Indonesia')
net_surat <- net_surat %>% mutate(city = 'Surat',country = 'India')
net_total <- rbind(net_bangkok,net_can_tho,net_chennai,net_melaka,net_mumbai,net_panama,net_pune,net_salvador,net_santa_fe,net_santiago,net_semarang,net_surat) %>% mutate(uniqueID = 1:n())
net_total_li <- net_total %>% dplyr::filter(count != 0)
net_total_li <- net_total_li %>%
mutate(count = (count - mean(count)) / sd(count))
total_litter <- do.call(rbind, city_data_list)
grid.draw(
grobTree(
rectGrob(gp=gpar(fill="#ecf6ff", lwd=0, col = "#ecf6ff")),
grid.arrange(
ggplot(total_litter, aes(x = master_material, fill = master_material)) +
geom_bar() +
plotTheme() +
labs(title = "Count of Main Item Category",
x = "Category",
y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = c("PLASTIC" = "#1c3aa9", "other" = "grey")) +
guides(fill = FALSE),
nrow = 1
)
)
)The litter data features a wide variety of item categories and general information about each piece. Through the process of our feature engineering, we were able to identify the major litter risks to the cities. According to the above chart, plastic waste is of greatest concern, both in quantity, and given that they are tied to significant long-term sustainability issues including the lack of recyclability, effects on marine life, and the dependence on single-use plastics.
In addition to the dependent variables in their counted presence, we implemented a K-Nearest Neighbor analysis to determine the proximity of each data point to one another and incorporate that as an additional variable to our model. K-Nearest Neighbor (KNN) allows us to find the proximity of another variant to determine commercial/high activity spots in the area. The nearest neighbor factor was repeated for land-use, roads, restaurants, water, and waste facilities. We added KNN as a factor to place emphasis on density and location-based processing. This addition acts as a fail-safe for our model’s cell calculation given that examining count alone does not incorporate the real-world application of proximity.
Local Moran’s I measures spatial autocorrelation which measures the degree of spatial relationship between observations, measuring how closer values are clustered in a space. We measure these among the dependent and independent variables to factor in any spatial lags identified within the region. Similar to KNN in terms of purpose, this factor allows the model to understand the relationship between the dependent and independent variables and apply that relationship in its calculations for each cell.
Following this feature engineering, the different variables were visualized in the fishnet grid below.
The visuals showing ‘count’ measure the number of parcels with the zone or number of buildings within the cells and does not account for density. The nearest neighbor maps (‘Variable’_nn) measured their proximity.
library(ggplot2)
library(stringr)
library(grid)
library(gridExtra)
map_waste <- visual_count(net_chennai, "waste")
map_waste_nn <- visual_count(net_chennai,'waste_nn')
grid.draw(grobTree(rectGrob(gp=gpar(fill="#ecf6ff", lwd=0, col = "#ecf6ff")),
grid.arrange(map_waste, map_waste_nn, ncol = 2)))The following sections also map OSM variables, including water, restaurants, roads, land use, and more. The selection was based on our hypothesis of areas of human activity leading to higher litter risk. Each variable is mapped on the fishnet grid for an initial exploratory analysis.
visual_count1 <- function(net_one,variable){
ggplot() +
geom_sf(data = net_one, aes(fill = net_one[[variable]]), color = NA) +
scale_fill_viridis_c(option = "mako", name = str_to_title(variable)) +
#labs(title = paste(str_to_title(variable),"Count for the Fishnet")) +
mapTheme()
}
map_water <- visual_count1(net_chennai,"water")
map_restaurant <- visual_count1(net_chennai,"restaurant")
map_road <- visual_count1(net_chennai,'road')
map_industrial <- visual_count1(net_chennai,'industrial')
map_residential <- visual_count1(net_chennai,"residential")
map_retail <- visual_count1(net_chennai,'retail')
grid.draw(
grobTree(
rectGrob(gp=gpar(fill="#ecf6ff", lwd=0, col = "#ecf6ff")),
grid.arrange(map_water, map_restaurant, map_road,
map_industrial, map_residential, map_retail, nrow = 2)
)
)map_water_nn <- visual_count1(net_chennai,'water_nn')
map_restaurant_nn <- visual_count1(net_chennai,"restaurant_nn")
map_road_nn <- visual_count1(net_chennai,'road_nn')
map_industrial_nn <- visual_count1(net_chennai,'industrial_nn')
map_residential_nn <- visual_count1(net_chennai,"residential_nn")
map_retail_nn <- visual_count1(net_chennai,'retail_nn')
grid.draw(
grobTree(
rectGrob(gp=gpar(fill="#ecf6ff", lwd=0, col = "#ecf6ff")),
grid.arrange(map_water_nn, map_restaurant_nn, map_road_nn,
map_industrial_nn, map_residential_nn, map_retail_nn, nrow = 2)
)
)The Principal Component Analysis (PCA) is a dimensional reduction technique used to transform high-dimensional data into a lower-dimensional space while preserving as much variance as possible. The PCA below is used to identify the most important variables in the dataset and their correlation with litter, ultimately creating synthetic variables that can augment the data input for the model. The first output is a table that depicts the general metrics found in the independent variables, such as standard deviation, variance, and the proportion of variance explained by each component.
corr <- st_drop_geometry(net_total) %>% dplyr::select(!c(uniqueID,cvID,city,country))
corr_nor<- scale(corr)
corr_matrix <- cor(corr_nor)
#ggcorrplot(corr_matrix)
data.pca <- princomp(corr_matrix)
summary(data.pca)## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.1544504 0.36523214 0.34046877 0.31029899 0.25846779
## Proportion of Variance 0.6144093 0.06149576 0.05343942 0.04438824 0.03079783
## Cumulative Proportion 0.6144093 0.67590503 0.72934445 0.77373269 0.80453052
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 0.23518531 0.21725529 0.20707302 0.1916407 0.18403398
## Proportion of Variance 0.02549926 0.02175945 0.01976761 0.0169310 0.01561361
## Cumulative Proportion 0.83002978 0.85178923 0.87155684 0.8884878 0.90410144
## Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## Standard deviation 0.17324598 0.15353018 0.143181926 0.1332383 0.120928775
## Proportion of Variance 0.01383673 0.01086663 0.009451127 0.0081840 0.006741655
## Cumulative Proportion 0.91793817 0.92880480 0.938255928 0.9464399 0.953181583
## Comp.16 Comp.17 Comp.18 Comp.19
## Standard deviation 0.120295484 0.102914317 0.101110151 0.100044032
## Proportion of Variance 0.006671229 0.004882686 0.004712992 0.004614127
## Cumulative Proportion 0.959852812 0.964735498 0.969448490 0.974062617
## Comp.20 Comp.21 Comp.22 Comp.23
## Standard deviation 0.097015031 0.09618051 0.092972876 0.088041257
## Proportion of Variance 0.004338956 0.00426463 0.003984921 0.003573384
## Cumulative Proportion 0.978401573 0.98266620 0.986651124 0.990224508
## Comp.24 Comp.25 Comp.26 Comp.27
## Standard deviation 0.080715923 0.068107836 0.058459405 0.052915014
## Proportion of Variance 0.003003486 0.002138461 0.001575491 0.001290818
## Cumulative Proportion 0.993227994 0.995366455 0.996941947 0.998232765
## Comp.28 Comp.29 Comp.30 Comp.31
## Standard deviation 0.0409395731 0.035875430 0.0252563016 0.0152463435
## Proportion of Variance 0.0007726696 0.000593337 0.0002940673 0.0001071615
## Cumulative Proportion 0.9990054343 0.999598771 0.9998928385 1.0000000000
## Comp.32
## Standard deviation 2.228071e-09
## Proportion of Variance 2.288576e-18
## Cumulative Proportion 1.000000e+00
table_html <- kbl(data.pca$loadings[, 1:2])
kbl(data.pca$loadings[, 1:2]) %>%
kable_styling(bootstrap_options = "striped", font_size = 12, full_width = FALSE, html_font = "Avenir")| Comp.1 | Comp.2 | |
|---|---|---|
| count | 0.0391065 | 0.0282442 |
| water | 0.0693956 | -0.1633430 |
| water_nn | -0.2529400 | 0.1799049 |
| waste | 0.0600653 | -0.0664571 |
| waste_nn | -0.2834370 | 0.0208787 |
| restaurant | 0.1223218 | -0.0422238 |
| restaurant_nn | -0.2771403 | -0.0894539 |
| road | 0.1634139 | 0.4109086 |
| road_nn | -0.1375091 | -0.3191800 |
| industrial | 0.0455750 | -0.2397279 |
| industrial_nn | -0.1254326 | 0.2251861 |
| residential | 0.1228023 | -0.0306500 |
| residential_nn | -0.1698571 | -0.1098496 |
| retail | 0.0834391 | -0.1030719 |
| retail_nn | -0.2773497 | 0.0287448 |
| avg_pop | 0.2391612 | 0.2413757 |
| sum_pop | 0.2599853 | 0.2968630 |
| water_sig | 0.0926867 | -0.1871962 |
| water_sig_dis | -0.2652302 | 0.1821440 |
| waste_sig | 0.1003334 | -0.1030702 |
| waste_sig_dis | -0.2567700 | 0.0972708 |
| restaurant_sig | 0.1312017 | -0.0589612 |
| restaurant_sig_dis | -0.2806288 | -0.0111852 |
| road_sig | 0.0539025 | 0.2301286 |
| road_sig_dis | -0.0984326 | -0.2721898 |
| industrial_sig | 0.0558730 | -0.2686758 |
| industrial_sig_dis | -0.1533502 | 0.2396142 |
| residential_sig | 0.1201466 | -0.1019100 |
| residential_sig_dis | -0.2504080 | 0.0363113 |
| retail_sig | 0.1067757 | -0.1163043 |
| retail_sig_dis | -0.2377061 | 0.0360326 |
| count_un | 0.0524686 | 0.0262507 |
The Eigenvalues are the amount of variance explained by each component, quantifying the importance of each component in the dataset. The first dimension explains the vast majority of the variance, with a sharp decrease thereafter. These results mean that we only need 1 dimension of overlap when evaluating our model and additional dimensions of evaluation will not be useful thereafter. A single dimensional view means having all the independent and dependent variables overlaid onto a single space provides the greatest insight, and any variations of such will not be useful.
grid.draw(
grobTree(
rectGrob(gp=gpar(fill="#ecf6ff", lwd=0, col = "#ecf6ff")),
grid.arrange(
fviz_eig(data.pca, addlabels = TRUE) + ggtitle("Eigenvalues from Principal Component Analysis") + plotTheme(),
nrow = 1
)
)
)The squared cosine value (Cos2) shows the quality of the representation of the variables on the factor map. The closer the value is to 1, the better the representation of the variable. What we notice particularly about these variables with the largest quality of representation is that they are statistical factor rather than a numeric value. Almost each independent variable with a ~0.75 value is either a KNN or significance value. The only exception being sum_pop (population density) and avg_pop (average population).
grid.draw(
grobTree(
rectGrob(gp=gpar(fill="#ecf6ff", lwd=0, col = "#ecf6ff")),
grid.arrange(
fviz_cos2(data.pca, choice = "var", axes = 1:2)+ plotTheme(),
nrow = 1
)
)
)The PCA Variables graph shows the correlation of each variable in comparison to the first two dimensions. The arrows that represent each variable have varying directions and lengths, indicating mostly little correlation between the first dimension and second dimension. But the few variables that have relatively longer lengths have stronger relationships between both dimensions and can be used for the model. To better understand their importance we evaluate their Squared Cosine. This graph serves as the basis for choosing the several variables. Based on the result, the selected variables included ‘waste_sig_dis, restaurant_sig_dis, residential_sig_dis, water_sig_dis, residential_nn, industrial_sig_dis, industrial_sig, restaurant_sig, industrial_nn, road_sig_dis, residential_sig, restaurant_sig, restaurant’ as the ‘shortened model’ independent variables.
n_colors <- 100
mako_gradient <- viridis(n_colors, option = "mako")
grid.draw(
grobTree(
rectGrob(gp=gpar(fill="#ecf6ff", lwd=0, col = "#ecf6ff")),
grid.arrange(
fviz_pca_var(data.pca, col.var = "cos2",
gradient.cols = mako_gradient,
repel = TRUE) + plotTheme(),
nrow = 1
)
)
)