This project is conducted to fulfill the course requirements for the 8010 Master of Urban Spatial Analytics (MUSA)/Smart Cities Practicum, Spring 2025. It is a collaboration between the University of Pennsylvania and the DC Office of the Deputy Mayor for Education (DME).
We sincerely appreciate Professor Michael Fichman and Professor Matthew Harris, as well as Rebecca Lee and Rory Lawless at DME, for their invaluable guidance, support, and dedication throughout the project.
[Project GitHub Page] - to be added after the cargo period
Four team members collaboratively contributed to this project. To reach out, please find their contact details below:
Jingmiao Fei: MUSA ’25 - LinkedIn
Jia Yue Ong: MUSA ’25 - LinkedIn
Amy Solano: MUSA ’25 - LinkedIn
Neve Zhang: MCP ’25 - LinkedIn
The project is commissioned by the DC Office of the Deputy Mayor for Education (DME), who are responsible for developing and implementing the Mayor’s vision for academic excellence and creating a high-quality education continuum from birth to 24 (from early childhood to K-12 to post-secondary and the workforce). The three major functions of the DME include:
Overseeing a District-wide education strategy
Managing interagency and cross-sector coordination
Providing oversight and/or support for the following education-related agencies including the DC Public Schools (DCPS) and Public Charter School Board (PCSB).
This student yield prediction project aims to enhance the DME’s ability to accurately forecast student enrollment across DCPS schools and boundaries, enabling more efficient resource allocation. This will support DME’s efforts including updates to the Master Facilities Plan, a key document that helps education stakeholders understand the current landscape and emerging needs, guiding decision-making and investments for a comprehensive approach to facility planning over the next 5 to 10 years.
Students’ decisions to enroll in public schools have traditionally been influenced by four key factors: parental demand, neighborhood characteristics, school availability, and policy interventions. As these factors have grown more complex, Washington, DC, has experienced steady growth in school-aged children between 2010 and 2020, an increase in highly educated families driving parental and housing demand, rising real estate prices, and migration patterns shaped by the pursuit of better education in specific wards.
Collectively, since existing year-on-year prediction methods do not adequately account for exogenous factors such as population and housing trends, our project provides a more comprehensive, data-informed tool to support the client in forecasting matriculation numbers. Per the client’s request, we focus specifically on elementary school (K–5) enrollments, with particular attention to kindergarten (grade K), as its enrollment cannot be predicted based on the previous grade. Our deliverables include this markdown and an app-format dashboard, to support the use case as below:
The project supports our client (DME) to make informed decision-making for education-related resource allocation, including personnel and capital investments, at each DCPS elementary school. The client can input housing pipeline and demographic data into a system to generate boundary school-specific matriculation predictions for incoming elementary students in a near-future year.
For the scope of our project and up-to-date data availability, our app forecasts for each school year between SY25-26 and SY29-30.
DC’s public education system consists of two categories of schools: DC Public schools and charter schools. DCPS includes all of the traditional schools in the Districts which have centralized school policies subject to regulation and oversight from the District government. Charter schools, comparatively, operate on innovative and non-traditional school models and policies in exchange of greater accountability for student performance. (Source: DC Pave)
schools_table %>%
kbl() %>%
column_spec(1, bold = T, border_right = T) %>%
kable_styling(bootstrap_options = "striped", font_size = 12)
School Type | Facilities | Funding | Funding Formula | Regulation | Leadership | Curriculum | Staffing |
---|---|---|---|---|---|---|---|
DCPS | Provided by the District | Funded by the District on a per-student basis | By law, DCPS is required to give schools at least 95 percent of their previous year’s budget. A student-based budgeting (SBB) formula,Limited staff-based allocations, Program grants, and Stability funding. | Must follow all District laws and regulations | Centralized decision-making process led by the Chancellor | Chosen by DC Central Office with some input from principals | Teacher positions are allocated using the Comprehensive Staffing Model Formula |
Charter | Funded with the Charter Facilities Allotment | Funded by the District on a per-student basis | Funded based on their audited, or actual, enrollment Unlike with DCPS, public charter LEAs do not receive one large lump sum of UPSFF funds. | Freedom to create their own policies | Individual Local Education Agency Executive Leaders have autonomy to make decisions | Each individual Local Education Agency chooses their own curriculum | LEA leaders decide how they want to structure staff positions at their schools |
In terms of school choice, public school students in the district have multiple options.
enroll in their in-boundary DCPS school, determined by their home address;
enroll at a DCPS school outside of their geographic school boundary;
enroll in a DCPS citywide or selective school; or
enroll at a public charter school.
In SY2022-23, 28% of students were enrolled at their in-boundary schools, with Ward 3 having the highest percentage of in-boundary attendance (84%), and Wards 5, 7, and 8 having the lowest (16-19%). Notably, 53-58% of students living in Wards 5, 7, and 8 were enrolled in public charter schools.
Overall, 21% of students chose to attend an out-of-boundary school, while only 6% of students from Ward 3 made that choice.
The current business-as-usual approach that Washington D.C. leverages to make enrollment forecasts is the Grade Progression Ratio (https://edscape.dc.gov/node/1607141), which operates on the rationale that enrollments at higher grade levels can be extrapolated from past enrollments at a corresponding lower grade (i.e. K-grade students from the same school facility will become the incoming 5th grade cohort in five years).
To calculate the Grade Promotion Rate (GPR) from Kindergarten to Grade 1 for the 2014-15 school year, one divides the number of students in Grade 1 during the 2014-15 audit by the number of students who were in Kindergarten during the 2013-14 audit. A ratio more than 1 (or greater than 100%) suggests that there are additional students joining, who did not enroll in the previous grade. A ratio less than 1 (or less than 100%) suggests that students may be transferring away from the school, or out of the public school system. It is important to note that GPRs are different than re-enrollment because they don’t necessarily reflect the same students.
However, this approach has several limitations. First, while the approach is more effective in making one-year forecasts, particularly for higher grade levels, it is inapplicable to make long-term forecasts for lower grade enrollments. For instance, while it is possible to deduce 5th grade enrollments in the next five years from current K-grade students, K level enrollments could be forecasted due to a lack of associated lower grade data, resulting in significantly fewer predictions feasible. Likewise, this method does not sufficiently account for external factors dynamically shaping enrollment preferences, such as facility expansion / closure, housing market changes, and socioeconomic trends, but rather assumes enrollment as a static choice once a cohort is enrolled, reinforcing a “business-as-usual” rationale.
Our approach improves the business-as-usual approach by improving on the two critical shortcomings highlighted. By incorporating historic and current enrollment records with comprehensive data sources such as the American Community Survey 5-Year Estimates and DC’s permits and residential mass appraisals, we deploy a regression-based method that enables significantly more predictions across all grades and years. Section 2 “Understanding Key Influences through Exploratory Data Analysis” further unwraps our thinking behind the new mechanism.
elementary <- list('K','1','2','3','4','5')
# Manually removed a record associated with school_id == 247 - this is the only duplicated row
schools_cleaned <- schools %>%
filter(grade_band %in% c("Elementary", "K-8")) %>% # Neve added
group_by(school_year, lea_id, school_id, mar_latitude) %>%
slice_max(order_by = facility_capacity, n = 1, with_ties = FALSE) %>%
ungroup()
elementary_by_year <- students_by_tract_sy1314_to_sy2425 %>%
filter(grade_level %in% elementary) %>%
left_join(schools_cleaned %>%
dplyr::select(-school_name, -school_sector),
by=c("lea_code"="lea_id","school_code"="school_id","school_year"="school_year", "school_main_facility_lat" = "mar_latitude", "school_main_facility_long" = "mar_longitude")) %>%
dplyr::select(# original variables from the secret data
school_year, residence_census_tract, lea_code, school_code, school_name,
school_sector, in_boundary_indicator, school_main_facility_lat, school_main_facility_long,
grade_level, total_students,
# variables added through the merge
dcps_facility_name, facility_sector, grade_band,`ward_(2022)`,
dcps_boundary, facility_capacity, enrollment, utilization, unfilled_seats) %>%
rename(ward_22 = `ward_(2022)`) # tidied up workflow
# Create unique facility ID for each school facility
elementary_by_year <- elementary_by_year %>%
group_by(school_code, grade_band, school_main_facility_lat, school_main_facility_long) %>% # this ensures distinct facilities are selected
mutate(unique_id = paste0("F", sprintf("%06d", cur_group_id()))) %>%
ungroup()
# Create new column variable for matching census data
elementary_by_year <- elementary_by_year %>%
mutate(census_year = case_when(
school_year == 'SY13-14' ~ 2013,
school_year == 'SY14-15' ~ 2014,
school_year == 'SY15-16' ~ 2015,
school_year == 'SY16-17' ~ 2016,
school_year == 'SY17-18' ~ 2017,
school_year == 'SY18-19' ~ 2018,
school_year == 'SY19-20' ~ 2019,
school_year == 'SY20-21' ~ 2020,
school_year == 'SY21-22' ~ 2021,
school_year == 'SY22-23' ~ 2022,
school_year == 'SY23-24' ~ 2023,
school_year == 'SY24-25' ~ 2024,
TRUE ~ NA_real_
))
Plot of the Errors of the Business-As-Usual Approach
ggplot(df5, aes(x = `sum(total_students).x`, y = `sum(total_students).y`)) +
geom_point(alpha = 0.3, color = "#440154") +
geom_abline(linetype = "dashed", color = "#fc8961") +
geom_smooth(method = "lm", color = "black") +
coord_equal() +
scale_x_continuous(
expand = c(0, 0),
limits = c(0, NA),
breaks = seq(0, max(df5$`sum(total_students).x`, na.rm = TRUE), by = 40)
) +
scale_y_continuous(
expand = c(0, 0),
limits = c(0, NA),
breaks = seq(0, max(df5$`sum(total_students).y`, na.rm = TRUE), by = 50)
) +
labs(
title = "Plot of Predicted Against Observed with Best Fit Line",
x = "Observed",
y = "Predicted"
) +
theme_bw()
Summary Table of Errors of the Business-As-Usual Approach
# MAE
df5$error = abs(df5$`sum(total_students).y`-df5$`sum(total_students).x`)
mean(df5$error, na.rm = TRUE)
## [1] 12.2847
# MAPE
df5$APE = abs(df5$`sum(total_students).y`-df5$`sum(total_students).x`)/(df5$`sum(total_students).x`)
mean(df5$APE, na.rm = TRUE)
## [1] 0.4295409
# RMSE
df5$sqerror = (df5$`sum(total_students).y` - df5$`sum(total_students).x`)^2
rmse_5_year <- sqrt(mean(df5$error, na.rm = TRUE))
# Create summary table
progression_error_summary <- data.frame(
Metric = c("Mean Absolute Error", "Mean Absolute Percentage Error", "Root Mean Square Error"),
Value = c(
mean(df5$error, na.rm = TRUE),
mean(df5$APE, na.rm = TRUE),
sqrt(mean(df5$sqerror, na.rm = TRUE))
)
)
print(progression_error_summary)
## Metric Value
## 1 Mean Absolute Error 12.2846975
## 2 Mean Absolute Percentage Error 0.4295409
## 3 Root Mean Square Error 15.4071666
As the team believes that grade-level enrollment changes is not isolated to the facility itself, but associated with a range of exogenous factors and concerns. Some of these factors are demographic variables (i.e., total population), socioeconomic variables (i.e., income), building permits and new residential development. The next section uses exploratory charts and visualizations to demonstrate the following insights:
School facility capacity and utilization patterns vary across years and geographies
Socioeconomic characteristics have been a key underlying concern to school district planning
Housing developments have shaped and will continue to shape elementary school enrollments in DC
We have consulted the following data sources to build up the analysis process:
Open Data DC
Others
Student Data by Census Tract SY13-14 to SY 23-24 (provided by client, internal use only)
School Classification by Year (provided by client, internal use only)
We then combine these variables with the school enrollment to create a panel data for our analysis. Details of our panel data creation can be found in Section 3.
This exploratory analysis encompasses three primary spatial units present in our data: census tracts, elementary school boundaries, and wards. Elementary school boundaries change over time (explained further in respective section). DC is divided into 8 different wards, each having its own demographic and social characteristics, as well as representation in the District.
ggarrange(nrow=1,
ggplot(boundary)+
geom_sf(fill="whitesmoke", color="#440154", lwd=0.1)+
labs(title="Elementary School Boundary")+
theme_void(),
ggplot(wards, aes(label=NAME))+
geom_sf(fill="whitesmoke", color="#440154", lwd=0.1)+
geom_sf_label(fill = "white",
fun.geometry = sf::st_centroid,
size=2.7)+
labs(title="Wards")+
theme_void()+
theme(plot.title = element_text(hjust = 0.5)),
ggplot(ct20)+
geom_sf(fill="whitesmoke", color="#440154", lwd=0.1)+
labs(title="Census Tracts (2020)")+
theme_void()
)
For the last decade (SY13-14 to SY23-24), most elementary schools in DC operate close to full capacity, with utilization rates clustering around 1.0. Nonetheless, a noticeable portion of schools falls below this threshold, indicating under-utilization, while the utilization of a smaller yet significant number exceeds their capacity, reflecting overcrowding. Comparatively, total school capacity and enrollment numbers exhibit multiple peaks, indicating variations in school sizes and enrollment volumes.
School_info$Utilization <- as.numeric(gsub("%", "", School_info$Utilization)) / 100
## Histogram
display_histogram <- function(arg, binwidth){
ggplot(School_info, aes(x = .data[[arg]])) +
geom_histogram(binwidth = binwidth, fill = "#440154", color = "black", alpha = 0.7) +
labs(title = paste("Distribution of", arg),
x = arg,
y = "Count") +
theme_minimal()
}
display_histogram("Facility capacity", 10)
display_histogram("Enrollment", 10)
display_histogram("Utilization", 0.02)
Elementary schools under both DCPS and Public Charter sectors have experienced relatively steady growth in facility capacity over time. Meanwhile, while enrollment increased in both sectors before SY19-20, DCPS schools have faced a more pronounced decline since then, suggesting a greater negative impact from COVID-19 compared to Charter schools.
## time series plot
# Convert 'school_year' to an ordered factor to ensure correct ordering
School_info$`School year` <- factor(School_info$`School year`,
levels = c("SY13-14", "SY14-15", "SY15-16", "SY16-17", "SY17-18", "SY18-19", "SY19-20", "SY20-21", "SY21-22", "SY22-23", "SY23-24", "SY24-25"),
ordered = TRUE)
# filter SY24-25 for they have no data
School_info_filtered <- School_info %>%
filter(`School year` != "SY24-25")
# Calculate the sum of 'capacity' by 'sector' for each year
time_series_plot <- function(arg){
arg_summary <- School_info_filtered %>%
group_by(`School year`, `School sector`) %>%
summarise(total_capacity = sum(.data[[arg]], na.rm = TRUE))
# plot time series chart
ggplot(arg_summary, aes(x = `School year`, y = total_capacity, color = `School sector`, group = `School sector`)) +
geom_line() +
geom_point() +
scale_color_manual(values=c("#440154","#fc8961"))+
labs(title = paste("Total", arg, "by Sector Over Time"),
x = "School Year",
y = paste("Total",arg),
color = "Sector") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
}
time_series_plot("Facility capacity")
time_series_plot("Enrollment")
However, the upward trend in school capacity is not reflected in utilization rates. In particular, DCPS schools have experienced a steady decline in utilization over the past decade, in stark contrast to the increasing rates seen in charter schools. This divergence may be driven by demographic shifts, evolving education policies, and a growing preference for public charter schools. Importantly, it also highlights the risk of relying solely on lower-grade enrollment trends to forecast future demand at higher grade levels.
time_series_plot("Utilization")
We further enquired the spatial patterns of elementary schools’ average utilization rate (by school boundary and wards) in Washington DC over two five year periods (SY13-18 and SY18-23). Note: We also conducted analysis on average capacity and enrollment rate, but the spatial clustering of those two factors evolved less significantly compared to utilization.
Overall, we observed a decreased average utilization rate across DCPS elementary schools across both school boundaries and wards. The decreases are the most severe in school boundaries in Ward 7 and Ward 8, but the lowest average utilization rate continues to be observed in Ward 5.
Comparatively, changes in average utilization rate across Public Charter elementary schools appear more versatile. For instance, while certain school districts from wards 7 and 8 have seen improved utilization rate of Public Charter elementary schools over the years, the gains are counteracted by decrease in adjacent school boundaries, leading to an overall drop in ward-level utilization rate. Nonetheless, this suggests a shift in the spatial distribution of student enrollment.
# SY13-14 ~ SY17-18 and SY18-19 ~ SY22-23, total capacity by school boundary
school_info_sf <-st_as_sf(as.data.frame(School_info), coords = c("MAR longitude", "MAR latitude"), crs = 4326)
school_boundary_sf <- st_join(school_info_sf, boundary, join = st_within)
## mutate a new variable: year_group, including: SY13-18, SY18-23
school_boundary_sf <- school_boundary_sf %>%
mutate(year_group = case_when(
`School year` %in% c("SY13-14", "SY14-15", "SY15-16", "SY16-17", "SY17-18") ~ "SY13-18",
`School year` %in% c("SY18-19", "SY19-20", "SY20-21", "SY21-22", "SY22-23") ~ "SY18-23",
TRUE ~ NA_character_ # dealing the no match case
))
# Calculate the sum of school capacities for each boundary, grouped by time period and School sector
mean_by_boundary <- school_boundary_sf %>%
group_by(NAME, year_group, `School sector`) %>%
summarise(
mean_capacity = mean(`Facility capacity`, na.rm = TRUE),
mean_enrollment = mean(Enrollment, na.rm = TRUE),
mean_utilization = mean(Utilization, na.rm = TRUE)
) %>%
na.omit() %>%
st_drop_geometry()
# merge the capacity, enrollment and utilization data to the Boundary data
Boundary <- boundary %>%
left_join(mean_by_boundary, by = "NAME")
# filter by School sector
dcps_data <- Boundary %>% filter(`School sector` == "DCPS")
charter_data <- Boundary %>% filter(`School sector` == "Public charter")
# function for visualization
base_map <- function(geodata, argbysector, total_arg){
ggplot() +
geom_sf(data = geodata, fill = "transparent") +
geom_sf(data = argbysector, aes(fill = .data[[total_arg]])) +
#scale_fill_viridis_c(option = "magma", trans = "log", direction = -1) +
scale_fill_gradientn(colours = c("#440154", "white", "#fc8961"), trans="log")+
facet_wrap(~ year_group) +
theme_void() +
labs(fill = total_arg) +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
plot.title = element_text(hjust = 0.5))
}
## visualize by ward
wards <- st_read("https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/Administrative_Other_Boundaries_WebMercator/MapServer/53/query?outFields=*&where=1%3D1&f=geojson", quiet = TRUE)
# Group school_boundary_info by Ward(2022) and year_group, and calculate the total capacity
ward_summary <- school_boundary_sf %>%
group_by(`Ward (2022)`, year_group, `School sector`) %>%
summarise(
mean_capacity = mean(`Facility capacity`, na.rm = TRUE),
mean_enrollment = mean(Enrollment, na.rm = TRUE),
mean_utilization = mean(Utilization, na.rm = TRUE)
) %>%
na.omit()%>%
st_drop_geometry()
# spatial join
wards <- wards %>%
left_join(ward_summary, by = c("NAME" = "Ward (2022)"))
# filter by School sector
dcps_data_ward <- wards %>% filter(`School sector` == "DCPS")
charter_data_ward <- wards %>% filter(`School sector` == "Public charter")
dcps_plot_uti <- base_map(Boundary, dcps_data, "mean_utilization") + ggtitle("Average Utilization Rate by School Boundary (DCPS)")
charter_plot_uti <- base_map(Boundary, charter_data, "mean_utilization") + ggtitle("Average Utilization Rate by School Boundary (Public Charter)")
print(dcps_plot_uti)
print(charter_plot_uti)
A closer look at enrollment patterns reveals shifting in-boundary and out-of-boundary behaviors within DCPS schools. The following interactive maps illustrate kindergarten enrollment flows to high-utilization DCPS schools (those operating at 90% capacity or higher) in 2014, 2018, and 2022. In these maps, each line connects the estimated centroid of a student’s home census tract to their school’s location. Yellow-to-purple line pairs represent in-boundary enrollment flows, while purple-to-purple lines indicate out-of-boundary enrollment. Line thickness corresponds to the number of students, with thicker lines indicating greater volume. Note: Only home tracts that sent three or more students to a school in a given year are shown to highlight stronger school-tract relationships while protecting individual privacy. Tooltip information excludes names of census tracts and schools.
As the maps reveal, most high-utilization schools primarily serve in-boundary students and are located in neighborhoods with varying income and homeownership levels. However, over time, there has been a growing concentration of these schools in Wards 3, 4, and 6. Concurrently, in-boundary enrollments have strengthened, while out-of-boundary enrollments have declined.
These shifts indicate that high enrollment is increasingly driven by neighborhood-based preferences rather than static choice. As a result, relying on past lower-grade enrollment figures to forecast future demand at higher grade levels may introduce risk—particularly if those higher grades serve broader or different catchment areas, or if the observed in-boundary demand does not persist across the years, violating the progression assumption.
## identify high-utilization schools
near_capacity_2014 <- schools.sf %>%
st_drop_geometry() %>%
filter(utilization > 90 & school_year == "SY14-15" & school_sector == "DCPS") %>%
dplyr::pull(school_id)
near_capacity_2018 <- schools.sf %>%
st_drop_geometry() %>%
filter(utilization > 90 & school_year == "SY18-19" & school_sector == "DCPS") %>%
dplyr::pull(school_id)
near_capacity_2022 <- schools.sf %>%
st_drop_geometry() %>%
filter(utilization > 90 & school_year == "SY22-23" & school_sector == "DCPS") %>%
dplyr::pull(school_id)
elem_centroid <- st_centroid(tracts12_23)
both_centroids_14 <- elem_centroid %>%
rename(residence_census_tract = GEOID,
census_year = year) %>%
left_join(elementary_students, by = c("residence_census_tract", "census_year")) %>%
drop_na(total_students)
both_centroids_14$geom2 = both_centroids_14 %>% st_drop_geometry() %>% st_as_sf(coords=c("school_main_facility_long","school_main_facility_lat")) %>% st_geometry()
both_centroids_14 %>%
filter(
census_year == 2014 & school_code %in% near_capacity_2014 & grade_level == "K" & total_students >= 3
) %>%
group_by(residence_census_tract) %>%
mutate(weight = total_students,
pct_own = round(pct_own * 100, 2),
pct_vacant = round(pct_vacant * 100, 2),
pct_poverty = round(pct_poverty * 100, 2),
boundary_word = if_else(in_boundary_indicator == 0, "No", "Yes"),
tooltip = paste0(total_students, " students",
"<br>", "In Boundary School? ", boundary_word,
"<br><br>", "Total Population: ", total_population,
"<br>", "Housing Units: ", total_housing_units,
"<br>", "Vacancy rate: ", pct_vacant,"%",
"<br>", "Tract homeownership rate: ", pct_own, "%",
"<br>", "Tract median gross rent: ", median_gross_rent,
"<br>", "Median HH Income: ", median_household_income,
"<br>", "Poverty rate: ", pct_poverty, "%")) %>%
mapdeck(token = token, style = mapdeck_style("light")) %>%
add_arc(origin = "geometry",
destination = "geom2",
stroke_width = "weight",
stroke_from = "boundary_word",
stroke_to = "census_year",
tooltip = "tooltip",
auto_highlight = TRUE,
legend= list( stroke_from = TRUE, stroke_to = FALSE ),
legend_options = list(
stroke_from = list( title = "In-boundary School"))) %>%
add_polygon(
data = boundary,
fill_opacity = 0,
stroke_width = 15,
stroke_colour = "#708090",
layer = "polygon_layer") %>%
mapdeck_view(
location = c(-77.0395304337645, 38.892799521270774),
# set the zoom level
zoom = 10,
# set the pitch angle
pitch = 30,
)
both_centroids <- elem_centroid %>%
rename(residence_census_tract = GEOID,
census_year = year) %>%
left_join(elementary_students, by = c("residence_census_tract", "census_year")) %>%
drop_na(total_students)
both_centroids$geom2 = both_centroids %>% st_drop_geometry() %>% st_as_sf(coords=c("school_main_facility_long","school_main_facility_lat")) %>% st_geometry()
both_centroids %>%
filter(
census_year == 2018 & school_code %in% near_capacity_2018 & grade_level == "K" & total_students >= 3
) %>%
group_by(residence_census_tract) %>%
mutate(weight = total_students,
pct_own = round(pct_own * 100, 2),
pct_vacant = round(pct_vacant * 100, 2),
pct_poverty = round(pct_poverty * 100, 2),
boundary_word = if_else(in_boundary_indicator == 0, "No", "Yes"),
tooltip = paste0(total_students, " students",
"<br>", "In Boundary School? ", boundary_word,
"<br><br>", "Total Population: ", total_population,
"<br>", "Housing Units: ", total_housing_units,
"<br>", "Vacancy rate: ", pct_vacant,"%",
"<br>", "Tract homeownership rate: ", pct_own, "%",
"<br>", "Tract median gross rent: ", median_gross_rent,
"<br>", "Median HH Income: ", median_household_income,
"<br>", "Poverty rate: ", pct_poverty, "%")) %>%
mapdeck(token = token, style = mapdeck_style("light")) %>%
add_arc(origin = "geometry",
destination = "geom2",
stroke_width = "weight",
stroke_from = "boundary_word",
stroke_to = "census_year",
tooltip = "tooltip",
auto_highlight = TRUE,
legend= list( stroke_from = TRUE, stroke_to = FALSE ),
legend_options = list(
stroke_from = list( title = "In-boundary School"))) %>%
add_polygon(
data = boundary,
fill_opacity = 0,
stroke_width = 15,
stroke_colour = "#708090",
layer = "polygon_layer") %>%
mapdeck_view(
location = c(-77.0395304337645, 38.892799521270774),
# set the zoom level
zoom = 10,
# set the pitch angle
pitch = 30,
)
both_centroids_22 <- elem_centroid %>%
rename(residence_census_tract = GEOID,
census_year = year) %>%
left_join(elementary_students, by = c("residence_census_tract", "census_year")) %>%
drop_na(total_students)
both_centroids_22$geom2 = both_centroids_22 %>% st_drop_geometry() %>% st_as_sf(coords=c("school_main_facility_long","school_main_facility_lat")) %>% st_geometry()
both_centroids_14 %>%
filter(
census_year == 2022 & school_code %in% near_capacity_2022 & grade_level == "K" & total_students >= 3
) %>%
group_by(residence_census_tract) %>%
mutate(weight = total_students,
pct_own = round(pct_own * 100, 2),
pct_vacant = round(pct_vacant * 100, 2),
pct_poverty = round(pct_poverty * 100, 2),
boundary_word = if_else(in_boundary_indicator == 0, "No", "Yes"),
tooltip = paste0(total_students, " students",
"<br>", "In Boundary School? ", boundary_word,
"<br><br>", "Total Population: ", total_population,
"<br>", "Housing Units: ", total_housing_units,
"<br>", "Vacancy rate: ", pct_vacant,"%",
"<br>", "Tract homeownership rate: ", pct_own, "%",
"<br>", "Tract median gross rent: ", median_gross_rent,
"<br>", "Median HH Income: ", median_household_income,
"<br>", "Poverty rate: ", pct_poverty, "%")) %>%
mapdeck(token = token, style = mapdeck_style("light")) %>%
add_arc(origin = "geometry",
destination = "geom2",
stroke_width = "weight",
stroke_from = "boundary_word",
stroke_to = "census_year",
tooltip = "tooltip",
auto_highlight = TRUE,
legend= list( stroke_from = TRUE, stroke_to = FALSE ),
legend_options = list(
stroke_from = list( title = "In-boundary School"))) %>%
add_polygon(
data = boundary,
fill_opacity = 0,
stroke_width = 15,
stroke_colour = "#708090",
layer = "polygon_layer") %>%
mapdeck_view(
location = c(-77.0395304337645, 38.892799521270774),
# set the zoom level
zoom = 10,
# set the pitch angle
pitch = 30,
)
To establish a more comprehensive approach, our tool incorporates the use of socioeconomic characteristics pertaining to both the school’s census tract and a student’s origin census tract (i.e., the census tract their home is) into the formula. This meets our client’s request of “input demographic data” to build future forecasts as addressed in the case study and corroborates DME’s continued effort to balance resource efficiency at DCPS schools through boundary revisions. For instance, DME completed a review of DC elementary school boundaries in 2014, with subsequent changes implemented between SY15-16. According to the “Elementary School Fact Sheet - SY2014-15” the purpose of the revision was to address previously closed schools as well as to ensure that there are a sufficient and evenly distributed number of students living within each of the boundaries. Overlaying with ACS data before and after the school boundary study (year 2010-2014 and 2014-2018), the DCPS zone revision corresponds to the following results:
Within the affected zones, there has been some increase in school-aged population, particularly in the Southwest (SW) Quadrant.
Along with the population increase, there has also been a simultaneous boom in housing units. (The map accounts for both traditional residential and rental units.)
The pressure of elementary school enrollment (approximated by the number of population aged 5 to 9 years’ old) was eased overall, as indicated by less clusterings and overall deconcentration.
ggplot(data = change_zones) +
geom_sf(data=city_boundary)+
geom_sf(aes(fill = CHANGE)) + # Set colors based on the 'CHANGE' attribute
scale_fill_manual(values=c("#440154","#fc8961"))+ # Automatically assign discrete colors
theme_minimal() +
labs(title = "Change Area Visualization by CHANGE Attribute",
fill = "Change Type") +
theme_void()+# Add title and legend
theme(plot.title = element_text(hjust = 0.5)) # Center the title
# Function to map total population with change_zones overlay
map_tot_pop <- function(data, title) {
ggplot() +
geom_sf(data = city_boundary, fill = "whitesmoke", color = "black", lwd = 0.2) +
geom_sf(data = data %>%
st_intersection(change_zones), aes(fill = tot_pop), color = NA) +
geom_sf(data = change_zones, fill = NA, color = "black", lwd = 0.3) +
scale_fill_gradientn(colors = c("#440154", "white", "#fc8961"))+
theme_void() +
theme(plot.title = element_text(hjust = 0.5))+
labs(title = title, fill = "Total Population")
}
# Generate maps for each year
map_2010_pop <- map_tot_pop(tract10_clip, "Total Population, 2010")
map_2014_pop <- map_tot_pop(tract14_clip, "2014")
map_2018_pop <- map_tot_pop(tract18_clip, "2018")
ggarrange(nrow=1,
map_2010_pop, map_2014_pop, map_2018_pop,
common.legend = TRUE, legend="bottom")
map_tot_housing <- function(data, title) {
ggplot() +
geom_sf(data = city_boundary, fill = "whitesmoke", color = "black", lwd = 0.2) +
geom_sf(data = data %>%
st_intersection(change_zones), aes(fill = tot_housing), color = NA) +
geom_sf(data = change_zones, fill = NA, color = "black", lwd = 0.3) +
scale_fill_gradientn(colors = c("#440154", "white", "#fc8961"))+
theme_void() +
labs(title = title, fill = "Total Housing")
}
# Generate maps for each year
map_2010_housing <- map_tot_housing(tract10_clip, "Total Housing, 2010")
map_2014_housing <- map_tot_housing(tract14_clip, "2014")
map_2018_housing <- map_tot_housing(tract18_clip, "2018")
ggarrange(nrow=1,
map_2010_housing, map_2014_housing, map_2018_housing,
common.legend = TRUE, legend="bottom")
map_school_age <- function(data, title) {
ggplot() +
geom_sf(data = city_boundary, fill = "whitesmoke", color = "black", lwd = 0.2) +
geom_sf(data = data %>%
st_intersection(change_zones), aes(fill = school_aged), color = NA) +
geom_sf(data = change_zones, fill = NA, color = "black", lwd = 0.3) +
scale_fill_gradientn(colors = c("#440154", "white", "#fc8961"))+
theme_void() +
theme(plot.title = element_text(hjust = 0.5))+
labs(title = title, fill = "Total School-aged Children")
}
# Generate maps for each year
map_2010_school_age <- map_school_age(tract10_clip, "School-aged Children, 2010")
map_2014_school_age <- map_school_age(tract14_clip, "2014")
map_2018_school_age <- map_school_age(tract18_clip, "2018")
# Arrange maps side by side
ggarrange(nrow=1,
map_2010_school_age,
map_2014_school_age,
map_2018_school_age,
common.legend = TRUE, legend="bottom")
Housing is a key feature in our predictive model, as development and consumption patterns offer insights into the current distribution of population and wealth while also serving as a proxy for long-term residential preferences and the spatial clustering of distinct community characteristics. The importance of incorporating planned housing construction into forecasting is further supported by the ongoing apartment boom in the D.C. region. According to RentCafe’s 2024 report, the Washington D.C. Metro area ranks seventh among all U.S. metropolitan areas in new apartment construction—signalling that housing dynamics will continue to shape residential choices and, by extension, school enrollment patterns.
Leveraging post-transaction residential property ownership, our preliminary analysis of the Computer Assisted Mass Appraisal (CAMA) – Residential dataset from Open Data DC reveals a distinct spatial clustering of high-priced homes concentrated in Wards 2 through 4, particularly in the NW Quadrant. These areas also tend to host schools with high utilization rates. Note: To ensure data quality and reduce historical bias, the dataset was filtered to include only sales occurring after 2002 and retained only the most recent transaction per property.
year_of_sale <- res_data %>%
group_by(SALEYEAR) %>%
summarize(Total_Units = sum(NUM_UNITS),
Total_GBA = sum(as.numeric(GBA))) %>% #use ground building area, since multiple buildings may exist in one SSL
ungroup()
# Plot Total Units by Year
ggplot(year_of_sale, aes(x = factor(SALEYEAR), y = Total_Units)) +
geom_bar(stat = "identity", fill = "#440154") +
labs(title = "Total Units by Year", x = as.factor("Year of Sale"), y = "Total Units") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
pal <- c("#fed395","#fea973","#e95462","#a3307e","#440154")
ggplot() +
geom_sf(data = wards22, fill = "whitesmoke", col = "white", lwd = 0.75)+
geom_sf(data = res_data, aes(colour = q5(price)),
show.legend = "point", size = .75) +
scale_colour_manual(values = pal,
labels=qBr(res_data, "price"),
name="Quintile Breaks\nof Property Price") +
guides(colour = guide_legend(override.aes = list(size=3))) +
labs(title="Most Recent Sales Price by Property in Washington, D.C.") +
theme_void()+
theme(plot.title = element_text(size = 12, hjust = 0.5))
The additional tables were created to summarize housing characteristics by current Ward and DCPS School Boundaries in Washington, DC.
Table of Housing Properties by Wards
res_ward22 <- res_data %>%
group_by(ward_2022) %>%
summarize(
Tot_Units = sum(NUM_UNITS, na.rm = TRUE),
Med_Price = median(price, na.rm = TRUE),
Med_GBA = median(as.numeric(GBA), na.rm = TRUE),
Med_AYB = median(year_built, na.rm = TRUE),
Med_Rooms = median(rooms, na.rm = TRUE),
Med_Bedrooms = median(bedroom, na.rm = TRUE),
Med_Bathrooms = median(bathroom, na.rm = TRUE),
Med_Stories = median(as.numeric(STORIES), na.rm = TRUE)
)
res_ward22 <- st_drop_geometry(res_ward22)
kable(res_ward22, caption = "Summary of Residential Data by Ward (2022 boundary)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
scroll_box(width = "100%", height = "300px")
ward_2022 | Tot_Units | Med_Price | Med_GBA | Med_AYB | Med_Rooms | Med_Bedrooms | Med_Bathrooms | Med_Stories |
---|---|---|---|---|---|---|---|---|
Ward 1 | 5733 | 799000 | 1602 | 1911 | 7 | 4 | 2 | 2 |
Ward 2 | 4586 | 1350000 | 1908 | 1900 | 8 | 3 | 3 | 2 |
Ward 3 | 7492 | 1252509 | 2170 | 1937 | 8 | 4 | 3 | 2 |
Ward 4 | 10645 | 695250 | 1600 | 1930 | 7 | 4 | 2 | 2 |
Ward 5 | 12137 | 550000 | 1412 | 1930 | 7 | 3 | 2 | 2 |
Ward 6 | 8687 | 845000 | 1512 | 1910 | 6 | 3 | 2 | 2 |
Ward 7 | 11052 | 355000 | 1224 | 1944 | 6 | 3 | 2 | 2 |
Ward 8 | 6584 | 338000 | 1308 | 1949 | 7 | 3 | 2 | 2 |
Table of Housing Properties by DCPS Boundaries
res_DCPS <- res_data %>%
group_by(dcps_zone) %>%
summarize(
Tot_Units = sum(NUM_UNITS, na.rm = TRUE),
Med_Price = median(price, na.rm = TRUE),
Med_GBA = median(as.numeric(GBA), na.rm = TRUE),
Med_AYB = median(year_built, na.rm = TRUE),
Med_Rooms = median(rooms, na.rm = TRUE),
Med_Bedrooms = median(bedroom, na.rm = TRUE),
Med_Bathrooms = median(bathroom, na.rm = TRUE),
Med_Stories = median(as.numeric(STORIES), na.rm = TRUE)
)
kable(res_DCPS %>% st_drop_geometry(), caption = "Summary of Residential Data by Elementary School Boundary") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
scroll_box(width = "100%", height = "200px")
dcps_zone | Tot_Units | Med_Price | Med_GBA | Med_AYB | Med_Rooms | Med_Bedrooms | Med_Bathrooms | Med_Stories |
---|---|---|---|---|---|---|---|---|
Amidon-Bowen | 143 | 910000.0 | 1640.0 | 1967.0 | 7.0 | 3 | 3 | 2.00 |
Bancroft | 1080 | 1050000.0 | 1971.0 | 1916.0 | 7.0 | 4 | 3 | 2.25 |
Barnard | 1544 | 610000.0 | 1440.0 | 1929.0 | 7.0 | 3 | 2 | 2.00 |
Beers | 804 | 364400.0 | 1184.0 | 1942.5 | 6.0 | 3 | 2 | 2.00 |
Boone | 506 | 365000.0 | 1360.0 | 1937.0 | 6.0 | 3 | 2 | 2.00 |
Brent | 1261 | 1050000.0 | 1679.5 | 1900.0 | 7.0 | 3 | 2 | 2.00 |
Brightwood | 474 | 650000.0 | 1650.5 | 1936.0 | 7.0 | 4 | 2 | 2.00 |
Browne | 850 | 425000.0 | 1152.0 | 1937.0 | 6.0 | 3 | 2 | 2.00 |
Bruce Monroe @ Park View | 1332 | 639500.0 | 1488.0 | 1913.0 | 7.0 | 3 | 2 | 2.00 |
Bunker Hill | 2077 | 500000.0 | 1292.0 | 1948.0 | 7.0 | 3 | 2 | 2.00 |
Burroughs | 944 | 600000.0 | 1549.0 | 1927.0 | 7.0 | 3 | 2 | 2.00 |
Burrville | 711 | 289000.0 | 1026.0 | 1945.0 | 6.0 | 3 | 2 | 2.00 |
Cleveland | 540 | 750000.0 | 1464.0 | 1907.0 | 6.0 | 3 | 2 | 2.00 |
Cooke, H.D | 233 | 955000.0 | 2223.0 | 1911.5 | 8.0 | 4 | 3 | 2.50 |
Drew | 659 | 280500.0 | 1276.0 | 1951.0 | 6.0 | 3 | 2 | 2.00 |
Eaton | 628 | 1602000.0 | 2426.0 | 1925.0 | 9.0 | 4 | 3 | 2.25 |
Garfield | 384 | 288900.0 | 1024.0 | 1947.0 | 6.0 | 3 | 2 | 2.00 |
Garrison | 1368 | 1025000.0 | 1666.0 | 1895.0 | 7.0 | 3 | 2 | 2.00 |
Harris, C. W | 917 | 340000.0 | 1292.0 | 1950.0 | 6.0 | 3 | 2 | 2.00 |
Hearst | 497 | 1125000.0 | 1920.0 | 1929.0 | 8.0 | 4 | 3 | 2.00 |
Hendley | 573 | 287500.0 | 1152.0 | 1955.0 | 6.0 | 3 | 2 | 2.00 |
Houston | 669 | 328607.5 | 1343.5 | 1940.0 | 7.0 | 3 | 2 | 2.00 |
Hyde-Addison | 1966 | 1450000.0 | 1803.5 | 1900.0 | 8.0 | 3 | 3 | 2.00 |
Janney | 1396 | 1002000.0 | 1852.0 | 1937.0 | 7.0 | 3 | 3 | 2.00 |
Ketcham | 915 | 347000.0 | 1344.0 | 1936.0 | 6.0 | 3 | 2 | 2.00 |
Key | 1411 | 1325000.0 | 2315.0 | 1951.0 | 9.0 | 4 | 3 | 2.00 |
Kimball | 379 | 310750.0 | 1362.5 | 1942.0 | 6.0 | 3 | 2 | 2.00 |
King, Martin Luther | 863 | 317650.0 | 1548.0 | 1941.0 | 7.0 | 3 | 2 | 2.00 |
Lafayette | 1706 | 1090000.0 | 2159.0 | 1936.0 | 8.0 | 4 | 3 | 2.00 |
Langdon | 1625 | 521990.0 | 1596.0 | 1951.0 | 7.0 | 3 | 2 | 2.00 |
Langley | 2398 | 699000.0 | 1600.0 | 1910.0 | 7.0 | 4 | 3 | 2.00 |
Lasalle-Backus | 1027 | 460000.0 | 1152.0 | 1952.0 | 6.0 | 3 | 1 | 2.00 |
Leckie | 312 | 335000.0 | 1152.0 | 1950.0 | 6.0 | 3 | 2 | 2.00 |
Lewis | 1223 | 770000.0 | 1818.0 | 1922.0 | 8.0 | 4 | 3 | 2.00 |
Ludlow-Taylor | 1167 | 877500.0 | 1579.0 | 1909.0 | 6.5 | 3 | 2 | 2.00 |
Malcolm X | 264 | 422940.0 | 1600.0 | 2008.0 | 8.0 | 3 | 2 | 2.00 |
Mann | 987 | 1675000.0 | 3036.0 | 1952.0 | 9.5 | 5 | 4 | 2.00 |
Maury | 1383 | 761500.0 | 1387.0 | 1914.0 | 6.0 | 3 | 2 | 2.00 |
Miner | 1643 | 620000.0 | 1312.0 | 1925.0 | 6.0 | 3 | 2 | 2.00 |
Moten | 594 | 354000.0 | 1368.0 | 2001.0 | 7.0 | 3 | 2 | 2.00 |
Murch | 1007 | 1186500.0 | 2303.5 | 1938.0 | 9.0 | 4 | 3 | 2.00 |
Nalle | 657 | 315000.0 | 1358.0 | 1950.0 | 6.0 | 3 | 2 | 2.00 |
Noyes | 1218 | 679000.0 | 1400.0 | 1928.0 | 7.0 | 3 | 2 | 2.00 |
Oyster-Adams | 756 | 1640000.0 | 2789.0 | 1922.0 | 10.0 | 5 | 3 | 2.50 |
Patterson | 469 | 355209.5 | 1713.0 | 1944.0 | 8.0 | 4 | 3 | 2.00 |
Payne | 1177 | 781750.0 | 1368.0 | 1918.0 | 6.0 | 3 | 2 | 2.00 |
Peabody-Watkins | 2270 | 919500.0 | 1596.0 | 1908.0 | 7.0 | 3 | 2 | 2.00 |
Plummer | 804 | 295000.0 | 1056.0 | 1951.0 | 6.0 | 3 | 2 | 2.00 |
Powell | 782 | 850000.0 | 1845.0 | 1923.0 | 8.0 | 4 | 3 | 2.00 |
Randle Highlands | 930 | 407626.0 | 1373.0 | 1941.0 | 7.0 | 3 | 2 | 2.00 |
Raymond | 729 | 740000.0 | 1536.0 | 1912.0 | 7.0 | 4 | 3 | 2.00 |
Reed, Marie | 398 | 949000.0 | 1520.0 | 1907.0 | 7.0 | 3 | 2 | 2.00 |
Ross | 348 | 1585000.0 | 2352.0 | 1890.0 | 9.5 | 4 | 3 | 3.00 |
Savoy | 314 | 311500.0 | 1272.0 | 1943.0 | 6.0 | 3 | 2 | 2.00 |
School Without Walls @ Francis-Stevens | 522 | 2012500.0 | 3100.0 | 1912.5 | 11.0 | 4 | 4 | 3.00 |
Seaton | 1100 | 776000.0 | 1625.0 | 1900.0 | 7.0 | 3 | 2 | 2.00 |
Shepherd | 627 | 865000.0 | 2211.0 | 1937.0 | 8.0 | 4 | 3 | 2.00 |
Simon | 429 | 323500.0 | 1296.0 | 1950.0 | 6.0 | 3 | 2 | 2.00 |
Smothers | 808 | 318000.0 | 1176.0 | 1941.0 | 6.0 | 3 | 2 | 2.00 |
Stanton | 230 | 349950.0 | 1278.0 | 1951.0 | 7.0 | 3 | 2 | 2.00 |
Stoddert | 795 | 1070500.0 | 1547.0 | 1931.0 | 8.0 | 4 | 3 | 2.00 |
Takoma | 657 | 625000.0 | 1621.0 | 1932.0 | 7.0 | 3 | 2 | 2.00 |
Thomas | 766 | 318750.0 | 1040.0 | 1942.0 | 6.0 | 3 | 2 | 2.00 |
Thomson | 78 | 1105000.0 | 2340.0 | 1900.0 | 8.0 | 4 | 3 | 2.50 |
Truesdell | 1033 | 595000.0 | 1404.0 | 1925.5 | 8.0 | 3 | 2 | 2.00 |
Tubman | 1256 | 691000.0 | 1520.0 | 1910.0 | 6.0 | 4 | 2 | 2.00 |
Turner | 620 | 310000.0 | 1360.0 | 2006.0 | 7.0 | 3 | 2 | 2.00 |
Tyler | 494 | 765000.0 | 1298.0 | 1909.0 | 6.0 | 3 | 2 | 2.00 |
Van Ness | 448 | 728032.5 | 1440.0 | 2009.0 | 7.0 | 3 | 2 | 3.00 |
Walker-Jones | 396 | 670000.0 | 1470.0 | 1906.0 | 7.0 | 3 | 2 | 2.00 |
Wheatley | 1920 | 499000.0 | 1368.0 | 1927.0 | 7.0 | 3 | 2 | 2.00 |
Whitlock | 588 | 300501.0 | 1190.0 | 1944.0 | 6.0 | 3 | 2 | 2.00 |
Whittier | 1703 | 545000.0 | 1368.0 | 1931.0 | 7.0 | 3 | 2 | 2.00 |
Wilson, J.O. | 1134 | 706750.0 | 1393.0 | 1912.0 | 6.0 | 3 | 2 | 2.00 |
Over the past decade, there has been a noticeable increase in both single-family and multi-family residential transactions, often occurring concurrently across different parts of the city. When disaggregated by DCPS elementary school boundaries, the broader price trends remain consistent; however, the distribution of housing types diverges. Single-family and multi-family units tend to cluster in different school zones, suggesting underlying patterns of residential segmentation.
res_DCPS <- res_DCPS %>%
st_drop_geometry() %>%
inner_join(boundary, by = c("dcps_zone" = "NAME")) %>%
st_as_sf()
ggplot() +
geom_sf(data = res_DCPS, aes(fill = Med_Price), size = 0.75) +
scale_fill_gradientn(colors = pal, name = "Median Price ($)") +
labs(title = "Housing Price by Current Elementary DCPS Zone") +
theme_void() +
theme(plot.title = element_text(size = 12, hjust = 0.5))
single_family <- res_data %>%
filter(NUM_UNITS == 1)
single_family_DCPS <- single_family %>%
group_by(dcps_zone) %>%
summarize(
Tot_Units = sum(NUM_UNITS, na.rm = TRUE),
Med_Price = median(price, na.rm = TRUE),
Med_GBA = median(as.numeric(GBA), na.rm = TRUE),
Med_AYB = median(year_built, na.rm = TRUE),
Med_Rooms = median(rooms, na.rm = TRUE),
Med_Bedrooms = median(bedroom, na.rm = TRUE),
Med_Bathrooms = median(bathroom, na.rm = TRUE),
Med_Stories = median(as.numeric(STORIES), na.rm = TRUE)
) %>%
st_drop_geometry()
single_family_DCPS <- single_family_DCPS %>%
inner_join(boundary, by = c("dcps_zone" = "NAME")) %>%
st_as_sf()
ggplot() +
geom_sf(data = single_family_DCPS, aes(fill = Tot_Units), size = 0.75) +
scale_fill_gradientn(colors = pal, name = "Total Units") +
labs(title = "Single Family Housing Units by Current Elementary DCPS Zone") +
theme_void() +
theme(plot.title = element_text(size = 12, hjust = 0.5))
multi_family <- res_data %>%
filter(NUM_UNITS > 1)
multi_family_DCPS <- multi_family %>%
group_by(dcps_zone) %>%
summarize(
Tot_Units = sum(NUM_UNITS, na.rm = TRUE),
Med_Price = median(price, na.rm = TRUE),
Med_GBA = median(as.numeric(GBA), na.rm = TRUE),
Med_AYB = median(year_built, na.rm = TRUE),
Med_Rooms = median(rooms, na.rm = TRUE),
Med_Bedrooms = median(bedroom, na.rm = TRUE),
Med_Bathrooms = median(bathroom, na.rm = TRUE),
Med_Stories = median(as.numeric(STORIES), na.rm = TRUE)
) %>%
st_drop_geometry()
multi_family_DCPS <- multi_family_DCPS %>%
inner_join(boundary %>% as_tibble(), by = c("dcps_zone" = "NAME")) %>%
st_as_sf()
ggplot() +
geom_sf(data = multi_family_DCPS, aes(fill = Tot_Units), size = 0.75) +
scale_fill_gradientn(colors = c("#ffffff", "#f98e09", "#440154"), name = "Total Units") +
labs(title = "Multi-Family Housing Units by Current Elementary DCPS Zones") +
theme_void() +
theme(plot.title = element_text(size = 12, hjust = 0.5))
Data of construction permits from 2013 to 2023 are downloaded from OpenData DC Building Permits (https://opendata.dc.gov/search?q=building%20permits). We filter out for building permits that are “Construction” type for our analysis. Broadly, construction encompass activities like alteration and repair, civil plan, demolition and new buildings. We decided to use activities classified as construction as we were interested in how these activities near a school could influence student enrollment numbers.
The total number of construction permits issued annually in DC shows fluctuation. From 2013 to 2019, the total number of permits issued increased year-on-year. This decreased from 2019 to 2023, likely due to the supply-chain disruptions from Covid-19.
The total number of construction permits issued annually by wards also follow similar trends, with most wards having an increase in construction permit from 2013 to 2019 but a decrease there after. Ward 2 and 6 have the highest number of construction permits issued compared to other wards.
Taking 2023 as an example, we observe that there is a spatial distribution of construction permits issued by DCPS Zones.There appears to be less construction permits in school zones in the south while large number of permits are clustered in the east.
# Total number of construction permits issued annually
ggplot(combined_construction_data %>%
group_by(year) %>%
summarise(total_count = n()),
aes(x=year, y=total_count)) +
geom_bar(stat="identity", fill="#a3307e")+
labs(title = "Total number of construction permits issued annually", x="Year", y="Total Construction Permits")+
theme_minimal()
# Total number of construction permits issued annually by wards
ggplot(combined_construction_data %>%
group_by(year, WARD) %>%
summarise(total_count = n()),
aes(x=year, y=total_count, color=WARD)) +
geom_point() +
geom_line() +
scale_color_manual(values = c("#fcfdbf", "#fed395", "#fea973", "#f1605d", "#a3307e", "#0D0887", "#440154", "#180f3d")) +
scale_x_continuous(breaks = seq(min(combined_construction_data$year), max(combined_construction_data$year), by = 1))+
labs(title = "Total number of construction permits issued annually by wards", x="Year", y="Total Construction Permits")
theme_minimal()
## List of 136
## $ line :List of 6
## ..$ colour : chr "black"
## ..$ linewidth : num 0.5
## ..$ linetype : num 1
## ..$ lineend : chr "butt"
## ..$ arrow : logi FALSE
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_line" "element"
## $ rect :List of 5
## ..$ fill : chr "white"
## ..$ colour : chr "black"
## ..$ linewidth : num 0.5
## ..$ linetype : num 1
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_rect" "element"
## $ text :List of 11
## ..$ family : chr ""
## ..$ face : chr "plain"
## ..$ colour : chr "black"
## ..$ size : num 11
## ..$ hjust : num 0.5
## ..$ vjust : num 0.5
## ..$ angle : num 0
## ..$ lineheight : num 0.9
## ..$ margin : 'margin' num [1:4] 0points 0points 0points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : logi FALSE
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ title : NULL
## $ aspect.ratio : NULL
## $ axis.title : NULL
## $ axis.title.x :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 1
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 2.75points 0points 0points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.title.x.top :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 0
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 0points 2.75points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.title.x.bottom : NULL
## $ axis.title.y :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 1
## ..$ angle : num 90
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 2.75points 0points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.title.y.left : NULL
## $ axis.title.y.right :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 1
## ..$ angle : num -90
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 0points 0points 2.75points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "grey30"
## ..$ size : 'rel' num 0.8
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text.x :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 1
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 2.2points 0points 0points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text.x.top :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 0
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 0points 2.2points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text.x.bottom : NULL
## $ axis.text.y :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 1
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 2.2points 0points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text.y.left : NULL
## $ axis.text.y.right :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 0
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 0points 0points 2.2points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text.theta : NULL
## $ axis.text.r :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 0.5
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 2.2points 0points 2.2points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.ticks : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ axis.ticks.x : NULL
## $ axis.ticks.x.top : NULL
## $ axis.ticks.x.bottom : NULL
## $ axis.ticks.y : NULL
## $ axis.ticks.y.left : NULL
## $ axis.ticks.y.right : NULL
## $ axis.ticks.theta : NULL
## $ axis.ticks.r : NULL
## $ axis.minor.ticks.x.top : NULL
## $ axis.minor.ticks.x.bottom : NULL
## $ axis.minor.ticks.y.left : NULL
## $ axis.minor.ticks.y.right : NULL
## $ axis.minor.ticks.theta : NULL
## $ axis.minor.ticks.r : NULL
## $ axis.ticks.length : 'simpleUnit' num 2.75points
## ..- attr(*, "unit")= int 8
## $ axis.ticks.length.x : NULL
## $ axis.ticks.length.x.top : NULL
## $ axis.ticks.length.x.bottom : NULL
## $ axis.ticks.length.y : NULL
## $ axis.ticks.length.y.left : NULL
## $ axis.ticks.length.y.right : NULL
## $ axis.ticks.length.theta : NULL
## $ axis.ticks.length.r : NULL
## $ axis.minor.ticks.length : 'rel' num 0.75
## $ axis.minor.ticks.length.x : NULL
## $ axis.minor.ticks.length.x.top : NULL
## $ axis.minor.ticks.length.x.bottom: NULL
## $ axis.minor.ticks.length.y : NULL
## $ axis.minor.ticks.length.y.left : NULL
## $ axis.minor.ticks.length.y.right : NULL
## $ axis.minor.ticks.length.theta : NULL
## $ axis.minor.ticks.length.r : NULL
## $ axis.line : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ axis.line.x : NULL
## $ axis.line.x.top : NULL
## $ axis.line.x.bottom : NULL
## $ axis.line.y : NULL
## $ axis.line.y.left : NULL
## $ axis.line.y.right : NULL
## $ axis.line.theta : NULL
## $ axis.line.r : NULL
## $ legend.background : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ legend.margin : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
## ..- attr(*, "unit")= int 8
## $ legend.spacing : 'simpleUnit' num 11points
## ..- attr(*, "unit")= int 8
## $ legend.spacing.x : NULL
## $ legend.spacing.y : NULL
## $ legend.key : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ legend.key.size : 'simpleUnit' num 1.2lines
## ..- attr(*, "unit")= int 3
## $ legend.key.height : NULL
## $ legend.key.width : NULL
## $ legend.key.spacing : 'simpleUnit' num 5.5points
## ..- attr(*, "unit")= int 8
## $ legend.key.spacing.x : NULL
## $ legend.key.spacing.y : NULL
## $ legend.frame : NULL
## $ legend.ticks : NULL
## $ legend.ticks.length : 'rel' num 0.2
## $ legend.axis.line : NULL
## $ legend.text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : 'rel' num 0.8
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ legend.text.position : NULL
## $ legend.title :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 0
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ legend.title.position : NULL
## $ legend.position : chr "right"
## $ legend.position.inside : NULL
## $ legend.direction : NULL
## $ legend.byrow : NULL
## $ legend.justification : chr "center"
## $ legend.justification.top : NULL
## $ legend.justification.bottom : NULL
## $ legend.justification.left : NULL
## $ legend.justification.right : NULL
## $ legend.justification.inside : NULL
## $ legend.location : NULL
## $ legend.box : NULL
## $ legend.box.just : NULL
## $ legend.box.margin : 'margin' num [1:4] 0cm 0cm 0cm 0cm
## ..- attr(*, "unit")= int 1
## $ legend.box.background : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ legend.box.spacing : 'simpleUnit' num 11points
## ..- attr(*, "unit")= int 8
## [list output truncated]
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi TRUE
## - attr(*, "validate")= logi TRUE
# Construction counts in 2023 by school zones
construction_data_2023 <- combined_construction_data[combined_construction_data$year=="2023",] %>%
st_transform(crs = st_crs(boundary))
construction_within_dcps <- st_join(construction_data_2023, boundary, join = st_within)
construction_count <- construction_within_dcps %>%
as_tibble() %>%
group_by(DCPS_ID, year) %>%
summarise(construction_projects = n())
breaks <- c(0, 100, 200, 300, 400, 500, Inf)
labels <- c("0-100", "101-200", "201-300", "301-400", "401-500", "500+")
dcps_zones_2 <- left_join(boundary, construction_count, by="DCPS_ID") %>%
mutate(construction_group = cut(construction_projects, breaks = breaks, labels = labels, include.lowest = TRUE))
ggplot() +
geom_sf(data = dcps_zones_2, aes(fill = construction_group)) +
scale_fill_manual(values = c("#180f3d", "#440154", "#a3307e", "#f1605d", "#fea973", "#fed395", "#fcfdbf"),
limits = c("500+", "401-500", "301-400", "201-300", "101-200", "0-100"),
name = "Construction Permits Issued 2023") +
coord_sf() +
theme(legend.position = "right")
The rental housing market in DC is expanding, with over 32,000 units of luxury and mixed-income housing completed since 2014, over 21,000 affordable units completed, and thousands more in the pipeline. Housing construction is a priority in DC, with the recent announcement of the RENTAL Act, meant to protect affordable housing and strengthen the local housing ecosystem.
new_residential_polygon %>% group_by(NAME) %>% summarize(n=n(), sum_units=sum(UNITS)) %>%
ggplot(aes(fill=sum_units, label=NAME)) +
geom_sf()+
geom_sf_label(fill = "white",
fun.geometry = sf::st_centroid)+
scale_fill_gradientn(colors = c("#ffffff", "#f98e09", "#440154"), name="Units Built")+
labs(title="Units Built, 2014-2023")+
theme_void()
Construction of new rental housing has been most pronounced in Ward 6, but most Wards have seen over 1,000 new units in the ten-year period of these data. Affordable housing has also been on the rise, and its construction is less concentrated in one part of the city. The RENTAL Act should see the market continue to flourish with the incentivization of new housing construction and protections for renters being a priority for Mayor Bowser.
ggplot(new_residential_point)+
geom_sf(data=boundary, fill="whitesmoke", color="grey70", lwd=0.1)+
geom_sf(data=waterbodies, fill = "ghostwhite", color="lightsteelblue")+
#geom_sf(aes(size=UNITS), color="#fdb42f",alpha=0.9)+
geom_sf(data=affordable %>% filter(STATUS_PUBLIC != "Pipeline"), aes(size=TOTAL_AFFORDABLE_UNITS),
color = "#440154", alpha=0.6)+
geom_sf(aes(size=UNITS), color="#fdb42f",alpha=0.6)+
scale_size_continuous(range = c(0.5, 4), limits=c(0,1400), breaks =seq(0, 1400,by=300), name="Units Built")+
guides(size = guide_legend(override.aes = list(color="black")))+
labs(caption = "Source: Open Data DC\n",
title="New units in DC, 2013-2023", subtitle="Affordable Units in Purple")+
theme_void()+
theme(legend.position = "bottom")
Panel data is a dataset in which the behavior of entities (i) are observed across time (t). In this case, it is the behavior of student enrollment by grades, in each school, over time. One row of data will therefore show the number of Students for Grade X, in School Y, in Year Z.
To develop the best model supporting the app’s backend results, we engineered a panel data that enlists every unique school facility-grade level-school year combination, targeted at student enrollment 5 years later, with supporting predictors ranging from facility based data (enrollment, capacity, utilization), socioeconomic data (demographic, age, income, housing, education), construction permit records, residential mass appraisal data. From data collected, we further manipulated future and lag variables for construction permits (cumulative permits within 5 years), student enrollments (both lag records associated with previous years, as well as future variables indicating enrollment 1-3 year(s) later), and capacity (change in capacity 5 years later).
The list of variables in our panel data are below. For each type of variables added, please find more detailed elaborations in their respective section.Raw.Variable.Name | Defintion |
---|---|
school_year | School Year |
school_code | School Code |
school_name | School Name |
grade_level | Grade level |
unique_id | This ID was assigned to each school based on their geographic location, to differentiate them when modelling. It is a categorical variable |
census_year | The year this row of data corresponds to. For example SY13-14 is 2013 |
geoid | Geographic ID of School |
dcps_boundary | This is the name of DCPS School Zones Boundary the school is located in. It is a categorical variable |
school_sector | This is the type of public school in DC. It is a categorical variable of either DC Public School or Public Charter Schools |
facility_capacity | School Capacity: The maximum number of students who can be housed in a school building based on the number and types of spaces available and the type of educational plan offered. |
enrollment | Student Enrollment Numbers |
utilization | Utilization is a measurement of how full a facility is to meet educational programmatic needs. Utilization is calculated as the number of enrolled students over the capacity of the school facility. |
median_hhincome_indv_mean | Median Household Income, averaged from the proportion of student in each census tract. The following equation calculates this: (1/Total Students)×Σ(Number of students in census tract N)×(Median Household Income in census tract N) |
pct_poverty_indv_mean | Poverty rate, averaged from the proportion of students in each census tract. The following equation calculates this: (1/Total Students)×Σ(Number of students in census tract N)×(Poverty rate in census tract N) |
total_population | Total population, in the census tract the school is located in |
num_households | Total number of households, in the census tract the school is located in |
total_housing_units | Total housing units, in the census tract the school is located in |
median_gross_rent | Median Gross Rent, in the census tract the school is located in |
med_rent_pct_inc | Median Gross Rent as percentage of income, in the census tract the school is located in |
med_owner_cost_pct_inc | Median Owner Costs as percentage of income, in the census tract the school is located in |
med_rooms_renter | Median Number of Rooms in Renter Occupied housing, in the census tract the school is located in |
median_household_income | Median Household Income, in the census tract the school is located in |
avg_hh_size | Average Household Size, in the census tract the school is located in |
pct_nonwhite | Percentage of non-white population, in the census tract the school is located in |
pct_poverty | Poverty rate, in the census tract the school is located in |
pct_vacant | Percentage of vacant housing, in the census tract the school is located in |
pct_own | Percentage of owner occupied housing among total population, in the census tract the school is located in |
pct_rent | Percentage of renter occupied housing among total population, in the census tract the school is located in |
pct_education_no_school | Percentage of adults aged 25 and above without school qualifications, in the census tract the school is located in |
pct_education_hs | Percentage of adults aged 25 and above with high school qualifications, in the census tract the school is located in |
pct_education_bachelors | Percentage of adults aged 25 and above with bachelor’s degree, in the census tract the school is located in |
construction_projects | Total number of construction permits, in the census tract the school is located in |
avg_price_5_nearest | Average sales price of 5 nearest residential properties to the school |
med_bedroom_5_nearest | Median number of bedrooms of 5 nearest residential properties to the school |
cumulative_permits5yr | Total number of construction permits issued in the school’s census tract in past 5 years |
housing_1yrbefore | Total housing units, 1 year before the specified school year, in the census tract the school is located in |
pct_change_housing_yrbefore | Percentage change in total housing units from 1 year before, in the census tract the school is located in |
agg_total_students | This is the total number of students in the school for each grade, in the specified school year. The specified school year is contained in a separate column in the panel data |
geometry | Geographic location of the school in geometry form |
agg_total_students_1yrbefore | This is the total number of students in the school for each grade, 1 year before the specified school year |
agg_total_students_2yrbefore | This is the total number of students in the school for each grade, 2 years before the specified school year |
agg_total_students_3yrbefore | This is the total number of students in the school for each grade, 3 years before the specified school year |
agg_total_students_1yrlater | This is the total number of students in the school for each grade, 1 year after the specified school year |
agg_total_students_2yrlater | This is the total number of students in the school for each grade, 2 years after the specified school year |
agg_total_students_3yrlater | This is the total number of students in the school for each grade, 3 years after the specified school year |
agg_total_students_4yrlater | This is the total number of students in the school for each grade, 4 years after the specified school year |
agg_total_students_5yrlater | This is the total number of students in the school for each grade, 5 years after the specified school year |
capacity_5yrlater | Percentage change in school capacity 5 years after specified year. The specified school year is contained in a separate column in the panel data |
pct_change_capacity_5yrlater | Percentage change in school capacity 5 years after specified year. The specified school year is contained in a separate column in the panel data |
ward | This is the name of municipal ward in DC. It is a categorical variable. |
avg_util | Average utilization by wards |
The same school code can be associated with multiple facilities in D.C., and scatter across distinct school boundaries. To appreciate the difference, we created a uniqueID for each facility in the region (unique combinations of school_code, school_name, latitude, and longitude), ensuring maximum capacity, enrollment, and utilization accuracy is captured for each facility.
Due to code re-sequencing for narrative improvement, we repeated the same code as in section 1.4 below. This does not impose any changes to our final results.
elementary <- list('K','1','2','3','4','5')
# Manually removed a record associated with school_id == 247 - this is the only duplicated row
schools_cleaned <- schools %>%
filter(grade_band %in% c("Elementary", "K-8")) %>% # Neve added
group_by(school_year, lea_id, school_id, mar_latitude) %>%
slice_max(order_by = facility_capacity, n = 1, with_ties = FALSE) %>%
ungroup()
elementary_by_year <- students_by_tract_sy1314_to_sy2425 %>%
filter(grade_level %in% elementary) %>%
left_join(schools_cleaned %>%
dplyr::select(-school_name, -school_sector),
by=c("lea_code"="lea_id","school_code"="school_id","school_year"="school_year", "school_main_facility_lat" = "mar_latitude", "school_main_facility_long" = "mar_longitude")) %>%
dplyr::select(# original variables from the secret data
school_year, residence_census_tract, lea_code, school_code, school_name,
school_sector, in_boundary_indicator, school_main_facility_lat, school_main_facility_long,
grade_level, total_students,
# variables added through the merge
dcps_facility_name, facility_sector, grade_band,`ward_(2022)`,
dcps_boundary, facility_capacity, enrollment, utilization, unfilled_seats) %>%
rename(ward_22 = `ward_(2022)`) # tidied up workflow
# Create unique facility ID for each school facility
elementary_by_year <- elementary_by_year %>%
group_by(school_code, grade_band, school_main_facility_lat, school_main_facility_long) %>% # this ensures distinct facilities are selected
mutate(unique_id = paste0("F", sprintf("%06d", cur_group_id()))) %>%
ungroup()
# Create new column variable for matching census data
elementary_by_year <- elementary_by_year %>%
mutate(census_year = case_when(
school_year == 'SY13-14' ~ 2013,
school_year == 'SY14-15' ~ 2014,
school_year == 'SY15-16' ~ 2015,
school_year == 'SY16-17' ~ 2016,
school_year == 'SY17-18' ~ 2017,
school_year == 'SY18-19' ~ 2018,
school_year == 'SY19-20' ~ 2019,
school_year == 'SY20-21' ~ 2020,
school_year == 'SY21-22' ~ 2021,
school_year == 'SY22-23' ~ 2022,
school_year == 'SY23-24' ~ 2023,
school_year == 'SY24-25' ~ 2024,
TRUE ~ NA_real_
))
We respectively matched the tract-level information to each school facility’s unique neighborhood tract, and to home origin tracts of the recorded student cohort (to obtain the weighted average that represents the student’s household characteristics).
tracts12_23 <- tracts12_23 %>%
group_by(GEOID) %>%
arrange(GEOID, year, .by_group = TRUE) %>% # Ensures sort happens within each group
mutate(
housing_1yrbefore = lag(total_housing_units, n = 1),
pct_change_housing_yrbefore = (total_housing_units - housing_1yrbefore) / housing_1yrbefore
) %>%
ungroup()
Manually Create 2024 Rows Using Best Data Available (ACS 2023)
Because 2020-2024 ACS 5-Year Estimates were not available during the period of this project, after retrieving all ACS variables from 2019-2023 ACS 5-Year Estimates for census year 2023 and calculating change variables (i.e. housing_1yrbefore & pct_change_housing_yrbefore), we duplicated the 2023 rows as the “best available data” for 2024, this enables us to predict for SY29-30 using SY24-25 enrollment records.
tracts_2024 <- tracts12_23 %>%
filter(year == 2023) %>%
mutate(year = 2024)
# Then bind it back to the original data
tracts12_24 <- bind_rows(tracts12_23, tracts_2024) %>%
arrange(GEOID, year)
Create Individual Mean Variables Based on Student Cohort Makeup
tracts_no_geometry <- tracts12_24 %>%
st_drop_geometry() %>%
dplyr::select(GEOID, year, total_population,
population_white, population_black, population_native, population_asian, population_hispanic,
Male5to9, Female5to9,
num_households, total_housing_units, occupied_housing_units, vacant_housing_units,
median_gross_rent, med_rent_pct_inc, median_home_value, med_owner_cost_pct_inc,
pop_owner_occupied, pop_renter_occupied,
med_rooms_owner, med_rooms_renter, median_household_income, poverty_pop, pop_25above,
education_no_school, education_hs,
education_bachelors, kinder_enrollment, first_fourth_grade_enr, avg_hh_size)
test1<- elementary_by_year %>%
left_join(tracts_no_geometry, by = c("census_year" = "year", "residence_census_tract" = "GEOID"))
# Calculate the weighted mean of each variable
test2 <- test1 %>%
group_by(census_year, unique_id, grade_level) %>%
mutate(TotalPop_Indv_Mean = sum(total_population * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
Male5to9_Indv_Mean = sum(Male5to9 * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
Female5to9_Indv_Mean = sum(Female5to9 * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
NumberWhites_Indv_Mean = sum(population_white * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
NumberBlack_Indv_Mean = sum(population_black * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
Median_HHIncome_Indv_Mean = sum(median_household_income * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
NumberNative_Indv_Mean = sum(population_native * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
NumberAsian_Indv_Mean = sum(population_asian * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
NumberHispanic_Indv_Mean = sum(population_hispanic * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
NumHouseholds_Indv_Mean = sum(num_households * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
TotalHousingUnits_Indv_Mean = sum(total_housing_units * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
OccupiedHousingUnits_Indv_Mean = sum(occupied_housing_units * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
VacantHousingUnits_Indv_Mean = sum(vacant_housing_units * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
MedianGrossRent_Indv_Mean = sum(median_gross_rent * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
MedRentPctInc_Indv_Mean = sum(med_rent_pct_inc * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
MedianHomeValue_Indv_Mean = sum(median_home_value * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
MedOwnerCostPctInc_Indv_Mean = sum(med_owner_cost_pct_inc * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
PopOwnerOccupied_Indv_Mean = sum(pop_owner_occupied * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
PopRenterOccupied_Indv_Mean = sum(pop_renter_occupied * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
MedRoomsOwner_Indv_Mean = sum(med_rooms_owner * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
MedRoomsRenter_Indv_Mean = sum(med_rooms_renter * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
PovertyPop_Indv_Mean = sum(poverty_pop * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
Pop_25Above_Indv_Mean = sum(pop_25above * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
EducationNoSchool_Indv_Mean = sum(education_no_school * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
EducationHS_Indv_Mean = sum(education_hs * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
EducationBachelors_Indv_Mean = sum(education_bachelors * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
KinderEnrollment_Indv_Mean = sum(kinder_enrollment * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
FirstFourthGradeEnr_Indv_Mean = sum(first_fourth_grade_enr * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
AvgHHSize_Indv_Mean = sum(avg_hh_size * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE)
) %>%
ungroup()
test2 <- test2 %>%
mutate(
pct_5to9_Indv_Mean = (Male5to9_Indv_Mean + Female5to9_Indv_Mean) / TotalPop_Indv_Mean,
pct_nonwhite_Indv_Mean = 1 - (NumberWhites_Indv_Mean / TotalPop_Indv_Mean),
pct_poverty_Indv_Mean = PovertyPop_Indv_Mean / TotalPop_Indv_Mean,
pct_vacant_Indv_Mean = VacantHousingUnits_Indv_Mean / TotalHousingUnits_Indv_Mean,
pct_own_Indv_Mean = PopOwnerOccupied_Indv_Mean / TotalHousingUnits_Indv_Mean,
pct_rent_Indv_Mean = PopRenterOccupied_Indv_Mean / TotalHousingUnits_Indv_Mean,
pct_educationnoschool_Indv_Mean = EducationNoSchool_Indv_Mean / Pop_25Above_Indv_Mean,
pct_educationhs_Indv_Mean = EducationHS_Indv_Mean / Pop_25Above_Indv_Mean,
pct_educationbachelors_Indv_Mean = EducationBachelors_Indv_Mean / Pop_25Above_Indv_Mean
)
elementary_by_year_Indv_Mean <- test2 %>%
group_by(school_year, school_code, school_name, grade_level, unique_id, census_year,
school_sector,school_main_facility_lat, school_main_facility_long,
facility_sector, facility_capacity, enrollment, utilization, unfilled_seats, # Neve added these back, census year school capacity
TotalPop_Indv_Mean, Male5to9_Indv_Mean, Female5to9_Indv_Mean,
NumberWhites_Indv_Mean, NumberBlack_Indv_Mean, Median_HHIncome_Indv_Mean,
NumberNative_Indv_Mean, NumberAsian_Indv_Mean, NumberHispanic_Indv_Mean,
NumHouseholds_Indv_Mean, TotalHousingUnits_Indv_Mean, OccupiedHousingUnits_Indv_Mean,
VacantHousingUnits_Indv_Mean, MedianGrossRent_Indv_Mean, MedRentPctInc_Indv_Mean,
MedianHomeValue_Indv_Mean, MedOwnerCostPctInc_Indv_Mean, PopOwnerOccupied_Indv_Mean,
PopRenterOccupied_Indv_Mean, MedRoomsOwner_Indv_Mean, MedRoomsRenter_Indv_Mean,
PovertyPop_Indv_Mean, EducationNoSchool_Indv_Mean, EducationHS_Indv_Mean,
EducationBachelors_Indv_Mean, KinderEnrollment_Indv_Mean, FirstFourthGradeEnr_Indv_Mean,
AvgHHSize_Indv_Mean,
pct_5to9_Indv_Mean, pct_nonwhite_Indv_Mean, pct_poverty_Indv_Mean, pct_vacant_Indv_Mean,
pct_own_Indv_Mean, pct_rent_Indv_Mean,
pct_educationnoschool_Indv_Mean, pct_educationhs_Indv_Mean, pct_educationbachelors_Indv_Mean) %>%
summarise(agg_total_students = sum(total_students, na.rm = TRUE))
# Convert to shapefile
elem_by_year_Indv_Mean.sf <- st_as_sf(elementary_by_year_Indv_Mean,
coords = c("school_main_facility_long","school_main_facility_lat"),
crs = 4326)
Appending School-Based ACS Variables
Match ACS data (both actual data from 2013 to 2023, and hypothetical data from 2024) to each school facility-year combination.
# Use st_within to identify the census tract each school were within, at different years
results_list <- list()
years_new <- 2013:2024 # add hypothetical 2024
# Loop through each year
for (i in years_new) {
# Filter the data frames for the specific year
elementary_filtered <- elem_by_year_Indv_Mean.sf %>%
filter(census_year == i)
tracts_filtered <- tracts12_24 %>%
filter(year == i) %>%
dplyr::select(GEOID, total_population, population_white, population_black, population_native, population_asian, population_hispanic,
Male5to9, Female5to9, num_households, total_housing_units, occupied_housing_units, vacant_housing_units, median_gross_rent,
med_rent_pct_inc, median_home_value, med_owner_cost_pct_inc, pop_owner_occupied, pop_renter_occupied, med_rooms_owner,
med_rooms_renter, median_household_income, poverty_pop, education_no_school, education_hs, education_bachelors,
kinder_enrollment, first_fourth_grade_enr, avg_hh_size,
pct_5to9, pct_nonwhite, pct_poverty, pct_vacant, pct_own, pct_rent,
pct_education_no_school, pct_education_hs, pct_education_bachelors,
housing_1yrbefore, pct_change_housing_yrbefore) # NZ: added these variables
# Perform the spatial join
joined_data <- st_join(elementary_filtered, tracts_filtered, join = st_within, left = TRUE)
# Add the joined data to the results list
results_list[[as.character(i)]] <- joined_data
}
Convert Results to Panel
Merge the panel results for the first time, the total number of rows (unique school facility-grade level-school year combinations) retained is 9401. We will be repeating the merging/binding process to add subsequent variables.
panel <- do.call(rbind, results_list)
We calculated the cumulative permits issued within the near 5 years for each DCPS zone on file, to account for the effect created by housing constructions. For example, for all records in SY13-14, cumulative permits issued within the near 5 years is the sum of permits issued in 2013, 2014, 2015, 2016, and 2017, which effectively accounts for all additional construction projects that influence the targeted year (SY18-19)’s prediction results.
construction_within_dcps <- st_join(combined_construction_data %>%
st_transform(st_crs(boundary)), boundary, join = st_within)
construction_count <- construction_within_dcps %>%
as_tibble() %>%
group_by(DCPS_ID, year) %>%
summarise(construction_projects = n()) %>%
st_drop_geometry()
# NZ: I think many-to-many in the following steps should be expected, as we intentionally match multiple years' construction permit count to the base year for each DCPS
projects_5yr_sum <- construction_count %>%
rename(base_year = year) %>%
inner_join(construction_count, by = c("DCPS_ID")) %>%
filter(year >= base_year, year < base_year + 5) %>%
group_by(DCPS_ID, base_year) %>%
summarise(cumulative_permits5yr = sum(construction_projects.y, na.rm = TRUE), .groups = "drop") %>%
rename(year = base_year)
construction_count <- construction_count %>%
left_join(projects_5yr_sum, by = c("DCPS_ID", "year"))
Removed 5-year cumulative permits for 2020 and later, since those counts would be incomplete. These years will remain in the final dataset used for model fitting, but will be excluded from training and testing. Projected permit values for these years will be manually assigned in scenario analyses.
construction_count <- construction_count %>%
filter(year <= 2019 )
# Match schools to DCPS zones
dcps_code <- dplyr::select(boundary, DCPS_ID, NAME)
panel <- st_join(panel, dcps_code, join = st_within)
panel <- panel %>%
left_join(construction_count, by = c("DCPS_ID", "census_year" = "year"))
Next, we added several residential property transaction-related data points within each DCPS boundary or nearest to the school facility from the year prior to the current school year. For instance, residential sales records in 2012 are matched with SY13–14 enrollment records. This is based on the belief that housing sales one year prior are more directly associated with enrollments in the subsequent school year, as families typically move within a few months of purchase and enroll their children in school the following fall. This creates a more stable predictor that aligns household settlement patterns with school enrollment cycles.
combined_sales <- res_data %>%
as_tibble() %>%
mutate(unit_type = if_else(NUM_UNITS == 1, "single", "multi")) %>%
group_by(SALEYEAR, school_year, dcps_zone, unit_type) %>%
tally() %>%
pivot_wider(
names_from = unit_type,
values_from = n,
names_prefix = "num_",
values_fill = 0
) %>%
rename(
num_single_units_sold = num_single,
num_multi_units_sold = num_multi
)
panel <- panel %>%
left_join(combined_sales %>%
dplyr::select(-SALEYEAR),
by = c("NAME" = "dcps_zone", "census_year" = "school_year"))
panel <- panel %>%
left_join(school_metrics_by_year)
Intermediate Step: Reducing Input Variables to Reduce Multicollinearity
# standardize percentages
panel_clean <- panel %>%
mutate(utilization = as.numeric(sub("%", "", utilization)) / 100,
med_rent_pct_inc = med_rent_pct_inc / 100,
med_owner_cost_pct_inc = med_owner_cost_pct_inc / 100) %>%
rename(dcps_boundary = NAME)
# clean up variable names to lower case
names(panel_clean) <- tolower(names(panel_clean))
# Improved variable inclusion based on Amy ENTIRE_panel_allVars.geojson
# Everything before mutating additonal agg_students and capacity related lag/future vars
panel_clean <- panel_clean %>%
dplyr::select(school_year, school_code, school_name, grade_level, unique_id,
census_year, geoid, dcps_boundary,
school_sector, facility_capacity, enrollment, utilization,
median_hhincome_indv_mean, pct_poverty_indv_mean,
total_population, num_households,
total_housing_units,median_gross_rent, med_rent_pct_inc,
med_owner_cost_pct_inc, med_rooms_renter, median_household_income, avg_hh_size, pct_nonwhite, pct_poverty,
pct_vacant, pct_own, pct_rent,pct_education_no_school, pct_education_hs, pct_education_bachelors,
construction_projects, avg_price_5_nearest, med_bedroom_5_nearest,
cumulative_permits5yr, housing_1yrbefore, pct_change_housing_yrbefore,
agg_total_students)
Finally, to support forecasting of future enrollment based on past records, we created the following columns:
agg_total_students_5yrslater: the aggregate total student enrollment for the same school facility and grade level, five years later.
pct_change_capacity_5yrs: the anticipated percentage change in the school’s capacity, five years later (i.e. percentage change in school capacity between base year SY13-14 and target year SY18-19).
The former serves as the target variable in the model. While the latter, similar to cumulative_permits5yr, acts as a hypothetical input in the training data. Collectively, they enables clients to conduct scenario-based forecasting informed by qualitative judgments about development pressures within a school district or planned facility-level capacity changes.
Student Enrollment
# aggregate enrollment lags and leads, per grade, future and past
grade_enrollment_lags <- expand.grid(
unique_id = sort(unique(panel_clean$unique_id)),
census_year = seq(from = min(panel_clean$census_year), to = max(panel_clean$census_year))
) %>%
left_join(panel_clean, by = c("unique_id", "census_year")) %>%
dplyr::select(unique_id,census_year, grade_level, agg_total_students) %>%
arrange(unique_id, census_year) %>%
group_by(unique_id, grade_level, census_year) %>% summarize(agg_total_students = mean(agg_total_students)) %>%
mutate(agg_total_students_1yrbefore = lag(agg_total_students, n=1),
agg_total_students_2yrbefore = lag(agg_total_students, n=2),
agg_total_students_3yrbefore = lag(agg_total_students, n=3),
agg_total_students_1yrlater = lead(agg_total_students, n=1),
agg_total_students_2yrlater = lead(agg_total_students, n=2),
agg_total_students_3yrlater = lead(agg_total_students, n=3),
agg_total_students_4yrlater = lead(agg_total_students, n=4),
agg_total_students_5yrlater = lead(agg_total_students, n=5)) %>%
ungroup()
panel_clean <- panel_clean %>%
left_join(grade_enrollment_lags %>% dplyr::select(-agg_total_students),
by=c("unique_id","census_year", "grade_level"))
School Capacity
capacity_change <- expand.grid(
unique_id = sort(unique(panel_clean$unique_id)),
census_year = seq(from = min(panel_clean$census_year), to = max(panel_clean$census_year))
) %>%
left_join(panel_clean, by = c("unique_id", "census_year")) %>%
#dplyr::select(unique_id,census_year, total_housing_units) %>%
arrange(unique_id, census_year) %>%
group_by(unique_id, census_year) %>% summarize(facility_capacity = mean(facility_capacity)) %>%
mutate(capacity_5yrlater = lead(facility_capacity, n=5),
pct_change_capacity_5yrlater = (capacity_5yrlater - facility_capacity) / facility_capacity
) %>%
ungroup()
panel_clean <- panel_clean %>%
left_join(capacity_change %>% dplyr::select(-facility_capacity),
by=c("unique_id","census_year"))
Final tweaks to clean up the panel data
panel_clean <- panel_clean %>%
mutate(across(c(facility_capacity, enrollment), round))
# impute enrollment for missing values based on aggregate values
panel_clean <- panel_clean %>%
group_by(school_year, unique_id) %>%
# where enrollment is missing, add up agg_total_students for each grade
mutate(enrollment = case_when(is.na(enrollment) ~ sum(agg_total_students),
.default = enrollment))
panel_clean <- panel_clean %>%
st_join(wards22 %>% dplyr::select(NAME, geometry) %>%
rename("ward"="NAME")) # join wards back
# calculate average utilization for each ward in the respective year
# where this column is empty, replace with that average
ward_avg_util <- panel_clean %>%
st_drop_geometry() %>%
group_by(school_year, ward) %>%
summarize(avg_util = mean(utilization, na.rm=T))
# join wards to impute average value
# will drop avg_util column later
panel_clean <- panel_clean %>%
left_join(ward_avg_util, by=c("school_year","ward"))
panel_clean <- panel_clean %>%
#assume missing values mean no enrollment that year
mutate_at(vars(agg_total_students_2yrbefore, agg_total_students_3yrbefore, agg_total_students_1yrbefore), ~replace_na(., 0)) %>%
# impute the same value for housing variables
# making the assumption that the year-to-year doesn't change much when it's only a year difference
# also doing for capacity
# only for 1yr
mutate(housing_1yrbefore = case_when(is.na(housing_1yrbefore) ~ total_housing_units, .default=housing_1yrbefore),
pct_change_housing_yrbefore = case_when(is.na(pct_change_housing_yrbefore) ~ 0, .default=pct_change_housing_yrbefore),
facility_capacity = case_when(facility_capacity == 0 | is.na(facility_capacity) ~ enrollment, .default=facility_capacity),
utilization=case_when(is.na(utilization) ~ avg_util, .default=utilization))
Inspect NAs in the Panel
sapply(panel_clean, function(x)sum(is.na(x)))
## school_year school_code
## 0 0
## school_name grade_level
## 0 0
## unique_id census_year
## 0 0
## geoid dcps_boundary
## 0 0
## school_sector facility_capacity
## 0 0
## enrollment utilization
## 0 823
## median_hhincome_indv_mean pct_poverty_indv_mean
## 0 0
## total_population num_households
## 0 0
## total_housing_units median_gross_rent
## 0 185
## med_rent_pct_inc med_owner_cost_pct_inc
## 6 52
## med_rooms_renter median_household_income
## 0 64
## avg_hh_size pct_nonwhite
## 0 0
## pct_poverty pct_vacant
## 0 0
## pct_own pct_rent
## 0 0
## pct_education_no_school pct_education_hs
## 0 0
## pct_education_bachelors construction_projects
## 0 4072
## avg_price_5_nearest med_bedroom_5_nearest
## 0 0
## cumulative_permits5yr housing_1yrbefore
## 4072 0
## pct_change_housing_yrbefore agg_total_students
## 0 0
## geometry agg_total_students_1yrbefore
## 0 0
## agg_total_students_2yrbefore agg_total_students_3yrbefore
## 0 0
## agg_total_students_1yrlater agg_total_students_2yrlater
## 1266 2325
## agg_total_students_3yrlater agg_total_students_4yrlater
## 3246 4131
## agg_total_students_5yrlater capacity_5yrlater
## 4982 5683
## pct_change_capacity_5yrlater ward
## 5683 0
## avg_util
## 823
Optional: Export Panel Data
A range of recent studies have explored the application of predictive modeling to support enrollment forecasting across different educational contexts. Tang and Yin (2012) applied grey prediction models to forecast U.S. education expenditure and school enrollment, while Goenner and Pauls (2006) used Bayesian logistic regression models to assess the likelihood of college enrollment based on inquiry behaviors and geodemographic indicators. Yang et al. (2020) introduced a hybrid Support Vector Regression model optimized with the Whale Optimization Algorithm to forecast student and teacher counts in Taiwan, and Aksenova et al. (2006) combined SVR with Cubist rule-based models to enhance accuracy and interpretability of university enrollment projections.
Among all the papers reviewed, the study most relevant to our approach was published by Tanner et al. (2021) at the National Center for Education Evaluation and Regional Assistance at Institution of Educational Sciences (IES). The study develops an enrollment forecasting model for the School District of Philadelphia using four statistical and machine learning methods: OLS regression, LASSO, Elastic Net, and Random Forest. Results showed that LASSO and Elastic Net—both implemented via the GLMNet framework—and Random Forest achieved comparable accuracy, with Random Forest reducing the need for classroom reallocation in 22% of cases, compared to 23% with Elastic Net. The research question addressed offers a direct parallel to the goals of our client at DME targeted at future enrollment forecasting and planning. This precedent reinforces our decision to test out GLMNet, Random Forest, and an additional XGBoost model in our preliminary testings (refer to Appendix at the bottom of the markdown).
Through literature review and trials and errors, our results indicates that a GLMNet model yields best predictions for the task we were given, and yields additional benefits including interpretability, flexibility, and computational efficiency. The next part outlines our model building process in the following steps:
Data Splitting: Based on data availability, especially limitations imposed by future-based variables, we split the data by whole school years into train data (SY13-14 to SY16-17 input data used to train the model and perform cross-fold validations), test data (SY17-18 to SY18-19, data used to perform cross validation on other samples), and fit data (SY19-20 and beyond, data used to make scenario-based predictions). Collectively, our model is trained on 2,496 rows of train data, validated on 1,211 rows of test data, and fitted on 4,871 rows of raw fit data before their expanded scenarios.
Validation: Performed 2 types of cross validation including 5-fold on the train data, and temporal cross validation on 1,211 rows of test data in different years. The test data are unseen observations to the model and effectively evaluates the model’s robustness.
Generalizability Check: For the testing set specifically, we further checked the generalizability across school years, school sectors, capacity, and location of the school facility.
Making Predictions: Finally, we expanded the raw fit data (4,871 rows) into combinations of scenarios based on outlook for total permits within the DCPS zone in the next 5 years (cumulative_permits5yr) and percentage change in the school’s capacity 5 years later (pct_change_capacity_5yrlater). This allows the user to customize predictions in the app later on.
Codes for the respective steps and key results are included below.
Raw.Variable.Name | Defintion |
---|---|
facility_capacity | School Capacity: The maximum number of students who can be housed in a school building based on the number and types of spaces available and the type of educational plan offered. |
enrollment | Student Enrollment Numbers |
utilization | Utilization is a measurement of how full a facility is to meet educational programmatic needs. Utilization is calculated as the number of enrolled students over the capacity of the school facility. |
median_hhincome_indv_mean | Median Household Income, averaged from the proportion of student in each census tract. The following equation calculates this: (1/Total Students)*Summation[(Number of students in census tract N)*(Median Household Income in census tract N)] |
pct_poverty_indv_mean | Poverty rate, averaged from the proportion of students in each census tract. The following equation calculates this: (1/Total Students)*Summation[(Number of students in census tract N)*(Poverty rate in census tract N)] |
total_population | Total population, in the census tract the school is located in |
total_housing_units | Total housing units, in the census tract the school is located in |
median_gross_rent | Median Gross Rent, in the census tract the school is located in |
med_owner_cost_pct_inc | Median Owner Costs as percentage of income, in the census tract the school is located in |
med_rooms_renter | Median Number of Rooms in Renter Occupied housing, in the census tract the school is located in |
avg_hh_size | Average Household Size, in the census tract the school is located in |
pct_nonwhite | Percentage of non-white population, in the census tract the school is located in |
pct_poverty | Poverty rate, in the census tract the school is located in |
pct_vacant | Percentage of vacant housing, in the census tract the school is located in |
pct_own | Percentage of owner occupied housing among total population, in the census tract the school is located in |
pct_education_no_school | Percentage of adults aged 25 and above without school qualifications, in the census tract the school is located in |
pct_education_hs | Percentage of adults aged 25 and above with high school qualifications, in the census tract the school is located in |
avg_price_5_nearest | Average sales price of 5 nearest residential properties to the school |
med_bedroom_5_nearest | Median number of bedrooms of 5 nearest residential properties to the school |
cumulative_permits5yr | Total number of construction permits issued in the school’s census tract in past 5 years |
housing_1yrbefore | Total housing units, 1 year before the specified school year, in the census tract the school is located in |
pct_change_housing_yrbefore | Percentage change in total housing units from 1 year before, in the census tract the school is located in |
agg_total_students | This is the total number of students in the school for each grade, in the specified school year. The specified school year is contained in a separate column in the panel data |
agg_total_students_1yrbefore | This is the total number of students in the school for each grade, 1 year before the specified school year |
agg_total_students_2yrbefore | This is the total number of students in the school for each grade, 2 years before the specified school year |
agg_total_students_3yrbefore | This is the total number of students in the school for each grade, 3 years before the specified school year |
agg_total_students_1yrlater | This is the total number of students in the school for each grade, 1 year after the specified school year |
pct_change_capacity_5yrlater | Percentage change in school capacity 5 years after specified year. The specified school year is contained in a separate column in the panel data |
school_year | This is the school year. It is a categorical variable |
grade_level_X2 | This is the grade level of students. It is a categorical variable |
unique_id | This ID was assigned to each school to differentiate them when modelling. It is a categorical variable |
dcps_boundary | This is the name of DCPS School Zones Boundary the school is located in. It is a categorical variable |
school_sector | This is the type of public school in DC. It is a categorical variable of either DC Public School or Public Charter Schools |
ward | This is the name of municipal ward in DC. It is a categorical variable. |
panel_5yr <- panel_clean %>%
dplyr::select(-geoid, -school_name, -school_code, -census_year, -avg_util,
-capacity_5yrlater,
-construction_projects,
# these vars have extremely big coefficients, which will put rows without these vars at great prediction inaccuracy risk
-agg_total_students_2yrlater, -agg_total_students_3yrlater, -agg_total_students_4yrlater) %>%
st_drop_geometry() %>%
filter(agg_total_students_5yrlater > 0) %>%
filter_all(all_vars(!is.infinite(.))) %>%
filter(school_year != "SY19-20")
panel_5yr <- panel_5yr %>%
left_join(
panel_clean %>% dplyr::select(unique_id, school_year, grade_level, geometry),
by = c("unique_id", "school_year", "grade_level")
) %>%
st_as_sf()
colnames(panel_5yr)
## [1] "school_year" "grade_level"
## [3] "unique_id" "dcps_boundary"
## [5] "school_sector" "facility_capacity"
## [7] "enrollment" "utilization"
## [9] "median_hhincome_indv_mean" "pct_poverty_indv_mean"
## [11] "total_population" "num_households"
## [13] "total_housing_units" "median_gross_rent"
## [15] "med_rent_pct_inc" "med_owner_cost_pct_inc"
## [17] "med_rooms_renter" "median_household_income"
## [19] "avg_hh_size" "pct_nonwhite"
## [21] "pct_poverty" "pct_vacant"
## [23] "pct_own" "pct_rent"
## [25] "pct_education_no_school" "pct_education_hs"
## [27] "pct_education_bachelors" "avg_price_5_nearest"
## [29] "med_bedroom_5_nearest" "cumulative_permits5yr"
## [31] "housing_1yrbefore" "pct_change_housing_yrbefore"
## [33] "agg_total_students" "agg_total_students_1yrbefore"
## [35] "agg_total_students_2yrbefore" "agg_total_students_3yrbefore"
## [37] "agg_total_students_1yrlater" "agg_total_students_5yrlater"
## [39] "pct_change_capacity_5yrlater" "ward"
## [41] "geometry"
Data Splitting
set.seed(2554)
data_split5 <- initial_time_split(panel_5yr %>%
st_drop_geometry() %>%
filter_all(all_vars(!is.infinite(.))),
prop = 2492/3739) # train 2013-2016, test 2017-2018
train_data_lag5 <- training(data_split5)
test_data_lag5 <- testing(data_split5)
data_cv5 <- group_vfold_cv(train_data_lag5, group = "school_year")
Set Recipe
linear_big_model_rec5 <- recipe(agg_total_students_5yrlater ~ .,
data=train_data_lag5) %>%
# acept mean impute for columns where there aren't many NA
step_impute_mean(all_numeric_predictors()) %>%
step_novel(all_nominal()) %>%
step_dummy(all_nominal()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors(), -all_nominal())
glmnet_plan5 <-
linear_reg() %>%
set_args(penalty = tune()) %>%
set_args(mixture = tune()) %>%
set_engine("glmnet")
control <- control_resamples(save_pred = TRUE, verbose = TRUE)
metrics <- metric_set(rmse, rsq, mape, smape)
glmnet_grid5 <- expand.grid(penalty = seq(0,0.5, by = 0.025),
mixture = seq(0,0.5, by = 0.025))
glmnet_wf5 <-
workflow() %>%
add_recipe(linear_big_model_rec5) %>%
add_model(glmnet_plan5)
Training and Selecting Best Model
We finalized the best workflow (0.025 for penalty and 0.175 for mixture) based on smallest RMSE
collect_metrics(glmnet_tuned5)
## # A tibble: 1,764 × 8
## penalty mixture .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0 0 mape standard 15.7 4 1.68 Preprocessor1_Model0…
## 2 0 0 rmse standard 8.34 4 0.541 Preprocessor1_Model0…
## 3 0 0 rsq standard 0.886 4 0.0188 Preprocessor1_Model0…
## 4 0 0 smape standard 14.2 4 1.18 Preprocessor1_Model0…
## 5 0.025 0 mape standard 15.7 4 1.68 Preprocessor1_Model0…
## 6 0.025 0 rmse standard 8.34 4 0.541 Preprocessor1_Model0…
## 7 0.025 0 rsq standard 0.886 4 0.0188 Preprocessor1_Model0…
## 8 0.025 0 smape standard 14.2 4 1.18 Preprocessor1_Model0…
## 9 0.05 0 mape standard 15.7 4 1.68 Preprocessor1_Model0…
## 10 0.05 0 rmse standard 8.34 4 0.541 Preprocessor1_Model0…
## # ℹ 1,754 more rows
show_best(glmnet_tuned5, metric = "rmse", n = 15)
## # A tibble: 15 × 8
## penalty mixture .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.025 0.175 rmse standard 8.14 4 0.495 Preprocessor1_Model149
## 2 0.025 0.15 rmse standard 8.14 4 0.491 Preprocessor1_Model128
## 3 0 0.175 rmse standard 8.14 4 0.494 Preprocessor1_Model148
## 4 0.05 0.075 rmse standard 8.14 4 0.500 Preprocessor1_Model066
## 5 0.05 0.125 rmse standard 8.15 4 0.506 Preprocessor1_Model108
## 6 0.05 0.1 rmse standard 8.15 4 0.505 Preprocessor1_Model087
## 7 0 0.05 rmse standard 8.15 4 0.495 Preprocessor1_Model043
## 8 0.025 0.05 rmse standard 8.15 4 0.495 Preprocessor1_Model044
## 9 0.05 0.05 rmse standard 8.15 4 0.498 Preprocessor1_Model045
## 10 0.025 0.1 rmse standard 8.15 4 0.490 Preprocessor1_Model086
## 11 0.025 0.125 rmse standard 8.15 4 0.494 Preprocessor1_Model107
## 12 0 0.075 rmse standard 8.15 4 0.489 Preprocessor1_Model064
## 13 0.025 0.075 rmse standard 8.15 4 0.489 Preprocessor1_Model065
## 14 0 0.15 rmse standard 8.15 4 0.489 Preprocessor1_Model127
## 15 0.025 0.275 rmse standard 8.15 4 0.510 Preprocessor1_Model233
glmnet_best_params5 <- select_best(glmnet_tuned5, metric = "rmse")
## Final workflow
glmnet_best_wf5 <- finalize_workflow(glmnet_wf5, glmnet_best_params5)
Inspect Coefficients & Zeroed-out Variables
As a result of Elastic Net (balancing LASSO and Ridge regularizations), the following list of variables are zeroed-out (evaluated to be insignificant variables with a zero-coefficient assigned).
# Extract the fitted parsnip model from the loaded workflow
glmnet_best_fit5 <- fit(glmnet_best_wf5, data = train_data_lag5)
glmnet_fit5 <- extract_fit_parsnip(glmnet_best_fit5)
# Tidy up and inspect coefficients
coef_df5 <- tidy(glmnet_fit5)
nonzero_coefs5 <- coef_df5 %>% filter(estimate != 0)
zeroed_out_coefs5 <- coef_df5 %>% filter(estimate == 0, term != "(Intercept)")
# Print non-zero coefficients
print(nonzero_coefs5)
## # A tibble: 218 × 3
## term estimate penalty
## <chr> <dbl> <dbl>
## 1 (Intercept) 50.9 0.025
## 2 facility_capacity 0.130 0.025
## 3 enrollment -0.00268 0.025
## 4 utilization -1.20 0.025
## 5 median_hhincome_indv_mean 1.33 0.025
## 6 pct_poverty_indv_mean 0.288 0.025
## 7 total_population 1.09 0.025
## 8 total_housing_units -0.00449 0.025
## 9 median_gross_rent -0.765 0.025
## 10 med_owner_cost_pct_inc 0.335 0.025
## # ℹ 208 more rows
# Print zeroed_out coefficients
print(zeroed_out_coefs5)
## # A tibble: 48 × 3
## term estimate penalty
## <chr> <dbl> <dbl>
## 1 num_households 0 0.025
## 2 med_rent_pct_inc 0 0.025
## 3 median_household_income 0 0.025
## 4 pct_rent 0 0.025
## 5 pct_education_bachelors 0 0.025
## 6 unique_id_F000066 0 0.025
## 7 unique_id_F000072 0 0.025
## 8 unique_id_F000073 0 0.025
## 9 unique_id_F000135 0 0.025
## 10 unique_id_F000180 0 0.025
## # ℹ 38 more rows
Optional: Save Co-efficients for Interpretation
write.csv(coef_df5, "coef_df5_May4.csv")
First, we evaluated the model performance across 5-folds on the train data, achieving an RMSE of 8.17, a MAPE of 14.8%, and an r-squared value of 0.885 (i.e. the model explains 88.5% of variations in the observed data).
RMSE (root mean squared error): The square root of the mean of the squared differences between actual and predicted values.
MAPE (Mean Absolute Percentage Error): Prediction errors as a percentage of actual values.
# Get best OOF predictions for glmnet only
glmnet_best_OOF_preds5 <- collect_predictions(glmnet_tuned5) %>%
filter(penalty == glmnet_best_params5$penalty[1],
mixture == glmnet_best_params5$mixture[1]) %>%
dplyr::select(.pred, agg_total_students_5yrlater)
OOF_preds5 <- glmnet_best_OOF_preds5 %>%
mutate(model = "glmnet")
# calculate overall metrics
glmnet_oof_metrics5 <- bind_rows(
metrics(
data = glmnet_best_OOF_preds5,
truth = agg_total_students_5yrlater,
estimate = .pred
),
mape(
data = glmnet_best_OOF_preds5,
truth = agg_total_students_5yrlater,
estimate = .pred
)
)
print(glmnet_oof_metrics5)
## # A tibble: 5 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 8.17
## 2 rsq standard 0.885
## 3 mape standard 14.8
## 4 smape standard 13.8
## 5 mape standard 14.8
# OOF predicted versus actual
ggplot(OOF_preds5, aes(x = agg_total_students_5yrlater, y = .pred, group = model)) +
geom_point(alpha = 0.3) +
geom_abline(linetype = "dashed", color = "#fc8961") +
geom_smooth(method = "lm", color = "#440154") +
coord_equal() +
facet_wrap(~model, ncol=3) +
theme_bw()
Next, we evaluated the model performance on the test data (data that the model has not “seen” yet) to evaluate its generalizability across years. This yields an RMSE of 9.62 and a MAPE of 17.5%. The errors, however, increases in SY18-19 as compared to SY17-18.
glmnet_val_fit_geo5 <- glmnet_best_wf5 %>%
last_fit(split = data_split5,
control = control,
metrics = metrics)
# collect validation set predictions from last_fit model
glmnet_val_pred_geo5 <- collect_predictions(glmnet_val_fit_geo5)
val_preds5 <- data.frame(glmnet_val_pred_geo5, model = "glmnet") %>%
left_join(
panel_5yr %>%
filter(agg_total_students_5yrlater > 0) %>%
rowid_to_column(var = ".row") %>%
dplyr::select(unique_id, school_year, grade_level, school_sector, ward, dcps_boundary,
utilization, facility_capacity, pct_change_capacity_5yrlater,
cumulative_permits5yr,
geometry, .row),
by = ".row"
) %>%
group_by(unique_id, school_year, grade_level, model) %>%
mutate(
RMSE = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
MAE = yardstick::mae_vec(agg_total_students_5yrlater, .pred),
MAPE = yardstick::mape_vec(agg_total_students_5yrlater, .pred)
) %>%
ungroup() %>%
mutate(model = factor(model, levels = "glmnet"))
val_preds_sf5 <- st_as_sf(val_preds5, crs = 4326)
val_avg_metrics5 <- val_preds5 %>%
summarise(
overall_RMSE = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
overall_MAPE = yardstick::mape_vec(agg_total_students_5yrlater, .pred)
)
print(val_avg_metrics5)
## # A tibble: 1 × 2
## overall_RMSE overall_MAPE
## <dbl> <dbl>
## 1 9.62 17.5
val_preds_by_year_only_sf <- val_preds_sf5 %>%
group_by(school_year) %>%
summarise(RMSE = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
MAE = yardstick::mae_vec(agg_total_students_5yrlater, .pred),
MAPE = yardstick::mape_vec(agg_total_students_5yrlater, .pred))
ggplot(val_preds_by_year_only_sf, aes(x=school_year, y=RMSE))+
geom_bar(position="dodge",stat = "identity", fill="#440154", color="#fed395")+
labs(title="RMSE by school year")
Looking at RMSE by grade levels, we found that predictions tend to be more accurate at higher grades, and is most prone to error for lower grades like K and 1. This implies as students naturally progress to upper grade levels, their enrollment patterns become more trackable.
# Validation Predicted vs. actual
ggplot(val_preds5, aes(x = agg_total_students_5yrlater, y = .pred, group = model)) +
geom_point(alpha = 0.3, color="#440154") +
geom_abline(linetype = "dashed", color = "#fc8961") +
geom_smooth(method = "lm", color = "#440154") +
coord_equal() +
facet_wrap(~model, ncol = 3) +
theme_bw()
# Validation Predicted vs. actual
ggplot(val_preds5, aes(x = agg_total_students_5yrlater, y = .pred, group = model)) +
geom_point(alpha = 0.3, color = "#440154") +
geom_abline(linetype = "dashed", color = "#fc8961") +
geom_smooth(method = "lm", color = "#440154") +
coord_equal() +
scale_x_continuous(breaks = seq(0, max(val_preds5$agg_total_students_5yrlater, na.rm = TRUE), by = 40)) +
scale_y_continuous(breaks = seq(0, max(val_preds5$.pred, na.rm = TRUE), by = 50)) +
theme_bw()
val_preds_by_year_grade_sf <- val_preds_sf5 %>%
group_by(school_year, grade_level) %>%
summarise(RMSE = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
MAE = yardstick::mae_vec(agg_total_students_5yrlater, .pred),
MAPE = yardstick::mape_vec(agg_total_students_5yrlater, .pred))%>%
arrange(RMSE) %>%
mutate(grade_level = factor(grade_level, levels=c("K","1","2","3","4","5")))
# plot rmse, grade level facet
ggplot(val_preds_by_year_grade_sf, aes(x=school_year, y=RMSE))+
geom_bar(position="dodge",stat = "identity", fill="#440154", color="#fed395")+
facet_wrap(~grade_level)+
labs(title="RMSE by grade level")
We further plotted for the distribution of errors in relation to school facility’s utilization, capacity, as well as the two scenario variables – cumulative permits in 5 years, and change in facility capacity 5 years later. As observed, the errors are randomly distributed with no obvious patterns of clustering. However, predictions for public charters do demonstrate more instability as compared to DCPS.
Error (RMSE) vs Utilization, by sector
ggplot(val_preds_sf5 %>%
mutate(grade_level = factor(grade_level, levels = c("K","1","2","3","4","5"))),
aes(x = utilization, y = RMSE)) +
geom_point(alpha = 0.3, color = "#440154") +
facet_wrap(~school_sector, nrow = 2) +
labs(title = "Error by Utilization and School Sector", subtitle = "5-Year Lag Model") +
theme_bw()
Error (RMSE) vs Capacity, by Sector
ggplot(val_preds_sf5 %>%
mutate(grade_level = factor(grade_level, levels = c("K","1","2","3","4","5"))),
aes(x = facility_capacity, y = RMSE)) +
geom_point(alpha = 0.3, color = "#440154") +
facet_wrap(~school_sector, nrow = 2) +
labs(title = "Error by Facility Capacity and School Sector", subtitle = "5-Year Lag Model") +
theme_bw()
Error (RMSE) vs Anticipated Construction in 5 Years, all schools (no facet)
ggplot(val_preds_sf5 %>%
mutate(grade_level = factor(grade_level, levels = c("K","1","2","3","4","5"))),
aes(x = cumulative_permits5yr, y = RMSE)) +
geom_point(alpha = 0.3, color = "#440154") +
labs(title = "Error by Anticipated Construction Projects", subtitle = "5-Year Lag Model") +
theme_bw()
Error (RMSE) vs Anticipated Pct Change in Capacity in 5 Years, all schools (no facet)
ggplot(val_preds_sf5 %>%
mutate(grade_level = factor(grade_level, levels = c("K","1","2","3","4","5"))),
aes(x = pct_change_capacity_5yrlater, y = RMSE)) +
geom_point(alpha = 0.3, color = "#440154") +
labs(title = "Error by Anticipated Change in Facility Capacity", subtitle = "5-Year Lag Model") +
theme_bw()
Finally, using test data, we inspected RMSE by facilities and report the following results.
List of Top 20 Schools with Biggest Errors
val_preds5_spatial <- val_preds5 %>%
group_by(unique_id, ward, dcps_boundary) %>% # the most granular is still school facility
summarise(
RMSE_school = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
MAE_school = yardstick::mae_vec(agg_total_students_5yrlater, .pred),
MAPE_school = yardstick::mape_vec(agg_total_students_5yrlater, .pred),
.pred_mean = mean(.pred, na.rm = TRUE),
actual_mean = mean(agg_total_students_5yrlater, na.rm = TRUE),
n_obs = n(),
.groups = "drop"
) %>%
arrange(desc(RMSE_school)) %>%
slice_head(n = 20)
# Display table nicely
val_preds5_spatial %>%
dplyr::select(
`Unique ID` = unique_id,
`DCPS Zone` = dcps_boundary,
`Ward` = ward,
`Predicted (Mean)` = .pred_mean,
`Actual (Mean)` = actual_mean,
`School RMSE` = RMSE_school,
`School MAE` = MAE_school,
`School MAPE` = MAPE_school
) %>%
mutate(across(where(is.numeric), ~round(., 2))) %>%
mutate(
`Predicted (Mean)` = round(`Predicted (Mean)`, 0),
`Actual (Mean)` = round(`Actual (Mean)`, 0)
) %>%
kable(
format = "html",
align = "c",
caption = "Top 20 Schools with Highest RMSE (5-Year Lag)"
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
Unique ID | DCPS Zone | Ward | Predicted (Mean) | Actual (Mean) | School RMSE | School MAE | School MAPE |
---|---|---|---|---|---|---|---|
F000091 | King, Martin Luther | Ward 8 | 102 | 52 | 51.80 | 50.17 | 102.49 |
F000028 | Nalle | Ward 7 | 100 | 123 | 23.91 | 23.46 | 18.93 |
F000102 | Bancroft | Ward 1 | 93 | 112 | 22.88 | 19.48 | 16.32 |
F000087 | Nalle | Ward 7 | 89 | 108 | 18.78 | 18.61 | 17.27 |
F000161 | Ketcham | Ward 8 | 58 | 76 | 18.26 | 18.08 | 23.78 |
F000035 | Noyes | Ward 5 | 37 | 19 | 18.26 | 17.84 | 100.82 |
F000193 | Walker-Jones | Ward 5 | 82 | 68 | 17.73 | 15.16 | 25.74 |
F000033 | Seaton | Ward 2 | 98 | 84 | 16.34 | 14.65 | 18.33 |
F000203 | Harris, C. W | Ward 7 | 57 | 46 | 14.25 | 11.53 | 28.77 |
F000092 | Noyes | Ward 5 | 74 | 61 | 14.10 | 12.68 | 22.03 |
F000130 | Garrison | Ward 2 | 37 | 49 | 14.01 | 11.78 | 21.99 |
F000237 | Lewis | Ward 4 | 57 | 67 | 13.78 | 11.70 | 16.45 |
F000247 | Miner | Ward 7 | 60 | 50 | 13.59 | 9.92 | 23.54 |
F000220 | Takoma | Ward 4 | 52 | 58 | 13.57 | 12.12 | 20.37 |
F000096 | Lasalle-Backus | Ward 5 | 66 | 66 | 13.55 | 9.74 | 16.02 |
F000027 | Bunker Hill | Ward 5 | 69 | 73 | 13.27 | 11.87 | 16.52 |
F000146 | Miner | Ward 7 | 5 | 18 | 13.05 | 12.99 | 72.00 |
F000105 | Beers | Ward 7 | 62 | 50 | 13.05 | 12.18 | 25.38 |
F000001 | Beers | Ward 7 | 70 | 80 | 12.58 | 10.21 | 11.85 |
F000029 | Moten | Ward 8 | 118 | 115 | 12.17 | 11.82 | 10.74 |
While most facilities have low to moderate errors in both years, based on the two maps below, we note that there is one particular school that demonstrated particularly outstanding errors. Upon closer inspection, the facility is the Eagle Academy PCS. It has reported serious financial struggles in the recent years and was put on a financial monitoring list in June of 2023, resulting in declining enrollments and an abrupt closure without any advanced notice in August 2024 (source: DC NEWS). This suggests that while our model has demonstrated ability to capture regular changes in socioeconomic and housing, it is susceptible to abrupt changes that cannot be accounted for by the broader change patterns.
Map of RMSE by School Facility across Both Test Years (SY17-18 & SY18-19)
val_pred_facility_sf5 <- val_preds_sf5 %>%
as_tibble() %>%
group_by(unique_id, model) %>%
summarise(
RMSE_school = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
MAPE_school = yardstick::mape_vec(agg_total_students_5yrlater, .pred),
geometry = st_union(geometry), # Collapse to one geometry per school
.groups = "drop"
) %>%
st_as_sf()
# Then plot
ggplot() +
geom_sf(data = boundary, fill = "grey85", color = "grey70", lwd = 0.1) +
geom_sf(data = waterbodies, fill = "ghostwhite", color = "lightsteelblue") +
geom_sf(data = val_pred_facility_sf5, aes(color = RMSE_school), alpha = 0.8) +
scale_color_gradientn(colors = c("#fcfdbf", "#fed395", "#fea973", "#f1605d", "#a3307e", "#440154", "#180f3d")) +
facet_wrap(~model) +
labs(title = "RMSE by School Facility (5-Year Lag)") +
theme_void()
Map of RMSE by School Facility by School Year
val_pred_facility_year_sf5 <- val_preds_sf5 %>%
as_tibble() %>%
group_by(unique_id, model, school_year) %>%
summarise(
RMSE_school = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
MAPE_school = yardstick::mape_vec(agg_total_students_5yrlater, .pred),
geometry = st_union(geometry), # Merge all points/polygons if multiple
.groups = "drop"
) %>%
st_as_sf()
# Then plot
ggplot() +
geom_sf(data = boundary, fill = "grey85", color = "grey70", lwd = 0.1) +
geom_sf(data = waterbodies, fill = "ghostwhite", color = "lightsteelblue") +
geom_sf(data = val_pred_facility_year_sf5, aes(color = RMSE_school), alpha = 0.8) +
scale_color_gradientn(colors = c("#fcfdbf", "#fed395", "#fea973", "#f1605d", "#a3307e", "#440154", "#180f3d")) +
facet_wrap(~school_year) +
labs(title = "RMSE by Facility and Census Year (5-Year Lag)") +
theme_void()
Map of RMSE by Ward and School Zones
Errors by ward and zones have increased across most wards in the later test year (SY18-19). Zooming into changes by school zone specifically, it can be found that the most significant changes occur in dcps zones associated with list of facilities with top errors highlighted in the previous table, including Eagle Academy (in King, Martin Luther) and Bancroft (in Bankcroft), which is associated with changes like financial difficulties or expansion that is unperceived in the current model variables.
# by Ward
# Spatial join points to wards
val_preds_sf_wards5 <- st_join(val_preds_sf5, wards %>% dplyr::select(ward_name = NAME))
# Group by ward and year, then correctly calculate RMSE, MAPE
val_preds_sf_wards5 <- val_preds_sf_wards5 %>%
st_drop_geometry() %>% # Drop POINT geometry
group_by(ward_name, school_year) %>%
summarise(
RMSE_ward = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
MAPE_ward = yardstick::mape_vec(agg_total_students_5yrlater, .pred),
.groups = "drop"
) %>%
left_join(wards %>% dplyr::select(ward_name = NAME), by = "ward_name") %>%
st_as_sf()
# Plot MAPE map
ggplot(val_preds_sf_wards5) +
geom_sf(aes(fill = MAPE_ward), color = "white") +
facet_wrap(~ school_year) +
scale_fill_viridis_c(option = "magma", name = "MAPE") +
theme_minimal() +
labs(
title = "MAPE by Ward and Year (5-Year Lag)",
subtitle = "Spatial aggregation of prediction error",
fill = "MAPE"
)
# Plot RMSE map
ggplot(val_preds_sf_wards5) +
geom_sf(aes(fill = RMSE_ward), color = "white") +
facet_wrap(~ school_year) +
scale_fill_viridis_c(option = "magma", name = "RMSE") +
theme_minimal() +
labs(
title = "RMSE by Ward and Year (5-Year Lag)",
subtitle = "Spatial aggregation of prediction error",
fill = "RMSE"
)
# by DCPS Zones
# Spatial join points to DCPS zones
val_preds_sf_dcps5 <- st_join(val_preds_sf5, boundary %>% dplyr::select(dcps_zone = NAME))
# Group by DCPS zone and school_year, correctly compute RMSE, MAPE
val_preds_sf_dcps5 <- val_preds_sf_dcps5 %>%
st_drop_geometry() %>%
group_by(dcps_boundary, school_year) %>%
summarise(
RMSE_dcps = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
MAPE_dcps = yardstick::mape_vec(agg_total_students_5yrlater, .pred),
.groups = "drop"
) %>%
left_join(boundary %>% dplyr::select(dcps_zone = NAME), by = c("dcps_boundary" = "dcps_zone")) %>%
st_as_sf()
# Plot MAPE by DCPS Zone
ggplot(val_preds_sf_dcps5) +
geom_sf(data = boundary, color = "lightgrey")+
geom_sf(aes(fill = MAPE_dcps), color = "white") +
facet_wrap(~ school_year) +
scale_fill_viridis_c(option = "magma", name = "MAPE") +
theme_minimal() +
labs(
title = "MAPE by DCPS Zone and Year (5-Year Lag)",
subtitle = "Spatial aggregation of prediction error",
fill = "MAPE"
)
# Plot RMSE by DCPS Zone
ggplot() +
geom_sf(data = boundary, color = "lightgrey")+
geom_sf(data = val_preds_sf_dcps5, aes(fill = RMSE_dcps), color = "white") +
facet_wrap(~ school_year) +
scale_fill_viridis_c(option = "magma", name = "RMSE") +
theme_minimal() +
labs(
title = "RMSE by DCPS Zone and Year (5-Year Lag)",
subtitle = "Spatial aggregation of prediction error",
fill = "RMSE"
)
By inspecting the distribution of total construction permits issued for each school zone within 5 years (i.e. the total construction permits issued from the base year to the end of census year prior to the target school year; eg. to predict for SY27-28, this variable sums up all construction permits issued in a school zone from start of 2022 to end of 2026), and the distribution of percent change in facility capacity 5 years later (i.e. percentage change between facility capacity in the base school year and the facility capacity in the target school year; eg. to predict for SY27-28, this variable compares facility capacity in SY27-28 to facility capacity to SY22-23), we mutated the following threshold that the client can interact with in the app:
cumulative_permits5yr: every 250 between 0 to 2000, 4000, 6000, 8000
pct_change_capacity_5yrs: since -1.0 represents that the school is gone, let’s do every every -50%, -25%, -10%, -5%, 0%, 5%, 10%, 25%, 50%, 100%, 150%
hist(panel_clean$cumulative_permits5yr)
hist(panel_clean$pct_change_capacity_5yrlater)
Set expansion grid
projects_grid5 <- c(seq(0, 2000, by = 250), 4000, 6000, 8000)
capacity_pct_grid5 <- c(-0.5, -0.25, -0.10, -0.05, 0, 0.05, 0.10, 0.25, 0.5, 1, 1.5)
# Create all combinations (Cartesian product)
grid <- expand.grid(
cumulative_permits5yr = projects_grid5,
pct_change_capacity_5yrlater = capacity_pct_grid5
)
Build Scenarios for 5-year Fit Data
fit_data5 <- panel_clean %>%
filter(census_year >= 2019) # we use anything beyond 2019
# Drop existing variables before expansion
fit_data5 <- fit_data5 %>%
dplyr::select(-cumulative_permits5yr, -pct_change_capacity_5yrlater, -agg_total_students_5yrlater)
# Now safe to cross
fit_data5 <- tidyr::crossing(fit_data5, grid)
Create Absolute Change in Capacity
Create a column that indicates what number of seats the scenario correspond to, for the App Team’s use.
fit_data5 <- fit_data5 %>%
mutate(abs_change_capacity_5yrlater = round(pct_change_capacity_5yrlater * facility_capacity))
Fit model to the scenario-based fit data (over 600,000 rows to generate predictions
# Fit on all available training data
glmnet_final_fit5 <- fit(glmnet_best_wf5, data = training(data_split5))
predictions5 <- predict(glmnet_final_fit5, new_data = fit_data5) %>%
bind_cols(fit_data5) #.pred stands for agg_total_students_5yrlater in this case!
predictions5 <- predictions5 %>%
mutate(`.pred` = ceiling(`.pred`))
predictions5_sf <- st_as_sf(predictions5)
Retain only useful rows to reduce file size
app_data <- predictions5_sf %>%
dplyr::select(`.pred`, school_year, school_code, school_name, grade_level, unique_id,census_year,
dcps_boundary, school_sector, ward,
cumulative_permits5yr, pct_change_capacity_5yrlater, abs_change_capacity_5yrlater)
Save data for App Team’s use
Based on previous discussions and results, as well as the strengths and limitations of the current model, we would like to conclude our project by proposing the following recommendations for our client:
Incorporate Qualitative Consideration: While using the app, we highly recommend our client to incorporate reasonable qualitative assumptions while picking a scenario.
Make a Flexible Switch between the App and the Business-as-usual Approach: For unique facilities that are extraordinarily high in capacity / undergoes sudden administrative changes etc. The Client can switch flexibly between our app and the traditional approach.
Enrich Data Sources: As census availability is a major impediment to the app, The Client may consider incorporating additional internal data, which are updated on a more regular basis, to substitute variables where suited to achieve more up-to-date results.
We’d like to express our sincere gratitude towards Rebecca Lee and Rory Lawless, our client representatives at the DC Office of the Deputy Mayor for Education (DME), as well as our instructors Professor Michael Fichman and Matthew Harris for their relentless support and guidance. We also appreciate this unique and exciting opportunity to apply our learnings into making a real-life impact through the collaboration between the Weitzman School of Design at University of Pennsylvania and the DME.
For any future inquiries, concerns, or updates, please feel free to reach out to team members at:
The following codes archives our initial process of testing out all three types of models: GLMNet, Random Forest, and XGBoost. According to the prediction vs. actual prediction plots, both Random Forest and XGBoost are prone to underpredict for high enrollment records. We thus decided to pursue polishing and development of the GLMNet model over the other two methods.
panel_prelim <- panel %>%
mutate(utilization = as.numeric(sub("%", "", utilization)) / 100,
med_rent_pct_inc = med_rent_pct_inc / 100,
med_owner_cost_pct_inc = med_owner_cost_pct_inc / 100) %>%
rename(dcps_boundary = NAME)
# clean up variable names to lower case
names(panel_prelim) <- tolower(names(panel_prelim))
panel_prelim <- panel_prelim %>%
left_join(grade_enrollment_lags %>% dplyr::select(-agg_total_students),
by=c("unique_id","census_year", "grade_level"))
panel_prelim <- panel_prelim %>%
left_join(capacity_change %>% dplyr::select(-facility_capacity),
by=c("unique_id","census_year"))
panel_prelim <- panel_prelim %>%
mutate(across(c(facility_capacity, enrollment), round))
# impute enrollment for missing values based on aggregate values
panel_prelim <- panel_prelim %>%
group_by(school_year, unique_id) %>%
# where enrollment is missing, add up agg_total_students for each grade
mutate(enrollment = case_when(is.na(enrollment) ~ sum(agg_total_students),
.default = enrollment))
panel_prelim <- panel_prelim %>%
st_join(wards22 %>% dplyr::select(NAME, geometry) %>%
rename("ward"="NAME")) # join wards back
panel_prelim <- panel_prelim %>%
left_join(ward_avg_util, by=c("school_year","ward"))
panel_prelim <- panel_prelim %>%
#assume missing values mean no enrollment that year
mutate_at(vars(agg_total_students_2yrbefore, agg_total_students_3yrbefore, agg_total_students_1yrbefore), ~replace_na(., 0)) %>%
# impute the same value for housing variables
# making the assumption that the year-to-year doesn't change much when it's only a year difference
# also doing for capacity
# only for 1yr
mutate(housing_1yrbefore = case_when(is.na(housing_1yrbefore) ~ total_housing_units, .default=housing_1yrbefore),
pct_change_housing_yrbefore = case_when(is.na(pct_change_housing_yrbefore) ~ 0, .default=pct_change_housing_yrbefore),
facility_capacity = case_when(facility_capacity == 0 | is.na(facility_capacity) ~ enrollment, .default=facility_capacity),
utilization=case_when(is.na(utilization) ~ avg_util, .default=utilization))
sapply(panel_prelim, function(x)sum(is.na(x)))
Data Splitting
set.seed(2554)
data_split_pre5 <- initial_time_split(panel_prelim5 %>%
st_drop_geometry() %>%
filter_all(all_vars(!is.infinite(.))),
prop = 2492/3739) # train 2013-2016, test 2017-2018
train_data_lag_pre5 <- training(data_split_pre5)
test_data_lag_pre5 <- testing(data_split_pre5)
data_cv_pre5 <- group_vfold_cv(train_data_lag_pre5, group = "school_year")
Recipe Setting
model_rec_pre5 <- recipe(agg_total_students_5yrlater ~ .,
data=train_data_lag_pre5) %>%
step_novel(all_nominal()) %>%
step_dummy(all_nominal()) %>%
step_indicate_na(agg_total_students_5yrlater)
linear_model_rec_pre5 <- recipe(agg_total_students_5yrlater ~ .,
data=train_data_lag_pre5) %>%
# acept mean impute for columns where there aren't many NA
step_impute_mean(all_numeric_predictors()) %>%
step_novel(all_nominal()) %>%
step_dummy(all_nominal()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors(), -all_nominal())
rf_plan_pre5 <- rand_forest() %>%
set_args(mtry = tune()) %>%
set_args(min_n = tune()) %>%
set_args(trees = 2000) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
glmnet_plan_pre5 <-
linear_reg() %>%
set_args(penalty = tune()) %>%
set_args(mixture = tune()) %>%
set_engine("glmnet")
XGB_plan_pre5 <- boost_tree() %>%
set_args(mtry = tune()) %>%
set_args(min_n = tune()) %>%
set_args(trees = 100) %>%
set_engine("xgboost") %>%
set_mode("regression")
glmnet_grid_pre5 <- expand.grid(penalty = seq(0, 1, by = .25),
mixture = seq(0,1,0.25))
rf_grid_pre5 <- expand.grid(mtry = c(5,10,17,25,30,40,50),
min_n = c(5,10,30,50,70,100,500))
xgb_grid_pre5 <- expand.grid(mtry = c(3,5),
min_n = c(1,5))
glmnet_wf_pre5 <-
workflow() %>%
add_recipe(linear_model_rec_pre5) %>%
add_model(glmnet_plan_pre5)
rf_wf_pre5 <-
workflow() %>%
add_recipe(model_rec_pre5) %>%
add_model(rf_plan_pre5)
xgb_wf_pre5 <-
workflow() %>%
add_recipe(model_rec_pre5) %>%
add_model(XGB_plan_pre5)
glmnet_tuned_pre5 <- glmnet_wf_pre5 %>%
tune::tune_grid(.,
resamples = data_cv_pre5,
grid = glmnet_grid_pre5,
control = control,
metrics = metrics)
rf_tuned_pre5 <- rf_wf_pre5 %>%
tune::tune_grid(.,
resamples = data_cv_pre5,
grid = rf_grid_pre5,
control = control,
metrics = metrics)
xgb_tuned_pre5 <- xgb_wf_pre5 %>%
tune::tune_grid(.,
resamples = data_cv_pre5,
grid = xgb_grid_pre5,
control = control,
metrics = metrics)
Model Results
show_best(glmnet_tuned_pre5, metric = "rsq", n = 15)
show_best(rf_tuned_pre5, metric = "rsq", n = 15)
show_best(xgb_tuned_pre5, metric = "rsq", n = 15)
glmnet_best_params_pre5 <- select_best(glmnet_tuned_pre5, metric = "rmse")
rf_best_params_pre5 <- select_best(rf_tuned_pre5, metric = "rmse")
xgb_best_params_pre5 <- select_best(xgb_tuned_pre5, metric = "rmse")
## Final workflow
glmnet_best_wf_pre5 <- finalize_workflow(glmnet_wf_pre5, glmnet_best_params_pre5)
rf_best_wf_pre5 <- finalize_workflow(rf_wf_pre5, rf_best_params_pre5)
xgb_best_wf_pre5 <- finalize_workflow(xgb_wf_pre5, xgb_best_params_pre5)
glmnet_best_OOF_preds_pre5 <- collect_predictions(glmnet_tuned_pre5) %>%
filter(penalty == glmnet_best_params_pre5$penalty[1] & mixture == glmnet_best_params_pre5$mixture[1])
rf_best_OOF_preds_pre5 <- collect_predictions(rf_tuned_pre5) %>%
filter(mtry == rf_best_params_pre5$mtry[1] & min_n == rf_best_params_pre5$min_n[1])
xgb_best_OOF_preds_pre5 <- collect_predictions(xgb_tuned_pre5) %>%
filter(mtry == xgb_best_params_pre5$mtry[1] & min_n == xgb_best_params_pre5$min_n[1])
# Aggregate OOF predictions (they do not overlap with Validation prediction set)
OOF_preds_pre5 <- rbind(data.frame(dplyr::select(glmnet_best_OOF_preds_pre5, .pred, agg_total_students_5yrlater), model = "glmnet"),
data.frame(dplyr::select(rf_best_OOF_preds_pre5, .pred, agg_total_students_5yrlater), model = "rf"),
data.frame(dplyr::select(xgb_best_OOF_preds_pre5, .pred, agg_total_students_5yrlater), model = "xgb")) %>%
group_by(model) %>%
mutate(RMSE = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
MAE = yardstick::mae_vec(agg_total_students_5yrlater, .pred),
MAPE = yardstick::mape_vec(agg_total_students_5yrlater, .pred)) %>%
ungroup() %>%
mutate(model = factor(model, levels=c("glmnet","rf","xgb")))
ggplot(data = OOF_preds_pre5 %>%
dplyr::select(model, MAPE) %>%
distinct() ,
aes(x = model, y = MAPE, group = 1)) +
geom_path(color = "#440154") +
geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
labs(x=NULL, title="5 year lag model MAPE")+
theme_bw()
ggplot(data = OOF_preds_pre5 %>%
dplyr::select(model, RMSE) %>%
distinct() ,
aes(x = model, y = RMSE, group = 1)) +
geom_path(color = "#440154") +
geom_label(aes(label = paste0(round(RMSE)))) +
labs(x=NULL, title="5 year lag model RMSE")+
theme_bw()
# OOF predicted versus actual
ggplot(OOF_preds_pre5, aes(x = agg_total_students_5yrlater, y = .pred, group = model)) +
geom_point(alpha = 0.3) +
geom_abline(linetype = "dashed", color = "#fc8961") +
geom_smooth(method = "lm", color = "#440154") +
coord_equal() +
facet_wrap(~model, ncol=3) +
labs(title="5 year lag model")+
theme_bw()
glmnet_val_fit_geo_pre5 <- glmnet_best_wf_pre5 %>%
last_fit(split = data_split_pre5,
control = control,
metrics = metrics)
rf_val_fit_geo_pre5 <- rf_best_wf_pre5 %>%
last_fit(split = data_split_pre5,
control = control,
metrics = metrics)
xgb_val_fit_geo_pre5 <- xgb_best_wf_pre5 %>%
last_fit(split = data_split_pre5,
control = control,
metrics = metrics)
# collect validation set predictions from last_fit model
glmnet_val_pred_geo_pre5 <- collect_predictions(glmnet_val_fit_geo_pre5)
rf_val_pred_geo_pre5 <- collect_predictions(rf_val_fit_geo_pre5)
xgb_val_pred_geo_pre5 <- collect_predictions(xgb_val_fit_geo_pre5)
# Aggregate predictions from Validation set
val_preds_pre5 <- rbind(data.frame(glmnet_val_pred_geo_pre5, model = "glmnet"),
data.frame(rf_val_pred_geo_pre5, model = "rf"),
data.frame(xgb_val_pred_geo_pre5, model = "xgb")) %>%
left_join(., panel_prelim5 %>% filter(agg_total_students_5yrlater > 0) %>% #because remove NA/infinite alr done before data split, this is ok
rowid_to_column(var = ".row") %>%
dplyr::select(geometry, .row),
by = ".row") %>%
group_by(model) %>%
mutate(RMSE = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
MAE = yardstick::mae_vec(agg_total_students_5yrlater, .pred),
MAPE = yardstick::mape_vec(agg_total_students_5yrlater, .pred)) %>%
ungroup() %>%
mutate(model = factor(model, levels=c("glmnet","rf","xgb")))
# plot RMSE by model type
ggplot(data = val_preds_pre5 %>%
dplyr::select(model, RMSE) %>%
distinct() ,
aes(x = model, y = RMSE, group = 1)) +
geom_path(color = "#440154") +
geom_label(aes(label = round(RMSE,1))) +
labs(title="5 year lag preliminary model, validation set, RMSE")+
theme_bw()
# plot MAPE by model type
ggplot(data = val_preds_pre5 %>%
dplyr::select(model, MAPE) %>%
distinct() ,
aes(x = model, y = MAPE, group = 1)) +
geom_path(color = "#440154") +
geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
labs(title="5 year lag preliminary model, validation set, MAPE")+
theme_bw()
# Validation Predicted vs. actual
ggplot(val_preds_pre5, aes(x = agg_total_students_5yrlater, y = .pred, group = model)) +
geom_point(alpha = 0.3) +
geom_abline(linetype = "dashed", color = "#fc8961") +
geom_smooth(method = "lm", color = "#440154") +
coord_equal() +
labs(title="5 year lag preliminary model, validation results")+
facet_wrap(~model, ncol = 3) +
theme_bw()