1 Introduction

This project is conducted to fulfill the course requirements for the 8010 Master of Urban Spatial Analytics (MUSA)/Smart Cities Practicum, Spring 2025. It is a collaboration between the University of Pennsylvania and the DC Office of the Deputy Mayor for Education (DME).

We sincerely appreciate Professor Michael Fichman and Professor Matthew Harris, as well as Rebecca Lee and Rory Lawless at DME, for their invaluable guidance, support, and dedication throughout the project.

[Project GitHub Page] - to be added after the cargo period

Project Dashboard

1.1 Team and Client

Four team members collaboratively contributed to this project. To reach out, please find their contact details below:

The project is commissioned by the DC Office of the Deputy Mayor for Education (DME), who are responsible for developing and implementing the Mayor’s vision for academic excellence and creating a high-quality education continuum from birth to 24 (from early childhood to K-12 to post-secondary and the workforce). The three major functions of the DME include:

  • Overseeing a District-wide education strategy

  • Managing interagency and cross-sector coordination

  • Providing oversight and/or support for the following education-related agencies including the DC Public Schools (DCPS) and Public Charter School Board (PCSB).

1.2 Context and Use Case

This student yield prediction project aims to enhance the DME’s ability to accurately forecast student enrollment across DCPS schools and boundaries, enabling more efficient resource allocation. This will support DME’s efforts including updates to the Master Facilities Plan, a key document that helps education stakeholders understand the current landscape and emerging needs, guiding decision-making and investments for a comprehensive approach to facility planning over the next 5 to 10 years.

Students’ decisions to enroll in public schools have traditionally been influenced by four key factors: parental demand, neighborhood characteristics, school availability, and policy interventions. As these factors have grown more complex, Washington, DC, has experienced steady growth in school-aged children between 2010 and 2020, an increase in highly educated families driving parental and housing demand, rising real estate prices, and migration patterns shaped by the pursuit of better education in specific wards.

Collectively, since existing year-on-year prediction methods do not adequately account for exogenous factors such as population and housing trends, our project provides a more comprehensive, data-informed tool to support the client in forecasting matriculation numbers. Per the client’s request, we focus specifically on elementary school (K–5) enrollments, with particular attention to kindergarten (grade K), as its enrollment cannot be predicted based on the previous grade. Our deliverables include this markdown and an app-format dashboard, to support the use case as below:

The project supports our client (DME) to make informed decision-making for education-related resource allocation, including personnel and capital investments, at each DCPS elementary school. The client can input housing pipeline and demographic data into a system to generate boundary school-specific matriculation predictions for incoming elementary students in a near-future year.

For the scope of our project and up-to-date data availability, our app forecasts for each school year between SY25-26 and SY29-30.

1.3 Overview of Washington, DC, Public School System

DC’s public education system consists of two categories of schools: DC Public schools and charter schools. DCPS includes all of the traditional schools in the Districts which have centralized school policies subject to regulation and oversight from the District government. Charter schools, comparatively, operate on innovative and non-traditional school models and policies in exchange of greater accountability for student performance. (Source: DC Pave)

schools_table %>%
  kbl() %>%
  column_spec(1, bold = T, border_right = T) %>%
  kable_styling(bootstrap_options = "striped", font_size = 12)
School Type Facilities Funding Funding Formula Regulation Leadership Curriculum Staffing
DCPS Provided by the District Funded by the District on a per-student basis By law, DCPS is required to give schools at least 95 percent of their previous year’s budget. A student-based budgeting (SBB) formula,​Limited staff-based allocations, Program grants, and Stability funding. Must follow all District laws and regulations Centralized decision-making process led by the Chancellor Chosen by DC Central Office with some input from principals Teacher positions are allocated using the Comprehensive Staffing Model Formula
Charter Funded with the Charter Facilities Allotment Funded by the District on a per-student basis Funded based on their audited, or actual, enrollment Unlike with DCPS, public charter LEAs do not receive one large lump sum of UPSFF funds. Freedom to create their own policies Individual Local Education Agency Executive Leaders have autonomy to make decisions Each individual Local Education Agency chooses their own curriculum LEA leaders decide how they want to structure staff positions at their schools

In terms of school choice, public school students in the district have multiple options.

  1. enroll in their in-boundary DCPS school, determined by their home address;

  2. enroll at a DCPS school outside of their geographic school boundary;

  3. enroll in a DCPS citywide or selective school; or​

  4. enroll at a public charter school.

In SY2022-23, 28% of students were enrolled at their in-boundary schools, with Ward 3 having the highest percentage of in-boundary attendance (84%), and Wards 5, 7, and 8 having the lowest (16-19%). Notably, 53-58% of students living in Wards 5, 7, and 8 were enrolled in public charter schools.​

Overall, 21% of students chose to attend an out-of-boundary school, while only 6% of students from Ward 3 made that choice.

1.4 The Business-as-Usual Approach

The current business-as-usual approach that Washington D.C. leverages to make enrollment forecasts is the Grade Progression Ratio (https://edscape.dc.gov/node/1607141), which operates on the rationale that enrollments at higher grade levels can be extrapolated from past enrollments at a corresponding lower grade (i.e. K-grade students from the same school facility will become the incoming 5th grade cohort in five years).

To calculate the Grade Promotion Rate (GPR) from Kindergarten to Grade 1 for the 2014-15 school year, one divides the number of students in Grade 1 during the 2014-15 audit by the number of students who were in Kindergarten during the 2013-14 audit. A ratio more than 1 (or greater than 100%) suggests that there are additional students joining, who did not enroll in the previous grade. A ratio less than 1 (or less than 100%) suggests that students may be transferring away from the school, or out of the public school system. It is important to note that GPRs are different than re-enrollment because they don’t necessarily reflect the same students.

However, this approach has several limitations. First, while the approach is more effective in making one-year forecasts, particularly for higher grade levels, it is inapplicable to make long-term forecasts for lower grade enrollments. For instance, while it is possible to deduce 5th grade enrollments in the next five years from current K-grade students, K level enrollments could be forecasted due to a lack of associated lower grade data, resulting in significantly fewer predictions feasible. Likewise, this method does not sufficiently account for external factors dynamically shaping enrollment preferences, such as facility expansion / closure, housing market changes, and socioeconomic trends, but rather assumes enrollment as a static choice once a cohort is enrolled, reinforcing a “business-as-usual” rationale.

Our approach improves the business-as-usual approach by improving on the two critical shortcomings highlighted. By incorporating historic and current enrollment records with comprehensive data sources such as the American Community Survey 5-Year Estimates and DC’s permits and residential mass appraisals, we deploy a regression-based method that enables significantly more predictions across all grades and years. Section 2 “Understanding Key Influences through Exploratory Data Analysis” further unwraps our thinking behind the new mechanism.

elementary <- list('K','1','2','3','4','5')

# Manually removed a record associated with school_id == 247 - this is the only duplicated row
schools_cleaned <- schools %>% 
              filter(grade_band %in% c("Elementary", "K-8")) %>% # Neve added
              group_by(school_year, lea_id, school_id, mar_latitude) %>% 
              slice_max(order_by = facility_capacity, n = 1, with_ties = FALSE) %>%
              ungroup()

elementary_by_year <- students_by_tract_sy1314_to_sy2425 %>%
  filter(grade_level %in% elementary) %>%
  left_join(schools_cleaned %>%
              dplyr::select(-school_name, -school_sector),
            by=c("lea_code"="lea_id","school_code"="school_id","school_year"="school_year", "school_main_facility_lat" = "mar_latitude", "school_main_facility_long" = "mar_longitude")) %>%
  dplyr::select(# original variables from the secret data
    school_year, residence_census_tract, lea_code, school_code, school_name,
    school_sector, in_boundary_indicator, school_main_facility_lat, school_main_facility_long,
    grade_level, total_students, 
    # variables added through the merge
    dcps_facility_name, facility_sector, grade_band,`ward_(2022)`, 
    dcps_boundary, facility_capacity, enrollment, utilization, unfilled_seats) %>%
  rename(ward_22 = `ward_(2022)`) # tidied up workflow

# Create unique facility ID for each school facility
elementary_by_year <- elementary_by_year %>%
  group_by(school_code, grade_band, school_main_facility_lat, school_main_facility_long) %>% # this ensures distinct facilities are selected
  mutate(unique_id = paste0("F", sprintf("%06d", cur_group_id()))) %>%
  ungroup()
# Create new column variable for matching census data
elementary_by_year <- elementary_by_year %>%
  mutate(census_year = case_when(
    school_year == 'SY13-14' ~ 2013,
    school_year == 'SY14-15' ~ 2014,
    school_year == 'SY15-16' ~ 2015,
    school_year == 'SY16-17' ~ 2016,
    school_year == 'SY17-18' ~ 2017,
    school_year == 'SY18-19' ~ 2018,
    school_year == 'SY19-20' ~ 2019,
    school_year == 'SY20-21' ~ 2020,
    school_year == 'SY21-22' ~ 2021,
    school_year == 'SY22-23' ~ 2022,
    school_year == 'SY23-24' ~ 2023,
    school_year == 'SY24-25' ~ 2024,
    TRUE ~ NA_real_
  ))

Plot of the Errors of the Business-As-Usual Approach

ggplot(df5, aes(x = `sum(total_students).x`, y = `sum(total_students).y`)) +
  geom_point(alpha = 0.3, color = "#440154") +
  geom_abline(linetype = "dashed", color = "#fc8961") +
  geom_smooth(method = "lm", color = "black") +
  coord_equal() +
  scale_x_continuous(
    expand = c(0, 0),
    limits = c(0, NA),
    breaks = seq(0, max(df5$`sum(total_students).x`, na.rm = TRUE), by = 40)
  ) +
  scale_y_continuous(
    expand = c(0, 0),
    limits = c(0, NA),
    breaks = seq(0, max(df5$`sum(total_students).y`, na.rm = TRUE), by = 50)
  ) +
  labs(
    title = "Plot of Predicted Against Observed with Best Fit Line",
    x = "Observed",
    y = "Predicted"
  ) +
  theme_bw()

Summary Table of Errors of the Business-As-Usual Approach

# MAE
df5$error = abs(df5$`sum(total_students).y`-df5$`sum(total_students).x`)
mean(df5$error, na.rm = TRUE)
## [1] 12.2847
# MAPE
df5$APE = abs(df5$`sum(total_students).y`-df5$`sum(total_students).x`)/(df5$`sum(total_students).x`)
mean(df5$APE, na.rm = TRUE)
## [1] 0.4295409
# RMSE
df5$sqerror = (df5$`sum(total_students).y` - df5$`sum(total_students).x`)^2
rmse_5_year <- sqrt(mean(df5$error, na.rm = TRUE))

# Create summary table
progression_error_summary <- data.frame(
  Metric = c("Mean Absolute Error", "Mean Absolute Percentage Error", "Root Mean Square Error"),
  Value = c(
    mean(df5$error, na.rm = TRUE),
    mean(df5$APE, na.rm = TRUE),
    sqrt(mean(df5$sqerror, na.rm = TRUE))
  )
)

print(progression_error_summary)
##                           Metric      Value
## 1            Mean Absolute Error 12.2846975
## 2 Mean Absolute Percentage Error  0.4295409
## 3         Root Mean Square Error 15.4071666

2 Understanding Key Influences through Exploratory Data Analysis

As the team believes that grade-level enrollment changes is not isolated to the facility itself, but associated with a range of exogenous factors and concerns. Some of these factors are demographic variables (i.e., total population), socioeconomic variables (i.e., income), building permits and new residential development. The next section uses exploratory charts and visualizations to demonstrate the following insights:

  • School facility capacity and utilization patterns vary across years and geographies

  • Socioeconomic characteristics have been a key underlying concern to school district planning

  • Housing developments have shaped and will continue to shape elementary school enrollments in DC

We have consulted the following data sources to build up the analysis process:

Open Data DC

Others

We then combine these variables with the school enrollment to create a panel data for our analysis. Details of our panel data creation can be found in Section 3.

2.1 Spatial Units

This exploratory analysis encompasses three primary spatial units present in our data: census tracts, elementary school boundaries, and wards. Elementary school boundaries change over time (explained further in respective section). DC is divided into 8 different wards, each having its own demographic and social characteristics, as well as representation in the District.

ggarrange(nrow=1,
  ggplot(boundary)+
    geom_sf(fill="whitesmoke", color="#440154", lwd=0.1)+
    labs(title="Elementary School Boundary")+
    theme_void(),
  
  ggplot(wards, aes(label=NAME))+
    geom_sf(fill="whitesmoke", color="#440154", lwd=0.1)+
    geom_sf_label(fill = "white",
                fun.geometry = sf::st_centroid,
                size=2.7)+
    labs(title="Wards")+
    theme_void()+
    theme(plot.title = element_text(hjust = 0.5)),
  
  ggplot(ct20)+
    geom_sf(fill="whitesmoke", color="#440154", lwd=0.1)+
    labs(title="Census Tracts (2020)")+
    theme_void()
)

2.2 Insight 1: School facility capacity and utilization patterns vary across years and geographies

For the last decade (SY13-14 to SY23-24), most elementary schools in DC operate close to full capacity, with utilization rates clustering around 1.0. Nonetheless, a noticeable portion of schools falls below this threshold, indicating under-utilization, while the utilization of a smaller yet significant number exceeds their capacity, reflecting overcrowding. Comparatively, total school capacity and enrollment numbers exhibit multiple peaks, indicating variations in school sizes and enrollment volumes.

School_info$Utilization <- as.numeric(gsub("%", "", School_info$Utilization)) / 100

## Histogram
display_histogram <- function(arg, binwidth){
  ggplot(School_info, aes(x = .data[[arg]])) +  
    geom_histogram(binwidth = binwidth, fill = "#440154", color = "black", alpha = 0.7) +  
    labs(title = paste("Distribution of", arg),
         x = arg, 
         y = "Count") + 
    theme_minimal()  
}

display_histogram("Facility capacity", 10)

display_histogram("Enrollment", 10)

display_histogram("Utilization", 0.02)

Elementary schools under both DCPS and Public Charter sectors have experienced relatively steady growth in facility capacity over time. Meanwhile, while enrollment increased in both sectors before SY19-20, DCPS schools have faced a more pronounced decline since then, suggesting a greater negative impact from COVID-19 compared to Charter schools.

## time series plot
# Convert 'school_year' to an ordered factor to ensure correct ordering
School_info$`School year` <- factor(School_info$`School year`, 
                                  levels = c("SY13-14", "SY14-15", "SY15-16", "SY16-17", "SY17-18", "SY18-19", "SY19-20", "SY20-21", "SY21-22", "SY22-23", "SY23-24", "SY24-25"), 
                                  ordered = TRUE)

# filter SY24-25 for they have no data
School_info_filtered <- School_info %>%
  filter(`School year` != "SY24-25") 

# Calculate the sum of 'capacity' by 'sector' for each year
time_series_plot <- function(arg){
  arg_summary <- School_info_filtered %>%
    group_by(`School year`, `School sector`) %>%
    summarise(total_capacity = sum(.data[[arg]], na.rm = TRUE)) 
  
  # plot time series chart
  ggplot(arg_summary, aes(x = `School year`, y = total_capacity, color = `School sector`, group = `School sector`)) +
    geom_line() + 
    geom_point() +  
    scale_color_manual(values=c("#440154","#fc8961"))+
    labs(title = paste("Total", arg, "by Sector Over Time"),
         x = "School Year",
         y = paste("Total",arg),
         color = "Sector") + 
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))  
}

time_series_plot("Facility capacity")

time_series_plot("Enrollment")

However, the upward trend in school capacity is not reflected in utilization rates. In particular, DCPS schools have experienced a steady decline in utilization over the past decade, in stark contrast to the increasing rates seen in charter schools. This divergence may be driven by demographic shifts, evolving education policies, and a growing preference for public charter schools. Importantly, it also highlights the risk of relying solely on lower-grade enrollment trends to forecast future demand at higher grade levels.

time_series_plot("Utilization")

2.2.1 How utilization varies across space?

We further enquired the spatial patterns of elementary schools’ average utilization rate (by school boundary and wards) in Washington DC over two five year periods (SY13-18 and SY18-23). Note: We also conducted analysis on average capacity and enrollment rate, but the spatial clustering of those two factors evolved less significantly compared to utilization.

Overall, we observed a decreased average utilization rate across DCPS elementary schools across both school boundaries and wards. The decreases are the most severe in school boundaries in Ward 7 and Ward 8, but the lowest average utilization rate continues to be observed in Ward 5.

Comparatively, changes in average utilization rate across Public Charter elementary schools appear more versatile. For instance, while certain school districts from wards 7 and 8 have seen improved utilization rate of Public Charter elementary schools over the years, the gains are counteracted by decrease in adjacent school boundaries, leading to an overall drop in ward-level utilization rate. Nonetheless, this suggests a shift in the spatial distribution of student enrollment.

# SY13-14 ~ SY17-18 and SY18-19 ~ SY22-23, total capacity by school boundary
school_info_sf <-st_as_sf(as.data.frame(School_info), coords = c("MAR longitude", "MAR latitude"), crs = 4326)

school_boundary_sf <- st_join(school_info_sf, boundary, join = st_within)

## mutate a new variable: year_group, including: SY13-18, SY18-23
school_boundary_sf <- school_boundary_sf %>%
  mutate(year_group = case_when(
    `School year` %in% c("SY13-14", "SY14-15", "SY15-16", "SY16-17", "SY17-18") ~ "SY13-18",
    `School year` %in% c("SY18-19", "SY19-20", "SY20-21", "SY21-22", "SY22-23") ~ "SY18-23",
    TRUE ~ NA_character_  # dealing the no match case
  ))
  
  
# Calculate the sum of school capacities for each boundary, grouped by time period and School sector
mean_by_boundary <- school_boundary_sf %>%
  group_by(NAME, year_group, `School sector`) %>%
    summarise(
    mean_capacity = mean(`Facility capacity`, na.rm = TRUE), 
    mean_enrollment = mean(Enrollment, na.rm = TRUE),          
    mean_utilization = mean(Utilization, na.rm = TRUE)         
  ) %>%
  na.omit() %>%
  st_drop_geometry()

# merge the capacity, enrollment and utilization data to the Boundary data
Boundary <- boundary %>%
  left_join(mean_by_boundary, by = "NAME")

# filter by School sector
dcps_data <- Boundary %>% filter(`School sector` == "DCPS")
charter_data <- Boundary %>% filter(`School sector` == "Public charter")

# function for visualization
base_map <- function(geodata, argbysector, total_arg){
  ggplot() +
    geom_sf(data = geodata, fill = "transparent") +
    geom_sf(data = argbysector, aes(fill = .data[[total_arg]])) +  
    #scale_fill_viridis_c(option = "magma", trans = "log", direction = -1) +
    scale_fill_gradientn(colours = c("#440154", "white", "#fc8961"), trans="log")+
    facet_wrap(~ year_group) +  
    theme_void() +
    labs(fill = total_arg) +
    theme(axis.text = element_blank(),
          axis.ticks = element_blank(),
          plot.title = element_text(hjust = 0.5))
} 
## visualize by ward
wards <- st_read("https://maps2.dcgis.dc.gov/dcgis/rest/services/DCGIS_DATA/Administrative_Other_Boundaries_WebMercator/MapServer/53/query?outFields=*&where=1%3D1&f=geojson", quiet = TRUE)

# Group school_boundary_info by Ward(2022) and year_group, and calculate the total capacity
ward_summary <- school_boundary_sf %>%
  group_by(`Ward (2022)`, year_group, `School sector`) %>%
  summarise(
    mean_capacity = mean(`Facility capacity`, na.rm = TRUE), 
    mean_enrollment = mean(Enrollment, na.rm = TRUE),          
    mean_utilization = mean(Utilization, na.rm = TRUE)         
  ) %>%
  na.omit()%>% 
  st_drop_geometry()

# spatial join
wards <- wards %>%
  left_join(ward_summary, by = c("NAME" = "Ward (2022)"))

# filter by School sector
dcps_data_ward <- wards %>% filter(`School sector` == "DCPS")
charter_data_ward <- wards %>% filter(`School sector` == "Public charter")
dcps_plot_uti <- base_map(Boundary, dcps_data, "mean_utilization") + ggtitle("Average Utilization Rate by School Boundary (DCPS)")

charter_plot_uti <- base_map(Boundary, charter_data, "mean_utilization") + ggtitle("Average Utilization Rate by School Boundary (Public Charter)")


print(dcps_plot_uti)

print(charter_plot_uti)

2.2.2 Spatial Changes in In/Out-of-Boundary Enrollment Flows

A closer look at enrollment patterns reveals shifting in-boundary and out-of-boundary behaviors within DCPS schools. The following interactive maps illustrate kindergarten enrollment flows to high-utilization DCPS schools (those operating at 90% capacity or higher) in 2014, 2018, and 2022. In these maps, each line connects the estimated centroid of a student’s home census tract to their school’s location. Yellow-to-purple line pairs represent in-boundary enrollment flows, while purple-to-purple lines indicate out-of-boundary enrollment. Line thickness corresponds to the number of students, with thicker lines indicating greater volume. Note: Only home tracts that sent three or more students to a school in a given year are shown to highlight stronger school-tract relationships while protecting individual privacy. Tooltip information excludes names of census tracts and schools.

As the maps reveal, most high-utilization schools primarily serve in-boundary students and are located in neighborhoods with varying income and homeownership levels. However, over time, there has been a growing concentration of these schools in Wards 3, 4, and 6. Concurrently, in-boundary enrollments have strengthened, while out-of-boundary enrollments have declined.

These shifts indicate that high enrollment is increasingly driven by neighborhood-based preferences rather than static choice. As a result, relying on past lower-grade enrollment figures to forecast future demand at higher grade levels may introduce risk—particularly if those higher grades serve broader or different catchment areas, or if the observed in-boundary demand does not persist across the years, violating the progression assumption.

## identify high-utilization schools
near_capacity_2014 <- schools.sf %>%
  st_drop_geometry() %>%
  filter(utilization > 90 & school_year == "SY14-15" & school_sector == "DCPS") %>%
  dplyr::pull(school_id)

near_capacity_2018 <- schools.sf %>%
  st_drop_geometry() %>%
  filter(utilization > 90 & school_year == "SY18-19" & school_sector == "DCPS") %>%
  dplyr::pull(school_id)

near_capacity_2022 <- schools.sf %>%
  st_drop_geometry() %>%
  filter(utilization > 90 & school_year == "SY22-23" & school_sector == "DCPS") %>%
  dplyr::pull(school_id)
elem_centroid <- st_centroid(tracts12_23)

both_centroids_14 <- elem_centroid %>%
  rename(residence_census_tract = GEOID,
         census_year = year) %>%
  left_join(elementary_students, by = c("residence_census_tract", "census_year")) %>%
  drop_na(total_students)

both_centroids_14$geom2 = both_centroids_14 %>% st_drop_geometry() %>% st_as_sf(coords=c("school_main_facility_long","school_main_facility_lat")) %>% st_geometry()

both_centroids_14 %>%
  filter(
    census_year == 2014 & school_code %in% near_capacity_2014 & grade_level == "K" & total_students >= 3
    ) %>%
  group_by(residence_census_tract) %>%
  mutate(weight = total_students,
         pct_own = round(pct_own * 100, 2),
         pct_vacant = round(pct_vacant * 100, 2),
         pct_poverty = round(pct_poverty * 100, 2),
         boundary_word = if_else(in_boundary_indicator == 0, "No", "Yes"),
         tooltip = paste0(total_students, " students",
                          "<br>", "In Boundary School? ", boundary_word,
                          "<br><br>", "Total Population: ", total_population,
                          "<br>", "Housing Units: ", total_housing_units,
                          "<br>", "Vacancy rate: ", pct_vacant,"%",
                          "<br>", "Tract homeownership rate: ", pct_own, "%",
                          "<br>", "Tract median gross rent: ", median_gross_rent,
                          "<br>", "Median HH Income: ", median_household_income,
                          "<br>", "Poverty rate: ", pct_poverty, "%")) %>%
  mapdeck(token = token, style = mapdeck_style("light")) %>%
  add_arc(origin = "geometry",
          destination = "geom2",
          stroke_width = "weight",
          stroke_from = "boundary_word",
          stroke_to = "census_year",
          tooltip = "tooltip",
          auto_highlight = TRUE,
          legend= list( stroke_from = TRUE, stroke_to = FALSE ),
          legend_options = list(
            stroke_from = list( title = "In-boundary School"))) %>%
  add_polygon(
    data = boundary,
    fill_opacity = 0,
    stroke_width = 15,
    stroke_colour = "#708090",
    layer = "polygon_layer") %>%
    mapdeck_view(
    location = c(-77.0395304337645, 38.892799521270774),
    # set the zoom level
    zoom = 10,
    # set the pitch angle
    pitch = 30,
  )
both_centroids <- elem_centroid %>%
  rename(residence_census_tract = GEOID,
         census_year = year) %>%
  left_join(elementary_students, by = c("residence_census_tract", "census_year")) %>%
  drop_na(total_students)

both_centroids$geom2 = both_centroids %>% st_drop_geometry() %>% st_as_sf(coords=c("school_main_facility_long","school_main_facility_lat")) %>% st_geometry()

both_centroids %>%
  filter(
    census_year == 2018 & school_code %in% near_capacity_2018 & grade_level == "K" & total_students >= 3
    ) %>%
  group_by(residence_census_tract) %>%
  mutate(weight = total_students,
         pct_own = round(pct_own * 100, 2),
         pct_vacant = round(pct_vacant * 100, 2),
         pct_poverty = round(pct_poverty * 100, 2),
         boundary_word = if_else(in_boundary_indicator == 0, "No", "Yes"),
         tooltip = paste0(total_students, " students",
                          "<br>", "In Boundary School? ", boundary_word,
                          "<br><br>", "Total Population: ", total_population,
                          "<br>", "Housing Units: ", total_housing_units,
                          "<br>", "Vacancy rate: ", pct_vacant,"%",
                          "<br>", "Tract homeownership rate: ", pct_own, "%",
                          "<br>", "Tract median gross rent: ", median_gross_rent,
                          "<br>", "Median HH Income: ", median_household_income,
                          "<br>", "Poverty rate: ", pct_poverty, "%")) %>%
  mapdeck(token = token, style = mapdeck_style("light")) %>%
  add_arc(origin = "geometry",
          destination = "geom2",
          stroke_width = "weight",
          stroke_from = "boundary_word",
          stroke_to = "census_year",
          tooltip = "tooltip",
          auto_highlight = TRUE,
          legend= list( stroke_from = TRUE, stroke_to = FALSE ),
          legend_options = list(
            stroke_from = list( title = "In-boundary School"))) %>%
  add_polygon(
    data = boundary,
    fill_opacity = 0,
    stroke_width = 15,
    stroke_colour = "#708090",
    layer = "polygon_layer") %>%
    mapdeck_view(
    location = c(-77.0395304337645, 38.892799521270774),
    # set the zoom level
    zoom = 10,
    # set the pitch angle
    pitch = 30,
  )
both_centroids_22 <- elem_centroid %>%
  rename(residence_census_tract = GEOID,
         census_year = year) %>%
  left_join(elementary_students, by = c("residence_census_tract", "census_year")) %>%
  drop_na(total_students)

both_centroids_22$geom2 = both_centroids_22 %>% st_drop_geometry() %>% st_as_sf(coords=c("school_main_facility_long","school_main_facility_lat")) %>% st_geometry()

both_centroids_14 %>%
  filter(
    census_year == 2022 & school_code %in% near_capacity_2022 & grade_level == "K" & total_students >= 3
    ) %>%
  group_by(residence_census_tract) %>%
  mutate(weight = total_students,
         pct_own = round(pct_own * 100, 2),
         pct_vacant = round(pct_vacant * 100, 2),
         pct_poverty = round(pct_poverty * 100, 2),
         boundary_word = if_else(in_boundary_indicator == 0, "No", "Yes"),
         tooltip = paste0(total_students, " students",
                          "<br>", "In Boundary School? ", boundary_word,
                          "<br><br>", "Total Population: ", total_population,
                          "<br>", "Housing Units: ", total_housing_units,
                          "<br>", "Vacancy rate: ", pct_vacant,"%",
                          "<br>", "Tract homeownership rate: ", pct_own, "%",
                          "<br>", "Tract median gross rent: ", median_gross_rent,
                          "<br>", "Median HH Income: ", median_household_income,
                          "<br>", "Poverty rate: ", pct_poverty, "%")) %>%
  mapdeck(token = token, style = mapdeck_style("light")) %>%
  add_arc(origin = "geometry",
          destination = "geom2",
          stroke_width = "weight",
          stroke_from = "boundary_word",
          stroke_to = "census_year",
          tooltip = "tooltip",
          auto_highlight = TRUE,
          legend= list( stroke_from = TRUE, stroke_to = FALSE ),
          legend_options = list(
            stroke_from = list( title = "In-boundary School"))) %>%
  add_polygon(
    data = boundary,
    fill_opacity = 0,
    stroke_width = 15,
    stroke_colour = "#708090",
    layer = "polygon_layer") %>%
    mapdeck_view(
    location = c(-77.0395304337645, 38.892799521270774),
    # set the zoom level
    zoom = 10,
    # set the pitch angle
    pitch = 30,
  )

2.3 Insight 2: Socioeconomic characteristics has been a key underlying concern to school district planning

To establish a more comprehensive approach, our tool incorporates the use of socioeconomic characteristics pertaining to both the school’s census tract and a student’s origin census tract (i.e., the census tract their home is) into the formula. This meets our client’s request of “input demographic data” to build future forecasts as addressed in the case study and corroborates DME’s continued effort to balance resource efficiency at DCPS schools through boundary revisions. For instance, DME completed a review of DC elementary school boundaries in 2014, with subsequent changes implemented between SY15-16. According to the “Elementary School Fact Sheet - SY2014-15” the purpose of the revision was to address previously closed schools as well as to ensure that there are a sufficient and evenly distributed number of students living within each of the boundaries. Overlaying with ACS data before and after the school boundary study (year 2010-2014 and 2014-2018), the DCPS zone revision corresponds to the following results:

  • Within the affected zones, there has been some increase in school-aged population, particularly in the Southwest (SW) Quadrant.

  • Along with the population increase, there has also been a simultaneous boom in housing units. (The map accounts for both traditional residential and rental units.)

  • The pressure of elementary school enrollment (approximated by the number of population aged 5 to 9 years’ old) was eased overall, as indicated by less clusterings and overall deconcentration.

ggplot(data = change_zones) +
  geom_sf(data=city_boundary)+
  geom_sf(aes(fill = CHANGE)) +  # Set colors based on the 'CHANGE' attribute
  scale_fill_manual(values=c("#440154","#fc8961"))+  # Automatically assign discrete colors
  theme_minimal() +
  labs(title = "Change Area Visualization by CHANGE Attribute", 
       fill = "Change Type") +
  theme_void()+# Add title and legend
  theme(plot.title = element_text(hjust = 0.5))  # Center the title

# Function to map total population with change_zones overlay
map_tot_pop <- function(data, title) {
  ggplot() +
    geom_sf(data = city_boundary, fill = "whitesmoke", color = "black", lwd = 0.2) + 
    geom_sf(data = data %>% 
              st_intersection(change_zones), aes(fill = tot_pop), color = NA) +
    geom_sf(data = change_zones, fill = NA, color = "black", lwd = 0.3) +  
    scale_fill_gradientn(colors = c("#440154", "white", "#fc8961"))+
    theme_void() +
    theme(plot.title = element_text(hjust = 0.5))+
    labs(title = title, fill = "Total Population")
}

# Generate maps for each year
map_2010_pop <- map_tot_pop(tract10_clip, "Total Population, 2010")
map_2014_pop <- map_tot_pop(tract14_clip, "2014")
map_2018_pop <- map_tot_pop(tract18_clip, "2018")

ggarrange(nrow=1,
             map_2010_pop, map_2014_pop, map_2018_pop,
          common.legend = TRUE, legend="bottom")

map_tot_housing <- function(data, title) {
  ggplot() +
    geom_sf(data = city_boundary, fill = "whitesmoke", color = "black", lwd = 0.2) +
    geom_sf(data = data %>% 
              st_intersection(change_zones), aes(fill = tot_housing), color = NA) +
    geom_sf(data = change_zones, fill = NA, color = "black", lwd = 0.3) +  
    scale_fill_gradientn(colors = c("#440154", "white", "#fc8961"))+
    theme_void() +
    labs(title = title, fill = "Total Housing")
}

# Generate maps for each year
map_2010_housing <- map_tot_housing(tract10_clip, "Total Housing, 2010")
map_2014_housing <- map_tot_housing(tract14_clip, "2014")
map_2018_housing <- map_tot_housing(tract18_clip, "2018")

ggarrange(nrow=1,
          map_2010_housing, map_2014_housing, map_2018_housing,
          common.legend = TRUE, legend="bottom")

map_school_age <- function(data, title) {
  ggplot() +
    geom_sf(data = city_boundary, fill = "whitesmoke", color = "black", lwd = 0.2) +
    geom_sf(data = data %>% 
              st_intersection(change_zones), aes(fill = school_aged), color = NA) +
    geom_sf(data = change_zones, fill = NA, color = "black", lwd = 0.3) + 
    scale_fill_gradientn(colors = c("#440154", "white", "#fc8961"))+
    theme_void() +
    theme(plot.title = element_text(hjust = 0.5))+
    labs(title = title, fill = "Total School-aged Children")
}

# Generate maps for each year
map_2010_school_age <- map_school_age(tract10_clip, "School-aged Children, 2010")
map_2014_school_age <- map_school_age(tract14_clip, "2014")
map_2018_school_age <- map_school_age(tract18_clip, "2018")

# Arrange maps side by side
ggarrange(nrow=1,
          map_2010_school_age,
          map_2014_school_age,
          map_2018_school_age,
          common.legend = TRUE, legend="bottom")

2.4 Insight 3: Housing developments have shaped and will continue to shape elementary school enrollments in DC

Housing is a key feature in our predictive model, as development and consumption patterns offer insights into the current distribution of population and wealth while also serving as a proxy for long-term residential preferences and the spatial clustering of distinct community characteristics. The importance of incorporating planned housing construction into forecasting is further supported by the ongoing apartment boom in the D.C. region. According to RentCafe’s 2024 report, the Washington D.C. Metro area ranks seventh among all U.S. metropolitan areas in new apartment construction—signalling that housing dynamics will continue to shape residential choices and, by extension, school enrollment patterns.

2.4.1 Transacted Residential Properties

Leveraging post-transaction residential property ownership, our preliminary analysis of the Computer Assisted Mass Appraisal (CAMA) – Residential dataset from Open Data DC reveals a distinct spatial clustering of high-priced homes concentrated in Wards 2 through 4, particularly in the NW Quadrant. These areas also tend to host schools with high utilization rates. Note: To ensure data quality and reduce historical bias, the dataset was filtered to include only sales occurring after 2002 and retained only the most recent transaction per property.

year_of_sale <- res_data %>% 
  group_by(SALEYEAR) %>% 
  summarize(Total_Units = sum(NUM_UNITS),
         Total_GBA = sum(as.numeric(GBA))) %>%  #use ground building area, since multiple buildings may exist in one SSL
  ungroup()

# Plot Total Units by Year
ggplot(year_of_sale, aes(x = factor(SALEYEAR), y = Total_Units)) +
  geom_bar(stat = "identity", fill = "#440154") +
  labs(title = "Total Units by Year", x = as.factor("Year of Sale"), y = "Total Units") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

pal <- c("#fed395","#fea973","#e95462","#a3307e","#440154")

ggplot() +
  geom_sf(data = wards22, fill = "whitesmoke", col = "white", lwd = 0.75)+
  geom_sf(data = res_data, aes(colour = q5(price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = pal,
                   labels=qBr(res_data, "price"),
                   name="Quintile Breaks\nof Property Price") +
  guides(colour = guide_legend(override.aes = list(size=3))) +
  labs(title="Most Recent Sales Price by Property in Washington, D.C.") +
  theme_void()+
  theme(plot.title = element_text(size = 12, hjust = 0.5))

The additional tables were created to summarize housing characteristics by current Ward and DCPS School Boundaries in Washington, DC.

Table of Housing Properties by Wards

res_ward22 <- res_data %>%
  group_by(ward_2022) %>% 
  summarize(
    Tot_Units = sum(NUM_UNITS, na.rm = TRUE),
    Med_Price = median(price, na.rm = TRUE),
    Med_GBA = median(as.numeric(GBA), na.rm = TRUE),
    Med_AYB = median(year_built, na.rm = TRUE),
    Med_Rooms = median(rooms, na.rm = TRUE),
    Med_Bedrooms = median(bedroom, na.rm = TRUE),
    Med_Bathrooms = median(bathroom, na.rm = TRUE),
    Med_Stories = median(as.numeric(STORIES), na.rm = TRUE)
  ) 

res_ward22 <- st_drop_geometry(res_ward22)

kable(res_ward22, caption = "Summary of Residential Data by Ward (2022 boundary)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  scroll_box(width = "100%", height = "300px")
Summary of Residential Data by Ward (2022 boundary)
ward_2022 Tot_Units Med_Price Med_GBA Med_AYB Med_Rooms Med_Bedrooms Med_Bathrooms Med_Stories
Ward 1 5733 799000 1602 1911 7 4 2 2
Ward 2 4586 1350000 1908 1900 8 3 3 2
Ward 3 7492 1252509 2170 1937 8 4 3 2
Ward 4 10645 695250 1600 1930 7 4 2 2
Ward 5 12137 550000 1412 1930 7 3 2 2
Ward 6 8687 845000 1512 1910 6 3 2 2
Ward 7 11052 355000 1224 1944 6 3 2 2
Ward 8 6584 338000 1308 1949 7 3 2 2

Table of Housing Properties by DCPS Boundaries

res_DCPS <- res_data %>%
  group_by(dcps_zone) %>% 
  summarize(
    Tot_Units = sum(NUM_UNITS, na.rm = TRUE),
    Med_Price = median(price, na.rm = TRUE),
    Med_GBA = median(as.numeric(GBA), na.rm = TRUE),
    Med_AYB = median(year_built, na.rm = TRUE),
    Med_Rooms = median(rooms, na.rm = TRUE),
    Med_Bedrooms = median(bedroom, na.rm = TRUE),
    Med_Bathrooms = median(bathroom, na.rm = TRUE),
    Med_Stories = median(as.numeric(STORIES), na.rm = TRUE)
  )

kable(res_DCPS %>% st_drop_geometry(), caption = "Summary of Residential Data by Elementary School Boundary") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  scroll_box(width = "100%", height = "200px")
Summary of Residential Data by Elementary School Boundary
dcps_zone Tot_Units Med_Price Med_GBA Med_AYB Med_Rooms Med_Bedrooms Med_Bathrooms Med_Stories
Amidon-Bowen 143 910000.0 1640.0 1967.0 7.0 3 3 2.00
Bancroft 1080 1050000.0 1971.0 1916.0 7.0 4 3 2.25
Barnard 1544 610000.0 1440.0 1929.0 7.0 3 2 2.00
Beers 804 364400.0 1184.0 1942.5 6.0 3 2 2.00
Boone 506 365000.0 1360.0 1937.0 6.0 3 2 2.00
Brent 1261 1050000.0 1679.5 1900.0 7.0 3 2 2.00
Brightwood 474 650000.0 1650.5 1936.0 7.0 4 2 2.00
Browne 850 425000.0 1152.0 1937.0 6.0 3 2 2.00
Bruce Monroe @ Park View 1332 639500.0 1488.0 1913.0 7.0 3 2 2.00
Bunker Hill 2077 500000.0 1292.0 1948.0 7.0 3 2 2.00
Burroughs 944 600000.0 1549.0 1927.0 7.0 3 2 2.00
Burrville 711 289000.0 1026.0 1945.0 6.0 3 2 2.00
Cleveland 540 750000.0 1464.0 1907.0 6.0 3 2 2.00
Cooke, H.D 233 955000.0 2223.0 1911.5 8.0 4 3 2.50
Drew 659 280500.0 1276.0 1951.0 6.0 3 2 2.00
Eaton 628 1602000.0 2426.0 1925.0 9.0 4 3 2.25
Garfield 384 288900.0 1024.0 1947.0 6.0 3 2 2.00
Garrison 1368 1025000.0 1666.0 1895.0 7.0 3 2 2.00
Harris, C. W 917 340000.0 1292.0 1950.0 6.0 3 2 2.00
Hearst 497 1125000.0 1920.0 1929.0 8.0 4 3 2.00
Hendley 573 287500.0 1152.0 1955.0 6.0 3 2 2.00
Houston 669 328607.5 1343.5 1940.0 7.0 3 2 2.00
Hyde-Addison 1966 1450000.0 1803.5 1900.0 8.0 3 3 2.00
Janney 1396 1002000.0 1852.0 1937.0 7.0 3 3 2.00
Ketcham 915 347000.0 1344.0 1936.0 6.0 3 2 2.00
Key 1411 1325000.0 2315.0 1951.0 9.0 4 3 2.00
Kimball 379 310750.0 1362.5 1942.0 6.0 3 2 2.00
King, Martin Luther 863 317650.0 1548.0 1941.0 7.0 3 2 2.00
Lafayette 1706 1090000.0 2159.0 1936.0 8.0 4 3 2.00
Langdon 1625 521990.0 1596.0 1951.0 7.0 3 2 2.00
Langley 2398 699000.0 1600.0 1910.0 7.0 4 3 2.00
Lasalle-Backus 1027 460000.0 1152.0 1952.0 6.0 3 1 2.00
Leckie 312 335000.0 1152.0 1950.0 6.0 3 2 2.00
Lewis 1223 770000.0 1818.0 1922.0 8.0 4 3 2.00
Ludlow-Taylor 1167 877500.0 1579.0 1909.0 6.5 3 2 2.00
Malcolm X 264 422940.0 1600.0 2008.0 8.0 3 2 2.00
Mann 987 1675000.0 3036.0 1952.0 9.5 5 4 2.00
Maury 1383 761500.0 1387.0 1914.0 6.0 3 2 2.00
Miner 1643 620000.0 1312.0 1925.0 6.0 3 2 2.00
Moten 594 354000.0 1368.0 2001.0 7.0 3 2 2.00
Murch 1007 1186500.0 2303.5 1938.0 9.0 4 3 2.00
Nalle 657 315000.0 1358.0 1950.0 6.0 3 2 2.00
Noyes 1218 679000.0 1400.0 1928.0 7.0 3 2 2.00
Oyster-Adams 756 1640000.0 2789.0 1922.0 10.0 5 3 2.50
Patterson 469 355209.5 1713.0 1944.0 8.0 4 3 2.00
Payne 1177 781750.0 1368.0 1918.0 6.0 3 2 2.00
Peabody-Watkins 2270 919500.0 1596.0 1908.0 7.0 3 2 2.00
Plummer 804 295000.0 1056.0 1951.0 6.0 3 2 2.00
Powell 782 850000.0 1845.0 1923.0 8.0 4 3 2.00
Randle Highlands 930 407626.0 1373.0 1941.0 7.0 3 2 2.00
Raymond 729 740000.0 1536.0 1912.0 7.0 4 3 2.00
Reed, Marie 398 949000.0 1520.0 1907.0 7.0 3 2 2.00
Ross 348 1585000.0 2352.0 1890.0 9.5 4 3 3.00
Savoy 314 311500.0 1272.0 1943.0 6.0 3 2 2.00
School Without Walls @ Francis-Stevens 522 2012500.0 3100.0 1912.5 11.0 4 4 3.00
Seaton 1100 776000.0 1625.0 1900.0 7.0 3 2 2.00
Shepherd 627 865000.0 2211.0 1937.0 8.0 4 3 2.00
Simon 429 323500.0 1296.0 1950.0 6.0 3 2 2.00
Smothers 808 318000.0 1176.0 1941.0 6.0 3 2 2.00
Stanton 230 349950.0 1278.0 1951.0 7.0 3 2 2.00
Stoddert 795 1070500.0 1547.0 1931.0 8.0 4 3 2.00
Takoma 657 625000.0 1621.0 1932.0 7.0 3 2 2.00
Thomas 766 318750.0 1040.0 1942.0 6.0 3 2 2.00
Thomson 78 1105000.0 2340.0 1900.0 8.0 4 3 2.50
Truesdell 1033 595000.0 1404.0 1925.5 8.0 3 2 2.00
Tubman 1256 691000.0 1520.0 1910.0 6.0 4 2 2.00
Turner 620 310000.0 1360.0 2006.0 7.0 3 2 2.00
Tyler 494 765000.0 1298.0 1909.0 6.0 3 2 2.00
Van Ness 448 728032.5 1440.0 2009.0 7.0 3 2 3.00
Walker-Jones 396 670000.0 1470.0 1906.0 7.0 3 2 2.00
Wheatley 1920 499000.0 1368.0 1927.0 7.0 3 2 2.00
Whitlock 588 300501.0 1190.0 1944.0 6.0 3 2 2.00
Whittier 1703 545000.0 1368.0 1931.0 7.0 3 2 2.00
Wilson, J.O. 1134 706750.0 1393.0 1912.0 6.0 3 2 2.00

Over the past decade, there has been a noticeable increase in both single-family and multi-family residential transactions, often occurring concurrently across different parts of the city. When disaggregated by DCPS elementary school boundaries, the broader price trends remain consistent; however, the distribution of housing types diverges. Single-family and multi-family units tend to cluster in different school zones, suggesting underlying patterns of residential segmentation.

res_DCPS <- res_DCPS %>% 
  st_drop_geometry() %>% 
  inner_join(boundary, by = c("dcps_zone" = "NAME")) %>% 
  st_as_sf()

ggplot() +
  geom_sf(data = res_DCPS, aes(fill = Med_Price), size = 0.75) +
  scale_fill_gradientn(colors = pal, name = "Median Price ($)") +  
  labs(title = "Housing Price by Current Elementary DCPS Zone") +
  theme_void() +
  theme(plot.title = element_text(size = 12, hjust = 0.5))

single_family <- res_data %>% 
  filter(NUM_UNITS == 1)

single_family_DCPS <- single_family %>%
  group_by(dcps_zone) %>% 
  summarize(
    Tot_Units = sum(NUM_UNITS, na.rm = TRUE),
    Med_Price = median(price, na.rm = TRUE),
    Med_GBA = median(as.numeric(GBA), na.rm = TRUE),
    Med_AYB = median(year_built, na.rm = TRUE),
    Med_Rooms = median(rooms, na.rm = TRUE),
    Med_Bedrooms = median(bedroom, na.rm = TRUE),
    Med_Bathrooms = median(bathroom, na.rm = TRUE),
    Med_Stories = median(as.numeric(STORIES), na.rm = TRUE)
  ) %>% 
  st_drop_geometry()

single_family_DCPS <- single_family_DCPS %>% 
  inner_join(boundary, by = c("dcps_zone" = "NAME")) %>% 
  st_as_sf()

ggplot() +
  geom_sf(data = single_family_DCPS, aes(fill = Tot_Units), size = 0.75) +
  scale_fill_gradientn(colors = pal, name = "Total Units") +
  labs(title = "Single Family Housing Units by Current Elementary DCPS Zone") +
  theme_void() +
  theme(plot.title = element_text(size = 12, hjust = 0.5))

multi_family <- res_data %>% 
  filter(NUM_UNITS > 1)

multi_family_DCPS <- multi_family %>%
  group_by(dcps_zone) %>% 
  summarize(
    Tot_Units = sum(NUM_UNITS, na.rm = TRUE),
    Med_Price = median(price, na.rm = TRUE),
    Med_GBA = median(as.numeric(GBA), na.rm = TRUE),
    Med_AYB = median(year_built, na.rm = TRUE),
    Med_Rooms = median(rooms, na.rm = TRUE),
    Med_Bedrooms = median(bedroom, na.rm = TRUE),
    Med_Bathrooms = median(bathroom, na.rm = TRUE),
    Med_Stories = median(as.numeric(STORIES), na.rm = TRUE)
  ) %>% 
  st_drop_geometry()

multi_family_DCPS <- multi_family_DCPS %>% 
  inner_join(boundary %>% as_tibble(), by = c("dcps_zone" = "NAME")) %>% 
  st_as_sf()

ggplot() +
  geom_sf(data = multi_family_DCPS, aes(fill = Tot_Units), size = 0.75) +
  scale_fill_gradientn(colors = c("#ffffff", "#f98e09", "#440154"), name = "Total Units") +
  labs(title = "Multi-Family Housing Units by Current Elementary DCPS Zones") +
  theme_void() +
  theme(plot.title = element_text(size = 12, hjust = 0.5))

2.4.2 Planned Construction

Data of construction permits from 2013 to 2023 are downloaded from OpenData DC Building Permits (https://opendata.dc.gov/search?q=building%20permits). We filter out for building permits that are “Construction” type for our analysis. Broadly, construction encompass activities like alteration and repair, civil plan, demolition and new buildings. We decided to use activities classified as construction as we were interested in how these activities near a school could influence student enrollment numbers.

The total number of construction permits issued annually in DC shows fluctuation. From 2013 to 2019, the total number of permits issued increased year-on-year. This decreased from 2019 to 2023, likely due to the supply-chain disruptions from Covid-19.

The total number of construction permits issued annually by wards also follow similar trends, with most wards having an increase in construction permit from 2013 to 2019 but a decrease there after. Ward 2 and 6 have the highest number of construction permits issued compared to other wards.

Taking 2023 as an example, we observe that there is a spatial distribution of construction permits issued by DCPS Zones.There appears to be less construction permits in school zones in the south while large number of permits are clustered in the east.

# Total number of construction permits issued annually
ggplot(combined_construction_data %>%
         group_by(year) %>%
         summarise(total_count = n()),
       aes(x=year, y=total_count)) +
  geom_bar(stat="identity", fill="#a3307e")+
  labs(title = "Total number of construction permits issued annually", x="Year", y="Total Construction Permits")+
  theme_minimal()

# Total number of construction permits issued annually by wards
ggplot(combined_construction_data %>% 
         group_by(year, WARD) %>%
         summarise(total_count = n()),
       aes(x=year, y=total_count, color=WARD)) +
  geom_point() +
  geom_line() +
  scale_color_manual(values = c("#fcfdbf", "#fed395", "#fea973", "#f1605d", "#a3307e", "#0D0887", "#440154", "#180f3d")) +
  scale_x_continuous(breaks = seq(min(combined_construction_data$year), max(combined_construction_data$year), by = 1))+
  labs(title = "Total number of construction permits issued annually by wards", x="Year", y="Total Construction Permits")

  theme_minimal()
## List of 136
##  $ line                            :List of 6
##   ..$ colour       : chr "black"
##   ..$ linewidth    : num 0.5
##   ..$ linetype     : num 1
##   ..$ lineend      : chr "butt"
##   ..$ arrow        : logi FALSE
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_line" "element"
##  $ rect                            :List of 5
##   ..$ fill         : chr "white"
##   ..$ colour       : chr "black"
##   ..$ linewidth    : num 0.5
##   ..$ linetype     : num 1
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ text                            :List of 11
##   ..$ family       : chr ""
##   ..$ face         : chr "plain"
##   ..$ colour       : chr "black"
##   ..$ size         : num 11
##   ..$ hjust        : num 0.5
##   ..$ vjust        : num 0.5
##   ..$ angle        : num 0
##   ..$ lineheight   : num 0.9
##   ..$ margin       : 'margin' num [1:4] 0points 0points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : logi FALSE
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ title                           : NULL
##  $ aspect.ratio                    : NULL
##  $ axis.title                      : NULL
##  $ axis.title.x                    :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 2.75points 0points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.title.x.top                :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 0
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 2.75points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.title.x.bottom             : NULL
##  $ axis.title.y                    :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 1
##   ..$ angle        : num 90
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 2.75points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.title.y.left               : NULL
##  $ axis.title.y.right              :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 1
##   ..$ angle        : num -90
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.75points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text                       :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : chr "grey30"
##   ..$ size         : 'rel' num 0.8
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.x                     :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 2.2points 0points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.x.top                 :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 0
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 2.2points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.x.bottom              : NULL
##  $ axis.text.y                     :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 1
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 2.2points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.y.left                : NULL
##  $ axis.text.y.right               :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.2points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.theta                 : NULL
##  $ axis.text.r                     :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0.5
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 2.2points 0points 2.2points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.ticks                      : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ axis.ticks.x                    : NULL
##  $ axis.ticks.x.top                : NULL
##  $ axis.ticks.x.bottom             : NULL
##  $ axis.ticks.y                    : NULL
##  $ axis.ticks.y.left               : NULL
##  $ axis.ticks.y.right              : NULL
##  $ axis.ticks.theta                : NULL
##  $ axis.ticks.r                    : NULL
##  $ axis.minor.ticks.x.top          : NULL
##  $ axis.minor.ticks.x.bottom       : NULL
##  $ axis.minor.ticks.y.left         : NULL
##  $ axis.minor.ticks.y.right        : NULL
##  $ axis.minor.ticks.theta          : NULL
##  $ axis.minor.ticks.r              : NULL
##  $ axis.ticks.length               : 'simpleUnit' num 2.75points
##   ..- attr(*, "unit")= int 8
##  $ axis.ticks.length.x             : NULL
##  $ axis.ticks.length.x.top         : NULL
##  $ axis.ticks.length.x.bottom      : NULL
##  $ axis.ticks.length.y             : NULL
##  $ axis.ticks.length.y.left        : NULL
##  $ axis.ticks.length.y.right       : NULL
##  $ axis.ticks.length.theta         : NULL
##  $ axis.ticks.length.r             : NULL
##  $ axis.minor.ticks.length         : 'rel' num 0.75
##  $ axis.minor.ticks.length.x       : NULL
##  $ axis.minor.ticks.length.x.top   : NULL
##  $ axis.minor.ticks.length.x.bottom: NULL
##  $ axis.minor.ticks.length.y       : NULL
##  $ axis.minor.ticks.length.y.left  : NULL
##  $ axis.minor.ticks.length.y.right : NULL
##  $ axis.minor.ticks.length.theta   : NULL
##  $ axis.minor.ticks.length.r       : NULL
##  $ axis.line                       : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ axis.line.x                     : NULL
##  $ axis.line.x.top                 : NULL
##  $ axis.line.x.bottom              : NULL
##  $ axis.line.y                     : NULL
##  $ axis.line.y.left                : NULL
##  $ axis.line.y.right               : NULL
##  $ axis.line.theta                 : NULL
##  $ axis.line.r                     : NULL
##  $ legend.background               : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ legend.margin                   : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
##   ..- attr(*, "unit")= int 8
##  $ legend.spacing                  : 'simpleUnit' num 11points
##   ..- attr(*, "unit")= int 8
##  $ legend.spacing.x                : NULL
##  $ legend.spacing.y                : NULL
##  $ legend.key                      : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ legend.key.size                 : 'simpleUnit' num 1.2lines
##   ..- attr(*, "unit")= int 3
##  $ legend.key.height               : NULL
##  $ legend.key.width                : NULL
##  $ legend.key.spacing              : 'simpleUnit' num 5.5points
##   ..- attr(*, "unit")= int 8
##  $ legend.key.spacing.x            : NULL
##  $ legend.key.spacing.y            : NULL
##  $ legend.frame                    : NULL
##  $ legend.ticks                    : NULL
##  $ legend.ticks.length             : 'rel' num 0.2
##  $ legend.axis.line                : NULL
##  $ legend.text                     :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : 'rel' num 0.8
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ legend.text.position            : NULL
##  $ legend.title                    :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ legend.title.position           : NULL
##  $ legend.position                 : chr "right"
##  $ legend.position.inside          : NULL
##  $ legend.direction                : NULL
##  $ legend.byrow                    : NULL
##  $ legend.justification            : chr "center"
##  $ legend.justification.top        : NULL
##  $ legend.justification.bottom     : NULL
##  $ legend.justification.left       : NULL
##  $ legend.justification.right      : NULL
##  $ legend.justification.inside     : NULL
##  $ legend.location                 : NULL
##  $ legend.box                      : NULL
##  $ legend.box.just                 : NULL
##  $ legend.box.margin               : 'margin' num [1:4] 0cm 0cm 0cm 0cm
##   ..- attr(*, "unit")= int 1
##  $ legend.box.background           : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ legend.box.spacing              : 'simpleUnit' num 11points
##   ..- attr(*, "unit")= int 8
##   [list output truncated]
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi TRUE
##  - attr(*, "validate")= logi TRUE
# Construction counts in 2023 by school zones
construction_data_2023 <- combined_construction_data[combined_construction_data$year=="2023",] %>%
  st_transform(crs = st_crs(boundary))

construction_within_dcps <- st_join(construction_data_2023, boundary, join = st_within)

construction_count <- construction_within_dcps %>%
  as_tibble() %>% 
  group_by(DCPS_ID, year) %>%
  summarise(construction_projects = n())


breaks <- c(0, 100, 200, 300, 400, 500, Inf)
labels <- c("0-100", "101-200", "201-300", "301-400", "401-500", "500+")

dcps_zones_2 <- left_join(boundary, construction_count, by="DCPS_ID") %>%
  mutate(construction_group = cut(construction_projects, breaks = breaks, labels = labels, include.lowest = TRUE))

ggplot() + 
  geom_sf(data = dcps_zones_2, aes(fill = construction_group)) +
  scale_fill_manual(values = c("#180f3d", "#440154", "#a3307e", "#f1605d", "#fea973", "#fed395", "#fcfdbf"),
                    limits = c("500+", "401-500", "301-400", "201-300", "101-200", "0-100"),
                    name = "Construction Permits Issued 2023") + 
  coord_sf() +
  theme(legend.position = "right")

2.4.3 Rental Housing

The rental housing market in DC is expanding, with over 32,000 units of luxury and mixed-income housing completed since 2014, over 21,000 affordable units completed, and thousands more in the pipeline. Housing construction is a priority in DC, with the recent announcement of the RENTAL Act, meant to protect affordable housing and strengthen the local housing ecosystem.

new_residential_polygon %>% group_by(NAME) %>% summarize(n=n(), sum_units=sum(UNITS)) %>%
  ggplot(aes(fill=sum_units, label=NAME)) +
  geom_sf()+
  geom_sf_label(fill = "white",
                fun.geometry = sf::st_centroid)+
  scale_fill_gradientn(colors = c("#ffffff", "#f98e09", "#440154"), name="Units Built")+
  labs(title="Units Built, 2014-2023")+
  theme_void()

Construction of new rental housing has been most pronounced in Ward 6, but most Wards have seen over 1,000 new units in the ten-year period of these data. Affordable housing has also been on the rise, and its construction is less concentrated in one part of the city. The RENTAL Act should see the market continue to flourish with the incentivization of new housing construction and protections for renters being a priority for Mayor Bowser.

ggplot(new_residential_point)+
  geom_sf(data=boundary, fill="whitesmoke", color="grey70", lwd=0.1)+
  geom_sf(data=waterbodies, fill = "ghostwhite", color="lightsteelblue")+
  #geom_sf(aes(size=UNITS), color="#fdb42f",alpha=0.9)+
  geom_sf(data=affordable %>% filter(STATUS_PUBLIC != "Pipeline"), aes(size=TOTAL_AFFORDABLE_UNITS),
          color = "#440154", alpha=0.6)+
  geom_sf(aes(size=UNITS), color="#fdb42f",alpha=0.6)+
  scale_size_continuous(range = c(0.5, 4), limits=c(0,1400), breaks =seq(0, 1400,by=300), name="Units Built")+
  guides(size = guide_legend(override.aes = list(color="black")))+
  labs(caption = "Source: Open Data DC\n",
       title="New units in DC, 2013-2023", subtitle="Affordable Units in Purple")+
  theme_void()+
  theme(legend.position = "bottom")

3 Panel Data Creation

Panel data is a dataset in which the behavior of entities (i) are observed across time (t). In this case, it is the behavior of student enrollment by grades, in each school, over time. One row of data will therefore show the number of Students for Grade X, in School Y, in Year Z.

To develop the best model supporting the app’s backend results, we engineered a panel data that enlists every unique school facility-grade level-school year combination, targeted at student enrollment 5 years later, with supporting predictors ranging from facility based data (enrollment, capacity, utilization), socioeconomic data (demographic, age, income, housing, education), construction permit records, residential mass appraisal data. From data collected, we further manipulated future and lag variables for construction permits (cumulative permits within 5 years), student enrollments (both lag records associated with previous years, as well as future variables indicating enrollment 1-3 year(s) later), and capacity (change in capacity 5 years later).

The list of variables in our panel data are below. For each type of variables added, please find more detailed elaborations in their respective section.
Raw Variable Names and Descriptions
Raw.Variable.Name Defintion
school_year School Year
school_code School Code
school_name School Name
grade_level Grade level
unique_id This ID was assigned to each school based on their geographic location, to differentiate them when modelling. It is a categorical variable
census_year The year this row of data corresponds to. For example SY13-14 is 2013
geoid Geographic ID of School
dcps_boundary This is the name of DCPS School Zones Boundary the school is located in. It is a categorical variable
school_sector This is the type of public school in DC. It is a categorical variable of either DC Public School or Public Charter Schools
facility_capacity School Capacity: The maximum number of students who can be housed in a school building based on the number and types of spaces available and the type of educational plan offered.
enrollment Student Enrollment Numbers
utilization Utilization is a measurement of how full a facility is to meet educational programmatic needs. Utilization is calculated as the number of enrolled students over the capacity of the school facility.
median_hhincome_indv_mean Median Household Income, averaged from the proportion of student in each census tract. The following equation calculates this: (1/Total Students)×Σ(Number of students in census tract N)×(Median Household Income in census tract N)
pct_poverty_indv_mean Poverty rate, averaged from the proportion of students in each census tract. The following equation calculates this: (1/Total Students)×Σ(Number of students in census tract N)×(Poverty rate in census tract N)
total_population Total population, in the census tract the school is located in
num_households Total number of households, in the census tract the school is located in
total_housing_units Total housing units, in the census tract the school is located in
median_gross_rent Median Gross Rent, in the census tract the school is located in
med_rent_pct_inc Median Gross Rent as percentage of income, in the census tract the school is located in
med_owner_cost_pct_inc Median Owner Costs as percentage of income, in the census tract the school is located in
med_rooms_renter Median Number of Rooms in Renter Occupied housing, in the census tract the school is located in
median_household_income Median Household Income, in the census tract the school is located in
avg_hh_size Average Household Size, in the census tract the school is located in
pct_nonwhite Percentage of non-white population, in the census tract the school is located in
pct_poverty Poverty rate, in the census tract the school is located in
pct_vacant Percentage of vacant housing, in the census tract the school is located in
pct_own Percentage of owner occupied housing among total population, in the census tract the school is located in
pct_rent Percentage of renter occupied housing among total population, in the census tract the school is located in
pct_education_no_school Percentage of adults aged 25 and above without school qualifications, in the census tract the school is located in
pct_education_hs Percentage of adults aged 25 and above with high school qualifications, in the census tract the school is located in
pct_education_bachelors Percentage of adults aged 25 and above with bachelor’s degree, in the census tract the school is located in
construction_projects Total number of construction permits, in the census tract the school is located in
avg_price_5_nearest Average sales price of 5 nearest residential properties to the school
med_bedroom_5_nearest Median number of bedrooms of 5 nearest residential properties to the school
cumulative_permits5yr Total number of construction permits issued in the school’s census tract in past 5 years
housing_1yrbefore Total housing units, 1 year before the specified school year, in the census tract the school is located in
pct_change_housing_yrbefore Percentage change in total housing units from 1 year before, in the census tract the school is located in
agg_total_students This is the total number of students in the school for each grade, in the specified school year. The specified school year is contained in a separate column in the panel data
geometry Geographic location of the school in geometry form
agg_total_students_1yrbefore This is the total number of students in the school for each grade, 1 year before the specified school year
agg_total_students_2yrbefore This is the total number of students in the school for each grade, 2 years before the specified school year
agg_total_students_3yrbefore This is the total number of students in the school for each grade, 3 years before the specified school year
agg_total_students_1yrlater This is the total number of students in the school for each grade, 1 year after the specified school year
agg_total_students_2yrlater This is the total number of students in the school for each grade, 2 years after the specified school year
agg_total_students_3yrlater This is the total number of students in the school for each grade, 3 years after the specified school year
agg_total_students_4yrlater This is the total number of students in the school for each grade, 4 years after the specified school year
agg_total_students_5yrlater This is the total number of students in the school for each grade, 5 years after the specified school year
capacity_5yrlater Percentage change in school capacity 5 years after specified year. The specified school year is contained in a separate column in the panel data
pct_change_capacity_5yrlater Percentage change in school capacity 5 years after specified year. The specified school year is contained in a separate column in the panel data
ward This is the name of municipal ward in DC. It is a categorical variable.
avg_util Average utilization by wards

3.1 Matching Enrollment Data with Facility Data

The same school code can be associated with multiple facilities in D.C., and scatter across distinct school boundaries. To appreciate the difference, we created a uniqueID for each facility in the region (unique combinations of school_code, school_name, latitude, and longitude), ensuring maximum capacity, enrollment, and utilization accuracy is captured for each facility.

Due to code re-sequencing for narrative improvement, we repeated the same code as in section 1.4 below. This does not impose any changes to our final results.

elementary <- list('K','1','2','3','4','5')

# Manually removed a record associated with school_id == 247 - this is the only duplicated row
schools_cleaned <- schools %>% 
              filter(grade_band %in% c("Elementary", "K-8")) %>% # Neve added
              group_by(school_year, lea_id, school_id, mar_latitude) %>% 
              slice_max(order_by = facility_capacity, n = 1, with_ties = FALSE) %>%
              ungroup()

elementary_by_year <- students_by_tract_sy1314_to_sy2425 %>%
  filter(grade_level %in% elementary) %>%
  left_join(schools_cleaned %>%
              dplyr::select(-school_name, -school_sector),
            by=c("lea_code"="lea_id","school_code"="school_id","school_year"="school_year", "school_main_facility_lat" = "mar_latitude", "school_main_facility_long" = "mar_longitude")) %>%
  dplyr::select(# original variables from the secret data
    school_year, residence_census_tract, lea_code, school_code, school_name,
    school_sector, in_boundary_indicator, school_main_facility_lat, school_main_facility_long,
    grade_level, total_students, 
    # variables added through the merge
    dcps_facility_name, facility_sector, grade_band,`ward_(2022)`, 
    dcps_boundary, facility_capacity, enrollment, utilization, unfilled_seats) %>%
  rename(ward_22 = `ward_(2022)`) # tidied up workflow

# Create unique facility ID for each school facility
elementary_by_year <- elementary_by_year %>%
  group_by(school_code, grade_band, school_main_facility_lat, school_main_facility_long) %>% # this ensures distinct facilities are selected
  mutate(unique_id = paste0("F", sprintf("%06d", cur_group_id()))) %>%
  ungroup()
# Create new column variable for matching census data
elementary_by_year <- elementary_by_year %>%
  mutate(census_year = case_when(
    school_year == 'SY13-14' ~ 2013,
    school_year == 'SY14-15' ~ 2014,
    school_year == 'SY15-16' ~ 2015,
    school_year == 'SY16-17' ~ 2016,
    school_year == 'SY17-18' ~ 2017,
    school_year == 'SY18-19' ~ 2018,
    school_year == 'SY19-20' ~ 2019,
    school_year == 'SY20-21' ~ 2020,
    school_year == 'SY21-22' ~ 2021,
    school_year == 'SY22-23' ~ 2022,
    school_year == 'SY23-24' ~ 2023,
    school_year == 'SY24-25' ~ 2024,
    TRUE ~ NA_real_
  ))

3.2 Processing and Adding ACS Predictors

We respectively matched the tract-level information to each school facility’s unique neighborhood tract, and to home origin tracts of the recorded student cohort (to obtain the weighted average that represents the student’s household characteristics).

tracts12_23 <- tracts12_23 %>% 
  group_by(GEOID) %>% 
  arrange(GEOID, year, .by_group = TRUE) %>%  # Ensures sort happens within each group
  mutate(
    housing_1yrbefore = lag(total_housing_units, n = 1),
    pct_change_housing_yrbefore = (total_housing_units - housing_1yrbefore) / housing_1yrbefore
  ) %>%
  ungroup()

Manually Create 2024 Rows Using Best Data Available (ACS 2023)

Because 2020-2024 ACS 5-Year Estimates were not available during the period of this project, after retrieving all ACS variables from 2019-2023 ACS 5-Year Estimates for census year 2023 and calculating change variables (i.e. housing_1yrbefore & pct_change_housing_yrbefore), we duplicated the 2023 rows as the “best available data” for 2024, this enables us to predict for SY29-30 using SY24-25 enrollment records.

tracts_2024 <- tracts12_23 %>%
  filter(year == 2023) %>%
  mutate(year = 2024)

# Then bind it back to the original data
tracts12_24 <- bind_rows(tracts12_23, tracts_2024) %>%
  arrange(GEOID, year)

Create Individual Mean Variables Based on Student Cohort Makeup

tracts_no_geometry <- tracts12_24 %>%
  st_drop_geometry() %>%
  dplyr::select(GEOID, year, total_population, 
                population_white, population_black, population_native, population_asian, population_hispanic, 
                Male5to9, Female5to9, 
                num_households, total_housing_units, occupied_housing_units, vacant_housing_units,
                median_gross_rent, med_rent_pct_inc, median_home_value, med_owner_cost_pct_inc, 
                pop_owner_occupied, pop_renter_occupied,
                med_rooms_owner, med_rooms_renter, median_household_income, poverty_pop, pop_25above,
                education_no_school, education_hs,
                education_bachelors, kinder_enrollment, first_fourth_grade_enr, avg_hh_size)
  
test1<- elementary_by_year %>%
  left_join(tracts_no_geometry, by = c("census_year" = "year", "residence_census_tract" = "GEOID"))

# Calculate the weighted mean of each variable
test2 <- test1 %>%
  group_by(census_year, unique_id, grade_level) %>%
  mutate(TotalPop_Indv_Mean = sum(total_population * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         Male5to9_Indv_Mean = sum(Male5to9 * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         Female5to9_Indv_Mean = sum(Female5to9 * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         NumberWhites_Indv_Mean = sum(population_white * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         NumberBlack_Indv_Mean = sum(population_black * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         Median_HHIncome_Indv_Mean = sum(median_household_income * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         NumberNative_Indv_Mean = sum(population_native * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         NumberAsian_Indv_Mean = sum(population_asian * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         NumberHispanic_Indv_Mean = sum(population_hispanic * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         NumHouseholds_Indv_Mean = sum(num_households * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         TotalHousingUnits_Indv_Mean = sum(total_housing_units * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         OccupiedHousingUnits_Indv_Mean = sum(occupied_housing_units * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         VacantHousingUnits_Indv_Mean = sum(vacant_housing_units * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         MedianGrossRent_Indv_Mean = sum(median_gross_rent * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         MedRentPctInc_Indv_Mean = sum(med_rent_pct_inc * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         MedianHomeValue_Indv_Mean = sum(median_home_value * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         MedOwnerCostPctInc_Indv_Mean = sum(med_owner_cost_pct_inc * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         PopOwnerOccupied_Indv_Mean = sum(pop_owner_occupied * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         PopRenterOccupied_Indv_Mean = sum(pop_renter_occupied * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         MedRoomsOwner_Indv_Mean = sum(med_rooms_owner * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         MedRoomsRenter_Indv_Mean = sum(med_rooms_renter * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         PovertyPop_Indv_Mean = sum(poverty_pop * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         Pop_25Above_Indv_Mean = sum(pop_25above * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         EducationNoSchool_Indv_Mean = sum(education_no_school * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         EducationHS_Indv_Mean = sum(education_hs * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         EducationBachelors_Indv_Mean = sum(education_bachelors * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         KinderEnrollment_Indv_Mean = sum(kinder_enrollment * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         FirstFourthGradeEnr_Indv_Mean = sum(first_fourth_grade_enr * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE),
         AvgHHSize_Indv_Mean = sum(avg_hh_size * total_students, na.rm = TRUE) / sum(total_students, na.rm = TRUE)
) %>% 
  ungroup()

test2 <- test2 %>%
  mutate(
    pct_5to9_Indv_Mean = (Male5to9_Indv_Mean + Female5to9_Indv_Mean) / TotalPop_Indv_Mean,
    pct_nonwhite_Indv_Mean = 1 - (NumberWhites_Indv_Mean / TotalPop_Indv_Mean),
    pct_poverty_Indv_Mean = PovertyPop_Indv_Mean / TotalPop_Indv_Mean,
    pct_vacant_Indv_Mean = VacantHousingUnits_Indv_Mean / TotalHousingUnits_Indv_Mean,
    pct_own_Indv_Mean = PopOwnerOccupied_Indv_Mean / TotalHousingUnits_Indv_Mean,
    pct_rent_Indv_Mean = PopRenterOccupied_Indv_Mean / TotalHousingUnits_Indv_Mean,
    pct_educationnoschool_Indv_Mean = EducationNoSchool_Indv_Mean / Pop_25Above_Indv_Mean,
    pct_educationhs_Indv_Mean = EducationHS_Indv_Mean / Pop_25Above_Indv_Mean,
    pct_educationbachelors_Indv_Mean = EducationBachelors_Indv_Mean / Pop_25Above_Indv_Mean
  )

elementary_by_year_Indv_Mean <- test2 %>%
  group_by(school_year, school_code, school_name, grade_level, unique_id, census_year,
           school_sector,school_main_facility_lat, school_main_facility_long,
           facility_sector, facility_capacity, enrollment, utilization, unfilled_seats, # Neve added these back, census year school capacity
            TotalPop_Indv_Mean, Male5to9_Indv_Mean, Female5to9_Indv_Mean, 
            NumberWhites_Indv_Mean, NumberBlack_Indv_Mean, Median_HHIncome_Indv_Mean, 
            NumberNative_Indv_Mean, NumberAsian_Indv_Mean, NumberHispanic_Indv_Mean, 
            NumHouseholds_Indv_Mean, TotalHousingUnits_Indv_Mean, OccupiedHousingUnits_Indv_Mean, 
            VacantHousingUnits_Indv_Mean, MedianGrossRent_Indv_Mean, MedRentPctInc_Indv_Mean, 
            MedianHomeValue_Indv_Mean, MedOwnerCostPctInc_Indv_Mean, PopOwnerOccupied_Indv_Mean, 
            PopRenterOccupied_Indv_Mean, MedRoomsOwner_Indv_Mean, MedRoomsRenter_Indv_Mean, 
            PovertyPop_Indv_Mean, EducationNoSchool_Indv_Mean, EducationHS_Indv_Mean, 
            EducationBachelors_Indv_Mean, KinderEnrollment_Indv_Mean, FirstFourthGradeEnr_Indv_Mean, 
            AvgHHSize_Indv_Mean,
           pct_5to9_Indv_Mean, pct_nonwhite_Indv_Mean, pct_poverty_Indv_Mean, pct_vacant_Indv_Mean, 
           pct_own_Indv_Mean, pct_rent_Indv_Mean, 
           pct_educationnoschool_Indv_Mean, pct_educationhs_Indv_Mean, pct_educationbachelors_Indv_Mean) %>%
  summarise(agg_total_students = sum(total_students, na.rm = TRUE))
# Convert to shapefile
elem_by_year_Indv_Mean.sf <- st_as_sf(elementary_by_year_Indv_Mean,
                                               coords = c("school_main_facility_long","school_main_facility_lat"),
                                               crs = 4326)

Appending School-Based ACS Variables

Match ACS data (both actual data from 2013 to 2023, and hypothetical data from 2024) to each school facility-year combination.

# Use st_within to identify the census tract each school were within, at different years
results_list <- list()

years_new <- 2013:2024 # add hypothetical 2024

# Loop through each year
for (i in years_new) {
  # Filter the data frames for the specific year
  elementary_filtered <- elem_by_year_Indv_Mean.sf %>%
    filter(census_year == i) 
  
  tracts_filtered <- tracts12_24 %>%
    filter(year == i) %>%
    dplyr::select(GEOID, total_population, population_white, population_black, population_native, population_asian, population_hispanic,
           Male5to9, Female5to9, num_households, total_housing_units, occupied_housing_units, vacant_housing_units, median_gross_rent,
           med_rent_pct_inc, median_home_value, med_owner_cost_pct_inc, pop_owner_occupied, pop_renter_occupied, med_rooms_owner,
           med_rooms_renter, median_household_income, poverty_pop, education_no_school, education_hs, education_bachelors,
           kinder_enrollment, first_fourth_grade_enr, avg_hh_size, 
           pct_5to9, pct_nonwhite, pct_poverty, pct_vacant, pct_own, pct_rent, 
           pct_education_no_school, pct_education_hs, pct_education_bachelors,
           housing_1yrbefore, pct_change_housing_yrbefore) # NZ: added these variables
  # Perform the spatial join
  joined_data <- st_join(elementary_filtered, tracts_filtered, join = st_within, left = TRUE)
  
  # Add the joined data to the results list
  results_list[[as.character(i)]] <- joined_data
}

Convert Results to Panel

Merge the panel results for the first time, the total number of rows (unique school facility-grade level-school year combinations) retained is 9401. We will be repeating the merging/binding process to add subsequent variables.

panel <- do.call(rbind, results_list)

3.3 Creating Aggregated Construction Permits

We calculated the cumulative permits issued within the near 5 years for each DCPS zone on file, to account for the effect created by housing constructions. For example, for all records in SY13-14, cumulative permits issued within the near 5 years is the sum of permits issued in 2013, 2014, 2015, 2016, and 2017, which effectively accounts for all additional construction projects that influence the targeted year (SY18-19)’s prediction results.

construction_within_dcps <- st_join(combined_construction_data %>% 
                                      st_transform(st_crs(boundary)), boundary, join = st_within)

construction_count <- construction_within_dcps %>%
  as_tibble() %>% 
  group_by(DCPS_ID, year) %>%
  summarise(construction_projects = n()) %>%
  st_drop_geometry()

# NZ: I think many-to-many in the following steps should be expected, as we intentionally match multiple years' construction permit count to the base year for each DCPS
projects_5yr_sum <- construction_count %>%
  rename(base_year = year) %>%
  inner_join(construction_count, by = c("DCPS_ID")) %>%
  filter(year >= base_year, year < base_year + 5) %>%
  group_by(DCPS_ID, base_year) %>%
  summarise(cumulative_permits5yr = sum(construction_projects.y, na.rm = TRUE), .groups = "drop") %>%
  rename(year = base_year)

construction_count <- construction_count %>%
  left_join(projects_5yr_sum, by = c("DCPS_ID", "year"))

Removed 5-year cumulative permits for 2020 and later, since those counts would be incomplete. These years will remain in the final dataset used for model fitting, but will be excluded from training and testing. Projected permit values for these years will be manually assigned in scenario analyses.

construction_count <- construction_count %>% 
  filter(year <= 2019 )
# Match schools to DCPS zones
dcps_code <- dplyr::select(boundary, DCPS_ID, NAME)
panel <- st_join(panel, dcps_code, join = st_within)

panel <- panel %>% 
   left_join(construction_count, by = c("DCPS_ID", "census_year" = "year"))

3.4 Engineering Residential Sales Data

Next, we added several residential property transaction-related data points within each DCPS boundary or nearest to the school facility from the year prior to the current school year. For instance, residential sales records in 2012 are matched with SY13–14 enrollment records. This is based on the belief that housing sales one year prior are more directly associated with enrollments in the subsequent school year, as families typically move within a few months of purchase and enroll their children in school the following fall. This creates a more stable predictor that aligns household settlement patterns with school enrollment cycles.

combined_sales <- res_data %>%
  as_tibble() %>%
  mutate(unit_type = if_else(NUM_UNITS == 1, "single", "multi")) %>%
  group_by(SALEYEAR, school_year, dcps_zone, unit_type) %>%
  tally() %>%
  pivot_wider(
    names_from = unit_type,
    values_from = n,
    names_prefix = "num_",
    values_fill = 0
  ) %>%
  rename(
    num_single_units_sold = num_single,
    num_multi_units_sold = num_multi
  )
panel <- panel %>%
  left_join(combined_sales %>%
            dplyr::select(-SALEYEAR),
    by = c("NAME" = "dcps_zone", "census_year" = "school_year"))
panel <- panel %>%
  left_join(school_metrics_by_year)

Intermediate Step: Reducing Input Variables to Reduce Multicollinearity

# standardize percentages
panel_clean <- panel %>% 
  mutate(utilization =  as.numeric(sub("%", "", utilization)) / 100,
         med_rent_pct_inc = med_rent_pct_inc / 100,
         med_owner_cost_pct_inc = med_owner_cost_pct_inc / 100) %>% 
  rename(dcps_boundary = NAME)

# clean up variable names to lower case
names(panel_clean) <- tolower(names(panel_clean))

# Improved variable inclusion based on Amy ENTIRE_panel_allVars.geojson
# Everything before mutating additonal agg_students and capacity related lag/future vars
panel_clean <- panel_clean %>%
  dplyr::select(school_year, school_code, school_name, grade_level, unique_id, 
                census_year, geoid, dcps_boundary, 
                school_sector, facility_capacity, enrollment, utilization,
                median_hhincome_indv_mean, pct_poverty_indv_mean,
                total_population, num_households, 
                total_housing_units,median_gross_rent, med_rent_pct_inc,
                med_owner_cost_pct_inc, med_rooms_renter, median_household_income, avg_hh_size, pct_nonwhite, pct_poverty,
                pct_vacant, pct_own, pct_rent,pct_education_no_school, pct_education_hs, pct_education_bachelors,
                construction_projects, avg_price_5_nearest, med_bedroom_5_nearest,
                cumulative_permits5yr, housing_1yrbefore, pct_change_housing_yrbefore,
                agg_total_students)

3.5 Creating Enrollment and Capacity Change Lag Variables

Finally, to support forecasting of future enrollment based on past records, we created the following columns:

  • agg_total_students_5yrslater: the aggregate total student enrollment for the same school facility and grade level, five years later.

  • pct_change_capacity_5yrs: the anticipated percentage change in the school’s capacity, five years later (i.e. percentage change in school capacity between base year SY13-14 and target year SY18-19).

The former serves as the target variable in the model. While the latter, similar to cumulative_permits5yr, acts as a hypothetical input in the training data. Collectively, they enables clients to conduct scenario-based forecasting informed by qualitative judgments about development pressures within a school district or planned facility-level capacity changes.

Student Enrollment

# aggregate enrollment lags and leads, per grade, future and past
grade_enrollment_lags <- expand.grid(
  unique_id = sort(unique(panel_clean$unique_id)),
  census_year   = seq(from = min(panel_clean$census_year), to = max(panel_clean$census_year))
) %>% 
  left_join(panel_clean, by = c("unique_id", "census_year")) %>% 
  dplyr::select(unique_id,census_year, grade_level, agg_total_students) %>%
  arrange(unique_id, census_year) %>% 
  group_by(unique_id, grade_level, census_year) %>% summarize(agg_total_students = mean(agg_total_students)) %>%
  mutate(agg_total_students_1yrbefore = lag(agg_total_students, n=1),
         agg_total_students_2yrbefore = lag(agg_total_students, n=2),
         agg_total_students_3yrbefore = lag(agg_total_students, n=3),
         agg_total_students_1yrlater = lead(agg_total_students, n=1),
         agg_total_students_2yrlater = lead(agg_total_students, n=2),
         agg_total_students_3yrlater = lead(agg_total_students, n=3),
         agg_total_students_4yrlater = lead(agg_total_students, n=4),
         agg_total_students_5yrlater = lead(agg_total_students, n=5)) %>%
  ungroup()
panel_clean <- panel_clean %>%
  left_join(grade_enrollment_lags %>% dplyr::select(-agg_total_students),
            by=c("unique_id","census_year", "grade_level"))

School Capacity

capacity_change <- expand.grid(
  unique_id = sort(unique(panel_clean$unique_id)),
  census_year   = seq(from = min(panel_clean$census_year), to = max(panel_clean$census_year))
) %>% 
  left_join(panel_clean, by = c("unique_id", "census_year")) %>% 
  #dplyr::select(unique_id,census_year, total_housing_units) %>%
  arrange(unique_id, census_year) %>% 
  group_by(unique_id, census_year) %>% summarize(facility_capacity = mean(facility_capacity)) %>%
  mutate(capacity_5yrlater = lead(facility_capacity, n=5),
         pct_change_capacity_5yrlater = (capacity_5yrlater - facility_capacity) / facility_capacity
  ) %>%
  ungroup()
panel_clean <- panel_clean %>%
  left_join(capacity_change %>% dplyr::select(-facility_capacity),
            by=c("unique_id","census_year"))

Final tweaks to clean up the panel data

panel_clean <- panel_clean %>%
  mutate(across(c(facility_capacity, enrollment), round))

# impute enrollment for missing values based on aggregate values
panel_clean <- panel_clean %>%
  group_by(school_year, unique_id) %>%
  # where enrollment is missing, add up agg_total_students for each grade
  mutate(enrollment = case_when(is.na(enrollment) ~ sum(agg_total_students),
                                .default = enrollment))

panel_clean <- panel_clean %>%
  st_join(wards22 %>% dplyr::select(NAME, geometry) %>%
  rename("ward"="NAME")) # join wards back

# calculate average utilization for each ward in the respective year
# where this column is empty, replace with that average
ward_avg_util <- panel_clean %>%
  st_drop_geometry() %>%
  group_by(school_year, ward) %>%
  summarize(avg_util = mean(utilization, na.rm=T))

# join wards to impute average value
# will drop avg_util column later
panel_clean <- panel_clean %>%
  left_join(ward_avg_util, by=c("school_year","ward"))

panel_clean <- panel_clean %>%
  #assume missing values mean no enrollment that year
  mutate_at(vars(agg_total_students_2yrbefore, agg_total_students_3yrbefore, agg_total_students_1yrbefore), ~replace_na(., 0)) %>%
  # impute the same value for housing variables
  # making the assumption that the year-to-year doesn't change much when it's only a year difference
  # also doing for capacity
  # only for 1yr
  mutate(housing_1yrbefore = case_when(is.na(housing_1yrbefore) ~ total_housing_units, .default=housing_1yrbefore),
         pct_change_housing_yrbefore = case_when(is.na(pct_change_housing_yrbefore) ~ 0, .default=pct_change_housing_yrbefore),
         facility_capacity = case_when(facility_capacity == 0 | is.na(facility_capacity) ~ enrollment, .default=facility_capacity),
         utilization=case_when(is.na(utilization) ~ avg_util, .default=utilization))

Inspect NAs in the Panel

sapply(panel_clean, function(x)sum(is.na(x)))
##                  school_year                  school_code 
##                            0                            0 
##                  school_name                  grade_level 
##                            0                            0 
##                    unique_id                  census_year 
##                            0                            0 
##                        geoid                dcps_boundary 
##                            0                            0 
##                school_sector            facility_capacity 
##                            0                            0 
##                   enrollment                  utilization 
##                            0                          823 
##    median_hhincome_indv_mean        pct_poverty_indv_mean 
##                            0                            0 
##             total_population               num_households 
##                            0                            0 
##          total_housing_units            median_gross_rent 
##                            0                          185 
##             med_rent_pct_inc       med_owner_cost_pct_inc 
##                            6                           52 
##             med_rooms_renter      median_household_income 
##                            0                           64 
##                  avg_hh_size                 pct_nonwhite 
##                            0                            0 
##                  pct_poverty                   pct_vacant 
##                            0                            0 
##                      pct_own                     pct_rent 
##                            0                            0 
##      pct_education_no_school             pct_education_hs 
##                            0                            0 
##      pct_education_bachelors        construction_projects 
##                            0                         4072 
##          avg_price_5_nearest        med_bedroom_5_nearest 
##                            0                            0 
##        cumulative_permits5yr            housing_1yrbefore 
##                         4072                            0 
##  pct_change_housing_yrbefore           agg_total_students 
##                            0                            0 
##                     geometry agg_total_students_1yrbefore 
##                            0                            0 
## agg_total_students_2yrbefore agg_total_students_3yrbefore 
##                            0                            0 
##  agg_total_students_1yrlater  agg_total_students_2yrlater 
##                         1266                         2325 
##  agg_total_students_3yrlater  agg_total_students_4yrlater 
##                         3246                         4131 
##  agg_total_students_5yrlater            capacity_5yrlater 
##                         4982                         5683 
## pct_change_capacity_5yrlater                         ward 
##                         5683                            0 
##                     avg_util 
##                          823

Optional: Export Panel Data

4 Final Model

A range of recent studies have explored the application of predictive modeling to support enrollment forecasting across different educational contexts. Tang and Yin (2012) applied grey prediction models to forecast U.S. education expenditure and school enrollment, while Goenner and Pauls (2006) used Bayesian logistic regression models to assess the likelihood of college enrollment based on inquiry behaviors and geodemographic indicators. Yang et al. (2020) introduced a hybrid Support Vector Regression model optimized with the Whale Optimization Algorithm to forecast student and teacher counts in Taiwan, and Aksenova et al. (2006) combined SVR with Cubist rule-based models to enhance accuracy and interpretability of university enrollment projections.

Among all the papers reviewed, the study most relevant to our approach was published by Tanner et al. (2021) at the National Center for Education Evaluation and Regional Assistance at Institution of Educational Sciences (IES). The study develops an enrollment forecasting model for the School District of Philadelphia using four statistical and machine learning methods: OLS regression, LASSO, Elastic Net, and Random Forest. Results showed that LASSO and Elastic Net—both implemented via the GLMNet framework—and Random Forest achieved comparable accuracy, with Random Forest reducing the need for classroom reallocation in 22% of cases, compared to 23% with Elastic Net. The research question addressed offers a direct parallel to the goals of our client at DME targeted at future enrollment forecasting and planning. This precedent reinforces our decision to test out GLMNet, Random Forest, and an additional XGBoost model in our preliminary testings (refer to Appendix at the bottom of the markdown).

Through literature review and trials and errors, our results indicates that a GLMNet model yields best predictions for the task we were given, and yields additional benefits including interpretability, flexibility, and computational efficiency. The next part outlines our model building process in the following steps:

  • Data Splitting: Based on data availability, especially limitations imposed by future-based variables, we split the data by whole school years into train data (SY13-14 to SY16-17 input data used to train the model and perform cross-fold validations), test data (SY17-18 to SY18-19, data used to perform cross validation on other samples), and fit data (SY19-20 and beyond, data used to make scenario-based predictions). Collectively, our model is trained on 2,496 rows of train data, validated on 1,211 rows of test data, and fitted on 4,871 rows of raw fit data before their expanded scenarios.

  • Validation: Performed 2 types of cross validation including 5-fold on the train data, and temporal cross validation on 1,211 rows of test data in different years. The test data are unseen observations to the model and effectively evaluates the model’s robustness.

  • Generalizability Check: For the testing set specifically, we further checked the generalizability across school years, school sectors, capacity, and location of the school facility.

  • Making Predictions: Finally, we expanded the raw fit data (4,871 rows) into combinations of scenarios based on outlook for total permits within the DCPS zone in the next 5 years (cumulative_permits5yr) and percentage change in the school’s capacity 5 years later (pct_change_capacity_5yrlater). This allows the user to customize predictions in the app later on.

Codes for the respective steps and key results are included below.

4.1 Data Splitting and Model Training

The list of variables that we included in our final modelling is shown in the table below.
Names of Variable in model and Descriptions
Raw.Variable.Name Defintion
facility_capacity School Capacity: The maximum number of students who can be housed in a school building based on the number and types of spaces available and the type of educational plan offered.
enrollment Student Enrollment Numbers
utilization Utilization is a measurement of how full a facility is to meet educational programmatic needs. Utilization is calculated as the number of enrolled students over the capacity of the school facility.
median_hhincome_indv_mean Median Household Income, averaged from the proportion of student in each census tract. The following equation calculates this: (1/Total Students)*Summation[(Number of students in census tract N)*(Median Household Income in census tract N)]
pct_poverty_indv_mean Poverty rate, averaged from the proportion of students in each census tract. The following equation calculates this: (1/Total Students)*Summation[(Number of students in census tract N)*(Poverty rate in census tract N)]
total_population Total population, in the census tract the school is located in
total_housing_units Total housing units, in the census tract the school is located in
median_gross_rent Median Gross Rent, in the census tract the school is located in
med_owner_cost_pct_inc Median Owner Costs as percentage of income, in the census tract the school is located in
med_rooms_renter Median Number of Rooms in Renter Occupied housing, in the census tract the school is located in
avg_hh_size Average Household Size, in the census tract the school is located in
pct_nonwhite Percentage of non-white population, in the census tract the school is located in
pct_poverty Poverty rate, in the census tract the school is located in
pct_vacant Percentage of vacant housing, in the census tract the school is located in
pct_own Percentage of owner occupied housing among total population, in the census tract the school is located in
pct_education_no_school Percentage of adults aged 25 and above without school qualifications, in the census tract the school is located in
pct_education_hs Percentage of adults aged 25 and above with high school qualifications, in the census tract the school is located in
avg_price_5_nearest Average sales price of 5 nearest residential properties to the school
med_bedroom_5_nearest Median number of bedrooms of 5 nearest residential properties to the school
cumulative_permits5yr Total number of construction permits issued in the school’s census tract in past 5 years
housing_1yrbefore Total housing units, 1 year before the specified school year, in the census tract the school is located in
pct_change_housing_yrbefore Percentage change in total housing units from 1 year before, in the census tract the school is located in
agg_total_students This is the total number of students in the school for each grade, in the specified school year. The specified school year is contained in a separate column in the panel data
agg_total_students_1yrbefore This is the total number of students in the school for each grade, 1 year before the specified school year
agg_total_students_2yrbefore This is the total number of students in the school for each grade, 2 years before the specified school year
agg_total_students_3yrbefore This is the total number of students in the school for each grade, 3 years before the specified school year
agg_total_students_1yrlater This is the total number of students in the school for each grade, 1 year after the specified school year
pct_change_capacity_5yrlater Percentage change in school capacity 5 years after specified year. The specified school year is contained in a separate column in the panel data
school_year This is the school year. It is a categorical variable
grade_level_X2 This is the grade level of students. It is a categorical variable
unique_id This ID was assigned to each school to differentiate them when modelling. It is a categorical variable
dcps_boundary This is the name of DCPS School Zones Boundary the school is located in. It is a categorical variable
school_sector This is the type of public school in DC. It is a categorical variable of either DC Public School or Public Charter Schools
ward This is the name of municipal ward in DC. It is a categorical variable.
panel_5yr <- panel_clean %>%
  dplyr::select(-geoid, -school_name, -school_code, -census_year, -avg_util, 
                -capacity_5yrlater, 
                -construction_projects, 
                # these vars have extremely big coefficients, which will put rows without these vars at great prediction inaccuracy risk
                -agg_total_students_2yrlater, -agg_total_students_3yrlater, -agg_total_students_4yrlater) %>%
  st_drop_geometry() %>%
  filter(agg_total_students_5yrlater > 0) %>%
  filter_all(all_vars(!is.infinite(.))) %>%
  filter(school_year != "SY19-20")

panel_5yr <- panel_5yr %>%
  left_join(
    panel_clean %>% dplyr::select(unique_id, school_year, grade_level, geometry),
    by = c("unique_id", "school_year", "grade_level")
  ) %>%
  st_as_sf()

colnames(panel_5yr)
##  [1] "school_year"                  "grade_level"                 
##  [3] "unique_id"                    "dcps_boundary"               
##  [5] "school_sector"                "facility_capacity"           
##  [7] "enrollment"                   "utilization"                 
##  [9] "median_hhincome_indv_mean"    "pct_poverty_indv_mean"       
## [11] "total_population"             "num_households"              
## [13] "total_housing_units"          "median_gross_rent"           
## [15] "med_rent_pct_inc"             "med_owner_cost_pct_inc"      
## [17] "med_rooms_renter"             "median_household_income"     
## [19] "avg_hh_size"                  "pct_nonwhite"                
## [21] "pct_poverty"                  "pct_vacant"                  
## [23] "pct_own"                      "pct_rent"                    
## [25] "pct_education_no_school"      "pct_education_hs"            
## [27] "pct_education_bachelors"      "avg_price_5_nearest"         
## [29] "med_bedroom_5_nearest"        "cumulative_permits5yr"       
## [31] "housing_1yrbefore"            "pct_change_housing_yrbefore" 
## [33] "agg_total_students"           "agg_total_students_1yrbefore"
## [35] "agg_total_students_2yrbefore" "agg_total_students_3yrbefore"
## [37] "agg_total_students_1yrlater"  "agg_total_students_5yrlater" 
## [39] "pct_change_capacity_5yrlater" "ward"                        
## [41] "geometry"

Data Splitting

set.seed(2554)

data_split5 <- initial_time_split(panel_5yr %>% 
                                  st_drop_geometry() %>% 
                                  filter_all(all_vars(!is.infinite(.))), 
                                  prop = 2492/3739) # train 2013-2016, test 2017-2018

train_data_lag5 <- training(data_split5)
test_data_lag5 <- testing(data_split5)

data_cv5 <- group_vfold_cv(train_data_lag5, group = "school_year")

Set Recipe

linear_big_model_rec5 <- recipe(agg_total_students_5yrlater ~ .,
                     data=train_data_lag5) %>%
  # acept mean impute for columns where there aren't many NA
  step_impute_mean(all_numeric_predictors()) %>%
  step_novel(all_nominal()) %>%
  step_dummy(all_nominal()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors(), -all_nominal())

glmnet_plan5 <- 
  linear_reg() %>% 
  set_args(penalty  = tune()) %>%
  set_args(mixture  = tune()) %>%
  set_engine("glmnet")

control <- control_resamples(save_pred = TRUE, verbose = TRUE)
metrics <- metric_set(rmse, rsq, mape, smape)

glmnet_grid5 <- expand.grid(penalty = seq(0,0.5, by = 0.025),
                            mixture = seq(0,0.5, by = 0.025))

glmnet_wf5 <-
  workflow() %>% 
  add_recipe(linear_big_model_rec5) %>% 
  add_model(glmnet_plan5)

Training and Selecting Best Model

We finalized the best workflow (0.025 for penalty and 0.175 for mixture) based on smallest RMSE

collect_metrics(glmnet_tuned5)
## # A tibble: 1,764 × 8
##    penalty mixture .metric .estimator   mean     n std_err .config              
##      <dbl>   <dbl> <chr>   <chr>       <dbl> <int>   <dbl> <chr>                
##  1   0           0 mape    standard   15.7       4  1.68   Preprocessor1_Model0…
##  2   0           0 rmse    standard    8.34      4  0.541  Preprocessor1_Model0…
##  3   0           0 rsq     standard    0.886     4  0.0188 Preprocessor1_Model0…
##  4   0           0 smape   standard   14.2       4  1.18   Preprocessor1_Model0…
##  5   0.025       0 mape    standard   15.7       4  1.68   Preprocessor1_Model0…
##  6   0.025       0 rmse    standard    8.34      4  0.541  Preprocessor1_Model0…
##  7   0.025       0 rsq     standard    0.886     4  0.0188 Preprocessor1_Model0…
##  8   0.025       0 smape   standard   14.2       4  1.18   Preprocessor1_Model0…
##  9   0.05        0 mape    standard   15.7       4  1.68   Preprocessor1_Model0…
## 10   0.05        0 rmse    standard    8.34      4  0.541  Preprocessor1_Model0…
## # ℹ 1,754 more rows
show_best(glmnet_tuned5, metric = "rmse", n = 15)
## # A tibble: 15 × 8
##    penalty mixture .metric .estimator  mean     n std_err .config               
##      <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
##  1   0.025   0.175 rmse    standard    8.14     4   0.495 Preprocessor1_Model149
##  2   0.025   0.15  rmse    standard    8.14     4   0.491 Preprocessor1_Model128
##  3   0       0.175 rmse    standard    8.14     4   0.494 Preprocessor1_Model148
##  4   0.05    0.075 rmse    standard    8.14     4   0.500 Preprocessor1_Model066
##  5   0.05    0.125 rmse    standard    8.15     4   0.506 Preprocessor1_Model108
##  6   0.05    0.1   rmse    standard    8.15     4   0.505 Preprocessor1_Model087
##  7   0       0.05  rmse    standard    8.15     4   0.495 Preprocessor1_Model043
##  8   0.025   0.05  rmse    standard    8.15     4   0.495 Preprocessor1_Model044
##  9   0.05    0.05  rmse    standard    8.15     4   0.498 Preprocessor1_Model045
## 10   0.025   0.1   rmse    standard    8.15     4   0.490 Preprocessor1_Model086
## 11   0.025   0.125 rmse    standard    8.15     4   0.494 Preprocessor1_Model107
## 12   0       0.075 rmse    standard    8.15     4   0.489 Preprocessor1_Model064
## 13   0.025   0.075 rmse    standard    8.15     4   0.489 Preprocessor1_Model065
## 14   0       0.15  rmse    standard    8.15     4   0.489 Preprocessor1_Model127
## 15   0.025   0.275 rmse    standard    8.15     4   0.510 Preprocessor1_Model233
glmnet_best_params5 <- select_best(glmnet_tuned5, metric = "rmse")

## Final workflow
glmnet_best_wf5 <- finalize_workflow(glmnet_wf5, glmnet_best_params5)

Inspect Coefficients & Zeroed-out Variables

As a result of Elastic Net (balancing LASSO and Ridge regularizations), the following list of variables are zeroed-out (evaluated to be insignificant variables with a zero-coefficient assigned).

# Extract the fitted parsnip model from the loaded workflow
glmnet_best_fit5 <- fit(glmnet_best_wf5, data = train_data_lag5)
glmnet_fit5 <- extract_fit_parsnip(glmnet_best_fit5)

# Tidy up and inspect coefficients
coef_df5 <- tidy(glmnet_fit5)
nonzero_coefs5 <- coef_df5 %>% filter(estimate != 0)
zeroed_out_coefs5 <- coef_df5 %>% filter(estimate == 0, term != "(Intercept)")

# Print non-zero coefficients
print(nonzero_coefs5)
## # A tibble: 218 × 3
##    term                      estimate penalty
##    <chr>                        <dbl>   <dbl>
##  1 (Intercept)               50.9       0.025
##  2 facility_capacity          0.130     0.025
##  3 enrollment                -0.00268   0.025
##  4 utilization               -1.20      0.025
##  5 median_hhincome_indv_mean  1.33      0.025
##  6 pct_poverty_indv_mean      0.288     0.025
##  7 total_population           1.09      0.025
##  8 total_housing_units       -0.00449   0.025
##  9 median_gross_rent         -0.765     0.025
## 10 med_owner_cost_pct_inc     0.335     0.025
## # ℹ 208 more rows
# Print zeroed_out coefficients
print(zeroed_out_coefs5)
## # A tibble: 48 × 3
##    term                    estimate penalty
##    <chr>                      <dbl>   <dbl>
##  1 num_households                 0   0.025
##  2 med_rent_pct_inc               0   0.025
##  3 median_household_income        0   0.025
##  4 pct_rent                       0   0.025
##  5 pct_education_bachelors        0   0.025
##  6 unique_id_F000066              0   0.025
##  7 unique_id_F000072              0   0.025
##  8 unique_id_F000073              0   0.025
##  9 unique_id_F000135              0   0.025
## 10 unique_id_F000180              0   0.025
## # ℹ 38 more rows

Optional: Save Co-efficients for Interpretation

write.csv(coef_df5, "coef_df5_May4.csv")

4.2 CV on Folds

First, we evaluated the model performance across 5-folds on the train data, achieving an RMSE of 8.17, a MAPE of 14.8%, and an r-squared value of 0.885 (i.e. the model explains 88.5% of variations in the observed data).

RMSE (root mean squared error): The square root of the mean of the squared differences between actual and predicted values.

MAPE (Mean Absolute Percentage Error): Prediction errors as a percentage of actual values.

# Get best OOF predictions for glmnet only
glmnet_best_OOF_preds5 <- collect_predictions(glmnet_tuned5) %>% 
  filter(penalty == glmnet_best_params5$penalty[1],
         mixture == glmnet_best_params5$mixture[1]) %>% 
  dplyr::select(.pred, agg_total_students_5yrlater)

OOF_preds5 <- glmnet_best_OOF_preds5 %>%
  mutate(model = "glmnet")

# calculate overall metrics
glmnet_oof_metrics5 <- bind_rows(
  metrics(
    data = glmnet_best_OOF_preds5,
    truth = agg_total_students_5yrlater,
    estimate = .pred
  ),
  mape(
    data = glmnet_best_OOF_preds5,
    truth = agg_total_students_5yrlater,
    estimate = .pred
  )
)
print(glmnet_oof_metrics5)
## # A tibble: 5 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       8.17 
## 2 rsq     standard       0.885
## 3 mape    standard      14.8  
## 4 smape   standard      13.8  
## 5 mape    standard      14.8
# OOF predicted versus actual
ggplot(OOF_preds5, aes(x = agg_total_students_5yrlater, y = .pred, group = model)) +
  geom_point(alpha = 0.3) +
  geom_abline(linetype = "dashed", color = "#fc8961") +
  geom_smooth(method = "lm", color = "#440154") +
  coord_equal() +
  facet_wrap(~model, ncol=3) +
  theme_bw()

4.3 CV on Test

Next, we evaluated the model performance on the test data (data that the model has not “seen” yet) to evaluate its generalizability across years. This yields an RMSE of 9.62 and a MAPE of 17.5%. The errors, however, increases in SY18-19 as compared to SY17-18.

glmnet_val_fit_geo5 <- glmnet_best_wf5 %>% 
  last_fit(split     = data_split5,
           control   = control,
           metrics   = metrics)

# collect validation set predictions from last_fit model
glmnet_val_pred_geo5 <- collect_predictions(glmnet_val_fit_geo5)

val_preds5 <- data.frame(glmnet_val_pred_geo5, model = "glmnet") %>%
  left_join(
    panel_5yr %>%
      filter(agg_total_students_5yrlater > 0) %>%
      rowid_to_column(var = ".row") %>%
      dplyr::select(unique_id, school_year, grade_level, school_sector, ward, dcps_boundary, 
                    utilization, facility_capacity, pct_change_capacity_5yrlater,
                    cumulative_permits5yr, 
                    geometry, .row),
    by = ".row"
  ) %>%
  group_by(unique_id, school_year, grade_level, model) %>%
  mutate(
    RMSE = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
    MAE  = yardstick::mae_vec(agg_total_students_5yrlater, .pred),
    MAPE = yardstick::mape_vec(agg_total_students_5yrlater, .pred)
  ) %>%
  ungroup() %>%
  mutate(model = factor(model, levels = "glmnet"))

val_preds_sf5 <- st_as_sf(val_preds5, crs = 4326)
val_avg_metrics5 <- val_preds5 %>%
  summarise(
    overall_RMSE = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
    overall_MAPE = yardstick::mape_vec(agg_total_students_5yrlater, .pred)
  )

print(val_avg_metrics5)
## # A tibble: 1 × 2
##   overall_RMSE overall_MAPE
##          <dbl>        <dbl>
## 1         9.62         17.5
val_preds_by_year_only_sf <- val_preds_sf5 %>% 
  group_by(school_year) %>% 
  summarise(RMSE = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
            MAE  = yardstick::mae_vec(agg_total_students_5yrlater, .pred),
            MAPE = yardstick::mape_vec(agg_total_students_5yrlater, .pred))

ggplot(val_preds_by_year_only_sf, aes(x=school_year, y=RMSE))+
  geom_bar(position="dodge",stat = "identity", fill="#440154", color="#fed395")+
  labs(title="RMSE by school year")

Looking at RMSE by grade levels, we found that predictions tend to be more accurate at higher grades, and is most prone to error for lower grades like K and 1. This implies as students naturally progress to upper grade levels, their enrollment patterns become more trackable.

# Validation Predicted vs. actual
ggplot(val_preds5, aes(x = agg_total_students_5yrlater, y = .pred, group = model)) +
  geom_point(alpha = 0.3, color="#440154") +
  geom_abline(linetype = "dashed", color = "#fc8961") +
  geom_smooth(method = "lm", color = "#440154") +
  coord_equal() +
  facet_wrap(~model, ncol = 3) +
  theme_bw()

# Validation Predicted vs. actual
ggplot(val_preds5, aes(x = agg_total_students_5yrlater, y = .pred, group = model)) +
  geom_point(alpha = 0.3, color = "#440154") +
  geom_abline(linetype = "dashed", color = "#fc8961") +
  geom_smooth(method = "lm", color = "#440154") +
  coord_equal() +
  scale_x_continuous(breaks = seq(0, max(val_preds5$agg_total_students_5yrlater, na.rm = TRUE), by = 40)) +
  scale_y_continuous(breaks = seq(0, max(val_preds5$.pred, na.rm = TRUE), by = 50)) +
  theme_bw()

val_preds_by_year_grade_sf <- val_preds_sf5 %>% 
  group_by(school_year, grade_level) %>% 
  summarise(RMSE = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
            MAE  = yardstick::mae_vec(agg_total_students_5yrlater, .pred),
            MAPE = yardstick::mape_vec(agg_total_students_5yrlater, .pred))%>% 
  arrange(RMSE) %>%
  mutate(grade_level = factor(grade_level, levels=c("K","1","2","3","4","5")))

# plot rmse, grade level facet
ggplot(val_preds_by_year_grade_sf, aes(x=school_year, y=RMSE))+
  geom_bar(position="dodge",stat = "identity", fill="#440154", color="#fed395")+
  facet_wrap(~grade_level)+
  labs(title="RMSE by grade level")

We further plotted for the distribution of errors in relation to school facility’s utilization, capacity, as well as the two scenario variables – cumulative permits in 5 years, and change in facility capacity 5 years later. As observed, the errors are randomly distributed with no obvious patterns of clustering. However, predictions for public charters do demonstrate more instability as compared to DCPS.

Error (RMSE) vs Utilization, by sector

ggplot(val_preds_sf5 %>%
         mutate(grade_level = factor(grade_level, levels = c("K","1","2","3","4","5"))), 
       aes(x = utilization, y = RMSE)) +
  geom_point(alpha = 0.3, color = "#440154") +
  facet_wrap(~school_sector, nrow = 2) +
  labs(title = "Error by Utilization and School Sector", subtitle = "5-Year Lag Model") +
  theme_bw()

Error (RMSE) vs Capacity, by Sector

ggplot(val_preds_sf5 %>%
         mutate(grade_level = factor(grade_level, levels = c("K","1","2","3","4","5"))), 
       aes(x = facility_capacity, y = RMSE)) +
  geom_point(alpha = 0.3, color = "#440154") +
  facet_wrap(~school_sector, nrow = 2) +
  labs(title = "Error by Facility Capacity and School Sector", subtitle = "5-Year Lag Model") +
  theme_bw()

Error (RMSE) vs Anticipated Construction in 5 Years, all schools (no facet)

ggplot(val_preds_sf5 %>%
         mutate(grade_level = factor(grade_level, levels = c("K","1","2","3","4","5"))), 
       aes(x = cumulative_permits5yr, y = RMSE)) +
  geom_point(alpha = 0.3, color = "#440154") +
  labs(title = "Error by Anticipated Construction Projects", subtitle = "5-Year Lag Model") +
  theme_bw()

Error (RMSE) vs Anticipated Pct Change in Capacity in 5 Years, all schools (no facet)

ggplot(val_preds_sf5 %>%
         mutate(grade_level = factor(grade_level, levels = c("K","1","2","3","4","5"))), 
       aes(x = pct_change_capacity_5yrlater, y = RMSE)) +
  geom_point(alpha = 0.3, color = "#440154") +
  labs(title = "Error by Anticipated Change in Facility Capacity", subtitle = "5-Year Lag Model") +
  theme_bw()

4.4 Generalizability Across Space

Finally, using test data, we inspected RMSE by facilities and report the following results.

List of Top 20 Schools with Biggest Errors

val_preds5_spatial <- val_preds5 %>%
  group_by(unique_id, ward, dcps_boundary) %>% # the most granular is still school facility
  summarise(
    RMSE_school = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
    MAE_school = yardstick::mae_vec(agg_total_students_5yrlater, .pred),
    MAPE_school = yardstick::mape_vec(agg_total_students_5yrlater, .pred),
    .pred_mean = mean(.pred, na.rm = TRUE),
    actual_mean = mean(agg_total_students_5yrlater, na.rm = TRUE),
    n_obs = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(RMSE_school)) %>%
  slice_head(n = 20)

# Display table nicely
val_preds5_spatial %>%
  dplyr::select(
    `Unique ID` = unique_id,
    `DCPS Zone` = dcps_boundary,
    `Ward` = ward,
    `Predicted (Mean)` = .pred_mean,
    `Actual (Mean)` = actual_mean,
    `School RMSE` = RMSE_school,
    `School MAE` = MAE_school,
    `School MAPE` = MAPE_school
  ) %>%
  mutate(across(where(is.numeric), ~round(., 2))) %>%
  mutate(
    `Predicted (Mean)` = round(`Predicted (Mean)`, 0),
    `Actual (Mean)` = round(`Actual (Mean)`, 0)
  ) %>%
  kable(
    format = "html",
    align = "c",
    caption = "Top 20 Schools with Highest RMSE (5-Year Lag)"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
Top 20 Schools with Highest RMSE (5-Year Lag)
Unique ID DCPS Zone Ward Predicted (Mean) Actual (Mean) School RMSE School MAE School MAPE
F000091 King, Martin Luther Ward 8 102 52 51.80 50.17 102.49
F000028 Nalle Ward 7 100 123 23.91 23.46 18.93
F000102 Bancroft Ward 1 93 112 22.88 19.48 16.32
F000087 Nalle Ward 7 89 108 18.78 18.61 17.27
F000161 Ketcham Ward 8 58 76 18.26 18.08 23.78
F000035 Noyes Ward 5 37 19 18.26 17.84 100.82
F000193 Walker-Jones Ward 5 82 68 17.73 15.16 25.74
F000033 Seaton Ward 2 98 84 16.34 14.65 18.33
F000203 Harris, C. W Ward 7 57 46 14.25 11.53 28.77
F000092 Noyes Ward 5 74 61 14.10 12.68 22.03
F000130 Garrison Ward 2 37 49 14.01 11.78 21.99
F000237 Lewis Ward 4 57 67 13.78 11.70 16.45
F000247 Miner Ward 7 60 50 13.59 9.92 23.54
F000220 Takoma Ward 4 52 58 13.57 12.12 20.37
F000096 Lasalle-Backus Ward 5 66 66 13.55 9.74 16.02
F000027 Bunker Hill Ward 5 69 73 13.27 11.87 16.52
F000146 Miner Ward 7 5 18 13.05 12.99 72.00
F000105 Beers Ward 7 62 50 13.05 12.18 25.38
F000001 Beers Ward 7 70 80 12.58 10.21 11.85
F000029 Moten Ward 8 118 115 12.17 11.82 10.74

While most facilities have low to moderate errors in both years, based on the two maps below, we note that there is one particular school that demonstrated particularly outstanding errors. Upon closer inspection, the facility is the Eagle Academy PCS. It has reported serious financial struggles in the recent years and was put on a financial monitoring list in June of 2023, resulting in declining enrollments and an abrupt closure without any advanced notice in August 2024 (source: DC NEWS). This suggests that while our model has demonstrated ability to capture regular changes in socioeconomic and housing, it is susceptible to abrupt changes that cannot be accounted for by the broader change patterns.

Map of RMSE by School Facility across Both Test Years (SY17-18 & SY18-19)

val_pred_facility_sf5 <- val_preds_sf5 %>%
  as_tibble() %>%
  group_by(unique_id, model) %>%
  summarise(
    RMSE_school = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
    MAPE_school = yardstick::mape_vec(agg_total_students_5yrlater, .pred),
    geometry = st_union(geometry),   # Collapse to one geometry per school
    .groups = "drop"
  ) %>%
  st_as_sf()

# Then plot
ggplot() +
  geom_sf(data = boundary, fill = "grey85", color = "grey70", lwd = 0.1) +
  geom_sf(data = waterbodies, fill = "ghostwhite", color = "lightsteelblue") +
  geom_sf(data = val_pred_facility_sf5, aes(color = RMSE_school), alpha = 0.8) +
  scale_color_gradientn(colors = c("#fcfdbf", "#fed395", "#fea973", "#f1605d", "#a3307e", "#440154", "#180f3d")) +
  facet_wrap(~model) +
  labs(title = "RMSE by School Facility (5-Year Lag)") +
  theme_void()

Map of RMSE by School Facility by School Year

val_pred_facility_year_sf5 <- val_preds_sf5 %>%
  as_tibble() %>%
  group_by(unique_id, model, school_year) %>%
  summarise(
    RMSE_school = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
    MAPE_school = yardstick::mape_vec(agg_total_students_5yrlater, .pred),
    geometry = st_union(geometry),   # Merge all points/polygons if multiple
    .groups = "drop"
  ) %>%
  st_as_sf()

# Then plot
ggplot() +
  geom_sf(data = boundary, fill = "grey85", color = "grey70", lwd = 0.1) +
  geom_sf(data = waterbodies, fill = "ghostwhite", color = "lightsteelblue") +
  geom_sf(data = val_pred_facility_year_sf5, aes(color = RMSE_school), alpha = 0.8) +
  scale_color_gradientn(colors = c("#fcfdbf", "#fed395", "#fea973", "#f1605d", "#a3307e", "#440154", "#180f3d")) +
  facet_wrap(~school_year) +
  labs(title = "RMSE by Facility and Census Year (5-Year Lag)") +
  theme_void()

Map of RMSE by Ward and School Zones

Errors by ward and zones have increased across most wards in the later test year (SY18-19). Zooming into changes by school zone specifically, it can be found that the most significant changes occur in dcps zones associated with list of facilities with top errors highlighted in the previous table, including Eagle Academy (in King, Martin Luther) and Bancroft (in Bankcroft), which is associated with changes like financial difficulties or expansion that is unperceived in the current model variables.

# by Ward
# Spatial join points to wards
val_preds_sf_wards5 <- st_join(val_preds_sf5, wards %>% dplyr::select(ward_name = NAME))

# Group by ward and year, then correctly calculate RMSE, MAPE
val_preds_sf_wards5 <- val_preds_sf_wards5 %>%
  st_drop_geometry() %>%  # Drop POINT geometry
  group_by(ward_name, school_year) %>%
  summarise(
    RMSE_ward = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
    MAPE_ward = yardstick::mape_vec(agg_total_students_5yrlater, .pred),
    .groups = "drop"
  ) %>%
  left_join(wards %>% dplyr::select(ward_name = NAME), by = "ward_name") %>%
  st_as_sf()

# Plot MAPE map
ggplot(val_preds_sf_wards5) +
  geom_sf(aes(fill = MAPE_ward), color = "white") +
  facet_wrap(~ school_year) +
  scale_fill_viridis_c(option = "magma", name = "MAPE") +
  theme_minimal() +
  labs(
    title = "MAPE by Ward and Year (5-Year Lag)",
    subtitle = "Spatial aggregation of prediction error",
    fill = "MAPE"
  )

# Plot RMSE map
ggplot(val_preds_sf_wards5) +
  geom_sf(aes(fill = RMSE_ward), color = "white") +
  facet_wrap(~ school_year) +
  scale_fill_viridis_c(option = "magma", name = "RMSE") +
  theme_minimal() +
  labs(
    title = "RMSE by Ward and Year (5-Year Lag)",
    subtitle = "Spatial aggregation of prediction error",
    fill = "RMSE"
  )

# by DCPS Zones
# Spatial join points to DCPS zones
val_preds_sf_dcps5 <- st_join(val_preds_sf5, boundary %>% dplyr::select(dcps_zone = NAME))

# Group by DCPS zone and school_year, correctly compute RMSE, MAPE
val_preds_sf_dcps5 <- val_preds_sf_dcps5 %>%
  st_drop_geometry() %>%
  group_by(dcps_boundary, school_year) %>%
  summarise(
    RMSE_dcps = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
    MAPE_dcps = yardstick::mape_vec(agg_total_students_5yrlater, .pred),
    .groups = "drop"
  ) %>%
  left_join(boundary %>% dplyr::select(dcps_zone = NAME), by = c("dcps_boundary" = "dcps_zone")) %>%
  st_as_sf()

# Plot MAPE by DCPS Zone
ggplot(val_preds_sf_dcps5) +
  geom_sf(data = boundary, color = "lightgrey")+
  geom_sf(aes(fill = MAPE_dcps), color = "white") +
  facet_wrap(~ school_year) +
  scale_fill_viridis_c(option = "magma", name = "MAPE") +
  theme_minimal() +
  labs(
    title = "MAPE by DCPS Zone and Year (5-Year Lag)",
    subtitle = "Spatial aggregation of prediction error",
    fill = "MAPE"
  )

# Plot RMSE by DCPS Zone
ggplot() +
  geom_sf(data = boundary, color = "lightgrey")+
  geom_sf(data = val_preds_sf_dcps5, aes(fill = RMSE_dcps), color = "white") +
  facet_wrap(~ school_year) +
  scale_fill_viridis_c(option = "magma", name = "RMSE") +
  theme_minimal() +
  labs(
    title = "RMSE by DCPS Zone and Year (5-Year Lag)",
    subtitle = "Spatial aggregation of prediction error",
    fill = "RMSE"
  )

4.5 Expand Fitted Data to Construct Different Scenarios

By inspecting the distribution of total construction permits issued for each school zone within 5 years (i.e. the total construction permits issued from the base year to the end of census year prior to the target school year; eg. to predict for SY27-28, this variable sums up all construction permits issued in a school zone from start of 2022 to end of 2026), and the distribution of percent change in facility capacity 5 years later (i.e. percentage change between facility capacity in the base school year and the facility capacity in the target school year; eg. to predict for SY27-28, this variable compares facility capacity in SY27-28 to facility capacity to SY22-23), we mutated the following threshold that the client can interact with in the app:

cumulative_permits5yr: every 250 between 0 to 2000, 4000, 6000, 8000

pct_change_capacity_5yrs: since -1.0 represents that the school is gone, let’s do every every -50%, -25%, -10%, -5%, 0%, 5%, 10%, 25%, 50%, 100%, 150%

hist(panel_clean$cumulative_permits5yr)

hist(panel_clean$pct_change_capacity_5yrlater)

Set expansion grid

projects_grid5 <- c(seq(0, 2000, by = 250), 4000, 6000, 8000)
capacity_pct_grid5 <- c(-0.5, -0.25, -0.10, -0.05, 0, 0.05, 0.10, 0.25, 0.5, 1, 1.5)

# Create all combinations (Cartesian product)
grid <- expand.grid(
  cumulative_permits5yr = projects_grid5,
  pct_change_capacity_5yrlater = capacity_pct_grid5
)

Build Scenarios for 5-year Fit Data

fit_data5 <- panel_clean %>% 
  filter(census_year >= 2019) # we use anything beyond 2019

# Drop existing variables before expansion
fit_data5 <- fit_data5 %>%
  dplyr::select(-cumulative_permits5yr, -pct_change_capacity_5yrlater, -agg_total_students_5yrlater)

# Now safe to cross
fit_data5 <- tidyr::crossing(fit_data5, grid)

Create Absolute Change in Capacity

Create a column that indicates what number of seats the scenario correspond to, for the App Team’s use.

fit_data5 <- fit_data5 %>% 
  mutate(abs_change_capacity_5yrlater = round(pct_change_capacity_5yrlater * facility_capacity))

Fit model to the scenario-based fit data (over 600,000 rows to generate predictions

# Fit on all available training data
glmnet_final_fit5 <- fit(glmnet_best_wf5, data = training(data_split5))

predictions5 <- predict(glmnet_final_fit5, new_data = fit_data5) %>%
  bind_cols(fit_data5) #.pred stands for agg_total_students_5yrlater in this case!

predictions5 <- predictions5 %>% 
  mutate(`.pred` = ceiling(`.pred`))

predictions5_sf <- st_as_sf(predictions5)

Retain only useful rows to reduce file size

app_data <- predictions5_sf %>% 
  dplyr::select(`.pred`, school_year, school_code, school_name, grade_level, unique_id,census_year, 
         dcps_boundary, school_sector, ward, 
         cumulative_permits5yr, pct_change_capacity_5yrlater, abs_change_capacity_5yrlater)

Save data for App Team’s use

Conclusion

Based on previous discussions and results, as well as the strengths and limitations of the current model, we would like to conclude our project by proposing the following recommendations for our client:

  • Incorporate Qualitative Consideration: While using the app, we highly recommend our client to incorporate reasonable qualitative assumptions while picking a scenario.

  • Make a Flexible Switch between the App and the Business-as-usual Approach: For unique facilities that are extraordinarily high in capacity / undergoes sudden administrative changes etc. The Client can switch flexibly between our app and the traditional approach.

  • Enrich Data Sources: As census availability is a major impediment to the app, The Client may consider incorporating additional internal data, which are updated on a more regular basis, to substitute variables where suited to achieve more up-to-date results.

We’d like to express our sincere gratitude towards Rebecca Lee and Rory Lawless, our client representatives at the DC Office of the Deputy Mayor for Education (DME), as well as our instructors Professor Michael Fichman and Matthew Harris for their relentless support and guidance. We also appreciate this unique and exciting opportunity to apply our learnings into making a real-life impact through the collaboration between the Weitzman School of Design at University of Pennsylvania and the DME.

For any future inquiries, concerns, or updates, please feel free to reach out to team members at:

  • General Process: Jia Yue Ong (email)

  • Panel Data: Neve Zhang (email)

  • Model: Amy Solano

  • App: Jingmiao Fei

6 Appendix: Preliminary Models

The following codes archives our initial process of testing out all three types of models: GLMNet, Random Forest, and XGBoost. According to the prediction vs. actual prediction plots, both Random Forest and XGBoost are prone to underpredict for high enrollment records. We thus decided to pursue polishing and development of the GLMNet model over the other two methods.

6.1 Preliminary Model Development

panel_prelim <- panel %>% 
  mutate(utilization =  as.numeric(sub("%", "", utilization)) / 100,
         med_rent_pct_inc = med_rent_pct_inc / 100,
         med_owner_cost_pct_inc = med_owner_cost_pct_inc / 100) %>% 
  rename(dcps_boundary = NAME)

# clean up variable names to lower case
names(panel_prelim) <- tolower(names(panel_prelim))

panel_prelim <- panel_prelim %>%
  left_join(grade_enrollment_lags %>% dplyr::select(-agg_total_students),
            by=c("unique_id","census_year", "grade_level"))

panel_prelim <- panel_prelim %>%
  left_join(capacity_change %>% dplyr::select(-facility_capacity),
            by=c("unique_id","census_year"))

panel_prelim <- panel_prelim %>%
  mutate(across(c(facility_capacity, enrollment), round))

# impute enrollment for missing values based on aggregate values
panel_prelim <- panel_prelim %>%
  group_by(school_year, unique_id) %>%
  # where enrollment is missing, add up agg_total_students for each grade
  mutate(enrollment = case_when(is.na(enrollment) ~ sum(agg_total_students),
                                .default = enrollment))

panel_prelim <- panel_prelim %>%
  st_join(wards22 %>% dplyr::select(NAME, geometry) %>%
  rename("ward"="NAME")) # join wards back

panel_prelim <- panel_prelim %>%
  left_join(ward_avg_util, by=c("school_year","ward"))

panel_prelim <- panel_prelim %>%
  #assume missing values mean no enrollment that year
  mutate_at(vars(agg_total_students_2yrbefore, agg_total_students_3yrbefore, agg_total_students_1yrbefore), ~replace_na(., 0)) %>%
  # impute the same value for housing variables
  # making the assumption that the year-to-year doesn't change much when it's only a year difference
  # also doing for capacity
  # only for 1yr
  mutate(housing_1yrbefore = case_when(is.na(housing_1yrbefore) ~ total_housing_units, .default=housing_1yrbefore),
         pct_change_housing_yrbefore = case_when(is.na(pct_change_housing_yrbefore) ~ 0, .default=pct_change_housing_yrbefore),
         facility_capacity = case_when(facility_capacity == 0 | is.na(facility_capacity) ~ enrollment, .default=facility_capacity),
         utilization=case_when(is.na(utilization) ~ avg_util, .default=utilization))

sapply(panel_prelim, function(x)sum(is.na(x)))

Data Splitting

set.seed(2554)

data_split_pre5 <- initial_time_split(panel_prelim5 %>% 
                                  st_drop_geometry() %>% 
                                  filter_all(all_vars(!is.infinite(.))), 
                                  prop = 2492/3739) # train 2013-2016, test 2017-2018

train_data_lag_pre5 <- training(data_split_pre5)
test_data_lag_pre5 <- testing(data_split_pre5)

data_cv_pre5 <- group_vfold_cv(train_data_lag_pre5, group = "school_year")

Recipe Setting

model_rec_pre5 <- recipe(agg_total_students_5yrlater ~ .,
                     data=train_data_lag_pre5) %>%
  step_novel(all_nominal()) %>%
  step_dummy(all_nominal()) %>%
  step_indicate_na(agg_total_students_5yrlater)

linear_model_rec_pre5 <- recipe(agg_total_students_5yrlater ~ .,
                     data=train_data_lag_pre5) %>%
  # acept mean impute for columns where there aren't many NA
  step_impute_mean(all_numeric_predictors()) %>%
  step_novel(all_nominal()) %>%
  step_dummy(all_nominal()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_predictors(), -all_nominal())

rf_plan_pre5 <- rand_forest() %>%
  set_args(mtry  = tune()) %>%
  set_args(min_n = tune()) %>%
  set_args(trees = 2000) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")

glmnet_plan_pre5 <- 
  linear_reg() %>% 
  set_args(penalty  = tune()) %>%
  set_args(mixture  = tune()) %>%
  set_engine("glmnet")

XGB_plan_pre5 <- boost_tree() %>%
  set_args(mtry  = tune()) %>%
  set_args(min_n = tune()) %>%
  set_args(trees = 100) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

glmnet_grid_pre5 <- expand.grid(penalty = seq(0, 1, by = .25), 
                           mixture = seq(0,1,0.25))

rf_grid_pre5 <- expand.grid(mtry = c(5,10,17,25,30,40,50), 
                       min_n = c(5,10,30,50,70,100,500))

xgb_grid_pre5 <- expand.grid(mtry = c(3,5), 
                        min_n = c(1,5))

glmnet_wf_pre5 <-
  workflow() %>% 
  add_recipe(linear_model_rec_pre5) %>% 
  add_model(glmnet_plan_pre5)

rf_wf_pre5 <-
  workflow() %>% 
  add_recipe(model_rec_pre5) %>% 
  add_model(rf_plan_pre5)

xgb_wf_pre5 <-
  workflow() %>% 
  add_recipe(model_rec_pre5) %>% 
  add_model(XGB_plan_pre5)
glmnet_tuned_pre5 <- glmnet_wf_pre5 %>%
  tune::tune_grid(.,
                  resamples = data_cv_pre5,
                  grid      = glmnet_grid_pre5,
                  control   = control,
                  metrics   = metrics)

rf_tuned_pre5 <- rf_wf_pre5 %>%
  tune::tune_grid(.,
                  resamples = data_cv_pre5,
                  grid      = rf_grid_pre5,
                  control   = control,
                  metrics   = metrics)

xgb_tuned_pre5 <- xgb_wf_pre5 %>%
  tune::tune_grid(.,
                  resamples = data_cv_pre5,
                  grid      = xgb_grid_pre5,
                  control   = control,
                  metrics   = metrics)

Model Results

show_best(glmnet_tuned_pre5, metric = "rsq", n = 15)
show_best(rf_tuned_pre5, metric = "rsq", n = 15)
show_best(xgb_tuned_pre5, metric = "rsq", n = 15)

glmnet_best_params_pre5 <- select_best(glmnet_tuned_pre5, metric = "rmse")
rf_best_params_pre5     <- select_best(rf_tuned_pre5, metric = "rmse")
xgb_best_params_pre5    <- select_best(xgb_tuned_pre5, metric = "rmse")

## Final workflow
glmnet_best_wf_pre5 <- finalize_workflow(glmnet_wf_pre5, glmnet_best_params_pre5)
rf_best_wf_pre5     <- finalize_workflow(rf_wf_pre5, rf_best_params_pre5)
xgb_best_wf_pre5    <- finalize_workflow(xgb_wf_pre5, xgb_best_params_pre5)

6.2 CV on Folds

glmnet_best_OOF_preds_pre5 <- collect_predictions(glmnet_tuned_pre5) %>% 
  filter(penalty  == glmnet_best_params_pre5$penalty[1] & mixture == glmnet_best_params_pre5$mixture[1])

rf_best_OOF_preds_pre5  <- collect_predictions(rf_tuned_pre5) %>% 
  filter(mtry  == rf_best_params_pre5$mtry[1] & min_n == rf_best_params_pre5$min_n[1])

xgb_best_OOF_preds_pre5 <- collect_predictions(xgb_tuned_pre5) %>% 
  filter(mtry  == xgb_best_params_pre5$mtry[1] & min_n == xgb_best_params_pre5$min_n[1])

# Aggregate OOF predictions (they do not overlap with Validation prediction set)
OOF_preds_pre5 <- rbind(data.frame(dplyr::select(glmnet_best_OOF_preds_pre5, .pred, agg_total_students_5yrlater), model = "glmnet"),
                   data.frame(dplyr::select(rf_best_OOF_preds_pre5, .pred, agg_total_students_5yrlater), model = "rf"),
                   data.frame(dplyr::select(xgb_best_OOF_preds_pre5, .pred, agg_total_students_5yrlater), model = "xgb")) %>% 
  group_by(model) %>% 
  mutate(RMSE = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
         MAE  = yardstick::mae_vec(agg_total_students_5yrlater, .pred),
         MAPE = yardstick::mape_vec(agg_total_students_5yrlater, .pred)) %>% 
  ungroup() %>% 
  mutate(model = factor(model, levels=c("glmnet","rf","xgb")))
ggplot(data = OOF_preds_pre5 %>% 
         dplyr::select(model, MAPE) %>% 
         distinct() , 
       aes(x = model, y = MAPE, group = 1)) +
  geom_path(color = "#440154") +
  geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
  labs(x=NULL, title="5 year lag model MAPE")+
  theme_bw()

ggplot(data = OOF_preds_pre5 %>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = "#440154") +
  geom_label(aes(label = paste0(round(RMSE)))) +
  labs(x=NULL, title="5 year lag model RMSE")+
  theme_bw()

# OOF predicted versus actual
ggplot(OOF_preds_pre5, aes(x = agg_total_students_5yrlater, y = .pred, group = model)) +
  geom_point(alpha = 0.3) +
  geom_abline(linetype = "dashed", color = "#fc8961") +
  geom_smooth(method = "lm", color = "#440154") +
  coord_equal() +
  facet_wrap(~model, ncol=3) +
  labs(title="5 year lag model")+
  theme_bw()

6.2 CV on Test

glmnet_val_fit_geo_pre5 <- glmnet_best_wf_pre5 %>% 
  last_fit(split     = data_split_pre5,
           control   = control,
           metrics   = metrics)

rf_val_fit_geo_pre5 <- rf_best_wf_pre5 %>% 
  last_fit(split     = data_split_pre5,
           control   = control,
           metrics   = metrics)

xgb_val_fit_geo_pre5 <- xgb_best_wf_pre5 %>% 
  last_fit(split     = data_split_pre5,
           control   = control,
           metrics   = metrics)

# collect validation set predictions from last_fit model
glmnet_val_pred_geo_pre5 <- collect_predictions(glmnet_val_fit_geo_pre5)
rf_val_pred_geo_pre5 <- collect_predictions(rf_val_fit_geo_pre5)
xgb_val_pred_geo_pre5 <- collect_predictions(xgb_val_fit_geo_pre5)

# Aggregate predictions from Validation set
val_preds_pre5 <- rbind(data.frame(glmnet_val_pred_geo_pre5, model = "glmnet"),
                   data.frame(rf_val_pred_geo_pre5, model = "rf"),
                   data.frame(xgb_val_pred_geo_pre5, model = "xgb")) %>% 
  left_join(., panel_prelim5 %>% filter(agg_total_students_5yrlater > 0) %>% #because remove NA/infinite alr done before data split, this is ok
              rowid_to_column(var = ".row") %>%  
              dplyr::select(geometry, .row), 
            by = ".row") %>% 
  group_by(model) %>%
  mutate(RMSE = yardstick::rmse_vec(agg_total_students_5yrlater, .pred),
         MAE  = yardstick::mae_vec(agg_total_students_5yrlater, .pred),
         MAPE = yardstick::mape_vec(agg_total_students_5yrlater, .pred)) %>% 
  ungroup() %>% 
  mutate(model = factor(model, levels=c("glmnet","rf","xgb")))
# plot RMSE by model type
ggplot(data = val_preds_pre5 %>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = "#440154") +
  geom_label(aes(label = round(RMSE,1))) +
  labs(title="5 year lag preliminary model, validation set, RMSE")+
  theme_bw()


# plot MAPE by model type
ggplot(data = val_preds_pre5 %>% 
         dplyr::select(model, MAPE) %>% 
         distinct() , 
       aes(x = model, y = MAPE, group = 1)) +
  geom_path(color = "#440154") +
  geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
  labs(title="5 year lag preliminary model, validation set, MAPE")+
  theme_bw()

# Validation Predicted vs. actual
ggplot(val_preds_pre5, aes(x = agg_total_students_5yrlater, y = .pred, group = model)) +
  geom_point(alpha = 0.3) +
  geom_abline(linetype = "dashed", color = "#fc8961") +
  geom_smooth(method = "lm", color = "#440154") +
  coord_equal() +
  labs(title="5 year lag preliminary model, validation results")+
  facet_wrap(~model, ncol = 3) +
  theme_bw()