PhillyStat360 — Residential Vacancy Risk Model

Overview — Pipeline

An Automated, Cache-Aware Modeling Pipeline

The pipeline turns administrative data from OPA, L&I, and the Philadelphia Revenue Department into a parcel-level predicted probability of vacancy for every residential property in the city — plus a calibrated 0–100 risk score, an ensemble flag for the top one percent of parcels, and a five-tier rank bucket for dashboard display.

The whole pipeline is R Markdown end to end and lives in code/r_code/. Modeling, validation, calibration, explainability, and tiling all run in R using tidymodels, ranger, xgboost, lightgbm (via bonsai), probably for calibration, and treeshap for local explanations. The data_py/ output folder name is historical — it is just where every step reads from and writes to.

Manual once — establish labels & reference

00_data_inventory → reference + OVS definition

00b_new_data_check → quality check for new sources

02_ovs_exploration → understand the dependent variable

03_1_Ovs → ovs_residential.csv

03_2_Analysis → tier mapping, EDA, baseline model (AUC 0.798)

03_3_Features → features_residential.csv

R Markdown modeling & validation in code/r_code/

04a_tidymodeling → Logit · RF · XGB · LGB · 50/50 calibrated ensemble

04b_model_validation → spatial CV by ZIP, LOGO CV, RF tree-variance CIs

04b_vpi_comparison → per-ZIP, per-building-type, per-ward, equity by income

04c_vs_city_vpi → head-to-head against the City VPI

04d_recalibration → held-out isotonic refit

04e_operational_thresholds → per-ward capacity flagging

04f_local_explanations → TreeSHAP on top-flagged parcels

04g_temporal_validation → evaluation by violation-recency cohort

04h_block_cv_rf → group_vfold_cv by census tract

05_output_analysis → probability summaries, ZCTA choropleth, capacity lookup

06_tiling → PMTiles vector tilesets for web consumption

Caching

Each Rmd caches expensive computations under data_py/cache/ keyed by a SHA-1 of the feature list. Re-runs are fast as long as the feature set has not changed; deleting the cache forces a clean refit.

Reference — Repository Layout

Repository Layout

The repository separates the R Markdown pipeline from the rendered reports and the static web surface. Knit outputs and headline figures live under docs/ so the GitHub Pages site picks them up directly.

code/r_code/ → Full R Markdown pipeline. Steps 00 through 06.

docs/code/outputs/ → Rendered HTML reports plus a handful of standalone PNGs from early-stage EDA.

docs/code/03_*_files/ → Auto-generated figure folders from knitting the Rmd files.

docs/graphs/ → Headline figures used across documentation.

docs/graphs/python/ → Modeling and validation figures. Folder name is historical — every file is now produced by the R pipeline.

docs/ → Static dashboard, landing page, PMTiles, ward GeoJSON, and the optional local Flask backend (tileserver.py + load_db.py). Served as the GitHub Pages root.

Reference — Running the Pipeline

Running the Pipeline

The early steps (00 to 03_3) load raw OpenDataPhilly extracts, validate quality, define the outcome variable, and engineer the feature matrix. They are run once whenever raw inputs change. They produce two flat tables that downstream steps consume:

data/ovs_residential.csv — around 522K residential parcels with the OVS label and source flags.
data/features_residential.csv — the same parcels with around 80 engineered features.

Each Rmd reads its inputs from data/ or data_py/, writes outputs back to data_py/, and knits a self-contained HTML rendering next to itself. The files have a strict left-to-right dependency. 04a fits the four base learners and the calibrated ensemble. 04b through 04h consume those fitted artefacts for validation. 05 produces the stakeholder-facing summaries, and 06 turns the parcel GeoJSON into PMTiles for the website.

How to Re-Run

To re-run the pipeline end to end, knit the Rmd files in alphabetical order. Each step caches its expensive computations under data_py/cache/ keyed by a SHA-1 of the feature list, so re-runs are fast as long as the feature set has not changed. Deleting data_py/cache/ forces a clean refit.

01 — Data Inventory

Data Inventory & Reference

code/r_code/00_data_inventory.Rmd is the reference layer for the entire project. It establishes the vocabulary, the data quality picture, and the definition of the outcome variable before any analysis begins.

1.1 Load and profile all eight raw datasets

All source files are loaded from rawdata/ with row counts, column names, and a glimpse(). Profiling upfront prevents silent data errors downstream — knowing a field has 40% missing values, or that date ranges only go back to 2010, shapes every subsequent design decision.

Dataset	Source	Contents
opa_properties_public.csv	OPA	~583K parcel records. Property characteristics, zoning, assessed values, sale history.
violations.csv	L&I	Full code violation history with code, date, and status (open, complied, closed).
business_licenses.csv	L&I	License history per parcel with revenue code, status, and issue dates.
clean_seal.csv	L&I	City-initiated board-up and securing actions with case date and work order status.
unsafe.csv	L&I	Unsafe structure orders.
imm_dang.csv	L&I	Imminently dangerous structure orders.
assessments.csv	OPA	Historical assessment records.
VIOLATION_DEFINITION.csv	L&I	Lookup from violation code to title and definition text.

1.2 Field-level data dictionaries

For each dataset a kable table is generated with field name, data type, example values, and a plain-English description. Missing-value rates are computed for every column. The dictionary is the team's living reference — rather than re-reading raw files, the dictionary documents what each field means in context.

1.3 Define the OVS (Observed Vacancy Status) outcome

The OVS definition is formalized here and carried unchanged through the entire pipeline. A parcel has OVS = 1 if it meets any of the following.

Source 01

Clean & Seal

Appears in L&I Clean & Seal records within a 2-year window of the training cutoff, with no subsequent demolition or new-construction permit.

Source 02

Open Vacant Violation

Has an OPEN violation with one of eleven specific vacancy codes: 9-3904, 9-3905, PM15-901.1, PM15-901.2, PM15-301, CP-102, CP-103, PM-102.4/1..4.

Source 03

Active Vacant License

Holds an active business license with revenue code 3219 (Residential Vacant) or 3634 (Commercial Vacant) — owner self-declaration, often for reduced tax treatment.

Three independent administrative systems triangulate vacancy more reliably than any single source. Clean & Seal captures severe cases the city has physically intervened on; the violation codes capture inspector-flagged properties; vacant licenses capture owners who self-declare.

1.4 Catalog vacancy-related violation codes

All 54 violation codes whose title or definition text contains the word "vacant" are enumerated. These are separated from the eleven OVS-defining codes. The remaining 43 are vacancy-adjacent signals used in feature engineering without being part of the label itself.

1.5 Computed OVS statistics

Overall observed vacancy rate: ~8.3% (48,671 of 583,802 parcels).
After filtering to residential: ~1.1%.
Source breakdown of OVS = 1: Clean & Seal 88.4%, violation only 3.6%, license only 1.9%, overlapping 6.1%.

Key Finding

Clean & Seal dominates the label. The model will largely learn to predict which properties the city has already physically boarded up, plus catch earlier-stage signals for properties that have not yet reached that point.

Figure 1.1

OVS class balance. Counts of OVS = 1 vs OVS = 0 across all parcels.

Figure 1.2

OVS source combinations. How parcels distribute across the three label sources and their overlaps.

02 — Quality Check

New Data Quality Check

code/r_code/00b_new_data_check.Rmd validates two newer sources before they enter the feature pipeline.

2.1 Real Estate Transfer records (RTT_SUMMARY)

The RTT dataset is loaded and profiled. Key checks: join key detection (confirmed as opa_account_num), match rate against OPA, date range on display_date, and document type distribution (DEED, DEED SHERIFF, MORTGAGE, AGREEMENT OF SALE).

RTT contains the full deed and transfer history, richer than the single most-recent sale_date already in OPA. Rapid re-sales and sheriff (foreclosure) sales are distress signals that predict vacancy independently of physical condition.

2.2 Building and zoning permits

Permits are loaded similarly. The join key is confirmed as opa_account_num; permittype, typeofwork, and status distributions are checked.

Permits play two roles. Demolition permits after a Clean & Seal event remove the parcel from the active vacancy list. New construction permits signal that a formerly vacant lot has been redeveloped. Both scenarios would make the OVS label stale if not accounted for.

03 — OVS EDA

OVS Exploratory Analysis

code/r_code/02_ovs_exploration_JF.Rmd profiles the dependent variable across the full parcel universe before applying the residential filter.

3.1 OVS source overlap

A bar chart shows how many OVS = 1 parcels are flagged by only one source versus combinations of two or three. Clean & Seal alone accounts for the bulk of the label, with the other two sources contributing smaller exclusive shares and a long tail of overlap. Parcels flagged by multiple sources carry high label confidence; parcels flagged by only one source may have higher label noise.

3.2 OVS = 1 versus OVS = 0 parcel characteristics

Box plots and summary statistics compare vacant against occupied parcels.

Market value. Vacant median ~$145K vs. occupied ~$222K.
Exterior condition (1–7 scale). Vacant parcels skew heavily toward 5, 6, 7.
Year built. Vacant parcels tend to be older.
Livable area. Vacant parcels tend to be smaller.

3.3 Spatial summary by ZIP code

Vacancy rates per ZIP range from ~0.3% in newer or wealthier areas to ~17% in areas with concentrated disinvestment. Vacancy is not randomly distributed — it clusters geographically. This finding directly motivated the spatial cross-validation work in Step 8.

3.4 Three case-study parcel drill-downs

Three specific parcels are selected and their full administrative history is reconstructed.

Case 1, severe vacancy. A parcel with C&S records, open violations, and lapsed licenses — confirms the three-source composite correctly identifies clear vacants.
Case 2, commercial false positive. A parking garage flagged by code PM15-901.1. The code applies to vacant property maintenance but parking lots can receive it without being vacant in the residential sense. This led directly to the parking filter in 03_3_Features.Rmd.
Case 3, false negative. An OPA category 6 (Vacant Land) parcel with no violation, license, or C&S signal. Genuinely vacant land is largely invisible to the three-source OVS definition, motivating treating land separately.

Figure 3.1

Exterior condition by OVS. Vacant parcels skew heavily toward conditions 5–7.

Figure 3.2

Year built by OVS. Vacant parcels tend to be older.

Figure 3.3

Market value by OVS. Vacant median ~$145K vs occupied ~$222K.

Figure 3.4

OPA category by OVS. Distribution of OVS by OPA building category.

Figure 3.5

ZIP-level OVS rates. Vacancy rates per ZIP range from ~0.3% to ~17%.

Figure 3.6

Vacant parcels map. Geographic clustering of OVS = 1 parcels.

04 — Label Construction

OVS Construction (Residential Only)

code/r_code/03_1_Ovs.Rmd is the single source of truth for the model's outcome variable. Every downstream file reads ovs_residential.csv rather than re-constructing OVS independently.

4.1 Set the temporal anchor

TRAIN_CUTOFF is fixed at 2025-10-01 globally. Every feature and every OVS source must use data strictly before this date. This prevents any information from after the cutoff from leaking into the model.

4.2 Source 1, Clean & Seal with two Fichman filters

cs_parcels <- clean_seal %>%
  filter(casecreateddate >= (TRAIN_CUTOFF - years(2))) %>%  # 2-year window
  anti_join(demo_after_seal, by = "opa_account_num")        # demolition filter

The two-year window keeps the label current — a C&S record from five years ago does not necessarily mean the building is still vacant today. The demolition filter removes parcels where a completed demolition permit followed the C&S action; the building was torn down, so it is no longer a vacant building.

4.3 Source 2, open vacant violations

Only violations with violationstatus == "open" and a code in the eleven-code OVS list are included. Closed or complied violations mean the issue was resolved and the property should not be labeled as currently vacant.

4.4 Source 3, active vacant property license

Only licenses with licensestatus == "active" and revenue codes 3219 or 3634 are included. An expired or inactive vacant license means the owner no longer holds that designation.

4.5 Merge and compute OVS

All three source flags are left-joined to the OPA property table and combined with OVS = if_else(ovs_cs == 1 | ovs_viol == 1 | ovs_lic == 1, 1, 0). The individual source flags are retained so downstream analysis can decompose the label.

4.6 Filter to residential only

filter(category_code %in% c("1", "2", "3", "14")) removes commercial (4, 5), industrial (implicit), vacant land (6), and other non-residential categories.

Code	Category
1	Single Family
2	Multi-Family, Duplex, Triplex
3	Mixed Use (residential plus commercial)
14	Large Apartment Buildings (4+ units)

4.7 Export

Geometry is dropped on export since downstream files work with flat tables, but the parcel-level building description is retained for the secondary parking and commercial filter applied during feature engineering.

Output

data/ovs_residential.csv — ~520K residential parcels (522,583 unique) with an OVS = 1 rate of about 1.1%.

Figure 4.1

OVS source overlap on residential parcels. Same overlap analysis as Step 3, restricted to residential parcels after the category-code filter.

05 — Tiers & Baseline

Violation Tier Mapping, EDA & Baseline Model

code/r_code/03_2_Analysis.Rmd establishes the violation severity scheme and the performance floor any engineered model must beat.

5.1 The VacancyGuide violation tier scheme

108 L&I violation codes are manually mapped to three severity tiers based on the VacancyGuide2026 methodology developed by Tim Haynes.

Tier

Weight / Codes

Description

Level 1

+1 · 26 codes

Minor deterioration and general maintenance — broken windows, debris, peeling paint.

Level 2

+2 · 19 codes

Moderate structural damage — deteriorated masonry, structural members, roofing.

Level 3

+3 · 63 codes

Severe vacant or unsafe or collapse risk — structural failure, imminently dangerous conditions, known vacancy codes.

Caveat from Fichman's 2026-02-18 Review

The tier weights were established by trial and error by Tim Haynes without statistical validation. They are used as a Track B comparison benchmark, not as primary model inputs. The engineered model in Track A learns its own weights from data.

5.2 Validate tier codes against real violation data

The 108 tier codes are matched against the actual violations dataset to count OPEN violations per tier. This confirms the codes are actively used (not hypothetical) and that Level 3 codes dominate OPEN violations as expected.

5.3 Identify the 54 vacancy keyword codes

VIOLATION_DEFINITION.csv is scanned for any code whose title or definition contains the word "vacant" — yielding 54 codes, a superset of the eleven OVS-defining codes. Using the eleven OVS codes as features would create direct leakage; the broader 54-code set provides predictive signal without circularity.

5.4 EDA — exterior condition vs vacancy rate

Vacancy rates are computed for each exterior condition rating (1 = New through 7 = Collapsed) and plotted as a bar chart. Condition = 7 shows dramatically higher vacancy rates than condition = 1. A violin plot was initially used here, but Fichman pushed back: exterior condition is an ordered categorical variable, not continuous, and bar charts are more honest about the discrete scale.

5.5 New-construction false positive check

Properties built after 2010 and labeled OVS = 1 are isolated and their source breakdown compared to the overall OVS = 1 population. The analysis reveals whether these are genuine vacants (construction never occupied) or data artefacts such as a C&S record on a wrong parcel.

5.6 Violation history across time windows

Mean violation counts in four windows (all-time, five-year, three-year, six-month) are computed separately for OVS = 1 and OVS = 0 parcels. OVS = 1 parcels have consistently higher violation counts across every window. The gap narrows in the six-month window, suggesting that acute recent activity is a weaker signal than long-term chronic history. This validates the life-history features in Step 6.

5.7 Baseline logistic regression

A logistic regression is fit using only four OPA fields — exterior_condition, log_market_value, building_age, log_livable_area — with no violation, license, or transfer features.

Spec	Value
Model	`ovs ~ exterior_condition + log_market_value + building_age + log_livable_area`
Dataset	276K parcels with non-missing values on all four fields
Baseline AUC	0.798

Visible overlap in the predicted score density between OVS = 0 and OVS = 1 indicates room for improvement. Outputs: data/building_tier_mapping.csv, data/vacancy_keyword_codes.csv.

Figure 5.1

Vacancy rate by exterior condition. Bar chart over the 1–7 ordered categorical scale.

Figure 5.2

Violation history faceted by OVS. Mean violation counts in 4 windows, OVS = 1 vs OVS = 0.

Figure 5.3

Baseline ROC curve. Logistic regression on four OPA fields. AUC = 0.798.

Figure 5.4

Baseline predicted probability distribution. Density of predicted scores by OVS class — visible overlap motivates the engineered features.

06 — Feature Engineering

Feature Engineering

code/r_code/03_3_Features.Rmd is the largest and most complex file in the pipeline. It transforms seven raw data sources into a single flat feature matrix. The guiding principle throughout is no temporal leakage. Every feature must be constructed from data strictly before TRAIN_CUTOFF.

6.1 Training-cutoff time windows

Window	Definition	Purpose
RECENT_WINDOW	Last 6 months	Acute current activity
WINDOW_2YR	Last 2 years	Recent trend
WINDOW_3YR	Last 3 years	Mid-term life history
WINDOW_5YR	Last 5 years	Long-term trajectory

6.2 Commercial and parking filter (secondary)

A second-pass filter removes parcels whose building description matches parking or commercial structure keywords. The category-code filter in Step 4 already excludes overtly commercial parcels, but some mixed-use parcels in category 3 are physically parking structures or commercial-only buildings. The fine-grained building-use description catches these edge cases, including the parking garage false positive identified in Case 2.

parking_pattern <- regex("parking|garage|surface lot|parking lot|commercial lot",
                         ignore_case = TRUE)
ovs_df <- ovs_df %>%
  filter(is.na(bldg_desc) | !str_detect(bldg_desc, parking_pattern))

6.3 Demolition and new-construction permit lookup

Permit records are filtered to completed demolition and new-construction permits before TRAIN_CUTOFF. For each parcel the most recent demolition date and most recent new-construction date are recorded.

6.4 Violation features

The OVS-defining eleven violation codes are explicitly excluded from all violation features via filter(!is_ovs_code) — using them would create direct leakage since they are components of the label.

Feature	Description
n_violations_total	All-time violation count
n_violations_recent	Violations in last 6 months
n_violations_2yr	Violations in last 2 years
n_violations_3yr	Violations in last 3 years
n_violations_5yr	Violations in last 5 years
n_violations_open	Count of currently open violations. Excluded from final model as an OVS proxy
n_distinct_codes	Number of unique violation codes ever cited
viol_trend_3v5	Count in 3yr window minus count in the 3 to 5yr window. Positive means worsening
viol_accel_2v3	Count in 2yr window minus count in the 2 to 3yr window. Positive means accelerating
n_repeat_codes	Count of codes cited two or more times. Measures repeat non-compliance
resolution_rate	Fraction of violations resolved (complied or closed). Low means persistent non-compliance
has_maintenance_code	Ever had a Level 1 or 2 maintenance or moderate violation
has_structural_code	Ever had a Level 3 severe or structural violation
has_fire_safety_code	Ever had a fire, egress, or hazard violation
has_open_structural	Has a currently open Level 3 violation
days_since_last_viol	Days from most recent violation to TRAIN_CUTOFF
days_oldest_open_viol	Age of the oldest currently open violation in days

The life-history features fulfill a Fichman requirement. A property with five violations all in the last six months is different from one with five violations spread over ten years. The trend and acceleration features (viol_trend_3v5, viol_accel_2v3) capture whether a property is getting worse, improving, or stable over time, which a single count cannot express.

6.5 Clean & Seal features

C&S features focus on history and trajectory rather than current active status (which is the OVS label component).

Feature	Description
n_cs_total	Total C&S events ever recorded
days_since_last_cs	Days from most recent C&S to TRAIN_CUTOFF
cs_active_2yr	C&S event in last 2 years (excluded — near-current)
cs_has_closed	At least one work order completed
cs_span_days	Days between first and most recent C&S
demo_after_cs	Demolition permit completed after most recent C&S
newcon_after_cs	New construction completed after most recent C&S
cs_truly_active	C&S in 2yr AND no demo/newcon after (excluded — this is the OVS rule)

6.6 License features

License features capture the lifecycle trajectory of how a property's licensed use has changed over time.

Feature	Description
had_vacancy_license	Ever had a vacancy license regardless of current status
n_vacancy_licenses	Total count of vacancy licenses over time
ever_had_vacant_lic	Binary version of above
has_rental_license, has_active_rental	Current active rental license. Excluded from final model as direct complement of vacancy status
ever_had_rental	Ever had a rental license
had_rental_then_vacant	Had a rental license, then later got a vacancy license. Lifecycle transition signal
license_lapse_rate	Fraction of all licenses that are no longer active
days_since_last_lic	Recency of most recent license activity

The had_rental_then_vacant feature is unusually informative. A property that went from having a rental license to a vacancy license has a documented owner-reported history of occupancy followed by abandonment. That transition is a strong qualitative signal of the kind of properties the model aims to predict.

6.7 Unsafe & Imminently Dangerous features

These orders sit outside the regular violation system. Features are simple — a binary "has ever" flag, a count, and days since the most recent order. These were not part of the OVS definition, so there is no leakage risk.

6.8 Real Estate Transfer (RTT) features

RTT records are filtered to deed and sheriff sale documents (excluding mortgages, liens, and other encumbrances) before TRAIN_CUTOFF. The join key is detected automatically by scanning for opa_account_num or parcel_number in the column names.

Feature	Description
n_transfers_total	Total deed or transfer count, all time
n_transfers_5yr	Transfers in last 5 years
n_transfers_3yr	Transfers in last 3 years
n_deed_transfers	Arms-length deed transfers with recorded prices
n_sheriff_sales	Total sheriff foreclosure sale count
had_sheriff_sale	Ever had a sheriff sale
sheriff_sale_recent	Sheriff sale within last 5 years
log_price_change	`log(last_deed_price) - log(prior_deed_price)`. Negative means price declined
days_since_last_transfer	Days from most recent any-type transfer to TRAIN_CUTOFF

OPA's sale_price and sale_date capture only the most recent transaction. RTT adds the full ownership history. Frequent re-sales can indicate a distressed property being flipped without rehabilitation. A sheriff sale in the recent past is a direct foreclosure indicator. A log price decline between the two most recent deed sales suggests the market is pricing in deterioration risk.

last_transfer_price and prior_transfer_price are left as NA when a parcel has fewer than two deed transfers with recorded prices. The Random Forest recipe handles these via median imputation. log_price_change is also left as NA in these cases.

6.9 OPA-derived features

Derived from the OPA fields already present in ovs_residential.csv.

Feature	Description
exterior_condition	Assessor-rated condition on a 1 to 7 scale
building_age	2024 minus `year_built`. Capped at 200 years, negative values set to NA
log_market_value	`log1p(market_value)`. Excluded from the final model due to assessed value bias
log_livable_area	`log1p(total_livable_area)`
log_sale_price	`log1p(sale_price)`
days_since_sale	Days from most recent OPA sale to TRAIN_CUTOFF
years_since_sale	`days_since_sale / 365.25`
is_poor_condition	Binary. `exterior_condition >= 5`
value_per_sqft	`market_value / total_livable_area`. NA when area is zero

6.10 Assemble the feature matrix

All feature groups are left-joined to the OPA base table on parcel_number. Left joins ensure every parcel appears even if it has no records in a particular source. A parcel with no violation history is kept with zeros, not dropped.

NA fill logic.

Integer counts (violation counts, C&S counts, and so on) are filled to zero. A missing parcel means no events occurred.
Rate fields (resolution rate, lapse rate) are filled to zero.
Trend and acceleration fields are filled to zero. No activity means no change.
Days fields are filled to as.integer(TRAIN_CUTOFF - as.Date("2000-01-01")) — a large sentinel value meaning "no activity on record before the data starts".
RTT price fields (log_price_change, raw prices) are left as NA. Genuinely missing when there are fewer than two arms-length sales. Handled by the model recipe.

6.11 Missing-value audit

A summary table shows every feature with its missing-value count and percentage, grouped into flags (Complete, less than 5 percent, 5 to 20 percent, more than 20 percent). Most engineered features are complete. The highest-missing features are days_oldest_open_viol (only populated for parcels with open violations) and price-related fields with fewer than two transactions.

6.12 Univariate correlation screening

Pearson correlations between each numeric feature and ovs are computed and the top 20 by absolute value are plotted. This acts as a quick sanity check — the top features should be conceptually sensible vacancy indicators (violation counts, C&S history, license flags).

Output

data/features_residential.csv — ~520K parcels (522,481 unique) by 80+ columns.

Figure 6.1

Top 20 univariate correlations with OVS. Quick sanity check that the strongest predictors are conceptually sensible.

07 — Modeling

Tidymodeling & the Vacancy Risk Score Ensemble

code/r_code/04a_tidymodeling.Rmd is the heart of the production pipeline. It fits four base learners on the engineered feature matrix, blends two of them into a calibrated ensemble (the Vacancy Risk Score), and exports the artefacts every downstream step depends on.

7.1 Load features and define the model variable set

features_residential.csv is loaded directly. model_vars is an explicit character vector and is the single source of truth for the feature set used everywhere downstream. The exclusions listed below capture the leakage and bias decisions made during feature engineering.

Excluded Feature	Reason
n_violations_open	Direct count of open violations. The OVS violation rule is "any open violation". Near-exact proxy
has_open_violation	One half of the OVS violation rule
has_vacancy_kw_code	The other half. Violation code matches the keyword list
has_open_vacancy_kw	Intersection of both. Directly reconstructs the OVS violation component
cs_truly_active	Current active C&S status is the OVS Clean & Seal rule
cs_active_2yr	Too close to current status. Near-proxy for `cs_truly_active`
has_rental_license, has_active_rental	Current active rental is the functional complement of vacancy license status
log_market_value, value_per_sqft	OPA assessed values carry well-documented racial and geographic bias (Fichman guidance)
last_transfer_price, prior_transfer_price	Redundant with `log_sale_price`. `log_price_change` captures the useful signal instead

Cache files in data_py/cache/04a/ are fingerprinted with a SHA-1 of model_vars. Any change to the feature list invalidates stale fits silently.

7.2 Stratified 70/30 train/test split

split    <- initial_split(df, prop = 0.70, strata = ovs)
train_df <- training(split)
test_df  <- testing(split)

A temporal split was evaluated and discarded earlier in the pipeline. Activity recency is itself a vacancy proxy via fields like days_since_last_viol and cs_active_2yr. A temporal split therefore causes distributional shift, with train OVS-equal-one rate around 0.2 percent and test rate around 6.9 percent. AUC estimates become unreliable. Stratified random split ensures both train (around 364K rows) and test (around 156K rows) have approximately the production-prevalence OVS-equal-one rate of around 1.1 percent. Temporal generalization is tested separately in Step 14.

7.3 Preprocessing recipe

Every base learner is wrapped in a recipes recipe with the same three steps in the same order.

step_impute_median() — imputes NAs primarily in log_price_change, days_oldest_open_viol, and log_sale_price.
step_zv() — removes zero-variance columns that survived feature engineering.
themis::step_rose() — synthetic minority oversampling applied only to the training fold of each cross-validation split. Wrapping it in the recipe ensures the resampling step never leaks into the test fold.

OVS prevalence sits at around 1.1 percent. Without correction, models tend to drive themselves toward the majority class. ROSE rebalances the training data before the model is fit. Class-weight tuning (described below) provides a complementary lever inside the model itself.

7.4 Four base learners

Each learner is fit with hyperparameters chosen up-front rather than searched on every run. Tuning results from earlier runs are kept in data_py/rf_tune_results.csv, data_py/xgb_tune_results.csv, and data_py/lgb_tune_results.csv.

Model	Configuration
Logistic Regression	`glm` with L2 regularization via `glmnet`, class weights inversely proportional to prevalence
Random Forest	`ranger` with `num.trees = 500`, `mtry = 7`, `min.node.size = 5`, class weights set to balance
XGBoost	Pre-tuned hyperparameters loaded from `xgb_tune_results.csv`, `scale_pos_weight` tuned to prevalence
LightGBM	Pre-tuned hyperparameters loaded from `lgb_tune_results.csv`, class weights set to balance via `bonsai`

The Random Forest mtry value of seven corresponds approximately to the floor of the square root of the feature count, matching the canonical default for RF on this size of feature set.

7.5 The Vacancy Risk Score ensemble

The production score is a 50/50 average of the calibrated Logit and Random Forest probabilities. XGBoost and LightGBM are kept as diagnostic comparators only. They are reported in the model thresholds table and ablation results, but do not feed the production ensemble.

ensemble_prob_raw <- 0.5 * logit_prob_raw + 0.5 * rf_prob_raw
ensemble_prob     <- predict(isotonic_ensemble, ensemble_prob_raw)

Isotonic regression is fit on the test set predictions of the raw ensemble.

Vacancy is rare. Raw probabilities cluster well below 0.5 even for parcels the model is confident about. Isotonic calibration maps the raw score back to honest empirical positive rates — even the very top one percent of parcels is only around 59% truly vacant, so the highest calibrated probability is around 0.6, not 1.0.

Important Implication

Calibrated probabilities should never be compared to a fixed threshold like 0.5 — almost nothing will exceed it. Use ensemble_flag (top one percent by raw rank) or qtile_tier (a five-bucket rank) for inspection triage.

7.6 Headline test-set metrics

Metric	Value
Test ROC-AUC	0.9395
Test PR-AUC	0.5461 — beats both parents
Test Brier	0.0068
Mean ensemble P(vacant) across full population	~1.28%

Variable importance from the Random Forest is dominated by C&S history, vacancy license history, violation trajectories, and recency signals. RTT features (had_sheriff_sale, log_price_change) appear in the top half when they carry independent signal.

7.7 Outputs from 04a

File	Contents
data_py/model_logit_final.rds	Fitted Logistic Regression pipeline
data_py/model_rf_final.rds	Fitted Random Forest pipeline (around 178 MB)
data_py/model_xgb_final.rds	Fitted XGBoost pipeline
data_py/model_lgb_final.rds	Fitted LightGBM pipeline
data_py/calibrators.rds	Per-model isotonic calibrators plus the ensemble calibrator
data_py/all_predictions_rf.csv	One row per parcel (around 520K) with all model probabilities, calibrated and raw, ensemble flag, risk_score, qtile_tier
data_py/model_thresholds.csv	Per-model Youden thresholds plus AUC, sens, spec, plus an `is_best` boolean
data_py/rf_tune_results.csv	Cached RF tuning grid
data_py/xgb_tune_results.csv	Cached XGBoost tuning grid
data_py/lgb_tune_results.csv	Cached LightGBM tuning grid

The all_predictions_rf.csv columns are worth memorizing.

risk_score — integer 0–100, ensemble probability times 100. Use for dashboard display.
ensemble_prob — calibrated probability 0–1. Use for "X% chance of being vacant" callouts.
ensemble_prob_raw — uncalibrated. Use for ranking and sorting because it preserves spread.
ensemble_flag — 1 equals top one percent by raw rank. Use for operational triage.
qtile_tier — five-bucket rank for tier badges in UI such as "Top 1 percent (highest risk)".

Figure 7.1

ROSE versus no subsampling. Class-balance lever before model fitting.

Figure 7.2

Train vs test density. Predicted-score density on each split — gap indicates overfit.

Figure 7.3

Calibration curve for the ensemble. Reliability after isotonic calibration.

Calibrated versus raw probability distributions

Figure 7.4

Calibrated vs raw distributions. Why the highest calibrated probability sits ~0.6, not 1.0.

Figure 7.5

ROC curves, all four base models. Logit, RF, XGBoost, LightGBM, plus the ensemble.

Random Forest variable importance, top 20

Figure 7.6

Random Forest top-20 variable importance. C&S history and violation trajectories dominate.

Figure 7.7

Threshold sensitivity. Precision and recall as the operating threshold sweeps.

Figure 7.8

Five-tier probability distribution. Population shares per qtile_tier bucket.

08 — Spatial Validation

Spatial Validation & Prediction Confidence Intervals

code/r_code/04b_model_validation.Rmd tests whether the production ensemble generalizes to data it was not trained on. The motivating concern is spatial autocorrelation — parcels in the same neighborhood share many features, including features that are themselves built from neighborhood activity. A standard random split overstates AUC because the test set contains parcels whose neighbors taught the model how to score them.

8.1 Spatial cross-validation, 10-fold by ZIP group

folds <- group_vfold_cv(df, group = zip_code, v = 10)
fits  <- map(folds$splits, ~ fit(workflow, data = analysis(.x)))

group_vfold_cv ensures that all parcels in a given ZIP code stay together — either all in training or all in test, never split across both. With ten folds, each fold holds out approximately ten percent of Philadelphia ZIPs. A parcel in 19139 cannot show up in both train fold 1 and test fold 2, so the model cannot lean on memorized neighborhood patterns.

A per-fold line chart shows AUC and J-Index across the ten folds. High variance across folds would suggest the model struggles with some geographic areas.

Result

Spatial CV mean AUC is 0.8877 ± 0.0064 across the ten folds — the lower-bound, conservative AUC for the RF component without ensemble calibration.

8.2 Leave-One-ZIP-Out (LOGO) cross-validation

LOGO takes the spatial CV to its extreme. Each of Philadelphia's residential ZIP codes is held out exactly once across separate model fits. To keep total compute bounded, the step samples 15 of the roughly 45 residential ZIPs uniformly at random and fits a fresh RF for each.

Per-ZIP AUC results are plotted as a ranked bar chart. ZIPs with AUC less than 0.70 would be highlighted in red as candidates for manual review. In the most recent run none of the sampled ZIPs fall below that line, with median AUC of 0.95 and above. The mean LOGO AUC is 0.8849, very close to the 10-fold spatial CV result.

LOGO matters for deployment. If the city ever applies the model to newly annexed areas, or if future data includes ZIPs not well represented in the training window, LOGO CV predicts how well the model will perform in those situations.

8.3 Prediction confidence intervals from the RF tree variance

The Random Forest is leveraged for free uncertainty quantification by computing per-tree probabilities and taking their dispersion across the 500 estimators.

tree_probs <- predict(rf_fit, test_df, predict.all = TRUE)$predictions[, 2, ]
rf_prob    <- rowMeans(tree_probs)
rf_se      <- apply(tree_probs, 1, sd) / sqrt(rf_fit$num.trees)
ci_lower   <- pmax(0, rf_prob - 1.96 * rf_se)
ci_upper   <- pmin(1, rf_prob + 1.96 * rf_se)
ci_width   <- ci_upper - ci_lower

A parcel with a 0.60 RF probability and CI width 0.05 is a confident prediction — the trees agree. The same 0.60 probability with CI width 0.35 is ambiguous — different subsets of trees give wildly different predictions.

Mean CI width across the test set: ~0.0466.
Share of parcels with CI under 0.10: ~45% (high-confidence band).
Share of parcels with CI over 0.30: ~5% (uncertain band).
Share of currently flagged parcels with CI > 0.30: ~4.3% — flagged for inspector review rather than automated action.

8.4 Sanity checks

Four checks confirm the model is learning from the right signals.

Feature-importance alignment. Top 20 features by Gini importance are compared against an expected_top list of domain-sensible features (vacancy license history, C&S total, license lapse rate, days since last violation). The five strongest signals are n_violations_total, days_since_last_viol, n_cs_total, n_violations_recent, and has_fire_safety_code. All sit comfortably in the expected list.
Calibration curve. The test set is binned by predicted probability and the observed vacancy rate within each bin is plotted against the bin midpoint. The original ensemble lies very close to the 45° diagonal, slightly under-predicting vacancy by around 0.03 percentage points on average.
Known-vacant holdout check. Test parcels are grouped by which OVS signal they carry (cs_truly_active, had_vacancy_license, has_open_vacancy_kw, or none). Mean predicted probability is reported per group. Parcels with cs_truly_active score around 2.5 times higher than the no-signal group, and parcels with a vacancy license score around 4.5 times higher. The model is responding to its strongest available signals.
Partial dependence spot-check. n_violations_total is varied from its 1st to 99th percentile while all other features are held at training-set medians. The predicted probability rises monotonically with violation count, as expected.

Outputs

data_py/spatial_cv_metrics.csv per-fold metrics, data_py/logo_cv_metrics.csv per-ZIP metrics, data_py/predictions_with_ci.csv test set with CI bounds, and data_py/validation_summary.csv headline metrics table.

Figure 8.1

Spatial CV per-fold AUC and J-Index. Variance across the 10 ZIP-grouped folds.

Figure 8.2

LOGO AUC by held-out ZIP. 15 sampled ZIPs, ranked.

Figure 8.3

Prediction CI analysis. RF tree-variance CI width vs predicted probability.

Figure 8.4

04b feature importance. Sanity-check reproduction of 04a's RF VIP under spatial CV.

09 — Subgroup & Equity

Subgroup Generalization & the Equity Audit

code/r_code/04b_vpi_comparison.Rmd breaks the citywide AUC apart along three operationally relevant axes (ZIP, building category, ward) and runs an equity audit by census-tract median household income.

9.1 Frame and tier classification

Predictions are joined to features (for ZIP and tract) and OPA (for ward and category). A five-tier classification using equal-width 0.2 probability buckets — Very Unlikely, Unlikely, Maybe, Likely, Very Likely — is applied to every parcel as prob_tier. The full population skews heavily into the lowest tier (~99.88% Very Unlikely). The Very Likely tier (> 0.8) holds about 0.47% — the operational high-risk pool.

9.2 AUC by ZIP code

For each ZIP with at least 50 parcels and at least 5 observed vacants, AUC is computed by calling roc_auc_score on the subset. Forty-one ZIPs survive the minimum-N filter. Results are ranked and plotted as a horizontal bar chart with ZIPs colored red where AUC falls below 0.70.

Median per-ZIP AUC: 0.973.
5th percentile: 0.949.
No ZIP below 0.70.

9.3 AUC by building type

Category	Parcels	AUC
Single Family	~461K	0.980
Multi-Family / Duplex / Triplex	~41K	0.978
Mixed Use	~14K	0.945
Apartments 4+ Units	~3.6K	0.942

9.4 Five-tier distribution by category

Tier counts and observed OVS rates per tier are computed within each category. The Very Likely tier shows significantly higher observed vacancy rates than Very Unlikely inside every category — confirming the probability scores discriminate within building type and not just on average.

9.5 Ward-level summary

Philadelphia's 66 political wards are used as an aggregation unit. For each ward this step reports parcel count, observed vacancy count and rate, mean predicted probability, count in Very Likely and Likely tiers, and a combined "high risk" rate (probability of 0.6 or higher). Top wards by mean predicted probability:

Ward	Mean predicted
11	0.0436
44	0.0419
28	0.0410
6	0.0384
16	0.0377

9.6 Equity audit by census-tract income quintile

ACS 2022 5-year median household income (B19013_001E) is joined at the census-tract level. Tracts are binned into income quintiles (Q1 = lowest, Q5 = highest); coverage of the join is ~99.5%.

Quintile	Observed rate	Mean predicted	AUC
Q1 (lowest income)	2.60%	2.93%	0.97
Q2	Lower	Lower	0.97
Q3	Lower	Lower	0.97
Q4	Lower	Lower	0.98
Q5 (highest income)	0.31%	0.28%	0.97

Headline Equity Finding

AUC is consistent across all income quintiles (0.968–0.978). The model identifies vacancy with the same ranking quality whether the parcel sits in a low-income or high-income tract. Mean predicted probability tracks observed rate closely in every quintile — the model is not systematically over- or under-predicting in any income band. It concentrates absolute flag counts in lower-income areas because vacancy is genuinely concentrated there.

Outputs. data_py/predictions_04b.csv parcel frame, data_py/zip_auc_04b.csv per-ZIP AUC, data_py/ward_summary_04b.csv per-ward rollup, data_py/equity_income_auc.csv equity audit, plus a static ZCTA choropleth and an interactive Leaflet map.

Figure 9.1

AUC by ZIP code. Per-ZIP discrimination, ranked. No ZIP falls below 0.70.

Equity flag rate vs observed vacancy by ZIP

Figure 9.2

Equity scatter. Per-ZIP flag rate vs observed vacancy rate — alignment indicates calibration.

Figure 9.3

Mean predicted probability by ward. Wards 11, 44, 28 lead.

Figure 9.4

AUC by income quintile. 0.968–0.978 across Q1–Q5.

10 — Vs City VPI

Head-to-Head Comparison Against the City VPI

code/r_code/04c_vs_city_vpi.Rmd compares the production ensemble to the City of Philadelphia's binary Vacant Property Indicator (VPI) head-to-head on the residential-only universe. rawdata/vpi_bldg.geojson contains 8,048 buildings flagged by the City. After filtering to residential parcels via the OPA join, 6,418 parcels (~79.7%) remain in scope. The remaining 20-plus percent are commercial or industrial properties not covered by the residential model.

10.1 Match-prevalence comparison

The ensemble probability is thresholded at the level that produces approximately the same number of positives as the City's VPI (~6,400 parcels).

Approach	Flagged	Precision	Recall vs OVS
Ensemble (matched capacity)	~6,400	54.7%	78.3%
City VPI	~6,400	53.3%	57.6%
Union of both	~9,800	47.9%	79.7%

The continuous-score view is cleaner. Ensemble AUC against OVS is 0.9786; VPI as a binary flag has AUC 0.7849. Ensemble average precision is 0.7538; VPI is 0.3116.

10.2 Four-bucket disagreement analysis

Bucket	Parcels	Observed OVS rate
Both flag the parcel	3,199	83.3%
Only the model flags	5,300	37.4%
Only the VPI flags	3,211	23.4%
Neither flags	508,486	0.1%

The "only ours" bucket has substantially higher observed vacancy than the "only VPI" bucket — the model's exclusive flags are higher quality on average than the VPI's exclusive flags.

10.3 Per-ZIP precision scatter

Per-ZIP precision is plotted with City VPI on the x-axis and ensemble on the y-axis (one point per ZIP, ≥50 parcels). The model achieves higher precision than VPI in around 30 of the 41 ZIPs, sometimes by a wide margin.

10.4 High-probability OVS = 0 candidates

A separate analysis isolates the top 339 parcels with the highest ensemble probability that have OVS = 0. Twenty-seven of those also appear in the City VPI — independent corroboration that the parcel is genuinely vacant despite OVS missing it. The remaining 312 are candidate false positives or genuinely undetected vacants and are exported with addresses for inspector review.

Outputs

data_py/vs_city_vpi_headline.csv, data_py/vs_city_vpi_buckets.csv, data_py/vs_city_vpi_zip.csv, data_py/vs_city_vpi_parcel.csv, data_py/vs_city_vpi_map.html (interactive overlay of all three buckets), and data_py/vs_city_vpi_high_prob_ovs0_map.html (the candidate review map).

Figure 10.1

Ensemble vs City VPI — ROC and PR curves. Ensemble AUC 0.9786 vs VPI 0.7849.

Figure 10.2

Four-bucket disagreement. Both / model-only / VPI-only / neither, with observed OVS rates per bucket.

Figure 10.3

Disagreement map. Spatial layout of model-only vs VPI-only flags.

Per-ZIP precision scatter, ensemble vs VPI

Figure 10.4

Per-ZIP precision, ensemble vs VPI. Above the diagonal = model wins. Ensemble wins ~30 of 41 ZIPs.

Figure 10.5

High-prob OVS = 0 candidates. Top 339 model-flagged parcels with OVS = 0 — candidates for inspector review.

11 — Recalibration

Held-Out Recalibration

code/r_code/04d_recalibration.Rmd refits the calibrator on a held-out half of the test split and evaluates calibration on the other half. The original 04a calibrator was fit on the full test set — a deliberate shortcut for an initial release that allows a small amount of optimism into the calibrated probabilities.

11.1 Split, refit, measure

The 30% test set from 04a is split 50/50 stratified by OVS. The raw ensemble probabilities for the calibration half are used to fit a fresh IsotonicRegression; the new calibrator is applied to the full population producing ensemble_prob_v2. On the evaluation half (which neither the model nor either calibrator has seen):

Metric (evaluation half)	Original	Recalibrated
ROC-AUC	0.9399	0.9375
Brier	0.0067	0.0067
Mean predicted to observed ratio	1.02×	1.04×

Both reliability curves follow the diagonal, but the recalibrated version stays closer across all probability bins. Practical takeaway: the original calibrator was already nearly correct — ensemble_prob_v2 is the recommendation for production, but the substantive difference is small enough that consumers already on ensemble_prob do not need to migrate urgently.

Outputs

data_py/calibrators_v2.rds and data_py/predictions_calibrated.csv (full population with the new column).

Reliability curve, original versus recalibrated

Figure 11.1

Reliability curve, original vs recalibrated. Recalibrated curve hugs the 45° diagonal more tightly across all bins.

12 — Operational Capacity

Operational Thresholds & Per-Ward Capacity

code/r_code/04e_operational_thresholds.Rmd translates model scores into actionable inspection capacity. Rather than choose one citywide threshold, it applies a per-ward capacity policy: inspect the top N parcels per ward, where N is some fraction of the ward's residential parcel count.

12.1 The capacity policy

CAPACITY_PCT is parameterized at one percent of each ward's residential parcels per inspection cycle. Within each ward, the top N parcels by ensemble probability are flagged. The same policy is then applied to the City's VPI for direct comparison, and to the union of the two.

12.2 Citywide results at 1% ward capacity

Strategy	Flagged	Precision	Recall
Model only	5,232 (1.01%)	57.0%	49.1%
City VPI only	6,410 (1.23%)	53.3%	57.6%
Union	9,874 (1.90%)	47.9%	79.7%

The model achieves the highest precision per parcel inspected. The union strategy trades precision for recall and is the right choice when coverage is the priority.

12.3 Per-ward precision scatter

A per-ward scatter (City VPI precision on the x-axis, model precision on the y-axis) shows wide variance across wards. Some wards reach perfect or near-perfect precision under the model's flagging policy, while others land in the 0.3–0.5 range. It flags wards where precision falls below a configurable threshold for individual review.

Outputs

data_py/operational_flags_by_ward.csv per-ward summary, data_py/operational_precision_at_capacity.csv headline citywide table, and the per-ward precision scatter PNG.

Operational precision at one percent ward capacity

Figure 12.1

Operational precision at 1% ward capacity. Per-ward precision under the capped per-ward policy.

Figure 12.2

Capacity vs threshold curve. Number of parcels flagged for any chosen threshold — lookup for ops teams.

13 — Local Explanations

Local Explanations with TreeSHAP

code/r_code/04f_local_explanations.Rmd. A model that is operationally useful must be explainable at the parcel level. An inspector who is told "go visit this address" must be able to ask "why this one" and get a meaningful answer.

13.1 Explainer setup & per-parcel attributions

This step uses treeshap on the fitted Random Forest component (which dominates the feature attribution alongside the Logistic Regression in the ensemble). For each parcel the explainer returns one SHAP value per feature. The step ranks parcels by ensemble probability, takes the top 200, and for each parcel records the five features with the largest absolute SHAP value, the feature value, the SHAP magnitude, and a direction (pushed up or pushed down).

13.2 Global SHAP summary across the top 200

days_since_last_cs — Clean & Seal recency is the single largest signal.
n_cs_total — total Clean & Seal events.
days_since_last_viol — recent violation pushes probability up.
n_violations_5yr & n_violations_3yr — five and three year violation density.

A useful pattern in the local explanations: some neighborhood-level features push downward at flagged parcels. A high count of vacant parcels in the same ZIP can lower the ensemble probability for a specific parcel — the model accounts for the neighborhood baseline elsewhere, so an individual parcel in a high-vacancy ZIP is comparatively less risky than a parcel that looks unusual within its ZIP.

Outputs

data_py/parcel_shap_topflags.csv (1,000 rows of parcel by top-five-feature explanations) and docs/graphs/python/shap_summary_topflags.png.

Figure 13.1

Top SHAP drivers across the top 200 flagged parcels. Mean absolute SHAP per feature.

14 — Temporal Validation

Temporal Validation

code/r_code/04g_temporal_validation.Rmd tests how the model performs on parcels with very recent administrative activity versus parcels whose activity is years old. The training split is stratified random rather than temporal — this is correct for measuring AUC (see Step 7.2), but it leaves open the question of how the model performs across activity windows. days_since_last_viol is used as the temporal proxy.

14.1 Cohorts

Old cohort. Last violation > 1,100 days ago, or none on record. ~460,820 parcels (88.6%), OVS rate 0.397%.
New cohort. Violation within 520 days (since early 2024). ~35,313 parcels (6.8%), OVS rate 8.9%.

14.2 Cohort-level metrics

Cohort	Observed rate	Mean predicted	AUC	Brier
Old	0.397%	0.45%	0.9579	0.0028
New	8.90%	9.63%	0.9760	0.0226

The model's discrimination is actually higher on the new cohort, where the recent administrative trail is rich. The Brier score on the new cohort is larger in absolute terms because prevalence is roughly 20× higher, but the calibration ratio is good. There is no evidence of temporal decay between the older and newer activity windows.

The OVS rates differ by an order of magnitude between the two cohorts, which by itself is informative. Recent violation activity is itself a strong vacancy signal, even before the model touches it.

Outputs

data_py/temporal_validation.csv and docs/graphs/python/temporal_validation.png.

Temporal validation, old versus new cohort

Figure 14.1

Temporal validation, old vs new cohort. ROC curves and calibration on the two recency-defined cohorts.

15 — Block CV

Block Cross-Validation by Census Tract

code/r_code/04h_block_cv_rf.Rmd. Several engineered features are spatially smooth — neighborhood vacancy counts, ZIP-level aggregations, and the implicit information that a parcel near many flagged parcels is itself probably flagged. A standard random train/test split puts neighbors on opposite sides of the split, so the model effectively gets to see its training answer through its neighborhood features when scoring the test set. This inflates AUC.

This step re-fits the Random Forest under group_vfold_cv(df, group = census_tract, v = 5). Each fold holds out approximately 20% of Philadelphia's census tracts entirely. No tract appears in both train and test. Production RF hyperparameters are loaded directly from model_rf_final.rds; the same imputation and variance-threshold preprocessing is reused. ROSE oversampling is not applied here because the goal is honest discrimination, not optimized recall.

15.1 Per-fold AUC

Fold	AUC
1	0.9677
2	0.9789
3	0.9644
4	0.9713
5	0.9564

The Headline Number

Block-CV mean AUC is 0.9677 ± 0.0083. The random-split test AUC from 04a was 0.9395 (ensemble) and the spatial-CV ZIP-blocked AUC from 04b was 0.8877 (RF only). The block-CV AUC sits between these, which is consistent with the underlying logic — ZIP groups are larger than census tracts, so ZIP blocking is a more aggressive test of generalization. Census-tract blocking is the right granularity for the operational story because the city does inspect and intervene at the tract level all the time. ~0.97 is the right number to put on a stakeholder slide as the honest, leakage-controlled AUC for the Random Forest component.

Outputs. data_py/block_cv_rf.csv and docs/graphs/python/block_cv_rf_aucs.png.

Figure 15.1

Block CV AUC by fold. Five tract-grouped folds, mean AUC 0.9677 ± 0.0083.

16 — Output Analysis

Output Analysis & Stakeholder Summaries

code/r_code/05_output_analysis.Rmd turns production predictions into stakeholder-facing materials — distributions by category and ZIP, a calibration scatter, a choropleth map, an interactive Leaflet map, and a capacity-based threshold lookup table.

16.1 The "no-vote" framing

About 25% of residential parcels have any positive tree votes from the Random Forest. The remaining 75% receive exactly zero probability, meaning the ensemble unanimously predicts occupied. Three quarters of the residential housing stock is so clearly not vacant that no tree in 500 disagrees.

16.2 Category summary

Mixed Use carries the highest mean predicted probability at ~2.73%; Single Family the lowest at ~1.20%. Both mean and median are reported because the distribution is right-skewed — a few high-risk parcels pull the mean up, and the median reflects the typical parcel.

16.3–16.5 Within-category visualization

A density plot of rf_prob is drawn per category, filtered to rf_prob > 0. The zero-vote parcels are excluded because they create a spike at zero that dominates the plot and obscures meaningful variation in the non-zero range. A horizontal bar chart shows mean predicted probability per category with observed OVS rate overlaid as red diamond markers; a stacked bar shows tier distribution within each category using log-spaced breaks (No Signal, Low, Moderate, High, Very High). A category calibration scatter plots mean predicted against observed rate with point size proportional to parcel count — points on the 45° diagonal are perfectly calibrated.

16.6 ZIP code summary

ZIP 19132 has the highest mean predicted probability at ~3.61%. ZIP 19137 has the lowest at ~0.85%.

16.7 ZCTA choropleth and interactive Leaflet map

tigris::zctas(starts_with = "191") downloads US Census ZCTA boundaries for Philadelphia ZIPs (a few MB). These are joined to the ZIP-level summary and rendered as a static choropleth, plus an interactive Leaflet map with hover highlighting and click popups showing ZIP, mean P(vacant), parcel count, and observed vacancy rate.

Why ZCTA, not parcel polygons?

ZCTA boundaries are used rather than PWD_PARCELS.geojson for the static map because the parcel file is over 400 MB and consistently causes the R session to run out of memory or render at unusable resolutions. The parcel-resolution map lives separately in Step 17 as a vector tileset that the browser handles efficiently.

16.8 Capacity-based threshold lookup

A precision-recall curve is computed across thresholds 0.01–0.99 with each row annotated by the corresponding number of parcels flagged. Common inspection volumes (100, 250, 500, 1K, 2K, 5K, 10K) are read off the curve as a lookup table. To flag 5,000 parcels: threshold ≈ 0.67, precision ≈ 78%, recall ≈ 67%. Operations teams pick a row based on actual resourcing rather than committing to a fixed citywide threshold.

16.9 Optional GeoJSON exports

When PWD_PARCELS.geojson is available, this step writes two GeoJSON files for downstream use.

data_py/vacancy_predictions.geojson — around 436K parcels, around 405 MB. All prediction columns plus all feature metadata.
data_py/vacancy_predictions_flagged.geojson — around 4,500 parcels (top one percent flagged), around 3.9 MB. Same column set, smaller geometry.

Outputs

data_py/output_zip_summary.csv, data_py/output_category_summary.csv, data_py/capacity_threshold_curve.csv, data_py/vacancy_risk_map.html, plus the two GeoJSON files and a folder of static PNGs under docs/graphs/python/.

Figure 16.1

Probability distribution by category. Density of rf_prob per OPA category, non-zero predictions only.

Figure 16.2

Mean predicted probability by category. Bars with observed OVS rate overlaid as red diamonds.

Figure 16.3

Five-tier distribution by category. Stacked bars using log-spaced breaks (No Signal…Very High).

Figure 16.4

Category calibration scatter. Mean predicted vs observed rate per category, point size by parcel count.

Figure 16.5

Mean predicted probability by ZIP. Bars + observed OVS rate as diamond markers.

Figure 16.6

ZCTA choropleth of mean P(vacant). Static map for stakeholder summaries.

17 — Vector Tilesets

Vector Tilesets for Web Consumption

code/r_code/06_tiling.Rmd. The 405 MB GeoJSON from Step 16 is unusable in a browser as is. This step converts both GeoJSON exports into PMTiles, a single-file vector tileset format that the browser can stream from static storage (S3, GCS, GitHub Pages) and consume directly via MapLibre GL JS or Mapbox GL JS without a tile server.

17.1 Toolchain

tippecanoe (version 2.80, Felt fork) is the converter. It is invoked via system2() from inside the Rmd. The Felt fork is required for the PMTiles output format — vanilla tippecanoe writes mbtiles only.

17.2 Flagged tileset

vacancy_flagged.pmtiles covers the top one percent flagged parcels — around 4,500 features, zoom 10–16. The layer name inside the tileset is flagged. At lower zooms tippecanoe coalesces the densest parcels so the tileset stays readable. Final file size: ~2.2 MB. This is the right tileset for high-zoom drill-down on the inspection target list.

17.3 Full prediction tileset

vacancy_predictions.pmtiles covers all ~436K parcels, zoom 10–15. The layer name inside the tileset is parcels. The pipeline applies a 10-unit simplification at low zooms and drops the densest features when tile size exceeds the limit. Final file size: ~46 MB.

Both tilesets carry the full feature property bag including risk_score, ensemble_prob, qtile_tier, zip, tract, ward, address, and ovs. Client-side styling and filtering are therefore possible without round-tripping to a server. A consumer can color by risk score, filter to flagged only, search by ZIP, or drill into a single parcel popup, all in the browser.

For visual sanity-checking the file https://protomaps.github.io/PMTiles/ accepts a PMTiles URL and renders it.

Outputs

data_py/tiles/vacancy_flagged.pmtiles and data_py/tiles/vacancy_predictions.pmtiles.

Reference — Website & Dashboard

Website & Dashboard

Everything that gets handed to a non-technical audience lives in docs/. The folder is fully static, ships with a graceful fallback for the Flask backend, and is designed to be pushed to GitHub Pages or any static host without modification.

Public-facing pages

File	Purpose
docs/index.html	Tiny redirect that drops visitors onto the landing page
docs/Vacancy Risk Landing Page.html	Project landing page. The "what this is and why it matters" surface for stakeholders
docs/PhillyStat360 v2.html	Full methodology write-up. Mirrors the step-by-step structure of the README in a presentation-ready layout
docs/dashboard.html	Interactive parcel-level dashboard. MapLibre + PMTiles vector parcels, SHAP risk drivers per parcel, ward choropleth, sidebar summary cards, ward filter, and parcel search

Data shipped with the dashboard

The dashboard needs a small set of static files alongside the HTML. Each file is the direct output of a pipeline step.

File	Source step	Purpose
vacancy_predictions.pmtiles (~46 MB)	Step 17	Full citywide parcel tileset, layer name `parcels`
vacancy_flagged.pmtiles (~2.2 MB)	Step 17	Top one percent flagged parcels, layer name `flagged`
vacancy_predictions_flagged.geojson (~3.9 MB)	Step 16	Flagged parcels as GeoJSON for the SHAP table backend
ward_boundaries.geojson	Step 16	Polygon geometry for the 66 political wards
ward_stats.json	Step 16	Per-ward rollups used to power the choropleth, summary cards, and ward filter list in static mode
dashboard_shap.json	Step 13	Per-parcel SHAP risk drivers for the dashboard's "why this parcel" panel
colors_and_type.css	—	Shared colour palette and type scale across landing page, methodology, and dashboard

Static vs local-server mode

dashboard.html auto-detects the host. On localhost or 127.0.0.1 it tries the local Flask backend first (docs/tileserver.py, backed by PostgreSQL/PostGIS via docs/load_db.py). On any other host it skips the backend entirely and falls back to the JSON files above.

Feature	Local-server mode	Static mode
Map basemap and PMTiles	CARTO basemap, vector parcels from `vacancy_predictions.pmtiles`	Same
Parcel popups (click)	Full property bag from PMTiles	Full property bag from PMTiles
SHAP risk drivers	`dashboard_shap.json`	`dashboard_shap.json`
Ward choropleth	`ward_stats.json` + `ward_boundaries.geojson`	Same
Sidebar summary cards	`/summary` endpoint	Aggregated from `ward_stats.json`
Ward filter list	`/wards` endpoint	From `ward_stats.json`
Ward fly-to	`/ward_bounds` endpoint	Computed client-side from the ward boundary GeoJSON
Parcel search	DB-wide search via `/search`	Limited to parcels currently rendered in view (`queryRenderedFeatures`)
Census tract filter	`/census_tracts` and `/tract_bounds`	Disabled unless `census_tracts.json` is shipped

The local Flask backend is optional. It exists for the case where the city wants DB-wide parcel search and tract-level filtering on a workstation. Everything in the public deployment is static.

Deployment

Full GitHub Pages deployment notes live in docs/DEPLOY.md. The short version is:

# From the repo root, push the docs/ folder along with everything else:
git add docs/
git commit -m "Update site"
git push origin main
# Settings → Pages → Source: Deploy from a branch, main / docs

The two PMTiles files stay in the repo as ordinary objects rather than Git LFS, because GitHub Pages does not serve LFS objects through its CDN. The largest file (vacancy_predictions.pmtiles, around 46 MB) sits comfortably under GitHub's 100 MB per-file limit. To host the tilesets on S3, Cloudflare R2, or any other static bucket instead, set the PMTILES_URL constant at the top of the <script> block in dashboard.html to the absolute URL — the PMTiles JS reader will stream byte ranges over HTTP.

docs/sync_methodology_assets.sh copies the latest figures from docs/graphs/ and docs/code/outputs/ into the methodology page so the public write-up stays in sync with the modeling pipeline.

Appendix A — Headline Metrics

Cross-Notebook Headline Metrics

Source	Method	Value
04a, random 30% test	Calibrated ensemble ROC-AUC	0.9395
04a, random 30% test	Calibrated ensemble PR-AUC	0.5461
04a, random 30% test	Calibrated ensemble Brier	0.0068
04b, 10-fold ZIP-grouped CV	RF mean AUC	0.8877 ± 0.0064
04b, LOGO sample of 15 ZIPs	RF mean AUC	0.8849
04b_vpi, by income quintile	Quintile AUC range	0.968 – 0.978
04b_vpi, by building category	Category AUC range	0.942 – 0.980
04c, vs City VPI on residential	Ensemble AUC	0.9786
04c, vs City VPI on residential	Ensemble PR-AUC	0.7538
04c, vs City VPI on residential	VPI binary AUC	0.7849
04c, at matched ~6.4K capacity	Ensemble precision / recall	54.7% / 78.3%
04c, at matched ~6.4K capacity	VPI precision / recall	53.3% / 57.6%
04d, evaluation half	Recalibrated ensemble AUC	0.9375
04e, citywide at 1% ward capacity	Model precision / recall	57.0% / 49.1%
04g, new-cohort temporal proxy	Ensemble AUC	0.9760
04g, old-cohort temporal proxy	Ensemble AUC	0.9579
04h, 5-fold tract-grouped CV	RF mean AUC	0.9677 ± 0.0083

Most Defensible Number

The most defensible number to cite for the model's generalization performance is the block-CV AUC of around 0.9677 from Step 15. It controls for the leakage that random splits allow through neighborhood-level features and uses the right spatial granularity for how city operations actually play out.

Appendix B — Design Decisions

Key Design Decisions

Decision	Choice	Reason
OVS definition	Three-source composite (C&S + violations + licenses)	No single source captures all vacancies. Composite reduces false negatives.
Temporal anchor	TRAIN_CUTOFF = 2025-10-01	Hard cutoff prevents future data from influencing features or label.
C&S window	2-year window plus demolition filter	Stale C&S records misrepresent current status. Demolished buildings are not vacant buildings.
Residential scope	Category codes 1, 2, 3, 14 only	Buildings and land require different feature vocabularies. Commercial properties introduce systematic false positives.
Commercial / parking filter	Pattern match on parcel building description	Catches parking garages and surface lots that survive the category code filter.
Life-history features	Trend & acceleration over 3yr / 5yr / 2yr windows	Single counts miss whether a property is getting worse, stable, or improving.
RTT integration	Transfer frequency, sheriff sales, log price change	Full deed history adds distress signals not in OPA's single most-recent sale.
Train / test split	Stratified random 70/30	Temporal split caused distributional shift because recency is itself a vacancy proxy.
Spatial validation	10-fold ZIP CV + LOGO + 5-fold tract block CV	Tests generalization independent of spatial autocorrelation. LOGO simulates new geography. Tract block CV is the honest stakeholder-reportable AUC.
No assessed value features	`log_market_value`, `value_per_sqft` excluded	OPA assessed values have documented racial and geographic bias.
Production score	50/50 Logit + RF, isotonic at the end	Logit and RF capture different signal shape. Equal weighting beat tuned weighting in CV. Isotonic restores honest empirical positive rates.
No fixed binary threshold	Top 1% rank flag plus 5-tier rank bucket	Calibrated probabilities are all small because vacancy is rare. Use rank or capacity.
ZCTA boundaries for static maps	`tigris` package	`PWD_PARCELS.geojson` is over 400 MB. ZCTA is practical for rendering and presentations. Parcel-resolution lives in PMTiles.
Per-ward capacity flagging	1% of each ward's residential parcels	Single citywide threshold concentrates too many flags in a few neighborhoods. Per-ward keeps inspection load sensible everywhere.
04b mirrors 04a exactly	Identical model_vars, recipe, hyperparameters	Spatial CV and LOGO CV must evaluate the same model that is deployed. Divergence would produce misleading validation metrics.
No poverty rate in model	ACS variables removed from `model_vars`	Used only in the equity audits as an analysis dimension, not a feature. Avoids an external Census API dependency at training time.

Appendix C — Output Catalog

Output File Catalog

Production model artefacts (in `data_py/`)

File	Contents
model_logit_final.rds	Fitted Logistic Regression pipeline.
model_rf_final.rds	Fitted Random Forest pipeline.
model_xgb_final.rds	Fitted XGBoost pipeline. Diagnostic comparator.
model_lgb_final.rds	Fitted LightGBM pipeline. Diagnostic comparator.
calibrators.rds	Original isotonic calibrators for each model and the ensemble.
calibrators_v2.rds	Held-out recalibrated ensemble calibrator. Recommended for production.
model_thresholds.csv	Per-model Youden thresholds plus AUC, sensitivity, specificity, and `is_best`.

Parcel-level prediction frames

File	Contents
all_predictions_rf.csv	One row per parcel with all probabilities, calibrated and raw, plus flag, risk_score, qtile_tier.
predictions_calibrated.csv	Adds `ensemble_prob_v2` from the held-out calibrator.
predictions_04b.csv	Predictions joined to ZIP, tract, ward, and category. The right starting point for analysis.
predictions_with_ci.csv	Test-set predictions with RF tree-variance CI bounds.
parcel_shap_topflags.csv	Top 200 flagged parcels with five SHAP-attributed features each.
vs_city_vpi_parcel.csv	Per-parcel ensemble probability vs City VPI flag.

Aggregated rollups

File	Contents
output_zip_summary.csv	Per-ZIP mean predicted, observed rate, count.
output_category_summary.csv	Per-category mean predicted, observed rate, count.
ward_summary_04b.csv	Per-ward rollup including high-risk count.
zip_auc_04b.csv	Per-ZIP AUC for ZIPs meeting the minimum-N filter.
equity_income_auc.csv	Per-income-quintile AUC, observed rate, mean predicted.
equity_zip_audit.csv	Per-ZIP flag rate vs observed vacancy rate.

Validation & diagnostics

File	Contents
spatial_cv_metrics.csv	Per-fold ZIP-grouped CV AUC and J-Index.
logo_cv_metrics.csv	Per-ZIP LOGO AUC for sampled ZIPs.
validation_summary.csv	Headline cross-validation metrics.
block_cv_rf.csv	Per-fold tract-grouped CV AUC.
temporal_validation.csv	Old and new cohort metrics.
capacity_threshold_curve.csv	Threshold → flagged-count → precision / recall lookup.
operational_flags_by_ward.csv	Per-ward flag counts under the 1% capacity policy.
operational_precision_at_capacity.csv	Citywide model vs VPI vs union summary.

Geospatial & interactive outputs

File	Contents
vacancy_risk_map.html	Interactive Leaflet ZCTA-level map.
vacancy_predictions.geojson	~436K parcel polygons with all predictions.
vacancy_predictions_flagged.geojson	~4,500 flagged parcels with all predictions.
tiles/vacancy_flagged.pmtiles	~2.2 MB. Top 1%.
tiles/vacancy_predictions.pmtiles	~46 MB. Full citywide.
vs_city_vpi_map.html	Interactive bucket overlay.
vs_city_vpi_high_prob_ovs0_map.html	Candidate-review map for high-probability OVS = 0 parcels.

Appendix D — Data Sources

Data Sources

All data from OpenDataPhilly unless noted.

File	Source	Purpose
opa_properties_public.csv	OPA	Parcel registry, property characteristics
violations.csv	L&I	Code violation history
business_licenses.csv	L&I	License history including vacant property licenses
clean_seal.csv	L&I	City-initiated boarding and securing actions
unsafe.csv	L&I	Unsafe structure orders
imm_dang.csv	L&I	Imminently dangerous structure orders
permits.csv	L&I	Building and zoning permits including demolition and new construction
RTT_SUMMARY.csv	Philadelphia Revenue	Full deed and transfer history
VIOLATION_DEFINITION.csv	L&I	Violation code title lookup
vpi_bldg.geojson	City of Philadelphia	Vacant Property Indicator binary flag
ACS 2022 5-year B19013_001E	US Census Bureau	Census-tract median household income for the equity audit