MUSA 801 · Smart Cities Practicum · University of Pennsylvania

PhillyStat360
Vacancy Risk Model

A spatial machine learning pipeline for predicting residential vacancy across Philadelphia parcels

ClientCity of Philadelphia OIT
DataOPA · L&I · RTT
ModelCalibrated Ensemble · AUC 0.94
Scope~520,000 residential parcels

An Automated, Cache-Aware Modeling Pipeline

The pipeline turns administrative data from OPA, L&I, and the Philadelphia Revenue Department into a parcel-level predicted probability of vacancy for every residential property in the city — plus a calibrated 0–100 risk score, an ensemble flag for the top one percent of parcels, and a five-tier rank bucket for dashboard display.

The whole pipeline is R Markdown end to end and lives in code/r_code/. Modeling, validation, calibration, explainability, and tiling all run in R using tidymodels, ranger, xgboost, lightgbm (via bonsai), probably for calibration, and treeshap for local explanations. The data_py/ output folder name is historical — it is just where every step reads from and writes to.

Manual once — establish labels & reference
00_data_inventory reference + OVS definition
00b_new_data_check quality check for new sources
02_ovs_exploration understand the dependent variable
03_1_Ovs ovs_residential.csv
03_2_Analysis tier mapping, EDA, baseline model (AUC 0.798)
03_3_Features features_residential.csv
R Markdown modeling & validation in code/r_code/
04a_tidymodeling Logit · RF · XGB · LGB · 50/50 calibrated ensemble
04b_model_validation spatial CV by ZIP, LOGO CV, RF tree-variance CIs
04b_vpi_comparison per-ZIP, per-building-type, per-ward, equity by income
04c_vs_city_vpi head-to-head against the City VPI
04d_recalibration held-out isotonic refit
04e_operational_thresholds per-ward capacity flagging
04f_local_explanations TreeSHAP on top-flagged parcels
04g_temporal_validation evaluation by violation-recency cohort
04h_block_cv_rf group_vfold_cv by census tract
05_output_analysis probability summaries, ZCTA choropleth, capacity lookup
06_tiling PMTiles vector tilesets for web consumption
Caching

Each Rmd caches expensive computations under data_py/cache/ keyed by a SHA-1 of the feature list. Re-runs are fast as long as the feature set has not changed; deleting the cache forces a clean refit.

Repository Layout

The repository separates the R Markdown pipeline from the rendered reports and the static web surface. Knit outputs and headline figures live under docs/ so the GitHub Pages site picks them up directly.

code/r_code/ Full R Markdown pipeline. Steps 00 through 06.
docs/code/outputs/ Rendered HTML reports plus a handful of standalone PNGs from early-stage EDA.
docs/code/03_*_files/ Auto-generated figure folders from knitting the Rmd files.
docs/graphs/ Headline figures used across documentation.
docs/graphs/python/ Modeling and validation figures. Folder name is historical — every file is now produced by the R pipeline.
docs/ Static dashboard, landing page, PMTiles, ward GeoJSON, and the optional local Flask backend (tileserver.py + load_db.py). Served as the GitHub Pages root.

Running the Pipeline

The early steps (00 to 03_3) load raw OpenDataPhilly extracts, validate quality, define the outcome variable, and engineer the feature matrix. They are run once whenever raw inputs change. They produce two flat tables that downstream steps consume:

Each Rmd reads its inputs from data/ or data_py/, writes outputs back to data_py/, and knits a self-contained HTML rendering next to itself. The files have a strict left-to-right dependency. 04a fits the four base learners and the calibrated ensemble. 04b through 04h consume those fitted artefacts for validation. 05 produces the stakeholder-facing summaries, and 06 turns the parcel GeoJSON into PMTiles for the website.

How to Re-Run

To re-run the pipeline end to end, knit the Rmd files in alphabetical order. Each step caches its expensive computations under data_py/cache/ keyed by a SHA-1 of the feature list, so re-runs are fast as long as the feature set has not changed. Deleting data_py/cache/ forces a clean refit.

Data Inventory & Reference

code/r_code/00_data_inventory.Rmd is the reference layer for the entire project. It establishes the vocabulary, the data quality picture, and the definition of the outcome variable before any analysis begins.

1.1 Load and profile all eight raw datasets

All source files are loaded from rawdata/ with row counts, column names, and a glimpse(). Profiling upfront prevents silent data errors downstream — knowing a field has 40% missing values, or that date ranges only go back to 2010, shapes every subsequent design decision.

DatasetSourceContents
opa_properties_public.csvOPA~583K parcel records. Property characteristics, zoning, assessed values, sale history.
violations.csvL&IFull code violation history with code, date, and status (open, complied, closed).
business_licenses.csvL&ILicense history per parcel with revenue code, status, and issue dates.
clean_seal.csvL&ICity-initiated board-up and securing actions with case date and work order status.
unsafe.csvL&IUnsafe structure orders.
imm_dang.csvL&IImminently dangerous structure orders.
assessments.csvOPAHistorical assessment records.
VIOLATION_DEFINITION.csvL&ILookup from violation code to title and definition text.

1.2 Field-level data dictionaries

For each dataset a kable table is generated with field name, data type, example values, and a plain-English description. Missing-value rates are computed for every column. The dictionary is the team's living reference — rather than re-reading raw files, the dictionary documents what each field means in context.

1.3 Define the OVS (Observed Vacancy Status) outcome

The OVS definition is formalized here and carried unchanged through the entire pipeline. A parcel has OVS = 1 if it meets any of the following.

Source 01

Clean & Seal

Appears in L&I Clean & Seal records within a 2-year window of the training cutoff, with no subsequent demolition or new-construction permit.

Source 02

Open Vacant Violation

Has an OPEN violation with one of eleven specific vacancy codes: 9-3904, 9-3905, PM15-901.1, PM15-901.2, PM15-301, CP-102, CP-103, PM-102.4/1..4.

Source 03

Active Vacant License

Holds an active business license with revenue code 3219 (Residential Vacant) or 3634 (Commercial Vacant) — owner self-declaration, often for reduced tax treatment.

Three independent administrative systems triangulate vacancy more reliably than any single source. Clean & Seal captures severe cases the city has physically intervened on; the violation codes capture inspector-flagged properties; vacant licenses capture owners who self-declare.

1.4 Catalog vacancy-related violation codes

All 54 violation codes whose title or definition text contains the word "vacant" are enumerated. These are separated from the eleven OVS-defining codes. The remaining 43 are vacancy-adjacent signals used in feature engineering without being part of the label itself.

1.5 Computed OVS statistics

Key Finding

Clean & Seal dominates the label. The model will largely learn to predict which properties the city has already physically boarded up, plus catch earlier-stage signals for properties that have not yet reached that point.

OVS class balance
Figure 1.1

OVS class balance. Counts of OVS = 1 vs OVS = 0 across all parcels.

OVS source combinations
Figure 1.2

OVS source combinations. How parcels distribute across the three label sources and their overlaps.

New Data Quality Check

code/r_code/00b_new_data_check.Rmd validates two newer sources before they enter the feature pipeline.

2.1 Real Estate Transfer records (RTT_SUMMARY)

The RTT dataset is loaded and profiled. Key checks: join key detection (confirmed as opa_account_num), match rate against OPA, date range on display_date, and document type distribution (DEED, DEED SHERIFF, MORTGAGE, AGREEMENT OF SALE).

RTT contains the full deed and transfer history, richer than the single most-recent sale_date already in OPA. Rapid re-sales and sheriff (foreclosure) sales are distress signals that predict vacancy independently of physical condition.

2.2 Building and zoning permits

Permits are loaded similarly. The join key is confirmed as opa_account_num; permittype, typeofwork, and status distributions are checked.

Permits play two roles. Demolition permits after a Clean & Seal event remove the parcel from the active vacancy list. New construction permits signal that a formerly vacant lot has been redeveloped. Both scenarios would make the OVS label stale if not accounted for.

OVS Exploratory Analysis

code/r_code/02_ovs_exploration_JF.Rmd profiles the dependent variable across the full parcel universe before applying the residential filter.

3.1 OVS source overlap

A bar chart shows how many OVS = 1 parcels are flagged by only one source versus combinations of two or three. Clean & Seal alone accounts for the bulk of the label, with the other two sources contributing smaller exclusive shares and a long tail of overlap. Parcels flagged by multiple sources carry high label confidence; parcels flagged by only one source may have higher label noise.

3.2 OVS = 1 versus OVS = 0 parcel characteristics

Box plots and summary statistics compare vacant against occupied parcels.

3.3 Spatial summary by ZIP code

Vacancy rates per ZIP range from ~0.3% in newer or wealthier areas to ~17% in areas with concentrated disinvestment. Vacancy is not randomly distributed — it clusters geographically. This finding directly motivated the spatial cross-validation work in Step 8.

3.4 Three case-study parcel drill-downs

Three specific parcels are selected and their full administrative history is reconstructed.

Exterior condition by OVS
Figure 3.1

Exterior condition by OVS. Vacant parcels skew heavily toward conditions 5–7.

Year built by OVS
Figure 3.2

Year built by OVS. Vacant parcels tend to be older.

Market value by OVS
Figure 3.3

Market value by OVS. Vacant median ~$145K vs occupied ~$222K.

OPA category by OVS
Figure 3.4

OPA category by OVS. Distribution of OVS by OPA building category.

ZIP-level OVS rates
Figure 3.5

ZIP-level OVS rates. Vacancy rates per ZIP range from ~0.3% to ~17%.

Vacant parcels map
Figure 3.6

Vacant parcels map. Geographic clustering of OVS = 1 parcels.

OVS Construction (Residential Only)

code/r_code/03_1_Ovs.Rmd is the single source of truth for the model's outcome variable. Every downstream file reads ovs_residential.csv rather than re-constructing OVS independently.

4.1 Set the temporal anchor

TRAIN_CUTOFF is fixed at 2025-10-01 globally. Every feature and every OVS source must use data strictly before this date. This prevents any information from after the cutoff from leaking into the model.

4.2 Source 1, Clean & Seal with two Fichman filters

cs_parcels <- clean_seal %>%
  filter(casecreateddate >= (TRAIN_CUTOFF - years(2))) %>%  # 2-year window
  anti_join(demo_after_seal, by = "opa_account_num")        # demolition filter

The two-year window keeps the label current — a C&S record from five years ago does not necessarily mean the building is still vacant today. The demolition filter removes parcels where a completed demolition permit followed the C&S action; the building was torn down, so it is no longer a vacant building.

4.3 Source 2, open vacant violations

Only violations with violationstatus == "open" and a code in the eleven-code OVS list are included. Closed or complied violations mean the issue was resolved and the property should not be labeled as currently vacant.

4.4 Source 3, active vacant property license

Only licenses with licensestatus == "active" and revenue codes 3219 or 3634 are included. An expired or inactive vacant license means the owner no longer holds that designation.

4.5 Merge and compute OVS

All three source flags are left-joined to the OPA property table and combined with OVS = if_else(ovs_cs == 1 | ovs_viol == 1 | ovs_lic == 1, 1, 0). The individual source flags are retained so downstream analysis can decompose the label.

4.6 Filter to residential only

filter(category_code %in% c("1", "2", "3", "14")) removes commercial (4, 5), industrial (implicit), vacant land (6), and other non-residential categories.

CodeCategory
1Single Family
2Multi-Family, Duplex, Triplex
3Mixed Use (residential plus commercial)
14Large Apartment Buildings (4+ units)

4.7 Export

Geometry is dropped on export since downstream files work with flat tables, but the parcel-level building description is retained for the secondary parking and commercial filter applied during feature engineering.

Output

data/ovs_residential.csv — ~520K residential parcels (522,583 unique) with an OVS = 1 rate of about 1.1%.

OVS source overlap on residential parcels
Figure 4.1

OVS source overlap on residential parcels. Same overlap analysis as Step 3, restricted to residential parcels after the category-code filter.

Violation Tier Mapping, EDA & Baseline Model

code/r_code/03_2_Analysis.Rmd establishes the violation severity scheme and the performance floor any engineered model must beat.

5.1 The VacancyGuide violation tier scheme

108 L&I violation codes are manually mapped to three severity tiers based on the VacancyGuide2026 methodology developed by Tim Haynes.

Tier
Weight / Codes
Description
Level 1
+1 · 26 codes
Minor deterioration and general maintenance — broken windows, debris, peeling paint.
Level 2
+2 · 19 codes
Moderate structural damage — deteriorated masonry, structural members, roofing.
Level 3
+3 · 63 codes
Severe vacant or unsafe or collapse risk — structural failure, imminently dangerous conditions, known vacancy codes.
Caveat from Fichman's 2026-02-18 Review

The tier weights were established by trial and error by Tim Haynes without statistical validation. They are used as a Track B comparison benchmark, not as primary model inputs. The engineered model in Track A learns its own weights from data.

5.2 Validate tier codes against real violation data

The 108 tier codes are matched against the actual violations dataset to count OPEN violations per tier. This confirms the codes are actively used (not hypothetical) and that Level 3 codes dominate OPEN violations as expected.

5.3 Identify the 54 vacancy keyword codes

VIOLATION_DEFINITION.csv is scanned for any code whose title or definition contains the word "vacant" — yielding 54 codes, a superset of the eleven OVS-defining codes. Using the eleven OVS codes as features would create direct leakage; the broader 54-code set provides predictive signal without circularity.

5.4 EDA — exterior condition vs vacancy rate

Vacancy rates are computed for each exterior condition rating (1 = New through 7 = Collapsed) and plotted as a bar chart. Condition = 7 shows dramatically higher vacancy rates than condition = 1. A violin plot was initially used here, but Fichman pushed back: exterior condition is an ordered categorical variable, not continuous, and bar charts are more honest about the discrete scale.

5.5 New-construction false positive check

Properties built after 2010 and labeled OVS = 1 are isolated and their source breakdown compared to the overall OVS = 1 population. The analysis reveals whether these are genuine vacants (construction never occupied) or data artefacts such as a C&S record on a wrong parcel.

5.6 Violation history across time windows

Mean violation counts in four windows (all-time, five-year, three-year, six-month) are computed separately for OVS = 1 and OVS = 0 parcels. OVS = 1 parcels have consistently higher violation counts across every window. The gap narrows in the six-month window, suggesting that acute recent activity is a weaker signal than long-term chronic history. This validates the life-history features in Step 6.

5.7 Baseline logistic regression

A logistic regression is fit using only four OPA fields — exterior_condition, log_market_value, building_age, log_livable_area — with no violation, license, or transfer features.

SpecValue
Modelovs ~ exterior_condition + log_market_value + building_age + log_livable_area
Dataset276K parcels with non-missing values on all four fields
Baseline AUC0.798

Visible overlap in the predicted score density between OVS = 0 and OVS = 1 indicates room for improvement. Outputs: data/building_tier_mapping.csv, data/vacancy_keyword_codes.csv.

Vacancy rate by exterior condition
Figure 5.1

Vacancy rate by exterior condition. Bar chart over the 1–7 ordered categorical scale.

Violation history faceted by OVS
Figure 5.2

Violation history faceted by OVS. Mean violation counts in 4 windows, OVS = 1 vs OVS = 0.

Baseline ROC curve
Figure 5.3

Baseline ROC curve. Logistic regression on four OPA fields. AUC = 0.798.

Baseline predicted probability distribution
Figure 5.4

Baseline predicted probability distribution. Density of predicted scores by OVS class — visible overlap motivates the engineered features.

Feature Engineering

code/r_code/03_3_Features.Rmd is the largest and most complex file in the pipeline. It transforms seven raw data sources into a single flat feature matrix. The guiding principle throughout is no temporal leakage. Every feature must be constructed from data strictly before TRAIN_CUTOFF.

6.1 Training-cutoff time windows

WindowDefinitionPurpose
RECENT_WINDOWLast 6 monthsAcute current activity
WINDOW_2YRLast 2 yearsRecent trend
WINDOW_3YRLast 3 yearsMid-term life history
WINDOW_5YRLast 5 yearsLong-term trajectory

6.2 Commercial and parking filter (secondary)

A second-pass filter removes parcels whose building description matches parking or commercial structure keywords. The category-code filter in Step 4 already excludes overtly commercial parcels, but some mixed-use parcels in category 3 are physically parking structures or commercial-only buildings. The fine-grained building-use description catches these edge cases, including the parking garage false positive identified in Case 2.

parking_pattern <- regex("parking|garage|surface lot|parking lot|commercial lot",
                         ignore_case = TRUE)
ovs_df <- ovs_df %>%
  filter(is.na(bldg_desc) | !str_detect(bldg_desc, parking_pattern))

6.3 Demolition and new-construction permit lookup

Permit records are filtered to completed demolition and new-construction permits before TRAIN_CUTOFF. For each parcel the most recent demolition date and most recent new-construction date are recorded.

6.4 Violation features

The OVS-defining eleven violation codes are explicitly excluded from all violation features via filter(!is_ovs_code) — using them would create direct leakage since they are components of the label.

FeatureDescription
n_violations_totalAll-time violation count
n_violations_recentViolations in last 6 months
n_violations_2yrViolations in last 2 years
n_violations_3yrViolations in last 3 years
n_violations_5yrViolations in last 5 years
n_violations_openCount of currently open violations. Excluded from final model as an OVS proxy
n_distinct_codesNumber of unique violation codes ever cited
viol_trend_3v5Count in 3yr window minus count in the 3 to 5yr window. Positive means worsening
viol_accel_2v3Count in 2yr window minus count in the 2 to 3yr window. Positive means accelerating
n_repeat_codesCount of codes cited two or more times. Measures repeat non-compliance
resolution_rateFraction of violations resolved (complied or closed). Low means persistent non-compliance
has_maintenance_codeEver had a Level 1 or 2 maintenance or moderate violation
has_structural_codeEver had a Level 3 severe or structural violation
has_fire_safety_codeEver had a fire, egress, or hazard violation
has_open_structuralHas a currently open Level 3 violation
days_since_last_violDays from most recent violation to TRAIN_CUTOFF
days_oldest_open_violAge of the oldest currently open violation in days

The life-history features fulfill a Fichman requirement. A property with five violations all in the last six months is different from one with five violations spread over ten years. The trend and acceleration features (viol_trend_3v5, viol_accel_2v3) capture whether a property is getting worse, improving, or stable over time, which a single count cannot express.

6.5 Clean & Seal features

C&S features focus on history and trajectory rather than current active status (which is the OVS label component).

FeatureDescription
n_cs_totalTotal C&S events ever recorded
days_since_last_csDays from most recent C&S to TRAIN_CUTOFF
cs_active_2yrC&S event in last 2 years (excluded — near-current)
cs_has_closedAt least one work order completed
cs_span_daysDays between first and most recent C&S
demo_after_csDemolition permit completed after most recent C&S
newcon_after_csNew construction completed after most recent C&S
cs_truly_activeC&S in 2yr AND no demo/newcon after (excluded — this is the OVS rule)

6.6 License features

License features capture the lifecycle trajectory of how a property's licensed use has changed over time.

FeatureDescription
had_vacancy_licenseEver had a vacancy license regardless of current status
n_vacancy_licensesTotal count of vacancy licenses over time
ever_had_vacant_licBinary version of above
has_rental_license, has_active_rentalCurrent active rental license. Excluded from final model as direct complement of vacancy status
ever_had_rentalEver had a rental license
had_rental_then_vacantHad a rental license, then later got a vacancy license. Lifecycle transition signal
license_lapse_rateFraction of all licenses that are no longer active
days_since_last_licRecency of most recent license activity

The had_rental_then_vacant feature is unusually informative. A property that went from having a rental license to a vacancy license has a documented owner-reported history of occupancy followed by abandonment. That transition is a strong qualitative signal of the kind of properties the model aims to predict.

6.7 Unsafe & Imminently Dangerous features

These orders sit outside the regular violation system. Features are simple — a binary "has ever" flag, a count, and days since the most recent order. These were not part of the OVS definition, so there is no leakage risk.

6.8 Real Estate Transfer (RTT) features

RTT records are filtered to deed and sheriff sale documents (excluding mortgages, liens, and other encumbrances) before TRAIN_CUTOFF. The join key is detected automatically by scanning for opa_account_num or parcel_number in the column names.

FeatureDescription
n_transfers_totalTotal deed or transfer count, all time
n_transfers_5yrTransfers in last 5 years
n_transfers_3yrTransfers in last 3 years
n_deed_transfersArms-length deed transfers with recorded prices
n_sheriff_salesTotal sheriff foreclosure sale count
had_sheriff_saleEver had a sheriff sale
sheriff_sale_recentSheriff sale within last 5 years
log_price_changelog(last_deed_price) - log(prior_deed_price). Negative means price declined
days_since_last_transferDays from most recent any-type transfer to TRAIN_CUTOFF

OPA's sale_price and sale_date capture only the most recent transaction. RTT adds the full ownership history. Frequent re-sales can indicate a distressed property being flipped without rehabilitation. A sheriff sale in the recent past is a direct foreclosure indicator. A log price decline between the two most recent deed sales suggests the market is pricing in deterioration risk.

last_transfer_price and prior_transfer_price are left as NA when a parcel has fewer than two deed transfers with recorded prices. The Random Forest recipe handles these via median imputation. log_price_change is also left as NA in these cases.

6.9 OPA-derived features

Derived from the OPA fields already present in ovs_residential.csv.

FeatureDescription
exterior_conditionAssessor-rated condition on a 1 to 7 scale
building_age2024 minus year_built. Capped at 200 years, negative values set to NA
log_market_valuelog1p(market_value). Excluded from the final model due to assessed value bias
log_livable_arealog1p(total_livable_area)
log_sale_pricelog1p(sale_price)
days_since_saleDays from most recent OPA sale to TRAIN_CUTOFF
years_since_saledays_since_sale / 365.25
is_poor_conditionBinary. exterior_condition >= 5
value_per_sqftmarket_value / total_livable_area. NA when area is zero

6.10 Assemble the feature matrix

All feature groups are left-joined to the OPA base table on parcel_number. Left joins ensure every parcel appears even if it has no records in a particular source. A parcel with no violation history is kept with zeros, not dropped.

NA fill logic.

6.11 Missing-value audit

A summary table shows every feature with its missing-value count and percentage, grouped into flags (Complete, less than 5 percent, 5 to 20 percent, more than 20 percent). Most engineered features are complete. The highest-missing features are days_oldest_open_viol (only populated for parcels with open violations) and price-related fields with fewer than two transactions.

6.12 Univariate correlation screening

Pearson correlations between each numeric feature and ovs are computed and the top 20 by absolute value are plotted. This acts as a quick sanity check — the top features should be conceptually sensible vacancy indicators (violation counts, C&S history, license flags).

Output

data/features_residential.csv — ~520K parcels (522,481 unique) by 80+ columns.

Top 20 univariate correlations with OVS
Figure 6.1

Top 20 univariate correlations with OVS. Quick sanity check that the strongest predictors are conceptually sensible.

Tidymodeling & the Vacancy Risk Score Ensemble

code/r_code/04a_tidymodeling.Rmd is the heart of the production pipeline. It fits four base learners on the engineered feature matrix, blends two of them into a calibrated ensemble (the Vacancy Risk Score), and exports the artefacts every downstream step depends on.

7.1 Load features and define the model variable set

features_residential.csv is loaded directly. model_vars is an explicit character vector and is the single source of truth for the feature set used everywhere downstream. The exclusions listed below capture the leakage and bias decisions made during feature engineering.

Excluded FeatureReason
n_violations_openDirect count of open violations. The OVS violation rule is "any open violation". Near-exact proxy
has_open_violationOne half of the OVS violation rule
has_vacancy_kw_codeThe other half. Violation code matches the keyword list
has_open_vacancy_kwIntersection of both. Directly reconstructs the OVS violation component
cs_truly_activeCurrent active C&S status is the OVS Clean & Seal rule
cs_active_2yrToo close to current status. Near-proxy for cs_truly_active
has_rental_license, has_active_rentalCurrent active rental is the functional complement of vacancy license status
log_market_value, value_per_sqftOPA assessed values carry well-documented racial and geographic bias (Fichman guidance)
last_transfer_price, prior_transfer_priceRedundant with log_sale_price. log_price_change captures the useful signal instead

Cache files in data_py/cache/04a/ are fingerprinted with a SHA-1 of model_vars. Any change to the feature list invalidates stale fits silently.

7.2 Stratified 70/30 train/test split

split    <- initial_split(df, prop = 0.70, strata = ovs)
train_df <- training(split)
test_df  <- testing(split)

A temporal split was evaluated and discarded earlier in the pipeline. Activity recency is itself a vacancy proxy via fields like days_since_last_viol and cs_active_2yr. A temporal split therefore causes distributional shift, with train OVS-equal-one rate around 0.2 percent and test rate around 6.9 percent. AUC estimates become unreliable. Stratified random split ensures both train (around 364K rows) and test (around 156K rows) have approximately the production-prevalence OVS-equal-one rate of around 1.1 percent. Temporal generalization is tested separately in Step 14.

7.3 Preprocessing recipe

Every base learner is wrapped in a recipes recipe with the same three steps in the same order.

  1. step_impute_median() — imputes NAs primarily in log_price_change, days_oldest_open_viol, and log_sale_price.
  2. step_zv() — removes zero-variance columns that survived feature engineering.
  3. themis::step_rose() — synthetic minority oversampling applied only to the training fold of each cross-validation split. Wrapping it in the recipe ensures the resampling step never leaks into the test fold.

OVS prevalence sits at around 1.1 percent. Without correction, models tend to drive themselves toward the majority class. ROSE rebalances the training data before the model is fit. Class-weight tuning (described below) provides a complementary lever inside the model itself.

7.4 Four base learners

Each learner is fit with hyperparameters chosen up-front rather than searched on every run. Tuning results from earlier runs are kept in data_py/rf_tune_results.csv, data_py/xgb_tune_results.csv, and data_py/lgb_tune_results.csv.

ModelConfiguration
Logistic Regressionglm with L2 regularization via glmnet, class weights inversely proportional to prevalence
Random Forestranger with num.trees = 500, mtry = 7, min.node.size = 5, class weights set to balance
XGBoostPre-tuned hyperparameters loaded from xgb_tune_results.csv, scale_pos_weight tuned to prevalence
LightGBMPre-tuned hyperparameters loaded from lgb_tune_results.csv, class weights set to balance via bonsai

The Random Forest mtry value of seven corresponds approximately to the floor of the square root of the feature count, matching the canonical default for RF on this size of feature set.

7.5 The Vacancy Risk Score ensemble

The production score is a 50/50 average of the calibrated Logit and Random Forest probabilities. XGBoost and LightGBM are kept as diagnostic comparators only. They are reported in the model thresholds table and ablation results, but do not feed the production ensemble.

ensemble_prob_raw <- 0.5 * logit_prob_raw + 0.5 * rf_prob_raw
ensemble_prob     <- predict(isotonic_ensemble, ensemble_prob_raw)

Isotonic regression is fit on the test set predictions of the raw ensemble.

Vacancy is rare. Raw probabilities cluster well below 0.5 even for parcels the model is confident about. Isotonic calibration maps the raw score back to honest empirical positive rates — even the very top one percent of parcels is only around 59% truly vacant, so the highest calibrated probability is around 0.6, not 1.0.

Important Implication

Calibrated probabilities should never be compared to a fixed threshold like 0.5 — almost nothing will exceed it. Use ensemble_flag (top one percent by raw rank) or qtile_tier (a five-bucket rank) for inspection triage.

7.6 Headline test-set metrics

MetricValue
Test ROC-AUC0.9395
Test PR-AUC0.5461 — beats both parents
Test Brier0.0068
Mean ensemble P(vacant) across full population~1.28%

Variable importance from the Random Forest is dominated by C&S history, vacancy license history, violation trajectories, and recency signals. RTT features (had_sheriff_sale, log_price_change) appear in the top half when they carry independent signal.

7.7 Outputs from 04a

FileContents
data_py/model_logit_final.rdsFitted Logistic Regression pipeline
data_py/model_rf_final.rdsFitted Random Forest pipeline (around 178 MB)
data_py/model_xgb_final.rdsFitted XGBoost pipeline
data_py/model_lgb_final.rdsFitted LightGBM pipeline
data_py/calibrators.rdsPer-model isotonic calibrators plus the ensemble calibrator
data_py/all_predictions_rf.csvOne row per parcel (around 520K) with all model probabilities, calibrated and raw, ensemble flag, risk_score, qtile_tier
data_py/model_thresholds.csvPer-model Youden thresholds plus AUC, sens, spec, plus an is_best boolean
data_py/rf_tune_results.csvCached RF tuning grid
data_py/xgb_tune_results.csvCached XGBoost tuning grid
data_py/lgb_tune_results.csvCached LightGBM tuning grid

The all_predictions_rf.csv columns are worth memorizing.

ROSE versus no subsampling
Figure 7.1

ROSE versus no subsampling. Class-balance lever before model fitting.

Train versus test density
Figure 7.2

Train vs test density. Predicted-score density on each split — gap indicates overfit.

Calibration curve for the ensemble
Figure 7.3

Calibration curve for the ensemble. Reliability after isotonic calibration.

Calibrated versus raw probability distributions
Figure 7.4

Calibrated vs raw distributions. Why the highest calibrated probability sits ~0.6, not 1.0.

ROC curves for all four base models
Figure 7.5

ROC curves, all four base models. Logit, RF, XGBoost, LightGBM, plus the ensemble.

Random Forest variable importance, top 20
Figure 7.6

Random Forest top-20 variable importance. C&S history and violation trajectories dominate.

Threshold sensitivity precision and recall
Figure 7.7

Threshold sensitivity. Precision and recall as the operating threshold sweeps.

Five-tier probability distribution
Figure 7.8

Five-tier probability distribution. Population shares per qtile_tier bucket.

Spatial Validation & Prediction Confidence Intervals

code/r_code/04b_model_validation.Rmd tests whether the production ensemble generalizes to data it was not trained on. The motivating concern is spatial autocorrelation — parcels in the same neighborhood share many features, including features that are themselves built from neighborhood activity. A standard random split overstates AUC because the test set contains parcels whose neighbors taught the model how to score them.

8.1 Spatial cross-validation, 10-fold by ZIP group

folds <- group_vfold_cv(df, group = zip_code, v = 10)
fits  <- map(folds$splits, ~ fit(workflow, data = analysis(.x)))

group_vfold_cv ensures that all parcels in a given ZIP code stay together — either all in training or all in test, never split across both. With ten folds, each fold holds out approximately ten percent of Philadelphia ZIPs. A parcel in 19139 cannot show up in both train fold 1 and test fold 2, so the model cannot lean on memorized neighborhood patterns.

A per-fold line chart shows AUC and J-Index across the ten folds. High variance across folds would suggest the model struggles with some geographic areas.

Result

Spatial CV mean AUC is 0.8877 ± 0.0064 across the ten folds — the lower-bound, conservative AUC for the RF component without ensemble calibration.

8.2 Leave-One-ZIP-Out (LOGO) cross-validation

LOGO takes the spatial CV to its extreme. Each of Philadelphia's residential ZIP codes is held out exactly once across separate model fits. To keep total compute bounded, the step samples 15 of the roughly 45 residential ZIPs uniformly at random and fits a fresh RF for each.

Per-ZIP AUC results are plotted as a ranked bar chart. ZIPs with AUC less than 0.70 would be highlighted in red as candidates for manual review. In the most recent run none of the sampled ZIPs fall below that line, with median AUC of 0.95 and above. The mean LOGO AUC is 0.8849, very close to the 10-fold spatial CV result.

LOGO matters for deployment. If the city ever applies the model to newly annexed areas, or if future data includes ZIPs not well represented in the training window, LOGO CV predicts how well the model will perform in those situations.

8.3 Prediction confidence intervals from the RF tree variance

The Random Forest is leveraged for free uncertainty quantification by computing per-tree probabilities and taking their dispersion across the 500 estimators.

tree_probs <- predict(rf_fit, test_df, predict.all = TRUE)$predictions[, 2, ]
rf_prob    <- rowMeans(tree_probs)
rf_se      <- apply(tree_probs, 1, sd) / sqrt(rf_fit$num.trees)
ci_lower   <- pmax(0, rf_prob - 1.96 * rf_se)
ci_upper   <- pmin(1, rf_prob + 1.96 * rf_se)
ci_width   <- ci_upper - ci_lower

A parcel with a 0.60 RF probability and CI width 0.05 is a confident prediction — the trees agree. The same 0.60 probability with CI width 0.35 is ambiguous — different subsets of trees give wildly different predictions.

8.4 Sanity checks

Four checks confirm the model is learning from the right signals.

Outputs

data_py/spatial_cv_metrics.csv per-fold metrics, data_py/logo_cv_metrics.csv per-ZIP metrics, data_py/predictions_with_ci.csv test set with CI bounds, and data_py/validation_summary.csv headline metrics table.

Spatial CV per-fold AUC and J-Index
Figure 8.1

Spatial CV per-fold AUC and J-Index. Variance across the 10 ZIP-grouped folds.

LOGO AUC by held-out ZIP
Figure 8.2

LOGO AUC by held-out ZIP. 15 sampled ZIPs, ranked.

Prediction confidence interval analysis
Figure 8.3

Prediction CI analysis. RF tree-variance CI width vs predicted probability.

04b feature importance reproduction
Figure 8.4

04b feature importance. Sanity-check reproduction of 04a's RF VIP under spatial CV.

Subgroup Generalization & the Equity Audit

code/r_code/04b_vpi_comparison.Rmd breaks the citywide AUC apart along three operationally relevant axes (ZIP, building category, ward) and runs an equity audit by census-tract median household income.

9.1 Frame and tier classification

Predictions are joined to features (for ZIP and tract) and OPA (for ward and category). A five-tier classification using equal-width 0.2 probability buckets — Very Unlikely, Unlikely, Maybe, Likely, Very Likely — is applied to every parcel as prob_tier. The full population skews heavily into the lowest tier (~99.88% Very Unlikely). The Very Likely tier (> 0.8) holds about 0.47% — the operational high-risk pool.

9.2 AUC by ZIP code

For each ZIP with at least 50 parcels and at least 5 observed vacants, AUC is computed by calling roc_auc_score on the subset. Forty-one ZIPs survive the minimum-N filter. Results are ranked and plotted as a horizontal bar chart with ZIPs colored red where AUC falls below 0.70.

9.3 AUC by building type

CategoryParcelsAUC
Single Family~461K0.980
Multi-Family / Duplex / Triplex~41K0.978
Mixed Use~14K0.945
Apartments 4+ Units~3.6K0.942

9.4 Five-tier distribution by category

Tier counts and observed OVS rates per tier are computed within each category. The Very Likely tier shows significantly higher observed vacancy rates than Very Unlikely inside every category — confirming the probability scores discriminate within building type and not just on average.

9.5 Ward-level summary

Philadelphia's 66 political wards are used as an aggregation unit. For each ward this step reports parcel count, observed vacancy count and rate, mean predicted probability, count in Very Likely and Likely tiers, and a combined "high risk" rate (probability of 0.6 or higher). Top wards by mean predicted probability:

WardMean predicted
110.0436
440.0419
280.0410
60.0384
160.0377

9.6 Equity audit by census-tract income quintile

ACS 2022 5-year median household income (B19013_001E) is joined at the census-tract level. Tracts are binned into income quintiles (Q1 = lowest, Q5 = highest); coverage of the join is ~99.5%.

QuintileObserved rateMean predictedAUC
Q1 (lowest income)2.60%2.93%0.97
Q2LowerLower0.97
Q3LowerLower0.97
Q4LowerLower0.98
Q5 (highest income)0.31%0.28%0.97
Headline Equity Finding

AUC is consistent across all income quintiles (0.968–0.978). The model identifies vacancy with the same ranking quality whether the parcel sits in a low-income or high-income tract. Mean predicted probability tracks observed rate closely in every quintile — the model is not systematically over- or under-predicting in any income band. It concentrates absolute flag counts in lower-income areas because vacancy is genuinely concentrated there.

Outputs. data_py/predictions_04b.csv parcel frame, data_py/zip_auc_04b.csv per-ZIP AUC, data_py/ward_summary_04b.csv per-ward rollup, data_py/equity_income_auc.csv equity audit, plus a static ZCTA choropleth and an interactive Leaflet map.

AUC by ZIP code
Figure 9.1

AUC by ZIP code. Per-ZIP discrimination, ranked. No ZIP falls below 0.70.

Equity flag rate vs observed vacancy by ZIP
Figure 9.2

Equity scatter. Per-ZIP flag rate vs observed vacancy rate — alignment indicates calibration.

Mean predicted probability by ward
Figure 9.3

Mean predicted probability by ward. Wards 11, 44, 28 lead.

AUC by census-tract income quintile
Figure 9.4

AUC by income quintile. 0.968–0.978 across Q1–Q5.

Head-to-Head Comparison Against the City VPI

code/r_code/04c_vs_city_vpi.Rmd compares the production ensemble to the City of Philadelphia's binary Vacant Property Indicator (VPI) head-to-head on the residential-only universe. rawdata/vpi_bldg.geojson contains 8,048 buildings flagged by the City. After filtering to residential parcels via the OPA join, 6,418 parcels (~79.7%) remain in scope. The remaining 20-plus percent are commercial or industrial properties not covered by the residential model.

10.1 Match-prevalence comparison

The ensemble probability is thresholded at the level that produces approximately the same number of positives as the City's VPI (~6,400 parcels).

ApproachFlaggedPrecisionRecall vs OVS
Ensemble (matched capacity)~6,40054.7%78.3%
City VPI~6,40053.3%57.6%
Union of both~9,80047.9%79.7%

The continuous-score view is cleaner. Ensemble AUC against OVS is 0.9786; VPI as a binary flag has AUC 0.7849. Ensemble average precision is 0.7538; VPI is 0.3116.

10.2 Four-bucket disagreement analysis

BucketParcelsObserved OVS rate
Both flag the parcel3,19983.3%
Only the model flags5,30037.4%
Only the VPI flags3,21123.4%
Neither flags508,4860.1%

The "only ours" bucket has substantially higher observed vacancy than the "only VPI" bucket — the model's exclusive flags are higher quality on average than the VPI's exclusive flags.

10.3 Per-ZIP precision scatter

Per-ZIP precision is plotted with City VPI on the x-axis and ensemble on the y-axis (one point per ZIP, ≥50 parcels). The model achieves higher precision than VPI in around 30 of the 41 ZIPs, sometimes by a wide margin.

10.4 High-probability OVS = 0 candidates

A separate analysis isolates the top 339 parcels with the highest ensemble probability that have OVS = 0. Twenty-seven of those also appear in the City VPI — independent corroboration that the parcel is genuinely vacant despite OVS missing it. The remaining 312 are candidate false positives or genuinely undetected vacants and are exported with addresses for inspector review.

Outputs

data_py/vs_city_vpi_headline.csv, data_py/vs_city_vpi_buckets.csv, data_py/vs_city_vpi_zip.csv, data_py/vs_city_vpi_parcel.csv, data_py/vs_city_vpi_map.html (interactive overlay of all three buckets), and data_py/vs_city_vpi_high_prob_ovs0_map.html (the candidate review map).

Ensemble vs VPI ROC and PR curves
Figure 10.1

Ensemble vs City VPI — ROC and PR curves. Ensemble AUC 0.9786 vs VPI 0.7849.

Four-bucket disagreement summary
Figure 10.2

Four-bucket disagreement. Both / model-only / VPI-only / neither, with observed OVS rates per bucket.

Disagreement map snapshot
Figure 10.3

Disagreement map. Spatial layout of model-only vs VPI-only flags.

Per-ZIP precision scatter, ensemble vs VPI
Figure 10.4

Per-ZIP precision, ensemble vs VPI. Above the diagonal = model wins. Ensemble wins ~30 of 41 ZIPs.

High-probability OVS = 0 candidate map
Figure 10.5

High-prob OVS = 0 candidates. Top 339 model-flagged parcels with OVS = 0 — candidates for inspector review.

Held-Out Recalibration

code/r_code/04d_recalibration.Rmd refits the calibrator on a held-out half of the test split and evaluates calibration on the other half. The original 04a calibrator was fit on the full test set — a deliberate shortcut for an initial release that allows a small amount of optimism into the calibrated probabilities.

11.1 Split, refit, measure

The 30% test set from 04a is split 50/50 stratified by OVS. The raw ensemble probabilities for the calibration half are used to fit a fresh IsotonicRegression; the new calibrator is applied to the full population producing ensemble_prob_v2. On the evaluation half (which neither the model nor either calibrator has seen):

Metric (evaluation half)OriginalRecalibrated
ROC-AUC0.93990.9375
Brier0.00670.0067
Mean predicted to observed ratio1.02×1.04×

Both reliability curves follow the diagonal, but the recalibrated version stays closer across all probability bins. Practical takeaway: the original calibrator was already nearly correct — ensemble_prob_v2 is the recommendation for production, but the substantive difference is small enough that consumers already on ensemble_prob do not need to migrate urgently.

Outputs

data_py/calibrators_v2.rds and data_py/predictions_calibrated.csv (full population with the new column).

Reliability curve, original versus recalibrated
Figure 11.1

Reliability curve, original vs recalibrated. Recalibrated curve hugs the 45° diagonal more tightly across all bins.

Operational Thresholds & Per-Ward Capacity

code/r_code/04e_operational_thresholds.Rmd translates model scores into actionable inspection capacity. Rather than choose one citywide threshold, it applies a per-ward capacity policy: inspect the top N parcels per ward, where N is some fraction of the ward's residential parcel count.

12.1 The capacity policy

CAPACITY_PCT is parameterized at one percent of each ward's residential parcels per inspection cycle. Within each ward, the top N parcels by ensemble probability are flagged. The same policy is then applied to the City's VPI for direct comparison, and to the union of the two.

12.2 Citywide results at 1% ward capacity

StrategyFlaggedPrecisionRecall
Model only5,232 (1.01%)57.0%49.1%
City VPI only6,410 (1.23%)53.3%57.6%
Union9,874 (1.90%)47.9%79.7%

The model achieves the highest precision per parcel inspected. The union strategy trades precision for recall and is the right choice when coverage is the priority.

12.3 Per-ward precision scatter

A per-ward scatter (City VPI precision on the x-axis, model precision on the y-axis) shows wide variance across wards. Some wards reach perfect or near-perfect precision under the model's flagging policy, while others land in the 0.3–0.5 range. It flags wards where precision falls below a configurable threshold for individual review.

Outputs

data_py/operational_flags_by_ward.csv per-ward summary, data_py/operational_precision_at_capacity.csv headline citywide table, and the per-ward precision scatter PNG.

Operational precision at one percent ward capacity
Figure 12.1

Operational precision at 1% ward capacity. Per-ward precision under the capped per-ward policy.

Capacity threshold curve
Figure 12.2

Capacity vs threshold curve. Number of parcels flagged for any chosen threshold — lookup for ops teams.

Local Explanations with TreeSHAP

code/r_code/04f_local_explanations.Rmd. A model that is operationally useful must be explainable at the parcel level. An inspector who is told "go visit this address" must be able to ask "why this one" and get a meaningful answer.

13.1 Explainer setup & per-parcel attributions

This step uses treeshap on the fitted Random Forest component (which dominates the feature attribution alongside the Logistic Regression in the ensemble). For each parcel the explainer returns one SHAP value per feature. The step ranks parcels by ensemble probability, takes the top 200, and for each parcel records the five features with the largest absolute SHAP value, the feature value, the SHAP magnitude, and a direction (pushed up or pushed down).

13.2 Global SHAP summary across the top 200

A useful pattern in the local explanations: some neighborhood-level features push downward at flagged parcels. A high count of vacant parcels in the same ZIP can lower the ensemble probability for a specific parcel — the model accounts for the neighborhood baseline elsewhere, so an individual parcel in a high-vacancy ZIP is comparatively less risky than a parcel that looks unusual within its ZIP.

Outputs

data_py/parcel_shap_topflags.csv (1,000 rows of parcel by top-five-feature explanations) and docs/graphs/python/shap_summary_topflags.png.

Top SHAP drivers for flagged parcels
Figure 13.1

Top SHAP drivers across the top 200 flagged parcels. Mean absolute SHAP per feature.

Temporal Validation

code/r_code/04g_temporal_validation.Rmd tests how the model performs on parcels with very recent administrative activity versus parcels whose activity is years old. The training split is stratified random rather than temporal — this is correct for measuring AUC (see Step 7.2), but it leaves open the question of how the model performs across activity windows. days_since_last_viol is used as the temporal proxy.

14.1 Cohorts

14.2 Cohort-level metrics

CohortObserved rateMean predictedAUCBrier
Old0.397%0.45%0.95790.0028
New8.90%9.63%0.97600.0226

The model's discrimination is actually higher on the new cohort, where the recent administrative trail is rich. The Brier score on the new cohort is larger in absolute terms because prevalence is roughly 20× higher, but the calibration ratio is good. There is no evidence of temporal decay between the older and newer activity windows.

The OVS rates differ by an order of magnitude between the two cohorts, which by itself is informative. Recent violation activity is itself a strong vacancy signal, even before the model touches it.

Outputs

data_py/temporal_validation.csv and docs/graphs/python/temporal_validation.png.

Temporal validation, old versus new cohort
Figure 14.1

Temporal validation, old vs new cohort. ROC curves and calibration on the two recency-defined cohorts.

Block Cross-Validation by Census Tract

code/r_code/04h_block_cv_rf.Rmd. Several engineered features are spatially smooth — neighborhood vacancy counts, ZIP-level aggregations, and the implicit information that a parcel near many flagged parcels is itself probably flagged. A standard random train/test split puts neighbors on opposite sides of the split, so the model effectively gets to see its training answer through its neighborhood features when scoring the test set. This inflates AUC.

This step re-fits the Random Forest under group_vfold_cv(df, group = census_tract, v = 5). Each fold holds out approximately 20% of Philadelphia's census tracts entirely. No tract appears in both train and test. Production RF hyperparameters are loaded directly from model_rf_final.rds; the same imputation and variance-threshold preprocessing is reused. ROSE oversampling is not applied here because the goal is honest discrimination, not optimized recall.

15.1 Per-fold AUC

FoldAUC
10.9677
20.9789
30.9644
40.9713
50.9564
The Headline Number

Block-CV mean AUC is 0.9677 ± 0.0083. The random-split test AUC from 04a was 0.9395 (ensemble) and the spatial-CV ZIP-blocked AUC from 04b was 0.8877 (RF only). The block-CV AUC sits between these, which is consistent with the underlying logic — ZIP groups are larger than census tracts, so ZIP blocking is a more aggressive test of generalization. Census-tract blocking is the right granularity for the operational story because the city does inspect and intervene at the tract level all the time. ~0.97 is the right number to put on a stakeholder slide as the honest, leakage-controlled AUC for the Random Forest component.

Outputs. data_py/block_cv_rf.csv and docs/graphs/python/block_cv_rf_aucs.png.

Block CV AUC by fold
Figure 15.1

Block CV AUC by fold. Five tract-grouped folds, mean AUC 0.9677 ± 0.0083.

Output Analysis & Stakeholder Summaries

code/r_code/05_output_analysis.Rmd turns production predictions into stakeholder-facing materials — distributions by category and ZIP, a calibration scatter, a choropleth map, an interactive Leaflet map, and a capacity-based threshold lookup table.

16.1 The "no-vote" framing

About 25% of residential parcels have any positive tree votes from the Random Forest. The remaining 75% receive exactly zero probability, meaning the ensemble unanimously predicts occupied. Three quarters of the residential housing stock is so clearly not vacant that no tree in 500 disagrees.

16.2 Category summary

Mixed Use carries the highest mean predicted probability at ~2.73%; Single Family the lowest at ~1.20%. Both mean and median are reported because the distribution is right-skewed — a few high-risk parcels pull the mean up, and the median reflects the typical parcel.

16.3–16.5 Within-category visualization

A density plot of rf_prob is drawn per category, filtered to rf_prob > 0. The zero-vote parcels are excluded because they create a spike at zero that dominates the plot and obscures meaningful variation in the non-zero range. A horizontal bar chart shows mean predicted probability per category with observed OVS rate overlaid as red diamond markers; a stacked bar shows tier distribution within each category using log-spaced breaks (No Signal, Low, Moderate, High, Very High). A category calibration scatter plots mean predicted against observed rate with point size proportional to parcel count — points on the 45° diagonal are perfectly calibrated.

16.6 ZIP code summary

ZIP 19132 has the highest mean predicted probability at ~3.61%. ZIP 19137 has the lowest at ~0.85%.

16.7 ZCTA choropleth and interactive Leaflet map

tigris::zctas(starts_with = "191") downloads US Census ZCTA boundaries for Philadelphia ZIPs (a few MB). These are joined to the ZIP-level summary and rendered as a static choropleth, plus an interactive Leaflet map with hover highlighting and click popups showing ZIP, mean P(vacant), parcel count, and observed vacancy rate.

Why ZCTA, not parcel polygons?

ZCTA boundaries are used rather than PWD_PARCELS.geojson for the static map because the parcel file is over 400 MB and consistently causes the R session to run out of memory or render at unusable resolutions. The parcel-resolution map lives separately in Step 17 as a vector tileset that the browser handles efficiently.

16.8 Capacity-based threshold lookup

A precision-recall curve is computed across thresholds 0.01–0.99 with each row annotated by the corresponding number of parcels flagged. Common inspection volumes (100, 250, 500, 1K, 2K, 5K, 10K) are read off the curve as a lookup table. To flag 5,000 parcels: threshold ≈ 0.67, precision ≈ 78%, recall ≈ 67%. Operations teams pick a row based on actual resourcing rather than committing to a fixed citywide threshold.

16.9 Optional GeoJSON exports

When PWD_PARCELS.geojson is available, this step writes two GeoJSON files for downstream use.

Outputs

data_py/output_zip_summary.csv, data_py/output_category_summary.csv, data_py/capacity_threshold_curve.csv, data_py/vacancy_risk_map.html, plus the two GeoJSON files and a folder of static PNGs under docs/graphs/python/.

Probability distribution by category
Figure 16.1

Probability distribution by category. Density of rf_prob per OPA category, non-zero predictions only.

Mean predicted probability by category
Figure 16.2

Mean predicted probability by category. Bars with observed OVS rate overlaid as red diamonds.

Five-tier distribution by category
Figure 16.3

Five-tier distribution by category. Stacked bars using log-spaced breaks (No Signal…Very High).

Category calibration scatter
Figure 16.4

Category calibration scatter. Mean predicted vs observed rate per category, point size by parcel count.

Mean predicted probability by ZIP
Figure 16.5

Mean predicted probability by ZIP. Bars + observed OVS rate as diamond markers.

ZCTA choropleth of mean P(vacant)
Figure 16.6

ZCTA choropleth of mean P(vacant). Static map for stakeholder summaries.

Vector Tilesets for Web Consumption

code/r_code/06_tiling.Rmd. The 405 MB GeoJSON from Step 16 is unusable in a browser as is. This step converts both GeoJSON exports into PMTiles, a single-file vector tileset format that the browser can stream from static storage (S3, GCS, GitHub Pages) and consume directly via MapLibre GL JS or Mapbox GL JS without a tile server.

17.1 Toolchain

tippecanoe (version 2.80, Felt fork) is the converter. It is invoked via system2() from inside the Rmd. The Felt fork is required for the PMTiles output format — vanilla tippecanoe writes mbtiles only.

17.2 Flagged tileset

vacancy_flagged.pmtiles covers the top one percent flagged parcels — around 4,500 features, zoom 10–16. The layer name inside the tileset is flagged. At lower zooms tippecanoe coalesces the densest parcels so the tileset stays readable. Final file size: ~2.2 MB. This is the right tileset for high-zoom drill-down on the inspection target list.

17.3 Full prediction tileset

vacancy_predictions.pmtiles covers all ~436K parcels, zoom 10–15. The layer name inside the tileset is parcels. The pipeline applies a 10-unit simplification at low zooms and drops the densest features when tile size exceeds the limit. Final file size: ~46 MB.

Both tilesets carry the full feature property bag including risk_score, ensemble_prob, qtile_tier, zip, tract, ward, address, and ovs. Client-side styling and filtering are therefore possible without round-tripping to a server. A consumer can color by risk score, filter to flagged only, search by ZIP, or drill into a single parcel popup, all in the browser.

For visual sanity-checking the file https://protomaps.github.io/PMTiles/ accepts a PMTiles URL and renders it.

Outputs

data_py/tiles/vacancy_flagged.pmtiles and data_py/tiles/vacancy_predictions.pmtiles.

Website & Dashboard

Everything that gets handed to a non-technical audience lives in docs/. The folder is fully static, ships with a graceful fallback for the Flask backend, and is designed to be pushed to GitHub Pages or any static host without modification.

Public-facing pages

FilePurpose
docs/index.htmlTiny redirect that drops visitors onto the landing page
docs/Vacancy Risk Landing Page.htmlProject landing page. The "what this is and why it matters" surface for stakeholders
docs/PhillyStat360 v2.htmlFull methodology write-up. Mirrors the step-by-step structure of the README in a presentation-ready layout
docs/dashboard.htmlInteractive parcel-level dashboard. MapLibre + PMTiles vector parcels, SHAP risk drivers per parcel, ward choropleth, sidebar summary cards, ward filter, and parcel search

Data shipped with the dashboard

The dashboard needs a small set of static files alongside the HTML. Each file is the direct output of a pipeline step.

FileSource stepPurpose
vacancy_predictions.pmtiles (~46 MB)Step 17Full citywide parcel tileset, layer name parcels
vacancy_flagged.pmtiles (~2.2 MB)Step 17Top one percent flagged parcels, layer name flagged
vacancy_predictions_flagged.geojson (~3.9 MB)Step 16Flagged parcels as GeoJSON for the SHAP table backend
ward_boundaries.geojsonStep 16Polygon geometry for the 66 political wards
ward_stats.jsonStep 16Per-ward rollups used to power the choropleth, summary cards, and ward filter list in static mode
dashboard_shap.jsonStep 13Per-parcel SHAP risk drivers for the dashboard's "why this parcel" panel
colors_and_type.cssShared colour palette and type scale across landing page, methodology, and dashboard

Static vs local-server mode

dashboard.html auto-detects the host. On localhost or 127.0.0.1 it tries the local Flask backend first (docs/tileserver.py, backed by PostgreSQL/PostGIS via docs/load_db.py). On any other host it skips the backend entirely and falls back to the JSON files above.

FeatureLocal-server modeStatic mode
Map basemap and PMTilesCARTO basemap, vector parcels from vacancy_predictions.pmtilesSame
Parcel popups (click)Full property bag from PMTilesFull property bag from PMTiles
SHAP risk driversdashboard_shap.jsondashboard_shap.json
Ward choroplethward_stats.json + ward_boundaries.geojsonSame
Sidebar summary cards/summary endpointAggregated from ward_stats.json
Ward filter list/wards endpointFrom ward_stats.json
Ward fly-to/ward_bounds endpointComputed client-side from the ward boundary GeoJSON
Parcel searchDB-wide search via /searchLimited to parcels currently rendered in view (queryRenderedFeatures)
Census tract filter/census_tracts and /tract_boundsDisabled unless census_tracts.json is shipped

The local Flask backend is optional. It exists for the case where the city wants DB-wide parcel search and tract-level filtering on a workstation. Everything in the public deployment is static.

Deployment

Full GitHub Pages deployment notes live in docs/DEPLOY.md. The short version is:

# From the repo root, push the docs/ folder along with everything else:
git add docs/
git commit -m "Update site"
git push origin main
# Settings → Pages → Source: Deploy from a branch, main / docs

The two PMTiles files stay in the repo as ordinary objects rather than Git LFS, because GitHub Pages does not serve LFS objects through its CDN. The largest file (vacancy_predictions.pmtiles, around 46 MB) sits comfortably under GitHub's 100 MB per-file limit. To host the tilesets on S3, Cloudflare R2, or any other static bucket instead, set the PMTILES_URL constant at the top of the <script> block in dashboard.html to the absolute URL — the PMTiles JS reader will stream byte ranges over HTTP.

docs/sync_methodology_assets.sh copies the latest figures from docs/graphs/ and docs/code/outputs/ into the methodology page so the public write-up stays in sync with the modeling pipeline.

Cross-Notebook Headline Metrics

SourceMethodValue
04a, random 30% testCalibrated ensemble ROC-AUC0.9395
04a, random 30% testCalibrated ensemble PR-AUC0.5461
04a, random 30% testCalibrated ensemble Brier0.0068
04b, 10-fold ZIP-grouped CVRF mean AUC0.8877 ± 0.0064
04b, LOGO sample of 15 ZIPsRF mean AUC0.8849
04b_vpi, by income quintileQuintile AUC range0.968 – 0.978
04b_vpi, by building categoryCategory AUC range0.942 – 0.980
04c, vs City VPI on residentialEnsemble AUC0.9786
04c, vs City VPI on residentialEnsemble PR-AUC0.7538
04c, vs City VPI on residentialVPI binary AUC0.7849
04c, at matched ~6.4K capacityEnsemble precision / recall54.7% / 78.3%
04c, at matched ~6.4K capacityVPI precision / recall53.3% / 57.6%
04d, evaluation halfRecalibrated ensemble AUC0.9375
04e, citywide at 1% ward capacityModel precision / recall57.0% / 49.1%
04g, new-cohort temporal proxyEnsemble AUC0.9760
04g, old-cohort temporal proxyEnsemble AUC0.9579
04h, 5-fold tract-grouped CVRF mean AUC0.9677 ± 0.0083
Most Defensible Number

The most defensible number to cite for the model's generalization performance is the block-CV AUC of around 0.9677 from Step 15. It controls for the leakage that random splits allow through neighborhood-level features and uses the right spatial granularity for how city operations actually play out.

Key Design Decisions

DecisionChoiceReason
OVS definitionThree-source composite (C&S + violations + licenses)No single source captures all vacancies. Composite reduces false negatives.
Temporal anchorTRAIN_CUTOFF = 2025-10-01Hard cutoff prevents future data from influencing features or label.
C&S window2-year window plus demolition filterStale C&S records misrepresent current status. Demolished buildings are not vacant buildings.
Residential scopeCategory codes 1, 2, 3, 14 onlyBuildings and land require different feature vocabularies. Commercial properties introduce systematic false positives.
Commercial / parking filterPattern match on parcel building descriptionCatches parking garages and surface lots that survive the category code filter.
Life-history featuresTrend & acceleration over 3yr / 5yr / 2yr windowsSingle counts miss whether a property is getting worse, stable, or improving.
RTT integrationTransfer frequency, sheriff sales, log price changeFull deed history adds distress signals not in OPA's single most-recent sale.
Train / test splitStratified random 70/30Temporal split caused distributional shift because recency is itself a vacancy proxy.
Spatial validation10-fold ZIP CV + LOGO + 5-fold tract block CVTests generalization independent of spatial autocorrelation. LOGO simulates new geography. Tract block CV is the honest stakeholder-reportable AUC.
No assessed value featureslog_market_value, value_per_sqft excludedOPA assessed values have documented racial and geographic bias.
Production score50/50 Logit + RF, isotonic at the endLogit and RF capture different signal shape. Equal weighting beat tuned weighting in CV. Isotonic restores honest empirical positive rates.
No fixed binary thresholdTop 1% rank flag plus 5-tier rank bucketCalibrated probabilities are all small because vacancy is rare. Use rank or capacity.
ZCTA boundaries for static mapstigris packagePWD_PARCELS.geojson is over 400 MB. ZCTA is practical for rendering and presentations. Parcel-resolution lives in PMTiles.
Per-ward capacity flagging1% of each ward's residential parcelsSingle citywide threshold concentrates too many flags in a few neighborhoods. Per-ward keeps inspection load sensible everywhere.
04b mirrors 04a exactlyIdentical model_vars, recipe, hyperparametersSpatial CV and LOGO CV must evaluate the same model that is deployed. Divergence would produce misleading validation metrics.
No poverty rate in modelACS variables removed from model_varsUsed only in the equity audits as an analysis dimension, not a feature. Avoids an external Census API dependency at training time.

Output File Catalog

Production model artefacts (in data_py/)

FileContents
model_logit_final.rdsFitted Logistic Regression pipeline.
model_rf_final.rdsFitted Random Forest pipeline.
model_xgb_final.rdsFitted XGBoost pipeline. Diagnostic comparator.
model_lgb_final.rdsFitted LightGBM pipeline. Diagnostic comparator.
calibrators.rdsOriginal isotonic calibrators for each model and the ensemble.
calibrators_v2.rdsHeld-out recalibrated ensemble calibrator. Recommended for production.
model_thresholds.csvPer-model Youden thresholds plus AUC, sensitivity, specificity, and is_best.

Parcel-level prediction frames

FileContents
all_predictions_rf.csvOne row per parcel with all probabilities, calibrated and raw, plus flag, risk_score, qtile_tier.
predictions_calibrated.csvAdds ensemble_prob_v2 from the held-out calibrator.
predictions_04b.csvPredictions joined to ZIP, tract, ward, and category. The right starting point for analysis.
predictions_with_ci.csvTest-set predictions with RF tree-variance CI bounds.
parcel_shap_topflags.csvTop 200 flagged parcels with five SHAP-attributed features each.
vs_city_vpi_parcel.csvPer-parcel ensemble probability vs City VPI flag.

Aggregated rollups

FileContents
output_zip_summary.csvPer-ZIP mean predicted, observed rate, count.
output_category_summary.csvPer-category mean predicted, observed rate, count.
ward_summary_04b.csvPer-ward rollup including high-risk count.
zip_auc_04b.csvPer-ZIP AUC for ZIPs meeting the minimum-N filter.
equity_income_auc.csvPer-income-quintile AUC, observed rate, mean predicted.
equity_zip_audit.csvPer-ZIP flag rate vs observed vacancy rate.

Validation & diagnostics

FileContents
spatial_cv_metrics.csvPer-fold ZIP-grouped CV AUC and J-Index.
logo_cv_metrics.csvPer-ZIP LOGO AUC for sampled ZIPs.
validation_summary.csvHeadline cross-validation metrics.
block_cv_rf.csvPer-fold tract-grouped CV AUC.
temporal_validation.csvOld and new cohort metrics.
capacity_threshold_curve.csvThreshold → flagged-count → precision / recall lookup.
operational_flags_by_ward.csvPer-ward flag counts under the 1% capacity policy.
operational_precision_at_capacity.csvCitywide model vs VPI vs union summary.

Geospatial & interactive outputs

FileContents
vacancy_risk_map.htmlInteractive Leaflet ZCTA-level map.
vacancy_predictions.geojson~436K parcel polygons with all predictions.
vacancy_predictions_flagged.geojson~4,500 flagged parcels with all predictions.
tiles/vacancy_flagged.pmtiles~2.2 MB. Top 1%.
tiles/vacancy_predictions.pmtiles~46 MB. Full citywide.
vs_city_vpi_map.htmlInteractive bucket overlay.
vs_city_vpi_high_prob_ovs0_map.htmlCandidate-review map for high-probability OVS = 0 parcels.

Data Sources

All data from OpenDataPhilly unless noted.

FileSourcePurpose
opa_properties_public.csvOPAParcel registry, property characteristics
violations.csvL&ICode violation history
business_licenses.csvL&ILicense history including vacant property licenses
clean_seal.csvL&ICity-initiated boarding and securing actions
unsafe.csvL&IUnsafe structure orders
imm_dang.csvL&IImminently dangerous structure orders
permits.csvL&IBuilding and zoning permits including demolition and new construction
RTT_SUMMARY.csvPhiladelphia RevenueFull deed and transfer history
VIOLATION_DEFINITION.csvL&IViolation code title lookup
vpi_bldg.geojsonCity of PhiladelphiaVacant Property Indicator binary flag
ACS 2022 5-year B19013_001EUS Census BureauCensus-tract median household income for the equity audit