A spatial machine learning pipeline for predicting residential vacancy across Philadelphia parcels
The pipeline turns administrative data from OPA, L&I, and the Philadelphia Revenue Department into a parcel-level predicted probability of vacancy for every residential property in the city — plus a calibrated 0–100 risk score, an ensemble flag for the top one percent of parcels, and a five-tier rank bucket for dashboard display.
The whole pipeline is R Markdown end to end and lives in code/r_code/. Modeling, validation, calibration, explainability, and tiling all run in R using tidymodels, ranger, xgboost, lightgbm (via bonsai), probably for calibration, and treeshap for local explanations. The data_py/ output folder name is historical — it is just where every step reads from and writes to.
Each Rmd caches expensive computations under data_py/cache/ keyed by a SHA-1 of the feature list. Re-runs are fast as long as the feature set has not changed; deleting the cache forces a clean refit.
The repository separates the R Markdown pipeline from the rendered reports and the static web surface. Knit outputs and headline figures live under docs/ so the GitHub Pages site picks them up directly.
The early steps (00 to 03_3) load raw OpenDataPhilly extracts, validate quality, define the outcome variable, and engineer the feature matrix. They are run once whenever raw inputs change. They produce two flat tables that downstream steps consume:
data/ovs_residential.csv — around 522K residential parcels with the OVS label and source flags.data/features_residential.csv — the same parcels with around 80 engineered features.Each Rmd reads its inputs from data/ or data_py/, writes outputs back to data_py/, and knits a self-contained HTML rendering next to itself. The files have a strict left-to-right dependency. 04a fits the four base learners and the calibrated ensemble. 04b through 04h consume those fitted artefacts for validation. 05 produces the stakeholder-facing summaries, and 06 turns the parcel GeoJSON into PMTiles for the website.
To re-run the pipeline end to end, knit the Rmd files in alphabetical order. Each step caches its expensive computations under data_py/cache/ keyed by a SHA-1 of the feature list, so re-runs are fast as long as the feature set has not changed. Deleting data_py/cache/ forces a clean refit.
code/r_code/00_data_inventory.Rmd is the reference layer for the entire project. It establishes the vocabulary, the data quality picture, and the definition of the outcome variable before any analysis begins.
All source files are loaded from rawdata/ with row counts, column names, and a glimpse(). Profiling upfront prevents silent data errors downstream — knowing a field has 40% missing values, or that date ranges only go back to 2010, shapes every subsequent design decision.
| Dataset | Source | Contents |
|---|---|---|
| opa_properties_public.csv | OPA | ~583K parcel records. Property characteristics, zoning, assessed values, sale history. |
| violations.csv | L&I | Full code violation history with code, date, and status (open, complied, closed). |
| business_licenses.csv | L&I | License history per parcel with revenue code, status, and issue dates. |
| clean_seal.csv | L&I | City-initiated board-up and securing actions with case date and work order status. |
| unsafe.csv | L&I | Unsafe structure orders. |
| imm_dang.csv | L&I | Imminently dangerous structure orders. |
| assessments.csv | OPA | Historical assessment records. |
| VIOLATION_DEFINITION.csv | L&I | Lookup from violation code to title and definition text. |
For each dataset a kable table is generated with field name, data type, example values, and a plain-English description. Missing-value rates are computed for every column. The dictionary is the team's living reference — rather than re-reading raw files, the dictionary documents what each field means in context.
The OVS definition is formalized here and carried unchanged through the entire pipeline. A parcel has OVS = 1 if it meets any of the following.
Appears in L&I Clean & Seal records within a 2-year window of the training cutoff, with no subsequent demolition or new-construction permit.
Has an OPEN violation with one of eleven specific vacancy codes: 9-3904, 9-3905, PM15-901.1, PM15-901.2, PM15-301, CP-102, CP-103, PM-102.4/1..4.
Holds an active business license with revenue code 3219 (Residential Vacant) or 3634 (Commercial Vacant) — owner self-declaration, often for reduced tax treatment.
Three independent administrative systems triangulate vacancy more reliably than any single source. Clean & Seal captures severe cases the city has physically intervened on; the violation codes capture inspector-flagged properties; vacant licenses capture owners who self-declare.
All 54 violation codes whose title or definition text contains the word "vacant" are enumerated. These are separated from the eleven OVS-defining codes. The remaining 43 are vacancy-adjacent signals used in feature engineering without being part of the label itself.
Clean & Seal dominates the label. The model will largely learn to predict which properties the city has already physically boarded up, plus catch earlier-stage signals for properties that have not yet reached that point.
OVS class balance. Counts of OVS = 1 vs OVS = 0 across all parcels.
OVS source combinations. How parcels distribute across the three label sources and their overlaps.
code/r_code/00b_new_data_check.Rmd validates two newer sources before they enter the feature pipeline.
The RTT dataset is loaded and profiled. Key checks: join key detection (confirmed as opa_account_num), match rate against OPA, date range on display_date, and document type distribution (DEED, DEED SHERIFF, MORTGAGE, AGREEMENT OF SALE).
RTT contains the full deed and transfer history, richer than the single most-recent sale_date already in OPA. Rapid re-sales and sheriff (foreclosure) sales are distress signals that predict vacancy independently of physical condition.
Permits are loaded similarly. The join key is confirmed as opa_account_num; permittype, typeofwork, and status distributions are checked.
Permits play two roles. Demolition permits after a Clean & Seal event remove the parcel from the active vacancy list. New construction permits signal that a formerly vacant lot has been redeveloped. Both scenarios would make the OVS label stale if not accounted for.
code/r_code/02_ovs_exploration_JF.Rmd profiles the dependent variable across the full parcel universe before applying the residential filter.
A bar chart shows how many OVS = 1 parcels are flagged by only one source versus combinations of two or three. Clean & Seal alone accounts for the bulk of the label, with the other two sources contributing smaller exclusive shares and a long tail of overlap. Parcels flagged by multiple sources carry high label confidence; parcels flagged by only one source may have higher label noise.
Box plots and summary statistics compare vacant against occupied parcels.
Vacancy rates per ZIP range from ~0.3% in newer or wealthier areas to ~17% in areas with concentrated disinvestment. Vacancy is not randomly distributed — it clusters geographically. This finding directly motivated the spatial cross-validation work in Step 8.
Three specific parcels are selected and their full administrative history is reconstructed.
PM15-901.1. The code applies to vacant property maintenance but parking lots can receive it without being vacant in the residential sense. This led directly to the parking filter in 03_3_Features.Rmd.
Exterior condition by OVS. Vacant parcels skew heavily toward conditions 5–7.
Year built by OVS. Vacant parcels tend to be older.
Market value by OVS. Vacant median ~$145K vs occupied ~$222K.
OPA category by OVS. Distribution of OVS by OPA building category.
ZIP-level OVS rates. Vacancy rates per ZIP range from ~0.3% to ~17%.
Vacant parcels map. Geographic clustering of OVS = 1 parcels.
code/r_code/03_1_Ovs.Rmd is the single source of truth for the model's outcome variable. Every downstream file reads ovs_residential.csv rather than re-constructing OVS independently.
TRAIN_CUTOFF is fixed at 2025-10-01 globally. Every feature and every OVS source must use data strictly before this date. This prevents any information from after the cutoff from leaking into the model.
cs_parcels <- clean_seal %>% filter(casecreateddate >= (TRAIN_CUTOFF - years(2))) %>% # 2-year window anti_join(demo_after_seal, by = "opa_account_num") # demolition filter
The two-year window keeps the label current — a C&S record from five years ago does not necessarily mean the building is still vacant today. The demolition filter removes parcels where a completed demolition permit followed the C&S action; the building was torn down, so it is no longer a vacant building.
Only violations with violationstatus == "open" and a code in the eleven-code OVS list are included. Closed or complied violations mean the issue was resolved and the property should not be labeled as currently vacant.
Only licenses with licensestatus == "active" and revenue codes 3219 or 3634 are included. An expired or inactive vacant license means the owner no longer holds that designation.
All three source flags are left-joined to the OPA property table and combined with OVS = if_else(ovs_cs == 1 | ovs_viol == 1 | ovs_lic == 1, 1, 0). The individual source flags are retained so downstream analysis can decompose the label.
filter(category_code %in% c("1", "2", "3", "14")) removes commercial (4, 5), industrial (implicit), vacant land (6), and other non-residential categories.
| Code | Category |
|---|---|
| 1 | Single Family |
| 2 | Multi-Family, Duplex, Triplex |
| 3 | Mixed Use (residential plus commercial) |
| 14 | Large Apartment Buildings (4+ units) |
Geometry is dropped on export since downstream files work with flat tables, but the parcel-level building description is retained for the secondary parking and commercial filter applied during feature engineering.
data/ovs_residential.csv — ~520K residential parcels (522,583 unique) with an OVS = 1 rate of about 1.1%.
OVS source overlap on residential parcels. Same overlap analysis as Step 3, restricted to residential parcels after the category-code filter.
code/r_code/03_2_Analysis.Rmd establishes the violation severity scheme and the performance floor any engineered model must beat.
108 L&I violation codes are manually mapped to three severity tiers based on the VacancyGuide2026 methodology developed by Tim Haynes.
The tier weights were established by trial and error by Tim Haynes without statistical validation. They are used as a Track B comparison benchmark, not as primary model inputs. The engineered model in Track A learns its own weights from data.
The 108 tier codes are matched against the actual violations dataset to count OPEN violations per tier. This confirms the codes are actively used (not hypothetical) and that Level 3 codes dominate OPEN violations as expected.
VIOLATION_DEFINITION.csv is scanned for any code whose title or definition contains the word "vacant" — yielding 54 codes, a superset of the eleven OVS-defining codes. Using the eleven OVS codes as features would create direct leakage; the broader 54-code set provides predictive signal without circularity.
Vacancy rates are computed for each exterior condition rating (1 = New through 7 = Collapsed) and plotted as a bar chart. Condition = 7 shows dramatically higher vacancy rates than condition = 1. A violin plot was initially used here, but Fichman pushed back: exterior condition is an ordered categorical variable, not continuous, and bar charts are more honest about the discrete scale.
Properties built after 2010 and labeled OVS = 1 are isolated and their source breakdown compared to the overall OVS = 1 population. The analysis reveals whether these are genuine vacants (construction never occupied) or data artefacts such as a C&S record on a wrong parcel.
Mean violation counts in four windows (all-time, five-year, three-year, six-month) are computed separately for OVS = 1 and OVS = 0 parcels. OVS = 1 parcels have consistently higher violation counts across every window. The gap narrows in the six-month window, suggesting that acute recent activity is a weaker signal than long-term chronic history. This validates the life-history features in Step 6.
A logistic regression is fit using only four OPA fields — exterior_condition, log_market_value, building_age, log_livable_area — with no violation, license, or transfer features.
| Spec | Value |
|---|---|
| Model | ovs ~ exterior_condition + log_market_value + building_age + log_livable_area |
| Dataset | 276K parcels with non-missing values on all four fields |
| Baseline AUC | 0.798 |
Visible overlap in the predicted score density between OVS = 0 and OVS = 1 indicates room for improvement. Outputs: data/building_tier_mapping.csv, data/vacancy_keyword_codes.csv.
Vacancy rate by exterior condition. Bar chart over the 1–7 ordered categorical scale.
Violation history faceted by OVS. Mean violation counts in 4 windows, OVS = 1 vs OVS = 0.
Baseline ROC curve. Logistic regression on four OPA fields. AUC = 0.798.
Baseline predicted probability distribution. Density of predicted scores by OVS class — visible overlap motivates the engineered features.
code/r_code/03_3_Features.Rmd is the largest and most complex file in the pipeline. It transforms seven raw data sources into a single flat feature matrix. The guiding principle throughout is no temporal leakage. Every feature must be constructed from data strictly before TRAIN_CUTOFF.
| Window | Definition | Purpose |
|---|---|---|
| RECENT_WINDOW | Last 6 months | Acute current activity |
| WINDOW_2YR | Last 2 years | Recent trend |
| WINDOW_3YR | Last 3 years | Mid-term life history |
| WINDOW_5YR | Last 5 years | Long-term trajectory |
A second-pass filter removes parcels whose building description matches parking or commercial structure keywords. The category-code filter in Step 4 already excludes overtly commercial parcels, but some mixed-use parcels in category 3 are physically parking structures or commercial-only buildings. The fine-grained building-use description catches these edge cases, including the parking garage false positive identified in Case 2.
parking_pattern <- regex("parking|garage|surface lot|parking lot|commercial lot",
ignore_case = TRUE)
ovs_df <- ovs_df %>%
filter(is.na(bldg_desc) | !str_detect(bldg_desc, parking_pattern))
Permit records are filtered to completed demolition and new-construction permits before TRAIN_CUTOFF. For each parcel the most recent demolition date and most recent new-construction date are recorded.
The OVS-defining eleven violation codes are explicitly excluded from all violation features via filter(!is_ovs_code) — using them would create direct leakage since they are components of the label.
| Feature | Description |
|---|---|
| n_violations_total | All-time violation count |
| n_violations_recent | Violations in last 6 months |
| n_violations_2yr | Violations in last 2 years |
| n_violations_3yr | Violations in last 3 years |
| n_violations_5yr | Violations in last 5 years |
| n_violations_open | Count of currently open violations. Excluded from final model as an OVS proxy |
| n_distinct_codes | Number of unique violation codes ever cited |
| viol_trend_3v5 | Count in 3yr window minus count in the 3 to 5yr window. Positive means worsening |
| viol_accel_2v3 | Count in 2yr window minus count in the 2 to 3yr window. Positive means accelerating |
| n_repeat_codes | Count of codes cited two or more times. Measures repeat non-compliance |
| resolution_rate | Fraction of violations resolved (complied or closed). Low means persistent non-compliance |
| has_maintenance_code | Ever had a Level 1 or 2 maintenance or moderate violation |
| has_structural_code | Ever had a Level 3 severe or structural violation |
| has_fire_safety_code | Ever had a fire, egress, or hazard violation |
| has_open_structural | Has a currently open Level 3 violation |
| days_since_last_viol | Days from most recent violation to TRAIN_CUTOFF |
| days_oldest_open_viol | Age of the oldest currently open violation in days |
The life-history features fulfill a Fichman requirement. A property with five violations all in the last six months is different from one with five violations spread over ten years. The trend and acceleration features (viol_trend_3v5, viol_accel_2v3) capture whether a property is getting worse, improving, or stable over time, which a single count cannot express.
C&S features focus on history and trajectory rather than current active status (which is the OVS label component).
| Feature | Description |
|---|---|
| n_cs_total | Total C&S events ever recorded |
| days_since_last_cs | Days from most recent C&S to TRAIN_CUTOFF |
| cs_active_2yr | C&S event in last 2 years (excluded — near-current) |
| cs_has_closed | At least one work order completed |
| cs_span_days | Days between first and most recent C&S |
| demo_after_cs | Demolition permit completed after most recent C&S |
| newcon_after_cs | New construction completed after most recent C&S |
| cs_truly_active | C&S in 2yr AND no demo/newcon after (excluded — this is the OVS rule) |
License features capture the lifecycle trajectory of how a property's licensed use has changed over time.
| Feature | Description |
|---|---|
| had_vacancy_license | Ever had a vacancy license regardless of current status |
| n_vacancy_licenses | Total count of vacancy licenses over time |
| ever_had_vacant_lic | Binary version of above |
| has_rental_license, has_active_rental | Current active rental license. Excluded from final model as direct complement of vacancy status |
| ever_had_rental | Ever had a rental license |
| had_rental_then_vacant | Had a rental license, then later got a vacancy license. Lifecycle transition signal |
| license_lapse_rate | Fraction of all licenses that are no longer active |
| days_since_last_lic | Recency of most recent license activity |
The had_rental_then_vacant feature is unusually informative. A property that went from having a rental license to a vacancy license has a documented owner-reported history of occupancy followed by abandonment. That transition is a strong qualitative signal of the kind of properties the model aims to predict.
These orders sit outside the regular violation system. Features are simple — a binary "has ever" flag, a count, and days since the most recent order. These were not part of the OVS definition, so there is no leakage risk.
RTT records are filtered to deed and sheriff sale documents (excluding mortgages, liens, and other encumbrances) before TRAIN_CUTOFF. The join key is detected automatically by scanning for opa_account_num or parcel_number in the column names.
| Feature | Description |
|---|---|
| n_transfers_total | Total deed or transfer count, all time |
| n_transfers_5yr | Transfers in last 5 years |
| n_transfers_3yr | Transfers in last 3 years |
| n_deed_transfers | Arms-length deed transfers with recorded prices |
| n_sheriff_sales | Total sheriff foreclosure sale count |
| had_sheriff_sale | Ever had a sheriff sale |
| sheriff_sale_recent | Sheriff sale within last 5 years |
| log_price_change | log(last_deed_price) - log(prior_deed_price). Negative means price declined |
| days_since_last_transfer | Days from most recent any-type transfer to TRAIN_CUTOFF |
OPA's sale_price and sale_date capture only the most recent transaction. RTT adds the full ownership history. Frequent re-sales can indicate a distressed property being flipped without rehabilitation. A sheriff sale in the recent past is a direct foreclosure indicator. A log price decline between the two most recent deed sales suggests the market is pricing in deterioration risk.
last_transfer_price and prior_transfer_price are left as NA when a parcel has fewer than two deed transfers with recorded prices. The Random Forest recipe handles these via median imputation. log_price_change is also left as NA in these cases.
Derived from the OPA fields already present in ovs_residential.csv.
| Feature | Description |
|---|---|
| exterior_condition | Assessor-rated condition on a 1 to 7 scale |
| building_age | 2024 minus year_built. Capped at 200 years, negative values set to NA |
| log_market_value | log1p(market_value). Excluded from the final model due to assessed value bias |
| log_livable_area | log1p(total_livable_area) |
| log_sale_price | log1p(sale_price) |
| days_since_sale | Days from most recent OPA sale to TRAIN_CUTOFF |
| years_since_sale | days_since_sale / 365.25 |
| is_poor_condition | Binary. exterior_condition >= 5 |
| value_per_sqft | market_value / total_livable_area. NA when area is zero |
All feature groups are left-joined to the OPA base table on parcel_number. Left joins ensure every parcel appears even if it has no records in a particular source. A parcel with no violation history is kept with zeros, not dropped.
NA fill logic.
as.integer(TRAIN_CUTOFF - as.Date("2000-01-01")) — a large sentinel value meaning "no activity on record before the data starts".log_price_change, raw prices) are left as NA. Genuinely missing when there are fewer than two arms-length sales. Handled by the model recipe.A summary table shows every feature with its missing-value count and percentage, grouped into flags (Complete, less than 5 percent, 5 to 20 percent, more than 20 percent). Most engineered features are complete. The highest-missing features are days_oldest_open_viol (only populated for parcels with open violations) and price-related fields with fewer than two transactions.
Pearson correlations between each numeric feature and ovs are computed and the top 20 by absolute value are plotted. This acts as a quick sanity check — the top features should be conceptually sensible vacancy indicators (violation counts, C&S history, license flags).
data/features_residential.csv — ~520K parcels (522,481 unique) by 80+ columns.
Top 20 univariate correlations with OVS. Quick sanity check that the strongest predictors are conceptually sensible.
code/r_code/04a_tidymodeling.Rmd is the heart of the production pipeline. It fits four base learners on the engineered feature matrix, blends two of them into a calibrated ensemble (the Vacancy Risk Score), and exports the artefacts every downstream step depends on.
features_residential.csv is loaded directly. model_vars is an explicit character vector and is the single source of truth for the feature set used everywhere downstream. The exclusions listed below capture the leakage and bias decisions made during feature engineering.
| Excluded Feature | Reason |
|---|---|
| n_violations_open | Direct count of open violations. The OVS violation rule is "any open violation". Near-exact proxy |
| has_open_violation | One half of the OVS violation rule |
| has_vacancy_kw_code | The other half. Violation code matches the keyword list |
| has_open_vacancy_kw | Intersection of both. Directly reconstructs the OVS violation component |
| cs_truly_active | Current active C&S status is the OVS Clean & Seal rule |
| cs_active_2yr | Too close to current status. Near-proxy for cs_truly_active |
| has_rental_license, has_active_rental | Current active rental is the functional complement of vacancy license status |
| log_market_value, value_per_sqft | OPA assessed values carry well-documented racial and geographic bias (Fichman guidance) |
| last_transfer_price, prior_transfer_price | Redundant with log_sale_price. log_price_change captures the useful signal instead |
Cache files in data_py/cache/04a/ are fingerprinted with a SHA-1 of model_vars. Any change to the feature list invalidates stale fits silently.
split <- initial_split(df, prop = 0.70, strata = ovs) train_df <- training(split) test_df <- testing(split)
A temporal split was evaluated and discarded earlier in the pipeline. Activity recency is itself a vacancy proxy via fields like days_since_last_viol and cs_active_2yr. A temporal split therefore causes distributional shift, with train OVS-equal-one rate around 0.2 percent and test rate around 6.9 percent. AUC estimates become unreliable. Stratified random split ensures both train (around 364K rows) and test (around 156K rows) have approximately the production-prevalence OVS-equal-one rate of around 1.1 percent. Temporal generalization is tested separately in Step 14.
Every base learner is wrapped in a recipes recipe with the same three steps in the same order.
step_impute_median() — imputes NAs primarily in log_price_change, days_oldest_open_viol, and log_sale_price.step_zv() — removes zero-variance columns that survived feature engineering.themis::step_rose() — synthetic minority oversampling applied only to the training fold of each cross-validation split. Wrapping it in the recipe ensures the resampling step never leaks into the test fold.OVS prevalence sits at around 1.1 percent. Without correction, models tend to drive themselves toward the majority class. ROSE rebalances the training data before the model is fit. Class-weight tuning (described below) provides a complementary lever inside the model itself.
Each learner is fit with hyperparameters chosen up-front rather than searched on every run. Tuning results from earlier runs are kept in data_py/rf_tune_results.csv, data_py/xgb_tune_results.csv, and data_py/lgb_tune_results.csv.
| Model | Configuration |
|---|---|
| Logistic Regression | glm with L2 regularization via glmnet, class weights inversely proportional to prevalence |
| Random Forest | ranger with num.trees = 500, mtry = 7, min.node.size = 5, class weights set to balance |
| XGBoost | Pre-tuned hyperparameters loaded from xgb_tune_results.csv, scale_pos_weight tuned to prevalence |
| LightGBM | Pre-tuned hyperparameters loaded from lgb_tune_results.csv, class weights set to balance via bonsai |
The Random Forest mtry value of seven corresponds approximately to the floor of the square root of the feature count, matching the canonical default for RF on this size of feature set.
The production score is a 50/50 average of the calibrated Logit and Random Forest probabilities. XGBoost and LightGBM are kept as diagnostic comparators only. They are reported in the model thresholds table and ablation results, but do not feed the production ensemble.
ensemble_prob_raw <- 0.5 * logit_prob_raw + 0.5 * rf_prob_raw ensemble_prob <- predict(isotonic_ensemble, ensemble_prob_raw)
Isotonic regression is fit on the test set predictions of the raw ensemble.
Vacancy is rare. Raw probabilities cluster well below 0.5 even for parcels the model is confident about. Isotonic calibration maps the raw score back to honest empirical positive rates — even the very top one percent of parcels is only around 59% truly vacant, so the highest calibrated probability is around 0.6, not 1.0.
Calibrated probabilities should never be compared to a fixed threshold like 0.5 — almost nothing will exceed it. Use ensemble_flag (top one percent by raw rank) or qtile_tier (a five-bucket rank) for inspection triage.
| Metric | Value |
|---|---|
| Test ROC-AUC | 0.9395 |
| Test PR-AUC | 0.5461 — beats both parents |
| Test Brier | 0.0068 |
| Mean ensemble P(vacant) across full population | ~1.28% |
Variable importance from the Random Forest is dominated by C&S history, vacancy license history, violation trajectories, and recency signals. RTT features (had_sheriff_sale, log_price_change) appear in the top half when they carry independent signal.
| File | Contents |
|---|---|
| data_py/model_logit_final.rds | Fitted Logistic Regression pipeline |
| data_py/model_rf_final.rds | Fitted Random Forest pipeline (around 178 MB) |
| data_py/model_xgb_final.rds | Fitted XGBoost pipeline |
| data_py/model_lgb_final.rds | Fitted LightGBM pipeline |
| data_py/calibrators.rds | Per-model isotonic calibrators plus the ensemble calibrator |
| data_py/all_predictions_rf.csv | One row per parcel (around 520K) with all model probabilities, calibrated and raw, ensemble flag, risk_score, qtile_tier |
| data_py/model_thresholds.csv | Per-model Youden thresholds plus AUC, sens, spec, plus an is_best boolean |
| data_py/rf_tune_results.csv | Cached RF tuning grid |
| data_py/xgb_tune_results.csv | Cached XGBoost tuning grid |
| data_py/lgb_tune_results.csv | Cached LightGBM tuning grid |
The all_predictions_rf.csv columns are worth memorizing.
risk_score — integer 0–100, ensemble probability times 100. Use for dashboard display.ensemble_prob — calibrated probability 0–1. Use for "X% chance of being vacant" callouts.ensemble_prob_raw — uncalibrated. Use for ranking and sorting because it preserves spread.ensemble_flag — 1 equals top one percent by raw rank. Use for operational triage.qtile_tier — five-bucket rank for tier badges in UI such as "Top 1 percent (highest risk)".
ROSE versus no subsampling. Class-balance lever before model fitting.
Train vs test density. Predicted-score density on each split — gap indicates overfit.
Calibration curve for the ensemble. Reliability after isotonic calibration.
Calibrated vs raw distributions. Why the highest calibrated probability sits ~0.6, not 1.0.
ROC curves, all four base models. Logit, RF, XGBoost, LightGBM, plus the ensemble.
Random Forest top-20 variable importance. C&S history and violation trajectories dominate.
Threshold sensitivity. Precision and recall as the operating threshold sweeps.
Five-tier probability distribution. Population shares per qtile_tier bucket.
code/r_code/04b_model_validation.Rmd tests whether the production ensemble generalizes to data it was not trained on. The motivating concern is spatial autocorrelation — parcels in the same neighborhood share many features, including features that are themselves built from neighborhood activity. A standard random split overstates AUC because the test set contains parcels whose neighbors taught the model how to score them.
folds <- group_vfold_cv(df, group = zip_code, v = 10) fits <- map(folds$splits, ~ fit(workflow, data = analysis(.x)))
group_vfold_cv ensures that all parcels in a given ZIP code stay together — either all in training or all in test, never split across both. With ten folds, each fold holds out approximately ten percent of Philadelphia ZIPs. A parcel in 19139 cannot show up in both train fold 1 and test fold 2, so the model cannot lean on memorized neighborhood patterns.
A per-fold line chart shows AUC and J-Index across the ten folds. High variance across folds would suggest the model struggles with some geographic areas.
Spatial CV mean AUC is 0.8877 ± 0.0064 across the ten folds — the lower-bound, conservative AUC for the RF component without ensemble calibration.
LOGO takes the spatial CV to its extreme. Each of Philadelphia's residential ZIP codes is held out exactly once across separate model fits. To keep total compute bounded, the step samples 15 of the roughly 45 residential ZIPs uniformly at random and fits a fresh RF for each.
Per-ZIP AUC results are plotted as a ranked bar chart. ZIPs with AUC less than 0.70 would be highlighted in red as candidates for manual review. In the most recent run none of the sampled ZIPs fall below that line, with median AUC of 0.95 and above. The mean LOGO AUC is 0.8849, very close to the 10-fold spatial CV result.
LOGO matters for deployment. If the city ever applies the model to newly annexed areas, or if future data includes ZIPs not well represented in the training window, LOGO CV predicts how well the model will perform in those situations.
The Random Forest is leveraged for free uncertainty quantification by computing per-tree probabilities and taking their dispersion across the 500 estimators.
tree_probs <- predict(rf_fit, test_df, predict.all = TRUE)$predictions[, 2, ] rf_prob <- rowMeans(tree_probs) rf_se <- apply(tree_probs, 1, sd) / sqrt(rf_fit$num.trees) ci_lower <- pmax(0, rf_prob - 1.96 * rf_se) ci_upper <- pmin(1, rf_prob + 1.96 * rf_se) ci_width <- ci_upper - ci_lower
A parcel with a 0.60 RF probability and CI width 0.05 is a confident prediction — the trees agree. The same 0.60 probability with CI width 0.35 is ambiguous — different subsets of trees give wildly different predictions.
Four checks confirm the model is learning from the right signals.
expected_top list of domain-sensible features (vacancy license history, C&S total, license lapse rate, days since last violation). The five strongest signals are n_violations_total, days_since_last_viol, n_cs_total, n_violations_recent, and has_fire_safety_code. All sit comfortably in the expected list.cs_truly_active, had_vacancy_license, has_open_vacancy_kw, or none). Mean predicted probability is reported per group. Parcels with cs_truly_active score around 2.5 times higher than the no-signal group, and parcels with a vacancy license score around 4.5 times higher. The model is responding to its strongest available signals.n_violations_total is varied from its 1st to 99th percentile while all other features are held at training-set medians. The predicted probability rises monotonically with violation count, as expected.data_py/spatial_cv_metrics.csv per-fold metrics, data_py/logo_cv_metrics.csv per-ZIP metrics, data_py/predictions_with_ci.csv test set with CI bounds, and data_py/validation_summary.csv headline metrics table.
Spatial CV per-fold AUC and J-Index. Variance across the 10 ZIP-grouped folds.
LOGO AUC by held-out ZIP. 15 sampled ZIPs, ranked.
Prediction CI analysis. RF tree-variance CI width vs predicted probability.
04b feature importance. Sanity-check reproduction of 04a's RF VIP under spatial CV.
code/r_code/04b_vpi_comparison.Rmd breaks the citywide AUC apart along three operationally relevant axes (ZIP, building category, ward) and runs an equity audit by census-tract median household income.
Predictions are joined to features (for ZIP and tract) and OPA (for ward and category). A five-tier classification using equal-width 0.2 probability buckets — Very Unlikely, Unlikely, Maybe, Likely, Very Likely — is applied to every parcel as prob_tier. The full population skews heavily into the lowest tier (~99.88% Very Unlikely). The Very Likely tier (> 0.8) holds about 0.47% — the operational high-risk pool.
For each ZIP with at least 50 parcels and at least 5 observed vacants, AUC is computed by calling roc_auc_score on the subset. Forty-one ZIPs survive the minimum-N filter. Results are ranked and plotted as a horizontal bar chart with ZIPs colored red where AUC falls below 0.70.
| Category | Parcels | AUC |
|---|---|---|
| Single Family | ~461K | 0.980 |
| Multi-Family / Duplex / Triplex | ~41K | 0.978 |
| Mixed Use | ~14K | 0.945 |
| Apartments 4+ Units | ~3.6K | 0.942 |
Tier counts and observed OVS rates per tier are computed within each category. The Very Likely tier shows significantly higher observed vacancy rates than Very Unlikely inside every category — confirming the probability scores discriminate within building type and not just on average.
Philadelphia's 66 political wards are used as an aggregation unit. For each ward this step reports parcel count, observed vacancy count and rate, mean predicted probability, count in Very Likely and Likely tiers, and a combined "high risk" rate (probability of 0.6 or higher). Top wards by mean predicted probability:
| Ward | Mean predicted |
|---|---|
| 11 | 0.0436 |
| 44 | 0.0419 |
| 28 | 0.0410 |
| 6 | 0.0384 |
| 16 | 0.0377 |
ACS 2022 5-year median household income (B19013_001E) is joined at the census-tract level. Tracts are binned into income quintiles (Q1 = lowest, Q5 = highest); coverage of the join is ~99.5%.
| Quintile | Observed rate | Mean predicted | AUC |
|---|---|---|---|
| Q1 (lowest income) | 2.60% | 2.93% | 0.97 |
| Q2 | Lower | Lower | 0.97 |
| Q3 | Lower | Lower | 0.97 |
| Q4 | Lower | Lower | 0.98 |
| Q5 (highest income) | 0.31% | 0.28% | 0.97 |
AUC is consistent across all income quintiles (0.968–0.978). The model identifies vacancy with the same ranking quality whether the parcel sits in a low-income or high-income tract. Mean predicted probability tracks observed rate closely in every quintile — the model is not systematically over- or under-predicting in any income band. It concentrates absolute flag counts in lower-income areas because vacancy is genuinely concentrated there.
Outputs. data_py/predictions_04b.csv parcel frame, data_py/zip_auc_04b.csv per-ZIP AUC, data_py/ward_summary_04b.csv per-ward rollup, data_py/equity_income_auc.csv equity audit, plus a static ZCTA choropleth and an interactive Leaflet map.
AUC by ZIP code. Per-ZIP discrimination, ranked. No ZIP falls below 0.70.
Equity scatter. Per-ZIP flag rate vs observed vacancy rate — alignment indicates calibration.
Mean predicted probability by ward. Wards 11, 44, 28 lead.
AUC by income quintile. 0.968–0.978 across Q1–Q5.
code/r_code/04c_vs_city_vpi.Rmd compares the production ensemble to the City of Philadelphia's binary Vacant Property Indicator (VPI) head-to-head on the residential-only universe. rawdata/vpi_bldg.geojson contains 8,048 buildings flagged by the City. After filtering to residential parcels via the OPA join, 6,418 parcels (~79.7%) remain in scope. The remaining 20-plus percent are commercial or industrial properties not covered by the residential model.
The ensemble probability is thresholded at the level that produces approximately the same number of positives as the City's VPI (~6,400 parcels).
| Approach | Flagged | Precision | Recall vs OVS |
|---|---|---|---|
| Ensemble (matched capacity) | ~6,400 | 54.7% | 78.3% |
| City VPI | ~6,400 | 53.3% | 57.6% |
| Union of both | ~9,800 | 47.9% | 79.7% |
The continuous-score view is cleaner. Ensemble AUC against OVS is 0.9786; VPI as a binary flag has AUC 0.7849. Ensemble average precision is 0.7538; VPI is 0.3116.
| Bucket | Parcels | Observed OVS rate |
|---|---|---|
| Both flag the parcel | 3,199 | 83.3% |
| Only the model flags | 5,300 | 37.4% |
| Only the VPI flags | 3,211 | 23.4% |
| Neither flags | 508,486 | 0.1% |
The "only ours" bucket has substantially higher observed vacancy than the "only VPI" bucket — the model's exclusive flags are higher quality on average than the VPI's exclusive flags.
Per-ZIP precision is plotted with City VPI on the x-axis and ensemble on the y-axis (one point per ZIP, ≥50 parcels). The model achieves higher precision than VPI in around 30 of the 41 ZIPs, sometimes by a wide margin.
A separate analysis isolates the top 339 parcels with the highest ensemble probability that have OVS = 0. Twenty-seven of those also appear in the City VPI — independent corroboration that the parcel is genuinely vacant despite OVS missing it. The remaining 312 are candidate false positives or genuinely undetected vacants and are exported with addresses for inspector review.
data_py/vs_city_vpi_headline.csv, data_py/vs_city_vpi_buckets.csv, data_py/vs_city_vpi_zip.csv, data_py/vs_city_vpi_parcel.csv, data_py/vs_city_vpi_map.html (interactive overlay of all three buckets), and data_py/vs_city_vpi_high_prob_ovs0_map.html (the candidate review map).
Ensemble vs City VPI — ROC and PR curves. Ensemble AUC 0.9786 vs VPI 0.7849.
Four-bucket disagreement. Both / model-only / VPI-only / neither, with observed OVS rates per bucket.
Disagreement map. Spatial layout of model-only vs VPI-only flags.
Per-ZIP precision, ensemble vs VPI. Above the diagonal = model wins. Ensemble wins ~30 of 41 ZIPs.
High-prob OVS = 0 candidates. Top 339 model-flagged parcels with OVS = 0 — candidates for inspector review.
code/r_code/04d_recalibration.Rmd refits the calibrator on a held-out half of the test split and evaluates calibration on the other half. The original 04a calibrator was fit on the full test set — a deliberate shortcut for an initial release that allows a small amount of optimism into the calibrated probabilities.
The 30% test set from 04a is split 50/50 stratified by OVS. The raw ensemble probabilities for the calibration half are used to fit a fresh IsotonicRegression; the new calibrator is applied to the full population producing ensemble_prob_v2. On the evaluation half (which neither the model nor either calibrator has seen):
| Metric (evaluation half) | Original | Recalibrated |
|---|---|---|
| ROC-AUC | 0.9399 | 0.9375 |
| Brier | 0.0067 | 0.0067 |
| Mean predicted to observed ratio | 1.02× | 1.04× |
Both reliability curves follow the diagonal, but the recalibrated version stays closer across all probability bins. Practical takeaway: the original calibrator was already nearly correct — ensemble_prob_v2 is the recommendation for production, but the substantive difference is small enough that consumers already on ensemble_prob do not need to migrate urgently.
data_py/calibrators_v2.rds and data_py/predictions_calibrated.csv (full population with the new column).
Reliability curve, original vs recalibrated. Recalibrated curve hugs the 45° diagonal more tightly across all bins.
code/r_code/04e_operational_thresholds.Rmd translates model scores into actionable inspection capacity. Rather than choose one citywide threshold, it applies a per-ward capacity policy: inspect the top N parcels per ward, where N is some fraction of the ward's residential parcel count.
CAPACITY_PCT is parameterized at one percent of each ward's residential parcels per inspection cycle. Within each ward, the top N parcels by ensemble probability are flagged. The same policy is then applied to the City's VPI for direct comparison, and to the union of the two.
| Strategy | Flagged | Precision | Recall |
|---|---|---|---|
| Model only | 5,232 (1.01%) | 57.0% | 49.1% |
| City VPI only | 6,410 (1.23%) | 53.3% | 57.6% |
| Union | 9,874 (1.90%) | 47.9% | 79.7% |
The model achieves the highest precision per parcel inspected. The union strategy trades precision for recall and is the right choice when coverage is the priority.
A per-ward scatter (City VPI precision on the x-axis, model precision on the y-axis) shows wide variance across wards. Some wards reach perfect or near-perfect precision under the model's flagging policy, while others land in the 0.3–0.5 range. It flags wards where precision falls below a configurable threshold for individual review.
data_py/operational_flags_by_ward.csv per-ward summary, data_py/operational_precision_at_capacity.csv headline citywide table, and the per-ward precision scatter PNG.
Operational precision at 1% ward capacity. Per-ward precision under the capped per-ward policy.
Capacity vs threshold curve. Number of parcels flagged for any chosen threshold — lookup for ops teams.
code/r_code/04f_local_explanations.Rmd. A model that is operationally useful must be explainable at the parcel level. An inspector who is told "go visit this address" must be able to ask "why this one" and get a meaningful answer.
This step uses treeshap on the fitted Random Forest component (which dominates the feature attribution alongside the Logistic Regression in the ensemble). For each parcel the explainer returns one SHAP value per feature. The step ranks parcels by ensemble probability, takes the top 200, and for each parcel records the five features with the largest absolute SHAP value, the feature value, the SHAP magnitude, and a direction (pushed up or pushed down).
days_since_last_cs — Clean & Seal recency is the single largest signal.n_cs_total — total Clean & Seal events.days_since_last_viol — recent violation pushes probability up.n_violations_5yr & n_violations_3yr — five and three year violation density.A useful pattern in the local explanations: some neighborhood-level features push downward at flagged parcels. A high count of vacant parcels in the same ZIP can lower the ensemble probability for a specific parcel — the model accounts for the neighborhood baseline elsewhere, so an individual parcel in a high-vacancy ZIP is comparatively less risky than a parcel that looks unusual within its ZIP.
data_py/parcel_shap_topflags.csv (1,000 rows of parcel by top-five-feature explanations) and docs/graphs/python/shap_summary_topflags.png.
Top SHAP drivers across the top 200 flagged parcels. Mean absolute SHAP per feature.
code/r_code/04g_temporal_validation.Rmd tests how the model performs on parcels with very recent administrative activity versus parcels whose activity is years old. The training split is stratified random rather than temporal — this is correct for measuring AUC (see Step 7.2), but it leaves open the question of how the model performs across activity windows. days_since_last_viol is used as the temporal proxy.
| Cohort | Observed rate | Mean predicted | AUC | Brier |
|---|---|---|---|---|
| Old | 0.397% | 0.45% | 0.9579 | 0.0028 |
| New | 8.90% | 9.63% | 0.9760 | 0.0226 |
The model's discrimination is actually higher on the new cohort, where the recent administrative trail is rich. The Brier score on the new cohort is larger in absolute terms because prevalence is roughly 20× higher, but the calibration ratio is good. There is no evidence of temporal decay between the older and newer activity windows.
The OVS rates differ by an order of magnitude between the two cohorts, which by itself is informative. Recent violation activity is itself a strong vacancy signal, even before the model touches it.
data_py/temporal_validation.csv and docs/graphs/python/temporal_validation.png.
Temporal validation, old vs new cohort. ROC curves and calibration on the two recency-defined cohorts.
code/r_code/04h_block_cv_rf.Rmd. Several engineered features are spatially smooth — neighborhood vacancy counts, ZIP-level aggregations, and the implicit information that a parcel near many flagged parcels is itself probably flagged. A standard random train/test split puts neighbors on opposite sides of the split, so the model effectively gets to see its training answer through its neighborhood features when scoring the test set. This inflates AUC.
This step re-fits the Random Forest under group_vfold_cv(df, group = census_tract, v = 5). Each fold holds out approximately 20% of Philadelphia's census tracts entirely. No tract appears in both train and test. Production RF hyperparameters are loaded directly from model_rf_final.rds; the same imputation and variance-threshold preprocessing is reused. ROSE oversampling is not applied here because the goal is honest discrimination, not optimized recall.
| Fold | AUC |
|---|---|
| 1 | 0.9677 |
| 2 | 0.9789 |
| 3 | 0.9644 |
| 4 | 0.9713 |
| 5 | 0.9564 |
Block-CV mean AUC is 0.9677 ± 0.0083. The random-split test AUC from 04a was 0.9395 (ensemble) and the spatial-CV ZIP-blocked AUC from 04b was 0.8877 (RF only). The block-CV AUC sits between these, which is consistent with the underlying logic — ZIP groups are larger than census tracts, so ZIP blocking is a more aggressive test of generalization. Census-tract blocking is the right granularity for the operational story because the city does inspect and intervene at the tract level all the time. ~0.97 is the right number to put on a stakeholder slide as the honest, leakage-controlled AUC for the Random Forest component.
Outputs. data_py/block_cv_rf.csv and docs/graphs/python/block_cv_rf_aucs.png.
Block CV AUC by fold. Five tract-grouped folds, mean AUC 0.9677 ± 0.0083.
code/r_code/05_output_analysis.Rmd turns production predictions into stakeholder-facing materials — distributions by category and ZIP, a calibration scatter, a choropleth map, an interactive Leaflet map, and a capacity-based threshold lookup table.
About 25% of residential parcels have any positive tree votes from the Random Forest. The remaining 75% receive exactly zero probability, meaning the ensemble unanimously predicts occupied. Three quarters of the residential housing stock is so clearly not vacant that no tree in 500 disagrees.
Mixed Use carries the highest mean predicted probability at ~2.73%; Single Family the lowest at ~1.20%. Both mean and median are reported because the distribution is right-skewed — a few high-risk parcels pull the mean up, and the median reflects the typical parcel.
A density plot of rf_prob is drawn per category, filtered to rf_prob > 0. The zero-vote parcels are excluded because they create a spike at zero that dominates the plot and obscures meaningful variation in the non-zero range. A horizontal bar chart shows mean predicted probability per category with observed OVS rate overlaid as red diamond markers; a stacked bar shows tier distribution within each category using log-spaced breaks (No Signal, Low, Moderate, High, Very High). A category calibration scatter plots mean predicted against observed rate with point size proportional to parcel count — points on the 45° diagonal are perfectly calibrated.
ZIP 19132 has the highest mean predicted probability at ~3.61%. ZIP 19137 has the lowest at ~0.85%.
tigris::zctas(starts_with = "191") downloads US Census ZCTA boundaries for Philadelphia ZIPs (a few MB). These are joined to the ZIP-level summary and rendered as a static choropleth, plus an interactive Leaflet map with hover highlighting and click popups showing ZIP, mean P(vacant), parcel count, and observed vacancy rate.
ZCTA boundaries are used rather than PWD_PARCELS.geojson for the static map because the parcel file is over 400 MB and consistently causes the R session to run out of memory or render at unusable resolutions. The parcel-resolution map lives separately in Step 17 as a vector tileset that the browser handles efficiently.
A precision-recall curve is computed across thresholds 0.01–0.99 with each row annotated by the corresponding number of parcels flagged. Common inspection volumes (100, 250, 500, 1K, 2K, 5K, 10K) are read off the curve as a lookup table. To flag 5,000 parcels: threshold ≈ 0.67, precision ≈ 78%, recall ≈ 67%. Operations teams pick a row based on actual resourcing rather than committing to a fixed citywide threshold.
When PWD_PARCELS.geojson is available, this step writes two GeoJSON files for downstream use.
data_py/vacancy_predictions.geojson — around 436K parcels, around 405 MB. All prediction columns plus all feature metadata.data_py/vacancy_predictions_flagged.geojson — around 4,500 parcels (top one percent flagged), around 3.9 MB. Same column set, smaller geometry.data_py/output_zip_summary.csv, data_py/output_category_summary.csv, data_py/capacity_threshold_curve.csv, data_py/vacancy_risk_map.html, plus the two GeoJSON files and a folder of static PNGs under docs/graphs/python/.
Probability distribution by category. Density of rf_prob per OPA category, non-zero predictions only.
Mean predicted probability by category. Bars with observed OVS rate overlaid as red diamonds.
Five-tier distribution by category. Stacked bars using log-spaced breaks (No Signal…Very High).
Category calibration scatter. Mean predicted vs observed rate per category, point size by parcel count.
Mean predicted probability by ZIP. Bars + observed OVS rate as diamond markers.
ZCTA choropleth of mean P(vacant). Static map for stakeholder summaries.
code/r_code/06_tiling.Rmd. The 405 MB GeoJSON from Step 16 is unusable in a browser as is. This step converts both GeoJSON exports into PMTiles, a single-file vector tileset format that the browser can stream from static storage (S3, GCS, GitHub Pages) and consume directly via MapLibre GL JS or Mapbox GL JS without a tile server.
tippecanoe (version 2.80, Felt fork) is the converter. It is invoked via system2() from inside the Rmd. The Felt fork is required for the PMTiles output format — vanilla tippecanoe writes mbtiles only.
vacancy_flagged.pmtiles covers the top one percent flagged parcels — around 4,500 features, zoom 10–16. The layer name inside the tileset is flagged. At lower zooms tippecanoe coalesces the densest parcels so the tileset stays readable. Final file size: ~2.2 MB. This is the right tileset for high-zoom drill-down on the inspection target list.
vacancy_predictions.pmtiles covers all ~436K parcels, zoom 10–15. The layer name inside the tileset is parcels. The pipeline applies a 10-unit simplification at low zooms and drops the densest features when tile size exceeds the limit. Final file size: ~46 MB.
Both tilesets carry the full feature property bag including risk_score, ensemble_prob, qtile_tier, zip, tract, ward, address, and ovs. Client-side styling and filtering are therefore possible without round-tripping to a server. A consumer can color by risk score, filter to flagged only, search by ZIP, or drill into a single parcel popup, all in the browser.
For visual sanity-checking the file https://protomaps.github.io/PMTiles/ accepts a PMTiles URL and renders it.
data_py/tiles/vacancy_flagged.pmtiles and data_py/tiles/vacancy_predictions.pmtiles.
Everything that gets handed to a non-technical audience lives in docs/. The folder is fully static, ships with a graceful fallback for the Flask backend, and is designed to be pushed to GitHub Pages or any static host without modification.
| File | Purpose |
|---|---|
| docs/index.html | Tiny redirect that drops visitors onto the landing page |
| docs/Vacancy Risk Landing Page.html | Project landing page. The "what this is and why it matters" surface for stakeholders |
| docs/PhillyStat360 v2.html | Full methodology write-up. Mirrors the step-by-step structure of the README in a presentation-ready layout |
| docs/dashboard.html | Interactive parcel-level dashboard. MapLibre + PMTiles vector parcels, SHAP risk drivers per parcel, ward choropleth, sidebar summary cards, ward filter, and parcel search |
The dashboard needs a small set of static files alongside the HTML. Each file is the direct output of a pipeline step.
| File | Source step | Purpose |
|---|---|---|
| vacancy_predictions.pmtiles (~46 MB) | Step 17 | Full citywide parcel tileset, layer name parcels |
| vacancy_flagged.pmtiles (~2.2 MB) | Step 17 | Top one percent flagged parcels, layer name flagged |
| vacancy_predictions_flagged.geojson (~3.9 MB) | Step 16 | Flagged parcels as GeoJSON for the SHAP table backend |
| ward_boundaries.geojson | Step 16 | Polygon geometry for the 66 political wards |
| ward_stats.json | Step 16 | Per-ward rollups used to power the choropleth, summary cards, and ward filter list in static mode |
| dashboard_shap.json | Step 13 | Per-parcel SHAP risk drivers for the dashboard's "why this parcel" panel |
| colors_and_type.css | — | Shared colour palette and type scale across landing page, methodology, and dashboard |
dashboard.html auto-detects the host. On localhost or 127.0.0.1 it tries the local Flask backend first (docs/tileserver.py, backed by PostgreSQL/PostGIS via docs/load_db.py). On any other host it skips the backend entirely and falls back to the JSON files above.
| Feature | Local-server mode | Static mode |
|---|---|---|
| Map basemap and PMTiles | CARTO basemap, vector parcels from vacancy_predictions.pmtiles | Same |
| Parcel popups (click) | Full property bag from PMTiles | Full property bag from PMTiles |
| SHAP risk drivers | dashboard_shap.json | dashboard_shap.json |
| Ward choropleth | ward_stats.json + ward_boundaries.geojson | Same |
| Sidebar summary cards | /summary endpoint | Aggregated from ward_stats.json |
| Ward filter list | /wards endpoint | From ward_stats.json |
| Ward fly-to | /ward_bounds endpoint | Computed client-side from the ward boundary GeoJSON |
| Parcel search | DB-wide search via /search | Limited to parcels currently rendered in view (queryRenderedFeatures) |
| Census tract filter | /census_tracts and /tract_bounds | Disabled unless census_tracts.json is shipped |
The local Flask backend is optional. It exists for the case where the city wants DB-wide parcel search and tract-level filtering on a workstation. Everything in the public deployment is static.
Full GitHub Pages deployment notes live in docs/DEPLOY.md. The short version is:
# From the repo root, push the docs/ folder along with everything else: git add docs/ git commit -m "Update site" git push origin main # Settings → Pages → Source: Deploy from a branch, main / docs
The two PMTiles files stay in the repo as ordinary objects rather than Git LFS, because GitHub Pages does not serve LFS objects through its CDN. The largest file (vacancy_predictions.pmtiles, around 46 MB) sits comfortably under GitHub's 100 MB per-file limit. To host the tilesets on S3, Cloudflare R2, or any other static bucket instead, set the PMTILES_URL constant at the top of the <script> block in dashboard.html to the absolute URL — the PMTiles JS reader will stream byte ranges over HTTP.
docs/sync_methodology_assets.sh copies the latest figures from docs/graphs/ and docs/code/outputs/ into the methodology page so the public write-up stays in sync with the modeling pipeline.
| Source | Method | Value |
|---|---|---|
| 04a, random 30% test | Calibrated ensemble ROC-AUC | 0.9395 |
| 04a, random 30% test | Calibrated ensemble PR-AUC | 0.5461 |
| 04a, random 30% test | Calibrated ensemble Brier | 0.0068 |
| 04b, 10-fold ZIP-grouped CV | RF mean AUC | 0.8877 ± 0.0064 |
| 04b, LOGO sample of 15 ZIPs | RF mean AUC | 0.8849 |
| 04b_vpi, by income quintile | Quintile AUC range | 0.968 – 0.978 |
| 04b_vpi, by building category | Category AUC range | 0.942 – 0.980 |
| 04c, vs City VPI on residential | Ensemble AUC | 0.9786 |
| 04c, vs City VPI on residential | Ensemble PR-AUC | 0.7538 |
| 04c, vs City VPI on residential | VPI binary AUC | 0.7849 |
| 04c, at matched ~6.4K capacity | Ensemble precision / recall | 54.7% / 78.3% |
| 04c, at matched ~6.4K capacity | VPI precision / recall | 53.3% / 57.6% |
| 04d, evaluation half | Recalibrated ensemble AUC | 0.9375 |
| 04e, citywide at 1% ward capacity | Model precision / recall | 57.0% / 49.1% |
| 04g, new-cohort temporal proxy | Ensemble AUC | 0.9760 |
| 04g, old-cohort temporal proxy | Ensemble AUC | 0.9579 |
| 04h, 5-fold tract-grouped CV | RF mean AUC | 0.9677 ± 0.0083 |
The most defensible number to cite for the model's generalization performance is the block-CV AUC of around 0.9677 from Step 15. It controls for the leakage that random splits allow through neighborhood-level features and uses the right spatial granularity for how city operations actually play out.
| Decision | Choice | Reason |
|---|---|---|
| OVS definition | Three-source composite (C&S + violations + licenses) | No single source captures all vacancies. Composite reduces false negatives. |
| Temporal anchor | TRAIN_CUTOFF = 2025-10-01 | Hard cutoff prevents future data from influencing features or label. |
| C&S window | 2-year window plus demolition filter | Stale C&S records misrepresent current status. Demolished buildings are not vacant buildings. |
| Residential scope | Category codes 1, 2, 3, 14 only | Buildings and land require different feature vocabularies. Commercial properties introduce systematic false positives. |
| Commercial / parking filter | Pattern match on parcel building description | Catches parking garages and surface lots that survive the category code filter. |
| Life-history features | Trend & acceleration over 3yr / 5yr / 2yr windows | Single counts miss whether a property is getting worse, stable, or improving. |
| RTT integration | Transfer frequency, sheriff sales, log price change | Full deed history adds distress signals not in OPA's single most-recent sale. |
| Train / test split | Stratified random 70/30 | Temporal split caused distributional shift because recency is itself a vacancy proxy. |
| Spatial validation | 10-fold ZIP CV + LOGO + 5-fold tract block CV | Tests generalization independent of spatial autocorrelation. LOGO simulates new geography. Tract block CV is the honest stakeholder-reportable AUC. |
| No assessed value features | log_market_value, value_per_sqft excluded | OPA assessed values have documented racial and geographic bias. |
| Production score | 50/50 Logit + RF, isotonic at the end | Logit and RF capture different signal shape. Equal weighting beat tuned weighting in CV. Isotonic restores honest empirical positive rates. |
| No fixed binary threshold | Top 1% rank flag plus 5-tier rank bucket | Calibrated probabilities are all small because vacancy is rare. Use rank or capacity. |
| ZCTA boundaries for static maps | tigris package | PWD_PARCELS.geojson is over 400 MB. ZCTA is practical for rendering and presentations. Parcel-resolution lives in PMTiles. |
| Per-ward capacity flagging | 1% of each ward's residential parcels | Single citywide threshold concentrates too many flags in a few neighborhoods. Per-ward keeps inspection load sensible everywhere. |
| 04b mirrors 04a exactly | Identical model_vars, recipe, hyperparameters | Spatial CV and LOGO CV must evaluate the same model that is deployed. Divergence would produce misleading validation metrics. |
| No poverty rate in model | ACS variables removed from model_vars | Used only in the equity audits as an analysis dimension, not a feature. Avoids an external Census API dependency at training time. |
data_py/)| File | Contents |
|---|---|
| model_logit_final.rds | Fitted Logistic Regression pipeline. |
| model_rf_final.rds | Fitted Random Forest pipeline. |
| model_xgb_final.rds | Fitted XGBoost pipeline. Diagnostic comparator. |
| model_lgb_final.rds | Fitted LightGBM pipeline. Diagnostic comparator. |
| calibrators.rds | Original isotonic calibrators for each model and the ensemble. |
| calibrators_v2.rds | Held-out recalibrated ensemble calibrator. Recommended for production. |
| model_thresholds.csv | Per-model Youden thresholds plus AUC, sensitivity, specificity, and is_best. |
| File | Contents |
|---|---|
| all_predictions_rf.csv | One row per parcel with all probabilities, calibrated and raw, plus flag, risk_score, qtile_tier. |
| predictions_calibrated.csv | Adds ensemble_prob_v2 from the held-out calibrator. |
| predictions_04b.csv | Predictions joined to ZIP, tract, ward, and category. The right starting point for analysis. |
| predictions_with_ci.csv | Test-set predictions with RF tree-variance CI bounds. |
| parcel_shap_topflags.csv | Top 200 flagged parcels with five SHAP-attributed features each. |
| vs_city_vpi_parcel.csv | Per-parcel ensemble probability vs City VPI flag. |
| File | Contents |
|---|---|
| output_zip_summary.csv | Per-ZIP mean predicted, observed rate, count. |
| output_category_summary.csv | Per-category mean predicted, observed rate, count. |
| ward_summary_04b.csv | Per-ward rollup including high-risk count. |
| zip_auc_04b.csv | Per-ZIP AUC for ZIPs meeting the minimum-N filter. |
| equity_income_auc.csv | Per-income-quintile AUC, observed rate, mean predicted. |
| equity_zip_audit.csv | Per-ZIP flag rate vs observed vacancy rate. |
| File | Contents |
|---|---|
| spatial_cv_metrics.csv | Per-fold ZIP-grouped CV AUC and J-Index. |
| logo_cv_metrics.csv | Per-ZIP LOGO AUC for sampled ZIPs. |
| validation_summary.csv | Headline cross-validation metrics. |
| block_cv_rf.csv | Per-fold tract-grouped CV AUC. |
| temporal_validation.csv | Old and new cohort metrics. |
| capacity_threshold_curve.csv | Threshold → flagged-count → precision / recall lookup. |
| operational_flags_by_ward.csv | Per-ward flag counts under the 1% capacity policy. |
| operational_precision_at_capacity.csv | Citywide model vs VPI vs union summary. |
| File | Contents |
|---|---|
| vacancy_risk_map.html | Interactive Leaflet ZCTA-level map. |
| vacancy_predictions.geojson | ~436K parcel polygons with all predictions. |
| vacancy_predictions_flagged.geojson | ~4,500 flagged parcels with all predictions. |
| tiles/vacancy_flagged.pmtiles | ~2.2 MB. Top 1%. |
| tiles/vacancy_predictions.pmtiles | ~46 MB. Full citywide. |
| vs_city_vpi_map.html | Interactive bucket overlay. |
| vs_city_vpi_high_prob_ovs0_map.html | Candidate-review map for high-probability OVS = 0 parcels. |
All data from OpenDataPhilly unless noted.
| File | Source | Purpose |
|---|---|---|
| opa_properties_public.csv | OPA | Parcel registry, property characteristics |
| violations.csv | L&I | Code violation history |
| business_licenses.csv | L&I | License history including vacant property licenses |
| clean_seal.csv | L&I | City-initiated boarding and securing actions |
| unsafe.csv | L&I | Unsafe structure orders |
| imm_dang.csv | L&I | Imminently dangerous structure orders |
| permits.csv | L&I | Building and zoning permits including demolition and new construction |
| RTT_SUMMARY.csv | Philadelphia Revenue | Full deed and transfer history |
| VIOLATION_DEFINITION.csv | L&I | Violation code title lookup |
| vpi_bldg.geojson | City of Philadelphia | Vacant Property Indicator binary flag |
| ACS 2022 5-year B19013_001E | US Census Bureau | Census-tract median household income for the equity audit |