Chapter 3 Predicting the State House, Part 3

3.1 Prepping the data for analysis

After having made such a big deal about Relational Databases, we need to create a boring old rectangular data.frame for our regression analysis. Let’s do that here.

Our target is a wide results table, with one row for each race, and sufficient columns for any race-level covariates we want to include in the model.

library(tidyverse)
source("utils/util.R")

load("data/relational_db.rda")
load("outputs/geographies_output.rda")

3.2 Thinking ahead

Let’s think about how we want to model the data. That’ll inform what columns we need.

First, I decide to model only the “two-party” vote, dropping everyone who’s not a Democrat or Republican. Remember that candidates has an imputed party_replaced for candidates who sneakily ran as third-parties for an uncontested election.

results_with_parties <- geography_results %>%
  inner_join(
    races %>% filter(substr(election, 6, 6) == "G")
  ) %>%
  left_join(candidates_to_races) %>%
  select(race, candidate, office, cofips:sth, vote_total, GEOID, party_replaced) 

## Make sure only Dem or Rep won. If a third party ever won, we'd need to rethink this...
results_with_parties %>%
  group_by(race, office, party_replaced) %>%
  summarise(vote_total = sum(vote_total)) %>%
  group_by(race) %>%
  mutate(rank = rank(-vote_total)) %>%
  filter(rank <= 2) %>%
  group_by(office, party_replaced, rank) %>%
  count() %>%
  spread(rank, n)

## # A tibble: 39 x 4
## # Groups:   office, party_replaced [39]
##    office party_replaced   `1`   `2`
##    <chr>  <chr>          <int> <int>
##  1 GOV    DEM                3     1
##  2 GOV    REP                1     3
##  3 STH    ACT               NA     1
##  4 STH    CON               NA     1
##  5 STH    CST               NA     3
##  6 STH    D/G               NA     1
##  7 STH    DBP               NA     1
##  8 STH    DEM              742   504
##  9 STH    F4B               NA     1
## 10 STH    F89               NA     1
## # ... with 29 more rows

Ok, we’ve got cleaned parties.

3.3 The model

Here, I’ll preview the model I use in Making the Predictions, since that dictates what data we need. How will we predict the votes? Here’s the model.

Let’s call \(sth_{yr}\) the two-party percent of the vote for State House in year \(y\) in race \(r\).

In a best case scenario, we would have polls of voters. Then we could capture simple-seeming things that our data has no idea of: How charismatic is a candidate? Are they well organized? Have they been running ads? Polls are what FiveThirtyEight uses, and why they get such good predictions. Of course, noboday actually publicly polls the 203 PA State House races.

In a worst case scenario, we would have to use only data from prior elections. This would leave us completely unable to predict large swings in public sentiment, and we would have to expand our uncertainty to capture the full range of election-level random effects (aka the way that all races are correlated from year to year). Imagine trying to predict the 2018 election using only the 2016 and 2014 results, without any data from 2018 that signaled Something Is Different. We would need to produce predictions capable of saying both “maybe this year is like 2010” and “maybe this year is like 2006”.

Luckily, we are somewhere in between. While we don’t have polling on this year’s State House races, we do have polling on the US Congressional races. To the extent that USC races are correlated with STH races (probably a lot), we can use the USC polls to estimate the overall tenor of the race. Better yet, we don’t have to actually use polling data itself, because FiveThirtyEight has already translated them into predicted votes. _sunglassesemoji_ Here’s the plan: model the results in a State House race as a function of past results in that district along with the US Congress results in that year:

\[ \begin{align*} sth_{yr} =& \beta_0 + \beta_1 sth_{y-1,r} + \beta_2 incumbent\_is\_dem_{yr} + \beta_3 sth_{y-1,r} * incumbent\_is\_dem_{yr} \\ &+ \beta_4 usc_{y,usc(r)} + \beta_5 usc_{y,PA} + \beta_6 uspgov_{y-1, r} + \beta_7 uspgov_{y, PA} + \beta_8 uspgov_{y-1, PA} + \epsilon_{yr} \end{align*} \] where \(incumbent\_is\_dem\) is \(1\) if the democratic candidate is an incumbent, \(-1\) if the republican is, and \(0\) otherwise; \(usc(r)\) is the result in the entire USC district that race \(r\) belongs to (as opposed to just in the precinct); the subscript \(PA\) represents the state-wide results; and \(uspgov\) is the result of either the USP or the GOV race, whichever occurred that year.

One thing to note is that I don’t include year-level random effects. Instead, I parametrize the vote using several annual-level covariates: this year’s USPGOV results, last year’s USPGOV results, this year’s total USC results, and an overall mean. That’s four degrees of freedom used up when we’re only going to be able to use data from seven elections; we’re already on thin ice, and hopefully uncertainty in those \(\beta\)s capture annual variations. We’ll check when we model

The above results will work well for races where everything was contested, but we clearly shouldn’t include uncontested races. So we’ll do two things: for races that are uncontested in year \(y\), we will not model them at all, since we know the running candidate will win 100% of the vote. For races that were not contested last cycle but are contested this year, we will model them entirely separately, using a different equation:

Model for formerly uncontested races: \[ \begin{align*} sth_{yr} =& \beta_0 + \beta_1 dem\_is\_uncontested_{y-1, r} + \beta_2 dem\_is\_uncontested_{y-1, r} * incumbent\_is\_running_{y,r} \\ &+ \beta_4 usc_{y,usc(r)} + \beta_5 usc_{y,PA} + \beta_6 uspgov_{y-1, r} + \beta_7 uspgov_{y, PA} + \beta_8 uspgov_{y-1, PA} + \epsilon_{yr} \end{align*} \]

So now we know what we need: a dataframe with one row per State House race, with the last year’s STH results, and this year’s USP/GOV and USC results.

First, let’s create the wide table, with the STH results.

## from here on out we only consider 2-party vote:
results_with_parties <- results_with_parties %>%
  filter(party_replaced %in% c("DEM", "REP")) 

rep_na_0 <- function(x) replace(x, is.na(x), 0)  
sth_races <- results_with_parties %>%
  inner_join(races %>% filter(office == 'STH'))  %>% 
  group_by(race, sth, party_replaced, candidate) %>%
  summarise(votes_sth = sum(vote_total)) %>%
  left_join(candidates_to_races %>% select(race, candidate, is_incumbent)) %>%
  gather("key", "value", votes_sth, candidate, is_incumbent) %>%
  unite("key", party_replaced, key) %>%
  spread(key, value, convert = TRUE) %>%
  mutate(
    sth_pctdem = rep_na_0(DEM_votes_sth) / (rep_na_0(DEM_votes_sth) + rep_na_0(REP_votes_sth))
  )

sth_races$incumbent_is_dem <- 0
sth_races$incumbent_is_dem <- with(
  sth_races, 
  replace(incumbent_is_dem, !is.na(DEM_is_incumbent) & as.logical(DEM_is_incumbent), 1)
)

sth_races$incumbent_is_dem <- with(
  sth_races, 
  replace(incumbent_is_dem, !is.na(REP_is_incumbent) & as.logical(REP_is_incumbent), -1)
)

sth_races$dem_is_uncontested <- with(
  sth_races,
  ifelse(is.na(REP_candidate), 1, ifelse(is.na(DEM_candidate), -1, 0))
)

## We'll call our main table df
df <- sth_races %>% mutate(year = substr(race,1,4))
head(df)

## # A tibble: 6 x 12
## # Groups:   race, sth [6]
##   race  sth   DEM_candidate DEM_is_incumbent DEM_votes_sth REP_candidate
##   <chr> <chr> <chr>         <lgl>                    <int> <chr>        
## 1 2002~ 001   LINDA BEBKOJ~ NA                        8895 BILL STEPHAN~
## 2 2002~ 010   <NA>          NA                          NA FRANK LAGROT~
## 3 2002~ 100   BRUCE BEARDS~ NA                        3308 GIBSON C ARM~
## 4 2002~ 101   NOEL HUBLER ~ NA                        5358 MAUREE A GIN~
## 5 2002~ 102   DAN BACKENST~ NA                        3965 PETER J ZUG ~
## 6 2002~ 103   RON BUXTON (~ NA                        7866 SHERMAN C CU~
## # ... with 6 more variables: REP_is_incumbent <lgl>, REP_votes_sth <dbl>,
## #   sth_pctdem <dbl>, incumbent_is_dem <dbl>, dem_is_uncontested <dbl>,
## #   year <chr>

Next, let’s get the USC results from the same year. We want to calculate the overall USC results, then apportion them to STH districts. While we could use precinct-level USC results to train the model, we don’t have precinct-level predictions from FiveThirtyEight, and will need to use only topline results for our predictions. So we’ll mimic that here.

usc_results <- results_with_parties %>%
  inner_join(races %>% filter(office == 'USC')) %>%
  group_by(race, usc, party_replaced, candidate) %>%
  summarise(votes_usc = sum(vote_total)) %>%
  group_by(race) %>%
  mutate(
    cand_pct = votes_usc / sum(votes_usc)
  )

Some of the USC races were uncontested, which will unhelpfully show up as 100% wins, and potentially skew our predictions. Let’s simplistically impute what those races would have been had they been contested, by regressing on the district’s vote in the USP/GOV race. (Typical studies find that the uncontested candidate would have won by 60%-95%.)

usc_wide <- usc_results %>% 
  select(race, usc, party_replaced, cand_pct) %>%
  spread(party_replaced, cand_pct, fill=0) %>%
  mutate(
    is_uncontested = ifelse(
      DEM == 0, "REP", ifelse(REP == 0, "DEM", "contested")
    ),
    year = substr(race,1,4),
    usc_pctdem_2party = DEM / (DEM + REP),
    total_votes = DEM + REP
  )

uspgov_by_usc <- results_with_parties %>%
  filter(office %in% c("GOV", "USP")) %>%
  mutate(year = substr(race, 1, 4)) %>%
  group_by(year, usc, party_replaced) %>%
  summarise(vote_total = sum(vote_total)) %>%
  spread(party_replaced, vote_total) %>%
  mutate(uspgov_pctdem_2party = DEM / (DEM + REP)) %>%
  select(-DEM, -REP)

imputation_df <- usc_wide %>% left_join(uspgov_by_usc)
imputation_fit <- lm(
  usc_pctdem_2party ~ uspgov_pctdem_2party + factor(year),
  data=imputation_df %>% filter(is_uncontested == 'contested')
)

summary(imputation_fit)

## 
## Call:
## lm(formula = usc_pctdem_2party ~ uspgov_pctdem_2party + factor(year), 
##     data = imputation_df %>% filter(is_uncontested == "contested"))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.226962 -0.049530 -0.007464  0.042506  0.296493 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -0.07601    0.03669  -2.072  0.04046 *  
## uspgov_pctdem_2party  0.99908    0.04848  20.608  < 2e-16 ***
## factor(year)2004      0.04495    0.03214   1.399  0.16454    
## factor(year)2006      0.02414    0.03003   0.804  0.42321    
## factor(year)2008      0.07660    0.02978   2.572  0.01135 *  
## factor(year)2010      0.10078    0.03064   3.289  0.00132 ** 
## factor(year)2012      0.05891    0.02988   1.972  0.05097 .  
## factor(year)2014     -0.00992    0.03096  -0.320  0.74921    
## factor(year)2016      0.06344    0.03128   2.028  0.04479 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0815 on 118 degrees of freedom
## Multiple R-squared:  0.7886, Adjusted R-squared:  0.7743 
## F-statistic: 55.02 on 8 and 118 DF,  p-value: < 2.2e-16

Let’s make sure they look sane:

imputation_df$usc_pctdem_2party_imputed <- ifelse(
  imputation_df$is_uncontested == 'contested',
  imputation_df$usc_pctdem_2party,
  predict(imputation_fit, newdata = imputation_df)
)

ggplot(imputation_df, aes(x=uspgov_pctdem_2party, y=usc_pctdem_2party_imputed)) +
  geom_point(aes(color=is_uncontested)) +
  theme_sixtysix()

Sure, they’re fine. The imputed values aren’t obviously different than the contested races. Let’s apportion the USC results to STH districts using a population-weighted average.

usc_to_sth <- geographies %>% 
  as.data.frame() %>%
  group_by(vintage, sth, usc) %>%
  summarise(
    pop10=sum(pop10, na.rm=TRUE)
  ) %>%
  group_by(vintage, usc) %>%
  mutate(frac_of_usc = pop10 / sum(pop10))

sth_usc_results <- imputation_df %>%
  left_join(year_to_geo_vintage) %>%
  left_join(usc_to_sth) %>%
  group_by(year, sth) %>%
  summarise(
    usc_pctdem = weighted.mean(usc_pctdem_2party_imputed, w = total_votes * frac_of_usc)
  )

Join the USC results to the STH results.

df <- left_join(df, sth_usc_results)
tail(df)

## # A tibble: 6 x 13
## # Groups:   race, sth [6]
##   race  sth   DEM_candidate DEM_is_incumbent DEM_votes_sth REP_candidate
##   <chr> <chr> <chr>         <lgl>                    <int> <chr>        
## 1 2016~ 094   <NA>          NA                          NA STANLEY E SA~
## 2 2016~ 095   CAROL HILLEV~ FALSE                    13726 JOEL L SEARS~
## 3 2016~ 096   PETER MICHAE~ TRUE                     17340 ROBERT F BIG~
## 4 2016~ 097   CHARLES J KL~ FALSE                    13403 STEVEN CURTI~
## 5 2016~ 098   <NA>          NA                          NA DAVID S HICK~
## 6 2016~ 099   DUANE A GROF~ FALSE                     6219 DAVID H ZIMM~
## # ... with 7 more variables: REP_is_incumbent <lgl>, REP_votes_sth <dbl>,
## #   sth_pctdem <dbl>, incumbent_is_dem <dbl>, dem_is_uncontested <dbl>,
## #   year <chr>, usc_pctdem <dbl>

Now let’s add the state-wide USP/GOV and USC races:

statewide_results <- results_with_parties %>% 
  inner_join(
    races %>% 
      filter(office %in% c("USP","GOV", "USC")) %>%
      inner_join(elections %>% filter(election_type == "G"))
  ) %>% 
  mutate(office = ifelse(office %in% c("USP", "GOV"), "USPGOV", office)) %>%
  group_by(election_year, party_replaced, office) %>%
  summarise(vote_total=sum(vote_total)) %>%
  unite("key", office, party_replaced) %>%
  spread(key, vote_total) %>%
  mutate(
    uspgov_pctdem_statewide = USPGOV_DEM / (USPGOV_DEM + USPGOV_REP),
    usc_pctdem_statewide = USC_DEM / (USC_DEM + USC_REP)
  )

df <- df %>% left_join(
  statewide_results %>% select(election_year, uspgov_pctdem_statewide, usc_pctdem_statewide), 
  by=c("year"="election_year")
)

head(df)

## # A tibble: 6 x 15
## # Groups:   race, sth [6]
##   race  sth   DEM_candidate DEM_is_incumbent DEM_votes_sth REP_candidate
##   <chr> <chr> <chr>         <lgl>                    <int> <chr>        
## 1 2002~ 001   LINDA BEBKOJ~ NA                        8895 BILL STEPHAN~
## 2 2002~ 010   <NA>          NA                          NA FRANK LAGROT~
## 3 2002~ 100   BRUCE BEARDS~ NA                        3308 GIBSON C ARM~
## 4 2002~ 101   NOEL HUBLER ~ NA                        5358 MAUREE A GIN~
## 5 2002~ 102   DAN BACKENST~ NA                        3965 PETER J ZUG ~
## 6 2002~ 103   RON BUXTON (~ NA                        7866 SHERMAN C CU~
## # ... with 9 more variables: REP_is_incumbent <lgl>, REP_votes_sth <dbl>,
## #   sth_pctdem <dbl>, incumbent_is_dem <dbl>, dem_is_uncontested <dbl>,
## #   year <chr>, usc_pctdem <dbl>, uspgov_pctdem_statewide <dbl>,
## #   usc_pctdem_statewide <dbl>

Now for the hard part: adding lagged STH results. We need to crosswalk the past election results forward to the next year, using the crosswalks we made.

## first, get the results for each year
sth_uspgov_results <- results_with_parties %>%
  mutate(
    office = replace(office, office %in% c("GOV", "USP"), "USPGOV"),
    year = substr(race, 1, 4)
  ) %>%
  filter(office %in% c("STH", "USPGOV")) %>%
  select(year, office, cofips:sth, GEOID, party_replaced, vote_total) %>%
  unite("key", office, party_replaced) %>%
  spread(key, vote_total, fill=0) %>%
  mutate(
    sth_is_uncontested = ifelse(STH_DEM == 0, "REP", ifelse(STH_REP == 0, 'DEM', 'contested')),
    sth_pctdem = STH_DEM / (STH_DEM + STH_REP),
    uspgov_pctdem = USPGOV_DEM / (USPGOV_DEM + USPGOV_REP)
  )

## now, walk them forward.
results_list <- list()
for(year_ in seq(2002, 2016, 2)){
  print(year_)
  vintage <- year_to_geo_vintage$vintage[year_to_geo_vintage$year == year_]
  needed_vintage <- year_to_geo_vintage$vintage[year_to_geo_vintage$year == (year_+2)]
  
  if(vintage != needed_vintage){
    needed_crosswalk <- paste(vintage, needed_vintage, sep=',')
    cw <- crosswalks[[needed_crosswalk]] %>% filter(pop > 0 & !is.na(pop))
  } else {
    ## dummmy cw
    cw <- data.frame(
      xid = unique(sth_uspgov_results$GEOID),
      yid = unique(sth_uspgov_results$GEOID),
      frac_of_x=1,
      frac_of_y=1
    )
  }
  
  geo_results_lagged <- sth_uspgov_results %>%
    filter(year == year_) %>%
    left_join(cw, by = c("GEOID" = "xid")) %>%
    group_by(yid, year) %>%
    summarise(
      sth_pctdem_lagged = weighted.mean(sth_pctdem, w = frac_of_y, na.rm=TRUE),
      uspgov_pctdem_lagged = weighted.mean(uspgov_pctdem, w = frac_of_y, na.rm=TRUE),
      sth_frac_contested_lagged = weighted.mean(sth_is_uncontested == "contested", w = frac_of_y, na.rm=TRUE)
    ) %>%
    mutate(year = as.character(asnum(year) + 2)) %>%
    rename(GEOID = yid)
  
  sth_results_lagged <- geo_results_lagged %>%
    left_join(year_to_geo_vintage) %>%
    left_join(geographies %>% as.data.frame %>% select(-geometry)) %>%
    group_by(year, sth) %>%
    summarise(
      sth_pctdem_lagged = weighted.mean(sth_pctdem_lagged, w=pop10, na.rm=TRUE),
      uspgov_pctdem_lagged = weighted.mean(uspgov_pctdem_lagged, w=pop10, na.rm=TRUE),
      sth_frac_contested_lagged = weighted.mean(sth_frac_contested_lagged, w=pop10, na.rm=TRUE)
    )
  
  results_list[[as.character(year_)]] <- sth_results_lagged
}

## [1] 2002
## [1] 2004
## [1] 2006
## [1] 2008
## [1] 2010
## [1] 2012
## [1] 2014
## [1] 2016

sth_lagged <- do.call(rbind, results_list)

df <- df %>% left_join(sth_lagged)

Finally, we need the state-wide USPGOV results from the last year.

statewide_lagged <- statewide_results %>%
  group_by() %>%
  mutate(year = as.character(asnum(election_year)+2)) %>%
  rename(
    uspgov_pctdem_statewide_lagged = uspgov_pctdem_statewide,
    usc_pctdem_statewide_lagged = usc_pctdem_statewide
  ) %>%
  select(year, uspgov_pctdem_statewide_lagged, usc_pctdem_statewide_lagged)

df <- left_join(df, statewide_lagged)

We’re done with our training data! We have a rectangular data.frame with sufficient data for the equation above. Save it and move on.

save(df, file="outputs/df.rda")

3.4 Building 2018

All that’s left is to create the corresponding table for 2018, which we will use for our predictions. We obviously don’t have the results, but we can populate the race characteristics, the lagged results, and the USC estimates from FiveThirtyEight.

df_2018 <- sth_lagged %>%
  filter(year == 2018) %>%
  left_join(statewide_lagged) %>%
  filter(!is.na(sth))

## fivethirtyeight's prediction for 2018 wolf is 57%
df_2018$uspgov_pctdem_statewide <- 0.57

## load the race particulars
cands_2018 <- read_csv("data/2018_cands.csv") %>% 
  rename(
    d_inc = incumbent_is_dem,
    r_inc = incumbent_is_rep
  )

cands_2018 <- cands_2018 %>%
  filter(office == 'STH') %>%
  mutate(
    sth = sprintf("%03d", asnum(district)),
    dem_is_uncontested = ifelse(
      dem_cand == "No candidate",
      -1, 
      ifelse(rep_cand == "No candidate", 1, 0)
    ),
    incumbent_is_dem = ifelse(d_inc, 1, ifelse(r_inc, -1, 0))
  )

df_2018 <- df_2018 %>% left_join(
  cands_2018 %>% select(sth, dem_is_uncontested, incumbent_is_dem)
)

## now we need the state predictions
fivethirtyeight <- read_csv("data/congress_races_2018_538.csv") %>%
  mutate(usc = sprintf("%02d", district)) %>%
  rename(usc_pctdem = fivethirtyeight)

## notice that one race is uncontested. we need to impute it.
## it looks like uncontested winners should be imputed as 66%:
print(
  imputation_df %>% 
    group_by(is_uncontested) %>% 
    summarise(mean_imputed = mean(usc_pctdem_2party_imputed))
)

## # A tibble: 3 x 2
##   is_uncontested mean_imputed
##   <chr>                 <dbl>
## 1 contested             0.501
## 2 DEM                   0.665
## 3 REP                   0.344

fivethirtyeight$usc_pctdem[fivethirtyeight$district == '18'] <- 0.66

df_2018$usc_pctdem_statewide <- mean(fivethirtyeight$usc_pctdem)

sth_usc_538 <- geographies %>% 
  as.data.frame() %>% select(-geometry) %>%
  filter(vintage == 2018 & !is.na(pop10)) %>%
  left_join(
    fivethirtyeight %>% select(usc, usc_pctdem) 
  ) %>%
  group_by(sth) %>%
  summarise(
    usc_pctdem = weighted.mean(usc_pctdem, w = pop10, na.rm = TRUE)
  )

df_2018 <- left_join(df_2018, sth_usc_538)

df_2018$race <- paste0("2018 G STH STH-", df_2018$sth)
head(df_2018)

## # A tibble: 6 x 13
## # Groups:   year [1]
##   year  sth   sth_pctdem_lagg~ uspgov_pctdem_l~ sth_frac_contes~
##   <chr> <chr>            <dbl>            <dbl>            <dbl>
## 1 2018  001              0.736            0.661                1
## 2 2018  002              0.646            0.580                1
## 3 2018  003              0.597            0.480                1
## 4 2018  004              0                0.362                0
## 5 2018  005              0                0.321                0
## 6 2018  006              0.400            0.351                1
## # ... with 8 more variables: uspgov_pctdem_statewide_lagged <dbl>,
## #   usc_pctdem_statewide_lagged <dbl>, uspgov_pctdem_statewide <dbl>,
## #   dem_is_uncontested <dbl>, incumbent_is_dem <dbl>,
## #   usc_pctdem_statewide <dbl>, usc_pctdem <dbl>, race <chr>

## I think we're done
save(df_2018, file="outputs/df_2018.rda")

So that’s it! We’ve got our data. Now for the modeling.