First, let’s load the Relational Database.
library(tidyverse)
source("utils/util.R")
load("data/relational_db.rda")
The results
dataframe contains precinct-level results. For our model, we are going to need to compare these results from year to year (for example, to use last election’s results to predict this one). This can be a problem. Districts can, and do, change from year to year. There are two ways this can happen, one measurable and one not.
First, the measurable: the state legislature passed new boundaries following the 2010 US Census. Thus, the districts from 2010 and before are completely different from those 2014 and after. The middle year, 2012, was particularly complicated. The Supreme Court had struck down the State House (STH) and State Senate (STS) boundaries, but allowed the US Congress (USC) boundaries; that election used the pre-2012 boundaries for state elections, post-2012 boundaries for USC. And then in 2018, of course, the PA Supreme Court redrew the USC boundaries, but not the state ones.
Here’s a table of elections to vintage of districts:
Election | State Boundaries | USC Boundaries |
---|---|---|
2010 and before | pre-2012 | pre-2012 |
2012 | pre-2012 | post-2012 |
2014-2016 | post-2012 | post-2012 |
2018 | post-2012 | 2018 |
But that’s all measurable. We can (and will) use GIS to create crosswalks from one to the other.
More pernicious is that municipalities seem to arbitrarily change the names and boundaries of their precincts. So precinct 1 in a township in 2004 possibly has zero overlap with that precinct in 2006. Ideally, we would have a shapefile of the precincts and could map them out. But we don’t. As far as I can tell, the only requirement for municipalities is to provide pdfs or even textual descriptions of their new precincts (e.g. "South of Pine Street, down to Maple St."). After gallantly downloading a few dozen pdfs, I gave up on attempts to map this out.
(Aside: There’s a great project to collect and create shapefiles for all precincts used in the 2016 election, but even that would only be for a single year, and not the full set of precincts we need.)
The table precincts_to_districts
contains a matching of each precinct and what STH, STS, and USC districts they voted for in a given election. This is pulled entirely from the results table; there’s no GIS used.
For example, here are precinct ids that changed State House districts between 2014 and 2016. Since the STH districts didn’t change that year, the only explanation is that the precinct boundaries are in totally new places. Four precincts in Delaware County appear to have swapped Congressional and State House districts. This could be an actual change in the map, a data error, who knows.
inner_join(
precincts_to_districts %>%
filter(election == "2014 G"),
precincts_to_districts %>%
filter(election == "2016 G"),
by=c('county_code_.', 'precinct_code', 'vtd_code')
) %>%
filter(sth.x != sth.y) %>% as.data.frame()
## election.x county_code_. precinct_code vtd_code cofips.x usc.x sts.x
## 1 2014 G 23 3350 3370 045 01 026
## 2 2014 G 23 3380 3400 045 01 026
## 3 2014 G 23 3470 3490 045 07 026
## 4 2014 G 23 3510 3530 045 07 026
## sth.x election.y cofips.y usc.y sts.y sth.y
## 1 164 2016 G 045 07 026 163
## 2 164 2016 G 045 07 026 163
## 3 163 2016 G 045 01 026 164
## 4 163 2016 G 045 01 026 164
The one thing that I do trust is the county and STH, STS, and USC boundaries. So we’ll use those as our anchor. In other words, if precinct code 3350 voted for Congressional District 1 in 2014, and for CD 7 in 2016, we’ll assume that the boundaries moved, and not that the same people voted for different districts.
So let’s create our own geographies. The idea is that we will identify unique combinations of STH, STS, USC, and county boundaries using shapefiles. We can then confidently map the precinct results to them, because we know which STH, STS, and USC each precinct voted for in a given election. These geographies will thus be larger than precincts, but smaller than STH boundaries (potentially equal to, if an STH is entirely contained withing STS, USC, and county). I call these, imaginatively, geographies
.
We will need four vintages of these: pre-2012, 2012, post-2012, and 2018 (see the table above). Let’s load the shapefiles, and calculate the intersections of the four geometries:
## Warning, I haphazardly move between sf and sp packages,
## based on my preferred function.
library(sf)
library(rgeos)
library(rgdal)
library(sp)
sth_pre2012 <- st_read("data/state_house/tigris_lower_house_2011.shp", quiet=TRUE) %>%
mutate(sth = SLDLST)
sth_post2012 <- st_read("data/state_house/tigris_lower_house_2015.shp", quiet=TRUE) %>%
mutate(sth = SLDLST)
sts_pre2012 <- st_read("data/state_senate/tigris_upper_house_2011.shp", quiet=TRUE) %>%
mutate(sts = SLDUST)
sts_post2012 <- st_read("data/state_senate/tigris_upper_house_2015.shp", quiet=TRUE) %>%
mutate(sts = SLDUST)
usc_pre2012 <- st_read("data/congress/tigris_usc_2011.shp", quiet=TRUE) %>%
mutate(usc = CD111FP)
usc_post2012 <- st_read("data/congress/tigris_usc_2015.shp", quiet=TRUE) %>%
mutate(usc = CD114FP)
usc_2018 <- st_read("data/congress/supcourt_usc_2018.shp", quiet=TRUE) %>%
mutate(usc = DISTRICT)
counties <- st_read("data/census/tigris_counties_2015.shp", quiet=TRUE) %>%
mutate(cofips = COUNTYFP)
## some helper functions
get_sp_id <- function(s) sapply(s@polygons, slot, "ID")
gInt_byid <- function(x, y, sep="_"){
int <- gIntersection(x, y, byid=TRUE, drop_lower_td=TRUE)
int <- gBuffer(int, byid=TRUE, width=0)
int <- spChFIDs(int, gsub("\\s", sep, get_sp_id(int)))
return(int)
}
sf_to_sp <- function(sf, idcol){
sf <- as(sf, "Spatial")
sf <- spChFIDs(sf, as.character(sf@data[,idcol]))
return(sf)
}
get_intersection <- function(sth, sts, usc, counties, vintage){
## I find the sf intersection functions hard to work with, so we'll use sp
sth <- sf_to_sp(sth, 'sth')
sts <- sf_to_sp(sts, 'sts')
usc <- sf_to_sp(usc, 'usc')
counties <- sf_to_sp(counties, 'cofips')
sth_sts <- gInt_byid(sth, sts)
sth_sts_usc <- gInt_byid(sth_sts, usc)
sth_sts_usc_co <- gInt_byid(sth_sts_usc, counties)
sth_sts_usc_co <- spChFIDs(
sth_sts_usc_co,
paste0(vintage, "_", row.names(sth_sts_usc_co))
)
id_df <- data.frame(GEOID = row.names(sth_sts_usc_co)) %>%
separate(
GEOID,
into=c('vintage', 'sth', 'sts', 'usc','cofips'),
sep="_",
remove=FALSE
)
sth_sts_usc_co <- SpatialPolygonsDataFrame(
sth_sts_usc_co,
id_df,
match.ID=FALSE
)
sth_sts_usc_co <- st_as_sf(sth_sts_usc_co)
return(sth_sts_usc_co)
}
## get and store the geographies
geographies <- list()
geographies[['pre2012']] <- get_intersection(
sth_pre2012, sts_pre2012, usc_pre2012, counties, "pre2012"
)
geographies[['2012']] <- get_intersection(
sth_pre2012, sts_pre2012, usc_post2012, counties, "2012"
)
geographies[['post2012']] <- get_intersection(
sth_post2012, sts_post2012, usc_post2012, counties, "post2012"
)
geographies[['2018']] <- get_intersection(
sth_post2012, sts_post2012, usc_2018, counties, "2018"
)
## create a single table to rule them all
geographies <- do.call(rbind, geographies)
## one useful table matching years to their vintage
year_to_geo_vintage <- data.frame(
year = as.character(seq(2002, 2018, 2)),
vintage = c(rep("pre2012", 5), "2012", rep("post2012", 2), "2018"),
stringsAsFactors = FALSE
)
It’s worth noting that geographies
are weird. They are wildly different sizes, and in general represent vastly different populations. One geography might be an entire state house district (if it’s nested within a state senate, congressional, and county boundary). Another might be a tiny sliver with no population, where a nub of one district crosses into a different county. We’ll need to keep this in mind when we model them later.
For example, here are the 2014-2016 geographies for Philadelphia:
ggplot(
geographies %>% filter(vintage == 'post2012' & cofips == 101)
) +
geom_sf() +
theme_map_sixtysix()
To be able to ignore precincts forever hereafter, let’s aggregate the results from precincts to geographies.
tmp_result_list <- list()
for(year in seq(2002, 2016, 2)){
vintage <- year_to_geo_vintage$vintage[year_to_geo_vintage$year == year]
geo <- geographies[geographies$vintage == vintage,] %>%
as.data.frame() %>%
select(-geometry)
tmp_result <- results %>%
filter(substr(race, 1, 4) == year) %>%
left_join(races %>% select(race, election)) %>%
left_join(precincts_to_districts) %>%
left_join(geo) %>%
group_by(race, candidate, party, cofips, usc, sts, sth) %>%
summarise(vote_total = sum(vote_total)) %>%
group_by() %>%
mutate(
GEOID = paste(vintage, sth, sts, usc, cofips, sep="_")
)
tmp_result_list[[as.character(year)]] <- tmp_result
}
geography_results <- bind_rows(tmp_result_list)
head(geography_results)
## # A tibble: 6 x 9
## race candidate party cofips usc sts sth vote_total GEOID
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 2002 G ~ ED RENDELL~ DEM 001 19 033 091 5393 pre2012_~
## 2 2002 G ~ ED RENDELL~ DEM 001 19 033 193 2339 pre2012_~
## 3 2002 G ~ ED RENDELL~ DEM 003 04 037 016 698 pre2012_~
## 4 2002 G ~ ED RENDELL~ DEM 003 04 037 027 876 pre2012_~
## 5 2002 G ~ ED RENDELL~ DEM 003 04 037 044 2024 pre2012_~
## 6 2002 G ~ ED RENDELL~ DEM 003 04 038 024 503 pre2012_~
Let’s do a quick check to make sure we didn’t duplicate any votes:
usp_2016 <- "2016 G USP USP-PA"
new_results <- geography_results %>%
filter(race == usp_2016) %>%
group_by(candidate) %>%
summarise(new_votes = sum(vote_total))
old_results <- results %>%
filter(race == usp_2016) %>%
group_by(candidate) %>%
summarise(old_votes = sum(vote_total))
left_join(new_results, old_results)
## # A tibble: 5 x 3
## candidate new_votes old_votes
## <chr> <dbl> <dbl>
## 1 DARRELL L CASTLE (USP) 21566 21566
## 2 DONALD J TRUMP (USP) 2970378 2970378
## 3 GARY E JOHNSON (USP) 146659 146659
## 4 HILLARY CLINTON (USP) 2925758 2925758
## 5 JILL STEIN (USP) 49935 49935
Yep, Donald Trump still won.
We’re going to want to compare the results from one year to the next, so we need to map vintages of the geographies to each other. In order to do that, let’s build a crosswalk. People often use areal crosswalks, but we’ll do one better with population-weighted crosswalks.
The strategy is to create the geographic intersections of geographies x
and y
, use Census Blocks to calculate the population in each intersection, and then apportion votes in x
based on the fraction of its population that overlaps with y
. We’ll use blocks from the 2010 Census. Using populations from a single year won’t accurately weight places where population are changing, but will provide a darn good first order approximation. (You could imagine being even better, weighting by voting age population, or even downloading the voter file and weighting by voters).
block_pops <- read_csv(
"data/census/block10_centroid_pops.csv",
col_types = cols(GEOID10 = col_character())
)
## get rid of NA pop blocks. (I have no idea what these are, but they have 0 ALAND)
block_pops %>% filter(is.na(pop10))
## # A tibble: 7 x 9
## STATEFP10 COUNTYFP10 TRACTCE10 BLOCKCE10 GEOID10 pop10 ALAND10 INTPTLAT10
## <int> <chr> <chr> <int> <chr> <int> <int> <dbl>
## 1 42 049 990000 2 420499~ NA 0 42.2
## 2 42 049 990000 6 420499~ NA 0 42.2
## 3 42 049 990000 1 420499~ NA 0 42.3
## 4 42 049 990000 3 420499~ NA 0 42.2
## 5 42 049 990000 4 420499~ NA 0 42.1
## 6 42 049 012400 1 420490~ NA 0 42.1
## 7 42 049 990000 5 420499~ NA 0 42.2
## # ... with 1 more variable: INTPTLON10 <dbl>
block_pops <- block_pops %>% filter(!is.na(pop10))
block_cents <- SpatialPointsDataFrame(
coords=block_pops[,c('INTPTLON10','INTPTLAT10')],
data=block_pops,
coords.nrs=match(c('INTPTLON10','INTPTLAT10'), names(block_pops)),
proj4string=CRS("+init=epsg:4326")
) %>%
st_as_sf %>%
st_transform(st_crs(geographies))
create_crosswalk <- function(
x,
y,
blocks=block_cents,
xid='GEOID',
yid='GEOID'
){
x <- sf_to_sp(x, xid)
y <- sf_to_sp(y, yid)
id_sep <- ":"
int <- gInt_byid(x, y, id_sep)
blocks_to_int <- blocks %>% as("Spatial") %>% over(int)
# Provide a sanity check that all blocks got mapped.
print(paste("We lost", sum(is.na(blocks_to_int)), "blocks."))
print(paste("We lost", sum(blocks$pop10[is.na(blocks_to_int)]), "people."))
crosswalk <- data.frame(
int_num = blocks_to_int,
pop = blocks$pop10
) %>%
mutate(int_id = get_sp_id(int)[int_num]) %>%
group_by(int_id) %>%
summarise(pop = sum(pop)) %>%
group_by() %>%
separate(int_id, into=c("xid", "yid"), sep=id_sep) %>%
group_by(xid) %>%
mutate(frac_of_x = pop / sum(pop)) %>%
group_by(yid) %>%
mutate(frac_of_y = pop / sum(pop))
## check that nothing got duplicated/lost.
print(paste("Pop of blocks:", sum(blocks$pop10)))
print(paste("Pop of cw: ", with(crosswalk, sum(crosswalk$pop))))
return(crosswalk)
}
crosswalks <- list()
for(
vintage_pair in c(
"pre2012,2012", "pre2012,post2012", "2012,post2012", "post2012,2018"
)
){
print(vintage_pair)
vintages <- strsplit(vintage_pair, ",")[[1]]
crosswalks[[vintage_pair]] <- create_crosswalk(
geographies %>% filter(vintage == vintages[1]),
geographies %>% filter(vintage == vintages[2])
)
}
## [1] "pre2012,2012"
## [1] "We lost 2 blocks."
## [1] "We lost 0 people."
## [1] "Pop of blocks: 12702379"
## [1] "Pop of cw: 12702379"
## [1] "pre2012,post2012"
## [1] "We lost 2 blocks."
## [1] "We lost 0 people."
## [1] "Pop of blocks: 12702379"
## [1] "Pop of cw: 12702379"
## [1] "2012,post2012"
## [1] "We lost 0 blocks."
## [1] "We lost 0 people."
## [1] "Pop of blocks: 12702379"
## [1] "Pop of cw: 12702379"
## [1] "post2012,2018"
## [1] "We lost 0 blocks."
## [1] "We lost 0 people."
## [1] "Pop of blocks: 12702379"
## [1] "Pop of cw: 12702379"
Here’s an example of how to use the crosswalk. Suppose we wanted to map the results in 2012 to the geographies in 2014. We want to assign to the 2014 boundaries the weighted average of the results from 2012:
geography_results %>%
inner_join(
races %>% filter(election == '2012 G')
) %>%
left_join(
crosswalks[["2012,post2012"]],
by = c("GEOID" = "xid")
) %>%
group_by(election, yid, office, party) %>%
summarise(
vote_total = sum(vote_total * frac_of_x)
)
## # A tibble: 7,185 x 5
## # Groups: election, yid, office [?]
## election yid office party vote_total
## <chr> <chr> <chr> <chr> <dbl>
## 1 2012 G post2012_001_049_03_049 STH DEM 15907.
## 2 2012 G post2012_001_049_03_049 STH OTH 0
## 3 2012 G post2012_001_049_03_049 STS DEM 13708.
## 4 2012 G post2012_001_049_03_049 STS OTH 0
## 5 2012 G post2012_001_049_03_049 STS REP 4480.
## 6 2012 G post2012_001_049_03_049 USC DEM 12831.
## 7 2012 G post2012_001_049_03_049 USC IND 909.
## 8 2012 G post2012_001_049_03_049 USC OTH 0
## 9 2012 G post2012_001_049_03_049 USC REP 4371.
## 10 2012 G post2012_001_049_03_049 USP DEM 14046.
## # ... with 7,175 more rows
Finally, let’s use the crosswalks to add a quick feature to geographies
: the 2010 population. (This doesn’t actually require the crosswalks–we could just map the blocks to a single time period–but since we already have them…)
get_pop_from_cw <- function(vintage_pair, vintage){
vintage_split <- str_split(vintage_pair, ",")[[1]]
x_or_y <- c("xid", "yid")[which(vintage_split == vintage)]
if(length(x_or_y) != 1) stop("vintage doesn't match")
cw <- crosswalks[[vintage_pair]]
cw$GEOID <- cw[,x_or_y] %>% unlist
return(cw %>% group_by(GEOID) %>% summarise(pop10 = sum(pop, na.rm = TRUE)))
}
pops_pre2012 <- get_pop_from_cw("pre2012,2012", "pre2012")
pops_2012 <- get_pop_from_cw("pre2012,2012", "2012")
pops_post2012 <- get_pop_from_cw("2012,post2012", "post2012")
pops_2018 <- get_pop_from_cw("post2012,2018", "2018")
geography_pops <- bind_rows(pops_pre2012, pops_2012, pops_post2012, pops_2018)
geographies <- left_join(geographies, geography_pops)
Let’s compare the total votes to the total population to make sure nothing crazy happened:
uspgov_votes <- geography_results %>%
inner_join(
races %>%
filter(
office %in% c("GOV", "USP") &
substr(election, 6, 6) == "G"
)
) %>%
mutate(year = substr(election, 1, 4)) %>%
group_by(year, GEOID) %>%
summarise(vote_total = sum(vote_total))
ggplot(
uspgov_votes %>% left_join(geography_pops),
aes(x=pop10, y=vote_total)
) +
geom_point() +
theme_sixtysix()
That looks good, but I want to flag something that’s weird. There are some combinations of districts that the shapefiles say shouldn’t exist. Here are some votes that aren’t in our intersections.
## Looks
uspgov_votes %>% filter(!GEOID %in% geography_pops$GEOID)
## # A tibble: 40 x 3
## # Groups: year [8]
## year GEOID vote_total
## <chr> <chr> <dbl>
## 1 2002 pre2012_046_017_07_091 430
## 2 2002 pre2012_058_045_12_129 1754
## 3 2002 pre2012_058_045_18_129 106
## 4 2002 pre2012_136_016_15_095 1368
## 5 2002 pre2012_137_016_15_095 1672
## 6 2002 pre2012_175_001_03_101 0
## 7 2004 pre2012_058_045_12_129 2770
## 8 2004 pre2012_058_045_18_129 155
## 9 2004 pre2012_136_016_15_095 2223
## 10 2004 pre2012_137_016_15_095 2889
## # ... with 30 more rows
Particularly, let’s consider pre2012_136_016_15_095
. There are 1,618 votes in 2006 in precincts voted for State House 137 and State Senate 16. According to the census maps, those districts don’t overlap. I haven’t been able to figure out if this is a problem with the vote data or the shapefiles, and we’ll ignore it for the time being. How many votes do we mess up?
uspgov_votes %>%
filter(!GEOID %in% geography_pops$GEOID) %>%
group_by(year) %>%
summarise(sum(vote_total))
## # A tibble: 8 x 2
## year `sum(vote_total)`
## <chr> <dbl>
## 1 2002 5330
## 2 2004 8037
## 3 2006 6796
## 4 2008 9332
## 5 2010 6056
## 6 2012 13250
## 7 2014 1
## 8 2016 1
Our lost votes come entirely in 2012 or before, with a max of 13,000 votes in 2012. Does this matter? I have no idea.
We’re done for now. Let’s save it and do some analysis!
if(!dir.exists("outputs")) dir.create("outputs")
save_objs <- c("crosswalks","geographies","geography_results", "year_to_geo_vintage")
save(
list = save_objs,
file="outputs/geographies_output.rda"
)