Section 1: Introduction

1.1: Motivation

Rates of urbanization and climate change are increasingly affecting our planet and specifically, the growing population of unhoused individuals. Even more so, the Covid-19 pandemic has brought the urgency of homelessness and affordable housing to the forefront as Americans were told to shelter in place while experiencing mass unemployment and income loss. The federal government even put a moratorium on evictions in order to decrease the flood of housing insecurity the country experienced.

While the pandemic has brought large-scale tragedy and challenges to the American people, we also see it as an opportunity for addressing the issue of homelessness on a large scale. Currently, President Joe Biden’s stimulus package includes $5 billion dollars towards emergency housing services. This funding amount has a chance to house every person experiencing homelessness in America as long as the resources are allocated effectively.

Our client, the Metro Dallas Homeless Alliance (MDHA), has a mission of making the experience of homelessness “rare, brief and non-recurring”. We hope that throughout our analysis, modeling, and application we can demonstrate that data provides further clarity around patterns of movement and builds understanding around the allocation of resources in order to find housing for all.

1.2: Use Case

Following MDHA’s motto, we have decided to build our application and modeling to focus specifically on resource allocation at a service provider level. Our resource allocation approach seeks to understand the likelihood of clients at specific service providers to stay housed or to experience a return to homelessness. Given this information, we hope MDHA can further understand the ways people move through the system and the support clients’ permanent housing needs.

At a high level, we measure homelessness as the usage of a bed night at an emergency shelter. Our likelihood score is calculated based on information like past usage of emergency shelter services and key demographics like race, income, and gender. Our use case rests on this predictive likelihood score as once we allocate resources to service providers we want to remain assured that they will confidently provide support to clients in order to prevent their return to homelessness.

Importantly, we recognize that improving the housing system requires a somewhat top down approach with a focus on services and how they can better serve individuals. We hope that our predictive scores will help funding and resources stretch far to help those who need them.

Section 2: Exploratory Data Analysis

2.0. Setup and Data Wrangling

Our dataset required quite a bit of data wrangling in order to associate clients with services and services offered with service providers as throughout our analysis we work on various units of scale. We also used the service date to associate a client with the most recent service and service provider they used within the system. Secondly, we assign organization IDs to each service record based on the Associated Program field. This ID will be used to track service provider-level performance in the modeling stage. Original service provider names are not shown for privacy reasons.

#Data Wrangling to connect clients with last used service
# Ordered list of all housing services dates client in chronological order
housed <- rbind(PSH_Serv, RRH_Serv, OPH_Serv) %>% 
  select('Family ID', 'Client ID', 'Service Date', Service, ProgramType, 'Associated Program')
housed$`Service Date` <- lubridate::ymd(housed$`Service Date`)
housed <- housed[order(housed$`Client ID`, housed$'Service Date'),] # Order by client ID, then service date
housed <- housed %>% 
  rename(ServiceDate = 'Service Date', ClientID = 'Client ID', FamilyID = 'Family ID') %>%
  distinct() # Unique client, date service received
housed$ServiceType <- 'Housed' #Housing service
# Ordered list of all homelessness services dates client in chronological order
homeless <- rbind(ES_Serv, TH_Serv, SH_Serv, SO_Serv) %>% 
  select('Family ID', 'Client ID', 'Service Date', Service, ProgramType, 'Associated Program')
homeless$`Service Date` <- lubridate::ymd(homeless$`Service Date`)
homeless <- homeless[order(homeless$`Client ID`, homeless$'Service Date'),]
homeless <- homeless %>% 
  rename(ServiceDate = 'Service Date', ClientID = 'Client ID', FamilyID = 'Family ID') %>%
  distinct()
homeless$ServiceType <- 'Homelessness'
# Create a filtered list of events when clients have received a homelessness service after a housing service
h_combined <- rbind(housed, homeless)
h_combined <- h_combined[order(h_combined$ClientID, h_combined$ServiceDate),]
# Create a set of lead fields that indicate information about the next service received
h_combined$NextClientID <- lead(h_combined$ClientID)
h_combined$NextServiceDate <- lead(h_combined$ServiceDate)
h_combined$NextService <- lead(h_combined$Service)
h_combined$NextProgramType <- lead(h_combined$ProgramType)
h_combined$NextServiceType <- lead(h_combined$ServiceType)
h_combined$ServiceMonth <- floor_date(h_combined$ServiceDate, "month")
# Remove rows where next record is from a different client
h_combined <- h_combined %>% filter(ClientID == NextClientID)

Next, we remove any clients who received services before October 1, 2015. We have done this to help with the incidence design of our data analysis. We want to be able to track people who exclusively showed up in the data after October 1, 2015 in order to see the factors associated with the onset of homelessness and recurrence rather than the persistence of homelessness over time. This incidence design will especially help in our modeling section.

ClientsPre10.2015 <- h_combined %>% 
  filter(ServiceDate < '2015-10-01') %>%
  select(ClientID) %>%
  unique()
Data_ClientsPre10.2015 <- h_combined %>% filter(ClientID %in% ClientsPre10.2015$ClientID)
nrow(Data_ClientsPre10.2015) / nrow(h_combined) # This removes 3.3% of the services. That's acceptable
length(Data_ClientsPre10.2015$ClientID %>% unique()) / length(h_combined$ClientID %>% unique()) # And 3.32% of the clients.
# Remove clients from before 10/01/2015 
h_combined <- h_combined %>% filter((ClientID %in% ClientsPre10.2015$ClientID)==FALSE)

To prepare for the later step of showing service providers in both our web application and modeling, we have assigned random aliases to the service providers in our data. We were careful to create names that show the types of organizations and non profits housing individuals such as religious groups, YMCA groups, and providers focused specifically on subsets of the population like single mothers or veterans.

# Now read from static output
setwd("C:/Users/ctb80/Box/Practicum_Dallas/modeling")
#Read in Org Alias table made and generated separately.
aliasTable <- read.csv('OrganizationAliases.csv')
#Combine into our data table so we can start using alias names
h_combined <- left_join(h_combined, aliasTable, by='OrganizationName')

Next, we create the client categorization for four narratives or profiles that explain how clients move through the system. These narratives are mined from the data and provide useful insights into the varying ways that clients interact with housed and homeless services. These client profiles are leveraged in later visualizations.

#Creating the service counts
ClientServicesCounts <- 
  h_combined %>%
  group_by(ClientID) %>%
  summarise(TotalESServices = sum(ProgramType == 'ES'),
            TotalPSHServices = sum(ProgramType == 'PSH'),
            TotalHomelessServices = sum(ServiceType == 'Homelessness'),
            TotalHousedServices = sum(ServiceType == 'Housed'),
            TotalServices = n()
            )
#Client services
nrow(ClientServicesCounts %>% filter(TotalESServices == TotalServices))
nrow(ClientServicesCounts %>% filter(TotalHomelessServices == TotalServices))
nrow(ClientServicesCounts %>% filter(TotalPSHServices == TotalServices))
nrow(ClientServicesCounts %>% filter(TotalHousedServices == TotalServices))
ESOnlyClients <- ClientServicesCounts %>% filter(TotalESServices == TotalServices)
HousedOnlyClients <- ClientServicesCounts %>% filter(TotalHousedServices == TotalServices)
ClientsHousedandHomeless <- 
  ClientServicesCounts %>%
  filter(TotalHomelessServices >0, TotalHousedServices > 0)
#Connecting with clients
EStoPSH <- h_combined %>% 
       filter(ClientID == NextClientID,
              ProgramType =='ES', 
              NextProgramType=='PSH') %>% 
       arrange(ClientID, ServiceDate)
length(EStoPSH$ClientID%>%unique())
HomelesstoHoused <- h_combined %>% 
       filter(ClientID == NextClientID,
              ServiceType =='Homelessness', 
              NextServiceType=='Housed') %>% 
       arrange(ClientID, ServiceDate)
length(HomelesstoHoused$ClientID%>%unique())
Recurrence <- h_combined %>% 
       filter(ClientID %in% HomelesstoHoused$ClientID,
              ServiceType =='Housed', 
              NextServiceType=='Homelessness'
              ) %>% 
       arrange(ClientID, ServiceDate)
length(Recurrence$ClientID%>%unique())
HomelessRehoused <- h_combined %>%
  filter(ClientID %in% HomelesstoHoused$ClientID,
        (ClientID %in% Recurrence$ClientID) == FALSE) %>%
  arrange(ClientID, ServiceDate)
length(HomelessRehoused$ClientID %>% unique())
#Providers for last three narratives
ESOnlyProviders <-
  h_combined %>%
  filter(ClientID %in% ESOnlyClients$ClientID) %>%
  group_by(ClientID) %>%
  summarise(AliasOrganizationName = AliasOrganizationName[which.max(ServiceDate)]) %>%
  group_by(AliasOrganizationName) %>%
  summarise(y = n_distinct(ClientID)) %>%
  rename(label=AliasOrganizationName)
ESOnlyProviders <- ESOnlyProviders[order(ESOnlyProviders$y, decreasing = TRUE),]
HousedOnlyProviders <-
  h_combined %>%
  filter(ClientID %in% HousedOnlyClients$ClientID) %>%
  group_by(ClientID) %>%
  summarise(AliasOrganizationName = AliasOrganizationName[which.max(ServiceDate)]) %>%
  group_by(AliasOrganizationName) %>%
  summarise(y = n_distinct(ClientID)) %>%
  rename(label=AliasOrganizationName)
HousedOnlyProviders <- HousedOnlyProviders[order(HousedOnlyProviders$y, decreasing = TRUE),]
 
HomelessRehousedProviders <-
  HomelessRehoused %>%
  group_by(ClientID) %>%
  summarise(AliasOrganizationName = AliasOrganizationName[which.max(ServiceDate)]) %>%
  group_by(AliasOrganizationName) %>%
  summarise(y = n_distinct(ClientID)) %>%
  rename(label=AliasOrganizationName)
HomelessRehousedProviders <- HomelessRehousedProviders[order(HomelessRehousedProviders$y, decreasing = TRUE),]
RecurrenceProviders <-
  Recurrence %>%
  group_by(ClientID) %>%
  summarise(AliasOrganizationName = AliasOrganizationName[which.max(ServiceDate)]) %>%
  group_by(AliasOrganizationName) %>%
  summarise(y = n_distinct(ClientID)) %>%
  rename(label=AliasOrganizationName)
RecurrenceProviders <- RecurrenceProviders[order(RecurrenceProviders$y, decreasing = TRUE),]
# Add the narratives into h_combined
h_combined$Narrative <- 'NA'
# The ES only numbers are 9731 in ESOnlyClients and 9674 in h_combined$Narrative. Close enough.
#Grouping clients into each of the four narrative groupings
h_combined$Narrative[which (h_combined$ClientID %in% ESOnlyClients$ClientID)] <- 'ESOnly'
h_combined$Narrative[which (h_combined$ClientID %in% HousedOnlyClients$ClientID)] <- 'HousedOnly'
h_combined$Narrative[which (h_combined$ClientID %in% HomelessRehoused$ClientID)] <- 'HomelessRehoused'
h_combined$Narrative[which (h_combined$ClientID %in% Recurrence$ClientID)] <- 'Recurrence'
label_interval <- function(breaks) {
  paste0(breaks[1:length(breaks) - 1], "-", breaks[2:length(breaks)])
}
ESOnlyDuration <-
  h_combined %>%
  filter(ClientID %in% ESOnlyClients$ClientID) %>%
  group_by(ClientID) %>%
  mutate(FirstServiceDate = min(ServiceDate),
         LastServiceDate = max(ServiceDate)) %>%
  select(ClientID, FirstServiceDate, LastServiceDate) %>%
  unique() %>%
  mutate(TotalDurationInSystem = as.numeric(difftime(LastServiceDate, FirstServiceDate, units='days'), units='days'))
HousedOnlyDuration <-
  h_combined %>%
  filter(ClientID %in% HousedOnlyClients$ClientID) %>%
  group_by(ClientID) %>%
  mutate(FirstServiceDate = min(ServiceDate),
         LastServiceDate = max(ServiceDate)) %>%
  select(ClientID, FirstServiceDate, LastServiceDate) %>%
  unique() %>%
  mutate(TotalDurationInSystem = as.numeric(difftime(LastServiceDate, FirstServiceDate, units='days'), units='days'))
HousedOnlyDuration$DurationBin <- cut(HousedOnlyDuration$TotalDurationInSystem, breaks=hist(HousedOnlyDuration$TotalDurationInSystem, plot=FALSE)$breaks, 
                                      labels = label_interval(hist(HousedOnlyDuration$TotalDurationInSystem, plot=FALSE)$breaks))
HousedOnlyDuration[is.na(HousedOnlyDuration)] <- '0-200'
HousedOnlyDurationBinOutput <-
  HousedOnlyDuration %>%
  group_by(DurationBin) %>%
  summarise(y = n_distinct(ClientID)) %>%
  rename(label=DurationBin)
HomelessRehousedDuration <-
  h_combined %>%
  filter(ClientID %in% HomelessRehoused$ClientID) %>%
  group_by(ClientID) %>%
  mutate(FirstServiceDate = min(ServiceDate),
         LastServiceDate = max(ServiceDate)) %>%
  select(ClientID, FirstServiceDate, LastServiceDate) %>%
  unique() %>%
  mutate(TotalDurationInSystem = as.numeric(difftime(LastServiceDate, FirstServiceDate, units='days'), units='days'))
HomelessRehousedDuration$DurationBin <- cut(HomelessRehousedDuration$TotalDurationInSystem, breaks=hist(HomelessRehousedDuration$TotalDurationInSystem, plot=FALSE)$breaks, 
                                      labels = label_interval(hist(HomelessRehousedDuration$TotalDurationInSystem, plot=FALSE)$breaks))
HomelessRehousedDurationBinOutput <-
  HomelessRehousedDuration %>%
  group_by(DurationBin) %>%
  summarise(y = n_distinct(ClientID)) %>%
  rename(label=DurationBin)
RecurrenceDuration <-
  h_combined %>%
  filter(ClientID %in% Recurrence$ClientID) %>%
  group_by(ClientID) %>%
  mutate(FirstServiceDate = min(ServiceDate),
         LastServiceDate = max(ServiceDate)) %>%
  select(ClientID, FirstServiceDate, LastServiceDate) %>%
  unique() %>%
  mutate(TotalDurationInSystem = as.numeric(difftime(LastServiceDate, FirstServiceDate, units='days'), units='days'))
RecurrenceDuration$DurationBin <- cut(RecurrenceDuration$TotalDurationInSystem, breaks=hist(RecurrenceDuration$TotalDurationInSystem, plot=FALSE)$breaks, 
                                      labels = label_interval(hist(RecurrenceDuration$TotalDurationInSystem, plot=FALSE)$breaks))
RecurrenceDurationBinOutput <-
  RecurrenceDuration %>%
  group_by(DurationBin) %>%
  summarise(y = n_distinct(ClientID)) %>%
  rename(label=DurationBin)

2 : Client Movement

2.1: General Movement

Our project first seeks to understand individual movements within the homelessness system. The graphic below visualizes individual usage of various project types from 2015 to the present. Dates of service are shown on the X axis and deidentified client ID’s are on the Y axis. From left to right, the chart shows how individuals use and moved through the seven different project types over the course of the past 6 years. Interestingly, there is a noticeable vertical break (i.e., white space) in the data at March 2020, when Covid-19 lock downs first became widespread in the United States, causing a temporary reduction in housing and homelessness services. This visualization informs the four profiles that describe how clients interact with the housing system.

first <- h_combined %>%
  select(ClientID, ServiceDate, ProgramType, ServiceType) %>%
  count(ClientID, ServiceDate, ProgramType, ServiceType) %>%
  arrange(ClientID, ServiceDate, -n) %>%
  rename(Count=n)
second <- first %>%  
  group_by(ClientID, ServiceDate) %>% 
  summarise(Count = max(Count))
third <- inner_join(second, first, 
                    by=c("ClientID" = "ClientID", 'ServiceDate' = 'ServiceDate', 'Count'='Count')) %>%
  arrange(ServiceType)
#third$ServiceMonthDisp <- format(third$ServiceMonth, "%Y-%m")
third$ID <- third %>% group_indices(ClientID)
third$ProgramType <- factor(third$ProgramType, levels = c( 'ES', 'SO', 'SH', 'TH', 'RRH', 'PSH', 'OPH'))
# Main chart that creates a statastical group of 250x250 bins
ggplot(third, aes(x=ServiceDate, y=ID, colour=ProgramType), size=4) +
  scale_fill_manual(values=palette7) +
  scale_color_manual(values=palette7) +
  stat_bin2d(bins=250, aes(fill=ProgramType)) +
  scale_x_date(date_labels = '%Y-%m', breaks=date_breaks('year')) +
  xlab('Date of Service') + ylab('Client') +
  ggtitle('Dallas Housing Services by Client, 2015-Present') +
  #theme(legend.position = c(0.05,.85)) +
  plotTheme

2.2: Client Profiles

There are four major ways that clients use and interact with the housing system. The first is someone who exclusively uses Emergency Shelter Services throughout the years of study. This might consist of a bed night, a bus pass, or a shower. This person likely experiences housing instability from time to time and uses Emergency Shelter services to supplement other forms of housing. Emergency Shelter services are the most used services in the system and therefore have the most clients associated with them at 13,782 people.

ES Only

Our second profile is an individual or family that enters into the housing system through a housing project type. This means that immediately upon entry they either get placed in Rapid Re-Housing (RRH), Permanent Supportive Housing (PSH), or Other Permanent Housing (OPH). It is important to note that Rapid Re-Housing’s purpose is to house individuals and families swiftly as research has proven that getting housing stability quickly means low rates of return to shelter. Additionally, Permanent Supportive Housing has eligibility requirements that one must have a disability and fit the definition of chronically homeless (meaning they have experienced homelessness for longer than a year or four or more times in the past three years). These individuals are a smaller subset of the population.

Housed Only

Our third profile represents a successful use of the system. An individual comes into the system through a homeless service such as Emergency Shelter or Street Outreach. A caseworker or employee at the service provider then connects them with housing services in order to give them housing stability. These individuals achieve housing stability and no longer make use of homeless services. In this narrative, supportive services and housing given to a client lead to housing permanency.

Homeless to Housed

The fourth profile is the one we are most interested in avoiding in the future. It represents individuals who start out like our third profile clients. They enter the system through homelessness services like Street Outreach or Emergency Shelter. They are then placed, like the third profile clients, into a housing project which will give them housing stability. However, unlike the third profile clients, we find that they end up back in the homelessness system over time. These clients recur back into the homelessness system for a myriad of reasons. From a policy perspective, it means agencies and service providers are not able to provide the proper resources to this subset of the homeless population. Our focus on minimizing the amount of clients in this category guides our modeling approach and web application.

Recurring Homeless

The graphics below provide a better sense of how each client profile looks individually. Note the volume of Emergency Shelter services relative to other categories and the pattern of movement of all profiles through different project types.

ESGraph <- ggplot(third %>% filter(ClientID %in% ESOnlyClients$ClientID), aes(x=ServiceDate, y=ID, colour=ProgramType), size=4) +
  scale_fill_manual(values=palette7) +
  scale_color_manual(values=palette7) +
  stat_bin2d(bins=250, aes(fill=ProgramType)) +
  scale_x_date(date_labels = '%Y-%m', breaks=date_breaks('year')) +
  xlab('Date of Service') + ylab('Client') +
  ggtitle('Emergency Services') +
  #theme(legend.position = c(0.05,.85)) +
  plotTheme
HousedOnlyGraph <- ggplot(third %>% filter(ClientID %in% HousedOnlyClients$ClientID), aes(x=ServiceDate, y=ID, colour=ProgramType), size=4) +
  scale_fill_manual(values=HousedOnlypalette) +
  scale_color_manual(values=HousedOnlypalette) +
  stat_bin2d(bins=250, aes(fill=ProgramType)) +
  scale_x_date(date_labels = '%Y-%m', breaks=date_breaks('year')) +
  xlab('Date of Service') + ylab('Client') +
  ggtitle('Housed Only') +
  #theme(legend.position = c(0.05,.85)) +
  plotTheme
HomelessRehousedGraph <- ggplot(third %>% filter(ClientID %in% HomelessRehoused$ClientID), aes(x=ServiceDate, y=ID, colour=ProgramType), size=4) +
  scale_fill_manual(values=palette7) +
  scale_color_manual(values=palette7) +
  stat_bin2d(bins=250, aes(fill=ProgramType)) +
  scale_x_date(date_labels = '%Y-%m', breaks=date_breaks('year')) +
  xlab('Date of Service') + ylab('Client') +
  ggtitle('Homeless to Housed') +
  #theme(legend.position = c(0.05,.85)) +
  plotTheme
RecurrenceGraph <- ggplot(third %>% filter(ClientID %in% Recurrence$ClientID), aes(x=ServiceDate, y=ID, colour=ProgramType), size=4) +
  scale_fill_manual(values=palette7) +
  scale_color_manual(values=palette7) +
  stat_bin2d(bins=250, aes(fill=ProgramType)) +
  scale_x_date(date_labels = '%Y-%m', breaks=date_breaks('year')) +
  xlab('Date of Service') + ylab('Client') +
  ggtitle('Experiencing Recurrence') +
  #theme(legend.position = c(0.05,.85)) +
  plotTheme
legend<- g_legend(RecurrenceGraph+theme(legend.position='bottom', legend.key.size = unit(.75, 'cm'), legend.title = element_text(size=10), legend.text = element_text(size=10), plot.margin = ggplot2::margin(10,10,10,50)))


grid.arrange(ESGraph+theme(legend.position='hidden'), 
             HousedOnlyGraph+theme(legend.position='hidden'),
             HomelessRehousedGraph+theme(legend.position='hidden'),
             RecurrenceGraph+theme(legend.position='hidden'), 
             top = grid::textGrob("Dallas Housing Services by Client Type, 2015-Present", gp=grid::gpar(fontsize=20, fontface='bold')),legend)

3: Service Providers

In addition to individual clients, we are focused on evaluating service providers within the system. We made this decision because we believe that service providers and the agencies who monitor them are responsible for improving the services they provide to individuals. We believe homelessness does not occur through individual fault and can be solved by providing the right combination of wraparound services and permanent housing. Thus, this section maintains the four archetypes of clients who use the system but visualizes their interaction with service providers. We create two plots for each profile, one that shows duration in the system and one that measures service providers associated with each archetype. The duration plot measures how long a client has used a particular set of services or stayed at a service provider. The second plot assesses service providers who have a large volume of a particular client profile. This plot is important in building a greater understanding of the clients a particular service provider works with. We hope it can create more conversation on challenges and improvement areas for particular service providers in order to increase their effectiveness in getting clients housed permanently.

ESOnlyDurationG<- ggplot(ESOnlyDuration,
       aes(TotalDurationInSystem)) +
  geom_histogram(fill= ES_col, binwidth = 7) +
  geom_vline(aes(xintercept = median(TotalDurationInSystem)),col= accent_col,size=.7) +
  coord_cartesian(xlim = c(0, 1000)) +
  xlab('Duration in the system (days)') + ylab('# of Clients') +
    labs(title = "50% of clients only using Emergency Shelter \nspend less than 25 Days in the System",
       subtitle = "Median = 25 days",
       caption = "Histogram of # of Days in the System by Clients using ES only") +
  plotTheme
ESOnlyProvidersG <- ggplot(data = ESOnlyProviders,
       aes(x=reorder(label, y), y=y)) +
  geom_bar(fill= ES_col, stat="identity") +
  xlab('Service Provider') + ylab('# of Clients') +
  labs(title = "   Recent Providers for clients using \n   Emergency Shelter Only") +
  coord_flip() +
  plotTheme
HousedOnlyDurationG <- ggplot(HousedOnlyDuration,
       aes(TotalDurationInSystem)) +
  geom_histogram(fill= PSH_col, binwidth = 7) +
  geom_vline(aes(xintercept = median(TotalDurationInSystem)),col= accent_col,size=.7) +
  coord_cartesian(xlim = c(0, 1500)) +
  xlab('Duration in the system (days)') + ylab('# of Clients') +
    labs(title = "50% of clients only using Housed Services \nspend less than 105 Days in the System",
       subtitle = "Median = 105 days",
       caption = "Histogram of # of Days in the System by Clients using Housed Services only") +
  plotTheme
HousedOnlyProvidersG <- ggplot(data = HousedOnlyProviders,
       aes(x=reorder(label, y), y=y)) +
  geom_bar(fill= PSH_col, stat="identity") +
  xlab('Service Provider') + ylab('# of Clients') +
  labs(title = "    Recent Providers for clients using \n    Housed Services Only") +
  coord_flip() +
  plotTheme
HomelessRehousedDurationG <- ggplot(HomelessRehousedDuration,
       aes(TotalDurationInSystem)) +
  geom_histogram(fill= PSH_col, binwidth = 7) +
  geom_vline(aes(xintercept = median(TotalDurationInSystem)),col= accent_col,size=.7) +
  #coord_cartesian(xlim = c(0, 1500)) +
  xlab('Duration in the system (days)') + ylab('# of Clients') +
    labs(title = "50% of clients using Housed, then Homeless Services \nspend more than 482 Days in the System",
       subtitle = "Median = 482 days",
       caption = "Histogram of # of Days in the System by Clients using Housed, then Homeless Services") +
  plotTheme
HomelessRehousedProvidersG <- ggplot(data = HomelessRehousedProviders,
       aes(x=reorder(label, y), y=y)) +
  geom_bar(fill= PSH_col, stat="identity") +
  xlab('Service Provider') + ylab('# of Clients') +
  labs(title = "    Recent Providers for clients using \n    Housed, then Homeless Services") +
  coord_flip() +
  plotTheme
RecurrenceDurationG <- ggplot(RecurrenceDuration,
       aes(TotalDurationInSystem)) +
  geom_histogram(fill= RRH_col, binwidth = 7) +
  geom_vline(aes(xintercept = median(TotalDurationInSystem)),col= accent_col,size=.7) +
  #coord_cartesian(xlim = c(0, 1500)) +
  xlab('Duration in the system (days)') + ylab('# of Clients') +
    labs(title = "50% of clients experiencing \nHomeless Recurrence \nspend more than 520 Days in the System",
       subtitle = "Median = 520 days",
       caption = "Histogram of # of Days in the System by Clients experiencing Homeless Recurrence") +
  plotTheme
RecurrenceProvidersG <- ggplot(data = RecurrenceProviders,
       aes(x=reorder(label, y), y=y)) +
  geom_bar(fill= RRH_col, stat="identity") +
  xlab('Service Provider') + ylab('# of Clients') +
  labs(title = "    Recent Providers for clients experiencing \n    Homeless Recurrence") +
  coord_flip() +
  plotTheme

Our first narrative, those that exclusively use Emergency Services, is associated with a low duration in the system. Below we also see the service providers most associated with this client profile, the top two being Funk Services of Dallas and Collins Veteran’s Program.

grid.arrange(ESOnlyDurationG, ESOnlyProvidersG, ncol=2, widths=c(1,2))

Our second client profile, those exclusively in housing projects, have a different client set of experiences. The clients spend a great deal longer in the system as this is permanent housing that provides stability. Additionally, the service providers most associated with this client profile are Watson Project and Neelay Sisterhood Rescue.

grid.arrange(HousedOnlyDurationG, HousedOnlyProvidersG, ncol=2, widths=c(1,2))

Our third client profile are those that go from homeless services in emergency shelter to housed services. These individuals also spend a long time in the system as they originally are marked as experiencing homeless and then are placed into a housing project. The service providers most associated with this client profile are Neelay Sisterhood Rescue and Campa Rescue.

grid.arrange(HomelessRehousedDurationG, HomelessRehousedProvidersG, ncol=2, widths=c(1,2))

Our fourth client profile is the one which we are most trying to avoid, those that are placed into a housing project and then recur into homelessness at a later point. These individuals have the longest duration in the system as they spend time between homeless and housing projects. Our objective is for them to be placed and remain in stable housing. Interestingly, the service providers most associated with this client profile, Neelay Sisterhood Rescue and Campa Rescue, are the same ones associated with our third client profile.

grid.arrange(RecurrenceDurationG, RecurrenceProvidersG, ncol=2, widths=c(1,2))

3.1: Overall Duration

Similar to the above plots, we show below the median amount of days clients spend at a service provider for the entire system. Our plot shows that half of clients spend over 34 days in the housing system. Thus, we understand that for many clients, it can be a month or so before they achieve self sufficiency. For the other half of clients, those that spend less than 34 days, we understand that many individuals using housing services are self motivated to leave. Thus, our focus returns to the ones with severe issues who must receive various types of services over a longer period of time to achieve stability and exit the system.

# Total duration in the system by client
# Make two new cols, first appearance, last appearance
DurationByClient <- h_combined %>%
  select(ClientID, ServiceDate) %>%
  group_by(ClientID) %>%
  mutate(FirstServiceDate = min(ServiceDate),
         LastServiceDate = max(ServiceDate))
#For some reason this isn't working when it's all in one
DurationByClient <- DurationByClient %>% 
  select(ClientID, FirstServiceDate, LastServiceDate) %>%
  unique() %>%
  mutate(TotalDurationInSystem = difftime(LastServiceDate, FirstServiceDate, units='days'))
# Average duration between services by client
DurationBtwServices <- 
  h_combined %>% 
  filter(ClientID == NextClientID) %>%
  select(ClientID, ServiceDate, NextServiceDate) %>%
  mutate(DaysBetweenService = difftime(NextServiceDate, ServiceDate, units='days')) %>%
  filter(DaysBetweenService > 0) %>%
  group_by(ClientID) %>%
  mutate(AverageDurationBetweenServices = mean(DaysBetweenService)) %>%
  select(ClientID, AverageDurationBetweenServices) %>%
  unique()
ggplot(DurationByClient, aes(x=TotalDurationInSystem)) +
  geom_histogram(binwidth=34, fill='#999999') +
  geom_vline(aes(xintercept = median(TotalDurationInSystem)),col= accent_col,size=.7) +
  coord_cartesian(xlim = c(0, 2000)) +
  ylab('# of Clients') + xlab('Days in System') +
  labs(title = "50% of Clients spend more than 34 Days in the System",
       subtitle = "Median =  34 days",
       caption = "Histogram of Number of Days in the System by Client") +  
  plotTheme

3.2: Frequency of Services

Additionally, we see that the services the housing system provides play a crucial role in helping individuals achieve housing permanency. The plot below shows that half of our clients require services often (interpreted as every few days). These services, such as gaining a bus pass or meeting with a case manager, are important to clients and hopefully lead to stability.

ggplot(DurationBtwServices, aes(x=AverageDurationBetweenServices)) +
  geom_histogram(fill='#999999', binwidth = 7) + # Bins represent a week
  geom_vline(aes(xintercept = median(AverageDurationBetweenServices)),col= accent_col,size=.7) +
  coord_cartesian(xlim = c(0, 300)) +
  ylab('# of Clients') + xlab('Average # of Days Between Services')+
  labs(title = "50% of Clients have an average of 7 Days or less \nbetween receiving Services",
       subtitle = "Median = 1 week",
       caption = "Histogram of Average Duration Between Services by Client") +  
  plotTheme

3.3: Service Seasonality

Monthly Sevices (excluding Emergency Shelter)

Timing plays a critical role throughout the housing system. We track the time individuals spend in the system as well as the time of year in which services on a system level are most in use. Below, we see the amount of services rendered in each project type by month. It appears that street outreach peaks around August with about 4000 services rendered at that time. Additionally, we see the high amount of services rendered throughout the year in the Permanent Supportive Housing project type. This makes sense as this project type comes with a wealth of provided services.

#change this plot's month from numbers to 
ServicesByMonth <- h_combined %>%
  mutate(ServiceMonth = month(ServiceDate)) %>%
  count(ServiceMonth, ProgramType)
ServicesByMonth$ProgramType_name <- ServicesByMonth$ProgramType
ServicesByMonth$ProgramType_name[which(ServicesByMonth$ProgramType == "ES")] <- "Emergency Shelter"
ServicesByMonth$ProgramType_name[which(ServicesByMonth$ProgramType == "SO")] <- "Street Outreach"
ServicesByMonth$ProgramType_name[which(ServicesByMonth$ProgramType == "SH")] <- "Safe Haven"
ServicesByMonth$ProgramType_name[which(ServicesByMonth$ProgramType == "TH")] <- "Transitional Housing"
ServicesByMonth$ProgramType_name[which(ServicesByMonth$ProgramType == "RRH")] <- "Rapid Re-Housing"
ServicesByMonth$ProgramType_name[which(ServicesByMonth$ProgramType == "PSH")] <- "Permanent Supportive Housing"
ServicesByMonth$ProgramType_name[which(ServicesByMonth$ProgramType == "OPH")] <- "Other Permanent Housing"
ServicesByMonth$ProgramType_name <- factor(ServicesByMonth$ProgramType_name, levels = c("Emergency Shelter", "Street Outreach", "Safe Haven", "Transitional Housing", "Rapid Re-Housing", "Permanent Supportive Housing", "Other Permanent Housing"))
ggplot(ServicesByMonth %>% 
         filter(ProgramType != 'ES') %>%
         rename(`Program Type` = ProgramType_name),
       aes(x=ServiceMonth, y=n, colour=`Program Type`)) +
  scale_color_manual(values = c(SO_col, SH_col, TH_col, RRH_col, PSH_col, OPH_col)) +
  coord_cartesian(xlim = c(1, 11.5)) +
  xlab('Month') + ylab('Services Rendered') +
  geom_line(size = 1) +
  labs(title = 'Most programs are used consistently across the year,\nwhile Street Outreach (SO) peaks in August',
       caption = "Services Rendered by Month") +
  plotTheme

Monthly Services, Emergency Shelter Only

Below, we see the same tracking of services throughout the year but subset to Emergency Shelter as this project type offers a much greater volume of services. For Emergency Shelter, which has the most commonly used services ranging from 2000 services rendered in its off season to over 6000 services rendered in its peak season, it is important to understand and anticipate when services will need to be provided for clients. It appears that June is the time when these services are most in need. This focus on Emergency Shelter and its usage will continue into our modeling and web application.

ggplot(ServicesByMonth %>% 
         filter(ProgramType == 'ES'),
       aes(x=ServiceMonth, y=n)) +
  geom_line(aes(ServiceMonth), size = 1, colour = ES_col) +
  xlab('Month') + ylab('# of Services Rendered') + 
  coord_cartesian(xlim = c(1, 11.5)) +
    labs(title = 'Emergency Shelter (ES) Services reach their peak in June, \nwith additional demand in Winters',
       caption = "Emergency Shelter Services Rendered by Month") +
  plotTheme

4: Disparities

4.1: Project Type Demographics

Another interest for us in exploring the use of the housing system by clients is how individuals, as representatives of various demographic groups, make use of the system. This section examines the ways demographics groups are represented in different project types. Within housing, we have a great concern for equity and ensuring that all people get the services they need and deserve. We are particularly concerned with creating equitable solutions that center people of color and other minorities within the system. Our interest in demographics appears again in the way we’ve built our web application.

Race and Gender by Count

Our first plot shows Race and Gender by number of clients in the system. Here we see that Black individuals make up a majority of the population of cis-men and cis-women. Additionally, we see that there are more cis men than cis women in the system overall.

# re-order levels
reorder_size <- function(x) {
  factor(x, levels = names(sort(table(x), decreasing = TRUE)))
}
#Race and Gender by Count
all_dem%>%
  select(Race, Gender, DisablingCondition) %>%
  filter( is.na(Race) == FALSE) %>%
  filter(Gender == "Male (1)"|Gender ==  "Female (0)"| Gender ==  "Trans/Non-Binary") %>%
  ggplot(aes(x = reorder_size(`Race`))) +
  geom_bar(stat='count', aes(fill = Race))+
  xlab("Race") +
  facet_grid(~Gender) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_fill_manual(values = palette6) + 
  labs(title= "Race and Gender by Count") +
  plotTheme +
  theme(plot.margin=ggplot2::margin(10,10,10,100))

Distribution of Race across Project Type

In this graph, we see the racial makeup of each project type. Our key takeaways from this plot are the relatively even proportion of Black and White individuals within Safe Haven. We also note that the gap between the percent of Black individuals and white individuals in Emergency Shelter grows much larger in the Other Permanent Housing project type. Additionally, we feel that this graph does not tell the full story of individuals within each project type as it doesn’t compare the racial distribution of project types to the racial distribution of the metropolitan Dallas below poverty population. In a future iteration, we would love to have further explored disparities by using below poverty to normalize. To read more on this method, please see this paper.

ggplot(all_dem %>%
         dplyr::filter(is.na(Race) == FALSE), aes(x=as.factor(ProjectType), fill=as.factor(Race)))+
  geom_bar(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..]), position="dodge" ) +
  geom_text(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..], 
                 label=scales::percent(..count../tapply(..count.., ..x.. ,sum)[..x..], accuracy=0.1)),
            stat="count", position=position_dodge(0.9), vjust=-0.5)+
  ylab('Percent of Project Type Group, %') +
  scale_y_continuous(labels = scales::percent)+
  scale_fill_manual(values = palette6) + 
  xlab("Project Type")+
  labs(title= "Distribution of Race across Project Type", fill="Race") +
  plotTheme

Distribution of Gender across Project Type

After examining race, we choose to view the distribution of gender as well. Our key takeaways from this graph are the large disparity between men and women within Safe Haven. We see a more even ratio of men to women within Rapid Re-Housing. In every project type, men make up the larger percentage of the population except for Other Permanent Housing. In future iterations of this plot, we would want to examine single women and women with families separately as these are two populations that use housing services differently. A vast majority of homeless families are women with children which is why they would be overrepresented in project types oriented to families like Rapid Re-Housing and Other Permanent Housing.

all_dem%>%
  filter(Gender == "Male (1)"|Gender ==  "Female (0)"| Gender ==  "Trans/Non-Binary") %>% 
  ggplot(., aes(x=as.factor(ProjectType), fill=as.factor(Gender)))+
  geom_bar(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..]), position="dodge" ) +
  geom_text(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..], 
                 label=scales::percent(..count../tapply(..count.., ..x.. ,sum)[..x..], accuracy=0.1)),
            stat="count", position=position_dodge(0.9), vjust=-0.5)+
  ylab('Percent of Project Type Group, %') +
  scale_y_continuous(labels = scales::percent)+
  scale_fill_manual(values = palette4) + 
  xlab("Gender")+
  labs(title= "Distribution of Gender across Project Type", fill="Gender") +
  plotTheme

Distribution of Project Type by Race

In this graph we see how each project type is distributed across race proportionally. We understand that there is a higher percentage of Native people in Emergency Shelter than any other project type and this number is a higher percentage relative to other racial group distributions. We also see a higher percentage of Asian people in Permanent Supportive Housing relative to other racial group distributions. Lastly, there is a lower percentage of Black people in Emergency Shelter compared to other racial group distributions.

ggplot(all_dem, aes(x=as.factor(Race), fill=ProjectType) )+
  geom_bar(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..]), position="dodge" ) +
  geom_text(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..], 
                 label=scales::percent(..count../tapply(..count.., ..x.. ,sum)[..x..], accuracy=0.1)),
            stat="count", position=position_dodge(0.9), vjust=-0.5)+
  ylab('Percent of Racial Group, %') +
  scale_y_continuous(labels = scales::percent)+
  scale_fill_manual(values = palette7) + 
  xlab("Race")+
  labs(title= "Distribution of Project Type by Race", fill="ProjectType") +
  plotTheme

Distribution of Project Type by Gender

In looking at project types distributed across gender we have two key takeaways. The first is that there is a higher percentage of women in Rapid Re-Housing than in Permanent Supportive Housing and for men this ratio is flipped. This ratio flipping makes sense in the context above which explains that most women in the system are women with families who would be placed in family oriented programs like Rapid Re-Housing. Additionally, there is a higher proportional distribution of trans and non binary people in Permanent Supportive Housing than compared to the percentages for men and women in Permanent Supportive Housing.

all_dem%>%
  filter(Gender == "Male (1)"|Gender ==  "Female (0)"| Gender ==  "Trans/Non-Binary") %>% 
  ggplot(., aes(x=as.factor(Gender), fill=as.factor(ProjectType)))+
  geom_bar(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..]), position="dodge" ) +
  geom_text(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..], 
                 label=scales::percent(..count../tapply(..count.., ..x.. ,sum)[..x..], accuracy=0.1)),
            stat="count", position=position_dodge(0.9), vjust=-0.5)+
  ylab('Percent of Project Type Group, %') +
  scale_y_continuous(labels = scales::percent)+
  scale_fill_manual(values = palette7) + 
  xlab("Project Type")+
  labs(title= "Distribution of Project Type by Gender", fill="Gender") +
  plotTheme

4.2: Maximum Incomes

In addition to the client disparities by project type that we’ve explored above, we also wanted to focus on disparities by income level. Income plays a significant role in supporting individuals experiencing homelessness in their ability to pay rent (even if they get a government subsidized unit) and to provide for themselves and their family. We see below that, on average, Black individuals have a higher income than other racial groups within the housing system. Cis-men have a higher average income than cis-women or trans/non-binary individuals. Those who either have a disability, are a veteran or a domestic violence survivor have a higher income than those that don’t. Higher incomes for these populations likely reflect more financial assistance from the government as eligibility statuses change. Lastly, the individuals in each project type with the most income are Safe Haven, Transitional Housing, and Permanent Supportive Housing. Those with the least are Emergency Shelter and Street Outreach. Our understanding of income and difference in income by subsets of the population helps to inform where resources should go and which individuals are more vulnerable than others.

options(dplyr.summarise.inform = FALSE)
#Removing outliers for Max Income by percentiles 
lower_bound <- quantile(all_dem$Max.Income.from.any.Assessment, 0.025) #all that lie outside 2.5 percentile
upper_bound <- quantile(all_dem$Max.Income.from.any.Assessment, 0.975)#all that lie outside 97.5 percentile
outlier_ind <- which(all_dem$Max.Income.from.any.Assessment < lower_bound | all_dem$Max.Income.from.any.Assessment > upper_bound)
all_demaoutliersremoved <- all_dem[- which(all_dem$Max.Income.from.any.Assessment < lower_bound | all_dem$Max.Income.from.any.Assessment > upper_bound),]
#Creating Demographic Plots: Race, Gender, Disability, Veteran, Domestic Violence
RaceIncomeDF <- all_demaoutliersremoved %>%
  dplyr::filter(is.na(Race) == FALSE) %>%
  group_by(Race) %>%
  summarize(countrace = n(),
            Maxincome= sum(Max.Income.from.any.Assessment),
            incomebyrace = round((Maxincome / countrace), digits=1 ))
RaceIncomePlot <-RaceIncomeDF%>%
  ggplot(aes(x=Race, y= incomebyrace, fill=Race)) +
  geom_bar(stat="identity") +
  geom_text(aes(label=incomebyrace), position=position_dodge(width=0.9), vjust=-0.25)+
  scale_fill_manual(values = palette6) +
  labs(y= 'Dollars' ) +
  ggtitle('Race')+
  #theme(legend.position = 'none')+
  plotTheme+
  theme(axis.title.x = element_blank(),
        axis.text.x = element_blank())
GenderIncomeDF <- all_demaoutliersremoved %>%
  filter(Gender == "Male (1)"|Gender ==  "Female (0)"| Gender ==  "Trans/Non-Binary") %>%
  group_by(Gender) %>%
  summarize(count = n(),
            Maxincome= sum(Max.Income.from.any.Assessment),
            incomepp = round(Maxincome / count))
GenderIncomePlot <-GenderIncomeDF%>%
  ggplot(aes(x=Gender, y= incomepp, fill=Gender)) +
  geom_bar(stat="identity") +
  geom_text(aes(label=incomepp), position=position_dodge(width=0.9), vjust=-0.25)+
  scale_fill_manual(values = palette4) + 
  labs(y= 'Dollars') +
  ggtitle('Gender')+
  #theme(legend.position = 'none')+
  plotTheme+
  theme(axis.title.x = element_blank(),
        axis.text.x = element_blank())
DisabilityIncomeDF <-all_demaoutliersremoved %>%
  filter(DisablingCondition == "No (0)"|DisablingCondition ==  "Yes (1)") %>% 
  group_by(DisablingCondition) %>%
  summarize(count = n(),
            Maxincome= sum(Max.Income.from.any.Assessment),
            incomepp = round((Maxincome / count)))
DisabilityIncomePlot <-DisabilityIncomeDF%>%
  ggplot(aes(x=DisablingCondition, y= incomepp, fill=DisablingCondition)) +
  geom_bar(stat="identity") +
  geom_text(aes(label=incomepp), position=position_dodge(width=0.9), vjust=-0.25)+
  scale_fill_manual(values =palette4) + 
  labs(y= 'Dollars') +
  ggtitle('Disability Status')+
  #theme(legend.position = 'none')+
  plotTheme+
  theme(axis.title.x = element_blank())
DomViolenceIncomeDF <-all_demaoutliersremoved %>%
  filter(DomesticViolenceVictim == "No"|DomesticViolenceVictim ==  "Yes") %>% 
  group_by(DomesticViolenceVictim) %>%
  summarize(count = n(),
            Maxincome= sum(Max.Income.from.any.Assessment),
            incomepp = round((Maxincome / count)))
DomViolenceIncomePlot <-DomViolenceIncomeDF%>%
  ggplot(aes(x=DomesticViolenceVictim, y= incomepp, fill=DomesticViolenceVictim)) +
  geom_bar(stat="identity") +
  geom_text(aes(label=incomepp), position=position_dodge(width=0.9), vjust=-0.25)+
  scale_fill_manual(values = c("#3a7d7c", "#ffa137")) + 
  labs(y= 'Dollars')+
  ggtitle('Domestic Violence \nSurvivor')+
  #theme(legend.position = 'none')+
  plotTheme+
  theme(axis.title.x = element_blank())
VeteranIncomeDF <- all_demaoutliersremoved %>%
  filter(VeteranStatus == "No (0)"|VeteranStatus ==  "Yes (1)") %>% 
  group_by(VeteranStatus) %>%
  summarize(count = n(),
            Maxincome= sum(Max.Income.from.any.Assessment),
            incomepp = round((Maxincome / count)))
VeteranIncomePlot <-VeteranIncomeDF%>%
  ggplot(aes(x=VeteranStatus, y= incomepp, fill=VeteranStatus)) +
  geom_bar(stat="identity") +
  geom_text(aes(label=incomepp), position=position_dodge(width=0.9), vjust=-0.25)+
  scale_fill_manual(values = c("#3a7d7c", "#ffa137")) + 
  labs(axis.title.x = element_blank(), y= 'Dollars')+
  ggtitle('Veteran Status')+
  plotTheme+
  theme(axis.title.x = element_blank())
ProjectTypeIncomeDF <-all_demaoutliersremoved %>%
  #filter(Gender == "Male (1)"|Gender ==  "Female (0)") %>% 
  group_by(ProjectType) %>%
  summarize(count = n(),
            Maxincome= sum(Max.Income.from.any.Assessment),
            incomepp = round((Maxincome / count)))
ProjectTypeIncomePlot <-ProjectTypeIncomeDF%>%
  ggplot(aes(x=ProjectType, y= incomepp, fill=ProjectType)) +
  geom_bar(stat="identity") +
  geom_text(aes(label=incomepp), position=position_dodge(width=0.9), vjust=-0.25)+
  scale_fill_manual(values = palette7 ) + 
  labs(y= 'Dollars') +
  ggtitle('Project Type')+
  #theme(legend.position = 'none')+
  plotTheme+
  theme(axis.title.x = element_blank())
#First Plot with Race and Gender
grid.arrange(RaceIncomePlot+theme(legend.position = 'bottom', legend.spacing = unit(.5, 'cm'), legend.text=element_text(size=5), plot.margin=ggplot2::margin(10,10,10,50)), GenderIncomePlot+theme(legend.position = 'bottom', legend.spacing = unit(.5, 'cm'), legend.text=element_text(size=7)), ncol=2)

#Second Plot with Disability, Veteran and Income
grid.arrange(DisabilityIncomePlot+theme(legend.position = 'hidden'), VeteranIncomePlot+theme(legend.position = 'hidden'), DomViolenceIncomePlot+theme(legend.position = 'hidden'), ncol=3) 

#Last Plot with Project Type
ProjectTypeIncomePlot+theme(legend.position = 'hidden')

Section 3. Predictive Modeling

1. Constructing an outcome

Despite the richness of the demographic and services data, it was not immediately evident what outcome would be most useful to predict for in service of our use case. The demographics data offers important variables about individuals’ statuses, yet contained large amounts of missing data. The services data represented historical transactions - not comprehensive information about clients - making direct predictions less useful.

Ultimately, we decided to engineer a pair of dependent variables from the services data: a person’s homelessness status over 6 and 12 month horizons. This status is defined as whether a client uses a Emergency Shelter Bed Night service at least once within the time horizon. The models are thus binary classifiers predicting a client’s status as not homeless(0) or homeless(1). Outcomes are predicted every six months, starting at the middle of 2016 (07/01/2016) through the end of 2020, resulting in 9 “start dates”. For each start date, two “end dates” are constructed, representing a 6 month and 12 month prediction.

Only clients that have received services within 6 months of a start date will be predicted for in the model. This helps ensure that there is not sequential overlap in outcome variables - this is especially important as we will be using historical services data as inputs (i.e., time lag features) to our model. Restricting the history for a given start date (prediction point) to six months ensures that the subsequent start date does not use the same services data.

This outcome variable was selected for its interpretability and applicability to the client’s mission. The outcomes will eventually be aggregated to the service provider level to offer MDHA visibility into a given service provider’s proportion of its client population expected to experience homelessness in the next six months or year.

The time intervals are constructed below for each of the 9 start dates, and clients’ 6 and 12 month homeless status is calculated. The table below aggregates the homeless status up to each prediction date.

start_dates <- list(ymd('2016-07-01'), ymd('2017-01-01'), ymd('2017-07-01'), ymd('2018-01-01'), ymd('2018-07-01'), ymd('2019-01-01'), ymd('2019-07-01'), ymd('2020-01-01'), ymd('2020-07-01'))

cumulative_df <- data.frame(ClientID=numeric(), start=numeric(), end=numeric())

for (d in start_dates)
{
  # Only individuals who have received services within a certain lag of the prediction time
  temp_df <- h_combined %>% filter(ServiceDate < d, ServiceDate > d %m-% months(6)) %>% dplyr::select(ClientID)
  temp_df$start <- d
  temp_df$end_6 <- temp_df$start %m+% months(6)
  temp_df$end_12 <- temp_df$start %m+% months(12)
  temp_df <- left_join(temp_df, 
              h_combined %>% 
                select(ClientID, ServiceDate, AliasOrganizationName, AliasOrgNameShort), by='ClientID') %>% 
                filter(ServiceDate < start) %>% 
                group_by(ClientID, start, end_6, end_12) %>% 
                mutate(MostRecentServiceDate = max(ServiceDate)) %>% 
                select (ClientID, start, end_6, end_12, MostRecentServiceDate, AliasOrgNameShort, AliasOrganizationName) %>% 
                distinct()
  cumulative_df <- rbind(cumulative_df, temp_df %>% unique())
}
options(dplyr.summarise.inform = FALSE)

# Construct the dependent variable
rolling_es <- 
  left_join(cumulative_df, 
                        h_combined %>% filter(ProgramType=='ES'& Service=='Bed night') %>% dplyr::select(ClientID, ServiceDate), 
                        by = 'ClientID') %>%
  mutate(
    ES_6_Months = case_when(
      start>ServiceDate ~ 0,
      end_6<ServiceDate ~ 0,
      is.na(ServiceDate) ~ 0,
      TRUE ~ 1),
    ES_12_Months = case_when(
      start>ServiceDate ~ 0,
      end_12<ServiceDate ~ 0,
      is.na(ServiceDate) ~ 0,
      TRUE ~ 1)) %>%
  group_by(ClientID,AliasOrganizationName, AliasOrgNameShort, start, end_6, end_12) %>%
  summarise(ES_6_Months=max(ES_6_Months), ES_12_Months=max(ES_12_Months)) 

# Add in information about the family ID
rolling_es <- left_join(rolling_es, h_combined %>% dplyr::select(ClientID, FamilyID) %>% unique(), by='ClientID')

rolling_es %>% group_by(start) %>%
  summarise('Six Month Observed Homeless Rate' = paste(round(sum(ES_6_Months)/length(ClientID) *100, 2), "%"),
            'Twelve Month Observed Homeless Rate' = paste(round(sum(ES_12_Months)/length(ClientID) *100, 2), "%")) %>%
  rename('Prediction Date'=start) %>%
  kable() %>% kable_styling()
Prediction Date Six Month Observed Homeless Rate Twelve Month Observed Homeless Rate
2016-07-01 38.37 % 40.06 %
2017-01-01 18.29 % 22.49 %
2017-07-01 24.67 % 31.14 %
2018-01-01 27.65 % 31.28 %
2018-07-01 27.74 % 33.27 %
2019-01-01 27.25 % 32.96 %
2019-07-01 26.81 % 30.35 %
2020-01-01 24.33 % 27.5 %
2020-07-01 29.65 % 29.65 %

The table above shows the observed outcome for each of the outcome variables by start date. In other words, these percentages represents the total share of the client population that experienced homelessness within a 6 or 12 month time horizon.

On the whole, the observed outcome variable is relatively tightly distributed, with the first two prediction dates (07/01/2016 and 01/01/2017) as notable high and low outliers, respectively.

2. Feature Engineering

Our feature engineering process will leverage the rich historical services data that we have access to. We will calculate clients’ experiences in Emergency Shelter, Street Outreach, Rapid Re-housing, and Street Outreach. We also calculate the number of family members that a client has in the services population. Again, all of these features limit calculations to the previous six months to prevent data leakages across instances of our outcome variables. In other words, because the features only “look back” six months, none of the outcome variables, which are also in six month intervals, will include the same data.

These time lag features hold the greatest predictive power in our model, engineered from the services data to reflect clients’ experiences with both homeless and housed services at a given point in time.

options(dplyr.summarise.inform = FALSE)
rolling_es_w_prior <- 
  left_join(rolling_es, 
            h_combined %>% filter(ProgramType=='ES') %>% dplyr::select(ClientID, ServiceDate, Service),
            by='ClientID') %>%
  mutate(
    PriorService = case_when(
      start<ServiceDate ~ 0,
      start %m-% months(6) > ServiceDate ~ 0,
      TRUE ~ 1),
    DaysSinceES= case_when(
      PriorService == 1 ~ start-ServiceDate,
      TRUE ~ 0))%>%
  group_by(ClientID, AliasOrganizationName, AliasOrgNameShort, FamilyID, start, end_6, end_12, ES_6_Months, ES_12_Months) %>%
  summarise(DaysSinceES = min(DaysSinceES),
    ES_Services_Prior = sum(PriorService==1),
    Bed_Night_Services_Prior = sum(Service=='Bed night' & PriorService==1)
    )
PSH_Services_Prior <-
  left_join(rolling_es, h_combined %>% filter(ProgramType=='PSH') %>% dplyr::select(ClientID, ServiceDate, Service),by='ClientID') %>%
  mutate(PriorService = case_when(
    start<ServiceDate ~ 0,
    start %m-% months(6) > ServiceDate ~ 0,
    TRUE ~ 1))%>%
  group_by(ClientID, FamilyID, start, end_6, end_12, ES_6_Months, ES_12_Months) %>%
  summarise(PSH_Services_Prior = sum(PriorService==1)) %>%
  ungroup() %>%
  dplyr::select(ClientID, start, PSH_Services_Prior)
rolling_es_w_prior <- left_join(rolling_es_w_prior, PSH_Services_Prior, by=c('ClientID', 'start'))

# RRH Services
RRH_Services_Prior <-
  left_join(rolling_es, h_combined %>% filter(ProgramType=='RRH') %>% dplyr::select(ClientID, ServiceDate, Service),by='ClientID') %>%
  mutate(PriorService = case_when(
    start<ServiceDate ~ 0,
    start %m-% months(6) > ServiceDate ~ 0,
    TRUE ~ 1))%>%
  group_by(ClientID, FamilyID, start, end_6, end_12, ES_6_Months, ES_12_Months) %>%
  summarise(RRH_Services_Prior = sum(PriorService==1)) %>%
  ungroup() %>%
  dplyr::select(ClientID, start, RRH_Services_Prior)
rolling_es_w_prior <- left_join(rolling_es_w_prior, RRH_Services_Prior, by=c('ClientID', 'start'))

SO_Services_Prior <-
  left_join(rolling_es, h_combined %>% filter(ProgramType=='SO') %>% dplyr::select(ClientID, ServiceDate, Service),by='ClientID') %>%
  mutate(PriorService = case_when(
    start<ServiceDate ~ 0,
    start %m-% months(6) > ServiceDate ~ 0,
    TRUE ~ 1))%>%
  group_by(ClientID, FamilyID, start, end_6, end_12, ES_6_Months, ES_12_Months) %>%
  summarise(SO_Services_Prior = sum(PriorService==1)) %>%
  ungroup() %>%
  dplyr::select(ClientID, start, SO_Services_Prior)
rolling_es_w_prior <- left_join(rolling_es_w_prior, SO_Services_Prior, by=c('ClientID', 'start'))

rolling_es_w_prior <- left_join(rolling_es_w_prior, 
                                all_dem %>% dplyr::select(
                                  clientid, Races, HUDEthnicity, Gender, VeteranStatus,DisablingCondition,Max.Income.from.any.Assessment,DomesticViolenceVictim, Medicare, PhysicalDisability, DevelopmentalDisability, ChronicHealthCondition, DrugAbuse, AlcoholAbuse, Medicaid),
                                by=c('ClientID' = 'clientid'))

rolling_es_w_prior$DaysSinceES <- as.numeric(rolling_es_w_prior$DaysSinceES)

rolling_es_w_prior$RacesDisplay <- case_when(rolling_es_w_prior$Races == 'White' ~ 'White',
                                              rolling_es_w_prior$Races == 'Black or African American' ~ 'Black or African American',
                                              rolling_es_w_prior$Races == 'Asian' ~ 'Asian',
                                             rolling_es_w_prior$Races == 'Native Hawaiian or Other Pacific Islander' ~ 'Native Hawaiian or Other Pacific Islander',
                                             rolling_es_w_prior$Races == 'American Indian or Alaska Native' ~ 'American Indian or Alaska Native',
                                              rolling_es_w_prior$Races == "Client refused" ~ "Unknown",
                                              rolling_es_w_prior$Races == "Data not collected" ~ "Unknown",
                                              rolling_es_w_prior$Races == "Client doesn't know" ~ 'Unknown',
                                             is.na(rolling_es_w_prior$Races) ~ 'Unknown',
                                              TRUE ~ 'Two or more races'
                                              )
rolling_es_w_prior$RacesDisplayShort <- case_when(rolling_es_w_prior$Races == 'White' ~ 'white',
                                              rolling_es_w_prior$Races == 'Black or African American' ~ 'baa',
                                              rolling_es_w_prior$Races == 'Asian' ~ 'Asian',
                                             rolling_es_w_prior$Races == 'Native Hawaiian or Other Pacific Islander' ~ 'ntvhi',
                                             rolling_es_w_prior$Races == 'American Indian or Alaska Native' ~ 'amind',
                                              rolling_es_w_prior$Races == "Client refused" ~ "unkwn",
                                              rolling_es_w_prior$Races == "Data not collected" ~ "unkwn",
                                              rolling_es_w_prior$Races == "Client doesn't know" ~ 'unkwn',
                                             is.na(rolling_es_w_prior$Races) ~ 'unkwn',
                                              TRUE ~ 'twomore'
                                              )

rolling_es_w_prior$GenderDisplay <- case_when(rolling_es_w_prior$Gender == 'Male (1)' ~ 'Male',
                                              rolling_es_w_prior$Gender == 'Female (0)' ~ 'Female',
                                              rolling_es_w_prior$Gender == 'Trans Female (MTF or Male to Female) (2)' ~ 'Trans Female',
                                              rolling_es_w_prior$Gender == "Trans Male (FTM or Female to Male) (3)" ~ "Trans Male",
                                              rolling_es_w_prior$Gender == "Gender Non-Conforming (i.e. not exclusively male or female) (4)" ~ "Gender Non-Conforming", 
                                              TRUE ~ "Unknown"
                                              )


rolling_es_w_prior$GenderDisplayShort <- case_when(rolling_es_w_prior$Gender == 'Male (1)' ~ 'male',
                                              rolling_es_w_prior$Gender == 'Female (0)' ~ 'female',
                                              rolling_es_w_prior$Gender == 'Trans Female (MTF or Male to Female) (2)' ~ 'trfemale',
                                              rolling_es_w_prior$Gender == "Trans Male (FTM or Female to Male) (3)" ~ "trmale",
                                              rolling_es_w_prior$Gender == "Gender Non-Conforming (i.e. not exclusively male or female) (4)" ~ "gnc", 
                                              TRUE ~ "unkwn"
                                              )

rolling_es_w_prior$IncomeDisplay <- case_when (
                                              rolling_es_w_prior$Max.Income.from.any.Assessment <= 200 ~ '200',
                                              rolling_es_w_prior$Max.Income.from.any.Assessment < 500 ~ '200-500',
                                              rolling_es_w_prior$Max.Income.from.any.Assessment < 800 ~ '500-800',
                                              rolling_es_w_prior$Max.Income.from.any.Assessment >= 800 ~ '800',
                                              TRUE ~ "Unknown"
                                            )
rolling_es_w_prior$IncomeDisplayShort <- case_when (
                                              rolling_es_w_prior$Max.Income.from.any.Assessment <= 200 ~ '200',
                                              rolling_es_w_prior$Max.Income.from.any.Assessment < 500 ~ '200-500',
                                              rolling_es_w_prior$Max.Income.from.any.Assessment < 800 ~ '500-800',
                                              rolling_es_w_prior$Max.Income.from.any.Assessment >= 800 ~ '800',
                                              TRUE ~ "unkwn"
                                            )

rolling_es_w_prior$VeteranDisplay <- case_when (
                                              rolling_es_w_prior$VeteranStatus == 'Yes (1)' ~ 'Veteran',
                                              rolling_es_w_prior$VeteranStatus == 'No (0)' ~ 'Non-veteran',
                                              TRUE ~ 'Unknown'
                                            )
rolling_es_w_prior$VeteranDisplayShort <- case_when (
                                              rolling_es_w_prior$VeteranStatus == 'Yes (1)' ~ 'vet',
                                              rolling_es_w_prior$VeteranStatus == 'No (0)' ~ 'novet',
                                              TRUE ~ 'unkwn'
                                            )

rolling_es_w_prior$DisabDisplay <- case_when (
                                              rolling_es_w_prior$DisablingCondition == 'Yes (1)' ~ 'Disabled',
                                              rolling_es_w_prior$DisablingCondition == 'No (0)' ~ 'Not disabled',
                                              TRUE ~ 'Unknown'
                                            )
rolling_es_w_prior$DisabDisplayShort <- case_when (
                                              rolling_es_w_prior$DisablingCondition == 'Yes (1)' ~ 'disab',
                                              rolling_es_w_prior$DisablingCondition == 'No (0)' ~ 'nodisab',
                                              TRUE ~ 'unkwn'
                                            )

rolling_es_w_prior$EthnicityDisplay <- case_when (
                                              rolling_es_w_prior$HUDEthnicity == 'Hispanic/Latino (H)' ~ 'Hispanic/Latino',
                                              rolling_es_w_prior$HUDEthnicity == 'Non-Hispanic/Latino (O)' ~ 'Non-Hispanic/Latino',
                                              TRUE ~ 'Unknown'
                                            )
rolling_es_w_prior$EthnicityDisplayShort <- case_when (
                                              rolling_es_w_prior$HUDEthnicity == 'Hispanic/Latino (H)' ~ 'hisp',
                                              rolling_es_w_prior$HUDEthnicity == 'Non-Hispanic/Latino (O)' ~ 'nonhisp',
                                              TRUE ~ 'unkwn'
                                            )
FamilyCounts <- h_combined %>% group_by(FamilyID) %>% summarise(CountFamilyinSystem = n_distinct(ClientID))
rolling_es_w_prior <- left_join(rolling_es_w_prior, FamilyCounts, by='FamilyID') 
rolling_es_w_prior <- rolling_es_w_prior %>% ungroup()

# Replace time lag variable NAs with 0.
rolling_es_w_prior[c('ES_Services_Prior', 'Bed_Night_Services_Prior', 'PSH_Services_Prior', 'RRH_Services_Prior', 'SO_Services_Prior', 'CountFamilyinSystem')][is.na(rolling_es_w_prior[c('ES_Services_Prior', 'Bed_Night_Services_Prior', 'PSH_Services_Prior', 'RRH_Services_Prior', 'SO_Services_Prior', 'CountFamilyinSystem')])] <- 0

# Drop these original demographic variables because they've been replaced with recategorized variables
rolling_es_w_prior <- rolling_es_w_prior %>% select(-DaysSinceES, -Races, -HUDEthnicity, -Gender, -VeteranStatus, -DisablingCondition, -'Max.Income.from.any.Assessment')

#Now fill in the rest with Unknown
rolling_es_w_prior[c('Medicare', 'DomesticViolenceVictim')][is.na(rolling_es_w_prior[c('Medicare', 'DomesticViolenceVictim')])] <- "Unknown"

rolling_es_w_prior[c('PhysicalDisability', 'DevelopmentalDisability', 'ChronicHealthCondition', 'DrugAbuse', 'AlcoholAbuse', 'Medicaid')][is.na(rolling_es_w_prior[c('PhysicalDisability', 'DevelopmentalDisability', 'ChronicHealthCondition', 'DrugAbuse', 'AlcoholAbuse', 'Medicaid')])] <- "Unknown"


# Fill in FamilyID NAs with 999.
rolling_es_w_prior['FamilyID'][is.na(rolling_es_w_prior['FamilyID'])] <- 999

We also have access to a substantial amount of demographic data. Here we join in a select amount of the HMIS data that will be used for predictive modeling. Some categorical variables, however, have many possible values (e.g., Races, Gender,), so we will group them into higher level categories - this boosts their predictive power. For the income variable, we convert a continuous variable into ordinal ranges.

Importantly, the population of clients represented in the services and demographics datasets are not the same. Some clients in the services dataset are not in the demographics dataset and vice versa. Since our outcome variable (and subsequent modeling process) is built on the services data, we will rely primarily on the services dataset. Time lag variables with no data will be zero-filled to indicate that a client did not interact with a program type during a given time frame.

The table below summarizes all the features going into our model, by category.

featuretable <- rbind(data.frame(Category = 'Time Lag', 
                           Feature = c('Family Members Receiving Services', 'Prior ES Services',  'Prior RRH Services','Prior PSH Services', 'Prior SO Services')), 
                      data.frame(Category = 'Demographics',
                                 Feature = c('Ethnicity','Income','Medicare Use','Medicaid Use','Domestic Violence Victim','Drug Abuse',          'Alcohol Abuse')))

featuretable %>% kable() %>% kable_styling()
Category Feature
Time Lag Family Members Receiving Services
Time Lag Prior ES Services
Time Lag Prior RRH Services
Time Lag Prior PSH Services
Time Lag Prior SO Services
Demographics Ethnicity
Demographics Income
Demographics Medicare Use
Demographics Medicaid Use
Demographics Domestic Violence Victim
Demographics Drug Abuse
Demographics Alcohol Abuse

Now that outcome variables and features have been engineered, we can visualize the relationships between the dependent and independent variables in our dataset. Select plots are shown below.

rolling_es_w_prior$ES_6_Months <- as.factor(rolling_es_w_prior$ES_6_Months)
rolling_es_w_prior$ES_12_Months <- as.factor(rolling_es_w_prior$ES_12_Months)

# Faceted bar plots predictors v. outcome
rolling_es_w_prior %>%
  dplyr::select(ES_6_Months, SO_Services_Prior, ES_Services_Prior, Bed_Night_Services_Prior,
                RRH_Services_Prior, PSH_Services_Prior, CountFamilyinSystem) %>%
  gather(Variable, value, -ES_6_Months) %>%
  ggplot(aes(ES_6_Months, value, fill=ES_6_Months)) +
  geom_bar(position='dodge', stat = 'summary', fun='mean') +
  facet_wrap(~Variable, scales='free') +
  scale_fill_manual(values=palette2) +
  labs(x="Use of Emergency Shelter Bed Night Within 6 Months", y="Mean", fill='Observed \nHomelessness', 
       title = "Feature associations with the likelihood of ES Use: 6 Months",
       subtitle = "(Continous inputs)") +
  plotTheme #+ theme(legend.position = "none")

Use of homelessness services, such as Bed Nights and other ES services, show in the top row, seems to have an association with use of Emergency Shelter Bed Night services within a 6 month time frame, which is intuitive. Permanent housing services such as PSH and RRH seem to have a negative association with the outcome variable which also makes sense. Having a higher number of family members receiving services in the system seems to have a moderately negative association with the outcome variable.

3. Construct the models

Our outcome is a binary variable, so for the model, we will develop a classifier. First we construct a logistic regression, then a random forest. One model is created for each time horizon, 6 and 12 months, for a total of four models. We will compare the results and select the best model for final predictions.

The full dataset is broken into a 75-25 train-test split, stratified by the outcome variable to ensure equal distribution of the labels across the train and test sets. Each of the four model’s predictions are then compared against the test set labels.

set.seed(123)
esSplit <- initial_split(rolling_es_w_prior, strata = ES_6_Months)
esTrain <- training(esSplit)
esTest <- testing(esSplit)
# Variables that will be used by all four classifiers
clf.vars <- c('CountFamilyinSystem', 'ES_Services_Prior',  'RRH_Services_Prior','PSH_Services_Prior', 'SO_Services_Prior','EthnicityDisplay','IncomeDisplay','Medicare','DomesticViolenceVictim','DrugAbuse','AlcoholAbuse','Medicaid')

glm.6.mo <- glm(ES_6_Months ~ .,
                  data=esTrain %>% 
                    dplyr::select(ES_6_Months, all_of(clf.vars)),
                  family="binomial" 
                  (link="logit"))
glm.12.mo <- glm(ES_12_Months ~ .,
                  data=esTrain %>% 
                    dplyr::select(ES_12_Months, all_of(clf.vars)),
                  family="binomial" 
                  (link="logit"))

GridSearch was performed on the random forest model to determine the optimal values for hyperparameters - this process takes roughly 2 hours. The identified optimal values for hyperparameters min_n and mtry are set as inputs below to reduce processing time for this code, but the grid search code is included below as well.

Because the 12 month model shares all the same features with the 6 month model, we will not retune the hyperparameters. Instead, we will just refit the model for the entire training set and the 12 month outcome variable.

rf_rec <- recipe(ES_6_Months ~ ., data = esTrain %>% dplyr::select(ES_6_Months, all_of(clf.vars))) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  themis::step_downsample(ES_6_Months)
rf_prep <- prep(rf_rec)
juiced <- juice(rf_prep)

final_rf <- rand_forest(
  mtry = 11,
  trees = 1000,
  min_n = 2) %>%
  set_mode("classification") %>%
  set_engine("ranger")

final_rf %>%
  set_engine("ranger", importance = "permutation") %>%
  fit(ES_6_Months ~ .,
      data = juice(rf_prep)
  )

final_wf <- workflow() %>%
  add_recipe(rf_rec) %>%
  add_model(final_rf)

final_res <- final_wf %>%
  last_fit(esSplit)

rf_preds_6 <- as.data.frame(final_res %>%
  collect_predictions() %>%
  mutate(correct = case_when(
    ES_6_Months == .pred_class ~ "Correct",
    TRUE ~ "Incorrect"
  )))

rf_rec_12 <- recipe(ES_12_Months ~ ., data = esTrain %>% dplyr::select(ES_12_Months, all_of(clf.vars))) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  themis::step_downsample(ES_12_Months)
rf_prep_12 <- prep(rf_rec_12)

# Retrain the model on the full train set and 12 mo outcome var
final_rf %>%
  set_engine("ranger", importance = "permutation") %>%
  fit(ES_12_Months ~ .,
      data = juice(rf_prep_12)
  ) %>%
  vip(geom = "point", aesthetics = list(size=3)) +
  ggtitle('Feature importance to Random Forest Model') + 
  plotTheme

final_wf_12 <- workflow() %>%
  add_recipe(rf_rec_12) %>%
  add_model(final_rf)

final_res_12 <- final_wf_12 %>%
  last_fit(esSplit)

rf_preds_12 <- as.data.frame(final_res_12 %>%
  collect_predictions() %>%
  mutate(correct = case_when(
    ES_12_Months == .pred_class ~ "Correct",
    TRUE ~ "Incorrect"
  )))

The feature importance plot provides a useful visualization of the relative importance of the various features in our model. Historical use of Emergency Shelter services is far and away the most important predictor, with other time lag variables also holding meaningful predictive power. Of the demographic features, drug abuse and alcohol abuse appear to be the most important to the random forest models’ predictions.

4. Model evaluation

With each of the models trained, let’s predict on the test set and evaluate the predictions as they compare to observed outcomes. The most effective way to do this is to look at the predicted probability distribution for each model and outcome variable separately.

# Add the RF predictions to this same DF
testProbs <- data.frame(Outcome.6 = as.factor(esTest$ES_6_Months),
                        Outcome.12 = as.factor(esTest$ES_12_Months),
                        Probs.GLM.6 = predict(glm.6.mo, esTest, type= "response"),
                        Probs.RF.6 = rf_preds_6$.pred_1,
                        Probs.GLM.12 = predict(glm.12.mo, esTest, type="response"),
                        Probs.RF.12 = rf_preds_6$.pred_1,
                        RacesDisplay = esTest$RacesDisplay,
                        GenderDisplay = esTest$GenderDisplay)
glm.6mo.preds.chart <- 
  ggplot(testProbs, aes(x = Probs.GLM.6, fill = as.factor(Outcome.6))) + 
    geom_density() +
    scale_fill_manual(values=palette2) +
    facet_grid(Outcome.6 ~ .) +
    labs(x = "ES Within 6 Months", y = "Density of probabilities", 
         title='Logistic Regression') +
    theme(strip.text.x = element_text(size = 18), legend.position = "none") + 
    plotTheme

glm.12mo.preds.chart <-
  ggplot(testProbs, aes(x = Probs.GLM.12, fill = as.factor(Outcome.12))) + 
    geom_density() +
    scale_fill_manual(values=palette2) +
    facet_grid(Outcome.12 ~ .) +
    labs(x = "ES Within 12 Months", y = "Density of probabilities", fill='Predicted Homelessness') +
    #theme(strip.text.x = element_text(size = 18), legend.position = "none") + 
    plotTheme

rf.6mo.preds.chart <- 
  ggplot(testProbs, aes(x = Probs.RF.6, fill = as.factor(Outcome.6))) + 
    geom_density() +
    scale_fill_manual(values=palette2) +
    facet_grid(Outcome.6 ~ .) +
    labs(x = "ES Within 6 Months", y = "Density of probabilities",
         title='Random Forest') +
    theme(strip.text.x = element_text(size = 18), legend.position = "none") + 
    plotTheme

rf.12mo.preds.chart <-
  ggplot(testProbs, aes(x = Probs.RF.12, fill = as.factor(Outcome.12))) + 
    geom_density() +
    scale_fill_manual(values=palette2) +
    facet_grid(Outcome.12 ~ .) +
    labs(x = "ES Within 12 Months", y = "Density of probabilities", fill='Predicted Homelessness') +
    #theme(strip.text.x = element_text(size = 18), legend.position = "none") + 
    plotTheme

#grid.arrange(glm.6mo.preds.chart, rf.6mo.preds.chart,glm.12mo.preds.chart, rf.12mo.preds.chart, ncol=2, top=('Distribution of predicted probabilities by observed outcome'))

figure1 <- ggarrange(glm.6mo.preds.chart, rf.6mo.preds.chart, ncol=2)
figure2 <- ggarrange(glm.12mo.preds.chart, rf.12mo.preds.chart, ncol=2, common.legend = TRUE, legend='bottom')

annotate_figure(figure1, top=text_grob("Distribution of predicted probabilities by observed outcome", face="bold", size=18))

annotate_figure(figure2)

The random forest models clearly do a much better job of separating out the probabilities. The observed 0 outcomes (not homeless) are clustered around lower probabilities and the observed 1 outcomes (homeless) are clustered around higher probabilities. With the logistic regression, it is harder to visually identify a probability threshold that would accurately predict clients’ outcomes.

Next, we’ll look at ROC curves and their respective areas under the curves (AUC) to continue evaluating logistic regression compared to random forest.

ggplot(gather(testProbs, Model, Probs, Probs.GLM.6:Probs.RF.12), 
       aes(d = as.numeric(Outcome.6), m = Probs, color=Model)) + 
  geom_roc(n.cuts = 50, labels = FALSE) + 
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') + 
  labs(title='ROC Curves by Model') +
  plotTheme

all_aucs <- data.frame(Model= c('Logistic Regression - 6 Month', 'Logistic Regression - 12 Month',
                                'Random Forest - 6 Month', 'Random Forest - 12 Month'),
                       AUC = c(auc(testProbs$Outcome.6, testProbs$Probs.GLM.6), auc(testProbs$Outcome.12, testProbs$Probs.GLM.12),
                               auc(testProbs$Outcome.6, testProbs$Probs.RF.6), auc(testProbs$Outcome.12, testProbs$Probs.RF.12)))

all_aucs %>% kable() %>% kable_styling()
Model AUC
Logistic Regression - 6 Month 0.7698508
Logistic Regression - 12 Month 0.7584693
Random Forest - 6 Month 0.8689312
Random Forest - 12 Month 0.8579623

The ROC curves clearly demonstrate that the random forest model is superior. They achieve higher true positive rates and lower false positive rates for each threshold, which is reflected in the AUC values - there is a difference of about 0.1 for both the AUCs for both outcome variables. This is sufficient evidence to justify selecting the random forest model. From this point on, we will proceed with the random forest model only.

With our model selected, we can evaluate it a bit more closely by subpopulations. Here we’ll look at the same ROC chart, but separated out by racial group.

ggplot(testProbs, 
       aes(d = as.numeric(Outcome.6), m = Probs.RF.6, group=RacesDisplay, colour=RacesDisplay)) + 
  geom_roc(n.cuts = 50, labels = FALSE) +
  scale_color_manual(values = palette7) +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title="ROC Curves by race", fill='Race') +
  plotTheme

There is some variability among the ROC curves by racial groups. First, the less smooth curves are a reflection of predicting for small populations, i.e., those racial groups that do not have a large population in the test set or system overall, including Asian and Native Hawaiian or Other Pacific Islander.

The ROC curve for the Two or more races group appears to have less area under it curves that the White or Black / African American curves.

5. Cost Benefit Analysis

With the Random Forest model selected, we may proceed forward with threshold selection. Up until this point, we have selected a model to predict the probability that a client will experience homelessness over 6 and 12 month time horizons for given points in time. Next, these probabilities must be converted into binary outcomes: homeless or not homeless. In order to do that we must select a threshold - below this threshold, a client will be predicted to not experience homelessness and above the threshold, a client will be predicted to experience homelessness.

To quantify this, we identified the monetary costs associated with the four possible scenarios that represent all combinations of the binary observed outcome and the binary predicted outcome. These four scenarios and their associated unit costs (i.e., the cost of one client experiencing the scenario) are summarized below. Scenarios in green represent correct predictions, and those in red represent incorrect predictions.

cbunittbale <- data.frame(Scenario = c('Client does not need bed night, does not use it', 'Client needs a bed night, uses it', 'Client does not need a bed night, uses it', 'Client needs a bed night, does not receive it'),
                           Cost = c('$0', '$80', '$80', '$137'))

cbunittbale %>% kable() %>% kable_styling('striped') %>%
  row_spec(c(1:2), bold = F, color = "white", background = "#a1c17e") %>%
  row_spec(c(3:4), bold = F, color = "white", background = "#f39f9f") %>%
  add_footnote('Source: HUD, U.S. Council on Homelessness')
Scenario Cost
Client does not need bed night, does not use it $0
Client needs a bed night, uses it $80
Client does not need a bed night, uses it $80
Client needs a bed night, does not receive it $137
a Source: HUD, U.S. Council on Homelessness

Now that unit costs have been identified, we can develop a cost function to calculate system-level costs for each of the four scenarios. In this next section we iterate through 100 thresholds 0-1 (increments of 0.01), calculate a confusion matrix for each, and the total costs for each scenario per threshold. The plot below colors the costs by scenario and also includes total costs.

options(scipen=999)
options(dplyr.summarise.inform = FALSE)
iterateThresholds <- function() {
  x = .01
  all_prediction <- data.frame()
  while (x <= 1) {
    
    this_prediction <-
      testProbs %>%
      mutate(predOutcome = ifelse(Probs.RF.6 > x, 1, 0)) %>%
      count(predOutcome, Outcome.6) %>%
      summarize('True Negative' = sum(n[predOutcome==0 & Outcome.6==0]),
                'True Positive' = sum(n[predOutcome==1 & Outcome.6==1]),
                'False Negative' = sum(n[predOutcome==0 & Outcome.6==1]),
                'False Positive' = sum(n[predOutcome==1 & Outcome.6==0])) %>%
      gather(Variable, Count) %>%
      mutate(Cost =
           ifelse(Variable == "True Negative", Count * 0,
                  ifelse(Variable == "True Positive",(Count * 80),
                         ifelse(Variable == "False Negative", Count *137,
                                ifelse(Variable == "False Positive",80 * Count, 0)))),
             Threshold = x)
    
    all_prediction <- rbind(all_prediction, this_prediction)
    x <- x + .01
  }
  return(all_prediction)
}
whichThreshold.6 <- iterateThresholds()
thresholdTotals.6 <- whichThreshold.6 %>% group_by(Threshold) %>% summarise(Cost = sum(Cost), Count=sum(Count)) %>% mutate(Variable='Total')
whichThreshold.6 <- rbind(whichThreshold.6, thresholdTotals.6)
threshold6.chart <- 
  whichThreshold.6 %>%
    ggplot(.,aes(Threshold, Cost, colour = Variable)) +
    geom_point() +
    scale_colour_manual(values = palette5) +    
    labs(title = "6 Month Outcome",
         y="Cost") +
    guides(colour=guide_legend(title="Confusion Matrix Scenario"))

iterateThresholds <- function() {
  x = .01
  all_prediction <- data.frame()
  while (x <= 1) {
    
    this_prediction <-
      testProbs %>%
      mutate(predOutcome = ifelse(Probs.RF.12 > x, 1, 0)) %>%
      count(predOutcome, Outcome.12) %>%
      summarize('True Negative' = sum(n[predOutcome==0 & Outcome.12==0]),
                'True Positive' = sum(n[predOutcome==1 & Outcome.12==1]),
                'False Negative' = sum(n[predOutcome==0 & Outcome.12==1]),
                'False Positive' = sum(n[predOutcome==1 & Outcome.12==0])) %>%
      gather(Variable, Count) %>%
      mutate(Cost =
           ifelse(Variable == "True Negative", Count * 0,
                  ifelse(Variable == "True Positive",(Count * 80),
                         ifelse(Variable == "False Negative", Count *137,
                                ifelse(Variable == "False Positive",80 * Count, 0)))),
             Threshold = x)
    
    all_prediction <- rbind(all_prediction, this_prediction)
    x <- x + .01
  }
  return(all_prediction)
}

whichThreshold.12 <- iterateThresholds()
thresholdTotals.12 <- whichThreshold.12 %>% group_by(Threshold) %>% summarise(Cost = sum(Cost), Count=sum(Count)) %>% mutate(Variable='Total')
whichThreshold.12 <- rbind(whichThreshold.12, thresholdTotals.12)
threshold12.chart <- 
  whichThreshold.12 %>%
    ggplot(.,aes(Threshold, Cost, colour = Variable)) +
    geom_point() +
    scale_colour_manual(values = palette5) +    
    labs(title = "12 Month Outcome",
         y="Cost") +
   guides(colour=guide_legend(title="Confusion Matrix Scenario"))

figure <- ggarrange(threshold6.chart, threshold12.chart, ncol=2, common.legend = TRUE, legend='bottom')

annotate_figure(figure, top=text_grob('Cost by confusion matrix type and threshold', face='bold', size='15'))

We can see that for both outcome variables, as the threshold increases, the total cost to the system decreases. This makes sense. An increasing threshold means that fewer people receive Emergency Shelter Bed Night services, reducing costs. However, as the threshold approaches 1, the total cost begins to increase again, because the cost of a false negative (denying a bed night to someone who needs it) is much higher than the cost of a false positive (the cost of a bed night).

We will select the threshold that allows for the largest distribution of bed night services to the most amount of clients at the lowest cost to the system. Let’s look at total cost by itself to make that determination.

options(scipen=10000)
totalcost6 <- 
  whichThreshold.6 %>% 
  filter(Variable=='Total') %>%
  dplyr::select(Threshold, Cost) %>%
  gather(Variable, Value, -Threshold) %>%
    ggplot(aes(Threshold, Value, colour = Variable)) +
    geom_point() +
    geom_vline(xintercept = 0.47) +
    scale_colour_manual(values = c('navy')) +
    labs(title = "6 Month Outcome - 0.47 Threshold", y ='Total cost ($)') + 
  plotTheme
totalcost12 <- 
  whichThreshold.12 %>% 
  filter(Variable=='Total') %>%
  dplyr::select(Threshold, Cost) %>%
  gather(Variable, Value, -Threshold) %>%
    ggplot(aes(Threshold, Value, colour = Variable)) +
    geom_point() +
    geom_vline(xintercept = 0.54) +
    scale_colour_manual(values = c('navy')) +
    labs(title = "12 Month Outcome - 0.54 Threshold", y ='Total cost ($)') +
  plotTheme
figure <- ggarrange(totalcost6, totalcost12, ncol=2, common.legend = TRUE, legend = 'bottom')

annotate_figure(figure, top=text_grob('Total cost by threshold', face='bold', size='15'),
                bottom = text_grob('Vertical lines denote selected thresholds', hjust=1, face='italic', size='12'))

Looking at just these total costs helps us see how adjustments to the threshold will impact overall costs. We can see that for a substantial amount of threshold values, the total cost stays within the $350,000 - $400,000 range. Because we are prioritizing the distribution of bed nights services, we are going to select the lowest threshold that keeps total cost below $400,000. Ultimately, this is a very subjective decision. The primary value of this analysis framework is that it allows administrators to easily see the trade offs of adjustments to the threshold if necessary, for example if there were particular budgetary requirements to factor in as well.

Now that thresholds have been selected, let’s reevaluate the cost benefit table from earlier to quantify the system costs for each model. We already know that total cost will be around $400,000.

cost_benefit_table_6 <-
  testProbs %>%
  mutate(predOutcome = ifelse(Probs.RF.6 > 0.47, 1, 0)) %>%
  count(predOutcome, Outcome.6) %>%
  summarize('True Negative' = sum(n[predOutcome==0 & Outcome.6==0]),
            'True Positive' = sum(n[predOutcome==1 & Outcome.6==1]),
            'False Negative' = sum(n[predOutcome==0 & Outcome.6==1]),
            'False Positive' = sum(n[predOutcome==1 & Outcome.6==0])) %>%
  gather(Variable, Count) %>%
  mutate('Unit Cost' =
           ifelse(Variable == "True Negative", '$0',
                  ifelse(Variable == "True Positive", '$80',
                         ifelse(Variable == "False Negative", '$137',
                                ifelse(Variable == "False Positive",'$80', 0)))),
    'Total Cost' =
           ifelse(Variable == "True Negative", paste("$", Count * 0),
                  ifelse(Variable == "True Positive",paste("$", Count * 80),
                         ifelse(Variable == "False Negative", paste("$",Count * 137),
                                ifelse(Variable == "False Positive", paste("$",80 * Count), 0))))) %>%
  bind_cols(data.frame(Description = c(
    'Client does not need bed night, does not use it', 'Client needs a bed night, uses it', 'Client needs a bed night, does not receive it', 'Client does not need a bed night, uses it')))
kable(cost_benefit_table_6,
      caption = "Cost/Benefit Table - 6 Month Outcome") %>% kable_styling()
Cost/Benefit Table - 6 Month Outcome
Variable Count Unit Cost Total Cost Description
True Negative 5101 $0 $ 0 Client does not need bed night, does not use it
True Positive 2242 $80 $ 179360 Client needs a bed night, uses it
False Negative 402 $137 $ 55074 Client needs a bed night, does not receive it
False Positive 2084 $80 $ 166720 Client does not need a bed night, uses it
cost_benefit_table_12 <-
  testProbs %>%
  mutate(predOutcome = ifelse(Probs.RF.12 > 0.54, 1, 0)) %>%
  count(predOutcome, Outcome.12) %>%
  summarize('True Negative' = sum(n[predOutcome==0 & Outcome.12==0]),
            'True Positive' = sum(n[predOutcome==1 & Outcome.12==1]),
            'False Negative' = sum(n[predOutcome==0 & Outcome.12==1]),
            'False Positive' = sum(n[predOutcome==1 & Outcome.12==0])) %>%
  gather(Variable, Count) %>%
  mutate('Unit Cost' =
           ifelse(Variable == "True Negative", '$0',
                  ifelse(Variable == "True Positive", '$80',
                         ifelse(Variable == "False Negative", '$137',
                                ifelse(Variable == "False Positive",'$80', 0)))),
    'Total Cost' =
           ifelse(Variable == "True Negative", paste("$", Count * 0),
                  ifelse(Variable == "True Positive",paste("$", Count * 80),
                         ifelse(Variable == "False Negative", paste("$",Count * 137),
                                ifelse(Variable == "False Positive", paste("$",80 * Count), 0))))) %>%
  bind_cols(data.frame(Description = c('Client does not need bed night, does not use it', 'Client needs a bed night, uses it', 'Client needs a bed night, does not receive it', 'Client does not need a bed night, uses it')))

kable(cost_benefit_table_12,
      caption = "Cost/Benefit Table - 12 Month Outcome") %>% kable_styling(position='float_right')
Cost/Benefit Table - 12 Month Outcome
Variable Count Unit Cost Total Cost Description
True Negative 5343 $0 $ 0 Client does not need bed night, does not use it
True Positive 2244 $80 $ 179520 Client needs a bed night, uses it
False Negative 765 $137 $ 104805 Client needs a bed night, does not receive it
False Positive 1477 $80 $ 118160 Client does not need a bed night, uses it

We can clearly see that the higher threshold for the 12 month outcome (0.54) as compared to the 6 month outcome (0.47) creates more false negatives. In other words, more clients in need of a bed night are being denied one for the 12 month outcome than the 6 month. On the other hand, the 6 month outcome has a higher cost associated with false positives - i.e., providing bed night services to clients who did not actually need them.

6. Additional evaluation

Another way to look at model performance is three high level metrics: accuracy, sensitivity, and specificity.

testProbs$predOutcome.6 <- as.factor(ifelse(testProbs$Probs.RF.6 > 0.47, 1, 0))
testProbs$predOutcome.12 <- as.factor(ifelse(testProbs$Probs.RF.12 > 0.54, 1, 0))

cm_1 <- caret::confusionMatrix(testProbs$predOutcome.6, testProbs$Outcome.6, 
                       positive = "1")
cm_2 <- caret::confusionMatrix(testProbs$predOutcome.12, testProbs$Outcome.12, 
                       positive = "1")

model_table <- data.frame(Model = c("6 Month Outcome", "12 Month Outcome"),
           Accuracy=c(cm_1$overall['Accuracy'],cm_2$overall['Accuracy']),
           Sensitivity =c(cm_1$byClass['Sensitivity'],cm_2$byClass['Sensitivity']),
           Specificity =c(cm_1$byClass['Specificity'],cm_2$byClass['Specificity']))
kable(model_table, align = "l",
      caption = "Model Comparison") %>% kable_styling()
Model Comparison
Model Accuracy Sensitivity Specificity
6 Month Outcome 0.7470750 0.8479576 0.7099513
12 Month Outcome 0.7718995 0.7457627 0.7834311

While a 74-77% accuracy may not seem that impressive, it is important to remember that we are most interested in the models’ abilities to predict for homelessness outcomes. Thus, we should look closely at the sensitivity numbers, which measure the ability of a model to predict for a positive outcome. The 84% sensitivity of the 6 Month Outcome model, in particular, is notably high. This is likely a function of our decision to down-sample the negative outcomes when we were building our random forest model, helping the model better predict positive outcomes.

We can also evaluate the same confusion metrics by racial group, which is another important way to understand how well our model generalizes across different subpopulations.

options(dplyr.summarise.inform = FALSE)
cm_race_6 <- testProbs %>%
  group_by(RacesDisplay) %>%
  count(predOutcome.6,Outcome.6) %>%
  summarize(sum_trueNeg = sum(n[predOutcome.6==0 & Outcome.6==0]),
            sum_truePos = sum(n[predOutcome.6==1 & Outcome.6==1]),
            sum_falseNeg = sum(n[predOutcome.6==0 & Outcome.6==1]),
            sum_falsePos = sum(n[predOutcome.6==1 & Outcome.6==0]),
            total=sum(n)) %>%
  mutate(True_Positive = sum_truePos / total,
         True_Negative = sum_trueNeg / total,
         False_Negative = sum_falseNeg / total,
         False_Positive = sum_falsePos / total,
         Accuracy = (sum_truePos + sum_trueNeg) / total) %>%
  select(RacesDisplay, True_Positive, True_Negative, False_Negative, False_Positive,Accuracy) %>%
  gather(key,value,True_Positive:Accuracy) %>%
    ggplot(.,aes(key,value,fill=RacesDisplay)) +
    geom_bar(aes(fill=RacesDisplay),position="dodge",stat="identity") +
    scale_fill_manual(values = palette7) +
    labs(title="6 Month Outcome (47% threshold)",
         x="Scenario",y="Rate", fill='Race') +
    #theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  plotTheme

cm_race_12 <- testProbs %>%
  group_by(RacesDisplay) %>%
  count(predOutcome.12,Outcome.12) %>%
  summarize(sum_trueNeg = sum(n[predOutcome.12==0 & Outcome.12==0]),
            sum_truePos = sum(n[predOutcome.12==1 & Outcome.12==1]),
            sum_falseNeg = sum(n[predOutcome.12==0 & Outcome.12==1]),
            sum_falsePos = sum(n[predOutcome.12==1 & Outcome.12==0]),
            total=sum(n)) %>%
  mutate(True_Positive = sum_truePos / total,
         True_Negative = sum_trueNeg / total,
         False_Negative = sum_falseNeg / total,
         False_Positive = sum_falsePos / total,
         Accuracy = (sum_truePos + sum_trueNeg) / total) %>%
  select(RacesDisplay, True_Positive, True_Negative, False_Negative, False_Positive,Accuracy) %>%
  gather(key,value,True_Positive:Accuracy) %>%
    ggplot(.,aes(key,value,fill=RacesDisplay)) +
    geom_bar(aes(fill=RacesDisplay),position="dodge",stat="identity") +
    scale_fill_manual(values = palette7) +
    labs(title="12 Month Outcome (54% threshold)",
         x="Scenario",y="Rate", fill='Race') + plotTheme
    #theme(axis.text.x = element_text(angle = 45, hjust = 1))

figure <- ggarrange(cm_race_6, cm_race_12, ncol=2, common.legend = TRUE, legend = 'bottom')

annotate_figure(figure, top=text_grob('Confusion matrix rates by race', face='bold', size='15'))

The metrics appear to be relatively tightly distributed across racial groups - this is important for the generalizability of our model. Importantly, some of the scenario metrics that are particularly higher or lower than the others in the group come from racial groups with a smaller sample size. For example, Asians have a much higher false negative rate, yet only represent 33 clients in the test set. Unfortunately, it is harder to accurately predict for such a small sample size.

For our final model evaluation, we will compare observed and predicted rates for our outcome variables across racial groups. This is another important way to evaluate the degree to which our model generalizes across subpopulations.

options(dplyr.summarise.inform = FALSE)
testProbs %>%
  group_by(RacesDisplay) %>%
  summarize(Clients = paste(round(n()/nrow(testProbs)*1000)/10, "%"),
            #Clients.Homeless.6 =  sum(as.numeric(Outcome.6) -1 ) ,
            #Clients.Predicted.Homeless.6 = sum(as.numeric(predOutcome.6) - 1), 
            'Observed Homeless \n Rate 6 Months' = paste(round(sum(as.numeric(Outcome.6) - 1) / n() * 100, 2), "%"),
            'Predicted Homeless \n Rate 6 Months' = paste(round(sum(as.numeric(predOutcome.6) - 1) / n() * 100, 2), "%"),
            #Clients.Homeless.12 =  sum(as.numeric(Outcome.12) -1 ) ,
            #Clients.Predicted.Homeless.12 = sum(as.numeric(predOutcome.12) - 1), 
            'Observed Homeless \n Rate 12 Months' = paste(round(sum(as.numeric(Outcome.12) - 1) / n() * 100, 2), "%"),
            'Predicted Homeless \n Rate 12 Months' = paste(round(sum(as.numeric(predOutcome.12) - 1) / n() * 100, 2), "%")) %>%
  arrange(desc(Clients)) %>%
  rename(Race=RacesDisplay) %>%
  kable() %>% kable_styling()
Race Clients Observed Homeless Rate 6 Months Predicted Homeless Rate 6 Months Observed Homeless Rate 12 Months Predicted Homeless Rate 12 Months
Black or African American 52.1 % 27 % 42.48 % 31.08 % 36.51 %
Unknown 22.9 % 30.1 % 50.82 % 33.7 % 43.53 %
White 21.3 % 23.98 % 41.04 % 27.07 % 35.37 %
Two or more races 1.9 % 23.04 % 37.7 % 25.65 % 34.03 %
American Indian or Alaska Native 1 % 15.84 % 41.58 % 18.81 % 33.66 %
Asian 0.3 % 33.33 % 42.42 % 39.39 % 39.39 %
Native Hawaiian or Other Pacific Islander 0.3 % 28.12 % 53.12 % 28.12 % 53.12 %

For almost all racial groups and both outcome variables, the predicted homelessness rate is higher than the observed rate. This is part of the design of our model - we want to be more sensitive to homelessness outcomes and therefore predict higher rates. What is important to note here is that for most racial groups that represent a substantial portion of the population (i.e., Black or African American, Unknown, White, and Two or More Races), the difference is within 10-20%, which is reasonably close. Racial groups with a much smaller population in the test set (e.g. Native Hawaiian and Other Pacific Islander) have larger disparities - this is primarily due to the difficulty in predicting for smaller sample sizes.

Finally, the difference between observed and predicted is smaller for the 12 month outcome due to the higher threshold selected for that model.

Section 4: Conclusion

Throughout this report, we have communicated the complexity around homelessness as a policy issue and the solutions that involve the individuals and the providers that serve them. Our approach acknowledges the difficulties surrounding this topic while also trying to distill it to the most basic elements: we want individuals who get housing to remain housed and to decrease the amount of return to homelessness within the system.

By focusing on this one solution, we have developed tools and analytics that address the root causes and patterns behind individuals not remaining housed on both an individual, system wide, and institutional level. The centering of service providers allows for clearer allocation of resources on the service wide level as well as evaluation of service providers whose populations have a high rate of return to homelessness. For MDHA, our client, this is a key tool as it allows them to understand and evaluate how their 30 service providers are supplying care and housing to their clients.

As we mentioned at the start, our purpose is to make the experience of homelessness “rare, brief and non-recurring”, we hope that with our data visualizations, modeling, and application we can help bring Dallas and our country one step closer towards this crucial goal.

Section 5: Acknowledgements

We would like to thank a number of individuals who helped make this project possible. Firstly, we would like to thank Alexandra Espinosa and Phil Force at the Metro Dallas Homeless Alliance for entrusting us with a sensitive dataset and answering our questions throughout the process. Secondly, we would like to thank the team of Ken Steif, Michael Fichman, and Matt Harris for expertly guiding our Practicum course. Lastly, we would like to thank Dennis Culhane for serving as our expert adviser and offering helpful tips and understandings of the system.