Return to MUSA 801 Projects page
This project was done as part of the University of Pennsylvania’s Master of Urban Spatial Analytics (MUSA) Spring 2022 Practicum. The instructors were Michael Fichman and Matthew Harris. We would like to thank them for their tireless help with this project. The group would also like to thank the team at US Ignite, Palak Agarwal and Kyle Compton, for their help in supplying the data and high-level guidance.
This report documents the data analysis pipeline through multiple steps. First is processing, exploration and visualization, then building the traffic prediction model through feature engineering, model training and validation testing.
Ever since the dawn of the Auto age, traffic jams have been present wherever large numbers of drivers exist. It is one phenomenon where, though widely experienced by nearly everyone at some point in time, is far less commonly understood in terms of its causes and potential mitigations. Without certain policy measures such as dynamic highway tolling, road network utilization becomes a classic “tragedy of the commons” situation, where unfettered open access to a shared resource result in systemic collapse and disutility for all who use it. Although the COVID-19 pandemic has shifted the usual travel and commute patterns found throughout the United States, especially for those those who have the option to work from home, the truth is that congestion will only shift to different locations and hours throughout the day. And even so, there are other workers who are required to work on site and still face congestion day to day. Novel ways of predicting and alerting drivers of traffic conditions ahead of time, especially for known major traffic generators such as large employers, will help both regular commuters make informed decisions on when and where to travel, and also the greater public.
In this study, we take the case of Colorado Springs, Colorado, a mid-sized city which is dominated by particular major employers, namely the US Armed Forces. To the south of the city and separate from it lies Fort Carson, one of the largest Army bases in the country since 1942, with a total area of 137,000 acres. Though Census data claims the CDP population is only 13,800, mainly counting active duty soldiers, including the “associated” population of retirees, family members, and civilian employees brings the total population closer to 125,000, which is 26% of the entirety of Colorado Springs.
Before the pandemic, Colorado Springs experienced a steadily-rising congestion level, peaking in 2019 according to the Texas Transportation Institute’s Urban Mobility Report. Measured in hours of travel delay in aggregate and per-person, this was 20,010 hours and 48 hours of delay respectively, right at the average for mid-sized cities. But what is unusual is that this congestion is particularly car commuter and Interstate highway-based: both statistics are significantly above the average for mid-sized cities1. Putting this together with the large employment center of Fort Carson (as well as the Air Force Academy and Peterson Air Force Base), it is apparent that these large employment sites, closely sited along Colorado Springs’ highways, are a major driver of regional traffic volumes and congestion.
Presently, there are little to no interventions made in Fort Carson in terms of managing traffic flow. As a military base will have restricted access to and from its premises, there is a ring of seven gates surrounding Fort Carson which control traffic movements in and out of the base. During hours of heavy congestion, base operators will selectively close gates in order to redirect traffic to less congested highways. In addition, during forecasts of inclement weather, the base will go through an overnight protocol to either delay base opening or limit opening to critical operating personnel only. The project team and US Ignite’s goal then is to develop a sophisticated predictive model for traffic congestion for Fort Carson and Colorado Springs, that will serve both commuters and base operators in identifying and avoiding potential traffic jams based on historical data.
Our primary data source and response variable will be a spatio-temporal dataset of traffic jams and irregularities around Carson Springs, provided from the Waze for Cities data sharing program through US Ignite. We transform this data into a defined congestion index (CI), and gather other candidate predictors of interest, such as historic accidents, weather patterns, roadway infrastructure features, and more.
The use cases for the predictions yielded from this project are twofold. First, we seek to send alerts and live map updates of predicted traffic disruptions, around two hours in advance, for commuters to and from Fort Carson to aid in their drive to and from the base. Secondly, base operators will have access to predictive information at their fingertips in order to direct drivers to specific gates or close them entirely.
First, the relevant boundaries were created and read in, including for Fort Carson, the city of Colorado Springs, the county, and study border. The main border for analysis purposes is a predefined bounding box over most of the Colorado Springs metropolitan area, covering all of the relevant Waze traffic data.
border.county = st_read("data/boundary_county.geojson") %>%
st_transform(crs)
border.city = st_read("data/boundary_city.geojson") %>%
st_transform(crs)
border = st_read("data/boundary_research.geojson") %>%
st_transform(crs)
border.military = st_read("data/boundary_military.geojson") %>%
st_transform(crs) %>%
filter(NAME == "Fort Carson") %>%
st_intersection(border)
Waze data was loaded with two major categories: irregularities and jams. Irregularities are automatically generated by users’ GPS data and identified by the Waze system as irregular or atypical by taking into account historical speed data for the same day of the week and hour. On the other hand, jams are either irregularities determined by the app to be significantly slower than normal, or user-generated reports shared by Waze users.
blacklist <- c("N Cheyenne Canyon Rd","High Dr")
waze.ir = read_xlsx("./data/Waze_Irregularities_1215.xlsx", sheet = "Waze_Irregularities_1215") %>%
st_as_sf(wkt = 'geo',crs = 4326)
waze.jam = st_read("data/waze.jam.geojson") %>%
# filter this road under construction
filter(!street %in% blacklist) %>%
rename(jam_id = id) %>%
mutate(ymdh = ymd_h(paste0(date, hour),tz = "UTC"),
ymdh = with_tz(ymdh,"MST"),
hour = hour(ymdh),
date = date(ymdh))
# filter jams with same id
# getting the most severe ci
waze.jam.uniqueid.severe =
waze.jam %>%
group_by(jam_id) %>%
arrange(speed) %>%
slice(1)
# for numeric, getting the average/median ci
# for categorical, getting the first value
# for geom, getting the union
waze.jam.uniqueid.uniongeom =
waze.jam %>%
group_by(jam_id) %>%
summarise(geometry = st_union(geometry))
waze.jam.uniqueid.avg.left =
waze.jam.uniqueid.severe %>%
st_drop_geometry() %>%
dplyr::select(-where(is.numeric),jam_id) %>%
add_column(geometry = waze.jam.uniqueid.uniongeom$geometry) %>%
st_sf()
waze.jam.uniqueid.avg =
waze.jam %>%
st_drop_geometry() %>%
group_by(jam_id) %>%
summarise(across(where(is.numeric), mean)) %>%
left_join(waze.jam.uniqueid.avg.left,by="jam_id") %>%
st_sf()
waze.jam.uniqueid.median =
waze.jam %>%
st_drop_geometry() %>%
group_by(jam_id) %>%
summarise(across(where(is.numeric), median)) %>%
left_join(waze.jam.uniqueid.avg.left,by="jam_id" ) %>%
st_sf()
waze.jam.uniqueid.severe %>%
st_write('data/waze.jam.uniqueid.severe.geojson',append=F)
waze.jam.uniqueid.avg %>%
st_write('data/waze.jam.uniqueid.avg.geojson',append=F)
waze.jam.uniqueid.median %>%
st_write('data/waze.jam.uniqueid.median.geojson',append=F)
# the most severe jam in one id (lowest speed)
waze.jam.uniqueid.severe =
st_read('data/waze.jam.uniqueid.severe.geojson')
# average in one id
waze.jam.uniqueid.avg =
st_read('data/waze.jam.uniqueid.avg.geojson')
# median in one id
waze.jam.uniqueid.median =
st_read('data/waze.jam.uniqueid.median.geojson')
The jam dataset columns results as follows:
Name | Type | Explanation | %NA |
---|---|---|---|
type | Category | Alert type | 98.5 |
speed | Number | Current average speed on jammed segments in m/s | 0 |
speedKMH | Number | Speed in km/h | 0 |
length | Number | Jam length in meters | 0 |
delay | Number | Delay of jam (in seconds) | 0 |
street | Text | .. | 4.39 |
city | Text | .. | 4.27 |
level | Category | Traffic congestion level | 0 |
Overall, there was waze data for 2.5 years, which comprised nearly 3
million rows of points. As the data were analyzed for overall patterns,
it was decided early on that in order to save processing time for this
analysis, the data would be subsetted to a time period of
October-December 2021. Not only was this a fairly recent time period
available to use that reflected post-pandemic traffic usage, but it was
also largely clear of any major unexplained events and disruptions at
Fort Carson that could throw off our prediction. Overall, the dataset
length of waze.jam
fell to 344,529 rows. In addition, in
further review it was decided to omit analysis of the waze
irregularities dataset, as there were too many NA values to produce
reasonable analysis without extensive imputation methods.
As we expect, there is strong periodicity within the jam dataset, both in hours and in days. The X axis is time from Monday to Sunday, And the Y axis is week number from 1 to 54.
There are 3 main takeaways.
To identify and plot the spatial relations of traffic gams, we first plot a map of traffic jams in a day. The gray area to the south is Fort Carson. As before, the timelapse shows both strong periodicity in intensity of jams, both in the morning and sustained throughout the early-mid afternoon. In addition, the outline of major arterials of Colorado Springs, as well as the highway network, are spatially concentrated.
The most congested streets over the day are as follows: | Street | Jam Count | | —————– | ——— | | S Nevada Ave | 3596 | | W Fillmore St | 1883 | | E Fountain Blvd | 1688 | | Lake Ave | 1475 | | S Academy Blvd | 1422 | | E Fillmore St | 1359 | | W Cimarron St | 1216 | | Manitou Ave | 1188 | | Ronald Reagan Hwy (I-25) | 1133 | | E Platte Ave | 1003 |
Because these segments are adjacent to one another or connected, it is apparent that the Spatial lag / Network effect of roadway congestion is significant; we can see clusters, and shifts of jam over time.
As a result, the decision was made to use the hexagonal “Fishnet” as the spacial unit instead of lines or points. That is because we can then model this network effect, by considering nearby cells. Although it leads to less precision, in terms of our use case, to help decision making and sending alerts, its totally acceptable.
The following analyses look at bringing in additional open data sources, including two sources explicitly identified by US Ignite as observations of interest and pervasive roadway hazards: weather patterns and crash patterns.
Historically, people may feel inclement weather like rainy days are
always accompanied with more traffic jams and longer delay time. With
data, we can see that this is undeniably the case. It is reasonable to
assume that inclement weather in Colorado Springs and Fort Carson Base
exacerbate driving conditions, which forces drivers drive slowly or even
leads to crashes, so that congestion index of roads will increase
accordingly. Therefore, we explored the weather data and analyzed its
relationship with congestion index, and researched how we can make use
of weather data as one of independent variables.
Weather data of Colorado Spring from October 1 to December 31 was
downloaded with the riem
package released by the Iowa
Environment Mesonet. The weather data’s time interval is 5 minutes,
since our congestion data’s smallest time interval is one hour, so we
summarized weather data at two temporal scales, one is one-hour and the
other is one-day. The five quantitative weather indicators are max
temperature, min temperature, min visibility, max precipitation and max
wind speed. Weather type (wxcodes) is also downloaded, which is helpful
when doing categorical analysis. The rule of summarizing weather type is
when there is an inclement weather type during this time interval, then
the weather at this time is that weather type.
We build linear regression models between number of jams and wind speed at each hour, and plot them. It is found that during morning rush hours and noon, then wind speed and number of jams are mainly positively correlated, since the slopes are more than zero. But during afternoon rush hours, there is an inverse situation. So we can speculate that the relationship between weather and traffic will change with the time of day.
So despite some basic weather quantitative indicators, we also add another two types of weather variables.
Based on our analysis before, we speculate that there is time lag in weather. That means, the inclement weather may have lasting effects on driving environments and traffic jams
Taking October 1st as an example, It is part of a light rainy day. The first line chart describes the change of number of traffic jams by hour on October 1 and average conditions of all Fridays, and the second line chart takes CI as dependent variable. Those plots are useful when exploring the relationship between traffic, hour and weather type.
It was found that after raining at 11am, the number of traffic congestion climbed more rapidly than usual Fridays. And after raining at 5pm, the number decreased more dramatically and the value is smaller than usual. The CI during and after raining i s smaller than usual as well. At an hour after raining at 7pm the number is higher than usual, while at two hours after that, the CI is very high
In conclusion, inclement weather has lasting effects on congestion, but those effects also differ by time in a given day.
Here we are looking at the hexagon fishnet level crash distribution in our study area. This data was sourced from Colorado DOT’s C-Plan Open Data portal Because crash data is supplied by CDOT via the state police, data is only available for the major highways. We divide the yearly crash data into five quantiles. While there are some safety incidents happening on local roads, the most affected areas are the state and interstate highways. The blue points on the edge of Fort Carson signify its six gates. On roads where crashes are relatively rare, the location of the gate is often the area where crash clusters.
Another interesting open data source, and unconventional is bringing in electricity usage throughout El Paso County, which includes Fort Carson and Colorado Springs. This data was sourced from the local electric utility, Colorado Springs Utilities. It clearly shows a similar weekly and daily periodicity also reflected in traffic congestion, which again is quite logical and expected.
In addition, several other static features, sourced from a combination of CDOT and OpenStreetMap, were also joined into the model. Many of these proved to be very useful in predictive efficacy, such as density of street signs, bus stops, and parking lots.
Several other data sources were considered but ultimately thrown out of the model. This includes public transit bus ridership (from Mountain Metro Transit),
The basic data structure for our model is a temporal-spatial, or Time-Space data panel. Data processing comprises aggregating each jam data point by hour, across each unique day, and taking the total count by each hour. All hourly bins with no jams are imputed with zero. This panel was then split into training and testing sets to a ratio of 70:30.
Several candidate models were considered, and the panel data was trained on each model to produce test R-squared results. These include:
We engineered 4 type of predictors with a total number over 50. The most important predictors are located to the left of the chart. time Lags are super strong for example the categorical hour and 2 hour lag on the dependent variable, followed by spatial lags and others. Weather predictors improves a lot. There are now several encoded weather predictors that are now pretty powerful. Like if it is normal weather in the middle of the day.
We tried different dependent variables to describe jams, and we built different regression models including ridge lasso, random forest, and XGB.
The winner here is the Random forest with Jam count as dependent variable, the R squared here is 0.32. And by Weighing the pros and cons and after discussion of our client, we are sticking to the winner model.
We then predict on the testing dataset. Although our model is not very good in terms of r squared, but our predictions seem quitegood! The rose red line is our prediction, and gray line is the actual value. The model successfully predict most peaks and troughs. Although it seems to be conservative on the peaks, overall, we are pretty happy with the results.
In terms of cross-validation method analysis, for time-series data there are two main methods at use. The first is using a continuous sliding window method, where training is done on n-data points and validation immediately on the next n-points. This 2n window is kept fixed as each former validation set becomes the new training set and so on. On the other hand, for forward chaining the training window grows by the previous validation window each time. Below we cross-validate our random forest model with both methods. We see that with either forward chaining or sliding window, the R-square is fairly constant.
Lastly we plot the fishnet map of the dependent variable, jam count, of both the actual values and prediction map. We see that overall, the predictions are spatially consistent with the actual testing set, with the values slightly more subdued.
And there is some unexplained errors in space, we will later do some detailed analysis like Moran’s I and account for the remaining errors. So there is still plenty of room for improvement.
And lastly, the error map of the dependent variable confirms that the remaining underpredictions are at the spatially identified jam hotspots, which is consistent for such data being an infrequent occurrence.
We have found that, with just a few publicly available data sources, we are able to predict to a high degree of predictability the traffic jam intensity in Waze counts, as well as each jam’s pervasiveness through roadway network effects.
However, it is also clear that the model, like any time-space analysis, cannot always predict for a particularly severe case of congestion intensity on a particular given day. For that, it is necessary to bring in other data sources that may not be public or less easy to source, including citywide events in Colorado Springs or at Fort Carson. Ultimately, for the level of data we were able to source, we have contributed towards a significant improvement in the way that Fort Carson, and any other large employer or city, can manage congestion for their employees and commuters.
We have developed a prototype web application to display traffic jam predictive analytics and historical predictions through the study period. A screenshot of the app is below; click on the image to go to the production website.
border.county = st_read("data/county-boundary.geojson") %>%
st_transform(crs)
border.city = st_read("data/city-boundary.geojson")%>%
st_transform(crs)
border.military = st_read("data/military-border.geojson") %>%
st_transform(crs) %>%
filter(NAME=="Fort Carson") %>%
st_intersection(border)
border = st_read("data/research-boundary.geojson") %>%
st_transform(crs)
all_time = expand_grid(week=9:52, time=0:167)
time.ana.data = read.csv("data/time-ana.csv") %>%
select(-X) %>%
filter(year(ymdh)<2022) %>%
mutate(ymdh= ymd_hms(ymdh),
week = format(ymdh,"%W") %>% as.numeric(),
wday = format(ymdh,"%w") %>% as.numeric(),
wday = ifelse(wday==0,7,wday)-1,
hour = hour(ymdh) %>% as.numeric(),
time = wday*24+hour) %>%
filter(week %>% as.numeric()>8) %>%
left_join(all_time,.) %>%
replace_na(list(n=0))
time.ana.data %>%
ggplot()+
geom_tile(aes(time,week,fill=n))+
scale_x_continuous(breaks=seq(12,170,24),
labels=strsplit("Mon Tue Wen Thu Fri Sat Sun"," ")[[1]],
expand = c(0, 0))+
scale_y_reverse(expand = c(0, 0))+
scale_fill_viridis(option="plasma",name="Jam Count")+
labs(title="Time Pattern of Jam Counts in 2021",
subtitle="2021, Colorado Springs, CO",
y="Week Number in 2021",
x="Time in a Week by Hour")+
plotTheme()+
theme(legend.position="left")+
plot_spacer()+
time.ana.data %>%
group_by(week) %>%
summarise(n = sum(n)) %>%
ggplot()+
geom_col(aes(x=week,y=n,fill=n), size=0) +
scale_fill_viridis(option="plasma", begin=0.2, end=0.8,guide="none")+
scale_x_reverse(expand = c(0, 0))+
coord_flip()+
theme_minimal()+
theme(axis.text.y = element_blank())+
labs(y="Jam Count by Week",x="")+
plot_layout(widths = c(4,-0.2,0.8))
ggsave("img/heat-final.png",bg="white",
width=12,height=6)
towday = function (n) {
wday = strsplit("Mon Tue Wen Thu Fri Sat Sun"," ")[[1]]
return(wday[n+1])
}
time.ana.data2 = time.ana.data%>%
mutate(weekday = floor( time/24),
hour = time%%24,
weekday = towday(weekday),
weekday = factor(weekday,levels=strsplit("Mon Tue Wen Thu Fri Sat Sun"," ")[[1]]))
time.ana.data2 %>%
group_by(weekday,hour) %>%
summarise(n=mean(n)) %>%
ggplot()+
geom_line(data=time.ana.data2,
aes(hour,n,color=weekday,group=week,alpha=week))+
geom_point(data=time.ana.data2,
aes(hour,n,color=weekday,group=week,alpha=week),
size=1)+
scale_color_viridis(option="plasma",begin=0.2,end=0.7,discrete = T,name="Average Jam Count")+
scale_alpha_continuous(range = c(0, 0.2),name="Jam Count Histroy")+
geom_line(aes(hour,n,color=weekday),size=1)+
geom_point(aes(hour,n,color=weekday),shape=21,size=1.5,fill="white",stroke=1.3)+
facet_wrap(~weekday, nrow=1)+
labs(title="Time Pattern of Jam Counts in a Week",
subtitle="2021, Colorado Springs, CO",
y="Jam Count",
x="Hour in One Day")+
plotTheme()
ggsave("img/time-final.png",bg="white",
width=12,height=6)
# visualization of joining process
wkt1 = "POLYGON ((-104.82604980468749 38.80560399500997, -104.79969978332518 38.80560399500997, -104.79969978332518 38.816973682736794, -104.82604980468749 38.816973682736794, -104.82604980468749 38.80560399500997))"
temp_boundary = st_as_sf(data.frame(wkt=wkt1),wkt="wkt",crs=4326)
x.b = st_coordinates(temp_boundary,)[1:2,1]
y.b = st_coordinates(temp_boundary)[3:2,2]
temp_boundary=temp_boundary %>% st_transform(crs)
small_roads = roads %>% st_crop(temp_boundary)
set.seed(100)
waze.jam.filtered = waze.jam %>%
sample_n(10000) %>%
st_filter(temp_boundary)
waze.jam.filtered %>% nrow()
buffer_size = 50*3.28 # meters
# waze.jam.filtered %>% view()
ggplot(waze.jam.filtered)+
geom_sf(color="red")+
geom_sf(data=small_roads)+
mapTheme()+
coord_sf(expand = FALSE,xlim=x.b,ylim=y.b,crs=4326)+
labs(title="Jam Road Segments",
subtitle = "Red Line: Jams, Black Line: Road Segments")
ggsave("img/jam-join-1.png")
patch = ggplot()
for (n in 1:20){
waze.jam.filtered.buffer =
waze.jam.filtered %>%
slice(n) %>%
st_buffer(buffer_size)
st_join(roads, waze.jam.filtered.buffer,
left=F, join=st_within) %>%
ggplot()+
geom_sf(data=small_roads)+
geom_sf(data=waze.jam.filtered %>% slice(n),color="blue")+
geom_sf(data=waze.jam.filtered.buffer,color=NA,fill="#33338899")+
geom_sf(color="red")+mapTheme()+
coord_sf(expand = FALSE,xlim=x.b,ylim=y.b,crs=4326)+
labs(title="Jam Buffer (50m) + Road Segments",
subtitle = "Red Line: Projected Jams, Blue Line: Jams")
ggsave(paste0("img/jam-gif",n,".png"))
}
patch$patches
buffer_size = 50*3.28 # meters
waze.jam.buffer = waze.jam.uniqueid.severe %>%
select(jam_id) %>%
st_buffer(buffer_size)
waze.jam.joined =
st_join(roads %>% select(road_id), waze.jam.buffer,
left=F, join=st_within)
waze.jam.joined %>% st_drop_geometry() %>%
write.csv("data/waze.jam.joined.csv",row.names = F)
waze.jam.filtered = waze.jam.joined %>%
pull(jam_id) %>% unique()
waze.jam.joined = read.csv("data/waze.jam.joined.csv")
waze.jam.joined %>% sample_n(1000) %>% view()
waze.jam.joined.map = waze.jam.joined %>%
group_by(road_id) %>%
summarise(n=n()) %>%
# mutate(n=q5(n)) %>%
left_join(roads,by="road_id") %>%
st_sf()%>%
arrange(desc(n))
waze.jam.joined.map %>%
ggplot()+
geom_sf(data=border.military,fill="#00000022",size=0.1)+
geom_sf(data=gates,size=1,color="steelblue")+
geom_sf(data=roads,size=0.1,alpha=0.4)+
geom_sf(aes(color=q5(n)),alpha=0.9)+
scale_color_manual(values=plasma(6)[1:5],
labels=qBr(waze.jam.joined.map,"n"))+
labs(title="Jam Count Map",
subtitle = "Oct-Dec 2021, Colorado Springs, CO")+
mapTheme()
ggsave("img/jam-m-4.png")
wkt2 = 'POLYGON ((-104.80768203735352 38.76666579487878, -104.77386474609375 38.76666579487878, -104.77386474609375 38.80627285040503, -104.80768203735352 38.80627285040503, -104.80768203735352 38.76666579487878))'
temp_boundary =
st_as_sf(data.frame(wkt=wkt2),wkt="wkt",crs=4326) %>%
st_transform(crs)
jam.oneroad = waze.jam.joined.map %>%
st_filter(temp_boundary) %>%
filter(route=="025A") %>%
mutate(y = st_coordinates(st_centroid(.)) %>% .[,2]) %>%
arrange(y) %>%
mutate(seg_id=row_number()) %>%
# v(z="y",lwd=5)+v(border.military,alpha=0.2)+v(gates)
select(road_id,seg_id) %>%
st_drop_geometry() %>%
left_join(waze.jam.joined,by="road_id") %>%
left_join(waze.jam.uniqueid.severe %>% st_drop_geometry(),by="jam_id")
jam.oneroad
jam.oneroad %>% write.csv("./data/jam.oneroad.csv",row.names = F)
jam.oneroad %>% view()
jam.oneroad %>%
group_by(seg_id,hour) %>%
filter(hour%in%6:21) %>%
# filter(!seg_id%in%36:37) %>%
summarise(speedKMH=speedKMH %>% mean(),
fflow.speedKMH =fflow.speedKMH %>% mean(),
ci=ci %>% mean(),
n=n(),
delay=delay %>% mean()/60) %>%
ggplot()+
geom_col(aes(seg_id,delay), width=1,
fill="#a64d79",color="white",size=NA)+
facet_wrap(~hour,ncol=4,)+
labs(title="Average Delay Time by Hour on Ronald Reagan Hwy",
subtitle = "Oct-Dec 2021, Colorado Springs, CO",
x="Road Segment (About 100 meters / 328 ft)",
y="Delay (minutes)")+
plotTheme()
ggsave("img/jam-case-1.png",bg="white",
width=10,height=8,scale=0.65)
jam.oneroad %>%
group_by(jam_id) %>%
slice(1) %>%
filter(hour%in%17:19 &
delay>120) %>%
group_by() %>%
summarise(n=n(),
length.max=max((length)),
length.mean=mean((length)),
delay.max=max(delay)/60,
delay.mean=mean(delay)/60,
speedKMH.mean=mean(speedKMH))
all_dates = ymd("2021-10-01")+days(0:91)
all_dates=as.character(all_dates)
jammed_dates = jam.oneroad %>%
filter(hour%in%17:19) %>%
group_by(d=date(ymdh)) %>%
summarise(n=(n()>0)) %>%
pull(d) %>% as.character()
setdiff(all_dates,jammed_dates)
waze.jam.uniqueid.severe %>%
st_drop_geometry() %>%
group_by(street) %>%
summarise(n=n()) %>%
arrange(n %>% desc()) %>%
filter(street!="") %>%
head(10) %>%
clipr::write_clip()
require(sp)
library(rgeos)
# 500m
HexPts <-spsample(as_Spatial(border), type="hexagonal", cellsize=1640)
hex <- HexPoints2SpatialPolygons(HexPts)
hex = hex %>% st_as_sf()
hex = hex %>%
st_filter(waze.jam.uniqueid.severe) %>%
data.frame(geometry=.) %>%
mutate(hex_id=row_number()) %>%
st_sf()
v(hex)+v(roads)
hex %>% nrow()
waze.jam.joined %>%
select(road_id) %>%
unique() %>% nrow()
# create time series
time_series=
seq(ymd_h('2021-10-01 0'),
ymd_h('2021-12-31 23'),
by=as.difftime(hours(1))) %>%
data.frame(ymdh=.)
time_series=
seq(ymd_h('2021-10-01 0'),
ymd_h('2021-10-28 23'),
by=as.difftime(hours(1))) %>%
data.frame(ymdh=.)
# spatial join jam and hex
# using the worst jam
jam.hex = hex%>%
st_join(waze.jam.uniqueid.severe) %>%
select(hex_id,ci,ymdh,jam_id,level,delay) %>%
st_drop_geometry() %>% #nrow()
group_by(hex_id,ymdh) %>%
mutate(jam.count=n(),
delay=sum(delay)) %>%
drop_na(ci) %>%
arrange(ci) %>%
dplyr::slice(1) %>%
ungroup()
jam.hex %>% .$ci %>% is.na() %>% mean()
jam.hex %>% filter(hour(ymdh)%in%5:22)%>% nrow()
jam.hex %>% head(1000) %>% view()
ggplot(jam.hex)+
geom_violin(aes(y=ci,x=level %>% as.factor()))
ggplot(jam.hex)+
geom_violin(aes(x=level %>% as.factor(),y=jam.count))
# join time panel
jam.panel = time_series %>%
expand_grid(hex) %>%
left_join(jam.hex) %>%
st_sf()
jam.panel = jam.panel %>% filter(hour(ymdh)%in%5:22)
jam.panel %>% head(1000) %>% view()
jam.panel %>% .$ci %>% is.na() %>% mean()
jam.panel %>% dim()
jam.panel %>% summary()
jam.panel %>%
st_drop_geometry() %>%
group_by(hour(ymdh)) %>%
summarise(mean(!is.na(ci))) %>%
plot()
system("rm data/jam.panel.geojson")
jam.panel %>%
st_write("data/jam.panel.geojson",append=F)
# waze.jam.joined1 = waze.jam.joined %>%
# left_join(waze.jam.uniqueid.severe)
#
# waze.jam.panel = waze.jam.panel %>%
# left_join(waze.jam.joined1)
#
# waze.jam.panel %>% nrow()
# waze.jam.panel %>% head(10000) %>% view()
## weather types
wxcodes_all <- riem_measures(station = "COS", date_start = "2021-09-30", date_end = "2022-1-2") %>%
dplyr::select(valid, wxcodes) %>%
replace(is.na(.), 'NW')
## convert the time zone
wxcodes_all['valid_c'] <- with_tz(wxcodes_all['valid'], tzone = "MST")
## weather data including temperature, precipitation, wind speed, humidity, visibility
weather.COS <-
riem_measures(station = "COS", date_start = "2021-9-30", date_end = "2022-1-2") %>%
dplyr::select(valid, tmpf, p01i, sknt, vsby, relh)%>%
replace(is.na(.), 0)
## convert the time zone
weather.COS['valid_c'] <- with_tz(weather.COS['valid'], tzone = "MST")
## join wxcodes and other weather indicators
weather.COS <- weather.COS %>%
mutate(interval60 = ymd_h(substr(valid_c,1,13)),
intervalday = ymd(substr(valid_c,1,10)),
Hour = hour(valid_c),
week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
rename(Temperature = tmpf,
Precipitation = p01i,
Wind_Speed = sknt,
Humidity = relh,
Visibility = vsby) %>%
dplyr::select(-valid) %>%
left_join(wxcodes_all %>% dplyr::select(-valid) ,by='valid_c')
weather.day.min.temp <- weather.COS %>%
filter(Temperature != 0) %>%
group_by(intervalday) %>%
summarize(Temp_Min = min(Temperature))
weather.day.count <- weather.COS %>%
group_by(intervalday) %>%
summarize(Temp_Max = max(Temperature),
Precipitation = sum(Precipitation),
Wind_Speed = max(Wind_Speed),
Visibility = min(Visibility),
Hour = mean(as.numeric(Hour)),
week = mean(as.numeric(week)),
dotw = mean(as.numeric(dotw)),
) %>%
left_join(weather.COS[c('intervalday','wxcodes')] %>% distinct(),
by = "intervalday") %>%
left_join(weather.day.min.temp,by='intervalday') %>%
left_join(jam.day.count, by = c("intervalday" = "date")) %>%
separate_rows(wxcodes,sep=' ',convert = FALSE) %>%
replace(is.na(.), 0)
weather.hour.count <- weather.COS %>%
group_by(intervalday,Hour) %>%
summarize(Temperature = max(Temperature),
Precipitation = sum(Precipitation),
Wind_Speed = max(Wind_Speed),
Visibility = min(Visibility),
Hour = mean(as.numeric(Hour)),
week = mean(as.numeric(week)),
dotw = mean(as.numeric(dotw)),
) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature)) %>%
left_join(weather.COS[c('intervalday','Hour','wxcodes')] %>% distinct(),
by = c("intervalday" = "intervalday","Hour" = "Hour")) %>%
left_join(jam.hour.count, by = c("intervalday" = "date","Hour" = "hour")) %>%
separate_rows(wxcodes,sep=' ',convert = FALSE) %>%
replace(is.na(.), 0)
weather.hour.count %>%
mutate(weather_type = case_when(wxcodes == 'BLDU' ~ "blowing dustorm",
wxcodes == 'HZ' ~ "haze",
wxcodes == 'SQ' ~ "squalls",
wxcodes == 'RA' ~ "rain",
wxcodes == '-RA' ~ "light rain",
wxcodes == '+RA' ~ "heavy rain",
wxcodes == '-TSRA' ~ "thunderstorm&rain",
wxcodes == 'TSRA' ~ "thunderstorm&rain",
wxcodes == 'VCTS' ~ "vicinity of thunderstorm",
wxcodes == '-DZ' ~ "light drizzle",
wxcodes == 'FG' ~ "fog",
wxcodes == 'BR' ~ "mist",
wxcodes == '-SN' ~ "light snow",
wxcodes == 'FZFG' ~ "freezing fog",
wxcodes == 'SN' ~ "snow",
wxcodes == 'NW' ~ "normal weather")) %>%
group_by(weather_type) %>%
summarize(MeanCount = mean(count)) %>%
ggplot(mapping = aes(x = reorder(weather_type,-MeanCount), y = MeanCount))+
geom_bar(stat = 'identity', fill = '#a64d79') +
geom_hline(yintercept = mean(filter(weather.hour.count, wxcodes == 'NW')[['count']]),
color = '#351c75', size = 1) +
labs(title = "Average Hourly Number of Traffic Jams For Each Weather Type", subtitle = 'Oct 1 to Dec 31, 2021',
x = 'Weather Types', y= 'Average number of jams') +
plotTheme +
theme(panel.border = element_rect(color = "black",
fill = NA,
size = 2))
lm_eqn <- function(df){
m <- lm(y ~ x, df);
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
list(a = format(unname(coef(m)[1]), digits = 2),
b = format(unname(coef(m)[2]), digits = 2),
r2 = format(summary(m)$r.squared, digits = 3)))
as.character(as.expression(eq));
}
my.formular <- value ~ count
weather.hour.count %>%
ungroup() %>%
dplyr::select(count,Temperature,Precipitation,Wind_Speed,Visibility) %>%
gather(Variable, value, -count) %>%
ggplot(aes(x = value, y = count)) +
geom_point(color = '#a64d79') +
stat_smooth(method = 'lm', color = 'black') +
# stat_poly_eq(formuar = my.formular,
# aes(label = paste(..rr.label.., sep = "~~~")),
# parse = TRUE, size = 3, color = 'black') +
facet_wrap(~Variable, scales = "free",ncol = 4) +
#scale_fill_manual(values = palette2) +
labs(title = "Scatter Plots Between Hourly Congestion Count and Weather Conditions ",
subtitle = "2021, Colorado Spring",
y = 'Congestion count',x = 'Weather') +
plotTheme +
theme(plot.title=element_text(size=18),
plot.subtitle=element_text(size=15),
axis.title.x = element_text(size=15),
axis.title.y = element_text(size=15),
strip.text.x = element_text(size=20, face="italic", color='black'))+
theme(panel.border = element_rect(color = "black",
fill = NA,
size = 2))
ConRoad <- st_read('/Users/tushimin/Desktop/MUSA801/weather codes/jamoneroad.csv') %>%
dplyr::select(-road_id) %>% distinct()
# ConRoad <- st_read('C:/Users/tushimin/Desktop/jam.oneroad.csv') %>%
# dplyr::select(-road_id) %>% distinct()
ConRoad$jam_id <- as.numeric(ConRoad$jam_id)
ConRoad <- waze.jam.uniqueid.severe %>%
st_drop_geometry() %>%
right_join(ConRoad['jam_id'],by='jam_id')
CR.hour.count <- ConRoad %>%
group_by(date, hour) %>%
count() %>%
rename('count' = 'n') %>%
filter(date >= '2021-10-01')
CR.hour.count.wea <- weather.hour.count %>%
filter(intervalday >= '2021-10-01') %>%
dplyr::select(-count,-wxcodes,-week,-dotw) %>%
left_join(CR.hour.count, by=c('intervalday'='date','Hour'='hour')) %>%
replace(is.na(.), 0)
CR.hour.count.wea %>%
ungroup() %>%
dplyr::select(count,Temperature,Precipitation,Wind_Speed,Visibility) %>%
gather(Variable, value, -count) %>%
ggplot(aes(x = value, y = count)) +
geom_point(color = '#a64d79') +
stat_smooth(method = 'lm',color = 'black') +
# stat_poly_eq(formular = my.formular,
# aes(label = paste(..rr.label.., sep = "~~~")),
# parse = TRUE, size = 3, color = 'black') +
facet_wrap(~Variable, scales = "free",ncol = 4) +
labs(title = "Scatter Plots Between Daily Congestion Index And Weather Conditions ",
subtitle = "2021, Ronald Reagan Highway",
y = 'Congestion count',x = 'Weather') +
plotTheme +
theme(plot.title=element_text(size=20),
plot.subtitle=element_text(size=15),
axis.title.x = element_text(size=15),
axis.title.y = element_text(size=15),
strip.text.x = element_text(size=20, face="italic", color='black'))+
theme(panel.border = element_rect(color = "black",
fill = NA,
size = 2))
temp.count.formular <- Temperature ~ count
weather.hour.count %>%
mutate(time_of_day = case_when(Hour < 7 | Hour > 18 ~ "Overnight",
Hour >= 7 & Hour < 10 ~ "AM Rush",
Hour >= 10 & Hour < 15 ~ "Mid-Day",
Hour >= 15 & Hour <= 18 ~ "PM Rush")) %>%
dplyr::select(Wind_Speed,count,time_of_day) %>%
ggplot(aes(x = Wind_Speed, y = count)) +
geom_point(color = '#a64d79') +
stat_smooth(method = 'lm',color = 'black') +
# stat_poly_eq(formuar = temp.count.formular,
# aes(label = paste(..rr.label.., sep = "~~~")),
# parse = TRUE, size = 5,color = 'black') +
facet_wrap(~time_of_day, scales = "free",ncol = 4) +
#scale_fill_manual(values = palette2) +
labs(title = "Scatter Plots Between Number of Traffic Jams and Wind Speed ",
subtitle = '2021, Colorado Springs',
y = 'Congestion count',x = 'visibility') +
plotTheme +
theme(plot.title=element_text(size=25),
plot.subtitle=element_text(size=20),
axis.title.y=element_text(size=15),
axis.title.x=element_text(size=15),
strip.text.x = element_text(size=20, face="italic", color="black")
)+
theme(panel.border = element_rect(color = "black",
fill = NA,
size = 2))
rate <- c()
hour <- 0:23
for(i in c(0:23)) {
hour_data <- weather.hour.count %>%
filter(intervalday >= '2021-10-01' & intervalday <= '2021-12-31' & Hour == i)
lrm <- lm(count ~ Wind_Speed, data = hour_data)
#hour_rate[nrow(hour_rate)+1,] <- c(i,rate)
rate <- append(rate,lrm$coefficients[2])
}
hour_rate <- data.frame(hour,rate)
hour_rate <- hour_rate %>% mutate(time_of_day = case_when(hour < 7 | hour > 18 ~ "Overnight",
hour >= 7 & hour < 10 ~ "AM Rush",
hour >= 10 & hour < 15 ~ "Mid-Day",
hour >= 15 & hour <= 18 ~ "PM Rush"))
ggplot(hour_rate) +
geom_point(aes(x=hour,y=rate)) +
geom_line(aes(x=hour,y=rate,color=factor(time_of_day),group=1),size=1) +
geom_vline(xintercept=c(7,10,15,19),size=1,linetype = 'dotted')+
scale_color_viridis(option="plasma",discrete = T,begin=0.8,end=0.2,name = "time of day") +
# scale_color_viridis(option = "plasma",discrete = TRUE, name = "time of day") +
plotTheme +
labs(title = "Changes of the Relationship Between Number of Jams and Wind Speed in a Day ",
subtitle = '2021, Colorado Springs',
y = 'Slope',x = 'Hour') +
theme(plot.title=element_text(size=11),
plot.subtitle=element_text(size=8),
axis.title.y=element_text(size=8),
axis.title.x=element_text(size=8),
panel.border = element_rect(color = "black",
fill = NA,
size = 2))
# weather data prepared for building a panel
wxcodes <- riem_measures(station = "COS", date_start = "2021-09-30", date_end = "2022-01-02") %>%
dplyr::select(valid, wxcodes) %>%
replace(is.na(.), 'NW')
wxcodes['valid'] <- with_tz(wxcodes['valid'], tzone = "MST")
wxcodes <- wxcodes%>%
mutate(interval60 = ymd_h(substr(valid,1,13)),
intervalday = ymd(substr(valid,1,10)),
Hour = hour(valid),
week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
dplyr::select(intervalday,Hour,wxcodes) %>% distinct() %>%
separate_rows(wxcodes,sep=' ',convert = FALSE)
## multiple classes of weather in one hour
wxcodes.multi <- wxcodes %>%
group_by(intervalday,Hour) %>%
count() %>%
filter(n > 1) %>%
left_join(wxcodes,by=c('intervalday','Hour')) %>%
filter(wxcodes != 'NW') %>%
dplyr::select(-n) %>%
group_by(intervalday,Hour) %>%
count() %>%
filter(n > 1) %>%
left_join(wxcodes,by=c('intervalday','Hour'))%>%
filter(wxcodes != 'NW')
wxcodes.multi <- wxcodes.multi %>% ungroup() %>% slice(2,3,5,16,23,28,32,34,36,38,40,45,47,49,50,52,54,57,60,66,69,72) %>%
dplyr::select(-n)
## two classes in one hour
wxcodes.two <- wxcodes %>%
group_by(intervalday,Hour) %>%
count() %>%
filter(n > 1) %>%
left_join(wxcodes,by=c('intervalday','Hour')) %>%
filter(wxcodes != 'NW') %>%
dplyr::select(-n) %>%
group_by(intervalday,Hour) %>%
count() %>%
filter(n == 1) %>%
left_join(wxcodes,by=c('intervalday','Hour')) %>%
filter(wxcodes != 'NW') %>%
dplyr::select(-n)
## one class in one hour
wxcodes.one <- wxcodes %>%
group_by(intervalday,Hour) %>%
count() %>%
filter(n == 1) %>%
left_join(wxcodes,by=c('intervalday','Hour')) %>%
dplyr::select(-n)
## one hour corresponds to only one weather class
wxcodes.final <- rbind(wxcodes.one,wxcodes.multi) %>% rbind(.,wxcodes.two) # all wxcodes in 94 days
## sort by day and hour
wxcodes.final <- wxcodes.final[order(wxcodes.final$intervalday, wxcodes.final$Hour),]
wxcodes.final <- wxcodes.final %>%
right_join(weather.COS[,c('interval60','intervalday','Hour')] %>% distinct(),by=c('intervalday','Hour')) %>%
left_join(weather.hour.count %>% dplyr::select(-wxcodes,-count)%>% distinct(),by=c('intervalday','Hour'))
## recalculate dow, Hour,week
wxcodes.final <- wxcodes.final %>%
mutate(time_of_day = case_when(Hour < 7 | Hour > 18 ~ "Overnight",
Hour >= 7 & Hour < 10 ~ "AM Rush",
Hour >= 10 & Hour < 15 ~ "Mid-Day",
Hour >= 15 & Hour <= 18 ~ "PM Rush")) %>%
dplyr::select(-interval60)
## build a weather panel
lag.panel <- wxcodes.final %>%
ungroup() %>%
mutate(lagTemp = dplyr::lag(Temperature,2),
lagPrecp = dplyr::lag(Precipitation,2),
lagWS = dplyr::lag(Wind_Speed,2),
lagVisib = dplyr::lag(Visibility,2),
lagCat = dplyr::lag(wxcodes,2))
allwea_fri_count <- waze.jam.uniqueid.severe %>%
st_drop_geometry() %>%
filter(day_of_week == 6) %>%
group_by(date,hour) %>%
count(name = 'count') %>%
filter(date >= '2021-10-01' & date <= '2021-12-31') %>%
group_by(hour) %>%
summarize(count = mean(count)) %>%
mutate(wxcodes = 'NW',type='Average')
oct.1 <- lag.panel %>%
filter(intervalday == '2021-10-01') %>%
left_join(jam.hour.count,by=c('intervalday' = 'date','Hour' = 'hour')) %>%
replace(is.na(.),0) %>%
mutate(type = 'OCT.1') %>%
rename('hour' = 'Hour') %>%
dplyr::select(hour,count, wxcodes, type)
df <- rbind(oct.1,allwea_fri_count)
bounds <- df %>%
dplyr::select(-wxcodes) %>%
pivot_wider(names_from = type, values_from = count) %>%
mutate(
ymax = pmax(OCT.1, Average),
ymin = pmin(Average, OCT.1),
fill = OCT.1 >= Average
)
bounds46 <- bounds %>%
mutate(fill = case_when(
hour >=4 & hour <=6 ~ TRUE,
TRUE ~ FALSE
)) %>%
filter(fill == TRUE)
bounds914 <- bounds %>%
mutate(fill = case_when(
hour >=9 & hour <=14 ~ TRUE,
TRUE ~ FALSE
)) %>%
filter(fill == TRUE)
bounds2223 <- bounds %>%
mutate(fill = case_when(
hour >=21 & hour <=23 ~ TRUE,
TRUE ~ FALSE
)) %>%
filter(fill == TRUE)
t2.rect1 <- data.frame (xmin=3.5, xmax=4.5, ymin=-Inf, ymax=Inf)
t2.rect2 <- data.frame (xmin=8.5, xmax=10.5, ymin=-Inf, ymax=Inf)
t2.rect3 <- data.frame (xmin=11.5, xmax=12.5, ymin=-Inf, ymax=Inf)
t2.rect4 <- data.frame (xmin=20.5, xmax=21.5, ymin=-Inf, ymax=Inf)
ggplot(df) +
geom_line(aes(hour, count, linetype = type)) +
geom_point(aes(x=hour,y=count),size = 1,color="#c27ba0") +
geom_ribbon(data = bounds46, aes(hour, ymin = ymin, ymax = ymax, fill = fill), alpha = 0.4)+
geom_ribbon(data = bounds914, aes(hour, ymin = ymin, ymax = ymax, fill = fill), alpha = 0.4) +
geom_ribbon(data = bounds2223, aes(hour, ymin = ymin, ymax = ymax, fill = fill), alpha = 0.4) +
geom_rect(data=t2.rect1, aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax), fill="#848484", alpha = 0.3,inherit.aes = FALSE) +
geom_rect(data=t2.rect2, aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax), fill="#848484", alpha = 0.3,inherit.aes = FALSE) +
geom_rect(data=t2.rect3, aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax), fill="#848484", alpha = 0.3,inherit.aes = FALSE) +
geom_rect(data=t2.rect4, aes(xmin=xmin, xmax=xmax, ymin=ymin, ymax=ymax), fill="#848484", alpha = 0.3,inherit.aes = FALSE) +
labs(title = "Average Number of Congestion and Weather Type of Every Hour on October 1",
subtitle = "2021, Colorado Spring",
y = 'Jam Count', x = 'Hour',
#color='Weather Type',
linetype = 'Date') +
theme(panel.border = element_rect(color = "black",
fill = NA,
size = 2)) +
scale_fill_manual(values = c('#a64d79'),
label = c('More jams'))+
guides(fill=guide_legend(title=c("Weather Lag")))+
theme(plot.title=element_text(size=30),
plot.subtitle=element_text(size=20),
axis.title.y=element_text(size=20),
axis.title.x=element_text(size=15))+
plotTheme
all_year_count <- st_read('/Users/tushimin/Desktop/MUSA801/Mobility-Decision-Support-System/data/time-ana.csv') %>%
mutate(intervalday = ymd(substr(ymdh,1,10)),
Hour = hour(ymdh)) %>%
rename('count' = 'n') %>%
dplyr::select(-ymdh,-field_1)
### weather data of the whole 2021
raw_year_wea <- riem_measures(station = "COS", date_start = "2020-12-30", date_end = "2022-1-2") %>%
dplyr::select(valid, wxcodes) %>%
replace(is.na(.), 'NW')
## convert the time zone
raw_year_wea['valid_c'] <- with_tz(raw_year_wea['valid'], tzone = "MST")
all_year_wea <- riem_measures(station = "COS", date_start = "2020-12-30", date_end = "2022-1-2") %>%
dplyr::select(valid, tmpf, p01i, sknt, vsby, relh)%>%
replace(is.na(.), 0)
## convert the time zone
all_year_wea['valid_c'] <- with_tz(all_year_wea['valid'], tzone = "MST")
## join wxcodes and other weather indicators
all_year_wea <- all_year_wea %>%
mutate(interval60 = ymd_h(substr(valid_c,1,13)),
intervalday = ymd(substr(valid_c,1,10)),
Hour = hour(valid_c),
week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
rename(Temperature = tmpf,
Precipitation = p01i,
Wind_Speed = sknt,
Humidity = relh,
Visibility = vsby) %>%
dplyr::select(-valid) %>%
left_join(raw_year_wea %>% dplyr::select(-valid) ,by='valid_c')
all.weather.hour.count <- all_year_wea %>%
group_by(intervalday,Hour) %>%
summarize(Temperature = max(Temperature),
Precipitation = sum(Precipitation),
Wind_Speed = max(Wind_Speed),
Visibility = min(Visibility),
Hour = mean(as.numeric(Hour)),
week = mean(as.numeric(week)),
dotw = mean(as.numeric(dotw)),
) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature)) %>%
left_join(all_year_wea[c('intervalday','Hour','wxcodes')] %>% distinct(),
by = c("intervalday" = "intervalday","Hour" = "Hour"))
all.weather.hour.count <- all.weather.hour.count %>%
left_join(all_year_count, by = c("intervalday","Hour"))
all.weather.hour.count <- all.weather.hour.count %>%
mutate(count = as.numeric(count))
all.weather.hour.count <- all.weather.hour.count %>%
replace(is.na(.), 0)
### convert wxcodes whose whole year frequency is less than 10 to NW
## infrequent wxcodes
infrequent_wxcodes <- all.weather.hour.count %>%
group_by(wxcodes) %>%
count() %>%
filter(n<10)
infrequent_wxcodes <- infrequent_wxcodes$wxcodes
## frequent wxcodes
frequent_wxcodes <- all.weather.hour.count %>%
group_by(wxcodes) %>%
count() %>%
filter(n >= 10)
frequent_wxcodes <- frequent_wxcodes$wxcodes
## convert infrequent wxcodes
infrequent_count <- weather.hour.count[weather.hour.count$wxcodes %in% infrequent_wxcodes,]
frequent_count <- weather.hour.count[weather.hour.count$wxcodes %in% frequent_wxcodes,]
infrequent_count <- infrequent_count %>%
dplyr::select(-wxcodes) %>%
mutate(wxcodes = 'NW')
## the final dataframe
weather.hour.count <- frequent_count %>%
rbind(infrequent_count)
## categorize weather type into five levels
weather.hour.count <- weather.hour.count %>%
mutate(wea_cat = case_when(
wxcodes == '-SN' | wxcodes == 'SN' | wxcodes == '+SN' | wxcodes == '-RASN'| wxcodes == 'BLSN' ~ 'Snowy',
wxcodes == 'BR' | wxcodes == 'FZFG' | wxcodes == 'HZ' | wxcodes == 'FG'| wxcodes == 'FU' ~ 'Foggy',
wxcodes == 'UP' | wxcodes == '-RA' | wxcodes == '-DZ'| wxcodes == 'RA'| wxcodes == '+RA'| wxcodes == 'TSRA'| wxcodes == '-TSRA'| wxcodes == 'TS' ~ 'Rainy',
wxcodes == 'SQ' ~ 'Windy',
wxcodes == 'NW' | wxcodes == 'VCTS' ~ 'Normal Weather',
))
## add time periods
weather.hour.count <- weather.hour.count %>%
mutate(time_of_day = case_when(Hour < 7 | Hour > 18 ~ "Overnight",
Hour >= 7 & Hour < 10 ~ "AM Rush",
Hour >= 10 & Hour < 15 ~ "Mid-Day",
Hour >= 15 & Hour <= 18 ~ "PM Rush"))
## one hot encoding weather types
oh_1 <- weather.hour.count
oh_1 <- oh_1 %>%
mutate('Normal Weather' = case_when(wea_cat == 'Normal Weather' ~ 1,
TRUE ~ 0),
'Snowy' = case_when(wea_cat == 'Snowy' ~ 1,
TRUE ~ 0),
'Windy' = case_when(wea_cat == 'Windy' ~ 1,
TRUE ~ 0),
'Rainy' = case_when(wea_cat == 'Rainy' ~ 1,
TRUE ~ 0),
'Foggy' = case_when(wea_cat == 'Foggy' ~ 1,
TRUE ~ 0))
## there might be several weather types in one hour, so we aggregate them into one row.
oh_1 <- aggregate(cbind(`Normal Weather`,`Snowy`,`Rainy`,`Foggy`,`Windy`) ~ intervalday+Hour, data = oh_1, FUN = sum)
## add two-hour weather lag
weatherlag <- oh_1 %>%
ungroup() %>%
mutate(lagNW_2 = dplyr::lag(`Normal Weather`,2),
lagWindy_2 = dplyr::lag(Windy,2),
lagRainy_2 = dplyr::lag(Rainy,2),
lagFoggy_2 = dplyr::lag(Foggy,2),
lagSnowy_2 = dplyr::lag(Snowy,2))
## one hot encoding weather + time slots
all_wxcodes <- weather.hour.count %>% ungroup() %>% distinct(wea_cat)
all_wxcodes <- all_wxcodes$wea_cat
all_timeslots <- weather.hour.count$time_of_day %>% unique()
oh_2 <- weather.hour.count
n = 12
for(i in all_wxcodes) {
for(j in all_timeslots){
n = n + 1
oh_2 <- oh_2 %>%
mutate('temp' = case_when(
wea_cat == i & time_of_day == j ~ 1,
TRUE ~ 0))
names(oh_2)[n] <- paste(i,'+',j)
}
}
## aggregate into one row per day
onehot <- aggregate(cbind(`Normal Weather + PM Rush`,`Normal Weather + Overnight`,`Normal Weather + AM Rush`,`Normal Weather + Mid-Day`,`Snowy + PM Rush`,`Snowy + Overnight`,`Snowy + AM Rush`,`Snowy + Mid-Day`,`Rainy + PM Rush`,`Rainy + Overnight`, `Rainy + AM Rush`,`Rainy + Mid-Day`,`Foggy + PM Rush`,`Foggy + Overnight`, `Foggy + AM Rush`,`Foggy + Mid-Day`) ~ intervalday+Hour, data = oh_2, FUN = sum)
wxcodes_onehot <- onehot %>%
left_join(oh_2 %>% dplyr::select(intervalday,Hour, Temperature,Precipitation, Wind_Speed, Visibility, dotw, count) %>% distinct(), by=c('intervalday','Hour'))
## add quantitative indicators' lag
wxcodes_onehot <- wxcodes_onehot %>%
ungroup() %>%
mutate(lagTemp = dplyr::lag(Temperature,2),
lagPrecp = dplyr::lag(Precipitation,2),
lagWS = dplyr::lag(Wind_Speed,2),
lagVisib = dplyr::lag(Visibility,2))
## the final dataframe
wxcodes_onehot_lag <- weatherlag %>%
dplyr::select(-`Normal Weather`,-Snowy, -Rainy,-Foggy,-Windy) %>%
left_join(wxcodes_onehot,by=c('intervalday','Hour'))
crash <- st_read("data/crash-2019.csv")
crash_clean <- crash[crash$longitude != 0,]
crash.sf <- st_as_sf(crash_clean, coords = c("longitude","latitude"), crs = 4326)
crash_2232 <- crash.sf %>%
st_transform(crs = 2232)
border = st_read("data/research-boundary.geojson") %>%
st_transform(crs = 2232)
gates = st_read("data/gate.geojson") %>%
st_transform(crs)
border.military = st_read("data/military-border.geojson") %>%
st_transform(crs) %>%
filter(NAME=="Fort Carson") %>%
st_intersection(border)
roads.highway = st_read("data/road-highway.geojson") %>%
st_transform(crs) %>%
st_intersection(border) %>%
mutate(class="highway")
roads.major = st_read("data/road-major.geojson") %>%
st_transform(crs) %>%
st_intersection(border)%>%
mutate(class="major")
roads = bind_rows(roads.highway,roads.major)%>%
mutate(road_id = 10000+row_number())
roads %>% sample_n(1000) %>% view()
roads %>% st_length() %>% mean()
##################
crash_study <- st_intersection(crash_2232, border)
crash.map = crash_study %>%
ggplot()+
geom_sf(data=gates,size=3,color="steelblue")+
# geom_sf(data=border,fill="#00000022",size=0.1)+
geom_sf(data=roads.highway,size=0.1,alpha=0.4)+
geom_sf(data=roads.major,size=0.1,alpha=0.4)+
geom_sf(data=border.military,fill="#00000044",size=0.1)+
geom_sf(alpha=0.1,color="#e44001")+
#facet_wrap(~hour,nrow = 3)+
labs(title="Heatmap of Crashes",
subtitle = "2019, Colorado Springs, CO")+
mapTheme()
crash.map
###################
crash_study$date = as.Date(crash_study$date, "%Y-%m-%d")
crash_study$hour = substr(crash_study$time, 1, 2)
crash_study$month = format(crash_study$date, "%m")
###################
crash_group_hour <- crash_study %>% group_by(hour) %>% summarise(n = n())
mapborder="POLYGON ((-104.94964599609374 38.70908932739828, -104.63104248046875 38.70908932739828, -104.63104248046875 38.86697874644268, -104.94964599609374 38.86697874644268, -104.94964599609374 38.70908932739828))" %>%
data.frame(w=.) %>%
st_as_sf(wkt="w",crs=4326) %>%
st_transform(crs)
x.b = st_coordinates(mapborder)[1:2,1]
y.b = st_coordinates(mapborder)[3:2,2]
crash.animation = crash_group_hour.sf %>%
# sample_n(10000) %>%
ggplot()+
geom_sf(data=gates,size=1,color="steelblue")+
geom_sf(data=roads.highway,size=0.1,alpha=0.4)+
geom_sf(data=roads.major,size=0.1,alpha=0.4)+
geom_sf(data=border.military,fill="#00000022",size=0.1)+
geom_sf(alpha=0.5,color="#e44001")+
coord_sf(xlim=x.b,ylim=y.b,expand = F)+
transition_manual(hour) +
labs(title="Heatmap of Crashes at {current_frame}:00",
subtitle = "2019, Colorado Springs, CO")+
mapTheme()
animate(crash.animation,res=150,width=10,height=6,units = "in",fps = 15)
#crash.animation$data
anim_save("img/crash-map.gif")
####################
ggplot(data = crash_study, aes(x = month, fill = severity)) +
geom_bar()+
labs(title="Crash Severity Histogram",
subtitle = "2019, Colorado Springs, CO")+
plotTheme()+
scale_fill_viridis(discrete = TRUE)
ggplot(data = crash_study, aes(x = month, fill = road_desc)) +
geom_bar()+
labs(title="Crash Road Histogram",
subtitle = "2019, Colorado Springs, CO")+
plotTheme()
ggplot(data = crash_study, aes(x = month, fill = condition)) +
geom_bar()+
labs(title="Crash Weather Histogram",
subtitle = "2019, Colorado Springs, CO")+
plotTheme()+
scale_fill_viridis(discrete = TRUE)
ggplot(data = crash_study, aes(x = month, fill = lighting)) +
geom_bar()+
labs(title="Crash Lighting Histogram",
subtitle = "2019, Colorado Springs, CO")+
plotTheme()+
scale_fill_viridis(discrete = TRUE)
ggplot(data = crash_study, aes(x = month, fill = limit1)) +
geom_bar()+
labs(title="Crash Speed Limit Histogram",
subtitle = "2019, Colorado Springs, CO")+
plotTheme()+
scale_fill_viridis(discrete = TRUE)
ggplot(data = crash_study, aes(x = month, fill = system)) +
geom_bar()+
labs(title="Crash System Histogram",
subtitle = "2019, Colorado Springs, CO")+
plotTheme()+
scale_fill_viridis(discrete = TRUE)
##################
crash_study_injlevel <- crash_study[, c("month", "injlevel_1", "injlevel_2", "injlevel_3", "injlevel_4", "injlevel_5")]
crash_study_injlevel$injlevel_1 <- as.numeric(crash_study_injlevel$injlevel_1)
crash_group_hour <- crash_study_injlevel %>% group_by(month) %>% summarise(ninjlevel_1 = sum(injlevel_1),
ninjlevel_2 = sum(injlevel_2),
ninjlevel_3 = sum(injlevel_3),
ninjlevel_4 = sum(injlevel_4),
ninjlevel_5 = sum(injlevel_5))
ggplot(crash_study %>%
select(matches("injlevel_")) %>%
gather(key = "variable", value = "value") %>%
filter(value != ""))+
geom_bar(aes(sum(value)), width = 0.6, fill = '#1c9099')+
ylab("Count of Injury Level")+
xlab("Injury Level")+
theme(
axis.title.x = element_text(size = 10, face = "bold"),
axis.title.y = element_text(size = 10, face = "bold")
)+
labs(title = "Injury Level Plot", subtitle = "2019, Colorado Springs, CO")+
theme(
plot.title = element_text(color = '#1c9099', size = 20),
plot.subtitle = element_text(face="italic")
)+
coord_flip()
######
time_series19=
seq(ymd_h('2019-10-01 0'),
ymd_h('2019-12-31 23'),
by=as.difftime(hours(1))) %>%
data.frame(ymdh=.)
time_series19=
seq(ymd_h('2019-10-01 0'),
ymd_h('2019-10-14 23'),
by=as.difftime(hours(1))) %>%
data.frame(ymdh=.)
#hexigon
require(sp)
library(rgeos)
# 500m
HexPts <-spsample(as_Spatial(border), type="hexagonal", cellsize=1640)
hex <- HexPoints2SpatialPolygons(HexPts)
hex = hex %>% st_as_sf()
hex = hex %>%
st_filter(roads) %>%
data.frame(geometry=.) %>%
mutate(hex_id=row_number()) %>%
st_sf()
v(hex)+v(roads)
hex %>% nrow()
crash_group_hour.sf <- crash_group_hour %>% st_as_sf(crs = crs)
crash.hex = hex%>%
st_join(crash_group_hour.sf) %>%
st_drop_geometry() %>%
group_by(hex_id,hour) %>%
arrange(n) %>%
slice(1) %>%
ungroup()
# join time panel
crash.panel = hex %>%
left_join(crash.hex) %>%
st_sf()
#
# jam.panel = jam.panel %>% filter(hour(ymdh)%in%5:22)
# jam.panel %>% head(10000) %>% view()
# jam.panel %>% .$ci %>% is.na() %>% mean()
crash.panel %>% dim()
crash.panel %>% summary()
crash.panel %>%
st_drop_geometry() %>%
group_by(hour) %>%
summarise(mean(!is.na(n))) %>%
plot()
crash.panel %>%
st_write("data/crash.panel.geojson")
write.csv(crash.panel,"data/crashpanel.csv", row.names = FALSE)
#pt.panel %>% ggplot()
hex_crash <- crash.panel %>%
ggplot()+
geom_sf(data=gates,size=2,color="steelblue")+
# geom_sf(data=border,fill="#00000022",size=0.1)+
geom_sf(data=roads.highway,size=0.1,alpha=0.4)+
geom_sf(data=roads.major,size=0.1,alpha=0.4)+
geom_sf(data=border.military,fill="#00000044",size=0.1)+
geom_sf(aes(fill = q5(n)),alpha=0.6, color="transparent")+
#facet_wrap(~hour,nrow = 3)+
labs(title="Number of Cash Incidents",
subtitle = "Colorado Springs, CO")+
mapTheme()+
scale_fill_manual(values = plasma(n=5, begin = 0.9, end = 0.2), labels = qBr(crash.panel, "n"), name="Crash(Quintitle Breaks)")
hex_crash
ggsave("img/pt_hexmap.png",bg="white")
loss <- readxl::read_xlsx("data/2020 Master Statewide LOSS Listing Spreadsheet.xlsx",skip=1) %>%
mutate(ROUTE = paste0(str_pad(Highway,3,pad='0'),Section))
highways <- st_read("data/ELPA/HIGHWAYS.shp") %>% st_zm() %>% select(ROUTE, REFPT,ENDREFPT,FROMMEAS,TOMEAS)
elpaso <- highways %>% distinct(ROUTE)
joined = list()
for (i in 1:nrow(elpaso)) {
loss_i = loss %>% filter(ROUTE == elpaso[i,"ROUTE"])
highways_i = highways %>% filter(ROUTE == elpaso[i,"ROUTE"])
join_i <- interval_inner_join(loss_i,highways_i,by=c(`Start MP`="FROMMEAS",`End MP`="TOMEAS")) %>%
st_sf()
join_i$i <- i
joined[[i]] <- join_i
}
joined <- bind_rows(joined) %>% st_transform(4326)
basemap <- get_stamenmap(bbox=c(left=-104.96681,
bottom=38.65013,
right=-104.66469,
top=38.93751), zoom=11,maptype="toner")
ggmap(basemap) +
geom_sf(data=joined,size=3,aes(color=as.factor(`LOSS (Total)`)),inherit.aes = F) +
scale_color_viridis_d(option = "A",direction=-1) +
theme(axis.title = element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
legend.direction = "vertical",
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, 'cm'),
legend.key.height = unit(1, "cm"),
legend.key.width = unit(0.2, "cm")) +
labs(title="Level of Service of Safety, Colorado Springs interstate highways",
subtitle="Source: Colorado DOT",
color="LOSS")
ggmap(basemap) +
geom_sf(data=highways %>% st_transform(4326),size=3,aes(color=AADT),inherit.aes = F) +
scale_color_viridis_c(option = "A",direction=-1) +
theme(axis.title = element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2)
legend.direction = "vertical",
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, 'cm'),
legend.key.height = unit(1, "cm"),
legend.key.width = unit(0.2, "cm")),
labs(title="Annual average daily traffic, Colorado Springs interstate highways",
subtitle="Source: Colorado DOT")
options(scipen=1000000)
loss %>% filter(County == 'El Paso' & AADT > 10000) %>%
ggplot(aes(x=as.factor(`LOSS (Total)`),y=AADT)) +
geom_violin(aes(fill=as.factor(`LOSS (Total)`)),lwd=0,show.legend=F) +
geom_boxplot(width=0.1) + plotTheme +
scale_fill_viridis_d(option = "A",direction=-1) +
labs(title='LOSS versus AADT, El Paso County roads',
subtitle='AADT above 10000',
x='LOSS')
# download osm data
library(tidyverse)
library(osmdata)
library(sf)
library(FNN)
crs = 2232
download.boudnary <- opq(bbox = c(-104.9668,38.65013,-104.6647,38.93751))
getOsmCentroid = function (osm){
res = osm$osm_points
if (!osm$osm_lines %>% is.null()){
res = res %>% bind_rows(osm$osm_lines %>%
st_centroid())
}
if (!osm$osm_polygons %>% is.null()){
res = res %>% bind_rows(osm$osm_polygons %>%
st_centroid())
}
if (!osm$osm_multilines %>% is.null()){
res = res %>% bind_rows(osm$osm_multilines %>%
st_centroid())
}
if (!osm$osm_multipolygons %>% is.null()){
res = res %>% bind_rows(osm$osm_multipolygons %>%
st_centroid())
}
res = res %>% select(geometry)
return(res)
}
parking.lots =
add_osm_feature(opq = download.boudnary,
key = 'amenity',
value = "parking") %>%
osmdata_sf() %>%
getOsmCentroid() %>%
st_transform(crs)
link.roads =
add_osm_feature(opq = download.boudnary,
key = 'highway',
value = c("motorway_link","trunk_link","primary_link",
"secondary_link"," tertiary_link")) %>%
osmdata_sf() %>%
getOsmCentroid() %>%
st_transform(crs)
bus.stops =
add_osm_feature(opq = download.boudnary,
key = 'highway',
value = c("bus_stop")) %>%
osmdata_sf() %>%
getOsmCentroid() %>%
st_transform(crs)
signs =
add_osm_feature(opq = download.boudnary,
key = 'highway',
value = c("stop","traffic_mirror","traffic_signals")) %>%
osmdata_sf() %>%
getOsmCentroid() %>%
st_transform(crs)
lights =
add_osm_feature(opq = download.boudnary,
key = 'highway',
value = c("street_lamp")) %>%
osmdata_sf() %>%
getOsmCentroid() %>%
st_transform(crs)
parking.building =
add_osm_feature(opq = download.boudnary,
key = 'building',
value = c("parking")) %>%
osmdata_sf() %>%
getOsmCentroid() %>%
st_transform(crs)
v(parking.building)
nn.hex = hex %>%
st_centroid() %>%
mutate(
nn.parkinglots = nn_function(st_coordinates(geometry), st_coordinates(parking.lots), 20),
nn.link.roads = nn_function(st_coordinates(geometry), st_coordinates(link.roads), 10),
nn.bus.stops = nn_function(st_coordinates(geometry), st_coordinates(bus.stops), 1),
nn.signs = nn_function(st_coordinates(geometry), st_coordinates(signs), 1),
nn.lights = nn_function(st_coordinates(geometry), st_coordinates(lights), 2),
nn.parking.building = nn_function(st_coordinates(geometry), st_coordinates(parking.building), 1),
)
nn.hex = nn.hex %>%
mutate(
nn.parkinglots = nn.parkinglots %>% log1p(),
nn.link.roads = nn.link.roads %>% log1p(),
nn.bus.stops = nn.bus.stops %>% log1p(),
nn.signs = nn.signs %>% log1p(),
nn.parking.building = nn.parking.building %>% log1p()
)
nn.hex %>%
st_drop_geometry() %>%
pivot_longer(-hex_id,names_to = "key",values_to = "value") %>%
ggplot()+
geom_histogram(aes(value))+
facet_wrap(~key,scale="free")
nn.hex %>%
v(z=c("nn.parkinglots","nn.link.roads","nn.bus.stops",
"nn.signs","nn.lights","nn.parking.building"))
nn.hex %>%
st_drop_geometry() %>%
write.csv("data/osm.features.csv",row.names = F)
# tidymodels for ML
# install.packages("tidymodels")
# install.packages("xgboost")
library(tidymodels) # for the parsnip package, along with the rest of tidymodels
# Helper packages
# library(readr) # for importing data
# library(broom.mixed) # for converting bayesian models to tidy tibbles
# library(dotwhisker) # for visualizing regression results
library(caret)
library(xgboost)
library(glmnet)
library(ranger)
library(ROSE) # sampling
# examine correlation for categorical data
# ggplot(urchins,
# aes(x = initial_volume,
# y = width,
# group = food_regime,
# col = food_regime)) +
# geom_point() +
# geom_smooth(method = lm, se = FALSE) +
# scale_color_viridis_d(option = "plasma", end = .7)
#> `geom_smooth()` using formula 'y ~ x'
# spliting test and train
jam.panel.model = st_read("data/jam.panel.model.geojson")
jam.panel.model = st_read("data/jam.panel.count.model.geojson")
jam.panel.model = read.csv("data/jam.panel.model.newnogeo.csv")
# jam.panel.model = jam.panel
jam.panel.data =
jam.panel.model%>%
st_drop_geometry() %>%
# rowwise() %>%
mutate(delay = ifelse(delay<0,500,delay),
dependent.variable=jam.count,
) %>%
select(-jam_id,-hex_id,-level,-jam.count,-ci,
-delay,-electricity_usage,
# -lagTemp,-lagPrecp,-lagWS,-lagVisib,
-starts_with("ci.lag"))
jam.panel.data %>% glimpse()
jam.panel.data$dependent.variable %>% hist()
# jam.panel.model %>% pull(ymdh) %>% unique()
reg.train = jam.panel.data %>%
filter(ymdh<=ymd("2021-10-21")) %>%
select(-ymdh)
reg.test = jam.panel.data %>%
filter(ymdh>ymd("2021-10-21")&
ymdh<=ymd("2021-10-28")) %>%
select(-ymdh)
#pca before the regression
install.packages("stats")
library(stats)
jam.panel.data.pca <- prcomp(jam.panel.data[,c(3:10,13:19,21:24,26:28)], center = TRUE,scale. = TRUE)
summary(jam.panel.data.pca)
str(jam.panel.data.pca)
library(devtools)
install_github("vqv/ggbiplot")
library(ggbiplot)
ggbiplot(jam.panel.data.pca)
#model 0 - PCR?
#install.packages("pls")
library(pls)
reg.train.num <- reg.train[,c(2:9,12:18,20:23,24:27)]
reg.test.num <- reg.test[,c(2:9,12:18,20:23,24:27)]
set.seed(777)
pcr_model <- pcr(dependent.variable~., data=reg.train.num, scale=TRUE, validation="CV")
summary(pcr_model)
#If we only use the intercept term in the model, the test RMSE is 0.4532.
#If we add in the first principal component, the test RMSE drops to 0.4278.
#If we add in the second principal component, the test RMSE drops to 0.4221.
#By using just the first principal component, we can explain 2.766% of the variation in the response variable.
#By adding in the second principal component, we can explain 3.945% of the variation in the response variable.
validationplot(pcr_model)
validationplot(pcr_model, val.type="MSEP")
validationplot(pcr_model, val.type="R2")
# oops we don't have an optimal point :\
#use model to make predictions on a test set
pcr_pred <- predict(pcr_model, reg.test.num[,c(1:22)], ncomp=221)
#calculate RMSE
sqrt(mean((pcr_pred - reg.test.num[,c("dependent.variable")])^2))
#the result is 0.3648681
# model 1 - linear regression
linear_model = lm(dependent.variable~.,data = reg.train)
log_linear_model = lm(log1p(dependent.variable)~.,data = reg.train)
linear_model %>% summary()
linear_model %>% plot()
tc = trainControl(method = "cv",number = 5)
linear_model = train(ci~.,data = reg.train,
trControl = tc, method = "lm")
linear_model %>% summary()
mean(linear_model$residuals^2)^0.5
prediction = predict(linear_model,
newdata = reg.test)
log_prediction = predict(linear_model,newdata = reg.test) %>% expm1()
R2(prediction,reg.test$dependent.variable)
RMSE(prediction,reg.test$dependent.variable)
# model 2 - random forest #
rf_model = ranger(dependent.variable~., data = reg.train)
rf_model
reg.train.sampling =
ROSE::ovun.sample(is.0~.,
reg.train %>%
mutate(is.0=ifelse(dependent.variable==0,0,1)),
N=nrow(reg.train), p=0.5,
method = 'both')$data %>%
dplyr::select(-is.0)
reg.train.sampling %>%
ggplot()+
geom_bar(aes(dependent.variable))+
reg.train %>%
ggplot()+
geom_bar(aes(dependent.variable))
reg.train$dependent.variable %>% hist()
rf_model2 = ranger(dependent.variable~.,
data = reg.train.sampling,
# importance = "impurity"
)
reg.train %>% summary()
rf_model3 = ranger(dependent.variable~.,
data = reg.train,
importance = "impurity"
)
# tuning ------------------------------------------------------------------
hyperparameter_grid = expand.grid(
mtry = c(1,3, 5),
min.node.size = c(1, 10, 18),
replace = c(TRUE),
sample.fraction = c(0.5,1),
rmse = NA
)
for(i in 1:nrow(hyperparameter_grid)) {
tmp = ranger(
formula = dependent.variable ~ .,
data = reg.train %>% sample_frac(0.25),
num.trees = 500,
mtry = hyperparameter_grid$mtry[i],
min.node.size = hyperparameter_grid$min.node.size[i],
replace = hyperparameter_grid$replace[i],
sample.fraction = hyperparameter_grid$sample.fraction[i],
verbose = T,
seed = 100,
)
print(paste(i," / ",nrow(hyperparameter_grid)))
hyperparameter_grid$rmse[i] = sqrt(tmp$prediction.error)
}
best_combo_index = which.min(hyperparameter_grid$rmse)
best_combo_index
hyperparameter_grid[best_combo_index,]
rf_model3 = ranger(
formula = dependent.variable ~ .,
data = reg.train,
num.trees = 500,
mtry = hyperparameter_grid[best_combo_index, "mtry"],
min.node.size = hyperparameter_grid[best_combo_index, "min.node.size"],
replace = hyperparameter_grid[best_combo_index, "replace"],
sample.fraction = hyperparameter_grid[best_combo_index, "sample.fraction"],
verbose = FALSE,
seed = 777 )
# chaining cv ---------------------------------------------------------------
split.dates = seq(
ymd("2021-10-01"),
ymd("2021-12-31"),
as.difftime(period(weeks=1))
)[-1]
forward.chaining = data.frame(
split.date = split.dates,
baseline.R2 = NA,
baseline.RMSE = NA,
rf.R2 = NA,
rf.RMSE = NA
)
# for (i in 1:(length(split.dates)-1)) {
for (i in 1:3) {
reg.train = jam.panel.data %>%
filter(ymdh<=ymd(split.dates[i])) %>%
select(-ymdh)
reg.train.cats = reg.train$wea.cat %>% unique()
reg.test = jam.panel.data %>%
filter(ymdh>ymd(split.dates[i]) &
ymdh<=ymd(split.dates[i+1])) %>%
select(-ymdh)
print(dim(reg.test))
reg.test = reg.test %>%
filter(wea.cat%in%reg.train.cats)
print(dim(reg.test))
linear_model = lm(dependent.variable~.,data = reg.train)
rf_model = ranger(dependent.variable~.,
data = reg.train)
prediction.s = predict(rf_model, data = reg.test)$predictions
forward.chaining[i,]$rf.R2 = R2(prediction.s,reg.test$dependent.variable)
forward.chaining[i,]$rf.RMSE = RMSE(prediction.s,reg.test$dependent.variable)
prediction = predict(linear_model, newdata = reg.test)
forward.chaining[i,]$baseline.R2 = R2(prediction,reg.test$dependent.variable)
forward.chaining[i,]$baseline.RMSE = RMSE(prediction,reg.test$dependent.variable)
}
forward.chaining %>%
drop_na() %>%
ggplot() +
geom_line(aes(split.date,rf.R2, color="rf.R2")) +
geom_point(aes(split.date,rf.R2, color="rf.R2")) +
geom_line(aes(split.date,baseline.R2,color="baseline.R2")) +
geom_point(aes(split.date,baseline.R2,color="baseline.R2")) +
labs(title = "Forward Chaining Cross Validation",
subtitle = "Using OLS and Random Forest Trained on Growing Data\nTested on following 2 weeks.",
)+
scale_colour_manual(name = 'R Squared',
values =c("rf.R2"=p12[8],
"baseline.R2"=p12[4]))+
plotTheme()
ggsave("img/cv-fc.png",bg="white",
width=11,height=6,scale=1)
# window cv ---------------------------------------------------------------
split.dates.window = seq(
ymd("2021-10-01"),
ymd("2021-12-31"),
as.difftime(period(weeks=2))
)
sliding.window = data.frame(
split.date = split.dates.window,
baseline.R2 = NA,
baseline.RMSE = NA,
rf.R2 = NA,
rf.RMSE = NA
)
for (i in 2:(length(split.dates.window)-1)) {
reg.train = jam.panel.data %>%
filter(ymdh<=ymd(split.dates.window[i])&
ymdh>=ymd(split.dates.window[i-1])) %>%
select(-ymdh)
reg.train.cats = reg.train$wea.cat %>% unique()
reg.test = jam.panel.data %>%
filter(ymdh>ymd(split.dates.window[i]) &
ymdh<=ymd(split.dates.window[i+1])) %>%
select(-ymdh)
print(dim(reg.test))
reg.test = reg.test %>%
filter(wea.cat%in%reg.train.cats)
print(dim(reg.test))
linear_model = lm(dependent.variable~.,data = reg.train)
rf_model = ranger(dependent.variable~.,
data = reg.train)
prediction.s = predict(rf_model, data = reg.test)$predictions
sliding.window[i,]$rf.R2 = R2(prediction.s,reg.test$dependent.variable)
sliding.window[i,]$rf.RMSE = RMSE(prediction.s,reg.test$dependent.variable)
prediction = predict(linear_model, newdata = reg.test)
sliding.window[i,]$baseline.R2 = R2(prediction,reg.test$dependent.variable)
sliding.window[i,]$baseline.RMSE = RMSE(prediction,reg.test$dependent.variable)
}
sliding.window %>%
ggplot() +
geom_line(aes(split.date,rf.R2, color="rf.R2")) +
geom_point(aes(split.date,rf.R2, color="rf.R2")) +
geom_line(aes(split.date,baseline.R2,color="baseline.R2")) +
geom_point(aes(split.date,baseline.R2,color="baseline.R2")) +
labs(title = "Sliding Window Cross Validation",
subtitle = "Using OLS and Random Forest Trained on 2 Weeks of Data\nTested on following 2 weeks.",
)+
scale_colour_manual(name = 'R Squared',
values =c("rf.R2"=p12[8],
"baseline.R2"=p12[4]))+
plotTheme()
ggsave("img/cv-sl.png",bg="white",
width=11,height=6,scale=1)
# verification ------------------------------------------------------------
prediction.s = predict(rf_model2,data = reg.test)
prediction.s = prediction.s$predictions #%>% expm1()
R2(prediction.s,reg.test$dependent.variable)
RMSE(prediction.s,reg.test$dependent.variable)
plot(reg.test$dependent.variable~prediction.s)
prediction = predict(rf_model3, data = reg.test)
prediction = prediction$predictions #%>% expm1()
# prediction %>% expm1() %>% expm1() %>% summary()
R2(prediction,reg.test$dependent.variable)
RMSE(prediction,reg.test$dependent.variable)
plot(reg.test$dependent.variable~prediction)
# model 3 - prophet model # no seasonal
# model 4 - mixed-effect
# model 5 - XGB / lgbm #
xg_model = xgboost(data = train.matrix,
label = ytr, max.depth = 2,
eta = 1, nthread = 2, nrounds = 2)
prediction = predict(xg_model,newdata = test.matrix) %>% expm1()
R2(prediction,yte)
RMSE(prediction,yte)
plot(prediction,yte)
# model 6 - Lasso and Elastic-Net Regularized Generalized Linear Models (glmnet)#
train.matrix = model.matrix(log1p(dependent.variable)~.,reg.train)[,-1]
test.matrix = model.matrix(log1p(dependent.variable)~.,reg.test)[,-1]
ytr = reg.train$dependent.variable
yte = reg.test$dependent.variable
# lasso
lasso_model = cv.glmnet(train.matrix,ytr,alpha=1)
lasso_model
prediction = predict(lasso_model,newx = test.matrix) %>% expm1()
R2(prediction,yte)
RMSE(prediction,yte)
# ridge
ridge_model = cv.glmnet(train.matrix,ytr,alpha=0)
ridge_model
prediction = predict(ridge_model,newx = test.matrix) %>% expm1()
R2(prediction,yte)
RMSE(prediction,yte)
# model 7 - SVM #
# model 8 - Weighted Logistic Beyes #
# model 9 - GWR spatial models
############################
# PREDICTION VISUALIZATION #
############################
read.csv("data/model-results.csv") %>%
ggplot()+
geom_col(aes(model,test.R2,fill=dependent.variable),size=0)+
scale_fill_viridis(option="plasma",discrete = T,begin=0.3,end=0.6,guide="none")+
coord_flip()+
facet_wrap(~dependent.variable,ncol=1,scale="free_y",strip.position = "left")+
plotTheme()+
labs(title="Test R-Squared on Different Dependent Variables",
subtitle = "Using Regression Models: OLS, Ridge, Lasso, RF, and XGB",
x = "",y="Test R-Squared")
ggsave("img/model-res.png",bg="white",
width=10,height=6,scale=0.8)
getCategories = function(n){
cates = strsplit("Time Lags,Spatial Lags,Weather,Others",",")[[1]]
cates = factor(cates,levels=cates)
return(cates[n])
}
data.frame(keyName=names(rf_model3$variable.importance),
value=rf_model3$variable.importance, row.names=NULL,
category = append(c(1,1,1,1,1,1,1,1,1,1,1,4,2,2,2,2,2,2,2,4,4,4,4,4,4,4,4,4,4,2,2,2,2,2,2),numeric(38)+3)) %>%
mutate(keyName = fct_reorder(keyName, value),
category = getCategories(category)) %>%
ggplot()+
geom_col(aes(value,keyName,fill=category),size=0)+
scale_fill_viridis(option="plasma",discrete = T,begin=0.8,end=0.2)+
labs(title="Variable importance for Jam Count",)+
plotTheme()
ggsave("img/predictor-importance2.png",bg="white",
width=11,height=6,scale=1)
factor(1,levels = 1:10)
reg.pred = jam.panel.model %>%
filter(ymdh > ymd("2021-10-21")) %>%
mutate(prediction = prediction,
dependent.variable=jam.count,
error = dependent.variable-prediction) %>%
select(dependent.variable, prediction,
error, ymdh, hex_id)
reg.pred.agg.byhour = reg.pred %>%
st_drop_geometry() %>%
group_by(ymdh) %>%
summarise(prediction=mean(prediction),
real=mean(dependent.variable),
error=mean(error))
reg.pred.agg.his = jam.panel.model %>%
st_drop_geometry() %>%
group_by(ymdh) %>%
summarise(jam.count=mean(jam.count)) %>%
filter(ymdh > ymd("2021-10-14") &
ymdh <= ymd("2021-10-21"))
reg.pred.agg.byhour %>%
ggplot()+
geom_line(data = reg.pred.agg.his,
aes(ymdh, jam.count,color=p12[4]),
size=0.8)+
geom_line(aes(ymdh,real,color="black"),alpha=0.3)+
geom_point(aes(ymdh,real,color="black"),alpha=0.3,size=1)+
geom_line(aes(ymdh,prediction,color=p12[6]),size=0.8)+
geom_point(aes(ymdh, prediction,color=p12[6]),
size=1.5,shape=21,fill="white",stroke=0.8)+
geom_vline(xintercept=reg.pred.agg.byhour$ymdh[1],size=0.5,color="black")+
scale_color_identity(name = "Legend",
breaks = c(p12[6],"black",p12[4]),
labels = c("Prediction","Actual Value","Training Data"),
guide = "legend")+
labs(title = "Prediction vs. Real Jam Count",
subtitle = "using Random Forest Trained on 3 Weeks of Data",
x="Hour", y = "Mean Jam Count")+
plotTheme()
ggsave("img/model-pre.png",bg="white",
width=10,height=5,scale=1)
reg.pred.agg.byhour %>%
group_by(hour = hour(ymdh)) %>%
summarize(error=mean(error)) %>%
ggplot()+
geom_col(aes(hour,error,fill=error),size=0,alpha=1)+
scale_fill_viridis_b(option="plasma") +
labs(title = "Error of Jam Count by Hour",
subtitle = "blue: prediction, gray: real jam count")
ggplot(data.frame(pred = prediction.s,
actual = reg.test$dependent.variable),
aes(x = pred, y = actual)) +
# geom_boxplot(aes(x = factor(actual), y =pred),alpha = 0.5,size=0.5) +
geom_jitter(alpha = 0.1,size=0.5,width = 0,height=1) +
coord_equal(xlim=c(0,15)) +
geom_abline() +
geom_smooth(method = "lm") +
plotTheme()+
labs(title = "Prediction vs. Actual Value",
subtitle = "blue: prediction, gray: real jam count")
# MAP ---------------------------------------------------------------------
mapborder="POLYGON ((-104.94964599609374 38.70908932739828, -104.63104248046875 38.70908932739828, -104.63104248046875 38.86697874644268, -104.94964599609374 38.86697874644268, -104.94964599609374 38.70908932739828))" %>%
data.frame(w=.) %>%
st_as_sf(wkt="w",crs=4326) %>%
st_transform(crs)
x.b = st_coordinates(mapborder)[1:2,1]
y.b = st_coordinates(mapborder)[3:2,2]
jam.panel %>% pull(ymdh) %>% hour() %>% unique()
reg.pred.byHour = reg.pred %>%
group_by(hex_id, hour=hour(ymdh)) %>%
summarise(error=mean(error))
my.animation = reg.pred.byHour %>%
# filter(hour%in%5:22) %>%
ggplot()+
# geom_sf(data=gates,size=1,color="steelblue")+
geom_sf(data=roads,size=0.1,alpha=0.2)+
geom_sf(data=border.military,fill="#00000022",size=0.2)+
geom_sf(aes(fill=error),alpha=0.8,size=0.01,color="white")+
scale_fill_viridis(option="plasma",name="Prediction Error")+
# facet_wrap(~hour, ncol=6) +
coord_sf(xlim=x.b,ylim=y.b,expand = F)+
transition_manual(hour+5) +
labs(title="Error Map of Dependent Variable at {current_frame}:00",
subtitle = "Oct 2021, Colorado Springs, CO")+
mapTheme()
animate(my.animation,res=150,width=10,height=6,fps = 15,unit="in")
anim_save("img/error-map.gif")
# eda map ---------------------------------------------------------------------
reg.act.byHour = reg.pred %>%
group_by(hex_id, hour=hour(ymdh)) %>%
summarise(jam.count=mean(dependent.variable))
my.animation = reg.act.byHour %>%
# filter(hour%in%5:22) %>%
ggplot()+
geom_sf(data=roads,size=0.1,alpha=0.2)+
geom_sf(data=border.military,fill="#00000022",size=0.2)+
geom_sf(aes(fill=jam.count %>% log1p()),alpha=0.8,size=0.01,color="white")+
scale_fill_viridis(option="plasma",name="Jam Count")+
# facet_wrap(~hour, ncol=6) +
transition_manual(hour+5) +
coord_sf(xlim=x.b,ylim=y.b,expand = F)+
labs(title="Error Map of Dependent Variable at {current_frame}:00",
subtitle = "Oct 2021, Colorado Springs, CO")+
mapTheme()
animate(my.animation,res=150,width=10,height=6,fps = 15,unit="in")
anim_save("img/dv-map.gif")
#mae by geoid
# generating web data -------------------------------------------------------------
prediction = predict(rf_model3, data = reg.train)
prediction = prediction$predictions #%>% expm1()
web.data.1 = jam.panel.model %>%
filter(ymdh <= ymd("2021-10-23")) %>%
mutate(prediction = prediction,
dependent.variable=jam.count) %>%
select(dependent.variable, prediction,
ymdh, hex_id) %>%
# summary()
filter(ymdh<ymd("2021-10-23") &
ymdh>=ymd("2021-10-22")) %>%
st_transform(4326)
web.data.1 = web.data.1 %>%
filter(hour(ymdh)==15) %>%
rename(time_now = ymdh,
jam_count_now = dependent.variable) %>%
select(-prediction) %>%
st_drop_geometry() %>%
left_join(web.data.1 %>%
filter(hour(ymdh)==17)%>%
rename(time_prediction = ymdh,
jam_count_prediction = prediction) %>%
select(-dependent.variable)) %>%
select(hex_id,time_now,jam_count_now,time_prediction,jam_count_prediction,geometry)
system("rm data/map-data.geojson")
web.data.1 %>% st_write("data/map-data.geojson")
###
last_dates = c("2021-10-01","2021-10-08","2021-10-15","2021-10-22")
web.data.2 = jam.panel.model%>%
st_drop_geometry() %>%
filter(as.Date(date(ymdh), "%Y-%m-%d")%in%last_dates) %>%
group_by(ymdh, hex_id) %>%
summarise(jam.count = mean(jam.count)) %>%
rowwise() %>%
mutate(hour=hour(ymdh),
v=as.Date(date(ymdh), "%Y-%m-%d"),
v = which(last_dates==v),
v = paste("jam_count",4-v,"weeks_ago",sep="_")) %>%
select(-ymdh)
web.data.22 = web.data.2 %>%
pivot_wider(names_from = v,
values_from = c("jam.count"),) %>%
group_by(hex_id,hour) %>%
summarise(jam_count_3_weeks_ago=max(jam_count_3_weeks_ago,na.rm = T),
jam_count_2_weeks_ago=max(jam_count_2_weeks_ago,na.rm = T),
jam_count_1_weeks_ago=max(jam_count_1_weeks_ago,na.rm = T),
jam_count_0_weeks_ago=max(jam_count_0_weeks_ago,na.rm = T),)
web.data.22 %>% write_csv("./data/chart-data.csv",)
###
gate.ids = gates %>%
select(geometry) %>%
st_nearest_feature(hex)
system("rm data/gates-data.geojson")
gates %>%
select(geometry,gate_name=Name) %>%
mutate(hex_id = hex[gate.ids,]$hex_id) %>%
st_transform(4326) %>%
st_write("data/gates-data.geojson")
91% commute-based and 34.6% highway peak-hour-based congestion↩︎