How long does Ariel’s commute take, and does it matter when she leaves?
Ariel grabbed her data from Moves. Here I’m going to see how timeofday affects her commute duration.
Side note: Kudos to Moves for excellent export facilities. Absolutely wonderful. They give you the data in every format and breakdown imaginable.
%pylab inline
import pandas as pd
import seaborn
df = pd.read_csv('storyline.csv')
df['Start'] = pd.to_datetime(df['Start'])
# we only want since the current commute started
df = df[df['Start'] > '20150301']
df.head()

Date 
Type 
Name 
Start 
End 
Duration 
4897 
3/1/15 
place 
Home 
20150301 05:00:00 
20150301T11:17:2705:00 
40647 
4898 
3/1/15 
move 
transport 
20150301 16:17:27 
20150301T11:49:2005:00 
1913 
4899 
3/1/15 
move 
walking 
20150301 16:49:19 
20150301T11:56:5805:00 
459 
4900 
3/1/15 
place 
Place in Cambridgeport 
20150301 16:56:59 
20150301T14:12:0305:00 
8104 
4901 
3/1/15 
move 
walking 
20150301 19:12:03 
20150301T14:19:0405:00 
421 
Later I’m going to only want the time of day, not the date, so set all dates to 19700101.
df['Start'] = df['Start'].apply(lambda x: x.replace(year=1970, month=1, day=1))
I want to know which transport rows go between Home and Koch, so I make two new columns shifting name up and down so I can match against those.
df['prev_name'] = df['Name'].shift(1)
df['next_name'] = df['Name'].shift(1)
df.head()

Date 
Type 
Name 
Start 
End 
Duration 
prev_name 
next_name 
4897 
3/1/15 
place 
Home 
19700101 05:00:00 
20150301T11:17:2705:00 
40647 
NaN 
transport 
4898 
3/1/15 
move 
transport 
19700101 16:17:27 
20150301T11:49:2005:00 
1913 
Home 
walking 
4899 
3/1/15 
move 
walking 
19700101 16:49:19 
20150301T11:56:5805:00 
459 
transport 
Place in Cambridgeport 
4900 
3/1/15 
place 
Place in Cambridgeport 
19700101 16:56:59 
20150301T14:12:0305:00 
8104 
walking 
walking 
4901 
3/1/15 
move 
walking 
19700101 19:12:03 
20150301T14:19:0405:00 
421 
Place in Cambridgeport 
transport 
koch = 'MIT Koch Day Care Ctr'
east = df.loc[(df.Name == 'transport') & (df.prev_name == 'Home') & (df.next_name == koch) ]
west = df.loc[(df.Name == 'transport') & (df.next_name == 'Home') & (df.prev_name == koch) ]
east.head()

Date 
Type 
Name 
Start 
End 
Duration 
prev_name 
next_name 
4945 
3/5/15 
move 
transport 
19700101 12:49:44 
20150305T08:45:3605:00 
3352 
Home 
MIT Koch Day Care Ctr 
4983 
3/9/15 
move 
transport 
19700101 11:50:58 
20150309T08:32:0204:00 
2464 
Home 
MIT Koch Day Care Ctr 
5059 
3/16/15 
move 
transport 
19700101 12:03:33 
20150316T08:40:4004:00 
2227 
Home 
MIT Koch Day Care Ctr 
5075 
3/17/15 
move 
transport 
19700101 12:16:17 
20150317T08:49:5804:00 
2021 
Home 
MIT Koch Day Care Ctr 
5093 
3/18/15 
move 
transport 
19700101 12:21:08 
20150318T08:51:0004:00 
1792 
Home 
MIT Koch Day Care Ctr 
plot(west.Start,west.Duration,'.') # not bothering to make the x axis a date for now
Let’s get rid of outliers.
Yes, obviously if you leave work in the middle of the day, there’s zero traffic, but we don’t want those 5 points driving the entire model, so let’s drop them.
west = west.ix[(west.Start > 7.5e13) & (west.Start < 8e13)].copy()
plot(west.Start,west.Duration,'.')
plot(east.Start,east.Duration,'.')
east = east.ix[(east.Start > 4.2e13) & (east.Start < 5e13)].copy()
plot(east.Start,east.Duration,'.')
How long does the commute take?
hist([k/60 for k in east.Duration], 15, [20, 60])
# using pyplot.hist instead of DataFrame.hist so I can /60 to get minutes
title('Eastbound')
xlabel('Minutes')
hist([k/60 for k in west.Duration], 15, [20, 60])
title('Westbound')
xlabel('Minutes')
Write to file.
east.to_csv('east.csv')
west.to_csv('west.csv')
Now, do some statistics, in R
Thanks to @ozjimbob for most of this.
Load in the data and convert the times to a POSIXct class, then make them numeric because gam models don’t always play nice with this class. So we have seconds from the UNIX epoch.
I’m also going to convert Duration to minutes, because who thinks in seconds.
east=read.csv("east.csv", stringsAsFactors=FALSE)
west=read.csv("west.csv" , stringsAsFactors=FALSE)
east$Start=as.numeric(as.POSIXct(east$Start))
west$Start=as.numeric(as.POSIXct(west$Start))
east$Start = east$Start  3600*4 # fix time zones back to US/Eastern
west$Start = west$Start  3600*4
east$Duration = east$Duration / 60
west$Duration = west$Duration / 60
We’re going to do generalized additive models that allow you to fit a spline smooth through a predictor variable.
e_mod=mgcv::gam(Duration ~ s(Start),data=east)
w_mod=mgcv::gam(Duration ~ s(Start),data=west)
Let’s examine the summaries for these.
East
summary(e_mod)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Duration ~ s(Start)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 35.4346 0.8483 41.77 <2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F pvalue
## s(Start) 6.574 7.712 6.318 1.18e05 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rsq.(adj) = 0.46 Deviance explained = 52.7%
## GCV = 45.195 Scale est. = 38.856 n = 54
West
summary(w_mod)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Duration ~ s(Start)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 41.2279 0.8465 48.7 <2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F pvalue
## s(Start) 1 1 0.139 0.711
##
## Rsq.(adj) = 0.0195 Deviance explained = 0.316%
## GCV = 34.463 Scale est. = 32.965 n = 46
West looks terrible  there’s just nothing there. Let’s start with East.
East  morning commute to work
Let’s look at the east model  generate predictions, with standard errors, for each possible time between the minimum and maximum.
pred_frame=data.frame(Start=min(east$Start):max(east$Start))
e_predict=predict(e_mod,newdata = pred_frame,se.fit=TRUE)
pred_frame$pred=e_predict$fit
pred_frame$upper=e_predict$fit + e_predict$se.fit
pred_frame$lower=e_predict$fit  e_predict$se.fit
Now prepare a graph  plot the predicted fit with the error bars. Convert the times back into POSIXct for nice plotting too.
pgoe_x = c(pred_frame$Start,rev(pred_frame$Start))
pgoe_y = c(pred_frame$upper,rev(pred_frame$lower))
pred_frame$Start_po=as.POSIXct(pred_frame$Start,origin="19700101")
plot(pred_frame$pred ~ pred_frame$Start_po,type="l",ylim=c(20,60),xlab="Start",ylab="Duration")
polygon(pgoe_x,pgoe_y,col="#44444444")
points(east$Duration ~ east$Start)
Let’s try a regression tree.
rp_mod=rpart(Duration ~ Start,data=east)
rp_predict=predict(rp_mod,newdata=pred_frame)
pred_frame$pred_rt = rp_predict
plot(pred_frame$pred_rt ~ pred_frame$Start_po,col="red")
points(east$Duration ~ east$Start)
And a randomForest.
rf_mod=randomForest(Duration ~ Start,data=east)
print(rf_mod)
##
## Call:
## randomForest(formula = Duration ~ Start, data = east)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 53.52246
## % Var explained: 24.24
Plot the model predictions.
rf_predict=predict(rf_mod,newdata=pred_frame,type="response")
pred_frame$pred_rf = rf_predict
plot(pred_frame$pred_rf ~ pred_frame$Start_po,col="red")
points(east$Duration ~ east$Start)
It’s insanely overfitted, but ok.
West  evening ride home
Let’s look at the west plot.
Ariel leaves work in such a narrow window that there’s just no difference in the small changes she makes. Maybe in the future she can start experimenting with intentionally leaving later to see what that does to her commute time.
Everything’s going to be garbage, but let’s look anyway. Regression tree:
randomForest.
print(rf_mod)
##
## Call:
## randomForest(formula = Duration ~ Start, data = west)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 49.80263
## % Var explained: 57.45
lol negative variance
lol nope