2015-08-04

Ariel's Commute

How long does Ariel’s commute take, and does it matter when she leaves?

Ariel grabbed her data from Moves. Here I’m going to see how time-of-day affects her commute duration.

Side note: Kudos to Moves for excellent export facilities. Absolutely wonderful. They give you the data in every format and breakdown imaginable.

%pylab inline
import pandas as pd
import seaborn
df = pd.read_csv('storyline.csv')
df['Start'] = pd.to_datetime(df['Start'])
# we only want since the current commute started
df = df[df['Start'] > '2015-03-01']
df.head()
Date Type Name Start End Duration
4897 3/1/15 place Home 2015-03-01 05:00:00 2015-03-01T11:17:27-05:00 40647
4898 3/1/15 move transport 2015-03-01 16:17:27 2015-03-01T11:49:20-05:00 1913
4899 3/1/15 move walking 2015-03-01 16:49:19 2015-03-01T11:56:58-05:00 459
4900 3/1/15 place Place in Cambridgeport 2015-03-01 16:56:59 2015-03-01T14:12:03-05:00 8104
4901 3/1/15 move walking 2015-03-01 19:12:03 2015-03-01T14:19:04-05:00 421

Later I’m going to only want the time of day, not the date, so set all dates to 1970-01-01.

df['Start'] = df['Start'].apply(lambda x: x.replace(year=1970, month=1, day=1))

I want to know which transport rows go between Home and Koch, so I make two new columns shifting name up and down so I can match against those.

df['prev_name'] = df['Name'].shift(1)
df['next_name'] = df['Name'].shift(-1)
df.head()
Date Type Name Start End Duration prev_name next_name
4897 3/1/15 place Home 1970-01-01 05:00:00 2015-03-01T11:17:27-05:00 40647 NaN transport
4898 3/1/15 move transport 1970-01-01 16:17:27 2015-03-01T11:49:20-05:00 1913 Home walking
4899 3/1/15 move walking 1970-01-01 16:49:19 2015-03-01T11:56:58-05:00 459 transport Place in Cambridgeport
4900 3/1/15 place Place in Cambridgeport 1970-01-01 16:56:59 2015-03-01T14:12:03-05:00 8104 walking walking
4901 3/1/15 move walking 1970-01-01 19:12:03 2015-03-01T14:19:04-05:00 421 Place in Cambridgeport transport

 

koch = 'MIT Koch Day Care Ctr'
east = df.loc[(df.Name == 'transport') & (df.prev_name == 'Home') & (df.next_name == koch) ]
west = df.loc[(df.Name == 'transport') & (df.next_name == 'Home') & (df.prev_name == koch) ]
east.head()
Date Type Name Start End Duration prev_name next_name
4945 3/5/15 move transport 1970-01-01 12:49:44 2015-03-05T08:45:36-05:00 3352 Home MIT Koch Day Care Ctr
4983 3/9/15 move transport 1970-01-01 11:50:58 2015-03-09T08:32:02-04:00 2464 Home MIT Koch Day Care Ctr
5059 3/16/15 move transport 1970-01-01 12:03:33 2015-03-16T08:40:40-04:00 2227 Home MIT Koch Day Care Ctr
5075 3/17/15 move transport 1970-01-01 12:16:17 2015-03-17T08:49:58-04:00 2021 Home MIT Koch Day Care Ctr
5093 3/18/15 move transport 1970-01-01 12:21:08 2015-03-18T08:51:00-04:00 1792 Home MIT Koch Day Care Ctr

 

    plot(west.Start,west.Duration,'.')  # not bothering to make the x axis a date for now

png

Let’s get rid of outliers.

Yes, obviously if you leave work in the middle of the day, there’s zero traffic, but we don’t want those 5 points driving the entire model, so let’s drop them.

west = west.ix[(west.Start > 7.5e13) & (west.Start < 8e13)].copy()
plot(west.Start,west.Duration,'.')

png

plot(east.Start,east.Duration,'.')

png

east = east.ix[(east.Start > 4.2e13) & (east.Start < 5e13)].copy()
plot(east.Start,east.Duration,'.')

png

How long does the commute take?

hist([k/60 for k in east.Duration], 15, [20, 60])
# using pyplot.hist instead of DataFrame.hist so I can /60 to get minutes
title('Eastbound')
xlabel('Minutes')

png

hist([k/60 for k in west.Duration], 15, [20, 60])
title('Westbound')
xlabel('Minutes')

png

Write to file.

east.to_csv('east.csv')
west.to_csv('west.csv')

Now, do some statistics, in R

Thanks to @ozjimbob for most of this.

Load in the data and convert the times to a POSIXct class, then make them numeric because gam models don’t always play nice with this class. So we have seconds from the UNIX epoch.

I’m also going to convert Duration to minutes, because who thinks in seconds.

east=read.csv("east.csv", stringsAsFactors=FALSE)
west=read.csv("west.csv" , stringsAsFactors=FALSE)
east$Start=as.numeric(as.POSIXct(east$Start))
west$Start=as.numeric(as.POSIXct(west$Start))
east$Start = east$Start - 3600*4   # fix time zones back to US/Eastern
west$Start = west$Start - 3600*4
east$Duration = east$Duration / 60
west$Duration = west$Duration / 60

We’re going to do generalized additive models that allow you to fit a spline smooth through a predictor variable.

e_mod=mgcv::gam(Duration ~ s(Start),data=east)
w_mod=mgcv::gam(Duration ~ s(Start),data=west)

Let’s examine the summaries for these.

East

summary(e_mod)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Duration ~ s(Start)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.4346     0.8483   41.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F  p-value    
## s(Start) 6.574  7.712 6.318 1.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.46   Deviance explained = 52.7%
## GCV = 45.195  Scale est. = 38.856    n = 54

West

summary(w_mod)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Duration ~ s(Start)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.2279     0.8465    48.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F p-value
## s(Start)   1      1 0.139   0.711
## 
## R-sq.(adj) =  -0.0195   Deviance explained = 0.316%
## GCV = 34.463  Scale est. = 32.965    n = 46

West looks terrible - there’s just nothing there. Let’s start with East.

East - morning commute to work

Let’s look at the east model - generate predictions, with standard errors, for each possible time between the minimum and maximum.

pred_frame=data.frame(Start=min(east$Start):max(east$Start))

e_predict=predict(e_mod,newdata = pred_frame,se.fit=TRUE)

pred_frame$pred=e_predict$fit
pred_frame$upper=e_predict$fit + e_predict$se.fit
pred_frame$lower=e_predict$fit - e_predict$se.fit

Now prepare a graph - plot the predicted fit with the error bars. Convert the times back into POSIXct for nice plotting too.

pgoe_x = c(pred_frame$Start,rev(pred_frame$Start))
pgoe_y = c(pred_frame$upper,rev(pred_frame$lower))

pred_frame$Start_po=as.POSIXct(pred_frame$Start,origin="1970-01-01")

plot(pred_frame$pred ~ pred_frame$Start_po,type="l",ylim=c(20,60),xlab="Start",ylab="Duration")
polygon(pgoe_x,pgoe_y,col="#44444444")
points(east$Duration ~ east$Start)

center

Let’s try a regression tree.

rp_mod=rpart(Duration ~ Start,data=east)
rp_predict=predict(rp_mod,newdata=pred_frame)
pred_frame$pred_rt = rp_predict
plot(pred_frame$pred_rt ~ pred_frame$Start_po,col="red")
points(east$Duration ~ east$Start)

center

And a randomForest.

rf_mod=randomForest(Duration ~ Start,data=east)
print(rf_mod)
## 
## Call:
##  randomForest(formula = Duration ~ Start, data = east) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 53.52246
##                     % Var explained: 24.24

Plot the model predictions.

rf_predict=predict(rf_mod,newdata=pred_frame,type="response")
pred_frame$pred_rf = rf_predict
plot(pred_frame$pred_rf ~ pred_frame$Start_po,col="red")
points(east$Duration ~ east$Start)

center

It’s insanely overfitted, but ok.

West - evening ride home

Let’s look at the west plot.

center

Ariel leaves work in such a narrow window that there’s just no difference in the small changes she makes. Maybe in the future she can start experimenting with intentionally leaving later to see what that does to her commute time.

Everything’s going to be garbage, but let’s look anyway. Regression tree: center

randomForest.

print(rf_mod)
## 
## Call:
##  randomForest(formula = Duration ~ Start, data = west) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 49.80263
##                     % Var explained: -57.45

lol negative variance

center

lol nope