Forecast Calls
Sam West
May 4, 2016
My last post looked at different trends in Madison's police logs. This time I wanted to use a few different methods to forecast the number of calls Madison’s Police Department receives each month to see which method would be the most accurate. There is a concept called Occam’s razor which infers all things being equal, the simplest model is best. Today, we have many models which are used to predict weather forecasts, but because weather is so complicated it’s generally more accurate to predict the weather 14 days out by guessing it will be the same as the average weather for that day and I was curious if this would apply to the call data from police call data.
source("policeAssessment.R")
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(forecast)
## Loading required package: timeDate
## This is forecast 7.0
library(stats)
library(ggplot2)
library(scales)
dfCalls <- read.csv("callsForService.csv")
dfCalls <- updateTimes(dfCalls)
dfCalls <- dfCalls[dfCalls$Date.Time<"2016-03-01 00:00:00",] # start with data from Feb 2016 as we're missing more recent data
dfCalls <- dfCalls[!is.na(dfCalls$Date.Time),] #remove times without a value
First as a reminder let’s plot the number of calls per month over time. We can see that there is a lot of seasonality from year to year.
When looking at time series data we can often make a pretty good forecast by just looking at historic data. If you own a grocery store and you want to guess how many people are going to come in today taking the number of customers you had yesterday is probably a pretty good estimate. We can take that same idea with a time series analysis by looking at the correlation from our historic data with the auto-correlative function (ACF). This graph shows us how the previous values in a time series compare with the current one. Previous values which are perfectly correlated will have a value of 1 in our ACF. This package includes starts the looking at data in the time series looking 0 periods back which is why our first value in the ACF graph is 1. We can ignore this value as if we know our count for the first day then we don’t need a forecast. What we can focus on is that the previous two periods are above the blue line which indicates that these values are significant. This means that the previous two months or data are correlated with forecasting the next month. After two months we drop below our significance line and the months become less significant. As you can see, once we get around 4 months previous, our values start to be negatively correlated, and once we look 12 months ago, values are positively correlated again which is what we would expect having looked at our previous plot and observing that the data seems to follow a 12 month pattern.
acf(monthSummary$Freq)
One issue with the ACF is that it compares the current data to all previous periods independently and we don’t know how the previous values actually contribute to our current value. For this reason we can use the partial auto-correlative function (PACF) which subtracts the correlation of the previous days for each previous period. We can then find which values contribute to the current period.
pacf(monthSummary$Freq)
Now let’s try to create a few models. I mentioned that the previous month seems to have a correlation with the current month so let’s use that as our first model. We can see how accurate we are by estimating last month’s value to this month. Looking at our graph the blue line, we can see that our model isn’t too bad, but has a decent amount of error.
tmpFreq <- monthSummary$Freq
tmpFreq <- append(x = tmpFreq,values = NA,after = 0)
tmpFreq<-tmpFreq[1:length(tmpFreq)-1]
monthSummary$PrevMonth <- tmpFreq
tmp <- monthSummary[2:length(monthSummary$Freq),]
p <- ggplot() + geom_line(data=tmp,aes(x=Date,y=Freq,color="Actual Calls",group=1)) + geom_line(data=tmp,aes(x=Date,y=PrevMonth,color="Previous Month"),group=1)
print(p)
We can calculate the mean absolute percent error (MAPE) and mean absolute error (MAE) as well. We can see the MAPE is around 7% and MAE is about We can calculate the mean absolute percent error (MAPE) and mean absolute error (MAE) as well. We can see the MAPE is around 7% and MAE is about 781 calls. This means that by using the number of the previous month to estimate the current month, we have an error rate of about 7% and or 781 calls.
mape(tmp$Freq,tmp$PrevMonth)
## [1] 0.07053552
mae(tmp$Freq,tmp$PrevMonth)
## [1] 781.5102
Let’s try this again but this time estimate the current month, with the month 12 months ago to capture the yearly seasonality.
tmpFreq <- monthSummary$Freq
twelveBlanks <- c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA)
tmpFreq <- append(x = tmpFreq,values = twelveBlanks,after = 0)
tmpFreq<-tmpFreq[1:(length(tmpFreq)-12)]
monthSummary$PrevYear <- tmpFreq
tmp <- monthSummary[13:length(monthSummary$Freq),]
p <- ggplot() + geom_line(data=tmp,aes(x=Date,y=Freq,color="Actual Calls",group=1)) + geom_line(data=tmp,aes(x=Date,y=PrevYear,color="Previous Year"),group=1)
print(p)
This isn’t quite an apples to apples comparison as I’m only evaluating this model on about a year less of data but from looking at the month one year ago we get a significant improvement in the MAPE which is now about 4% and the MAE which is about 451 calls per month.
mape(tmp$Freq,tmp$PrevYear)
## [1] 0.04108195
mae(tmp$Freq,tmp$PrevYear)
## [1] 450.9474
Finally, let’s plug this data into a SARIMA model to see how we can do. Since we know that there is a seasonal component, we will factor a seasonality for the previous month into our model. I’ll add parameters for one autoregressive parameter, and one seasonal autoregressive parameter to make the model simple.
fit <- arima(monthSummary$Freq,order=c(1,0,1), seasonal=list(order=c(1,0,1),period=12))
summary(fit)
##
## Call:
## arima(x = monthSummary$Freq, order = c(1, 0, 1), seasonal = list(order = c(1,
## 0, 1), period = 12))
##
## Coefficients:
## ar1 ma1 sar1 sma1 intercept
## 0.6598 -0.3660 0.9406 -0.2995 11501.566
## s.e. 0.1780 0.2108 0.0529 0.2527 570.797
##
## sigma^2 estimated as 280858: log likelihood = -394.23, aic = 800.45
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -12.29505 529.9606 428.7537 -0.4119077 3.896567 0.548622
## ACF1
## Training set -0.116086
tmpFit <- forecast::fitted.Arima(fit)
monthSummary$ARIMA1 <- tmpFit
p <- ggplot() + geom_line(data=monthSummary,aes(x=Date,y=Freq,color="Actual Calls",group=1)) + geom_line(data=monthSummary,aes(x=Date,y=ARIMA1,color="Best SARIMA",group=1))
print(p)
So as we can see we improved upon our model of simply using the last month by about 1%.
mape(monthSummary$Freq,forecast::fitted.Arima(fit))
## [1] 0.03896567
mae(monthSummary$Freq,forecast::fitted.Arima(fit))
## [1] 428.7537
Lastly for this exploratory model I wanted to test what happens if I evaluate a larger number of SARIMA models. We could incorporate a few more autoregressive parameters into the model so I created a function evaluate all combinations that we could make for the ARIMA model to see what the top results are. As you can see it’s almost a perfect match against the data we tested and it’s very difficult to distinguish the actual number of calls from the SARIMA model. This isn’t necessarily a good thing though which I’ll go into later as our model may be over fitting.
fit <- arima(monthSummary$Freq,order=c(0,1,0), seasonal=list(order=c(3,1,0),period=12))
summary(fit)
##
## Call:
## arima(x = monthSummary$Freq, order = c(0, 1, 0), seasonal = list(order = c(3,
## 1, 0), period = 12))
##
## Coefficients:
## sar1 sar2 sar3
## -0.4681 -0.4681 -1
## s.e. 0.0001 0.0001 0
##
## sigma^2 estimated as 2.742e-19: log likelihood = 671.4, aic = -1334.79
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set -0.4407114 5.859909 1.370676 -0.004611019 0.01274322
## MASE ACF1
## Training set 0.001753882 0.1026767
tmpFit <- forecast::fitted.Arima(fit)
monthSummary$ARIMA1 <- tmpFit
p <- ggplot() + geom_line(data=monthSummary,aes(x=Date,y=Freq,color="Actual Calls",group=1)) + geom_line(data=monthSummary,aes(x=Date,y=ARIMA1,color="Best SARIMA",group=1))
print(p)
mape(monthSummary$Freq,monthSummary$ARIMA1)
## [1] 0.0001274322
mae(monthSummary$Freq,monthSummary$ARIMA1)
## [1] 1.370676
This only uses 5 parameters places less emphasis on the seasonality which I believe will help overfitting.
fit <- arima(monthSummary$Freq,order=c(1,0,0), seasonal=list(order=c(2,1,0),period=12))
summary(fit)
##
## Call:
## arima(x = monthSummary$Freq, order = c(1, 0, 0), seasonal = list(order = c(2,
## 1, 0), period = 12))
##
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
## ar1 sar1 sar2
## 0.0161 -0.319 -0.0412
## s.e. NaN NaN NaN
##
## sigma^2 estimated as 294992: log likelihood = -293.69, aic = 595.38
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -49.97685 473.5263 343.6308 -0.4677906 3.130685 0.439701
## ACF1
## Training set 0.01084465
monthSummary$ARIMA3 <- forecast::fitted.Arima(fit)
p <- ggplot() + geom_line(data=monthSummary,aes(x=Date,y=Freq,color="Actual Calls",group=1)) + geom_line(data=monthSummary,aes(x=Date,y=ARIMA3,color="More Simple SARIMA"),group=1)
print(p)
mape(monthSummary$Freq,monthSummary$ARIMA3)
## [1] 0.03130685
mae(monthSummary$Freq,monthSummary$ARIMA3)
## [1] 343.6308
Finally, I mentioned that I hadn’t been making an apples to apples comparison earlier. Let’s see which model would have had the best performance had I ran each model for the 2015 calendar year to see how we could have gotten our best results for planning for a year. In this case we’ll use data from 2012 to 2014 as our training data and 2015 will be our test data.
test <- monthSummary[monthSummary$Date>"2014-12-31"&monthSummary$Date<"2016-01-01",]
train <- monthSummary[monthSummary$Date<"2015-01-01",]
fit <- arima(train$Freq,order=c(1,0,1), seasonal=list(order=c(1,0,1),period=12))
tmpPred <- predict(fit,n.ahead = 12)
test$ARIMA_1_0_1_1_0_1 <- ts(tmpPred$pred,start=1,end=12)
fit <- arima(train$Freq,order=c(1,0,0), seasonal=list(order=c(1,0,0),period=12))
tmpPred <- predict(fit,n.ahead = 12)
test$ARIMA_1_0_0_1_0_0 <- ts(tmpPred$pred,start=1,end=12)
fit <- arima(train$Freq,order=c(0,1,0), seasonal=list(order=c(1,1,0),period=12))
tmpPred <- predict(fit,n.ahead = 12)
test$ARIMA_0_1_0_1_1_0 <- ts(tmpPred$pred,start=1,end=12)
fit <- arima(train$Freq,order=c(1,0,0), seasonal=list(order=c(2,0,0),period=12))
tmpPred <- predict(fit,n.ahead = 12)
test$ARIMA_0_1_1_0_1_1 <- ts(tmpPred$pred,start=1,end=12)
fit <- arima(train$Freq,order=c(2,1,0), seasonal=list(order=c(1,1,2),period=12))
tmpPred <- predict(fit,n.ahead = 12)
test$ARIMA_2_1_0_1_1_2 <- ts(tmpPred$pred,start=1,end=12)
If we break down all of the models we can see that by just looking at the month from the previous year, we actually get one of our best results. The more parameters we added to the SARIMA models, the more we overfit to our testing data, and when using that data to project actual performance, the models with more parameters didn’t describe the data as well. Interestingly enough, the best model was one of the simplest ones which looked at the previous month’s value, and the month one year before.
p <- ggplot() + geom_line(data=test,aes(x=Date,y=Freq,color="Actual",group=1)) + geom_line(data=test,aes(x=Date,y=PrevYear,color="Previous Year"),group=1) + geom_line(data=test,aes(x=Date,y=ARIMA_1_0_1_1_0_1,color="SARIMA 1,0,1,1,0,1"),group=1) + geom_line(data=test,aes(x=Date,y=ARIMA_1_0_0_1_0_0,color="SARIMA 1,0,0,1,0,0"),group=1) + geom_line(data=test,aes(x=Date,y=ARIMA_0_1_0_1_1_0,color="SARIMA 0,1,0,1,1,0"),group=1) + geom_line(data=test,aes(x=Date,y=ARIMA_2_1_0_1_1_2,color="SARIMA 2,1,0,1,1,2"),group=1) + geom_line(data=test,aes(x=Date,y=ARIMA_0_1_1_0_1_1,color="SARIMA 0,1,1,0,1,1"),group=1)
#print(p)
mape(test$Freq,test$PrevYear)
## [1] 0.03474613
mape(test$Freq,test$ARIMA_1_0_1_1_0_1)
## [1] 0.03536673
mape(test$Freq,test$ARIMA_1_0_0_1_0_0)
## [1] 0.03117555
mape(test$Freq,test$ARIMA_0_1_0_1_1_0)
## [1] 0.0473024
mape(test$Freq,test$ARIMA_2_1_0_1_1_2)
## [1] 0.04663715
mape(test$Freq,test$ARIMA_0_1_1_0_1_1)
## [1] 0.0333368
We can break down the best performing models to get a better image of the final results. Although the SARIMA model does outperform the month from the previous year slightly, it’s hardly noticeable and may be better just to use the month from the previous year as it’s much easier to interpret.
p <- ggplot() + geom_line(data=test,aes(x=Date,y=Freq,color="Actual",group=1)) + geom_line(data=test,aes(x=Date,y=PrevYear,color="Previous Year",group=1)) + geom_line(data=test,aes(x=Date,y=ARIMA_1_0_0_1_0_0,color="SARIMA 1,0,0,1,0,0",group=1))
print(p)
Overall, I got some experience working with a ARIMA models and came away with the conclusion that for this particular case, it’s tough to beat the value provided by the month from the previous year.
No comments:
Post a Comment