Introduction

This will be an analysis of armed robberies in Boston from January 1966 - October 1975. This data is freely available here. We will start with an exploration of the data with the goal of fitting a decent model to it, while exploring any interesting properties of the data along the way.


Data Exploration

After some recoding to sort the variables in our data set, we’ll briefly look at our data. Note that a year/month combo was attached to each observation; I’ve specified the first day of each month in order for R to read the dates.

##         date num_robberies
## 1 1966-01-01            41
## 2 1966-02-01            39
## 3 1966-03-01            50
## 4 1966-04-01            40
## 5 1966-05-01            43
## 6 1966-06-01            38
##       date            num_robberies  
##  Min.   :1966-01-01   Min.   : 29.0  
##  1st Qu.:1968-06-08   1st Qu.: 85.5  
##  Median :1970-11-16   Median :166.0  
##  Mean   :1970-11-15   Mean   :196.3  
##  3rd Qu.:1973-04-23   3rd Qu.:296.8  
##  Max.   :1975-10-01   Max.   :500.0  
##  NA's   :1            NA's   :1

The variable num_robberies is relatively self-explanatory; just the number of robberies for that month in Boston. We see one missing value. After viewing the data, we could see this was caused by some text at the end of the CSV file read in. We will remove it in the proceeding analysis.

Looking at the plot of our data, we see a definite increasing trend. We also see an increase in variance. From the Periodogram, we see strong evidence for yearly cycles and trend. We’d like to stabilize our data by de-trending it and introducing homoskedacticity.


Detrending and an Examination of Seasonality

A potential remedy for the increasing variance would be taking the log of the number of robberies.

This plot looks much better with regard to variance. Judging from the periodogram, we have retained trend (not surprising) but may have loss our evidence for a yearly cycle, this seems odd. We will investigate this later, after de-trending data.

Detrending: Linear Regression with ARMA Errors

One strategy to tackle trend is linear regression with ARMA errors. We’ll use the same technique to de-trend both the normal series and the logged series, as to set us up for a diagnosis of our lost seasonality later.

The following are plots of both the transformed and non transformed data, with the following models plotted in red: \[X = \mathrm{Intercept} + \beta_1 * \mathrm{Month Index} + \beta_2 * \mathrm{Month Index}^2\] Note \(X\) is the logged number of robberies and number of robberies for each plot, respectively.

And now we look at the summary of our models, note logcrimets is the time series of the logged data, while crimets is the time series of the non-logged data:

## 
## Call:
## lm(formula = logcrimets ~ monthseq + I(monthseq^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5197 -0.1486  0.0192  0.1547  0.4764 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.452e+00  6.154e-02  56.093  < 2e-16 ***
## monthseq       3.495e-02  2.388e-03  14.641  < 2e-16 ***
## I(monthseq^2) -1.101e-04  1.944e-05  -5.664 1.11e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2191 on 115 degrees of freedom
## Multiple R-squared:  0.9238, Adjusted R-squared:  0.9225 
## F-statistic: 697.3 on 2 and 115 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = crimets ~ monthseq + I(monthseq^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -109.09  -21.04   -3.14   17.73  113.07 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   21.202019  11.796713   1.797 0.074915 .  
## monthseq       1.816161   0.457633   3.969 0.000126 ***
## I(monthseq^2)  0.014259   0.003726   3.827 0.000211 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.99 on 115 degrees of freedom
## Multiple R-squared:  0.8943, Adjusted R-squared:  0.8924 
## F-statistic: 486.4 on 2 and 115 DF,  p-value: < 2.2e-16

We see a the polynomial coefficient being evaluated as significant for both models. It may be worth noting that the polynomial coefficient is quite small in both models and for a time series of 118 observations - however it adds a nice curve to our fitted plot that seems to fit our data well, so we’ll stick with these polynomial terms.

Now we’ll digress momentarily to look an explanation for the loss of a seasonal trend after the log transformation.

Seasonality Examination

A good illustration of our loss of seasonality is by plotting the ACF of the residuals of each model fit.

Note that in the ACF of our original data on the left, there are clear cycles every 12 months. The ACF of the logged data does not show the same behavior.

One hypothesis would that there are cycles with larger variance in the non-logged model that are dominant and yearly - thus giving us evidence for seasonality, while these same yearly cycles in the logged model had their power dampened by the nature of the trasnformation. It would be worth looking at the residuals of the 1st and 2nd half of our data plotted on top of each other. This would illustrate the dominant variance in the residuals in the 2nd half of our data when it is not log transformed.

Why not look at some numerical comparisons as well:

Variance of Residuals

Month Index Crime Model Logged Crime Model
Months 1-60 716.914 0.064
Months 61-118 2789.185 0.03

We see evidence both from the variance measures and from the plot that the non transformed data has higher variance in the second half of our data set, while the log transformed data actually saw a decrease in the variance. Looking at the periodogram of the two halves of the data would help confirm our stated hypothesis of a dominant yearly cycle in the 2nd half of the data.

This provides more evidence supporting our hypothesis of a yearly cycle in the latter months of our crime data. Given that the log transform removed the dominant variance of these months - we can now see why our seasonality disappeared when we took the log of our data. The feeling of mystery regarding the disappearance of the yearly cycle in the log transformed data is ameliorated. With that said, it is certainly worth noting that we see evidence for seasonality in roughly half of our data, just because the the log series doesn’t show it does not mean we should disregard this when thinking about armed robberies in Boston during the 1960s.


Fitting a Model

Our next step is to fit an ARMA model to the residuals of our model with the log transformed data. Typically a first step would be to look at the ACF of our residuals, and a smoothed periodogram - although we already showed both of these, another look won’t hurt.

Interestingly enough, it looks like we may have evidence of a periodic behavior every 3 years. We can see this from the ACF, as every 36 months it looks roughly like there’s a period completed. We also seem a potentially significant bump around a frequency of 0.33 in our periodogram.

With that said, we should know that we are limited by our data. With 10 years of monthly data, we don’t have much data to claim periodic behavior every 3 years.

AIC Values of ARIMA models for Model Errors with

No Seasonal Component

MA0 MA1 MA2 MA3 MA4 MA5
AR0 -22.49 -54.17 -56.60 -57.88 -56.01 -55.59
AR1 -62.15 -60.15 -58.38 -56.65 -54.68 -54.30
AR2 -60.15 -58.35 -56.71 -54.76 -52.78 -52.94
AR3 -58.42 -56.77 -60.95 -60.65 -59.61 -56.93
AR4 -56.55 -54.85 -59.85 -62.65 -57.63 -59.40

AIC Values of ARIMA models for Model Errors with

AR(1) Seasonal Component, period=36

MA0 MA1 MA2 MA3
AR0 -32.80 -60.93 -62.17 -63.76
AR1 -67.54 -65.59 -63.91 -62.42
AR2 -65.58 -63.81 -62.50 -60.60
AR3 -64.02 -62.64 -66.33 -65.44

In the non-seasonal model, we see the ARMA(1,0,0) model has a low AIC value. On top of that, it’s a simple model! If we want to look at the AIC values in our seasonal table, we can see the ARMA(1,0,0)x(1,0,0) with period=36 does well. It feels a little presumptuous to run with this model, ideally we would like more than 10 years of data to diagnose cyclic behavior every 3 years. If we were to do further analysis with more data, this would certainly be worth investigating. For now we will stick with the ARMA(1,0,0) model for our errors.

Now we’ll look at diagnostics for the ARMA model fitted to our residuals:

Here we see 2 violations of our hypothesis testing lines of IID errors in our model. It should certainly be noted that the violations come at lags of 15, and 36. This is not surprising, as we saw evidence for cyclic behavior at a lag of 36 in our AIC table and previous ACF plots. Perhaps with more data we would’ve gone with the seasonal model and this lag correlation would’ve been taken care of. Looking at the periodogram, we see no dominant cycles - which is evidence of IID errors (which we want). Given our data and what we know so far - we’ll stick with this model.

Let’s take a look at it’s summary:

## 
## Call:
## arima(x = logcrime_res, order = c(1, 0, 0))
## 
## Coefficients:
##          ar1  intercept
##       0.5453     0.0025
## s.e.  0.0768     0.0363
## 
## sigma^2 estimated as 0.03276:  log likelihood = 34.07,  aic = -62.15

Our AR1 Coefficient is roughly 7.7 times the size of our standard error - this is nice as we would have reason for concern if our AR1 Coefficient was within 1.96 standard errors of 0.

The likelihood profile may help to assure us even further of our selected AR1 coefficient:

In the plots above, the red line shows the cutoff for the confidence interval for our AR1 coefficient, given by Wilk’s theorem. To be explicit, the interval is given by \[\{\theta_d : {\ell}({\theta^*}) - {\ell^\mathrm{profile}_d}(\theta_d)\} < 1.92.\]

With \({\ell}({\theta^*}) =\) 0.5453111. Which is the log of the MLE, \(\theta_d^*\), given by our ARIMA fit.

We see this confidence interval is well above zero - giving us more reason to believe we’ve selected an AR1 coefficient that fit our data well.


Summary / Conclusion

At this point, we’re content with our model and ought to go on and summarize what we’ve found.

Model Fit

We’ve found that a reasonable model for armed robberies in Boston between 1966 and 1976 is a linear model with ARMA errors after performing a log transform on the number of robberies, specifically, the model is:

\[(1 - .5453\mathrm{B})(\log{X_n} - 3.4522 - 0.3495n + .0001n^2) = \epsilon_n)\]

Here \(n\) is the \(n\)th month, beginning with \(n=1\) in January of 1966. \(X_n\) would then be the number of armed robberies recorded in Boston at month \(n\).

Seasonality

Upon taking the log transform of our data, we noticed a loss in seasonality by comparing ACF plots and Periodograms. After further investigation, it was noticed that the seasonal behavior was displayed in the latter half of the months, which had dominating variance on a normal scale. When put on a log scale, this dominating variance was dampened and no more seasonality was observed. This is not to say that it’s not worth noticing - perhaps a log transform was not ideal here and another transform may be better which preserves seasonality.

Future Analysis

Business Cycles

We noticed evidence of 3 year cycles in our analysis of the residuals of our linear regression model. Lack of sufficient data for strong evidence gave us pause in declaring this behavior significant. The length of regular business cycles are a debated topic, we can see from the National Bureau of Economic Research that cycles have been observed around 3 years at certain points in history. It certainly would be worth investigating this in the future if more data are available and aggregated.


Unemployment and Crime

A brief look into a relationship between crime numbers and unemployment may be reasonable, it seems like natural that a change in unemployment may affect crime rates; i.e., people have less money due to less work and are thus may have to commit crime in order to make a living.

The unemployment data shown below is available here.

It may work to decompose our time series into seasonal, trend, and irregular components using R’s stl function. This uses loess smoothing to identify seasonal components, and after the seasonal components are remove - the trend is identified and removed. Specifics of this decomposition can be found here and here. Note the grey bar on the right is of the same length in each plot - thus a smaller grey bar indicates larger variance explained by each component.

Here is an example of the stl breakdown of our crime series:

The remainder portion of the plot is what is left after the seasonal and trend are removed from our data. Now we’ll compare remainders for both crime and unemployment.

Here we see some evidence of coinciding troughs and peaks. Although sometimes it appears that robbery spikes or dips may precede the unemployment rate, which is perplexing. A lag relationship between robberies and unemployment rate in this way wouldn’t make much sense. A relationship may still be worth investigating, but, given that we have not much data to make any strong conclusions, we may be best off leaving this analysis for another time.


Explanations for our Trend (Non-Technical)

The observation of an increase in crime during the 1960s is not new. Some interesting explanation of its increase has been put forth by the Psychologist/Cognitive Scientist Steven Pinker. It mentions theories regarding the coming of age of baby boomers and changes in cultural norms. Eventually, Pinker combines the values of the counter-culture, societal connectedness, and the de-valuing of the family unit as possible explanations for a rise in crime during that time. An excerpt from his book can be found here. While not rigorously statistical, perhaps some analysis could be devised in the future to further examine his points.


References

Ionides, E. (n.d.). Stats 531 (Winter 2016) ‘Analysis of Time Series’ Retrieved March 10, 2016, from http://ionides.github.io/531w16/

6.5 STL decomposition. (n.d.). Retrieved March 10, 2016, from https://www.otexts.org/fpp/6/5

Pinker, S. (n.d.). Decivilization in the 1960s. Retrieved March 10, 2016, from http://quod.lib.umich.edu/h/humfig/11217607.0002.206/–decivilization-in-the-1960s?rgn=main;view