As we all know, criminal problem in Chicago is always huge and hard to solve as Rafael Mangual said in his article: “Chicago deserves its reputation as an American murder capital, or at least a significant part of it does.” Thus, analyzing the trend of criminal amount in Chicago is meaningful. According to the trend, we can see how was the criminal amount going and analyze the reason. People may also be able to predict the future criminal trend based on the previous trend.
In this project, I use data that reported incidents of crime that occurred in the City of Chicago from 2001 to present(minus the most recent seven days) trying to sumulate the trend of the crime amount in Chicago. And also explore interesting properties of the data along the way.
The raw data can be downloaded from https://catalog.data.gov/dataset/crimes-2001-to-present-398a4. The origin data records the details of every incidents of crime occured such as case number, date when the incident occurred, criminal type, location and so on. Every incidents has an identified ID. Upon the date I downloaded this data, there are total 6546196 incidents of crime occured from 2001-01 to 2018-02.
In this project, I only focus on the monthly amount of criminal incidents. Thus, I preprocess the data by counting the indicents occured each month. We can take a briefly look at the data after recoding. N represents the amount of incidents and the corrsponding time is the month those incidents occured.
Also check the summary of the data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12533 24852 32533 31778 38367 46013
##
## Range of Time: 2001-01 , 2018-02
Then I make a time plot for the data.
We can see an strong seasonal pattern for approx. every 12.5 months (50/4 = 12.5) as well as a decreasing trend. The variance seems stable here except little variation.
Then take a look of unsmoothed spectrum.
After smoothing, the spectrum is showed as following.
We can see there is a dominant freqency. After calculating, we get the dominant freqence is 0.0833. And the period is 12 month (1/0.0833), which is corresponding to the result derived from the time plot.
## The dominant frequence for origin data is 0.0833333333333333
## The dominant period for origin data is 12 month.
Variance
Although the variance showed in the time plot looks stable, I still take a look at the log data to see if there is any improvement of the stablility of variance.
Therefore, I plot the first degree differencing on original data and log data. The graph are shown below.
The fluctuate pattern showed in two graph is similar and both have an relative stable variance over time. Then take a look at the smoothed spectrum.
From the graph, we can see that the spectrum pattern of the origin data and log data are almost identical. Thus, combining the differencing graph, we can conclude that the log transformation doesn’t improve a lot and we will stick on the origin data.
Trend
As showing in the time plot, the origin data has a decreasing trend. To tackle with this trend, I apply the linear regression with ARMA errors technique.
For this dataset, we can see from the time plot, the decreasing trend is obvious from 2001 to 2016 while after 2016, the trend of the data becomes flat. Thus, I applied the model: \[X = \beta_1 * \mathrm{Month Index} + \beta_2 * \mathrm{Month Index}^2 + \beta_3 * \mathrm{Month Index}^3 + \eta_n \] Here, \(X\) is the monthly amount of the criminal incidents and \(\mathrm{Month Index}\) is calculated by transforming the year-month date to month, i.e. transform 2001-01 to 1 and 2018-02 to 206. \(\eta_n\) is a correlated ARMA error which will be analyzed later. The following is the summary of the linear model.
##
## Call:
## lm(formula = data$n ~ month + I(month^2) + I(month^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -9497.7 -1584.7 436.6 2180.7 6271.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.948e+04 8.772e+02 45.006 < 2e-16 ***
## month 7.318e+01 3.661e+01 1.999 0.047 *
## I(month^2) -1.908e+00 4.105e-01 -4.649 6.02e-06 ***
## I(month^3) 5.398e-03 1.304e-03 4.141 5.08e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3091 on 202 degrees of freedom
## Multiple R-squared: 0.8346, Adjusted R-squared: 0.8322
## F-statistic: 339.9 on 3 and 202 DF, p-value: < 2.2e-16
We see that all the polynomial coefficients are being evaluated as significant. Also plot the data and the models.
The model is ploted in red. By the plot, we can see that the model fits the trend of the data pretty well. Therefore, I will use this ploynoimial.
Apart from the ploynoimial term in the linear regression with ARMA errors, we still need to determine the ARMA error term. First plot the ACF(here I use correlation plot) of the residuals of the linear model.
From the ACF plot, we can see a significant periodic existing in the residual of the linear model. I will add AR(2) to the model to fit the seasonal trend of the residual. Then, plot the smoothed spectrum of residuals.
## The dominant frequence for residuals is 0.0833333333333333
## The dominant period for residuals is 12 month.
Based on the result of period showed above, residuals also follow the 12 month period as the origin data. Thus, it is proper to model the residual by monthly SARIMA model(period = 12).
Model for residuals
Next step is to fit a SARIMA model to the residuals of the linear model. As the analysis result above, the fitted model will be a model whose period equals to 12 and goed with AR(2) for the annual polynomial. To determine the model for monthly part, we calculate the AIC.
MA0 | MA1 | MA2 | MA3 | MA4 | |
---|---|---|---|---|---|
AR0 | 3569.297 | 3533.880 | 3525.50 | 3522.375 | 3522.673 |
AR1 | 3517.969 | 3515.704 | 3517.26 | 3519.023 | 3521.021 |
Among the table, SARIMA(1,0,1)x(2,0,0) with period equals to 12 has the lowest AIC and it’s a simple model. We will stick with this model for residuals.
Fit SARIMA(1,0,1)x(2,0,0) with period = 12 model to the residuals. The summary is showed below.
##
## Call:
## arima(x = lm1$residuals, order = c(1, 0, 1), seasonal = list(order = c(2, 0,
## 0), period = 12))
##
## Coefficients:
## ar1 ma1 sar1 sar2 intercept
## 0.7410 -0.3167 0.5507 0.3668 200.8405
## s.e. 0.0993 0.1473 0.0697 0.0718 1509.3363
##
## sigma^2 estimated as 1301627: log likelihood = -1752.85, aic = 3515.7
Then look at the diagnostic for this fitting model on residuals. The error of this SARIMA model assumed to be a white noise. First check the ACF(here I use correlation plot) of the SARIMA model error.
In plot, we see there are still 5 violations (out of 206 lags) of our hypothesis testing lines of IID errors in our model but the seasonallity is disappear. The 5 violations seems random, so here I didn’t take them into consideration. The errors are approx. uncorrelated.
Then plot the smoothed spectrum of the error.
From the plot, we see dominant freqence is eliminated. There is no dominant cycles, which is also evidence of non-seasonal errors.
Then test the noemality of errors by drawing qq plot.
We see the result of qq plot is pretty well, indicating that the error follows normal distribution.
Last, check the mean and variance. Plot the errors itself.
It’s can be concluded that the mean of errors is approx. 0. And the variance is stable. In conclude, we can say that the error of SARIMA model is approx. white noise error,which is tally with the assumption of SARIMA model. It is appropriate to fit the SARIMA(1,0,1)x(2,0,0) with period=12 model to the residual of the linear model.
Final Model
Based on the above analysis, the reasonable model for the monthly criminal incidents amount is a linear model with SARIMA errors. The model is: \[X = \beta_1 * \mathrm{Month Index} + \beta_2 * \mathrm{Month Index}^2 + \beta_3 * \mathrm{Month Index}^3 + \eta_n\] where \(\eta_n\) is a SARIMA error, fitted by SARIMA(1,0,1)x(2,0,0) with period=12 model.
Then, fit this final model to the origin data. The summary is showed below.
##
## Call:
## arima(x = data$n, order = c(1, 0, 1), seasonal = list(order = c(2, 0, 0), period = 12),
## xreg = cbind(month, sq_month, cb_month), method = "ML")
##
## Coefficients:
## ar1 ma1 sar1 sar2 intercept month sq_month
## 0.7416 -0.3197 0.5502 0.3677 40375.03 34.3648 -1.5187
## s.e. 0.1017 0.1494 0.0697 0.0717 1934.27 45.7531 0.5201
## cb_month
## 0.0043
## s.e. 0.0016
##
## sigma^2 estimated as 1296457: log likelihood = -1752.48, aic = 3520.96
In the end, fit the origin data with this model. The fitted result is below. The red line is the origin data while the blue line is the fitted value.
The fitted result is actually very well. The fitted curve almost grasp all the seasonality of the origin data and also simulate the trend precisely.
The amount of the criminal incidents in Chicago has a seasonality of 1 cycle per year. Within the cycle, the shape of the amount of crimes is like a open down parabola. The crime number first increases, reaching the peak around middle of the year and then drop down.
The data also has a quick descreasing trend from 2001 to 2016 while after 2016, the trend becomes flat, which means the amount of the criminal incidents after 2016 stop decreasing in Chicago.
According to the fitted result, specifically, the fit model is: \[(1 - 0.7416\mathrm{B})(1 - 0.5502B^{12} - 0.3677B^{24})(Y_n + 34.3648n - 1.5187n^{2} + 0.0043n^{3} + 40375.03) = (1 - 0.3197B )\epsilon_n)\] where \(\epsilon_n\) is a white noise. n is the month index I defined before. And the \(Y_n\) is the monthly amount of the criminal incidents.
We noticed evidence of 1 year cycle of the amount of the criminal incidents. I think it is worth investigating the reason of the open down parabola pattern.
Unemployment Rate and Crime
A brief look into the relationship between crime numbers and unemployment rate is reasonable. It is make sense that the change of the unemployment rate will influence the crime numbers. The unemployment will make people’s living hard so that they may commit crime to make a living.
The unemployment rate data shown below is retrieved from https://fred.stlouisfed.org/series/CHIC917URN.
We can see that the pattern and trend of Unemploymen rate and the crime numbers are very different. Without further explore, it hard to conclude that the Unemploymen rate and the crime numbers are correlated in Chicago.
Climate and Crime
Exploring the relationship between climate and unemployment rate is also reasonable. Since winter of Chicago is harsh, it seems like natural that the crime number will decrease when the temperature becomes lower. When the winter comes, the temperature will be low and always snow a lot. Therefore, people tends to stay in one place rather than linger around outside.
The climate data shown below is retrieved from https://www.climatestations.com/chicago/. We will take a look at the climate data of 4 years: 2001, 2005, 2009, 2017 for generallity.
We can see that the pattern of the Chicago climate within one year is similar to the crime numbers. Both of them are open down parabola and reach the peak around the middle of year (Jun, Jul Aug). Here, since without further analysis, it is just possible that the climate is related with the crime activities. When the temperature is high, the crime activities will be more active. In the future, this will be worth to explore more.
Ionides, E. (n.d.). Stats 531 (2018, winter). ‘Analysis of Time Series’. Retrieved from http://ionides.github.io/531w18/
Rafael Mangual (2017, summer). Sub-Chicago and America’s Real Crime Rate. Retrieved from
https://www.city-journal.org/html/sub-chicago-and-americas-real-crime-rate-15341.html.
Chicago Data Portal (2018, March). Crimes - 2001 to present. Retrieved from https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
FRED Economic Data. Unemployment Rate in Chicago-Naperville-Elgin, IL-IN-WI (MSA). Retrieved from https://fred.stlouisfed.org/series/CHIC917URN
ClimateStations. Graphical Climatology of Chicago (1871-Present). Retrieved from https://www.climatestations.com/chicago/