Stats 531 Midterm Project, An Analysis of Shanghai Car License Plate Auction Price

Introduction

According to a survey in 2016, Shanghai has a population of about 24.15 billion, and the average population density in urban areas is 3,854 people per square kilometer. Such high population density put a heavy burden on traffic. To alleviate the traffic congestion and the pollutions from cars, Shanghai uses an auction system to sell a limited number of license plates to fossil-fuel car buyers every month. Because of the high demand, nowadays the average price of a license plate is about $13000, which has been refered to as “the most expensive piece of metal in the world”. In real life, it would be useful if we are able predict the price of the plate in the future, so that we can pick the best time of application if we were to apply for a license plate. In this project, I aim to explore the time series data of the average price of each month from 2002 to 2018. The goal is to find a decent model to fit the data, and explore the properties of the data along the way.

Data Exploration

First, let’s take a first look of our data:

##         Date Issued Average Applied
## 1 2002-01-01   1400   14735    3718
## 2 2002-02-01   1800   14057    4590
## 3 2002-03-01   2000   14662    5190
## 4 2002-04-01   2300   16334    4806
## 5 2002-05-01   2350   18357    4665
## 6 2002-06-01   2800   20178    4502

##       Date                Issued         Average         Applied      
##  Min.   :2002-01-01   Min.   : 1400   Min.   :14057   Min.   :  3525  
##  1st Qu.:2005-12-24   1st Qu.: 5698   1st Qu.:34803   1st Qu.: 10247  
##  Median :2010-01-16   Median : 7507   Median :43302   Median : 18664  
##  Mean   :2010-01-03   Mean   : 7406   Mean   :52253   Mean   : 58814  
##  3rd Qu.:2014-01-08   3rd Qu.: 9000   3rd Qu.:74547   3rd Qu.: 62273  
##  Max.   :2018-01-01   Max.   :16000   Max.   :93540   Max.   :277889

It should be noted that the original data has only “year/month” information, and I added the first day of each month in order for R to read the dates.

We can see that the number of the license plate issued did not increase too much over the past 15 years, however, the number of the applications increased a lot, especially after 2014. The average price has an obvious increasing trend over the past 15 years, which is reasonable because of the increasing demand. It makes sense (either from the plot, or from common sense) that the average price is related to the number of the licensce plates issued and the number of the applications in that month.

The smoothed periodogram does not show any strong evidence of periodic pattern. There might be a small peak at around 0.2 cycles per month, which indicating a period of every 4-5 months, but it might not be real when considering the error bar.

Detrending Data with Linear Regression

In order to tackle the trend of our time series, a linear regression model is used. The following plot shows the model below in red: \[\mathrm{AveragePrice} = \beta_0 + \beta_1 * \mathrm{Issued}_n + \beta_2 * \mathrm{log(Applied_n)} + \beta_3 * \mathrm{MonthIndex}+ \epsilon_n\] where $\epsilon_n$ is an ARMA process. The $\mathrm{MonthIndex}$ term is included to account for the inflation of RMB over the past 15 years.

The linear regression captures the increasing trend of the data, while there are still many peaks and fluctuations not predicted well. Let’s look at the summary of our linear regression model:

## 
## Call:
## lm(formula = Average ~ Issued + log(Applied) + monthseq, data = price)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34593  -5535   1183   5017  29533 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.752e+04  1.056e+04  -1.659  0.09875 .  
## Issued       -8.092e-01  4.695e-01  -1.724  0.08643 .  
## log(Applied)  4.481e+03  1.238e+03   3.619  0.00038 ***
## monthseq      3.138e+02  3.351e+01   9.365  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9376 on 188 degrees of freedom
## Multiple R-squared:  0.8351, Adjusted R-squared:  0.8325 
## F-statistic: 317.4 on 3 and 188 DF,  p-value: < 2.2e-16

We can see that the coefficients of $\mathrm{log(Applied)}$ and $\mathrm{MonthIndex}$ is being evaluated as significant for the model. The coefficient of $\mathrm{Issued}$ has a relatively high p-value, but it should be noted that the model has a potential collinearity problem from the variable $\mathrm{Issued}$ and $\mathrm{monthseq}$, so the significance of the coefficients may be underestimated.

Fitting to ARMA and SARMA models

Next, let’s focusing on the residuals of the linear model before. Specifically, we fit an ARMA model to the residuals from the linear model. First, let’s look at the ACF plot and the smoothed periodogram of the residuals:

The ACF plot shows a high autocorrelation. In the periodogram, we can still see the small peak at around the frequency of 0.2, which corresponds to a 5-month cycle.

We then fit our data to an ARMA(p,q) model where $p$ and $q$ are decided by the AIC values:

AIC Values of ARMA models

	MA0	MA1	MA2	MA3	MA4
AR0	4130.38	4018.94	3981.60	3972.22	3970.60
AR1	3949.83	3951.81	3949.96	3949.83	3951.76
AR2	3951.82	3953.32	3950.71	3953.66	3953.16
AR3	3951.01	3950.30	3949.69	3953.19	3955.51
AR4	3949.67	3951.62	3951.18	3951.89	3954.48
AR5	3951.60	3952.56	3952.35	3954.62	3954.88

From the table, ARMA(1,2) has the lowest AIC value. We noted that the simple model ARMA(1,0) also has a relatively low AIC value. Here we choose ARMA(1,2) as our best model.

We can plot the ACF of the residuals of ARMA(1,2) model:

We can see that there is no significant autocorrelation between different time lags, and no seasonality is observed. The periodogram also shows no dominant cycles, indicating IID errors. Therefore, we tend not to believe that the “5-month cycle” is real.

Let’s take a look at the summary of ARMA(1,2) model:

## 
## Call:
## arima(x = price_res, order = c(1, 0, 2))
## 
## Coefficients:
##          ar1      ma1      ma2  intercept
##       0.9259  -0.1353  -0.1714  -153.4565
## s.e.  0.0338   0.0816   0.0750  2957.4617
## 
## sigma^2 estimated as 21523192:  log likelihood = -1894.06,  aic = 3798.12

We can visually look at the confidence interval of our coefficients. In the following plot, the red line shows the cutoff for the confidence interval for our AR1, MA1, and MA2 coefficients, given by Wilk’s theorem.

The confidence intervals for our coefficients shows that the coefficient is mostly meaningful and fits our data well.

Although from the ACF and periodogram of the residuals from the ARMA(1,2) model, the residuals are IID errors, these residuals do not follow a normal distribution, as shown in the qq plot below:

Specifically, there are three points with very low residuals. These data points correspond to the observations in 2004-06, 2008-01, and 2010-12. In these months, the average price of the license plates cannot be fully explained by the given information and our model. For example, in June 2004, the number of application greatly increased from 8114 to 19233, but the average price actually decreased from 34266 to 21001. More information is needed to better explain and predict these anomalies.

On the other hand, even without the previous mentioned outliers, the residuals are still skewed from a normal distribution, meaning that some other analysis might be needed to further investigate the data.

Conclusions and Future Analysis

We’ve found that a reasonable model for the average price of license plates in Shanghai from 2002 to 2018 is a linear model with ARMA errors. Specifically, the model is:

\[(1 - 0.9259\mathrm{B})(X_n - 0.8092\mathrm{Issued}_n + 4481log(\mathrm{Applied}_n) + 313.8n - 17520) = (1-0.1353\mathrm{B}-0.1714\mathrm{B}^2)\epsilon_n\] where $n$ is the $n$th month, beginning with $n=1$ in January of 2002.

It should be noted that the ARMA(1,0) model is also a good candidate for the data:

## 
## Call:
## arima(x = price_res, order = c(1, 0, 0))
## 
## Coefficients:
##         ar1  intercept
##       0.858  -116.8778
## s.e.  0.036  2318.9086
## 
## sigma^2 estimated as 22123181:  log likelihood = -1896.67,  aic = 3799.34

In this case, the specific model would be: \[(1 - 0.858\mathrm{B})(X_n - 0.8092\mathrm{Issued}_n + 4481log(\mathrm{Applied}_n) + 313.8n - 17520) = \epsilon_n\]

The ACF plot and periodogram both indicate IID errors.

We did not observe any seasonality in our data. However, some seasonality might lie in the numbers of the applications time series, as implied in the periodogram below:

For future analysis, it would be valuable to investigate the numbers of applications as time series. This would also be useful when trying to predict the average price of the future. We can use the stl function in R to decompose our time series into seasonal, trend, and irregular components, as shown below:

The remainder includes the information after the seasonal and trend are removed from the data.

References

[1] The time series data was retrieved from: https://www.kaggle.com/bogof666/shanghai-car-license-plate-auction-price
[2] https://ionides.github.io/531w18/
[3] http://worldpopulationreview.com/world-cities/shanghai-population/
[4] 6.5 STL decomposition. (n.d.). Retrieved March 10, 2016, from https://www.otexts.org/fpp/6/5