1.The Introduction of the project


Retailing is the business where an organization directly sells its products and services to an end consumer and this is for his personal use. By definition whenever an organization be it a manufacturing or a whole seller sells directly to the end consumer it is actually operating in the Retail space.
For this midterm project, we are going to find some basic pattern of the total revenue in US. Meanwhile, we are also going to fit other data with the pattern we known. We will choose to use the revenue of furniture and home furnishing stores to be fitted. The data we used this time is coming from https://www.census.gov, which is the data of Manufacturing & Trade Inventories & Sales, in the aspect of retailers of monthly sales.
We deal with both adjusted and unadjusted data in this case. The adjusted data is modified by The X-13ARIMA-SEATS Seasonal Adjustment Program.

Generally speaking, there are several goals in this project:

  1. Try to analyze the total retail revenue data on frequency domain and by filtering.
  2. Try to fit the data with different models, including SARIMA, linear regression with SARIMA errors.
  3. Forecasting the future development in retail.
  4. Use the pattern of the total retail revenue data to fit other specific data.

2.Introduction to the original data


First of all, we do some pre-processing before the analysis of our data.

setwd("D:\\531\\project1\\pro1")

#ReadData

Adj_Data_Unmodified = read.csv("Adj.csv", header = T)
Org_Data_Unmodified = read.csv("Org.csv", header = T)

library(zoo)
## Warning: package 'zoo' was built under R version 3.2.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(astsa)
## Warning: package 'astsa' was built under R version 3.2.3
library(knitr)
## Warning: package 'knitr' was built under R version 3.2.3
library(forecast)
## Warning: package 'forecast' was built under R version 3.2.3
## Loading required package: timeDate
## Warning: package 'timeDate' was built under R version 3.2.3
## This is forecast 6.2
## 
## Attaching package: 'forecast'
## The following object is masked from 'package:astsa':
## 
##     gas
Sys.setlocale("LC_TIME", "C") 
## [1] "C"
Org_Data_Unmodified$Period = as.Date(as.yearmon(Org_Data_Unmodified$Period, "%b-%y"))
Adj_Data_Unmodified$Period = as.Date(as.yearmon(Adj_Data_Unmodified$Period, "%b-%y"))

Org_Data_Unmodified = cbind(c(1:dim(Org_Data_Unmodified)[1]),Org_Data_Unmodified)
colnames(Org_Data_Unmodified) = c("Number","Period","Value")
Adj_Data_Unmodified = cbind(c(1:dim(Adj_Data_Unmodified)[1]),Adj_Data_Unmodified)
colnames(Adj_Data_Unmodified) = c("Number","Period","Value")


Org_Data_Unmodified$Value = as.numeric(gsub(",","",Org_Data_Unmodified$Value))
Adj_Data_Unmodified$Value = as.numeric(gsub(",","",Adj_Data_Unmodified$Value))

Org_Data = Org_Data_Unmodified
Adj_Data = Adj_Data_Unmodified

Now, in order to have a whole picture of our data, we provide the basic plots of adjusted and unadjusted data in the same figure, trying to find some basic pattern.

plot(Org_Data$Period,Org_Data$Value,type = "l", col = "red", main = "Time Plot of Total Revenue")
lines(Org_Data$Period,Adj_Data$Value,type = "l")

acf(Org_Data$Value, lag.max = 30, main = "ACF of Unadjusted Total Revenue")

pacf(Org_Data$Value, main = "PACF of Unadjusted Total Revenue")

The original data plot shows that there are significant linear trend in long term, for both adjusted and unadjusted data, which indicate that it is unstationary. What’s more, there are some seasonal pattern in the unadjusted data, while the seasonal pattern of adjusted data have already been removed. The autocorrelation function and partial autocorrelation function also show similar things.

Box.test(Org_Data$Value,type="Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  Org_Data$Value
## X-squared = 238.51, df = 1, p-value < 2.2e-16

We also do the basic LB-test, which indicate that our data is not white noise.
Now, we try to use the HP-filter and 1-step first difference to get the pattern of our unadjusted data.

library(mFilter)
## Warning: package 'mFilter' was built under R version 3.2.3
Org_Data_HP <- hpfilter(Org_Data$Value, freq=10,type="lambda",drift=F)$cycle
plot(Org_Data$Number,Org_Data_HP,type="l",xlab="Index",ylab="", main = "HP-fiter of Unadjusted Data")

#1-step diff
plot(diff(Org_Data$Value),type = "l", main = "1-step Difference of Unadjusted Data")

The result shows that both 1-step first difference and HP filter can get the seasonal pattern very well. What’s more, from the plot shown above, we can find out that there are significant seasonal pattern, which mean SARIMA might be a good choice for us.

3. Analysis in Frequency Domain

We now analyze our data in frequency domain. There are several things we can do right now. - Firstly, we will provide the original spectrum of our data, which will show the periodical pattern, - Secondly, the local regression will be applied. We tried to separate the variation in different range of frequency. - Finally, we will try to explain what we find in the frequency domain.

3.1 Spectrum Analysis


And now, we will take a look at our data in frequency domain. The parameters we used in span is set by the tests we have done.

##Spectrum
spectrum(Org_Data_Unmodified$Value,span = c(3,3))


The spectrum reveals that, there are some strong seasonal pattern for 12 months, 6 months, 4 months and 3 months. We can use different difference operator to eliminate the seasonal pattern. Thus, based on the tests we have done, we would prefer to use \(B^{12}\). There are several reason, 1.Our data is monthly data, which imply that \(B^{12}\) is a good choice. 2. As the plot we shown below:

  Org_Data["12V"]=append(c(1:12),diff(Org_Data_Unmodified$Value,12))
  Org_Data = tail(Org_Data,dim(Org_Data)[1]-12)
  plot(Org_Data$Number,Org_Data$`12V`,type = "l")


The \(B^{12}\) seems to extract all the seasonal pattern.

3.2 Local Regression

Now, we will try to extract the long term/short term trend from the data. The method we used here was local regression:

Org_loess <- loess(Value~Number,span=0.5,data= Org_Data)
plot(Org_Data$Number,Org_Data$Value,type="l",col="red")
lines(Org_loess$x,Org_loess$fitted,type="l")

We used different span parameters to extract different range of frequency. After testing, we would like to use span = 0.5 and span = 0.1 to extract low and high range of frequency. We will take the rest of the part as the variation in mid-range.

  #Extracting Cycle
  Org_low <- ts(loess(Value~Number,span=0.5,data= Org_Data)$fitted)
  Org_hi <- ts(Org_Data$Value - loess(Value~Number,span=0.1,data= Org_Data)$fitted)
  Org_cycles <- Org_Data$Value - Org_low - Org_hi
  plot(ts.union(Org_Data$Value, Org_low,Org_hi,Org_cycles),
     main="Decomposition of retail revenue as trend + Sesonal + Fluctuation")

The plot above shows that, we have an increasing linear long term trend during these 20 years. And there are a significant fluctuation on around 2008, which is reasonable due to the financial crisis. In the high-range domain, we can find out that we extract the seasonal pattern successfully.

  spec_cycle <- spectrum(ts.union(Org_Data$Value,Org_hi),spans=c(3,3),plot=FALSE)
  freq_response_cycle <- spec_cycle$spec[,2]/spec_cycle$spec[,1]
  plot(spec_cycle$freq,freq_response_cycle,
       type="l",log="y",
       ylab="frequency ratio", xlab="frequency", ylim=c(5e-6,1.1),
       main="frequency response (dashed line at 1.0)")
  abline(h=1,lty="dashed",col="red")  

The frequency response function also shows that the seasonal pattern plays an important role in the data.

4.Fit SARIMA Model

In this part, we will fit a SARiMA model for our data. Since SARIMA model has 7 parameters to determine before we fit the model at the same time, including (p,d,q) and (P,D,Q) and season period (S), we would not prefer to justify all of the parameters at the same time. Thus, we would set (S) and (d,D) at first, then justify (p,q) and (P,Q) separately. Therefore, we can use the AIC table to determine which model should be used. What’s more, since we would prefer less variable in this case, we would also provide BIC and other criterion for concern.

4.1 Fit the AR and MA Part in SARIMA

From the previous analysis, we can notice that (S = 12) is a good choice for our model. Meanwhile, we would use AIC criterion to choose AR and MA at first.

Table_For_ARMA_AIC <- function(data,P,Q){
table <- matrix(NA,(P+1),(Q+1))
for(p in 0:P) {
for(q in 0:Q) {
table[p+1,q+1] <- Arima(data,order=c(p,0,q),seasonal=list(order=c(0,1,0),period=12))$aic
}
}
dimnames(table) <- list(paste("<b> AR",0:P, "</b>", sep=""),paste("MA",0:Q,sep=""))
table
}
Retail_aic_table <- Table_For_ARMA_AIC(Org_Data$Value,6,6)
kable(Retail_aic_table,digits=2)
MA0 MA1 MA2 MA3 MA4 MA5 MA6
AR0 5873.17 5666.62 5630.30 5509.94 5508.81 5455.43 5441.35
AR1 5453.81 5417.23 5414.68 5406.36 5403.75 5389.12 5390.99
AR2 5422.34 5417.50 5402.86 5390.97 5392.92 5390.89 5389.52
AR3 5408.09 5399.02 5365.41 5358.07 5359.79 5390.40 5358.45
AR4 5395.73 5397.23 5359.51 5360.24 5359.67 5388.39 5389.54
AR5 5397.23 5399.19 5358.51 5360.46 5348.83 5348.15 5348.54
AR6 5398.86 5401.07 5360.39 5359.30 5349.72 5341.86 5340.87

As the results shown above, the AIC table indicates that the larger the model, the better results we might have. Thus, we would now turn to use BIC criterion instead of AIC.

Table_For_ARMA_BIC <- function(data,P,Q){
table <- matrix(NA,(P+1),(Q+1))
for(p in 0:P) {
for(q in 0:Q) {
table[p+1,q+1] <- Arima(data,order=c(p,0,q),seasonal=list(order=c(0,1,0),period=12))$bic
}
}
dimnames(table) <- list(paste("<b> AR",0:P, "</b>", sep=""),paste("MA",0:Q,sep=""))
table
}
Retail_bic_table <- Table_For_ARMA_BIC(Org_Data$Value,6,6)
kable(Retail_bic_table,digits=2)
MA0 MA1 MA2 MA3 MA4 MA5 MA6
AR0 5876.75 5673.77 5641.02 5524.25 5526.69 5476.89 5466.38
AR1 5460.97 5427.96 5428.98 5424.24 5425.21 5414.15 5419.60
AR2 5433.06 5431.80 5420.74 5412.43 5417.95 5419.50 5421.70
AR3 5422.39 5416.90 5386.86 5383.10 5388.39 5422.58 5394.21
AR4 5413.61 5418.68 5384.54 5388.85 5391.85 5424.15 5428.88
AR5 5418.69 5424.23 5387.11 5392.65 5384.59 5387.48 5391.45
AR6 5423.89 5429.68 5392.57 5395.06 5389.05 5384.77 5387.36

BIC table give us different results. Based on BIC table, we would prefer to use (3,0,3) as the ARMA parameters, which seems like a quarterly pattern.

4.2 Fit the SAR and SMA Part in SARIMA

And now, we turn to the parameters of SAR and SMA. Based on our tests, we can notice that, most of the time, the results would return non-stationary AR part. What’s more, the computation of parameters are really time consuming if we set SAR and SMA too large. Thus in this case, we would prefer to set (2,2) as below:

Table_For_Season_AIC <- function(data,P,Q){
  table <- matrix(NA,(P+1),(Q+1))
  for(p in 0:P) {
    for(q in 0:Q) {
      table[p+1,q+1] <- Arima(data,order=c(3,0,3),seasonal=list(order=c(p,1,q),period=12))$aic
    }
  }
  dimnames(table) <- list(paste("<b> AR",0:P, "</b>", sep=""),paste("MA",0:Q,sep=""))
  table
}
Retail_aic_table <- Table_For_Season_AIC(Org_Data$Value,2,2)
kable(Retail_aic_table,digits=2)
MA0 MA1 MA2
AR0 5358.07 5317.26 5314.86
AR1 5335.81 5355.10 5353.65
AR2 5322.90 5315.96 5318.78


The results turns out that SARIMA(3,0,3)(0,1,2)[12] is a good choice for us. Meanwhile in this circumstance, BIC is not useful at all.

4.3 Residual Analysis

Based on the analysis above, we get our SARIMA model for the total revenue data:

Org_Result_1 = Arima(Org_Data$Value,order=c(3,0,3),seasonal=list(order=c(0,1,2),period=12))


And we turn to the analysis of residuals. We

plot(Org_Result_1$residuals)

acf(Org_Result_1$residuals)


The residuals seems like white noise but still have a little pattern. Thus, we shows the Q-Q plot against normal distribution below:

qqnorm(Org_Result_1$residuals)
qqline(Org_Result_1$residuals)


It seems like that there is a little long tail here. However, basically we can say that our residuals is a Gaussian white noise.
At the same time, we try to fit the linear regression with SARIMA errors:

Org_Result_test = Arima(Org_Data$Value,order=c(3,0,3),xreg = Org_Data$Number,seasonal=list(order=c(0,1,2),period=12))

acf(Org_Result_test$residuals)


There are not any consequential improvement when we use linear regression. Therefore we would prefer to use the original one, in order to keep our model clear and simple.

5.Fitted result
Now, we show the detail results of our model.

Org_Result_1
## Series: Org_Data$Value 
## ARIMA(3,0,3)(0,1,2)[12]                    
## 
## Coefficients:
##           ar1    ar2     ar3     ma1     ma2      ma3     sma1     sma2
##       -0.1642  0.147  0.9875  0.8156  0.6201  -0.2905  -0.4675  -0.1122
## s.e.   0.0087  0.010  0.0105  0.0582  0.0684   0.0645   0.0670   0.0599
## 
## sigma^2 estimated as 29211317:  log likelihood=-2648.43
## AIC=5314.86   AICc=5315.56   BIC=5347.04


According to the log-likelihood test, we can know that most of those parameters are significant, which indicate that our model is a great model.
Meanwhile, we also provide the figure between the original data and fitted value:

plot(Org_Result_1$x,type="l", main = "Original Data and Fitted Result")
lines(fitted(Org_Result_1),col="red")


By comparing these two lines, the plot above shows that the model we used fit the data very well. Since we already have a great SARIMA model, We can as well as forecast the future retail amount by our model. Practically, the prediction data can give us some understanding about what will happen in the future. However, at first, we need to testify the goodness of our model. Therefore, we eliminate the last year data from the original one, and compared the prediction to the actual input.

Test_Org_Data = head(Org_Data,n=dim(Org_Data)[1]-12)
Org_Result_2 = Arima(Test_Org_Data$Value,order=c(3,0,3),seasonal=list(order=c(0,1,2),period=12))
test = forecast.Arima(Org_Result_2)
plot(test, main = "Testing about the prediction of SARIMA")
lines(Org_Data$Value, type= "l", col = "red")

In the plot above, red line is the original data, while the blue line shows that our prediction done well in this case, and therefore we can apply our model to predict next year retail revenue. The following result are shown below, which is the prediction of next year data:

forecast.Arima(Org_Result_1)
##     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 277       360706.2 353779.7 367632.6 350113.0 371299.3
## 278       354477.8 346211.4 362744.3 341835.4 367120.3
## 279       406074.4 396627.6 415521.2 391626.7 420522.1
## 280       394914.2 384344.6 405483.8 378749.4 411079.1
## 281       416687.0 405257.4 428116.6 399207.0 434167.1
## 282       407659.3 395376.5 419942.2 388874.3 426444.4
## 283       408102.6 394992.8 421212.4 388052.8 428152.3
## 284       413974.6 400207.2 427742.0 392919.2 435030.1
## 285       391235.2 376774.0 405696.3 369118.7 413351.6
## 286       397118.7 381998.1 412239.3 373993.8 420243.6
## 287       405417.7 389757.1 421078.3 381466.9 429368.5
## 288       467935.3 451681.8 484188.8 443077.7 492792.9
## 289       366595.1 348508.0 384682.2 338933.3 394256.9
## 290       366322.1 347145.7 385498.4 336994.4 395649.8
## 291       414803.3 394489.1 435117.6 383735.4 445871.3
## 292       401014.7 379655.0 422374.4 368347.8 433681.5
## 293       427832.9 405593.5 450072.4 393820.6 461845.2
## 294       414593.6 391403.1 437784.1 379126.8 450060.4
## 295       413344.9 389300.9 437388.9 376572.8 450117.0
## 296       424901.1 400110.1 449692.0 386986.6 462815.6
## 297       397427.6 371815.7 423039.6 358257.5 436597.8
## 298       403529.8 377201.0 429858.6 363263.4 443796.2
## 299       416042.2 389058.7 443025.7 374774.4 457309.9
## 300       472545.2 444840.7 500249.7 430174.8 514915.6

6.Seasonlly Adjusted Data Analysis


From the previous study, including frequency response function and SARIMA model, we notice that the seasonal pattern is so strong that we pay too much attention on it. Thus from now on, we would use the seasonally adjusted data. From the website, the Census.gov, we can learn that they modified the data by The X-13ARIMA-SEATS Seasonal Adjustment Program, which decompose the seasonal pattern in different part, and show the real trend of the time series. The seasonally adjusted data is just what we are looking for, since we are trying to analyze the “real” data.

We first plot the original data.

plot(Adj_Data$Value,type= "l", main = "Seasonaly Adjusted Revenue")


The plot here shows significant linear trend. Thus for SARIMA model, we would prefer to set (d) = 1. Now we look at the spectrum about the adjusted data and unadjusted data:

Org_Data = Org_Data_Unmodified
Adj_Data = Adj_Data_Unmodified
spectrum(ts.union(ts(Org_Data$Value),ts(Adj_Data$Value)),spans=c(3,5,3))

With the spectrum plotted, we can find out that most of the cycle has been eliminated by the X-13 method. What’s more, we also provide the plot of the frequency response function between the adjusted data and unadjusted data.

  spec_cycle <- spectrum(ts.union(Org_Data$Value,ts(Adj_Data$Value)),spans=c(3,3),plot=FALSE)
  freq_response_cycle <- spec_cycle$spec[,2]/spec_cycle$spec[,1]
  plot(spec_cycle$freq,freq_response_cycle,
       type="l",log="y",
       ylab="frequency ratio", xlab="frequency", ylim=c(5e-6,1.1),
       main="frequency response (dashed line at 1.0)")
  abline(h=1,lty="dashed",col="red")  

Based on these three plots in frequency domain, we conclude that most of the cycle has been eliminated. Thus ARIMA is better than SARIMA if we want to fit the seasonally adjusted data.
The AIC table are provided as below:

Table_For_ARMA_AIC <- function(data,P,Q){
table <- matrix(NA,(P+1),(Q+1))
for(p in 0:P) {
for(q in 0:Q) {
table[p+1,q+1] <- Arima(data,order=c(p,1,q),method = "ML")$aic
}
}
dimnames(table) <- list(paste("<b> AR",0:P, "</b>", sep=""),paste("MA",0:Q,sep=""))
table
}
Retail_aic_table <- Table_For_ARMA_AIC(Adj_Data$Value,6,6)
kable(Retail_aic_table,digits=2)
MA0 MA1 MA2 MA3 MA4 MA5 MA6
AR0 5429.37 5431.09 5428.39 5429.33 5427.85 5429.74 5426.30
AR1 5431.01 5416.37 5416.97 5418.62 5420.58 5419.14 5428.29
AR2 5427.08 5416.84 5418.13 5420.13 5418.97 5420.74 5425.46
AR3 5427.42 5418.54 5420.61 5414.52 5419.27 5418.38 5420.48
AR4 5425.46 5420.47 5418.52 5419.08 5420.99 5422.10 5424.50
AR5 5427.00 5419.05 5424.47 5417.80 5418.57 5421.13 5419.97
AR6 5424.97 5423.81 5423.46 5423.43 5424.81 5419.65 5419.36

The table indicate that ARMA(1,1) would be a good choice for us. Thus,

Adj_Result_test = Arima(Adj_Data$Value,order=c(1,1,1),method = "ML")
plot(Adj_Result_test$residuals, main = "Residuals of Adjusted Data")

acf(Adj_Result_test$residuals)

The result shows that our model fit the data well, and the residuals seems to be a white noise.

plot(Adj_Result_test$x,type="l")
lines(fitted(Adj_Result_test),col="red")

The plot of fitted value and origin value are also shown, which indicate that our model do well in this case. It is also available to provide the prediction about seasonally adjusted data.

forecast.Arima(Adj_Result_test)
##     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 289       395927.0 392084.0 399769.9 390049.7 401804.2
## 290       396556.1 391038.5 402073.7 388117.7 404994.5
## 291       397179.4 390320.5 404038.3 386689.6 407669.2
## 292       397796.9 389760.2 405833.6 385505.8 410088.0
## 293       398408.7 389293.2 407524.3 384467.7 412349.8
## 294       399014.9 388887.0 409142.8 383525.6 414504.2
## 295       399615.5 388522.8 410708.1 382650.7 416580.2
## 296       400210.5 388188.5 412232.4 381824.4 418596.5
## 297       400800.0 387875.9 413724.0 381034.4 420565.6
## 298       401384.0 387579.3 415188.7 380271.6 422496.5

7.Other Data for Analysis


In this part, we will turn to use others data for our analysis. Since at beginning, we choose to analyze the data of total retail revenue, it is natural that we would expect to use the pattern of total revenue to fit the model on some specific item revenue.
For instance, we use the monthly sales data of Furniture and Home Furnishings Stores, which comes from the same website:

Home_Data = read.csv("Home.csv")
Home_Data$Value = as.numeric(gsub(",","",Home_Data$Value))

For convenience’s sake, we fit the model with ARIMA(3,1,3), since there is a significant linear trend and ARMA(3,3) is large enough for us, meanwhile we also have done the AIC and BIC test, concluded that ARIMA(3,1,3) might be a good choice.
The different between those models below is that, in the second model, we use the HP-filtering result of total retail revenue to regress the error part of ARIMA.

Home_Result_1 <-  Arima(Home_Data$Value,order=c(3,1,3))
Home_Result_2 <-  Arima(Home_Data$Value,xreg=Org_Data_HP,order=c(3,1,3))

The plot is shown below, which the black line is the original data:

plot(Home_Result_1$x,type="l", main = "ARIMA without Regression")
lines(fitted(Home_Result_1),col = "red")

plot(Home_Result_1$x,type="l", main = "ARIMA with Regression" )
lines(fitted(Home_Result_2),col = "blue")


Obviously, the model with linear regression part do better than the traditional ARIMA model. Hence one can conclude that the pattern of total retail revenue do have some contribution in fitting the Furniture & Home Furnishings Stores revenue.

8.Conclusion


In this project, to conclude, there are several things we have done. - Firstly we try to analysis the basic pattern of the unadjusted total retail revenue data in US, which indicates that there are some seasonal pattern and significant long term linear trend. - Thus we would prefer to fit a SARIMA model for our data. We choose the best version of the SARIMA model, and do some prediction for the future. The results shows that our model do well in fitting and forecasting. - Then, we compare our unadjusted data to the seasonally adjusted data to find out the differences. - Finally, with the pattern we got from the total revenue, we try to fit the data of home furnishing revenue, and discover that it performs better than the ordinary ARIMA model.

In conclusion, SARIMA model and regression with ARIMA error are good models for both prediction are fitting. And analysis in frequency domain is a helpful tool for us in time series analysis.

9.Reference

1.Lecture notes of STATS 531
2.Retail data of US
3.Detail of X13 adjusted