Algae Bloom is a critical problem in preserving the sound water environment, ecosystem, and people’s health. Bloom means tiny small greenish microbes flourished in water, resulting in several problems in our ecosystem like choking marine life and contaminating the drinking water source. In August 2014, Toledo had a severe bloom at lake Erie. In fact, the algae bloom at Lake Erie was not the first case; it has occasionally occurred every year, which is a big headache to Michigan.
Generally, two phosphorus sources cause a bloom in the water body. First, the non-point source that runs off from the agricultural area, and another is the point source, for example, the wastewater treatment plant (WWTP). Thus, the Michigan state has tried to control both sources to reduce phosphorus amounts entering the water body.
This report will account for a wastewater dataset including daily phosphorus amounts discharged from 300 million residences of Detroit. The wastewater (herein phosphorus that we are interested in) from the city is collected, and pollutants (including phosphorus) are removed from the WWTP, then flown into lake Erie. We will look over the two different types of data: phosphorus amount produced from residences and phosphorus amount treated and discharged from WWTP to lake Erie.
Two general analyses would be carried out; 1) Characterizing the varying pattern of phosphorus amount produced from the city and entered into WWTP, and 2) Figure out a correlation between phosphorus generation and its potential burden to water receiving body (lake Erie). In other words, we will see how well the WWTP has handled the influent phosphorus amount to reduce possible bloom in lake Erie.
We will denote the generated phosphorus amounts from Detroit city’s residence (collected and entered into WWTP) as InfP (InfPhosphorus) and discharged phosphorus amount from WWTP (remains after removed from WWTP) as EffP (Effluent Phosphorus). The time scale we will analyze is 7.5 years, from Jan. 2013 to May. 2020.
We did data processing works to filter available dataset and plotted to take a view of the InfP variation during the period on a daily basis. The black line indicates InfP, again the discharged phosphorus amounts from Detroit and entering into WWTP. The red line shows the trend estimated by loess smoother. The daily variation data seems too noisy to perform a neat analysis, so we performed additional work to smooth these fluctuations.
The monthly averaged InfP was computed and plotted with its trend line to smooth the daily volatility, which shows two clear patterns. 1) The monthly averaged InfP (the black) shows somewhat seasonal patterns approximately on yearly basis: relatively lower InfP in the early of a year and higher InfP during summer. 2) Second, the trend of InfP estimated by loess smoother (the red line) indicates a decreasing phosphorus mass trend over time series.
Now, we will conduct several statistical analyses from the next section to check the two observations are relevant.
The two general patterns identified from the InfP dataset forced us to investigate a more in-depth analysis of the observations. We did work to define a seasonal pattern by periodogram analysis. The spectrum suggests a robust seasonal pattern between the one-year(Black Dot) period and the two-year (Red Dot) period.
Several factors can influence the InfP variation along the time span, including variations of temperature, precipitation, and population. Given those factor’s variations could gradually change the pattern of InfP, the higher seasonal patterns in adjacent years seem reasonable.
Then, the seasonal patterns were decomposed by the Hodrick-Prescott filter to separate trend, noise, and cycles from the monthly averaged InfP. The decomposition function shows a clear decreasing long-term trend and seasonal variation that was inferred from the periodogram analysis.
Now, we would construct a signal plus ARMA noise model to describe the decreasing trend. To decide the value of \(p\) and \(q\) of the ARMA(p, q) model, we will start by tabulating Akaike Information Criteria (AIC). A model with low AIC values implies precise prediction. Note that the AIC method might not be appropriate to select the most privileged mode for our dataset solely. Still, it is useful to eliminate some of the models that return relatively poorer predictions.
MA0 | MA1 | MA2 | MA3 | MA4 | |
---|---|---|---|---|---|
AR0 | 1351.74 | 1334.80 | 1330.48 | 1330.41 | 1330.58 |
AR1 | 1330.73 | 1332.46 | 1329.95 | 1331.73 | 1332.58 |
AR2 | 1332.36 | 1329.62 | 1331.46 | 1330.94 | 1332.93 |
AR3 | 1333.86 | 1331.56 | 1333.50 | 1330.42 | 1329.00 |
AR4 | 1335.27 | 1336.01 | 1330.76 | 1328.86 | 1330.84 |
The lowest value of AIC criteria is 1330.42 computed from the ARMA(3,4) model. The next lowest value is 1330.76 from ARMA(4,3). However, both models are relatively heavy, so we will also consider two simpler models, AR(1) and MA(2), whose models also have low AIC values, 1330.73 and 1330.48 repeatedly.
Then we can perform formal hypothesis tests on those selected models to finalize our decision. The general form of signal plus ARMA noise model is: \[Y_n = \mu_n + \eta_n\] where {\(\eta_n\)} is a stationary, causal invertible ARMA(p, q) with mean zero and {\(\mu_n\)} is the mean function. The mean function here has a linear specification: \(\mu_n=Z_n\beta\)
The descriptions for each model are shown below.
##
## Call:
## arima(x = i, order = c(1, 0, 0), xreg = year)
##
## Coefficients:
## ar1 intercept year
## 0.4868 279174.20 -135.3430
## s.e. 0.0945 76911.86 38.1382
##
## sigma^2 estimated as 166242: log likelihood = -661.36, aic = 1330.73
##
## Call:
## arima(x = i, order = c(0, 0, 2), xreg = year)
##
## Coefficients:
## ma1 ma2 intercept year
## 0.5813 0.2892 280504.66 -136.0025
## s.e. 0.1264 0.1185 73627.23 36.5094
##
## sigma^2 estimated as 161852: log likelihood = -660.24, aic = 1330.48
##
## Call:
## arima(x = i, order = c(4, 0, 3), xreg = year)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 ma2 ma3 intercept
## 0.0672 0.6632 0.3476 -0.357 0.4783 -0.4783 -1.000 279864.63
## s.e. 0.1156 0.1252 0.1112 0.114 0.0606 0.0597 0.065 30018.69
## year
## -135.6639
## s.e. 14.8852
##
## sigma^2 estimated as 131623: log likelihood = -654.43, aic = 1328.86
##
## Call:
## arima(x = i, order = c(3, 0, 4), xreg = year)
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 ma4 intercept
## -0.0383 -0.8166 -0.2163 0.5876 1.2234 0.7432 0.5874 283468.01
## s.e. 0.2512 0.1043 0.2323 0.2260 0.1000 0.1919 0.1418 71972.37
## year
## -137.4722
## s.e. 35.6888
##
## sigma^2 estimated as 135630: log likelihood = -654.5, aic = 1329
The coefficient of “MA3” in the ARMA(4,3) model is 1, which means that the estimated ARMA(4,3) is at the boundary of invertibility, in other words, unstable. Therefore, we will exclude ARMA (4,3) from our inventory and perform additional work to decide the final model among the other three models, ARMA(3,4), AR(1), and MA(2)
###3.2.3 Model Selection with a hypothesis test Before comparing the three models with trends (ARMA(3,4), AR(1), and MA(2)), we did one more analysis to prevent hesitation in selecting the fittest model to our dataset, which was performed by adding a model without a trend. It suggests that AR(1) model is best for the non-trend analysis. The procedures in selecting the fitted model are introduced at the supplementary part (end of the report)
To select the fitted model, we considered the formal hypothesis testing methods, Wilks approximation. To construct this hypothesis test, we will first state our null hypothesis \(H_0\) and our alternative hypothesis \(H_1\):
\(H_0\): The AR(1) without trend model is better to fit the data with a higher likelihood (or AR(1) with trend, or MA(2) with trend). \(H_1\): The ARMA(3,4) model with regression better fits the data with higher likelihood. And we denote the log-likelihood that maximized over each of the hypotheses as follows: \[l_0 =\sup_{\theta \in \Theta_0} l(\theta)\] \[l_1 =\sup_{\theta \in \Theta_1} l(\theta)\] Then, the Wilks’ approximation asserts that the cutoff between the null hypothesis and the alternative hypothesis is (Wikipedia, Wilks’ Theorem): \[ D = 2 \times (l_1 - l_0) \approx \chi^2_{D_1-D_0}\]
## [1] 1.858796e-06
## [1] 0.03279868
## [1] 0.04265723
ARMA(3,4) with trend | |
---|---|
AR(1) without trend | 0.00 |
AR(1) with trend | 0.03 |
MA(2) with trend | 0.04 |
All p-values of the three hypothesis tests are lower than 0.05 significance level. Therefore, we reject the all null hypothesis, suggesting the conclusion is that ARMA(3,4) model with regression is more favored in our time series data. Thus, the selected model is: \[I_n -0.038I_{n-1} -0.817I_{n-2} -0.216I_{n-3}= \epsilon_{n}+0.588\epsilon_{n-1}+ 1.223\epsilon_{n-2}+ 0.743\epsilon_{n-3}+ 0.587\epsilon_{n-4}+ 283467.150 -137.472\times year\]
We evaluate the performance of our model by checking residual, QQ-plot, and an autocorrelation plot.
The residual shows that there is no apparent pattern in the data from our model. There is little evidence for fluctuations increasing in amplitude over time, which precipitation patterns might induce. However, it is not significant and could be studied in a future analysis.
Then we construct the autocorrelation (ACF) plot to determine whether the residuals are uncorrelated. From the ACF plot, we notice that there is no significant correlation other than lag 0. Therefore, we can conclude that the residuals are indeed uncorrelated.
##
## Autocorrelations of series 'best$residuals', by lag
##
## 0 1 2 3 4 5 6 7 8 9 10
## 1.000 0.001 0.002 -0.005 -0.052 -0.008 -0.046 -0.001 -0.094 -0.112 -0.095
## 11 12 13 14 15 16 17 18 19
## -0.086 0.092 -0.082 0.035 0.012 -0.050 0.043 0.058 0.082
Finally, we check our normality assumption by the QQ-plot. Also, the QQ plot shows the residuals well follow a normal distribution. The majority of residuals lie on the diagonal guideline, which implies that our model has normal residuals.
We defined the characteristics of InfP so far, which indicates that there are consistent decreasing trends and seasonal variations. The constructed model ARMA(3,4) well fit the characteristics of InfP.
In this section, we will conduct the next analysis to define the correlation between InfP and EffP. Suppose there are statistically significant correlations between them. In that case, it can be concluded that the more efficient plan for preventing bloom in Lake Erie is to reduce the produced phosphorus amount from Detroit. Otherwise, if there are no distinct correlations, it can be concluded that WWTP has a crucial role in suppressing phosphorus discharged amounts from the point source. Then, the efficient plan is to enforce WWTP to optimize phosphorus removal from their system strictly. Currently, Michigan has conducted a strategy to regulate EffP from WWTP, and the result of this section could clear whether their regulation is a practical approach or not. Let’s see.
First, we draw out the smoothed plots to directly compare the monthly averaged patterns each model has.
There seems to be no clear correlation from the monthly averaged plots. So, we did detrend both data and plotted again.
Detrended plots show some evidence of positive correlation and also show a negative correlation, indicating that inspection with eyes would be inappropriate. Then, we try to see the ARMA regression models on the two monthly averaged datasets to test the correlation between InfP and EffP.
MA0 | MA1 | MA2 | MA3 | |
---|---|---|---|---|
AR0 | 1312.46 | 1307.58 | 1309.54 | 1290.64 |
AR1 | 1309.24 | 1309.57 | 1296.15 | 1291.31 |
AR2 | 1306.79 | 1292.02 | 1306.52 | 1293.06 |
AR3 | 1306.60 | 1293.95 | 1294.68 | 1294.92 |
MA(3) model shows the lowest value, 1290.64. Then, we take a hypothesis test with this model. Here, \(H_0\) and \(H_1\) are,
\(H_0\): The InfP and EffP are equal in MA(3) model, which shows they have correlations between each other. \(H_1\): There are no correlations between InfP and EffP as predicted by MA(3) model
##
## Call:
## arima(x = Influent_hp, order = c(0, 0, 3), xreg = Effluent_hp)
##
## Coefficients:
## ma1 ma2 ma3 intercept Effluent_hp
## 0.0519 -0.4097 -0.6421 -0.2626 0.2275
## s.e. 0.0951 0.0810 0.0916 3.3116 0.2142
##
## sigma^2 estimated as 97000: log likelihood = -639.32, aic = 1290.64
## [1] 0.2864263
The model shows an invertibility feature, and the computed likelihood ratio test is 0.2864, which suggests that we cannot reject the null hypothesis. There is no substantial evidence that the InfP is correlated with the EffP, which is an interesting inference. It means the WWTP has played a key role in managing phosphorus amounts that flow into the receiving body. Thus, if the systemic problem occurred at the WWTP, more potential for bloom increased.
We did perform a cross-correlation analysis. The results shown below are impressive, which seems to be affected by seasonal variations. However, it should be analyzed further in detail. Note that the elevated monthly averaged EffP is strongly related to the previous monthly averaged InfP, which tells us that the EffP can be roughly estimated beforehand by about one month. It would then be great inference for WWTP operators to adjust their strategies for proper phosphorus removal in terms of InfP variations.
Such oscillatory patterns are also shown in the coherency plot below. The coherence is most considerable at a period of 3 months, which also offers a seasonal pattern. These results also suggest that InfP predicts effP variability before around 3months ahead, associated with the cross-correlation analysis as shown above.
The phase is also plotted below, which illustrates that the phase between these two-time series is unstable. Note the phase oscillates around 1, which also shows the monthly averaged InfP can allow a cyclic prediction for the monthly averaged EffP.
To sum up this section, there seems to be no clear correlation between InfP and EffP, but some spurious patterns exist, which is interesting and needs to be investigated later. Nevertheless, it is important to remember that the elevated EffP could be predicted by monitoring InfP from one to three months ahead, allowing an operator to have appropriated actions for reducing EffP to the treated water receiving body.
We observed the seasonal variation and decreasing trend by smoothing its variations, which confirmed that the seasonal variation is statistically significant by a smoothed periodogram, which shows about 1-year patterns. The decreasing trend is also confirmed by the decomposition of the dataset, which shows that the long-term trend is decreasing.
The selected model is ARMA(3,4) with trends, which is carefully done by AIC modeling and three different hypothesis tests based on the Wilks’ approximation. Lastly, model assessments conducted show a well-fitted model we obtain without any significant problems.
We plotted both datasets and detrended one either, which shows no clear correlations between each other. Then, we did AIC modeling to select the fitted model and hypothesis tests to validate observations. From the hypothesis test, we concluded there are no clear correlations from the monthly averaged phosphorus variations. These results suggested that WWTP has a crucial role in controlling phosphorus amount entering into the water body. Michigan’s policy, regulating EffP, seems to be the proper approach for preventing occasional bloom at lake Erie. However, an interesting feature is that the monthly averaged InfP is positively related to EffP in one to three months ahead. This inference might be helpful for operators of WWTP to plan appropriate phosphorus removal strategies in advance.
Consequently, our work suggested a fitted model for reflecting InfP variation and defined the correlation between InfP and EffP. Then, we argue that an operator from WWTP can prepare in advance at least one month early by monitoring InfP variation. There are still tasks to be investigated to enhance our understanding, for instance, what can cause the seasonal variation of InfP, but we would leave it later and would like to finalize this report here.
MA0 | MA1 | MA2 | MA3 | MA4 | |
---|---|---|---|---|---|
AR0 | 1377.68 | 1351.37 | 1339.98 | 1341.59 | 1338.22 |
AR1 | 1337.73 | 1339.67 | 1341.53 | 1340.20 | 1339.42 |
AR2 | 1339.68 | 1340.10 | 1342.28 | 1339.70 | 1341.28 |
AR3 | 1341.63 | 1341.98 | 1336.52 | 1338.24 | 1338.55 |
AR4 | 1340.97 | 1341.85 | 1338.44 | 1338.52 | 1340.00 |
From the table above, we could see that models that have relatively low AIC values are AR(1), and ARMA(3,2). The summary of two model are listed below.
ar1
##
## Call:
## arima(x = i, order = c(1, 0, 0))
##
## Coefficients:
## ar1 intercept
## 0.6310 6221.6346
## s.e. 0.0853 121.0558
##
## sigma^2 estimated as 183454: log likelihood = -665.87, aic = 1337.73
ar3ma2
##
## Call:
## arima(x = i, order = c(3, 0, 2))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 intercept
## -0.7288 0.0831 0.4262 1.4779 1.0000 6217.5312
## s.e. 0.1129 0.1352 0.1093 0.0499 0.0542 118.0764
##
## sigma^2 estimated as 156944: log likelihood = -661.26, aic = 1336.52
The summary of ARMA(3,2) model have a MA(2) coefficient 1 which is located at the unit circle. To avoid the problem in further analysis, we choose an invertible AR(1) model to be the potential model in our model selection.
This section has been removed to maintain anonymity.