Your report should include the R code that you use. For calculations, please add enough explanation to help the reader understand what you did and why. Recall that you are permitted to collaborate, or to use any internet resources, but you must list all sources that influenced your report. As usual, a statement of Sources is required to get credit for the homework.
\({\mathbb{A}}=[a_{ij}]_{p\times p}\) is a symmetric matrix if \(a_{ij}=a_{ji}\) for all \(i\) and \(j\). This means the matrix has reflective symmetry across its leading diagonal. Recall that variance/covariance matrices are symmetric because \({\mathrm{Cov}}(X_i,X_j)= {\mathrm{Cov}}(X_j,X_i)\) due to the symmetry of covariance.
An example of a symmetric matrix is
\[ \begin{bmatrix} 2 & 3 & -1 \\ 3 & 1 & 4 \\ -1 & 4 & 7 \\ \end{bmatrix} \]
An equivalent definition is that \({\mathbb{A}}\) is symmetric if \({\mathbb{A}}={\mathbb{A}}^{{\scriptscriptstyle \mathrm{T}}}\). Transposing swaps rows and columns, which is exactly the same thing as reflection across the leading diagonal. Thus, a matrix is equal to its transpose if and only if it has reflective symmetry across the leading diagonal. Checking whether a matrix is equal to its transpose can be a good way to see if a matrix is symmetric since we have already used rules for transposing matrix products and inverses.
Q1. Let \({\mathbb{A}}\) be a symmetric \(p\times p\) matrix and let \({\mathbb{B}}\) be any \(p\times q\) matrix. Let \[ {\mathbb{C}} = {\mathbb{B}}^{{\scriptscriptstyle \mathrm{T}}}{\mathbb{A}}{\mathbb{B}} \] Show that \({\mathbb{C}}\) is equal to its own transpose and hence show that \({\mathbb{C}}\) is a symmetric \(q\times q\) matrix.
Hint: Apply the rules of matrix multiplication and transposition to simplify \({\mathbb{C}}^{{\scriptscriptstyle \mathrm{T}}}= \big({\mathbb{B}}^{{\scriptscriptstyle \mathrm{T}}}{\mathbb{A}}{\mathbb{B}}\big)^{{\scriptscriptstyle \mathrm{T}}}\). Using the assumption that \({\mathbb{A}}\) is symmetric, this simplification should give you an expression equal to \({\mathbb{C}}\).
Q2. Let \({\mathbb{A}}\) be a symmetric \(p\times p\) matrix with an inverse \({\mathbb{A}}^{-1}\). Is \({\mathbb{A}}^{-1}\) symmetric?
Calculate an example using R to check this.
Explain this result using properties of the matrix transpose and inverse.
Q3. Investigating multivariate data.
Carbon dioxide (\(CO_2\)) levels in the atmosphere have been increasingly steadily. The longest data recording this are the measurements taken at Mauna Loa observatory in Hawaii. An increasing trend in \(CO_2\) matches increasing trends in both global economic activity and the global population. Which provides a better explanation? To address this, one may have to look at variations around the trend. However, on shorter timescales, fluctuating geophysical processes such as volcanic activity and the El Nino Southern Oscillation (ENSO) may be dominant. In addition, one can wonder whether existing data measuring global \(CO_2\) emissions do a better job than overall economic activity for explaining fluctuations of \(CO_2\) around its increasing trend. The following dataset collects together indices measuring all these phenomena.
download.file(destfile="climate.txt",
  url="https://ionides.github.io/401f18/hw/hw05/climate.txt")X <- read.table("climate.txt",header=TRUE)
head(X)##   Year    CO2  GDP  Pop    ENSO Volcanic Emissions
## 1 1958     NA   NA   NA  0.9054   0.0042        NA
## 2 1959 315.97   NA   NA  0.2698   0.0027        NA
## 3 1960 316.91 7.23 3027 -0.2568   0.0024        NA
## 4 1961 317.64 7.54 3069 -0.2322   0.0024       9.5
## 5 1962 318.45 7.97 3123 -0.7650   0.0024       9.8
## 6 1963 318.99 8.38 3189 -0.1629   1.8454      10.4CO2: Mean annual concentrations of atmospheric CO2 (in parts per million by volume) at Mauna Loa observatory.
GDP: World gross domestic product from World Development Indicators database (data.worldbank.org/data-catalog/world-development-indicators) of the World Bank
Pop: World population from World Development Indicators database (data.worldbank.org/data-catalog/world-development-indicators) of the World Bank
ENSO: El Nino Southern Oscillation index from NOAA sources (www.esrl.noaa.gov/psd/people/klaus.wolter/MEI/table.html)
Volcanic: An index of monthly estimated sulfate aerosols derived from NOAA data (ftp.ncdc.noaa.gov/pub/data/paleo/climate_forcing/volcanic_aerosols/ammann2003b_volcanics.txt)
Emissions: estimated emissions of CO2 (million Kt) from World Development Indicators database (data.worldbank.org/data-catalog/world-development-indicators) of the World Bank
diff_CO2 diff_GDP diff_pop and diff_emissions that measure the percentage annual change in each variable. A trick to do this is to take the difference of the logarithm. In addition, it is necessary to add a missing value (NA) at the start since the annual change is not available in the first year of the dataset. For example,X$diff_CO2 <- c(NA,diff(log(X$CO2)))Make a pairs plot of the variables diff_CO2, diff_GDP, diff_pop, diff_emissions, ENSO and Volcanic. You will likely have to look at ?pairs to study the syntax of pairs().
Describe what you see in the pairs plot. Which variables seem to approximately follow a joint normal distribution and which do not? Can you explain any of these results by whether or not a central limit theorem is applicable to the quantities?
Calculate the sample correlation matrix of these variables using cor(). If any of the measurements are NA the sample correlation will show up as NA. You can subset the data to consider only the years when all these variables were measured.
Interpret the sample correlation matrix. Global climate and the global economy are complex systems, and it is unlikely that a preliminary analysis such as (d) is going to be highly revealing. More likely, some of the results will be surprising and some may not readily make sense. Comment on what does and does not make sense to you in the sample correlation matrix, and then discuss the limitations of this analysis and things that might be done to overcome these limitations. You are not asked to carry out additional analysis, though you can if you want to!
Q4. Working with the multivariate normal distribution in three dimensions.
This question guides you through a simulation experiment to verify that mvnorm(n,mean=mu,sigma=V) draws random variables with mean mu and variance/covariance matrix V. This differs slightly from rnorm() for which the arguments are mean and standard deviation rather than mean and variance.
Carry out the following steps in R.
Set mu to be a length 3 vector of your choice.
Set V to be a \(3\times 3\) symmetric matrix of your choice.
Draw a large sample from the multivariate normal distribution with these parameters using rmvnorm(). There may be some fiddling required to choose a V that makes this work without the warning message
Warning message:
In rmvnorm: sigma is numerically not positive semidefinitein which case you can look ahead to part (f).
Compute the sample mean and sample variance/covariance matrix for the multivariate random variables you simulated, and check they are close to mu and V.
Make a pairs plot of your three simulated variables using the pairs() function. By looking at the pairs plot, guess the three correlations between the three possible choices of pairs of these variables. Then compute the exact correlations from V and see how close your guess was.
Not all symmetric matrices are valid variance/covariance matrices. In particular, variances have to be non-negative and so the diagonal entries of the variance/covariance matrix must be non-negative. Covariances can be negative. However, variables with small variance can’t have large covariance. To see this, since the correlation is between -1 and 1, we have an inequality \[
\left| \frac{{\mathrm{Cov}}(X,Y)}{\sqrt{{\mathrm{Var}}(X){\mathrm{Var}}(Y)}} \right| \le 1
\] which gives us \[
\big[ {\mathrm{Cov}}(X,Y) \big]^2 \le {\mathrm{Var}}(X) \times {\mathrm{Var}}(Y).
\] Try to carry out step (c) and (d) with a non-valid variance/covariance matrix and report on what happens. If your original variance/covariance matrix was not valid, re-run your code for (a-d) with a valid choice of V.
Q5. Working with means and variances of linear combinations.
Let \({\boldsymbol{\mathrm{X}}}=(X_1,X_2,X_3,X_4)\) be a vector random variable with mean vector \((1,2,3,4)\) and variance/covariance matrix \[ {\mathbb{V}} = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 2 & 2 \\ 1 & 2 & 3 & 3 \\ 1 & 2 & 3 & 4 \\ \end{bmatrix} \]
Find the mean and variance of \(\sum_{i=1}^4X_i\).
Find the mean and variance of \(\bar X = \frac{1}{4} \sum_{i=1}^4 X_i\).
Use matrix methods to find the mean and variance of the linear combination \(X_1+2X_2+3X_3+4X_4\). You do not have to do the calculation by hand. Report the required calculation using mathematical notation and carry out the computation using R.
Suppose that \({\boldsymbol{\mathrm{X}}}\) is a multivariate normal vector random variable with the mean and variance/covariance matrix given above. Find the probability that \(X_4\) is greater than \(X_3\). Explain your reasoning, and report an R calculation. Hint: Rewrite the event \(\{X_4>X_3\}\) as a statement about the linear combination \(X_4-X_3\). Work out the mean and variance of \(X_4-X_3\) using the same approach you used for part (c). Then, use the property that a linear combination of multivariate normal random variables is normally distributed.
License: This material is provided under an MIT license
 Acknowledgment: The climate dataset comes from from https://https://doi.org/10.1016/j.envsci.2012.03.008