Homework 5, STATS 401 F18

Your report should include the R code that you use. For calculations, please add enough explanation to help the reader understand what you did and why. Recall that you are permitted to collaborate, or to use any internet resources, but you must list all sources that influenced your report. As usual, a statement of Sources is required to get credit for the homework.

Symmetric matrices

\({\mathbb{A}}=[a_{ij}]_{p\times p}\) is a symmetric matrix if \(a_{ij}=a_{ji}\) for all \(i\) and \(j\). This means the matrix has reflective symmetry across its leading diagonal. Recall that variance/covariance matrices are symmetric because \({\mathrm{Cov}}(X_i,X_j)= {\mathrm{Cov}}(X_j,X_i)\) due to the symmetry of covariance.

An example of a symmetric matrix is

\[ \begin{bmatrix} 2 & 3 & -1 \\ 3 & 1 & 4 \\ -1 & 4 & 7 \\ \end{bmatrix} \]

An equivalent definition is that \({\mathbb{A}}\) is symmetric if \({\mathbb{A}}={\mathbb{A}}^{{\scriptscriptstyle \mathrm{T}}}\). Transposing swaps rows and columns, which is exactly the same thing as reflection across the leading diagonal. Thus, a matrix is equal to its transpose if and only if it has reflective symmetry across the leading diagonal. Checking whether a matrix is equal to its transpose can be a good way to see if a matrix is symmetric since we have already used rules for transposing matrix products and inverses.

Q1. Let \({\mathbb{A}}\) be a symmetric \(p\times p\) matrix and let \({\mathbb{B}}\) be any \(p\times q\) matrix. Let \[ {\mathbb{C}} = {\mathbb{B}}^{{\scriptscriptstyle \mathrm{T}}}{\mathbb{A}}{\mathbb{B}} \] Show that \({\mathbb{C}}\) is equal to its own transpose and hence show that \({\mathbb{C}}\) is a symmetric \(q\times q\) matrix.

Q1 Solution.

\begin{eqnarray*} \mathbb{C}^T &=& (\mathbb{B}^T \mathbb{AB})^T \\ &=& \mathbb{B}^T \mathbb{A}^T (\mathbb{B^T})^T \\ &=& \mathbb{B}^T \mathbb{AB} \end{eqnarray*}

Therefore \(\mathbb{C}^T = \mathbb{C}\).

Hint: Apply the rules of matrix multiplication and transposition to simplify \({\mathbb{C}}^{{\scriptscriptstyle \mathrm{T}}}= \big({\mathbb{B}}^{{\scriptscriptstyle \mathrm{T}}}{\mathbb{A}}{\mathbb{B}}\big)^{{\scriptscriptstyle \mathrm{T}}}\). Using the assumption that \({\mathbb{A}}\) is symmetric, this simplification should give you an expression equal to \({\mathbb{C}}\).

Q2. Let \({\mathbb{A}}\) be a symmetric \(p\times p\) matrix with an inverse \({\mathbb{A}}^{-1}\). Is \({\mathbb{A}}^{-1}\) symmetric?

Calculate an example using R to check this.

Question 2a Solution.

Let \[\mathbb{A} = \begin{bmatrix} 1 & 2 & 3 \\ 2 & 1 & 4 \\ 3 & 4 & 1 \end{bmatrix}\]

Using R to solve, we obtain that \[\mathbb{A}^{-1} = \begin{bmatrix} -0.75 & 0.5 & 0.25 \\ 0.50 & -0.4 & 0.10 \\ 0.25 & 0.1 & -0.15 \end{bmatrix}\] which is also a symmetric matrix.

Explain this result using properties of the matrix transpose and inverse.

Question 2b Solution.

We want to show that \((\mathbb{A}^{-1})^T = \mathbb{A}^{-1}\). We are assuming that \(\mathbb{A}\) is a symmetric matrix which has an inverse \({\mathbb{A}}^{-1}\). Some basic properties of the matrix transpose and inverse are \(\mathbb{A}^{-1}\mathbb{A} = \mathbb{AA}^{-1} = \mathbb{I}\) and \(({\mathbb{A}}^{-1})^{{\scriptscriptstyle \mathrm{T}}}= ({\mathbb{A}}^{{\scriptscriptstyle \mathrm{T}}})^{-1}\). Using the second of these properties, together with the symmetry property that \({\mathbb{A}}^{{\scriptscriptstyle \mathrm{T}}}= {\mathbb{A}}\), we calculate \[ ({\mathbb{A}}^{-1})^{{\scriptscriptstyle \mathrm{T}}}= ({\mathbb{A}}^{{\scriptscriptstyle \mathrm{T}}})^{-1} = {\mathbb{A}}^{-1}. \] We have checked that \(({\mathbb{A}}^{-1})^{{\scriptscriptstyle \mathrm{T}}}= {\mathbb{A}}^{-1}\) which shows that \({\mathbb{A}}^{-1}\) is itself symmetric.

Q3. Investigating multivariate data.

Carbon dioxide (\(CO_2\)) levels in the atmosphere have been increasingly steadily. The longest data recording this are the measurements taken at Mauna Loa observatory in Hawaii. An increasing trend in \(CO_2\) matches increasing trends in both global economic activity and the global population. Which provides a better explanation? To address this, one may have to look at variations around the trend. However, on shorter timescales, fluctuating geophysical processes such as volcanic activity and the El Nino Southern Oscillation (ENSO) may be dominant. In addition, one can wonder whether existing data measuring global \(CO_2\) emissions do a better job than overall economic activity for explaining fluctuations of \(CO_2\) around its increasing trend. The following dataset collects together indices measuring all these phenomena.

download.file(destfile="climate.txt",
  url="https://ionides.github.io/401f18/hw/hw05/climate.txt")

X <- read.table("climate.txt",header=TRUE)
head(X)

##   Year    CO2  GDP  Pop    ENSO Volcanic Emissions
## 1 1958     NA   NA   NA  0.9054   0.0042        NA
## 2 1959 315.97   NA   NA  0.2698   0.0027        NA
## 3 1960 316.91 7.23 3027 -0.2568   0.0024        NA
## 4 1961 317.64 7.54 3069 -0.2322   0.0024       9.5
## 5 1962 318.45 7.97 3123 -0.7650   0.0024       9.8
## 6 1963 318.99 8.38 3189 -0.1629   1.8454      10.4

CO2: Mean annual concentrations of atmospheric CO2 (in parts per million by volume) at Mauna Loa observatory.
GDP: World gross domestic product from World Development Indicators database (data.worldbank.org/data-catalog/world-development-indicators) of the World Bank
Pop: World population from World Development Indicators database (data.worldbank.org/data-catalog/world-development-indicators) of the World Bank
ENSO: El Nino Southern Oscillation index from NOAA sources (www.esrl.noaa.gov/psd/people/klaus.wolter/MEI/table.html)
Emissions: estimated emissions of CO2 (million Kt) from World Development Indicators database (data.worldbank.org/data-catalog/world-development-indicators) of the World Bank

Create new variables called diff_CO2 diff_GDP diff_pop and diff_emissions that measure the percentage annual change in each variable. A trick to do this is to take the difference of the logarithm. In addition, it is necessary to add a missing value (NA) at the start since the annual change is not available in the first year of the dataset. For example,

X$diff_CO2 <- c(NA,diff(log(X$CO2)))

Question 3a Solution.

X$diff_CO2 <- c(NA,diff(log(X$CO2)))
X$diff_GDP <- c(NA,diff(log(X$GDP)))
X$diff_pop <- c(NA,diff(log(X$Pop)))
X$diff_emissions <- c(NA,diff(log(X$Emissions)))

Make a pairs plot of the variables diff_CO2, diff_GDP, diff_pop, diff_emissions, ENSO and Volcanic. You will likely have to look at ?pairs to study the syntax of pairs().

Question 3b Solution.

pairs_plot_vars <- X[, c("diff_CO2", "diff_GDP", "diff_pop", "diff_emissions", "ENSO", "Volcanic")]
pairs(pairs_plot_vars)

Describe what you see in the pairs plot. Which variables seem to approximately follow a joint normal distribution and which do not? Can you explain any of these results by whether or not a central limit theorem is applicable to the quantities?

Question 3c Solution.

(diff_CO2 and diff_GDP), (diff_CO2 and diff_emissions), (diff_pop and diff_emissions), (diff_pop and ENSO), (diff_emissions and ENSO), (diff_GDP and diff_emissions), and (diff_GDP and diff_pop) appear to follow joint normal distributions, since their pairwise scatterplots are approximately football shaped. Note: There appear to be some potential outliers.

A normal distribution due to the central limit theorem is expected when a quantity is the sum of many smaller quantities. Many global processes, including human socioeconomic activity, is the sum of a large number of small contributions spread across the globe and so variation may be expected to follow a normal curve. By contrast, volcanic activity is driven by a few rare giant volcanic erruptions and so is not subject to the central limit theorem. ENSO measures a single global phenomenon, the El Nino Southern Oscillation, which cannot readily be broken down as the sum of many small things so there may be no clear reason why a central limit theorem should apply to it. However, the ENSO index appears roughly normally distributed anyhow. The central limit principle does not forbid the normal distribution occurring for other reasons, it just says that the normal distribution will arise when the quantity is the sum of many small and approximately independent contributions.

Calculate the sample correlation matrix of these variables using cor(). If any of the measurements are NA the sample correlation will show up as NA. You can subset the data to consider only the years when all these variables were measured.

Question 3d Solution.

# remove the observations that contain missing values
corr_vars <- na.omit(pairs_plot_vars)
cor(corr_vars)

##                   diff_CO2    diff_GDP   diff_pop diff_emissions
## diff_CO2        1.00000000 -0.11068825 -0.5048093    -0.02691627
## diff_GDP       -0.11068825  1.00000000  0.4786702     0.71034357
## diff_pop       -0.50480933  0.47867016  1.0000000     0.28688387
## diff_emissions -0.02691627  0.71034357  0.2868839     1.00000000
## ENSO            0.34404762 -0.14042767 -0.1653128    -0.18804802
## Volcanic       -0.43053100 -0.04741184  0.2609260    -0.12030808
##                      ENSO    Volcanic
## diff_CO2        0.3440476 -0.43053100
## diff_GDP       -0.1404277 -0.04741184
## diff_pop       -0.1653128  0.26092596
## diff_emissions -0.1880480 -0.12030808
## ENSO            1.0000000  0.33582604
## Volcanic        0.3358260  1.00000000

Interpret the sample correlation matrix. Global climate and the global economy are complex systems, and it is unlikely that a preliminary analysis such as (d) is going to be highly revealing. More likely, some of the results will be surprising and some may not readily make sense. Comment on what does and does not make sense to you in the sample correlation matrix, and then discuss the limitations of this analysis and things that might be done to overcome these limitations. You are not asked to carry out additional analysis, though you can if you want to!

Question 3e Solution.

Note: It is not surprising that the signs and values of the pairwise correlations follow the trends that we saw in the pairs() plot in part b.

It initially surprising that \(CO_2\) levels would decrease as GDP, population, and emissions rise. However, this may be driven by changes in technology or changes in the make-up of emissions (for instance emitting a different gas as opposed to \(CO_2\)). It is somewhat surprising that \(CO_2\) declines so sharply with increases in Volcanic activity but increases as the El Nino index rises.

It is not surprising that GDP would increase as population increases and that emissions would increase with increases in GDP and the population. The correlation between GDP and volcanic activity is close to zero, which might be expected because volcanic activity has little effect on most of the global economic activities. The correlation between ENSO and GDP is also quite small, and might be due to chance variation. Similarly, the correlations of population with ENSO and volcanic activity do not have readily identifiable causes and may just be spurious correlations.

Q4. Working with the multivariate normal distribution in three dimensions.

This question guides you through a simulation experiment to verify that mvnorm(n,mean=mu,sigma=V) draws random variables with mean mu and variance/covariance matrix V. This differs slightly from rnorm() for which the arguments are mean and standard deviation rather than mean and variance.

Carry out the following steps in R.

Set mu to be a length 3 vector of your choice.

Question 4a Solution.

mu = matrix(c(1,2,3))

Set V to be a \(3\times 3\) symmetric matrix of your choice.

Question 4b Solution.

# create a variance/covariance matrix
V = matrix(c(3,2,1,2,5,1,1,1,6), nrow = 3)
V

##      [,1] [,2] [,3]
## [1,]    3    2    1
## [2,]    2    5    1
## [3,]    1    1    6

# check that it is positive semi-definite
eigen(V)$values

## [1] 7.507903 4.755640 1.736457

Since my \(\mathbb{V}\) have eigenvalues that are all greater than 0, I won’t get the warning message mentioned in part c.

Draw a large sample from the multivariate normal distribution with these parameters using rmvnorm(). There may be some fiddling required to choose a V that makes this work without the warning message
```
Warning message:
In rmvnorm: sigma is numerically not positive semidefinite
```
in which case you can look ahead to part (f).

Question 4c Solution.

# Draw a large sample from the multivariate normal distribution
library(mvtnorm)
mvn_sample <- rmvnorm(1000, mu, V)

Compute the sample mean and sample variance/covariance matrix for the multivariate random variables you simulated, and check they are close to mu and V.

Question 4d Solution.

# sample mean
apply(mvn_sample, 2, mean)

## [1] 1.069725 2.126886 2.956789

# sample covariance matrix
cov(mvn_sample)

##           [,1]      [,2]      [,3]
## [1,] 2.9604048 2.0783820 0.9448336
## [2,] 2.0783820 5.0425931 0.8607823
## [3,] 0.9448336 0.8607823 6.4108265

The sample vector mean and covariance matrix are close to the true values.

Make a pairs plot of your three simulated variables using the pairs() function. By looking at the pairs plot, guess the three correlations between the three possible choices of pairs of these variables. Then compute the exact correlations from V and see how close your guess was.

Question 4e Solution.

pairs(mvn_sample)

Guess: The correlation between variable 1 and variable 2 is around 0.6. The correlation between variable 1 and variable 2 is around 0.3. Variable 2 and variable 3 have a correlation close to 0.

The calculated correlation matrix is the following:

cor(mvn_sample)

##           [,1]      [,2]      [,3]
## [1,] 1.0000000 0.5379264 0.2168816
## [2,] 0.5379264 1.0000000 0.1513942
## [3,] 0.2168816 0.1513942 1.0000000

Not all symmetric matrices are valid variance/covariance matrices. In particular, variances have to be non-negative and so the diagonal entries of the variance/covariance matrix must be non-negative. Covariances can be negative. However, variables with small variance can’t have large covariance. To see this, since the correlation is between -1 and 1, we have an inequality \[ \left| \frac{{\mathrm{Cov}}(X,Y)}{\sqrt{{\mathrm{Var}}(X){\mathrm{Var}}(Y)}} \right| \le 1 \] which gives us \[ \big[ {\mathrm{Cov}}(X,Y) \big]^2 \le {\mathrm{Var}}(X) \times {\mathrm{Var}}(Y). \] Try to carry out step (c) and (d) with a non-valid variance/covariance matrix and report on what happens. If your original variance/covariance matrix was not valid, re-run your code for (a-d) with a valid choice of V.

Question 4f Solution.

# create a variance/covariance matrix
V = matrix(c(1,2,3,2,5,1,3,1,6), nrow = 3)
V

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    2    5    1
## [3,]    3    1    6

# Draw a large sample from the multivariate normal distribution
mvn_sample <- rmvnorm(1000, mu, V)

## Warning in rmvnorm(1000, mu, V): sigma is numerically not positive
## semidefinite

# sample mean
apply(mvn_sample, 2, mean)

## [1] 1.012456 2.108220 2.956429

# sample covariance matrix
cov(mvn_sample)

##          [,1]     [,2]     [,3]
## [1,] 1.635497 1.821639 2.807777
## [2,] 1.821639 4.994261 1.085820
## [3,] 2.807777 1.085820 6.225845

# pairs plot of the sample
pairs(mvn_sample)

# correlation matrix
cor(mvn_sample)

##           [,1]      [,2]      [,3]
## [1,] 1.0000000 0.6373851 0.8799107
## [2,] 0.6373851 1.0000000 0.1947257
## [3,] 0.8799107 0.1947257 1.0000000

My correlation between variable 1 and variable 2 is fairly high at 0.63, and my correlation between variable 2 and variable 3 is very high with a correlation of 0.88. This indicates that my variables may not be independent; variable 2 may be linearly related to variable 3.

Q5. Working with means and variances of linear combinations.

Let \({\boldsymbol{\mathrm{X}}}=(X_1,X_2,X_3,X_4)\) be a vector random variable with mean vector \((1,2,3,4)\) and variance/covariance matrix \[ {\mathbb{V}} = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 2 & 2 \\ 1 & 2 & 3 & 3 \\ 1 & 2 & 3 & 4 \\ \end{bmatrix} \]

Find the mean and variance of \(\sum_{i=1}^4X_i\).

Question 5a Solution.

Note: I can re-write \(\sum_{i=1}^4X_i\) as \(\begin{bmatrix} 1 & 1 & 1 & 1 \end{bmatrix}\begin{bmatrix} X_1 \\ X_2 \\ X_3 \\ X_4 \end{bmatrix}\). Let \(\mathbb{A}_{1 \times 4}= \begin{bmatrix} 1 & 1 & 1 & 1 \end{bmatrix}\).

Then

\[ E[\mathbb{A}\boldsymbol{X}] = \mathbb{A}E[\boldsymbol{X}] \\ \mathbb{A}E[\boldsymbol{X}] = \begin{bmatrix} 1 & 1 & 1 & 1 \end{bmatrix}\begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} \\ = 10 \]

and

\[ Var(\mathbb{A}\boldsymbol{X}) = \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T \\ \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T = \begin{bmatrix} 1 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 2 & 2 \\ 1 & 2 & 3 & 3 \\ 1 & 2 & 3 & 4 \end{bmatrix}\begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \end{bmatrix} \\ \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T = 30 \]

Find the mean and variance of \(\bar X = \frac{1}{4} \sum_{i=1}^4 X_i\).

Question 5b Solution.

Note: I can re-write \(\frac{1}{4}\sum_{i=1}^4X_i\) as \(\begin{bmatrix} \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \end{bmatrix}\begin{bmatrix} X_1 \\ X_2 \\ X_3 \\ X_4 \end{bmatrix}\). Let \(\mathbb{A}_{1 \times 4}= \begin{bmatrix} \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \end{bmatrix}\).

Then

\[ E[\mathbb{A}\boldsymbol{X}] = \mathbb{A}E[\boldsymbol{X}] \\ \mathbb{A}E[\boldsymbol{X}] = \begin{bmatrix} \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \end{bmatrix}\begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} \\ = 2.5 \] and

\[ Var(\mathbb{A}\boldsymbol{X}) = \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T \\ \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T = \begin{bmatrix} \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \end{bmatrix} \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 2 & 2 \\ 1 & 2 & 3 & 3 \\ 1 & 2 & 3 & 4 \end{bmatrix}\begin{bmatrix} \frac{1}{4} \\ \frac{1}{4} \\ \frac{1}{4} \\ \frac{1}{4} \end{bmatrix} \\ \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T = 1.875 \]

Use matrix methods to find the mean and variance of the linear combination \(X_1+2X_2+3X_3+4X_4\). You do not have to do the calculation by hand. Report the required calculation using mathematical notation and carry out the computation using R.

Question 5c Solution.

Note: I can re-write \(\frac{1}{4}\sum_{i=1}^4X_i\) as \(\begin{bmatrix} 1 & 2 & 3 & 4 \end{bmatrix}\begin{bmatrix} X_1 \\ X_2 \\ X_3 \\ X_4 \end{bmatrix}\). Let \(\mathbb{A}_{1 \times 4}= \begin{bmatrix} 1 & 2 & 3 & 4 \end{bmatrix}\).

Then

\[ E[\mathbb{A}\boldsymbol{X}] = \mathbb{A}E[\boldsymbol{X}] \\ \mathbb{A}E[\boldsymbol{X}] = \begin{bmatrix} 1 & 2 & 3 & 4 \end{bmatrix}\begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} \\ = 30 \] and

\[ Var(\mathbb{A}\boldsymbol{X}) = \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T \\ \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T = \begin{bmatrix} 1 & 2 & 3 & 4 \end{bmatrix} \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 2 & 2 \\ 1 & 2 & 3 & 3 \\ 1 & 2 & 3 & 4 \end{bmatrix}\begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} \\ \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T = 246 \]

Suppose that \({\boldsymbol{\mathrm{X}}}\) is a multivariate normal vector random variable with the mean and variance/covariance matrix given above. Find the probability that \(X_4\) is greater than \(X_3\). Explain your reasoning, and report an R calculation. Hint: Rewrite the event \(\{X_4>X_3\}\) as a statement about the linear combination \(X_4-X_3\). Work out the mean and variance of \(X_4-X_3\) using the same approach you used for part (c). Then, use the property that a linear combination of multivariate normal random variables is normally distributed.

Question 5c Solution.

\(P(X_4>X_3) = P(X_4 - X_3 > 0)\)

Note: \[ X_4 - X_3 = 0X_1 + 0X_2 - X_3 + X_4 \]

Then we can rewrite this as \(\mathbb{A}\boldsymbol{X}\) where \(\mathbb{A}_{1 \times 4} = \begin{bmatrix} 0 & 0 & -1 & 1 \end{bmatrix}\).

Then

\[ E[\mathbb{A}\boldsymbol{X}] = \mathbb{A}E[\boldsymbol{X}] \\ \mathbb{A}E[\boldsymbol{X}] = \begin{bmatrix} 0 & 0 & -1 & 1 \end{bmatrix}\begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} \\ = 1 \]

and

\[ Var(\mathbb{A}\boldsymbol{X}) = \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T \\ \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T = \begin{bmatrix} 0 & 0 & -1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 2 & 2 \\ 1 & 2 & 3 & 3 \\ 1 & 2 & 3 & 4 \end{bmatrix} \begin{bmatrix} 0 \\ 0 \\ -1 \\ 1 \end{bmatrix} \\ \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T = 1 \]

Then \(P(X_4>X_3) = P(X_4 - X_3 > 0)\) can be found using pnorm(0, 1, 1, lower.tail = F).

Thus, \(P(X_4 - X_3 > 0) = 0.8413447\).

License: This material is provided under an MIT license
Acknowledgment: The climate dataset comes from from https://https://doi.org/10.1016/j.envsci.2012.03.008

Homework 5, STATS 401 F18

Due in lab on 10/19

Symmetric matrices