Your report should include the R code that you use. For calculations, please add enough explanation to help the reader understand what you did and why. Recall that you are permitted to collaborate, or to use any internet resources, but you must list all sources that influenced your report. As usual, a statement of Sources is required to get credit for the homework.
A=[aij]p×p is a symmetric matrix if aij=aji for all i and j. This means the matrix has reflective symmetry across its leading diagonal. Recall that variance/covariance matrices are symmetric because Cov(Xi,Xj)=Cov(Xj,Xi) due to the symmetry of covariance.
An example of a symmetric matrix is
[23−1314−147]
An equivalent definition is that A is symmetric if A=AT. Transposing swaps rows and columns, which is exactly the same thing as reflection across the leading diagonal. Thus, a matrix is equal to its transpose if and only if it has reflective symmetry across the leading diagonal. Checking whether a matrix is equal to its transpose can be a good way to see if a matrix is symmetric since we have already used rules for transposing matrix products and inverses.
Q1. Let A be a symmetric p×p matrix and let B be any p×q matrix. Let C=BTAB Show that C is equal to its own transpose and hence show that C is a symmetric q×q matrix.
Q1 Solution.
CT=(BTAB)T=BTAT(BT)T=BTABTherefore CT=C.
Hint: Apply the rules of matrix multiplication and transposition to simplify CT=(BTAB)T. Using the assumption that A is symmetric, this simplification should give you an expression equal to C.
Q2. Let A be a symmetric p×p matrix with an inverse A−1. Is A−1 symmetric?
Question 2a Solution.
Let A=[123214341]
Using R to solve, we obtain that A−1=[−0.750.50.250.50−0.40.100.250.1−0.15] which is also a symmetric matrix.
Question 2b Solution.
We want to show that (A−1)T=A−1. We are assuming that A is a symmetric matrix which has an inverse A−1. Some basic properties of the matrix transpose and inverse are A−1A=AA−1=I and (A−1)T=(AT)−1. Using the second of these properties, together with the symmetry property that AT=A, we calculate (A−1)T=(AT)−1=A−1. We have checked that (A−1)T=A−1 which shows that A−1 is itself symmetric.
Q3. Investigating multivariate data.
Carbon dioxide (CO2) levels in the atmosphere have been increasingly steadily. The longest data recording this are the measurements taken at Mauna Loa observatory in Hawaii. An increasing trend in CO2 matches increasing trends in both global economic activity and the global population. Which provides a better explanation? To address this, one may have to look at variations around the trend. However, on shorter timescales, fluctuating geophysical processes such as volcanic activity and the El Nino Southern Oscillation (ENSO) may be dominant. In addition, one can wonder whether existing data measuring global CO2 emissions do a better job than overall economic activity for explaining fluctuations of CO2 around its increasing trend. The following dataset collects together indices measuring all these phenomena.
download.file(destfile="climate.txt",
url="https://ionides.github.io/401f18/hw/hw05/climate.txt")
X <- read.table("climate.txt",header=TRUE)
head(X)
## Year CO2 GDP Pop ENSO Volcanic Emissions
## 1 1958 NA NA NA 0.9054 0.0042 NA
## 2 1959 315.97 NA NA 0.2698 0.0027 NA
## 3 1960 316.91 7.23 3027 -0.2568 0.0024 NA
## 4 1961 317.64 7.54 3069 -0.2322 0.0024 9.5
## 5 1962 318.45 7.97 3123 -0.7650 0.0024 9.8
## 6 1963 318.99 8.38 3189 -0.1629 1.8454 10.4
CO2: Mean annual concentrations of atmospheric CO2 (in parts per million by volume) at Mauna Loa observatory.
GDP: World gross domestic product from World Development Indicators database (data.worldbank.org/data-catalog/world-development-indicators) of the World Bank
Pop: World population from World Development Indicators database (data.worldbank.org/data-catalog/world-development-indicators) of the World Bank
ENSO: El Nino Southern Oscillation index from NOAA sources (www.esrl.noaa.gov/psd/people/klaus.wolter/MEI/table.html)
Emissions: estimated emissions of CO2 (million Kt) from World Development Indicators database (data.worldbank.org/data-catalog/world-development-indicators) of the World Bank
diff_CO2
diff_GDP
diff_pop
and diff_emissions
that measure the percentage annual change in each variable. A trick to do this is to take the difference of the logarithm. In addition, it is necessary to add a missing value (NA
) at the start since the annual change is not available in the first year of the dataset. For example,X$diff_CO2 <- c(NA,diff(log(X$CO2)))
Question 3a Solution.
X$diff_CO2 <- c(NA,diff(log(X$CO2)))
X$diff_GDP <- c(NA,diff(log(X$GDP)))
X$diff_pop <- c(NA,diff(log(X$Pop)))
X$diff_emissions <- c(NA,diff(log(X$Emissions)))
diff_CO2
, diff_GDP
, diff_pop
, diff_emissions
, ENSO
and Volcanic
. You will likely have to look at ?pairs
to study the syntax of pairs()
.Question 3b Solution.
pairs_plot_vars <- X[, c("diff_CO2", "diff_GDP", "diff_pop", "diff_emissions", "ENSO", "Volcanic")]
pairs(pairs_plot_vars)
Question 3c Solution.
(diff_CO2
and diff_GDP
), (diff_CO2
and diff_emissions
), (diff_pop
and diff_emissions
), (diff_pop
and ENSO
), (diff_emissions
and ENSO
), (diff_GDP
and diff_emissions
), and (diff_GDP
and diff_pop
) appear to follow joint normal distributions, since their pairwise scatterplots are approximately football shaped. Note: There appear to be some potential outliers.
A normal distribution due to the central limit theorem is expected when a quantity is the sum of many smaller quantities. Many global processes, including human socioeconomic activity, is the sum of a large number of small contributions spread across the globe and so variation may be expected to follow a normal curve. By contrast, volcanic activity is driven by a few rare giant volcanic erruptions and so is not subject to the central limit theorem. ENSO measures a single global phenomenon, the El Nino Southern Oscillation, which cannot readily be broken down as the sum of many small things so there may be no clear reason why a central limit theorem should apply to it. However, the ENSO index appears roughly normally distributed anyhow. The central limit principle does not forbid the normal distribution occurring for other reasons, it just says that the normal distribution will arise when the quantity is the sum of many small and approximately independent contributions.
cor()
. If any of the measurements are NA
the sample correlation will show up as NA
. You can subset the data to consider only the years when all these variables were measured.Question 3d Solution.
# remove the observations that contain missing values
corr_vars <- na.omit(pairs_plot_vars)
cor(corr_vars)
## diff_CO2 diff_GDP diff_pop diff_emissions
## diff_CO2 1.00000000 -0.11068825 -0.5048093 -0.02691627
## diff_GDP -0.11068825 1.00000000 0.4786702 0.71034357
## diff_pop -0.50480933 0.47867016 1.0000000 0.28688387
## diff_emissions -0.02691627 0.71034357 0.2868839 1.00000000
## ENSO 0.34404762 -0.14042767 -0.1653128 -0.18804802
## Volcanic -0.43053100 -0.04741184 0.2609260 -0.12030808
## ENSO Volcanic
## diff_CO2 0.3440476 -0.43053100
## diff_GDP -0.1404277 -0.04741184
## diff_pop -0.1653128 0.26092596
## diff_emissions -0.1880480 -0.12030808
## ENSO 1.0000000 0.33582604
## Volcanic 0.3358260 1.00000000
Question 3e Solution.
Note: It is not surprising that the signs and values of the pairwise correlations follow the trends that we saw in the pairs()
plot in part b.
It initially surprising that CO2 levels would decrease as GDP, population, and emissions rise. However, this may be driven by changes in technology or changes in the make-up of emissions (for instance emitting a different gas as opposed to CO2). It is somewhat surprising that CO2 declines so sharply with increases in Volcanic activity but increases as the El Nino index rises.
It is not surprising that GDP would increase as population increases and that emissions would increase with increases in GDP and the population. The correlation between GDP and volcanic activity is close to zero, which might be expected because volcanic activity has little effect on most of the global economic activities. The correlation between ENSO and GDP is also quite small, and might be due to chance variation. Similarly, the correlations of population with ENSO and volcanic activity do not have readily identifiable causes and may just be spurious correlations.
Q4. Working with the multivariate normal distribution in three dimensions.
This question guides you through a simulation experiment to verify that mvnorm(n,mean=mu,sigma=V)
draws random variables with mean mu
and variance/covariance matrix V
. This differs slightly from rnorm()
for which the arguments are mean and standard deviation rather than mean and variance.
Carry out the following steps in R.
mu
to be a length 3 vector of your choice.Question 4a Solution.
mu = matrix(c(1,2,3))
V
to be a 3×3 symmetric matrix of your choice.Question 4b Solution.
# create a variance/covariance matrix
V = matrix(c(3,2,1,2,5,1,1,1,6), nrow = 3)
V
## [,1] [,2] [,3]
## [1,] 3 2 1
## [2,] 2 5 1
## [3,] 1 1 6
# check that it is positive semi-definite
eigen(V)$values
## [1] 7.507903 4.755640 1.736457
Since my V have eigenvalues that are all greater than 0, I won’t get the warning message mentioned in part c.
Draw a large sample from the multivariate normal distribution with these parameters using rmvnorm()
. There may be some fiddling required to choose a V
that makes this work without the warning message
Warning message:
In rmvnorm: sigma is numerically not positive semidefinite
in which case you can look ahead to part (f).
Question 4c Solution.
# Draw a large sample from the multivariate normal distribution
library(mvtnorm)
mvn_sample <- rmvnorm(1000, mu, V)
mu
and V
.Question 4d Solution.
# sample mean
apply(mvn_sample, 2, mean)
## [1] 1.069725 2.126886 2.956789
# sample covariance matrix
cov(mvn_sample)
## [,1] [,2] [,3]
## [1,] 2.9604048 2.0783820 0.9448336
## [2,] 2.0783820 5.0425931 0.8607823
## [3,] 0.9448336 0.8607823 6.4108265
The sample vector mean and covariance matrix are close to the true values.
pairs()
function. By looking at the pairs plot, guess the three correlations between the three possible choices of pairs of these variables. Then compute the exact correlations from V
and see how close your guess was.Question 4e Solution.
pairs(mvn_sample)
Guess: The correlation between variable 1 and variable 2 is around 0.6. The correlation between variable 1 and variable 2 is around 0.3. Variable 2 and variable 3 have a correlation close to 0.
The calculated correlation matrix is the following:
cor(mvn_sample)
## [,1] [,2] [,3]
## [1,] 1.0000000 0.5379264 0.2168816
## [2,] 0.5379264 1.0000000 0.1513942
## [3,] 0.2168816 0.1513942 1.0000000
V
.Question 4f Solution.
# create a variance/covariance matrix
V = matrix(c(1,2,3,2,5,1,3,1,6), nrow = 3)
V
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 2 5 1
## [3,] 3 1 6
# Draw a large sample from the multivariate normal distribution
mvn_sample <- rmvnorm(1000, mu, V)
## Warning in rmvnorm(1000, mu, V): sigma is numerically not positive
## semidefinite
# sample mean
apply(mvn_sample, 2, mean)
## [1] 1.012456 2.108220 2.956429
# sample covariance matrix
cov(mvn_sample)
## [,1] [,2] [,3]
## [1,] 1.635497 1.821639 2.807777
## [2,] 1.821639 4.994261 1.085820
## [3,] 2.807777 1.085820 6.225845
# pairs plot of the sample
pairs(mvn_sample)
# correlation matrix
cor(mvn_sample)
## [,1] [,2] [,3]
## [1,] 1.0000000 0.6373851 0.8799107
## [2,] 0.6373851 1.0000000 0.1947257
## [3,] 0.8799107 0.1947257 1.0000000
My correlation between variable 1 and variable 2 is fairly high at 0.63, and my correlation between variable 2 and variable 3 is very high with a correlation of 0.88. This indicates that my variables may not be independent; variable 2 may be linearly related to variable 3.
Q5. Working with means and variances of linear combinations.
Let {\boldsymbol{\mathrm{X}}}=(X_1,X_2,X_3,X_4) be a vector random variable with mean vector (1,2,3,4) and variance/covariance matrix {\mathbb{V}} = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 2 & 2 \\ 1 & 2 & 3 & 3 \\ 1 & 2 & 3 & 4 \\ \end{bmatrix}
Question 5a Solution.
Note: I can re-write \sum_{i=1}^4X_i as \begin{bmatrix} 1 & 1 & 1 & 1 \end{bmatrix}\begin{bmatrix} X_1 \\ X_2 \\ X_3 \\ X_4 \end{bmatrix}. Let \mathbb{A}_{1 \times 4}= \begin{bmatrix} 1 & 1 & 1 & 1 \end{bmatrix}.
Then
E[\mathbb{A}\boldsymbol{X}] = \mathbb{A}E[\boldsymbol{X}] \\ \mathbb{A}E[\boldsymbol{X}] = \begin{bmatrix} 1 & 1 & 1 & 1 \end{bmatrix}\begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} \\ = 10
and
Var(\mathbb{A}\boldsymbol{X}) = \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T \\ \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T = \begin{bmatrix} 1 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 2 & 2 \\ 1 & 2 & 3 & 3 \\ 1 & 2 & 3 & 4 \end{bmatrix}\begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \end{bmatrix} \\ \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T = 30
Question 5b Solution.
Note: I can re-write \frac{1}{4}\sum_{i=1}^4X_i as \begin{bmatrix} \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \end{bmatrix}\begin{bmatrix} X_1 \\ X_2 \\ X_3 \\ X_4 \end{bmatrix}. Let \mathbb{A}_{1 \times 4}= \begin{bmatrix} \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \end{bmatrix}.
Then
E[\mathbb{A}\boldsymbol{X}] = \mathbb{A}E[\boldsymbol{X}] \\ \mathbb{A}E[\boldsymbol{X}] = \begin{bmatrix} \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \end{bmatrix}\begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} \\ = 2.5 and
Var(\mathbb{A}\boldsymbol{X}) = \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T \\ \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T = \begin{bmatrix} \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \end{bmatrix} \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 2 & 2 \\ 1 & 2 & 3 & 3 \\ 1 & 2 & 3 & 4 \end{bmatrix}\begin{bmatrix} \frac{1}{4} \\ \frac{1}{4} \\ \frac{1}{4} \\ \frac{1}{4} \end{bmatrix} \\ \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T = 1.875
Question 5c Solution.
Note: I can re-write \frac{1}{4}\sum_{i=1}^4X_i as \begin{bmatrix} 1 & 2 & 3 & 4 \end{bmatrix}\begin{bmatrix} X_1 \\ X_2 \\ X_3 \\ X_4 \end{bmatrix}. Let \mathbb{A}_{1 \times 4}= \begin{bmatrix} 1 & 2 & 3 & 4 \end{bmatrix}.
Then
E[\mathbb{A}\boldsymbol{X}] = \mathbb{A}E[\boldsymbol{X}] \\ \mathbb{A}E[\boldsymbol{X}] = \begin{bmatrix} 1 & 2 & 3 & 4 \end{bmatrix}\begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} \\ = 30 and
Var(\mathbb{A}\boldsymbol{X}) = \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T \\ \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T = \begin{bmatrix} 1 & 2 & 3 & 4 \end{bmatrix} \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 2 & 2 \\ 1 & 2 & 3 & 3 \\ 1 & 2 & 3 & 4 \end{bmatrix}\begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} \\ \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T = 246
Question 5c Solution.
P(X_4>X_3) = P(X_4 - X_3 > 0)
Note: X_4 - X_3 = 0X_1 + 0X_2 - X_3 + X_4
Then we can rewrite this as \mathbb{A}\boldsymbol{X} where \mathbb{A}_{1 \times 4} = \begin{bmatrix} 0 & 0 & -1 & 1 \end{bmatrix}.
Then
E[\mathbb{A}\boldsymbol{X}] = \mathbb{A}E[\boldsymbol{X}] \\ \mathbb{A}E[\boldsymbol{X}] = \begin{bmatrix} 0 & 0 & -1 & 1 \end{bmatrix}\begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} \\ = 1
and
Var(\mathbb{A}\boldsymbol{X}) = \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T \\ \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T = \begin{bmatrix} 0 & 0 & -1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 2 & 2 \\ 1 & 2 & 3 & 3 \\ 1 & 2 & 3 & 4 \end{bmatrix} \begin{bmatrix} 0 \\ 0 \\ -1 \\ 1 \end{bmatrix} \\ \mathbb{A}Var(\boldsymbol{X})\mathbb{A}^T = 1
Then P(X_4>X_3) = P(X_4 - X_3 > 0) can be found using pnorm(0, 1, 1, lower.tail = F)
.
Thus, P(X_4 - X_3 > 0) = 0.8413447.
License: This material is provided under an MIT license
Acknowledgment: The climate dataset comes from from https://https://doi.org/10.1016/j.envsci.2012.03.008