Your report should include the R code that you use. For calculations, please add enough explanation to help the reader understand what you did and why. Recall that you are permitted to collaborate, or to use any internet resources, but you must list all sources that influenced your report. As usual, a statement of Sources is required to get credit for the homework.


Working with the probability model for the linear model.

In principle, once we have a probability model for a linear model we can calculate the probability of any outcome related to the model that we might be interested in. This homework leads you through two such calculations. We can calculate the probability by simulating many realizations from the probability model. For some events related to the linear model, we can also work out the probability in terms of the normal probability density function.

We investigate the probability model \({\boldsymbol{\mathrm{Y}}}={\mathbb{X}}{\boldsymbol{\mathrm{\beta}}} + {\boldsymbol{\mathrm{\epsilon}}}\) with \({\boldsymbol{\mathrm{\epsilon}}}\sim{\mathrm{MVN}}(0,\sigma^2{\mathbb{I}})\), applied to fluctuations of life expectancy and unemployment around a linear trend. We will base our analysis on the dataset below, which we have seen repeatedly. The previous discussions of the data and scientific background will not be presented here; you should review them as needed.

The data can be downloaded using

download.file(destfile="life_expectancy.txt",
  url="https://ionides.github.io/401f18/hw/hw06/life_expectancy.txt")
download.file(destfile="unemployment.csv",
  url="https://ionides.github.io/401f18/hw/hw06/unemployment.csv")

We can then obtain residuals from fitting a linear trend and build a dataframe for the years common to both datasets, as follows

L <- read.table("https://ionides.github.io/401f18/hw/hw06/life_expectancy.txt",header=TRUE)
U <- read.table("https://ionides.github.io/401f18/hw/hw06/unemployment.csv",sep=",",header=TRUE)
U_annual <- apply(U[,2:13],1,mean)
L_subset <- subset(L$Total,L$Year %in% U$Year)
health <- data.frame(year=U$Year,u=U_annual,l=L_subset)
x <- lm(u~year,data=health)$residuals
y <- lm(l~year,data=health)$residuals

Now we consider a model we have investigated earlier to assess whether the fluctuations of life expectancy can be statistically explained by fluctuations in the unemployment rate. Let \({\boldsymbol{\mathrm{y}}}=(y_1,\dots,y_n)\) be a vector of detrended total life expectancy for each of the \(n=68\) years. Let \({\boldsymbol{\mathrm{x}}}=(x_1,\dots,x_n)\) be a vector of the corresponding unemployment rates. We consider the probability model \[ Y_i = \beta_1 x_i + \beta_2 + \epsilon_i, \quad \epsilon_i\sim{\mathrm{iid}}\;{\mathrm{normal}}(0,\sigma),\quad i=1,\dots,n. \] This defines a probability model for any values of \(\beta_1\), \(\beta_2\) and \(\sigma\). The value of the vector \({\boldsymbol{\mathrm{x}}}\) is considered known and fixed. We will call the fitted probability model the particular case when these parameters correspond to the least squares choices \(\beta_1=b_1\) and \(\beta_2=b_2\), with \(\sigma=s\) being the sample estimate of the standard deviation computed from the residuals. This notation corresponds to the sample simple linear regression model \[ y_i = b_1 x_i + b_2 + e_i, \quad i=1,\dots,n \] computed in R using

lm_fit <- lm(y~x)

Q1. Under the fitted probability model, we look to find the probability that a year with unemployment 5 percent above trend has higher life expectancy (relative to trend) than a year with typical unemployment. This is just one example of a quantity we might be interested in to interpret the fitted model. We calculate this two different ways.

  1. By simulation. We can write a new hypothetical value of the explanatory variable as \(x^*\). A corresponding outcome from the linear model is a random variable \(Y^*\) given by \[ Y^*=b_1 x^* + b_2+\epsilon^* \] with \(\epsilon^*\sim{\mathrm{normal}}(0,s)\). You can use the following code to simulate a pair of possible outcomes from the model, one having \(x^*=5\) and the other having \(x^*=0\).
Y_sim <- function(){
  beta_1 <- coef(lm_fit)["x"]
  beta_2 <- coef(lm_fit)["(Intercept)"]
  sigma <- summary(lm_fit)$sigma
  Y1 <- beta_1*5 + beta_2 + rnorm(1,0,sigma)
  Y2 <- beta_1*0 + beta_2 + rnorm(1,0,sigma)
  c(Y1,Y2)
}

Use replicate() to carry out 5000 such simulations and write R code to find the fraction of these in which the variable Y1 (corresponding to a realization from the fitted probability model with \(x^*=5\)) is higher than the variable Y2 (corresponding to a realization with \(x^*=0\)).

Let us first use replicate() to carry out 5000 simulations

rep5000 <- replicate(5000,Y_sim())

#to see the structure of rep5000
dim(rep5000)
## [1]    2 5000

We see that the 5000 simuation results are stored as different columns in the matrix rep5000.

To evaluate the proportion of observations for which Y1>Y2 is the same as finding the proportion of columns in which th efirst entry is greater than the second entry

mean(rep5000[1,]>rep5000[2,])
## [1] 0.8984
#or
vec <- apply(rep5000,2,function(x){x[1]>x[2]}) #vector of length 5000 containing TRUE if Y1>Y2 
mean(vec) #proportion of entries corresponding to TRUE
## [1] 0.8984

  1. By a normal probability calculation. Let \(U\sim {\mathrm{normal}}(5b_1+b_2,s)\) and \(V\sim{\mathrm{normal}}(b_2,s)\) be independent draws from the probability model in part (a) with \(x^*=5\) and \(x^*=0\) respectively. Find \({\mathrm{P}}(U>V)\) as a call to pnorm(). Check this is consistent with your answer to part (a).

(b) Since \(U\sim \mathcal{N}(5b_1 + b_2,s)\) and \(V\sim\mathcal{N}(b_2,s)\) where \(b_1=0.1313673\), \(b_2=3.046417e-17\) and \(s=0.3633792\)
\(U-V\sim\mathcal{N}(5b_1,2*s)\). So, \(P(U>V)=P(U-V>0)\) can be calculated as follows

beta_1 <- coef(lm_fit)["x"]
beta_2 <- coef(lm_fit)["(Intercept)"]
sigma <- summary(lm_fit)$sigma

pnorm(0,5*beta_1 ,sqrt(2)*sigma,lower.tail = FALSE)
## [1] 0.899402

Q2. Suppose that the data are modeled as a realization of the following collection of random variables:
\[ Y_i = \beta_1 x_i + \beta_2 + \epsilon_i, \quad\epsilon_i\sim{\mathrm{iid}}\;{\mathrm{normal}}(0,\sigma),\quad i=1,\dots,n \] with \(\beta_1=0\) and the other parameters taking their fitted values from the data; \(\beta_2=b_2\) and \(\sigma=s\). We seek to calculate \({\mathrm{P}}(\hat\beta_1>b_1)\) in two different ways.

  1. By simulation. We can generate a random dataset from this model, refit the linear model, and extract the simulated realization of \(\hat\beta_1\) as follows
beta_1_sim <- function(){
  beta_1 <- 0
  beta_2 <- coef(lm_fit)["(Intercept)"]
  sigma <- summary(lm_fit)$sigma
  Y_sim <- beta_1*x + beta_2 + rnorm(length(x),0,sigma)
  coef(lm(Y_sim~x))["x"]
}

Each call to the function beta_1_sim() produces one simulated realization of \(\hat\beta_1\) drawn from the specified probability model.

(i). Make a histogram of 5000 simulated realizations of \(\hat\beta_1\) using replicate(). Compare this histogram with the value of \(b_1\).

(ii). Find the fraction of these simulations which are larger than \(b_1\). The simulations will be different each time you carry them out, but it is possible your fraction will be zero.

b1_sim <- replicate(5000,beta_1_sim())
hist(b1_sim,
     xlim=c(min(b1_sim),max(max(b1_sim),beta_1)+0.05),
     main='Histrogram of simulated realizations of beta1_hat')
abline(v=beta_1,col="red")
text(beta_1,400,"0.1313673",col="maroon")

Proportion of simulated values larger than \(b_1\)

mean(b1_sim>beta_1)
## [1] 0
  1. Mean: \(\text{E}(\hat{\beta_1}) = \beta_1=0\), \(\text{Var}(\hat{\beta_1})=\sigma^2(\mathbb{X}^t\mathbb{X})^{-1}\) where \(\sigma=s\)
# Find variance
## Define design matrix
X <- cbind(x,1)
## Covariance matrix of betaHat
var_beta1 <- (sigma**2)*(solve(t(X)%*%X))
## Variance of beta1Hat
var_beta1hat <- var_beta1[1,1]

To find the probability

pnorm(beta_1,0,sqrt(var_beta1hat),lower.tail = FALSE)
##            x 
## 5.269751e-06

  1. By a normal probability calculation. Use matrix methods to find the mean and variance of \(\hat\beta_1\) for this probability model with \(\beta_1=0\). Construct a call to pnorm() that gives \({\mathrm{P}}(\hat\beta_1>b_1)\) for the probability model. Check that this is consistent with the answer you got by simulation.

Q3. What are \({\boldsymbol{\mathrm{b}}}\), \({\boldsymbol{\mathrm{\beta}}}\) and \(\hat{\boldsymbol{\mathrm{\beta}}}\) in questions Q1 and Q2? Explain why these two analyses with the same data and the same model (and therefore the same \({\boldsymbol{\mathrm{b}}}\)) choose to consider different choices of \({\boldsymbol{\mathrm{\beta}}}\) and therefore different probability distributions for the vector random variable \(\hat{\boldsymbol{\mathrm{\beta}}}\). The purpose for doing the calculation in Q2 will be discussed during Chapter 6 of the notes, but you may be able to motivate what is going on by thinking back to the hypothesis tests you have done in a previous Stats class.

\(\beta\) is the vector of the true population coefficients in the probability model. \(\beta\) is fixed but unknown (this is what we want to estimate). \(\hat{\beta}\) is the unbiased estimator of \(\beta\) obtained by \(\hat{\beta} = (\mathbb{X}^T\mathbb{X})^{-1}\mathbb{X}^T\text{Y}\), where Y is the outcome random variable. b is the realization of \(\hat{\beta}\) obtained from our sample data using least squares as \(\textbf{b}=(\mathbb{X}^T\mathbb{X})^{-1}\mathbb{X}^T\textbf{y}\) where \(\textbf{y}\) is the vector of the outcome variable (from our data) and \(\mathbb{X}\) is the design matrix.


Hints

  1. It may help you to develop your code with a small number (say, 10) realizations rather than 5000. Once your code is debugged and working, re-run it with the full 5000.

  2. The dimension of replicate(5000,Y_sim()) is somewhat surprising. It is often a good idea to run dim() on new variables you construct to see what you are dealing with. If a matrix seems like it would make more sense to you with rows and columns switched, you can always take its transpose using t().

  3. Variables created within an R function are not returned when R evaluates the function. The only thing returned is the output, which is the last line of the function. So, when the function Y_sim() is called, only the un-named length 2 vector made up of the quantities Y1 and Y2 is returned. These vectors are packed together into a matrix by replicate(). Conceptually, variables created within a function are local to that function and are not exported to the global environment where you find all the variables you create at the R prompt. You can check what variables are in your R session by calling the function ls(). You can check, for example, that calling Y_sim() does not cause the creation of variables Y1 and Y2 in your main R session.

  4. For a logical vector, sum() adds 1 for each TRUE and 0 for each FALSE, so gives the number of times the vector is TRUE. For example,

x <- c(TRUE,TRUE,FALSE,TRUE,FALSE)
sum(x)
## [1] 3
  1. Q1b involves similar techniques to HW5 Q5d. You could do it using the formula for the wariance of a linear combination, but it is simpler to use \({\mathrm{Var}}(X+Y)={\mathrm{Var}}(X)+{\mathrm{Var}}(Y)+2{\mathrm{Cov}}(X,Y)\).

  2. For Q1b, \(U\) and \(V\) are normally distributed and independent. Their independence is asserted in the question. Therefore, their covariance is zero and once you have found their means and variances you are in a good position to move forward. Note that \(b_1\), \(b_2\) and \(s\) are constants not random variables.

  3. For Q2b, \(b_1\) is a constant. It is derived from the data, which are constant numbers not random variables. The numeric value of \(b_1\) is calculated using least squares by lm(). \(\hat\beta_1\) is the random variable. \(\hat\beta_1\) has a normal distribution with mean and standard deviation given in the notes. You may be surprised to find that \(b_1\) is somewhat different from the simulated values calculated in Q2a. This might make you doubt your own calculation, but this turns out to be correct—you can think about the statistical interpretation of this.



License: This material is provided under an MIT license