First, load the data

circulation = read.table("https://ionides.github.io/401f18/hw/hw07/circulation.txt", sep = "\t", header = T)

Q1

plot(Sunday~Weekday, pch=ifelse(Competition,2,1), data=circulation)

  1. Examining the plot below, we can see that the values that disappear correspond to Competition == 0. R has filled in 1 for the Competition == 1 values and 0 otherwise. However, col = 0 corresponds to white so the points disappear.
plot(Sunday~Weekday, col=Competition, data=circulation)

We plot the log transformed data below:

circulation$log_Sunday <- log(circulation$Sunday)
circulation$log_Weekday <- log(circulation$Weekday)

plot(log_Sunday ~ log_Weekday,
     col=ifelse(Competition,"red","blue"),
     pch = ifelse(Competition,1,4),
     data=circulation)

Before we take the log transform, there were many values clustered in the bottom left corner. After taking the log transform, the data are more football shaped, so it looks like it would be better to fit the linear model on a log scale.

Q2

lm1 = lm(log_Sunday~log_Weekday+Competition,data=circulation)
X = model.matrix(lm1)
y = circulation$log_Sunday

For the coefficients we calculate \[\textbf{b} = (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top \textbf{y}\]

b = solve(t(X) %*% X) %*% t(X) %*% y
b
##                   [,1]
## (Intercept) -0.4472999
## log_Weekday  1.0613296
## Competition -0.5313725

For the standard errors, we first calculate the residual standard error \(s\). We have the formula \[s = \sqrt{\frac{1}{n-p}\sum_{i = 1}^{n}(y_i - \hat{y}_i)^2} = \sqrt{\frac{1}{n-p}\textbf{e}^\top \textbf{e}}\] where \(\textbf{e}\) is the vector of residuals. We calculate this below:

e = y - X %*% b
n = nrow(X)
p = ncol(X)

s = sqrt((t(e) %*% e)/(n-p))

The standard errors for each coefficient are then obtained by calculating the matrix \((\mathbb{X}^\top\mathbb{X})^{-1}\) and multiplying the square root of the diagonal elements by the residual standard error:

XtX_inv = solve(t(X) %*% X)
std_errs = as.vector(s)*sqrt(diag(XtX_inv))
std_errs
## (Intercept) log_Weekday Competition 
##  0.35138427  0.02847685  0.06800382

Note that in the code above, we write as.vector(s)*sqrt(diag(XtX_inv)) instead of s*sqrt(diag(XtX_inv)) because R gives an error in the latter case. Using the code written converts s into a 1-by-1 matrix, which R then multiplies by each element of sqrt(diag(XtX_inv)).

  1. We can see the values calculated in part (a) match the values from lm1:
b
##                   [,1]
## (Intercept) -0.4472999
## log_Weekday  1.0613296
## Competition -0.5313725
coefficients(lm1)
## (Intercept) log_Weekday Competition 
##  -0.4472999   1.0613296  -0.5313725
std_errs
## (Intercept) log_Weekday Competition 
##  0.35138427  0.02847685  0.06800382
summary(lm1)$coefficients[,2]
## (Intercept) log_Weekday Competition 
##  0.35138427  0.02847685  0.06800382
  1. We can see that the residual standard error we calculated in part (b) matches the one from lm1:
s
##           [,1]
## [1,] 0.1391926
summary(lm1)$sigma
## [1] 0.1391926

In this case, the degrees of freedom is the value used in the denominator when calculating \(s\). One way to think of the degrees of freedom is the effective number of observations that we have available to estimate the residual standard error. Because we estimate 3 parameters (the coefficients), we subtract 3 from \(n = 89\) to obtain the degrees of freedom.

Q3

The probability model is \[\textbf{Y} = \mathbb{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\]

where

Q4

Our confidence interval is given by \[b_2 \pm 1.96 \times \text{SE}(b_2)\]

where 1.96 was obtained as the 97.5 percentile of the standard normal distribution (i.e., qnorm(.975)). We can use the values calculated in question 2. We are interested in the third coefficient (since that corresponds to the Competition variable):

lwr = b[3] - 1.96*std_errs[3]
upr = b[3] + 1.96*std_errs[3]
c(lwr,upr)
## Competition Competition 
##   -0.664660   -0.398085

Assuming our sample is a realization from the probability model defined in question 3, we are 95% confident that this confidence interval contains our true coefficient \(\beta_2\). What this means is that if we were to repeatedly sample from our probability model and construct a confidence interval each time, we would expect \(\beta_2\) to be in approximately 95% of the confidence intervals constructed.

Q5

  1. We have an intercept \(b_0\) when Competition == 0 and an intercept of \(b_0 + b_2\) when Competition == 1 (see part (b) for more details). We have a slope of \(b_1\) for both lines. We can plot these lines using the following code:
slope = coefficients(lm1)[2]
intercept0 = coefficients(lm1)[1]
intercept1 = coefficients(lm1)[1] + coefficients(lm1)[3]

plot(log_Sunday ~ log_Weekday,
     col=ifelse(Competition,"red","blue"),
     pch = ifelse(Competition,1,4),
     data=circulation)

abline(a = intercept0, b = slope)
abline(a = intercept1, b = slope)

On the plot above, we can see one line corresponding to Competition == 0 and another line corresponding to Competition == 1, where the slope is the same for the lines. Thus we can see the linear model that was fit is equivalent to including two different intercepts, corresponding to the two levels of Competition.

We can also see this from the fitted model: \(\hat{y}_i = b_0 + b_1x_{i1} + b_2x_{i2}\) for \(i = 1, \dots, 89\). When Competition is equal to 0 (i.e., \(x_{i2} = 0\)), this equivalent to \(\hat{y}_i = b_0 + b_1x_{i1}\). When Competition is equal to 1 (i.e., \(x_{i2} = 1\)), this equivalent to \(\hat{y}_i = b_0 + b_1x_{i1} + b_2 = (b_0 + b_2) + b_1x_{i1}\). For each case, the slope is the same (\(b_1\)), while the intercept is \(b_0\) for Competition == 0 and \(b_0 + b_2\) for Competition == 1.

Q6

A causal interpretation for the coefficient for Competition would mean that being a tabloid with a competing serious newspaper results in a reduction in log weekday circulation. A causal interpretation would allow us to predict the consequence of an intervention. For example, if a serious competitor started up, we would predict a decrease (relative to weekday circulation) by the amount of the coefficient.

To infer a causal relationship, we need to believe that we have controlled for all potential confounders, such that the only difference between the two groups is whether they are a tabloid with a competing serious newspaper or not. Often we infer causality using a randomized experiment, as we could use randomization to control for confounders; however, this is an observational study.

In this case, it does not seem plausible that we have controlled for all confounders. While we do control for log weekday circulation by including it in the linear model, there may still be other confounders unrelated to log weekday circulation. For example, newspapers in certain locations might have higher Sunday circulation and serious newspapers may be more likely to be in these locations.