Homework 8, STATS 401 F18

Your report should include the R code that you use. For calculations, please add enough explanation to help the reader understand what you did and why. Recall that you are permitted to collaborate, or to use any internet resources, but you must list all sources that influenced your report. As usual, a statement of Sources is required to get credit for the homework.

Making predictions and hypothesis tests.

This homework continues the analysis of the newspaper circulation data from HW7. Refer back to HW7 for a description of the data and how to download it. Start by fitting a linear model on a logarithmic scale as follows.

circulation <- read.table("circulation.txt",sep="\t",header=T)
circulation$log_Sunday <- log(circulation$Sunday)
circulation$log_Weekday <- log(circulation$Weekday)
lm1 <- lm(log_Sunday~log_Weekday+Competition,data=circulation)

Q1. Let the sample least squares model be \({\boldsymbol{\mathrm{y} } }={\mathbb{X} }{\boldsymbol{\mathrm{b} } }+{\boldsymbol{\mathrm{e} } }\) where \({\mathbb{X} }=[x_{ij}]_{n\times p}\) with \(n=89\) and \(p=3\). Write \({\mathbb{X} }=[ {\boldsymbol{\mathrm{1} } } \; {\boldsymbol{\mathrm{w} } } \; {\boldsymbol{\mathrm{c} } }]\). Here, \(y_i\) is the log Sunday circulation of the \(i\)th paper, \({\boldsymbol{\mathrm{1} } }\) is a column vector of ones, \(w_{i}\) is the log weekday circulation of the \(i\)th paper, and \(c_{i}\) is the binary explanatory variable taking value 1 for tabloids with a serious competitor.

Suppose a new Sunday edition is proposed for a serious (i.e., non-tabloid) weekday paper with a circulation of 150,000. Find the numeric value of a row vector \({\boldsymbol{\mathrm{x} } }^*\) and a column vector \({\boldsymbol{\mathrm{b} } }\) such that \({\boldsymbol{\mathrm{x^*} } }{\boldsymbol{\mathrm{b} } }\) gives the least squares prediction of the log circulation for the proposed Sunday edition.
Find a standard error for this prediction, using matrix methods, and hence use a normal approximation to find a confidence interval for the expected logarithm of Sunday circulation.
Figure out how to use the predict() function to get a confidence interval for the expected log Sunday circulation. If you look at ?predict you will see that predict() is a generic function. This means that predict(object,...) will look at the class of object when working out what to do. Here, we call predict(lm1,...) where lm1 has class lm and so predict() should follow the rules for prediction using a fitted linear model. When predict(lm1,...) sees that lm1 has class lm, it passes on its work to a function called predict.lm(). Therefore, ?predict.lm gives you the necessary syntax for calling predict(lm1,...). In particular, you will have to look through the arguments described in ?predict.lm to see which are relevant to this particular question. Once you have a working call to predict() it becomes easier to adapt it to make it a correct answer to the question. You might like to use the following code as a starting point, though this is not intended to directly answer the question.

predict(lm1,newdata=data.frame(log_Weekday=12,Competition=1))

##        1 
## 11.75728

When you are comfortable with the methods for making predictions and confidence intervals with linear models, using predict() can save time. It is much easier to use a high-level function like predict() accurately and well once you have an understanding of the task you want it to carry out. For example, you should bear in mind that there are two related but different prediction problems (finding an interval estimate of the mean at a new value of the explanatory variables, and finding an interval estimate for a new observation measured at that set of explanatory variables). Somehow, predict() will have to be told which of these prediction problems you want to solve.

Your confidence interval from (c) should be close to your answer from (b) but may be slightly different. Looking at these numbers, do you think predict.lm() uses a normal approximation or a t distribution? Explain your reasoning.

Q2. Explain some things you would look for to check whether your confidence interval from Q1(b) is trustworthy.

Q3. We now turn to the problem of finding a prediction interval for the circulation of the proposed new Sunday edition.

Explain how to use the data vector \({\boldsymbol{\mathrm{y} } }\), the design matrix \({\mathbb{X} }\), and a row vector \({\boldsymbol{\mathrm{x} } }^*\) to construct a normal approximation prediction interval that seeks to cover the logarithm of the circulation of the proposed new Sunday edition with a probability of 95%.
Carry out this matrix computation in R.
Figure out how to use predict() to get a comparable prediction interval.
Comment on what “probability of 95%” means in the context of part (b). Is the introduction of a new Sunday newspaper a repeatable experiment? How does the interpretation of a probability statement depend on whether one can carry out repetitions?

Q4. Transforming data, such as by taking the logarithm, is a useful technique to build a good statistical model but the transformation has to be undone (inverted) if you want to pull answers back to the scale of the original data. The inverse of the natural logarithm is the natural exponent. In math, \[ \log(e^x)=e^{\log(x)}=x \] In R,

log(exp(5))

## [1] 5

exp(log(5))

## [1] 5

Construct a prediction interval for Sunday circulation. This involves moving the prediction interval in Q3(b) for log Sunday cirulation back to the original un-logged scale.

Q5. How compelling is the evidence that there is a real difference between newspapers which are tabloid with a serious competitor (Competition==1) and the other papers (Competition==0)? Address this question by specifying a suitable null hypothesis and making an appropriate hypothesis test.

License: This material is provided under an MIT license
Acknowledgement: The circulation data come from S. J. Sheather (2009) “A Modern Approach to Regression with R.”

Homework 8, STATS 401 F18

Due in lab on 11/9

Making predictions and hypothesis tests.