For your homework report this week, you do not need to submit solutions to the swirl lessons. Please report whether you successfully completed them, and discuss any issues that arose in the Sources and Please explain statements. For the data analysis exercise, write a brief report including the output you are asked to compute and the R code you used to generate it. Recall that you are permitted to collaborate, or to use internet resources including the source code for the textbook at http://www.stat.tamu.edu/~sheather/book/, but you must list all sources that make a substantial contribution to your report.


Building the design matrix for a linear model

  1. Start by reading Section 1.2.1. of the textbook. This section describes the data and the models we will consider, and then goes further to discuss p-values we have not yet introduced in class. We will get to that soon. Look at the data at https://ionides.github.io/401w18/hw/hw03/FieldGoals2003to2006.csv. The data are described in the header of that file.

  2. Read the data into R (as in Homework 1).

data_nfl <- read.csv("https://ionides.github.io/401w18/hw/hw03/FieldGoals2003to2006.csv",header = TRUE,skip=5)
  1. Let \(i\) correspond to the \(i\)th row of the data table. Recall that each kicker has \(4\) rows for the seasons 2003, 2004, 2005, 2006. Let \(y_i\) be the field goal average for row \(i\), and \(x_i\) be the average for the previous season. A simple linear regression model predicting performance based on the previous season is

    (SLR) \(\hspace{1cm}y_i = m x_i + c + e_i, \hspace{1cm}\) for \(i=1,\dots,4\times 19\).

    We can write this in matrix form as \[ {\boldsymbol{\mathrm{y}}} = {\mathbb{X}}{\boldsymbol{\mathrm{b}}} + {\boldsymbol{\mathrm{e}}}.\] Construct the matrix \({\mathbb{X}}\) for model (SLR) in R. Use the material in Chapter 3 of the notes to obtain the least squares values of \(m\) and \(c\) in model (SLR) via matrix multiplication in R. Plot the data, together with the fitted values contructed using a matrix multiplication formula. This should look like Figure 1.2 of Sheather, with the addition of a single line corresponding to a fit for all kickers.
y <- data_nfl$FGt
X <- cbind(data_nfl$FGtM1,rep(1,length(data_nfl$FGtM1)))
b <- c("m","c")
#Confirm that you have defined the correct matrices
head(y); head(X)
## [1] 73.5 93.9 80.0 89.4 82.7 84.3
##      [,1] [,2]
## [1,] 90.0    1
## [2,] 73.5    1
## [3,] 93.9    1
## [4,] 80.0    1
## [5,] 88.2    1
## [6,] 82.7    1
b <- solve( t(X) %*% X ) %*% t(X) %*% y
b
##            [,1]
## [1,] -0.1509583
## [2,] 94.6097871
plot(x=data_nfl$FGtM1,y=jitter(y),xlab = "Av Field Goals in previous year (t-1)", ylab = "Av Field Goals in current year (t)")
abline(a=b[2],b=b[1],col="red")

  1. Now, we build a linear model for predicting performance based on the previous season where each kicker has their own intercept. This model can be written mathematically as

    (LM) \(\hspace{1cm}y_i = m x_i + c_1 z_{i,1} + c_2 z_{i,2} +\dots+c_{19}z_{i,19} + e_i, \hspace{1cm}\) for \(i=1,\dots,4\times 19\).

    where \(z_{i,1}\) takes the value \(1\) when row \(i\) of the data corresponds to kicker 1 (i.e., for \(i=1,2,3,4\)) and \(0\) otherwise, \(z_{i,2}\) takes the value \(1\) when \(i\) corresponds to kicker 2 (i.e., for \(i=5,6,7,8\)) and \(0\) otherwise, and so on. This means that \(c_k\) is the intercept of a linear model for kicker \(k\), for \(k=1,\dots,19\), where the lines for all the kickers share the same slope, \(m\). To write model (LM) in the form \({\boldsymbol{\mathrm{y}}} = {\mathbb{X}}{\boldsymbol{\mathrm{b}}} + {\boldsymbol{\mathrm{e}}}\), we need \({\mathbb{X}}=[{\boldsymbol{\mathrm{x}}},{\boldsymbol{\mathrm{z}}}_1,\dots,{\boldsymbol{\mathrm{z}}}_{19}]\) where \({\boldsymbol{\mathrm{x}}}\) is a column vector containing \((x_1,\dots,x_{76})\) and \({\boldsymbol{\mathrm{z}}}_k\) is a column vector containing \((z_{1,k},\dots,z_{76,k})\) for \(k=1,\dots,19\). The vector \({\boldsymbol{\mathrm{z}}}_k\) can be called a dummy variable for kicker \(k\). Dummy variables are explanatory variables that we construct in order to allow a coefficient (here, the intercept) to be estimated separately for different subsets of the data (here, the intercept is estimated separately for each kicker).

    Construct the \({\mathbb{X}}\) matrix for model (LM) in R. A direct way is to build each of the twenty columns, with the help of rep(), and glue them together with cbind().
Z <- matrix(0,nrow=19*4,ncol=19)
for(k in 1:19){Z[,k]<-c(rep(c(0,0,0,0),k-1),rep(1,4),rep(c(0,0,0,0),19 - k))}
# or
for(k in 1:19){Z[,k][(4*(k-1)+1) : (4*k)] <- rep(1,4)}
X <- cbind(data_nfl$FGtM1,Z)
b <- solve( t(X) %*% X ) %*% t(X) %*% y
m <- b[1]; m
## [1] -0.5037008

There are other succinct ways to construct this matrix, and you can look for them if you wish. Report whether your least squares estimate of \(m\), constructed using the design matrix \({\mathbb{X}}\), matches the value of -0.504 in Figure 1.2 of Sheather.

  1. Can you use the fitted values and/or the least square coefficient estimates that you obtained for model (LM) above to reproduce the parallel lines in Figure 1.2 of Sheather? This is an optional additional task, to be carried out if you have time. It will not be counted toward the “completeness” grade for your homework.
plot(x=data_nfl$FGtM1,y=jitter(y),xlab = "Av Field Goals in previous year (t-1)", ylab = "Av Field Goals in current year (t)")
for(i in 1:19){abline(a=b[i+1],b=b[1],col=palette(rainbow(19))[i])}