Your report should include the R code that you use. For calculations, please add enough explanation to help the reader understand what you did and why. Recall that you are permitted to collaborate, or to use any internet resources, but you must list all sources that influenced your report. As usual, a statement of Sources is required to get credit for the homework.



Kicking goals: linear models with factors.

We consider the field goal kicking data introduced in Chapter 7 of the notes. The primary question under investigation is whether an athlete that has a good season, is likely to perform above or below their typical level in the next season. There is also the possibility that the previous season has no predictive skill. We consider field goal kicking success for the 19 National Football League (NFL) kickers who played every season during 2002-2006.

download.file(destfile="FieldGoals.csv", 
  url="https://ionides.github.io/401f18/hw/hw09/FieldGoals.csv")
goals <- read.table("FieldGoals.csv",header=TRUE,sep=",")
head(goals[,1:8]) 
##             Name Yeart Teamt FGAt  FGt Teamt1 FGAt1 FGt1
## 1 Adam Vinatieri  2003    NE   34 73.5     NE    30 90.0
## 2 Adam Vinatieri  2004    NE   33 93.9     NE    34 73.5
## 3 Adam Vinatieri  2005    NE   25 80.0     NE    33 93.9
## 4 Adam Vinatieri  2006   IND   19 89.4     NE    25 80.0
## 5    David Akers  2003   PHI   29 82.7    PHI    34 88.2
## 6    David Akers  2004   PHI   32 84.3    PHI    29 82.7

Q1. Let \(i\) correspond to the \(i\)th row of the data table. Recall that each kicker has \(4\) rows for the seasons 2003, 2004, 2005, 2006. Let \(y_i\) be the percentage of successful field goal average for row \(i\), denoted by the FGt variable in goals. Let \(x_i\) be the percentage for the previous season, denoted by FGt1. A simple linear regression model predicting performance based on the previous season (in the sample form) is \[ y_i = a x_i + b + e_i, \quad \mbox{for $i=1,\dots,4\times 19$}. \]

  1. Use R to find the least squares values of \(a\) and \(b\).

  2. Make a scatterplot of \(y_i\) against \(x_i\) and add the fitted line \(y=ax+b\).

  3. Is the value of \(a\) large enough to convince you that it can’t reasonably be explained by chance variation? Explain your reasoning. This should include a formal statistical test.

Q2. Now, we consider a linear model for predicting performance based on the previous season where each kicker has their own intercept. In double subscript notation this is \[ y_{ij} = a x_{ij} + b_i + e_{ij}, \quad\mbox{for $i=1,\dots,19$ and $j=1,\dots,4$}, \] where \(x_{ij}=y_{i,j-1}\) with \(y_{i0}\) corresponding to the successful field goal percentage in 2002. This gives a different parameterization from the model corresponding to lm(FGt~FGt1+Name,data=goals) that is investigated in the Chapter 7 notes.

  1. Write this model in matrix form, \({\boldsymbol{\mathrm{y} } }={\mathbb{X} }{\boldsymbol{\mathrm{b} } }+{\boldsymbol{\mathrm{e} } }\). You will find that \({\mathbb{X} }\) has many 0 and 1 entries. It is not practical to write all the entries of \({\mathbb{X} }\) so judicious use of “\(\dots\)” is appropriate.

  2. Construct the design matrix \({\mathbb{X} }\) in R. There are many ways to do this. A direct approach is to build each column using rep() and stick them together using cbind(). If you don’t want to build each of the columns separately, one way to code this more compactly is via the R function for(), using for(i in 1:19){...} and then inserting code to build the column corresponding to the ith dummy variable. Using for() is optional, but if you choose to try it you might like to look at ?Control for documentation and examples. You are asked not to solve this problem indirectly by calling model.matrix() on a fitted linear model since our goal here is to build the design matrix by hand and then cross-check it against the lm() output.

  3. Check that the design matrix matches the design matrix used by lm(FGt~FGt1+Name-1,data=goals). The -1 in the model formula removes the intercept term. It is not obvious that asking R to remove the intercept term will lead R to fit exactly the model above, so this gives a situation where checking the design matrix is a good way to explore how to use lm() correctly. The order of the columns in the two design matrices may differ, but this is inconsequential for the model.

  4. Plot all the fitted lines, \[ y = ax+b_i, \quad\mbox{for $i=1,\dots,19$} \] on top of a scatterplot of \(y_{ij}\) against \(x_{ij}\). This should look identical to the corresponding plot in Chapter 7 of the notes, with a parallel line for each kicker.

  5. Comment on your interpretation of the difference between the fitted lined in Q2(d) and Q1(b).

Q3. Suppose you want to address the question of whether there is convincing evidence that Shayne Graham (\(i=19\)) is a better kicker than Jay Feely (\(i=5\)). We can rephrase this to ask if it is plausible that the difference in their kicking record could just be due to small differences in luck between two players of equal skill.

  1. Write out a probability model for which this null hypothesis is \(H_0: \beta_5=\beta_{19}\).

  2. Find the mean and estimated variance of \(\hat\beta_{19}-\hat\beta_{5}\) in the context of this model and hypothesis. You can do this one of two ways, either (i) construct a row vector \({\boldsymbol{\mathrm{x} } }\) such that \(\hat\beta_{19}-\hat\beta_{5}={\boldsymbol{\mathrm{x} } }{\boldsymbol{\mathrm{\hat\beta} } }\) and use the formulas for the mean and variance/covariance matrix of \({\mathbb{A} }{\boldsymbol{\mathrm{X} } }\) with \({\boldsymbol{\mathrm{X} } }={\boldsymbol{\mathrm{\hat\beta} } }\), or (ii) use the formulas for the mean and variance of \(X-Y\) with \(X=\hat\beta_{19}\) and \(Y=\hat\beta_{5}\). Either way, you will need to work with the variance/covariance matrix of \({\boldsymbol{\mathrm{\hat\beta} } }\) which you may obtain either from appropriate output of lm() or by direct matrix calculation. At some point, you will have to replace the standard deviation of the measurement errors with its estimate from the sample.

  3. Use a standard error based on this variance to carry out a normal approximation test of \(H_0\) with the test statistic being \(b_{19}-b_{5}\).

  4. Implicit in the hypothesis test developed above is a causal interpretation: if there is evidence that \(b_{19}\) larger than \(b_{5}\) beyond what can be explained by chance variation then we are tempted to conclude that Shayne Graham is a more skillfull kicker than Jay Feely. Explain the limitations and disclaimers that should be understood for this causal interpretation.



License: This material is provided under an MIT license
Acknowledgement: The field goal data come from S. J. Sheather (2009) “A Modern Approach to Regression with R.”