Your report should include the R code that you use. For calculations, please add enough explanation to help the reader understand what you did and why. Recall that you are permitted to collaborate, or to use any internet resources, but you must list all sources that influenced your report. As usual, a statement of Sources is required to get credit for the homework.
We consider the field goal kicking data introduced in Chapter 7 of the notes. The primary question under investigation is whether an athlete that has a good season, is likely to perform above or below their typical level in the next season. There is also the possibility that the previous season has no predictive skill. We consider field goal kicking success for the 19 National Football League (NFL) kickers who played every season during 2002-2006.
download.file(destfile="FieldGoals.csv",
url="https://ionides.github.io/401f18/hw/hw09/FieldGoals.csv")
goals <- read.table("FieldGoals.csv",header=TRUE,sep=",")
head(goals[,1:8])
## Name Yeart Teamt FGAt FGt Teamt1 FGAt1 FGt1
## 1 Adam Vinatieri 2003 NE 34 73.5 NE 30 90.0
## 2 Adam Vinatieri 2004 NE 33 93.9 NE 34 73.5
## 3 Adam Vinatieri 2005 NE 25 80.0 NE 33 93.9
## 4 Adam Vinatieri 2006 IND 19 89.4 NE 25 80.0
## 5 David Akers 2003 PHI 29 82.7 PHI 34 88.2
## 6 David Akers 2004 PHI 32 84.3 PHI 29 82.7
Q1. Let \(i\) correspond to the \(i\)th row of the data table. Recall that each kicker has \(4\) rows for the seasons 2003, 2004, 2005, 2006. Let \(y_i\) be the percentage of successful field goal average for row \(i\), denoted by the FGt
variable in goals
. Let \(x_i\) be the percentage for the previous season, denoted by FGt1
. A simple linear regression model predicting performance based on the previous season (in the sample form) is \[
y_i = a x_i + b + e_i, \quad \mbox{for $i=1,\dots,4\times 19$}.
\]
lm1 <- lm(FGt~FGt1,data = goals)
summary(lm1)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.6097871 10.2525316 9.227944 6.18095e-14
## FGt1 -0.1509583 0.1248457 -1.209159 2.30451e-01
Thus, \(a=-0.1509583\) and \(b=94.6097871\).
plot(goals$FGt1, goals$FGt, xlab="FGt1: percentage for previous year", ylab="FGt: percentage of successful field goal average", title("Plot of FGt vs FGt1"))
abline(summary(lm1)$coefficients[,"Estimate"],col="blue")
summary(lm1)
##
## Call:
## lm(formula = FGt ~ FGt1, data = goals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.4350 -7.0576 0.6933 5.3824 18.7047
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.6098 10.2525 9.228 6.18e-14 ***
## FGt1 -0.1510 0.1248 -1.209 0.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.723 on 74 degrees of freedom
## Multiple R-squared: 0.01937, Adjusted R-squared: 0.006123
## F-statistic: 1.462 on 1 and 74 DF, p-value: 0.2305
We can see this from the output of the lm
output. As we can see, the sample t-statistic, which is -1.21 which the t-test compares with the t distribution on \((76-2)=74\) degrees of freedom to get a p-value of 0.23. Since this is greater than 0.05, we conclude that we don’t have sufficient evidence to reject the null hypothesis of \(a=0\). It is plausible that the observed value of \(a\) is infact due to chance variation.
Since we are only adding one covariate (FGt1), the t-test is equivalent to the F-test (as can be seen from the identical p-values of both in the lm()
summary). The sample F-statistic has a value of 1.462 and has p-value 0.2305 which is greater than 0.05.
Q2. Now, we consider a linear model for predicting performance based on the previous season where each kicker has their own intercept. In double subscript notation this is \[
y_{ij} = a x_{ij} + b_i + e_{ij}, \quad\mbox{for $i=1,\dots,19$ and $j=1,\dots,4$},
\] where \(x_{ij}=y_{i,j-1}\) with \(y_{i0}\) corresponding to the successful field goal percentage in 2002. This gives a different parameterization from the model corresponding to lm(FGt~FGt1+Name,data=goals)
that is investigated in the Chapter 7 notes.
In this model, \(\textbf{y}\) is the 76-vector whose \(i^{th}\) entry corresponds to the percentage of successful field goal average for row \(i\). \(\mathbf{b}\) is the 20-vector of least-square fitted coefficients, \(\mathbf{b}=(b_1,a,b_2,\dots,b_19)\). \(\mathbf{e}\) is the 76-vetor of errors. \(\mathbb{X}\) is the design matrix with 20 columns and 76 rows, i.e. \(\mathbb{X}=\mathbb{X} = \begin{bmatrix} | & | & \dots & | \\ \textbf{x}_1 & \textbf{x}_2 & \dots & \textbf{x}_{20} \\ | & | & \dots & | \end{bmatrix}\). The first column corresponds to FGt1
for which the \(i^{th}\) entry is the \(i^{th}\) entry of FGt1
in the dataset. Columns 2-20 of the design matrix, that is, \(\textbf{x}_2 - {\boldsymbol{\mathrm{x} } }_{20}\) are dummy variables corresponding to each of the 19 players. Since for each player we have information worth 4 years, each of the dummy variables will have 4 1’s and the remaining 0’s. For the first player, the information corresponds to the first four rows of the dataset and therefore \(\textbf{x}_1=(\underbrace{1,\dots,1}_\text{4 times},\underbrace{0,\dots,0}_\text{72 times})\). Similarly, \({\boldsymbol{\mathrm{x} } }_2 = (\underbrace{0,\dots,0}_\text{4 times},\underbrace{1,\dots,1}_\text{4 times},\underbrace{0,\dots,0}_\text{68 times})\). Generalizing for the \(i^{th}\) dummy variable, \({\boldsymbol{\mathrm{x} } }_i\) will have 4 1’s corresponding to the rows which contain information about the \(i^{th}\) player and would have the remaining 72 entries as 0. Which rows contain information about the \(i^{th}\) player? We know that there are \(i-1\) players before the \(i^{th}\) player and each player has 4 years worth of data, so, the first \(\underbrace{4 + 4 +\dots + 4}_\text{i-1 times} = (i-1)\times 4\) rows contain information about the first \(i-1\) players. The next four rows, which is \((i-1)\times 4 +1, (i-1)\times 4+2, (i-1)\times 4+3,(i-1)\times 4+4 = i\times 4\) rows contain information about the \(i^{th}\) player and the remaining rows \(i\times 4 +1,\dots , 76\) contain information about the remaining \(19-i\) players. Since each of the remaining \(19-i\) players also have 4 years worth of data each, there will be \((19-i)\times 4\) 0’s in \({\boldsymbol{\mathrm{x} } }_i\) followinf the 1’s. So, \({\boldsymbol{\mathrm{x} } }_i = (\underbrace{0,\dots,0}_\text{(i-1)x4 times},\underbrace{1,\dots,1}_\text{4 times},\underbrace{0,\dots,0}_\text{(19-i)x4 times})\). Therefore, we have \[
\mathbb{X} =
\begin{bmatrix}
| & | & \dots & | \\
\textbf{x}_1 & \textbf{x}_2 & \dots & \textbf{x}_{20} \\
| & | & \dots & |
\end{bmatrix}=
\begin{bmatrix}
90.0 & 1 & 0 & 0 & & \dots & & 0 \\
73.5 & 1 & 0 & 0 & & \dots & & 0 \\
93.9 & 1 & 0 & 0 & & \dots & & 0\\
80.0 & 1 & 0 & 0 & & \dots & & 0\\
88.2 & 0 & 1 & 0 & & \dots & & 0\\
82.7 & 0 & 1 & 0 & & \dots & & 0\\
84.3 & 0 & 1 & 0 & & \dots & & 0\\
72.7 & 0 & 1 & 0 & & \dots & & 0\\
72.2 & 0 & 0 & 1 & & \dots & & 0\\
87.0 & 0 & 0 & 1 & & \dots & & 0\\
85.2 & 0 & 0 & 1 & & \dots & & 0\\
75.0 & 0 & 0 & 1 & & \dots & & 0\\
\vdots & \vdots & \vdots & \vdots & \ddots & & & \vdots\\
\vdots & \vdots & \vdots & \vdots & & \ddots & & \vdots\\
\vdots & \vdots & \vdots & \vdots & & & \ddots & \vdots \\
72.2 & 0 & 0 & 0 & & \dots & & 1\\
88.0 & 0 & 0 & 0 & & \dots & & 1\\
87.0 & 0 & 0 & 0 & & \dots & & 1\\
87.5 & 0 & 0 & 0 & & \dots & & 1
\end{bmatrix}
\]
rep()
and stick them together using cbind()
. If you don’t want to build each of the columns separately, one way to code this more compactly is via the R function for()
, using for(i in 1:19){...}
and then inserting code to build the column corresponding to the i
th dummy variable. Using for()
is optional, but if you choose to try it you might like to look at ?Control
for documentation and examples. You are asked not to solve this problem indirectly by calling model.matrix()
on a fitted linear model since our goal here is to build the design matrix by hand and then cross-check it against the lm()
output. Let us define \(\mathbb{X}\) using two ways. First, we will define each column using rep()
which we will paste together using cbind()
. Second, we will directly define the matrix using a for loop (for()
)
## Method 1
# Define each column seperately
dummy_1 <- rep(c(1,0),c(4,76-4))
dummy_2 <- rep(c(0,1,0),c(4,4,76-2*(4)))
dummy_3 <- rep(c(0,1,0),c(2*4,4,76-3*(4)))
dummy_4 <- rep(c(0,1,0),c(3*4,4,76-4*(4)))
dummy_5 <- rep(c(0,1,0),c(4*4,4,76-5*(4)))
dummy_6 <- rep(c(0,1,0),c(5*4,4,76-6*(4)))
dummy_7 <- rep(c(0,1,0),c(6*4,4,76-7*(4)))
dummy_8 <- rep(c(0,1,0),c(7*4,4,76-8*(4)))
dummy_9 <- rep(c(0,1,0),c(8*4,4,76-9*(4)))
dummy_10 <- rep(c(0,1,0),c(9*4,4,76-10*(4)))
dummy_11 <- rep(c(0,1,0),c(10*4,4,76-11*(4)))
dummy_12 <- rep(c(0,1,0),c(11*4,4,76-12*(4)))
dummy_13 <- rep(c(0,1,0),c(12*4,4,76-13*(4)))
dummy_14 <- rep(c(0,1,0),c(13*4,4,76-14*(4)))
dummy_15 <- rep(c(0,1,0),c(14*4,4,76-15*(4)))
dummy_16 <- rep(c(0,1,0),c(15*4,4,76-16*(4)))
dummy_17 <- rep(c(0,1,0),c(16*4,4,76-17*(4)))
dummy_18 <- rep(c(0,1,0),c(17*4,4,76-18*(4)))
dummy_19 <- rep(c(0,1,0),c(18*4,4,76-19*(4)))
# Define X
X1 <- cbind(goals$FGt1,dummy_1,dummy_2,dummy_3,dummy_4,dummy_5,dummy_6,dummy_7,dummy_8,dummy_9,dummy_10,dummy_11,dummy_12,dummy_13,dummy_14,dummy_15,dummy_16,dummy_17,dummy_18,dummy_19)
## Method 2
X2 <- matrix(nrow=76, ncol=20)
X2[,1] <- goals$FGt1 # First column corresponds to FGt1
for(i in 1:19){X2[,i+1] <- rep(c(0,1,0),c((i-1)*4,4,76-i*4))}
#####
# Alternatively, you could also use `rep()` in the following way
# Note that the following two give the same output
# For any fixed value of i (i=1,2,...,19), say i=2
i=floor(runif(1,1,20))
vec1 <- rep(c(0,1,0),c((i-1)*4,4,76-i*4))
vec2 <- rep(rep(c(0,1,0),c(i-1,1,19-i)), each=4)
identical(vec1, vec2)
## [1] TRUE
lm(FGt~FGt1+Name-1,data=goals)
. The -1
in the model formula removes the intercept term. It is not obvious that asking R to remove the intercept term will lead R to fit exactly the model above, so this gives a situation where checking the design matrix is a good way to explore how to use lm()
correctly. The order of the columns in the two design matrices may differ, but this is inconsequential for the model. Let us now check if both the above matrices are equal to each other and to the matrix extracted from lm()
using model.matrix()
.
## Fit regression model
lm2 <- lm(FGt~FGt1+Name-1,data=goals)
## Check matrix equiavalence
X1X2 <- all.equal(X1,X2,check.attributes=FALSE, check.names=FALSE)
X1mmat <- all.equal(X1,model.matrix(lm2),check.attributes=FALSE, check.names=FALSE)
X2mmat <- all.equal(X2,model.matrix(lm2),check.attributes=FALSE, check.names=FALSE)
cat(" X1 and X2 are equal:", X1X2, "\n", "X1 and model.matrix(lm2) are equal:", X1mmat,
"\n", "X2 and model.matrix(lm2) are equal:", X2mmat)
## X1 and X2 are equal: TRUE
## X1 and model.matrix(lm2) are equal: TRUE
## X2 and model.matrix(lm2) are equal: TRUE
# We could also directly check using the following
X1X2 <- sum(X1==X2)/(20*76)
X1mmat <- sum(X1==model.matrix(lm2))/(20*76)
X2mmat <- sum(X1==model.matrix(lm2))/(20*76)
plot(FGt~FGt1,data=goals)
intercept <- as.numeric(lm2$coefficients[2:20])
slope <- lm2$coefficients[1]
for(i in 1:19){abline(a=intercept[i], b=slope)}
abline(summary(lm1)$coefficients[,"Estimate"],col="blue")
int <- c();for(i in 1:19){int <- c(int,mean(goals[goals$Name==levels(goals$Name)[i],"FGt"]))}
int
summary(lm1)$coefficients
summary(lm2)
lm(FGt ~ Name-1, data=goals);int
mean(summary(lm2)$coefficients[2:20,"Estimate"])
Q3. Suppose you want to address the question of whether there is convincing evidence that Shayne Graham (\(i=19\)) is a better kicker than Jay Feely (\(i=5\)). We can rephrase this to ask if it is plausible that the difference in their kicking record could just be due to small differences in luck between two players of equal skill.
The probability model is the probability model correspondnig to the sample model in Q2(a) above. That , is \({\boldsymbol{\mathrm{Y} } } = \mathbb{X}{\boldsymbol{\mathrm{\beta} } } + {\boldsymbol{\mathrm{\epsilon} } }\) where \({\boldsymbol{\mathrm{Y} } }\) is the random vector which models the percentage successful field goal averages, \(\mathbb{X}\) is the design matrix as described in Q2, \(\mathbf{\beta} = (\beta_0, \beta_1 \dots, \beta_{19})\) is the vector of the true but unknown population coefficients where \(\beta_0\) corresponds to the slope of FGt1
onto FGt
and \(\beta_i\) corresponds to the intercept for kicker \(i\) for \(i=1,\dots,19\). Lastly, \(\mathbf{\epsilon} = (\epsilon_1,\dots ,\epsilon_{76})\) is the vector of iid errors, i.e. \(\epsilon_i \sim \mathcal{N}(0,\sigma^2), \hspace{0.4cm} i=1,2,\dots,76\).
The null hypothesis correponds to the coefficient corresponding to kicker 5 = coefficient corresponding to kicker 19, that is, \(\beta_5 = \beta_{19}\).
lm()
or by direct matrix calculation. At some point, you will have to replace the standard deviation of the measurement errors with its estimate from the sample. We can find the expected value and variance of \(\hat{\beta}_{19} - \hat\beta_5\) using the formulae we know. Note that we can get the mean and variance-covariance matric of \({\boldsymbol{\mathrm{\hat{\beta}} } }\) from thelm2
output as follows:
betahat <- lm2$coefficients
# Cov using output of lm
cov_betahat <- summary(lm2)$cov.unscaled
cov_betahat <- (summary(lm2)$sigma)^2 *cov_betahat
# Cov using matrix multiplication
X<- model.matrix(lm2)
cov_betaHat <- (summary(lm2)$sigma)^2 * solve(t(X)%*%X)
# Check that they are the same
all.equal(cov_betahat,cov_betaHat)
## [1] TRUE
Then, we can calculate the mean and variance in two ways as explained in the question.
i. Let \(\hat{{\boldsymbol{\mathrm{\beta} } }} = (\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_{19})\). So, \(\hat{\beta}_{19} - {\hat{\beta}}_5 = [0,0,0,0,-1,\underbrace{0,\dots,0}_\text{13 times},1]\hat{\beta} = {\boldsymbol{\mathrm{x} } }{\boldsymbol{\mathrm{\hat{\beta}} } }\)
Using the following formulae we can then compute the mean and variance as shown below
- \(\text{E}[{\boldsymbol{\mathrm{x} } }{\boldsymbol{\mathrm{\hat{\beta}} } }] = {\boldsymbol{\mathrm{x} } }\text{E}[{\boldsymbol{\mathrm{\hat{\beta}} } }] = {\boldsymbol{\mathrm{x} } }{\boldsymbol{\mathrm{b} } }\)
- \(\text{Var}({{\boldsymbol{\mathrm{x} } }{\boldsymbol{\mathrm{\hat{\beta}} } }}) = {\boldsymbol{\mathrm{x} } }\text{Var}{{\boldsymbol{\mathrm{\hat{\beta}} } }} {\boldsymbol{\mathrm{x} } }^{T}\)
x<- c(rep(0,5),-1,rep(0,13),1)
diff <- x%*%betahat
mean_diff <- diff
var_diff <- x%*%cov_betahat%*%x
var_diff_direct <- cov_betaHat[6,6] + cov_betaHat[20,20] - 2*cov_betaHat[20,6]
#comparing the variances
cat(" Mean of diff:", mean_diff, "\n",
"Variance using method (i):", var_diff, "\n",
"Variance using method (i):", var_diff_direct)
## Mean of diff: 12.50869
## Variance using method (i): 19.711
## Variance using method (i): 19.711
Since our null hypothesis is \(\hat{\beta}_{19} - \hat\beta_5=0\), hence, the test statistic would be \(\text{z_stat} = \frac{(b_{19} - b_{5}) - \text{E}(\hat{\beta}_{19} - \hat{\beta}_5)}{\text{sd}(\hat{\beta}_{19} - \hat{\beta}_5)}\). We then find the p-value, which is \(\text{prob}(\text{z}\geq \text{z_stat})\) where \(z\sim\mathcal{N}(0,1)\) as below:
# Define test statistic
z_stat <- diff/var_diff
# Find the p-value
pnorm(z_stat,lower.tail = FALSE)
## [,1]
## [1,] 0.2628432
Since the p-value of 0.2628432 is greater than 0.05, we fail to reject the null hypothesis and thus conclude that the we do not have sufficient evidence to say that Shayne Graham (\(i=19\)) is a better kicker than Jay Feely (\(i=5\)).
License: This material is provided under an MIT license
Acknowledgement: The field goal data come from S. J. Sheather (2009) “A Modern Approach to Regression with R.”