Write a report answering the questions below, and including your R code. Recall that you are permitted to collaborate, or to use any internet resources, but you must list all sources that make a substantial contribution to your report. As usual, following the syllabus, you are also requested to give some feedback in a “Please explain” statement.
Question 1. Use the R functions pt()
and pf()
to provide convincing evidence that if \(T\) is a random variable following Student’s t distribution on n degrees of freedom, and \(F\) is a random variable following the F distribution on 1 and n degrees of freedom, then \(T^2\) has the same distribution as \(F\).
# Fix degrees of freedom of t distribution
n=10
for(x in c(1,5,9,16,25,36)){print(c(pf(x,1,n), pt(sqrt(x),n)-pt(-sqrt(x),n)))}
## [1] 0.6591069 0.6591069
## [1] 0.9506678 0.9506678
## [1] 0.9866563 0.9866563
## [1] 0.9974817 0.9974817
## [1] 0.9994627 0.9994627
## [1] 0.9998679 0.9998679
We note that the approximations are not exact.
Question 2. The random variable T is defined to have the t distribution on n degrees of freedom if \[T = \frac {X_0}{(\frac 1n \sum_{i=1}^n X_i^2)}\] where \(X_0, X_1, \dots, X_n\) are independent identically distributed (iid) N[0,1] random variables. Now consider the random variable \[\bar T = \frac {\sqrt{n} \bar X}{\frac 1{n-1} \sum_{i=1}^n (X_i - \bar X)^2}\] where \(\bar X = \frac 1n \sum_{i=1}^n X_i\). Here, \(\bar T\) is the model generated statistic corresponding to a sample statistic \[\bar t = \bar x/ SE(\bar x)\] where \(SE(\bar x)= \frac{s}{\sqrt{n}}\) with \(s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i-\bar x)^2\). Note that \(\bar t\) is a test statistic for a one-sample t-test investigating whether a sample can reasonably be modeled as having mean zero.
Simulate a large number of draws of \(\bar T\) from the null hypothesis that \(X_1,\dots,X_n\) are iid \(N[0,1]\), with \(n=8\).
Plot a histogram of these simulations and add a line corresponding to the t density of a \(t\) distribution on 7 degrees of freedom. These should be similar.
Solution to (a) and (b):
N <- 50000 ; sigma <- 1 ; d <- 7 ; set.seed(23)
X <- matrix(rnorm(N*(d+1),mean=0,sd=sigma),nrow=N)
T_bar_evaluator <- function(x) sqrt(d+1)*mean(x)/sqrt(sum((x[1:d+1]-mean(x))^2)/d)
T_bar_sim <- apply(X,1, T_bar_evaluator)
hist(T_bar_sim,freq=F,main="", breaks=30,ylim=c(0,0.4))
x <- seq(length=200,min(T_bar_sim),max(T_bar_sim))
lines(x,dt(x,df=d), col="red")
Solution:
\(X_i\) and \(\bar X\) are not independent because \(\bar X = \frac 1n \sum_{i=1}^n X_i\) which is dependent on \(X_i\).
Solution:
Consider n = 3 and \(X_i = X_1\)
\[\begin{equation*} Cov(\bar X, X_i - \bar X) \\ Cov(\frac 13 (X_1 + X_2 + X_3), X_1 - \frac 13 (X_1 + X_2 + X_3))\\ Cov(\frac 13 X_1 + \frac 13 X_2 + \frac 13 X_3), (1 - \frac 13)X_1 - \frac 13 X_2 - \frac 13 X_3)\\ \end{equation*}\]We don’t need to consider \(Cov(X_1, X_2), Cov(X_1, X_3),\) or \(Cov(X_2, X_3)\) because \(X_1, X_2\), and \(X_3\) are independent. Thus \(Cov(X_1, X_2), Cov(X_1, X_3),\) or \(Cov(X_2, X_3)\) are equal to 0. Therefore,
\[\begin{equation*} Cov(\frac 13 X_1 + \frac 13 X_2 + \frac 13 X_3), (1 - \frac 13)X_1 - \frac 13 X_2 - \frac 13 X_3)\\ Cov(\frac 13 X_1, (1- \frac 13) X_1) + Cov(\frac 13 X_2, -\frac 13 X_2) + Cov(\frac 13 X_3, -\frac 13 X_3) \\ \frac 13 {(1- \frac 13) Cov(X_1, X_1) + \frac 13 Cov(X_2, X_2) + \frac 13 Cov(X_3, X_3)} \\ \frac 13 {(1- \frac 13) *1 + \frac 13 *1 + \frac 13 *1} \\ \frac 13 (0) \\ 0 \end{equation*}\]Since this same argument works for any value of \(n\), \(Cov(\bar X,X_i-\bar X)=0\). Similarly, we don’t need to consider \(Cov(\bar X,X_2-\bar X), Cov(\bar X,X_3-\bar X),etc.\) because \(X_1, X_2, etc.\) are iid.
We consider again the data on freshman GPA, ACT exam scores and percentile ranking of each student within their high school for 705 students at a large state university in the file gpa.txt. In addition, we now consider the year (1996 to 2000) in which the student entered college.
## # A tibble: 6 x 5
## ID GPA High_School ACT Year
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.00 0.980 61.0 20.0 1996
## 2 2.00 1.13 84.0 20.0 1996
## 3 3.00 1.25 74.0 19.0 1996
## 4 4.00 1.32 95.0 23.0 1996
## 5 5.00 1.48 77.0 28.0 1996
## 6 6.00 1.57 47.0 23.0 1996
gpa <- read.table("gpa.txt",header=T)
head(gpa)
There are many ways in which the predictive relationship between freshman GPA and the admission scores may, or may not, be stable over time. The director of admissions is interested in what ways, if any, there is evidence for changes larger than can be explained by chance variation. Let the null hypothesis, \(H_0\), be the probability model considered in the midterm exam, where freshman GPA is modeled to depend linearly on ACT score and high school ranking. Let \(H_a\) be the probability model where \(H_0\) is extended to include year as a factor, as fitted in R by
lm_gpa <- lm(GPA~High_School+ACT+factor(Year),data=gpa)
Question 3.
model.matrix(lm_gpa)
may help you to understand the model that R has fitted. You can write the models in either matrix or subscript form.Solution:
\(Y_i = \beta_0 + \beta_1X_{i,1} + \beta_2X_{i,2} + \beta_3X_{i,3} + \beta_4X_{i,4} + \beta_5X_{i,5} + \beta_6X_{i,6}\)
wher \(Y_i\) is the GPA of freshman i, \(X_1\) is the percentile ranking of each student within their high school, \(X_2\) is the ACT score, and \(X_3, \dots X_6\) are indicators for the year in which the student entered college.
summary(lm_gpa)
. Is there any evidence that some year, or years, may have different intercept? Why is there no row in the results table for factor(Year)1996
?summary(lm_gpa)
##
## Call:
## lm(formula = GPA ~ High_School + ACT + factor(Year), data = gpa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.15048 -0.28873 0.07655 0.39619 1.30415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.217874 0.144598 8.422 < 2e-16 ***
## High_School 0.010124 0.001285 7.878 1.28e-14 ***
## ACT 0.037188 0.005951 6.248 7.21e-10 ***
## factor(Year)1997 0.083657 0.068816 1.216 0.2245
## factor(Year)1998 0.115339 0.066158 1.743 0.0817 .
## factor(Year)1999 0.080071 0.067475 1.187 0.2358
## factor(Year)2000 0.056007 0.068013 0.823 0.4105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5674 on 698 degrees of freedom
## Multiple R-squared: 0.2071, Adjusted R-squared: 0.2003
## F-statistic: 30.39 on 6 and 698 DF, p-value: < 2.2e-16
Based on the summary, the percentile ranking of the student in their high school and their ACT score are significant in determining a student’s freshman GPA in college. However, there is no evidence that any of the years have a different intercept because none of coefficients are statistically signficant.
There is no row for factor(Year)1996
because it is contained in the intercept.
anova(lm_gpa)
. Write out in detail an F test to test \(H_0\) against \(H_a\), explaining how the sample test statistic is constructed, giving the distribution of the model-generated test statistic under \(H_0\), and saying how the resulting p-value is calculated and interpreted.anova(lm_gpa)
## Analysis of Variance Table
##
## Response: GPA
## Df Sum Sq Mean Sq F value Pr(>F)
## High_School 1 45.007 45.007 139.784 < 2.2e-16 ***
## ACT 1 12.628 12.628 39.219 6.618e-10 ***
## factor(Year) 4 1.072 0.268 0.832 0.505
## Residuals 698 224.742 0.322
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[H_0: Y_i = \beta_0 + \beta_1X_{i,1} + \beta_2X_{i,2}\] \[H_a: Y_i = \beta_0 + \beta_1X_{i,1} + \beta_2X_{i,2} + \beta_3X_{i,3} + \beta_4X_{i,4} + \beta_5X_{i,5} + \beta_6X_{i,6}\]
\[f = \frac {(RSS_0 - RSS_a)/d}{RSS_a/(n-q)}\] where \(RSS_0\) and \(RSS_a\) are the residual sum of squares (\(\sum_{i=1}^n (y_i - \hat y_i)^2\)) for the null and alternative models, \(d\) is the difference in the degrees of freedom between the null and alternative models, \(n-q\) is the degrees of freedom in the altnerative model, and \(f\) ~ \(F_{4, 698}\). The p-value was calculated as the \(P(f^* > f = 0.824)\). Since the p-value is equal to 0.505 we fail to reject the null hypothesis at an \(\alpha = 0.05\) and conclude that the year a student enters college is not significant in determining their freshman GPA. The ANOVA agrees with what we saw in the regression output: the year a student enters college is not signficant.