Write a report answering the questions below, and including your R code. Recall that you are permitted to collaborate, or to use any internet resources, but you must list all sources that make a substantial contribution to your report. As usual, following the syllabus, you are also requested to give some feedback in a “Please explain” statement.


Exploring the t and F distributions

Question 1. Use the R functions pt() and pf() to provide convincing evidence that if \(T\) is a random variable following Student’s t distribution on n degrees of freedom, and \(F\) is a random variable following the F distribution on 1 and n degrees of freedom, then \(T^2\) has the same distribution as \(F\).

# Fix degrees of freedom of t distribution
n=10

for(x in c(1,5,9,16,25,36)){print(c(pf(x,1,n), pt(sqrt(x),n)-pt(-sqrt(x),n)))}
## [1] 0.6591069 0.6591069
## [1] 0.9506678 0.9506678
## [1] 0.9866563 0.9866563
## [1] 0.9974817 0.9974817
## [1] 0.9994627 0.9994627
## [1] 0.9998679 0.9998679

We note that the approximations are not exact.

Question 2. The random variable T is defined to have the t distribution on n degrees of freedom if \[T = \frac {X_0}{(\frac 1n \sum_{i=1}^n X_i^2)}\] where \(X_0, X_1, \dots, X_n\) are independent identically distributed (iid) N[0,1] random variables. Now consider the random variable \[\bar T = \frac {\sqrt{n} \bar X}{\frac 1{n-1} \sum_{i=1}^n (X_i - \bar X)^2}\] where \(\bar X = \frac 1n \sum_{i=1}^n X_i\). Here, \(\bar T\) is the model generated statistic corresponding to a sample statistic \[\bar t = \bar x/ SE(\bar x)\] where \(SE(\bar x)= \frac{s}{\sqrt{n}}\) with \(s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i-\bar x)^2\). Note that \(\bar t\) is a test statistic for a one-sample t-test investigating whether a sample can reasonably be modeled as having mean zero.

  1. Simulate a large number of draws of \(\bar T\) from the null hypothesis that \(X_1,\dots,X_n\) are iid \(N[0,1]\), with \(n=8\).

  2. Plot a histogram of these simulations and add a line corresponding to the t density of a \(t\) distribution on 7 degrees of freedom. These should be similar.

Solution to (a) and (b):

N <- 50000 ; sigma <- 1 ; d <- 7 ; set.seed(23)
X <- matrix(rnorm(N*(d+1),mean=0,sd=sigma),nrow=N)
T_bar_evaluator <- function(x) sqrt(d+1)*mean(x)/sqrt(sum((x[1:d+1]-mean(x))^2)/d)
T_bar_sim <- apply(X,1, T_bar_evaluator)

hist(T_bar_sim,freq=F,main="", breaks=30,ylim=c(0,0.4))
x <- seq(length=200,min(T_bar_sim),max(T_bar_sim))
lines(x,dt(x,df=d), col="red")

  1. Explain why \(\sum_{i=1}^n (X_i-\bar X)^2\) behaves like a sum of squares on \(n-1\) degrees of freedom, not \(n\). You are expected to make an intuitive explanation; you are not asked to show this mathematically.

Solution:

\(X_i\) and \(\bar X\) are not independent because \(\bar X = \frac 1n \sum_{i=1}^n X_i\) which is dependent on \(X_i\).

  1. Check that \(Cov(\bar X,X_i-\bar X)=0\) for each \(i=1,2,\dots,n\), by using the bilinarity of covariance, which in its simplest form is \(Cov(X,Y+Z)=Cov(X,Y)+Cov(X,Z)\) and in general is \[ Cov\Big(\sum_{i=1}^na_iX_i,\sum_{j=1}^nb_jX_j\Big)=\sum_{i=1}^n\sum_{j=1}^n a_ib_iCov(X_i,X_j).\] This means that each term \(X_i-\bar X\) in the denominator of the formula for \(\bar T\) is uncorrelated with the numerator. It is a fact that multivariate normal random variables that are uncorrelated are also independent. Thus, the numerator and denominator of \(\bar T\) are independent, just as in the formula for \(T\).

Solution:

Consider n = 3 and \(X_i = X_1\)

\[\begin{equation*} Cov(\bar X, X_i - \bar X) \\ Cov(\frac 13 (X_1 + X_2 + X_3), X_1 - \frac 13 (X_1 + X_2 + X_3))\\ Cov(\frac 13 X_1 + \frac 13 X_2 + \frac 13 X_3), (1 - \frac 13)X_1 - \frac 13 X_2 - \frac 13 X_3)\\ \end{equation*}\]

We don’t need to consider \(Cov(X_1, X_2), Cov(X_1, X_3),\) or \(Cov(X_2, X_3)\) because \(X_1, X_2\), and \(X_3\) are independent. Thus \(Cov(X_1, X_2), Cov(X_1, X_3),\) or \(Cov(X_2, X_3)\) are equal to 0. Therefore,

\[\begin{equation*} Cov(\frac 13 X_1 + \frac 13 X_2 + \frac 13 X_3), (1 - \frac 13)X_1 - \frac 13 X_2 - \frac 13 X_3)\\ Cov(\frac 13 X_1, (1- \frac 13) X_1) + Cov(\frac 13 X_2, -\frac 13 X_2) + Cov(\frac 13 X_3, -\frac 13 X_3) \\ \frac 13 {(1- \frac 13) Cov(X_1, X_1) + \frac 13 Cov(X_2, X_2) + \frac 13 Cov(X_3, X_3)} \\ \frac 13 {(1- \frac 13) *1 + \frac 13 *1 + \frac 13 *1} \\ \frac 13 (0) \\ 0 \end{equation*}\]

Since this same argument works for any value of \(n\), \(Cov(\bar X,X_i-\bar X)=0\). Similarly, we don’t need to consider \(Cov(\bar X,X_2-\bar X), Cov(\bar X,X_3-\bar X),etc.\) because \(X_1, X_2, etc.\) are iid.


Making and interpreting an F test

We consider again the data on freshman GPA, ACT exam scores and percentile ranking of each student within their high school for 705 students at a large state university in the file gpa.txt. In addition, we now consider the year (1996 to 2000) in which the student entered college.

## # A tibble: 6 x 5
##      ID   GPA High_School   ACT  Year
##   <dbl> <dbl>       <dbl> <dbl> <dbl>
## 1  1.00 0.980        61.0  20.0  1996
## 2  2.00 1.13         84.0  20.0  1996
## 3  3.00 1.25         74.0  19.0  1996
## 4  4.00 1.32         95.0  23.0  1996
## 5  5.00 1.48         77.0  28.0  1996
## 6  6.00 1.57         47.0  23.0  1996
gpa <- read.table("gpa.txt",header=T)
head(gpa)

There are many ways in which the predictive relationship between freshman GPA and the admission scores may, or may not, be stable over time. The director of admissions is interested in what ways, if any, there is evidence for changes larger than can be explained by chance variation. Let the null hypothesis, \(H_0\), be the probability model considered in the midterm exam, where freshman GPA is modeled to depend linearly on ACT score and high school ranking. Let \(H_a\) be the probability model where \(H_0\) is extended to include year as a factor, as fitted in R by

lm_gpa <- lm(GPA~High_School+ACT+factor(Year),data=gpa)

Question 3.

  1. Write out the null and alternative hypotheses by completely specifying the probability models. Sometimes in class, to save time and space, we don’t write the complete details of each model. Here, you are asked to do that. Looking at model.matrix(lm_gpa) may help you to understand the model that R has fitted. You can write the models in either matrix or subscript form.

Solution:

\(Y_i = \beta_0 + \beta_1X_{i,1} + \beta_2X_{i,2} + \beta_3X_{i,3} + \beta_4X_{i,4} + \beta_5X_{i,5} + \beta_6X_{i,6}\)

wher \(Y_i\) is the GPA of freshman i, \(X_1\) is the percentile ranking of each student within their high school, \(X_2\) is the ACT score, and \(X_3, \dots X_6\) are indicators for the year in which the student entered college.

  1. Interpret the results in summary(lm_gpa). Is there any evidence that some year, or years, may have different intercept? Why is there no row in the results table for factor(Year)1996?
summary(lm_gpa)
## 
## Call:
## lm(formula = GPA ~ High_School + ACT + factor(Year), data = gpa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.15048 -0.28873  0.07655  0.39619  1.30415 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.217874   0.144598   8.422  < 2e-16 ***
## High_School      0.010124   0.001285   7.878 1.28e-14 ***
## ACT              0.037188   0.005951   6.248 7.21e-10 ***
## factor(Year)1997 0.083657   0.068816   1.216   0.2245    
## factor(Year)1998 0.115339   0.066158   1.743   0.0817 .  
## factor(Year)1999 0.080071   0.067475   1.187   0.2358    
## factor(Year)2000 0.056007   0.068013   0.823   0.4105    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5674 on 698 degrees of freedom
## Multiple R-squared:  0.2071, Adjusted R-squared:  0.2003 
## F-statistic: 30.39 on 6 and 698 DF,  p-value: < 2.2e-16

Based on the summary, the percentile ranking of the student in their high school and their ACT score are significant in determining a student’s freshman GPA in college. However, there is no evidence that any of the years have a different intercept because none of coefficients are statistically signficant.

There is no row for factor(Year)1996 because it is contained in the intercept.

  1. Interpret the results in anova(lm_gpa). Write out in detail an F test to test \(H_0\) against \(H_a\), explaining how the sample test statistic is constructed, giving the distribution of the model-generated test statistic under \(H_0\), and saying how the resulting p-value is calculated and interpreted.
anova(lm_gpa)
## Analysis of Variance Table
## 
## Response: GPA
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## High_School    1  45.007  45.007 139.784 < 2.2e-16 ***
## ACT            1  12.628  12.628  39.219 6.618e-10 ***
## factor(Year)   4   1.072   0.268   0.832     0.505    
## Residuals    698 224.742   0.322                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\[H_0: Y_i = \beta_0 + \beta_1X_{i,1} + \beta_2X_{i,2}\] \[H_a: Y_i = \beta_0 + \beta_1X_{i,1} + \beta_2X_{i,2} + \beta_3X_{i,3} + \beta_4X_{i,4} + \beta_5X_{i,5} + \beta_6X_{i,6}\]

\[f = \frac {(RSS_0 - RSS_a)/d}{RSS_a/(n-q)}\] where \(RSS_0\) and \(RSS_a\) are the residual sum of squares (\(\sum_{i=1}^n (y_i - \hat y_i)^2\)) for the null and alternative models, \(d\) is the difference in the degrees of freedom between the null and alternative models, \(n-q\) is the degrees of freedom in the altnerative model, and \(f\) ~ \(F_{4, 698}\). The p-value was calculated as the \(P(f^* > f = 0.824)\). Since the p-value is equal to 0.505 we fail to reject the null hypothesis at an \(\alpha = 0.05\) and conclude that the year a student enters college is not significant in determining their freshman GPA. The ANOVA agrees with what we saw in the regression output: the year a student enters college is not signficant.