Homework 8, STATS 401 W18

Exploring the t and F distributions
Making and interpreting an F test

Write a report answering the questions below, and including your R code. Recall that you are permitted to collaborate, or to use any internet resources, but you must list all sources that make a substantial contribution to your report. As usual, following the syllabus, you are also requested to give some feedback in a “Please explain” statement.

Exploring the t and F distributions

Question 1. Use the R functions pt() and pf() to provide convincing evidence that if $T$ is a random variable following Student’s t distribution on $n$ degrees of freedom, and $F$ is a random variable following the F distribution on $1$ and $n$ degrees of freedom, then $T^2$ has the same distribution as $F$ .

Question 2. The random variable $T$ is defined to have the t distribution on $n$ degrees of freedom if $T=\frac{X_0}{\Big(\frac{1}{n}\sum_{i=1}^n X_i^2\Big)}$ where $X_0, X_1,\dots,X_n$ are independent identically distributed (iid) $N[0,1]$ random variables. Now consider the random variable $\bar T = \frac{\sqrt{n} \, \bar X}{ \sqrt{\frac{1}{n-1} \sum_{i=1}^n (X_i-\bar X)^2} }$ where $\bar X = \frac{1}{n}\sum_{i=1}^n X_i$ . Here, $\bar T$ is the model-generated statistic corresponding to a sample statistic $\bar t = \bar x \big/ {\mathrm{SE}}(\bar x)$ where ${\mathrm{SE}}(\bar x)= \frac{s}{\sqrt{n}}$ with $s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i-\bar x)^2$ . Note that $\bar t$ is a test statistic for a one-sample t-test investigating whether a sample can reasonably be modeled as having mean zero.

Simulate a large number of draws of $\bar T$ from the null hypothesis that $X_1,\dots,X_n$ are iid $N[0,1]$ , with $n=8$ .
Plot a histogram of these simulations and add a line corresponding to the t density of a $t$ distribution on 7 degrees of freedom. These should be similar.
Explain why $\sum_{i=1}^n (X_i-\bar X)^2$ behaves like a sum of squares on $n-1$ degrees of freedom, not $n$ . You are expected to make an intuitive explanation; you are not asked to show this mathematically.
Check that ${\mathrm{Cov}}(\bar X,X_i-\bar X)=0$ for each $i=1,2,\dots,n$ , by using the bilinarity of covariance, which in its simplest form is ${\mathrm{Cov}}(X,Y+Z)={\mathrm{Cov}}(X,Y)+{\mathrm{Cov}}(X,Z)$ and in general is ${\mathrm{Cov}}\Big(\sum_{i=1}^na_iX_i,\sum_{j=1}^nb_jX_j\Big)=\sum_{i=1}^n\sum_{j=1}^n a_ib_i{\mathrm{Cov}}(X_i,X_j).$ This means that each term $X_i-\bar X$ in the denominator of the formula for $\bar T$ is uncorrelated with the numerator. It is a fact that multivariate normal random variables that are uncorrelated are also independent. Thus, the numerator and denominator of $\bar T$ are independent, just as in the formula for $T$ .

Making and interpreting an F test

We consider again the data on freshman GPA, ACT exam scores and percentile ranking of each student within their high school for 705 students at a large state university in the file gpa.txt. In addition, we now consider the year (1996 to 2000) in which the student entered college.

gpa <- read.table("gpa.txt",header=T)
head(gpa)

##   ID  GPA High_School ACT Year
## 1  1 0.98          61  20 1996
## 2  2 1.13          84  20 1996
## 3  3 1.25          74  19 1996
## 4  4 1.32          95  23 1996
## 5  5 1.48          77  28 1996
## 6  6 1.57          47  23 1996

There are many ways in which the predictive relationship between freshman GPA and the admission scores may, or may not, be stable over time. The director of admissions is interested in what ways, if any, there is evidence for changes larger than can be explained by chance variation. Let the null hypothesis, $H_0$ , be the probability model considered in the midterm exam, where freshman GPA is modeled to depend linearly on ACT score and high school ranking. Let $H_a$ be the probability model where $H_0$ is extended to include year as a factor, as fitted in R by

lm_gpa <- lm(GPA~High_School+ACT+factor(Year),data=gpa)

Question 3.

Write out the null and alternative hypotheses by completely specifying the probability models. Sometimes in class, to save time and space, we don’t write the complete details of each model. Here, you are asked to do that. Looking at model.matrix(lm_gpa) may help you to understand the model that R has fitted. You can write the models in either matrix or subscript form.
Interpret the results in summary(lm_gpa). Is there any evidence that some year, or years, may have different intercept? Why is there no row in the results table for factor(Year)1996?
Interpret the results in anova(lm_gpa). Write out in detail an F test to test $H_0$ against $H_a$ , explaining how the sample test statistic is constructed, giving the distribution of the model-generated test statistic under $H_0$ , and saying how the resulting p-value is calculated and interpreted.

License: This material is provided under an MIT license

Homework 8, STATS 401 W18

Due in your lab on 3/22 or 3/23

Exploring the t and F distributions

Making and interpreting an F test