Your report should include the R code that you use. For calculations, please add enough explanation to help the reader understand what you did and why. Recall that you are permitted to collaborate, or to use any internet resources, but you must list all sources that influenced your report. As usual, a statement of Sources is required to get credit for the homework.
We consider again the data on freshman GPA, ACT exam scores and percentile ranking of each student within their high school for 705 students at a large state university. In addition, we now consider the year (1996 to 2000) in which the student entered college.
download.file(destfile="gpa.txt",
url="https://ionides.github.io/401f18/hw/hw10/gpa.txt")
gpa <- read.table("gpa.txt",header=T)
head(gpa)
## ID GPA High_School ACT Year
## 1 1 0.98 61 20 1996
## 2 2 1.13 84 20 1996
## 3 3 1.25 74 19 1996
## 4 4 1.32 95 23 1996
## 5 5 1.48 77 28 1996
## 6 6 1.57 47 23 1996
The predictive relationship between freshman GPA and the admission scores may, or may not, be stable over time. Intercepts and/or slopes of fitted lines may be approximately constant, or may show trends through time. The director of admissions is interested in what ways, if any, there is evidence for changes larger than can be explained by chance variation. Let the null hypothesis, \(H_0\), be a probability model for which freshman GPA is modeled to depend linearly on ACT score and high school ranking. Let \(H_a\) be the probability model where \(H_0\) is extended to include year as a factor, as fitted in R by
lm_gpa <- lm(GPA~High_School+ACT+factor(Year),data=gpa)
model.matrix(lm_gpa)
may help you to understand the model that R has fitted. You can write the models in matrix, subscript or double subscript form.List the years as \(i=1,2,3,4,5\) corresponding to 1996, 1997, 1998, 1999, 2000. Let \(Y_{ij}\) model the freshman GPA of the \(j\)th student entering as a freshman in year \(i\). The alternative hypothesis is \[ H_a: \quad Y_{ij} = \mu + \beta_1 h_{ij} + \beta_2 s_{ij} + \alpha_i + \epsilon_{ij} \] where \(h_{ij}\) is the student’s high school ranking, \(s_{ij}\) is the student’s ACT score, and \(\alpha_i\) is a contrast for year \(i\), and \(\mu\) is the intercept. The model supposes that there is a true coefficient \(\beta_1\) corresponding to high school ranking, \(\beta_2\) corresponding to ACT score, and contrast \(\alpha_i\) for each \(i\). The measurement error terms \(\epsilon_{ij}\) are \(\mathrm{iid} \; \mathrm{normal}(0,\sigma)\). In order not to over-specify the model, we set \(\alpha_1=0\).
The null hypothesis is the special case of \(H_a\) where \(\alpha_i=0\) for all \(i\).
summary(lm_gpa)
in the context of whether there is evidence that some year, or years, may have substantially different intercepts. Why is there no row in the results table for factor(Year)1996
?summary(lm_gpa)
##
## Call:
## lm(formula = GPA ~ High_School + ACT + factor(Year), data = gpa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.15048 -0.28873 0.07655 0.39619 1.30415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.217874 0.144598 8.422 < 2e-16 ***
## High_School 0.010124 0.001285 7.878 1.28e-14 ***
## ACT 0.037188 0.005951 6.248 7.21e-10 ***
## factor(Year)1997 0.083657 0.068816 1.216 0.2245
## factor(Year)1998 0.115339 0.066158 1.743 0.0817 .
## factor(Year)1999 0.080071 0.067475 1.187 0.2358
## factor(Year)2000 0.056007 0.068013 0.823 0.4105
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5674 on 698 degrees of freedom
## Multiple R-squared: 0.2071, Adjusted R-squared: 0.2003
## F-statistic: 30.39 on 6 and 698 DF, p-value: < 2.2e-16
Based on the linear model results above, it does not appear that any of the years 1997-2000 have a substantially different intercept from the year 1996, because none of the p-values are signficant for the coefficients for these years at the 5% signficance level. It is interesting to note, that 1998 is signficant at the 10% level. We should still perform a hypothesis test (ANOVA) to confirm that year is not statistically significant in our model.
There is no row in the results table for the year 1996 because 1996 is contained in the intercept. If we included a dummy variable for the year 1996, then we would have an over-specified model.
anova(lm_gpa)
are relevant for constructing and interpreting an F test to test \(H_0\) against \(H_a\). Explain how to calculate the sample test statistic for this test. Specify the distribution of the model-generated test statistic under \(H_0\), and saying how the resulting p-value is calculated. What do you conclude?anova(lm_gpa)
## Analysis of Variance Table
##
## Response: GPA
## Df Sum Sq Mean Sq F value Pr(>F)
## High_School 1 45.007 45.007 139.784 < 2.2e-16 ***
## ACT 1 12.628 12.628 39.219 6.618e-10 ***
## factor(Year) 4 1.072 0.268 0.832 0.505
## Residuals 698 224.742 0.322
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The numbers that are relevant for testing \(H_0\) against \(H_a\) are 1.072, 4, 224.742, and 698.
Recall that the sample F test statistic is calculated as \(f = \frac{(RSS_0 - RSS_a)/d}{RSS_a/(n-q)}\). Here \(RSS_0 - RSS_a = 1.072\), \(RSS_a = 224.742\), \(d = 4\), and \(n-q = 698\). In our case, the F test statistic is calculated as \(\frac{1.072/4}{224.742/698} = 0.832\)
\(f\) ~ \(F_{4, 698}\); the model-generated test statistic under \(H_0\) has an F-distribution with 4 and 698 degrees of freedom.
Our p-value is by definition \(P(F>f)\) and is calculated in R using pf(0.832, 4, 698, lower.tail = FALSE)
. We can see from the ANOVA output that the p-value is 0.505 which is much larger than any reasonable level of significance (0.01, 0.05, or 0.1). Therefore, we fail to reject the null hypothesis and conclude that the year in which a student entered college should not be included in the model.
If at all possible, do these by yourself. It is okay to discuss the exercises with your peers or to come to office hours after a reasonable amount of time working by yourself, experimenting by trial and error, consulting R help pages, and collecting information from the internet. One source to reinforce R skills appropriate for STATS 401 is rstatistics.net/r-tutorial-exercise-for-beginners.
gpa
by two methods: (i) using matrix indexing; (2) via the tail()
function.To obtain the last 10 lines of gpa
we can use the following lines of code: gpa[696:705,]
and tail(gpa, 10)
.
gpa
who were admitted in 1997. Hint: sum()
applied to a logical vector counts the number of TRUE
values.sum(c(TRUE,TRUE,FALSE,TRUE))
## [1] 3
To count the number of students in the dataset gpa
who were admitted in 1997 we can use the following code: sum(gpa$Year == 1997)
.
gpa
who are both above the 50th percentile in their high school and have an ACT score greater than 15. &
on suitably constructed logical vectors is perhaps the simplest and best, and is something you should be able to do.Note that the High_School variable in the GPA dataset is the percentile ranking of the student.
Relevant code to find the proportion of students who are both above the 50th percentile and have an ACT score greater than 15: sum(gpa$High_School > 50 & gpa$ACT > 15)/nrow(gpa)
.
my_color <- rep("black",length=nrow(gpa))
and then modify the entries of my_color
corresponding to the different years. For example, you can create a logical vector gpa$Year==1997
to pick out the rows with admission year 1997. The corresponding entries of my_color
can then be modified to take the value "blue"
. Finally, you can use plot(...,col=my_color)
to color your scatterplot.my_color <- rep("black",length=nrow(gpa))
my_color[gpa$Year == 1997] <- 'blue'
my_color[gpa$Year == 1998] <- 'red'
my_color[gpa$Year == 1999] <- 'green'
my_color[gpa$Year == 2000] <- 'cyan'
plot(gpa$High_School, gpa$ACT, xlab = 'High School', ylab = 'ACT',
main = 'ACT score by High School Ranking', col = my_color)
par(mfrow=c(3,2))
to combine the scatterplots into a single figure. Use the xlim
and ylim
arguments to plot()
to give all five scatterplots the same axis scales: this is helpful to do since otherwise the differing scales on each plot make them hard to compare. xlim
argument expects a vector of length 2 giving the lower and upper limits of the x-axis. Conveniently, range(gpa$ACT)
produces a vector of length 2 giving the minimum and maximum values in the vector gpa$ACT
. An internet search on R plot mfrow example
will provide examples using par(mfrow=...)
to combine several plots into one figure. ggplot()
will fairly easily give you a nicer version if you have learned previously how to use it. However, learning to use base R graphics (as you are asked to do here) remains a useful skill.# subset the dataset based on the year a student was admitted
gpa1996 <- gpa[gpa$Year == 1996,]
gpa1997 <- gpa[gpa$Year == 1997,]
gpa1998 <- gpa[gpa$Year == 1998,]
gpa1999 <- gpa[gpa$Year == 1999,]
gpa2000 <- gpa[gpa$Year == 2000,]
# get the summary statistics from the full gpa dataset to determine the x and y limits
summary(gpa)
## ID GPA High_School ACT
## Min. : 1 Min. :0.510 Min. : 4.00 Min. :13.00
## 1st Qu.:177 1st Qu.:2.609 1st Qu.:68.00 1st Qu.:22.00
## Median :353 Median :3.050 Median :81.00 Median :25.00
## Mean :353 Mean :2.977 Mean :76.95 Mean :24.54
## 3rd Qu.:529 3rd Qu.:3.470 3rd Qu.:92.00 3rd Qu.:28.00
## Max. :705 Max. :4.000 Max. :99.00 Max. :35.00
## Year
## Min. :1996
## 1st Qu.:1997
## Median :1998
## Mean :1998
## 3rd Qu.:1999
## Max. :2000
Based on the summary statistics, we can guess that the limits for the x-axis with High School Ranking should be 4 and 99, and the limits for the y-axis with ACT score should be 13 and 35.
par(mfrow=c(3,2))
plot(gpa1996$High_School, gpa1996$ACT, xlab = 'High School', ylab = 'ACT',
main = 'Students Admitted in 1996',
xlim = c(4, 99), ylim = c(13, 35), col = my_color[gpa$Year == 1996])
plot(gpa1997$High_School, gpa1997$ACT, xlab = 'High School', ylab = 'ACT',
main = 'Students Admitted in 1997',
xlim = c(4, 99), ylim = c(13, 35), col = my_color[gpa$Year == 1997])
plot(gpa1998$High_School, gpa1998$ACT, xlab = 'High School', ylab = 'ACT',
main = 'Students Admitted in 1998',
xlim = c(4, 99), ylim = c(13, 35), col = my_color[gpa$Year == 1998])
plot(gpa1999$High_School, gpa1999$ACT, xlab = 'High School', ylab = 'ACT',
main = 'Students Admitted in 1999',
xlim = c(4, 99), ylim = c(13, 35), col = my_color[gpa$Year == 1999])
plot(gpa2000$High_School, gpa2000$ACT, xlab = 'High School', ylab = 'ACT',
main = 'Students Admitted in 2000',
xlim = c(4, 99), ylim = c(13, 35), col = my_color[gpa$Year == 2000])
If you decided to use range, an example of what the r code should look like is
plot(gpa2000$High_School, gpa2000$ACT, xlab = 'High School', ylab = 'ACT',
main = 'Students Admitted in 2000',
xlim = range(gpa$High_School), ylim = range(gpa$ACT), col = my_color[gpa$Year == 2000])
(d):
Strengths: You can see all of the data at once. If there were distinct clusters of students based on their high school ranking, ACT score, and year admitted, the clusters would be easy to see with this graph.
Weaknesses: Since there are no distinct clusters, any differences in overall high school ranking and ACT scores by year admitted are difficult to see.
(e):
Strengths: You can see the differences in overall high school rankings and ACT scores for students admitted in different years. It is much easier to compare trends between the years.
Weaknesses: It would be easy to overstate differences in the high school rankings and ACT scores which could lead to erroneous conclusions. The graphs are smaller which can make them harder to read.