Homework 4
SOLUTIONS CS&SS 508 Winter 2012 Assigned: Feb. 17 Due: Mar. 2 Please submit code and output for all questions except those marked with an asterisk (*). For questions marked with an asterisk submit code only.
Lecture 6: Programming in R 1) Load the warpbreaks data set and attach it.* This data set gives the number of warp breaks per loom, where a loom corresponds to a fixed length of yarn. The variables are: [,1] breaks numeric The number of breaks [,2] wool factor The type of wool (A or B) [,3] tension factor The level of tension (L, M, H) There are measurements on 9 looms for each of the six types of warp (AL, AM, AH, BL, BM, BH). > data(warpbreaks) > attach(warpbreaks) 2) Use the tapply() function to compute the mean number of breaks separately for each level of tension. > tapply(breaks, tension, mean) L M H 36.38889 26.38889 21.66667 3) Use the tapply() function to compute the mean number of breaks separately for each type of wool. > tapply(breaks, wool, mean) A B 31.03704 25.25926 4 a) Again using the warpbreaks data set, create a factor variable which indicates the combination of tension and wool for each observation. You should do this with the paste() function. The paste() function can be used to paste together two character strings into a single character string. By default it adds a space between the two character strings. Example: > vec1 = "red" > vec2 = "head"
> vec3=paste(vec1,vec2, sep="") > vec3 [1] "redhead" Use the paste() function to create your factor variable, which takes the values “AL”, “AM”, “AH”, “BL”, “BM”, “BH”. > wooltension = as.factor( paste(wool, tension, sep="")) > table(wooltension) wooltension AH AL AM BH BL BM 9 9 9 9 9 9 4 b) Use the tapply() function to compute the mean number of breaks separately for each combination of wool type and tension level. > tapply(breaks, wooltension, mean) AH AL AM BH BL BM 24.55556 44.55556 24.00000 18.77778 28.22222 28.77778 > 5) The following code computes bootstrapped standard errors for a regression model from the swiss data set. Run the code line by line to figure out how each line works and what it is doing. Add a comment to each line of code describing exactly what it’s doing. ## Here is the regression model with regular (non-bootstrapped) ## standard errors: mod1=lm(Examination~Agriculture, data=swiss) summary(mod1) ## The following code bootstraps the standard errors: nboot = 1000 ## specify number of bootstrap simulations boot.est = matrix(nrow=nboot,ncol=2) ## Create an empty matrix with two columns and with one row for each ## simulation. for (b in 1:nboot) { ## Repeat this loop nboot times. boot.sample = sample(1:dim(swiss)[1], replace=T) ## sample the indices of the swiss data set WITH REPLACEMENT. ## So some indices may be sampled twice, and others not at all. boot.dat = swiss[boot.sample,] ## Subset the data set to include the rows sampled in the ## previous step. This is the “bootstrapped data set” mod=lm(Examination~Agriculture, data=boot.dat) ## Compute linear regression coefficients for the bootstrapped ## data set. boot.est[b,] = mod$coef ## save these coefficients to the matrix boot.est. ## The intercept is saved in the first column, the slope in the ## second.
} apply(boot.est, 2, sd) ## Compute the standard deviation of all bootstrapped coefficient ## estimates (intercept and slope) apply(boot.est, 2, sd) ## ditto. (This duplicate line was a typo.) apply(boot.est, 2, quantile, 0.025) ## Compute the 2.5% quantile of the bootstrapped coefficient ## estimates (intercept and slope). This gives the lower bound for the ## 95% bootstrap confidence interval apply(boot.est, 2, quantile, 0.975) ## Compute the 97.5% quantile of the bootstrapped coefficient ## estimates (intercept and slope). This gives the upper bound for the ## 95% bootstrap confidence interval
How do the bootstrapped standard errors compare to the regular standard errors? Regular standard errors: > summary(mod1)$coef[,2] (Intercept) Agriculture 2.11000959 0.03807406 Bootstrapped standard errors: > apply(boot.est, 2, sd) [1] 2.25346577 0.03680473 They are quite close!!
Lecture 7 6) We’ll use variables in the swiss data set to explain fertility levels. Here’s a summary description of the data set and variables available: Standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888. [,1] Fertility Ig, ‘common standardized fertility measure’ [,2] Agriculture % of males involved in agriculture as occupation [,3] Examination % draftees receiving highest mark on army examination [,4] Education % education beyond primary school for draftees. [,5] Catholic % ‘catholic’ (as opposed to ‘protestant’). [,6] Infant.Mortality live births who live less than 1 year.
a) Perform a linear regression of Fertility on the percentage of draftees achieving the highest mark on the army Examination. Display summary output from the regression model, including standard errors and p-values. > > > >
data(swiss) attach(swiss) mod=lm(Fertility~Examination) summary(mod)
Call: lm(formula = Fertility ~ Examination) Residuals: Min 1Q -25.9375 -6.0044
Median -0.3393
3Q 7.9239
Max 19.7399
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 86.8185 3.2576 26.651 < 2e-16 *** Examination -1.0113 0.1782 -5.675 9.45e-07 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.642 on 45 degrees of freedom Multiple R-squared: 0.4172, Adjusted R-squared: 0.4042 F-statistic: 32.21 on 1 and 45 DF, p-value: 9.45e-07 b) Is the Examination variable statistically significant? What is its p-value? Yes. Its p-value is 9.45e-07, which is substantially less than 0.05. c) For an increase of one percentage point in Examination, how many units do we expect the Fertility measure to increase? We expect it to decrease by 1.0113 units. d) Make a scatter plot of fitted values on the x-axis and residuals on the y-axis. Add a horizontal line at y=0.
e) What does your plot in (d) tell you? The constant variance assumption appears to be met. f) Make a QQ normal plot of the residuals from your model. Add a QQ line. What does this plot tell you? > qqnorm(resid(mod)) > qqline(resid(mod)) The normality assumption is met:
g) Make a scatter plot of Fertility (y-axis) against Examination (x-axis) and add the regression line in red.
> plot(Fertility~Examination) > abline(lm(Fertility~Examination),col="red")