-
Notifications
You must be signed in to change notification settings - Fork 63
/
Copy pathTutorial5_BasicAnalytics.Rmd
691 lines (547 loc) · 23.1 KB
/
Tutorial5_BasicAnalytics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
% Tutorial 5: Analytics
% DPI R Bootcamp
% Jared Knowles
# Overview
In this lesson we hope to learn:
- How to use summary statistics to look at data
- How to run basic statistical tests on a dataset
- How to use formulas to build a statistical model
- Analyze subsets of data
<p align="center"><img src="img/dpilogo.png" height="81" width="138"></p>
```{r setup, include=FALSE}
# set global chunk options
opts_chunk$set(fig.path='figure/slides5-', cache.path='cache/slides5-',fig.width=8,fig.height=6,message=FALSE,error=FALSE,warning=FALSE,echo=TRUE,size='tiny',dev='png')
library(eeptools)
```
# Datasets
In this tutorial we will use a number of datasets of different types:
- `stulong`: student-level assessment and demographics data (simulated and research ready)
- `midwest_schools.csv`: aggregate school level test score averages from a large Midwest state
# Reading Data In
- We start with the aggregate school level data
```{r readschdata}
load('data/midwest_schools.rda')
head(midsch[,1:12])
```
# What do we have then?
- We have unique identifiers for districts and schools
- For each school/district combination we have a row of test scores in year 1 and year 2 by test_year (of year 1); grade; and subject
- How can we use R to ask this?
```{r datastruc}
table(midsch$test_year,midsch$grade)
length(unique(midsch$district_id))
length(unique(midsch$school_id))
```
- What's wrong with this?
* More districts than schools? The IDs must be goofed
* We need to create a unique school ID
# Explore Data Structure (II)
```{r datastruc2}
table(midsch$subject,midsch$grade)
```
- Why don't we want to do `table(midsch$district_id,midsch$grade)`
- What else do we want to know?
# Diagnostic Plots Perhaps
```{r diag1}
library(ggplot2)
qplot(ss1,ss2,data=midsch,alpha=I(.07))+theme_dpi()+geom_smooth()+
geom_smooth(method='lm',se=FALSE,color='purple')
```
# Frequencies, Crosstabs, and t-tests
- Some of the most basic analyses we can implement in R are sometimes the most useful
- Before we dive in to linear regression and evaluating a linear regression, we will first look into tests of differences among groups
- This is really useful in an education context or for evaluating experiments quickly when we are interested in whether the difference we observe in groups is real, or due to chance
# Let's take a simple example of cars
- Sometimes we want to compare groups of data to other groups or a fixed value
- We use a t-test for this, but only if we believe the data are normally distributed
```{r carsdata}
data(mtcars) # load the data from R
head(mtcars)
```
# Check for normality
- The Shapiro-Wilk normality test checks this for us:
```{r normtest}
shapiro.test(mtcars$mpg)
shapiro.test(mtcars$hp)
```
- Which of these is normally distributed?
- For more on hypothesis testing, see the optional intro to statistics module
# T-test
- We can t-test the `mpg` variable then
- Let's test it against an assumption about the population using a one-sided test
```{r onewayttest}
mean(mtcars$mpg)
t.test(mtcars$mpg,mu=18,alternative="greater")
t.test(mtcars$mpg,mu=22,alternative="less")
```
- What does `t.test(mtcars$mpg,mu=18)` test?
# Two sample t-test
- What if we want to compare two groups?
- For example cars with and without automatic transmissions
- How might we do this?
- Look at the documentation?
# Answer
```{r twowaytest}
t.test(mpg~am,data=mtcars)
```
# If data is non-normal
- You can do `wilcox.test(mtcars$hp,mu=102)`
- Or `wilcox.test(hp~am,data=mtcars)`
- Just Google it!
# Chi-Square
- Placebo v. aspirin
- Heartattack or no
```{r aspirindata}
aspirin <- matrix(c(189,104,10845,10933),ncol=2,dimnames=list(c("Placebo","Aspirin"),
c("MI","No MI")))
aspirin
```
- We want to know how independent from one another heart attacks and taking the placebo are. If it is unlikely that they are independent, then we would conclude these two variables have some relationship in our population (assuming the data was collected well)
# Test it
- To test these we use a chi-squared test statistic
```{r chi-squaretest}
chisq.test(aspirin,correct=FALSE)
fisher.test(aspirin)
```
# Back to School Data
- Let's imagine that a journalist has used this dataset to detect testing "irregularities" using publicly available aggregate test data
- The journalist's methodology is to regress test scores for a school/grade/subject in one year on a school/grade/subject aggregate test score in the next year
- For example, 2005-06, 3rd grade, reading scores are regressed on 2006-07, 4th grade, reading scores
- Where the observed gains are higher or lower than predicted by this statistical model, "irregularities" are suspected
# Regression 101
- What is wrong with this approach?
- What are the five assumptions of simple linear regression?
1. Dependent variable has a linear relationship to a combination of independent variables + a disturbance term (no variables omitted)
2. The expected value of the disturbance term is zero.
3. Disturbance terms have the same variance and are not correlated with one another.
4. The observations of the independent variables are considered fixed in repeated samples.
5. The number of observations exceeds the number of independent variables and no fixed linear combination exists among the independent variables (perfect collinearity)
- What are other concerns?
1. Sensitivity of the model to outliers
2. Confidence interval around predictions
3. Validity of the model on key subsets
# How to approach this?
- This is a perfect case for exploring the power of R for doing analysis on data and for checking accuracy of results
- Two approaches
1. Work on one test,grade,school_year combination and validate that
2. Test model assumptions across all combinations
3. Build one mega model from full data and control for year, grade, and subject
# First Step
- How many unique combinations are there of `test_year`, `grade`, and `subject`?
```{r uniq}
nrow(unique(midsch[,c(3,4,14)]))
```
# Let's look at one subset to start
5th grade, 2011, math scores
```{r drawsubset}
midsch_sub<-subset(midsch,midsch$grade==5 &
midsch$test_year==2011 &
midsch$subject=='math')
```
- How many observations in `midsch_sub`?
# How to specify a regression in R
`my_mod<-lm(ss2~ss1,data=midsch_sub)`
- OLS regression is done by the trusty `lm` function
- The `~` character divides the dependent variable `ss2` from the independent variable `ss1`
- We want to store the results of our function so we can capture it by `my_mod<-`
- `data` means we don't have to write: `lm(midsch_sub$ss2~midsch_sub$ss1)`
# Run the regression
- To implement the regression described above is simple in this framework
```{r basicreg}
ss_mod<-lm(ss2~ss1,data=midsch_sub)
summary(ss_mod)
```
# Explore the Model Output
```{r exploremod}
objects(ss_mod)
```
- Most of these we can ignore
- A few are interesting such as `coefficients` `fitted.values` and `call`
- Any idea how to access these objects?
# Omitted Variable
- What other data elements do we have available that might be omitted from our model specification?
* What about the class size?
- Class size is attractive since class size probably correlates with the variability in the change of scores from year 1 to year 2--big classes swing less than small classes
# Plot of class size
```{r diagn}
qplot(n2,ss2-ss1,data=midsch,alpha=I(.1))+theme_dpi()+geom_smooth()
```
- Group size might matter
- Another type of omitted variable are non-linear terms (polynomials) of the independent variable
# How to check formally
```{r basicregn}
ssN1_mod<-lm(ss2~ss1+n1,data=midsch_sub)
summary(ssN1_mod)
ssN2_mod<-lm(ss2~ss1+n2,data=midsch_sub)
summary(ssN2_mod)
```
# F Test
- Both n1 and n2 seemed to matter, or potentially to matter
- How can we test this formally?
```{r ftest}
anova(ss_mod,ssN1_mod,ssN2_mod)
AIC(ssN2_mod)
AIC(ssN1_mod)
```
- No difference between `n1` and `n2` but either improves model fit over the model without it
# Diagnostic Check for Linearity
```{r regdiagreset}
library(lmtest)
resettest(ss_mod,power=2:4)
```
- Statistically significant
```{r raintest}
raintest(ss2~ss1,fraction=0.5,order.by=~ss1,data=midsch_sub)
```
- Statistically significant
```{r harvtest}
harvtest(ss2~ss1,order.by=~ss1,data=midsch_sub)
```
- Statistically significant
- This is not a good sign for our model.
# Adjust for linearity
- No need to despair, we can quickly test a couple easy adjustments for non-linearity
- First, let's just include polynomial terms of our predictor
```{r polyregression}
ss_poly<-lm(ss2~ss1+I(ss1^2)+I(ss1^3)+I(ss1^4),data=midsch_sub)
summary(ss_poly)
```
- Ok, now what?
```{r polyanova}
anova(ss_mod,ss_poly)
```
# Is this polynomial model still nonlinear?
```{r testpolys}
resettest(ss_poly,power=2:4)
raintest(ss2~ss1+I(ss1^2)+I(ss1^3)+I(ss1^4),fraction=0.5,order.by=~ss1,data=midsch_sub)
harvtest(ss2~ss1+I(ss1^2)+I(ss1^3)+I(ss1^4),order.by=~ss1,data=midsch_sub)
```
- We don't eliminate all the problems
# What if we include our omitted variable?
```{r polyn}
ss_polyn<-lm(ss2~ss1+I(ss1^2)+I(ss1^3)+I(ss1^4)+n2,data=midsch_sub)
anova(ss_mod,ssN2_mod,ss_poly,ss_polyn)
```
- Promising
# Non-linearity tests
```{r testpolyns}
resettest(ss_polyn,power=2:4)
raintest(ss2~ss1+I(ss1^2)+I(ss1^3)+I(ss1^4)+n2,fraction=0.5,order.by=~ss1,data=midsch_sub)
harvtest(ss2~ss1+I(ss1^2)+I(ss1^3)+I(ss1^4)+n2,order.by=~ss1,data=midsch_sub)
```
- Yipes, nope, this isn't going to fix it.
# Another way to explore non-linearity
- Why might student test scores have a non-linear relationship?
- Tests are goofy at the low and high end of the scale, partly due to design, partly due to regression toward the mean
- How can we check if this is occurring in our data?
- We can use quantile regression, to fit different models to different subsets of the data and see if they are different
# Diagnostic Check for Quantile Regression
```{r quantileregression}
library(quantreg)
ss_quant<-rq(ss2~ss1,tau=c(seq(.1,.9,.1)),data=midsch_sub)
plot(summary(ss_quant,se='boot',method='wild'))
```
# Results
- `ss_quant` shows that in the lower quantiles the coefficients for the intercept and `ss1` fall outside the confidence interval around the base coefficient
- This suggests the relationship may vary in a statistically significant fashion at the high and low end of the scales, evidence of systematic non-linearity
# Robustness
```{r quantileregression2}
ss_quant2<-rq(ss2~ss1+I(ss1^2)+I(ss1^3)+I(ss1^4)+n2,tau=c(seq(.1,.9,.1)),data=midsch_sub)
plot(summary(ss_quant2,se='boot',method='wild'))
```
- The polynomials seem to address some of our concern about non-linearity in this manner, but remember, don't eliminate other symptoms of non-linearity
# Showing Off
```{r betterquantileplot}
ss_quant3<-rq(ss2~ss1,tau=-1,data=midsch_sub)
qplot(ss_quant3$sol[1,],ss_quant3$sol[5,],geom='line',main='Continuous Quantiles')+
theme_dpi()+xlab('Quantile')+ylab(expression(beta))+geom_hline(yintercept=coef(summary(ss_mod))[2,1])+
geom_hline(yintercept=coef(summary(ss_mod))[2,1]+(2*coef(summary(ss_mod))[2,2]),linetype=3)+
geom_hline(yintercept=coef(summary(ss_mod))[2,1]-(2*coef(summary(ss_mod))[2,2]),linetype=3)
```
# Showing Off 2
```{r betterquantileplot2}
ss_quant4<-rq(ss2~ss1+I(ss1^2)+I(ss1^3)+I(ss1^4)+n2,tau=-1,data=midsch_sub)
qplot(ss_quant4$sol[1,],ss_quant4$sol[5,],geom='line',main='Continuous Quantiles')+
theme_dpi()+xlab('Quantile')+ylab(expression(beta))+geom_hline(yintercept=coef(summary(ss_mod))[2,1])+
geom_hline(yintercept=coef(summary(ss_mod))[2,1]+(2*coef(summary(ss_mod))[2,2]),linetype=3)+
geom_hline(yintercept=coef(summary(ss_mod))[2,1]-(2*coef(summary(ss_mod))[2,2]),linetype=3)
```
# Test all 50 models
- This is just one of the fifty models we identified at the start
- How do we test them all?
- With a function and `dlply`
```{r mods}
library(plyr)
midsch$id<-interaction(midsch$test_year,midsch$grade,midsch$subject)
mods<-dlply(midsch,.(id),lm,formula=ss2 ~ ss1)
objects(mods)[1:10]
```
# Now we have fifty models in an object
- We need to test each one of them
- Sound tedious?
- R can easily do this as well
```{r resettest}
mytest<-llply(mods,function(x) resettest(x,power=2:4))
mytest[[1]]
mytest[[2]]
```
- OK, not that easy!
# Test Residuals
```{r residplot1}
a1<-qplot(id,residmean,
data=ddply(midsch,.(id),summarize,residmean=mean(residuals)),
geom='bar',main='Provided Residuals')+
theme_dpi()+opts(axis.text.x=theme_blank(),axis.ticks=theme_blank())+
ylab('Mean of Residuals')+
xlab('Model')+geom_text(aes(x=12,y=0.3),label='SD of Residuals = 9')
a2<-qplot(id,V1,data=ldply(mods,function(x) mean(x$residuals)),
geom='bar',main='Replication Models')+
theme_dpi()+opts(axis.text.x=theme_blank(),axis.ticks=theme_blank())+
ylab('Mean of Residuals')+
xlab('Model')+geom_text(aes(x=7,y=0.3),
label=paste("SD =",round(mean(ldply(mods,function(x) sd(x$residuals))$V1),2)))
grid.arrange(a1,a2,main="Comparing Replication and Provided Residual Means by Model")
```
# Test Expected Value of Residuals
- A key thing is that the residuals sum to 0
```{r residplot}
qplot(residuals,data=midsch,geom='density')+
stat_function(fun = dnorm, aes(colour = 'Normal'))+
geom_histogram(aes(y = ..density..), alpha = I(0.4)) +
geom_line(aes(y = ..density.., colour = 'Empirical'), stat = 'density') +
scale_colour_manual(name = 'Density', values = c('red', 'blue')) +
opts(legend.position = c(0.85, 0.85))+theme_dpi()
```
# Residuals Have Uniform Variance
```{r perfectmodel}
b<-2*rnorm(5000)
c<-b+runif(5000)
dem<-lm(c~b)
a1<-qplot(midsch$ss1,abs(midsch$residuals),
main='Residual Plot of Replication Data',geom='point',alpha=I(0.1))+
geom_smooth(method='lm',se=TRUE)+xlab('SS1')+ylab('Residuals')+
geom_smooth(se=FALSE)+
ylim(c(0,50))+theme_dpi()
a2<-qplot(b,abs(lm(c~b)$residuals),main='Well Specified OLS',alpha=I(.3))+theme_dpi()+geom_smooth()
grid.arrange(a1,a2,ncol=2)
```
# Empirical Tests
- We can do two tests, Breusch-Pagan and the Goldfeld-Quandt test to test for non-standard error variance
- Again, in R these are simple to use
```{r bpgqtest}
bptest(ss_mod)
gqtest(ss_mod)
```
# Correcting for Heteroskedacticity
- After all it only messes up the standard errors, not the estimates themselves
# Accuracy of Predictions
- Even if the regression models fit the assumptions above, a somewhat heroic assumption, they still might not be accurate!
- What are some good ways to address accuracy and outlier sensitivity?
- R model diagnostics can be easily run on any `lm` object
# Convenience Functions
- Using `ggplot2` we can run something called `fortify` on our linear model to get a data frame that tells us a lot of diagnostics about each observation
- Example:
```{r fortifymethod}
damodel<-fortify(ss_mod)
summary(damodel)
```
# What do we get?
- `dv` `iv` `.hat` `.sigma` `.cooksd` `.fitted` `.resid` and `.stdresid`
- Some are obvious: `.fitted` is the prediction from our model
- `.resid` = `dv` - `.fitted`
- `.stdresid` = normalized `.resid`
- `.sigma` = estimate of residual standard deviation when observation is dropped from the model
- `.hat` is more obscure, but is a measure of the influence an individual observation has on overall model fit
# So, how do we use this?
- Visual inspection is the best in this case
- It's easy to implement, easy to interpret, and easy to explain to others
- Watch: let's look at an ideal linear regression model
```{r simulatedgoodmodel}
a<-rnorm(500)
b<-runif(500)
c<-a+b
goodsim<-lm(c~a)
goodsim_a<-fortify(goodsim)
qplot(c,.hat,data=goodsim_a)+theme_dpi()+geom_smooth(se=FALSE)
```
# Let's look at our model
```{r nonsim}
qplot(ss2,.hat,data=damodel)+theme_dpi()+geom_smooth(se=FALSE)
```
- The deviation here is quite stark
# Compare and contrast
```{r comparisonplot,out.width='800px',out.height='570px'}
a<-qplot(c,.hat,data=goodsim_a)+theme_dpi()+geom_smooth(se=FALSE)
b<-qplot(ss2,.hat,data=damodel)+theme_dpi()+geom_smooth(se=FALSE)
grid.arrange(a,b,ncol=2)
```
- These are different, but what do they tell us?
- Points with a high `hat` value are what we call "high leverage" observations, and on their own are not bad--in fact our good model has lots of them
- They help keep the model robust to outliers
- What do you notice about our replication model's outliers?
# One step further
- A rule of thumb is that observations greater than hat of 3x the mean hat value are troubling
```{r diagnosticplot}
qplot(ss2,.hat,data=damodel)+theme_dpi()+geom_smooth(se=FALSE)+geom_hline(yintercept=3*mean(damodel$.hat),color=I("red"),size=I(1.1))
```
- Yikes!
# Checking this systematically
- First, a nasty chunk of R code
```{r influentialobs}
infobs<-which(apply(influence.measures(ss_mod)$is.inf,1,any))
ssdata<-cbind(fortify(ss_mod),midsch_sub)
ssdata$id3<-paste(ssdata$district_id,ssdata$school_id,sep='.')
noinf<-lm(ss2~ss1,data=midsch_sub[-infobs,])
noinff<-fortify(noinf)
```
# Then a plot
```{r infobsplot}
qplot(ss1,ss2,data=ssdata,alpha=I(.5))+
geom_line(aes(ss1,.fitted,group=1),data=ssdata,size=I(1.02))+
geom_line(aes(x=ss1,y=.fitted,group=1),data=noinff,linetype=6,size=1.1,
color='blue')+
theme_dpi()+xlab('SS1')+ylab('Y')
```
# What have we learned?
- Regression in R is easy
- Regression is easy to get wrong
# What might we do different to address these concerns?
- Well, there is nesting in our data that is being ignored
- Also, by fitting fifty separate models we are not efficiently using our data
- Let's look at some quick easy strategies to address that concern
- Let's start with the megamodel
# Megamodel I
```{r megamodel2}
my_megamod<-lm(ss2~ss1+grade+test_year+subject,data=midsch)
summary(my_megamod)
```
- What's wrong with this?
# Megamodel II
```{r megamodel3}
my_megamod2<-lm(ss2~ss1+as.factor(grade)+
as.factor(test_year)+subject,data=midsch)
summary(my_megamod2)
```
# Comparison
- How do we test between these two?
- An F test, which we can run in ANOVA--how?
# Answer
```{r anovamega}
anova(my_megamod,my_megamod2)
```
# Interaction Terms
- R model syntax makes it easy to include interaction terms as well
- An interaction fits a parameter for all combinations of factors in the interaction and can be inserted simply with a `*`
```{r modelinteraction}
megamodeli<-lm(ss2~ss1+as.factor(grade)+subject*factor(test_year),data=midsch)
summary(megamodeli)
```
# Interaction Terms II
- The `:` in this case only includes the interactions, but not the main effects in every combination
```{r modelinteraction2}
megamodelii<-lm(ss2~ss1+as.factor(grade)+subject:factor(test_year),data=midsch)
summary(megamodelii)
```
- How is this different, and why does it matter?
# Meganova for fun
```{r meagnova}
anova(my_megamod,my_megamod2,megamodelii,megamodeli)
```
# What about nesting the data?
- One problem here is that observations--which are grades--are nested within schools and districts
- Why can't we include a variable for each school in our replication model from earlier?
- What happens if we try?
```{r breakrep}
badidea<-lm(ss2~ss1+as.factor(grade)+
as.factor(test_year)+subject+factor(district_id),data=midsch)
summary(badidea)
```
# Adjusting for Heteroskedasticity
- We do this with the `sandwich` package, which allows us to manipulate the variance-covariance matrix and adjust our estimates of standard errors appropriately to correct for non-independence
```{r sandwiching}
library(sandwich)
vcovHC(ss_mod,type="HC4")
```
- This is ugly. I wrote a function to make it easy for the univariate case. For the multivariate case functions are still pending.
# Function
```{r glsfunction,eval=FALSE}
gls.correct <-
function(lm){
require(sandwich)
rob<-t(sapply(c('const','HC0','HC1','HC2','HC3','HC4','HC5'),function (x)
sqrt(diag(vcovHC(lm,type=x)))))
a<-c(NA,(rob[2,1]-rob[1,1])/rob[1,1],(rob[3,1]-rob[1,1])/rob[1,1],
(rob[4,1]-rob[1,1])/rob[1,1],(rob[5,1]-rob[1,1])/rob[1,1],
(rob[6,1]-rob[1,1])/rob[1,1],(rob[7,1]-rob[1,1])/rob[1,1])
rob<-cbind(rob,round(a*100,2))
a<-c(NA,(rob[2,2]-rob[1,2])/rob[1,2],(rob[3,2]-rob[1,2])/rob[1,2],
(rob[4,2]-rob[1,2])/rob[1,2],(rob[5,2]-rob[1,2])/rob[1,2],
(rob[6,2]-rob[1,2])/rob[1,2],(rob[7,2]-rob[1,2])/rob[1,2])
rob<-cbind(rob,round(a*100,2))
rob<-as.data.frame(rob)
names(rob)<-c('alpha','beta','alpha.pchange','beta.pchange')
rob$id2<-rownames(rob)
rob
}
```
# Results of GLS Function
```{r glscorrect}
gls.correct(ss_mod)
```
- So our standard errors were underestimated by about 18%
# How to Explore Substantive Significance
- One way is to plot the effects across the range of the variables
```{r substantiveeffect,echo=FALSE}
a<-coef(ssN1_mod)
xdat<-seq(min(midsch_sub$ss1),max(midsch_sub$ss1),length.out=20)
ydat<-xdat*a[2]+a[1]+(median(midsch_sub$n1)*a[3])
ydatsmall<-xdat*a[2]+a[1]+(min(midsch_sub$n1)*a[3])
ydatbig<-xdat*a[2]+a[1]+(max(midsch_sub$n1)*a[3])
myx<-rep(xdat,3)
myy<-c(ydat,ydatsmall,ydatbig)
mymod<-rep(c("mean","low","high"),each=length(xdat))
newdat<-data.frame(x=myx,y=myy,type=mymod)
ggplot(newdat,aes(x=x,y=y,color=mymod))+geom_line(aes(linetype=mymod),size=I(.8))+coord_cartesian()+theme_dpi()
```
# Clumsy Code to Do This
```{r codeforlineplots,eval=FALSE}
a<-coef(ssN1_mod)
xdat<-seq(min(midsch_sub$ss1),max(midsch_sub$ss1),length.out=20)
ydat<-xdat*a[2]+a[1]+(median(midsch_sub$n1)*a[3])
ydatsmall<-xdat*a[2]+a[1]+(min(midsch_sub$n1)*a[3])
ydatbig<-xdat*a[2]+a[1]+(max(midsch_sub$n1)*a[3])
myx<-rep(xdat,3)
myy<-c(ydat,ydatsmall,ydatbig)
mymod<-rep(c("mean","low","high"),each=length(xdat))
newdat<-data.frame(x=myx,y=myy,type=mymod)
ggplot(newdat,aes(x=x,y=y,color=mymod))+geom_line(aes(linetype=mymod),size=I(.8))+coord_cartesian()+theme_dpi()
```
# Coefficient Plots
```{r coefplots,fig.width=12,fig.height=6,out.width='600px',out.height='300px'}
b<-sqrt(diag(vcov(ssN1_mod)))
mycoef<-data.frame(var=names(b),y=a,se=b)
ggplot(mycoef[2:3,],aes(x=var,y=y,ymin=y-2*se,ymax=y+2*se))+geom_pointrange()+theme_dpi()+geom_hline(yintercept=0,size=I(1.1),color=I('red'))
```
# Exercises
1. Test our new megamodel for heteroskedacticty. What do you find?
2. Does `megamodel2` exhibit non-linearity? Can you fit a quantile regression model of this?
3. Can you simulate a model with multiple variables? What does it look like?
Bonus: Write better code for the plot with different lines for different values of variable 2.
# Other References
- [Video Tutorials](http://www.twotorials.com/)
- [Quick R: Multiple Regression in R](http://www.statmethods.net/stats/regression.html)
- [UCLA Regression Guide](http://www.ats.ucla.edu/stat/r/sk/books_pra.htm)
# Session Info
It is good to include the session info, e.g. this document is produced with **knitr** version `r packageVersion('knitr')`. Here is my session info:
```{r session-info}
print(sessionInfo(), locale=FALSE)
```
# Attribution and License
<p xmlns:dct="http://purl.org/dc/terms/">
<a rel="license" href="http://creativecommons.org/publicdomain/mark/1.0/">
<img src="http://i.creativecommons.org/p/mark/1.0/88x31.png"
style="border-style: none;" alt="Public Domain Mark" />
</a>
<br />
This work (<span property="dct:title">R Tutorial for Education</span>, by <a href="www.jaredknowles.com" rel="dct:creator"><span property="dct:title">Jared E. Knowles</span></a>), in service of the <a href="http://www.dpi.wi.gov" rel="dct:publisher"><span property="dct:title">Wisconsin Department of Public Instruction</span></a>, is free of known copyright restrictions.
</p>