Tutorial5_BasicAnalytics.Rmd

% Tutorial 5: Analytics
% DPI R Bootcamp 
% Jared Knowles

# Overview
In this lesson we hope to learn:

- How to use summary statistics to look at data
- How to run basic statistical tests on a dataset
- How to use formulas to build a statistical model
- Analyze subsets of data

<p align="center"><img src="img/dpilogo.png" height="81" width="138"></p>

```{r setup, include=FALSE}
# set global chunk options
opts_chunk$set(fig.path='figure/slides5-', cache.path='cache/slides5-',fig.width=8,fig.height=6,message=FALSE,error=FALSE,warning=FALSE,echo=TRUE,size='tiny',dev='png')
library(eeptools)
```


# Datasets
In this tutorial we will use a number of datasets of different types:

- `stulong`: student-level assessment and demographics data (simulated and research ready)
- `midwest_schools.csv`: aggregate school level test score averages from a large Midwest state

# Reading Data In
- We start with the aggregate school level data
```{r readschdata}
load('data/midwest_schools.rda')
head(midsch[,1:12])
```

#  What do we have then?
- We have unique identifiers for districts and schools
- For each school/district combination we have a row of test scores in year 1 and year 2 by test_year (of year 1); grade; and subject
- How can we use R to ask this?
```{r datastruc}
table(midsch$test_year,midsch$grade)
length(unique(midsch$district_id))
length(unique(midsch$school_id))
```
- What's wrong with this?
  * More districts than schools? The IDs must be goofed
  * We need to create a unique school ID
  

# Explore Data Structure (II)
```{r datastruc2}
table(midsch$subject,midsch$grade)
```
- Why don't we want to do `table(midsch$district_id,midsch$grade)`
- What else do we want to know?

# Diagnostic Plots Perhaps
```{r diag1}
library(ggplot2)
qplot(ss1,ss2,data=midsch,alpha=I(.07))+theme_dpi()+geom_smooth()+
  geom_smooth(method='lm',se=FALSE,color='purple')
```

# Frequencies, Crosstabs, and t-tests
- Some of the most basic analyses we can implement in R are sometimes the most useful
- Before we dive in to linear regression and evaluating a linear regression, we will first look into tests of differences among groups
- This is really useful in an education context or for evaluating experiments quickly when we are interested in whether the difference we observe in groups is real, or due to chance

# Let's take a simple example of cars

- Sometimes we want to compare groups of data to other groups or a fixed value
- We use a t-test for this, but only if we believe the data are normally distributed

```{r carsdata}
data(mtcars) # load the data from R
head(mtcars)
```


# Check for normality
- The Shapiro-Wilk normality test checks this for us:

```{r normtest}
shapiro.test(mtcars$mpg)
shapiro.test(mtcars$hp)
```

- Which of these is normally distributed?
- For more on hypothesis testing, see the optional intro to statistics module

# T-test

- We can t-test the `mpg` variable then
- Let's test it against an assumption about the population using a one-sided test

```{r onewayttest}
mean(mtcars$mpg)
t.test(mtcars$mpg,mu=18,alternative="greater")
t.test(mtcars$mpg,mu=22,alternative="less")
```

- What does `t.test(mtcars$mpg,mu=18)` test?

# Two sample t-test
- What if we want to compare two groups?
- For example cars with and without automatic transmissions
- How might we do this?
- Look at the documentation?

# Answer

```{r twowaytest}
t.test(mpg~am,data=mtcars)
```

# If data is non-normal
- You can do `wilcox.test(mtcars$hp,mu=102)`
- Or `wilcox.test(hp~am,data=mtcars)`
- Just Google it!

# Chi-Square
- Placebo v. aspirin
- Heartattack or no

```{r aspirindata}
aspirin <- matrix(c(189,104,10845,10933),ncol=2,dimnames=list(c("Placebo","Aspirin"),
                                                              c("MI","No MI")))
aspirin
```

- We want to know how independent from one another heart attacks and taking the placebo are. If it is unlikely that they are independent, then we would conclude these two variables have some relationship in our population (assuming the data was collected well)

# Test it
- To test these we use a chi-squared test statistic 
```{r chi-squaretest}
chisq.test(aspirin,correct=FALSE)
fisher.test(aspirin)
```


# Back to School Data
- Let's imagine that a journalist has used this dataset to detect testing "irregularities" using publicly available aggregate test data
- The journalist's methodology is to regress test scores for a school/grade/subject in one year on a school/grade/subject aggregate test score in the next year
- For example, 2005-06, 3rd grade, reading scores are regressed on 2006-07, 4th grade, reading scores
- Where the observed gains are higher or lower than predicted by this statistical model, "irregularities" are suspected

# Regression 101
- What is wrong with this approach? 
- What are the five assumptions of simple linear regression?
  1. Dependent variable has a linear relationship to a combination of independent variables + a disturbance term (no variables omitted)
  2. The expected value of the disturbance term is zero.
  3. Disturbance terms have the same variance and are not correlated with one another.
  4. The observations of the independent variables are considered fixed in repeated samples.
  5. The number of observations exceeds the number of independent variables and no fixed linear combination exists among the independent variables (perfect collinearity)
- What are other concerns?
  1. Sensitivity of the model to outliers
  2. Confidence interval around predictions
  3. Validity of the model on key subsets


# How to approach this?
- This is a perfect case for exploring the power of R for doing analysis on data and for checking accuracy of results
- Two approaches
  1. Work on one test,grade,school_year combination and validate that
  2. Test model assumptions across all combinations
  3. Build one mega model from full data and control for year, grade, and subject


# First Step
- How many unique combinations are there of `test_year`, `grade`, and `subject`?

```{r uniq}
nrow(unique(midsch[,c(3,4,14)]))
```

# Let's look at one subset to start
5th grade, 2011, math scores
```{r drawsubset}
midsch_sub<-subset(midsch,midsch$grade==5 &
                   midsch$test_year==2011 &
                   midsch$subject=='math')
```
- How many observations in `midsch_sub`?

# How to specify a regression in R

`my_mod<-lm(ss2~ss1,data=midsch_sub)`

- OLS regression is done by the trusty `lm` function
- The `~` character divides the dependent variable `ss2` from the independent variable `ss1`
- We want to store the results of our function so we can capture it by `my_mod<-`
- `data` means we don't have to write: `lm(midsch_sub$ss2~midsch_sub$ss1)`


# Run the regression
- To implement the regression described above is simple in this framework
```{r basicreg}
ss_mod<-lm(ss2~ss1,data=midsch_sub)
summary(ss_mod)
```

# Explore the Model Output
```{r exploremod}
objects(ss_mod)
```
- Most of these we can ignore
- A few are interesting such as `coefficients` `fitted.values` and `call`
- Any idea how to access these objects?

# Omitted Variable
- What other data elements do we have available that might be omitted from our model specification?
  * What about the class size?
- Class size is attractive since class size probably correlates with the variability in the change of scores from year 1 to year 2--big classes swing less than small classes

# Plot of class size
```{r diagn}
qplot(n2,ss2-ss1,data=midsch,alpha=I(.1))+theme_dpi()+geom_smooth()
```
- Group size might matter
- Another type of omitted variable are non-linear terms (polynomials) of the independent variable


# How to check formally
```{r basicregn}
ssN1_mod<-lm(ss2~ss1+n1,data=midsch_sub)
summary(ssN1_mod)
ssN2_mod<-lm(ss2~ss1+n2,data=midsch_sub)
summary(ssN2_mod)
```

# F Test
- Both n1 and n2 seemed to matter, or potentially to matter
- How can we test this formally?

```{r ftest}
anova(ss_mod,ssN1_mod,ssN2_mod)
AIC(ssN2_mod)
AIC(ssN1_mod)
```
- No difference between `n1` and `n2` but either improves model fit over the model without it

# Diagnostic Check for Linearity
```{r regdiagreset}
library(lmtest)
resettest(ss_mod,power=2:4)
```
- Statistically significant
```{r raintest}
raintest(ss2~ss1,fraction=0.5,order.by=~ss1,data=midsch_sub)
```
- Statistically significant
```{r harvtest}
harvtest(ss2~ss1,order.by=~ss1,data=midsch_sub)
```
- Statistically significant
- This is not a good sign for our model.

# Adjust for linearity
- No need to despair, we can quickly test a couple easy adjustments for non-linearity
- First, let's just include polynomial terms of our predictor
```{r polyregression}
ss_poly<-lm(ss2~ss1+I(ss1^2)+I(ss1^3)+I(ss1^4),data=midsch_sub)
summary(ss_poly)
```
- Ok, now what?
```{r polyanova}
anova(ss_mod,ss_poly)
```

# Is this polynomial model still nonlinear?
```{r testpolys}
resettest(ss_poly,power=2:4)
raintest(ss2~ss1+I(ss1^2)+I(ss1^3)+I(ss1^4),fraction=0.5,order.by=~ss1,data=midsch_sub)
harvtest(ss2~ss1+I(ss1^2)+I(ss1^3)+I(ss1^4),order.by=~ss1,data=midsch_sub)
```
- We don't eliminate all the problems

# What if we include our omitted variable?
```{r polyn}
ss_polyn<-lm(ss2~ss1+I(ss1^2)+I(ss1^3)+I(ss1^4)+n2,data=midsch_sub)
anova(ss_mod,ssN2_mod,ss_poly,ss_polyn)
```
- Promising

# Non-linearity tests
```{r testpolyns}
resettest(ss_polyn,power=2:4)
raintest(ss2~ss1+I(ss1^2)+I(ss1^3)+I(ss1^4)+n2,fraction=0.5,order.by=~ss1,data=midsch_sub)
harvtest(ss2~ss1+I(ss1^2)+I(ss1^3)+I(ss1^4)+n2,order.by=~ss1,data=midsch_sub)
```
- Yipes, nope, this isn't going to fix it.

# Another way to explore non-linearity
- Why might student test scores have a non-linear relationship?
- Tests are goofy at the low and high end of the scale, partly due to design, partly due to regression toward the mean
- How can we check if this is occurring in our data? 
- We can use quantile regression, to fit different models to different subsets of the data and see if they are different

# Diagnostic Check for Quantile Regression
```{r quantileregression}
library(quantreg)
ss_quant<-rq(ss2~ss1,tau=c(seq(.1,.9,.1)),data=midsch_sub)
plot(summary(ss_quant,se='boot',method='wild'))
```

# Results
- `ss_quant` shows that in the lower quantiles the coefficients for the intercept and `ss1` fall outside the confidence interval around the base coefficient
- This suggests the relationship may vary in a statistically significant fashion at the high and low end of the scales, evidence of systematic non-linearity

# Robustness
```{r quantileregression2}
ss_quant2<-rq(ss2~ss1+I(ss1^2)+I(ss1^3)+I(ss1^4)+n2,tau=c(seq(.1,.9,.1)),data=midsch_sub)
plot(summary(ss_quant2,se='boot',method='wild'))
```
- The polynomials seem to address some of our concern about non-linearity in this manner, but remember, don't eliminate other symptoms of non-linearity

# Showing Off
```{r betterquantileplot}
ss_quant3<-rq(ss2~ss1,tau=-1,data=midsch_sub)
qplot(ss_quant3$sol[1,],ss_quant3$sol[5,],geom='line',main='Continuous Quantiles')+
  theme_dpi()+xlab('Quantile')+ylab(expression(beta))+geom_hline(yintercept=coef(summary(ss_mod))[2,1])+
  geom_hline(yintercept=coef(summary(ss_mod))[2,1]+(2*coef(summary(ss_mod))[2,2]),linetype=3)+
  geom_hline(yintercept=coef(summary(ss_mod))[2,1]-(2*coef(summary(ss_mod))[2,2]),linetype=3)
```

# Showing Off 2
```{r betterquantileplot2}
ss_quant4<-rq(ss2~ss1+I(ss1^2)+I(ss1^3)+I(ss1^4)+n2,tau=-1,data=midsch_sub)
qplot(ss_quant4$sol[1,],ss_quant4$sol[5,],geom='line',main='Continuous Quantiles')+
  theme_dpi()+xlab('Quantile')+ylab(expression(beta))+geom_hline(yintercept=coef(summary(ss_mod))[2,1])+
  geom_hline(yintercept=coef(summary(ss_mod))[2,1]+(2*coef(summary(ss_mod))[2,2]),linetype=3)+
  geom_hline(yintercept=coef(summary(ss_mod))[2,1]-(2*coef(summary(ss_mod))[2,2]),linetype=3)
```


# Test all 50 models
- This is just one of the fifty models we identified at the start
- How do we test them all?
- With a function and `dlply`
```{r mods}
library(plyr)
midsch$id<-interaction(midsch$test_year,midsch$grade,midsch$subject)
mods<-dlply(midsch,.(id),lm,formula=ss2 ~ ss1)
objects(mods)[1:10]
```

# Now we have fifty models in an object
- We need to test each one of them
- Sound tedious?
- R can easily do this as well
```{r resettest}
mytest<-llply(mods,function(x) resettest(x,power=2:4))
mytest[[1]]
mytest[[2]]
```
- OK, not that easy!

# Test Residuals
```{r residplot1}
a1<-qplot(id,residmean,
      data=ddply(midsch,.(id),summarize,residmean=mean(residuals)),
      geom='bar',main='Provided Residuals')+
        theme_dpi()+opts(axis.text.x=theme_blank(),axis.ticks=theme_blank())+
        ylab('Mean of Residuals')+
        xlab('Model')+geom_text(aes(x=12,y=0.3),label='SD of Residuals = 9')

a2<-qplot(id,V1,data=ldply(mods,function(x) mean(x$residuals)),
          geom='bar',main='Replication Models')+
        theme_dpi()+opts(axis.text.x=theme_blank(),axis.ticks=theme_blank())+
        ylab('Mean of Residuals')+
        xlab('Model')+geom_text(aes(x=7,y=0.3),
        label=paste("SD =",round(mean(ldply(mods,function(x) sd(x$residuals))$V1),2)))
grid.arrange(a1,a2,main="Comparing Replication and Provided Residual Means by Model")
```

# Test Expected Value of Residuals
- A key thing is that the residuals sum to 0
```{r residplot}
qplot(residuals,data=midsch,geom='density')+
  stat_function(fun = dnorm, aes(colour = 'Normal'))+
  geom_histogram(aes(y = ..density..), alpha = I(0.4)) +
  geom_line(aes(y = ..density.., colour = 'Empirical'), stat = 'density') +
  scale_colour_manual(name = 'Density', values = c('red', 'blue')) + 
  opts(legend.position = c(0.85, 0.85))+theme_dpi()
```

# Residuals Have Uniform Variance
```{r perfectmodel}
b<-2*rnorm(5000)
c<-b+runif(5000)
dem<-lm(c~b)

a1<-qplot(midsch$ss1,abs(midsch$residuals),
  main='Residual Plot of Replication Data',geom='point',alpha=I(0.1))+
  geom_smooth(method='lm',se=TRUE)+xlab('SS1')+ylab('Residuals')+
  geom_smooth(se=FALSE)+
    ylim(c(0,50))+theme_dpi()

a2<-qplot(b,abs(lm(c~b)$residuals),main='Well Specified OLS',alpha=I(.3))+theme_dpi()+geom_smooth()

grid.arrange(a1,a2,ncol=2)
```

# Empirical Tests
- We can do two tests, Breusch-Pagan and the Goldfeld-Quandt test to test for non-standard error variance
- Again, in R these are simple to use
```{r bpgqtest}
bptest(ss_mod)
gqtest(ss_mod)
```

# Correcting for Heteroskedacticity
- After all it only messes up the standard errors, not the estimates themselves

# Accuracy of Predictions
- Even if the regression models fit the assumptions above, a somewhat heroic assumption, they still might not be accurate!
- What are some good ways to address accuracy and outlier sensitivity?
- R model diagnostics can be easily run on any `lm` object

# Convenience Functions
- Using `ggplot2` we can run something called `fortify` on our linear model to get a data frame that tells us a lot of diagnostics about each observation
- Example:

```{r fortifymethod}
damodel<-fortify(ss_mod)
summary(damodel)           
```

# What do we get?
- `dv` `iv` `.hat` `.sigma` `.cooksd` `.fitted` `.resid` and `.stdresid`
- Some are obvious: `.fitted` is the prediction from our model
- `.resid` = `dv` - `.fitted`
- `.stdresid` = normalized `.resid`
- `.sigma` = estimate of residual standard deviation when observation is dropped from the model
- `.hat` is more obscure, but is a measure of the influence an individual observation has on overall model fit

# So, how do we use this?
- Visual inspection is the best in this case
- It's easy to implement, easy to interpret, and easy to explain to others
- Watch: let's look at an ideal linear regression model

```{r simulatedgoodmodel}
a<-rnorm(500)
b<-runif(500)
c<-a+b
goodsim<-lm(c~a)
goodsim_a<-fortify(goodsim)
qplot(c,.hat,data=goodsim_a)+theme_dpi()+geom_smooth(se=FALSE)
```

# Let's look at our model

```{r nonsim}
qplot(ss2,.hat,data=damodel)+theme_dpi()+geom_smooth(se=FALSE)
```

- The deviation here is quite stark

# Compare and contrast
```{r comparisonplot,out.width='800px',out.height='570px'}
a<-qplot(c,.hat,data=goodsim_a)+theme_dpi()+geom_smooth(se=FALSE)
b<-qplot(ss2,.hat,data=damodel)+theme_dpi()+geom_smooth(se=FALSE)
grid.arrange(a,b,ncol=2)
```
- These are different, but what do they tell us?
- Points with a high `hat` value are what we call "high leverage" observations, and on their own are not bad--in fact our good model has lots of them
- They help keep the model robust to outliers
- What do you notice about our replication model's outliers?

# One step further
- A rule of thumb is that observations greater than hat of 3x the mean hat value are troubling

```{r diagnosticplot}
qplot(ss2,.hat,data=damodel)+theme_dpi()+geom_smooth(se=FALSE)+geom_hline(yintercept=3*mean(damodel$.hat),color=I("red"),size=I(1.1))
```
- Yikes!

# Checking this systematically
- First, a nasty chunk of R code

```{r influentialobs}
infobs<-which(apply(influence.measures(ss_mod)$is.inf,1,any))
ssdata<-cbind(fortify(ss_mod),midsch_sub)
ssdata$id3<-paste(ssdata$district_id,ssdata$school_id,sep='.')
noinf<-lm(ss2~ss1,data=midsch_sub[-infobs,])
noinff<-fortify(noinf)
```

# Then a plot

```{r infobsplot}

qplot(ss1,ss2,data=ssdata,alpha=I(.5))+
   geom_line(aes(ss1,.fitted,group=1),data=ssdata,size=I(1.02))+
   geom_line(aes(x=ss1,y=.fitted,group=1),data=noinff,linetype=6,size=1.1,
             color='blue')+
   theme_dpi()+xlab('SS1')+ylab('Y')

```

# What have we learned?
- Regression in R is easy
- Regression is easy to get wrong


# What might we do different to address these concerns?
- Well, there is nesting in our data that is being ignored
- Also, by fitting fifty separate models we are not efficiently using our data
- Let's look at some quick easy strategies to address that concern
- Let's start with the megamodel

# Megamodel I
```{r megamodel2}
my_megamod<-lm(ss2~ss1+grade+test_year+subject,data=midsch)
summary(my_megamod)
```
- What's wrong with this?

# Megamodel II
```{r megamodel3}
my_megamod2<-lm(ss2~ss1+as.factor(grade)+
  as.factor(test_year)+subject,data=midsch)
summary(my_megamod2)
```

# Comparison
- How do we test between these two?
- An F test, which we can run in ANOVA--how?

# Answer
```{r anovamega}
anova(my_megamod,my_megamod2)
```

# Interaction Terms
- R model syntax makes it easy to include interaction terms as well
- An interaction fits a parameter for all combinations of factors in the interaction and can be inserted simply with a `*`

```{r modelinteraction}
megamodeli<-lm(ss2~ss1+as.factor(grade)+subject*factor(test_year),data=midsch)
summary(megamodeli)
```


# Interaction Terms II
- The `:` in this case only includes the interactions, but not the main effects in every combination

```{r modelinteraction2}
megamodelii<-lm(ss2~ss1+as.factor(grade)+subject:factor(test_year),data=midsch)
summary(megamodelii)
```

- How is this different, and why does it matter?

# Meganova for fun

```{r meagnova}
anova(my_megamod,my_megamod2,megamodelii,megamodeli)
```


# What about nesting the data?

- One problem here is that observations--which are grades--are nested within schools and districts
- Why can't we include a variable for each school in our replication model from earlier?
- What happens if we try?

```{r breakrep}
badidea<-lm(ss2~ss1+as.factor(grade)+
  as.factor(test_year)+subject+factor(district_id),data=midsch)
summary(badidea)
```


# Adjusting for Heteroskedasticity
- We do this with the `sandwich` package, which allows us to manipulate the variance-covariance matrix and adjust our estimates of standard errors appropriately to correct for non-independence

```{r sandwiching}
library(sandwich)
vcovHC(ss_mod,type="HC4")
```

- This is ugly. I wrote a function to make it easy for the univariate case. For the multivariate case functions are still pending. 

# Function
```{r glsfunction,eval=FALSE}
gls.correct <-
function(lm){
  require(sandwich)
  rob<-t(sapply(c('const','HC0','HC1','HC2','HC3','HC4','HC5'),function (x)
    sqrt(diag(vcovHC(lm,type=x)))))
  a<-c(NA,(rob[2,1]-rob[1,1])/rob[1,1],(rob[3,1]-rob[1,1])/rob[1,1],
       (rob[4,1]-rob[1,1])/rob[1,1],(rob[5,1]-rob[1,1])/rob[1,1],
       (rob[6,1]-rob[1,1])/rob[1,1],(rob[7,1]-rob[1,1])/rob[1,1])
  rob<-cbind(rob,round(a*100,2))
  a<-c(NA,(rob[2,2]-rob[1,2])/rob[1,2],(rob[3,2]-rob[1,2])/rob[1,2],
       (rob[4,2]-rob[1,2])/rob[1,2],(rob[5,2]-rob[1,2])/rob[1,2],
       (rob[6,2]-rob[1,2])/rob[1,2],(rob[7,2]-rob[1,2])/rob[1,2])
  rob<-cbind(rob,round(a*100,2))
  rob<-as.data.frame(rob)
  names(rob)<-c('alpha','beta','alpha.pchange','beta.pchange')
  rob$id2<-rownames(rob)
  rob
}
```

# Results of GLS Function
```{r glscorrect}
gls.correct(ss_mod)
```
- So our standard errors were underestimated by about 18%

# How to Explore Substantive Significance
- One way is to plot the effects across the range of the variables

```{r substantiveeffect,echo=FALSE}
a<-coef(ssN1_mod)
xdat<-seq(min(midsch_sub$ss1),max(midsch_sub$ss1),length.out=20)
ydat<-xdat*a[2]+a[1]+(median(midsch_sub$n1)*a[3])
ydatsmall<-xdat*a[2]+a[1]+(min(midsch_sub$n1)*a[3])
ydatbig<-xdat*a[2]+a[1]+(max(midsch_sub$n1)*a[3])

myx<-rep(xdat,3)
myy<-c(ydat,ydatsmall,ydatbig)
mymod<-rep(c("mean","low","high"),each=length(xdat))

newdat<-data.frame(x=myx,y=myy,type=mymod)

ggplot(newdat,aes(x=x,y=y,color=mymod))+geom_line(aes(linetype=mymod),size=I(.8))+coord_cartesian()+theme_dpi()
```

# Clumsy Code to Do This
```{r codeforlineplots,eval=FALSE}
a<-coef(ssN1_mod)
xdat<-seq(min(midsch_sub$ss1),max(midsch_sub$ss1),length.out=20)
ydat<-xdat*a[2]+a[1]+(median(midsch_sub$n1)*a[3])
ydatsmall<-xdat*a[2]+a[1]+(min(midsch_sub$n1)*a[3])
ydatbig<-xdat*a[2]+a[1]+(max(midsch_sub$n1)*a[3])

myx<-rep(xdat,3)
myy<-c(ydat,ydatsmall,ydatbig)
mymod<-rep(c("mean","low","high"),each=length(xdat))

newdat<-data.frame(x=myx,y=myy,type=mymod)

ggplot(newdat,aes(x=x,y=y,color=mymod))+geom_line(aes(linetype=mymod),size=I(.8))+coord_cartesian()+theme_dpi()
```

# Coefficient Plots
```{r coefplots,fig.width=12,fig.height=6,out.width='600px',out.height='300px'}
b<-sqrt(diag(vcov(ssN1_mod)))
mycoef<-data.frame(var=names(b),y=a,se=b)

ggplot(mycoef[2:3,],aes(x=var,y=y,ymin=y-2*se,ymax=y+2*se))+geom_pointrange()+theme_dpi()+geom_hline(yintercept=0,size=I(1.1),color=I('red'))

```


# Exercises

1. Test our new megamodel for heteroskedacticty. What do you find?

2. Does `megamodel2` exhibit non-linearity? Can you fit a quantile regression model of this?

3. Can you simulate a model with multiple variables? What does it look like?

Bonus: Write better code for the plot with different lines for different values of variable 2.

# Other References
- [Video Tutorials](http://www.twotorials.com/)
- [Quick R: Multiple Regression in R](http://www.statmethods.net/stats/regression.html)
- [UCLA Regression Guide](http://www.ats.ucla.edu/stat/r/sk/books_pra.htm)


# Session Info

It is good to include the session info, e.g. this document is produced with **knitr** version `r packageVersion('knitr')`. Here is my session info:

```{r session-info}
print(sessionInfo(), locale=FALSE)
```

# Attribution and License
<p xmlns:dct="http://purl.org/dc/terms/">
<a rel="license" href="http://creativecommons.org/publicdomain/mark/1.0/">
<img src="http://i.creativecommons.org/p/mark/1.0/88x31.png"
     style="border-style: none;" alt="Public Domain Mark" />
</a>
<br />
This work (<span property="dct:title">R Tutorial for Education</span>, by <a href="www.jaredknowles.com" rel="dct:creator"><span property="dct:title">Jared E. Knowles</span></a>), in service of the <a href="http://www.dpi.wi.gov" rel="dct:publisher"><span property="dct:title">Wisconsin Department of Public Instruction</span></a>, is free of known copyright restrictions.
</p>