-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathfinal_pres_done.Rmd
588 lines (431 loc) · 29.1 KB
/
final_pres_done.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
---
title: "Forecasting Carabid Beetle Abundance at Niwot Ridge"
author: "Anna Spiers, Christa Torrens, Grant Vagle"
date: "5/07/2020"
output:
html_document:
df_print: paged
keep_md: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Project overview and background
## Project goals - big picture
* To use NEON data, including AOP lidar data, to accurately forecast the abundance of carabid beetles across the landscape, using training data from Niwot Ridge.
* To be able to forecast changes to carabid beetle distribution under changing climate conditions
* To use our forecast model to determine "how much data is enough?" for accurate forcasting at different timesteps
+ Assume the predicted data are real - run model 25 years into future
+ Test model accuracy with 1, 3, 5, 10, 15, 20 years of “real” data
## Project goals - this semester
* To create and validate a working model to accurately predict the abundance of 1-2 carabid spp. at NEON's Niwot carabid traps
+ *either* a temporal prediction (2019 season)
+ *or* spatial prediction (one of the plots, 2018 season).
## Why Carabids?
Ground beetles (Carabidae) are an excellent sentinel taxon, and can be used to understand the consequences of global climate change in high mountain areas.
* They are abundant, found across many biomes, and are relatively easy to identify
* They are sensitive to habitat change, thus are predicted to serve as useful bioindicators of environmental and land use change
+ Numerous studies have shown that carabid beetle richness and/or assemblage composition change along altitudinal gradients, with vegetation type, and/ or land use type
* Their presence/ absence impacts other ecosystem components:
+ They are an important component of terrestrial food webs and can influence trophic structure
(Hoekman et al. 2017, Hiramatsu and Usio 2018)
Available data: The NEON data suite includes carabid abundance since 2015, as well as many of our target predictor variables.
## Research questions (initial)
1. Which carabid species are 'best' to focus on for our semester goals?
2. Initial exploration of variables (EDA)
+ Based on the literature, what are likely to be important variables driving carabid presence/ absence?
+ Note: We initially focused on spatial predictions, but found temporal variance was a stronger influence on carabid abundance
3. What are the key predictor variables for each species? (Model building)
4. Which model type will best predict species abundance? (Model testing)
## Q1: Carabid species selection
First we determined that there were seven abundant and well-identified species in NEON's Niwot data.
```{r, echo=FALSE, out.width = 600}
knitr::include_graphics("images/Carabid ID_7circled.png")
```
(the next section will go into details on sampling design and analysis)
Then we looked at abundance by sampling plot and collection date
```{r, echo=FALSE}
knitr::include_graphics("output/species_through_years.png")
```
Looking at the abundance data, we chose two of those seven: *Carabus taedatus* and *Cymindis unicolor*.
```{r, echo=FALSE}
knitr::include_graphics("output/abund_DateClass_twospp.png")
```
<!-- find code to create the distribution by nlcd class and spp image -->
Both species were consistently present in both of Niwot's landscape classes: tundra and evergreen forest. *C. unicolor* was more abundant in the forested plots and *C. taedatus* was more abundant in the tundra plots. This combination allows us to test our model's accuracy for high and low abundance levels in both habitats.
## Species traits
This information was sparse for our selected species - therefore some of the "specific" phenology information is for the family or genus level.
In general, carabids are ground-dwelling beetles that prefer moist conditions; note that both of our selected species are xerophilous. Most species are omnivorous as larvae and adults, commonly eating both other arthropods and seeds. The life cycle is 1 year long for most species, with adults living 2-4 years. Most taxa have three instars. Pupation is in the ground.
### *Carabus taedatus*
```{r, echo=FALSE, out.width = 100}
knitr::include_graphics("images/CARTAE.png")
```
**Range**: Nearctic; **Habitat**: Boreal forests. Prefers dry, well-drained environments with thin, low vegetation; **Diet**: both larvae and adults are predatory; **Phenology**: probably overwinters as an adult; **Reproduction**: (genus) Females lay their eggs in specially-constructed cells made of mud, twigs and leaves.
### *Cymindis unicolor*
```{r, echo=FALSE, out.width = 100}
knitr::include_graphics("images/CYMUNI.png")
```
**Range**: Holarctic; **Habitat**: An arctic-alpine species occurring in tundra and tundra-forest transition zones; prefers dry, treeless, well-drained environments; **Diet**: no specific information found; **Phenology**: Little known: In Canada, most specimens were found in June and July, but also found as late as October; **Reproduction**: (genus) Females use ovipositors to create cavities in the earth then deposit their eggs. **Remarks**: The species is relatively rare. The *Lebiini* tribe is found in all the major zoogeographical regions of the world; this genus ranges from Costa Rica to arctic tundra.
```{r, echo=FALSE, warning=FALSE, message=FALSE}
library(knitr)
```
# Q2: EDA
## Carabid data
### Sampling
Carabid beetles were sampled at 10 plots, with 4 traps per plot. Beetles were collected every two weeks during the summer season (with temperature cutoffs to start and stop sampling). Data were available for 2015--2018 (2019 beetles were not yet identified).
```{r, echo=FALSE}
knitr::include_graphics("output/canopy_height_traps.png")
```
In 2018, plot 4 was removed from sampling and plot 13 was added. Plot 4 was forested, and plot 13 is a tundra site. Plot 4 was likely removed from sampling due to low number of beetles caught (speculation from Matt Bitters). We took advantage of the new site in 2018 by using plot 13 as a test plot for our model predictions, which Anna will get to later.
### Identification
All individuals were identified to species (or morphospecies) by a parataxonomist. A subset (1511/1974) of those were then sent for expert identification. There was 97% agreement overall between the parataxonomist and the expert taxonomist.
```{r, echo=FALSE, out.width = 600}
knitr::include_graphics("output/paratax_vs_expertax.png")
```
<!-- ![species_abund_by_year_by_trap](output/paratax_vs_expertax.png) -->
### Species selection
Most of the species collected were only found rarely, so we first narrowed down the set of species to the seven most abundant (and all were accurately ID'd).
```{r, echo=FALSE}
knitr::include_graphics("output/species_through_years.png")
```
However, comparing multiple models across seven species is still a bit too much. To narrow further, we selected the two species of these seven that were found both above and below treeline.
**Selected species**:
*Carabus taedatus* is more abundant in tundra plots.
```{r, echo=FALSE, out.width = 100}
knitr::include_graphics("images/CARTAE.png")
```
*Cymindis unicolor* is more abundant in forested plots.
```{r, echo=FALSE, out.width = 50}
knitr::include_graphics("images/CYMUNI.png")
```
<!-- info/natural history about the species we chose
Do we have an idea for why certain carabids are greater in different nlcdClass given their biology?
-->
We chose these species because we thought it would be interesting to examine model performance for the species that were found in both habitats, especially since they show opposite trends in abundance in the two habitats.
### Abundance over time
<!-- by year -->
<!-- change to just the two species -->
```{r, echo=FALSE}
knitr::include_graphics("output/species_abund_by_year_by_trap_nlcd.png")
```
This plot shows the abundances at each trap in each year for *Carabus taedatus* and *Cymindis unicolor*. *Carabus taedatus* was reliably caught at the tundra sites and was caught on occasion in some forested sites. *Cymindis unicolor* was reliably caught at some (but not all) forested sites and was caught on occasion at one tundra plot (8) and once at another tundra plot (6).
## Predictor variables
At the start of this project we spent some time brainstorming and searching the literature for potentially important predictor variables for carabid abundance. Once we had identified a long list of potential predictors, we searched through the available datasets for usable predictors. We ended up using predictor variables from three sources: NEON data included in the carabid data product, NEON AOP data products, and Niwot Ridge LTER environmental data. The predictors we used from each of these sources will be shown below. There were also several predictor variables that we were not able to use: soil moisture (NEON product, not colocated with beetle traps), woody debris (NEON product, not colocated with beetle traps), vegetation (NEON product, difficult to summarize at plot or trap level), and snowmelt date (didn't find or have time to incorporate).
#### Included with carabid data product
Day of year - extracted from collect date
NLCD class - evergreen forest or grassland herbaceous
Elevation
#### AOP (remote sensing) data products
Data products from NEON's AOP collection were downloaded via the `neonUtilities` package in the form of multiple (over 100) .tif files in a zip folder for each data product. We then merged together all of the .tif's into a single raster in R (`carabids_01_download_AOP.R`) for each data product. These rasters were then used for all of our downstream analyses.
**Canopy height**
Measured via LIDAR, this data product provides a 1x1m resolution map of canopy height across the site. To summarize for each trap, we took the average canopy height for a 10m radius from each trap location (see`raster::extract` in `carabids_03_EDA_CHM`).
```{r, echo=FALSE, out.width = 600}
knitr::include_graphics("output/canopy_height_traps.png")
```
**Leaf area index**
This is the ratio of leaf area to bare ground area as sensed from above (similar metric to percent canopy cover). This was processed the same way as canopy height, but with some extra data cleaning. Data were available for 2017, 2018, and 2019, but the data looked different year to year. In the end, we used an average of 2017 and 2018 because these issues were difficult to sort out and those years were similar and appeared the most reliable. After contacting someone at NEON, we learned that they used different quality control measures in different years resulting in different values.
**Slope and aspect**
Slope and aspect were summarized at the plot level (average slope and aspect at 10m radius from plot center). We included them because they may be related to microclimatic factors such as soil moisture (slope) and energy availability via sun exposure (aspect).
```{r, echo=FALSE, out.width = 400}
knitr::include_graphics("output/slope_traps.png")
```
```{r, echo=FALSE, out.width = 400}
knitr::include_graphics("output/aspect_traps.png")
```
#### Niwot Ridge LTER data sources
**Precipitation**
For the forested plots, C1 precipitation data were used; for the tundra plots Saddle precipitation data were used. [NEON also has a precipitation data product, but there were too many missing data points for our use.] We calculated total precipitation during the collection period (2 weeks) for each collect date. This summary gets at the short-term weather patterns that may influence beetle behavior and detectability in traps. [We wanted to get to a long-term precipitation summary, but ran out of time.]
```{r, echo=FALSE}
knitr::include_graphics("output/precip_2week_summ.png")
```
**Growing degree days**
Using temperature data from the C1 sensor (Saddle sensor data contained missing dates), we calculated growing degree days. Growing degree days were summarized similar to (Nufio et al. 2010) as a cumulative sum of growing degree days in that year until the date of collection. Degree days were quantified for temperatures between 12--33C based on grasshopper physiology at Niwot Ridge (Nufio et al. 2010). We did not have any information regarding the carabid temperature limits, so grasshopper values from the same location were our best estimate.
```{r, echo=FALSE}
knitr::include_graphics("output/temp_time.png")
```
```{r, echo=FALSE}
knitr::include_graphics("output/degree_days_cumulative.png")
```
### Summary of predictors used in models
* Day of year
* NLCD class
* Elevation
* Canopy height (10m radius)
* Leaf area index (LAI) (10m radius)
* Slope
* Aspect
* Precipitation (during trapping period)
* Growing degree days (cumulative over season)
# Q3 and 4: Model building and selection
```{r, echo = FALSE, results=FALSE, message = FALSE, warning = FALSE}
library(dplyr)
library(mctest) #omcdiag()
library(tidyr) #gather()
library(ggplot2)
library(gridExtra) #grid.arrange
library(broom) #glance
library(MuMIn) #model.sel
library(brms) #brm()
library(mgcv) #gam()
library(gratia) #draw()
# Load in model df
all7sp_dat <- read.csv("data_derived/model_df_by_species_in_sample.csv") %>%
mutate(sc_DOY = scale(DOY, center = TRUE, scale = TRUE),
col_year_fac = as.factor(col_year))
```
To build and select a model for abundance of *Carabus taedatus* and *Cymindis unicolor*, our goals were to:
1. Identify what are the key predictor variables for each species.
2. Find what model best fits the training data species abundance.
#### Predictor variable selection
We ruled out some variables by evaluating collinearity with the `mctest` package.
```{r, echo = FALSE}
vars_mat <- all7sp_dat %>%
mutate(nlcdclass_num = as.numeric(nlcdClass)) %>%
dplyr::select(elev, LAI_1718avg, plot_CHM, LAI_1718avg, precip_2weeks,
DOY, plot17aspect, plot17slope, GDD_cum, col_year, nlcdclass_num)
pairs(vars_mat, pch=20, cex=0.75, col="dodgerblue", main="Correlation scatterplots of every possible predictor")
```
Note the nearly perfect correlation between growing degree-days and day of year. Comparing all possible variables, we detect overall collinearity between elevation, canopy height, and NLCD class.
```{r, echo = FALSE}
imcdiag(vars_mat, all7sp_dat$sp_abund, method="VIF")
```
After dropping elevation, we failed to detect pairwise collinearity between variables. However, variance inflation factor (VIF) values are above 5 (indicating high correlation) for canopy height, NLCD class, day of year, and cummulative growing degree-day.
```{r, echo = FALSE}
# remove elevation
vars_mat <- all7sp_dat %>%
mutate(nlcdclass_num = as.numeric(nlcdClass)) %>%
dplyr::select(LAI_1718avg, plot_CHM, LAI_1718avg, precip_2weeks, DOY,
plot17aspect, plot17slope, GDD_cum, col_year, nlcdclass_num)
imcdiag(vars_mat, all7sp_dat$sp_abund, method="VIF")
```
Canopy height and NLCD class are highly correlated because canopy height is 0 in the tundra. NLCD class is useful in model building since we can vary the intercept by this categorical variable.
Day of year and growing degree-day are highly correlated. Both capture seasonal variation.
We don't have enough information right now to select which variable to rule out of each pair. Model fitting with these variables can help us decide. Let's take a step in this direction by utilizing Melissa's `stepGAM` function to compare model AIC values for every iteration of these variables combined.
Second, use Melissa's stepGAM(), a wrapper function for stepAIC that is compatible with GAMs, to see which predictor variables are important in model fit.
The effects of plot, trap within plot, canopy height, and slope/aspect are strong for both species. For *Cymindis unicolor* we also see the interaction of day of year and precipitation as an important effect.
```{r, echo = FALSE, results=FALSE, message = FALSE, warning = FALSE}
load("data_derived/stepGAM_output_cartae.Rdata")
load("data_derived/stepGAM_output_cymuni.Rdata")
allAIC_cartae_filt <- allAIC_cartae %>%
filter(!is.na(AIC)) %>% arrange(AIC) %>%
mutate(rank=seq(1:length(AIC))) %>%
mutate(AIC=as.numeric(AIC)
, Dev.expl = as.numeric(Dev.expl)
, REML = as.numeric(REML))# %>%
#filter(AIC<500)
allAIC_cartae_plot <- allAIC_cartae_filt %>%
gather(-c(AIC, REML, Dev.expl, formula,rank), key=Predictor, value=WithPredictor) %>%
select(AIC, REML, Dev.expl, Predictor, WithPredictor) %>%
mutate(WithPredictor = ifelse(WithPredictor==0, F, T)) %>%
ggplot() +
geom_violin(aes(x=WithPredictor, y=AIC), col="dodgerblue") +
ggtitle("Carabus taedatus") +
facet_wrap(.~Predictor)
allAIC_cymuni_filt <- allAIC_cymuni %>%
filter(!is.na(AIC)) %>% arrange(AIC) %>%
mutate(rank=seq(1:length(AIC))) %>%
mutate(AIC=as.numeric(AIC)
, Dev.expl = as.numeric(Dev.expl)
, REML = as.numeric(REML))# %>%
#filter(AIC<500)
allAIC_cymuni_plot <- allAIC_cymuni_filt %>%
gather(-c(AIC, REML, Dev.expl, formula,rank), key=Predictor, value=WithPredictor) %>%
select(AIC, REML, Dev.expl, Predictor, WithPredictor) %>%
mutate(WithPredictor = ifelse(WithPredictor==0, F, T)) %>%
ggplot() +
geom_violin(aes(x=WithPredictor, y=AIC), col="dodgerblue") +
ggtitle("Cymindis unicolor") +
facet_wrap(.~Predictor)
grid.arrange(allAIC_cartae_plot,allAIC_cymuni_plot,nrow=1)
```
#### Model selection
Now that we've identified what are the possible key predictor variables to include in modeling, let's compare a few modeling approaches. We found that GAMMs performed better than GLMMs as they capture well the seasonal flux in carabid abundance. We used the `mgcv` package for fitting GAMMs. Our initial efforts used a poisson or negative binomial distribution to model carabid abundance. However, predictions were severely biased toward 0, so we should next try a zero-inflated poisson model. While we have a number of environmental variables at our disposal to use as predictors, we learned from Harris et al. 2018 that simply an average model may yield the best predictions, even if it's not the best fit model. For this reason, we will compare the model results from the best baseline (variance partitioning) and negative binomial models. First, we iterated through models with many combinations of environmental predictors.
The best fit model for each species utilized distinct environmental predictor variables. Since we observe clear differences in abundance between NLCD class types for each species, we had the models fit different smooths of continuous variables to each type of NLCD class. Models for both species used grouping variables that are artifacts of the sampling structure:
* collection date (24 total)
* collection year (4 total)
* plot (10 across site)
* traps within plots (4 per plot)
For *Carabus taedatus*, the following environmental predictors were used:
* day of year
* canopy height
* accumulated precipitation over collection period
* interaction of precipitation and day of year
* interaction of slope and aspect
For *Cymindis unicolor*, the following environmental predictors were used:
* cummulative growing degree days
* LAI
* accumulated precipitation over collection period
* interaction of slope and aspect
* interaction of precipitation and day of year
* interaction of growing degree days and precipitation
```{r, echo = FALSE}
load("data_derived/final_mod_cartae.Rdata")
load("data_derived/final_mod_cymuni.Rdata")
# View selected model formulae
cartae_mod$formula
cymuni_mod$formula
```
# Results & Interpretation
## Biological interpretation
Many of the models we selected between had comparable AIC values, so we chose ones that would be most easily interpreted biologically. Now let's interpret them.
First, let's look at the output of the variance partition model for each species to see where the variability in the data sit.Note that the variance partition model explains over half of the deviance in the data already!
```{r}
load("data_derived/vp_mod_cartae.Rdata")
load("data_derived/vp_mod_cymuni.Rdata")
summary(vp_mod_cymuni)
```
Now let's look at the best-fit GAMM result plots.
```{r}
summary(cartae_mod)
plot(cartae_mod, pages = 5, rug = TRUE, residuals = TRUE, shade=TRUE, shade.col = "lightblue", all.terms = TRUE, seWithMean = TRUE, shift = coef(cartae_mod)[1], scheme=2)
```
```{r}
summary(cymuni_mod)
plot(cymuni_mod, pages = 5, rug = TRUE, residuals = TRUE, shade=TRUE, shade.col = "lightblue", all.terms = TRUE, seWithMean = TRUE, shift = coef(cymuni_mod)[1], scheme=2)
```
## Prediction
Given the limited size of the carabid dataset over 4 years and at one site, we will validate our models on data from a single plot in a single year. We trained our models on the rest of the data. Remember that in our last sampling year, 2018, NEON swapped out plot 4 for plot 13. We thought it would be interesting to predict beetle abundance at this new plot, given what we know about the other plots and their abundances.
```{r, warning=F}
val_dat <- all7sp_dat %>%
filter(plotID == "NIWO_013" & col_year == 2018)
vp_cartae_pred <- cbind(val_dat %>% filter(para_sciname == "Carabus taedatus") %>% select(sp_abund), data.frame(predict(vp_mod_cartae, newdata = val_dat %>% filter(para_sciname == "Carabus taedatus"), type = "response", se.fit= TRUE)))
vp_cymuni_pred <- cbind(val_dat %>% filter(para_sciname == "Cymindis unicolor") %>% select(sp_abund), data.frame(predict(vp_mod_cymuni, newdata = val_dat %>% filter(para_sciname == "Cymindis unicolor"), type = "response", se.fit= TRUE)))
cartae_pred <- cbind(val_dat %>% filter(para_sciname == "Carabus taedatus") %>% select(sp_abund), data.frame(predict(cartae_mod, newdata = val_dat %>% filter(para_sciname == "Carabus taedatus"), type = "response", se.fit= TRUE)))
cymuni_pred <- cbind(val_dat %>% filter(para_sciname == "Cymindis unicolor") %>% select(sp_abund), data.frame(predict(cymuni_mod, newdata = val_dat %>% filter(para_sciname == "Cymindis unicolor"), type = "response", se.fit= TRUE)))
plot_vp_cartae_pred <- vp_cartae_pred %>%
ggplot(aes(x=fit, y=sp_abund)) +
geom_jitter(col="dodgerblue") +
geom_abline(slope = 1, intercept = 0, col="dodgerblue") +
ggtitle("Var. Part. model for Carabus taedatus") +
xlim(0,5) + ylim(0,5) +
ylab("Observed count" ) +
xlab("Predicted count")
plot_vp_cymuni_pred <- vp_cymuni_pred %>%
ggplot(aes(x=fit, y=sp_abund)) +
geom_jitter(col="dodgerblue") +
geom_abline(slope = 1, intercept = 0, col="dodgerblue") +
ggtitle("Var. Part. model for Cymindis unicolor") +
xlim(0,5) + ylim(0,5) +
ylab("Observed count" ) +
xlab("Predicted count")
plot_cartae_pred <- cartae_pred %>%
ggplot(aes(x=fit, y=sp_abund)) +
geom_jitter(col="dodgerblue") +
geom_abline(slope = 1, intercept = 0, col="dodgerblue") +
ggtitle("Best fit model for Carabus taedatus") +
xlim(0,5) + ylim(0,5) +
ylab("Observed count" ) +
xlab("Predicted count")
plot_cymuni_pred <- cymuni_pred %>%
ggplot(aes(x=fit, y=sp_abund)) +
geom_jitter(col="dodgerblue") +
geom_abline(slope = 1, intercept = 0, col="dodgerblue") +
ggtitle("Best fit model for Cymindis unicolor") +
xlim(0,5) + ylim(0,5) +
ylab("Observed count" ) +
xlab("Predicted count")
grid.arrange(plot_vp_cartae_pred,plot_vp_cymuni_pred,plot_cartae_pred,plot_cymuni_pred,nrow=2)
```
Our models can predict absence or low abundance, but not more. This is support for trying a zero-inflated poisson next.
## Cross-validation
Cross-validation is intended to evaluate model performance (esp re: prediction error) using a limited data sample. It estimates the skill of the model on unseen (test) data. There are various cross-validation methods; the key difference is in how they partition the data into training and test datasets.
We chose the k-fold cross validation method, using *k* = 10. This method has relatively low bias due to random grouping of test and training data.
### About the method
* This method randomly splits the data into *k* groups for cross-validation
* Then it cycles through the groups, selecting each one in turn as a test data set and using the other *k*-1 groups as a training data set to fit and test the model
* The Mean Square Error [MSE] is calculated and retained for each step
* The mean of all *k* MSEs is reported to evaluate overall model performance
```{r}
# all7sp_dat_cartae_k <- all7sp_dat %>%
# filter(para_sciname=="Carabus taedatus") %>%
# mutate(k = sample(1:20, 960, replace=T))
# all7sp_dat_cymuni_k <- all7sp_dat %>%
# filter(para_sciname=="Cymindis unicolor" ) %>%
# mutate(k = sample(1:20, 960, replace=T))
#
# err_cartae <- rep(NA,960)
# for (i in 1:20 ) {
# gamcartae_loo <- gam(formula = sp_abund ~ nlcdClass + s(DOY, bs = "cc", k = 3) +
# s(precip_2weeks, by = nlcdClass, bs = "ts", k = 3) +
# s(plot_CHM,bs = "ts", k = 3) +
# te(DOY, precip_2weeks, by = nlcdClass, bs = c("cc", "ts"), k = 3) +
# te(plot17aspect, plot17slope, bs = c("ts", "ts"), k = 3) +
# s(collectDate, bs = "re") + s(col_year_fac, bs = "re") +
# s(plot_trap, bs = "re") + s(plotID, bs = "re"), family = nb,
# data = all7sp_dat_cartae_k %>% filter(k != i), method = "REML")
#
# err_temp <- ( predict(gamcartae_loo,newdata=all7sp_dat_cartae_k %>% filter(k == i)) -
# all7sp_dat_cartae_k %>% filter(k == i) %>% select(sp_abund) )^2 #MSE
# err_cartae <- rbind(err_cartae, err_temp)
# rm(gamcartae_loo, err_temp)
# }
# err_cartae <- err_cartae[-1,]
# CV_cartae <- mean(err_cartae)
#
# err_cymuni <-rep(NA,960)
# for (i in 1:20 ) {
# gamcymuni_loo <- gam(formula = sp_abund ~ nlcdClass + s(GDD_cum, bs = "ts", k = 1) +
# s(LAI_1718avg, bs = "ts", k = 1) +
# s(precip_2weeks, bs = "ts", k = 1) +
# te(GDD_cum, precip_2weeks, bs = c("ts", "ts"), k = 3) +
# te(plot17aspect, plot17slope, by = nlcdClass, bs = c("ts", "ts"), k = 5) +
# te(DOY, precip_2weeks, bs = c("cc", "ts"), k = 4) +
# s(collectDate, bs = "re") + s(col_year_fac, bs = "re") +
# s(plot_trap, bs = "re") + s(plotID, bs = "re"), family = nb,
# data = all7sp_dat_cymuni_k %>% filter(k != i), method = "REML")
#
# err_temp <- ( predict(gamcymuni_loo,newdata=all7sp_dat_cymuni_k %>% filter(k == i)) -
# all7sp_dat_cymuni_k %>% filter(k == i) %>% select(sp_abund) )^2 #MSE
# err_cymuni <- rbind(err_cymuni, err_temp)
# rm(gamcymuni_loo, err_temp)
# }
# err_cymuni <- err_cymuni[-1,]
# CV_cymuni <- mean(err_cymuni)
#
# err_cartae <- rep(NA,960)
# for (i in 1:20 ) {
# gamcartae_vp_loo <- gam(formula = sp_abund ~
# s(collectDate, bs = "re") + s(col_year_fac, bs = "re") +
# s(plot_trap, bs = "re") + s(plotID, bs = "re"), family = nb,
# data = all7sp_dat_cartae_k %>% filter(k != i), method = "REML")
#
# err_temp <- ( predict(gamcartae_vp_loo,newdata=all7sp_dat_cartae_k %>% filter(k == i)) -
# all7sp_dat_cartae_k %>% filter(k == i) %>% select(sp_abund) )^2 #MSE
# err_cartae <- rbind(err_cartae, err_temp)
# rm(gamcartae_vp_loo, err_temp)
# }
# err_cartae <- err_cartae[-1,]
# CV_cartae_vp <- mean(err_cartae)
#
# err_cymuni <-rep(NA,960)
# for (i in 1:20 ) {
# gamcymuni_vp_loo <- gam(formula = sp_abund ~
# s(collectDate, bs = "re") + s(col_year_fac, bs = "re") +
# s(plot_trap, bs = "re") + s(plotID, bs = "re"), family = nb,
# data = all7sp_dat_cymuni_k %>% filter(k != i), method = "REML")
#
# err_temp <- ( predict(gamcymuni_vp_loo,newdata=all7sp_dat_cymuni_k %>% filter(k == i)) -
# all7sp_dat_cymuni_k %>% filter(k == i) %>% select(sp_abund) )^2 #MSE
# err_cymuni <- rbind(err_cymuni, err_temp)
# rm(gamcymuni_vp_loo, err_temp)
# }
# err_cymuni <- err_cymuni[-1,]
# CV_cymuni_vp <- mean(err_cymuni)
#
# # Compare models
# save(CV_cartae, CV_cymuni, CV_cartae_vp, CV_cymuni_vp, file="data_derived/cv_output.Rdata")
load("data_derived/cv_output.Rdata")
CV_mat <- matrix(c(CV_cartae, CV_cymuni, CV_cartae_vp, CV_cymuni_vp), nrow=2, byrow=T)
rownames(CV_mat) <- c("best fit", "var par")
colnames(CV_mat) <- c("Car tae", "Cym uni")
CV_mat
```
# Discussion - Did we do what we set out to do?
We did meet our primary semester goal:
To create a working model to accurately predict the abundance of 1-2 carabid spp. at NEON's Niwot carabid traps (spatial prediction, Plot 13 in 2018)
This was scaled back quite a bit from our early aspirations. And we did not meet some of our larger goals, including running a working cross-validation
## Future work
If we had more time (and data), we'd include the following items:
* Once the 2019 carabid data is identified, we'd try to predict our two species for 2019 at all plots and collection dates
* If successful, we'd try to predict all 7 abundant species
* Regardless of outcome, we'd try to use our model to answer the question: "How much data is enough?" for accurate forcasting at different timesteps