Code
library(tidyverse)
library(here)
# Load data - ADJUST FILENAMES TO MATCH WHAT YOU DOWNLOADED
sites <- read_csv(here("data", "raw", "nla2022_wide_siteinfo.csv"))
zoop <- read_csv(here("data", "raw", "nla2022_wide_zooplankton_count.csv"))Systematic exploration of the National Lake Assessment dataset
By the end of this lab, you will be able to:
The National Lakes Assessment (NLA) is a collaborative survey between EPA, states, and tribes to assess the condition of lakes across the United States. The survey occurs every 5 years and collects data on:
We’re using the 2022 NLA data for this course. You can explore summary results using the NLA 2022 dashboard.
This course is different from traditional labs. We’re not just learning JSDMs - we’re actively exploring how they can reveal patterns of community assembly across trophic levels in lake ecosystems. You are collaborators in this research, not just students following a recipe.
By dividing the analytical work across taxonomic groups, we can:
This mimics real collaborative ecology: different researchers specialize in different taxa, then synthesis happens through collaboration.
Document your work clearly. The rest of the research team needs to understand your findings to build on them. This means:
Think of your lab submission as a technical report to collaborators, not just an assignment.
As everything in this class, this is a work in progress. We (the instructors) are genuinely exploring these data alongside you. If you find resources, identify additional analytical steps we’re missing, or have methodological suggestions, we welcome contributions through GitHub Issues or direct discussion.
This means some uncertainty is built in. We’ll make decisions together, document our reasoning, and learn from what emerges. That’s how real ecology works.
Data portal: https://www.epa.gov/national-aquatic-resource-surveys/data-national-aquatic-resource-surveys
For JSDM we need at least three things: site info, species data (counts or presence/absence), and environemntal covariates. Navigate to National Lakes Assessment 2022:
EVERYONE downloads:. 1. Site information. 2. Environmental data:
- Physical Habitat - Secchi - Water Chemistry - Landscape Data
YOUR ASSIGNED GROUP (download ONE of these): - Group 1: Benthic Macroinvertebrates - Group 2: Zooplankton
- Group 3: Phytoplankton
Save these files to your data/raw/ folder. Keep the original filenames.
Add an entry to your data/README.md following the template from Lab 0.
Include:
[We’ll assign these in class]
What to focus on for your group:
If you have Benthic Macroinvertebrates: info - These are bottom-dwelling invertebrates (insect larvae, worms, mollusks) - Key ecological role: Indicators of water quality and habitat condition - Questions to explore: How do benthos respond to sediment, nutrients, oxygen?
If you have Zooplankton: info - These are tiny drifting animals (copepods, cladocerans, rotifers) - Key ecological role: Link between phytoplankton and fish - Questions to explore: How do zooplankton respond to nutrients, predators, lake size?
If you have Phytoplankton: - These are microscopic algae - Key ecological role: Primary producers, base of food web - Questions to explore: How does phytoplankton composition vary with nutrients, other environmental variables?
Before we fit Joint Species Distribution Models, we need to understand what we’re modeling. This lab walks through a systematic EDA framework, not because you don’t know how to explore data, but because having a checklist prevents you from missing critical issues. As everything in this class, this is a work in progress.
Think of this as the workflow you’d use if a collaborator handed you a new dataset and said “can you analyze this?” You need to quickly but thoroughly understand what you’re working with.
Your task: Work through the phases below with the NLA 2022 data. The specific code is up to you. I’m providing the questions you should answer and the decisions you should make. Document your findings and decisions as you go.
Goal: Know what you have before you manipulate it or model it
library(tidyverse)
library(here)
# Load data - ADJUST FILENAMES TO MATCH WHAT YOU DOWNLOADED
sites <- read_csv(here("data", "raw", "nla2022_wide_siteinfo.csv"))
zoop <- read_csv(here("data", "raw", "nla2022_wide_zooplankton_count.csv"))# Dimensions
dim(sites)
dim(zoop)
# Structure
str(sites)
str(zoop)
# What are we working with?
# How many sites?
# How many species?
# What format?Deliverable: Write 3-4 sentences describing the dataset structure in plain English.
Goal: Find problems before they become analytical nightmares
# Missing data patterns
summary(sites)
summary(zoop)
# Check for specific issues:
# - Are there NAs?
# - Range of abundance values
# - Species name consistency
# - Duplicate records?Deliverable: A list of data quality issues found and decisions made about how to handle them. Document these as comments in your code or in a separate markdown section.
Goal: Understand the biological patterns
You’ll likely need to reshape/transform the data depending on whether it’s in long or wide format.
# Species richness per site
# Hint: you may need to pivot_longer if data is wide
# Species prevalence (how many sites has each species?)
# Total abundance patterns
# Rarity: how many species found at only 1 site? 2 sites?Deliverable: 2-3 plots with captions describing what they reveal about community structure.
Goal: Understand environmental context
# Identify key environmental variables in sites data
# Univariate summaries
# Pairwise correlations
# Look for gradientsDeliverable: Identify 2-3 environmental variables that might be important and explain why.
Goal: Generate hypotheses about community-environment relationships
# You'll need to join species and environmental data
# Richness vs environment
# Pick 2-3 key environmental variables
# Individual species responses
# Pick 3-5 common species and examine their environmental associationsDeliverable: Generate 3 hypotheses about community-environment relationships based on patterns you observed.
After going through all these quality checks and decisions, save a cleaned version:
# Save your processed data
write_csv(cleaned_data, here("data", "processed", "nla_2022_cleaned.csv"))What surprised you? Were there data quality issues you didn’t expect?
How does this inform JSDMs? Based on your EDA, what patterns would you want a Joint Species Distribution Model to capture?
What decisions did you make? List 2-3 major data cleaning decisions and justify them.
Submit to Canvas:
data/processed/nla_2022_cleaned.csv)data/README.md with NLA data documentedReview the learning objectives. Can you: