library(dplyr)
library(dbplyr) #need to install, but don't actually need to explicitly load
library(DBI)
library(duckdb)Working with forestTIME-builder data
Setup
Packages you’ll need:
If any are missing, install with install.packages("duckdb"), for example.
Step 1: connecting to the database
The path to the duckdb file can be a relative path if you’re in an RStudio project. If you’re working in a Quarto or RMarkdown document, you might need to use here::here("project/path/to/foresttime-DE.duckdb") to ensure the same working directory whether you’re working interactively or rendering the document.
con <- dbConnect(duckdb("data/db/foresttime-DE.duckdb"))Step 2: getting a table
List the tables in the database:
dbListTables(con) [1] "all_invyrs" "cond"
[3] "nsvb_vars" "plot"
[5] "qa_flags" "ref_species"
[7] "ref_tree_carbon_ratio_dead" "ref_tree_decay_prop"
[9] "sapling_transitions" "tree"
[11] "tree_annualized" "tree_carbon"
[13] "tree_carbon_annualized_midpoint" "tree_carbon_annualized_mortyr"
[15] "tree_cns" "tree_info_composite_id"
[17] "trees_annual_measures_midpoint_nsvb" "trees_annual_measures_mortyr_nsvb"
Pull a specific table to work with:
tree_annualized <- tbl(con, "tree_annualized")
tree_annualized# Source: table<tree_annualized> [?? x 13]
# Database: DuckDB v1.1.3 [root@Darwin 24.2.0:R 4.4.1//Users/ericscott/Documents/GitHub/mekevans/forestTIME-builder/data/db/foresttime-DE.duckdb]
STATECD TREE_COMPOSITE_ID TREE_CN_midpoint YEAR midpoint_dead_year
<int> <chr> <chr> <int> <dbl>
1 10 10_1_1_48_1_2 1275355140290487 2021 NA
2 10 10_1_1_48_3_10 1275355163290487 2021 NA
3 10 10_1_1_132_3_7 1275355801290487 2021 NA
4 10 10_1_1_132_4_4 1275355808290487 2021 NA
5 10 10_1_1_132_4_1 1275355806290487 2021 NA
6 10 10_1_1_202_1_8 1275365477290487 2021 NA
7 10 10_1_1_202_1_9 1275365478290487 2021 NA
8 10 10_1_1_202_2_8 1275365489290487 2021 NA
9 10 10_1_1_238_1_9 1275355197290487 2021 NA
10 10 10_1_1_238_2_12 1275355217290487 2021 NA
# ℹ more rows
# ℹ 8 more variables: HT_est_midpoint <dbl>, DIA_est_midpoint <dbl>,
# AHEIGHT_est_midpoint <dbl>, TREE_CN_mortyr <chr>, mortyr_dead_year <dbl>,
# HT_est_mortyr <dbl>, DIA_est_mortyr <dbl>, AHEIGHT_est_mortyr <dbl>
Step 3: working with tables
The table above is not in memory. You can tell because it lists the dimensions as [?? x 13]—it doesn’t know how many rows there are because it’s only read in enough to figure out what kinds of data (numeric, character, date, etc.) the columns hold. However, you can still use most tidyverse functions to manipulate it.
tree_annualized |>
filter(DIA_est_midpoint < 5) |>
mutate(is_dead = YEAR >= midpoint_dead_year) |>
select(-STATECD)# Source: SQL [?? x 13]
# Database: DuckDB v1.1.3 [root@Darwin 24.2.0:R 4.4.1//Users/ericscott/Documents/GitHub/mekevans/forestTIME-builder/data/db/foresttime-DE.duckdb]
TREE_COMPOSITE_ID TREE_CN_midpoint YEAR midpoint_dead_year HT_est_midpoint
<chr> <chr> <int> <dbl> <dbl>
1 10_1_1_202_1_9 1275365478290487 2021 NA 23
2 10_1_1_202_2_8 1275365489290487 2021 NA 20
3 10_1_5_249_4_3 1275355080290487 2021 2018 20
4 10_1_1_78_3_12 1540953243290487 2022 NA 19
5 10_1_1_488_3_4 463202152489998 2016 NA 31
6 10_1_5_226_3_12 463202794489998 2016 NA 26
7 10_1_5_276_3_3 463203082489998 2016 NA 16
8 10_1_5_276_3_7 463203086489998 2016 NA 13
9 10_1_5_276_4_8 463203115489998 2016 NA 15
10 10_1_5_395_1_60 463202548489998 2016 NA 35
# ℹ more rows
# ℹ 8 more variables: DIA_est_midpoint <dbl>, AHEIGHT_est_midpoint <dbl>,
# TREE_CN_mortyr <chr>, mortyr_dead_year <dbl>, HT_est_mortyr <dbl>,
# DIA_est_mortyr <dbl>, AHEIGHT_est_mortyr <dbl>, is_dead <lgl>
This includes joins
tree <- tbl(con, "tree")
plot <- tbl(con, "plot")
left_join(
tree,
plot |> select(PLOT_COMPOSITE_ID, DESIGNCD),
by = join_by(PLOT_COMPOSITE_ID)
)# Source: SQL [?? x 199]
# Database: DuckDB v1.1.3 [root@Darwin 24.2.0:R 4.4.1//Users/ericscott/Documents/GitHub/mekevans/forestTIME-builder/data/db/foresttime-DE.duckdb]
TREE_CN PLT_CN PREV_TRE_CN INVYR STATECD UNITCD COUNTYCD PLOT SUBP TREE
<chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int>
1 654580880… 60134… <NA> 2004 10 1 1 132 2 1
2 654580900… 60134… <NA> 2004 10 1 1 132 2 2
3 654580920… 60134… <NA> 2004 10 1 1 132 3 1
4 654580940… 60134… <NA> 2004 10 1 1 132 3 2
5 654580960… 60134… <NA> 2004 10 1 1 132 3 3
6 654580980… 60134… <NA> 2004 10 1 1 132 3 4
7 654581000… 60134… <NA> 2004 10 1 1 132 3 5
8 654581020… 60134… <NA> 2004 10 1 1 132 3 6
9 654581040… 60134… <NA> 2004 10 1 1 132 3 7
10 654581060… 60134… <NA> 2004 10 1 1 132 3 8
# ℹ more rows
# ℹ 189 more variables: CONDID <int>, PREVCOND <int>, STATUSCD <int>,
# SPCD <dbl>, SPGRPCD <int>, DIA <dbl>, DIAHTCD <int>, HT <int>, HTCD <int>,
# ACTUALHT <int>, TREECLCD <int>, CR <int>, CCLCD <int>, TREEGRCD <int>,
# AGENTCD <int>, CULL <int>, DAMLOC1 <int>, DAMTYP1 <int>, DAMSEV1 <int>,
# DAMLOC2 <int>, DAMTYP2 <int>, DAMSEV2 <int>, DECAYCD <int>, STOCKING <dbl>,
# WDLDSTEM <int>, VOLCFNET <dbl>, VOLCFGRS <dbl>, VOLCSNET <dbl>, …
If you need to bring the dataframe into memory for some reason, you can use collect().
tiny_saplings <- tree |>
filter(DIA < 2) |>
collect()
tiny_saplings# A tibble: 1,226 × 198
TREE_CN PLT_CN PREV_TRE_CN INVYR STATECD UNITCD COUNTYCD PLOT SUBP TREE
<chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int>
1 654583970… 60135… <NA> 2004 10 1 1 30 3 1
2 654578730… 60134… <NA> 2004 10 1 1 48 2 2
3 654578710… 60134… <NA> 2004 10 1 1 48 2 1
4 654578750… 60134… <NA> 2004 10 1 1 48 2 3
5 654578770… 60134… <NA> 2004 10 1 1 48 2 4
6 654578790… 60134… <NA> 2004 10 1 1 48 2 5
7 654579050… 60134… <NA> 2004 10 1 1 48 3 4
8 654579070… 60134… <NA> 2004 10 1 1 48 3 5
9 654579290… 60134… <NA> 2004 10 1 1 48 4 1
10 654579310… 60134… <NA> 2004 10 1 1 48 4 2
# ℹ 1,216 more rows
# ℹ 188 more variables: CONDID <int>, PREVCOND <int>, STATUSCD <int>,
# SPCD <dbl>, SPGRPCD <int>, DIA <dbl>, DIAHTCD <int>, HT <int>, HTCD <int>,
# ACTUALHT <int>, TREECLCD <int>, CR <int>, CCLCD <int>, TREEGRCD <int>,
# AGENTCD <int>, CULL <int>, DAMLOC1 <int>, DAMTYP1 <int>, DAMSEV1 <int>,
# DAMLOC2 <int>, DAMTYP2 <int>, DAMSEV2 <int>, DECAYCD <int>, STOCKING <dbl>,
# WDLDSTEM <int>, VOLCFNET <dbl>, VOLCFGRS <dbl>, VOLCSNET <dbl>, …
The resulting dataframe is in memory and has 1226 rows.
Be careful using collect() on very large tables. You might crash your R session! If you’re just wanting to View() a table, you might consider using head() or slice_sample() on the data frame before using collect()
tree |>
slice_sample(n = 100) |> #get 100 random rows
collect() |> #pull into memory
View() #open in data viewer tab in RStudioStep 4: disconnect
You can disconnect from the database with dbDisconnect(). It isn’t totally necessary and restarting the R session will do the same thing, but it’s good practice to include in a script.
dbDisconnect(con)