Working with forestTIME-builder data

Author

Eric Scott

Setup

Packages you’ll need:

library(dplyr)
library(dbplyr) #need to install, but don't actually need to explicitly load
library(DBI)
library(duckdb)

If any are missing, install with install.packages("duckdb"), for example.

Step 1: connecting to the database

Tip

The path to the duckdb file can be a relative path if you’re in an RStudio project. If you’re working in a Quarto or RMarkdown document, you might need to use here::here("project/path/to/foresttime-DE.duckdb") to ensure the same working directory whether you’re working interactively or rendering the document.

con <- dbConnect(duckdb("data/db/foresttime-DE.duckdb"))

Step 2: getting a table

List the tables in the database:

dbListTables(con)
 [1] "all_invyrs"                          "cond"                               
 [3] "nsvb_vars"                           "plot"                               
 [5] "qa_flags"                            "ref_species"                        
 [7] "ref_tree_carbon_ratio_dead"          "ref_tree_decay_prop"                
 [9] "sapling_transitions"                 "tree"                               
[11] "tree_annualized"                     "tree_carbon"                        
[13] "tree_carbon_annualized_midpoint"     "tree_carbon_annualized_mortyr"      
[15] "tree_cns"                            "tree_info_composite_id"             
[17] "trees_annual_measures_midpoint_nsvb" "trees_annual_measures_mortyr_nsvb"  

Pull a specific table to work with:

tree_annualized <- tbl(con, "tree_annualized")
tree_annualized
# Source:   table<tree_annualized> [?? x 13]
# Database: DuckDB v1.1.3 [root@Darwin 24.2.0:R 4.4.1//Users/ericscott/Documents/GitHub/mekevans/forestTIME-builder/data/db/foresttime-DE.duckdb]
   STATECD TREE_COMPOSITE_ID TREE_CN_midpoint  YEAR midpoint_dead_year
     <int> <chr>             <chr>            <int>              <dbl>
 1      10 10_1_1_48_1_2     1275355140290487  2021                 NA
 2      10 10_1_1_48_3_10    1275355163290487  2021                 NA
 3      10 10_1_1_132_3_7    1275355801290487  2021                 NA
 4      10 10_1_1_132_4_4    1275355808290487  2021                 NA
 5      10 10_1_1_132_4_1    1275355806290487  2021                 NA
 6      10 10_1_1_202_1_8    1275365477290487  2021                 NA
 7      10 10_1_1_202_1_9    1275365478290487  2021                 NA
 8      10 10_1_1_202_2_8    1275365489290487  2021                 NA
 9      10 10_1_1_238_1_9    1275355197290487  2021                 NA
10      10 10_1_1_238_2_12   1275355217290487  2021                 NA
# ℹ more rows
# ℹ 8 more variables: HT_est_midpoint <dbl>, DIA_est_midpoint <dbl>,
#   AHEIGHT_est_midpoint <dbl>, TREE_CN_mortyr <chr>, mortyr_dead_year <dbl>,
#   HT_est_mortyr <dbl>, DIA_est_mortyr <dbl>, AHEIGHT_est_mortyr <dbl>

Step 3: working with tables

The table above is not in memory. You can tell because it lists the dimensions as [?? x 13]—it doesn’t know how many rows there are because it’s only read in enough to figure out what kinds of data (numeric, character, date, etc.) the columns hold. However, you can still use most tidyverse functions to manipulate it.

tree_annualized |> 
  filter(DIA_est_midpoint < 5) |> 
  mutate(is_dead = YEAR >= midpoint_dead_year) |> 
  select(-STATECD)
# Source:   SQL [?? x 13]
# Database: DuckDB v1.1.3 [root@Darwin 24.2.0:R 4.4.1//Users/ericscott/Documents/GitHub/mekevans/forestTIME-builder/data/db/foresttime-DE.duckdb]
   TREE_COMPOSITE_ID TREE_CN_midpoint  YEAR midpoint_dead_year HT_est_midpoint
   <chr>             <chr>            <int>              <dbl>           <dbl>
 1 10_1_1_202_1_9    1275365478290487  2021                 NA              23
 2 10_1_1_202_2_8    1275365489290487  2021                 NA              20
 3 10_1_5_249_4_3    1275355080290487  2021               2018              20
 4 10_1_1_78_3_12    1540953243290487  2022                 NA              19
 5 10_1_1_488_3_4    463202152489998   2016                 NA              31
 6 10_1_5_226_3_12   463202794489998   2016                 NA              26
 7 10_1_5_276_3_3    463203082489998   2016                 NA              16
 8 10_1_5_276_3_7    463203086489998   2016                 NA              13
 9 10_1_5_276_4_8    463203115489998   2016                 NA              15
10 10_1_5_395_1_60   463202548489998   2016                 NA              35
# ℹ more rows
# ℹ 8 more variables: DIA_est_midpoint <dbl>, AHEIGHT_est_midpoint <dbl>,
#   TREE_CN_mortyr <chr>, mortyr_dead_year <dbl>, HT_est_mortyr <dbl>,
#   DIA_est_mortyr <dbl>, AHEIGHT_est_mortyr <dbl>, is_dead <lgl>

This includes joins

tree <- tbl(con, "tree")
plot <- tbl(con, "plot")
left_join(
  tree,
  plot |> select(PLOT_COMPOSITE_ID, DESIGNCD),
  by = join_by(PLOT_COMPOSITE_ID)
)
# Source:   SQL [?? x 199]
# Database: DuckDB v1.1.3 [root@Darwin 24.2.0:R 4.4.1//Users/ericscott/Documents/GitHub/mekevans/forestTIME-builder/data/db/foresttime-DE.duckdb]
   TREE_CN    PLT_CN PREV_TRE_CN INVYR STATECD UNITCD COUNTYCD  PLOT  SUBP  TREE
   <chr>      <chr>  <chr>       <int>   <int>  <int>    <int> <int> <int> <int>
 1 654580880… 60134… <NA>         2004      10      1        1   132     2     1
 2 654580900… 60134… <NA>         2004      10      1        1   132     2     2
 3 654580920… 60134… <NA>         2004      10      1        1   132     3     1
 4 654580940… 60134… <NA>         2004      10      1        1   132     3     2
 5 654580960… 60134… <NA>         2004      10      1        1   132     3     3
 6 654580980… 60134… <NA>         2004      10      1        1   132     3     4
 7 654581000… 60134… <NA>         2004      10      1        1   132     3     5
 8 654581020… 60134… <NA>         2004      10      1        1   132     3     6
 9 654581040… 60134… <NA>         2004      10      1        1   132     3     7
10 654581060… 60134… <NA>         2004      10      1        1   132     3     8
# ℹ more rows
# ℹ 189 more variables: CONDID <int>, PREVCOND <int>, STATUSCD <int>,
#   SPCD <dbl>, SPGRPCD <int>, DIA <dbl>, DIAHTCD <int>, HT <int>, HTCD <int>,
#   ACTUALHT <int>, TREECLCD <int>, CR <int>, CCLCD <int>, TREEGRCD <int>,
#   AGENTCD <int>, CULL <int>, DAMLOC1 <int>, DAMTYP1 <int>, DAMSEV1 <int>,
#   DAMLOC2 <int>, DAMTYP2 <int>, DAMSEV2 <int>, DECAYCD <int>, STOCKING <dbl>,
#   WDLDSTEM <int>, VOLCFNET <dbl>, VOLCFGRS <dbl>, VOLCSNET <dbl>, …

If you need to bring the dataframe into memory for some reason, you can use collect().

tiny_saplings <- tree |> 
  filter(DIA < 2) |> 
  collect()
tiny_saplings
# A tibble: 1,226 × 198
   TREE_CN    PLT_CN PREV_TRE_CN INVYR STATECD UNITCD COUNTYCD  PLOT  SUBP  TREE
   <chr>      <chr>  <chr>       <int>   <int>  <int>    <int> <int> <int> <int>
 1 654583970… 60135… <NA>         2004      10      1        1    30     3     1
 2 654578730… 60134… <NA>         2004      10      1        1    48     2     2
 3 654578710… 60134… <NA>         2004      10      1        1    48     2     1
 4 654578750… 60134… <NA>         2004      10      1        1    48     2     3
 5 654578770… 60134… <NA>         2004      10      1        1    48     2     4
 6 654578790… 60134… <NA>         2004      10      1        1    48     2     5
 7 654579050… 60134… <NA>         2004      10      1        1    48     3     4
 8 654579070… 60134… <NA>         2004      10      1        1    48     3     5
 9 654579290… 60134… <NA>         2004      10      1        1    48     4     1
10 654579310… 60134… <NA>         2004      10      1        1    48     4     2
# ℹ 1,216 more rows
# ℹ 188 more variables: CONDID <int>, PREVCOND <int>, STATUSCD <int>,
#   SPCD <dbl>, SPGRPCD <int>, DIA <dbl>, DIAHTCD <int>, HT <int>, HTCD <int>,
#   ACTUALHT <int>, TREECLCD <int>, CR <int>, CCLCD <int>, TREEGRCD <int>,
#   AGENTCD <int>, CULL <int>, DAMLOC1 <int>, DAMTYP1 <int>, DAMSEV1 <int>,
#   DAMLOC2 <int>, DAMTYP2 <int>, DAMSEV2 <int>, DECAYCD <int>, STOCKING <dbl>,
#   WDLDSTEM <int>, VOLCFNET <dbl>, VOLCFGRS <dbl>, VOLCSNET <dbl>, …

The resulting dataframe is in memory and has 1226 rows.

Important

Be careful using collect() on very large tables. You might crash your R session! If you’re just wanting to View() a table, you might consider using head() or slice_sample() on the data frame before using collect()

tree |> 
  slice_sample(n = 100) |> #get 100 random rows
  collect() |> #pull into memory
  View() #open in data viewer tab in RStudio

Step 4: disconnect

You can disconnect from the database with dbDisconnect(). It isn’t totally necessary and restarting the R session will do the same thing, but it’s good practice to include in a script.

dbDisconnect(con)