August 13, 2019 From rOpenSci (https://deploy-preview-121--ropensci.netlify.app/blog/2019/08/13/popler/). Except where otherwise noted, content on this site is licensed under the CC-BY license.
The availability of large quantities of freely available data is revolutionizing the world of ecological research. Open data maximizes the opportunities to perform comparative analyses and meta-analyses. Such synthesis efforts will increasingly exploit “population data”, which we define here as time series of population abundance. Such population data plays a central role in testing ecological theory and guiding management decisions. One of the richest sources of open access population data is the USA Long Term Ecological Research (LTER) Network. However, LTER data presents the drawback common to all ecological time-series: extreme heterogeneity derived from differences in sampling designs. We experienced this heterogeneity first hand, upon embarking on our own comparative analysis of population data. Specifically, we noticed that heterogeneities in sampling design made datasets hard to compare, and therefore hard to search and analyze.
To alleviate these issues, we created popler (POPulation time series from Long-term Ecological Research): an online PostgreSQL database (henceforth “popler online database”), and associated R client (henceforth “popler R package”). popler accommodates raw population time-series data using the same structure for all datasets. Using raw data, without aggregation, promotes flexibility in data analysis. Moreover, using a common data structure implies that all datasets share the same variables, or the same types of variables. As a result, popler facilitates comparing, manipulating, and retrieving population data.
We strove to make user-friendliness the hallmark of the popler R package. To facilitate use, we created as few functions as possible: popler relies on essentially two functions: pplr_browse()
, which you can use to see what types of data are available in popler, and pplr_get_data()
, to download datasets from the popler online database.
pplr_browse()
allows selection of which one of the 305 datasets satisfies user-specified criteria. These criteria will depend on the metadata variables used to describe the dataset stored in popler. These metadata variables, and their content, is described by function pplr_dictionary()
.
Once a user understands the meaning and content of metadata variables, she/he can use pplr_browse()
to identify datasets of interest. For example, one could be interested in experimental datasets that quantify population abundance using cover measurements, and that are at least 20 years long. To see which of these data are available, the user should run:
library(popler)
data_search <- pplr_browse( studytype == 'exp' & datatype == 'cover' & duration_years > 20 )
data_search
The resulting tibble object shows that there are only four datasets with these characteristics:
## # A tibble: 4 x 20
## title proj_metadata_k~ lterid datatype structured_data studytype duration_years community
## * <chr> <int> <chr> <chr> <chr> <chr> <int> <chr>
## 1 SGS-~ 71 SGS cover no exp 42 yes
## 2 Post~ 140 AND cover no exp 22 yes
## 3 Vege~ 196 BNZ cover no exp 27 yes
## 4 e133~ 235 CDR cover no exp 26 yes
## # ... with 12 more variables: studystartyr <chr>, studyendyr <chr>, structured_type_1 <chr>,
## # structured_type_2 <chr>, structured_type_3 <chr>, structured_type_4 <chr>, treatment_type_1 <chr>,
## # treatment_type_2 <chr>, treatment_type_3 <chr>, lat_lter <dbl>, lng_lter <dbl>, taxas <list>
If interested we can download these directly:
df_obj <- pplr_get_data( datatype == 'cover' & studytype == 'exp' & duration_years > 20 )
One of the biggest hurdles in creating the popler R package was integrating it with an Application Programming Interface (API) that allows secure access to the popler online database. In principle, the popler R package could access the popler online database directly from R using functionalities of R packages such as RPostgreSQL. However, this poses substantial security threats, and we therefore welcomed the suggestion by Noam Ross, from rOpenSci, to use an API. Subsequently, Scott Chamberlain, also from rOpenSci, was kind enough to develop the API using the Ruby language, the code for which is currently stored on GitHub.
Querying an SQL database using an API improves security, but poses a few limitations on the features of the popler R package, and imposes the additional challenge of managing the API. The API limits the ways in which the R package can query the popler online database. In particular, our API allows querying popler in two ways: downloading the metadata of all 305 datasets, and downloading entire datasets. Fortunately, these two queries reflect the vast majority of operations that users perform when using the popler R package. In particular, the metadata of all 305 datasets is essentially what is used by the pplr_browse()
function, and users almost always tend to use pplr_get_data()
to download entire datasets. Hence, the limitations posed by the API were of little importance when weighed against the benefit provided by increased security. As a side note, using an API provides a minor, yet somewhat enticing perk: a progress bar for the downloads of pplr_get_data()
.
The API also needs to be managed in order to avoid crashes which make downloads impossible. The API crashes whenever the online database is stopped and restarted, and whenever the API web service crashes. The former case is predictable, but the API web service can crash at any time. To monitor whether the web service of our API is running correctly, we use Uptime Robot, a free application that quantifies the time during which the web server is available to users.
Our work with popler highlighted many opportunities linked to data synthesis using population data from the USA LTER network. Specifically, we found that population time-series data, while containing sometimes daunting heterogeneities, also present conspicuous regularities. By “regularities”, we mean that population abundance can be quantified in just a few ways (e.g. using densities, counts, or area covered), that all datasets have a temporal replication of population censuses, a spatial replication structure (which is often times nested), and so on. Ecologists can use these regularities to format most population time-series using the same data model. During our work on popler, we realized that the population data associated with the USA LTER data, and with other long-term population sampling efforts, provide untapped potential to foster synthesis work in population, community, and macro ecology.
We thank reviewers Corinna Gries, and Benjamin Bond-Lamberty for helpful comments that improved the package functionality.