July 27, 2017 From rOpenSci (https://deploy-preview-121--ropensci.netlify.app/blog/2017/07/27/taxonomy-suite/). Except where otherwise noted, content on this site is licensed under the CC-BY license.
Taxonomy in its most general sense is the practice and science of classification. It can refer to many things. You may have heard or used the word taxonomy used to indicate any sort of classification of things, whether it be companies or widgets. Here, we’re talking about biological taxonomy, the science of defining and naming groups of biological organisms.
In case you aren’t familiar with the terminology, here’s a brief intro.
species
- the term you are likely most familiar with, usually defined as a group of individuals in which any 2 individuals can produce fertile offspring, although definitions can vary.genus
/family
/order
/class
/phylum
/kingdom
- These are nested groupings of similar species. genus
(e.g. Homo) is restrictive grouping and kingdom
(e.g. Animalia) is a much more inclusive grouping. There are genera in families, families in orders, etc.taxon
- a species or grouping of species. e.g. Homo sapiens, Primates, and Animalia are all taxa.taxa
- the plural of taxon
.taxonomic hierarchy
or taxonomic classification
- the list of groups a species (or other taxon) belongs to. For example the taxonomic classification of humans is: Animalia;Chordata;Mammalia;Primates;Hominidae;Homo;sapiens
We put a lot of time into our suite of taxonomic software for a good reason - probably all naturalists/biologists/environmental consultants/etc. will be confronted with taxonomic names in their research/work/surveys/etc. at some point or all along the way. Some people study a single species their entire career, likely having little trouble with taxonomic names - while others study entire communities or ecosystems, dealing with thousands of taxonomic names.
Taxonomic names are not only ubiquitous but are incredibly important to get right. Just as the URL points to the correct page you want to view on the internet (an incorrect URL will not get you where you want to go), taxonomic names point to the right definition/description of a taxon, leading to lots of resources increasingly online including text, images, sounds, etc. If you get the taxonomic name wrong, all information downstream is likely to be wrong.
R is gaining in popularity in general (TIOBE index, Muenchen 2017), and in academia. At least in my graduate school experience (‘06 - ‘12), most graduate students used R - despite their bosses often using other things.
Given that R is widely used among biologists that have to deal with taxonomic names, it makes a lot of sense to build taxonomic tools in R.
We have an ever-growing suite of packages that enable users to:
The packages:
taxize
- taxonomic data from many sourcestaxizedb
- work with taxonomic SQL databases locallytaxa
- taxonomic classes and manipulation functionsbinomen
- taxonomic name classes and parsing methods (getting folded into taxa
, will be archived on CRAN soon)wikitaxa
- taxonomic data from Wikipedia/Wikidata/Wikispeciesritis
- get ITIS (Integrated Taxonomic Information Service) taxonomic dataworrms
- get WORMS (World Register of Marine Species) taxonomic datapegax
- taxonomy PEG (Parsing Expression Grammar)For each package below, there are 2-3 badges. One for whether the package is on CRAN
a link to source on GitHub
and another for when the package is community contributed:
For each package we show a very brief example - all packages have much more functionality - check them out on CRAN or GitHub.
This was our first package for taxonomy. It is a one stop shop for lots of different taxonomic data sources online, including NCBI, ITIS, GBIF, EOL, IUCN, and more - up to 22 data sources now.
The canonical reference for taxize
is the paper we published in 2013:
Chamberlain, S. A., & Szöcs, E. (2013). taxize: taxonomic search and retrieval in R. F1000Research.
Check it out at https://doi.org/10.12688/f1000research.2-191.v1
We released a new version (v0.8.8
) about a month ago (a tiny bug fix was pushed more recently (v0.8.9
)) with some new features requested by users:
ncbi_downstream
and downstream
wikitaxa
packageget_iucn
tax_rank
now works with many more data sources: ncbi, itis, eol, col, tropicos, gbif, nbn,
worms, natserv, and boldA quick example of the power of taxize
install.packages("taxize")
library("taxize")
Get WORMS identifiers for three taxa:
ids <- get_wormsid(c("Platanista gangetica", "Lichenopora neapolitana", 'Gadus morhua'))
Get classifications for each taxon
clazz <- classification(ids, db = 'worms')
Combine all three into a single data.frame
head(rbind(clazz))
#> name rank id query
#> 1 Animalia Kingdom 2 254967
#> 2 Chordata Phylum 1821 254967
#> 3 Vertebrata Subphylum 146419 254967
#> 4 Gnathostomata Superclass 1828 254967
#> 5 Tetrapoda Superclass 1831 254967
#> 6 Mammalia Class 1837 254967
taxizedb
is a relatively new package. We just released a new version (v0.1.4
) about one month ago, with fixes for the new dplyr
version.
The sole purpose of taxizedb
is to solve the use case where a user has a lot of taxonomic names, and thus using taxize
is too slow. Although taxize
is a powerful tool, every request is a transaction over the internet, and the speed of that transaction can vary from very fast to very slow, depending on three factors: data provider speed (including many things), your internet speed, and how much data you requested. taxizedb
gets around this problem by using a local SQL database of the same stuff the data providers have, so you can get things done much faster.
The trade-off with taxizedb
is that the interface is quite different from taxize
. So there is a learning curve. There are two options in taxizedb
: you can use SQL syntax, or dplyr
commands. I’m guessing people are more familiar with the latter.
Install taxizedb
install.packages("taxizedb")
library("taxizedb")
Here, we show working with the ITIS SQL database. Other sources work with the same workflow of function calls.
Download ITIS SQL database
x <- db_download_itis()
#> downloading...
#> unzipping...
#> cleaning up...
#> [1] "/Users/sacmac/Library/Caches/R/taxizedb/ITIS.sql"
db_load_tpl()
loads the SQL database into Postgres. Data sources vary in the SQL database used, see help for more.
db_load_tpl(x, "<your Postgresql user name>", "your Postgresql password, if any")
Create a src
object to connect to the SQL database.
src <- src_itis("<your Postgresql user name>", "your Postgresql password, if any")
Query!
library(dbplyr)
library(dplyr)
tbl(src, sql("select * from taxonomic_units limit 10"))
# Source: SQL [?? x 26]
# Database: postgres 9.6.0 [sacmac@localhost:5432/ITIS]
tsn unit_ind1 unit_name1 unit_ind2 unit_name2 unit_ind3 unit_name3 unit_ind4 unit_name4
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 50 <NA> Bacteria <NA> <NA> <NA> <NA> <NA> <NA>
2 51 <NA> Schizomycetes <NA> <NA> <NA> <NA> <NA> <NA>
3 52 <NA> Archangiaceae <NA> <NA> <NA> <NA> <NA> <NA>
4 53 <NA> Pseudomonadales <NA> <NA> <NA> <NA> <NA> <NA>
5 54 <NA> Rhodobacteriineae <NA> <NA> <NA> <NA> <NA> <NA>
6 55 <NA> Pseudomonadineae <NA> <NA> <NA> <NA> <NA> <NA>
7 56 <NA> Nitrobacteraceae <NA> <NA> <NA> <NA> <NA> <NA>
8 57 <NA> Nitrobacter <NA> <NA> <NA> <NA> <NA> <NA>
9 58 <NA> Nitrobacter <NA> agilis <NA> <NA> <NA> <NA>
10 59 <NA> Nitrobacter <NA> flavus <NA> <NA> <NA> <NA>
# ... with more rows, and 17 more variables: unnamed_taxon_ind <chr>, name_usage <chr>, unaccept_reason <chr>,
# credibility_rtng <chr>, completeness_rtng <chr>, currency_rating <chr>, phylo_sort_seq <int>, initial_time_stamp <dttm>,
# parent_tsn <int>, taxon_author_id <int>, hybrid_author_id <int>, kingdom_id <int>, rank_id <int>, update_date <date>,
# uncertain_prnt_ind <chr>, n_usage <chr>, complete_name <chr>
taxa
is our newest entry (hit CRAN just a few weeks ago) into the taxonomic R package space. It defines taxonomic classes for R, and basic, but powerful manipulations on those classes.
It defines two broad types of classes: those with just taxonomic data, and a class with taxonomic data plus other associated data (such as traits, environmental data, etc.) called taxmap
.
The taxa
package includes functions to do various operations with these taxonomic classes. With the taxonomic classes, you can filter out or keep taxa based on various criteria. In the case of the taxmap
class, when you filter on taxa, the associated data is filtered the same way so taxa and data are in sync.
A manuscript about taxa
is being prepared at the moment - so look out for that.
Most of the hard work in taxa
has been done by my co-maintainer Zachary Foster!
A quick example of the power of taxa
install.packages("taxa")
library("taxa")
An example Hierarchy
data object that comes with the package:
ex_hierarchy1
#> <Hierarchy>
#> no. taxon's: 3
#> Poaceae / family / 4479
#> Poa / genus / 4544
#> Poa annua / species / 93036
We can remove taxa like the following, combining criteria targeting ranks, taxonomic names, or IDs:
ex_hierarchy1 %>% pop(ranks("family"), ids(4544))
#> <Hierarchy>
#> no. taxon's: 1
#> Poa annua / species / 93036
An example taxmap
class:
ex_taxmap
#> <Taxmap>
#> 17 taxa: b. Mammalia ... q. lycopersicum, r. tuberosum
#> 17 edges: NA->b, NA->c, b->d ... j->o, k->p, l->q, l->r
#> 4 data sets:
#> info:
#> # A tibble: 6 x 4
#> name n_legs dangerous taxon_id
#> <fctr> <dbl> <lgl> <chr>
#> 1 tiger 4 TRUE m
#> 2 cat 4 FALSE n
#> 3 mole 4 FALSE o
#> # ... with 3 more rows
#> phylopic_ids: e148eabb-f138-43c6-b1e4-5cda2180485a ... 63604565-0406-460b-8cb8-1abe954b3f3a
#> foods: a list with 6 items
#> And 1 more data sets: abund
#> 1 functions:
#> reaction
Here, filter by taxonomic names to those starting with the letter t
(notice the taxa, edgelist, and datasets have changed)
filter_taxa(ex_taxmap, startsWith(taxon_names, "t"))
#> <Taxmap>
#> 3 taxa: m. tigris, o. typhlops, r. tuberosum
#> 3 edges: NA->m, NA->o, NA->r
#> 4 data sets:
#> info:
#> # A tibble: 3 x 4
#> name n_legs dangerous taxon_id
#> <fctr> <dbl> <lgl> <chr>
#> 1 tiger 4 TRUE m
#> 2 mole 4 FALSE o
#> 3 potato 0 FALSE r
#> phylopic_ids: e148eabb-f138-43c6-b1e4-5cda2180485a ... 63604565-0406-460b-8cb8-1abe954b3f3a
#> foods: a list with 3 items
#> And 1 more data sets: abund
#> 1 functions:
#> reaction
wikitaxa
is a client that allows you to get taxonomic data from four different Wiki-* sites:
Only Wikispecies is focused on taxonomy - for the others you could use wikitaxa
to do any searches, but we look for and parse out taxonomic specific items in the wiki objects that are returned.
We released a new version (v0.1.4
) earlier this year. Big thanks to Ethan Welty for help on this package.
wikitaxa
is used in taxize
to get Wiki* data.
A quick example of the power of wikitaxa
install.packages("wikitaxa")
library("wikitaxa")
Search for Malus domestica (apple):
res <- wt_wikispecies(name = "Malus domestica")
# links to language sites for the taxon
res$langlinks
#> # A tibble: 12 x 5
#> lang url langname
#> * <chr> <chr> <chr>
#> 1 ast https://ast.wikipedia.org/wiki/Malus_domestica Asturian
#> 2 es https://es.wikipedia.org/wiki/Malus_domestica Spanish
#> 3 hu https://hu.wikipedia.org/wiki/Nemes_alma Hungarian
#> 4 ia https://ia.wikipedia.org/wiki/Malus_domestica Interlingua
#> 5 it https://it.wikipedia.org/wiki/Malus_domestica Italian
#> 6 nds https://nds.wikipedia.org/wiki/Huusappel Low German
#> 7 nl https://nl.wikipedia.org/wiki/Appel_(plant) Dutch
#> 8 pl https://pl.wikipedia.org/wiki/Jab%C5%82o%C5%84_domowa Polish
#> 9 pms https://pms.wikipedia.org/wiki/Malus_domestica Piedmontese
#> 10 pt https://pt.wikipedia.org/wiki/Malus_domestica Portuguese
#> 11 sk https://sk.wikipedia.org/wiki/Jablo%C5%88_dom%C3%A1ca Slovak
#> 12 vi https://vi.wikipedia.org/wiki/Malus_domestica Vietnamese
#> # ... with 2 more variables: autonym <chr>, `*` <chr>
# any external links on the page
res$externallinks
#> [1] "https://web.archive.org/web/20090115062704/http://www.ars-grin.gov/cgi-bin/npgs/html/taxon.pl?104681"
# any common names, and the language they are from
res$common_names
#> # A tibble: 19 x 2
#> name language
#> <chr> <chr>
#> 1 Ябълка български
#> 2 Poma, pomera català
#> 3 Apfel Deutsch
#> 4 Aed-õunapuu eesti
#> 5 Μηλιά Ελληνικά
#> 6 Apple English
#> 7 Manzano español
#> 8 Pomme français
#> 9 Melâr furlan
#> 10 사과나무 한국어
#> 11 ‘Āpala Hawaiʻi
#> 12 Melo italiano
#> 13 Aapel Nordfriisk
#> 14 Maçã, Macieira português
#> 15 Яблоня домашняя русский
#> 16 Tarhaomenapuu suomi
#> 17 Elma Türkçe
#> 18 Яблуня домашня українська
#> 19 Pomaro vèneto
# the taxonomic hierarchy - or classification
res$classification
#> # A tibble: 8 x 2
#> rank name
#> <chr> <chr>
#> 1 Superregnum Eukaryota
#> 2 Regnum Plantae
#> 3 Cladus Angiosperms
#> 4 Cladus Eudicots
#> 5 Cladus Core eudicots
#> 6 Cladus Rosids
#> 7 Cladus Eurosids I
#> 8 Ordo Rosales
ritis
is a client for ITIS (Integrated Taxonomic Information Service), part of USGS.
There’s a number of different ways to get ITIS data, one of which (local SQL dump) is available in taxizedb
, while the others are covered in ritis
:
The functions that use the SOLR service are: itis_search
, itis_facet
, itis_group
, and itis_highlight
.
All other functions interact with the RESTful web service.
We released a new version (v0.5.4
) late last year.
ritis
is used in taxize
to get ITIS data.
A quick example of the power of ritis
install.packages("ritis")
library("ritis")
Search for blue oak ( Quercus douglasii )
search_scientific("Quercus douglasii")
#> # A tibble: 1 x 12
#> author combinedName kingdom tsn unitInd1 unitInd2 unitInd3
#> * <chr> <chr> <chr> <chr> <lgl> <lgl> <lgl>
#> 1 Hook. & Arn. Quercus douglasii Plantae 19322 NA NA NA
#> # ... with 5 more variables: unitInd4 <lgl>, unitName1 <chr>,
#> # unitName2 <chr>, unitName3 <lgl>, unitName4 <lgl>
Get taxonomic hierarchy down from the Oak genus - that is, since it’s a genus, get all species in the Oak genus
res <- search_scientific("Quercus")
hierarchy_down(res[1,]$tsn)
#> # A tibble: 207 x 5
#> parentname parenttsn rankname taxonname tsn
#> * <chr> <chr> <chr> <chr> <chr>
#> 1 Quercus 19276 Species Quercus falcata 19277
#> 2 Quercus 19276 Species Quercus lyrata 19278
#> 3 Quercus 19276 Species Quercus michauxii 19279
#> 4 Quercus 19276 Species Quercus nigra 19280
#> 5 Quercus 19276 Species Quercus palustris 19281
#> 6 Quercus 19276 Species Quercus phellos 19282
#> 7 Quercus 19276 Species Quercus virginiana 19283
#> 8 Quercus 19276 Species Quercus macrocarpa 19287
#> 9 Quercus 19276 Species Quercus coccinea 19288
#> 10 Quercus 19276 Species Quercus agrifolia 19289
#> # ... with 197 more rows
worrms
is a client for working with data from World Register of Marine Species (WoRMS).
WoRMS is the most authoritative list of names of all marine species globally.
We released our first version (v0.1.0
) earlier this year.
worrms
is used in taxize
to get WoRMS data.
A quick example of the power of worrms
install.packages("worrms")
library("worrms")
Get taxonomic name synonyms for salmon ( Oncorhynchus )
xx <- wm_records_name("Oncorhynchus", fuzzy = FALSE)
wm_synonyms(id = xx$AphiaID)
#> # A tibble: 4 x 25
#> AphiaID url
#> * <int> <chr>
#> 1 296858 http://www.marinespecies.org/aphia.php?p=taxdetails&id=296858
#> 2 397908 http://www.marinespecies.org/aphia.php?p=taxdetails&id=397908
#> 3 397909 http://www.marinespecies.org/aphia.php?p=taxdetails&id=397909
#> 4 297397 http://www.marinespecies.org/aphia.php?p=taxdetails&id=297397
#> # ... with 23 more variables: scientificname <chr>, authority <chr>,
#> # status <chr>, unacceptreason <chr>, rank <chr>, valid_AphiaID <int>,
#> # valid_name <chr>, valid_authority <chr>, kingdom <chr>, phylum <chr>,
#> # class <chr>, order <chr>, family <chr>, genus <chr>, citation <chr>,
#> # lsid <chr>, isMarine <int>, isBrackish <lgl>, isFreshwater <lgl>,
#> # isTerrestrial <int>, isExtinct <lgl>, match_type <chr>, modified <chr>
pegax
aims to be a powerful taxonomic name parser for R. This package started at #runconf17 - was made possible because the talented Oliver Keyes created a Parsing Expression Grammar package for R: piton
From piton
PEGs are:
a way of defining formal grammars for formatted data that allow you to identify matched structures and then take actions on them
Some great taxonomic name parsing does exist already. Global Names Parser, gnparser is a great effort by Dmitry Mozzherin and others. The only problem is Java does not play nice with R - thus pegax
, implementing in C++. We’ll definitely try to learn alot from the work they have done on gnparser
.
pegax
is not on CRAN yet. The package is in very very early days, so expect lots of changes.
A quick example of the power of pegax
devtools::install_github("ropenscilabs/pegax")
library("pegax")
Parse out authority name
authority_names("Linnaeus, 1758")
#> [1] "Linnaeus"
Parse out authority year
authority_years("Linnaeus, 1758")
#> [1] "1758"
These packages do not primarily deal with taxonomy, but do include taxonomic data. No examples are included below, but do check out their vignettes and other documentation to get started.
rotl
is maintained by Francois Michonneau, Joseph Brown, and David Winter, and is a package to interact with the Open Tree of Life (OTL). OTL main purpose is perhaps about phylogeny data, but they do have a taxonomy they maintain, and rotl
has functions that let you access that taxonomic data.
rotl
is used in taxize
to get OTL data.
rredlist
is an interface to the IUCN Redlist of Threatened Species,
which provides taxonomic, conservation status and distribution information on plants, fungi and animals that have been globally evaluated using the IUCN Red List Categories and Criteria.
rredlist
is used in taxize
to get IUCN Redlist Taxonomy data.
bold
is an interface to the IUCN Redlist of Threatened Species,
which provides taxonomic, conservation status and distribution information on plants, fungi and animals that have been globally evaluated using the IUCN Red List Categories and Criteria.
bold
is used in taxize
to get BOLD taxonomy data.
rgbif
is an interface to the Global Biodiversity Information Facility, the largest provider of free and open access biodiversity data.
rgbif
is used in taxize
to get GBIF taxonomy data.
Together, the rOpenSci taxonomy suite of packages make it much easier to work with taxonomy data in R. We hope you agree :)
Despite all of the above, we still have some things to work on:
taxa
taxonomy classes where appropriate. We plan to deploy taxa
classes inside of the taxize
package very soon, but they may be appropriate elsewhere as well. Using the same classes in many packages will make working with taxonomic data more consistent across packages.taxizedb
needs to be more robust. Given that the package not only touches your file system, and for some databases depends on different SQL databases, we likely will run into many problems with various operating system + database combinations. Please do kick the tires and get back to us!pegax
is ready for use, we’ll be able to use it in many packages whenever we need to parse taxonomic names.taxize
- get in touch and let us know.What do you think about the taxonomic suite of packages? Anything we’re missing? Anything we can be doing better with any of the packages? Are you working on a taxonomic R package? Consider submitting to rOpenSci.