Forum Paper |
Corresponding author: R. Henrik Nilsson ( henrik.nilsson@bioenv.gu.se ) Academic editor: Thorsten Lumbsch
© 2023 R. Henrik Nilsson, Martin Ryberg, Christian Wurzbacher, Leho Tedersoo, Sten Anslan, Sergei Põlme, Viacheslav Spirin, Vladimir Mikryukov, Sten Svantesson, Martin Hartmann, Charlotte Lennartsdotter, Pauline Belford, Maryia Khomich, Alice Retter, Natàlia Corcoll, Daniela Gómez Martinez, Tobias Jansson, Masoomeh Ghobad-Nejhad, Duong Vu, Marisol Sanchez-Garcia, Erik Kristiansson, Kessy Abarenkov.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Nilsson RH, Ryberg M, Wurzbacher C, Tedersoo L, Anslan S, Põlme S, Spirin V, Mikryukov V, Svantesson S, Hartmann M, Lennartsdotter C, Belford P, Khomich M, Retter A, Corcoll N, Gómez Martinez D, Jansson T, Ghobad-Nejhad M, Vu D, Sanchez-Garcia M, Kristiansson E, Abarenkov K (2023) How, not if, is the question mycologists should be asking about DNA-based typification. MycoKeys 96: 143-157. https://doi.org/10.3897/mycokeys.96.102669
|
Fungal metabarcoding of substrates such as soil, wood, and water is uncovering an unprecedented number of fungal species that do not seem to produce tangible morphological structures and that defy our best attempts at cultivation, thus falling outside the scope of the International Code of Nomenclature for algae, fungi, and plants. The present study uses the new, ninth release of the species hypotheses of the UNITE database to show that species discovery through environmental sequencing vastly outpaces traditional, Sanger sequencing-based efforts in a strongly increasing trend over the last five years. Our findings challenge the present stance of some in the mycological community – that the current situation is satisfactory and that no change is needed to “the code” – and suggest that we should be discussing not whether to allow DNA-based descriptions (typifications) of species and by extension higher ranks of fungi, but what the precise requirements for such DNA-based typifications should be. We submit a tentative list of such criteria for further discussion. The present authors hope for a revitalized and deepened discussion on DNA-based typification, because to us it seems harmful and counter-productive to intentionally deny the overwhelming majority of extant fungi a formal standing under the International Code of Nomenclature for algae, fungi, and plants.
Dark taxa, ICN, nomenclature, species description, taxonomy, type principle
Dark matter is an astronomical concept that denotes mass of a hitherto unknown nature. That mass is detectable indirectly through the gravity it exerts – such as the bending of passing light – but its exact nature has so far defied scientific explanation. Mycology offers an analogy in the form of dark taxa, a concept that we define as taxa that do not seem to produce tangible morphological structures and that we cannot seem to cultivate in the lab (cf.
Most of the present authors have spent considerable time in the company of dark fungal taxa (DFT) as recovered through environmental metabarcoding and as manifested in the UNITE database for molecular identification of fungi (
In the present forum paper, we wish to visualize the relative contribution of DFT to molecular mycological species discovery over time. We do this through two molecular datasets, both of which reflect current knowledge but also biases in various ways. These datasets are: 1) all full-length fungal ITS sequences in the international nucleotide sequence database collaboration (INSDC;
The full flow of operation behind the UNITE database is described elsewhere (
We downloaded all sequences included in the October 2022 version 9 release of the UNITE species hypothesis system. To allow us to contrast the species discovery from taxonomic and metabarcoding studies, we made the admittedly coarse assumption that all SHs that contained at least one sequence from the INSDC could be considered as taxonomy-derived SHs, that is, SHs with some sort of footing in traditional taxonomy. In contrast, all SHs containing only metabarcoding sequences were considered to be DFT. Based on the date of initial submission of each sequence (submission to INSDC and to UNITE, respectively, for INSDC and DFT sequences), we examined the accumulation of SHs over time. We plotted the accumulation of taxonomy-derived and DFT-only SHs against date of initial discovery in R v. 4.2.2 (
While there is little hope of piecing together the ecological context of these sequences in an automated way, at least there is an opportunity to visualize the country of collection for many of the sequences in INSDC and UNITE. We thus sought to illustrate the geographical component of the SH accumulation curves by summarizing the country of collection of the taxonomy-derived and DFT sequences. In total, 63% of the taxonomy-derived, and 99.9% of the DFT, sequences were tagged with an explicit country of origin. The 20 most common countries of origin in each dataset were compiled using R.
We retrieved a total of 1.26 M taxonomy- (Sanger sequencing-) derived sequences from INSDC and 7.1 M metabarcoding-derived DFT sequences from UNITE (https://unite.ut.ee/repository.php). The taxonomy-derived sequences were found to stem from a total of 88,665 distinct published and unpublished studies as defined by the combination of the INSDC fields AUTHORS, TITLE, and JOURNAL. The DFT sequences were found to stem from 5 studies. The SH accumulation curves at the dynamic 1.5% similarity threshold level are shown in Fig.
The accumulation of SHs at the 1.5% distance threshold over time in the Sanger (black; 88,665 studies of various sizes) and the DFT (red; 5 large studies) datasets. The Y axis depicts the number of SHs, and the X axis depicts year of sequence deposition. Solid trend lines were calculated using cubic smoothing splines. Also plotted (blue) is the cumulative number of newly described species for the period 2002–2022 (excluding recombinations, orthographic variants, invalid names, and illegitimate names). The numbers of species described in ca 2020–2022 may be slight underestimates due to widespread violation of the ICN recommendation F.5A to “inform the recognised repository of the complete bibliographic details upon publication of the name”. In reality, also the Sanger (INSDC) dataset is likely to hold some proportion of DFT. DFT sequences are notoriously difficult to tell apart in an automated way from sequences that are unidentified for other reasons (
The 20 most common countries of collection for the Sanger and the DFT sequences. The DFT dataset is dominated by sequences from Estonia, from which most of the five metabarcoding studies were run. Estonia is not known as any particularly rich hotspot of biodiversity, perhaps suggesting that additional worldwide sampling would have produced even more dramatic increases in the number of DFT SHs.
INSDC country | INSDC seq. | DFT country | DFT seq. |
---|---|---|---|
Unknown | 463524 | Estonia | 1788894 |
United States | 133496 | United States | 350869 |
China | 117292 | Italy | 287842 |
India | 31788 | Brazil | 285473 |
Japan | 29754 | Czechia | 260611 |
Brazil | 27765 | Russian Federation | 228979 |
Canada | 26038 | Mexico | 210643 |
Spain | 22362 | Norway | 208422 |
Australia | 22205 | Colombia | 204172 |
Germany | 19971 | Australia | 177777 |
Italy | 18078 | Sweden | 177318 |
Mexico | 16326 | Latvia | 169168 |
France | 14896 | Lithuania | 166553 |
Korea, Republic of | 12434 | Georgia | 146440 |
Russian Federation | 11668 | Finland | 127258 |
Iran, Islamic Republic of | 11285 | India | 123706 |
Poland | 10969 | Argentina | 116852 |
New Zealand | 10956 | China | 100143 |
Thailand | 10708 | Papua New Guinea | 96253 |
South Africa | 10642 | Tanzania, United Republic of | 95203 |
The present study approximated fungal species accumulation over time as deduced from taxonomic and metabarcoding efforts. We found that the DFT account for the clear majority of the new species discovered in the last five years (although some limited proportion of both the Sanger-derived and the DFT sequences may possibly correspond to described, but so far unsequenced, species). We reached this conclusion based on a very limited number of metabarcoding studies – in fact, just five – of soil fungal communities and in almost complete absence of metabarcoding data from, e.g., water, air, wood, and plant material. One can only imagine that Fig.
When data are sparse, opinions may be maintained and cherished for longer than necessary. Our results show that data are no longer sparse; DFT, in view of their diversity and abundance, form a major, inextricable component of the fungal kingdom. They simply cannot be ignored. It is not scientifically defensible to exclude them from mycological efforts in phylogeny, ecology, or biogeography. We therefore argue that it does not make sense to deny them a formal standing under the ICN. We feel that it is time – in fact, long overdue – to resume and deepen the discussion initiated by, e.g.,
There is clearly room for refinement of the requirements mentioned here, and we are furthermore certain that the mycological community can come up with additional prerequisites to further increase stringency and reduce the risk for haphazard, more or less irreproducible or irresponsible use of DNA sequences as types (cf.
It could be argued that a separate nomenclature code should be erected for the DFT, akin perhaps to the Candidatus concept in bacteria (
The GenBank staff is gratefully acknowledged for assistance with establishing the date of submission for the INSDC entries. We thank Nathan Smith for useful feedback on scientific publishing and mycological journals. David Hibbett and Conrad Schoch are acknowledged for valuable feedback on an earlier draft of the manuscript. Konstanze Bensch and MycoBank are acknowledged for assistance with species description statistics. The work of KA was supported by the Estonian Research Council (grant PRG1170).