Research Article |
Corresponding author: R. Henrik Nilsson ( henrik.nilsson@dpes.gu.se ) Academic editor: Thorsten Lumbsch
© 2016 R. Henrik Nilsson, Christian Wurzbacher, Mohammad Bahram, Victor R. M. Coimbra, Ellen Larsson, Leho Tedersoo, Jonna Eriksson, Camila Duarte, Sten Svantesson, Marisol Sánchez-García, Martin K. Ryberg, Erik Kristiansson, Kessy Abarenkov.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Nilsson RH, Wurzbacher C, Bahram M, Coimbra VRM, Larsson E, Tedersoo L, Eriksson J, Duarte Ritter C, Svantesson S, Sánchez-García M, Ryberg M, Kristiansson E, Abarenkov K (2016) Top 50 most wanted fungi. MycoKeys 12: 29-40. https://doi.org/10.3897/mycokeys.12.7553
|
Environmental sequencing regularly recovers fungi that cannot be classified to any meaningful taxonomic level beyond “Fungi”. There are several examples where evidence of such lineages has been sitting in public sequence databases for up to ten years before receiving scientific attention and formal recognition. In order to highlight these unidentified lineages for taxonomic scrutiny, a search function is presented that produces updated lists of approximately genus-level clusters of fungal ITS sequences that remain unidentified at the phylum, class, and order levels, respectively. The search function (https://unite.ut.ee/top50.php) is implemented in the UNITE database for molecular identification of fungi, such that the underlying sequences and fungal lineages are open to third-party annotation. We invite researchers to examine these enigmatic fungal lineages in the hope that their taxonomic resolution will not have to wait another ten years or more.
Fungi , environmental sequencing, taxonomic orphans, metabarcoding, taxonomy feedback loop
Fungi form a large and diverse kingdom of heterotrophic eukaryotes. Recent studies suggest that there may be more than 6 million extant species of fungi (
Molecular ecology studies regularly struggle to identify the recovered fungi to meaningful taxonomic levels. Lack of reference sequences, mis-annotated reference sequences, and reference sequences annotated only to, e.g., kingdom or phylum level combine to make taxonomic identification of newly recovered sequence data challenging (
In our work with environmental sequencing of fungi, we regularly run across these unidentified lineages. We typically encounter them through sequences of the internal transcribed spacer (ITS), the formal fungal barcode (
UNITE clusters all public fungal ITS sequences (~500,000 at the time of this writing) to approximately the genus/subgenus level (called a “compound cluster”) using a clustering threshold of 80% sequence similarity. A second round of clustering inside each such compound cluster seeks to produce molecular operational taxonomic units (OTUs) at approximately the species level; these OTUs are called
Although UNITE offers various search functions targeting the compound clusters and species hypotheses, none of the search functions were designed to find truly poorly known lineages. To remedy this, we devised a search function to retrieve fungal lineages for which little to no taxonomic information is available. The user is presented with two main choices: 1) the taxonomic level to be considered (phylum, class, or order), and 2) whether the list of compound clusters should be ordered by the number of constituent sequences or by the number of studies in which the sequences were found. In addition, the user can exercise control over how the output is shown through several other options.
To enable exploration of different hierarchical levels in the classification system, the search function supports three different levels: phylum, class, and order. Thus, the search function will retrieve clusters of sequences where none of the sequences are identified at the phylum, class, or order level depending on the choice of the user.
Multiple independent recoveries of some particular fungal sequence type would strengthen one’s belief that the lineage indeed corresponds to a biological reality. In analogy, for sequence types found only in a single study, some sound skepticism is perhaps in place given the sequence quality-related issues involved in studies based on cloning as well as next-generation sequencing (
Each search will retrieve all clusters of sequences fulfilling the criteria. Thus, there are 3 (phylum, class, and order) * 2 (order by sequences or by studies) = 6 lists of “poorly known” fungal lineages. Some degree of overlap among these lists is likely; a compound cluster where all sequences are unidentified at the order level may also qualify as a cluster where all of the sequences are unidentified at the phylum level. No attempt was made to account for such redundancy.
A concern was that these sequences could be subject to quality issues. Alternatively they could be false positives in that they lacked explicit taxonomic annotation but nevertheless were easy to assign to a known taxonomic lineage. To minimize these concerns, we examined the 50 largest lineages at the phylum, class, and order levels (as ordered by the number of constituent studies) through BLAST searches in UNITE and the INSDC following
The phylum-level search returned 1,004 compound clusters, of which 830 (83%) were singletons. Out of the 1,364 class-level clusters, 1,056 (77%) were singletons; and out of the 1,738 order-level clusters, 1,290 (74%) were singletons. The results presented here focus on the 50 topmost entries in each of these lists. The largest of the phylum-level clusters comprised 30 sequences, and the average number of sequences in the 50 topmost clusters was 7.4 (standard deviation: 4.9). At the class level, the largest cluster comprised 60 sequences (average cluster size 8.5 sequences, standard deviation 9.7). At the order level, the largest cluster comprised 60 sequences (average cluster size 9.5 sequences, standard deviation 9.5). The cluster with the highest number of independent recoveries had been found in 23 different studies and was unidentified at the order level.
The lists, with accompanying multiple sequence alignments and geo/ecological metadata, are available for viewing and third-party annotation at https://unite.ut.ee/top50.php (Figs
A compound cluster displayed in the web browser of the user. The INSDC accession numbers and their taxonomic annotation are shown in columns 1 and 2. The DNA source and the country of collection are shown in columns 3 and 4. Column 5 shows the inclusiveness of the species hypotheses at the 97% similarity level (rightmost filled column), the 97.5% similarity level (second-to-rightmost filled column), and so on up to 100% similarity. The aligned sequence data are shown in column 6.
Web-based third-party taxonomic annotation of the sequences in a species hypothesis is demonstrated. Third-party annotation requires non-anonymous registration, and such annotations are subject to peer review. Annotations are tagged with the name of the annotator as well as the date. Multiple annotations for individual entries are supported.
Our data assembly effort to restore data on the country and host of collection resulted in 60 sequences being tagged with a country of collection and 261 with a substrate of collection. Data on country and substrate of collection for the 50 largest compound clusters that were not identified at the phylum, class, and order level, respectively, are shown in Figs
Geographical distribution of the top 50 most wanted fungi at the phylum, class, and order level. Each fungal sequence was assigned to country of origin according to its INSDC entry (or underlying publication as applicable) and then summarized based on the continents: Africa (dark blue), Antarctica (green), Asia (grey), Australia (yellow), Europe (orange), North America (light blue), and South America (blue).
The most common substrates associated with the top 50 most wanted fungi at the phylum, class, and order level. Each fungal sequence was assigned to substrate according to its GenBank entry (or underlying publication as applicable). The major substrates included soil (light blue), living plants (blue), mycorrhiza (orange), dust (green), lichen (dark blue), dead wood (red), and other (grey). To improve readability, rare substrates (<3 occurrences) were merged into the ‘other’ category.
This paper presents a set of lists of fungi for which taxonomic assignment is very troublesome at present. These lists matter, because the underlying fungi are regularly recovered in environmental sequencing efforts, where they contribute to the proportion of unidentified sequences. Mycology is a comparatively small discipline that struggles for funding (cf.
We examined all sequence types from the 50 largest compound clusters for telltale signs of a technically compromised nature, such as chimeric insertions or low read quality (cf.
It is not immediately clear that all of these lineages indeed are fungi, although at least one fungus-specific primer seems to have been involved in the generation of many of them. Many studies have reported the occasional (even frequent) co-amplification of, e.g., plants and metazoans with fungus-specific primers (cf.
Precise and robust taxonomic assignment of these ITS sequences is not possible at present due to the lack of similar reference sequences in the public sequence databases. Sequence data from the much more conserved, neighboring small and large subunit genes (18S/SSU and 28S/LSU, respectively) would presumably have alleviated this problem by allowing phylogenetic placement in the context of known SSU and LSU sequences. However, ITS sequences are typically sequenced and deposited without significant parts of the SSU and LSU, particularly in environmental sequencing efforts, rendering this approach difficult. Deeply sequenced metagenomes – as well as emerging sequencing technologies producing very long reads – offer a route by which to retrieve parts of the ITS region attached to either the SSU or LSU, or indeed span them both. Thus, the increasing popularity of metagenomics and genomics may solve many of these cases over time. However, also someone doing traditional systematics and taxonomy can contribute. Supplying, as a minimum, an ITS sequence with each new species description would offer structure to available sequence data and would significantly reduce interpretation difficulties of species names (
We are working to add additional flexibility in the generation of these lists. Some researchers may, for example, be interested only in unknown fungi found in the built environment, or in a medical context, or from aquatic environments. We will seek to address these needs by compiling a set of keywords for each such research field. For the built environment, these keywords would include, e.g., “house”, “dust”, “building”, and “gypsum”. For the search function, we will then require that a compound cluster contains at least one sequence where at least one of these keywords occurs either in the title of the underlying scientific study or in the FEATURES field of the corresponding INSDC/UNITE entry. The search function would then retrieve compound clusters with at least one fungal sequence that has a relation to the built environment. We will similarly endeavor to add support for the genus and species levels in the search function.
We refer to this list as the “most wanted” fungi. That is not meant to suggest that these fungi are the ecologically or economically most important extant fungi. Indeed, we make no claim as to the importance of these fungi from whatever point of view. We do make a claim to their uniqueness though, because it is frustrating, in the year 2016, not to be able to assign a name to a fungal sequence even at the phylum level. When it comes to taxonomic discovery potential, we argue that these lineages definitely should be counted among the most interesting candidates. Even if we assume that some proportion of the present lineages in fact are technical artifacts or represent non-fungal organisms, it is reasonable to assume that some proportion of them indeed represent new or previously unsequenced lineages of fungi. None of them are at least 80% similar to sequences with richer taxonomic annotations; many are much more distant from known reference sequences than that. Common rules of thumb for ITS sequence similarity thresholds (
RHN acknowledges financial support from the Swedish Research Council of Environment, Agricultural Sciences, and Spatial Planning (FORMAS, 215-2011-498). Support from the Sloan Foundation is gratefully acknowledged. CW acknowledges a Marie Skłodowska-Curie post doc grant (660122, CRYPTRANS).