Research Article |
Corresponding author: R. Henrik Nilsson ( henrik.nilsson@bioenv.gu.se ) Academic editor: Thorsten Lumbsch
© 2022 Kessy Abarenkov, Erik Kristiansson, Martin Ryberg, Sandra Nogal-Prata, Daniela Gómez-Martínez, Katrin Stüer-Patowsky, Tobias Jansson, Sergei Põlme, Masoomeh Ghobad-Nejhad, Natàlia Corcoll, Ruud Scharn, Marisol Sánchez-García, Maryia Khomich, Christian Wurzbacher, R. Henrik Nilsson.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Abarenkov K, Kristiansson E, Ryberg M, Nogal-Prata S, Gómez-Martínez D, Stüer-Patowsky K, Jansson T, Põlme S, Ghobad-Nejhad M, Corcoll N, Scharn R, Sánchez-García M, Khomich M, Wurzbacher C, Nilsson RH (2022) The curse of the uncultured fungus. MycoKeys 86: 177-194. https://doi.org/10.3897/mycokeys.86.76053
|
The international DNA sequence databases abound in fungal sequences not annotated beyond the kingdom level, typically bearing names such as “uncultured fungus”. These sequences beget low-resolution mycological results and invite further deposition of similarly poorly annotated entries. What do these sequences represent? This study uses a 767,918-sequence corpus of public full-length fungal ITS sequences to estimate what proportion of the 95,055 “uncultured fungus” sequences that represent truly unidentifiable fungal taxa – and what proportion of them that would have been straightforward to annotate to some more meaningful taxonomic level at the time of sequence deposition. Our results suggest that more than 70% of these sequences would have been trivial to identify to at least the order/family level at the time of sequence deposition, hinting that factors other than poor availability of relevant reference sequences explain the low-resolution names. We speculate that researchers’ perceived lack of time and lack of insight into the ramifications of this problem are the main explanations for the low-resolution names. We were surprised to find that more than a fifth of these sequences seem to have been deposited by mycologists rather than researchers unfamiliar with the consequences of poorly annotated fungal sequences in molecular repositories. The proportion of these needlessly poorly annotated sequences does not decline over time, suggesting that this problem must not be left unchecked.
Data interoperability, data mining, DNA barcoding, scientific practice, species identification, taxonomic annotation
DNA sequencing enables researchers to explore environmental habitats such as soil, wood and water for fungal diversity. A common choice of genetic marker for such pursuits is the nuclear ribosomal internal transcribed spacer (ITS) region, the formal fungal barcode (
Roughly 42% (326,062) of the 767,918 full-length Sanger-derived fungal ITS sequences in the INSDC (November 2020) lack a full species name and 29% (95,055) of these are not annotated beyond the kingdom level (e.g. “uncultured fungus” from the environmental (ENV) sample division and “fungal sp.” from the plants and fungal (PLN) division;
Many of the present authors are curators of specific taxonomic groups in the UNITE database. In that role, we revisit our favourite fungal groups and multiple sequence alignments after each incremental update with new INSDC sequences. Unfortunately, we regularly find that previously tidy and well-annotated species hypotheses have been watered down by tens to hundreds of sequences of the “uncultured fungus” kind (Figure
A screenshot from species hypothesis SH1159264.08FU (Vishniacozyma victoriae; https://dx.doi.org/10.15156/BIO/SH1159264.08FU) in UNITE. Identifying a Vishniacozyma victoriae ITS sequence to at least the genus level is trivial, yet the screenshot hints at the swathes of kingdom level-annotated Vishniacozyma victoriae sequences regularly deposited in the INSDC. SequenceID – INSDC accession number. UNITE taxon name – taxonomic annotation in UNITE. INSD taxon name – original taxonomic annotation in INSDC. RefSeq – indicates a type-derived sequence. More than thirty studies have deposited kingdom-level annotations in this species hypothesis. The ones shown primarily stem from
We targeted all 767,918 full-length, Sanger-derived fungal ITS sequences (annotated as such) in the INSDC (November 2020) as mirrored in the UNITE species hypotheses release 8. For each such sequence, UNITE extracts and stores relevant metadata from the GenBank flat file format (https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html). Sequence quality control is part of the species hypotheses generation and seeks to exclude clear cases of, for example, chimeras and low read quality sequences through tools, such as USEARCH (
UNITE uses the NCBI Taxonomy classification (
The fact that UNITE stores the INSDC initial release date for each sequence allowed us to build a map of what sequences were available in INSDC at any time. We wanted to capture what we feel are the two most common scenarios of INSDC sequence deposition, namely: (i) a user deposits sequences for immediate release and (ii) a user deposits sequences for release, pending acceptance of the underlying manuscript. Thus, for each sequence A that was only annotated at the kingdom level, we considered all sequences that were released at least seven days before A as being available for BLAST searches by the authors of A. This leaves room for the authors of A to have done a final double check of the taxonomic affiliation of their soon-to-be-released sequences, including A, prior to setting them free.
We sought to recreate what such a BLAST search would have looked like to the authors of A with respect to closely matching (≥ 97% similarity) sequences (the topmost, high-scoring sequences in a BLAST hit list), as well as sequences that produced reasonable (≥ 80% similarity), but not top-scoring, matches to A. This captures our experience of BLAST – most users, it seems to us, do not bother looking beyond the first ~20 BLAST matches for clues to the taxonomic affiliation of a query sequence. For this “closely matching sequences” dataset, we examined the 3.0% species hypothesis of each kingdom-level sequence for the presence of sequences at least 7 days older than the kingdom-level sequence. Any such sequences were examined for their INSDC taxonomic annotation from kingdom to the species level. This allowed us to build a view of what the author of the kingdom-level sequence would have seen, had they done a BLAST search prior to the release of the kingdom-level sequence. For the “reasonable, but not top-scoring, matches” dataset, we, instead, considered the (≥ 80% similarity) compound cluster where each kingdom-level sequence was found. This allowed us to model the scenario where the kingdom-level sequence authors progressed further down in the BLAST hit list for taxonomic clues, plus the scenario where there were no close BLAST matches to begin with.
We examined the GenBank FEATURES field for information on the country of collection of each sequence to get a feeling for whether kingdom-level sequences and the sequences annotated beyond the kingdom level stemmed from dramatically different sampling areas. Some two percent (2,049) of the sequences annotated only at the kingdom level (e.g. “uncultured fungus”) were found to initially lack an explicit country of collection, yet stem from a published or otherwise available (e.g. a pre-print) study (as opposed to being a “Direct submission” or an “Unpublished” INSDC submission). Similarly, some 7% (48,540) of the sequences with at least a phylum-level annotation (e.g. “Ascomycota sp.” and “Rhizoplaca sp.”) were found to lack an explicit country of collection, but to stem from a published or otherwise available study. These sequences offer some hope of restoration of the missing country of collection through recourse to the presumed underlying publication; sequences merely listed as “Direct submission” or “Unpublished” do not, in our experience (e.g.
When the GenBank REFERENCE field specified a scientific journal, we used the journal name as a proxy for whether the author(s) of each sequence were mycologists or not. We made the admittedly crude assumptions that a mycologist is someone who publishes in a mycological journal; that only mycologists publish papers in mycological journals; and to only consider the 29 journals listed under “Mycology” in Web of Science (November 2020; Suppl. material
The year of deposition of each sequence was assessed to examine whether the proportion of kingdom-level INSDC depositions fluctuated over time (2001–2020).
Regarding our attempt to mimic BLAST users who only consider matches with very high match scores, we found that a full 68,929 (73%) of the 95,055 sequences annotated only at the kingdom level (Fungi) were false negatives (Figure
Pie chart representing all the 95,055 kingdom-level ITS sequences and the proportion of these that were true-positives (had no or only very distant taxonomically more well-annotated BLAST matches at the time of sequence deposition/release; red, 10%), false-negatives (had only reasonable matches; green, 17%) and false-negatives (had close matches; blue, 73%). The chart suggests that nearly all kingdom-level fungal ITS sequences in INSDC could have been given a more taxonomically-resolved name at the time of sequence deposition/release.
If we include the “reasonable, but not top-scoring, matches” from the corresponding compound cluster (i.e. sequences that would have appeared further down in the BLAST hit list) in these statistics, we found that 85,093 (90%) of the 95,055 sequences annotated only at the kingdom level were false negatives (Figure
Initially, 2,049 (2.2%) of the publication-associated kingdom-level sequences were found to lack information on country of collection. The corresponding number was 7% (48,540) for the publication-associated sequences with at least a phylum-level annotation. We were able to restore the country of collection for 1,983 (96.8%) of these kingdom-level sequences and 1,812 (89.3%) of these phylum-level sequences. The newly-obtained countries of collection were deposited in UNITE for each sequence to facilitate further mycological enterprises by UNITE users. Figure
The top 15 most common countries of collection for the publication-associated sequences annotated at or beyond the phylum level (green) expressed as the proportion of the sequences stemming from each country out of all phylum-level-and-beyond sequences. The corresponding country for publication-associated sequences annotated only at the kingdom level (orange) is similarly expressed as the proportion of sequences stemming from that country out of all kingdom-level sequences. The figure is ordered in decreasing order by the country of collection for the phylum-level sequences.
For the “closely matching sequences” scenario, we found that 22% (21,205) of the full INSDC set of kingdom-level sequences, for which a more resolved name would have been only a BLAST search away, were generated by mycologists (following our admittedly crude definition of a mycologist). When, instead, considering the fully identified sequences, 182,402 (27.1%) were deposited by mycologists. The proportion of false-negative INSDC depositions does not decline over time (Figure
The proportion of false-negative sequences (had reasonable matches; green) and false-negative sequences (had close matches; blue) out of all kingdom-level sequences over time (2001-2020). The figure suggests that the act of taking sequence annotation very lightly is not in an abating trend. The data for 2020 extend through early November 2020 and are thus partial.
The present paper examines the corpus of reasonably full-length public fungal ITS sequences not annotated to any meaningful taxonomic level. We found that our initial, UNITE curation-based hunches were largely right: reasons other than lack of established taxonomy and available reference sequences lie behind the lack of resolved taxonomic annotations for the overwhelming majority of these sequences. A full 12% of the 767,918 sequences in our dataset were annotated only to kingdom level – and in at least 73% of these cases seemingly without clear justification. In fact, for 64% of these sequences, an annotation to at least the genus level seems to have been possible and only a BLAST search away at the time of sequence deposition/release. The tendency of researchers not to name fungal sequences beyond the kingdom level, even when this would have been perfectly possible, does not seem to go down over time (Figure
It would somehow have been nice to conclude that mycology is the victim of the decisions of non-mycologist researchers: only non-mycologists are behind the countless “uncultured fungus” depositions. Our results are not in line with this though; mycologists seem to be behind more than one fifth of these sequences. We find this remarkable, considering that mycology is often touted as an overlooked and easily dismissed discipline (
Our results make it painfully clear that human nature, rather than lack of taxonomic information and resolution, is the cause of the lion’s share of the kingdom-level annotations. Indeed, more than 70% of the kingdom-level sequences belong to lineages for which an established Latin name – and at least one reference sequence annotated accordingly – were readily available at the time of sequence deposition. This begs the question why those sequence authors did not go looking for that information to begin with. One can think of many answers: lack of mycological or bioinformatics expertise, lack of money/time, a research focus other than taxonomy, wanting journal policies on metadata richness and availability and, indeed, lack of a perceived good reason to take the time to do it in the first place. All those reasons can be countered one way or the other. For instance, any environmental sequencing effort likely to unravel fungi – although they may not target fungi or taxonomic aspects specifically – should always include a mycologist as well as a bioinformatician to maximise resolution in the analysis, but also the data deposition step. Grant applications should be written in such a way to provide sufficient time and resources for reproducible down-the-road data handling and not just the field and sequencing expenses. Similarly, journal policies on data availability should ideally be extended – and enforced – to also include aspects of data annotation and re-usability, perhaps to the extent that any pending INSDC entries to be released upon publication of the study must be submitted to the journal for review alongside the other manuscript files. Above all, individual research efforts should be seen not only as a way to increase the length of one’s CV and to meet promises to funding agencies, but also as a contribution to the ever-growing corpus of scientific – mycological – knowledge. In fact, we speculate that this last issue is the main reason behind the findings of the present study. Researchers do not perceive their sequence data as atomised contributions to science and, thus, fail to take the steps that would have enabled meaningful use of those sequences beyond the study at hand.
The present results dispel the assertion that only mycologists are in a position to add to our growing knowledge of the fungal kingdom. This, in turn, suggests that mycologists should make it as easy as possible for anyone to make use of, but also add to, the corpus of mycological data. After all, DNA sequences form a key component of contemporary mycology (
It is painful to come across sequences that are annotated as “uncultured fungus” or “fungal sp.” in INSDC, but that are deeply nested (and sometimes even well annotated) in well-supported clades in phylogenetic trees of, for example, Fusarium, Helotiales, and Lactarius in the associated publications. The present study argues that taxonomic annotations of the “uncultured fungus” kind should be reserved for cases where taxonomic annotation beyond the kingdom level was attempted, but came up short. Then users would know that each such sequence carries a non-trivial potential for taxonomic discovery – you could even argue that such sequences would be amongst the most interesting and exciting of all fungal sequences. Right now, however, the “uncultured fungus” label is used as a catch-all device whose routine use serves to mask the presence of truly unidentifiable fungi. Many researchers seem to shun unidentified sequences also in situations where these sequences clearly should have been considered (
Phylogenetic analysis is probably the most robust way to assess the taxonomic affiliation of sequences and hence to annotate sequences. However, we acknowledge that not all studies use phylogenetic approaches to begin with and that phylogenetic analysis may not be applicable in all situations. Fortunately, similarity-based searches, such as BLAST in INSDC, will take you a long way. By ticking the GenBank-BLAST box “Exclude: Uncultured/environmental sample sequences”, a more taxonomy-orientated picture is likely to emerge. We feel that a sequence that produces a long list of, say, robust Fusarium matches – when both BLAST coverage and similarity are considered closely (
We would like to stress that annotating sequences is always a balance between under- and over-annotation. There is no shortage of incorrectly annotated fungal sequences in the public repositories (
Tedersoo et al. (2014) do not specify when a sequence should be annotated at the species level; indeed, sequences were not annotated at the species level in that study. We agree with this move and we personally do not annotate newly-generated environmental sequences to the species level other than in very rare and particularly unequivocal cases. After all, there are many examples of clearly distinct species that have identical ITS sequences (
The present study should be viewed as a rough estimation of the reasons why we keep seeing INSDC submissions of the “uncultured fungus” kind. Many aspects of the present study are clearly hard to algorithmise. For instance, in our mimicking of BLAST searches, we used default BLAST settings and a single version of BLAST. However, the BLAST output may have looked somewhat different to a user with non-default parameter values or another version of BLAST. It is, furthermore, difficult to model human behaviour when it comes to processing and interpreting BLAST hit lists. One can also think of cases where the sequence authors did, in fact, do BLAST searches, but were presented with contradictory information: “Ascomycota sp.” and “Basidiomycota sp.”. In our experience, it is often easy to single out and resolve many misannotated sequences, based on the annotations of the other relevant BLAST hits – a single Lactarius (Basidiomycota) annotation in a large group of Fusarium (Ascomycota), for instance – but we can certainly see why some users would feel uncomfortable doing this. The magnitude of this problem appears limited, as 0.5% of the SHs and 1.8% of the compound clusters contained annotation conflicts at the phylum level. Complications such as these, nevertheless, suggest that our estimate that more than 70% of the kingdom-level annotations are false negatives may be off by several percentage units. That said, many of our parameter settings – such as the permissive single-linkage clustering underlying the SH generation – were deliberately set to be very forgiving. We, therefore, argue that at least the order of magnitude of our estimate is reasonable. Our estimate is, furthermore, in line with our admittedly basidiomycete-centric experience of UNITE sequence curation.
The scoring of sequence authors as mycologists or non-mycologists, based on the journal of the underlying publication, is clearly a move that will prove to be wrong in many cases. We are well aware – and welcome – that also non-mycologists publish their findings in mycological journals. Conversely, mycologists often – and rightfully – seek to publish their findings beyond mycological journals. Finally, Web of Science is not an ideal arbiter of what is mycology and what is not, given that there are many mycological journals that do not yet have a formal impact factor. Thus, while we agree that these shortcomings haunt our estimate that 22.3% of the kingdom-level sequences were submitted by mycologists, it is not immediately clear whether our estimate is biased towards, or away from, mycologists. Our estimate is clearly so high that it would be counter-intuitive to argue that only non-mycologists are behind it.
The study of fungi is being reshaped by the many novel and hitherto nameless fungal lineages unearthed by environmental sequencing efforts (
The work of KA was supported by the Estonian Research Council grant (PRG1170). KSP and CW gratefully acknowledge funding from the German Research Foundation (DFG: WU890/2–1). NC gratefully acknowledges funding from the Swedish Research Council Formas (project HerbEvol grant No. 2015–1464). SNP gratefully acknowledges funding from the Spanish Ministry of Economy and Competitiveness (CGL2015–67459–P and BES–2016–077793). MG gratefully acknowledges funding from the International Association for Plant Taxonomy (IAPT: 2021).
A list of the 29 journals under the Web of Science heading “Mycology” as of November 2020
Data type: Text