Introduction
The built environment presents dry, harsh conditions for fungal life, and traditional estimates of “indoor” fungi run in the low hundreds (Flannigan et al. 2011; Khan and Karuppayil 2012). General taxonomic progress and studies based on high-throughput sequencing (HTS) of amplicons are changing our view of the built environment as a biologically depauperate habitat. In a global study of indoor dust, Amend et al. (2010) found no less than 4,473 approximately species-level fungal operational taxonomic units (OTUs; Blaxter et al. 2005) distributed across more than 20 different fungal orders. Similarly, Adams et al. (2013) found 986 fungal OTUs from a homogeneous set of houses in a California family housing complex. The majority may represent outdoor fungi that drifted indoors as dead or dormant stages such as hyphal fragments, spores, or other propagules, but these stages must be considered as at least one aspect of the built mycobiome. Furthermore, niches not typically considered in indoor surveys, such as house plants (including Christmas trees), may harbour groups of fungi such as endophytes or mycorrhizae that are not typically considered part of the indoor mycobiome but would be detected by sensitive molecular techniques. Understanding the built mycobiome, therefore, becomes a matter of understanding a much larger number of species than those traditionally considered to form the core indoor fungi. Similarly, many of the common indoor species have been divided into numerous new species, thereby increasing the number of indoor species considerably (e.g., Aspergillus versicolor divided into ten new species (Jurjević et al. 2012), Penicillium chrysogenum into five species (Houbraken et al. 2012), and Wallemia sebi into four species (Nguyen et al. 2015, Jančič et al. 2015)).
There is good reason to study the built mycobiome and the built microbiome at large (Nevalainen et al. 2015; Stamper et al. 2016; Adams et al. 2016). In damp dwellings an increased risk for health problems is apparent from epidemiological studies (WHO 2009). Indoor fungi and fungal particles are linked to a range of medical conditions, including asthma onset, allergies, and fatigue (Norbäck et al. 2016; Tischer et al. 2016). Fungi are a serious cause of decay of building materials, including recently introduced components such as composite wood products and various types of wall board, in the presence of sufficient moisture (Mensah-Attipoe et al. 2016; Meyer et al. 2016). Food spoilage or biodeterioration of textiles or other objects used for clothing, furniture, or carpeting in homes, hospitals, factories, and agricultural settings are also of concern (Benedict et al. 2016; Cardinale et al. 2017; Garnier et al. 2017). This puts fungi on the research agenda for a range of scientific disciplines in addition to traditional mycology, which increases the pressure on mycology to produce data, results, and resources that are straightforward to apply for mycologists and non-mycologists alike. Most mycologists would presumably agree that you should not have to be a taxonomist or even a mycologist to be able to identity fungal DNA sequences from the built environment to a meaningful taxonomic level, such as genus or preferably species. Similarly, it should be straightforward also for non-mycologists to retrieve all public fungal DNA sequences collected on, say, interior walls or floors for further study. In reality, neither of these possibilities is feasible (cf. Abarenkov et al. 2016).
Molecular identification of fungi is largely centred on the nuclear ribosomal internal transcribed spacer (ITS) region, which is the formal fungal DNA barcoding marker (Schoch et al. 2012). However, a number of problems beset ITS-based molecular identification of fungi. To begin with, reference ITS sequences are available for less than 1% of the estimated 6 million extant species of fungi (Blackwell 2011; Taylor et al. 2014). This is coupled with the fact that some 50% of the ~750,000 fungal ITS sequences in the International Nucleotide Sequence Database Collaboration (INSDC: GenBank, ENA, and DDBJ; Cochrane et al. 2016) are not identified to species level, and of the sequences that do have a full Latin binomial, more than 10% may have names that are incorrect (Nilsson et al. 2012). On top of that, technical artefacts such as chimeric unions, poor sequence trimming, and low read quality are common (Kang et al. 2010; Hyde et al. 2013; Nilsson et al. 2017). Finally, many researchers are in the habit of submitting their sequences with very little associated metadata (such as country and substrate or host of collection), leaving more than 50% of the fungal ITS entries in the INSDC more or less non-attributable (Tedersoo et al. 2011). Taken together, these issues often make informed molecular identification of fungi difficult even for well-trained mycologists. It is therefore no surprise that non-mycologists struggle even more.
The UNITE database for molecular identification of fungi (https://unite.ut.ee/; Kõljalg et al. 2013) was designed to overcome these complications, with the ultimate purpose to offer robust and reproducible identification and reference to all species of fungi, whether or not formally described. UNITE draws from the public fungal ITS sequences in the INSDC and is centred on the concept of species hypotheses (SHs), which are approximately species-level OTUs derived from sequence clustering (Kõljalg et al. 2013). All SHs have unique, individual digital object identifiers (DOIs; https://www.doi.org/) for unambiguous reference across time and scientific studies (e.g., 10.15156/BIO/SH216455.07FU which resolves to http://dx.doi.org/10.15156/BIO/SH216455.07FU). UNITE recently received an Alfred P. Sloan Foundation grant to improve the support within the database for handling, processing, and characterising the mycobiome from the built environment (the built mycobiome). Several actions have been taken towards that goal, including a workshop to annotate all extant built-environment fungal ITS sequences according to the MIxS-BE annotation standard (Glass et al. 2014; Abarenkov et al. 2016). The present paper reports on the outcome of a taxonomic annotation workshop that specifically addressed fungal taxa and sequences from the built environment from taxonomic and nomenclatural points of view.
The workshop was held at the James Hutton Institute/University of Aberdeen on April 10–11 2017, and comprised 19 in situ and 12 remote participants. Fifteen of the participants had a taxonomic background and were tasked with assessing the public fungal ITS sequences and SHs within their respective expertise area in relation to assigned names, nomenclature, and recent taxonomic progress. Four of the participants had a general background in built-environment mycology and were asked to annotate recent sequences from the built environment according to the MIxS-BE standard. Nine participants had a background in other fields of mycology and were asked to harvest missing sequence metadata from the literature for fungal groups relevant to the built environment. Finally, four participants had a background in bioinformatics and were asked to process the corpus of public fungal ITS sequences from a technical point of view. All participants operated under the expectation that their contribution should meet the highest of quality standards, and that their work would be incorporated in UNITE, adopted by the downstream resources that make use of UNITE data (see, e.g., https://unite.ut.ee/repository.php), and shared with the INSDC and the recently established ISHAM database, which is a comprehensive, expertly curated ITS database of clinically important fungal pathogens (Irinyi et al. 2015).
Materials and methods
Taxonomic annotation of fungi related to the built environment
The participants examined the public sequences from their respective fungal groups of expertise from nomenclatural and taxonomic points of view through the PlutoF workbench of UNITE (Abarenkov et al. 2010). Sequences were given (or re-annotated with) names that reflected the level at which the taxonomist felt comfortable providing a name. Thus, some sequences were demoted from species level to genus level, some sequences were promoted from kingdom-level (“Uncultured fungus”) to order level (“Dothideales”), and so on. Other sequences were re-named to account for, e.g., recent synonymies and merger of anamorphic and teleomorphic names. From a taxonomic standpoint, reference sequences for individual SHs were designated at the similarity level (97–100% similarity) at which the application made the most taxonomic sense. Inter-specific divergence is known to be very low or even non-existent for the ITS region of certain species complexes or genera, for example parts of Penicillium, Fusarium, Aspergillus, and Talaromyces (Visagie et al. 2014; O’Donnell et al. 2015; Visagie et al. 2017; Yilmaz et al. 2014), and the participants sought to represent the species level as closely as possible for each SH, at times drawing from information from other genetic markers. The participants used recent publications, Index Fungorum (http://www.indexfungorum.org/), MycoBank (http://www.mycobank.org/), and other resources in this pursuit. In addition, all type-derived sequences from the NCBI RefSeq Targeted Loci Project (https://www.ncbi.nlm.nih.gov/refseq/targetedloci/; O’Leary et al. 2015) were considered during the workshop, and were designated as reference sequences for the corresponding SHs whenever possible.
Annotation of built-environment sequences according to the MIxS-BE standard
Abarenkov et al. (2016) annotated all published, public fungal ITS sequences from the built environment – identifiable as such – according to one or more aspect of the MIxS-BE annotation standard. During the ~12 months that had elapsed after the Abarenkov et al. (2016) workshop, 924 new Sanger-derived fungal ITS sequences from the built environment had become available in the INSDC. Those 924 sequences, in so far as they corresponded to a formally published (or otherwise available) study, were examined and annotated according to the MIxS-BE standard following Abarenkov et al. (2016).
Metadata assembly for sequences from species with a relation to the built environment
In UNITE, all sequences that are at least 80% similar are grouped into compound clusters, which are further clustered into SHs (Kõljalg et al. 2013). SHs that contain at least one sequence recovered from the built environment, regardless of whether or not the underlying species is known as a “traditional” indoor fungus, were considered for this task. Those SHs were examined for sequences that lacked explicit specification of both country of collection and host of collection. UNITE was found to contain more than 5,000 such sequences, and the workshop participants sought to reduce this number by applying country and/or host of collection to these entries through scrutiny of the underlying scientific publications (as available) or other online resources. These basic metadata were restored with the hope that the amalgamated information would assist present and future researchers in the interpretation of the biology of fungi with a relation to the built environment. Country names were specified according to the ISO 3166 standard. Hosts were specified by Latin names following the PlutoF consensus classification (Abarenkov et al. 2010).
Analysis of sequences from a technical, quality-related point of view
Several of the workshop participants had a background in bioinformatics and focused on quality-related aspects of public fungal ITS sequences with and without a direct relation to the built environment. Chimera control was done following Nilsson et al. (2015, 2016), and sequence trimming/read quality was examined following Hyde et al. (2013) and Nilsson et al. (2017).
Discussion
Jointly the workshop participants implemented a total of 19,508 changes in UNITE (Tables 2–3). Some 27% (5,288) were taxonomically-related in the sense of giving sequences correct names or designating reference sequences (and their similarity threshold of application) for SHs. Although these numbers may sound impressive, several participants reported that they were unable to finish the annotation of their genera of expertise. In many cases, these genera contained dozens to hundreds of species, highlighting the very substantial amount of time required to process them. Indeed, for the larger genera, no single taxonomic expert can be expected to know all species equally well, and no single researcher can be expected to oversee the annotation of the entire genus. Input from different researchers is clearly needed to process large genera such as Penicillium, Aspergillus, and Fusarium. The workshop participants were asked to focus on the species they knew well, and this report makes no claim of exhaustiveness regarding the genera covered during the workshop or fungi in the built environment in general.
Several participants expressed frustration over the fact that numerous scientific studies were found to have released hundreds of sequences identified only as “Uncultured fungus” (or similar) even when a more informative name would be only seconds away through, e.g., a BLAST search (Altschul et al. 1997). The two presumed main reasons why a researcher would deposit a sequence under the name “Uncultured fungus” would be lack of time to investigate the taxonomic affiliation of the sequence prior to deposition, and concerns about providing a Latin name that would later turn out to be incorrect. As far as fungal sequences go, “Uncultured fungus” will always be a correct – albeit very uninformative – name, and while we agree that it is an error-proof way of giving names to sequences, it also introduces uncertainty, especially for least recent ancestor analyses, and serves to mask fully identified reference sequences, much to the damage of molecular identification of fungi. Consider the case of a cloning-based study that gives rise to, say, 25 near-identical public sequences called “Uncultured fungus”. Another researcher happens to have the same species in a sample, generates an ITS sequence from that species, and uses that sequence for a BLAST search. Even if, say, two highly similar, non-cloning-based sequences with a full species name were available, the resulting BLAST output would be confusing (perhaps starting with 10–25 “Uncultured fungus” sequences as top matches). No wonder, then, that such a user might adopt the name “Uncultured fungus” for the new sequence, with the effect that uncertainty and mistakes will persist and may even be amplified over time (Gilks et al. 2002). The subsequent scientific study of that researcher would be plagued by yet another OTU not assigned beyond the kingdom level, needlessly depriving the study of fungi and fungal communities of much-needed resolution. We advocate that sequence depositors in the INSDC try to go beyond the kingdom level when assigning names to their newly generated sequences, at least for straightforward cases. This task should be undertaken by somebody with significant understanding of fungal systematics and sequence analysis. Instead of arguing that the taxonomic expertise to make such taxonomic calls were not available to the project team, researchers should plan their projects to include sufficient taxonomic expertise that the process of making such calls is feasible. We furthermore ask journal editors and reviewers to set high standards regarding the taxonomic annotations in any manuscript they handle.
Another issue that surfaced repeatedly during the workshop was the occurrence of legacy names, some of them downright outrageously outdated, and other obsolete data. In one case, a name that was synonymized more than 20 years ago was found. We take this to indicate that many researchers do not feel a personal responsibility for their INSDC submissions once those have become a part of the public corpus. However, this view goes against the INSDC policies (https://www.ncbi.nlm.nih.gov/genbank/submit/), which make it clear that sequence authors should approach the INSDC whenever additional explanatory information pertinent to their entries becomes available. Major changes to INSDC entries, such as changes in species names or the very sequence data, will also reach UNITE automatically. We hope that this workshop will serve as a general call to taxonomists and other researchers to revisit their previous INSDC submissions to see if they can be updated or if additional data can be provided. At an altruistic level, any such additional data are likely to move the study of fungi forward – in whatever context they are found – which should be at the heart of every mycologist. At a more personal level, researchers who ensure that “their” group of fungi are properly annotated in the public sequence databases, will soon start to see additional sequences for “their” fungi being identified and deposited by other researchers. This should translate into new opportunities for knowledge expansion and scientific collaboration, to the benefit of the initial researcher and, ultimately, everyone else.
The workshop also identified several shortcomings and avenues for improvement of the UNITE database. For example, recent taxonomic progress in fungi traditionally classified in the polyphyletic genera Candida, Cryptococcus, and Rhodotorula resulted in the recognition of a number of new genera and species names (e.g., Daniel et al. 2014; Kurtzman 2014; Liu et al. 2015; Wang et al. 2015). These changes necessitate the renaming of hundreds to thousands of sequences at, typically, the genus level. At present, UNITE has no software support for batch renaming of sequences at the SH or genus levels, suggesting an urgent need for improvement of UNITE. Similarly, several workshop participants expressed the need to name sequences according to publications that were “in press” or that were published just weeks or a month ago. UNITE, and to some extent Index Fungorum/MycoBank, operate at longer time scales than weeks and will not always have the very latest information at hand. In a digital age where information is created and disseminated more or less simultaneously, updating information on a bimonthly basis is no longer sufficient. This observation puts pressure on UNITE to improve the frequency at which information is exchanged with Index Fungorum and other databases. A significant amount of work remains in terms of information exchange policies across databases such as the INDSC, UNITE, Index Fungorum, MycoBank, and ISHAM-ITS. A change implemented in one of these databases does not necessarily reach the others. As one participant pointed out, it is frustrating enough to provide updates and corrections once for other researchers’ data. To have to do it twice, to two different repositories at that, is disheartening – and ultimately unlikely to occur. A solution to this problem would be the establishment of an integrated cloud-based dynamic database network that would allow an instantaneous update in all relevant databases. Towards that end, all changes implemented during this workshop were shared with the INSDC and ISHAM-ITS, and several updates were sent to Index Fungorum.
In conclusion, the present workshop implemented a total of 19,508 changes in UNITE relating to fungi in the built environment. This will undoubtedly improve the taxonomic resolution in studies of the built, as well as many other, mycobiomes. Although truly uncharacterized lineages of fungi are repeatedly found in the built environment (e.g., Nilsson et al. 2016), in many cases it is more likely lack of input from the mycological community that is responsible for the low taxonomic resolution that haunts many molecular ecology studies of fungal communities. If all taxonomic experts were to look through and annotate fungi in their areas of expertise in the international nucleotide sequence databases, the problem would be greatly diminished. However, it is not just expert taxonomists who can make a difference – sequence authors, article co-authors, reviewers, and editors should make it a habit to insist that sequence data are annotated beyond the barest minimum and in compliance with recent taxonomic progress and relevant metadata standards. An increasing number of non-mycologists now sequence fungi and fungal communities as a part of their professional pursuits, and it would greatly benefit mycology if these non-mycologists could obtain unambiguous, correct, and reproducible results.