Great differences in performance and outcome of high-throughput sequencing data analysis platforms for fungal metabarcoding

Sten Anslan; R. Henrik Nilsson; Christian Wurzbacher; Petr Baldrian; Leho Tedersoo; Mohammad Bahram

doi:10.3897/mycokeys.39.28109

Short Communication

Great differences in performance and outcome of high-throughput sequencing data analysis platforms for fungal metabarcoding

Sten Anslan^‡, R. Henrik Nilsson^§, Christian Wurzbacher^|, Petr Baldrian^¶, Leho Tedersoo^#, Mohammad Bahram^¤#«

‡ Braunschweig University of Technology, Braunschweig, Germany

§ University of Gothenburg, Gothenburg, Sweden

| Technical University of Munich, Garching, Germany

¶ Institute of Microbiology of the Czech Academy of Sciences, Praha, Czech Republic

# Tartu University, Tartu, Estonia

¤ Swedish University of Agricultural Sciences, Uppsala, Sweden

« Uppsala University, Uppsala, Sweden

Corresponding author: Sten Anslan ( sten1987@gmail.com )

Corresponding author: Mohammad Bahram ( bahram@ut.ee )

Academic editor: Thorsten Lumbsch

This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Citation: Anslan S, Nilsson RH, Wurzbacher C, Baldrian P, Tedersoo L, Bahram M (2018) Great differences in performance and outcome of high-throughput sequencing data analysis platforms for fungal metabarcoding. MycoKeys 39: 29-40. https://doi.org/10.3897/mycokeys.39.28109

Abstract

Along with recent developments in high-throughput sequencing (HTS) technologies and thus fast accumulation of HTS data, there has been a growing need and interest for developing tools for HTS data processing and communication. In particular, a number of bioinformatics tools have been designed for analysing metabarcoding data, each with specific features, assumptions and outputs. To evaluate the potential effect of the application of different bioinformatics workflow on the results, we compared the performance of different analysis platforms on two contrasting high-throughput sequencing data sets. Our analysis revealed that the computation time, quality of error filtering and hence output of specific bioinformatics process largely depends on the platform used. Our results show that none of the bioinformatics workflows appears to perfectly filter out the accumulated errors and generate Operational Taxonomic Units, although PipeCraft, LotuS and PIPITS perform better than QIIME2 and Galaxy for the tested fungal amplicon dataset. We conclude that the output of each platform requires manual validation of the OTUs by examining the taxonomy assignment values.

Keywords

Microbial communities, microbiome, mycobiome, fungal biodiversity, metagenomics, amplicon sequencing

Introduction

Fungi are major ecological and functional players in terrestrial ecosystems. The full diversity of fungi remains largely uncharted due to their largely unculturable nature, the lack of tangible morphological manifestations and shortcomings of the mycological community to sample beyond traditional habitats and substrates (Grossart et al. 2016; Hibbett et al. 2017; Lücking et al. 2018). As a result, identification of fungi has come to rely mainly on direct DNA sequencing of material containing fungal hyphae or spores. In this regard, several DNA barcoding regions have been evaluated and the current consensus is that the nuclear ribosomal internal transcribed spacer (ITS) region is the best region for delimiting fungal taxa at the species level across a variety of fungal groups (Schoch et al. 2012). Recent advances in high-throughput sequencing (HTS) have made it possible to sequence millions of reads and identify thousands of fungal taxa from a single sample. Handling this enormous amount of data is often complicated and requires extensive bioinformatics expertise.

Multiple analysis platforms have been introduced to facilitate the bioinformatics treatment of HTS data. However, most of these software suites were developed for the prokaryotic 16S rRNA gene and may therefore perform poorly with other markers and other organisms, in particular ITS sequences due to their length variation and non-alignability across taxonomic expanses. To accommodate this, several tailored platforms have recently been developed to specifically address fungal ITS datasets (Anslan et al. 2017; Gweon et al. 2015; Hildebrand et al. 2014; Vetrovský et al. 2018). These platforms cover multiple steps of the analysis procedure, including quality control, clustering, taxonomic assignment and generating Operational Taxonomic Unit (OTU) abundance tables. Many of these platforms cover all these analysis steps, whereas others do not.

The application of different bioinformatics workflows may introduce variation in the data quality and output OTU tables (Majaneva et al. 2015; Sinha et al. 2017). However, to date, there are no data on the relative performance of the available tools for fungal HTS data analysis. In this study, we report on the relative performance of the most popular software pipelines on two contrasting HTS datasets.

Methods

Sequence data and general workflow

We compared the performance of bioinformatics analysis platforms on two fungal ITS datasets. Tested datasets included Illumina MiSeq paired-end ITS2 amplicons from arthropod substrates (Anslan et al. 2018) and full ITS circular consensus sequences from Pacific Biosciences (PacBio) Sequel machine, amplified from soil samples. PacBio data set is available through PlutoF database (Abarenkov et al. 2010b), https://plutof.ut.ee/#/datacite/10.15156%2FBIO%2F781236). For bioinformatics analyses, we used multiple platforms that support all steps in the analysis of HTS-based metabarcoding datasets: QIIME2 (v2018.2; Caporaso et al. 2010), LotuS (v1.59; Hildebrand et al. 2014), Galaxy (v.2.1.1; Afgan et al. 2016), PipeCraft (v1.0; Anslan et al. 2017) and PIPITS (v2.0; Gweon et al. 2015) (Table 1; Figure 1). Depending on the analysis platform, quality filtering was performed using either VSEARCH (Rognes et al. 2016), trimmomatic (Bolger et al. 2014), DADA2 (Callahan et al. 2016), sdm (Hildebrand et al. 2014) or fastx (http://hannonlab.cshl.edu/fastx_toolkit). Quality filtered sequences were passed through chimeric reads removal algorithms as implemented in USEARCH (Edgar 2013; Edgar et al. 2011) or VSEARCH. Using PipeCraft, LotuS and PIPITS, reads were also subjected to ITS extraction using ITSx (Bengtsson-Palme et al. 2013) to remove conservative flanking genes of the ITS region. OTU formation (clustering) was performed using USEARCH or VSEARCH as outlined below (Platform specific options). For each platform, we relied on de-novo single linkage clustering, which is the most popular approach in fungal community studies, knowing that reference-based clustering methods can provide similar results (Cline et al. 2017). Taxonomic affiliations were assigned to OTUs using DP Naive Bayesian rRNA Classifier (RDP classifier v2.11; Wang et al. 2007) with the Warcup Fungal ITS trainset 2 (confidence threshold: 80%; Deshpande et al. 2016) as well as BLAST+ (Camacho et al. 2009) search (e-value = 0.001, word size = 7, reward = 1, penalty = -1, gap open = 1, gap extend = 2) against the UNITE v7.2 reference database (Abarenkov et al. 2010a).

Table 1.

Download as

CSV

XLSX

Used software, sequence and OTU counts (values in bold) by a) Illumina and b) PacBio analysis platforms. The number of sequences denotes raw input reads and remaining reads after each analysis step. Singleton OTUs were excluded from the OTU counts.

a)	LotuS	Qiime2	PipeCraft	Galaxy	PIPITS
Raw reads	7,981,812a	7,335,838b	7,981,812a	7,981,812a	7 335 838b
Assembly	FLASH/ NA	DADA2/ NA	VSEARCH/ 7,511,274	FASTQ joiner/ 7,911,554	VSEARCH/ 7,198,094
Quality filtering	sdm/NA	DADA2/ 5,428,563	VSEARCH/ 7,511,274	trimmomatic/ 7,879,960	fastqx/ 7,142,354
Demultiplexing	sdm/ 6,727,631	NP	mothur/ 6,558,772	mothur/ 1,643,879	NP
Chimera filtering	USEARCH/ 6,486,802	NP	VSEARCH/ 6,300,085	VSEARCH/ 1,621,330	VSEARCH/ NA
ITS extractor	5,919,084	NP	6,262,000	NP	6,401,097
Clustering (OTUs)	UPARSE/ 8,659	VSEARCH/ 7,477	UPARSE/ 7,598	VSEARCH/ 23,167	VSEARCH/ 7,887
b)	LotusS	PipeCraft	Galaxy
CCSc reads	720,222a	720,222a	720,222a
Quality filtering	sdm/ NA	VSEARCH/ 462,010	trimmomatic/ 672,292
Demultiplexing	sdm/ 258,085	mothur/ 380,722	mothur/ 457,173
Chimera filtering	USEARCH/ 255,746	VSEARCH/ 341,154	VSEARCH/ 405,025
ITS extraction	192,485	338,150	NP
Clustering (OTUs)	UPARSE/ 942	UPARSE/ 4,176	VSEARCH/ 8,338

Figure 1.

Outline of workflow in different analysis pipelines.

Platform specific options

Using QIIME2, reads were assembled (Illumina data) and quality filtered using DADA2 (Callahan et al. 2016) with default options, except --p-trunc-len = 0, --p-max-ee = 1 and --p-chimera-method = none (with chimera-method = consensus, QIIME2 reported error for our data). Clustering was performed with VSEARCH cluster-features-de-novo (--p-perc-identity 0.97).

In LotuS pipline, data was assembled (Illumina data), quality filtered (minimum length = 170, minAvgQuality = 27, TruncateSequenceLength = 170, maxAccumulatedError = 0.75) and demultiplexed with sdm (pdiffs = 1, bdiffs = 1). Chimera filtering was undertaken using USEARCH de novo chimera filtering (abundance annotation = 0.97, abskew = 2) and USEARCH reference-based chimera filtering using UNITE v7.2 as reference database. Flanking genes of the ITS region were discarded using ITSx (v1.0.11; default options). ITS reads were clustered to OTUs with USEARCH/UPARSE algorithm (-id = 3, -minsize = 2).

Using web-based Galaxy pipeline, Illumina data were assembled with Fastq joiner (Galaxy Version 2.0.1.1; Blankenberg et al. 2010) with default options. Quality filtering was performed with Trimmomatic (Galaxy Version 0.36.3) ‒ SLIDINGWINDOW; number of bases to average across = 15, average quality required = 30, minimum length of kept reads = 45. Fastq files were converted to FASTA files using FASTQ to FASTA converter (Galaxy Version 1.0.0). Fasta files were demultiplexed using mothur (Galaxy Version 1.39.5.0; Schloss et al. 2009) ‒ pdiffs = 2, bdiffs = 1. As sequences were of mixed orientation in the files (5’-3’ and 3’-5’), the demultiplexing step was repeated for reverse orientated sequences (reads were reversed using mothur reverse.seqs). Chimera filtering was undertaken using VSEARCH chimera detection (Galaxy Version 1.9.7.0) with default settings (abundance annotation = 97% similarity threshold) and using the UNITE v7.2 database as reference. Clustering was performed using VSEARCH (--cluster-fast, --id 0.97, --iddef 1).

In PipeCraft, platform reads were assembled (Illumina data) and quality filtered using VSEARCH (minimum overlap = 15, minimum length = 100, E max = 1, max ambiguous = 0, allowstagger = T). Demultiplexing was undertaken using mothur (pdiffs = 2, bdiffs = 1). In this step, sequences are also re-orientated into the 5’-3’ orientation based on primers (2 mismatches allowed).

Chimeric sequences were removed using VSEARCH de novo (abundance annotation = 0.97, abskew = 2) and reference-based (UNITE v7.2 as reference) chimera filtering algorithms. In the chimera filtering step, the PipeCraft supported option for “primer artefact” removal was also used (sequences where primer strings were found in the middle of the sequence were removed). ITS reads were extracted using ITSx (default options). Clustering was performed using USEARCH/UPARSE algorithm (id = 3, minsize = 2).

Using PIPITS, sequences were assembled with VSEARCH and quality-filtering was undertaken with fastx through the PIPITS command pispino_createreadpairslist. The ITSx was executed through the PIPITS command pipits_funits. Chimera filtering and clustering were undertaken using VSEARCH through the PIPITS command pipits_process.

Additional filtering

The additional manual OTU table filtering was based on the BLAST similarity scores when run against UNITE (v7.2) reference database. Any OTUs that had no BLAST hit or that were not classified to the kingdom Fungi were discarded from the OTU table. The remaining OTUs were filtered based on BLAST e-value and query coverage. OTUs with higher e-value than 1e^-25 and query coverage less than 70% were excluded from the dataset (as putative artefacts or non-fungal OTUs). Additionally, OTUs with low numbers of sequences per sample were removed (less than 10 sequences per sample; Brown et al. (2015)). Finally, the LULU (Frøslev et al. 2017) algorithm was applied (minimum_ratio_type = “min”, minimum_match = 97) to merge consistently co-occurring ‘daughter’ OTUs.

Data pooling

To detect the effect of analysis platform choice on the OTU composition, we pooled sequences originating from different platforms and applied the common clustering method to generate a single OTU table. For Illumina data, filtered reads from PipeCraft, LotuS and PIPITS were pooled and clustered using CD-HIT (Fu et al. 2012) at 97% sequence similarity (Table 1). The pooled PacBio dataset included filtered sequences from LotuS, PipeCraft and Galaxy platform, clustering was performed using UPARSE algorithm with 97% sequence similarity threshold (Table 1).

Statistical analysis

We used PERMANOVA analysis (Anderson and Walsh 2013; Type III SS, 4,999 permutations) on Bray-Curtis distances of Hellinger-transformed OTU matrices, using PRIMER6 (Clarke and Gorley 2006). Outliers were screened and removed using analysis of non-metric multidimensional scaling (NMDS). The numbers of sequences per sample were included in the analysis as covariates. Rarefaction curves were generated based on OTU abundance matrices for each dataset using the RTK package (Saary et al. 2017) of R (R-Core-Team 2015).

Results and discussion

Properties of bioinformatics analysis platforms

All tested bioinformatics platforms offer straightforward installation. While Galaxy provides a freely available online platform, the benefits of PipeCraft and QIIME2 include easy-to-use graphical user interfaces and multiple options for data analysis. These platforms bundle many tools for diverse tasks. LotuS and PIPITS represent command-line based platforms. PIPITS offers a limited number of tools, but data analysis is easily performed with a straightforward pipeline. LotuS has been developed to minimise computational time and memory requirements. Specifically, for accuracy of ITS-based analyses of fungi and other eukaryotes, PipeCraft, LotuS and PIPITS implement the ITSx tool (Bengtsson-Palme et al. 2013), which removes the fragments of conservative flanking genes for precise clustering purposes. There is no such option in QIIME2 and Galaxy.

Bioinformatics platforms differ by specific requirements to the input data, with the options being a raw multiplexed file (a single file containing all sequences from one run) and multiple demultiplexed files (reads split into separate files based on indexes). PipeCraft and Galaxy use raw multiplexed data, whereas QIIME2 and PIPITS require demultiplexed files. Only LotuS allows both, multiplexed and demultiplexed files as input. As the raw data files are multiplexed by default, QIIME2 and PIPITS platforms required additional steps of analyses outside these tools to meet the input requirements. Using a Python script, we demultiplexed the raw Illumina data, allowing 2 and 1 mismatches to primer and index strings, respectively. However, PacBio data analysis was dropped for QIIME2 and PIPITS as the present versions of these platforms are limited to analysis of short read (Illumina) data.

Performance of bioinformatics platforms on sequence data

For both the Illumina and PacBio datasets, the final OTU richness (singleton OTUs excluded) differed considerably amongst the tested workflows (Table 1). We found that pipelines, which produced roughly comparable numbers of total OTUs (QIIME2, PipeCraft, PIPITS and LotuS for Illumina data), still exhibited large variations in OTU richness per sample (Figures 2 and 3). By performing joint de-novo clustering for filtered sequences from different pipelines (total number of OTUs = 16333), we observed a weak but significant effect of pipeline choice on overall OTU composition for the Illumina data set (PERMANOVA: pseudo-F_2,868 = 5.88, R²_adj = 0.012, P < 0.001). For the PacBio dataset (total number of OTUs = 4448), differences amongst platforms were slightly stronger (pseudo-F_2,512 = 9.174; R²_adj = 0.033, P < 0.001).

Taxonomic annotation tools differed in the ability to classify OTUs. In general, BLAST searches revealed many cases of high-quality matches to non-fungal organisms (in some cases for hundreds of OTUs), while RDP when combined with the Warcup Fungal ITS trainset optimistically classified all OTUs to Fungi (100% confidence). Numerous papers have evaluated the performance of different methods on the accuracy of taxonomic assignment and performance inevitably hinges on the completeness of the reference database used (e.g. Gdanetz et al. 2017; Richardson et al. 2017). In spite of its relatively rapid performance, the RDP Fungal ITS trainset does not include any non-fungal data, which explains its shortcomings in detecting non-fungal OTUs. However, the confidence score of an RDP classifier did not exceed 64% for non-fungal OTUs, mostly overestimating the group of unclassified fungi.

We also observed that the quality-filtered datasets included up to ~10% of obvious erroneous/chimeric OTUs that produced matches with low query coverage and confidence scores. A long tail of satellite OTUs, assigned to a single species hypothesis with 99–100% BLAST identity and RDP classifier confidence level, were also common – especially in the results where a relatively high number of OTUs was observed (Galaxy platform). After filtering the spurious OTUs manually (see Methods), we found that richness estimates per sample became more homogeneous across pipelines (Illumina data: Figure 3). When OTU table filtering was applied to jointly clustered reads from different pipelines, the significant effect of pipeline choice on the community composition diminished (Illumina data: pseudo-F_2,837 = 0.955, R²_adj = 0.007, P = 0.779).

Figure 2.

OTU accumulation curves of the evaluated pipelines for a) PacBio and b) Illumina datasets.

Figure 3.

Number of OTUs per sample for Illumina data recorded from a) pipeline-generated OTU tables (median differences = 38 OTUs) and from b) filtered OTU tables (median differences = 12 OTUs). The Galaxy workflow was excluded here.

In conclusion, our results indicate that bioinformatics analysis pipelines greatly differ in their relative performance on ITS datasets targeting fungi, although roughly similar quality-orientated settings were implemented. Overall, our recommended Illumina data workflow would be PipeCraft, PIPITS or LotuS, which provide a good balance between speed, mycological accuracy (including support for ITS Extractor) and technical quality. For PacBio, the tools implemented in PipeCraft were most suitable for the long-read analysis. Conversely, the widely used platform in prokaryote 16S-based studies, our options chosen in Galaxy, performed relatively poorly on the ITS data. While QIIME2 implements an accurate quality filtering algorithm of DADA2, the lack of ITS region extraction lowers the accuracy for mycological studies. Of classification tools, BLAST searches against the UNITE database provided more accurate results on the kingdom and phylum levels compared with the RDP and Warcup ITS trainset combined. We emphasise that none of the tested bioinformatics workflows is able to fully filter out the errors that accumulated during sample preparation and sequencing, even when using the most elaborate error-filtering options. Therefore, manual curation of OTU tables continues to be an important step in obtaining robust datasets, although semi-automatic tools to assist evaluation are becoming available (Frøslev et al. 2017). It is also important to rely on high-coverage reference databases to be able to recognise non-target organisms and metagenomic reads.

Acknowledgments

We thank Falk Hildebrand for advice on bioinformatics analysis. This study was supported by the Estonian Research Council (grant no. PUT1317).

References

AbarenkovK, NilssonRH, LarssonK-H, AlexanderIJ, EberhardtU, ErlandS, HoilandK, KjollerR, LarssonE, PennanenT, SenR, TaylorAFS, TedersooL, UrsingBM, VralstadT, LiimatainenK, PeintnerU, KõljalgU (2010a) The UNITE database for molecular identification of fungi – recent updates and future perspectives.New Phytologist186: 281–285. https://doi.org/10.1111/j.1469-8137.2009.03160.x

AbarenkovK, TedersooL, NilssonRH, VellakK, SaarI, VeldreV, ParmastoE, ProusM, AanA, OtsM, KurinaO, OstonenI, JogevaJ, HalapuuS, PoldmaaK, TootsM, TruuJ, LarssonK-H, KoljalgU (2010b) PlutoF-a Web Based Workbench for Ecological and Taxonomic Research, with an Online Implementation for Fungal ITS Sequences.Evolutionary Bioinformatics6: 189–196. https://doi.org/10.4137/ebo.s6271

AfganE, BakerD, Van den BeekM, BlankenbergD, BouvierD, ČechM, ChiltonJ, ClementsD, CoraorN, EberhardC (2016) The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update.Nucleic Acids Research44: 3–10. https://doi.org/10.1093/nar/gkw343

AndersonMJ, WalshDCI (2013) PERMANOVA, ANOSIM, and the Mantel test in the face of heterogeneous dispersions: What null hypothesis are you testing? Ecological Monographs 83: 557–574. https://doi.org/10.1890/12-2010.1

AnslanS, BahramM, HiiesaluI, TedersooL (2017) PipeCraft: flexible open-source toolkit for bioinformatics analysis of custom high-throughput amplicon sequencing data. Molecular Ecology Resources 17: e234–e240. https://doi.org/10.1111/1755-0998.12692

AnslanS, BahramM, TedersooL (2018) Seasonal and annual variation in fungal communities associated with epigeic springtails (Collembola spp.) in boreal forests.Soil Biology and Biochemistry116: 245–252. doi:https://doi.org/10.1016/j.soilbio.2017.10.021

Bengtsson-PalmeJ, RybergM, HartmannM, BrancoS, WangZ, GodheA, De WitP, Sanchez-GarciaM, EbersbergerI, de SousaF, AmendAS, JumpponenA, UnterseherM, KristianssonE, AbarenkovK, BertrandYJK, SanliK, ErikssonKM, VikU, VeldreV, NilssonRH (2013) Improved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for analysis of environmental sequencing data.Methods in Ecology and Evolution4: 914–919. https://doi.org/10.1111/2041-210x.12073

BlankenbergD, GordonA, Von KusterG, CoraorN, TaylorJ, NekrutenkoA, TeamG (2010) Manipulation of FASTQ data with Galaxy.Bioinformatics26: 1783–1785. https://doi.org/10.1093/bioinformatics/btq281

BolgerAM, LohseM, UsadelB (2014) Trimmomatic: a flexible trimmer for Illumina sequence data.Bioinformatics30: 2114–2120. https://doi.org/10.1093/bioinformatics/btu170

BrownSP, VeachAM, Rigdon-HussAR, GrondK, LickteigSK, LothamerK, OliverAK, JumpponenA (2015) Scraping the bottom of the barrel: are rare high throughput sequences artifacts? Fungal Ecology 13: 221–225. http://dx.doi.org/10.1016/j.funeco.2014.08.006

CallahanBJ, McMurdiePJ, RosenMJ, HanAW, JohnsonAJA, HolmesSP (2016) DADA2: high-resolution sample inference from Illumina amplicon data. Nature Methods 13: 581. https://doi.org/10.1038/nmeth.3869

CamachoC, CoulourisG, AvagyanV, MaN, PapadopoulosJ, BealerK, MaddenTL (2009) BLAST+: architecture and applications. BMC Bioinformatics 10: 421. https://doi.org/10.1186/1471-2105-10-421

CaporasoJG, KuczynskiJ, StombaughJ, BittingerK, BushmanFD, CostelloEK, FiererN, PeñaAG, GoodrichJK, GordonJI (2010) QIIME allows analysis of high-throughput community sequencing data.Nature Methods7: 335–336. https://doi.org/10.1038/nmeth.f.303

ClarkeK, GorleyR (2006) PRIMER V6: User Manual / Tutorial.Primer-E Ltd, Plymouth, 192 pp.

ClineLC, SongZ, Al‐GhalithGA, KnightsD, KennedyPG (2017) Moving beyond de novo clustering in fungal community ecology. New Phytol.216(3): 629–634. https://doi.org/10.1111/nph.14752

DeshpandeV, WangQ, GreenfieldP, CharlestonM, Porras-AlfaroA, KuskeCR, ColeJR, MidgleyDJ, Tran-DinhN (2016) Fungal identification using a Bayesian classifier and the Warcup training set of internal transcribed spacer sequences.Mycologia108: 1–5. https://doi.org/10.3852/14-293

EdgarRC (2013) UPARSE: highly accurate OTU sequences from microbial amplicon reads. Nature Methods 10. https://doi.org/10.1038/nmeth.2604

EdgarRC, HaasBJ, ClementeJC, QuinceC, KnightR (2011) UCHIME improves sensitivity and speed of chimera detection.Bioinformatics27: 2194–2200. https://doi.org/10.1093/bioinformatics/btr381

FrøslevTG, KjøllerR, BruunHH, EjrnæsR, BrunbjergAK, PietroniC, HansenAJ (2017) Algorithm for post-clustering curation of DNA amplicon data yields reliable biodiversity estimates. Nature communications 8: 1188. https://doi.org/10.1038/s41467-017-01312-x

FuL, NiuB, ZhuZ, WuS, LiW (2012) CD-HIT: accelerated for clustering the next-generation sequencing data.Bioinformatics28: 3150–3152. https://doi.org/10.1093/bioinformatics/bts565

GrossartH-P, WurzbacherC, JamesTY, KagamiM (2016) Discovery of dark matter fungi in aquatic ecosystems demands a reappraisal of the phylogeny and ecology of zoosporic fungi.Fungal Ecology19: 28–38. doi:https://doi.org/10.1016/j.funeco.2015.06.004

GweonHS, OliverA, TaylorJ, BoothT, GibbsM, ReadDS, GriffithsRI, SchonroggeK (2015) PIPITS: an automated pipeline for analyses of fungal internal transcribed spacer sequences from the Illumina sequencing platform.Methods in Ecology and Evolution6: 973–980. https://doi.org/10.1111/2041-210x.12399

HibbettD, AbarenkovK, KoljalgU, OpikM, ChaiB, ColeJR, WangQ, CrousPW, RobertVARG, HelgasonT, HerrJ, KirkP, LueschowS, O’DonnellK, NilssonH, OonoR, SchochCL, SmythC, WalkerD, Porras-AlfaroA, TaylorJW, GeiserDM (2017) Sequence-based classification and identification of Fungi.Mycologia108: 1049–1068

HildebrandF, TadeoR, VoigtAY, BorkP, RaesJ (2014) LotuS: an efficient and user-friendly OTU processing pipeline. Microbiome 2: 30. https://doi.org/10.1186/2049-2618-2-30

LückingR, KirkPM, HawksworthDL (2018) Sequence-based nomenclature: a reply to Thines et al. and Zamora et al. and provisions for an amended proposal.IMA fungus9: 185–198. https://doi.org/10.5598/imafungus.2018.09.01.12

MajanevaM, HyytiäinenK, VarvioSL, NagaiS, BlomsterJ (2015) Bioinformatic amplicon read processing strategies strongly affect eukaryotic diversity and the taxonomic composition of communities. PLoS ONE 10: e0130035. https://doi.org/10.1371/journal.pone.0130035

R-Core-Team (2015) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna.

RognesT, FlouriT, NicholsB, QuinceC, MahéF (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ 4: e2584. https://doi.org/10.7717/peerj.2584

SaaryP, ForslundK, BorkP, HildebrandF (2017) RTK: efficient rarefaction analysis of large datasets.Bioinformatics33: 2594–2595. https://doi.org/10.1093/bioinformatics/btx206

SchlossPD, WestcottSL, RyabinT, HallJR, HartmannM, HollisterEB, LesniewskiRA, OakleyBB, ParksDH, RobinsonCJ, SahlJW, StresB, ThallingerGG, Van HornDJ, WeberCF (2009) Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities.Applied and Environmental Microbiology75: 7537–7541. https://doi.org/10.1128/aem.01541-09

SchochCL, SeifertKA, HuhndorfS, RobertV, SpougeJL, LevesqueCA, ChenW, BolchacovaE, VoigtK, CrousPW, MillerAN, WingfieldMJ, AimeMC, AnKD, BaiFY, BarretoRW, BegerowD, BergeronMJ, BlackwellM, BoekhoutT, BogaleM, BoonyuenN, BurgazAR, BuyckB, CaiL, CaiQ, CardinaliG, ChaverriP, CoppinsBJ, CrespoA, CubasP P, CummingsC, DammU, de BeerZW, de HoogGS, Del-PradoR, DentingerB, Dieguez-UribeondoJ, DivakarPK, DouglasB, DuenasM, DuongTA, EberhardtU, EdwardsJE, ElshahedMS, FliegerovaK, FurtadoM, GarciaMA, GeZW, GriffithGW, GriffithsK, GroenewaldJZ, GroenewaldM, GrubeM, GryzenhoutM, GuoLD, HagenF, HambletonS, HamelinRC, HansenK, HarroldP, HellerG, HerreraG, HirayamaK, HirookaY, HoHM, HoffmannK, HofstetterV, HognabbaF, HollingsworthPM, HongSB, HosakaK, HoubrakenJ, HughesK, HuhtinenS, HydeKD, JamesT, JohnsonEM, JohnsonJE, JohnstonPR, JonesEB, KellyLJ, KirkPM, KnappDG, KoljalgU, KovacsGM, KurtzmanCP, LandvikS, LeavittSD, LiggenstofferAS, LiimatainenK, LombardL, Luangsa-ArdJJ, LumbschHT, MagantiH, MaharachchikumburaSS, MartinMP, MayTW, McTaggartAR, MethvenAS, MeyerW, MoncalvoJM, MongkolsamritS, NagyLG, NilssonRH, NiskanenT, NyilasiI, OkadaG, OkaneI, OlariagaI, OtteJ, PappT, ParkD, PetkovitsT, Pino-BodasR, QuaedvliegW, RajaHA, RedeckerD, RintoulT, RuibalC, Sarmiento-RamirezJM, SchmittI, SchusslerA, ShearerC, SotomeK, StefaniFO, StenroosS, StielowB, StockingerH, SuetrongS, SuhSO, SungGH, SuzukiM, TanakaK, TedersooL, TelleriaMT, TretterE, UntereinerWA, UrbinaH, VagvolgyiC, VialleA, VuTD, WaltherG, WangQM, WangY, WeirBS, WeissM, WhiteMM, XuJ, YahrR, YangZL, YurkovA, ZamoraJC, ZhangN, ZhuangWY, SchindelD, FungalBarcoding C (2012) Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi.Proceedings of the National Academy of Sciences of the United States of America109: 6241–6246. https://doi.org/10.1073/pnas.1117018109

SinhaR, Abu-AliG, VogtmannE, FodorAA, RenB, AmirA, SchwagerE, CrabtreeJ, MaS, AbnetCC (2017) Assessment of variation in microbial community amplicon sequencing by the Microbiome Quality Control (MBQC) project consortium. Nature Biotechnology volume 35, pages 1077–1086. https://doi.org/10.1038/nbt.3981

WangQ, GarrityGM, TiedjeJM, ColeJR (2007) Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology 73. https://doi.org/10.1128/aem.00062-07

VetrovskýT, BaldrianP, MoraisD, BergerB (2018) SEED 2: a user-friendly platform for amplicon high-throughput sequencing data analyses. Bioinformatics 1: 3. https://doi.org/10.1093/bioinformatics/bty071