Great differences in performance and outcome of high-throughput sequencing data analysis platforms for fungal metabarcoding

Abstract Along with recent developments in high-throughput sequencing (HTS) technologies and thus fast accumulation of HTS data, there has been a growing need and interest for developing tools for HTS data processing and communication. In particular, a number of bioinformatics tools have been designed for analysing metabarcoding data, each with specific features, assumptions and outputs. To evaluate the potential effect of the application of different bioinformatics workflow on the results, we compared the performance of different analysis platforms on two contrasting high-throughput sequencing data sets. Our analysis revealed that the computation time, quality of error filtering and hence output of specific bioinformatics process largely depends on the platform used. Our results show that none of the bioinformatics workflows appears to perfectly filter out the accumulated errors and generate Operational Taxonomic Units, although PipeCraft, LotuS and PIPITS perform better than QIIME2 and Galaxy for the tested fungal amplicon dataset. We conclude that the output of each platform requires manual validation of the OTUs by examining the taxonomy assignment values.


Introduction
Fungi are major ecological and functional players in terrestrial ecosystems. The full diversity of fungi remains largely uncharted due to their largely unculturable nature, the lack of tangible morphological manifestations and shortcomings of the mycological community to sample beyond traditional habitats and substrates (Grossart et al. 2016;Hibbett et al. 2017;Lücking et al. 2018). As a result, identification of fungi has come to rely mainly on direct DNA sequencing of material containing fungal hyphae or spores. In this regard, several DNA barcoding regions have been evaluated and the current consensus is that the nuclear ribosomal internal transcribed spacer (ITS) region is the best region for delimiting fungal taxa at the species level across a variety of fungal groups (Schoch et al. 2012). Recent advances in high-throughput sequencing (HTS) have made it possible to sequence millions of reads and identify thousands of fungal taxa from a single sample. Handling this enormous amount of data is often complicated and requires extensive bioinformatics expertise.
Multiple analysis platforms have been introduced to facilitate the bioinformatics treatment of HTS data. However, most of these software suites were developed for the prokaryotic 16S rRNA gene and may therefore perform poorly with other markers and other organisms, in particular ITS sequences due to their length variation and non-alignability across taxonomic expanses. To accommodate this, several tailored platforms have recently been developed to specifically address fungal ITS datasets (Anslan et al. 2017;Gweon et al. 2015;Hildebrand et al. 2014;Vetrovský et al. 2018). These platforms cover multiple steps of the analysis procedure, including quality control, clustering, taxonomic assignment and generating Operational Taxonomic Unit (OTU) abundance tables. Many of these platforms cover all these analysis steps, whereas others do not.
The application of different bioinformatics workflows may introduce variation in the data quality and output OTU tables (Majaneva et al. 2015;Sinha et al. 2017). However, to date, there are no data on the relative performance of the available tools for fungal HTS data analysis. In this study, we report on the relative performance of the most popular software pipelines on two contrasting HTS datasets.

Sequence data and general workflow
We compared the performance of bioinformatics analysis platforms on two fungal ITS datasets. Tested datasets included Illumina MiSeq paired-end ITS2 amplicons from arthropod substrates (Anslan et al. 2018) and full ITS circular consensus sequences from Pacific Biosciences (PacBio) Sequel machine, amplified from soil samples. PacBio data set is available through PlutoF database (Abarenkov et al. 2010b Figure 1). Depending on the analysis platform, quality filtering was performed using either VSEARCH (Rognes et al. 2016), trimmomatic (Bolger et al. 2014), DADA2 (Callahan et al. 2016), sdm (Hildebrand et al. 2014) or fastx (http://hannonlab.cshl.edu/fastx_toolkit). Quality filtered sequences were passed through chimeric reads removal algorithms as implemented in USEARCH (Edgar 2013;Edgar et al. 2011) or VSEARCH. Using PipeCraft, LotuS and PIPITS, reads were also subjected to ITS extraction using ITSx (Bengtsson-Palme et al. 2013) to remove conservative flanking genes of the ITS region. OTU formation (clustering) was performed using USEARCH or VSEARCH as outlined below (Platform specific options). For each platform, we relied on de-novo single linkage clustering, which is the most popular approach in fungal community studies, knowing that reference-based clustering methods can provide similar results (Cline et al. 2017). Taxonomic affiliations were assigned to OTUs using DP Naive Bayesian rRNA Classifier (RDP classifier v2.11; Wang et al. 2007) with the Warcup Fungal ITS trainset 2 (confidence threshold: 80%; Deshpande et al. 2016) as well as BLAST+ (Camacho et al. 2009) search (e-value = 0.001, word size = 7, reward = 1, penalty = -1, gap open = 1, gap extend = 2) against the UNITE v7.2 reference database (Abarenkov et al. 2010a).
Chimeric sequences were removed using VSEARCH de novo (abundance annotation = 0.97, abskew = 2) and reference-based (UNITE v7.2 as reference) chimera filtering algorithms. In the chimera filtering step, the PipeCraft supported option for "primer artefact" removal was also used (sequences where primer strings were found in the middle of the sequence were removed). ITS reads were extracted using ITSx (default options). Clustering was performed using USEARCH/UPARSE algorithm (id = 3, minsize = 2).
Using PIPITS, sequences were assembled with VSEARCH and quality-filtering was undertaken with fastx through the PIPITS command pispino_createreadpairslist. The ITSx was executed through the PIPITS command pipits_funits. Chimera filtering and clustering were undertaken using VSEARCH through the PIPITS command pipits_process.

Additional filtering
The additional manual OTU table filtering was based on the BLAST similarity scores when run against UNITE (v7.2) reference database. Any OTUs that had no BLAST hit or that were not classified to the kingdom Fungi were discarded from the OTU table. The remaining OTUs were filtered based on BLAST e-value and query coverage. OTUs with higher e-value than 1e -25 and query coverage less than 70% were excluded from the dataset (as putative artefacts or non-fungal OTUs). Additionally, OTUs with low numbers of sequences per sample were removed (less than 10 sequences per sample; Brown et al. (2015)). Finally, the LULU (Frøslev et al. 2017) algorithm was applied (minimum_ratio_type = "min", minimum_match = 97) to merge consistently co-occurring 'daughter' OTUs.

Data pooling
To detect the effect of analysis platform choice on the OTU composition, we pooled sequences originating from different platforms and applied the common clustering method to generate a single OTU table. For Illumina data, filtered reads from Pipe-Craft, LotuS and PIPITS were pooled and clustered using CD-HIT (Fu et al. 2012) at 97% sequence similarity (Table 1). The pooled PacBio dataset included filtered sequences from LotuS, PipeCraft and Galaxy platform, clustering was performed using UPARSE algorithm with 97% sequence similarity threshold (Table 1).

Statistical analysis
We used PERMANOVA analysis (Anderson and Walsh 2013; Type III SS, 4,999 permutations) on Bray-Curtis distances of Hellinger-transformed OTU matrices, using PRIMER6 (Clarke and Gorley 2006). Outliers were screened and removed using analysis of non-metric multidimensional scaling (NMDS). The numbers of sequences per sample were included in the analysis as covariates. Rarefaction curves were generated based on OTU abundance matrices for each dataset using the RTK package (Saary et al. 2017) of R (R-Core-Team 2015).

Properties of bioinformatics analysis platforms
All tested bioinformatics platforms offer straightforward installation. While Galaxy provides a freely available online platform, the benefits of PipeCraft and QIIME2 include easy-to-use graphical user interfaces and multiple options for data analysis. These platforms bundle many tools for diverse tasks. LotuS and PIPITS represent command-line based platforms. PIPITS offers a limited number of tools, but data analysis is easily performed with a straightforward pipeline. LotuS has been developed to minimise computational time and memory requirements. Specifically, for accuracy of ITS-based analyses of fungi and other eukaryotes, PipeCraft, LotuS and PIPITS implement the ITSx tool (Bengtsson-Palme et al. 2013), which removes the fragments of conservative flanking genes for precise clustering purposes. There is no such option in QIIME2 and Galaxy. Bioinformatics platforms differ by specific requirements to the input data, with the options being a raw multiplexed file (a single file containing all sequences from one run) and multiple demultiplexed files (reads split into separate files based on indexes). PipeCraft and Galaxy use raw multiplexed data, whereas QIIME2 and PIPITS require demultiplexed files. Only LotuS allows both, multiplexed and demultiplexed files as input. As the raw data files are multiplexed by default, QIIME2 and PIPITS platforms required additional steps of analyses outside these tools to meet the input requirements. Using a Python script, we demultiplexed the raw Illumina data, allowing 2 and 1 mismatches to primer and index strings, respectively. However, PacBio data analysis was dropped for QIIME2 and PIPITS as the present versions of these platforms are limited to analysis of short read (Illumina) data.

Performance of bioinformatics platforms on sequence data
For both the Illumina and PacBio datasets, the final OTU richness (singleton OTUs excluded) differed considerably amongst the tested workflows (Table 1). We found that pipelines, which produced roughly comparable numbers of total OTUs (QIIME2, PipeCraft, PIPITS and LotuS for Illumina data), still exhibited large variations in OTU richness per sample (Figures 2 and 3). By performing joint de-novo clustering for filtered sequences from different pipelines (total number of OTUs = 16333), we observed a weak but significant effect of pipeline choice on overall OTU composition for the Illumina data set (PERMANOVA: pseudo-F 2,868 = 5.88, R 2 adj = 0.012, P < 0.001). For the PacBio dataset (total number of OTUs = 4448), differences amongst platforms were slightly stronger (pseudo-F 2,512 = 9.174; R 2 adj = 0.033, P < 0.001). Taxonomic annotation tools differed in the ability to classify OTUs. In general, BLAST searches revealed many cases of high-quality matches to non-fungal organisms (in some cases for hundreds of OTUs), while RDP when combined with the Warcup Fungal ITS trainset optimistically classified all OTUs to Fungi (100% confidence). Numerous papers have evaluated the performance of different methods on the accuracy of taxonomic assignment and performance inevitably hinges on the completeness of the reference database used (e.g. Gdanetz et al. 2017;Richardson et al. 2017). In spite of its relatively rapid performance, the RDP Fungal ITS trainset does not include any non-fungal data, which explains its shortcomings in detecting non-fungal OTUs. However, the confidence score of an RDP classifier did not exceed 64% for non-fungal OTUs, mostly overestimating the group of unclassified fungi.
We also observed that the quality-filtered datasets included up to ~10% of obvious erroneous/chimeric OTUs that produced matches with low query coverage and confidence scores. A long tail of satellite OTUs, assigned to a single species hypothesis with 99-100% BLAST identity and RDP classifier confidence level, were also commonespecially in the results where a relatively high number of OTUs was observed (Galaxy platform). After filtering the spurious OTUs manually (see Methods), we found that richness estimates per sample became more homogeneous across pipelines (Illumina data: Figure 3). When OTU table filtering was applied to jointly clustered reads from different pipelines, the significant effect of pipeline choice on the community composition diminished (Illumina data: pseudo-F 2,837 = 0.955, R 2 adj = 0.007, P = 0.779). In conclusion, our results indicate that bioinformatics analysis pipelines greatly differ in their relative performance on ITS datasets targeting fungi, although roughly similar quality-orientated settings were implemented. Overall, our recommended Illumina data workflow would be PipeCraft, PIPITS or LotuS, which provide a good balance between speed, mycological accuracy (including support for ITS Extractor) and technical quality. For PacBio, the tools implemented in PipeCraft were most suitable for the long-read analysis. Conversely, the widely used platform in prokaryote 16Sbased studies, our options chosen in Galaxy, performed relatively poorly on the ITS data. While QIIME2 implements an accurate quality filtering algorithm of DADA2, the lack of ITS region extraction lowers the accuracy for mycological studies. Of clas- sification tools, BLAST searches against the UNITE database provided more accurate results on the kingdom and phylum levels compared with the RDP and Warcup ITS trainset combined. We emphasise that none of the tested bioinformatics workflows is able to fully filter out the errors that accumulated during sample preparation and sequencing, even when using the most elaborate error-filtering options. Therefore, manual curation of OTU tables continues to be an important step in obtaining robust datasets, although semi-automatic tools to assist evaluation are becoming available (Frøslev et al. 2017). It is also important to rely on high-coverage reference databases to be able to recognise non-target organisms and metagenomic reads.