Short Communication |
Corresponding author: Sten Anslan ( sten1987@gmail.com ) Corresponding author: Mohammad Bahram ( bahram@ut.ee ) Academic editor: Thorsten Lumbsch
© 2018 Sten Anslan, R. Henrik Nilsson, Christian Wurzbacher, Petr Baldrian, Leho Tedersoo, Mohammad Bahram.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Anslan S, Nilsson RH, Wurzbacher C, Baldrian P, Tedersoo L, Bahram M (2018) Great differences in performance and outcome of high-throughput sequencing data analysis platforms for fungal metabarcoding. MycoKeys 39: 29-40. https://doi.org/10.3897/mycokeys.39.28109
|
Along with recent developments in high-throughput sequencing (HTS) technologies and thus fast accumulation of HTS data, there has been a growing need and interest for developing tools for HTS data processing and communication. In particular, a number of bioinformatics tools have been designed for analysing metabarcoding data, each with specific features, assumptions and outputs. To evaluate the potential effect of the application of different bioinformatics workflow on the results, we compared the performance of different analysis platforms on two contrasting high-throughput sequencing data sets. Our analysis revealed that the computation time, quality of error filtering and hence output of specific bioinformatics process largely depends on the platform used. Our results show that none of the bioinformatics workflows appears to perfectly filter out the accumulated errors and generate Operational Taxonomic Units, although PipeCraft, LotuS and PIPITS perform better than QIIME2 and Galaxy for the tested fungal amplicon dataset. We conclude that the output of each platform requires manual validation of the OTUs by examining the taxonomy assignment values.
Microbial communities, microbiome, mycobiome, fungal biodiversity, metagenomics, amplicon sequencing
Fungi are major ecological and functional players in terrestrial ecosystems. The full diversity of fungi remains largely uncharted due to their largely unculturable nature, the lack of tangible morphological manifestations and shortcomings of the mycological community to sample beyond traditional habitats and substrates (
Multiple analysis platforms have been introduced to facilitate the bioinformatics treatment of HTS data. However, most of these software suites were developed for the prokaryotic 16S rRNA gene and may therefore perform poorly with other markers and other organisms, in particular ITS sequences due to their length variation and non-alignability across taxonomic expanses. To accommodate this, several tailored platforms have recently been developed to specifically address fungal ITS datasets (
The application of different bioinformatics workflows may introduce variation in the data quality and output OTU tables (
We compared the performance of bioinformatics analysis platforms on two fungal ITS datasets. Tested datasets included Illumina MiSeq paired-end ITS2 amplicons from arthropod substrates (
Used software, sequence and OTU counts (values in bold) by a) Illumina and b) PacBio analysis platforms. The number of sequences denotes raw input reads and remaining reads after each analysis step. Singleton OTUs were excluded from the OTU counts.
a) | LotuS | Qiime2 | PipeCraft | Galaxy | PIPITS |
---|---|---|---|---|---|
Raw reads | 7,981,812a | 7,335,838b | 7,981,812a | 7,981,812a | 7 335 838b |
Assembly | FLASH/ NA | DADA2/ NA | VSEARCH/ 7,511,274 | FASTQ joiner/ 7,911,554 | VSEARCH/ 7,198,094 |
Quality filtering | sdm/NA | DADA2/ 5,428,563 | VSEARCH/ 7,511,274 | trimmomatic/ 7,879,960 | fastqx/ 7,142,354 |
Demultiplexing | sdm/ 6,727,631 | NP | mothur/ 6,558,772 | mothur/ 1,643,879 | NP |
Chimera filtering | USEARCH/ 6,486,802 | NP | VSEARCH/ 6,300,085 | VSEARCH/ 1,621,330 | VSEARCH/ NA |
ITS extractor | 5,919,084 | NP | 6,262,000 | NP | 6,401,097 |
Clustering (OTUs) | UPARSE/ 8,659 | VSEARCH/ 7,477 | UPARSE/ 7,598 | VSEARCH/ 23,167 | VSEARCH/ 7,887 |
b) | LotusS | PipeCraft | Galaxy | ||
CCSc reads | 720,222a | 720,222a | 720,222a | ||
Quality filtering | sdm/ NA | VSEARCH/ 462,010 | trimmomatic/ 672,292 | ||
Demultiplexing | sdm/ 258,085 | mothur/ 380,722 | mothur/ 457,173 | ||
Chimera filtering | USEARCH/ 255,746 | VSEARCH/ 341,154 | VSEARCH/ 405,025 | ||
ITS extraction | 192,485 | 338,150 | NP | ||
Clustering (OTUs) | UPARSE/ 942 | UPARSE/ 4,176 | VSEARCH/ 8,338 |
Using QIIME2, reads were assembled (Illumina data) and quality filtered using DADA2 (
In LotuS pipline, data was assembled (Illumina data), quality filtered (minimum length = 170, minAvgQuality = 27, TruncateSequenceLength = 170, maxAccumulatedError = 0.75) and demultiplexed with sdm (pdiffs = 1, bdiffs = 1). Chimera filtering was undertaken using USEARCH de novo chimera filtering (abundance annotation = 0.97, abskew = 2) and USEARCH reference-based chimera filtering using UNITE v7.2 as reference database. Flanking genes of the ITS region were discarded using ITSx (v1.0.11; default options). ITS reads were clustered to OTUs with USEARCH/UPARSE algorithm (-id = 3, -minsize = 2).
Using web-based Galaxy pipeline, Illumina data were assembled with Fastq joiner (Galaxy Version 2.0.1.1;
In PipeCraft, platform reads were assembled (Illumina data) and quality filtered using VSEARCH (minimum overlap = 15, minimum length = 100, E max = 1, max ambiguous = 0, allowstagger = T). Demultiplexing was undertaken using mothur (pdiffs = 2, bdiffs = 1). In this step, sequences are also re-orientated into the 5’-3’ orientation based on primers (2 mismatches allowed).
Chimeric sequences were removed using VSEARCH de novo (abundance annotation = 0.97, abskew = 2) and reference-based (UNITE v7.2 as reference) chimera filtering algorithms. In the chimera filtering step, the PipeCraft supported option for “primer artefact” removal was also used (sequences where primer strings were found in the middle of the sequence were removed). ITS reads were extracted using ITSx (default options). Clustering was performed using USEARCH/UPARSE algorithm (id = 3, minsize = 2).
Using PIPITS, sequences were assembled with VSEARCH and quality-filtering was undertaken with fastx through the PIPITS command pispino_createreadpairslist. The ITSx was executed through the PIPITS command pipits_funits. Chimera filtering and clustering were undertaken using VSEARCH through the PIPITS command pipits_process.
The additional manual OTU table filtering was based on the BLAST similarity scores when run against UNITE (v7.2) reference database. Any OTUs that had no BLAST hit or that were not classified to the kingdom Fungi were discarded from the OTU table. The remaining OTUs were filtered based on BLAST e-value and query coverage. OTUs with higher e-value than 1e-25 and query coverage less than 70% were excluded from the dataset (as putative artefacts or non-fungal OTUs). Additionally, OTUs with low numbers of sequences per sample were removed (less than 10 sequences per sample;
To detect the effect of analysis platform choice on the OTU composition, we pooled sequences originating from different platforms and applied the common clustering method to generate a single OTU table. For Illumina data, filtered reads from PipeCraft, LotuS and PIPITS were pooled and clustered using CD-HIT (
We used PERMANOVA analysis (
All tested bioinformatics platforms offer straightforward installation. While Galaxy provides a freely available online platform, the benefits of PipeCraft and QIIME2 include easy-to-use graphical user interfaces and multiple options for data analysis. These platforms bundle many tools for diverse tasks. LotuS and PIPITS represent command-line based platforms. PIPITS offers a limited number of tools, but data analysis is easily performed with a straightforward pipeline. LotuS has been developed to minimise computational time and memory requirements. Specifically, for accuracy of ITS-based analyses of fungi and other eukaryotes, PipeCraft, LotuS and PIPITS implement the ITSx tool (
Bioinformatics platforms differ by specific requirements to the input data, with the options being a raw multiplexed file (a single file containing all sequences from one run) and multiple demultiplexed files (reads split into separate files based on indexes). PipeCraft and Galaxy use raw multiplexed data, whereas QIIME2 and PIPITS require demultiplexed files. Only LotuS allows both, multiplexed and demultiplexed files as input. As the raw data files are multiplexed by default, QIIME2 and PIPITS platforms required additional steps of analyses outside these tools to meet the input requirements. Using a Python script, we demultiplexed the raw Illumina data, allowing 2 and 1 mismatches to primer and index strings, respectively. However, PacBio data analysis was dropped for QIIME2 and PIPITS as the present versions of these platforms are limited to analysis of short read (Illumina) data.
For both the Illumina and PacBio datasets, the final OTU richness (singleton OTUs excluded) differed considerably amongst the tested workflows (Table
Taxonomic annotation tools differed in the ability to classify OTUs. In general, BLAST searches revealed many cases of high-quality matches to non-fungal organisms (in some cases for hundreds of OTUs), while RDP when combined with the Warcup Fungal ITS trainset optimistically classified all OTUs to Fungi (100% confidence). Numerous papers have evaluated the performance of different methods on the accuracy of taxonomic assignment and performance inevitably hinges on the completeness of the reference database used (e.g. Gdanetz et al. 2017; Richardson et al. 2017). In spite of its relatively rapid performance, the RDP Fungal ITS trainset does not include any non-fungal data, which explains its shortcomings in detecting non-fungal OTUs. However, the confidence score of an RDP classifier did not exceed 64% for non-fungal OTUs, mostly overestimating the group of unclassified fungi.
We also observed that the quality-filtered datasets included up to ~10% of obvious erroneous/chimeric OTUs that produced matches with low query coverage and confidence scores. A long tail of satellite OTUs, assigned to a single species hypothesis with 99–100% BLAST identity and RDP classifier confidence level, were also common – especially in the results where a relatively high number of OTUs was observed (Galaxy platform). After filtering the spurious OTUs manually (see Methods), we found that richness estimates per sample became more homogeneous across pipelines (Illumina data: Figure
In conclusion, our results indicate that bioinformatics analysis pipelines greatly differ in their relative performance on ITS datasets targeting fungi, although roughly similar quality-orientated settings were implemented. Overall, our recommended Illumina data workflow would be PipeCraft, PIPITS or LotuS, which provide a good balance between speed, mycological accuracy (including support for ITS Extractor) and technical quality. For PacBio, the tools implemented in PipeCraft were most suitable for the long-read analysis. Conversely, the widely used platform in prokaryote 16S-based studies, our options chosen in Galaxy, performed relatively poorly on the ITS data. While QIIME2 implements an accurate quality filtering algorithm of DADA2, the lack of ITS region extraction lowers the accuracy for mycological studies. Of classification tools, BLAST searches against the UNITE database provided more accurate results on the kingdom and phylum levels compared with the RDP and Warcup ITS trainset combined. We emphasise that none of the tested bioinformatics workflows is able to fully filter out the errors that accumulated during sample preparation and sequencing, even when using the most elaborate error-filtering options. Therefore, manual curation of OTU tables continues to be an important step in obtaining robust datasets, although semi-automatic tools to assist evaluation are becoming available (
We thank Falk Hildebrand for advice on bioinformatics analysis. This study was supported by the Estonian Research Council (grant no. PUT1317).