Research Article |
Corresponding author: R. Henrik Nilsson ( henrik.nilsson@bioenv.gu.se ) Academic editor: Imke Schmitt
© 2017 R. Henrik Nilsson, Marisol Sánchez-García, Martin K. Ryberg, Kessy Abarenkov, Christian Wurzbacher, Erik Kristiansson.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Nilsson RH, Sánchez-García M, Ryberg M, Abarenkov K, Wurzbacher C, Kristiansson E (2017) Read quality-based trimming of the distal ends of public fungal DNA sequences is nowhere near satisfactory. MycoKeys 26: 13-24. https://doi.org/10.3897/mycokeys.26.14591
|
DNA sequences are increasingly used for taxonomic and functional assessment of environmental communities. In mycology, the nuclear ribosomal internal transcribed spacer (ITS) region is the most commonly chosen marker for such pursuits. Molecular identification is associated with many challenges, one of which is low read quality of the reference sequences used for inference of taxonomic and functional properties of the newly sequenced community (or single taxon). This study investigates whether public fungal ITS sequences are subjected to sufficient trimming in their distal (5’ and 3’) ends prior to deposition in the public repositories. We examined 86 species (and 10,584 sequences) across the fungal tree of life, and we found that on average 13.1% of the sequences were poorly trimmed in one or both of their 5’ and 3’ ends. Deposition of poorly trimmed entries was found to continue through 2016. Poorly trimmed reference sequences add noise and mask biological signal in sequence similarity searches and phylogenetic analyses, and we provide a set of recommendations on how to manage the sequence trimming problem.
Molecular identification, DNA barcoding, database curation, Sanger sequencing, high-throughput sequencing, molecular ecology
Molecular (DNA-based) species identification is the process by which newly generated DNA sequences are examined for taxonomic affiliation and sometimes functional aspects by comparison to reference sequences of firmly established taxonomic origin. It is a powerful tool to identify organisms, particularly those with few or no discriminatory morphological characters and those with cryptic or inconspicuous life styles (
Several factors combine to make molecular identification of fungi complicated. In addition to the lack of reference sequences for more than 99% of the estimated number of extant species of fungi, technical complications such as chimera formation and low read quality may introduce noise and bias to such efforts (
One aspect of sequence reliability that remains largely unexplored is quality trimming of the distal (approximately 25 bases at the very 5’ and 3’) ends of Sanger sequences. Owing to the nature of the Sanger sequencing process, the very first bases are often hard to resolve due to the presence of un-incorporated nucleotides and leftover primers. Similarly, the signal-to-noise ratio typically drops with the length of the amplicon in that it becomes increasingly difficult to separate amplicons of near-identical lengths from each other on the electrophoresis gel. Thus, an important part of Sanger sequencing is to inspect the resulting chromatograms and remove any noisy distal sequence parts in the newly generated sequence data. This step is, however, sometimes overlooked. When working with INSDC data for fungal molecular identification and sequence analysis purposes, we regularly come across entries whose distal ends appear to be very poorly trimmed. They may feature extended homopolymer regions (e.g., AAAAAAAAA…) or stretches of seemingly random bases that are not found in other conspecific sequences (
The problem is of particular concern for the nuclear ribosomal internal transcribed spacer (ITS) region, the formal fungal barcode and the most popular genetic marker for assessing the taxonomic composition of fungal communities (
The ribosomal operon is regularly left out from genome sequencing efforts due to assembly difficulties (
For each of the 86 species (spanning 3 fungal phyla and 29 orders, Suppl. material
We went through each position in each of the alignments, starting from the 50th-to-last base of the SSU to the 50th base of LSU, and noted the proportion of INSDC sequences that produced a different nucleotide base from that of the corresponding genome-derived ITS sequence. All three of DNA base mismatches, gaps, and DNA ambiguity symbols (
The 86 multiple sequence alignments, each covering at most 50 bases of the SSU, the full ITS region (minus at most 50 bases of the 5’ end of ITS1 and/or 50 bases of the end of ITS2), and at most 50 bases of the LSU, are provided in Suppl. material
The plotting of disagreements with respect to the genome-derived sequences revealed that insufficient trimming of sequence data seems to be a widespread problem (Figs
Example of poorly trimmed sequences (sequence four and on) from the species Setosphaeria turcica. The 5’ end of the alignment is shown, and the poorly trimmed sequences cover the last ~5 bases of SSU and the immediate start of ITS1. The topmost sequence is genome-derived, and sequences two and three are regular Sanger sequences retrieved from the INSDC from other studies than the one with the poorly trimmed sequences (sequences four and on). SeaView v. 4 (
Public fungal ITS sequences are subjected to insufficient trimming in their distal ends. Panel a shows the dissimilarity (y-axis) as a function of the relative sequence position (x-axis). The plot is based on 10,584 sequences from 86 species. Panel b and c show zoom-ins of the 5’ and 3’ ends, respectively. Dashed lines indicate point-wise standard errors.
The proportion of poorly trimmed (y-axis) fungal ITS sequences submitted to the INSDC does not decrease over time (x-axis). The regression line (dashed), which was derived by overdispersed Poisson rate regression, shows a weak but significant increasing trend (yearly relative increase of 0.047, p=0.0291).
We provide data to suggest that many public DNA sequences are poorly trimmed in their distal parts. The fact that poorly trimmed sequences continue to be deposited through 2016 furthermore suggests that this problem will not go away by itself over time. We hope that the present paper will serve as an eye-opener, both for researchers who risk using the poorly trimmed data for molecular identification and for researchers generating and depositing sequences in public sequence repositories. The way it is now, these sequences may confound sequence similarity searches by falsely suggesting that two sequences (biological entities) are less similar than what really is the case. This reduces the precision in taxonomic and functional assessment – whether manual or carried out through some software pipeline – of newly generated sequences. Other kinds of sequence analysis, such as phylogenetic analyses, will similarly be distorted by poorly trimmed sequences.
Fortunately, managing read quality in Sanger sequences is fairly straightforward. The chromatograms, indicating the relative signal strength for each of the four purines/pyrimidines C, T, A, and G for each position in the sequence, are a key resource in this pursuit. Brief guidelines for how chromatograms should be processed are available in
In this study we show that incomplete (or lack of) trimming of sequence ends remains abundant in molecular mycology. Although this was expected based on our experience, this is the first study to provide at least an initial estimate of the magnitude of the problem. We used genome-derived ITS sequences from 86 fungal species from 29 different orders in our pursuit, such that we think that it is reasonable to extrapolate our findings to the fungal kingdom at large. Furthermore, we cannot think of any reason why this would be a uniquely fungus-specific problem, and we consider that our findings in fact may hold true for Sanger sequences from all genes and groups of organisms, possibly excluding groups and genes that only a few meticulous researchers have worked on. We would, however, like to stress that we provide estimates rather than hard facts. Our approach relied on genome-derived ITS sequences, and we quantified deviations from the genome sequences among conspecific ITS sequences in the INSDC as assessed through species names (Latin binomials). However, some degree of deviation from the genome-derived sequences is expected, since intraspecific ITS variation may reach 3% or in some cases more (
In conclusion, we have shown beyond reasonable doubt that there is room for improvement in the way the mycological community – and to some degree the scientific community at large – trim their DNA sequences. The poor sequence trimming leaves a mark on all subsequent studies that make use of those sequences through BLAST searches or otherwise. Mycology faces enough challenges as it is without having to worry about the burden of poorly trimmed sequences (cf.
RHN acknowledges financial support from the Swedish Research Council of Environment, Agricultural Sciences, and Spatial Planning (FORMAS, 215-2011-498) and MR from the same agency (FORMAS, 226-2014-1109). RHN, KA, and the UNITE community acknowledge support from the Alfred P. Sloan Foundation. EK acknowledges funding from FORMAS and Wallenberg. CW and RHN acknowledges funding from Stiftelsen Olle Engkvist, Stiftelsen Lars Hiertas Minne, Stiftelsen Birgit och Birger Wåhlströms minnesfond för den bohuslänska havs- och insjömiljön, and Kapten Carl Stenholms donationsfond. Conrad Schoch (NCBI) is gratefully acknowledged for valuable comments on the manuscript. CW acknowledges a Marie Skłodowska-Curie postdoctoral grant (CRYPTRANS).
Details on the fungal genomes/contigs targeted
Data type: Excel spreadsheet
Explanation note: List of the fungal genomes/contigs targeted, their URL, their taxonomic affiliation, and the number of sequences (with and without poor trimming) for each entry.
The 86 multiple sequence alignments used
Data type: Text
Explanation note: The multiple sequence alignments used to infer the statistics of the study. They are provided in the FASTA format (