Research Article |
Corresponding author: Cécile Gueidan ( cecile.gueidan@csiro.au ) Academic editor: Francesco Dal Grande
© 2019 Cécile Gueidan, John A. Elix, Patrick M. McCarthy, Claude Roux, Max Mallen-Cooper, Gintaras Kantvilas.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Gueidan C, Elix JA, McCarthy PM, Roux C, Mallen-Cooper M, Kantvilas G (2019) PacBio amplicon sequencing for metabarcoding of mixed DNA samples from lichen herbarium specimens. MycoKeys 53: 73-91. https://doi.org/10.3897/mycokeys.53.34761
|
The detection and identification of species of fungi in the environment using molecular methods heavily depends on reliable reference sequence databases. However, these databases are largely incomplete in terms of taxon coverage, and a significant effort is required from herbaria and living fungal collections for the mass-barcoding of well-identified and well-curated fungal specimens or strains. Here, a PacBio amplicon sequencing approach is applied to recent lichen herbarium specimens for the sequencing of the fungal ITS barcode, allowing a higher throughput sample processing than Sanger sequencing, which often required the use of cloning. Out of 96 multiplexed samples, a full-length ITS sequence of the target lichenised fungal species was recovered for 85 specimens. In addition, sequences obtained for co-amplified fungi gave an interesting insight into the diversity of endolichenic fungi. Challenges encountered at both the laboratory and bioinformatic stages are discussed, and cost and quality are compared with Sanger sequencing. With increasing data output and reducing sequencing cost, PacBio amplicon sequencing is seen as a promising approach for the generation of reference sequences for lichenised fungi as well as the characterisation of lichen-associated fungal communities.
SMRT sequencing, high-throughput sequencing, long amplicon analysis (LAA), lichenised fungi
Across the world, herbaria and fungal culture collections host large numbers of specimens and strains with the main goal being to document, preserve and classify the many species within the fungal kingdom. Among characters generally used to identify fungi, DNA sequences have been particularly useful for the numerous fungal species with plastic or convergent morphologies. Of the DNA markers traditionally used for fungal identification, the internal transcribed spacer region (ITS) was chosen as the primary barcode because it allowed species-level identification for the broadest range of fungi (
Although the characterisation of fungal diversity in environmental samples has benefited immensely from the development of next generation sequencing (NGS) technologies, the high-throughput generation of reference ITS sequences from herbarium specimens has been hindered by the short length of Illumina reads, the most commonly used NGS platform. With a size ranging from 500 up to 1200 bp, the full length of the ITS region cannot be sequenced as a single DNA fragment. For DNA extractions of lichen specimens, which harbor on their surface or within their thalli a plethora of other fungi, as well as for any other DNA samples of mixed fungal communities, assembling DNA markers from a pool of short reads belonging to several species carries a high risk of obtaining chimeric sequences (
Long-read sequencing allows us to circumvent these challenges. For lichenised fungi, early long amplicon sequencing studies made use of Roche 454 pyrosequencing technology, either for the full ITS region (
The scarcity of automated pipelines to analyse SMRT amplicon data for different applications, including community analysis, remains an issue (
In this study, the use of SMRT sequencing and the LAA pipeline for the production of ITS barcode sequences from lichen herbarium specimens was explored. The goals were: 1) to establish a high-throughput protocol to obtain indexed PCR products from lichen DNA extracts for SMRT sequencing; and 2) to investigate the ability of LAA to recover high-quality sequences for the target lichenised fungal species as well as for non-target fungal species.
A total of 96 specimens were selected for their frequent need for cloning as well as their relevance to several ongoing taxonomic studies on Australian lichens at the Australian National Herbarium (see Suppl. material
Each of the 96 cluster tubes contained a washed chrome steel bearing ball 3 mm in diameter (3MMCH/S/B grade 40, BSC Bearing & Power Transmission Solutions, Chullora, Australia). The samples were ground with a TissueLyser II (Qiagen, Hilden, Germany) in two cycles of 1 min and a frequency of 25/s. The cluster tubes were centrifuged for 1 min at 6,000 rpm and the caps were removed with care to avoid cross-contaminations. A lysis buffer was added to each tube. Genomic DNA was extracted using the Invisorb® DNA Plant HTS 96 kit (STRATEC Molecular, Berlin, Germany) following the manufacturer’s instructions, except for the last centrifugation step which was changed to 10 min at 2,000 rpm (instead of 5 min at 4,000 rpm) to avoid breaking the elution plates. A 1/10 dilution of the DNA extraction plate was prepared and subsequently used for amplification.
Indexed PCR products were generated using the PacBio Barcoded Universal Primers protocol (https://www.pacb.com/wp-content/uploads/2015/09/Procedure-and-Checklist-Preparing-SMRTbell-Libraries-PacB-Barcoded-Universal-Primers.pdf). The ITS barcode (internal transcribed spacer 1, 5.8S ribosomal RNA subunit and internal transcribed spacer 2) was the region targeted. With a first PCR, our target region was amplified using the primers ITS1F (
A second amplification was then performed using the Barcoded Universal F/R Primers Plate-96 available from PacBio (Millenium Science, Mulgrave, VIC, Australia). In a 25 µl reaction, 5 µl of buffer, 1 µl of MyFi, 2.5 µl of the barcoded primers, 15.5 µl of water and 1 µl of the product of the first round of PCR were added. The PCR programme was 3 min at 95 °C, then 20 cycles of 30 sec at 95 °C, 30 sec at 53 °C and 1:30 min at 72 °C, followed by a final elongation step for 7 min at 72 °C. All PCR products were checked on a gel as previously described. For samples showing no band, a new round of amplification (first PCR with ITS tailed primers and second PCR with barcoded universal primers) was performed using the non-diluted DNA extracts. Positive PCR products generated by this second round of amplification were used to substitute the negative samples on the original plate. The samples for which neither the dilution nor the original extract gave amplicons were left on the original plate. The 96 samples were cleaned as described above and their concentration measured with a Nanodrop 8000 spectrophotometer (Thermo Scientific, Waltham, MA, USA). All samples with a DNA concentration larger than 50 ng/µl were normalised manually to 50 ng/µl. One microliter of each of the 96 samples was pooled into a 1.5 ml Eppendorf tube.
The pooled sample (1.5 µg of DNA) was sent to the Ramaciotti Centre for Genomics (UNSW Sydney, Australia) for single molecular real-time (SMRT) sequencing. The sample met the quality control requirements and showed two clear peaks at 650 and 923 bp, which are within the expected size range for the ITS barcode. The library preparation was done using the PacBio Barcoded Universal Primers protocol (https://www.pacb.com/wp-content/uploads/2015/09/Procedure-and-Checklist-Preparing-SMRTbell-Libraries-PacB-Barcoded-Universal-Primers.pdf). The sample was sequenced in one SMRT cell on a PacBio RSII using a P6 chemistry with a four-hour movie. The raw data were analysed using Long Amplicon Analysis (LAA v. 1) with barcoding option on the SMRT Portal. This pipeline was run using the following settings: symmetric DNA barcodes with a minimum score of 22, a minimum subread length of 450 bp, a maximum number of subreads of 4,000, and default values for all other parameters. The phase alleles option was selected.
After demultiplexing, quality control and generation of consensus sequences using LAA, the amplicon data was provided in 96 files in fasta format. Sequences from the targeted lichenised fungal species were identified from fungal endophytes or contaminants based on sequence similarity using a BLASTN v 2.8.1 search on the nr database by excluding uncultured/environmental sample sequences (https://blast.ncbi.nlm.nih.gov/; search done on Dec 12–14 2018). The first BLASTN match (or highest bit score) was recorded unless it did not correspond to at least a fungal Class, in which case the next match was considered. For 12 samples, the target sequence obtained with PacBio was cross-validated by a sequence obtained with Sanger sequencing. The ITS barcode was amplified with the primers ITS1F and ITS4 using the DNA polymerase MyFi as previously described, and products were sent to Macrogen (Seoul, Korea) for purification and sequencing. PacBio and Sanger sequences were manually compared in Mesquite v 3.51 (
After the two amplification steps, 19 of the 96 samples did not show any visible product. For these samples, a new round of amplifications was performed using the non-diluted DNA extracts instead of the 1/10 dilution. This second amplification round generated products for 13 samples that were previously negative. These additional products were then used to substitute the corresponding negative samples on the original plate. For the five samples for which none of the amplification rounds was positive (sample barcode numbers 41, 45, 73, 84, 92, 93), the initial sample were not substituted and were pooled together with the positive samples for final submission.
A first SMRT cell was run with a 0.02 nM sample, but resulting ZMW (Zero-Mode Waveguide) productivity was poor (P1=12%). After filtering, only 17,582 polymerase reads were recovered for this first run, which corresponded to 247,626 subreads (Table
Productivity and sequence output from two SMRT runs with different loading concentration.
Loading concentration | ZMW productivity | Post-filter polymerase reads | Subreads | ||||||
---|---|---|---|---|---|---|---|---|---|
P0 | P1 | P2 | number | mean length | N50 | number | mean length | number of passes | |
0.02 nM | 86% | 12% | 2% | 17,582 | 15,234 bp | 27,799 bp | 247,626 | 1038 bp | 14.68 |
0.05 nM | 29% | 55% | 16% | 82,501 | 15,321 bp | 21,997 bp | 1,095,489 | 1111 bp | 13.79 |
Sequences were recovered for 89 of the 96 samples (see Suppl. material
For 46 samples, a single sequence was obtained for the target species among all the co-amplified fungal taxa. For the other samples, the clustering and phasing process resulted in 2–9 sequences that could be attributed to the target species using the BLASTN match. The similarity between these different sequence versions was investigated in an alignment. After reverse complement and barcode trimming, these sequence versions were in fact identical for an additional ten samples, bringing the total of successful target single-sequences to 56. A few sequence versions also differed with indels in long single nucleotide repeats (e.g., samples 27, 27, 28, 49, 64, 68, 69) and were the likely result of poor quality for low coverage consensus sequences. For the remaining samples, differences between sequence versions were more random (indels and single nucleotide mutations) and more likely due to genuine sequence heterogeneity within a sample, either due to biological reasons (concerted evolution or mixed individual) or carry-over contaminations. The percentage identity between the most divergent target sequence versions of each sample ranged between 94.78 and 99.98% (Suppl. material
In addition to the target species, sequences were also obtained for other fungi for 81 of the 96 samples. The great majority of these co-amplifying fungal sequences (479 out of 506 sequences) belonged to the ascomycetes, with Capnodiales and Chaetothyriales being the most commonly represented orders. Several sequences (23) belonged to the basidiomycetes, two to the chytridiomycetes and one to the mucoromycetes. Finally, one sequence matched Pirula salina, a Xanthophyceae algae from the order Tribonematales.
Sanger sequences were obtained for 12 samples. For five of these (3, 6, 8, 61 and 76), the Sanger sequence was entirely identical to the PacBio sequence. For five additional samples (5, 7, 58, 60 and 74), the Sanger sequence was identical to the PacBio sequence, but contained ambiguous bases or was slightly shorter due to missing bases at the 5’ or 3’ ends. For two samples (4 and 57), the Sanger sequences comprised one nucleotide difference compared to the PacBio sequences in addition to several ambiguous bases. These differences were located at the start or the end of the sequence and were likely due to poor quality chromatograms in the 5’ and 3’ regions of the Sanger sequences.
Collections of well-curated lichen specimens are important resources for species identification, whether done traditionally using morpho-anatomical together with chemical characters, or using DNA barcodes. For the latter, access to databases with high quality sequence data, detailed voucher information and reliable taxon names is critical. Although current reference sequence datasets exist, they are still incomplete. Thus, the molecular barcoding of lichen specimens for a broad range of species remains a priority (
The protocol used in this study enabled the high-throughput processing of samples and the bioinformatic pipeline permitted the recovery of high-quality sequences with minimum time and effort, because both the primary and secondary analyses could be carried out by the sequencing provider. For mixed lichen DNA extracts, SMRT sequencing is a cheaper option than cloning and Sanger sequencing when high number of samples need to be processed. The cost of producing a barcode Sanger sequence is generally around AUD$15/sample (including DNA extraction, amplification and sequencing costs), but when cloning is required, this cost can increase to AUD$100/sample. With the PacBio amplicon protocol used in this study and the PacBio RSII platform, the cost per sample was estimated at AUD$37. This cost could be further reduced by co-amplifying more markers and/or multiplexing more samples. With the new PacBio platform Sequel, multiplexing 384 samples for one marker would bring the cost down to AUD$12/sample. Moreover, with the recent increased output per SMRT cell (from 1 gigabase for RSII to 20 gigabases for the new PacBio platform Sequel), the cost per sample could fall even further with higher multiplexing levels.
The use of PacBio amplicon sequencing for high-throughput barcoding of lichen specimens is therefore very promising. Some challenges remain, both for the laboratory and the bioinformatic aspects of this method. Some considerations and suggestions for improvements based on this study are outlined below, especially with regard to the preparation of the amplicon library and the generation of ITS barcode sequences using the LAA pipeline.
A relatively high risk of carry-over contaminations is generally associated with two-step PCR amplicon protocols (Seitz et al. 2015), including the PacBio Barcoded Universal Primers protocol used in this study. In a preliminary test of this protocol (data not shown), output sequences suggested a significant number of potential carry-over contaminations. In light of these results, particular care was taken at various stages of the final run when handling the genomic DNA and PCR samples. More specifically, DNA extracts were stored in single tubes as opposed to a 96-well plate, and amplifications were carried out in strip tubes with individual caps instead of 96-well plates. In the final SMRT run, carry-over contaminations may have also occurred, but at a much lower level than the preliminary test run. Our current protocol could therefore be improved further. More specifically, the use of a PCR hood or post- and pre-PCR separation, as previously recommended (
The choice of polymerase has been shown to be important when using HT sequencing for rare allele detection (
In addition to polymerase misincorporation, chimera formation is another common PCR-induced source of error. Low starting template concentration and low amplification cycle numbers tend to reduce the rate of chimera formation (
Metagenomics and metabarcoding studies using two-step PCR protocols often include triplicate samples for amplification, which are then pooled and used as a template for the second PCR (e.g.,
The Long Amplicon Analysis pipeline has been developed by PacBio to phase and detect alleles in diploid organisms, and enables differentiation between underlying sequences that are 99.9% similar, such as haplotypes and pseudogenes (
Most target sequences had high accuracy and coverage values, and a subset of these was validated with Sanger sequences obtained from the same DNA extracts. Initially, the size variation among amplicons recovered was more than the 10% recommended by PacBio and was of potential concern. An excess of sequences for the shorter amplicons relative to the longer ones could have been problematic, because lichenised fungi often have long ITS regions due to the presence of introns. However, at this low level of multiplexing (one marker for 96 samples), the difference in amplicon size did not prevent the generation of high-quality sequences for the target species. Moreover, at a higher multiplexing level (5 genes for 96 samples), the sequencing bias due to amplicon size variation did not seem to influence the results (
A recent study identified some problems with the LAA pipeline, including the formation of a few incorrect or truncated sequences even at relatively high read depths (
PacBio amplicon sequencing is a promising approach for the metabarcoding of lichen specimens, and can be applied to the generation of reference sequences and the characterisation of lichen-associated fungal communities. Although restricted to specimens for which the genomic DNA is not overly degraded, this approach succeeded in generating full-length and high-quality ITS barcodes for specimens up to 25 years old. By scaling up the multiplexing level, this approach could significantly reduce the cost of barcode/sample and compete with Sanger sequencing as well as other NGS approaches.
The authors would like to thank Judith Curnow (Australian National Botanic Gardens, Canberra) for her help with specimen databasing and curation, Michel Bertrand and Jean-Yves Monnat for providing lichen specimens, and Lan Li (Australian National Herbarium, Canberra) for her help with the molecular work. We also thank Tonia Russell (Ramaciotti Centre for Genomics, UNSW Sydney, Sydney) and Pacific Biosciences for their advice with sample preparation and data analysis.
Table S1
Data type: measurement