Research Article |
Corresponding author: Cécile Gueidan ( cecile.gueidan@csiro.au ) Academic editor: Imke Schmitt
© 2022 Cécile Gueidan, Lan Li.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Gueidan C, Li L (2022) A long-read amplicon approach to scaling up the metabarcoding of lichen herbarium specimens. MycoKeys 86: 195-212. https://doi.org/10.3897/mycokeys.86.77431
|
Reference sequence databases are critical to the accurate detection and identification of fungi in the environment. As repositories of large numbers of well-curated specimens, herbaria and fungal culture collections have the material resources to generate sequence data for large number of taxa, and could therefore allow filling taxonomic gaps often present in reference sequence databases. Financial resources to do that are however often lacking, so that recent efforts have focused on decreasing sequencing cost by increasing the number of multiplexed samples per sequencing run while maintaining high sequence quality. Following a previous study that aimed at decreasing sequencing cost for lichen specimens by generating fungal ITS barcodes for 96 specimens using PacBio amplicon sequencing, we present a method that further decreases lichen specimen metabarcoding costs. A total of 384 mixed DNA extracts obtained from lichen herbarium specimens, mostly from the four genera Buellia, Catillaria, Endocarpon and Parmotrema, were used to generate new fungal ITS sequences using a Sequel I sequencing platform and the PacBio M13 barcoded primers. The average success rate across all taxa was high (86.5%), with particularly high rates for the crustose saxicolous taxa (Buellia, Catillaria and others; 93.3%) and the terricolous squamulose taxa (Endocarpon and others; 96.5%). On the other hand, the success rate for the foliose genus Parmotrema was lower (60.4%). With this taxon sampling, greater specimen age did not appear to impact sequencing success. In fact, the 1966–1980 collection date category showed the highest success rate (97.3%). Compared to the previous study, the abundance-based sequence denoising method showed some limitations, but the cost of generating ITS barcodes was further decreased thanks to the higher multiplexing level. In addition to contributing new ITS barcodes for specimens of four interesting lichen genera, this study further highlights the potential and challenges of using new sequencing technologies on collection specimens to generate DNA sequences for reference databases.
Collection specimens, ITS barcode, lichenised fungi, PacBio amplicon sequencing
Reference nucleotide sequence databases aim at providing access to curated and high-quality nucleotide sequences representing a broad taxonomic range of living organisms. They are critical to the accurate detection and identification of organisms from environmental samples and, for organisms lacking diagnostic characters, they are a useful tool to confirm morphology-based identifications. In fungi, the internal transcribed spacer region (ITS) has historically been used for species-level molecular identification (
Taking advantage of the development of next generation sequencing (NGS) methods, large numbers of fungal ITS sequences have been generated these last ten years. Fungal metabarcoding studies that detect and identify fungi in environmental samples based on inferred operational taxonomic units have mostly generated partial ITS sequences, either ITS1 or ITS2 (
Although used for whole genome sequencing of lichen metagenomes (
The main goals of this study were to 1) assess the current cost and efficiency of a PacBio metabarcoding method applied to lichen herbarium specimens following changes in laboratory and bioinformatic pipelines, and 2) generate high-quality ITS sequences to contribute to reference sequence databases, as well as to molecular taxonomic studies of several lichen groups.
For this study, 384 lichen specimens were selected because of their importance to several ongoing taxonomic works on Australian lichens at the Australian National Herbarium (see Suppl. material
Examples of lichen herbarium specimens used for this study A Parmotrema perlatum, specimen J.A. Elix 43686 (CANB790817) B Endocarpon pusillum, specimen H. Streiman 45100 (CBG9011273) C Buellia albula, specimen J.A. Elix 45138 (CANB810791) D Catillaria sp., specimen J.A Elix 37142 (CANB872684). Scale bar: 1 cm. Photos C. Gueidan.
The samples were ground with a Precellys Evolution (Bertin Instruments, Montigny-le-Bretonneux, France) in 2–3 cycles of 30 sec at 6,000 rpm. To avoid cross-contaminations, the tubes were briefly centrifuged before the caps were removed. Genomic DNA was extracted using the Invisorb DNA Plant HTS 96 kit (Stratec Molecular, Berlin, Germany) adhering to the manufacturer’s instructions, except for the few following modifications. The lysis buffer and proteinase K were added to each tube of ground material, which were then manually homogenised and incubated at 65 °C for 1 hour. The tubes were centrifuged at 11,000 rpm for 2 min and the supernatants were transferred onto the 96-well prefilter plate using a width-adjustable multichannel pipette. The RNase A (40 µl/well of a 10 mg/ml solution) was added after the prefiltration step and the tubes were incubated at room temperature for 15–20 min before adding the binding buffer. The last centrifugation step was changed to 10 min at 2,000 rpm (instead of 5 min at 4,000 rpm) to avoid breaking the elution plates. The DNA was eluted in 100 µl of elution buffer and 1/10 dilutions of the DNA samples were prepared.
Indexed PCR products were generated using a 2-step PCR approach as described in the PacBio Barcoded Universal Primers protocol (https://www.pacb.com/wp-content/uploads/2015/09/Procedure-and-Checklist-Preparing-SMRTbell-Libraries-PacB-Barcoded-Universal-Primers.pdf), but with few modifications. The fungal ITS barcode (internal transcribed spacer 1, 5.8S ribosomal RNA subunit and internal transcribed spacer 2) was the target region. With a first PCR, our target region was amplified using the primers ITS1F (
A second amplification was then performed using part of a set of 64 barcoded M13 primers (32 forward and 32 reverse) provided by PacBio (Menlo Park, CA, USA). The barcode sequences were 16 bp long (see Suppl. material
The pooled sample was sent to the Ramaciotti Centre for Genomics (UNSW Sydney, Australia) for single molecular real-time (SMRT) sequencing. The library preparation was done using the SMRTbell Template Prep Kit v. 1.0 (Pacific Biosciences, Menlo Park, CA, USA). The sample was sequenced in one SMRT cell and with a ten-hour movie, using the Sequel Binding Kit v. 3.0 and the Sequel Sequencing Plate v. 3.0 (Pacific Biosciences). The subread bam file provided by Ramaciotti was generated using SMRT Link v. 6.0 (Pacific Biosciences). This subread bam file was then demultiplexed using the “lima” command in SMRT Tools v. 7.0.1 (Pacific Biosciences), and the circular consensus sequences (CCSs) generated using the “ccs” command (0.9999 minimum predicted accuracy and 3 minimum passes).
Generated CCSs were denoised using DADA2 v. 1.14 (
The blastn output was parsed into a single text file using a custom script and the results checked manually. Sequencing was considered successful if one of the generated sequence variants matched the same genus as the target taxa. For the unsuccessful samples, the fastq ccs files were converted to fasta using the fastqtofasta command in fastx 0.0.14, and an additional blastn query (BLASTN 2.12.0+) was performed on the ccs files using the same parameter as above. The blastn results were checked manually and sequencing was considered successful if one of the ccs matched the same genus as the target taxa. Demultiplexed fastq files were deposited in the Sequence Read Archive on NCBI (BioProject ID PRJNA796455).
The two-step amplification approach generated PCR products with concentrations ranging from 11 to 1,573 ng/µl (Suppl. material
Using SMRT Tools, CCSs were recovered for 372 of the 384 samples, with only 12 samples for which no reads were generated (Suppl. material
When divided into three main morphological groups of taxa (Fig.
Sequencing success for different morphological groups of taxa included in this study. Specimens were grouped into three main morphological categories: 1 Buellia, Catillaria and other crustose saxicolous taxa 2 Endocarpon and other squamulose terricolous taxa 3 the foliose corticolous genus Parmotrema. In the graph, stalked columns show successful samples (sequence generated for the target species) in dark grey and unsuccessful samples (no sequence generated or generated sequences not from the target species) in light grey. The total number of samples (N) is indicated below each corresponding column.
Building upon a previous work (
Sequencing success for different ages of specimens included in this study. Specimens were grouped in five categories: 1966–1980, 1981–1990, 1991–2000, 2001–2010, 2011–2020. In the graph, stalked columns show successful samples (sequence generated for the target species) in dark grey and unsuccessful samples (no sequence generated or generated sequences not from the target species) in light grey. The total number of samples (N) is indicated below each corresponding column.
ITS sequences were successfully generated for 332 of the 384 herbarium specimens included in this study. Most of the specimens included belonged to the four genera Buellia, Catillaria, Endocarpon and Parmotrema. The success rate for the sequencing of the target ITS barcode was high (an average of 86.5% across all taxa) and similar to the one reported in
The genomic DNA of some groups of lichens, most often from crustose corticolous tropical families (e.g.,
PacBio long read sequencing is a powerful approach, which when applied to amplicons, can utilise circular sequencing to generate high quality consensus of shorter nucleotide fragments. In order to correct sequencing errors, subreads extracted from one polymerase read – therefore generated from a single amplicon molecule, are aligned and assembled into one circular consensus sequence (CCS). Following CCS generation, additional software and pipelines are available to further correct sequencing errors, a step often called denoising. Although several software are available for denoising Illumina amplicon data (e.g., unoise,
Despite DADA2 generating target sequence variants for a large number of our samples, a significant number of samples (70) did not yield sequences from the target taxon despite having one to several CCSs that matched the target taxon. For error correction, DADA2 is trained on a pool of sequences and uses sequence abundance to discriminate between sequencing error and true sequence variation. In our case the sequence pools corresponded to each of the 384 samples and were rather small due to the high level of multiplexing (average of 229 CCSs per sample/pool). In addition, among the CCSs available for each pool, in particular for the samples for which DADA2 did not recover the target taxon, the target CCSs were in low abundance within a large pool of lichen-associated fungal sequences or contaminant sequences. Because DADA2 error correction is based on sequence abundance, sequence variants are only inferred for high-abundance sequences. It is therefore not fully applicable to the metabarcoding of lichen herbarium specimens, or at least not when sequences of associated fungi are abundant. In this case, denoising methods that are not based on sequence abundance may perform better.
In terms of sequencing efficiency, with an average sequencing success rate of 86.5%, the new M13 amplicon sequencing protocol from PacBio is comparable to the protocol used in
With an average sequencing success of 86.5%, this long-read amplicon sequencing method is confirmed as a potential alternative to Sanger sequencing for the generation of full-length and high-quality DNA barcodes from mixed DNA samples extracted from lichen specimens. It performed particularly well for crustose saxicolous (93.3% success) and squamulose terricolous (96.5% success) taxa. In terms of cost (AU$27/sample), although still more expensive than Sanger sequencing, it allows recovering high-quality sequences even when other lichen-associated fungi amplify as well, eliminating the need for using gel separation or cloning. At high multiplexing level (more than 500 samples/run), this high-throughput method is therefore an attractive option for the generation of DNA barcodes from large number of herbarium specimens.
The authors would like to thank Judith Curnow (CANB, Canberra) for her help with specimen databasing and curation, and the BM and NSW herbaria as well as the following collectors and herbaria for providing specimens used in this study: A. Aptroot (ABL), M. Bertrand, J.A. Elix (CANB), G. Kantvilas (HO), M. Mallen-Cooper (UNSW), P.M. McCarthy, B. McCune, R.W. Purdie (CANB), C. Roux (MARSSJ) and D. Stone. They would also like to acknowledge the contribution of Cameron Jack (ANU, Canberra) and Nunzio Knerr (CANB, Canberra) to various bioinformatic scripts. They also thank Tonia Russell (Ramaciotti, UNSW, Sydney) and Cheryl Heiner (PacBio, USA) for their advice with sample preparation and sequencing.
Table S1. List of specimens used for this study, including their voucher information, plate location, indexing, amplicon concentration and sequencing results, both as an output from SMRT tools (CCSs) and as an output from DADA2 (sequence variants). Table S2. List of the 64 barcode sequences used to index the samples. Used barcode pairs are listed in Table S1
Data type: Taxon sampling
Explanation note: List of specimens used for this study, including their voucher information, plate location, indexing, amplicon concentration and sequencing results, both as an output from SMRT tools (CCSs) and as an output from DADA2 (sequence variants). A summary of the blast results for the sequence variants is also listed for each sample. In the "recovered target" column, samples for which the target sequence was recovered from the CCS file but not the sequence variant file are indicated by a star.