Chromosome-level genome assembly of the Chinese longsnout catfish Leiocassis longirostris

2021-08-16 08:09:44Wen-PingHe,JianZhou,ZheLi等

Zoological Research 2021年4期

The Chinese longsnout catfish (Leiocassis longirostrisGünther) is one of the most economically important freshwater fish in China.As wild populations have declined sharply in recent years,it is also a valuable model for research on sexual dimorphism,comparative biology,and conservation.However,the current lack of high-quality chromosome-level genome information for the species hinders the advancement of comparative genomic analysis and evolutionary studies.Therefore,we constructed the first high-quality chromosomelevel reference genome forL.longirostris.The total genome was 703.19 Mb,with 389 contigs and contig N50 length of 4.29 Mb.Using high-throughput chromosome conformation capture (Hi-C) data,the genome sequences (685.53 Mb) were scaffolded into 26 chromosomes ranging from 17.36 to 43.97 Mb,resulting in a chromosomal anchoring rate for the genome of 97.44%.In total,23 708 protein-coding genes were identified in the genome.Phylogenetic analysis indicated thatL.longirostrisand its closest related speciesP.fulvidracodiverged approximately 26.6 million years ago.This highquality reference genome ofL.longirostrisshould pave the way for future genomic comparisons and evolutionary research.

Leiocassis longirostris(also named Jiangtuan) belongs to the family Bagridae,which contains more than 220 species(Ferraris,2007),and the order Siluriformes.It is a semimigratory and commercially important freshwater species endemic to China,especially the Huaihe,Liaohe,Minjiang,Yangtze,and Pearl rivers,and the western regions of the Korean Peninsula (Shen et al.,2014;Wang et al.,2006;Zhu et al.,2005).In recent years,wild populations ofL.longirostrishave experienced a rapid decline due to over-fishing,water pollution,hydropower construction,and other human activities(Liang et al.,2016;Luo et al.,2000;Wang et al.,2006;Xiao & Yang,2009).Thus,to facilitate conservation and evolutionary research,we constructed the first high-quality chromosomelevel reference genome forL.longirostrisusing BGISEQ-500,Nanopore,and high-throughput chromosome conformation capture (Hi-C) technologies.

One healthy adult femaleL.longirostris(Figure 1A)collected from a farm at the Sichuan Academy of Agricultural Sciences in Meishan,Sichuan Province,China,was used for genome sequencing.Muscle tissue was collected for DNA extraction after treatment with the anesthetic tricaine MS-222.Genomic DNA for BGISEQ-500 and Nanopore sequencing was isolated using standard chloroform-isoamyl alcohol extraction procedures (Sambrook et al.,1989).DNA quality and quantity were measured using a NanoDrop? One UV-Vis spectrophotometer (Thermo Fisher Scientific,USA) and Qubit?3.0 fluorometer (Invitrogen,USA),respectively.

A DNA library (200–400 bp insert size) was constructed following the manufacturer’s instructions as described in previous study (Huang et al.,2017).The library was then sequenced following the BGISEQ-500 protocols (Huang et al.,2017).The short-read data obtained from the BGISEQ-500 platform were filtered using SOAPnuke v1.5.2 (Chen et al.,2018).The adapter sequences were removed from the reads,and paired reads with more than 10% ambiguous or lowquality (Phred score<5) bases were discarded,with BLAST v2.2.31 applied for the evaluation of sample contamination(Altschul et al.,1990).As a result,we obtained a total of 64.11 Gb short reads (Supplementary Table S1).Using Jellyfish v2.2.6 (Mar?ais & Kingsford,2011),theK-mer frequency distribution was calculated. The Jellyfish results were subsequently delivered to GenomeScope (Vurture et al.,2017).Using aK-mer size of 17,theK-mer frequency distribution forL.longirostriswas obtained (Supplementary Figure S1).As a result,the genome size ofL.longirostriswas estimated to be 688.99 Mb,with heterozygosity,repeat content,and GC content of 0.35%,42.53%,and 38.43%,respectively.

Figure 1 Genome analysis of L.longirostris

For Nanopore sequencing,we prepared a library using a Ligation Sequencing Kit (Oxford Nanopore Technologies,UK,SQK-LSK109) according to the manufacturer’s instructions.The library was sequenced using the Nanopore GridION X5 sequencer (Oxford Nanopore Technologies,UK) with flow cell R9.4 on five flow cells.Base calling was performed using Guppy v2.0.8 with default parameters,and reads were filtered for mean_qscore_template ≥7.NanoPlot v1.0.0 (De Coster et al.,2018) was then used to filter the Nanopore reads.For the construction of the Hi-C library,1 g of muscle tissue was used to prepare a library according to previously established protocols (Rao et al.,2014).The library was then sequenced on a BGISEQ-500 sequencer (BGI Genomics,China) using 100 bp paired end sequencing.

For transcriptome sequencing,the liver tissues of 15L.longirostrisindividuals collected from the same farm were used for RNA extraction with TRIzol reagent (Invitrogen,USA),followed by treatment with DNase I (Invitrogen,USA) to remove genomic DNA.RNA concentration and integrity were measured using a Qubit?RNA Assay Kit and Qubit?2.0 fluorometer (Life Technologies,USA) and an RNA Nano 6000 Assay Kit with the Agilent Bioanalyzer 2100 system (Agilent Technologies,USA),respectively.Three RNA sequencing libraries (five fish per library) with an insert size of 250–300 bp were prepared using a NEBNext?Ultra? RNA Library Prep Kit for Illumina?(NEB,USA) following the manufacturer’s protocols,and then sequenced on the Illumina Hiseq X Ten platform (Illumina Inc.,USA) as 150 bp paired-end reads.The raw RNA-seq reads were cleaned and assembled as described previously (Ye et al.,2018).

Using the Nanopore sequencing platform,we obtained 43.23 Gb long reads,with an expected average sequencing coverage of 61.48 X for genome assembly (Supplementary Table S1).We then performedde novogenome assembly using Canu v1.8 (Koren et al.,2017) following the correction,trimming,and contig construction steps. After contig assembly,three rounds of contig sequence polishing were performed with cleaned genomic short reads using Pilon v1.23(Walker et al.,2014).Purge Haplotigs v1.0.3 (Roach et al.,2018) was used to produce an improved and deduplicated assembly.Finally,we obtained the assembled genome ofL.longirostris,which was 703.19 Mb in length,with 389 contigs and an N50 contig size of 4.29 Mb.This is a medium-sized genome among other sequenced catfish genomes (Table 1;Supplementary Table S2).We performed genome assembly quality control using the distribution of GC_depth.The GC_depth scatter plots demonstrated a Poisson distribution,indicating that this genome had no significant contamination.The overall GC-content of 39.67% in theL.longirostrisgenome was slightly higher than that of the walking catfish(Clarias batrachus) (Li et al.,2018) and common carp(Cyprinus carpio) but much lower than that of most teleost genomes (Xu et al.,2014).The completeness of the assembledL.longirostrisgenome was estimated using BUSCO v3.0.2 (Sim?o et al.,2015) with the actinopterygii_odb9 database.As a result,4 293 (93.6 %) of the 4 584 BUSCO genes were completely identified in the genome,including 4 109 (89.6%) single-copy and 184 (4.0%)duplicated genes. These results suggest high genome assembly completeness.

For chromosome-level assembly of theL.longirostrisgenome,Hi-C reads were first filtered using HIC-Pro v2.8.0(Servant et al.,2015).Juicer v1.5 (Durand et al.,2016a) was then used to analyze the Hi-C datasets,and 3D-DNA v170123 was used to anchor the genome assembly to the chromosomes (Dudchenko et al.,2017) with parameters “-m haploid -s 0 -c 26”.The contact matrix of theL.longirostriscontigs was mapped using Juicebox v1.11.08 (Durand et al.,2016b) (Figure 1B).A total of 126.35 Gb clean Hi-C reads were obtained,and 685.53 Mb (97.44% of total genome)genome sequences were successfully scaffolded into 26 pseudochromosomes.The number of chromosome scaffolds is consistent with previous research on karyotypes ofL.longirostris(2n=52;Hong & Zhou,1984).The lengths of chromosomes ranged from 17.36 Mb to 43.97 Mb(Supplementary Table S3). The scaffold N50 of the chromosome-level assembly was 28.03 Mb (Table 1).

For the annotation of repetitive sequences,we used RepeatModeler v1.0.10 (Bao & Eddy,2002),which employs two complementary computational methods,i.e.,RECON v1.08 and RepeatScout v1.0.5 (RepeatScout,RRID:SCR 014653) (Price et al.,2005),to identify repeat element boundaries and family relationships from sequence data.Subsequently,the outputs from the RepeatModeler and RepBase v21.01 library were combined and used for further characterization of transposable elements (TEs),many of which are not repetitive,and other repeats by homology-based methods,including identification with RepeatMasker v4.0.7,rmblast-2.2.28 (RRID:SCR 012954).Using RepBase-based homology andde novomethods,239.11 Mb (33.99% of total genome) repetitive elements were identified,with DNA transposons (146.40 Mb,20.81%) being the most abundant type in the genome (Supplementary Table S4-1).The proportion of repetitive elements inL.longirostrisis similar to that in theGlyptosternon maculatumgenome (33.96%) (Liu et al.,2018) and higher than that of most teleost genomes(Supplementary Table S4-2).

Combined homology-,de novo-,and transcriptome-based methods were used for gene prediction in the genome.The protein sequences of nine fish species,includingDanio rerio,Gasterosteus aculeatus,Ictalurus Punctatus,Larimichthys crocea,Oreochromis niloticus,Oryziaslatipes,Pangasianodon hypophthalmus,Tachysurus fulvidraco,andTakifugu rubripes,were downloaded from the Ensembl database and mapped onto the assembledL.longirostrisgenome using BLASTN.Subsequently,GeneWise v2.2.0(Birney et al.,2004) with default options was used for homologous annotation.Forde novoprediction,Augustus v3.1.0 (Stanke & Waack,2003) was used to predict gene models.In addition,RNA-seq data were aligned to the assembledL.longirostrisgenome to predict gene coding regions.The gene models were then predicted by combining the above homology-,de novo-,and transcriptome-based information using PASA v2.3.3 (Haas et al.,2003).Various databases,including SwissProt (Boeckmann et al.,2003),Kyoto Encyclopedia of Genes and Genomes (KEGG)(Kanehisa & Goto,2000),TrEMBL (Boeckmann et al.,2003),InterPro (Zdobnov & Apweiler,2001),and Gene Ontology(GO) (Ashburner et al.,2000),were used to functionally annotate the predicted protein-coding genes,and GLEAN(Elsik et al.,2007) was used to create a consensus gene set.Finally,a total of 23 708 protein-coding genes were identified in theL.longirostrisgenome (Supplementary Table S5),of which 21 692,20 072,23 114,21 169,and 16 638 proteincoding genes were annotated in the SwissProt,KEGG,TrEMBL,InterPro,and GO databases,respectively(Supplementary Table S6 and Figure S2).BUSCO was also used to test the completeness of the genome annotation with the actinopterygii_odb9 database,which showed that 92.4%complete and 4.0% fragmented conserved single-copy orthologs were predicted forL.longirostris.

Table 1 Summary of sequenced catfish genomes

For non-coding RNAs,microRNA (miRNA) and small nuclear RNA (snRNA) were predicted using INFERNAL v1.1(Nawrocki & Eddy,2013) and the Rfam database (Kalvari et al.,2018).Transfer RNA (tRNA) and ribosomal RNA (rRNA)were identified using tRNAscan-SE v1.3.1 (Lowe & Eddy,1997) and RNAmmer v1.2 (Lagesen et al.,2007),respectively.After analysis,422 miRNAs,2 118 tRNAs,1 838 rRNAs,and 1 925 snRNAs were annotated in theL.longirostrisgenome (Supplementary Table S7).

To identify gene families,protein sequences from the longest transcripts of each gene fromL.longirostrisand 10 other fish species,includingD.rerio,Astyanax mexicanus,G.aculeatus,G.maculatum,I.punctatus,Lepisosteus oculatus,Oreochromis niloticus,Oryzias latipes,Pelteobagrus fulvidraco,andT.rubripes,were aligned using BLASTP with an e-value threshold of 1e-5.OrthoMCL v1.4 (Li et al.,2003)was then used to construct gene families.A total of 19 438 gene families and 3 585 single-copy ortholog families were identified among the 11 species,with 68 gene families specific toL.longirostris(Supplementary Table S8).In addition,11 729 (89.1%) gene families were shared by the four catfish species,with 301 gene families specific toL.longirostris(Supplementary Figure S3).

To investigate the phylogenetic relationships ofL.longirostriswith the above 10 fish species,the shared singlecopy genes were aligned by MUSCLE v3.8.31 (Edgar,2004).RAxML v8.2.1163 (Stamatakis,2014) was then employed to construct a phylogenetic tree with the -m PROTGAMMAAUTO model and 100 bootstrap replicates.MCMCTREE v3.8.31(Yang,2007) was used to estimate divergence time based on the “correlated molecular clock” and “HKY85” models.Phylogenetic analysis indicated thatL.longirostrisandP.fulvidraco,which are both from the family Bagridae,were clustered onto one branch,andL.longirostriswas close to theP.fulvidraco,G.maculatum,andI.punctatusclades,which belong to the Siluriformes order.These results are similar to previous phylogenetic analyses based on the mitochondrial genome ofL.longirostris(Liu et al.,2019).Our results also showed thatL.longirostrisdiverged～26.2 million years ago from its closest related speciesP.fulvidraco(Figure 1C).Furthermore,phylogenetic analysis estimated thatI.punctatusdiverged fromP.fulvidracoaround 82.2 million years ago,consistent with the 81.9 million years reported in previous study (Gong et al.,2018). Collinearity analysis of chromosomes betweenL.longirostrisandI.punctatuswas performed using LASTZ v1.02.00 (Harris,2007) with parameters “T=2 C=2 H=2 000 Y=3 400 L=6 000 K=2 200”.As a result,all 26 pseudochromosomes ofL. longirostrisdisplayed high homology with the corresponding chromosomes ofI.punctatus(Figure 1D),suggesting highqualityL.longirostrisgenome assembly.

In the present study,the first chromosome sequences forL.longirostriswere constructed using a combination of BGISEQ-500,Nanopore,and Hi-C technologies.The reference genome exhibited high quality in terms of continuity and completeness.This study should improve our understanding of theL.longirostrisgenome and provide valuable chromosomal information for genomic comparisons and evolutionary research among important aquaculture species.

DATA AVAILABILITY

The raw genome and RNA sequencing data were deposited in the National Center for Biotechnology Information (NCBI)database under accession No.PRJNA692071.

SUPPLEMENTARY DATA

Supplementary data to this article can be found online.

COMPETING INTERESTS

The authors declare that they have no competing interests.

AUTHORS’ CONTRIBUTIONS

W.P.H.,H.L.,J.Z.,and H.Y.designed the experiments;W.P.H.,H.L.,J.Z.,Z.L.,T.S.J.,C.H.L.,Y.J.Y.,M.B.X.,and C.W.Z. performed the experiments and analyzed data;W.P.H.,G.J.L.,H.Y.X.,and H.Y.wrote the paper.All authors read and approved the final version of the manuscript.

Zoological Research2021年4期

Zoological Research的其它文章: Melatonin relieves heat-induced spermatocyte apoptosis in mouse testes by inhibition of ATF6 and PERK signaling pathways; Genome and population evolution and environmental adaptation of Glyptosternon maculatum on theQinghai-Tibet Plateau; 3DPhenoFish:Application for two-and threedimensional fish morphological phenotype extraction from point cloud analysis; A new snake species of the genus Gonyosoma Wagler,1828 (Serpentes:Colubridae) from Hainan Island,China; Inhibition of mTOR signaling by rapamycin protects photoreceptors from degeneration in rd1 mice; A bright future for the tree shrew in neuroscience research:Summary from the inaugural Tree Shrew Users Meeting