CHEN Sheng Lin, KANG Yu Tong, LIANG Yi He, QIU Xiao Tong, and LI Zhen Jun,#
1.School of Public Health, Shanxi Medical University, Taiyuan 030001, Shanxi, China; 2.State Key Laboratory for Infectious Disease Prevention and Control, National Institute for Communicable Disease Control and Prevention,Chinese Center for Disease Control and Prevention, Beijing 102206, China
Abstract Objective A core genome multilocus sequence typing (cgMLST) scheme to genotype and identify potential risk clonal groups (CGs) in Proteus mirabilis.
Key words: Proteus mirabilis; CgMLST; Genotyping; Clonal evolution; ChewBBACA
Proteus mirabiliscauses complex urinary tract infections, which may lead to sepsis or systemic inflammatory response syndrome, with a fatality rate of 20%–50%[1].P.mirabilis, the most common pathogen of the genusProteus, is a member of theEnterobacteriaceaethat causes urinary system infections, second only toEscherichia coli.P.mirabilisaccounts for 3% of nosocomial infections[2].With the widespread irregular use of antibiotics,P.mirabilisantibiotic resistance has increased.Multidrug-resistantP.mirabilisproducing broad-spectrum β-lactamase was reported for the first time in Italy[3].Subsequently,Poland, China, Japan, and other countries and regions have reported multidrug-resistantP.mirabilis[4-8], showing global spread.Due to the continuous increase in antibiotic resistance, the prevention and control ofP.mirabilisinfections is challenging.Therefore, high-resolutionP.mirabilistyping is needed.
High-resolution sequence typing is important to assess the epidemiological links and identify possible sources of transmission during outbreaks.Currently,multilocus sequence typing (MLST)[9,10]and pulsedfield gel electrophoresis (PFGE)[11-13]are the two most commonly used typing methods for outbreak investigations.However, their main disadvantages are the low resolution in MLST[14,15]and difficulties in standardization of PFGE[16].To the best of our knowledge, there are few studies on MLST ofP.mirabilis, and most studies have focused on MLST of carbapenem-resistant Gram-negative clinical isolates containingP.mirabilis[9,10].
In recent years, with the continuous development of genomic epidemiology, the geneby-gene (GbG) method has been advocated as an extension of MLST[17].This method has the advantages of portability, scalability, and independence.It is regarded by PulseNet International as an important method for the identification of bacterial strains by highthroughput sequencing.The open source chewBBACA algorithm[18]provides a simple bioinformatics pipeline to construct cgMLST schemes of target strains.To typeP.mirabilisglobally and monitor the clonal groups (CGs) of potentially high-riskP.mirabilis, we used chewBBACA as a bioinformatics pipeline for GbG analysis to construct and verify the cgMLST scheme ofP.mirabilis.Our results support the prevention and control ofP.mirabilisinfections.
Genomic sequences available as of September 10, 2021, forP.mirabilis(Enterobacteria) were downloaded from the National Center for Biotechnology Information (NCBI) genome sequence repository (http://www.ncbi.nlm.nih.gov/genome/).The downloaded sequences included 74 complete genomes and 679 whole genome shotgun sequences available as scaffolds (116), chromosomes (6), or contigs (557).After removing low-quality assemblies(gene assembly rate < 95% or contamination rate> 5%), all remaining genomes were included in the preliminary analysis.
In the present study, a chewBBACA suite was designed to create and evaluate the core genome GbG typing scheme, followed by allele calling in availableP.mirabilisstrains.ChewBBACA performs scheme creation and allele calls for complete or draft genomes generated byde novoassembly.Before starting, we created a training file in Prodigal v2.6.3[19]using the reference genome HI4320 ofP.mirabilis.The genome sequence ofP.mirabilisHI4320 (RefSeq assembly accession:GCF_ 000069965.1) was used only as a reference genome to predict the whole genome MLST(wgMLST)lociand was then removed from further analysis.
A total of 72 complete genome sequences were annotated in the first step.In this step, the algorithm defined the coding sequences (CDSs) of each genome,and then all CDSs in the genome were compared in a pairwise manner to generate a single FASTA file containing the selected CDSs.Then, a two-step evaluation process was conducted to identify all CDSs in the genomes.First, CDSs that have the same sequence but are smaller than other CDSs were removed, and the larger CDSs were retained.Second,the remaining CDSs were gathered in uniquelociby performing an all-against-all BLASTP search and calculating the blast score ratio (BSR)[20].CDSs with a pairwise BSR of > 0.6 were considered to be alleles at the same locus, and the larger allele (in bp) was retained.The remaining wgMLSTlociwere used to define the cgMLST scheme.We selected candidatelocifrom 100% (threshold) of the available complete genomes (72 genomes) to define our cgMLST scheme.The second step was validation of candidatelocibased on a total of 635 unfinishedP.mirabilisgenomes.We reduced stringency and maintained candidatelocishared by 99% of the isolates.This cutoff was chosen based on the quality of the tested genomes, since the sequences used in this step were incomplete genomes.ChewBBACA is available at https://github.com/B-UMMI/chewBBACA.
To define CGs in a non-arbitrary manner, we analyzed the distribution of the number of allelic mismatches (lociwith different sequences) among the pairs of genomes.We performed a single-link algorithm designed to classify isolates from cgMLST data.Similar to the classification of classical MLST data, the single-link algorithm was also used for cgMLST data to accurately classify isolates[21].
A minimum spanning tree (MST) was constructed using the MSTree method in GrapeTree (version 2.1)for the allelic profiles obtained for each isolate by the cgMLST scheme.
Based on our cgMLST scheme, MSTs for different geographical locations were built to investigate possible evolutionary relationships.We focused on the distribution of CGs in different countries or regions and examined the shared and unique CGs between them.
We used the ABRicate pipeline (https://github.com/tseemann/abricate) to annotate the antibiotic resistance genes (ARGs) and virulence genes in theP.mirabilisgenome using the CARD (https://card.mcmaster.ca/) and VFDB (http://www.mgc.ac.cn/VFs/main.htm) databases.The annotation parameters for ARGs and virulence genes were gene coverage > 60% and identity > 70%.
As of September 10, 2021, 753 genome sequences were available at the NCBI genome sequence repository (http://www.ncbi.nlm.nih.gov/genome/).Of these genomes, 45 were not included due to the low assembly quality (Supplementary Table S1 available in www.besjournal.com).The genome sequences of 12P.mirabilisstrains were excluded during the cgMLST scheme validation process (Supplementary Table S1).According to our cgMLST scheme, reference genome HI4320 was not included.Finally, 695P.mirabilisstrains were included in the analysis according to our cgMLST scheme.Information regarding all strains is provided in Supplementary Table S2 (available in www.besjournal.com).As shown in Figure 1, the majority ofP.mirabilisisolates were collected from humans(82.4%), most samples were collected in the USA or China (75.3%), urine and sputum samples were the most common (40.6%), and most samples were collected in 2018 or 2019 (54.8%).
A flowchart of the cgMLST scheme ofP.mirabilis(https://github.com/B-UMMI/chewBBACA) is provided in Figure 2.In total, 72 complete genome sequences were retrieved.A total of 7,305 targetlociwere generated and annotated using the wgMLST dataset.Using the RemoveGenes operation to remove possible paralogouslocifrom the wgMLST scheme,we discarded 85loci.We filtered out an additional 4,921lociusing the genome quality test (threshold 55).A total of 2,299 candidatelocifor cgMLST were retained, which were present in 72 complete genome sequences (Supplementary Figure S1 available in www.besjournal.com).During the creation of the scheme, no genome sequence was removed.
We used the 635 unfinishedP.mirabilisgenomes for the validation of the cgMLST scheme.The TestGenomeQuality operation was used to selectlocipresent in a predetermined percentage (99%) of the analyzed strains from the cgMLST scheme(Supplementary Figure S2 available in www.besjournal.com).During the validation step, 12 unfinished genomes were removed.In the end,1,842 cgMLSTlociwere selected.Following this step,a cgMLST scheme consisting of 1,842 gene targets was defined, covering 49.91% of the 3,690 open reading frames predicted for the reference strain HI4320.
The allelic difference threshold analysis showed a high number of genome pairs with < 291 mismatches or > 1,345 mismatches, and almost no genome pairs had 291–1,345 mismatches (Figure 3).Finally, we defined theP.mirabiliscgMLST CGs as a set of cgMLST profiles having ≤ 291 allelic mismatches with at least one other member of the group.The final 205 CGs were identified based on single-linkage clustering of the cgMLST allele profile.
From 695P.mirabilisstrains that conform to the cgMLST scheme, 205 CG types were identified(Supplementary Table S2).Owing to the limitation of the MST visualization, Figure 4 only exhibits the CGs(ranked by the number of strains included in the top 30) distributed in clusters.As shown in Figure 4, CGs exhibited good clustering.In other words, mostP.mirabilisstrains have been well typed according to our cgMLST scheme.
Based on our cgMLST scheme, a minimal generative tree of 695P.mirabilisstrains from different countries or regions showed no close relationships at the geographic level (Figure 5).Overall, it could be seen thatP.mirabilispresented a clear global spread, and the root of the MST originated from the USA.However, there are differences between shared CGs and unique CGs in different countries or regions.In total, 46 shared CGs were distributed in 26 countries or regions.CG3 was present in at least 13 countries or regions,CG20 was present in seven countries, and CG7 was present in five countries (Supplementary Figure S3,available in www.besjournal.com).Moreover, 159 unique CGs were distributed in 16 countries(Supplementary Table S3, available in www.besjournal.com).The USA had 105 unique CGs, China had 10 unique CGs, and Denmark had 13 unique CGs.Overall, there were regional differences in the distribution of CGs.

Figure 1.Characterization of the P.mirabilis isolates included in this study.(A) Hosts of isolated P.mirabilis.“Others”represents host types with only one strain.(B) Geographic distribution of available genomes.“Others”represents countries with less than five strains.(C) Isolation source distribution of isolates.“Others”represents sources with less than eight strains.(D) Temporal distribution of P.mirabilis isolates.“Others”represents years when less than two strains were collected.
Virulence-associated Genes in P.mirabilis Clonal GroupsIn total, 64 virulence gene types were found in 695P.mirabilisstrains.Supplementary Table S4 (available in www.besjournal.com) presents the distribution of the 64 virulence gene types in the 205 CGs.The carrier rate of virulence genescheB,cheY,flgG,flgH,fliI,fliN,flip,gmhA/lpcA,kdsA,lpxC,lpxD, and luxSwas 100%.The carrier rate of virulence genescheW,flgC,flhC,flhD,fliA,fliG,fliM,fliZ,galU,htpB,ompA, and rfaDwas > 95%.It is worth noting that CG20 carries nine unique virulence types, namely,papC,papD,papE,papF,papG,papH,papI,papJ, and papK.These nine unique types of virulence genes were present only in CG20 and were P fimbrial operons that cause severe urinary tract infections[22-24].
Antibiotic Resistance Genes in P.mirabilis CGsThe subtypes of 156 ARGs were found in 695P.mirabilisstrains.Supplementary Table S5 (available in www.besjournal.com) presents the distribution of the subtypes of 156 ARGs among the 205 CGs.The ARGCRPwas present in all CGs.The ARGsH-NS,acrB,cpxA, andtet(J)were present in more than 95%of CGs ofP.mirabilis.The total number of ARG subtypes present in different CGs ranged from six to 95.CG3, CG20, CG29, CG32, CG6, CG49, CG41, CG80,CG12, CG7, CG4, CG1, CG61, CG68, CG48, CG28,CG107, and CG11 not only have a large number of shared ARG subtypes, but also unique ARG subtypes.CG3 has 77 shared ARG subtypes and 18 unique ARG subtypes.CG20 has 37 shared ARG subtypes and five unique ARG subtypes.

Figure 2.Flowchart describing the development of the cgMLST scheme for P.mirabilis using chewBBACA(https://github.com/B-UMMI/chewBBACA).

Figure 3.Distribution of the number of paired allele mismatches in P.mirabilis (i.e., the number of different loci for a given pair of strains).The allelic mismatch cutoff value of 291 proposed for the CG definition is shown in red.
The spread of multidrug-resistantP.mirabilis[4-8]is an urgent public health problem.Proper typing ofP.mirabilisis important.MLST[9,10]and PFGE[11-13]are almost incapable of high-resolution typing ofP.mirabilis.CgMLST is based on comparison of bacterial whole genome sequences, and has proven to be a promising means for disease outbreak investigation, source tracing, surveillance of bacterial pathogens, and contamination control[25-27].cgMLST is based on comparison of a complete or draft de novo genome assembly with a scheme consisting of a series oflociand a collection of related allele sequences.This is a suitable and unbiased means to identify possible clusters from a sample of the entire species, and does not require any outbreak-specific reference[28].To the best of our knowledge, no previous studies have proposed a cgMLST scheme forP.mirabilistyping.
GbG approaches are becoming increasingly popular in bacterial typing[29,30], bacterial genome epidemiology[31], and outbreak investigation[31,32].However, the lack of open source software for scheme definition and allele calling has limited the widespread use of the cgMLST approach.Interestingly, the chewBBACA suite can address this limitation.The chewBBACA suite was designed to help users create and evaluate novel whole genome or core genome GbG schemes and make allelic calls to bacterial strains of interest[17].We used chewBBACA to call alleles ofP.mirabilisand created and validated the first cgMLST scheme forP.mirabilis.In the end, the proposed cgMLST scheme forP.mirabilistyped 100% (707/707) of the genomes available for this research considering at least 95% of the cgMLST target genes present.We proposed the public cgMLST scheme forP.mirabilisbased on 72 complete genome sequences and validated the scheme using 1,842 targetlociin 635 unfinished genome sequences.The proposed 1842-locus cgMLST scheme delineated the 695P.mirabilisstrains into 205 distinct CGs.CG3 (18.3%) was the most dominant, followed by CG2 (4.7%), CG17(3.0%), and CG32 (2.4%).Several CGs showed interesting regional distribution characteristics.

Figure 4.Minimum spanning tree (MST) of CGs generated using the 695 P.mirabilis strains based on the cgMLST scheme.Due to the limitation of the visualization of the MST, only the CG types containing the top 30 strains are displayed.“Others”represents the CG types containing less than five strains.
The cgMLST approach, which extends the MLST concept to the core genome, has proven to be a useful high-resolution typing scheme in other bacterial species[29,30,33,34].In contrast to the genuswideLeptospiracore genome MLST for strain taxonomy[30], we established a cgMLST scheme at the species level.Currently, the genusProteusconsists of five species (P.mirabilis,P.vulgaris,P.penneri,P.myxofaciens, andP.hauseri), withP.mirabilisbeing the main pathogenic species[35].P.mirabilisis generally more susceptible to antimicrobials thanP.penneri,P.hauseri, andP.vulgaris[35,36].Therefore, a cgMLST scheme at the species level is necessary.Unlike the cgMLST scheme of de Sales RO et al.forPseudomonas aeruginosa[29],our cgMLST scheme allows for analysis of ARGs and virulence genes.The identification of CGs of high-riskP.mirabilisis important for public health.
Virulence genespapC,papD,papE,papF,papG,papH,papI,papJ, and papKwere only in CG20.These nine virulence genes encode adhesive pili of the chaperone-usher family and they form the importantP.mirabilispmf fimbrial operon.These virulence genes inP.mirabilismay be the cause of severe urinary tract infections[22-24,37-39].In addition, CG20 has 42 shared ARG subtypes and five unique ARG subtypes.ARGsAAC(3)-Id[40,41],aadA7[42,43],dfrA15[44,45], andvanRA[46]have been found in multidrug-resistantP.mirabilis.A recent study from Italy revealed that the coexistence of ARGs and virulence factors could lead to the emergence of a new population of resistant clones, which poses a major challenge for addressing antimicrobial resistance[47].We believe thatP.mirabilisdeserves more attention.In addition, CG3 has 77 shared ARG subtypes and 18 unique ARG subtypes.CG3 was widely distributed on all continents, and its major resistance genes were classified as β-lactams (e.g.,cephalosporin and carbapenem), which is consistent with the prevalence of β-lactamase-carryingP.mirabilisstrains reported in many countries and regions[48-50].Therefore, attention should be paid to these CGs ofP.mirabilis.

Figure 5.Minimal spanning tree analysis of cgMLST profiles of 695 P.mirabilis strains isolated from different countries or regions.The genome sequences were downloaded from the NCBI database(https://www.ncbi.nlm.nih.gov/ genome/).Colors indicate the geographic locations of the isolates.
There are some limitations in this study that must be pointed out.First, our analysis of the ARGs and virulence genes of the CGs ofP.mirabiliswas only preliminary.We simply used the easy and fast bioinformatics pipeline (ABRicate) and did not perform relevant experimental studies such as gene validation using polymerase chain reaction and quantitative PCR.The main reason for this is that the sequences of all 695P.mirabilisstrains were downloaded from NCBI, which became the main reason why we were unable to perform many experiments.Second, the main aim of the cgMLST scheme is not only typing, but also outbreak tracing.We did not apply the cgMLST scheme to epidemiological practice in this study.Therefore,further research will be conducted in the future to confirm that our cgMLST scheme is well suited not only for molecular typing ofP.mirabilis, but also for epidemiological practice ofP.mirabilis.Despite these current limitations, this is still the first study that performed high-resolution molecular typing ofP.mirabilisusing WGS technology combined with bioinformatics pipelines (chewBBACA), our results support the prevention and control ofP.mirabilisinfections.
In summary, we carried out high-resolution molecular typing ofP.mirabilisfrom all over the world and found that the CGs ofP.mirabilisshowed significant regional distribution differences.We identified a total of 205 CGs in 695P.mirabilisstrains.The proposed cgMLST scheme forP.mirabilistyped 100% (707/707) of the available isolates, and at least 95% of these selectedlociwere present in the genome.We also found that someP.mirabilisCGs carried a large number of shared ARGs (or virulence genotypes) and unique ARGs (or virulence genotypes).To the best of our knowledge, this is the first report of high-resolution molecular typing ofP.mirabilisusing whole genome sequencing technology combined with a bioinformatics pipeline (chewBBACA).We have published our scheme, associated codes, and files on GitHub(https://github.com/Natasha22222222/cgMLSTProteus-mirabilis) to facilitate further discussion and contribute to the establishment of cgMLST forP.mirabilis.

Supplementary Figure S1.Number of loci and number of strains at every exclusion threshold level for the 72 complete genomes that were used to create the cgMLST scheme.

Supplementary Figure S2.Number of loci and number of genomes at every exclusion threshold level for the 635 validation strains’genomes and 72 completed genomes that were used to construct the cgMLST scheme.

Supplementary Figure S3.Heatmap of 46 shared CGs in 26 countries or regions.
CONFLICT OF INTEREST
The authors declare that they have no competing interests.
DATA AVAILABILITY STATEMENT
The datasets for this study can be found in the NCBI (National Center for Biotechnology Information) genome sequence repository (http://www.ncbi.nlm.nih.gov/genome/).The RefSeq assembly accession of all includedP.mirabilissequences are provided in Supplementary Table S2.
SUPPLEMENTARY MATERIALS
All data generated or analyzed during this study are included in this article and its supplementary information.
Received: August 2, 2022;Accepted: November 30, 2022
Biomedical and Environmental Sciences
2023年4期