利用位點特異性打分矩陣對大腸桿菌啟動子的預測

2016-01-15 02:06:22閆妍,萬平

生物信息學 2015年2期

閆妍，萬平*

(首都師范大學生命科學學院，北京100048)

摘要：啟動子是基因轉錄起始的一個關鍵性元件。本研究利用數據庫中提供的大腸桿菌啟動子數據，基于位點特異性打分矩陣(Position-specific scoring matrix，PSSM)算法建立了大腸桿菌啟動子預測方法，并采用ROC曲線對預測結果進行評估。結果顯示，本方法對大腸桿菌sigma24、sigma28、sigma32、sigma38、sigma54和sigma70啟動子預測的準確度分別達到86%，96%，93%，96%，97%和74%。由于原核生物啟動子序列的保守性，可將該方法推廣至其他原核生物的啟動子預測。

關鍵詞：大腸桿菌；啟動子；位點特異性打分矩陣(PSSM)；預測

中圖分類號：Q811.4文獻標志碼：A

收稿日期：2015-01-14;修回日期：2015-03-10.

作者簡介：吳文峰，男，碩士研究生，研究方向：智能信息及圖像處理;E-mail:641178636@qq.com.

doi:10.3969/j.issn.1672-5565.2015.02.10

Prediction ofEscherichiacoliK-12 promoters using position-specific

scoring matrix (PSSM) method

YAN Yan, WAN Ping*

(CollegeofLifeScience,CapitalNormalUniversity,Beijing100048 ,China)

Abstract：Promoter is an essential element in transcription initiation. In this study, we proposed a method for the promoter prediction based on the position-specific scoring matrix(PSSM) constructed with the data from RegulonDB database,and evaluated the performance through the receiver operating characteristic(ROC).We predicted the Escherichia coli K-12 promoters, the accuracies of predictions for sigma24, sigma28, sigma32, sigma38, sigma54 and sigma70 are 86%, 96%, 93%, 96%, 97% and 74%, respectively. Since promoter sequences are conserved among prokaryotes, PSSM could be applied to the prediction of prokaryotic promoters.

Keywords：E.coli; Promoter; Position-specific Scoring matrix; Prediction

啟動子是基因轉錄起始的一個關鍵性元件，位于基因轉錄起始點附近。在細菌中，啟動子由RNA聚合酶核心酶與相應的sigma因子共同識別[1]。因子共有7種類型：sigma19、sigma24、sigma28、sigma32、sigma38、sigma54和sigma70，每種sigma因子所識別的序列都具有一定特征。除sigma54啟動子外，啟動子在轉錄起始位點上游-10和-35位附近都存在保守區域[2]；而sigma54啟動子的保守區域位于轉錄起始位點上游-12和-24位附近[3]。

對于特征結構域建模的算法有很多。例如，常用的有位點特異性打分矩陣(Position-specific scoring matrix，PSSM，也稱PWM)、貪婪算法、EM算法和MCMC算法，這些算法都有各自的優缺點[4]。此外，近年內也報導了一些新型算法，如pHMM-ANN方法[5]、GLECLUBS算法[6]、BOBRO算法[7-8]、神經網絡算法[9]、構建非傳統的16列雙核苷酸矩陣的PSSM算法[10]。

在眾多算法中，PSSM仍然是最常用的算法，占據重要的地位。PSSM在發現例如啟動元件或可變剪接等具有信號核酸序列方面有著廣泛的應用[11]。有很多構建PSSM的方法，最常用的就是使用排列好且長度相等的具有已知類似功能的結構域構建打分矩陣。這個打分矩陣的行數由結構域中的元素種類決定，列數則由結構域的元素個數決定。構建好的打分矩陣能夠搜索DNA序列或蛋白序列中的與已知序列相似的序列[10]。

ROC曲線(Receiver operating characteristic curve)是一種坐標圖式的分析工具，能描繪診斷中敏感性和特異性之間的制約關系[12]。

目前還未見采用PSSM方法預測原核生物啟動子的報道。本研究采用PSSM方法預測大腸桿菌啟動子，并且通過ROC曲線評估預測結果。

1數據和方法

1.1大腸桿菌K-12啟動子核酸序列

大腸桿菌K-12 sigma24、sigma28、sigma32、sigma38、sigma54和sigma70啟動子的核酸序列下載自RegulonDB數據庫(http://regulondb.ccg.unam.mx/)。RegulonDB收錄了大腸桿菌K-12各種轉錄起始時的調控復合體和調控網絡。除此之外，它還包括了各種功能的基因間的相互作用，如轉錄復合體、操縱子以及簡單或復雜的調控子的基因[13]。

由于sigma19啟動子在RegulonDB中只有一條序列，未列入本研究。對于下載的啟動子序列，我們先對數據進行篩選。篩選包括去掉數據庫中的冗余序列、無注釋信息序列、以及屬于多類啟動子的序列。屬于多類啟動子的序列指可同時被多類啟動子識別的序列，這些序列會影響PSSM的預測效果。經過篩選處理后，共得到2 954條啟動子序列，其中sigma24有511條，sigma28有138條，sigma32有285條，sigma38有130條，sigma54有92條、sigma70有1 787條。

1.2位點特異性打分矩陣(PSSM)的構建

對大腸桿菌K-12的每類啟動子，分別構建位點特異性打分矩陣(PSSM)。

1.2.1構建頻數矩陣

從RegulonDB下載的啟動子序列每一條的長度都為81個堿基。以DNA序列上的基因翻譯起始點的堿基位置定為0，將其坐標化，則啟動子全長即為-60~20。構建頻數矩陣時，首先要統計每個坐標位置中4種核苷酸出現的次數，將結果填入4行81列矩陣中。該矩陣行的名稱分別為A、C、G、T，列名為啟動子對應的位置坐標值。

1.2.2構建偽計數矩陣

頻數矩陣的某些元素的值可能為0。一般認為，這是由于收集數據時的數據量不足造成的。為彌補這一缺陷，通常對頻數矩陣中每個元素的值加一個正數(本研究中加1)，生成偽計數矩陣(Peudo count matrix)。

1.2.3構建概率矩陣

(1)

1.2.4構建幾率比(Odds ratio)矩陣

將概率矩陣中每個元素的值除以所對應的堿基在隨機條件下出現的概率(均為0.25)，即得到幾率比(Odds ratio)矩陣。如公式(2)所示，其中M代表實際觀測情況，R代表隨機情況。

(2)

1.2.5構建對數幾率比(Log-Odds ratio)矩陣，即位點特異性打分矩陣(PSSM)

將幾率比矩陣中的每個元素取以2為底的對數，再取整數部分，即得到對數幾率比矩陣(公式(3))，這就是最終的位點特異性打分矩陣(PSSM)。

(3)

1.3利用PSSM預測大腸桿菌K-12的啟動子

1.3.1預測方法

對于給定的DNA序列，根據每個位置上出現的堿基，在PSSM中查出相應的得分，然后對各個位置的得分求和，得到總分。采用不同啟動子的PSSM分別對同一DNA序列打分，得分最高者被視為此DNA序列所屬的啟動子類型。

1.3.2分別對陽性數據集和陰性數據集進行預測

對于特定的啟動子類型，陽性數據集指屬于該類啟動子的DNA序列，陰性數據集指不屬于該類啟動子的DNA序列。本研究中，陽性數據集和陰性數據集所包含的序列數目為1∶1。陰性數據集由不屬于某類啟動子的其它5類啟動子序列組成。

1.4利用ROC曲線對預測結果進行評估

使用RStudio中的ROCR包[14]繪制ROC曲線。

ROC曲線中，AUC代表“曲線下面積”，該值越趨近1，說明預測效果越好。

敏感度(Sensitivity, Sens)、特異性(Specificity, Spec)和準確度(Accuracy, Acc)評估預測效果[15]。公式(4)~(6)，式中TP為真陽性，FN為假陰性，TN為真陰性，FP為假陽性。

(4)

(5)

(6)

2結果

2.1大腸桿菌K-12 6種啟動子的PSSM

我們計算了大腸桿菌K-12的6種啟動子的PSSM圖1是6類啟動子相應的logo。為做ROC評估提供陽性數據集和陰性數據集，采用如下方法處理E.coliK-12啟動子的perl腳本：

#!/usr/bin/perl-w

use strict;

#從RegulonDB上下載的啟動子的原始數據放在一個文件夾下

my @file=glob"PromoterSigma*Set.txt";

my (%sigma_all_sequences, %matrices_total, %prediction_all);

# Score Matrices Computation

foreach my $file(@file){

my ($promoter_index) = $file =~ /PromoterSigma(.*)Set.txt/;

my (%sigma_sequences, %sigma_transposition_sequences);

my ($score, @matrix_score);

# Store the promoters of current file into %sigma_sequences and then

# store all types of the promoter sequences into %sigma_all_sequences.

open(DATA,$file)||die"Can't open file "$file"! ";

while(){

chomp;

(/^#/ or /^s*$/) and next;

my @fields = split(" ");

($fields[5]!~/^[acgtACGT]+$/ or $fields[4]=~/,/) and next;

$sigma_sequences{lc($fields[5])}=$fields[4];

}

$sigma_all_sequences{$promoter_index}=\%sigma_sequences;

# Transposit the %sigma_sequences to %sigma_transposition_sequences

# in order to calculate the score matrices.

foreach my $key(keys %sigma_sequences){

my @fields=split("",$key);

for(my $i=0; $i<@fields; $i++){

$sigma_transposition_sequences{$i}.=$fields[$i];

}

# Calculate score matrices and store the results in %matrices_total.

foreach my $key(sort {$a <=> $b}keys %sigma_transposition_sequences){

my $promoter_num=length$sigma_transposition_sequences{$key};

my @fields=split("",$sigma_transposition_sequences{$key});

my %acgt=();

foreach my $base(@fields){

(exists $acgt{$base}) ? ($acgt{$base}++) : ($acgt{$base}=1);}

foreach my $base(keys %acgt){

my $base_score = ($acgt{$base}+.1)*4/($promoter_num+.4);

$base_score = sprintf"%.0f", log($base_score)/log(2);

$matrix_score[$key][judge($base)] = int($base_score);

}

$matrices_total{$promoter_index}=@matrix_score;

}

# True or False Promoters Prediction

foreach my $file(@file){

my ($promoter_index) = $file =~ /PromoterSigma(.*)Set.txt/;

my %prediction;

# True promoters score

foreach my $key(keys%{$sigma_all_sequences{$promoter_index}}){

my $score;

for(my $i = 0; $i < length$key; $i++){

$score += ${$matrices_total{$promoter_index}}[$i][judge(substr($key, $i, 1))];

}

$prediction{$key}=$score." T";

}

# False promoters score

foreach my $key(keys %sigma_all_sequences){

$key =~ $promoter_index and next;

foreach my $false_key(keys %{$sigma_all_sequences{$key}}){

my $score;

for(my $i = 0; $i < length$false_key; $i++){

$score += ${$matrices_total{$promoter_index}}[$i][judge(substr($false_key, $i, 1))];

}

$prediction{$false_key}=$score." F";

}

$prediction_all{$promoter_index}=\%prediction;

}

# Print the score into files.

foreach my $key(keys %prediction_all){

open(RS,">score_sigma".$key.".txt");

foreach my $sub_key(keys %{$prediction_all{$key}}){

print RS $sub_key," ",${$prediction_all{$key}}{$sub_key}," ";

}

close RS;

}

sub judge{

my($string)=@_;

my $num;

$string=~/a/ and $num=0;

$string=~/c/ and $num=1;

$string=~/g/ and $num=2;

$string=~/t/ and $num=3;

return $num;

}

圖1　大腸桿菌K-12 6種啟動子的Logo

3.2ROC曲線

根據對6種啟動子預測結果，我們使用R語言的ROCR包繪制了相應的ROC曲線(圖2)。表1顯示了PSSM對每一種啟動子預測的敏感度(Sensitivity)和特異性(Specificity)。繪制ROC曲線的Rscript如下：

library(ROCR)

setwd("") # 將工作目錄設在原始數據在的地方

par(mfrow=c(2,3),bg="white",mai=c(.6,.6,.6,.6))

for(i in c("24","28","32","38","54","70")){

data=read.table(paste0("score_sigma",i,".txt"))

pred <- prediction(data[,2],data[,3])

perf <- performance(pred,"tpr","fpr")

sens <- performance(pred,"spec","sens")@x.values

spec <- performance(pred,"spec","sens")@y.values

print(paste0("sigma",i,sens,spec))

auc <- format(performance(pred,"auc")@y.values,digits=2)

plot(perf,main=paste0("sigma",i),colorize=FALSE,lwd=2,xaxis.cex.axis=1,

yaxis.cex.axis=1,yaxis.las=1,cex.main=1.5)

segments(0,0,1,1,lty=2)

text(0.6,0.5,paste0("AUC=",auc),cex=1.2)

}

σ24σ28σ32σ38σ54σ70Sensitivity0.710.930.740.740.730.73Specificity0.840.740.880.640.770.68

4討論

通過比較PSSM與BacPP方法[16]的準確度(Acc)(見表2)可以看出，PSSM方法在預測6種類型的sigma因子時，有3種啟動子(sigma28、sigma32、sigma38)的預測準確度優于BacPP方法；一種啟動子(sigma54)的預測準確度與BacPP方法持平，均為0.97。

表2　PSSM與BacPP方法的準確度(Acc)比較

從結果我們可以判斷，用PSSM模型預測原核生物啟動子是一種較為準確的算法。

首先，圖1中的ROC曲線都處于坐標對角虛線的上方，這說明使用PSSM預測啟動子的概率比隨機概率要高。

其次，根據AUC的值判斷PSSM方法的可信性。圖中的AUCs只有sigma38為0.74，其余均大于0.8，說明PSSM的可信度很高。

再次，預測方法的敏感性和特異性是評價一種預測方法最具說明力的指標。在表1中，PSSM的敏感性和特異性均大于0.6。另外，sigma28的敏感性達到了0.93，達到了相當高的水平。

PSSM算法為大腸桿菌K-12啟動子的預測提供了一種較為準確的可靠方法。從ROC曲線的形狀、AUC值，以及敏感度、特異性和準確度值均表明PSSM在預測啟動子方面的有效性。由于原核生物的啟動子具有較大的保守性，PSSM可以作為原核生物啟動子預測的一種有效方法。PSSM方法缺陷在于，使用PSSM方法需要指定打分矩陣的窗口大小，該缺陷可以通過采用隱馬爾科夫模型(HMM)方法得以克服。另外，我們還將采用多重交叉驗證的方法進一步提高預測的準確度。

參考文獻(References)

[1]楊明，李權勝．原核生物的sigma因子[J]．河南醫學研究，1999，8(1)：88-90．

YANG Ming，LI Quansheng．Prokaryotic sigma factors [J]．Henan medical research，1999，8(1)：88-90．

[2]HAWLEY D K , MCCLURE W R．Compilation and analysis of Escherichia coli promoter DNA sequences[J]．Nucleic Acids Rresearch，1983，11(8)：2237-2255．

[3]TH?NY B, HENNECKE H．The -24/-12 promoter comes of age [J]．FEMS Microbiol. Rev，1989，63：341-357．

[4]GUHATHAKURTA D．Computational identification of transcriptional regulatory elements in DNA sequence [J]．Nucleic Acids Research，2006，34(12)：3585-3598．

[5]MANN S，LI J，CHEN Y P P．A pHMM-ANN based discriminative approach to promoter identification in prokaryote genomic contexts[J]．Nucleic Acids Res，2007，35：e12．

[6]ZHANG S，XU M，LI S，et al．Genome-wide de novo prediction of cis-regulatory binding sites in prokaryotes[J]．Nucleic Acids Research，2009，37(10)：e72．

[7]LI G，LIU B，MA Q，et al．A new framework for identifying cis-regulatory motifs in prokaryotes[J]．Nucleic Acids Research，2011，39(7)：e42．

[8]MA Q，LIU B，ZHOU C，et al．An integrated toolkit for accurate prediction and analysis of cis-regulatory motifs at a genome scale[J]．Bioinformatics，2013，29(18)：2261-2268．

[9]AHMAD S，SARAI A．PSSM-based prediction of DNA binding sites in proteins[J]．BMC Bioinformatics，2005，6：33．

[10]GERSHENZON N I，STORMO G D，IOSHIKHES I P．Computational technique for improvement of the position-weight matrices for the DNA/protein binding sites[J]．Nucleic Acids Research，2005，33(7)：2290-2301．

[11]CLAVERIE J M，AUDIC S．The statistical significance of nucleotide position-weight matrix matches[J]．Computer Applications in the Biosciences，1996，12(5)：431-439．

[12]韋修喜，周永權．基于ROC曲線的兩類分類問題性能評估方法[J]．計算機技術與發展，2010，20(11)：47-50．

WEI Xiuxi，ZHOU Yongquan．Assess of performance of two types of classification methods based on ROS[J]．Computer Technology and Development，2010，20(11)：47 -50．

[13]SALGADO H，PERALTA-GIL M，GAMA-CASTRO S， et al．RegulonDB(version 8.0)： Omics data sets，evolutionary conservation，regulatory phrases，cross-validated gold standards and more[J]．Nucleic Acids Research，2012，41(D1)：D203-213．

[14]SING T，SANDER O，BEERENWINKEL N，et al．ROCR: visualizing classifier performance in R[J]．Bioinformatics，2005，21(20)：7881．

[15]ZHOU X，LI Z，DAI Z，et al．Predicting promoters by pseudo-trinucleotide compositions based on discrete wavelets transform[J]．Journal of Theoretical Biology，2013，319：1-7．

[16]DE AVILA E SILVA S，ECHEVERRIGARAY S，GERHARDT G J．BacPP： bacterial promoter prediction-a tool for accurate sigma-factor specific assignment in enterobacteria[J]．J.Theor Biol,2011，287：92-99．

*通信作者：劉毅慧, 女, 博士, 教授,研究方向: 生物計算, 智能信息處理等;E-mail:yxl@sdili.edu.cn.

生物信息學2015年2期

生物信息學的其它文章: 高維蛋白質波譜癌癥數據特征提取; 基于氨基酸約化和統計特征的蛋白質亞細胞定位預測; 大鼠肝臟半乳糖凝聚素-3 cDNA分子多樣性分析; 基于微信公眾平臺的文獻定制服務; ABI PGM測序平臺用于細菌基因組de novo測序的評價; 基于EST數據的水稻基因表達大規模初步分析