314 Genome Informatics 13 314–315 (2002) Classification--688IT编程网

314Genome Informatics13:314–315(2002) Classiﬁcation of C2H2Zinc Finger Domains Using

Support Vector Machines

Takafumi Nagano1Makiko Suwa2Kiyoshi Asai2 Nagano.jp jp jp

1Advanced Technology R&D Center,Mitsubishi Electric Corp.,8-1-1Tsukaguchi-Honmachi,Amagasaki City,Hyogo661-8661,Japan

2Computational Biology Research Center(CBRC),National Institute of Advanced Industrial Science and Technology(AIST),2-41-6Aomi,Koutou-ku,Tokyo135-0064,

Japan

Keywords:zincﬁnger,C2H2,sequence analysis,support vector machine,ﬁsher kernel

1Introduction

Zincﬁnger proteins include nuclear receptors for steroid hormones and are mainly DNA-binding transcript

ion factors.Thus those are supposed to be target proteins for drug discovery.C2H2zinc ﬁnger gene family is one of the most popular and complex superfamilies.C2H2zincﬁnger domains are composed of approximately25to30amino acid residues including the paired cysteines and histidines that form coordinate bonds with zinc ion.Although C2H2domains are well-studied,it is diﬃcult to detect the domains with high accuracy by means of homology search or hidden Markov models(HMMs) owing to a wide variety of the sequences.

In this research,we have extended the Support Vector Machine(SVM)based method using the Fisher kernel[1]in order to achieve better accuracy than an HMM.The Fisher kernel extracts aﬁxed length vector of features known as a Fisher score vector(FSV)from a variable length sequence with an HMM.The method in[1]classiﬁes G-protein coupled receptors(GPCRs)into GPCR subfamilies.

2Method and Results

The method to discriminate among domains which are detected with little signiﬁcance by an HMM is proposed.A training data set is constructed from domains detected by an HMM.An SVM is trained to distinguish positive examples from negative examples.

First,the C2H2domains with positive scores

were extracted from protein sequences of SWISS-PROT Release40.0using HMMER2.2with the proﬁle HMM(zf-C2H2)of Pfam6.6.The do-mains whose coordinating residues are neither cysteines nor histidines were removed.The do-mains which overlapped with domains with un-certain annotation:ATYPICAL,DEFECTIVE, DEGENERATE,INCOMPLETE,POTENTIAL and LOW DNA-BINDING AFFINITY,in the SWISS-PROT database and which didn’t wholly correspond with domains in the SWISS-PROT database were removed in order to evaluate the accuracy of the proposed method.

Figure1:The score histogram of C2H2domains

detected by HMM.

Classiﬁcation of C2H2Zinc Finger Domains315

315

Then the domains which corresponded with domains in the SWISS-PROT database were determined to be positive examples.The others were determined to be negative examples.The score histogram of C2H2domains detected by the HMM is shown in Fig.1.

Each Proﬁle HMM in the Pfam database has a trusted cutoﬀand a noise cutoﬀ:TC2and NC2, for the domain scores.TC2is the lowest domain score found in the Pfam full alignment.NC2is the highest domain score of matches not included in the Pfam full alignment.TC2and NC2of zf-C2H2 are5.5and19.2,respectively.Thus we regarded the domains with higher or equal scores to NC2 as positive examples in a training data set and the domains with lower scores than TC2as negative examples in a training data set.It is noted that the training data set includes misclassiﬁcations in order for the proposed method to discriminate among domains detected by an HMM using only a proﬁle HMM,TC2and NC2.A test data set was composed of the domains with higher or equal scores to TC2and lower scores than NC2.The training and test data set were made non-redundant.The number of positive and negative examples in the training data set was2523and113,respectively.

All domain sequences were transformed into FSVs on match states in the HMM using nine-component mixture,uprior.9comp[1,3].As a domain score got lower,the FSV got more scattered from that of the HMM consensus.The number of positive examples in the training data was too large compared with th

at of negative examples.The positive examples were reduced to200domains with the lowest scores in order to capture the sensitive features of the domains around the classiﬁcation boundary.Each negative example was also reﬁned by replacing some amino acid residues on match states where the viterbi path passing through by those of the positive example which had the minimum mean square distance of FSV.And each negative example was retransformed into FSV.Likelihood ratio scores on delete and insert states where the viterbi path passing through were also calculated and those on the other delete and insert states were set to0.Because FSVs were based on emission probabilities of match states in the HMM and were not informative enough for an SVM to be trained to distinguish positive examples from negative examples in the test data set with high accuracy.

Then we trained a linearν-SVM[2]using FSVs and the likelihood ratio scores on delete and insert states.Theν-SVM has the advantage of using a parameterνon controlling the number of margin errors and support vectors.The positive accuracy(TP/(TP+FN))and negative accuracy(TN/ (TN+FP))on the test data set when trained withν=0.1were75.0%and76.9%,respectively.The positive and negative accuracy of the HMM were73.0%and72.5%,respectively.The performance of the HMM were evaluated as follows.Changing a threshold from TC2to NC2,domains with higher or equal scores to the threshold were considered to be classiﬁed as positive examples and domains with lower scores than the thresho

ld were considered to be classiﬁed as negative examples.Then the result at the threshold where both positive and negative accuracy were high was regarded as the performance of the HMM.

The proposed method showed better performance than the HMM.We note that the proposed method should be applicable to a variety of domains,since it didn’t make use of speciﬁc characteristic of C2H2domains.

References

[1]Rachel,K.,Kevin,K.,and David,H.,Classifying G-protein coupled receptors with support vector

machines,Bioinformatics,18(1):147–159,2002.

[2]Sch¨o lkopf,B.,Smola,A.J.,Williamson,R.C.,and Bartlett,P.L.,New support vector algorithms,

Neural Computation,12:1207–1245,2000.

[3]Sj¨o lander,K.,Karplus,K.,Brown,M.P.,Hughey,R.,Krogh,A.,Mian,I.S.,and Haussler,D.,

Dirichlet mixtures:A method for improving detection of weak but signiﬁcant protein sequence homology,CABIOS,12:327–345,1996.

688IT编程网

314 Genome Informatics 13 314–315 (2002) Classification

发表评论

推荐文章

随机森林算法介绍及R语言实现

基于随机森林优化的神经网络算法在冬小麦产量预测中的应用研究_百度文 ...

基于正则化贪心森林算法的情感分析方法研究

随机森林算法和grandientboosting算法

基于随机森林的图像分类算法研究

热门文章

随机森林特征选择原理

自动驾驶系统中的随机森林算法解析

随机森林算法及其在生物信息学中的应用

监督学习中的随机森林算法解析(六)

随机森林算法在数据分析中的应用

机器学习——随机森林,RandomForestClassifier参数含义详解

随机森林的算法

随机森林算法作用

监督学习中的随机森林算法解析(十)

随机森林算法案例

随机森林案例

二分类问题常用的模型

绘制ssd框架训练流程

一种基于信息熵和DTW的多维时间序列相似性度量算法

SVM训练过程范文

如何使用支持向量机进行股票预测与交易分析

二分类交叉熵损失函数binary

tinybert_训练中文文本分类模型_概述说明

基于门控可形变卷积和分层Transformer的图像修复模型及其应用

人工智能开发技术的测试和评估方法

最新文章

基于随机森林的数据分类算法改进

人工智能中的智能识别与分类技术

基于人工智能技术的随机森林算法在医疗数据挖掘中的应用

随机森林回归模型的建模步骤

r语言随机森林预测模型校准曲线

《2024年随机森林算法优化研究》范文

标签列表