核酸BLAST--688IT编程网

核酸BLAST：
‧ blastn程式——核酸序列比对。
‧ MegaBLAST——可搜寻一批EST序列、长序列cDNA或基因体序列。

BLAST——Basic Local Alignment Search Tool——核酸与蛋白质序列比对工具。BLAST网页提供BLAST（Basic Local Alignment Search Tool）程式、概述、使用说明与常见问题解答（网址：bi.v/BLAST/）。

BLAST Program Selection Guide：bi.v/blast/producttable.shtml#tab31

Program Selection for Nucleotide Queries
Length ¹	Database	Purpose	Program
20 bp or longer 28 bp or above for megablast	Nucleotide	Identify the query sequence	discontiguous megablast, megablast, or blastn
		Find sequences similar to query sequence	discontiguous megablast or blastn
		Find similar sequence from the Trace archive	Trace megablast, or Trace discontiguous megablast
		Find similar proteins to translated query in a translated database	Translated BLAST (tblastx)
	Peptide	Find similar proteins to translated query in a protein database	Translated BLAST (blastx)
7 - 20 bp	Nucleotide	Find primer binding sites or map short contiguous motifs	Search for short, nearly exact matches

4.1 MEGABLAST is the tool of choice to identify a nucleotide sequence.

The best way to identify an unknown sequence is to see if that sequence already exists in a public database. If the database sequence is a well-characterized sequence, then one will have access to a wealth of biological information. MEGABLAST, discontiguous-megablast, and blastn all can be used to accomplish this goal. However, MEGABLAST is specifically designed to efficiently find long alignments between very similar sequences and thus is the best tool to use to find the identical match to your query sequence. In addition to the expect value significance cut-off, MEGABLAST also provides an adjustable percent identity cut-off for the alignment, which provides cut-off in addition to the significance cut-off threshold set by Expect value.

Web MEGABLAST and discontiguous megablast pages can also accept batch queries, the only web BLAST pages with this capability. Please refer to the "Batch Search" section for details.

4.2 Discontiguous MEGABLAST is better at finding nucleotide sequences similar, but not identical, to your nucleotide query.

The BLAST nucleotide algorithm finds similar sequences by breaking the query into short subsequences called words. The program identifies the exact matches to the query words first (word hits). BLAST program then extends these word hits in multiple steps to generate the final gapped alignments.

One of the important parameters governing the sensitivity of BLAST searches is the length of the initial words, or word size as it is called. The most important reason that blastn is more sensitive than MEGABLAST is that it uses a shorter default word size (11). Because of this, blastn is better than MEGABLAST at finding alignments to related nucleotide sequences from other organisms. The word size is adjustable in blastn and can be reduced from the default value to a minimum of 7 to increase search sensitivity.

A more sensitive search can be achieved by using the newly introduced discontiguous megablast page. This page uses an algorithm with the same name, which is similar to that reported by Ma et.al. Rather than requiring exact word matches as seeds for alignment extension, discontiguous megablast uses non-contiguous word within a longer window of template. In coding mode, the third base wobbling is taken into consideration by focusing on finding matches at the first and second codon positions while ignoring the mismatches in the third position. Searching in discontiguous MEGABLAST using the same word size is more sensitive and efficient than standard blastn using the same word size. For this reason, it is now the recommended tool for this type of search. Alternative non-coding patterns can also be specified if desired. Additional details on discontiguous are available at:

Parameters unique for discontiguous megablast are:

∙ word size: retricted to two options, i.e., 11 or 12

∙ template: only three options are available, 16, 18, or 21

∙ template type: coding (0), non-coding (1), or both (2)

It is important to point out that nucleotide-nucleotide searches are not the best method for finding homologous protein coding regions in other organisms. That task is better accomplished by performing searches at the protein level, by direct protein-protein BLAST searches or by translated BLAST searches. This is because of the codon degeneracy, the greater information available in amino acid sequence, and the more sophisticated algorithm and scoring matrix used in protein-protein BLAST.

4.3 "Search for short nearly exact matches" is useful for primer or short nucleotide searches.

Short sequences (less than 20 bases) will often not find any significant matches to the database entries under the standard nucleotide-nucleotide BLAST settings. The usual reasons for this are that the significance threshold governed by the Expect value parameter is set too stringently and the default word size parameter is set too high.

You can adjust both the word size and the expect value on the standard BLAST pages to work with short sequences. NCBI provides a BLAST page with these values preset to give optimal results with short sequences. This page ("Search for short nearly exact matches") is linked under the nucleotide BLAST section of the main BLAST page.

Table 4.3.1 Parameter settings for standard blastn and "Search for short and nearly exact matches"
Program	Word Size	DUST Filter Setting	Expect Value
Standard blastn	11	On	10
Search for short nearly exact matches	7	Off	1000

A common use of this page is to check the specificity of PCR or hybridization primers. A useful way to check a pair of PCR primers is to first concatenate them by inserting string of 20 or more N's in between the two primers, and then search the concatenated pair as one sequence. Since BLAST looks for local alignments and automatically searches both strands, there is no need to reverse complement the reverse primer before doing the concatenation or the search.

The query sequence should contain no ambiguous bases. Consensus motifs with degenerate bases, such as AACNNNNNNRTAYG (StySQI recognition site) or TGGNNNNNNGCCAA (NF-1 binding site) will not work for this type of search.

4.4 Use the Trace Archive BLAST page to search raw primary sequence trace files.

Trace data files are not official entries of the GenBank database and have no associated feature annotations. Despite this limitation, they are still a rich source of sequence information, especially for organisms lacking a significant amount of regular mRNA or assembled genomic sequences. The sequence data come from a variety of projects and sequencing strategies, including Whole Genome Shotgun (WGS), BAC end sequencing, and EST sequencing. The trace data are single pass sequencing reads not trimmed for quality or vector contamination. Their average lengths are between 500 to 700 bp.

A search against the Trace Archive can use MEGABLAST or discontiguous MEGABLAST. The former is better for identifying exact matches in intra-species searches, such as looking for extra mRNA sequences or the genomic counterparts for a given gene, while the latter is better for identifying similar coding sequences from different species. Information on the Trace Archive is available from the Trace documentation page.

Nucleotide Databases for BLAST
Database	Content Description
nr ¹	All GenBank + EMBL + DDBJ + PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences). No longer "non-redundant" due to computational cost.
refseq_mrna	mRNA sequences from NCBI Reference Sequence Project.
refseq_genomic	Genomic sequences from NCBI Reference Sequence Project.
est	Database of GenBank + EMBL + DDBJ sequences from EST division.
est_human	Human subset of est.
est_mouse	Mouse subset of est.
est_others	Subset of est other than human or mouse.
gss	Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.
htgs	Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2. Finished, phase 3 HTG sequences are in nr.
pat	Nucleotides from the Patent division of GenBank.
pdb	Sequences derived from the 3-dimensional structure records from Protein Data Bank. They are NOT the coding sequences for the coresponding proteins found in the same PDB record.
month	All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days.
alu_repeats	Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. See "Alu alert" by Claverie and Makalowski, Nature 371: 752 (1994).
dbsts	Database of Sequence Tag Site entries from the STS division of GenBank + EMBL + DDBJ.
chromosome	Complete genomes and complete chromosomes from the NCBI Reference Sequence project. It overlaps with refseq_genomic.
wgs	Assemblies of Whole Genome Shotgun sequences.
env_nt	Sequences from environmental samples, such as uncultured bacterial samples isolated from soil or marine samples. The largest single source is Sagarsso Sea project. This does NOT overlap with nucleotide nr.

在做BLASTn的时候，系统会给出三个程序选项，分别是Highly similar sequences (megablast)， More dissimilar sequences (discontiguous megablast)，Somewhat similar sequences (blastn) 。

第一个选项megablast是对高度相似DNA序列间的比较。鉴别一段未知核酸结果查询平台DNA序列的最好办法就是看看在公共数据库中这段序列是否存在。Megablast就是对那些具有高度相似（相似性95%以上）的长序列片断所特别设计的一种序列比较工具。Megablast除了提供序列联配的显著性期望值域之外，还提供了一种百分值域。在进行序列比较时，用户可以同时调整这两个参数以优化搜索结果。

第二个选项discontiguous megablast，当序列之间的差异比megablast大时，一般选用这个程序。其算法的基本原理是将查询序列分为一个一个的小片断，我们把它叫做字，通过字与数据库序列相比较，如果能够精确匹配，则以这个字为种子向两边延伸，从而获得符合我们要求的相似性序列。discontiguous megablast所应用的字是不连续的，这使得他的搜索精确性在三种搜索程序中是最高的。其模板类型选项分为三种编码（0），非编码（1），两者都有（2）。在编码模式中，根据第三位碱基的摆动原理，只要第一个和第二个碱基能够精确

匹配，那么第三个碱基可以忽略，不做比较。在字的长度相同的情况下，discontiguous megablast的精确度要高于blastn。

第三个选项Somewhat similar sequences (blastn)，这个程序比较的序列其相似程度可以非常低。它采用的算法与discontiguous megablast相同，只不过它的字是连续的。Blastn的字要比megablast短，所以其精确度要高于megablast，但是运算速度要慢一些。

注：字是影响blast灵敏度的一个主要参数，其取值要根据具体情况具体而定。

NCBI BLASTn:

www.incogen/public_documents/vibe/details/NcbiBlastn.html

Standard nucleotide-nucleotide BLAST

Takes nucleotides sequences and compares them against the NCBI nucleotide databases. It is better at finding sequences similar, but not identical, to your query.

The BLAST nucleotide algorithm finds similar sequences by generating an indexed table or dictionary of short subsequences called words for both the query and the database. The program can then rapidly find initial exact matches to the query words by simply looking up a particular word in the database dictionary. These initial matches serve as starting points for longer alignments that are generated in several steps, ending with a final gapped alignment.

One of the important parameters governing the sensitivity of BLAST searches is the length of the initial words (word size). The most important reason that blastn is more sensitive than MEGABLAST is that it uses a shorter default word size. Because of this, blastn is better than MEGABLAST at finding alignments to related nucleotide sequences from other organisms since the initial exact match can be shorter. The word size is adjustable in blastn and can be reduced from the default value of 11 to a minimum of 7 to increase sensitivity. T

his word size can also be increased to increase the search speed and limit the number of database hits.

688IT编程网

核酸BLAST

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

688IT编程网

核酸BLAST

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时 正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

java正则表达式选择题

非零金额正则表达式

半小时正则表达式