核酸BLAST:
‧ blastn程式——核酸序列比对。
‧ MegaBLAST——可搜寻一批EST序列、长序列cDNA或基因体序列。
‧ blastn程式——核酸序列比对。
‧ MegaBLAST——可搜寻一批EST序列、长序列cDNA或基因体序列。
BLAST——Basic Local Alignment Search Tool——核酸与蛋白质序列比对工具。BLAST网页提供BLAST(Basic Local Alignment Search Tool)程式、概述、使用说明与常见问题解答(网址:bi.v/BLAST/)。
BLAST Program Selection Guide:bi.v/blast/producttable.shtml#tab31
Program Selection for Nucleotide Queries | |||||
Length ¹ | Database | Purpose | Program | ||
20 bp or longer 28 bp or above for megablast | Nucleotide | Identify the query sequence | discontiguous megablast, megablast, or blastn | ||
Find sequences similar to query sequence | discontiguous megablast or blastn | ||||
Find similar sequence from the Trace archive | Trace megablast, or Trace discontiguous megablast | ||||
Find similar proteins to translated query in a translated database | Translated BLAST (tblastx) | ||||
Peptide | Find similar proteins to translated query in a protein database | Translated BLAST (blastx) | |||
7 - 20 bp | Nucleotide | Find primer binding sites or map short contiguous motifs | Search for short, nearly exact matches | ||
4.1 MEGABLAST is the tool of choice to identify a nucleotide sequence. | ||||||||||||||||||||
The best way to identify an unknown sequence is to see if that sequence already exists in a public database. If the database sequence is a well-characterized sequence, then one will have access to a wealth of biological information. MEGABLAST, discontiguous-megablast, and blastn all can be used to accomplish this goal. However, MEGABLAST is specifically designed to efficiently find long alignments between very similar sequences and thus is the best tool to use to find the identical match to your query sequence. In addition to the expect value significance cut-off, MEGABLAST also provides an adjustable percent identity cut-off for the alignment, which provides cut-off in addition to the significance cut-off threshold set by Expect value. Web MEGABLAST and discontiguous megablast pages can also accept batch queries, the only web BLAST pages with this capability. Please refer to the "Batch Search" section for details. | ||||||||||||||||||||
4.2 Discontiguous MEGABLAST is better at finding nucleotide sequences similar, but not identical, to your nucleotide query. | ||||||||||||||||||||
The BLAST nucleotide algorithm finds similar sequences by breaking the query into short subsequences called words. The program identifies the exact matches to the query words first (word hits). BLAST program then extends these word hits in multiple steps to generate the final gapped alignments. One of the important parameters governing the sensitivity of BLAST searches is the length of the initial words, or word size as it is called. The most important reason that blastn is more sensitive than MEGABLAST is that it uses a shorter default word size (11). Because of this, blastn is better than MEGABLAST at finding alignments to related nucleotide sequences from other organisms. The word size is adjustable in blastn and can be reduced from the default value to a minimum of 7 to increase search sensitivity. A more sensitive search can be achieved by using the newly introduced discontiguous megablast page. This page uses an algorithm with the same name, which is similar to that reported by Ma et.al. Rather than requiring exact word matches as seeds for alignment extension, discontiguous megablast uses non-contiguous word within a longer window of template. In coding mode, the third base wobbling is taken into consideration by focusing on finding matches at the first and second codon positions while ignoring the mismatches in the third position. Searching in discontiguous MEGABLAST using the same word size is more sensitive and efficient than standard blastn using the same word size. For this reason, it is now the recommended tool for this type of search. Alternative non-coding patterns can also be specified if desired. Additional details on discontiguous are available at: Parameters unique for discontiguous megablast are: ∙ word size: retricted to two options, i.e., 11 or 12 ∙ template: only three options are available, 16, 18, or 21 ∙ template type: coding (0), non-coding (1), or both (2) It is important to point out that nucleotide-nucleotide searches are not the best method for finding homologous protein coding regions in other organisms. That task is better accomplished by performing searches at the protein level, by direct protein-protein BLAST searches or by translated BLAST searches. This is because of the codon degeneracy, the greater information available in amino acid sequence, and the more sophisticated algorithm and scoring matrix used in protein-protein BLAST. | ||||||||||||||||||||
4.3 "Search for short nearly exact matches" is useful for primer or short nucleotide searches. | ||||||||||||||||||||
Short sequences (less than 20 bases) will often not find any significant matches to the database entries under the standard nucleotide-nucleotide BLAST settings. The usual reasons for this are that the significance threshold governed by the Expect value parameter is set too stringently and the default word size parameter is set too high. You can adjust both the word size and the expect value on the standard BLAST pages to work with short sequences. NCBI provides a BLAST page with these values preset to give optimal results with short sequences. This page ("Search for short nearly exact matches") is linked under the nucleotide BLAST section of the main BLAST page.
A common use of this page is to check the specificity of PCR or hybridization primers. A useful way to check a pair of PCR primers is to first concatenate them by inserting string of 20 or more N's in between the two primers, and then search the concatenated pair as one sequence. Since BLAST looks for local alignments and automatically searches both strands, there is no need to reverse complement the reverse primer before doing the concatenation or the search. The query sequence should contain no ambiguous bases. Consensus motifs with degenerate bases, such as AACNNNNNNRTAYG (StySQI recognition site) or TGGNNNNNNGCCAA (NF-1 binding site) will not work for this type of search. | ||||||||||||||||||||
4.4 Use the Trace Archive BLAST page to search raw primary sequence trace files. | ||||||||||||||||||||
Trace data files are not official entries of the GenBank database and have no associated feature annotations. Despite this limitation, they are still a rich source of sequence information, especially for organisms lacking a significant amount of regular mRNA or assembled genomic sequences. The sequence data come from a variety of projects and sequencing strategies, including Whole Genome Shotgun (WGS), BAC end sequencing, and EST sequencing. The trace data are single pass sequencing reads not trimmed for quality or vector contamination. Their average lengths are between 500 to 700 bp. A search against the Trace Archive can use MEGABLAST or discontiguous MEGABLAST. The former is better for identifying exact matches in intra-species searches, such as looking for extra mRNA sequences or the genomic counterparts for a given gene, while the latter is better for identifying similar coding sequences from different species. Information on the Trace Archive is available from the Trace documentation page. | ||||||||||||||||||||
Nucleotide Databases for BLAST | |
Database | Content Description |
nr ¹ | All GenBank + EMBL + DDBJ + PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences). No longer "non-redundant" due to computational cost. |
refseq_mrna | mRNA sequences from NCBI Reference Sequence Project. |
refseq_genomic | Genomic sequences from NCBI Reference Sequence Project. |
est | Database of GenBank + EMBL + DDBJ sequences from EST division. |
est_human | Human subset of est. |
est_mouse | Mouse subset of est. |
est_others | Subset of est other than human or mouse. |
gss | Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences. |
htgs | Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2. Finished, phase 3 HTG sequences are in nr. |
pat | Nucleotides from the Patent division of GenBank. |
pdb | Sequences derived from the 3-dimensional structure records from Protein Data Bank. They are NOT the coding sequences for the coresponding proteins found in the same PDB record. |
month | All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days. |
alu_repeats | Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. See "Alu alert" by Claverie and Makalowski, Nature 371: 752 (1994). |
dbsts | Database of Sequence Tag Site entries from the STS division of GenBank + EMBL + DDBJ. |
chromosome | Complete genomes and complete chromosomes from the NCBI Reference Sequence project. It overlaps with refseq_genomic. |
wgs | Assemblies of Whole Genome Shotgun sequences. |
env_nt | Sequences from environmental samples, such as uncultured bacterial samples isolated from soil or marine samples. The largest single source is Sagarsso Sea project. This does NOT overlap with nucleotide nr. |
在做BLASTn的时候,系统会给出三个程序选项,分别是Highly similar sequences (megablast), More dissimilar sequences (discontiguous megablast),Somewhat similar sequences (blastn) 。
第一个选项megablast是对高度相似DNA序列间的比较。鉴别一段未知核酸结果查询平台DNA序列的最好办法就是看看在公共数据库中这段序列是否存在。Megablast就是对那些具有高度相似(相似性95%以上)的长序列片断所特别设计的一种序列比较工具。Megablast除了提供序列联配的显著性期望值域之外,还提供了一种百分值域。在进行序列比较时,用户可以同时调整这两个参数以优化搜索结果。
第二个选项discontiguous megablast,当序列之间的差异比megablast大时,一般选用这个程序。其算法的基本原理是将查询序列分为一个一个的小片断,我们把它叫做字,通过字与数据库序列相比较,如果能够精确匹配,则以这个字为种子向两边延伸,从而获得符合我们要求的相似性序列。discontiguous megablast所应用的字是不连续的,这使得他的搜索精确性在三种搜索程序中是最高的。其模板类型选项分为三种编码(0),非编码(1),两者都有(2)。在编码模式中,根据第三位碱基的摆动原理,只要第一个和第二个碱基能够精确
匹配,那么第三个碱基可以忽略,不做比较。在字的长度相同的情况下,discontiguous megablast的精确度要高于blastn。
第三个选项Somewhat similar sequences (blastn),这个程序比较的序列其相似程度可以非常低。它采用的算法与discontiguous megablast相同,只不过它的字是连续的。Blastn的字要比megablast短,所以其精确度要高于megablast,但是运算速度要慢一些。
注:字是影响blast灵敏度的一个主要参数,其取值要根据具体情况具体而定。
NCBI BLASTn:
www.incogen/public_documents/vibe/details/NcbiBlastn.html
Standard nucleotide-nucleotide BLAST
Takes nucleotides sequences and compares them against the NCBI nucleotide databases. It is better at finding sequences similar, but not identical, to your query.
The BLAST nucleotide algorithm finds similar sequences by generating an indexed table or dictionary of short subsequences called words for both the query and the database. The program can then rapidly find initial exact matches to the query words by simply looking up a particular word in the database dictionary. These initial matches serve as starting points for longer alignments that are generated in several steps, ending with a final gapped alignment.
One of the important parameters governing the sensitivity of BLAST searches is the length of the initial words (word size). The most important reason that blastn is more sensitive than MEGABLAST is that it uses a shorter default word size. Because of this, blastn is better than MEGABLAST at finding alignments to related nucleotide sequences from other organisms since the initial exact match can be shorter. The word size is adjustable in blastn and can be reduced from the default value of 11 to a minimum of 7 to increase sensitivity. T
his word size can also be increased to increase the search speed and limit the number of database hits.
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论