Mothur_tutorial--688IT编程网

Anthony S. Amend - a.amend@berkeley.edu

June 2010 MOTHUR Tutorial for Fungal Community Analysis

INTRODUCTION

Before we get started, there are some things you should know…

MOTHUR works with a command line interface. This may sound intimidating, but if you take a few hours to learn some basic operations in your Unix/Linux/Terminal, it will save you *LOTS* of time and frustration in the future. There are plenty of great free online tutorials to help you out. Here’s the first of ~6,000,000 google hits for “unix tutorial”:

surrey.ac.uk/Teaching/Unix/

Eukaryotic systems in general, and ITS analyses specifically, are a little outside of the microbial ecology mainstream. MOTHUR was designed to work with environmental sequences that can be reasonably aligned and are also useful for distinguishing species. In prokaryotic systems the 16s (a.k.

a. the nuclear gene coding the small subunit of the ribosome) seems to fit both of these criteria. In fungi we’re not so lucky. The ITS region is good at distinguishing species, but is difficult to get into a meaningful alignment across kingdom Fungi. Other loci, like the large or small ribosomal subunits (28s or 18s) can yield reasonable multiple sequence alignments, but might not be good for determining O perational T axonomic U nits (OTUs) depending on which groups of fungi you’re interested in.

seifertTo make a long story short, the parts of MOTHUR that deal with clustering OTUs based on multiple sequence alignments probably won’t work for you no matter which locus you’re working with. That’s OK! There are a number of programs which use pairwise alignments and will work for you just fine: (i.e. Blastclust, CD-HIT, UCLUST, TIGCL) and at this meeting we’ll learn about some analysis pipelines which integrate these types of clustering algorithms to enable you to work with programs like MOTHUR. By the time you read this, this may even be a part of MOTHUR.

This tutorial is divided into two sections: a “Phylogentic Analysis” in which we use MOTHUR on our raw 454 data of environmental LSU sequences, and an “OTU Analysis” section in which I’ve already used the program CD-HIT-EST to calculate OTUs based on 97% ITS sequence identity. The data used in this tutorial is from fungi found in indoor dust samples from a global sample (see Amend A, S

amson R, Seifert K, Bruns T (2010) Deep sequencing reveals diverse and geographically structured assemblages of fungi in indoor dust. Proceedings of the National Academy of Sciences,

doi:10.1073/pnas.1000454107 for details). We’ve reduced the number of reads to make things simpler.

Finally, MOTHUR can do a lot more than I had time to present in this tutorial, and it is constantly being improved upon and expanded. If this is a

program that you’re planning to use, I urge you to consult the MOTHUR manual

for alternative analysis options and parameters (sections which are indented

below were “borrowed” from this source).

STARTING MOTHUR

In your terminal, navigate to your “FesinFiles” folder. On my machine I type:

crassa$ cd /Users/crassa/Documents/Fesin/FesinFiles

My computer is named “crassa” and the “$” indicates the UNIX prompt, so I don’t

actually type those. The UNIX command “cd” stands for “change directory”. You

can save yourself the trouble of typing the entire path by locating the file in your

“finder” or windows directory and dragging it into your terminal. The path, as if by

magic, appears.

Now that we’re inside the right directory, the easiest way to execute MOTHUR is

by simply indicating the path where you have the MOTHUR executable installed.

Again it’s pretty easy to do this by finding the file and dragging it in:

FesinFiles crassa$ /Applications/mothur

OK, we’re read to go! MOTHUR will always look for input files from, and will

output files to the directory from which it is executed (FesinFiles in this case) so if

the program can’t find files that you KNOW are there, you should make sure that

you are working from the correct directory.

PHYLOGENETIC (LSU) ANALYSES

There are three files we’ll be using here LSU.fna , LSU.qual, and LSUoligos.

*.fna is equivalent to a fasta sequence file, and *.qual is a quality score file.

These are typical outputs from a 454 run.

Initial processing and clean-up of sequences

Let’s take a quick look at the LSU.fna file using “Summary.seq”

MOTHUR > summary.seqs(fasta=LSU.fna)

 Start End NBases Ambigs Polymer 

Minimum: 1 55 55 0 3 

2.5%‐tile: 1 104 104 0 4 

25%‐tile: 1 278 278 0 4 

Median: 1 345 345 0 5 

75%‐tile: 1 392 392 0 5  

97.5%‐tile: 1 484 484 1 6 Maximum: 1 572 572 5 61 

# of Seqs: 1936  

So we can see that we have 1936 sequences, that our shortest read was 55 bp, our longest was 572 bp., and the median was 345 bp. We can also see that our “messiest” sequence contained 5 ambiguous base calls (Ns), and there was at least one sequence with a homopolymer (single nucleotide repeat) 61 bases long!

Let’s clean these sequences up a little bit. Here’s what Pat Schloss says about the Trim.seqs command:

The trim.seqs command provides the preprocessing features needed to

screen and sort pyrosequences. Similar analyses are provided by RDP;

here we give you added flexibility and speed. The command will enable

you to trim off primer sequences and barcodes, use the barcode

information to generate a group file and split a fasta file into sub-files,

screen sequences based on the qual file that comes from 454 sequencers, cull sequences based on sequence length and the presence of ambiguous bases and get the reverse complement of your sequences. While this

analysis is clearly geared towards pyrosequencing collections, it can also be used with traditional Sanger sequences.

We used “barcoded” primers to multiplex a number of samples in a single run. To be able to figure out which sequences go with which samples, we created an “oligos” file:

The oligos option takes a file that can contain the sequences of the forward and reverse primers and barcodes and their sample identifier. Each line of the oligos file can start with the key words "forward", "reverse", and

"barcode" or it can start with a "#" to tell MOTHUR to ignore that line of the oligos file. Here we are using a "#" for the reverse primer to indicate that

we don't want MOTHUR to screen for the reverse primer because the 454 sequencing platform cannot generate sequences that are >500 bp long. If the "#" were removed, all of the sequences would wind up in the scrap file. You can enter your oligos as upper or lowercase letters. It has

been shown that sequencing errors in the PCR primer region of a

sequence correlate highly with poor sequence quality. Therefore,

MOTHUR will not allow you to have anything less than an exact

match to the primer or barcode sequences that you provide.

So let’s have MOTHUR sort our sequences by barcode:

MOTHUR > trim.seqs(fasta=LSU.fna, oligos=LSUoligos) 

You’ll notice four new files in your directory: 

LSU.scrap.fasta‐the sequences that didn’t pass 

 they came 

And then let’s check our results: 

MOTHUR > summary.seqs(im.fasta) 

 Start End NBases Ambigs Polymer Minimum: 1 24 24 0 3 

2.5%‐tile: 1 73 73 0 4 

25%‐tile: 1 247 247 0 4 Median: 1 314 314 0 5 

75%‐tile: 1 360 360 0 5 

97.5%‐tile: 1 450 450 1 6 Maximum: 1 521 521 4 14 

# of Seqs: 1782  

So it looks like only 154 sequences were “scrapped” including the sequence with a 61 bp homopolymer. You’ll also notice that our shortest sequences are now 31 bp shorter. That’s because MOTHUR removed the priming region and 8 bp barcode from our sequences.

We can also trim our sequences based on quality scores.

In addition to a .fna file, the 454 platform generates a .qual file which

mirrors the fna file. The difference is that in place of letters, the qual file has numbers indicating the quality of each base. For example, an "N" will be

represented by a "0" because it is a poor base call. If a qfile is provided,

you must provide either the qaverage or qthreshold options as well.

MOTHUR has a number of other quality-checking options:

Qaverage-average Q score of a sequence

Qthreshold-threshold Q score of a sequence

Qtrim-trim away low quality sections of a sequence

Maxambig-the number of allowable Ns

Maxhomop-the length of tolerable homopolymers

Minlength-the minimum length of sequences

Maxlength-the maximum length of sequences

These can be combined in a single command and changed as you see fit. Here’s a moderately tolerant quality check:

MOTHUR > trim.seqs(fasta=LSU.fna, qfile=LSU.qual, oligos=LSUoligos, minlength=300, maxlength=500, maxambig=0, maxhomop=10, qaverage=25) Check this file with

MOTHUR > summary.seqs(im.fasta)

As you can see, ~50% of our original sequences passed.

Unique Sequences

You may be working in a system with relatively low complexity, or your samples may be dominated by a few species. To save hard drive space and computational effort, MOTHUR will collapse all of the identical sequences in a “.names” file and will generate a list of unique sequences.

MOTHUR > unique.seqs(im.fasta)

This command works on “identical” sequences in the strict sense, meaning redundant sequences must have exact nucleotide matches and sequence length. In this case this didn’t help reduce our file size, but the “.names” file will still be useful later on.

Align sequences

MOTHUR has a built-in aligner that, in my opinion has its advantages and disadvantages. It can handle ENORMOUS amounts of data and process them relatively rapidly, and it produces alignments that are, obviously, easy to use in subsequent MOTHUR analyses (like chimera checking). However, it requires a “seed” or “template” alignment off which to work. Depending on w

hich locus you are working with, you may have to find or create your own such template alignment.

I’ve taken the sequences from Tim James’ ncLSU alignment off of the AFTOL website, removed the non-fungal taxa, re-aligned with MAFFT, and then removed the empty columns or the positions downstream from our target locus. This is

the “im.mafft” file. We can use this as our “template alignment”.

To align our sequences based on this template alignment we type:

MOTHUR> align.seqs(im.unique.fasta,

im.mafft, flip=T)

688IT编程网

Mothur_tutorial

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

利用正则表达式实现文本数据提取与处理

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符回溯引用和前后查匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

最新文章

java正则表达式选择题

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

标签列表

688IT编程网

Mothur_tutorial

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

利用正则表达式实现文本数据提取与处理

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符 回溯引用和前后查 匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

最新文章

java正则表达式 选择题

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

标签列表

java正则表达式选择题

非零金额正则表达式

基本的元字符回溯引用和前后查匹配模式

java正则表达式选择题

非零金额正则表达式