Anthony S. Amend - a.amend@berkeley.edu
June 2010 MOTHUR Tutorial for Fungal Community Analysis
INTRODUCTION
<
Before we get started, there are some things you should know…
MOTHUR works with a command line interface. This may sound intimidating, but if you take a few hours to learn some basic operations in your Unix/Linux/Terminal, it will save you *LOTS* of time and frustration in the future. There are plenty of great free online tutorials to help you out. Here’s the first of ~6,000,000 google hits for “unix tutorial”:
surrey.ac.uk/Teaching/Unix/
Eukaryotic systems in general, and ITS analyses specifically, are a little outside of the microbial ecology mainstream. MOTHUR was designed to work with environmental sequences that can be reasonably aligned and are also useful for distinguishing species. In prokaryotic systems the 16s (a.k.
a. the nuclear gene coding the small subunit of the ribosome) seems to fit both of these criteria. In fungi we’re not so lucky. The ITS region is good at distinguishing species, but is difficult to get into a meaningful alignment across kingdom Fungi. Other loci, like the large or small ribosomal subunits (28s or 18s) can yield reasonable multiple sequence alignments, but might not be good for determining O perational T axonomic U nits (OTUs) depending on which groups of fungi you’re interested in.
seifertTo make a long story short, the parts of MOTHUR that deal with clustering OTUs based on multiple sequence alignments probably won’t work for you no matter which locus you’re working with. That’s OK! There are a number of programs which use pairwise alignments and will work for you just fine: (i.e. Blastclust, CD-HIT, UCLUST, TIGCL) and at this meeting we’ll learn about some analysis pipelines which integrate these types of clustering algorithms to enable you to work with programs like MOTHUR. By the time you read this, this may even be a part of MOTHUR.
This tutorial is divided into two sections: a “Phylogentic Analysis” in which we use MOTHUR on our raw 454 data of environmental LSU sequences, and an “OTU Analysis” section in which I’ve already used the program CD-HIT-EST to calculate OTUs based on 97% ITS sequence identity. The data used in this tutorial is from fungi found in indoor dust samples from a global sample (see Amend A, S
amson R, Seifert K, Bruns T (2010) Deep sequencing reveals diverse and geographically structured assemblages of fungi in indoor dust. Proceedings of the National Academy of Sciences,
doi:10.1073/pnas.1000454107 for details). We’ve reduced the number of reads to make things simpler.
Finally, MOTHUR can do a lot more than I had time to present in this tutorial, and it is constantly being improved upon and expanded. If this is a
program that you’re planning to use, I urge you to consult the MOTHUR manual
for alternative analysis options and parameters (sections which are indented
below were “borrowed” from this source).
STARTING MOTHUR
In your terminal, navigate to your “FesinFiles” folder. On my machine I type:
crassa$ cd /Users/crassa/Documents/Fesin/FesinFiles
My computer is named “crassa” and the “$” indicates the UNIX prompt, so I don’t
actually type those. The UNIX command “cd” stands for “change directory”. You
can save yourself the trouble of typing the entire path by locating the file in your
“finder” or windows directory and dragging it into your terminal. The path, as if by
magic, appears.
Now that we’re inside the right directory, the easiest way to execute MOTHUR is
by simply indicating the path where you have the MOTHUR executable installed.
Again it’s pretty easy to do this by finding the file and dragging it in:
FesinFiles crassa$ /Applications/mothur
OK, we’re read to go! MOTHUR will always look for input files from, and will
output files to the directory from which it is executed (FesinFiles in this case) so if
the program can’t find files that you KNOW are there, you should make sure that
you are working from the correct directory.
PHYLOGENETIC (LSU) ANALYSES
There are three files we’ll be using here LSU.fna , LSU.qual, and LSUoligos.
*.fna is equivalent to a fasta sequence file, and *.qual is a quality score file.
These are typical outputs from a 454 run.
Initial processing and clean-up of sequences
Let’s take a quick look at the LSU.fna file using “Summary.seq”
MOTHUR > summary.seqs(fasta=LSU.fna)
Start
End
NBases
Ambigs
Polymer
Minimum:
1
55
55
0
3
2.5%‐tile:
1
104
104
0
4
25%‐tile:
1
278
278
0
4
Median: 1
345
345
0
5
75%‐tile:
1
392
392
0
5
97.5%‐tile:
1
484
484
1
6
Maximum:
1
572
572
5
61
#
of
Seqs:
1936
So we can see that we have 1936 sequences, that our shortest read was 55 bp, our longest was 572 bp., and the median was 345 bp. We can also see that our “messiest” sequence contained 5 ambiguous base calls (Ns), and there was at least one sequence with a homopolymer (single nucleotide repeat) 61 bases long!
Let’s clean these sequences up a little bit. Here’s what Pat Schloss says about the Trim.seqs command:
The trim.seqs command provides the preprocessing features needed to
screen and sort pyrosequences. Similar analyses are provided by RDP;
here we give you added flexibility and speed. The command will enable
you to trim off primer sequences and barcodes, use the barcode
information to generate a group file and split a fasta file into sub-files,
screen sequences based on the qual file that comes from 454 sequencers, cull sequences based on sequence length and the presence of ambiguous bases and get the reverse complement of your sequences. While this
analysis is clearly geared towards pyrosequencing collections, it can also be used with traditional Sanger sequences.
We used “barcoded” primers to multiplex a number of samples in a single run. To be able to figure out which sequences go with which samples, we created an “oligos” file:
The oligos option takes a file that can contain the sequences of the forward and reverse primers and barcodes and their sample identifier. Each line of the oligos file can start with the key words "forward", "reverse", and
"barcode" or it can start with a "#" to tell MOTHUR to ignore that line of the oligos file. Here we are using a "#" for the reverse primer to indicate that
we don't want MOTHUR to screen for the reverse primer because the 454 sequencing platform cannot generate sequences that are >500 bp long. If the "#" were removed, all of the sequences would wind up in the scrap file. You can enter your oligos as upper or lowercase letters. It has
been shown that sequencing errors in the PCR primer region of a
sequence correlate highly with poor sequence quality. Therefore,
MOTHUR will not allow you to have anything less than an exact
match to the primer or barcode sequences that you provide.
So
let’s
have
MOTHUR
sort
our
sequences
by
barcode:
MOTHUR
>
trim.seqs(fasta=LSU.fna,
oligos=LSUoligos)
You’ll
notice
four
new
files
in
your
directory:
LSU.scrap.fasta‐the
sequences
that
didn’t
pass
they
came
And
then
let’s
check
our
results:
MOTHUR
>
summary.seqs(im.fasta)
Start
End
NBases
Ambigs
Polymer
Minimum:
1
24
24
0
3
2.5%‐tile:
1
73
73
0
4
25%‐tile:
1
247
247
0
4
Median: 1
314
314
0
5
75%‐tile:
1
360
360
0
5
97.5%‐tile:
1
450
450
1
6
Maximum:
1
521
521
4
14
#
of
Seqs:
1782
So it looks like only 154 sequences were “scrapped” including the sequence with a 61 bp homopolymer. You’ll also notice that our shortest sequences are now 31 bp shorter. That’s because MOTHUR removed the priming region and 8 bp barcode from our sequences.
We can also trim our sequences based on quality scores.
In addition to a .fna file, the 454 platform generates a .qual file which
mirrors the fna file. The difference is that in place of letters, the qual file has numbers indicating the quality of each base. For example, an "N" will be
represented by a "0" because it is a poor base call. If a qfile is provided,
you must provide either the qaverage or qthreshold options as well.
MOTHUR has a number of other quality-checking options:
Qaverage-average Q score of a sequence
Qthreshold-threshold Q score of a sequence
Qtrim-trim away low quality sections of a sequence
Maxambig-the number of allowable Ns
Maxhomop-the length of tolerable homopolymers
Minlength-the minimum length of sequences
Maxlength-the maximum length of sequences
These can be combined in a single command and changed as you see fit. Here’s a moderately tolerant quality check:
MOTHUR > trim.seqs(fasta=LSU.fna, qfile=LSU.qual, oligos=LSUoligos, minlength=300, maxlength=500, maxambig=0, maxhomop=10, qaverage=25) Check this file with
MOTHUR > summary.seqs(im.fasta)
As you can see, ~50% of our original sequences passed.
Unique Sequences
You may be working in a system with relatively low complexity, or your samples may be dominated by a few species. To save hard drive space and computational effort, MOTHUR will collapse all of the identical sequences in a “.names” file and will generate a list of unique sequences.
MOTHUR > unique.seqs(im.fasta)
This command works on “identical” sequences in the strict sense, meaning redundant sequences must have exact nucleotide matches and sequence length. In this case this didn’t help reduce our file size, but the “.names” file will still be useful later on.
Align sequences
MOTHUR has a built-in aligner that, in my opinion has its advantages and disadvantages. It can handle ENORMOUS amounts of data and process them relatively rapidly, and it produces alignments that are, obviously, easy to use in subsequent MOTHUR analyses (like chimera checking). However, it requires a “seed” or “template” alignment off which to work. Depending on w
hich locus you are working with, you may have to find or create your own such template alignment.
I’ve taken the sequences from Tim James’ ncLSU alignment off of the AFTOL website, removed the non-fungal taxa, re-aligned with MAFFT, and then removed the empty columns or the positions downstream from our target locus. This is
the “im.mafft” file. We can use this as our “template alignment”.
To align our sequences based on this template alignment we type:
MOTHUR> align.seqs(im.unique.fasta,
im.mafft, flip=T)
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论