RNA-seq流程学习笔记(8)-使⽤SAMtools将SAM⽂件转换为BAM⽂件、排
序、建⽴索引
参考⽂章:
1. 进⾏格式转换的原因
BAM and SAM formats are designed to contain the same information. The SAM format is more human readable, and easier to process by conventional text based processing programs, such as awk, sed, python, cut and so on. The BAM format provides binary versions of most of the same data, and is designed to compress reasonably well.
为什么要转换格式?为了让计算机好处理。SAM(sequence Alignment/mapping)数据格式是⽬前⾼通量测序中存放⽐对数据的标准格式。bam是sam的⼆进制格式,减少了sam⽂件的储存量。
2. SAMTools的主要功能
⽬前处理SAM格式的⼯具主要是SAMTools,这是Heng Li⼤神写的,除了C语⾔版本,还有Java的Picard,Python的
Pysam,Common lisp的cl-sam等其他版本。软件对应各命令如下:
view: BAM-SAM/SAM-BAM 转换和提取部分⽐对
sort: ⽐对排序 merge: 聚合多个排序⽐对
index:索引排序⽐对
faidx: 建⽴FASTA索引,提取部分序列
tview: ⽂本格式查看序列
pileup: 产⽣基于位置的结果和 consensus/indel calling
3. SAMtools的使⽤
最常⽤的三板斧就是格式转换,排序,索引。⽽进阶教程就是看⽂档提⾼。
1. 格式转换(view命令)
Usage: samtools view [options]<in.bam>|<in.sam>|&am>[region ...]
Options:
-b      output BAM
-C      output CRAM (requires -T)
-1      use fast BAM compression (implies -b)
-o FILE  output file name [stdout]
-S      ignored (input format is auto-detected)
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-@, --threads INT
Number of additional threads to use [0]
# 范例⼀:利⽤的是samtools的view选项,参数-S 输⼊sam⽂件;参数-b 指定输出的⽂件为bam;最后重定向写⼊>bam⽂件
$ cd mnt/f/rna_seq/aligned
$ for((i=56;i<=62;i++));do samtools view -S SRR35899${i}.sam -b > SRR35899${i}.bam;done
# 操作记录:
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ ls
m3108.sam  m3110_1.sam  m3110.sam  m3111.sam  m3112.sam  m3113.sam  m3114.sam  m3122.sam  msh1.sam  msh2.sam  Scr.sam #使⽤参数-S;-b;-@;输出选⽤>进⾏重定向写⼊,也可以使⽤-o Filename 指定输⼊⽂件名称
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ samtools view -@ 8 -S Scr.sam -b > Scr.bam
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ ll
total 256G
-
rw-rw-r-- 1 zexing zexing 4.3G 6⽉  3 12:22 Scr.bam
-rw-rw-r-- 1 zexing zexing  23G 6⽉  2 15:08 Scr.sam
#两个对⽐发现:.sam⽂件⽐.bam⽂件⼤5-6倍
#使⽤参数-S;-b;-@;输出选⽤>进⾏重定向写⼊,也可以使⽤-o Filename 指定输⼊⽂件名称
#参数-1压缩速度略快,结果略⼤
-rw-rw-r-- 1 zexing zexing 4.3G 6⽉  3 13:16 Scr.bam-0
-rw-rw-r-- 1 zexing zexing 4.6G 6⽉  3 13:09 Scr.bam_1
确认代码:
samtools view -@ 8 -S {$i}.sam -1b -o {$i}.bam
2. 排序(sort命令)
Usage: samtools ][in.bam]
Options:
-l INT    Set compression level, from 0 (uncompressed) to 9 (best)
-m INT    Set maximum memory per thread; suffix K/M/G recognized [768M]
-n        Sort by read name
-t TAG    Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
-o FILE    Write final output to FILE rather than standard output
-@, --threads INT
Number of additional threads to use [0]
#范例⼀:将所有的bam⽂件按默认的染⾊体位置进⾏排序
$ for((i=56;i<=62;i++));do samtools sort SRR35899${i}.bam -o SRR35899${i}_sorted.bam;done
#操作记录:
#使⽤参数-n;-l;-@;;-o Filename 指定输⼊⽂件名称,输⼊⽂件放在最后。
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ samtools sort -@ 8 -n -l 4 -o Scr.bam.sort Scr.bam
[bam_sort_core] merging from 24 files and 8
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ ll
total 270G
-rw-rw-r-- 1 zexing zexing 4.3G 6⽉  3 12:22 Scr.bam
-rw-rw-r-- 1 zexing zexing 4.5G 6⽉  3 13:30 Scr.bam.sort
-rw-rw-r-- 1 zexing zexing  23G 6⽉  2 15:08 Scr.sam
#使⽤参数-n;-l;-@;;-o Filename 指定输⼊⽂件名称,输⼊⽂件放在选项前边。
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ samtools sort Scr.bam -@ 8 -n -l 9 -o Scr.bam.sort_9
[bam_sort_core] merging from 24 files and 8
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ ll
total 274G
-rw-rw-r-- 1 zexing zexing 4.3G 6⽉  3 12:22 Scr.bam
-rw-rw-r-- 1 zexing zexing 4.5G 6⽉  3 13:30 Scr.bam.sort
-rw-rw-r-- 1 zexing zexing 4.1G 6⽉  3 13:38 Scr.bam.sort_9
-rw-rw-r-- 1 zexing zexing  23G 6⽉  2 15:08 Scr.sam
#压缩⽐⼤时可以适当缩⼩⽂件⼤⼩
#默认按照染⾊体位置进⾏排序,⽽-n参数则是根据read名进⾏排序,-t 根据TAG进⾏排序。
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ samtools sort -@ 8 -l 5 -o Scr.bam.sort_n Scr.bam
[bam_sort_core] merging from 24 files and 8
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ ll
total 263G
-rw-rw-r-- 1 zexing zexing 4.3G 6⽉  3 12:22 Scr.bam
-rw-rw-r-- 1 zexing zexing 4.5G 6⽉  3 13:30 Scr.bam.sort
-rw-rw-r-- 1 zexing zexing 2.1G 6⽉  3 14:26 Scr.bam.sort_n
-rw-rw-r-- 1 zexing zexing  23G 6⽉  2 15:08 Scr.sam
#上⾯Scr.bam.sort_n为默认排序,⽂件明显变⼩。为什么 BAM ⽂件sort之后体积会变⼩?因为BAM⽂件是压缩的⼆进制⽂件,对⽂件内容排序之后相似的内容排在⼀起,使得⽂件压缩⽐提⾼了,因此排序之后的BAM⽂件变⼩了,相对应的 SAM ⽂件就是纯⽂本⽂件,对SAM ⽂件进⾏排序就不会改变⽂件⼤⼩。⽽且由于 RNA-seq 中由于基因表达量的关系,RNA-seq 的数据⽐对结果 BAM ⽂件使⽤samtools 进⾏ sort 之后⽂件压缩⽐例变化会⽐DNA-seq 更甚。另外,samtools 对 BAM ⽂件进⾏排序之后那些没有⽐对上的 reads 会被放在⽂件的末尾。
确认代码:
samtools sort -@ 8 -l 5 -o {$i}.bam.sort {$i}.bam
3. 建⽴索引(index命令)
Usage: samtools index [-bc][-m INT]<in.bam>[out.index]
Options:
-b      Generate BAI-format index for BAM files [default]
-c      Generate CSI-format index for BAM files
-m INT  Set minimum interval size for CSI indices to 2^INT [14]
-@ INT  Sets the number of threads [none]
#范例⼀:将所有的排序⽂件建⽴索引,索引⽂件.bai后缀
$ for((i=56;i<=62;i++));do samtools index SRR35899${i}_sorted.bam;done
#操作记录:
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ samtools index -@ 8 Scr.bam Scr.bam.index [E::hts_idx_push] Unsorted positions on sequence #15: 76904671 followed by 76904427
samtools index: failed to create index for"Scr.bam"
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ samtools index -@ 8 Scr.bam.sort Scr.bam.index [E::hts_idx_push] Unsorted positions on sequence #19: 44575082 followed by 44574995
#以上报错,源⽂件Scr.bam在之前samtools sort命令时,加⼊参数-n,按reads名排序,⽆法建⽴Index。(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ samtools index -@ 8 Scr.bam.sort_n Scr.bam.index (base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ ll
total 263G
-rw-rw-r-- 1 zexing zexing 4.3G 6⽉  3 12:22 Scr.bam
-rw-rw-r-- 1 zexing zexing 2.2M 6⽉  3 14:58 Scr.bam.index
-rw-rw-r-- 1 zexing zexing 4.5G 6⽉  3 13:30 Scr.bam.sort
-rw-rw-r-- 1 zexing zexing 2.1G 6⽉  3 14:26 Scr.bam.sort_n
-rw-rw-r-- 1 zexing zexing  23G 6⽉  2 15:08 Scr.sam
#第⼆次源⽂件Scr.bam.sort_n在之前samtools sort命令时,按默认染⾊体位置排序,顺利建⽴Index。
确认代码:
samtools index -@ 8 {$i}.bam.sort {$i}.bam.index
4. 查看reads⽐对情况(flagstat命令)
#samtools flagstat 使⽤说明:
Usage: samtools flagstat [options]<in.bam>
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-
@, --threads INT
Number of additional threads to use [0]
#操作记录:
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ samtools flagstat Scr.bam.sort
50976263 + 0 in total (QC-passed reads + QC-failed reads)
5208051 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
49743886 + 0 mapped (97.58% : N/A)
45768212 + 0 paired in sequencing
22884106 + 0 read1
sort命令排序22884106 + 0 read2
43519458 + 0 properly paired (95.09% : N/A)
43928560 + 0 with itself and mate mapped
607275 + 0 singletons (1.33% : N/A)
103944 + 0 with mate mapped to a different chr
76413 + 0 with mate mapped to a different chr (mapQ>=5)
确认代码:
samtools flagstat -@ 8 {$i}.bam.sort
5. 使⽤shell scripts将以上命令写在⼀起
范例⼀:
#!/bin/bash
#这⾥写了⼀个⼩脚本,把三个步骤写在⼀个for循环⾥,for循环会依次对每⼀个sam⽂件进⾏处理
for i in`seq 77 80`
do
samtools view -S /media/yanfang/FYWD/RNA_seq/sam_files/SRR9576${i}.sam -b > /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}.bam
#第⼀步将⽐对后的sam⽂件转换成bam⽂件。-S 后⾯跟的是sam⽂件的路径;-b 指定输出的⽂件为bam,后⾯跟输出的路径;最后重定向写⼊bam⽂件
samtools sort /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}.bam -o /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}_sorted.bam
#第⼆步将所有的bam⽂件按默认的染⾊体位置进⾏排序。-o是输出⽂件名
samtools index /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}_sorted.bam
#第三步将所有的排序⽂件建⽴索引,索引⽂件,⽣成的索引⽂件是以bai为后缀的
done
操作记录:
#shell script记录如下:
#!/bin/bash
#这⾥写⼀个脚本,将samtools的view、sort和index三个命令串联在⼀起,对赵秀娟的⼏个数据循环处理
#    2020/06/03        李泽兴      First release
for i in msh1 msh2 m3108 m3110 m3111 m3112 m3113 m3114
do
samtools view -@ 8 -S /f/xudonglab/zexing/projects/zhaoxiujuan/aligned/$i.sam -1b -o /f/xudonglab/zexing/projects/zhaoxiujuan/aligned/$i.bam
samtools sort -@ 8 -l 5 -o /f/xudonglab/zexing/projects/zhaoxiujuan/aligned/$i.bam.sort /f/xudonglab/
zexing/projects/zhaoxiujuan/aligned/$i.bam
samtools index -@ 8 /f/xudonglab/zexing/projects/zhaoxiujuan/aligned/$i.bam.sort /f/xudonglab/zexing/projects/zhaoxiujuan/aligned/$i.bam.index
done
#后台运⾏命令如下
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ nohup sh samtools.sh &
[1] 46028
#运⾏的nohup_out结果如下:
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ cat nohup.out
[bam_sort_core] merging from 24 files and 8
[bam_sort_core] merging from 24 files and 8
[W::sam_read1] Parse error at line 12148383
[main_samview] truncated file.
[bam_sort_core] merging from 0 files and 8
[bam_sort_core] merging from 24 files and 8
[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] Parse error at line 13734
[main_samview] truncated file.
[bam_sort_core] merging from 0 files and 8
[bam_sort_core] merging from 24 files and 8
[bam_sort_core] merging from 24 files and 8
[bam_sort_core] merging from 24 files and 8
使⽤shell script对⽐对结果进⾏查看,操作记录如下:
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ for i in msh1 msh2 m3108 m3110 m3111 m3112 m3113 m3114;do samtools flagstat -@ 8 $i.bam.so rt;done
56463066 + 0 in total (QC-passed reads + QC-failed reads)
4886336 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
55172624 + 0 mapped (97.71% : N/A)
51576730 + 0 paired in sequencing
25788365 + 0 read1
25788365 + 0 read2
49186058 + 0 properly paired (95.36% : N/A)
49644168 + 0 with itself and mate mapped
642120 + 0 singletons (1.24% : N/A)
107584 + 0 with mate mapped to a different chr
83662 + 0 with mate mapped to a different chr (mapQ>=5)

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。