RNA seq 3 學習筆記序列對比

主要參考**：

chip詳細分析流程

序列比對：hisat2

1.需要建立乙個index檔案有兩種方法。

為啥要這個index？需要把測序資料和這個參考基因組做對比，但是又不能直接和基因組做對比，不然哪兒跟哪兒可能區分不開，只能拿個簡化版的注釋檔案做對比。

# 其實hisat2-buld在執行的時候也會自己尋找exons和splice_sites，但是先做的目的是為了提高執行效率 extract_exons.py gencode.v26lift37.annotation.sorted.gtf > hg19.exons.gtf & extract_splice_sites.py gencode.v26lift37.annotation.gtf > hg19.splice_sites.gtf & # 建立index，必須選項是基因組所在檔案路徑和輸出的字首 hisat2-build --ss hg19.splice_sites.gtf --exon hg19.exons.gtf genome/hg19/hg19.fa hg19

解壓：

tar -zxvf *.tar.gz

cd  /mnt/f/chip_seq/aligned
for ((i=5;i<=9;i++));do hisat2 -t -x /mnt/f/chip_seq/data/reference/index/mm10/genome -u /mnt/f/chip_seq/data/srr62020$.fastq.gz -s srr62020$.sam; done
#如果是雙端測序，使用下面這個**，去掉 u，更換成-1 -2
for ((i=19;i<=23;i++));do hisat2 -t -x /mnt/d/biotech/chip/ref/mm10/geno
me -p 8 -1 /mnt/d/biotech/rna/srr53377$_1.fastq.gz -2 /mnt/d/biotech/rna/srr53377$_2.fastq.gz -s srr53377$.sam;
done

在對比的過程中，出現了這個錯誤**：

traceback (most recent call last): file "/home/leo/miniconda3/bin/hisat2_read_statistics.py", line 210, in reads_stat(args.read_file, args.read_count) file "/home/leo/miniconda3/bin/hisat2_read_statistics.py", line 168, in reads_stat for id, seq in fstream: file "/home/leo/miniconda3/bin/hisat2_read_statistics.py", line 44, in parser_fq if line[0] == '@': indexerror: index out of range could not locate a hisat2 index corresponding to basename "/mnt/f/chip_seq/data/reference/index/mm10/"

後來我仔細看了看，發現是，index定位不準確，沒有加上genome 。

/mnt/f/chip_seq/data/reference/index/mm10/genome

但是，這些關於index out of range 的錯誤先不用管。可能是fastq檔案裡面有一些空格。但這些可以被允許，儘管是錯誤。

可以參考一下幾個**：

question: hisat2: an 「indexerror: index out of range」

index out of range?

python-indexerror: list index out of range

python 報錯不影響hisat2執行

最後的結果：

traceback (most recent call last)
: file "/home/leo/miniconda3/bin/hisat2_read_statistics.py"
, line 210,in
reads_stat(args.read_file, args.read_count)
file "/home/leo/miniconda3/bin/hisat2_read_statistics.py"
, line 168
,in reads_stat
forid, seq in fstream:
file "/home/leo/miniconda3/bin/hisat2_read_statistics.py"
, line 44
,in parser_fq
if line[0]
=='@'
:indexerror: index out of range
time loading forward index:00:
00:26time loading reference:00:
00:04multiseed full-index search:00:
27:0919687027 reads; of these:
19687027
(100.00
%) were unpaired; of these:
8549808
(43.43
%) aligned 0 times
9537058
(48.44
%) aligned exactly 1 time
1600161
(8.13
%) aligned >
1 times
56.57
% overall alignment rate
time searching:00:
27:13overall time:00:
27:39

第乙個用了27min.

隨後對把sam檔案轉換成bam檔案，**如下：

samtools sort . /srr620204.sam -o ring1b.bam

其實這樣做會導致檔案量很大，而且內部的資料沒有很好的排序，可以如下操作：（當然，我這一次沒這麼做，直接就按照上邊的**寫了下去）

# 首先將比對後的sam檔案轉換成bam檔案
# 利用的是samtools的view選項，引數-s 輸入sam檔案；引數-b 指定輸出的檔案為bam；最後重定向寫入bam檔案
$ cd mnt/f/rna_seq/aligned
$ for
((i=
56;i<=
62;i++)
);do samtools view -s srr35899$
.sam -b > srr35899$
.bam;done
# 將所有的bam檔案按預設的染色體位置進行排序
$ for
((i=
56;i<=
62;i++)
);do samtools sort srr35899$
.bam -o srr35899$_sorted.bam;done
# 將所有的排序檔案建立索引，索引檔案.bai字尾
$ for
((i=
56;i<=
62;i++)
);do samtools index srr35899$_sorted.bam;done

3.拿到bam檔案，應該再進行一次質控**：

具體步驟參考序列比對：hisat2

另乙個關於bam檔案加載入i**和質控的**。

質控結果如下：（摘自上述第二個**）

source ("")
bioclite("chipseeker")
bioclite("txdb.mmusculus.ucsc.mm10.knowngene")

然後又是很久。

整個帖子的過程我用了兩天下午+晚上做完了。畢竟是新手可能會慢很多

RNA seq 3 學習筆記序列對比

C 學習筆記對比C

python序列學習筆記

oracle 學習筆記序列

RNA seq 3 學習筆記 序列對比

C 學習筆記 對比C

python序列學習筆記

oracle 學習筆記 序列

相關推薦

RNA seq 3 學習筆記序列對比

C 學習筆記對比C

oracle 學習筆記序列