跟着 Cell 学 Ribo-seq 分析 二
随缘

1前言
昨天早晨醒来看到一些句子说的挺好,分享给大家:
你只是心动过,并没有坚定地走向我,也只是短暂的喜欢了我一下,我却记了很久很久… 确实啊,当初对我的感情那么浓烈,我想不到是你不爱我,心动这个词很好,只是心动,从来没有坚定的选择我。
难过是后知后觉的,可以是在一个阳光灿烂的午后,也可以是在一个大雨滂沱的夜晚,回想生命中错过的片段。
2引言
上一节我们看了 reads 的长度分布, 接下来需要计算每个位置的 reads 的 density。采用的不是传统的 5'end
的模式,而是采用 center weighting
的方法。具体原理我画个示意图阐述:

基本原理就是对 reads 的上下游缩减 11nt, 然后中间的位置的得分为剩余长度的倒数, 最后叠加这个位置的分数即为该位置的 density。
文章介绍:
A critical step in the analysis involves the use of a specific scoring system, a center-weighted strategy for counting each read. Center-weighting can better locate the position of the ribosomal A-site and P-site than a simple counting at the 5′ end of the read, because only the midpoint (center) of a footprint fragment is scored. Center-weighting also takes into account the heterogeneous read lengths after MNase digestion and size selection. To this end, every footprint receives the same score regardless of length. If the read is longer than the defined minimum length (23 nt), the position of the ribosome cannot be clearly assigned. To this end, 11 nt from either end are removed and the score of the footprint is distributed equally among the remaining nucleotides. This scoring system was successfully used to verify well-known pause sites,such as stop codons and nascent chain–mediated stalling sites15. In addition, it facilitated the detection of pausing at serine codons upon starvation and of Shine-Dalgarno-like sequences as general ribosome pausing sites15. Center-weighted scores can be used for downstream analyses, which include the calculation of gene expression levels, normalized read densities along the genome or in protein coding regions and the average read density in a meta-gene analysis.
3center weighting 计算
下面是代码实现过程,输入文件为 bowtie 默认输出的结果文件, 输出为正负链的 density:
"""
Supplementary Note 2: Center Weighting
Authors: Eugene Oh, Annemarie Becker
inputFile:
.map file generated by Bowtie default output.
outputFileP:
read density file for plus strand
col0: position along genome
col1: read density at that position
outputFileM:
read density file for minus strand
col0: position along genome
col1: read density at that position
"""
def rawdata(inputFile, outputFileP, outputFileM):
pDict = {}
mDict = {}
inFile = open(inputFile, 'r')
line = inFile.readline()
while line != '':
fields = line.split()
col2 = str(fields[2]) #strand; note: if sequencing was performed without barcode reading, the column numbering is changed
col4 = int(fields[4]) #left-most position
col5 = str(fields[5]) #footprint seq
length = len(col5) #footprint length
if 22 < length < 42: #select range of footprint read lengths
if col2 == '+': #for plus strand
columns = len(fields) #count number of columns to check if alignment contains mismatches
if columns > 8:
col8 = str(fields[8])
if col8.startswith("0"): #if there is a mismatch in the 1st position
length0 = length - 1 #subtract wrong base at 1st position
end5 = col4 + 2 #Bowtie uses 0-based offset: transform to 1-based and subtract 1st base
end3 = end5 + length0 - 1
centerEnd5 = end5 + 11 #define center
centerEnd3 = end3 - 11
centerLength = centerEnd3 - centerEnd5 + 1
else:
end5 = col4 + 1 #Bowtie uses zero-based offset, transform to 1-based
end3 = end5 + length - 1
centerEnd5 = end5 + 11
centerEnd3 = end3 - 11
centerLength = centerEnd3 - centerEnd5 + 1
else:
end5 = col4 + 1
end3 = end5 + length - 1
centerEnd5 = end5 + 11
centerEnd3 = end3 - 11
centerLength = centerEnd3 - centerEnd5 + 1
for elem in range(centerEnd5, centerEnd3 + 1):
if elem in pDict:
pDict[elem] += (1.0 / centerLength)
else:
pDict[elem] = (1.0 / centerLength)
elif col2 == '-': #for minus strand
columns = len(fields)
if columns > 8:
col8 = str(fields[8])
if col8.startswith("0"):
length0 = length - 1
end3 = col4 + 1 #for minus strand, Bowtie gives leftmost position (3' end) with zero-based numbering
end5 = end3 + length0 - 1
centerEnd5 = end5 - 11
centerEnd3 = end3 + 11
centerLength = centerEnd5 - centerEnd3 + 1
else:
end3 = col4 + 1
end5 = end3 + length - 1
centerEnd5 = end5 - 11
centerEnd3 = end3 + 11
centerLength = centerEnd5 - centerEnd3 + 1
else:
end3 = col4 + 1
end5 = end3 + length - 1
centerEnd5 = end5 - 11
centerEnd3 = end3 + 11
centerLength = centerEnd5 - centerEnd3 + 1
for elem in range(centerEnd3, centerEnd5 + 1):
if elem in mDict:
mDict[elem] += (1.0 / centerLength)
else:
mDict[elem] = (1.0 / centerLength)
line = inFile.readline()
pList = list(pDict.items())
pList.sort()
outFileP = open(outputFileP, 'w')
for J in pList:
outFileP.write(str(J[0]) + 't' + str(J[1]) + 'n')
mList = list(mDict.items())
mList.sort()
outFileM = open(outputFileM, 'w')
for J in mList:
outFileM.write(str(J[0]) + 't' + str(J[1]) + 'n')
inFile.close()
outFileP.close()
outFileM.close()
代码也比较容易看懂, 作者也考虑了错配的碱基,并不纳入该位置进行计算。
值得注意的是,保存每个位置的 density 作者采用的字典的 key 为每个 read 的比对位置, 那么这就会有一个问题, 每条 read 比对到不同的染色体同一位置处,也需要进行叠加 density?
这么计算是不合理的,应该把字典的 key 加上染色体的信息 才能消除这种影响。
作者的代码是下面这样,elem 应该加上染色体的信息作为字典的 key
:
for elem in range(centerEnd3, centerEnd5 + 1):
if elem in mDict:
mDict[elem] += (1.0 / centerLength)
else:
mDict[elem] = (1.0 / centerLength)
咱们先不管,按着原来的代码往后分析:
批量计算:
# sample name
sample = ['Ssb1-inter-rep1.map','Ssb1-inter-rep2.map','Ssb2-inter-rep1.map','Ssb2-inter-rep2.map',
'Ssb1-trans-rep1.map','Ssb1-trans-rep2.map','Ssb2-trans-rep1.map','Ssb2-trans-rep2.map']
outPlus = ['Ssb1-inter-rep1.PDensity.txt','Ssb1-inter-rep2.PDensity.txt','Ssb2-inter-rep1.PDensity.txt','Ssb2-inter-rep2.PDensity.txt',
'Ssb1-trans-rep1.PDensity.txt','Ssb1-trans-rep2.PDensity.txt','Ssb2-trans-rep1.PDensity.txt','Ssb2-trans-rep2.PDensity.txt']
outMinus = ['Ssb1-inter-rep1.MDensity.txt','Ssb1-inter-rep2.MDensity.txt','Ssb2-inter-rep1.MDensity.txt','Ssb2-inter-rep2.MDensity.txt',
'Ssb1-trans-rep1.MDensity.txt','Ssb1-trans-rep2.MDensity.txt','Ssb2-trans-rep1.MDensity.txt','Ssb2-trans-rep2.MDensity.txt']
# run
for i in range(0,8):
rawdata(sample[i],''.join(['2.ribo-density-data/',outPlus[i]]),''.join(['2.ribo-density-data/',outMinus[i]]))
4计算每个样本的 total density
"""
Supplementary Note 3: Total reads
Author: Annemarie Becker
inputFileP:
read density file for plus strand (Supplementary Note 2)
col0: position along genome
col1: read density at that position
inputFileM:
read density file for minus strand (Supplementary Note 2)
col0: position along genome
col1: read density at that position
outputFile:
total read number as float
"""
def countReads(inputFileP,inputFileM, outputFile):
inFileP = open(inputFileP, 'r')
inFileM = open(inputFileM, 'r')
outFile = open(outputFile, 'w')
line = inFileP.readline()
i = 0
while line != '':
fields = line.split()
col0 = int(fields[0])
col1 = float(fields[1])
i = i+col1 #count reads on plus strand
line = inFileP.readline()
totReadsP = i
line = inFileM.readline()
j = 0
while line != '':
fields = line.split()
col0 = int(fields[0])
col1 = float(fields[1])
j = j+col1 #count reads on minus strand
line = inFileM.readline()
totReadsM = j
totalReads = i + j
outFile.write(str(totalReads))
inFileP.close()
inFileM.close()
outFile.close()
批量计算:
# sample name
outPlus = ['Ssb1-inter-rep1.PDensity.txt','Ssb1-inter-rep2.PDensity.txt','Ssb2-inter-rep1.PDensity.txt','Ssb2-inter-rep2.PDensity.txt',
'Ssb1-trans-rep1.PDensity.txt','Ssb1-trans-rep2.PDensity.txt','Ssb2-trans-rep1.PDensity.txt','Ssb2-trans-rep2.PDensity.txt']
outMinus = ['Ssb1-inter-rep1.MDensity.txt','Ssb1-inter-rep2.MDensity.txt','Ssb2-inter-rep1.MDensity.txt','Ssb2-inter-rep2.MDensity.txt',
'Ssb1-trans-rep1.MDensity.txt','Ssb1-trans-rep2.MDensity.txt','Ssb2-trans-rep1.MDensity.txt','Ssb2-trans-rep2.MDensity.txt']
sample = ['Ssb1-inter-rep1.totalReads.txt','Ssb1-inter-rep2.totalReads.txt','Ssb2-inter-rep1.totalReads.txt','Ssb2-inter-rep2.totalReads.txt',
'Ssb1-trans-rep1.totalReads.txt','Ssb1-trans-rep2.totalReads.txt','Ssb2-trans-rep1.totalReads.txt','Ssb2-trans-rep2.totalReads.txt']
# run
for i in range(0,8):
countReads(''.join(['2.ribo-density-data/',outPlus[i]]),''.join(['2.ribo-density-data/',outMinus[i]]),''.join(['3.totalReads-data/',sample[i]]))
5准备基因信息文件
作者只说了 NCBI 的 ptt 文件, 但我找不到,于是自己写代码准备这样一个基因文件,一个正链一个负链的基因信息文件
,三列,基因名
,基因起始位点
和基因终止位点
:
# get protein-coding gene start and end info
inputListP = open('geneListPlus.txt','w')
inputListM = open('geneListMlus.txt','w')
with open('Saccharomyces_cerevisiae.R64-1-1.105.gtf','r') as gtf:
for line in gtf:
if line.startswith('#'):
continue
fileds = line.split()
# for Plus strand
if fileds[2] == 'gene':
if len(fileds) > 14:
type = fileds[15]
gene_name = fileds[11].replace('"','').replace(';','')
else:
type = fileds[13]
gene_name = fileds[9].replace('"','').replace(';','')
if type == '"protein_coding";' and fileds[6] == '+':
gene_name = gene_name
start = fileds[3]
end = fileds[4]
inputListP.write('t'.join([gene_name,str(start),str(end)]) + 'n')
elif type == '"protein_coding";' and fileds[6] == '-':
gene_name = gene_name
start = fileds[3]
end = fileds[4]
inputListM.write('t'.join([gene_name,str(start),str(end)]) + 'n')
else:
pass
else:
pass
inputListP.close()
inputListM.close()
输出文件长这样:
THI74 1338274 1339386
PAU10 1523249 1523611
IZH1 1434924 1435874
6计算每个基因的总 density
基因前面的 density 的基因信息文件,对基因位置内的 density 进行加和即可,代码也比较好理解:
"""
Supplementary Note 4: Read density per gene
Authors: Eugene Oh
inputFileP:
read density file for plus strand (Supplementary Note 2)
col0: position along genome
col1: read density at that position
inputFileM:
read density file for minus strand (Supplementary Note 2)
col0: position along genome
col1: read density at that position
inputListP:
E. coli MC4100 gene list of the plus strand
col0: gene name
col1: start coordinate of gene
col2: stop coordinate of gene
inputListM
E. coli MC4100 gene list of the minus strand
col0: gene name
col1: start coordinate of gene
col2: stop coordinate of gene
outputFileP:
read densities per gene on plus strand
col0: gene name
col1: start coordinate of gene
col2: stop coordinate of gene
col3: sum of read densities
outputFileM:
read densities per gene on minus strand
col0: gene name
col1: start coordinate of gene
col2: stop coordinate of gene
col3: sum of read densities
"""
def expression(inputFileP, inputFileM, inputListP, inputListM, outputFileP, outputFileM):
### PLUS STRAND ###
# Upload read density file from plus strand as a dictionary
inFileDictP = {} #create dictionary that looks like input file
inFile = open(inputFileP, 'r')
line = inFile.readline()
while line != '':
fields = line.split()
col0 = int(fields[0])
col1 = float(fields[1])
inFileDictP[col0] = col1
line = inFile.readline()
# Upload plus strand gene list as a dictionary and list
geneDictP = {} #create dictionary with col0=gene name; col1=read number
geneListP = [] #create list that looks like input gene list
inFile = open(inputListP, 'r')
line = inFile.readline()
while line != '':
fields = line.split()
geneListP.append(fields) #add an item to the end of the list
col0 = str(fields[0]) #plus strand gene list: gene name
col1 = int(fields[1]) #plus strand gene list: start
col2 = int(fields[2]) #plus strand gene list: stop
# Sum up and write read densities per protein coding region in dictionary
for Z in range(col1, col2 + 1):
if Z in inFileDictP and col0 in geneDictP:
geneDictP[col0] += inFileDictP[Z]
elif Z in inFileDictP:
geneDictP[col0] = inFileDictP[Z]
line = inFile.readline()
# Assign gene expression levels to all genes
tupledlistP = geneDictP.items()
for J in geneListP:
match = 0
for K in tupledlistP:
if J[0] == K[0]:
match = 1
J.append(K[1])
if match == 0: #list genes that don't have any reads
J.append(0)
# Output file for plus strand
outFile = open(outputFileP, 'w')
for J in geneListP:
outFile.write(str(J[0]) + 't' + str(J[1]) + 't' + str(J[2]) + 't' + str(J[3]) + 'n')
### MINUS STRAND ###
# Upload read density file from minus strand as a dictionary
inFileDictM = {}
inFile = open(inputFileM, 'r')
line = inFile.readline()
while line != '':
fields = line.split()
col0 = int(fields[0])
col1 = float(fields[1])
inFileDictM[col0] = col1
line = inFile.readline()
# Upload minus strand gene list as a dictionary and list
geneDictM = {} #create dictionary with col0=gene name; col1=read number
geneListM = [] #create list that looks like input gene list
inFile = open(inputListM, 'r')
line = inFile.readline()
while line != '':
fields = line.split()
geneListM.append(fields) #add an item to the end of the list
col0 = str(fields[0]) #minus strand gene list: gene name
col1 = int(fields[1]) #minus strand gene list: start
col2 = int(fields[2]) #minus strand gene list: stop
# Sum up and write read densities per protein coding region in dictionary
for Z in range(col2, col1 + 1):
if Z in inFileDictM and col0 in geneDictM:
geneDictM[col0] += inFileDictM[Z]
elif Z in inFileDictM:
geneDictM[col0] = inFileDictM[Z]
line = inFile.readline()
# Assign gene expression levels to all genes
tupledlistM = geneDictM.items()
for J in geneListM:
match = 0
for K in tupledlistM:
if J[0] == K[0]:
match = 1
J.append(K[1])
if match == 0:
J.append(0)
# Output file for minus strand
outFile = open(outputFileM, 'w')
for J in geneListM:
outFile.write(str(J[0]) + 't' + str(J[1]) + 't' + str(J[2]) + 't' + str(J[3]) + 'n')
批量计算:
# expression(inputFileP, inputFileM, inputListP, inputListM, outputFileP, outputFileM)
# sample name
inputFileP = ['Ssb1-inter-rep1.PDensity.txt','Ssb1-inter-rep2.PDensity.txt','Ssb2-inter-rep1.PDensity.txt','Ssb2-inter-rep2.PDensity.txt',
'Ssb1-trans-rep1.PDensity.txt','Ssb1-trans-rep2.PDensity.txt','Ssb2-trans-rep1.PDensity.txt','Ssb2-trans-rep2.PDensity.txt']
inputFileM = ['Ssb1-inter-rep1.MDensity.txt','Ssb1-inter-rep2.MDensity.txt','Ssb2-inter-rep1.MDensity.txt','Ssb2-inter-rep2.MDensity.txt',
'Ssb1-trans-rep1.MDensity.txt','Ssb1-trans-rep2.MDensity.txt','Ssb2-trans-rep1.MDensity.txt','Ssb2-trans-rep2.MDensity.txt']
inputListP = 'geneListPlus.txt';inputListM = 'geneListMlus.txt'
outputFileP = ['Ssb1-inter-rep1.geneExpPlus.txt','Ssb1-inter-rep2.geneExpPlus.txt','Ssb2-inter-rep1.geneExpPlus.txt','Ssb2-inter-rep2.geneExpPlus.txt',
'Ssb1-trans-rep1.geneExpPlus.txt','Ssb1-trans-rep2.geneExpPlus.txt','Ssb2-trans-rep1.geneExpPlus.txt','Ssb2-trans-rep2.geneExpPlus.txt']
outputFileM = ['Ssb1-inter-rep1.geneExpMinus.txt','Ssb1-inter-rep2.geneExpMinus.txt','Ssb2-inter-rep1.geneExpMinus.txt','Ssb2-inter-rep2.geneExpMinus.txt',
'Ssb1-trans-rep1.geneExpMinus.txt','Ssb1-trans-rep2.geneExpMinus.txt','Ssb2-trans-rep1.geneExpMinus.txt','Ssb2-trans-rep2.geneExpMinus.txt']
# run
for i in range(0,8):
expression(''.join(['2.ribo-density-data/',inputFileP[i]]),''.join(['2.ribo-density-data/',inputFileM[i]]),
inputListP,inputListM,
''.join(['4.gene-expression-data/',outputFileP[i]]),''.join(['4.gene-expression-data/',outputFileM[i]]))
输出结果:
THI74 1338274 1339386 1.5714285714285714
PAU10 1523249 1523611 0
IZH1 1434924 1435874 436.71428571428805
RBS1 122216 123589 7250.4234126983865
MRPL28 1386073 1386516 192.90277777777777
NUM1 755628 763874 19323.75396825382
7结尾
未完待续…

欢迎加入生信交流群。加我微信我也拉你进 微信群聊 老俊俊生信交流群
哦,数据代码已上传至QQ群,欢迎加入下载。
群二维码:
老俊俊微信:
知识星球:
所以今天你学习了吗?
今天的分享就到这里了,敬请期待下一篇!
最后欢迎大家分享转发,您的点赞是对我的鼓励和肯定!
如果觉得对您帮助很大,赏杯快乐水喝喝吧!
往期回顾
◀RiboPlotR 优雅的可视化你的 Ribo-seq 数据
◀南京的春
◀..
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!