跟着 Cell 学 Ribo-seq 分析 四 (终)
随缘

1引言
为了去除测序文库大小带来的差异,往往需要对文库大小进行归一化,计算 RPM 来 使不同样本之间的基因具有可比性。
之前我们计算了 每个位置处的的 density, 但是没有进行归一化,所以需要计算一下 RPM 的值。

2calculate RPM
输入文件为 density 文件,输出文件为每个位置处的 RPM 值:
"""
Supplementary Note 6: RPM-normalized read densities
Author: Annemarie Becker
inputFileP:
read density file for plus strand (Supplementary Note 2)
col0: position along genome
col1: read density at that position
inputFileM:
read density file for minus strand (Supplementary Note 2)
col0: position along genome
col1: read density at that position
inputNumber:
total read number as float (Supplementary Note 3)
outputFileP:
RPM-normalized read density file for plus strand
col0: position along genome
col1: RPM-normalized read density at that position
outputFileM:
RPM-normalized read density file for minus strand
col0: position along genome
col1: RPM-normalized read density at that position
"""
定义函数:
def norm(inputFileP, inputFileM, inputNumber, outputFileP, outputFileM):
### PLUS STRAND ###
inFile = open(inputFileP, 'r')
inNumber = open(inputNumber, 'r')
outFile = open(outputFileP, 'w')
line = inFile.readline()
number = inNumber.readline()
totalReads = int(float(number))
i = 0
while line != '':
fields = line.split()
col0 = int(fields[0])
col1 = float(fields[1])
RPM = col1 / totalReads * 1000000
outFile.write(str(col0) + 't' + str(RPM) + 'n')
line = inFile.readline()
### MINUS STRAND ###
inFile = open(inputFileM, 'r')
inNumber = open(inputNumber, 'r')
outFile = open(outputFileM, 'w')
line = inFile.readline()
number = inNumber.readline()
totalReads = int(float(number))
i = 0
while line != '':
fields = line.split()
col0 = int(fields[0])
col1 = float(fields[1])
RPM = col1 / totalReads * 1000000
outFile.write(str(col0) + 't' + str(RPM) + 'n')
line = inFile.readline()
代码比较好懂,接下来批量运行:
# norm(inputFileP, inputFileM, inputNumber, outputFileP, outputFileM)
# sample name
inputFileP = ['Ssb1-inter-rep1.PDensity.txt','Ssb1-inter-rep2.PDensity.txt','Ssb2-inter-rep1.PDensity.txt','Ssb2-inter-rep2.PDensity.txt',
'Ssb1-trans-rep1.PDensity.txt','Ssb1-trans-rep2.PDensity.txt','Ssb2-trans-rep1.PDensity.txt','Ssb2-trans-rep2.PDensity.txt']
inputFileM = ['Ssb1-inter-rep1.MDensity.txt','Ssb1-inter-rep2.MDensity.txt','Ssb2-inter-rep1.MDensity.txt','Ssb2-inter-rep2.MDensity.txt',
'Ssb1-trans-rep1.MDensity.txt','Ssb1-trans-rep2.MDensity.txt','Ssb2-trans-rep1.MDensity.txt','Ssb2-trans-rep2.MDensity.txt']
inputNumber = ['Ssb1-inter-rep1.totalReads.txt','Ssb1-inter-rep2.totalReads.txt','Ssb2-inter-rep1.totalReads.txt','Ssb2-inter-rep2.totalReads.txt',
'Ssb1-trans-rep1.totalReads.txt','Ssb1-trans-rep2.totalReads.txt','Ssb2-trans-rep1.totalReads.txt','Ssb2-trans-rep2.totalReads.txt']
outputFileP = ['Ssb1-inter-rep1.PRPM.txt','Ssb1-inter-rep2.PRPM.txt','Ssb2-inter-rep1.PRPM.txt','Ssb2-inter-rep2.PRPM.txt',
'Ssb1-trans-rep1.PRPM.txt','Ssb1-trans-rep2.PRPM.txt','Ssb2-trans-rep1.PRPM.txt','Ssb2-trans-rep2.PRPM.txt']
outputFileM = ['Ssb1-inter-rep1.MRPM.txt','Ssb1-inter-rep2.MRPM.txt','Ssb2-inter-rep1.MRPM.txt','Ssb2-inter-rep2.MRPM.txt',
'Ssb1-trans-rep1.MRPM.txt','Ssb1-trans-rep2.MRPM.txt','Ssb2-trans-rep1.MRPM.txt','Ssb2-trans-rep2.MRPM.txt']
# run
for i in range(0,8):
norm(''.join(['2.ribo-density-data/',inputFileP[i]]),''.join(['2.ribo-density-data/',inputFileM[i]]),
''.join(['3.totalReads-data/',inputNumber[i]]),
''.join(['6.RPM-data/',outputFileP[i]]),''.join(['6.RPM-data/',outputFileM[i]]))
结果类似于下面这样:
726 0.0807090320751414
742 0.008967670230571266
743 0.008967670230571266
744 0.008967670230571266
745 0.008967670230571266
3extend positions
为了后面方便计算,需要对把数据 拓展为完整的基因组坐标位置, 因为现在是断断续续
的位置:

输入文件为上一步的 RPM 文件:
"""
Supplementary Note 7: Complete RPM-normalized read densities
Author: Annemarie Becker
inputFileP:
RPM-normalized read density file for plus strand (Supplementary Note 6)
col0: position along genome
col1: RPM-normalized read density at that position
inputFileM:
RPM-normalized read density file for minus strand (Supplementary Note 6)
col0: position along genome
col1: RPM-normalized read density at that position
outputFileP:
complete RPM-normalized read density file for all positions along the genome on plus strand
col0: position along genome
col1: RPM-normalized read density at that position
outputFileM:
complete RPM-normalized read density file for all positions along the genome on minus strand
col0: position along genome
col1: RPM-normalized read density at that position
"""
定义函数:
def RPMcomplete(inputFileP, inputFileM, outputFileP, outputFileM):
### PLUS STRAND###
DictP = {}
inFile = open(inputFileP, 'r')
line = inFile.readline()
while line != '':
fields = line.split()
col0 = int(fields[0])
col1 = float(fields[1])
DictP[col0] = col1
line = inFile.readline()
outFile = open(outputFileP, 'w')
for elem in range(1,4578160): #genome length, change for different genome
if elem not in DictP:
DictP[elem] = 0.0
ListP = list(DictP.items())
ListP.sort()
for J in ListP:
outFile.write(str(J[0]) + 't' + str(J[1]) + 'n')
### MINUS STRAND###
DictM = {}
inFile = open(inputFileM, 'r')
line = inFile.readline()
while line != '':
fields = line.split()
col0 = int(fields[0])
col1 = float(fields[1])
DictM[col0] = col1
line = inFile.readline()
outFile = open(outputFileM, 'w')
for elem in range(1,4578160): #genome length, change for different genome
if elem not in DictM:
DictM[elem] = 0.0
ListM = list(DictM.items())
ListM.sort()
for J in ListM:
outFile.write(str(J[0]) + 't' + str(J[1]) + 'n')
这里作者设置的酵母基因组长度为 range(1,4578160)
。
注意,这里依然会出现我前面提到的问题,应该是对每个染色体对应的长度进行分析。所以我觉得代码是有问题的。
批量运行:
# RPMcomplete(inputFileP, inputFileM, outputFileP, outputFileM)
# sample name
inputFileP = ['Ssb1-inter-rep1.PRPM.txt','Ssb1-inter-rep2.PRPM.txt','Ssb2-inter-rep1.PRPM.txt','Ssb2-inter-rep2.PRPM.txt',
'Ssb1-trans-rep1.PRPM.txt','Ssb1-trans-rep2.PRPM.txt','Ssb2-trans-rep1.PRPM.txt','Ssb2-trans-rep2.PRPM.txt']
inputFileM = ['Ssb1-inter-rep1.MRPM.txt','Ssb1-inter-rep2.MRPM.txt','Ssb2-inter-rep1.MRPM.txt','Ssb2-inter-rep2.MRPM.txt',
'Ssb1-trans-rep1.MRPM.txt','Ssb1-trans-rep2.MRPM.txt','Ssb2-trans-rep1.MRPM.txt','Ssb2-trans-rep2.MRPM.txt']
outputFileP = ['Ssb1-inter-rep1.PRPM-full.txt','Ssb1-inter-rep2.PRPM-full.txt','Ssb2-inter-rep1.PRPM-full.txt','Ssb2-inter-rep2.PRPM-full.txt',
'Ssb1-trans-rep1.PRPM-full.txt','Ssb1-trans-rep2.PRPM-full.txt','Ssb2-trans-rep1.PRPM-full.txt','Ssb2-trans-rep2.PRPM-full.txt']
outputFileM = ['Ssb1-inter-rep1.MRPM-full.txt','Ssb1-inter-rep2.MRPM-full.txt','Ssb2-inter-rep1.MRPM-full.txt','Ssb2-inter-rep2.MRPM-full.txt',
'Ssb1-trans-rep1.MRPM-full.txt','Ssb1-trans-rep2.MRPM-full.txt','Ssb2-trans-rep1.MRPM-full.txt','Ssb2-trans-rep2.MRPM-full.txt']
# run
for i in range(0,8):
RPMcomplete(''.join(['6.RPM-data/',inputFileP[i]]),''.join(['6.RPM-data/',inputFileM[i]]),
''.join(['6.RPM-data/',outputFileP[i]]),''.join(['6.RPM-data/',outputFileM[i]]))
输出结果,可以看到从 1 开始:
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
4Enrichment efficiency
计算富集效率,输入文件为 IP 和 Input 的 RPM 拓展坐标的文件:
"""
Supplementary Note 14: Enrichment efficiency
Author: Annemarie Becker
inputFile1:
RPM-normalized read densities along the whole genome or in protein coding regions on plus or minus strand from sample 1 (Supplementary Note 7 or 8)
col0: position along genome
col1: RPM-normalized read density at that position
inputFile2:
RPM-normalized read densities along the whole genome or in protein coding regions on plus or minus strand from sample 2 (Supplementary Note 7 or 8)
col0: position along genome
col1: RPM-normalized read density at that position
outputFile:
ratio of RPM-normalized read densities for protein coding regions on plus or minus strand from samples 1 and 2
col0: position along genome
col1: ratio of RPM-normalized read densities from samples 1 and 2
"""
因为 很多位置出的 density 并不连续, 作者采用了
滑动窗口
的方法,计算每个位置上下游 20nt 的 density 的总和,再除以 Input 的上下游 20nt 的 density 的总和得到 ratio
,这样出来的图的曲线会更光滑一些:

作者使用滑动窗口的方法在整个基因组上进行滑动计算比值, 分为
三个区间
,前 20nt
,中间区域
和末尾 20nt
。
定义函数:
def ratio(inputFile1, inputFile2, outputFile):
# Upload input1
Dict1 = {}
inFile = open(inputFile1, 'r')
line = inFile.readline()
while line != '':
fields = line.split()
col0 = int(fields[0])
col1 = float(fields[1])
Dict1[col0] = col1
line = inFile.readline()
# Upload input2
Dict2 = {}
inFile = open(inputFile2, 'r')
line = inFile.readline()
while line != '':
fields = line.split()
col0 = int(fields[0])
col1 = float(fields[1])
Dict2[col0] = col1
line = inFile.readline()
# calculate window +-20
outFile = open(outputFile, 'w')
sum1 = 0
sum2 = 0
start = 1
end = 4578159 #change this value dependent on the genome
start_sum = start + 20 #start_sum = 21
end_sum = end - 20 #end_sum = 4578139
for J in range(start, start_sum + 1): #lines 1-21
for X in range(start, J+20+1): #sum 1-(J+20)
sum1 += float(Dict1[X])
sum2 += float(Dict2[X])
if sum2 != 0:
ratio = sum1 / sum2
else:
ratio = 0.0
outFile.write(str(J) + 't' + str(ratio) + 'n')
sum1 = 0
sum2 = 0
for K in range(start_sum + 1, end_sum + 1): #lines 22-4578139
for Y in range(K-20, K+20+1): #sum (K-20)-(K+20)
sum1 += float(Dict1[Y])
sum2 += float(Dict2[Y])
if sum2 != 0:
ratio = sum1 / sum2
else:
ratio = 0.0
outFile.write(str(K) + 't' + str(ratio) + 'n')
sum1 = 0
sum2 = 0
for L in range(end_sum + 1, end + 1): #lines 4578140-4578159
for Z in range(L-20, end + 1):
sum1 += float(Dict1[Z])
sum2 += float(Dict2[Z])
if sum2 != 0:
ratio = sum1 / sum2
else:
ratio = 0.0
outFile.write(str(L) + 't' + str(ratio) + 'n')
sum1 = 0
sum2 = 0
批量运行:
inputFile1 = ['Ssb1-inter-rep1.PRPM-full.txt','Ssb1-inter-rep2.PRPM-full.txt','Ssb2-inter-rep1.PRPM-full.txt','Ssb2-inter-rep2.PRPM-full.txt',
'Ssb1-inter-rep1.MRPM-full.txt','Ssb1-inter-rep2.MRPM-full.txt','Ssb2-inter-rep1.MRPM-full.txt','Ssb2-inter-rep2.MRPM-full.txt']
inputFile2 = ['Ssb1-trans-rep1.PRPM-full.txt','Ssb1-trans-rep2.PRPM-full.txt','Ssb2-trans-rep1.PRPM-full.txt','Ssb2-trans-rep2.PRPM-full.txt',
'Ssb1-trans-rep1.MRPM-full.txt','Ssb1-trans-rep2.MRPM-full.txt','Ssb2-trans-rep1.MRPM-full.txt','Ssb2-trans-rep2.MRPM-full.txt']
outputFile = ['Ssb1-rep1-Penrichment.txt','Ssb1-rep2-Penrichment.txt','Ssb2-rep1-Penrichment.txt','Ssb2-rep2-Penrichment.txt',
'Ssb1-rep1-Menrichment.txt','Ssb1-rep2-Menrichment.txt','Ssb2-rep1-Menrichment.txt','Ssb2-rep2-Menrichment.txt']
# run
for i in range(0,8):
ratio(''.join(['6.RPM-data/',inputFile1[i]]),
''.join(['6.RPM-data/',inputFile2[i]]),
''.join(['8.enrichment-ratio-data/',outputFile[i]]))
得到每个位置出的比值:
8750 0.0
8751 0.0
8752 0.0
8753 0.11379265589085022
8754 0.22758531178170044
8755 0.3413779676725507
5可视化
这里准备一个基因信息文件,挑选了 三个正链基因 和 两个负链基因:
PMT1 287059 289512
CDC37 790328 791848
CCT3 407558 409162
SSC1 519638 521602
PDI1 48653 50221
正链基因
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!