python bioinfokit 计算 FPKM/RPKM/TPM 值
最困难的事情就是认识自己。——希腊

1引言
介绍一下使用 bioinfokit 来计算 FPKM/RPKM/TPM
归一化值。bioinfokit 是一个处理分析,可视化生物数据的库,功能还很多的,后面慢慢介绍。
github 地址:
https://github.com/reneshbedre/bioinfokit

2安装
Install using pip for Python 3:
# install
pip install bioinfokit
# upgrade to latest version
pip install bioinfokit --upgrade
# uninstall
pip uninstall bioinfokit
Install using conda:
conda install -c bioconda bioinfokit
Install using git:
# download and install bioinfokit (Tested on Linux, Mac, Windows)
git clone https://github.com/reneshbedre/bioinfokit.git
cd bioinfokit
python setup.py install
检查版本:
>>> import bioinfokit
>>> bioinfokit.__version__
'0.4'
3计算
RPKM (Reads per kilo base of transcript per million mapped reads)
RPKM 和 FPKM (Fragments per kilo base of transcript per million mapped fragments)
计算公式一样,只不过后者对于双端数据而言的。
公式:

注意:
RPKM considers the gene length for normalization. RPKM is suitable for sequencing protocols where reads sequencing depends on gene length. Used in single-end RNA-seq experiments (FPKM for paired-end RNA-seq data).
RPKM/FPKM does not represent the accurate measure of relative RNA molar concentration (rmc) and can be biased towards identifying the differentially expressed genes as the total normalized counts for each sample will be different. TPM is proposed as an alternative to the RPKM
.
计算:
from bioinfokit.analys import norm, get_data
# load sugarcane RNA-seq expression dataset (Published in Bedre et al., 2019)
df = get_data('sc_exp').data
df.head(2)
# output
gene ctr1 ctr2 ctr3 trt1 trt2 trt3 length
0 Sobic.001G000200 338 324 246 291 202 168 1982.0
1 Sobic.001G000400 49 21 53 16 16 11 4769.0
# make gene column as index column
df = df.set_index('gene')
df.head(2)
# output
ctr1 ctr2 ctr3 trt1 trt2 trt3 length
gene
Sobic.001G000200 338 324 246 291 202 168 1982.0
Sobic.001G000400 49 21 53 16 16 11 4769.0
# now, normalize raw counts using RPKM method
# gene length must be in bp
nm = norm()
nm.rpkm(df=df, gl='length')
# get RPKM normalized dataframe
rpkm_df = nm.rpkm_norm
rpkm_df.head(2)
# output
ctr1 ctr2 ctr3 trt1 trt2 trt3
gene
Sobic.001G000200 50.804745 51.327542 37.699846 46.737552 37.472610 48.090169
Sobic.001G000400 3.060976 1.382614 3.375644 1.067995 1.233556 1.308627
TPM (Transcripts per million)
公式:

注意:
TPM considers the gene length for normalization. TPM is suitable for sequencing protocols where reads sequencing depends on gene length.
TPM is proposed as an alternative to RPKM because of inaccuracy in RPKM measurement. In contrast to RPKM, the TPM average is constant and is proportional to the relative RNA molar concentration (rmc)
.
计算:
from bioinfokit.analys import norm, get_data
# load sugarcane RNA-seq expression dataset (Published in Bedre et al., 2019)
df = get_data('sc_exp').data
df.head(2)
# output
gene ctr1 ctr2 ctr3 trt1 trt2 trt3 length
0 Sobic.001G000200 338 324 246 291 202 168 1982.0
1 Sobic.001G000400 49 21 53 16 16 11 4769.0
# make gene column as index column
df = df.set_index('gene')
# output
ctr1 ctr2 ctr3 trt1 trt2 trt3 length
gene
Sobic.001G000200 338 324 246 291 202 168 1982.0
Sobic.001G000400 49 21 53 16 16 11 4769.0
# now, normalize raw counts using TPM method
# gene length must be in bp
nm = norm()
nm.tpm(df=df, gl='length')
# get TPM normalized dataframe
tpm_df = nm.tpm_norm
tpm_df.head(2)
# output
ctr1 ctr2 ctr3 trt1 trt2 trt3
gene
Sobic.001G000200 99.730156 97.641941 72.361658 89.606265 69.447237 90.643338
Sobic.001G000400 6.008723 2.630189 6.479263 2.047584 2.286125 2.466582
4gene length
如果没有基因长度文件可以试试我的 GetGeneLength Package。
install:
$ pip install GetGeneLength
# for lattest version
$ pip install GetGeneLength==0.0.3
提取长度:
$ GetGeneLength -d gencode -g gencode.v38.annotation_human.gtf -l gene_length.txt
Your job is running, please wait...
Your job is done!
$ head -n 3 gene_length.txt
DDX11L1 ENSG00000223972.5 transcribed_unprocessed_pseudogene 1735
WASH7P ENSG00000227232.5 unprocessed_pseudogene 1351
MIR6859-1 ENSG00000278267.1 miRNA 68
然后和你的 count 文件根据基因名合并即可。

欢迎加入生信交流群。加我微信我也拉你进 微信群聊 老俊俊生信交流群
哦,数据代码已上传至QQ群,欢迎加入下载。
群二维码:

老俊俊微信:
知识星球:
所以今天你学习了吗?
欢迎小伙伴留言评论!
今天的分享就到这里了,敬请期待下一篇!
最后欢迎大家分享转发,您的点赞是对我的鼓励和肯定!
如果觉得对您帮助很大,赏杯快乐水喝喝吧!
往期回顾
◀...
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!