Nature 文章主图结果复现
测试开头







测试结尾

没关注?伸出手指点这里—


1引言
复现一篇 Nature 文章的两张主图 Fig.1
和 Fig.2
。包含数据上下游分析及可视化。
文章标题:
主图 1:
主图 2:
文章数据分析方法部分:
本文相关推文:
2数据下载
下载方式多样,根据自己需求采用喜欢的方式下载。总共 30 的 fastq 文件。
文章 GSE 号:
GSE116570
3质控
# 1.qc raw-data
mkdir -p 2.QC-data/raw-qc 2.QC-data/clean-qc
fastqc -t 12 -q -o 2.QC-data/raw-qc/ 1.raw-data/*.gz
multiqc 2.QC-data/raw-qc/* -o 2.QC-data/raw-qc/
4去除接头
# 2.trim adaptors
mkdir 3.trim-data
for i in SRR74712{27..44} SRR74712{49..60}
do
cutadapt -j 10 -f fastq
-a CTGTAGGCACCATCAATTCGTATGCCGTCTT
-O 6 -m 20 -M 35
-o 3.trim-data/${i}.trimmed.fq.gz
1.raw-data/${i}.fastq.gz
done
# qc
fastqc -t 12 -q -o 2.QC-data/clean-qc/ 3.trim-data/*.gz
multiqc 2.QC-data/clean-qc/* -o 2.QC-data/clean-qc/
5给 tRNA/rRNA 建立索引
# 3.build index for rRNA/tRNA and genome
mkdir rRNA-index tRNA-index
bowtie2-build Saccharomyces-cerevisiae-rRNA.fasta rRNA-index/Saccharomyces-cerevisiae-rRNA
bowtie2-build Saccharomyces-cerevisiae-tRNA.fasta tRNA-index/Saccharomyces-cerevisiae-tRNA
6去除 rRNA/tRNA 序列污染
# 4.rm rtRNA contamination from fastq files
mkdir 4.rmrtRNA-data
# rmrRNA
for i in SRR74712{27..44} SRR74712{49..60}
do
bowtie2 -p 20 -x ./index/rRNA-index/Saccharomyces-cerevisiae-rRNA
--un-gz 4.rmrtRNA-data/${i}.rmrRNA.fq.gz
-U 3.trim-data/${i}.trimmed.fq.gz
-S 4.rmrtRNA-data/null
done
# rmtRNA
for i in SRR74712{27..44} SRR74712{49..60}
do
bowtie2 -p 20 -x ./index/tRNA-index/Saccharomyces-cerevisiae-tRNA
--un-gz 4.rmrtRNA-data/${i}.rmtRNA.fq.gz
-U 4.rmrtRNA-data/${i}.rmrRNA.fq.gz
-S 4.rmrtRNA-data/null
done
7比对
序列准备参考下面:
建立索引
# 5.map to trancriptome
mkdir trans-index
mkdir 6.map-trans-data
hisat2-build extened-CDS-seq.fa trans-index/extened-CDS-seq
比对
这里采用 hisat2 比对软件进行比对到转录组序列上,原文是 tophat2 比对到基因组序列上,我做的和原文有点不同
,输出 sam 文件:
for i in SRR74712{27..44} SRR74712{49..60}
do
hisat2 -p 20 -x ./index/trans-index/extened-CDS-seq
-k 1 --summary-file 5.map-data/${i}.mapinfo.txt
-U 4.rmrtRNA-data/${i}.rmtRNA.fq.gz
-S 5.map-data/${i}.sam
done
8计算 read density
代码根据之前相关文章修改一下即可,注意一下 sam 文件的坐标系统为 1-based 的即可:
import os
# make folder
os.mkdir('./3.ribo-density-data')
sample = []
for i in [range(27,45),range(49,61)]:
for j in i:
samp = ''.join(['../5.map-data/','SRR74712',str(j),'.sam'])
sample.append(samp)
# output name
outputName = ['FAS1-trans-rep1','FAS1-inter-rep1','FAS1-trans-rep2','FAS1-inter-rep2',
'FAS2-trans-rep1','FAS2-inter-rep1','FAS2-trans-rep2','FAS2-inter-rep2',
'FAS1-MPTdel-trans-rep1','FAS1-MPTdel-inter-rep1','FAS1-MPTdel-trans-rep2','FAS1-MPTdel-trans-rep3',
'FAS1-MPTdel-inter-rep2','FAS1-MPTdel-inter-rep3',
'FAS2-MPTdel-trans-rep1','FAS2-MPTdel-inter-rep1','FAS2-MPTdel-trans-rep2','FAS2-MPTdel-inter-rep2',
'GUS1-trans-rep1','GUS1-inter-rep1','GUS1-trans-rep2','GUS1-inter-rep2',
'MES1-trans-rep1','MES1-inter-rep1','MES1-trans-rep2','MES1-inter-rep2',
'ARC1-trans-rep1','ARC1-inter-rep1','ARC1-trans-rep2','ARC1-inter-rep2']
# run
for i in range(0,len(sample)):
riboDensityAtEachPosition(sample[i],''.join(['3.ribo-density-data/',outputName[i],'.density.txt']))
9计算 ratio
代码参考之前文章:
import os
# make folder
os.mkdir('./5.gene-enrichment-data')
# '3.ribo-density-data/ssb1-inter-rep1.map.density.txt' '3.ribo-density-data/ssb1-trans-rep1.map.density.txt'
geneList = ['FAS1','FAS2']
IPfile = ['FAS1-inter-rep1','FAS1-inter-rep2','FAS2-inter-rep1','FAS2-inter-rep2',
'FAS1-MPTdel-inter-rep1','FAS1-MPTdel-inter-rep2','FAS1-MPTdel-inter-rep3',
'FAS2-MPTdel-inter-rep1','FAS2-MPTdel-inter-rep2']
Inputfile = ['FAS1-trans-rep1','FAS1-trans-rep2','FAS2-trans-rep1','FAS2-trans-rep2',
'FAS1-MPTdel-trans-rep1','FAS1-MPTdel-trans-rep2','FAS1-MPTdel-trans-rep3',
'FAS2-MPTdel-trans-rep1','FAS2-MPTdel-trans-rep2']
outputfile = ['FAS1-rep1','FAS1-rep2','FAS2-rep1','FAS2-rep2',
'FAS1_MPTdel-rep1','FAS1_MPTdel-rep2','FAS1_MPTdel-rep3',
'FAS2_MPTdel-rep1','FAS2_MPTdel-rep2']
# run
for i in range(0,9):
batchEnrichment(''.join(['3.ribo-density-data/',IPfile[i],'.density.txt']),
''.join(['3.ribo-density-data/',Inputfile[i],'.density.txt']),
geneList,
''.join(['./5.gene-enrichment-data/',outputfile[i],'.sample1.enrich.txt']))
# '3.ribo-density-data/ssb1-inter-rep1.map.density.txt' '3.ribo-density-data/ssb1-trans-rep1.map.density.txt'
geneList = ['GUS1','MES1','ARC1']
IPfile = ['GUS1-inter-rep1','GUS1-inter-rep2','MES1-inter-rep1','MES1-inter-rep2','ARC1-inter-rep1','ARC1-inter-rep2']
Inputfile = ['GUS1-trans-rep1','GUS1-trans-rep2','MES1-trans-rep1','MES1-trans-rep2','ARC1-trans-rep1','ARC1-trans-rep2']
outputfile = ['GUS1-rep1','GUS1-rep2','MES1-rep1','MES1-rep2','ARC1-rep1','ARC1-rep2']
# run
for i in range(0,6):
batchEnrichment(''.join(['3.ribo-density-data/',IPfile[i],'.density.txt']),
''.join(['3.ribo-density-data/',Inputfile[i],'.density.txt']),
geneList,
''.join(['./5.gene-enrichment-data/',outputfile[i],'.sample2.enrich.txt']))
10Fig1 复现
批量读取数据进行预处理:
library(ggplot2)
library(tidyverse)
library(ggsci)
library(Rmisc)
library(data.table)
################################################### expriment 1
name <- list.files(path = '5.gene-enrichment-data/',pattern = '.sample1.enrich.txt')
map_df(1:length(name), function(x){
tmp <- fread(paste('5.gene-enrichment-data/',name[x],sep = ''),sep = 't')
colnames(tmp) <- c('id','pos','ratio')
# add gene name
tmp <- tmp[, c("gene_name") := tstrsplit(id, "|", fixed=TRUE)[1]]
# loop for every gene to process data
map_df(unique(tmp$gene_name),function(y){
tmp1 <- tmp[gene_name == y]
start <- sapply(strsplit(tmp1$id[1],split = '\|'),'[',3) %>% as.numeric()
end <- sapply(strsplit(tmp1$id[1],split = '\|'),'[',4) %>% as.numeric()
# 1.transform to codon pos
sq <- seq(1,(end - start + 1),3);rg <- c(1:length(sq))
map_df(1:length(sq),function(z){
tmp2 = tmp1[pos >= sq[z] & pos <= sq[z] + 2
][,.(mean_ratio = mean(ratio)),by = .(id,gene_name)
][,`:=`(codon_pos = rg[z])]
return(tmp2)
}) -> codon_res
return(codon_res)
}) -> final_res
# add sample info
final_res$sample <- sapply(strsplit(name[x],split = '\.'),'[',1)
final_res$type <- sapply(strsplit(name[x],split = '\-'),'[',1)
final_res$exp <- sapply(strsplit(name[x],split = '\-'),'[',2)
return(final_res)
}) %>% data.table() -> df_ratio
##################################################
# mean for replicates
merge_rep <- df_ratio %>%
dplyr::group_by(type,gene_name,codon_pos) %>%
dplyr::summarise(mean_rep_ratio = mean(mean_ratio),
mean_sd = sd(mean_ratio))
主图上半部分可视化:
###################################################
merge_rep$gene_name <- factor(merge_rep$gene_name,
levels = c('FAS1','FAS2'))
# plot
ggplot(merge_rep %>% filter(type %in% c('FAS1','FAS2')),
aes(x = codon_pos,y = mean_rep_ratio)) +
geom_line(aes(color = gene_name)) +
geom_hline(yintercept = 2,lty = 'dashed',color = 'red',size = 1) +
geom_ribbon(aes(ymin = mean_rep_ratio - mean_sd,
ymax = mean_rep_ratio + mean_sd,
fill = gene_name),
alpha = 0.3) +
theme_classic(base_size = 16) +
scale_color_manual(name = '',values = c('#C89E70','#54A36C')) +
scale_fill_manual(name = '',values = c('#C89E70','#54A36C')) +
theme(legend.background = element_blank(),
strip.background = element_rect(color = NA,fill = 'grey')) +
ylab('Mean enrichment [AU] n (co-IP/total)') +
xlab('Ribosome position n (Codons/amino acids)') +
# facet_wrap(~gene_name,scales = 'free',ncol = 2)
facet_grid(type~gene_name,scales = 'free_x')

主图下半部分可视化:
# plot
ggplot(merge_rep %>% filter(type %in% c('FAS1_MPTdel')),
aes(x = codon_pos,y = mean_rep_ratio)) +
geom_line(aes(color = gene_name)) +
geom_hline(yintercept = 2,lty = 'dashed',color = 'red',size = 1) +
geom_ribbon(aes(ymin = mean_rep_ratio - mean_sd,
ymax = mean_rep_ratio + mean_sd,
fill = gene_name),
alpha = 0.3) +
theme_classic(base_size = 16) +
scale_color_manual(name = '',values = c('#C89E70','#54A36C')) +
scale_fill_manual(name = '',values = c('#C89E70','#54A36C')) +
theme(legend.background = element_blank(),
strip.background = element_rect(color = NA,fill = 'grey')) +
ylab('Mean enrichment [AU] n (co-IP/total)') +
xlab('Ribosome position n (Codons/amino acids)') +
facet_wrap(~gene_name,scales = 'free_x',ncol = 2) +
ylim(0,90)

11Fig2 复现
批量读取数据进行预处理:
################################################### expriment 2
name <- list.files(path = '5.gene-enrichment-data/',pattern = '.sample2.enrich.txt')
map_df(1:length(name), function(x){
tmp <- fread(paste('5.gene-enrichment-data/',name[x],sep = ''),sep = 't')
colnames(tmp) <- c('id','pos','ratio')
# add gene name
tmp <- tmp[, c("gene_name") := tstrsplit(id, "|", fixed=TRUE)[1]]
# loop for every gene to process data
map_df(unique(tmp$gene_name),function(y){
tmp1 <- tmp[gene_name == y]
start <- sapply(strsplit(tmp1$id[1],split = '\|'),'[',3) %>% as.numeric()
end <- sapply(strsplit(tmp1$id[1],split = '\|'),'[',4) %>% as.numeric()
# 1.transform to codon pos
sq <- seq(1,(end - start + 1),3);rg <- c(1:length(sq))
map_df(1:length(sq),function(z){
tmp2 = tmp1[pos >= sq[z] & pos <= sq[z] + 2
][,.(mean_ratio = mean(ratio)),by = .(id,gene_name)
][,`:=`(codon_pos = rg[z])]
return(tmp2)
}) -> codon_res
return(codon_res)
}) -> final_res
# add sample info
final_res$sample <- sapply(strsplit(name[x],split = '\.'),'[',1)
final_res$type <- sapply(strsplit(name[x],split = '\-'),'[',1)
final_res$exp <- sapply(strsplit(name[x],split = '\-'),'[',2)
return(final_res)
}) %>% data.table() -> df_ratio
##################################################
# mean for replicates
merge_rep <- df_ratio %>%
dplyr::group_by(type,gene_name,codon_pos) %>%
dplyr::summarise(mean_rep_ratio = mean(mean_ratio),
mean_sd = sd(mean_ratio))
主图可视化:
###################################################
merge_rep$gene_name <- factor(merge_rep$gene_name,
levels = c('GUS1','ARC1','MES1'))
merge_rep$type <- factor(merge_rep$type,
levels = c('GUS1','MES1','ARC1'))
# plot
ggplot(merge_rep,
aes(x = codon_pos,y = mean_rep_ratio)) +
geom_line(aes(color = gene_name)) +
geom_hline(yintercept = 2,lty = 'dashed',color = 'red',size = 1) +
geom_ribbon(aes(ymin = mean_rep_ratio - mean_sd,
ymax = mean_rep_ratio + mean_sd,
fill = gene_name),
alpha = 0.3) +
theme_classic(base_size = 16) +
scale_color_manual(name = '',values = c('#4883C6','#AA356A','#D52C30')) +
scale_fill_manual(name = '',values = c('#4883C6','#AA356A','#D52C30')) +
theme(legend.background = element_blank(),
strip.background = element_rect(color = NA,fill = 'grey')) +
ylab('Mean enrichment [AU] n (co-IP/total)') +
xlab('Ribosome position n (Codons/amino acids)') +
# facet_wrap(~gene_name,scales = 'free',ncol = 2)
facet_grid(type~gene_name,scales = 'free_x')

和原文还是很像的。
12结尾
下面是文章想阐明的分子机制,还是比较好懂的:

欢迎加入生信交流群。加我微信我也拉你进 微信群聊 老俊俊生信交流群
哦,数据代码已上传至QQ群,欢迎加入下载。
群二维码:
老俊俊微信:
知识星球:
所以今天你学习了吗?
今天的分享就到这里了,敬请期待下一篇!
最后欢迎大家分享转发,您的点赞是对我的鼓励和肯定!
如果觉得对您帮助很大,赏杯快乐水喝喝吧!
往期回顾
◀...
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!