R语言-使用RTCGAToolbox包对获取TCGA数据数据分析与可视化(2)
继续关注RTCGA包:数据处理及可视化功能,主要是两个分析:
(1)生存分析
survivalTCGA()从RTCGA.clinical中获取临床数据,kmTCGA()生成K-M生存曲线;
(2)表达分析
expressionsTCGA()从RTCGA.rnaseq、RTCGA.mirna等共5个包中获取表达数据,pcaTCGA()、heatmapTCGA()、boxplotTCGA()可视化。
生存分析
RTCGA.clinical包中分癌症类型,有38个数据表,如BRCA.clinical, OV.clinical等等,癌症的编号详见http://gdac.broadinstitute.org/。每个数据表包括”admin.bcr”、”admin.day_of_dcc_upload”、”admin.disease_code” 等共计3703列。如下所示:
library("RTCGA.clinical")
library("tidyverse")
BRCA.clinical%>%colnames()%>%head
# "admin.bcr" "admin.day_of_dcc_upload" "admin.disease_code"
# "admin.file_uuid" "admin.month_of_dcc_upload" "admin.patient_withdrawal.withdrawn"
BRCA.clinical%>%ncol()
#3703
survivalTCGA()可以从RTCGA.clinical中获取临床数据,如果不注明extract.cols参数,那么结果只有三列:times 、bcr_patient_barcode、patient.vital_status
,分别是生存时间、barcode及病人生存状态。
kmTCGA()用于绘制生存曲线,参数explanatory.names代表分组:
# 获取生存数据,不同的疾病
survivalTCGA(BRCA.clinical, OV.clinical, extract.cols = "admin.disease_code") -> BRCAOV.survInfo
# 绘图
kmTCGA(BRCAOV.survInfo, explanatory.names = "admin.disease_code", pval = TRUE)
#kmTCGA(BRCAOV.survInfo, explanatory.names = "admin.disease_code", main = "",
xlim = c(0,4000))
# 获取生存数据,不同的治疗方式
survivalTCGA(BRCA.clinical,
extract.cols = c("patient.drugs.drug.therapy_types.therapy_type")) %>%
filter(patient.drugs.drug.therapy_types.therapy_type %in%
c("chemotherapy", "hormone therapy")) %>%
rename(therapy = patient.drugs.drug.therapy_types.therapy_type) -> BRCA.survInfo.chemo
# 绘图
kmTCGA(BRCA.survInfo.chemo, explanatory.names = "therapy", xlim = c(0, 3000), conf.int = FALSE)
表达分析
以RTCGA.rnaseq包为例,expressionsTCGA ()用于获取表达数据,这部分在上一篇RTCGA中已经说过。
expressionsTCGA()如指定参数extract.cols,则返回特定基因在各个样本的表达量,如不指定则返回全部基因,其值的形式为“Gene symbol|Gene ID”,如 “VENTX|27287″。
- PCA
library(RTCGA.rnaseq)
# 获取全部基因的表达情况
expressionsTCGA(BRCA.rnaseq, OV.rnaseq, HNSC.rnaseq) %>%
rename(cohort = dataset) %>%
filter(substr(bcr_patient_barcode, 14, 15) == "01") -> BRCA.OV.HNSC.rnaseq.cancer
BRCA.OV.HNSC.rnaseq.cancer %>%ncol()
# 20533
# 由于基因数太多了,随机选1000个基因绘图
BRCA.OV.HNSC.rnaseq.cancer%>%select(sample(colnames(.),1000),-bcr_patient_barcode,cohort) %>%
pcaTCGA("cohort")->pca.rnaseq
plot(pca.rnaseq)
- 热图
hearmap用于绘制热图,主要有四个参数,第一个为包含所需变量的数据库,第二个和第三个分别是热图的X、Y轴分组,第四个变量为展示的变量。
# 获取MET、ZNF500、ZNF501三个基因在ACC、BLCA、BRCA、OV癌症中的表达
expressionsTCGA(ACC.rnaseq, BLCA.rnaseq, BRCA.rnaseq, OV.rnaseq,
extract.cols = c("MET|4233", "ZNF500|26048", "ZNF501|115560")) %>%
rename(cohort = dataset, MET = `MET|4233`) %>% #cancer samples
filter(substr(bcr_patient_barcode, 14, 15) == "01") %>%
mutate(MET = cut(MET,
round(quantile(MET, probs = seq(0,1,0.25)), -2),
include.lowest = TRUE,
dig.lab = 5)) -> ACC_BLCA_BRCA_OV.rnaseq
# 以癌症为第一分组、MET基因表达量为第二分组,计算各分组ZNF500、ZNF501的基因表达中位数
ACC_BLCA_BRCA_OV.rnaseq %>%
select(-bcr_patient_barcode) %>%
group_by(cohort, MET) %>%
summarise_all(median) %>%
mutate(ZNF500 = round(`ZNF500|26048`),
ZNF501 = round(`ZNF501|115560`)) -> ACC_BLCA_BRCA_OV.rnaseq.medians
# 绘制ZNF501热图
heatmapTCGA(ACC_BLCA_BRCA_OV.rnaseq.medians,
"cohort", "MET", "ZNF501", title = "Heatmap of ZNF501 expression")
- 箱线图
boxplotTCGA用于绘制箱线图,主要的参数有data、x、y、fill,分别是数据库、x轴、y轴,染色填充,两个有意思的参数coord.flip坐标轴翻转(默认翻转), facet.names分组(facet)变量,xlab和ylab用于定义坐标轴标题,legend.title定义图例,legend定义图例的位置。
# 获取RNAseq表达量数据
expressionsTCGA(ACC.rnaseq, BLCA.rnaseq, BRCA.rnaseq, OV.rnaseq,
extract.cols = "MET|4233") %>%
rename(cohort = dataset,
MET = `MET|4233`) %>%
#cancer samples
filter(substr(bcr_patient_barcode, 14, 15) == "01") -> ACC_BLCA_BRCA_OV.rnaseq
boxplotTCGA(ACC_BLCA_BRCA_OV.rnaseq, "cohort", "MET")
boxplotTCGA(ACC_BLCA_BRCA_OV.rnaseq, "cohort", "log1p(MET)") # 数据变换,可以压缩异常值的离群趋势
boxplotTCGA(ACC_BLCA_BRCA_OV.rnaseq, "reorder(cohort,log1p(MET), median)", "log1p(MET)") # 调整x的顺序
boxplotTCGA(ACC_BLCA_BRCA_OV.rnaseq,"cohort", "log1p(MET)",facet.names = "cohort")
本周的分享就到这里。
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!