GEO数据库中国区镜像横空出世
接收到太多的粉丝求助,想下载个表达矩阵做一下数据挖掘偏偏第一步就卡在了,数据文件下载半天毫无动静,或者下载到99%就卡死了。如果我恰好在电脑旁,通常会帮忙下载后微云或者百度云传递给粉丝,但这毕竟不是长久之计。
经过个把月的不懈努力,我终于把全部的GEO数据库里面的表达芯片数据都下载并且全部格式化处理成为r数据文件,并且购置一个2万块钱的腾讯云服务器来存放它们,供广大粉丝使用!
下载使用GEOmirror
Or just use source function to load the codes of geoChina function, as below:
source('http://raw.githubusercontent.com/jmzeng1314/GEOmirror/master/R/geoChina.R')
What if all these 3 methods failed? I'm so sorry, what' a pity that there's no chance for you to use our GEOmirror !!
安装成功后,加载我们的R包会有一个提示,妈妈以后再也不用担心自己不知道怎么样致谢生信技能树团队啦!
image-20191130112132080
这个就是官方致谢:在感恩节官宣!(文末有惊喜哈)
使用起来非常方便,就一句话,找到你的GSE数据集的ID,传给我们的函数即可:
use it to download GEO dataset, as below :
eSet=geoChina('GSE1009')
eSet=geoChina('GSE27533')
eSet=geoChina('GSE95166')
举一个简单的例子
Once you download the ExpressionSet of GEO dataset, you can access the expression matrix and phenotype data:
## download GSE95166 data
# https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE95166
#eSet=getGEO('GSE95166', destdir=".", AnnotGPL = F, getGPL = F)[[1]]
library(GEOmirror)
eSet=geoChina('GSE95166')
eSet
eSet=eSet[[1]]
probes_expr <- exprs(eSet);dim(probes_expr)
head(probes_expr[,1:4])
boxplot(probes_expr,las=2)
## pheno info
phenoDat <- pData(eSet)
head(phenoDat[,1:4])
# https://www.ncbi.nlm.nih.gov/pubmed/31430288
groupList=factor(c(rep('npc',4),rep('normal',4)))
table(groupList)
eSet@annotation
# GPL15314 Arraystar Human LncRNA microarray V2.0 (Agilent_033010 Probe Name version)
对于这一点表达矩阵数据集,我们可以看看PCA图,火山图以及热图:
image-20191130112636685
代码如下:
genes_expr=probes_expr
library("FactoMineR")
library("factoextra")
dat.pca <- PCA(t(genes_expr) , graph = FALSE)
dat.pca
fviz_pca_ind(dat.pca,
geom.ind = "point",
col.ind = groupList,
addEllipses = TRUE,
legend.title = "Groups"
)
library(limma)
design=model.matrix(~factor(groupList))
design
fit=lmFit(genes_expr,design)
fit=eBayes(fit)
DEG=topTable(fit,coef=2,n=Inf)
head(DEG)
# We observed that 2107 lncRNAs were upregulated
# while 2090 lncRNAs were downregulated by more than 2-fold,
# NKILA among these downregulated lncRNAs (Fig 1A, GSE95166).
## for volcano plot
df=DEG
attach(df)
df$v= -log10(P.Value)
df$g=ifelse(df$P.Value>0.05,'stable',
ifelse( df$logFC >1,'up',
ifelse( df$logFC < -1,'down','stable') )
)
table(df$g)
df$name=rownames(df)
head(df)
library(ggpubr)
ggpubr::ggscatter(df, x = "logFC", y = "v", color = "g",size = 0.5,
label = "name", repel = T,
label.select =head(rownames(df)),
palette = c("#00AFBB", "#E7B800", "#FC4E07") )
detach(df)
x=DEG$logFC
names(x)=rownames(DEG)
cg=c(names(head(sort(x),100)),
names(tail(sort(x),100)))
cg
library(pheatmap)
n=t(scale(t(genes_expr[cg,])))
n[n>2]=2
n[n< -2]= -2
n[1:4,1:4]
ac=data.frame(groupList=groupList)
rownames(ac)=colnames(n)
pheatmap(n,show_colnames =F,show_rownames = F,
annotation_col=ac)
实际上,这个时候,我们需要把探针的ID转换为基因名字,进行后续分析,不过这个比较复杂超出了本文范围,感兴趣的看:(重磅!价值一千元的R代码送给你)芯片探针序列的基因组注释
是不是很激动,想试试看
同样的代码,你可以处理自己的数据集哦。
因为功能非常简单,just a replacement of getGEO function from GEOquery package.
所以是不会有bug的,但是,也许大家在使用的过程有新的需求,我可以酌情根据时间来开发增加功能
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!