R/Seurat包对数据处理03-b数据整合分析
回想起来自己从事生物相关的研究已经大概15年了,从研究生进入实验室也有10年时间,陆续从硕士,博士到博后,研究地点也从化学学院,到药学院再到医院科室。自己做的研究是“干-湿”实验结合的,发表的成果也是各自一半,但是综合起来还是生物信息分析的文章发表的影响因子高一些。到现在由于工作场所频繁发生变化,反而没有稳定的场所做实验,所以愈发的在生物信息方面下较多的功夫。因此我对这十几年来的生信研究进行总结,希望帮助新手克服生物信息陡峭的学习曲线,当然我自己也不是科班出身的,也希望与你一起交流学习。所有的内容均是以自己的实验数据(会明确下载地址给读者)操作来进行,避免某些在demo运行很好却在自己的环境中出现bug的情况。最后一点,现在通讯太发达了,欢迎大家与我直接交流共同进步。
9 scRNA-seq Dataset Integration | Analysis of single cell RNA-seq data
大部分的情况下我们会碰到需要将数据整合分析的场景,比如不同批次测的同一个样本的单细胞测序,这样做的好处是可以获得更多的细胞数量方便分析获得更多的数据。
library(Seurat)
library(SeuratDisk)
library(SeuratWrappers)
library(patchwork)
library(harmony)
library(rliger)
library(reshape2)
library(RColorBrewer)
library(dplyr)
# 获取数据
download.file("https://cf.10xgenomics.com/samples/cell-exp/4.0.0/Parent_NGSC3_DI_PBMC/Parent_NGSC3_DI_PBMC_filtered_feature_bc_matrix.h5",
destfile = "3p_pbmc10k_filt.h5")
download.file("https://cf.10xgenomics.com/samples/cell-vdj/5.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_filtered_feature_bc_matrix.h5",
destfile = "5p_pbmc10k_filt.h5")
# Let’s read the data and create the appropriate Seurat objects. Note that 5’ dataset has other assays - namely, VDJ data.
matrix_3p <- Read10X_h5("data/update/3p_pbmc10k_filt.h5",use.names = T)
matrix_5p <- Read10X_h5("data/update/5p_pbmc10k_filt.h5",use.names = T)$`Gene Expression`
srat_3p <- CreateSeuratObject(matrix_3p,project = "pbmc10k_3p")
srat_5p <- CreateSeuratObject(matrix_5p,project = "pbmc10k_5p")
# 节省内存移除一些加载
rm(matrix_3p)
rm(matrix_5p)
# Let’s calculate the fractions of mitochondrial genes and ribosomal proteins, and do quick-and-dirty filtering of the datasets:
srat_3p[["percent.mt"]] <- PercentageFeatureSet(srat_3p, pattern = "^MT-")
srat_3p[["percent.rbp"]] <- PercentageFeatureSet(srat_3p, pattern = "^RP[SL]")
srat_5p[["percent.mt"]] <- PercentageFeatureSet(srat_5p, pattern = "^MT-")
srat_5p[["percent.rbp"]] <- PercentageFeatureSet(srat_5p, pattern = "^RP[SL]")
# 可视化结果
VlnPlot(srat_3p, features = c("nFeature_RNA","nCount_RNA","percent.mt","percent.rbp"), ncol = 4)
VlnPlot(srat_5p, features = c("nFeature_RNA","nCount_RNA","percent.mt","percent.rbp"), ncol = 4)
# 需要对比基因名字是否一致
table(rownames(srat_3p) %in% rownames(srat_5p))
##
## TRUE
## 36601
# Quick filtering of the datasets removes dying cells and putative doublets:
srat_3p <- subset(srat_3p, subset = nFeature_RNA > 500 & nFeature_RNA < 5000 & percent.mt < 15)
srat_5p <- subset(srat_5p, subset = nFeature_RNA > 500 & nFeature_RNA < 5000 & percent.mt < 10)
# 然后执行正态化以及后续的分析
pbmc_list <- list()
pbmc_list[["pbmc10k_3p"]] <- srat_3p
pbmc_list[["pbmc10k_5p"]] <- srat_5p
for (i in 1:length(pbmc_list)) {
pbmc_list[[i]] <- NormalizeData(pbmc_list[[i]], verbose = F)
pbmc_list[[i]] <- FindVariableFeatures(pbmc_list[[i]], selection.method = "vst", nfeatures = 2000, verbose = F)
}
# After this, we use the two following Seurat commands to find integration anchors and actually perform integration. This takes about 10 min:
pbmc_anchors <- FindIntegrationAnchors(object.list = pbmc_list, dims = 1:30)
## Warning in CheckDuplicateCellNames(object.list = object.list): Some cell names
## are duplicated across objects provided. Renaming to enforce unique cell names.
## Computing 2000 integration features
## Scaling features for provided objects
## Finding all pairwise anchors
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
## Found 21626 anchors
## Filtering anchors
## Retained 8832 anchors
pbmc_seurat <- IntegrateData(anchorset = pbmc_anchors, dims = 1:30)
## Merging dataset 1 into 2
## Extracting anchors for merged samples
## Finding integration vectors
## Finding integration vector weights
## Integrating data
# Let’s remove all the datastructures we’re not using to save the RAM:
rm(pbmc_list)
rm(pbmc_anchors)
# Seurat integration creates a unified object that contains both original data (‘RNA’ assay) as well as integrated data (‘integrated’ assay). Let’s set the assay to RNA and visualize the datasets before integration.
DefaultAssay(pbmc_seurat) <- "RNA"
# Let’s do normalization, HVG finding, scaling, PCA, and UMAP on the un-integrated (RNA) assay:
pbmc_seurat <- NormalizeData(pbmc_seurat, verbose = F)
pbmc_seurat <- FindVariableFeatures(pbmc_seurat, selection.method = "vst", nfeatures = 2000, verbose = F)
pbmc_seurat <- ScaleData(pbmc_seurat, verbose = F)
pbmc_seurat <- RunPCA(pbmc_seurat, npcs = 30, verbose = F)
pbmc_seurat <- RunUMAP(pbmc_seurat, reduction = "pca", dims = 1:30, verbose = F)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
# UMAP plot of the datasets before integration shows clear separation. Note that we can use patchwork syntax with Seurat plotting functions:
DimPlot(pbmc_seurat,reduction = "umap") + plot_annotation(title = "10k 3' PBMC and 10k 5' PBMC cells, before integration")
# 发现未整合的数据靠的分厂的远
# Now let’s change the assay to integrated and do the same do the same thing in the integrated assay (it’s already normalized and HVGs are selected):
DefaultAssay(pbmc_seurat) <- "integrated"
pbmc_seurat <- ScaleData(pbmc_seurat, verbose = F)
pbmc_seurat <- RunPCA(pbmc_seurat, npcs = 30, verbose = F)
pbmc_seurat <- RunUMAP(pbmc_seurat, reduction = "pca", dims = 1:30, verbose = F)
# Finally, let’s plot the integrated UMAP: 就发现很好的整合了
DimPlot(pbmc_seurat, reduction = "umap") + plot_annotation(title = "10k 3' PBMC and 10k 5' PBMC cells, after integration (Seurat 3)")
# The data are visibly very nicely integrated. Let’s try a split plot, which should make the comparison easier:
DimPlot(pbmc_seurat, reduction = "umap", split.by = "orig.ident") + NoLegend()
# Now let’s cluster the integrated matrix and look how clusters are distributed between the two sets:
pbmc_seurat <- FindNeighbors(pbmc_seurat, dims = 1:30, k.param = 10, verbose = F)
pbmc_seurat <- FindClusters(pbmc_seurat, verbose = F)
DimPlot(pbmc_seurat,label = T) + NoLegend()
# We can now calculate the number of cells in each cluster that came for either 3’ or the 5’ dataset:
count_table <- table(pbmc_seurat@meta.data$seurat_clusters, pbmc_seurat@meta.data$orig.ident)
count_table
##
## pbmc10k_3p pbmc10k_5p
## 0 1313 2140
## 1 1427 945
## 2 1214 994
## 3 898 1110
## 4 612 769
## 5 576 437
## 6 401 603
## 7 338 646
## 8 311 403
## 9 359 223
## 10 251 292
## 11 347 131
## 12 202 240
## 13 136 288
## 14 251 61
## 15 109 190
## 16 118 143
## 17 76 92
## 18 93 61
## 19 55 87
## 20 35 35
## 21 24 43
## 22 13 40
## 23 15 10
## 24 14 5
## 25 2 12
# Let’s plot the distribution among clusters using our custom function:
plot_integrated_clusters(pbmc_seurat)
## Using cluster as id variables
# 以下命令删除并且进行下一步的hrmony整合
rm(pbmc_seurat)
# Harmony, 3’ vs 5’ 10k PBMC
pbmc_harmony <- merge(srat_3p,srat_5p)
## Warning in CheckDuplicateCellNames(object.list = objects): Some cell names are
## duplicated across objects provided. Renaming to enforce unique cell names.
# Now let’s do the same as we did before:
pbmc_harmony <- NormalizeData(pbmc_harmony, verbose = F)
pbmc_harmony <- FindVariableFeatures(pbmc_harmony, selection.method = "vst", nfeatures = 2000, verbose = F)
pbmc_harmony <- ScaleData(pbmc_harmony, verbose = F)
pbmc_harmony <- RunPCA(pbmc_harmony, npcs = 30, verbose = F)
pbmc_harmony <- RunUMAP(pbmc_harmony, reduction = "pca", dims = 1:30, verbose = F)
# Let’s plot the “before” plot again: 先可视化非整合的数据
DimPlot(pbmc_harmony,reduction = "umap") + plot_annotation(title = "10k 3' PBMC and 10k 5' PBMC cells, before integration")
# Use RunHarmony to run it on the combined Seurat object using orig.ident as batch:
pbmc_harmony <- pbmc_harmony %>% RunHarmony("orig.ident", plot_convergence = T)
## Harmony 1/10
## Harmony 2/10
## Harmony 3/10
## Harmony 4/10
## Harmony 5/10
## Harmony converged after 5 iterations
## Warning: Invalid name supplied, making object name syntactically valid. New
## object name is Seurat..ProjectDim.RNA.harmony; see ?make.names for more details
## on syntax validity
# Check the generated embeddings:
harmony_embeddings <- Embeddings(pbmc_harmony, 'harmony')
harmony_embeddings[1:5, 1:5]
## harmony_1 harmony_2 harmony_3 harmony_4 harmony_5
## AAACCCACATAACTCG-1_1 -9.206607 -2.351619 -2.374652 -1.897186 -1.011885
## AAACCCACATGTAACC-1_1 7.124223 21.600131 -0.292039 1.530283 -5.792142
## AAACCCAGTGAGTCAG-1_1 -18.134725 3.405369 5.256459 4.220001 3.961466
## AAACGAACAGTCAGTT-1_1 -18.103262 15.279955 12.301681 -18.115094 31.785955
## AAACGAACATTCGGGC-1_1 11.097966 -2.330278 -2.723953 1.546468 1.552332
# Check the PCA plot after:
p1 <- DimPlot(object = pbmc_harmony, reduction = "harmony", pt.size = .1, group.by = "orig.ident") + NoLegend()
p2 <- VlnPlot(object = pbmc_harmony, features = "harmony_1", group.by = "orig.ident", pt.size = .1) + NoLegend()
plot_grid(p1,p2)
## 可以看到pca和voil图很好的一致
# Do UMAP and clustering:
pbmc_harmony <- pbmc_harmony %>%
RunUMAP(reduction = "harmony", dims = 1:30, verbose = F) %>%
FindNeighbors(reduction = "harmony", k.param = 10, dims = 1:30) %>%
FindClusters() %>%
identity()
## Computing nearest neighbor graph
## Computing SNN
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 19190
## Number of edges: 280811
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8952
## Number of communities: 26
## Elapsed time: 3 seconds
# Finally, so same UMAP plots of integrated datasets as above:
pbmc_harmony <- SetIdent(pbmc_harmony,value = "orig.ident")
DimPlot(pbmc_harmony,reduction = "umap") + plot_annotation(title = "10k 3' PBMC and 10k 5' PBMC cells, after integration (Harmony)")
# 以下命令分开展示
DimPlot(pbmc_harmony, reduction = "umap", group.by = "orig.ident", pt.size = .1, split.by = 'orig.ident') + NoLegend()
# These look a bit worse than ones from Seurat:结果看起来不好
pbmc_harmony <- SetIdent(pbmc_harmony,value = "seurat_clusters")
DimPlot(pbmc_harmony,label = T) + NoLegend()
# Finally, let’s take a look at the cluster content:
plot_integrated_clusters(pbmc_harmony)
## Using cluster as id variables
# 以下命令移除然后使用新的方法
rm(pbmc_harmony)
# 后续不做太多赘述,一般都是要尝试seurat harmony liger三种方法来找到最合适的合并方法,然后进行分析
最后,如果感兴趣请加微信:cll7658,如果你有觉得比较重要的讨论即可。
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!
标签:单细胞测序