• 主页
  • 课程

    关于课程

    • 课程归档
    • 成为一名讲师
    • 讲师信息
    同等学历教学

    同等学历教学

    免费
    阅读更多
  • 特色
    • 展示
    • 关于我们
    • 问答
  • 事件
  • 个性化
  • 博客
  • 联系
  • 站点资源
    有任何问题吗?
    (00) 123 456 789
    weinfoadmin@weinformatics.cn
    注册登录
    恒诺新知
    • 主页
    • 课程

      关于课程

      • 课程归档
      • 成为一名讲师
      • 讲师信息
      同等学历教学

      同等学历教学

      免费
      阅读更多
    • 特色
      • 展示
      • 关于我们
      • 问答
    • 事件
    • 个性化
    • 博客
    • 联系
    • 站点资源

      未分类

      • 首页
      • 博客
      • 未分类
      • Tutorial:Enrichment Analysis, Clustering and Scoring with pathfindR

      Tutorial:Enrichment Analysis, Clustering and Scoring with pathfindR

      • 发布者 weinfoeditor
      • 分类 未分类
      • 日期 2021年7月20日
      • 评论 0评论

      https://www.biostars.org/p/322415/

      n this tutorial, I’ll try to provide a brief overview of the capabilities of our enrichment analysis R package pathfindR. The tool is in CRAN and its introductory vignette can be found here. We also have a detailed wiki.

      This tool is designed to improve enrichment analysis by firstly identifying active subnetworks in differential expression/methylation data using a protein-protein interaction network. It then performs over-representation analysis using the identified active subnetworks. By utilizing the interaction information, the tool identifies numerous gene sets relevant to the condition under study.

      I’ll be using the example data provided in the package: a gene expression data comparing rheumatoid arthritis patients vs. controls.


      Active-Subnetwork-Oriented Pathway Enrichment Analysis

      As illustrated below, this workflow takes in a data frame of 3 columns containing: Gene Symbols, Change Values (optional) and p values. It then maps these genes onto a protein-protein interaction network and identifies and filters active subnetworks. Using the genes in each active subnetwork, it performs enrichment analyses. The results are returned as a data frame as well as presented in an HTML report.

      pathfindR enrichment overview

      We’ll first load the library and the example data:

      library(pathfindR)
      
      head(RA_input)
        Gene.symbol    logFC  adj.P.Val
      1     FAM110A -0.69394 3.4087e-06
      2      RNASE2  1.35350 1.0085e-05
      3      S100A8  1.54483 3.4664e-05
      4      S100A9  1.02809 2.2626e-04
      5      TEX261 -0.32360 2.2626e-04
      6    ARHGAP17 -0.69193 2.7081e-04
      

      Then we simply run the wrapper function run_pathfindR for active-subnetwork oriented pathway enrichment analysis:

      RA_output <- run_pathfindR(RA_input)
      

      By default the gene sets used for enrichment analysis is KEGG pathways. The available gene sets are “KEGG”, “Reactome”, “BioCarta”, “GO-All”, “GO-BP”, “GO-CC”, “GO-MF” and “mmu_KEGG”. The gene sets can be specified using the gene_sets argument. Detailed information on all the arguments of run_pathfindR is provided here.

      The output of run_pathfindR() is a data frame containing 8 (or 9) columns:

      • ID: ID of the enriched term
      • Term_Description: Description of the enriched term
      • Fold_Enrichment: Fold enrichment value for the enriched term (Calculated using ONLY the input genes)
      • occurrence: the number of iterations that the given term was found to enriched over all iterations
      • lowest_p: the lowest adjusted-p value of the given term over all iterations
      • highest_p: the highest adjusted-p value of the given term over all iterations
      • non_Signif_Snw_Genes (OPTIONAL): the non-significant active subnetwork genes, comma-separated
      • Up_regulated: the up-regulated genes (as determined by ‘change value’ > 0, if the ‘change column’ was provided) in the input involved in the given term’s gene set, comma-separated. If change column not provided, all affected are listed here.
      • Down_regulated: the down-regulated genes (as determined by ‘change value’ < 0, if the ‘change column’ was provided) in the input involved in the given term’s gene set, comma-separated

      By default, the function also plots a bubble chart of enrichment results. The example bubble chart of enrichment results for RA_output is presented below:

      bubble chart of enrichment results

      The function also creates an HTML report named results.html under the directory specified by the argument output_dir (default = ‘pathfindR_Results’). The report contains a table of the active subnetwork-oriented pathway enrichment results. This table contains the same information as the returned data frame. If KEGG pathways were chosen, each enriched pathway name is linked to the visualization of that KEGG pathway with the gene nodes colored according to log-fold-change values. The colored human KEGG pathway diagrams are created using the input genes. If another gene set is being used, the graph visualization of interactions of pathway genes (in the PIN) are plotted instead.


      Enrichment Term Clustering

      Enrichment analysis usually yields a large number of related terms. In order to establish representative terms among similar groups of terms, we provide functionality to cluster the enriched terms, as outlined below:

      pathfindR clustering overview

      The wrapper function cluster_enriched_terms() is used for clustering the enriched terms. This function first calculates the pairwise kappa statistics between the terms in the data frame obtained from run_pathfindR(). It then clusters the terms and partitions them into relevant clusters. As can be seen in the diagram, there are two options for clustering: (1) hierarchical clustering and (2) fuzzy clustering.

      Detailed information on the clustering functionality can be found here.

      Hierarchical Clustering

      By default, cluster_enriched_terms() performs hierarchical clustering (using 1 - kappa statistic as the distance metric), determines the optimal number of clusters by maximizing the average silhouette width and returns a data frame with cluster assignments. The representative terms are chosen as the ones with the lowest p value in each cluster.

      RA_clustered <- cluster_enriched_terms(RA_output)
      head(RA_clustered, 2)
              ID                        Term_Description Fold_Enrichment occurrence   lowest_p  highest_p Up_regulated
      4 hsa04722 Neurotrophin signaling pathway          24.849          4 5.0626e-05 0.00066481             
      7 hsa04218            Cellular senescence          20.060          2 1.6810e-03 0.00168099             
                                                      Down_regulated Cluster         Status
      4 CRKL, FASLG, SH2B3, ABL1, MAGED1, IRAK2, IKBKB, CALM1, CALM3       1 Representative
      7       MTOR, ETS1, RBL2, CALM1, CALM3, NFATC3, SLC25A5, VDAC1       1         Member
      

      An enrichment bubble chart grouped by clusters can be plotted via:

      enrichment_chart(RA_clustered, plot_by_cluster = TRUE)
      
      bubble chart by clusters

      Heuristic Fuzzy Multiple-linkage Partitioning

      Alternatively, the fuzzy clustering method (as described in Huang DW, Sherman BT, Tan Q, et al. The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 2007;8(9):R183.) can be used to obtain fuzzy cluster assignments and representative terms per each cluster:

      RA_clustered <- cluster_enriched_terms(RA_output, method = "fuzzy")
      

      Aggregated Term Scores per Sample

      The function score_terms() can be used to calculate the agglomerated z score of each enriched term per sample. This allows the user to individually examine the scores and infer how a term is overall altered (activated or repressed) in a given sample or a group of samples. The function uses:

      1. The output data frame obtain of run_pathfindR(),
      2. The expression matrix used during analysis (differential expression/methylation),
      3. (Optional for ordering samples) a vector of “case” IDs.

      Example usage is presented below:

      ## Vector of "Case" IDs cases <- c("GSM389703", "GSM389704", "GSM389706", "GSM389708", 
                 "GSM389711", "GSM389714", "GSM389716", "GSM389717", 
                 "GSM389719", "GSM389721", "GSM389722", "GSM389724", 
                 "GSM389726", "GSM389727", "GSM389730", "GSM389731", 
                 "GSM389733", "GSM389735")
      
      ## Calculate scores for representative terms 
      ## and plot heat map using term descriptions
      score_matrix <- score_terms(enrichment_table = RA_clustered[RA_clustered$Status == "Representative", ],
                                  exp_mat = RA_exp_mat,
                                  cases = cases,
                                  use_description = TRUE, # default FALSE
                                  label_samples = FALSE) # default = TRUE
      
      Pathway scoring

      pathfindR Analysis with Custom Gene Sets

      As of the latest version, pathfindR offers utility functions for obtaining organism-specific PIN data and organism-specific gene sets data via get_pin_file() and get_gene_sets_list(), respectively.

      It is possible to use run_pathfindR() with custom gene sets (including gene sets for non-Homo-sapiens species). Here, we provide an example application of active-subnetwork-oriented enrichment analysis of the target genes of two transcription factors.

      We first load and prepare the gene sets:

      ## CREB target genes
      CREB_target_genes <- normalizePath(system.file("extdata/CREB.txt", package = "pathfindR"))
      CREB_target_genes <- readLines(CREB_target_genes)[-c(1, 2)] # skip the first two lines
      
      ## MYC target genes
      MYC_target_genes <- normalizePath(system.file("extdata/MYC.txt", package = "pathfindR"))
      MYC_target_genes <- readLines(MYC_target_genes)[-c(1, 2)] # skip the first two lines
      
      ## Prep for use
      custom_genes <- list(TF1 = CREB_target_genes, TF2 = MYC_target_genes)
      custom_descriptions <- c(TF1 = "CREB target genes", TF2 = "MYC target genes")
      

      We next prepare the example input data frame. Because of the way we choose genes, we expect significant enrichment for MYC targets (40 MYC target genes + 10 CREB target genes). Because this is only an example, we also assign each genes random p-values between 0.001 and 0.05.

      set.seed(123)
      
      ## Select 40 random genes from MYC gene sets and 10 from CREB gene sets
      selected_genes <- sample(MYC_target_genes, 40)
      selected_genes <- c(selected_genes, 
                          sample(CREB_target_genes, 10))
      
      ## Assign random p value between 0.001 and 0.05 for each selected gene
      rand_p_vals <- sample(seq(0.001, 0.05, length.out = 5),
                            size = length(selected_genes),
                            replace = TRUE)
      
      input_df <- data.frame(Gene_symbol = selected_genes,
                             p_val = rand_p_vals)
      

      Finally, we perform active-subnetwork-oriented enrichment analysis via run_pathfindR() using the custom genes as the gene sets:

      custom_result <- run_pathfindR(input_df,
                                     gene_sets = "Custom",
                                     custom_genes = custom_genes,
                                     custom_descriptions = custom_descriptions,
                                     max_gset_size = Inf, # DO NOT LIMIT GENE SET SIZE
                                     iterations = 1, # for demo, setting number of iterations to 1
                                     output_dir = "misc/v1.4/CREB_MYC")
      
      ID  Term_Description Fold_Enrichment occurrence   lowest_p  highest_p
      1 TF2  MYC target genes          15.133          1 9.3559e-14 9.3559e-14
      2 TF1 CREB target genes          17.346          1 5.0187e-05 5.0187e-05
                                                                                                                                                                                                                                                                                     Up_regulated
      1 AGRP, ATP6V1C1, C19orf54, CD3EAP, CNPY3, EPB41, FOXD3, FXR1, GLA, HNRNPD, HOXA7, IL1RAPL1, KDM6A, LONP1, LTBR, MDN1, MICU1, NET1, NEUROD6, NMNAT2, NOL6, NUDC, PEPD, PKN1, PSMB3, RPS19, RPS28, RRS1, SLC9A5, SMC3, STC2, TESK2, TNPO2, TOPORS, TPM2, TSSK3, WBP2, ZBTB8OS, ZFYVE26, ZHX2
      2                                                                                                                                                                                                  BRAF, DIO2, ELAVL1, EPB41, FAM65A, FOXD3, NEUROD6, NOC4L, NUPL2, PPP1R15A, SYNGR3, TIPRL
        Down_regulated
      1               
      2
      

      Closing remarks

      In this tutorial I tried to provide an overview of pathfindR. I try to update this tutorial as certain changes and additions to the package are made. To keep this tutorial as brief as possible, I had to omit certain frequently asked issues (such as pathfindR analysis using non-human data). For such questions and any issues, please visit our website or submit an issue to our issues page on GitHub.

      Also check out the vignettes included in the package via browseVignettes("pathfindR") or find them here:

      1. Introduction to pathfindR
      2. Step-by-Step Execution of the pathfindR Enrichment Workflow
      3. pathfindR Analysis for non-Homo-sapiens organisms
      4. Visualization of pathfindR Enrichment Results
      5. Comparing Two pathfindR Results
      6. Obtaining PIN and Gene Sets Data

      请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!

      • 分享:
      作者头像
      weinfoeditor

      上一篇文章

      Plot.ly Volcano Plot Example
      2021年7月20日

      下一篇文章

      Tutorial: guidelines for the computational analysis of single-cell RNA sequencing data
      2021年7月20日

      你可能也喜欢

      2-1675088548
      lncRNA和miRNA生信分析系列讲座免费视频课和课件资源包,干货满满
      30 1月, 2023
      9-1675131201
      如何快速批量修改 Git 提交记录中的用户信息
      26 1月, 2023
      8-1678501786
      肿瘤细胞通过改变CD8+ T细胞中的丙酮酸利用和琥珀酸信号来调控抗肿瘤免疫应答。
      7 12月, 2022

      留言 取消回复

      要发表评论,您必须先登录。

      搜索

      分类

      • R语言
      • TCGA数据挖掘
      • 单细胞RNA-seq测序
      • 在线会议直播预告与回放
      • 数据分析那些事儿分类
      • 未分类
      • 生信星球
      • 老俊俊的生信笔记

      投稿培训

      免费

      alphafold2培训

      免费

      群晖配置培训

      免费

      最新博文

      Nature | 单细胞技术揭示衰老细胞与肌肉再生
      301月2023
      lncRNA和miRNA生信分析系列讲座免费视频课和课件资源包,干货满满
      301月2023
      如何快速批量修改 Git 提交记录中的用户信息
      261月2023
      logo-eduma-the-best-lms-wordpress-theme

      (00) 123 456 789

      weinfoadmin@weinformatics.cn

      恒诺新知

      • 关于我们
      • 博客
      • 联系
      • 成为一名讲师

      链接

      • 课程
      • 事件
      • 展示
      • 问答

      支持

      • 文档
      • 论坛
      • 语言包
      • 发行状态

      推荐

      • iHub汉语代码托管
      • iLAB耗材管理
      • WooCommerce
      • 丁香园论坛

      weinformatics 即 恒诺新知。ICP备案号:粤ICP备19129767号

      • 关于我们
      • 博客
      • 联系
      • 成为一名讲师

      要成为一名讲师吗?

      加入数以千计的演讲者获得100%课时费!

      现在开始

      用你的站点账户登录

      忘记密码?

      还不是会员? 现在注册

      注册新帐户

      已经拥有注册账户? 现在登录

      close
      会员购买 你还没有登录,请先登录
      • ¥99 VIP-1个月
      • ¥199 VIP-半年
      • ¥299 VIP-1年
      在线支付 激活码

      立即支付
      支付宝
      微信支付
      请使用 支付宝 或 微信 扫码支付
      登录
      注册|忘记密码?