Tutorial:Enrichment Analysis, Clustering and Scoring with pathfindR
https://www.biostars.org/p/322415/
n this tutorial, I’ll try to provide a brief overview of the capabilities of our enrichment analysis R package pathfindR. The tool is in CRAN and its introductory vignette can be found here. We also have a detailed wiki.
This tool is designed to improve enrichment analysis by firstly identifying active subnetworks in differential expression/methylation data using a protein-protein interaction network. It then performs over-representation analysis using the identified active subnetworks. By utilizing the interaction information, the tool identifies numerous gene sets relevant to the condition under study.
I’ll be using the example data provided in the package: a gene expression data comparing rheumatoid arthritis patients vs. controls.
Active-Subnetwork-Oriented Pathway Enrichment Analysis
As illustrated below, this workflow takes in a data frame of 3 columns containing: Gene Symbols
, Change Values
(optional) and p values
. It then maps these genes onto a protein-protein interaction network and identifies and filters active subnetworks. Using the genes in each active subnetwork, it performs enrichment analyses. The results are returned as a data frame as well as presented in an HTML report.

We’ll first load the library and the example data:
library(pathfindR)
head(RA_input)
Gene.symbol logFC adj.P.Val
1 FAM110A -0.69394 3.4087e-06
2 RNASE2 1.35350 1.0085e-05
3 S100A8 1.54483 3.4664e-05
4 S100A9 1.02809 2.2626e-04
5 TEX261 -0.32360 2.2626e-04
6 ARHGAP17 -0.69193 2.7081e-04
Then we simply run the wrapper function run_pathfindR
for active-subnetwork oriented pathway enrichment analysis:
RA_output <- run_pathfindR(RA_input)
By default the gene sets used for enrichment analysis is KEGG pathways. The available gene sets are “KEGG”, “Reactome”, “BioCarta”, “GO-All”, “GO-BP”, “GO-CC”, “GO-MF” and “mmu_KEGG”. The gene sets can be specified using the gene_sets
argument. Detailed information on all the arguments of run_pathfindR
is provided here.
The output of run_pathfindR()
is a data frame containing 8 (or 9) columns:
- ID: ID of the enriched term
- Term_Description: Description of the enriched term
- Fold_Enrichment: Fold enrichment value for the enriched term (Calculated using ONLY the input genes)
- occurrence: the number of iterations that the given term was found to enriched over all iterations
- lowest_p: the lowest adjusted-p value of the given term over all iterations
- highest_p: the highest adjusted-p value of the given term over all iterations
- non_Signif_Snw_Genes (OPTIONAL): the non-significant active subnetwork genes, comma-separated
- Up_regulated: the up-regulated genes (as determined by ‘change value’ > 0, if the ‘change column’ was provided) in the input involved in the given term’s gene set, comma-separated. If change column not provided, all affected are listed here.
- Down_regulated: the down-regulated genes (as determined by ‘change value’ < 0, if the ‘change column’ was provided) in the input involved in the given term’s gene set, comma-separated
By default, the function also plots a bubble chart of enrichment results. The example bubble chart of enrichment results for RA_output
is presented below:

The function also creates an HTML report named results.html
under the directory specified by the argument output_dir
(default = ‘pathfindR_Results’). The report contains a table of the active subnetwork-oriented pathway enrichment results. This table contains the same information as the returned data frame. If KEGG pathways were chosen, each enriched pathway name is linked to the visualization of that KEGG pathway with the gene nodes colored according to log-fold-change values. The colored human KEGG pathway diagrams are created using the input genes. If another gene set is being used, the graph visualization of interactions of pathway genes (in the PIN) are plotted instead.
Enrichment Term Clustering
Enrichment analysis usually yields a large number of related terms. In order to establish representative terms among similar groups of terms, we provide functionality to cluster the enriched terms, as outlined below:

The wrapper function cluster_enriched_terms()
is used for clustering the enriched terms. This function first calculates the pairwise kappa statistics between the terms in the data frame obtained from run_pathfindR()
. It then clusters the terms and partitions them into relevant clusters. As can be seen in the diagram, there are two options for clustering: (1) hierarchical clustering and (2) fuzzy clustering.
Detailed information on the clustering functionality can be found here.
Hierarchical Clustering
By default, cluster_enriched_terms()
performs hierarchical clustering (using 1 - kappa statistic
as the distance metric), determines the optimal number of clusters by maximizing the average silhouette width and returns a data frame with cluster assignments. The representative terms are chosen as the ones with the lowest p value in each cluster.
RA_clustered <- cluster_enriched_terms(RA_output)
head(RA_clustered, 2)
ID Term_Description Fold_Enrichment occurrence lowest_p highest_p Up_regulated
4 hsa04722 Neurotrophin signaling pathway 24.849 4 5.0626e-05 0.00066481
7 hsa04218 Cellular senescence 20.060 2 1.6810e-03 0.00168099
Down_regulated Cluster Status
4 CRKL, FASLG, SH2B3, ABL1, MAGED1, IRAK2, IKBKB, CALM1, CALM3 1 Representative
7 MTOR, ETS1, RBL2, CALM1, CALM3, NFATC3, SLC25A5, VDAC1 1 Member
An enrichment bubble chart grouped by clusters can be plotted via:
enrichment_chart(RA_clustered, plot_by_cluster = TRUE)

Heuristic Fuzzy Multiple-linkage Partitioning
Alternatively, the fuzzy clustering method (as described in Huang DW, Sherman BT, Tan Q, et al. The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 2007;8(9):R183.) can be used to obtain fuzzy cluster assignments and representative terms per each cluster:
RA_clustered <- cluster_enriched_terms(RA_output, method = "fuzzy")
Aggregated Term Scores per Sample
The function score_terms()
can be used to calculate the agglomerated z score of each enriched term per sample. This allows the user to individually examine the scores and infer how a term is overall altered (activated or repressed) in a given sample or a group of samples. The function uses:
- The output data frame obtain of
run_pathfindR()
, - The expression matrix used during analysis (differential expression/methylation),
- (Optional for ordering samples) a vector of “case” IDs.
Example usage is presented below:
## Vector of "Case" IDs cases <- c("GSM389703", "GSM389704", "GSM389706", "GSM389708",
"GSM389711", "GSM389714", "GSM389716", "GSM389717",
"GSM389719", "GSM389721", "GSM389722", "GSM389724",
"GSM389726", "GSM389727", "GSM389730", "GSM389731",
"GSM389733", "GSM389735")
## Calculate scores for representative terms
## and plot heat map using term descriptions
score_matrix <- score_terms(enrichment_table = RA_clustered[RA_clustered$Status == "Representative", ],
exp_mat = RA_exp_mat,
cases = cases,
use_description = TRUE, # default FALSE
label_samples = FALSE) # default = TRUE

pathfindR Analysis with Custom Gene Sets
As of the latest version, pathfindR offers utility functions for obtaining organism-specific PIN data and organism-specific gene sets data via
get_pin_file()
andget_gene_sets_list()
, respectively.
It is possible to use run_pathfindR()
with custom gene sets (including gene sets for non-Homo-sapiens species). Here, we provide an example application of active-subnetwork-oriented enrichment analysis of the target genes of two transcription factors.
We first load and prepare the gene sets:
## CREB target genes
CREB_target_genes <- normalizePath(system.file("extdata/CREB.txt", package = "pathfindR"))
CREB_target_genes <- readLines(CREB_target_genes)[-c(1, 2)] # skip the first two lines
## MYC target genes
MYC_target_genes <- normalizePath(system.file("extdata/MYC.txt", package = "pathfindR"))
MYC_target_genes <- readLines(MYC_target_genes)[-c(1, 2)] # skip the first two lines
## Prep for use
custom_genes <- list(TF1 = CREB_target_genes, TF2 = MYC_target_genes)
custom_descriptions <- c(TF1 = "CREB target genes", TF2 = "MYC target genes")
We next prepare the example input data frame. Because of the way we choose genes, we expect significant enrichment for MYC targets (40 MYC target genes + 10 CREB target genes). Because this is only an example, we also assign each genes random p-values between 0.001 and 0.05.
set.seed(123)
## Select 40 random genes from MYC gene sets and 10 from CREB gene sets
selected_genes <- sample(MYC_target_genes, 40)
selected_genes <- c(selected_genes,
sample(CREB_target_genes, 10))
## Assign random p value between 0.001 and 0.05 for each selected gene
rand_p_vals <- sample(seq(0.001, 0.05, length.out = 5),
size = length(selected_genes),
replace = TRUE)
input_df <- data.frame(Gene_symbol = selected_genes,
p_val = rand_p_vals)
Finally, we perform active-subnetwork-oriented enrichment analysis via run_pathfindR() using the custom genes as the gene sets:
custom_result <- run_pathfindR(input_df,
gene_sets = "Custom",
custom_genes = custom_genes,
custom_descriptions = custom_descriptions,
max_gset_size = Inf, # DO NOT LIMIT GENE SET SIZE
iterations = 1, # for demo, setting number of iterations to 1
output_dir = "misc/v1.4/CREB_MYC")
ID Term_Description Fold_Enrichment occurrence lowest_p highest_p
1 TF2 MYC target genes 15.133 1 9.3559e-14 9.3559e-14
2 TF1 CREB target genes 17.346 1 5.0187e-05 5.0187e-05
Up_regulated
1 AGRP, ATP6V1C1, C19orf54, CD3EAP, CNPY3, EPB41, FOXD3, FXR1, GLA, HNRNPD, HOXA7, IL1RAPL1, KDM6A, LONP1, LTBR, MDN1, MICU1, NET1, NEUROD6, NMNAT2, NOL6, NUDC, PEPD, PKN1, PSMB3, RPS19, RPS28, RRS1, SLC9A5, SMC3, STC2, TESK2, TNPO2, TOPORS, TPM2, TSSK3, WBP2, ZBTB8OS, ZFYVE26, ZHX2
2 BRAF, DIO2, ELAVL1, EPB41, FAM65A, FOXD3, NEUROD6, NOC4L, NUPL2, PPP1R15A, SYNGR3, TIPRL
Down_regulated
1
2
Closing remarks
In this tutorial I tried to provide an overview of pathfindR. I try to update this tutorial as certain changes and additions to the package are made. To keep this tutorial as brief as possible, I had to omit certain frequently asked issues (such as pathfindR analysis using non-human data). For such questions and any issues, please visit our website or submit an issue to our issues page on GitHub.
Also check out the vignettes included in the package via browseVignettes("pathfindR")
or find them here:
- Introduction to pathfindR
- Step-by-Step Execution of the pathfindR Enrichment Workflow
- pathfindR Analysis for non-Homo-sapiens organisms
- Visualization of pathfindR Enrichment Results
- Comparing Two pathfindR Results
- Obtaining PIN and Gene Sets Data
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!