PCA应用于虚构的人物个性
专题介绍:R是一种广泛用于数据分析和统计计算的强大语言,于上世纪90年代开始发展起来。得益于全世界众多 爱好者的无尽努力,大家继而开发出了一种基于R但优于R基本文本编辑器的R Studio(用户的界面体验更好)。也正是由于全世界越来越多的数据科学社区和用户对R包的慷慨贡献,让R语言在全球范围内越来越流行。其中一些R包,例如MASS,SparkR, ggplot2,使数据操作,可视化和计算功能越来越强大。R是用于统计分析、绘图的语言和操作环境。R是属于GNU系统的一个自由、免费、源代码开放的软件,它是一个用于统计计算和统计制图的优秀工具。R作为一种统计分析软件,是集统计分析与图形显示于一体的。它可以运行于UNIX、Windows和Macintosh的操作系统上,而且嵌入了一个非常方便实用的帮助系统,相比于其他统计分析软件,R的学术性开发比较早,适合生物学和医学等学术学科的科研人员使用。
【R语言】已开通R语言社群,五湖四海,天南地北,各行各业,有缘相聚,共享R事,雕刻数据,求解问题,以创价值。喜乐入群者,请加微信号luqin360,或扫描文末二维码,添加为好友,同时附上R-入群。有朋自远方来,不亦乐乎,并诚邀入群,以达相互学习和进步之美好心愿。
R语言出品
题目:Applying PCA to fictional character personalities
来源:
https://www.alexcookson.com/post/2020-11-19-applying-pca-to-fictional-character-personalities/
编译:R学习者
在本文,我们将主成分分析(PCA)应用于虚构人物个性的数据集。
PCA是一种常用的降维技术。例如,若你需要把分类模型和含有很多变量的数据集放在一块,可能要做降维操作。
我们正在使用的数据集是众包评分的800个虚构人物的个性特征,这些任务来自书籍、电影和电视节目,例如《权利的游戏》、《傲慢与偏见》和《狮子王》。这些数据来自开源的心理测量项目-在线“那个人物”的个性测验。你可以在这里了解更多信息。
准备工作
1# 准备工作
2library(tidyverse) # For data manipulation
3library(tidymodels) # For PCA
4library(tidytext) # For reorder_within()
5library(extrafont) # For custom fonts on graphs
6library(glue) # For creating strings from multiple pieces
7library(ggtext) # For fine-grained text formatting
8library(plotly) # For interactive graphs
9
10# These are to make tables look nice on the website
11# You probably won't need them
12library(knitr)
13library(kableExtra)
14
15theme_set(theme_void())
16
17personalities <- read_tsv("https://raw.githubusercontent.com/tacookson/data/master/fictional-character-personalities/personalities.txt")
数据检视
我们来看一个示例人物:《星际迷航:下一代》中的Jean-Luc Picard船长。
1personalities %>%
2 filter(character_name == "Jean-Luc Picard") %>%
3 head(8) %>%
4 arrange(mean) %>%
5 kable(format = "html") %>%
6 kable_styling()
我们感兴趣的主要字段是:spectrum_low,spectrum_high和mean:
-
频谱字段告诉我们所考虑频谱两端的特质是什么
-
平均值是分数(从-50到+50),其中分数接近-50表示人物更像Spectrum_low特质,而分数接近+50意味着人物更像Spectrum_high特质
如果我们看一下Picard船长最强的一些特质-即接近-50或+50-我们会发现他很有魅力和知识分子(低分对应于Spectrum_low),以及大胆而有品位(高分对应于Spectrum_high)。我当然会用这些话来形容我们亲爱的船长。
其他大多数字段是要确定角色是谁以及他们来自哪个虚构世界。对于该人物-频谱组合,我们还具有评分和标准偏差(sd)的数据。最后,我们有一个辅助列is_emoji,它告诉我们频谱两端的特质是表情符号还是单词。
为了确保能检查出这些数据,因为我真的很好奇-让我们看看“傲慢与偏见”和“狮子王”中的其他示例。我们将看看每个角色的八个最强特质。我在括号内包含了频谱另一端的特质,以供参考。
傲慢与偏见
1pride_palette <- c("#19313b", "#3b636b", "#bbb6a2",
2 "#ffc277", "#bc652f", "#472411")
3
4pride_example <- personalities %>%
5 filter(!is_emoji,
6 fictional_work == "Pride and Prejudice",
7 character_name %in% c("Elizabeth Bennet", "Jane Bennet", "Lydia Bennet",
8 "Mr. Darcy", "Mr. William Collins", "George Wickham")) %>%
9 mutate(character_name = ifelse(character_name == "Mr. William Collins", "Mr. Collins", character_name)) %>%
10 group_by(character_name) %>%
11 top_n(8, abs(mean)) %>%
12 ungroup() %>%
13 mutate(spectrum_label = ifelse(mean > 0,
14 glue("{spectrum_high}<br><span style='font-size:8pt'>({spectrum_low})</span>"),
15 glue("{spectrum_low}<br><span style='font-size:8pt'>({spectrum_high})</span>")),
16 mean = abs(mean),
17 spectrum_label = reorder_within(spectrum_label, mean, character_name))
18
19
20pride_example %>%
21 ggplot() +
22 geom_col(aes(mean, spectrum_label, fill = character_name), show.legend = FALSE) +
23 geom_vline(xintercept = seq(0, 50, by = 10), colour = "#FFF1E6", size = 0.1) +
24 scale_y_reordered() +
25 scale_fill_manual(values = pride_palette) +
26 expand_limits(x = c(0, 50)) +
27 facet_wrap(~ character_name, scales = "free_y") +
28 labs(title = "Pride and Prejudice",
29 caption = "Visualization: @alexcookson | Data: Open Source Psychometrics Project") +
30 theme(plot.title = element_text(family = "JaneAusten", face = "bold", size = 30,
31 hjust = 0.5, margin = margin(t = 10, b = 20)),
32 plot.background = element_rect(fill = "#FFF1E6", colour = NA),
33 plot.margin = margin(10, 10, 10, 10),
34 plot.caption = element_text(family = "Sylfaen", size = 8, colour = "grey50", margin = margin(t = 10)),
35 strip.text = element_text(family = "JaneAusten", size = 12, margin = margin(b = 10)),
36 axis.text.x = element_text(family = "Sylfaen", size = 10, hjust = 0.5),
37 axis.text.y = element_markdown(family = "Sylfaen", size = 11, hjust = 1))
某些特质相当轻率,例如伊丽莎白·本内特(Elizabeth Bennet)的主要特质是“宝藏”,而在另一端则带有“垃圾”。请记住,这些数据来自在线测验,而不是严格的科学研究,这很有趣!
其他人物的特质很明显,自私的乔治·威克汉姆(George Wickham),善良的简(Jane),愚蠢的莉迪亚(Lydia),笨拙的柯林斯先生(Collins)以及富有而又保留的达西先生(Darcy)。
狮子王
1lion_king_palette <- c("#FFBA08", "#FCA311", "#E85D04",
2 "#DC2F02", "#791901", "#38160D")
3
4lion_king_example <- personalities %>%
5 filter(!is_emoji,
6 fictional_work == "The Lion King") %>%
7 group_by(character_name) %>%
8 top_n(8, abs(mean)) %>%
9 ungroup() %>%
10 mutate(spectrum_label = ifelse(mean > 0,
11 glue("{spectrum_high}<br><span style='font-size:8pt'>({spectrum_low})</span>"),
12 glue("{spectrum_low}<br><span style='font-size:8pt'>({spectrum_high})</span>")),
13 mean = abs(mean),
14 spectrum_label = reorder_within(spectrum_label, mean, character_name))
15
16lion_king_example %>%
17 ggplot() +
18 geom_col(aes(mean, spectrum_label, fill = character_name), show.legend = FALSE) +
19 geom_vline(xintercept = seq(0, 50, by = 10), colour = "#FFFBEB", size = 0.1) +
20 scale_y_reordered() +
21 scale_fill_manual(values = lion_king_palette) +
22 expand_limits(x = c(0, 50)) +
23 facet_wrap(~ character_name, scales = "free_y") +
24 labs(title = "The Lion King",
25 caption = "Visualization: @alexcookson | Data: Open Source Psychometrics Project") +
26 theme(plot.title = element_text(family = "African", size = 30, hjust = 0.5, margin = margin(t = 10, b = 20)),
27 plot.background = element_rect(fill = "#FFFBEB", colour = NA),
28 plot.margin = margin(10, 10, 10, 10),
29 plot.caption = element_text(family = "IBM Plex Sans", size = 8, colour = "grey50", margin = margin(t = 10)),
30 strip.text = element_text(family = "African", size = 12, margin = margin(b = 10)),
31 axis.text.x = element_text(family = "IBM Plex Sans", size = 10, hjust = 0.5),
32 axis.text.y = element_markdown(family = "IBM Plex Sans", size = 11, hjust = 1))
Scar对于反派人物具有很大的特质(包括鼻子上的反面人物!)。Timon薄,Pumbaa厚。而且Nala的主要特质是美丽,我对动画狮子感到有些奇怪。(实际上,它暗示了媒体中如何描绘和感知不同性别的角色,我计划在以后的文章中对此进行更多分析。)
主成分分析(PCA)
现在我们对数据有了一定了解,让我们开始进行主成分分析。已经有一些不错的资源,例如Julia Silge的使用嘻哈歌曲的PCA以了解如何应用它,tidymodels文档以查看tidymodels如何实现PCA,以及StatQuest对主成分分析的逐步指导,以理解一些基本的数学知识。
辅助数据集
在开始之前,我们先创建一些辅助数据集,这些数据集将帮助我们逐步发展。为什么?我们将仅使用代码字段进行PCA,以识别人物和频谱。完成PCA后,我们将要检查结果,看到人物“ HP/2”或频谱“BAP209”不是很有帮助。取而代之的是,我们希望看到“Harry Potter”或“city-slicker to country-bumpkin”。
1character_list <- personalities %>%
2 distinct(character_code, fictional_work, character_name)
3
4spectrum_list <- personalities %>%
5 distinct(spectrum, spectrum_low, spectrum_high)
使用tidymodels的PCA
tidymodels使PCA非常简单–在这种情况下,大约需要10行代码就能正确设置数据并运行PCA。
首先,让我们将数据转换为比数据框更接近矩阵的格式。我们希望每一行对应一个字符,每一列对应一个频谱。我们还将在character_code上添加一个特殊的列,以便我们一旦完成就知道谁是谁!
这就是我们要输入到整洁模型中的数据:
1pca_data <- personalities %>%
2 filter(!is_emoji) %>%
3 select(character_code, spectrum, mean) %>%
4 pivot_wider(names_from = spectrum, values_from = mean)
5
6pca_data %>%
7 head(6) %>%
8 select(1:5) %>%
9 kable(format = "html") %>%
10 kable_styling()
其次,我们将创建一个配方,将character_code指定为id列,将分数标准化(这在PCA中非常重要),然后运行PCA本身。
1# Write recipe for PCA
2pca_recipe <- recipe(~ ., data = pca_data) %>%
3 # Specify character_code as key/id column
4 update_role(character_code, new_role = "id") %>%
5 # Normalize sets mean to zero and standard deviation of one
6 step_normalize(all_predictors()) %>%
7 # PCA is done here
8 # Use threshold to specify we want to capture 90% of variance in the data
9 step_pca(all_predictors(),
10 threshold = 0.9)
11
12# prep() implements the recipe
13# bake() applies the PCA model to a dataset
14personality_pca <- prep(pca_recipe) %>%
15 bake(new_data = pca_data)
PCA已完成。(真!)
什么是主要成分?
您可以将每个主成分(PC)看作为捆绑了许多频谱,这些频谱将相似的信息告诉我们一个变量。考虑一下达西先生的主要特质:
1pride_example %>%
2 filter(character_name == "Mr. Darcy") %>%
3 mutate(spectrum_label = str_replace(spectrum_label, "8pt", "10pt"),
4 spectrum_label = fct_reorder(spectrum_label, mean)) %>%
5 ggplot() +
6 geom_col(aes(mean, spectrum_label),
7 fill = "#472411", show.legend = FALSE) +
8 geom_vline(xintercept = seq(0, 50, by = 10), colour = "#FFF1E6", size = 0.1) +
9 expand_limits(x = c(0, 50)) +
10 scale_y_reordered() +
11 labs(title = "Mr. Darcy") +
12 theme(plot.title = element_text(family = "JaneAusten", face = "bold", size = 30, hjust = 0.5),
13 plot.background = element_rect(fill = "#FFF1E6", colour = NA),
14 plot.margin = margin(10, 10, 10, 10),
15 axis.text.x = element_text(family = "Sylfaen", size = 12, hjust = 0.5),
16 axis.text.y = element_markdown(family = "Sylfaen", size = 16, hjust = 1))
他是保守,私密,守卫,隐居,机密和性格内向的人。所有这些都是同一件事。如果我们已经知道达西先生是保守的,如果我们也知道达西先生是私密的,我们是否能从他身上学到更多东西?不是吗?这两个特质往往并存(如果您很好奇,它们的相关系数为0.909)。
直观地讲,所有保守型特质都在一起。因此,让我们将它们捆绑在一起成为一个新的变量。我们为什么不称呼它为主要成分?如果我们知道达西先生在这个将保守、私密等捆绑在一起的新的主要部分上的分数是多少,即使我们不知道他的分数,我们仍然会很清楚自己是什么类型的人。这就是主成分分析所做的,只是在更大的范围内,用数学而不是直觉,就像我们对达西先生所做的那样。
最主要的成分告诉我们什么?
我们可以看到每个主要成分的组成,它告诉我们哪些特质已捆绑在一起。以下是排名前六的主成分:
1# Custom colour palette
2top_pc_palette <- c("#264653", "#2A9D8F", "#8AB17D",
3 "#E9C46A", "#F4A261", "#E76F51")
4
5# Inspect a few principal components to see what they capture
6prep(pca_recipe) %>%
7 # Inspect the recipe after it has completed step 2
8 tidy(type = "coef", number = 2) %>%
9 # Add interpretable names of spectrums
10 left_join(spectrum_list, by = c("terms" = "spectrum")) %>%
11 # Take the end of the spectrum that the principal component value is closest to
12 mutate(label = ifelse(value > 0,
13 glue("{spectrum_high} <span style='font-size:7pt'>({spectrum_low})</span>"),
14 glue("{spectrum_low} <span style='font-size:7pt'>({spectrum_high})</span>")),
15 abs_value = abs(value)) %>%
16 # Look at the first six components
17 filter(component %in% paste0("PC", 1:6)) %>%
18 # Look at the 8 strongest elements of each component
19 group_by(component) %>%
20 top_n(8, abs_value) %>%
21 ungroup() %>%
22 # Reorder spectrum labels so that they will be graphed from highest to lowest
23 # reorder_within() is useful for when you want things ordered in a facetted graph
24 mutate(text_label = label,
25 label = reorder_within(label, by = abs_value, within = component)) %>%
26 ggplot() +
27 geom_col(aes(abs_value, label, fill = component),
28 show.legend = FALSE) +
29 geom_richtext(aes(x = 0, y = label, label = text_label),
30 hjust = 0, nudge_x = 0.002, family = "IBM Plex Sans",
31 colour = "white", size = 3.7, fontface = "bold",
32 fill = NA, label.colour = NA) +
33 scale_x_continuous(breaks = seq(0, 0.2, by = 0.05),
34 expand = c(0, 0)) +
35 scale_y_reordered() +
36 scale_fill_manual(values = top_pc_palette) +
37 facet_wrap(~ component, scales = "free_y", ncol = 2) +
38 labs(title = "What are the top principal components?",
39 subtitle = paste("Each component is shown with the eight traits that contribute most to that component.",
40 "The opposing trait is shown in parentheses for context.",
41 sep = "n")) +
42 theme(text = element_text(family = "IBM Plex Sans"),
43 plot.title = element_text(size = 24, margin = margin(t = 10)),
44 plot.subtitle = element_text(size = 10, margin = margin(t = 10, b = 20)),
45 plot.margin = margin(0, 20, 20, 20),
46 panel.spacing.y = unit(20, "points"),
47 strip.text = element_text(size = 12, hjust = 0, margin = margin(0, 5, 5, 5)),
48 axis.text.x = element_text(size = 9, hjust = 0.5))
这些都是显然的! 乔佛里·巴拉森(Joffrey Baratheon)滑稽可笑,精神错乱,古斯·弗林(Gus Fring)坚强和披甲的,莱斯利·诺普(Leslie Knope)绝对是痴迷和匆忙。 从疲倦的泥泞的柯林斯先生到精致,修剪整齐的Tahani Al-Jamil,PC两端的人物也是很明显的。
我们还可以组合各个成分,以了解它们之间的关系。(在更高级的术语中,我们采用的是双变量视图,而不是单变量视图)。例如,如果将PC01(错乱与合理)相对于PC04(粗糙与精致)作图,我们可以看到字符在两个分量上的位置 ,从而使他们的性格更加细致入微。右侧(或左侧)更远的人物更错乱(或合理),上侧(或下册)更远的任务更粗糙(或更精致)。
1p <- character_list %>%
2 left_join(personality_pca, by = "character_code") %>%
3 mutate(Character = character_name,
4 `Fictional Work` = fictional_work,
5 colour_group = case_when(PC01 > 0 & PC04 > 0 ~ "deranged_rugged",
6 PC01 > 0 ~ "deranged_refined",
7 PC04 < 0 ~ "reasonable_refined",
8 TRUE ~ "reasonable_rugged"),
9 alpha_factor = sqrt(PC01 ^ 2 + PC04 ^ 2)) %>%
10 ggplot(aes(PC01, PC04, label = Character, label2 = `Fictional Work`)) +
11 geom_hline(yintercept = 0, lty = 2, alpha = 0.8, colour = "grey50") +
12 geom_vline(xintercept = 0, lty = 2, alpha = 0.8, colour = "grey50") +
13 geom_point(aes(colour = colour_group, alpha = alpha_factor)) +
14 annotate("text", x = 14, y = 14, label = "Derangedn+nRugged",
15 family = "IBM Plex Sans", size = 7, colour = "#cc0024") +
16 annotate("text", x = 14, y = -14, label = "Derangedn+nRefined",
17 family = "IBM Plex Sans", size = 7, colour = "#4b264d") +
18 annotate("text", x = -14, y = -14, label = "Reasonablen+nRefined",
19 family = "IBM Plex Sans", size = 7, colour = "#244F26") +
20 annotate("text", x = -14, y = 14, label = "Reasonablen+nRugged",
21 family = "IBM Plex Sans", size = 7, colour = "#016eae") +
22 expand_limits(x = c(-21, 21),
23 y = c(-16, 16)) +
24 scale_colour_manual(values = c("#4b264d", "#cc0024", "#244F26", "#016eae")) +
25 labs(title = "") +
26 theme(legend.position = "none",
27 text = element_text(family = "IBM Plex Sans"),
28 plot.title = element_text(margin = margin(t = 10, b = 30)))
29
30ggplotly(p, tooltip = c("label", "label2"))
采用这种观点还强调了单个成分的作用只能到此为止。“精神错乱”(PC01)的任务不止一种。添加另一个成分,在本例中是“粗犷与精炼”(PC04),我们可以看到deranged有不同的口味。例如,Jayne Cobb(《萤火虫》)和Lucille Bluth(《发展受阻》)在PC01上的得分大致相同,但在PC04上的得分却大不相同。我同意他们都是精神错乱,但方式完全不同。增加第三个(或第四个!或者第五个!)这样的成分会给我们带来更多的细微差别。(但这里我就不画了,因为要把超过两个连续维的空间视觉化就很棘手了)
结论
通过进行PCA并探索结果,我们已经学到了很多有关此数据集的知识。但是像PCA这样的降维技术通常是其他事情的先导,例如开发分类模型。我还认为该数据集非常丰富且有趣,您可以通过其他与PCA无关的方式对其进行探索。
如果您想进一步分析,可以在我的GitHub中找到有趣的数据集。需要研究的领域包括:
-
性别如何影响人物的画像和感知方式?
-
人格类型的集群是什么?不同小说作品中的哪些人物最相似?
-
某些人物在某些类型中是否更为突出?例如,情景喜剧是否倾向于具有其他类型的人物没有的个性?
附录:完整代码
1# 准备工作
2library(tidyverse) # For data manipulation
3library(tidymodels) # For PCA
4library(tidytext) # For reorder_within()
5library(extrafont) # For custom fonts on graphs
6library(glue) # For creating strings from multiple pieces
7library(ggtext) # For fine-grained text formatting
8library(plotly) # For interactive graphs
9
10# These are to make tables look nice on the website
11# You probably won't need them
12library(knitr)
13library(kableExtra)
14
15theme_set(theme_void())
16
17personalities <- read_tsv("https://raw.githubusercontent.com/tacookson/data/master/fictional-character-personalities/personalities.txt")
18
19
20personalities <- read_tsv("./datasets/personalities.txt")
21
22# 2 数据检视
23personalities %>%
24 filter(character_name == "Jean-Luc Picard") %>%
25 head(8) %>%
26 arrange(mean) %>%
27 kable(format = "html") %>%
28 kable_styling()
29
30# 3 傲慢与偏见
31pride_palette <- c("#19313b", "#3b636b", "#bbb6a2",
32 "#ffc277", "#bc652f", "#472411")
33
34pride_example <- personalities %>%
35 filter(!is_emoji,
36 fictional_work == "Pride and Prejudice",
37 character_name %in% c("Elizabeth Bennet", "Jane Bennet", "Lydia Bennet",
38 "Mr. Darcy", "Mr. William Collins", "George Wickham")) %>%
39 mutate(character_name = ifelse(character_name == "Mr. William Collins", "Mr. Collins", character_name)) %>%
40 group_by(character_name) %>%
41 top_n(8, abs(mean)) %>%
42 ungroup() %>%
43 mutate(spectrum_label = ifelse(mean > 0,
44 glue("{spectrum_high}<br><span style='font-size:8pt'>({spectrum_low})</span>"),
45 glue("{spectrum_low}<br><span style='font-size:8pt'>({spectrum_high})</span>")),
46 mean = abs(mean),
47 spectrum_label = reorder_within(spectrum_label, mean, character_name))
48
49
50pride_example %>%
51 ggplot() +
52 geom_col(aes(mean, spectrum_label, fill = character_name), show.legend = FALSE) +
53 geom_vline(xintercept = seq(0, 50, by = 10), colour = "#FFF1E6", size = 0.1) +
54 scale_y_reordered() +
55 scale_fill_manual(values = pride_palette) +
56 expand_limits(x = c(0, 50)) +
57 facet_wrap(~ character_name, scales = "free_y") +
58 labs(title = "Pride and Prejudice",
59 caption = "Visualization: @alexcookson | Data: Open Source Psychometrics Project") +
60 theme(plot.title = element_text(family = "JaneAusten", face = "bold", size = 30,
61 hjust = 0.5, margin = margin(t = 10, b = 20)),
62 plot.background = element_rect(fill = "#FFF1E6", colour = NA),
63 plot.margin = margin(10, 10, 10, 10),
64 plot.caption = element_text(family = "Sylfaen", size = 8, colour = "grey50", margin = margin(t = 10)),
65 strip.text = element_text(family = "JaneAusten", size = 12, margin = margin(b = 10)),
66 axis.text.x = element_text(family = "Sylfaen", size = 10, hjust = 0.5),
67 axis.text.y = element_markdown(family = "Sylfaen", size = 11, hjust = 1))
68
69# 狮子王
70lion_king_palette <- c("#FFBA08", "#FCA311", "#E85D04",
71 "#DC2F02", "#791901", "#38160D")
72
73lion_king_example <- personalities %>%
74 filter(!is_emoji,
75 fictional_work == "The Lion King") %>%
76 group_by(character_name) %>%
77 top_n(8, abs(mean)) %>%
78 ungroup() %>%
79 mutate(spectrum_label = ifelse(mean > 0,
80 glue("{spectrum_high}<br><span style='font-size:8pt'>({spectrum_low})</span>"),
81 glue("{spectrum_low}<br><span style='font-size:8pt'>({spectrum_high})</span>")),
82 mean = abs(mean),
83 spectrum_label = reorder_within(spectrum_label, mean, character_name))
84
85lion_king_example %>%
86 ggplot() +
87 geom_col(aes(mean, spectrum_label, fill = character_name), show.legend = FALSE) +
88 geom_vline(xintercept = seq(0, 50, by = 10), colour = "#FFFBEB", size = 0.1) +
89 scale_y_reordered() +
90 scale_fill_manual(values = lion_king_palette) +
91 expand_limits(x = c(0, 50)) +
92 facet_wrap(~ character_name, scales = "free_y") +
93 labs(title = "The Lion King",
94 caption = "Visualization: @alexcookson | Data: Open Source Psychometrics Project") +
95 theme(plot.title = element_text(family = "African", size = 30, hjust = 0.5, margin = margin(t = 10, b = 20)),
96 plot.background = element_rect(fill = "#FFFBEB", colour = NA),
97 plot.margin = margin(10, 10, 10, 10),
98 plot.caption = element_text(family = "IBM Plex Sans", size = 8, colour = "grey50", margin = margin(t = 10)),
99 strip.text = element_text(family = "African", size = 12, margin = margin(b = 10)),
100 axis.text.x = element_text(family = "IBM Plex Sans", size = 10, hjust = 0.5),
101 axis.text.y = element_markdown(family = "IBM Plex Sans", size = 11, hjust = 1))
102
103
104
105# 辅助数据集
106character_list <- personalities %>%
107 distinct(character_code, fictional_work, character_name)
108
109spectrum_list <- personalities %>%
110 distinct(spectrum, spectrum_low, spectrum_high)
111
112# 使用tidymodels库的PCA算法
113pca_data <- personalities %>%
114 filter(!is_emoji) %>%
115 select(character_code, spectrum, mean) %>%
116 pivot_wider(names_from = spectrum, values_from = mean)
117
118pca_data %>%
119 head(6) %>%
120 select(1:5) %>%
121 kable(format = "html") %>%
122 kable_styling()
123
124# 运行PCA算法
125# Write recipe for PCA
126pca_recipe <- recipe(~ ., data = pca_data) %>%
127 # Specify character_code as key/id column
128 update_role(character_code, new_role = "id") %>%
129 # Normalize sets mean to zero and standard deviation of one
130 step_normalize(all_predictors()) %>%
131 # PCA is done here
132 # Use threshold to specify we want to capture 90% of variance in the data
133 step_pca(all_predictors(),
134 threshold = 0.9)
135
136# prep() implements the recipe
137# bake() applies the PCA model to a dataset
138personality_pca <- prep(pca_recipe) %>%
139 bake(new_data = pca_data)
140
141# 主成分分析
142pride_example %>%
143 filter(character_name == "Mr. Darcy") %>%
144 mutate(spectrum_label = str_replace(spectrum_label, "8pt", "10pt"),
145 spectrum_label = fct_reorder(spectrum_label, mean)) %>%
146 ggplot() +
147 geom_col(aes(mean, spectrum_label),
148 fill = "#472411", show.legend = FALSE) +
149 geom_vline(xintercept = seq(0, 50, by = 10), colour = "#FFF1E6", size = 0.1) +
150 expand_limits(x = c(0, 50)) +
151 scale_y_reordered() +
152 labs(title = "Mr. Darcy") +
153 theme(plot.title = element_text(family = "JaneAusten", face = "bold", size = 30, hjust = 0.5),
154 plot.background = element_rect(fill = "#FFF1E6", colour = NA),
155 plot.margin = margin(10, 10, 10, 10),
156 axis.text.x = element_text(family = "Sylfaen", size = 12, hjust = 0.5),
157 axis.text.y = element_markdown(family = "Sylfaen", size = 16, hjust = 1))
158
159# 排名前六的主成分的组成分析
160# Custom colour palette
161top_pc_palette <- c("#264653", "#2A9D8F", "#8AB17D",
162 "#E9C46A", "#F4A261", "#E76F51")
163
164# Inspect a few principal components to see what they capture
165prep(pca_recipe) %>%
166 # Inspect the recipe after it has completed step 2
167 tidy(type = "coef", number = 2) %>%
168 # Add interpretable names of spectrums
169 left_join(spectrum_list, by = c("terms" = "spectrum")) %>%
170 # Take the end of the spectrum that the principal component value is closest to
171 mutate(label = ifelse(value > 0,
172 glue("{spectrum_high} <span style='font-size:7pt'>({spectrum_low})</span>"),
173 glue("{spectrum_low} <span style='font-size:7pt'>({spectrum_high})</span>")),
174 abs_value = abs(value)) %>%
175 # Look at the first six components
176 filter(component %in% paste0("PC", 1:6)) %>%
177 # Look at the 8 strongest elements of each component
178 group_by(component) %>%
179 top_n(8, abs_value) %>%
180 ungroup() %>%
181 # Reorder spectrum labels so that they will be graphed from highest to lowest
182 # reorder_within() is useful for when you want things ordered in a facetted graph
183 mutate(text_label = label,
184 label = reorder_within(label, by = abs_value, within = component)) %>%
185 ggplot() +
186 geom_col(aes(abs_value, label, fill = component),
187 show.legend = FALSE) +
188 geom_richtext(aes(x = 0, y = label, label = text_label),
189 hjust = 0, nudge_x = 0.002, family = "IBM Plex Sans",
190 colour = "white", size = 3.7, fontface = "bold",
191 fill = NA, label.colour = NA) +
192 scale_x_continuous(breaks = seq(0, 0.2, by = 0.05),
193 expand = c(0, 0)) +
194 scale_y_reordered() +
195 scale_fill_manual(values = top_pc_palette) +
196 facet_wrap(~ component, scales = "free_y", ncol = 2) +
197 labs(title = "What are the top principal components?",
198 subtitle = paste("Each component is shown with the eight traits that contribute most to that component.",
199 "The opposing trait is shown in parentheses for context.",
200 sep = "n")) +
201 theme(text = element_text(family = "IBM Plex Sans"),
202 plot.title = element_text(size = 24, margin = margin(t = 10)),
203 plot.subtitle = element_text(size = 10, margin = margin(t = 10, b = 20)),
204 plot.margin = margin(0, 20, 20, 20),
205 panel.spacing.y = unit(20, "points"),
206 strip.text = element_text(size = 12, hjust = 0, margin = margin(0, 5, 5, 5)),
207 axis.text.x = element_text(size = 9, hjust = 0.5))
208
209# 主成分变量组合分析和解读
210p <- character_list %>%
211 left_join(personality_pca, by = "character_code") %>%
212 mutate(Character = character_name,
213 `Fictional Work` = fictional_work,
214 colour_group = case_when(PC01 > 0 & PC04 > 0 ~ "deranged_rugged",
215 PC01 > 0 ~ "deranged_refined",
216 PC04 < 0 ~ "reasonable_refined",
217 TRUE ~ "reasonable_rugged"),
218 alpha_factor = sqrt(PC01 ^ 2 + PC04 ^ 2)) %>%
219 ggplot(aes(PC01, PC04, label = Character, label2 = `Fictional Work`)) +
220 geom_hline(yintercept = 0, lty = 2, alpha = 0.8, colour = "grey50") +
221 geom_vline(xintercept = 0, lty = 2, alpha = 0.8, colour = "grey50") +
222 geom_point(aes(colour = colour_group, alpha = alpha_factor)) +
223 annotate("text", x = 14, y = 14, label = "Derangedn+nRugged",
224 family = "IBM Plex Sans", size = 7, colour = "#cc0024") +
225 annotate("text", x = 14, y = -14, label = "Derangedn+nRefined",
226 family = "IBM Plex Sans", size = 7, colour = "#4b264d") +
227 annotate("text", x = -14, y = -14, label = "Reasonablen+nRefined",
228 family = "IBM Plex Sans", size = 7, colour = "#244F26") +
229 annotate("text", x = -14, y = 14, label = "Reasonablen+nRugged",
230 family = "IBM Plex Sans", size = 7, colour = "#016eae") +
231 expand_limits(x = c(-21, 21),
232 y = c(-16, 16)) +
233 scale_colour_manual(values = c("#4b264d", "#cc0024", "#244F26", "#016eae")) +
234 labs(title = "") +
235 theme(legend.position = "none",
236 text = element_text(family = "IBM Plex Sans"),
237 plot.title = element_text(margin = margin(t = 10, b = 30)))
238
239ggplotly(p, tooltip = c("label", "label2"))
你在阅读过程中,遇到什么问题,或者有什么心得与收获,可以扫描我的微信号,备注“R-入群”。我会邀请你加入R语言群,和大家一起讨论与学习。
推荐阅读:
1 R语言机器学习3本经典书籍集合本,提高你的R语言和机器学习能力!(可供下载)
2 R语言实战英文书籍,配套源代码,帮助你学习R语言!(可下载)
推荐公众号:数据科学与人工智能
数据科学与人工智能公众号推广Python语言,数据科学与人工智能的知识和信息。扫码下方二维码关注我,一起学习Python语言和数据科学与人工智能。
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!