R语言学习:一门在线课程,玫瑰图,数据可视化指南,R语言编码风格指南, tidyverse研讨会
2021年第42周。
这一周R语言学习,记录如下。
01
一门面向市场营销专业学生的R课程
一门面向市场营销专业学生的R课程,学习网址:
https://bookdown.org/content/1340/
课程内容结构
-
教程介绍
-
R语言简介
-
基础数据分析
-
主成分分析
-
聚类分析与分群
-
联合分析
02
玫瑰图
玫瑰与数学的故事。
利用数据可视化和数学方程,绘制玫瑰图,赠送给需要的人。
学习和娱乐,融为一体。
源代码1
library(tidyverse)
seq(-3,3,by=.01) %>%
expand.grid(x=., y=.) %>%
ggplot(aes(x=(1-x-sin(y^2)), y=(1+y-cos(x^2)))) +
geom_point(alpha=.05, shape=20, size=0)+
theme_void()+
coord_polar()
结果图1
源代码2
library(tidyverse)
library(ggfx)
seq(-3, 3, by = .02) %>%
expand.grid(x = ., y = .) %>%
ggplot(aes(x = (1 - x - sin(y ^ 2)), y = (1 + y - cos(x ^ 2)))) +
ggfx::with_outer_glow(
geom_point(
alpha = .05,
shape = 20,
size = 0,
color = "white"
),
colour = "red",
expand = 10,
sigma = 3
) +
theme_void() +
coord_polar() +
labs(
caption = paste0(
"@lenkiefer modified from @aschinchon #rose tweet,",
"nhttps://twitter.com/aschinchon/status/1405136386034970630"
)
)
结果图2
学习资料:
http://lenkiefer.com/2021/06/16/rose-chart-in-r/
03
数据可视化指南
不管是数据可视化新手,还是老手,这5条数据可视化指南都是很棒的。
-
Show the Data
-
Reduce the Clutter
-
Integrate Graphics and Text
-
Small Multiples
-
Start with Gray
每条指南详细的介绍,请阅读下面网址。
http://lenkiefer.com/2021/02/26/data-visualization-guidelines-and-a-case-study/
04
R语言编码风格指南
R大神Hadley Wickham所说:
好的编码风格就像使用正确的标点符号。你可以不用它,但它确实让内容更容易阅读。
好的R语言编码,可以让R代码更专业,更可读,更易于分享和传播,也更能价值最大化。
关于R语言编码风格,我们可以关注这些方面。
1 命名
包括文件、变量、函数和类名。
2 语法
包括行长度、间距、大括号、缩进格式、换行操作、注释
3 其它建议
上述方面的具体操作,请阅读下面网址。
https://irudnyts.github.io//r-coding-style-guide/
参考资料:
1 http://adv-r.had.co.nz/Style.html
2 https://google.github.io/styleguide/Rguide.html
05
用tidyverse做数据科学
这是一个基于R for Data Science这本书的为期两天的研讨会。你将学习如何在R中可视化、转换和建模数据,以及如何使用日期-时间、字符串和无序数据格式。在此过程中,您将学习和使用来自tidyverse的许多包,包括ggplot2、dplyr、tidyr、readr、purrr、tibble、stringr、lubridate和forcats。
这个研讨会的材料,包括PDF讲义、Rmd文件等,它是利用RStudio工具做项目管理。这份材料,可以在R4DS学习交流群的群公告查看获取链接。
你可以扫描下方二维码,备注R4DS,添加我的微信,我要请你进入R4DS学习交流群。
为了顺利学习和操作研讨会里面的内容,你需要提前安装好这些R包。
# tidyverse包做数据科学任务
# 提前安装好所需R包
install.packages(
c(
"babynames",
"fivethirtyeight",
"formatR",
"gapminder",
"hexbin",
"mgcv",
"maps",
"mapproj",
"nycflights13",
"rmarkdown",
"skimr",
"tidyverse",
"viridis"
)
)
06
R做因子分析
因子分析的具体目的用于揭示潜在的变量。
请思考:
因子分析和主成分分析的差异?
源代码
# R中的因子分析
## R包
library(tidyverse)
library(psych)
# 数据导入
consumers <- read_csv("./data/customers_quan.csv") %>%
select(starts_with("p"))
dim(consumers)
consumers %>% View
## 数据清洗
sdevs <- apply(consumers, 1, sd, na.rm = TRUE)
incomplete <- apply(consumers, 1, function(i) any(is.na(i)))
consumers <- consumers[sdevs != 0 & !incomplete, ]
dim(consumers)
## 探索性分析
consumers %>%
rownames_to_column(var = "Subject") %>%
gather(Item, Response, -Subject) %>%
ggplot(aes(Item, Response)) + geom_boxplot(fill = "#f7941d") +
theme_bw(base_size = 10) +
ggtitle("personal Involvement Index",
subtitle = paste("Tap Water Consumers USA and Australia (n =",
nrow(consumers), ")"))
ggsave("./figs/involvement-explore.png", width = 6, height = 4)
##png("involvement-correlation.png")
# 相关性分析
corPlot(consumers)
##dev.off()
## 因子分析
piiFac <- fa(consumers, nfactors = 2, rotate = "varimax")
##png("involvement-factors.png")
fa.diagram(piiFac)
##dev.off()
结果图
思考题:
因子分析如何确定旋转因子?正交旋转和斜旋转的差异?
学习资料:
https://lucidmanager.org/data-science/measuring-consumer-involvement/
07
可重复性代码构建指南
创建项目工程,做项目管理
项目的层级架构,参考下图:
各个文件夹和文件的用途
请注意
1 永远不要修改原始数据,或者说,一定要备份好原始数据
2 对于任何项目,创建一个文件,记录你的所思和所做,便于复盘和迭代
3 脚本的命名,请知名晓意,赋予含义,具有条理性和逻辑性,重视代码的可读性,代码是让电脑来运行的,更重要的是,让人来看的。
4 对于一个复杂的项目,编写代码之前,先写伪代码或者画流程图
08
get函数,把字符串转换为变量名
最近在一个项目中,自定义了一个函数,其中参数传递为字符串的时候,需要把这个参数的字符串转换为一个变量名。
R语言的get函数可以实现这个功能。
09
ggplot2包设计和生成优美的数据可视化图形
来源一份PPT学习资料,作者分享了如何使用ggplot2包设计和生成优美的数据可视化图形。我把这个PPT上面的代码做了测试,可以看到令人心动的图形。
这份代码值得学习和迁移。
直接审查代码吧。
# R包
library(tidyverse)
# 上代码
diff_df <- readr::read_csv("./data/diff_df.csv")
combo_pass <- readr::read_csv("./data/combo_pass.csv")
glimpse(diff_df)
# 1) 简单图形
basic_plot <- ggplot(diff_df, aes(x = differential, y = win_percent)) +
geom_point()
basic_plot
# 使用内置的主题
basic_plot +
theme_minimal()
basic_plot +
theme_bw()
# 使用其它包提供的主题
# ggthemes包
basic_plot +
ggthemes::theme_fivethirtyeight()
basic_plot +
ggthemes::theme_economist()
# 自己设计和定制主题
basic_plot +
theme(
panel.grid.major = element_line(color = "red"),
axis.text.x = element_text(size = 20, color = "red"),
plot.background = element_rect(fill = "lightblue"),
panel.background = element_rect(fill = "blue")
)
# 模仿学习
# 收集好点子,加以所用
# 提供一些素材源头
# https://fivethirtyeight.com/features/the-56-best-and-weirdest-charts-we-made-in-2019/
# https://www.nytimes.com/2020/06/10/learning/over-60-new-york-times-graphs-for-students-to-analyze.html
# https://www.storytellingwithdata.com/blog/2018/6/19/june-swdchallenge-recap-slopegraphs
# https://www.notion.so/spcanelon/Data-Viz-Bookmarks-dc01718020bd4fd6a8a4ca80e6bce933
# 自定义主题
theme_538 <- function(base_size = 12, base_family = "Chivo") {
theme_grey(base_size = base_size, base_family = base_family) %+replace%
theme(
# drop minor gridlines and axis-ticks
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
# change font elements/family
text = element_text(family = "Chivo", size = base_size),
axis.text = element_text(face = "bold", color = "grey", size = base_size),
axis.title = element_text(face = "bold", size = rel(1.33)),
axis.title.x = element_text(margin = margin(0.5, 0, 0, 0, unit = "cm")),
axis.title.y = element_text(margin = margin(0, 0.5, 0, 0, unit = "cm"), angle =90),
plot.title = element_text(face = "bold", size = rel(1.67), hjust = 0),
plot.title.position = "plot",
plot.subtitle = element_text(size = 16, margin = margin(0.2, 0, 1, 0, unit = "cm"), hjust = 0),
plot.caption = element_text(size = 10, margin = margin(1, 0, 0, 0, unit = "cm"), hjust = 1),
# change plot colors for the background/panel elements
plot.background = element_rect(fill = "#f0f0f0", color = NA),
panel.background = element_rect(fill = "#f0f0f0", color = NA),
panel.grid.major = element_line(color = "#d0d0d0"),
panel.border = element_blank(),
# shrinks margin and simplify the strip/facet elements
plot.margin = margin(0.5, 1, 0.5, 1, unit = "cm"),
strip.background = element_blank(),
strip.text = element_text(size = rel(1.33), face = "bold")
)
}
# espnscrapeR包的安装方法
# remotes::install_github("jthomasmock/espnscrapeR")
nfl_stand <- 2014:2020 %>%
map_dfr(espnscrapeR::get_nfl_standings)
glimpse(nfl_stand)
nfl_stand_plot <- nfl_stand %>%
ggplot(aes(x = pts_diff, y = win_pct)) +
geom_point() +
geom_smooth(method = "lm")
nfl_stand_plot
nfl_stand_prep <- nfl_stand %>%
mutate(
color = case_when(
season < 2020 & seed <= 6 ~ "blue",
season == 2020 & seed <= 7 ~ "blue",
TRUE ~ "red"
)
)
nfl_stand_prep
# 颜色区分
nfl_stand_prep %>%
ggplot(aes(x = pts_diff, y = win_pct)) +
geom_vline(xintercept = 0, size = 0.75, color = "#737373") +
geom_point(aes(color = I(color)))
# 添加文本信息
nfl_stand_prep %>%
ggplot(aes(x = pts_diff, y = win_pct)) +
geom_vline(xintercept = 0, size = 0.75, color = "#737373") +
geom_point(
aes(color = I(color)),
size = 3, alpha = 0.8
) +
labs(x = "Points Differential", y = "Win Percent",
title = "Playoff teams typically have a positive point differential",
subtitle = "Data through week 17 of the 2020 NFL Season",
caption = "Plot: @thomas_mock | Data: ESPN")
# 精细化标注信息
library(ggtext)
# create a tiny dataset
playoff_label_scatter <- tibble(
differential = c(25,-70), y = c(0.3, 0.8),
label = c("Missed<br>Playoffs", "Made<br>Playoffs"),
color = c("#D50A0A", "#013369")
)
nfl_stand_refine <- nfl_stand %>%
mutate(
color = case_when(
season < 2020 & seed <= 6 ~ "#013369",
season == 2020 & seed <= 7 ~ "#013369",
TRUE ~ "#D50A0A"
)
)
playoff_diff_plot <- nfl_stand_refine %>%
ggplot(aes(x = pts_diff, y = win_pct)) +
geom_vline(xintercept = 0, size = 0.75, color = "#737373") +
geom_hline(yintercept = 0, size = 0.75, color = "#737373") +
geom_point(
aes(color = I(color)),
size = 3, alpha = 0.8
) +
ggtext::geom_richtext(
data = playoff_label_scatter,
aes(x = differential, y = y, label = label, color = I(color)),
fill = "#f0f0f0", label.color = NA, # remove background and outline
label.padding = grid::unit(rep(0, 4), "pt"), # remove padding
family = "Chivo", hjust = 0.1, fontface = "bold",
size = 8
) +
labs(x = "Points Differential", y = "Win Percent",
title = "Playoff teams typically have a positive point differential",
subtitle = "Data through week 17 of the 2020 NFL Season",
caption = str_to_upper("Plot: @thomas_mock | Data: ESPN")) +
scale_y_continuous(
labels = scales::percent_format(accuracy = 1),
breaks = seq(.0, 1, by = .20)
) +
scale_x_continuous(
breaks = seq(-200, 250, by = 50)
) +
theme_538()
playoff_diff_plot
# 相同数据,不同故事
library(ggridges)
stand_density <- nfl_stand %>%
mutate(
color = case_when(
season < 2020 & seed <= 6 ~ "#013369",
season == 2020 & seed <= 7 ~ "#013369",
TRUE ~ "#D50A0A"
)
) %>%
ggplot(aes(x = pts_diff, y = factor(season), color = I(color), fill = I(color))) +
geom_vline(xintercept = 0.5, size = 0.75, color = "#737373") +
geom_density_ridges(alpha = 0.8, scale = 0.9) +
theme_538()
stand_density
# create a small dataset for the custom annotations
playoff_label_ridge <- tibble(
y = c(7.55, 7.55), differential = c(-250,175),
label = c("Missed<br>Playoffs", "Made<br>Playoffs"),
color = c("#D50A0A", "#013369")
)
stand_density +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
coord_cartesian(xlim = c(-250, 250)) +
ggtext::geom_richtext(
data = playoff_label_ridge,
aes(x = differential, y = y, label = label, color = color),
fill = "#f0f0f0", label.color = NA, # remove background and outline
label.padding = grid::unit(rep(0, 4), "pt"), # remove padding
family = "Chivo", hjust = 0 , fontface = "bold",
size = 6
) +
theme_538() +
theme(panel.grid.major.y = element_blank()) +
labs(
x = "Point Differential", y = "",
title = "Playoff teams typically have a positive point differential",
subtitle = "Data through week 15 of the 2020 NFL Season",
caption = "Plot: @thomas_mock | Data: ESPN"
)
stand_df <- nfl_stand %>%
filter(season == 2020)
stand_df %>%
filter(seed <= 12 & season == 2020) %>%
ggplot(aes(x = tidytext::reorder_within(team_abb, seed, conf), y = pts_diff)) +
geom_col() +
tidytext::scale_x_reordered() +
facet_grid(~conf, scales = "free_x") +
geom_hline(yintercept = 0, size = 0.75, color = "#737373") +
theme_538()
# Small label dataset
playoff_label <- tibble(
seed = c(9, 2),
pts_diff = c(30, 145),
conf = c("AFC", "AFC"),
label = c("Outside<br>looking in", "Playoff<br>teams"),
color = c("#D50A0A", "#013369")
)
stand_df %>%
filter(seed <= 12) %>%
ggplot(aes(x = as.factor(seed), y = pts_diff)) +
geom_col(
aes(fill = if_else(seed <= 7, "#013369", "#D50A0A")),
width = 0.8
) +
ggtext::geom_richtext(
data = playoff_label,
aes(label = label, color = I(color)),
fill = "#f0f0f0", label.color = NA,
# remove background and outline
label.padding = grid::unit(rep(0, 4), "pt"),
# remove padding
family = "Chivo", hjust = 0.1, fontface = "bold", size = 6
) +
geom_hline(yintercept = 0, size = 0.75, color = "#737373") +
geom_vline(xintercept = 7.5, size = 1, color = "grey") +
geom_vline(xintercept = 0.5, size = 0.75, color = "#737373") +
facet_grid(~conf, scales = "free_x") +
scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
scale_fill_identity(aesthetics = c("fill", "color")) +
theme_538() +
theme(panel.grid.major.x = element_blank()) +
labs(
x = "Playoff Seed",
y = "Points Differential",
title = "Playoff teams typically have a positive point differential",
subtitle = "Data through week 15 of the 2020 NFL Season",
caption = "Plot: @thomas_mock | Data: ESPN"
)
# 重新创造
library(rvest)
raw_url <- "https://www.pro-football-reference.com/years/2020/opp.htm"
raw_html <- read_html(raw_url)
raw_table <- raw_html %>%
html_table(fill = TRUE) %>%
.[[2]] %>%
janitor::clean_names() %>%
tibble()
pressure_df <- raw_table %>%
select(tm, blitz_pct = bltz_percent, press_pct = prss_percent) %>%
mutate(across(c(blitz_pct, press_pct), parse_number))
pass_def_raw <- raw_html %>%
html_node("#all_passing") %>%
html_nodes(xpath = "comment()") %>%
html_text() %>%
read_html() %>%
html_node("table") %>%
html_table() %>%
janitor::clean_names() %>%
tibble()
pass_def_df <- pass_def_raw %>%
select(tm, pass_att = att, int, pass_def = pd, sack = sk, ypa = y_a, anypa = any_a)
pass_def_df %>% View
combo_pass <- left_join(
pressure_df, pass_def_df,
by = "tm"
)
combo_pass %>%
glimpse()
# 快速可视化
combo_pass %>%
ggplot(aes(x = blitz_pct, y = press_pct)) +
geom_point() +
labs(
x = "Blitz Rate", y = "Pressure Rate",
title = "The Colts are pressuring QBs without much of a blitz",
subtitle = "Blitz rate vs. pressure rate for each NFL defense, through Week 17nof the 2020 season"
) +
theme_538()
# 添加颜色和文本信息
colt_df <- combo_pass %>%
mutate(
color = if_else(tm == "Indianapolis Colts", "#359fda", "#91c390"),
fill = colorspace::lighten(color, amount = 0.3)
) %>%
rowwise() %>%
mutate(
att_def = sum(int, pass_def, sack),
cov_rate = att_def/pass_att*100
) %>%
ungroup() %>%
arrange(desc(cov_rate))
label_df_cov <- tibble(
label = c("Colts", "Everyone else"),
color = c("#359fda", "#91c390"),
fill = colorspace::lighten(color, amount = 0.3),
x = c(16, 33),
y = c(25, 28)
)
colt_df %>%
ggplot(aes(x = blitz_pct, y = cov_rate, color = color, fill = fill)) +
geom_point(size = 5, pch = 21) +
scale_color_identity(aesthetics = c("fill", "color")) +
labs(
x = "Blitz Rate",
y = "Pass Affected Rate",
title = "The Colts affect passes at an elite rate while blitzing the least",
subtitle = "Blitz rate vs. pressure rate for each NFL defense, through Week 17nof the 2020 season",
caption = "Plot: @thomas_mock | Source: PFR"
) +
scale_x_continuous(limits = c(10, 45), breaks = seq(10, 45, by = 5)) +
scale_y_continuous(limits = c(10, 35), breaks = seq(10, 35, by = 5)) +
coord_cartesian(clip = "off") +
annotate("text", x = 10, y = 10, label = "Pass affected rate = (ints + sacks + passes defended)/pass attempts",
vjust = 10, hjust = 0.2, color = "darkgrey") +
theme_538()
colt_df %>%
ggplot(aes(x = blitz_pct, y = cov_rate, color = color, fill = fill)) +
geom_point(size = 5, pch = 21) +
scale_color_identity(aesthetics = c("fill", "color")) +
labs(
x = "Blitz Rate",
y = "Pass Affected Rate",
title = "The Colts affect passes at an elite rate while blitzing the least",
subtitle = "Blitz rate vs. pressure rate for each NFL defense, through Week 17nof the 2020 season",
caption = "Plot: @thomas_mock | Source: PFR"
) +
scale_x_continuous(limits = c(10, 45), breaks = seq(10, 45, by = 5)) +
scale_y_continuous(limits = c(10, 35), breaks = seq(10, 35, by = 5)) +
coord_cartesian(clip = "off") +
annotate("text", x = 10, y = 10, label = "Pass affected rate = (ints + sacks + passes defended)/pass attempts",
vjust = 10, hjust = 0.2, color = "darkgrey") +
theme_538() +
geom_label(
data = label_df_cov,
aes(x = x, y = y, color = color, label = label),
fill = "#f0f0f0",
size = 6,
fontface = "bold",
hjust = 0.8,
label.size = NA
)
部分结果图:
学习资料
https://themockup.blog/static/slides/intro-plot.html#1
https://jthomasmock.github.io/espnscrapeR/
我创建了R语言群,添加我的微信,备注:姓名-入群,我邀请你进群,一起学习R语言。
如果你觉得文章内容有用,请关注下方公众号~
如果你想找数据工作,请关注下方公众号~
R语言学习专辑:
觉得本文不错,就顺手帮我转发到朋友圈和微信群哦,谢谢。
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!