R语言学习:数据可视化技能,Pair Plot,LR模型手写版,dplyr包across函数,快速学习ggplot2包画图
2021年第49周。
这一周R语言学习,记录如下。
01
R语言数据可视化技能
数据可视化目的是为了发现和沟通。
R语言擅长做数据可视化。
数据可视化技能是我们做数据科学工作的核心技能之一。
我喜欢学习和实践R语言做数据可视化知识,增进自己的数据可视化技能。
请问下图,如何使用ggplot2设计和实现?
参考代码
if(!require("mosaicData")){
install.packages("mosaicData")
}
library(dplyr)
library(ggplot2)
plotdata <- CPS85 %>%
filter(wage < 40)
ggplot(data = plotdata,
mapping = aes(x = exper,
y = wage,
color = sex)) +
geom_point(alpha = .7,
size = 3) +
geom_smooth(method = "lm",
se = FALSE,
size = 1.5) +
scale_x_continuous(breaks = seq(0, 60, 10)) +
scale_y_continuous(breaks = seq(0, 30, 5),
labels = scales::dollar) +
scale_color_manual(values = c("indianred3",
"cornflowerblue")) +
facet_wrap(~sector) +
labs(title = "Relationship between wages and experience",
subtitle = "Current Population Survey",
caption = "source: http://mosaic-web.org/",
x = "Years of Experience",
y = "Hourly Wage",
color = "Gender") +
theme_minimal()
这幅图用到了ggplot2这些知识。
1)ggplot()函数
2)geom操作
3)scale操作
4)facet操作
5)labs操作
6)theme操作
当我们熟悉ggplot2包这些知识后,我们就可以使用它设计和实现一系列有用的图形,以帮助我们获取数据洞见和增强沟通效果。
学习资料:
https://rkabacoff.github.io/datavis/IntroGGPLOT.html
02
如何使用tidyverse包绘制Pair Plot?
Pair plot适合多变量间探索性分析。
我们使用tidyverse包绘制Pair Plot包。
在这里,使用企鹅数据集。
这份数据集的元数据描述如下:
企鹅数据集的Pair Plot参考代码。
library(palmerpenguins)
library(tidyverse)
# 使用企鹅数据集
penguins %>% glimpse()
penguins %>%
slice_head(n = 10) %>%
View
# 数据准备工作
df <- penguins %>%
rowid_to_column() %>%
mutate(year=factor(year)) %>%
select(where(is.numeric))
df %>% glimpse()
df %>%
slice_head(n = 10) %>%
View
df1 <- df %>%
pivot_longer(cols = -rowid) %>%
full_join(., ., by = "rowid")
df2 <- df1 %>%
left_join(penguins %>%
rowid_to_column() %>%
select(rowid,species))
df2 %>%
drop_na() %>%
ggplot(aes(x = value.x, y = value.y, color=species)) +
geom_point(alpha = 0.5) +
facet_wrap(name.x ~ name.y, scales = "free")+
theme(axis.title = element_blank(),
legend.position = "bottom")
结果图
知识总结:
1 使用dplyr包和tidyr包做数据准备工作
2 使用ggplot包做数据可视化工作
我创建了R语言群,欢迎大家加入,可以扫描文末的二维码,备注:R语言,添加我微信,邀请你入群,一起学用R语言做数据科学。
03
逻辑回归模型
逻辑回归模型,我实际工作频繁使用的模型。
逻辑回归模型的手写版本
思考题:
1 逻辑回归模型的代价函数如何理解?
04
dplyr包across函数
dplyr包across函数功能强大,我经常使用。
使用across函数可以同时给多个列进行处理,并且可以是”glue”语法对列做命名。
across函数使用的例子。
# dplyr的across函数
library(readr)
library(dplyr)
ac_items <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/items.csv')
ac_items %>% glimpse()
ac_items %>%
slice_head(n = 100) %>%
View
# 统计变量
# sell_value 和 buy_value的平均值
ac_items %>%
group_by(category) %>%
summarise(sell_value = mean(sell_value, na.rm = TRUE),
buy_value = mean(buy_value, na.rm = TRUE),
.groups = "drop")
# 使用across函数
ac_items %>%
group_by(category) %>%
summarise(across(c(sell_value, buy_value), mean, na.rm = TRUE),
.groups = "drop")
ac_items %>%
group_by(category) %>%
mutate(across(c(sell_value, buy_value), ~ .x / max(.x, na.rm = TRUE),
.names = "{col}_prop")) %>%
select(category, ends_with("prop"))
# 修改列名
# 使用across函数的.names参数
ac_items %>%
group_by(category) %>%
summarise(across(c(sell_value, buy_value), mean, na.rm = TRUE,
.names = "{col}_mean"))
# 说明:.names使用glue语法表示
# 多个函数作用
ac_items %>%
group_by(category) %>%
summarise(across(c(sell_value, buy_value),
list(mean = mean, sd = sd), na.rm = TRUE,
.names = "{col}_{fn}"))
# 使用contains函数选择满足模式的变量
ac_items %>%
group_by(category) %>%
summarise(across(contains("value"),
mean, na.rm = TRUE,
.names = "{col}_"))
# 使用where(is.numeric)操作选择变量集
ac_items %>%
group_by(category) %>%
summarise(across(where(is.numeric),
mean, na.rm = TRUE,
.names = "{col}_"))
# 自定义一个汇总函数
summarizer <- function(data, numeric_cols = NULL, ...) {
data %>%
group_by(...) %>%
summarise(across({{numeric_cols}}, list(
mean = ~ mean(.x, na.rm = TRUE),
sd = ~ sd(.x, na.rm = TRUE),
q05 = ~ quantile(.x, 0.05, na.rm = TRUE),
q95 = ~ quantile(.x, 0.95, na.rm = TRUE)
), .names = "{col}_{fn}"))
}
summarizer(ac_items, numeric_cols = c(sell_value, buy_value), category)
自定义汇总函数的结果,如下图:
学习资料:
https://willhipson.netlify.app/post/dplyr_across/dplyr_across/
我建议你掌握这个函数的使用。
05
如何快速学习ggplot2做数据可视化?
ggplot2包做数据可视化,包括三个核心组件。
我们从数据可视化说起。
一个是数据
一个是可视化
一个是数据和可视化之间的映射
对应与ggplot2包的三个基础组件。
data
Geoms
mapping = aes()
利用三个基础组件生成基础图形后,后续就是根据实际需求不断地雕刻图形,直到满意。
(点击放大,查看清晰图)
我们可以逐步学习这份代码,来体会如何快速做成可用的图形。
# 如何快速学习ggplot2包做图?
library(dslabs)
library(dplyr)
library(ggplot2)
data("murders")
murders %>%
glimpse()
# 1) ggplot对象
p <- ggplot(data = murders)
class(p)
# 2) Geometries + Aesthetic mappings
# 绘制一个散点图
murders %>% ggplot() +
geom_point(aes(x = population/10^6, y = total))
# 3) Layes
p + geom_point(aes(x = population/10^6, y = total)) +
geom_text(aes(population/10^6, total, label = abb))
# 4)Global versus local aesthetic mappings
p <- murders %>% ggplot(aes(population/10^6, total, label = abb))
# 5)Scales
p + geom_point(size = 3) +
geom_text(nudge_x = 0.05) +
scale_x_continuous(trans = "log10") +
scale_y_continuous(trans = "log10")
# 或者
p + geom_point(size = 3) +
geom_text(nudge_x = 0.05) +
scale_x_log10() +
scale_y_log10()
# 6)Labels and titles
p + geom_point(size = 3) +
geom_text(nudge_x = 0.05) +
scale_x_log10() +
scale_y_log10() +
xlab("Populations in millions (log scale)") +
ylab("Total number of murders (log scale)") +
ggtitle("US Gun Murders in 2010")
# 7) Categories as colors
p <- murders %>% ggplot(aes(population/10^6, total, label = abb)) +
geom_text(nudge_x = 0.05) +
scale_x_log10() +
scale_y_log10() +
xlab("Populations in millions (log scale)") +
ylab("Total number of murders (log scale)") +
ggtitle("US Gun Murders in 2010")
p + geom_point(size = 3, color ="blue")
p + geom_point(aes(col=region), size = 3)
# 8)Annotation, shapes, and adjustments
r <- murders %>%
summarize(rate = sum(total) / sum(population) * 10^6) %>%
pull(rate)
r
p + geom_point(aes(col=region), size = 3) +
geom_abline(intercept = log10(r))
p <- p + geom_abline(intercept = log10(r), lty = 2, color = "darkgrey") +
geom_point(aes(col=region), size = 3)
p <- p + scale_color_discrete(name = "Region")
# 9) Themes
ds_theme_set()
library(ggthemes)
p + theme_economist()
# 10) 完整代码整理
library(dslabs)
library(dplyr)
library(ggplot2)
library(ggthemes)
library(ggrepel)
data(murders)
r <- murders %>%
summarize(rate = sum(total) / sum(population) * 10^6) %>%
pull(rate)
murders %>% ggplot(aes(population/10^6, total, label = abb)) +
geom_abline(intercept = log10(r), lty = 2, color = "darkgrey") +
geom_point(aes(col=region), size = 3) +
geom_text_repel() +
scale_x_log10() +
scale_y_log10() +
xlab("Populations in millions (log scale)") +
ylab("Total number of murders (log scale)") +
ggtitle("US Gun Murders in 2010") +
scale_color_discrete(name = "Region") +
theme_economist()
完整代码图形的结果
学习资料:
https://rafalab.github.io/dsbook/ggplot2.html
06
Tidymodels包学习和应用
第1集:认识Tidymodels包,让我们知道Tidymodels包是什么,如何安装和加载,以及Tidymodels的生态、常用函数和举了一个简单示例说明。
第2集:利用Tidymodels包做线性回归模型
1 线性回归模型推导
2 Tidymodels包做线性回归模型案例
# 02 线性回归模型案例
library(tidyverse)
library(tidymodels)
library(vip)
# 加载数据集
advertising <- read_rds(url('https://gmudatamining.com/data/advertising.rds'))
home_sales <- read_rds(url('https://gmudatamining.com/data/home_sales.rds')) %>%
select(-selling_date)
advertising %>% glimpse()
home_sales %>% glimpse()
# 数据集划分
# 训练集和测试集
set.seed(314)
advertising_split <- initial_split(advertising, prop = 0.75,
strata = Sales)
# 训练集
advertising_training <- advertising_split %>%
training()
# 测试集
advertising_test <- advertising_split %>%
testing()
# 使用parsnip包,格式统一化
# Pick a model type
# Set the engine
# Set the mode (either regression or classification)
lm_model <- linear_reg() %>%
set_engine('lm') %>% # adds lm implementation of linear regression
set_mode('regression')
lm_model
# 训练集拟合线性回归模型
# 使用parsnip包的fit函数
# 设置3个参数
# a parnsip model object specification
# a model formula
# a data frame with the training data
lm_fit <- lm_model %>%
fit(Sales ~ ., data = advertising_training)
lm_fit
# 分析或者探索训练集的结果
names(lm_fit)
summary(lm_fit$fit)
# 训练回归模型的诊断信息
par(mfrow=c(2, 2))
plot(lm_fit$fit, pch = 16, col = "#006EA1")
# 训练结果的整洁格式
# yardstick包的tidy函数
# 或者
# parsnip包的glance函数
yardstick::tidy(lm_fit)
parsnip::glance(lm_fit)
# 变量重要性分析
vip(lm_fit)
# 评价测试集上面的准确率
# 模型的泛化能力
# parnsip包的predict函数
# a trained parnsip model object
# new_data for which to generate predictions
predict(lm_fit, new_data = advertising_test)
advertising_test_results <- predict(lm_fit, new_data = advertising_test) %>%
bind_cols(advertising_test)
advertising_test_results
# 计算测试集的RMSE和R^2
# yardstick包的rmse函数
yardstick::rmse(advertising_test_results,
truth = Sales,
estimate = .pred)
yardstick::rsq(advertising_test_results,
truth = Sales,
estimate = .pred)
# 效果的可视化
# R^2 Plot
# 理想情况下 y = x
ggplot(data = advertising_test_results,
mapping = aes(x = .pred, y = Sales)) +
geom_point(color = '#006EA1') +
geom_abline(intercept = 0, slope = 1, color = 'orange') +
labs(title = 'Linear Regression Results - Advertising Test Set',
x = 'Predicted Sales',
y = 'Actual Sales')
# 升级版
# 创建一个机器学习工作流
# 第一步:Split Our Data
set.seed(314)
# Create a split object
advertising_split <- initial_split(advertising, prop = 0.75,
strata = Sales)
# Build training data set
advertising_training <- advertising_split %>%
training()
# Build testing data set
advertising_test <- advertising_split %>%
testing()
# 第二步:特征工程
advertising_recipe <- recipe(Sales ~ ., data = advertising_training) %>%
step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes())
# 第三步:Specify a Model
lm_model <- linear_reg() %>%
set_engine('lm') %>%
set_mode('regression')
# 第四步:创建工作流
# 使用workflow包
# we start with workflow() to create an empty workflow and then add out model and recipe with add_model() and add_recipe().
advertising_workflow <- workflow() %>%
add_model(lm_model) %>%
add_recipe(advertising_recipe)
# 第五步:执行工作流
advertising_fit <- advertising_workflow %>%
last_fit(split = advertising_split)
# 模型性能分析
advertising_fit %>% collect_metrics()
# 测试集预测的结果
# Obtain test set predictions data frame
test_results <- advertising_fit %>%
collect_predictions()
# View results
test_results
部分结果
学习资料:
https://www.gmudatamining.com/lesson-10-r-tutorial.html
我创建了R语言群,添加我的微信,备注:姓名-入群,我邀请你进群,一起学习R语言。
如果你想学习数据科学与人工智能,请关注下方公众号~
如果你想找数据工作,请关注下方公众号~
R语言学习专辑:
若觉得文章不错,就顺手帮我转发到朋友圈和微信群哦,谢谢。
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!