R语言学习:tidyquery包,R4DS书籍练习题解决方案,热图,STAT454,EDA模板
2021年第43周。
这一周R语言学习,记录如下。
01
数据框运行SQL语句的tidyquery包
每日的工作,我会用到R语言和SQL。
我会用R语言的数据框来存储数据集,用dplyr包来处理数据集,用SQL语句从数据库平台获取授权的数据集。
操作流程如下:
SQL语句获取数据表–>R语言数据框–>dplyr包做数据处理
tidyquery包,可以让你在R语言的数据框上面执行SQL语句,把R语言技术和SQL技术进行融合和链接。
tidyquery包的query函数可以让你把R的数据框看作数据表,执行你编写的SQL语句;show_dplyr函数可以让你把执行的SQL语句转换为dplyr包的数据处理范式。同时,query函数可以进一步与管道操作和dplyr包结合使用,增强功能。
# tidyquery R数据框运行SQL查询
# 两个函数
# query() 用于执行在R数据框里面运行SQL查询语句
# show_dplyr() 用于把SQL语句转换为dplyr包
library(tidyverse)
library(tidyquery)
library(nycflights13)
query(
" SELECT origin, dest,
COUNT(flight) AS num_flts,
round(SUM(seats)) AS num_seats,
round(AVG(arr_delay)) AS avg_delay
FROM flights f LEFT OUTER JOIN planes p
ON f.tailnum = p.tailnum
WHERE distance BETWEEN 200 AND 300
AND air_time IS NOT NULL
GROUP BY origin, dest
HAVING num_flts > 3000
ORDER BY num_seats DESC, avg_delay ASC
LIMIT 2;"
)
# 查询语句与管道操作和dplyr做结合使用
planes %>%
filter(engine == "Turbo-fan") %>%
query("SELECT manufacturer AS maker, COUNT(*) AS num_planes GROUP BY maker") %>%
arrange(desc(num_planes)) %>%
head(5)
# show_dplyr()函数举例
# SQL语句转换为dplyr的表示形式
show_dplyr(
" SELECT manufacturer,
COUNT(*) AS num_planes
FROM planes
WHERE engine = 'Turbo-fan'
GROUP BY manufacturer
ORDER BY num_planes DESC;"
)
运行结果
学习资料:
https://github.com/ianmcook/tidyquery
02
R4DS书籍练习题解决方案
我每天用R语言做数据科学的任务。
我创建了R4DS学习交流群,它以R4DS书籍为基础,聚焦于R语言做数据科学的任务。
R4DS是一本好书,每一个想用R语言做数据科学的朋友,都可以阅读下这本书。这本书的章节后面提供了丰富的练习题,有很多是要编写R语言代码。
这里提供一份R4DS书籍练习的解决方案,供大家学习和参考。
在线访问网址:
https://jrnold.github.io/r4ds-exercise-solutions/
对应的Github:
https://github.com/jrnold/r4ds-exercise-solutions
欢迎朋友们加入R4DS学习交流群,大家多讨论和交流。同时,我也会定期在群里分享R4DS相关的资料。你可以扫描下方二维码,备注:姓名-R4DS,添加我的微信,我邀请你入群。
03
热图Heatmap
tidyHeatmap包基于tidy原则创建信息丰富的热图。它以ComplexHeatmap为画图引擎。
tidyHeatmap包可用函数,描述如下。
举例说明
# 热图绘制
# tidyHeatmap包的安装操作
install.packages("BiocManager")
BiocManager::install("ComplexHeatmap")
install.packages("tidyHeatmap")
# tidyHeatmap包的使用举例
# 1 数据获取和表示
library(tidyHeatmap)
library(tidyverse)
mtcars_tidy <-
mtcars %>%
as_tibble(rownames="Car name") %>%
# Scale
mutate_at(vars(-`Car name`, -hp, -vs), scale) %>%
# tidyfy
pivot_longer(cols = -c(`Car name`, hp, vs), names_to = "Property", values_to = "Value")
mtcars_tidy
# 2 热图可视化
mtcars_heatmap <-
mtcars_tidy %>%
heatmap(`Car name`, Property, Value ) %>%
add_tile(hp)
mtcars_heatmap
# 3 图片保存
mtcars_heatmap %>% save_pdf("./figs/mtcars_heatmap.pdf")
# 4 分组热图
mtcars_tidy %>%
group_by(vs) %>%
heatmap(`Car name`, Property, Value ) %>%
add_tile(hp)
# 每个组做配色处理
mtcars_tidy %>%
group_by(vs) %>%
heatmap(
`Car name`, Property, Value ,
palette_grouping = list(c("#66C2A5", "#FC8D62"))
) %>%
add_tile(hp)
mtcars_tidy %>%
heatmap(`Car name`, Property, Value ) %>%
split_rows(2) %>%
split_columns(2)
# 使用Kmeans做行聚类和列聚类
mtcars_tidy %>%
heatmap(
`Car name`, Property, Value ,
row_km = 2,
column_km = 2
)
# 自定义调色板
# 字符串或者16进制
mtcars_tidy %>%
heatmap(
`Car name`,
Property,
Value,
palette_value = c("red", "white", "blue")
)
mtcars_tidy %>%
heatmap(
`Car name`,
Property,
Value,
palette_value = circlize::colorRamp2(
seq(-2, 2, length.out = 11),
RColorBrewer::brewer.pal(11, "RdBu")
)
)
mtcars_tidy %>%
heatmap(
`Car name`,
Property,
Value,
palette_value = circlize::colorRamp2(c(-2, -1, 0, 1, 2), viridis::magma(5))
)
# 多个分组和标注
tidyHeatmap::pasilla %>%
group_by(location, type) %>%
heatmap(
.column = sample,
.row = symbol,
.value = `count normalised adjusted`
) %>%
add_tile(condition) %>%
add_tile(activation)
pasilla_plus <-
tidyHeatmap::pasilla %>%
dplyr::mutate(act = activation) %>%
tidyr::nest(data = -sample) %>%
dplyr::mutate(size = rnorm(n(), 4,0.5)) %>%
dplyr::mutate(age = runif(n(), 50, 200)) %>%
tidyr::unnest(data)
pasilla_plus %>% View
pasilla_plus %>%
heatmap(
.column = sample,
.row = symbol,
.value = `count normalised adjusted`
) %>%
add_tile(condition) %>%
add_point(activation) %>%
add_tile(act) %>%
add_bar(size) %>%
add_line(age)
library(forcats)
mtcars_tidy %>%
mutate(`Car name` = fct_reorder(`Car name`, `Car name`, .desc = TRUE)) %>%
heatmap(
`Car name`, Property, Value,
cluster_rows = FALSE
)
部分结果图
学习迁移:
结合上面的热图代码,首先把自己的数据加工成复合函数输入的格式,然后选择合适热图展示方式。
学习资料:
https://github.com/stemangiola/tidyHeatmap
04
在线课程:STAT545
课程访问网址:
https://stat545.com/index.html
课程介绍:
05
R高级数据整理讲习班
数据整理工作是每个数据任务都需要的工作之一。
掌握R语言数据整理技能,可以提升工作效率。
学习目标:
-
掌握如何重塑和操作数据
-
掌握如何用tidyverse包汇总数据
准备工作:
提前安装好这些R包
# p_load loads and, if necessary, install missing packages.
# install.packages() + library() = p_load()
# If you just want to install, then use p_install()
if (!require("pacman")) install.packages("pacman")
pacman::p_load(
tidyverse, # for the tidyverse framework
palmerpenguins,
gapminder,
kableExtra,
flextable,
modelr,
nycflights13
)
讲习班的所有材料,可以在R4DS学习交流群的群公告查看。
06
R语言做EDA的一份模板
EDA,也是我每天数据工作需要做的事情。
如何做EDA,可以参考下这份R语言做EDA的模版,它是一份Rmd的模版。
TITLE by YOUR_NAME_HERE
========================================================
```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
Load all of the packages that you end up using
in your analysis in this code chunk.
"echo" was set to FALSE for this code chunk. Notice that the parameter
in the knitted HTML output. This prevents the code from displaying
set echo=FALSE for all code chunks in your file. You should
library(ggplot2)
```
```{r echo=FALSE, Load_the_Data}
Load the Data
```
# Univariate Plots Section
```{r echo=FALSE, Univariate_Plots}
```
# Univariate Analysis
### What is the structure of your dataset?
### What is/are the main feature(s) of interest in your dataset?
### What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
### Did you create any new variables from existing variables in the dataset?
### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
# Bivariate Plots Section
```{r echo=FALSE, Bivariate_Plots}
```
# Bivariate Analysis
### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
### What was the strongest relationship you found?
# Multivariate Plots Section
```{r echo=FALSE, Multivariate_Plots}
```
# Multivariate Analysis
### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
### Were there any interesting or surprising interactions between features?
### OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.
------
# Final Plots and Summary
### Plot One
```{r echo=FALSE, Plot_One}
```
### Description One
### Plot Two
```{r echo=FALSE, Plot_Two}
```
### Description Two
### Plot Three
```{r echo=FALSE, Plot_Three}
```
### Description Three
------
# Reflection
资料来源:
https://github.com/nicolasfguillaume/Exploratory-Data-Analysis-with-R/blob/master/EDA%20template%20in%20R.Rmd
我创建了R语言群,添加我的微信,备注:姓名-入群,我邀请你进群,一起学习R语言。
如果你觉得文章内容有用,请关注下方公众号~
如果你想找数据工作,请关注下方公众号~
R语言学习专辑:
觉得本文不错,就顺手帮我转发到朋友圈和微信群哦,谢谢。
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!