R语言学习:使用devtools和usethis创建R包,EDA,dplyr包数据操作,韦恩图
2021年第47周。
这一周R语言学习,记录如下。
01
使用devtools和usethis创建你的R包
1 为什么要创建R包?
2 如何写一个R函数?
编写一个稳健的函数,示范如下。
3 tidyeval实践
使用{{}}caoz
4 Package认识?
5 简易创建自己的R包
使用devtools和usethis包高效创建你自己的R包。
步骤如下:
第一步:利用RStudio新建项目,选择创建R包
第二步:使用usethis::use_git()创建基于git的版本控制
第三步:删除项目自动生成的文件,包括namespace,hello.R,hello.md
第四步:使用usethis::use_r创建R包需要设计和实现的函数
第五步:函数编写和注释
把光标放在函数里面,使用快捷键Ctrl+Shift+Alt+R添加代码注释模板,然后修改模板的内容
第六步:使用devtools::document()生成函数的md,使用devtools::install()安装R包,使用devtools::check()检查R包,根据检查结果做修正和完善。
第七步:把本地创建的R包上传到Github
首先,在github上面创建与本地R包同名的仓库
使用如下命令:
1 git add .
2 git commit -m “add your comments”
3 git remote add origin <项目的ssh链接>
4 git push -u origin master
关于创建自己的R包有什么问题或者想法,可以扫描下方二维码,备注:姓名-入群,添加我的微信,大家一起讨论。
学习资料:
https://www.youtube.com/watch?v=EpTkT6Rkgbs
02
EDA(探索性数据分析)
R4DS书籍第5章EDA学习
EDA是数据分析过程重要一个环节,可以帮助我们更好地理解数据,通过可视化技术和数字化技术做EDA工作,并且是一个不断迭代和完善的工作。
R4DS书籍EDA章节的学习代码。
##################
#EDA 探索性数据分析
#R4DS 第5章学习
##################
# EDA是一个迭代过程
# EDA使用可视化技术和数字化技术
# EDA始于问题又终于问题,通过可视化、变换和模型
# 数据清洗是EDA的一个应用
# 准备工作
library(tidyverse)
# 1 Variation ------------------------------------------------------
# 1.1 变量分布可视化
# 如何做分布?取决于变量的类型,连续的还是类别的。
# 类别变量--因子类型或者字符串类型
# 常用bar chart 观察频数分布
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
diamonds %>%
dplyr::count(cut)
# 连续变量
# 数值和时间日期
# 连续变量的分布,采用直方图
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
# 汇总每个箱子的频数统计
diamonds %>%
count(cut_width(carat, 0.5))
smaller <- diamonds %>%
filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)
ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
geom_freqpoly(binwidth = 0.1)
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.01)
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_histogram(binwidth = 0.25)
# Unusual Values
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
# 标记unusual
unusual <- diamonds %>%
filter(y < 3 | y > 20) %>%
arrange(y)
unusual %>%
View
# Missing Value
diamonds2 <- diamonds %>%
filter(between(y, 3, 20))
glimpse(diamonds2)
diamonds2 <- diamonds %>%
mutate(
y = ifelse(y < 3 | y > 20, NA, y)
)
glimpse(diamonds2)
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point()
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point(na.rm = TRUE)
nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(
mapping = aes(color = cancelled),
binwidth = 1/4
)
# 2 Covariation -----------------------------------------------------------
# 类别变量和连续变量
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(color = cut), binwidth = 500)
ggplot(diamonds) +
geom_bar(mapping = aes(x = cut))
ggplot(
data = diamonds,
mapping = aes(x = price, y = ..density..)
) +
geom_freqpoly(mapping = aes(color = cut), binwidth = 500)
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
ggplot(data = mpg) +
geom_boxplot(
mapping = aes(
x = reorder(class, hwy, FUN = median),
y = hwy
)
)
ggplot(data = mpg) +
geom_boxplot(
mapping = aes(
x = reorder(class, hwy, FUN = median),
y = hwy
)
) +
coord_flip()
# 两个类别变量
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
diamonds %>%
count(color, cut)
# 热图表示
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
# 两个连续变量
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
ggplot(data = diamonds) +
geom_point(
mapping = aes(x = carat, y = price),
alpha = 1 / 100
)
ggplot(data = smaller) +
geom_bin2d(mapping = aes(x = carat, y = price))
ggplot(data = smaller) +
geom_hex(mapping = aes(x = carat, y = price))
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))
ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))
ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions, y = waiting))
library(modelr)
mod <- lm(log(price) ~ log(carat), data = diamonds)
diamonds2 <- diamonds %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))
# 残差分析
ggplot(data = diamonds2) +
geom_point(mapping = aes(x = carat, y = resid))
ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid))
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_freqpoly(binwidth = 0.25)
ggplot(faithful, aes(eruptions)) +
geom_freqpoly(binwidth = 0.25)
diamonds %>%
count(cut, clarity) %>%
ggplot(aes(clarity, cut, fill = n)) +
geom_tile()
# 总结:
# EDA是数据分析过程重要的环节
# EDA帮助我们更好地理解数据,为后续的分析和建模提供指导
我建议你亲自测试,体会每个代码片段所作的具体工作。
03
书籍阅读:《R数据可视化手册(第2版)》
《R数据可视化手册(第2版)》,告诉你R语言数据可视化的知识和技能。
内容提要,摘录如下。
本书基于主题划分章节,每一章归纳总结了对应的常见问题和解决方法。本书第1章是R基础知识,包括包安装和数据加载;第2章是绘图概述,帮助读者快速绘制基本图形;第3章至第6章具体介绍绘制几种不同图形(如条形图、折线图和散点图等)的方法;第7章至第12章讨论如何修改图形的各个元素(如注解、坐标轴、标题、图例和配色等);第13章介绍其他难以清晰分门别类的图形;第14章介绍将R绘制的图形以不同的格式导出的方法;第15章讨论数据处理的相关问题。本书由浅入深,脉络分明,适合数据分析、数据处理和数据可视化的初学者;此外,本书对有一定数据分析、数据处理和数据可视化工作经验的读者,也是一本方便的速查手册。
数据可视化手册第2版
你在阅读中有什么问题,欢迎加入R语言群,一起讨论。
04
代码学习:dplyr包数据操作
使用dplyr包,高效做数据操作,包括:
1)数据表合并操作
2)数据变量操作
3)数据观察(样本)操作
dplyr包数据操作代码学习。
library(tidyverse)
# 导入数据集
animal_p1 <- read_csv("data/animal_p1.csv")
animal_p2 <- read_csv("data/animal_p2.csv")
animal_rp <- read_csv("data/animal_rp.csv")
animal_meal <- read_csv("data/animal_meal.csv")
# 1)合并数据表
glimpse(animal_p1)
glimpse(animal_p2)
# 行合并
# 使用bind_rows函数
animal <- bind_rows(animal_p1, animal_p2)
animal
# 2)数据表的对比分析操作
setequal(animal_p1, animal_p2)
intersect(animal, animal_rp)
setdiff(animal, animal_rp)
setdiff(animal_rp, animal)
union(animal, animal_rp)
# 列合并
# 使用join操作
animal_meal
# 内连接和外连接
# 内连接 inner_join
# 外连接 left_join/right_join/full_join
animal_weight <- union(animal, animal_rp)
animal_weight
# 左连接
animal_joined <- animal_weight %>%
left_join(animal_meal, by = c("id" = "IDs"))
animal_joined %>%
View
animal_weight %>%
inner_join(animal_meal, by = c("id" = "IDs")) %>%
View
animal_weight %>%
right_join(animal_meal, by = c("id" = "IDs")) %>%
View
animal_weight %>%
full_join(animal_meal, by = c("id" = "IDs")) %>%
View
# 等价于
animal_p1 %>%
full_join(animal_p2, by = c("id", "animal", "weight"))
# 过滤连结操作
# semi_join(x, y)
# anti_join(x, y)
animal_weight %>%
semi_join(animal_meal, by = c("id" = "IDs"))
animal_weight %>%
anti_join(animal_meal, by = c("id" = "IDs"))
animal_new <- read_csv("data/animal_new.csv")
str(animal_new)
animal_joined
# full_join进行数据表的合并操作
animal_final <- animal_joined %>%
full_join(animal_new,
by = c("id" = "ID", "animal" = "Animals", "weight", "meal" = "Meal"))
animal_final %>%
View
# 对于加工好的数据集进行数据可视化操作
library(gridExtra)
# 条形图
barplot <- ggplot(animal_final, aes(animal, fill = meal)) +
geom_bar(alpha = 0.8) +
labs(title = "Diversity of meals", x = NULL) +
scale_fill_brewer(palette = "Set3", type = "seq", na.value = "grey") +
theme_minimal() +
theme(plot.title = element_text(size = 14, hjust = 0.5, face = "bold"),
plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), units = , "cm"))
barplot
# 盒箱图
boxplot <- ggplot(animal_final) +
geom_boxplot(aes(animal, weight, fill = animal), alpha = 0.5, position = "dodge2") +
scale_y_continuous(limits = c(0, 30)) +
labs(title = "Mean weights of animals", x = NULL, y = "Weight (kg)") +
theme_minimal() +
theme(plot.title = element_text(size = 14, hjust = 0.5, face = "bold"),
plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), units = , "cm"))
boxplot
animal_panel <- grid.arrange(barplot, boxplot, ncol = 2)
animal_panel
# 图象保存
ggsave(filename = "figs/animal_panel.png", plot = animal_panel, width = 10, height = 5)
# 目标:
# 掌握表格的合并操作
# 熟悉变量和观察的操作
marine <- read_csv("data/LPI_marine.csv")
str(marine)
marine %>%
slice_head(n = 100) %>%
View
# 对原始数据集做tidy data
marine2 <- marine %>%
gather(key = year, value = pop, c(25:69)) %>%
mutate(
year = parse_number(as.character(year)),
pop = as.numeric(pop)
) %>%
drop_na(pop)
marine2 %>%
View
glimpse(marine2)
View(marine2)
# 变量操作
# 1)获取向量
marine2 %>%
pull(Species) %>%
glimpse()
# 2)获取数据框
marine2 %>%
select(Species) %>%
glimpse()
marine2 %>%
select(id, pop, year, Country.list) %>%
glimpse()
# 变量名重命名
marine2 %>%
select("Country list" = Country.list,
method = Sampling.method) %>%
glimpse()
marine2 %>%
select(id, year, pop, everything()) %>%
glimpse()
# 选择列的时候,建议不要使用数字写法
# 注意代码的可读性,非常重要
marine2 %>%
select(Family:Species, 24:26) %>%
glimpse()
# 删除不感兴趣的列
marine2 %>%
select(-c(2:22, 24)) %>%
glimpse()
marine_cols <- c("Genus", "Species", "year", "pop", "id")
marine2 %>%
select(!!marine_cols) %>%
glimpse()
# starts_with("x") matches names starting with “x”
# ends_with("x") matches names ending with “x”
# contains("x") matches names containing “x”
marine2 %>%
select(starts_with("Decimal")) %>%
glimpse()
# 根据变量类型和select_if来选择所需列
marine2 %>%
select_if(is.numeric) %>%
glimpse()
# 多种方式结合来选择列
marine2 %>% select(id, # put id first
Class:Family, # add columns between `Class` and `Family`
genus = Genus, # rename `Genus` to lowercase
starts_with("Decimal"), # add columns starting with "Decimal"
everything(), # add all the other columns
-c(6:9, system:Data.transformed)) %>% # delete columns in these ranges
glimpse()
# 选择好的数据集做保存
marine3 <- marine2 %>%
select(id, Class, Genus, Species, year, pop,
location = Location.of.population,
lat = Decimal.Latitude,
lon = Decimal.Longitude) %>%
glimpse()
# 使用rename函数对变量名做重命名
marine3 %>%
rename(class = Class,
genus = Genus,
species = Species) %>% # renames only chosen columns
glimpse()
# 若是用函数来对变量名重命名
# 使用rename_with函数
marine3 %>%
rename_with(tolower) %>%
glimpse()
marine4 <- marine3 %>%
select_all(tolower) %>%
glimpse()
marine3 %>%
select_at(vars(Genus, Species), tolower) %>%
glimpse()
# 拓展:
# select_all() if you want to apply the function to all columns
# select_at() if you want to apply the function to specific columns (specify them with vars())
# select_if() if you want to apply the function to columns of a certain characteristic (e.g. data type)
# select_with() if you want to apply the function to columns and include another function within it
# 创建新列
# mutate函数
marine5 <- marine4 %>%
mutate(genus_species = paste(genus, species, sep = "_")) %>%
glimpse()
# 使用case_when函数构建多条件的逻辑操作
marine6 <- marine5 %>%
mutate(region = case_when(lat > 0 & lon >= 0 ~ "NE",
lat <= 0 & lon >= 0 ~ "SE",
lat > 0 & lon < 0 ~ "NW",
lat <= 0 & lon < 0 ~ "SW")) %>%
glimpse()
unique(marine6$region)
marine4 %>%
transmute(genus_species = paste(genus, species, sep = "_"),
region = case_when(lat > 0 & lon >= 0 ~ "NE",
lat <= 0 & lon >= 0 ~ "SE",
lat > 0 & lon < 0 ~ "NW",
lat <= 0 & lon < 0 ~ "SW")) %>%
glimpse()
marine6 %>%
mutate_at(vars(class, genus, location), tolower) %>%
glimpse()
marine6 %>%
add_column(observation_num = 1:4456) %>%
glimpse()
marine6 %>%
select(genus_species, year) %>%
group_by(genus_species) %>%
add_tally(name = "observations_count") %>%
glimpse()
marine6 %>%
select(genus_species, year) %>%
# `add_count()` includes the grouping variable (here `genus_species`) inside the function
add_count(genus_species, name = "observations_count") %>%
glimpse()
# 样本选择操作
marine6 %>%
filter(class == "Mammalia") %>%
glimpse()
marine6 %>%
filter(class %in% c("Mammalia", "Aves")) %>%
glimpse()
marine6 %>%
filter(class != "Actinopteri") %>%
glimpse()
marine6 %>%
filter(!class %in% c("Mammalia", "Aves")) %>%
glimpse()
marine6 %>%
filter(pop >= 10 & pop <= 100) %>%
glimpse()
marine6 %>%
filter(between(pop, 10, 100)) %>%
glimpse()
marine6 %>%
filter(!is.na(pop)) %>%
glimpse()
marine6 %>%
filter((class == "Mammalia" | pop > 100) & region != "SE") %>%
glimpse()
marine6 %>%
filter(class == "Mammalia" | (pop > 100 & region != "SE")) %>%
glimpse()
marine6 %>%
distinct() %>%
glimpse()
marine6 %>%
n_distinct()
# 切片操作
marine6 %>%
select(id:species) %>%
slice(2:4)
marine6 %>%
top_n(5, pop) %>%
glimpse()
marine7 <- marine6 %>%
filter(id == "2077") %>%
select(id, genus_species, year, pop)
# 添加样本
marine7 %>%
add_row(id = 2077, genus_species = "Chrysophrys_auratus", year = 1997, pop = 39000)
marine7 %>%
add_row(id = 2077, genus_species = "Chrysophrys_auratus", year = 1969, pop = 39000,
.before = 1)
marine_final <- marine6 %>%
filter(genus_species == "Chelonia_mydas") %>%
# change `id` to factor (otherwise it would display as a continuous variable on the plot)
mutate(id = as.factor(id))
chelonia_trends <- ggplot(marine_final, aes(x = year, y = pop, colour = location)) +
geom_point(size = 2, alpha = 0.7) +
geom_smooth(method = "lm", colour = "black", fill = "lightgrey") +
scale_x_continuous(limits = c(1970, 2005), breaks = c(1970, 1980, 1990, 2000)) +
labs(x = NULL, y = "Population countn",
title = "Positive trend of Green Sea Turtle population in Australian",
colour = "Location") +
theme_minimal() +
theme(plot.title = element_text(size = 14, hjust = 0.5, face = "bold"),
plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), units = , "cm"))
chelonia_trends
ggsave(chelonia_trends, filename = "figs/chelonia_trends.png", width = 8, height = 6)
(代码结果,请自己审核和运行)
学习资料:
https://ourcodingclub.github.io/tutorials/data-manip-creative-dplyr/
05
优美画图:韦恩图
下图怎么绘制?
代码片段
# 韦恩图
set.seed(20190708)
genes <- paste("gene",1:1000,sep="")
x <- list(
A = sample(genes,300),
B = sample(genes,525),
C = sample(genes,440),
D = sample(genes,350)
)
# ggVennDiagram的安装
# if (!require(devtools)) install.packages("devtools")
# devtools::install_github("gaospecial/ggVennDiagram")
library("ggVennDiagram")
ggVennDiagram(x, label_alpha = 0)
学习资料:
https://www.datanovia.com/en/blog/beautiful-ggplot-venn-diagram-with-r/
06
资料分享:Beautiful graphics in ggplot2
访问链接:
https://themockup.blog/static/slides/intro-plot.html#1
部分内容引用如下:
1 Why ggplot2?
2 theme() elements
3 Steal like an artist
4 Summary
我创建了R语言群,添加我的微信,备注:姓名-入群,我邀请你进群,一起学习R语言。
如果你想学习数据科学与人工智能,请关注下方公众号~
如果你想找数据工作,请关注下方公众号~
R语言学习专辑:
觉得本文不错,就顺手帮我转发到朋友圈和微信群哦,谢谢。
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!