R语言学习:可视化分析,Apply家族函数,row_number函数,R4DS学习,数据变换
2021年第41周。
这一周R语言学习,记录如下。
01
美国谋杀数据集可视化分析
画图代码
# 美国谋杀数据集可视化分析
library(dslabs)
library(tidyverse)
library(ggthemes)
data(murders)
murders %>%
View
murders %>%
glimpse()
r <- murders %>%
summarize(rate = sum(total) / sum(population) * 10^6) %>%
pull(rate)
murders %>% ggplot(aes(population/10^6, total, label = abb)) +
geom_abline(intercept = log10(r), lty = 2, color = "darkgrey") +
geom_point(aes(col=region), size = 3) +
geom_text_repel() +
scale_x_log10() +
scale_y_log10() +
xlab("Populations in millions (log scale)") +
ylab("Total number of murders (log scale)") +
ggtitle("US Gun Murders in 2010") +
scale_color_discrete(name = "Region") +
theme_economist()
结果图
可视化分析
我们可以清楚地看到各州在人口规模和谋杀总数上的差异,也看到了谋杀总数和人口规模之间的明显关系。位于灰色虚线上的州的谋杀率与美国的平均水平相同。这四个地理区域用颜色表示,这说明大多数南部州的谋杀率高于平均水平。
学习资料:
https://rafalab.github.io/dsbook/introduction-to-data-visualization.html
02
Apply家族函数
Apply家族函数,可以避免循环语句操作。
Apply家族函数的选择,可以由数据的结果和想要的结果来决定。
Apply家族函数是R语言自带的函数集,不需要安装任何额外包就可以使用和执行。
1 apply()
函数描述
举例说明
?apply
mymatrix <- matrix(1:9, nrow = 3)
mymatrix
apply(mymatrix, 1, sum)
apply(mymatrix, 2, sum)
mymatrix[2, 3] <- NA
mymatrix
apply(mymatrix, 1, sum)
apply(mymatrix, 1, sum, na.rm = TRUE)
2 lapply()
函数定义
举例说明
?lapply
mylist <- list(A = matrix(1:9, nrow = 3),
B = 1:5,
C = 8)
mylist
lapply(mylist, sum)
unlist(lapply(mylist, sum))
lapply(mylist, function(x) x * 20)
运行结果
3 sapply()
函数定义
举例说明
?sapply
sapply(mylist, sum)
运行结果
4 mapply()
函数描述
举例说明
?mapply
mapply(rep, 1:4, 4:1)
x <- c(A = 20, B = 1, C = 40)
y <- c(J = 430, K = 50, L = 10)
simply <- function(u, v) {
(u + v) * 2
}
mapply(simply, x, y)
运行结果
5 tapply()
函数描述
举例说明
?tapply
tapply(iris$Sepal.Length, iris$Species, max)
tapply(iris$Sepal.Length, iris$Species, min)
运行结果
总结:
apply:把函数应用到数组指定的margins
lapply:把函数应用到列表的每个元素
sapply:类似lapply,结果显示更简洁
mapply:lapply的多元版本
tapply:把函数应用到向量的每个子集(基于某个逻辑的分组)
03
窗口函数row_number()
dplyr包的row_number()方法实现类似SQL的row_number()的功能。
举例说明
library(dplyr)
data(iris)
by_species <- iris %>%
arrange(Species, desc(Sepal.Length)) %>%
group_by(Species) %>%
mutate(rank = row_number())
by_species %>% View
上面的语句类似下面的SQL逻辑
select
*
,row_number() over(partition by Species order by Sepal.Length desc) as rank
from iris
04
R4DS学习交流群
我创建了R4DS学习交流群,以R4DS书籍为基础,聚焦于R语言做数据科学的任务。
想进群的伙伴,可以添加我的微信,备注:R4DS。我诚邀你入群,与大家交流和讨论,相互学习。
05
R4DS第三章 工作流:使用dplyr包做数据变换
1 内容结构
1)引言
为什么需要数据转换
相关准备工作
2)filter()过滤行
3)arrange()排序行
4)select()选择列
5)mutate()增加新列
6) summarize()和group_by()实现汇总和分组汇总处理
2 目标管理
1)掌握行选择技能
2)掌握列选择技能
3)掌握变量新增和重命名技能
4)掌握数据汇总技能
5)掌握利用dplyr包把数据变换为合理数据格式的技能
3 实操代码
# R4DS 第三章 使用dplyr包做数据变换
# 1 准备工作
library(pacman)
p_load(
nycflights13,
tidyverse
)
# 2 数据理解
flights %>% glimpse()
# 3 行过滤 filter()
filter(flights, month == 1, day == 1)
jan1 <- filter(flights, month == 1, day == 1)
(dec25 <- filter(flights, month == 12, day == 25))
# 3.1) 使用比较运算操作
# > , >= , < , <= , != (不等于), and == (等于)
# 3.2)使用逻辑运算操作
# & is “与” | is “或” and ! is “非”
filter(flights, month == 11 | month == 12)
nov_dec <- filter(flights, month %in% c(11, 12))
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
# 4 行排序 arrange()
arrange(flights, year, month, day)
arrange(flights, desc(arr_delay))
# 5 列选择 select()
select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))
# 5.1) 使用一些函数来辅助选择具有某种模式的列
# starts_with("abc") matches names that begin with “abc”.
# ends_with("xyz") matches names that end with “xyz”.
# contains("ijk") matches names that contain “ijk”.
# matches("(.)\1") selects variables that match a regular expression.
# num_range("x", 1:3) matches x1 , x2 , and x3
select(flights, time_hour, air_time, everything())
# 6 变量重命名 rename()
rename(flights, tail_num = tailnum)
# 7 增加新的变量 mutate()
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)
mutate(flights_sml,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60
)
mutate(flights_sml,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
# 只需要新增的变量
transmute(flights,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
# 对于新增变量的可以采用这些操作
# 1)Arithmetic operators + , - , * , / , ^
# 2)Modular arithmetic ( %/% and %% )
transmute(flights,
dep_time,
hour = dep_time %/% 100,
minute = dep_time %% 100
)
# 3)Logs log() , log2() , log10()
# 4)Offsets lead() and lag()
# 5)Cumulative and rolling aggregates cumsum() , cumprod() , cummin() , cummax()
# dplyr包的cummean()
# RcppRoll包提供a sum computed over a rolling window
# 6)Logical comparisons < , <= , > , >= , !=
# 7)Ranking min_rank()
# row_number() , dense_rank() , percent_rank() , cume_dist() ntile()
# 8 数据汇总和分组汇总
# summarize()和group_by()
summarize(flights, delay = mean(dep_delay, na.rm = TRUE))
by_day <- group_by(flights, year, month, day)
summarize(by_day, delay = mean(dep_delay, na.rm = TRUE))
# 9 管道操作
# 让代码更好理解,可读性更好
by_dest <- group_by(flights, dest)
delay <- summarize(
by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "HNL")
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
# 上述代码的的管道操作表示
delays <- flights %>%
group_by(dest) %>%
summarize(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
ggplot(data = delays, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
# 或者
flights %>%
group_by(dest) %>%
summarize(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL") %>%
ggplot(data = ., mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
flights %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay))
flights %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay, na.rm = TRUE))
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
group_by(year, month, day) %>%
summarize(mean = mean(dep_delay))
# 10 计数操作
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarize(
delay = mean(arr_delay)
)
ggplot(data = delays, mapping = aes(x = delay)) +
geom_freqpoly(binwidth = 10)
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
ggplot(data = delays, mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = n, y = delay)) +
geom_point(alpha = 1/10)
batting <- as_tibble(Lahman::Batting)
batters <- batting %>%
group_by(playerID) %>%
summarize(
ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
ab = sum(AB, na.rm = TRUE)
)
batters %>%
filter(ab > 100) %>%
ggplot(mapping = aes(x = ab, y = ba)) +
geom_point() +
geom_smooth(se = FALSE)
batters %>%
arrange(desc(ba))
# 11 有用的汇总函数
# 1)Measures of location mean(x) median(x)
not_cancelled %>%
group_by(year, month, day) %>%
summarize(
# average delay:
avg_delay1 = mean(arr_delay),
# average positive delay:
avg_delay2 = mean(arr_delay[arr_delay > 0])
)
# 2)Measures of spread sd(x) , IQR(x) , mad(x)
not_cancelled %>%
group_by(dest) %>%
summarize(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
# 3)Measures of rank min(x) , quantile(x, 0.25) , max(x)
not_cancelled %>%
group_by(year, month, day) %>%
summarize(
first = min(dep_time),
last = max(dep_time)
)
# 4)Measures of position first(x) , nth(x, 2) , last(x)
not_cancelled %>%
group_by(year, month, day) %>%
summarize(
first_dep = first(dep_time),
last_dep = last(dep_time)
)
not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
# 5)Counts n() n_distinct(x)
not_cancelled %>%
group_by(dest) %>%
summarize(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
not_cancelled %>%
count(dest)
not_cancelled %>%
count(tailnum, wt = distance)
# 6)Counts and proportions of logical values sum(x > 10) , mean(y == 0)
not_cancelled %>%
group_by(year, month, day) %>%
summarize(n_early = sum(dep_time < 500))
not_cancelled %>%
group_by(year, month, day) %>%
summarize(hour_perc = mean(arr_delay > 60))
# 12 多变量分组操作
daily <- group_by(flights, year, month, day)
(per_day <- summarize(daily, flights = n()))
(per_month <- summarize(per_day, flights = sum(flights)))
(per_year <- summarize(per_month, flights = sum(flights)))
# 13 去掉group
daily %>%
ungroup() %>% # no longer grouped by date
summarize(flights = n()) # all flights
flights_sml %>%
group_by(year, month, day) %>%
filter(rank(desc(arr_delay)) < 10)
popular_dests <- flights %>%
group_by(dest) %>%
filter(n() > 365)
popular_dests
popular_dests %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, dest, arr_delay, prop_delay)
代码结果请自测,关于代码有任何问题,请来R4DS学习交流群讨论。
06
雷达图
雷达图,又叫蜘蛛图,用于可视化多个定量变量的值,实现多个变量在二维空间的对比分析。
客户分群项目,总结每个群体的特性时,可以用雷达图来表示。
library(tidyverse)
# devtools::install_github("ricardo-bion/ggradar")
library("ggradar")
data<- data.frame(
row.names = c("A", "B", "C"),
Thickness = c(7.9, 3.9, 9.4),
Apperance = c(10, 7, 5),
Spredability = c(3.7, 6, 2.5),
Likeability = c(8.7, 6, 4)
)
data
df <- data %>% rownames_to_column("group")
df
ggradar(
df,
values.radar = c("0", "5", "10"),
grid.min = 0,
grid.mid = 5,
grid.max = 10,
# Polygons
group.line.width = 1,
group.point.size = 3,
group.colours = c("#00AFBB", "#E7B800", "#FC4E07"),
background.circle.colour = "white",
gridline.mid.colour = "grey",
legend.position = "bottom"
)
结果图
我创建了R语言群,添加我的微信,备注:姓名-入群,我邀请你进群,一起学习R语言。
如果你觉得文章内容有用,请关注下方公众号~
如果你想找数据工作,请关注下方公众号~
R语言学习专辑:
觉得本文不错,就顺手帮我转发到朋友圈和微信群哦,谢谢。
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!