R语言学习:图例管理,SQL语句操作数据框,tidyverse包实操,情感分析,词云图
2021年第38周。
这一周R语言学习,记录如下。
01
探索数据框的方法
数据框是R语言常用的数据结构。
探索数据框的方法:
1)View方法或者V方法
2)dplyr包的glimpse方法
3)knitr包的kable方法
library(dplyr)
library(nycflights13)
library(knitr)
# 探索数据框的方法
View(flights)
glimpse(flights)
kable(airlines)
02
图例管理
图例管理,包括图例保持和图例移除。
图例保持,要么放置在图外面;要么放置在图里面。
library(ggplot2)
ToothGrowth$dose<-factor(ToothGrowth$dose)
# 图例保持
# 1)图的外面
p <- ggplot(ToothGrowth, aes(x=dose, y=len, fill=dose)) +
geom_boxplot()
p
p +theme(legend.position = "bottom")
# 2)图的里面
p + theme(legend.position = c(.9, .9))
p + theme(legend.position = c(.9, .1))
# 图例移除
p + theme(legend.position = "none")
部分结果图
03
使用SQL语句操作数据框
sqldf包是R语言中一个实用的数据管理辅助工具。
它支持用SQL语句来操作数据框。
library(sqldf)
# R语言自带数据集mtcars
# 第一步:根据数据处理逻辑编辑SQL语句
sql_st = "
select
*
from mtcars
where carb = 1
order by mpg
"
# 第二步:使用sqldf包的sqldf函数执行SQL语句
new_df <- sqldf(sql_st, row.names = TRUE)
# 第三步:结果查看
new_df
结果表
04
tidyverse实操
tidyverse包是我每天都要用的R包,用于数据管理、数据可视化和数据科学的工作。
library(tidyverse)
library(lubridate)
library(nycflights13)
# 获取数据
# 航班数据
head(flights)
# 1)长航班统计
flights %>%
mutate(long_flight = (air_time >= 6 * 60)) %>%
View()
flights %>%
mutate(long_flight = (air_time >= 6 * 60)) %>%
count(long_flight)
# 或者
flights %>%
count(long_flight = air_time >= 6 * 60)
# 2)通过分组生成新的变量集
flights %>%
group_by(date = make_date(year, month, day)) %>%
summarise(flights_n = n(),
air_time_mean = mean(air_time, na.rm = TRUE)) %>%
ungroup()
# 3)随机获取样例
flights %>%
slice_sample(n = 15)
flights %>%
slice_sample(prop = 0.15)
# 4)创建日期列
flights %>%
select(year, month, day) %>%
mutate(date = make_date(year, month, day))
# 5)数字解析
numbers_1 <- tibble(number = c("#1", "Number8", "How are you 3"))
numbers_1 %>% mutate(number = parse_number(number))
# 6)多条件生成变量,按着循序执行和获取对应结果
flights %>%
mutate(
origin = case_when(
(origin == "EWR") &
dep_delay > 20 ~ "Newark International Airport - DELAYED",
(origin == "EWR") &
dep_delay <= 20 ~ "Newark International Airport - ON TIME DEPARTURE",
)
) %>%
count(origin)
# 7)一次性替换匹配到的所有模式
flights %>%
mutate(origin = str_replace_all(
origin,
c("^EWR$" = "Newark International", "^JFK$" = "John F. Kennedy International")
)) %>%
count(origin)
# 8)过滤数据集,行选择
flights_top_carriers <- flights %>%
group_by(carrier) %>%
filter(n() >= 10000) %>%
ungroup()
flights_top_carriers %>% View
# 9) 抽取行,字符串检测
beginning_with_am <- airlines %>%
filter(name %>% str_detect("^Am"))
beginning_with_am %>% View
# 10)补集和交集运算
data1 <- flights %>%
anti_join(beginning_with_am, by = "carrier") %>%
View
data2 <- flights %>%
inner_join(beginning_with_am, by = "carrier") %>%
View
nrow(data1)
nrow(data2)
nrow(flights)
# 11) fct_reorder
airline_names <- flights %>%
left_join(airlines, by = "carrier")
airline_names %>%
count(name) %>%
ggplot(aes(name, n)) +
geom_col()
airline_names %>%
count(name) %>%
mutate(name = fct_reorder(name, n)) %>%
ggplot(aes(name, n)) +
geom_col()
# 12) 坐标轴转换
airline_names %>%
count(name) %>%
mutate(name = fct_reorder(name, n)) %>%
ggplot(aes(name, n)) +
geom_col() +
coord_flip()
# 13)Crossing操作
# 生成笛卡尔积
crossing(
customer_channel = c("Bus", "Car"),
customer_status = c("New", "Repeat"),
spend_range = c("$0-$10", "$10-$20", "$20-$50", "$50+"))
# 14)基于自定义函数做分组
summary1 <- function(data, col_names, na.rm = TRUE) {
data %>%
summarise(across({{ col_names }},
list(
min = min,
max = max,
median = median,
mean = mean
),
na.rm = na.rm,
.names = "{col}_{fn}"
))
}
airline_names %>%
summary1(c(air_time, arr_delay)) %>%
View
airline_names %>%
group_by(carrier) %>%
summary1(c(air_time, arr_delay)) %>%
View
代码的结果,请自测。
关于tidyverse包的学习和交流,可以扫码加我微信,进入R语言群,一起讨论。
学习资料:
https://finnstats.com/index.php/2021/04/02/tidyverse-in-r/
05
Tukey HSD 检验
应用场景:组变量取值3个或者以上时,用于两两组间的差异显著性分析。
rm(list = ls())
# 数据集
set.seed(1045)
data <- data.frame(group = rep(c("P1", "P2", "P3"), each = 40),
values = c(rnorm(40, 0, 3), rnorm (40, 0, 6), rnorm (40, 1, 5)))
head(data)
# 方差分析模型
model <- aov(values ~ group, data = data)
summary(model)
# 做Tukey HSD 检验
TukeyHSD(model, conf.level = .95)
# 可视化分析
plot(TukeyHSD(model, conf.level=.95), las = 2)
结果表
结果图
06
添加趋势图
应用场景:散点图上面添加趋势图。
举个例子,给散点图添加线性趋势图;或者在线性趋势图基础上面增加增信区间等。
data <- data.frame(x = c(1, 2, 5, 3, 5, 5, 9, 10, 12),
y = c(18, 10, 10, 20, 22, 13, 15, 16, 17))
data
# 1)添加线性趋势和置信区域
ggplot(data, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = lm, level = 0.99)
# 2)添加线性趋势和无置信区域
ggplot(data, aes(x = x, y = y)) +
geom_point() +
geom_smooth(
method = lm,
se = FALSE,
col = 'blue',
size = 2
)
# 3)默认loess曲线
ggplot(data, aes(x = x, y = y)) +
geom_point() +
geom_smooth(se = FALSE)
结果图
07
可重复性代码构建指南
创建项目工程,做项目管理
项目的层级架构,参考下图:
各个文件夹和文件的用途
请注意
1 永远不要修改原始数据,或者说,一定要备份好原始数据
2 对于任何项目,创建一个文件,记录你的所思和所做,便于复盘和迭代
3 脚本的命名,请知名晓意,赋予含义,具有条理性和逻辑性,重视代码的可读性,代码是让电脑来运行的,更重要的是,让人来看的。
4 对于一个复杂的项目,编写代码之前,先写伪代码或者画流程图
08
subset函数
subset函数,帮助你从数据框中获取数据子集。
graphics.off()
rm(list = ls())
options(warn = -1)
options(scipen = 999)
options(digits = 3)
# 创建数据框
manager <- c(1, 2, 3, 4, 5)
date <- c("10/24/08", "10/28/08", "10/1/08", "10/12/08", "5/1/09")
country <- c("US", "US", "UK", "UK", "UK")
gender <- c("M", "F", "F", "M", "F")
age <- c(32, 45, 25, 39, 99)
q1 <- c(5, 3, 3, 3, 2)
q2 <- c(4, 5, 5, 3, 2)
q3 <- c(5, 2, 5, 4, 1)
q4 <- c(5, 5, 5, NA, 2)
q5 <- c(5, 5, 2, NA, 1)
leadership <- data.frame(manager, date, country, gender, age,
q1, q2, q3, q4, q5, stringsAsFactors=FALSE)
# subset函数获取数据子集
# 1)选择所有age>=35或者age<24的行,保留变量q1, q3和q5
new_df1 <- subset(leadership,
age >= 35 | age < 24,
select = c(q1, q2, q5))
new_df1
# 2)选择所有25岁以上的男性,保留gender到q5的变量集
new_df2 <- subset(leadership,
gender == 'M' & age > 25,
select = gender:q5)
new_df2
思考题:请想下如何用tidyverse包实现上述操作?
09
TidyX项目
项目愿景:通过做一系列有趣、有用、好玩的数据项目,帮助更多人学习和应用R,以及从数据中学习和用数据解答问题。
项目内容:从TidyTuesday项目中选择一个人的代码,逐行阅读代码,解析代码是做什么以及各函数的功能,拆分可视化和迁移到相似的应用场景。
第2集:研究the office数据集,情感分析和词云图
源代码
# 第2集 研究the office数据集,情感分析和词云图
# 工作空间管理和配置
graphics.off()
rm(list = ls)
options(warn = -1)
options(scipen = 999)
options(digits = 3)
# R包
library(tidyr)
library(stringr)
library(schrute)
library(tidytext)
library(wordcloud)
library(ggplot2)
library(dplyr)
library(reshape2)
# 数据导入
office_ratings <-
readr::read_csv('./data/tidytuesday/data/2020/2020-03-17/office_ratings.csv')
schrute <- schrute::theoffice
head(schrute) %>% View
# 查看数据
dplyr::glimpse(schrute)
dplyr::glimpse(office_ratings)
# 准备数据
token.schrute <- schrute %>%
tidytext::unnest_tokens(word, text)
dplyr::glimpse(token.schrute)
# Remove stop words
stop_words <- tidytext::stop_words
tidy.token.schrute <- token.schrute %>%
dplyr::anti_join(stop_words, by = 'word')
# Most common words
tidy.token.schrute %>% # 169,835 observations
dplyr::count(word, sort = TRUE)
# 词频分析
p1 <- tidy.token.schrute %>%
dplyr::count(word, sort = TRUE) %>%
dplyr::filter(n > 400) %>%
dplyr::mutate(word = stats::reorder(word, n)) %>%
ggplot2::ggplot(ggplot2::aes(word, n)) +
ggplot2::geom_col() +
ggplot2::xlab(NULL) +
ggplot2::coord_flip() +
ggplot2::theme_minimal()
ggsave(filename = './figs/Most_used_words.pdf',
plot = p1,
scale = .6)
# 情感分析
# 导入情感词典
sentiments <-
get_sentiments("bing") # Codes words as positive or negative (Bing Liu). NA for neutral.
dplyr::glimpse(sentiments)
unique(sentiments$sentiment)
# Sentiment by season
schrute.sentiment <- tidy.token.schrute %>%
dplyr::left_join(sentiments) %>%
dplyr::count(episode_name, sentiment) %>%
spread(sentiment, n, fill = 0) %>% # fill missing values w/ 0
mutate(sentimentc = positive - negative) %>% # pos value means more words had positive connatation than neg
dplyr::select(episode_name, sentimentc, negative, positive, neutral =
`<NA>`)
bing_word_counts <- tidy.token.schrute %>%
inner_join(sentiments %>% filter(sentiment == 'positive' |
sentiment == 'negative')) %>%
count(word, sentiment, sort = TRUE)
bing_word_counts
p2 <- bing_word_counts %>%
filter(n > 150) %>%
mutate(n = ifelse(sentiment == 'negative', -n, n)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col() +
coord_flip() +
labs(y = 'Contribution to sentiment analysis', x = 'Word') +
theme_bw() +
theme(
legend.position = 'none',
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = 'bold'),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = 'bold')
)
p2
# word cloud
pdf('./figs/comparison_cloud.pdf',
width = 4,
height = 4)
bing_word_counts %>%
acast(word ~ sentiment, value.var = 'n', fill = 0) %>%
comparison.cloud(colors = c("#F8766D", "#00BFC4"),
max.words = 100)
dev.off()
# ?Is there a relationship between word sentiment and episode rating?
office_ratings <- office_ratings %>%
dplyr::select(season, episode, episode_name = title, imdb_rating)
sent.rating <- schrute.sentiment %>%
inner_join(office_ratings, by = 'episode_name') %>%
mutate(season = as.factor(season))
ggplot(data = sent.rating, aes(x = sentimentc, y = imdb_rating, color = season)) +
geom_point()
sent.rating %>%
group_by(season) %>%
summarize(rating_ave = mean(imdb_rating),
sentiment_ave = mean(sentimentc)) %>%
ggplot(data = ., aes(x = sentiment_ave, y = rating_ave, color = season)) +
geom_point()
# Descriptive fig of pos & neg characters
glimpse(tidy.token.schrute)
char.sentiment <- tidy.token.schrute %>%
inner_join(sentiments, by = 'word') %>%
count(character, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentimentc = positive - negative)
p3 <- char.sentiment %>%
filter(negative + positive > 300) %>%
mutate(sent_dummy = ifelse(sentimentc < 0, 'More Negative', 'More Positive')) %>%
mutate(character = reorder(character, sentimentc)) %>%
ggplot(aes(character, sentimentc, fill = sent_dummy)) +
geom_col() +
coord_flip() +
labs(y = 'Emotional Charge of Dialogue n (Positive - Negative Words)', x = 'Character') +
theme_bw() +
theme(
legend.position = 'none',
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = 'bold'),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = 'bold')
)
ggsave(
filename = './figs/word_bar.pdf',
plot = p2,
width = 4.5,
height = 4
)
ggsave(
filename = './figs/characters_sentiment.pdf',
plot = p3,
width = 9,
height = 6
)
结果图
1 词频图
2 词云图
3 情感分析词频图
4 角色的情感分析
关于源代码的理解,有什么问题或者想法,请留言,或者添加我的微信,进入R语言群,一起讨论。
学习资料:
https://github.com/rrobinn/tidy-tuesday/blob/master/20200318-The-Office/20200318-The-Office.R
我创建了R语言群,添加我的微信,备注:姓名-入群,我邀请你进群,一起学习R语言。
如果你觉得文章内容有用,请关注下方公众号~
如果你想找数据工作,请关注下方公众号~
R语言学习专辑:
觉得本文不错,就顺手帮我转发到朋友圈和微信群哦,谢谢。
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!